diff --git a/findings/2026-05-08-50-concurrency-race-condition-analysis.md b/findings/2026-05-08-50-concurrency-race-condition-analysis.md new file mode 100644 index 0000000..0053f96 --- /dev/null +++ b/findings/2026-05-08-50-concurrency-race-condition-analysis.md @@ -0,0 +1,223 @@ +# Finding 50: Concurrency and Race Condition Analysis — A New Lens + +## Summary + +New analytical lens: identify concurrency hazards, race conditions, and +underspecified interleaving semantics in architecture documents that describe +concurrent or parallel operations. Applied to gargoyle's `signal-lifecycle.md` +(111 lines) — a compact specification of how trading signals flow through +evaluation stages. + +## Setup + +**Document:** gargoyle `signal-lifecycle.md` (111 lines) +**Lens:** Concurrency and race condition analysis (NEW) +**Models:** GPT-5, Claude 4.6 Opus, Claude 4.6 Sonnet +**Method:** Same document (full text) + same focused analytical prompt to all +3 models via HAI proxy. Structured prompt specifying 5 focus areas: +producer-consumer races, state visibility between components, ordering +guarantees for concurrent writers, resource contention under load, and +TOCTOU gaps. Required structured output per finding (concurrent operations, +underspecified interleaving, potential failure mode, severity, exact quote). +No tools, no project context beyond the document. + +**Token usage:** +- GPT-5: 1,542 input → 7,045 output (5,376 reasoning tokens), 59s latency +- Opus: 1,827 input → 2,513 output +- Sonnet: 1,827 input → 2,960 output + +## Results + +### Finding Counts by Severity + +| Model | Critical | High | Medium | Low | Total | +|-------|----------|------|--------|-----|-------| +| GPT-5 | 4 | 2 | 3 | 0 | 9 | +| Opus | 3 | 2 | 2 | 0 | 7 | +| Sonnet | 3 | 2 | 2 | 0 | 7 | + +### Shared Findings (All Three Models) + +All three models independently identified these core concurrency hazards: + +1. **Multi-aggregator fan-out duplication** (Critical) — Same signal appearing + in multiple aggregators produces duplicate decisions → doubled orders. + All three quoted the same key sentence about signal_id independence from + decision_id. + +2. **Aggregator completion predicate vs. in-flight signals** (Critical/High) — + The boundary between "buffer accepting signals" and "completion predicate + fires" is not atomic. All three identified this as a race at the step 3→4 + boundary. + +3. **Position state TOCTOU** (Critical) — Action resolution depends on + position state that can be concurrently modified. GPT-5 framed as + concurrent decisions; Opus framed as concurrent `scale_in` signals; + Sonnet extended to PortfolioMonitor vs. decision pipeline. + +4. **Backpressure/expiration under load** (Medium-High) — Non-deterministic + signal expiry with underspecified audit treatment. + +5. **Audit write ordering** (Medium) — Non-atomicity between decision + formation, audit writes, and downstream forwarding. + +### Unique/Distinctive Findings + +**GPT-5 unique:** +- **PortfolioMonitor stop-loss close racing with strategy signals** (Critical) + — Identified the concurrent control path between PortfolioMonitor + close-triggers and the decision pipeline producing new signals for the same + instrument. Neither Opus nor Sonnet identified this as a *separate* race + from the general position TOCTOU. +- **Duplicate signal_id from crash+restart** (High) — UUID collision or replay + in aggregation buffers. GPT-5 was the only model to treat the failure mode + table's "duplicate signal_id" entry as a concurrency hazard rather than a + mere correctness bug. + +**Opus unique:** +- **Strategy worker crash + in-flight signal survival** (Critical) — If a + worker crashes after dispatching signals to Signal Risk but before completion, + and then restarts and produces new equivalent signals, both the old and new + signals may contribute to a decision. This contradicts the spec's assumption + that crashes mean "buffered signals are lost." Opus was the only model to + identify that "lost" is ambiguous about whether downstream buffers (not just + the crashed process's local buffer) are included. + +**Sonnet unique:** +- **Signal Risk crash AFTER dispatch but before audit** (High) — A signal + approved and already sent to the aggregator, but whose approving Signal Risk + crashes before writing its audit entry. The audit log now shows a signal that + influenced a real decision but has no risk evaluation record — appearing to + have bypassed risk controls. Neither GPT-5 nor Opus identified this specific + audit-integrity race. + +### Quality Comparison + +**GPT-5 (9 findings, 7,045 tokens):** +Highest raw finding count. Most systematic — enumerated every stage boundary +and every concurrent actor pair, producing comprehensive coverage. However, +some findings are somewhat mechanical extrapolations (e.g., Finding 8 on +duplicate signal_ids is the spec's own failure mode table entry restated as a +concurrency problem rather than a genuinely new insight). Strength: exhaustive +enumeration, nothing missed. Weakness: some findings are more "restatement of +the spec's acknowledged risks" than new analytical insight. + +**Opus (7 findings, 2,513 tokens):** +Most architecturally precise. Each finding includes a detailed causal chain +explaining exactly HOW the race manifests in practice. The "strategy crash + +in-flight survival" finding is the most original across all three — it +identifies a contradiction within the spec's own recovery model (claiming +crashes lose signals while the pipeline topology means some signals survive +the producer's crash). Strength: reasoning about what the spec's OWN claims +imply when combined. Weakness: slightly fewer findings means some coverage +gaps vs GPT-5. + +**Sonnet (7 findings, 2,960 tokens):** +Best individual attack narratives with concrete timing scenarios (T=0, T=99ms, +T=100ms, T=101ms examples). The Signal Risk crash-after-dispatch finding shows +strong reasoning about the difference between "signal is lost" (stated) and +"signal was already sent downstream" (unaddressed). Strength: concrete +temporal scenarios that make each race viscerally understandable. Weakness: +Finding 6 (audit ordering) is mechanically similar to several other models' +observations with less novel angle. + +## Analysis + +### Concurrency as a Lens: Assessment + +This lens produced **high-quality, architecturally significant findings from +all three models** — comparable to the adversarial lens (Finding #49) in +productivity and direct actionability. + +Key characteristics of the concurrency lens: +- **Naturally multi-actor** — forces models to reason about pairs/groups of + concurrent operations, which requires compositional thinking +- **Demands temporal reasoning** — models must reason about "what if X happens + between steps N and N+1" which is a specific cognitive skill +- **Specification-exploiting** — finds gaps where the document says "then Y + happens" without specifying atomicity, a common spec-writing failure +- **Directly actionable** — each finding maps to a specific design decision + (add a lock, add ordering guarantee, add audit entry, etc.) + +### Model Characteristics on This Task Type + +Concurrency analysis requires three cognitive skills: +1. **Enumeration** — identifying all pairs of concurrent actors/operations +2. **Temporal reasoning** — working through what happens at specific orderings +3. **Specification interpretation** — identifying what claims are made vs. what + is left ambiguous + +**GPT-5** excels at (1) — systematic enumeration of every actor pair. It found +9 findings because it methodically checked every stage boundary. + +**Opus** excels at (3) — it found the contradiction between the spec's recovery +claim and the pipeline topology, requiring deep interpretation of what "signals +are lost" actually means in context. + +**Sonnet** excels at (2) — it constructed the most vivid temporal scenarios +(T=0/99/100/101ms) making each race immediately graspable. Its Signal Risk +crash finding also shows good (3) reasoning about stated vs. unstated cases. + +### Comparison to Previous Lenses + +| Lens | GPT-5 finds | Opus finds | Sonnet finds | Total unique | +|------|-------------|------------|--------------|--------------| +| Adversarial (#49) | 25 | 14 | 11 | ~35 | +| Concurrency (#50) | 9 | 7 | 7 | ~10 | +| Defense-in-depth (#48) | 10 | 7 | 6 | ~14 | +| Emergent behavior (#47) | 8 | 6 | 5 | ~12 | + +Lower raw count than adversarial, BUT: +- Higher proportion of Critical/High findings (7/10 unique ≈ 70% vs ~40% for adversarial) +- Every finding is directly actionable (specific design decision needed) +- The document is much smaller (111 lines vs 170 for #49) +- Higher quality-per-finding ratio — no padding or obvious observations + +### Key Insight + +**Concurrency analysis rewards a different cognitive profile than previous lenses.** +Prior lenses (adversarial, defense-in-depth, gap analysis) primarily reward +*completeness* — finding all instances of a pattern. Concurrency analysis +rewards *compositional temporal reasoning* — mentally simulating interleaved +executions to identify non-obvious failure modes. This explains why the +finding-count gap between models is smaller here (9 vs 7 vs 7) compared to +adversarial (25 vs 14 vs 11): the bottleneck is reasoning depth per finding, +not enumeration breadth. + +**Root cause pattern identified by all three models:** +The specification describes a *pipeline* with multiple concurrent *stages* but +specifies only the happy-path *sequence* through stages. Nowhere does it define: +- Atomicity boundaries (what is a transaction within/between stages?) +- Visibility semantics (when does a downstream stage see an upstream action?) +- Conflict resolution (what happens when two paths act on shared state?) + +This is the same class of specification gap that causes real production +concurrency bugs: correct sequential logic described without concurrent +correctness guarantees. + +## Practical Implications + +For architecture document review: +- **GPT-5** for exhaustive enumeration of all concurrent actor pairs +- **Opus** for finding contradictions between a spec's claims and its topology +- **Sonnet** for vivid temporal scenarios that make races immediately clear + +The concurrency lens is recommended for any specification that describes: +- Pipeline/stage architectures +- Multiple producers or consumers +- Timeouts and expiration +- State that is read by one component and written by another +- Recovery mechanisms (crash/restart) + +## Meta + +**Finding number:** 50 +**New lens:** Yes (Concurrency and race condition analysis) +**Builds on:** None directly; related to #41 (temporal ordering) but distinct — +#41 focused on sequential ordering assumptions, #50 focuses on concurrent +interleaving +**Open question generated:** Does the document size matter for concurrency +analysis? This 111-line doc produced 10 unique findings. Would a larger +multi-component doc (e.g., system-overview.md at 323 lines) produce +proportionally more, or does concurrency analysis saturate at the +interface boundaries regardless of document length?