finding 50: concurrency and race condition analysis lens

New analytical lens applied to signal-lifecycle.md (111 lines). All three models (GPT-5, Opus, Sonnet) found 7-9 findings each with 70% at Critical/High severity. Key insight: concurrency analysis rewards compositional temporal reasoning over enumeration breadth, narrowing the gap between models compared to other lenses. Unique finds: GPT-5 (stop-loss race, duplicate UUID), Opus (crash survival contradiction), Sonnet (Signal Risk audit gap after dispatch).
2026-05-08 11:06:06 -07:00
parent 7ca01f0cbf
commit 5b8f8caf8c
1 changed files with 223 additions and 0 deletions
@@ -0,0 +1,223 @@
+# Finding 50: Concurrency and Race Condition Analysis — A New Lens
+
+## Summary
+
+New analytical lens: identify concurrency hazards, race conditions, and
+underspecified interleaving semantics in architecture documents that describe
+concurrent or parallel operations. Applied to gargoyle's `signal-lifecycle.md`
+(111 lines) — a compact specification of how trading signals flow through
+evaluation stages.
+
+## Setup
+
+**Document:** gargoyle `signal-lifecycle.md` (111 lines)
+**Lens:** Concurrency and race condition analysis (NEW)
+**Models:** GPT-5, Claude 4.6 Opus, Claude 4.6 Sonnet
+**Method:** Same document (full text) + same focused analytical prompt to all
+3 models via HAI proxy. Structured prompt specifying 5 focus areas:
+producer-consumer races, state visibility between components, ordering
+guarantees for concurrent writers, resource contention under load, and
+TOCTOU gaps. Required structured output per finding (concurrent operations,
+underspecified interleaving, potential failure mode, severity, exact quote).
+No tools, no project context beyond the document.
+
+**Token usage:**
+- GPT-5: 1,542 input → 7,045 output (5,376 reasoning tokens), 59s latency
+- Opus: 1,827 input → 2,513 output
+- Sonnet: 1,827 input → 2,960 output
+
+## Results
+
+### Finding Counts by Severity
+
+| Model | Critical | High | Medium | Low | Total |
+|-------|----------|------|--------|-----|-------|
+| GPT-5 | 4 | 2 | 3 | 0 | 9 |
+| Opus | 3 | 2 | 2 | 0 | 7 |
+| Sonnet | 3 | 2 | 2 | 0 | 7 |
+
+### Shared Findings (All Three Models)
+
+All three models independently identified these core concurrency hazards:
+
+1. **Multi-aggregator fan-out duplication** (Critical) — Same signal appearing
+   in multiple aggregators produces duplicate decisions → doubled orders.
+   All three quoted the same key sentence about signal_id independence from
+   decision_id.
+
+2. **Aggregator completion predicate vs. in-flight signals** (Critical/High) —
+   The boundary between "buffer accepting signals" and "completion predicate
+   fires" is not atomic. All three identified this as a race at the step 3→4
+   boundary.
+
+3. **Position state TOCTOU** (Critical) — Action resolution depends on
+   position state that can be concurrently modified. GPT-5 framed as
+   concurrent decisions; Opus framed as concurrent `scale_in` signals;
+   Sonnet extended to PortfolioMonitor vs. decision pipeline.
+
+4. **Backpressure/expiration under load** (Medium-High) — Non-deterministic
+   signal expiry with underspecified audit treatment.
+
+5. **Audit write ordering** (Medium) — Non-atomicity between decision
+   formation, audit writes, and downstream forwarding.
+
+### Unique/Distinctive Findings
+
+**GPT-5 unique:**
+- **PortfolioMonitor stop-loss close racing with strategy signals** (Critical)
+  — Identified the concurrent control path between PortfolioMonitor
+  close-triggers and the decision pipeline producing new signals for the same
+  instrument. Neither Opus nor Sonnet identified this as a *separate* race
+  from the general position TOCTOU.
+- **Duplicate signal_id from crash+restart** (High) — UUID collision or replay
+  in aggregation buffers. GPT-5 was the only model to treat the failure mode
+  table's "duplicate signal_id" entry as a concurrency hazard rather than a
+  mere correctness bug.
+
+**Opus unique:**
+- **Strategy worker crash + in-flight signal survival** (Critical) — If a
+  worker crashes after dispatching signals to Signal Risk but before completion,
+  and then restarts and produces new equivalent signals, both the old and new
+  signals may contribute to a decision. This contradicts the spec's assumption
+  that crashes mean "buffered signals are lost." Opus was the only model to
+  identify that "lost" is ambiguous about whether downstream buffers (not just
+  the crashed process's local buffer) are included.
+
+**Sonnet unique:**
+- **Signal Risk crash AFTER dispatch but before audit** (High) — A signal
+  approved and already sent to the aggregator, but whose approving Signal Risk
+  crashes before writing its audit entry. The audit log now shows a signal that
+  influenced a real decision but has no risk evaluation record — appearing to
+  have bypassed risk controls. Neither GPT-5 nor Opus identified this specific
+  audit-integrity race.
+
+### Quality Comparison
+
+**GPT-5 (9 findings, 7,045 tokens):**
+Highest raw finding count. Most systematic — enumerated every stage boundary
+and every concurrent actor pair, producing comprehensive coverage. However,
+some findings are somewhat mechanical extrapolations (e.g., Finding 8 on
+duplicate signal_ids is the spec's own failure mode table entry restated as a
+concurrency problem rather than a genuinely new insight). Strength: exhaustive
+enumeration, nothing missed. Weakness: some findings are more "restatement of
+the spec's acknowledged risks" than new analytical insight.
+
+**Opus (7 findings, 2,513 tokens):**
+Most architecturally precise. Each finding includes a detailed causal chain
+explaining exactly HOW the race manifests in practice. The "strategy crash +
+in-flight survival" finding is the most original across all three — it
+identifies a contradiction within the spec's own recovery model (claiming
+crashes lose signals while the pipeline topology means some signals survive
+the producer's crash). Strength: reasoning about what the spec's OWN claims
+imply when combined. Weakness: slightly fewer findings means some coverage
+gaps vs GPT-5.
+
+**Sonnet (7 findings, 2,960 tokens):**
+Best individual attack narratives with concrete timing scenarios (T=0, T=99ms,
+T=100ms, T=101ms examples). The Signal Risk crash-after-dispatch finding shows
+strong reasoning about the difference between "signal is lost" (stated) and
+"signal was already sent downstream" (unaddressed). Strength: concrete
+temporal scenarios that make each race viscerally understandable. Weakness:
+Finding 6 (audit ordering) is mechanically similar to several other models'
+observations with less novel angle.
+
+## Analysis
+
+### Concurrency as a Lens: Assessment
+
+This lens produced **high-quality, architecturally significant findings from
+all three models** — comparable to the adversarial lens (Finding #49) in
+productivity and direct actionability.
+
+Key characteristics of the concurrency lens:
+- **Naturally multi-actor** — forces models to reason about pairs/groups of
+  concurrent operations, which requires compositional thinking
+- **Demands temporal reasoning** — models must reason about "what if X happens
+  between steps N and N+1" which is a specific cognitive skill
+- **Specification-exploiting** — finds gaps where the document says "then Y
+  happens" without specifying atomicity, a common spec-writing failure
+- **Directly actionable** — each finding maps to a specific design decision
+  (add a lock, add ordering guarantee, add audit entry, etc.)
+
+### Model Characteristics on This Task Type
+
+Concurrency analysis requires three cognitive skills:
+1. **Enumeration** — identifying all pairs of concurrent actors/operations
+2. **Temporal reasoning** — working through what happens at specific orderings
+3. **Specification interpretation** — identifying what claims are made vs. what
+   is left ambiguous
+
+**GPT-5** excels at (1) — systematic enumeration of every actor pair. It found
+9 findings because it methodically checked every stage boundary.
+
+**Opus** excels at (3) — it found the contradiction between the spec's recovery
+claim and the pipeline topology, requiring deep interpretation of what "signals
+are lost" actually means in context.
+
+**Sonnet** excels at (2) — it constructed the most vivid temporal scenarios
+(T=0/99/100/101ms) making each race immediately graspable. Its Signal Risk
+crash finding also shows good (3) reasoning about stated vs. unstated cases.
+
+### Comparison to Previous Lenses
+
+| Lens | GPT-5 finds | Opus finds | Sonnet finds | Total unique |
+|------|-------------|------------|--------------|--------------|
+| Adversarial (#49) | 25 | 14 | 11 | ~35 |
+| Concurrency (#50) | 9 | 7 | 7 | ~10 |
+| Defense-in-depth (#48) | 10 | 7 | 6 | ~14 |
+| Emergent behavior (#47) | 8 | 6 | 5 | ~12 |
+
+Lower raw count than adversarial, BUT:
+- Higher proportion of Critical/High findings (7/10 unique ≈ 70% vs ~40% for adversarial)
+- Every finding is directly actionable (specific design decision needed)
+- The document is much smaller (111 lines vs 170 for #49)
+- Higher quality-per-finding ratio — no padding or obvious observations
+
+### Key Insight
+
+**Concurrency analysis rewards a different cognitive profile than previous lenses.**
+Prior lenses (adversarial, defense-in-depth, gap analysis) primarily reward
+*completeness* — finding all instances of a pattern. Concurrency analysis
+rewards *compositional temporal reasoning* — mentally simulating interleaved
+executions to identify non-obvious failure modes. This explains why the
+finding-count gap between models is smaller here (9 vs 7 vs 7) compared to
+adversarial (25 vs 14 vs 11): the bottleneck is reasoning depth per finding,
+not enumeration breadth.
+
+**Root cause pattern identified by all three models:**
+The specification describes a *pipeline* with multiple concurrent *stages* but
+specifies only the happy-path *sequence* through stages. Nowhere does it define:
+- Atomicity boundaries (what is a transaction within/between stages?)
+- Visibility semantics (when does a downstream stage see an upstream action?)
+- Conflict resolution (what happens when two paths act on shared state?)
+
+This is the same class of specification gap that causes real production
+concurrency bugs: correct sequential logic described without concurrent
+correctness guarantees.
+
+## Practical Implications
+
+For architecture document review:
+- **GPT-5** for exhaustive enumeration of all concurrent actor pairs
+- **Opus** for finding contradictions between a spec's claims and its topology
+- **Sonnet** for vivid temporal scenarios that make races immediately clear
+
+The concurrency lens is recommended for any specification that describes:
+- Pipeline/stage architectures
+- Multiple producers or consumers
+- Timeouts and expiration
+- State that is read by one component and written by another
+- Recovery mechanisms (crash/restart)
+
+## Meta
+
+**Finding number:** 50
+**New lens:** Yes (Concurrency and race condition analysis)
+**Builds on:** None directly; related to #41 (temporal ordering) but distinct —
+#41 focused on sequential ordering assumptions, #50 focuses on concurrent
+interleaving
+**Open question generated:** Does the document size matter for concurrency
+analysis? This 111-line doc produced 10 unique findings. Would a larger
+multi-component doc (e.g., system-overview.md at 323 lines) produce
+proportionally more, or does concurrency analysis saturate at the
+interface boundaries regardless of document length?