model-research/findings/2026-05-08-50-concurrency-race-condition-analysis.md

# Finding 50: Concurrency and Race Condition Analysis — A New Lens

## Summary

New analytical lens: identify concurrency hazards, race conditions, and
underspecified interleaving semantics in architecture documents that describe
concurrent or parallel operations. Applied to gargoyle's `signal-lifecycle.md`
(111 lines) — a compact specification of how trading signals flow through
evaluation stages.

## Setup

**Document:** gargoyle `signal-lifecycle.md` (111 lines)
**Lens:** Concurrency and race condition analysis (NEW)
**Models:** GPT-5, Claude 4.6 Opus, Claude 4.6 Sonnet
**Method:** Same document (full text) + same focused analytical prompt to all
3 models via HAI proxy. Structured prompt specifying 5 focus areas:
producer-consumer races, state visibility between components, ordering
guarantees for concurrent writers, resource contention under load, and
TOCTOU gaps. Required structured output per finding (concurrent operations,
underspecified interleaving, potential failure mode, severity, exact quote).
No tools, no project context beyond the document.

**Token usage:**
- GPT-5: 1,542 input → 7,045 output (5,376 reasoning tokens), 59s latency
- Opus: 1,827 input → 2,513 output
- Sonnet: 1,827 input → 2,960 output

## Results

### Finding Counts by Severity

| Model | Critical | High | Medium | Low | Total |
|-------|----------|------|--------|-----|-------|
| GPT-5 | 4 | 2 | 3 | 0 | 9 |
| Opus | 3 | 2 | 2 | 0 | 7 |
| Sonnet | 3 | 2 | 2 | 0 | 7 |

### Shared Findings (All Three Models)

All three models independently identified these core concurrency hazards:

1. **Multi-aggregator fan-out duplication** (Critical) — Same signal appearing
   in multiple aggregators produces duplicate decisions → doubled orders.
   All three quoted the same key sentence about signal_id independence from
   decision_id.

2. **Aggregator completion predicate vs. in-flight signals** (Critical/High) —
   The boundary between "buffer accepting signals" and "completion predicate
   fires" is not atomic. All three identified this as a race at the step 3→4
   boundary.

3. **Position state TOCTOU** (Critical) — Action resolution depends on
   position state that can be concurrently modified. GPT-5 framed as
   concurrent decisions; Opus framed as concurrent `scale_in` signals;
   Sonnet extended to PortfolioMonitor vs. decision pipeline.

4. **Backpressure/expiration under load** (Medium-High) — Non-deterministic
   signal expiry with underspecified audit treatment.

5. **Audit write ordering** (Medium) — Non-atomicity between decision
   formation, audit writes, and downstream forwarding.

### Unique/Distinctive Findings

**GPT-5 unique:**
- **PortfolioMonitor stop-loss close racing with strategy signals** (Critical)
  — Identified the concurrent control path between PortfolioMonitor
  close-triggers and the decision pipeline producing new signals for the same
  instrument. Neither Opus nor Sonnet identified this as a *separate* race
  from the general position TOCTOU.
- **Duplicate signal_id from crash+restart** (High) — UUID collision or replay
  in aggregation buffers. GPT-5 was the only model to treat the failure mode
  table's "duplicate signal_id" entry as a concurrency hazard rather than a
  mere correctness bug.

**Opus unique:**
- **Strategy worker crash + in-flight signal survival** (Critical) — If a
  worker crashes after dispatching signals to Signal Risk but before completion,
  and then restarts and produces new equivalent signals, both the old and new
  signals may contribute to a decision. This contradicts the spec's assumption
  that crashes mean "buffered signals are lost." Opus was the only model to
  identify that "lost" is ambiguous about whether downstream buffers (not just
  the crashed process's local buffer) are included.

**Sonnet unique:**
- **Signal Risk crash AFTER dispatch but before audit** (High) — A signal
  approved and already sent to the aggregator, but whose approving Signal Risk
  crashes before writing its audit entry. The audit log now shows a signal that
  influenced a real decision but has no risk evaluation record — appearing to
  have bypassed risk controls. Neither GPT-5 nor Opus identified this specific
  audit-integrity race.

### Quality Comparison

**GPT-5 (9 findings, 7,045 tokens):**
Highest raw finding count. Most systematic — enumerated every stage boundary
and every concurrent actor pair, producing comprehensive coverage. However,
some findings are somewhat mechanical extrapolations (e.g., Finding 8 on
duplicate signal_ids is the spec's own failure mode table entry restated as a
concurrency problem rather than a genuinely new insight). Strength: exhaustive
enumeration, nothing missed. Weakness: some findings are more "restatement of
the spec's acknowledged risks" than new analytical insight.

**Opus (7 findings, 2,513 tokens):**
Most architecturally precise. Each finding includes a detailed causal chain
explaining exactly HOW the race manifests in practice. The "strategy crash +
in-flight survival" finding is the most original across all three — it
identifies a contradiction within the spec's own recovery model (claiming
crashes lose signals while the pipeline topology means some signals survive
the producer's crash). Strength: reasoning about what the spec's OWN claims
imply when combined. Weakness: slightly fewer findings means some coverage
gaps vs GPT-5.

**Sonnet (7 findings, 2,960 tokens):**
Best individual attack narratives with concrete timing scenarios (T=0, T=99ms,
T=100ms, T=101ms examples). The Signal Risk crash-after-dispatch finding shows
strong reasoning about the difference between "signal is lost" (stated) and
"signal was already sent downstream" (unaddressed). Strength: concrete
temporal scenarios that make each race viscerally understandable. Weakness:
Finding 6 (audit ordering) is mechanically similar to several other models'
observations with less novel angle.

## Analysis

### Concurrency as a Lens: Assessment

This lens produced **high-quality, architecturally significant findings from
all three models** — comparable to the adversarial lens (Finding #49) in
productivity and direct actionability.

Key characteristics of the concurrency lens:
- **Naturally multi-actor** — forces models to reason about pairs/groups of
  concurrent operations, which requires compositional thinking
- **Demands temporal reasoning** — models must reason about "what if X happens
  between steps N and N+1" which is a specific cognitive skill
- **Specification-exploiting** — finds gaps where the document says "then Y
  happens" without specifying atomicity, a common spec-writing failure
- **Directly actionable** — each finding maps to a specific design decision
  (add a lock, add ordering guarantee, add audit entry, etc.)

### Model Characteristics on This Task Type

Concurrency analysis requires three cognitive skills:
1. **Enumeration** — identifying all pairs of concurrent actors/operations
2. **Temporal reasoning** — working through what happens at specific orderings
3. **Specification interpretation** — identifying what claims are made vs. what
   is left ambiguous

**GPT-5** excels at (1) — systematic enumeration of every actor pair. It found
9 findings because it methodically checked every stage boundary.

**Opus** excels at (3) — it found the contradiction between the spec's recovery
claim and the pipeline topology, requiring deep interpretation of what "signals
are lost" actually means in context.

**Sonnet** excels at (2) — it constructed the most vivid temporal scenarios
(T=0/99/100/101ms) making each race immediately graspable. Its Signal Risk
crash finding also shows good (3) reasoning about stated vs. unstated cases.

### Comparison to Previous Lenses

| Lens | GPT-5 finds | Opus finds | Sonnet finds | Total unique |
|------|-------------|------------|--------------|--------------|
| Adversarial (#49) | 25 | 14 | 11 | ~35 |
| Concurrency (#50) | 9 | 7 | 7 | ~10 |
| Defense-in-depth (#48) | 10 | 7 | 6 | ~14 |
| Emergent behavior (#47) | 8 | 6 | 5 | ~12 |

Lower raw count than adversarial, BUT:
- Higher proportion of Critical/High findings (7/10 unique ≈ 70% vs ~40% for adversarial)
- Every finding is directly actionable (specific design decision needed)
- The document is much smaller (111 lines vs 170 for #49)
- Higher quality-per-finding ratio — no padding or obvious observations

### Key Insight

**Concurrency analysis rewards a different cognitive profile than previous lenses.**
Prior lenses (adversarial, defense-in-depth, gap analysis) primarily reward
*completeness* — finding all instances of a pattern. Concurrency analysis
rewards *compositional temporal reasoning* — mentally simulating interleaved
executions to identify non-obvious failure modes. This explains why the
finding-count gap between models is smaller here (9 vs 7 vs 7) compared to
adversarial (25 vs 14 vs 11): the bottleneck is reasoning depth per finding,
not enumeration breadth.

**Root cause pattern identified by all three models:**
The specification describes a *pipeline* with multiple concurrent *stages* but
specifies only the happy-path *sequence* through stages. Nowhere does it define:
- Atomicity boundaries (what is a transaction within/between stages?)
- Visibility semantics (when does a downstream stage see an upstream action?)
- Conflict resolution (what happens when two paths act on shared state?)

This is the same class of specification gap that causes real production
concurrency bugs: correct sequential logic described without concurrent
correctness guarantees.

## Practical Implications

For architecture document review:
- **GPT-5** for exhaustive enumeration of all concurrent actor pairs
- **Opus** for finding contradictions between a spec's claims and its topology
- **Sonnet** for vivid temporal scenarios that make races immediately clear

The concurrency lens is recommended for any specification that describes:
- Pipeline/stage architectures
- Multiple producers or consumers
- Timeouts and expiration
- State that is read by one component and written by another
- Recovery mechanisms (crash/restart)

## Meta

**Finding number:** 50
**New lens:** Yes (Concurrency and race condition analysis)
**Builds on:** None directly; related to #41 (temporal ordering) but distinct —
#41 focused on sequential ordering assumptions, #50 focuses on concurrent
interleaving
**Open question generated:** Does the document size matter for concurrency
analysis? This 111-line doc produced 10 unique findings. Would a larger
multi-component doc (e.g., system-overview.md at 323 lines) produce
proportionally more, or does concurrency analysis saturate at the
interface boundaries regardless of document length?