finding 50: concurrency and race condition analysis lens
New analytical lens applied to signal-lifecycle.md (111 lines). All three models (GPT-5, Opus, Sonnet) found 7-9 findings each with 70% at Critical/High severity. Key insight: concurrency analysis rewards compositional temporal reasoning over enumeration breadth, narrowing the gap between models compared to other lenses. Unique finds: GPT-5 (stop-loss race, duplicate UUID), Opus (crash survival contradiction), Sonnet (Signal Risk audit gap after dispatch).
This commit is contained in:
@@ -0,0 +1,223 @@
|
||||
# Finding 50: Concurrency and Race Condition Analysis — A New Lens
|
||||
|
||||
## Summary
|
||||
|
||||
New analytical lens: identify concurrency hazards, race conditions, and
|
||||
underspecified interleaving semantics in architecture documents that describe
|
||||
concurrent or parallel operations. Applied to gargoyle's `signal-lifecycle.md`
|
||||
(111 lines) — a compact specification of how trading signals flow through
|
||||
evaluation stages.
|
||||
|
||||
## Setup
|
||||
|
||||
**Document:** gargoyle `signal-lifecycle.md` (111 lines)
|
||||
**Lens:** Concurrency and race condition analysis (NEW)
|
||||
**Models:** GPT-5, Claude 4.6 Opus, Claude 4.6 Sonnet
|
||||
**Method:** Same document (full text) + same focused analytical prompt to all
|
||||
3 models via HAI proxy. Structured prompt specifying 5 focus areas:
|
||||
producer-consumer races, state visibility between components, ordering
|
||||
guarantees for concurrent writers, resource contention under load, and
|
||||
TOCTOU gaps. Required structured output per finding (concurrent operations,
|
||||
underspecified interleaving, potential failure mode, severity, exact quote).
|
||||
No tools, no project context beyond the document.
|
||||
|
||||
**Token usage:**
|
||||
- GPT-5: 1,542 input → 7,045 output (5,376 reasoning tokens), 59s latency
|
||||
- Opus: 1,827 input → 2,513 output
|
||||
- Sonnet: 1,827 input → 2,960 output
|
||||
|
||||
## Results
|
||||
|
||||
### Finding Counts by Severity
|
||||
|
||||
| Model | Critical | High | Medium | Low | Total |
|
||||
|-------|----------|------|--------|-----|-------|
|
||||
| GPT-5 | 4 | 2 | 3 | 0 | 9 |
|
||||
| Opus | 3 | 2 | 2 | 0 | 7 |
|
||||
| Sonnet | 3 | 2 | 2 | 0 | 7 |
|
||||
|
||||
### Shared Findings (All Three Models)
|
||||
|
||||
All three models independently identified these core concurrency hazards:
|
||||
|
||||
1. **Multi-aggregator fan-out duplication** (Critical) — Same signal appearing
|
||||
in multiple aggregators produces duplicate decisions → doubled orders.
|
||||
All three quoted the same key sentence about signal_id independence from
|
||||
decision_id.
|
||||
|
||||
2. **Aggregator completion predicate vs. in-flight signals** (Critical/High) —
|
||||
The boundary between "buffer accepting signals" and "completion predicate
|
||||
fires" is not atomic. All three identified this as a race at the step 3→4
|
||||
boundary.
|
||||
|
||||
3. **Position state TOCTOU** (Critical) — Action resolution depends on
|
||||
position state that can be concurrently modified. GPT-5 framed as
|
||||
concurrent decisions; Opus framed as concurrent `scale_in` signals;
|
||||
Sonnet extended to PortfolioMonitor vs. decision pipeline.
|
||||
|
||||
4. **Backpressure/expiration under load** (Medium-High) — Non-deterministic
|
||||
signal expiry with underspecified audit treatment.
|
||||
|
||||
5. **Audit write ordering** (Medium) — Non-atomicity between decision
|
||||
formation, audit writes, and downstream forwarding.
|
||||
|
||||
### Unique/Distinctive Findings
|
||||
|
||||
**GPT-5 unique:**
|
||||
- **PortfolioMonitor stop-loss close racing with strategy signals** (Critical)
|
||||
— Identified the concurrent control path between PortfolioMonitor
|
||||
close-triggers and the decision pipeline producing new signals for the same
|
||||
instrument. Neither Opus nor Sonnet identified this as a *separate* race
|
||||
from the general position TOCTOU.
|
||||
- **Duplicate signal_id from crash+restart** (High) — UUID collision or replay
|
||||
in aggregation buffers. GPT-5 was the only model to treat the failure mode
|
||||
table's "duplicate signal_id" entry as a concurrency hazard rather than a
|
||||
mere correctness bug.
|
||||
|
||||
**Opus unique:**
|
||||
- **Strategy worker crash + in-flight signal survival** (Critical) — If a
|
||||
worker crashes after dispatching signals to Signal Risk but before completion,
|
||||
and then restarts and produces new equivalent signals, both the old and new
|
||||
signals may contribute to a decision. This contradicts the spec's assumption
|
||||
that crashes mean "buffered signals are lost." Opus was the only model to
|
||||
identify that "lost" is ambiguous about whether downstream buffers (not just
|
||||
the crashed process's local buffer) are included.
|
||||
|
||||
**Sonnet unique:**
|
||||
- **Signal Risk crash AFTER dispatch but before audit** (High) — A signal
|
||||
approved and already sent to the aggregator, but whose approving Signal Risk
|
||||
crashes before writing its audit entry. The audit log now shows a signal that
|
||||
influenced a real decision but has no risk evaluation record — appearing to
|
||||
have bypassed risk controls. Neither GPT-5 nor Opus identified this specific
|
||||
audit-integrity race.
|
||||
|
||||
### Quality Comparison
|
||||
|
||||
**GPT-5 (9 findings, 7,045 tokens):**
|
||||
Highest raw finding count. Most systematic — enumerated every stage boundary
|
||||
and every concurrent actor pair, producing comprehensive coverage. However,
|
||||
some findings are somewhat mechanical extrapolations (e.g., Finding 8 on
|
||||
duplicate signal_ids is the spec's own failure mode table entry restated as a
|
||||
concurrency problem rather than a genuinely new insight). Strength: exhaustive
|
||||
enumeration, nothing missed. Weakness: some findings are more "restatement of
|
||||
the spec's acknowledged risks" than new analytical insight.
|
||||
|
||||
**Opus (7 findings, 2,513 tokens):**
|
||||
Most architecturally precise. Each finding includes a detailed causal chain
|
||||
explaining exactly HOW the race manifests in practice. The "strategy crash +
|
||||
in-flight survival" finding is the most original across all three — it
|
||||
identifies a contradiction within the spec's own recovery model (claiming
|
||||
crashes lose signals while the pipeline topology means some signals survive
|
||||
the producer's crash). Strength: reasoning about what the spec's OWN claims
|
||||
imply when combined. Weakness: slightly fewer findings means some coverage
|
||||
gaps vs GPT-5.
|
||||
|
||||
**Sonnet (7 findings, 2,960 tokens):**
|
||||
Best individual attack narratives with concrete timing scenarios (T=0, T=99ms,
|
||||
T=100ms, T=101ms examples). The Signal Risk crash-after-dispatch finding shows
|
||||
strong reasoning about the difference between "signal is lost" (stated) and
|
||||
"signal was already sent downstream" (unaddressed). Strength: concrete
|
||||
temporal scenarios that make each race viscerally understandable. Weakness:
|
||||
Finding 6 (audit ordering) is mechanically similar to several other models'
|
||||
observations with less novel angle.
|
||||
|
||||
## Analysis
|
||||
|
||||
### Concurrency as a Lens: Assessment
|
||||
|
||||
This lens produced **high-quality, architecturally significant findings from
|
||||
all three models** — comparable to the adversarial lens (Finding #49) in
|
||||
productivity and direct actionability.
|
||||
|
||||
Key characteristics of the concurrency lens:
|
||||
- **Naturally multi-actor** — forces models to reason about pairs/groups of
|
||||
concurrent operations, which requires compositional thinking
|
||||
- **Demands temporal reasoning** — models must reason about "what if X happens
|
||||
between steps N and N+1" which is a specific cognitive skill
|
||||
- **Specification-exploiting** — finds gaps where the document says "then Y
|
||||
happens" without specifying atomicity, a common spec-writing failure
|
||||
- **Directly actionable** — each finding maps to a specific design decision
|
||||
(add a lock, add ordering guarantee, add audit entry, etc.)
|
||||
|
||||
### Model Characteristics on This Task Type
|
||||
|
||||
Concurrency analysis requires three cognitive skills:
|
||||
1. **Enumeration** — identifying all pairs of concurrent actors/operations
|
||||
2. **Temporal reasoning** — working through what happens at specific orderings
|
||||
3. **Specification interpretation** — identifying what claims are made vs. what
|
||||
is left ambiguous
|
||||
|
||||
**GPT-5** excels at (1) — systematic enumeration of every actor pair. It found
|
||||
9 findings because it methodically checked every stage boundary.
|
||||
|
||||
**Opus** excels at (3) — it found the contradiction between the spec's recovery
|
||||
claim and the pipeline topology, requiring deep interpretation of what "signals
|
||||
are lost" actually means in context.
|
||||
|
||||
**Sonnet** excels at (2) — it constructed the most vivid temporal scenarios
|
||||
(T=0/99/100/101ms) making each race immediately graspable. Its Signal Risk
|
||||
crash finding also shows good (3) reasoning about stated vs. unstated cases.
|
||||
|
||||
### Comparison to Previous Lenses
|
||||
|
||||
| Lens | GPT-5 finds | Opus finds | Sonnet finds | Total unique |
|
||||
|------|-------------|------------|--------------|--------------|
|
||||
| Adversarial (#49) | 25 | 14 | 11 | ~35 |
|
||||
| Concurrency (#50) | 9 | 7 | 7 | ~10 |
|
||||
| Defense-in-depth (#48) | 10 | 7 | 6 | ~14 |
|
||||
| Emergent behavior (#47) | 8 | 6 | 5 | ~12 |
|
||||
|
||||
Lower raw count than adversarial, BUT:
|
||||
- Higher proportion of Critical/High findings (7/10 unique ≈ 70% vs ~40% for adversarial)
|
||||
- Every finding is directly actionable (specific design decision needed)
|
||||
- The document is much smaller (111 lines vs 170 for #49)
|
||||
- Higher quality-per-finding ratio — no padding or obvious observations
|
||||
|
||||
### Key Insight
|
||||
|
||||
**Concurrency analysis rewards a different cognitive profile than previous lenses.**
|
||||
Prior lenses (adversarial, defense-in-depth, gap analysis) primarily reward
|
||||
*completeness* — finding all instances of a pattern. Concurrency analysis
|
||||
rewards *compositional temporal reasoning* — mentally simulating interleaved
|
||||
executions to identify non-obvious failure modes. This explains why the
|
||||
finding-count gap between models is smaller here (9 vs 7 vs 7) compared to
|
||||
adversarial (25 vs 14 vs 11): the bottleneck is reasoning depth per finding,
|
||||
not enumeration breadth.
|
||||
|
||||
**Root cause pattern identified by all three models:**
|
||||
The specification describes a *pipeline* with multiple concurrent *stages* but
|
||||
specifies only the happy-path *sequence* through stages. Nowhere does it define:
|
||||
- Atomicity boundaries (what is a transaction within/between stages?)
|
||||
- Visibility semantics (when does a downstream stage see an upstream action?)
|
||||
- Conflict resolution (what happens when two paths act on shared state?)
|
||||
|
||||
This is the same class of specification gap that causes real production
|
||||
concurrency bugs: correct sequential logic described without concurrent
|
||||
correctness guarantees.
|
||||
|
||||
## Practical Implications
|
||||
|
||||
For architecture document review:
|
||||
- **GPT-5** for exhaustive enumeration of all concurrent actor pairs
|
||||
- **Opus** for finding contradictions between a spec's claims and its topology
|
||||
- **Sonnet** for vivid temporal scenarios that make races immediately clear
|
||||
|
||||
The concurrency lens is recommended for any specification that describes:
|
||||
- Pipeline/stage architectures
|
||||
- Multiple producers or consumers
|
||||
- Timeouts and expiration
|
||||
- State that is read by one component and written by another
|
||||
- Recovery mechanisms (crash/restart)
|
||||
|
||||
## Meta
|
||||
|
||||
**Finding number:** 50
|
||||
**New lens:** Yes (Concurrency and race condition analysis)
|
||||
**Builds on:** None directly; related to #41 (temporal ordering) but distinct —
|
||||
#41 focused on sequential ordering assumptions, #50 focuses on concurrent
|
||||
interleaving
|
||||
**Open question generated:** Does the document size matter for concurrency
|
||||
analysis? This 111-line doc produced 10 unique findings. Would a larger
|
||||
multi-component doc (e.g., system-overview.md at 323 lines) produce
|
||||
proportionally more, or does concurrency analysis saturate at the
|
||||
interface boundaries regardless of document length?
|
||||
Reference in New Issue
Block a user