5b8f8caf8c
New analytical lens applied to signal-lifecycle.md (111 lines). All three models (GPT-5, Opus, Sonnet) found 7-9 findings each with 70% at Critical/High severity. Key insight: concurrency analysis rewards compositional temporal reasoning over enumeration breadth, narrowing the gap between models compared to other lenses. Unique finds: GPT-5 (stop-loss race, duplicate UUID), Opus (crash survival contradiction), Sonnet (Signal Risk audit gap after dispatch).
224 lines
10 KiB
Markdown
224 lines
10 KiB
Markdown
# Finding 50: Concurrency and Race Condition Analysis — A New Lens
|
|
|
|
## Summary
|
|
|
|
New analytical lens: identify concurrency hazards, race conditions, and
|
|
underspecified interleaving semantics in architecture documents that describe
|
|
concurrent or parallel operations. Applied to gargoyle's `signal-lifecycle.md`
|
|
(111 lines) — a compact specification of how trading signals flow through
|
|
evaluation stages.
|
|
|
|
## Setup
|
|
|
|
**Document:** gargoyle `signal-lifecycle.md` (111 lines)
|
|
**Lens:** Concurrency and race condition analysis (NEW)
|
|
**Models:** GPT-5, Claude 4.6 Opus, Claude 4.6 Sonnet
|
|
**Method:** Same document (full text) + same focused analytical prompt to all
|
|
3 models via HAI proxy. Structured prompt specifying 5 focus areas:
|
|
producer-consumer races, state visibility between components, ordering
|
|
guarantees for concurrent writers, resource contention under load, and
|
|
TOCTOU gaps. Required structured output per finding (concurrent operations,
|
|
underspecified interleaving, potential failure mode, severity, exact quote).
|
|
No tools, no project context beyond the document.
|
|
|
|
**Token usage:**
|
|
- GPT-5: 1,542 input → 7,045 output (5,376 reasoning tokens), 59s latency
|
|
- Opus: 1,827 input → 2,513 output
|
|
- Sonnet: 1,827 input → 2,960 output
|
|
|
|
## Results
|
|
|
|
### Finding Counts by Severity
|
|
|
|
| Model | Critical | High | Medium | Low | Total |
|
|
|-------|----------|------|--------|-----|-------|
|
|
| GPT-5 | 4 | 2 | 3 | 0 | 9 |
|
|
| Opus | 3 | 2 | 2 | 0 | 7 |
|
|
| Sonnet | 3 | 2 | 2 | 0 | 7 |
|
|
|
|
### Shared Findings (All Three Models)
|
|
|
|
All three models independently identified these core concurrency hazards:
|
|
|
|
1. **Multi-aggregator fan-out duplication** (Critical) — Same signal appearing
|
|
in multiple aggregators produces duplicate decisions → doubled orders.
|
|
All three quoted the same key sentence about signal_id independence from
|
|
decision_id.
|
|
|
|
2. **Aggregator completion predicate vs. in-flight signals** (Critical/High) —
|
|
The boundary between "buffer accepting signals" and "completion predicate
|
|
fires" is not atomic. All three identified this as a race at the step 3→4
|
|
boundary.
|
|
|
|
3. **Position state TOCTOU** (Critical) — Action resolution depends on
|
|
position state that can be concurrently modified. GPT-5 framed as
|
|
concurrent decisions; Opus framed as concurrent `scale_in` signals;
|
|
Sonnet extended to PortfolioMonitor vs. decision pipeline.
|
|
|
|
4. **Backpressure/expiration under load** (Medium-High) — Non-deterministic
|
|
signal expiry with underspecified audit treatment.
|
|
|
|
5. **Audit write ordering** (Medium) — Non-atomicity between decision
|
|
formation, audit writes, and downstream forwarding.
|
|
|
|
### Unique/Distinctive Findings
|
|
|
|
**GPT-5 unique:**
|
|
- **PortfolioMonitor stop-loss close racing with strategy signals** (Critical)
|
|
— Identified the concurrent control path between PortfolioMonitor
|
|
close-triggers and the decision pipeline producing new signals for the same
|
|
instrument. Neither Opus nor Sonnet identified this as a *separate* race
|
|
from the general position TOCTOU.
|
|
- **Duplicate signal_id from crash+restart** (High) — UUID collision or replay
|
|
in aggregation buffers. GPT-5 was the only model to treat the failure mode
|
|
table's "duplicate signal_id" entry as a concurrency hazard rather than a
|
|
mere correctness bug.
|
|
|
|
**Opus unique:**
|
|
- **Strategy worker crash + in-flight signal survival** (Critical) — If a
|
|
worker crashes after dispatching signals to Signal Risk but before completion,
|
|
and then restarts and produces new equivalent signals, both the old and new
|
|
signals may contribute to a decision. This contradicts the spec's assumption
|
|
that crashes mean "buffered signals are lost." Opus was the only model to
|
|
identify that "lost" is ambiguous about whether downstream buffers (not just
|
|
the crashed process's local buffer) are included.
|
|
|
|
**Sonnet unique:**
|
|
- **Signal Risk crash AFTER dispatch but before audit** (High) — A signal
|
|
approved and already sent to the aggregator, but whose approving Signal Risk
|
|
crashes before writing its audit entry. The audit log now shows a signal that
|
|
influenced a real decision but has no risk evaluation record — appearing to
|
|
have bypassed risk controls. Neither GPT-5 nor Opus identified this specific
|
|
audit-integrity race.
|
|
|
|
### Quality Comparison
|
|
|
|
**GPT-5 (9 findings, 7,045 tokens):**
|
|
Highest raw finding count. Most systematic — enumerated every stage boundary
|
|
and every concurrent actor pair, producing comprehensive coverage. However,
|
|
some findings are somewhat mechanical extrapolations (e.g., Finding 8 on
|
|
duplicate signal_ids is the spec's own failure mode table entry restated as a
|
|
concurrency problem rather than a genuinely new insight). Strength: exhaustive
|
|
enumeration, nothing missed. Weakness: some findings are more "restatement of
|
|
the spec's acknowledged risks" than new analytical insight.
|
|
|
|
**Opus (7 findings, 2,513 tokens):**
|
|
Most architecturally precise. Each finding includes a detailed causal chain
|
|
explaining exactly HOW the race manifests in practice. The "strategy crash +
|
|
in-flight survival" finding is the most original across all three — it
|
|
identifies a contradiction within the spec's own recovery model (claiming
|
|
crashes lose signals while the pipeline topology means some signals survive
|
|
the producer's crash). Strength: reasoning about what the spec's OWN claims
|
|
imply when combined. Weakness: slightly fewer findings means some coverage
|
|
gaps vs GPT-5.
|
|
|
|
**Sonnet (7 findings, 2,960 tokens):**
|
|
Best individual attack narratives with concrete timing scenarios (T=0, T=99ms,
|
|
T=100ms, T=101ms examples). The Signal Risk crash-after-dispatch finding shows
|
|
strong reasoning about the difference between "signal is lost" (stated) and
|
|
"signal was already sent downstream" (unaddressed). Strength: concrete
|
|
temporal scenarios that make each race viscerally understandable. Weakness:
|
|
Finding 6 (audit ordering) is mechanically similar to several other models'
|
|
observations with less novel angle.
|
|
|
|
## Analysis
|
|
|
|
### Concurrency as a Lens: Assessment
|
|
|
|
This lens produced **high-quality, architecturally significant findings from
|
|
all three models** — comparable to the adversarial lens (Finding #49) in
|
|
productivity and direct actionability.
|
|
|
|
Key characteristics of the concurrency lens:
|
|
- **Naturally multi-actor** — forces models to reason about pairs/groups of
|
|
concurrent operations, which requires compositional thinking
|
|
- **Demands temporal reasoning** — models must reason about "what if X happens
|
|
between steps N and N+1" which is a specific cognitive skill
|
|
- **Specification-exploiting** — finds gaps where the document says "then Y
|
|
happens" without specifying atomicity, a common spec-writing failure
|
|
- **Directly actionable** — each finding maps to a specific design decision
|
|
(add a lock, add ordering guarantee, add audit entry, etc.)
|
|
|
|
### Model Characteristics on This Task Type
|
|
|
|
Concurrency analysis requires three cognitive skills:
|
|
1. **Enumeration** — identifying all pairs of concurrent actors/operations
|
|
2. **Temporal reasoning** — working through what happens at specific orderings
|
|
3. **Specification interpretation** — identifying what claims are made vs. what
|
|
is left ambiguous
|
|
|
|
**GPT-5** excels at (1) — systematic enumeration of every actor pair. It found
|
|
9 findings because it methodically checked every stage boundary.
|
|
|
|
**Opus** excels at (3) — it found the contradiction between the spec's recovery
|
|
claim and the pipeline topology, requiring deep interpretation of what "signals
|
|
are lost" actually means in context.
|
|
|
|
**Sonnet** excels at (2) — it constructed the most vivid temporal scenarios
|
|
(T=0/99/100/101ms) making each race immediately graspable. Its Signal Risk
|
|
crash finding also shows good (3) reasoning about stated vs. unstated cases.
|
|
|
|
### Comparison to Previous Lenses
|
|
|
|
| Lens | GPT-5 finds | Opus finds | Sonnet finds | Total unique |
|
|
|------|-------------|------------|--------------|--------------|
|
|
| Adversarial (#49) | 25 | 14 | 11 | ~35 |
|
|
| Concurrency (#50) | 9 | 7 | 7 | ~10 |
|
|
| Defense-in-depth (#48) | 10 | 7 | 6 | ~14 |
|
|
| Emergent behavior (#47) | 8 | 6 | 5 | ~12 |
|
|
|
|
Lower raw count than adversarial, BUT:
|
|
- Higher proportion of Critical/High findings (7/10 unique ≈ 70% vs ~40% for adversarial)
|
|
- Every finding is directly actionable (specific design decision needed)
|
|
- The document is much smaller (111 lines vs 170 for #49)
|
|
- Higher quality-per-finding ratio — no padding or obvious observations
|
|
|
|
### Key Insight
|
|
|
|
**Concurrency analysis rewards a different cognitive profile than previous lenses.**
|
|
Prior lenses (adversarial, defense-in-depth, gap analysis) primarily reward
|
|
*completeness* — finding all instances of a pattern. Concurrency analysis
|
|
rewards *compositional temporal reasoning* — mentally simulating interleaved
|
|
executions to identify non-obvious failure modes. This explains why the
|
|
finding-count gap between models is smaller here (9 vs 7 vs 7) compared to
|
|
adversarial (25 vs 14 vs 11): the bottleneck is reasoning depth per finding,
|
|
not enumeration breadth.
|
|
|
|
**Root cause pattern identified by all three models:**
|
|
The specification describes a *pipeline* with multiple concurrent *stages* but
|
|
specifies only the happy-path *sequence* through stages. Nowhere does it define:
|
|
- Atomicity boundaries (what is a transaction within/between stages?)
|
|
- Visibility semantics (when does a downstream stage see an upstream action?)
|
|
- Conflict resolution (what happens when two paths act on shared state?)
|
|
|
|
This is the same class of specification gap that causes real production
|
|
concurrency bugs: correct sequential logic described without concurrent
|
|
correctness guarantees.
|
|
|
|
## Practical Implications
|
|
|
|
For architecture document review:
|
|
- **GPT-5** for exhaustive enumeration of all concurrent actor pairs
|
|
- **Opus** for finding contradictions between a spec's claims and its topology
|
|
- **Sonnet** for vivid temporal scenarios that make races immediately clear
|
|
|
|
The concurrency lens is recommended for any specification that describes:
|
|
- Pipeline/stage architectures
|
|
- Multiple producers or consumers
|
|
- Timeouts and expiration
|
|
- State that is read by one component and written by another
|
|
- Recovery mechanisms (crash/restart)
|
|
|
|
## Meta
|
|
|
|
**Finding number:** 50
|
|
**New lens:** Yes (Concurrency and race condition analysis)
|
|
**Builds on:** None directly; related to #41 (temporal ordering) but distinct —
|
|
#41 focused on sequential ordering assumptions, #50 focuses on concurrent
|
|
interleaving
|
|
**Open question generated:** Does the document size matter for concurrency
|
|
analysis? This 111-line doc produced 10 unique findings. Would a larger
|
|
multi-component doc (e.g., system-overview.md at 323 lines) produce
|
|
proportionally more, or does concurrency analysis saturate at the
|
|
interface boundaries regardless of document length?
|