Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
8.0 KiB
Finding 13: Race condition identification: Opus excels at subtle temporal interactions; GPT-5 is exhaustive; Sonnet struggles with concurrency reasoning
Date: 2026-05-03
Task: Identify race conditions, timing-dependent bugs, and ordering hazards in
gargoyle's concurrent-failure-detection.md (241 lines) — a document specifically
about concurrent detection logic with timers, ETS state, and multi-process events.
How we used them: Same document (full text) + same focused analytical question
to all 3 models via HAI proxy. Prompt specifically asked for event ordering problems,
timer interaction bugs, state visibility gaps, crash/restart timing, and multi-instance
coordination. Required each finding to reference specific mechanisms in the document
with specific interleaving descriptions. No tools, no project context beyond the
document itself.
| Model | Time | Output tokens | Reasoning tokens | Race conditions found |
|---|---|---|---|---|
| GPT-5 | 116s | 10,587 | 8,192 | 12 |
| Claude Opus 4.6 | ~105s | 4,610 | (internal) | 10 |
| Claude Sonnet 4.6 | ~39s | 1,404 | (internal) | 7 |
What they found — common ground (all 3 identified):
- Stale timer messages in mailbox after cancellation (classic Erlang timer race)
- HealthMonitor crash losing compound detection state (init from :unknown, no replay)
- ETS vs GenServer state divergence visible to dashboard
- Kill switch mode conflict (FLATTEN from single-process vs HALT from compound path)
GPT-5 unique findings (not in either Claude model):
- Cross-sender message ordering: recovery events from pipeline processes vs timer expiry from runtime (Erlang preserves per-sender order, NOT cross-sender) — the "rapid recovery" safety argument in the doc relies on state being updated before timer fires, which isn't guaranteed
- Debounce starvation: flapping component repeatedly restarting the timer, causing compound evaluation to be indefinitely postponed while ≥2 genuinely degraded
- State regression: {:degraded} arriving after {:escalated, :kill_switch} with no guard in the event table — state machine allows regressing from :halted to :degraded
- Cold-start window: application boots with existing degraded processes that won't re-emit events, compound detection never fires
- Catch-all handle_info could accidentally swallow timer messages if pattern matching is ordered wrong (implementation pitfall of the described approach)
- Debounce window growing beyond calibrated bounds from repeated timer restarts
Claude Opus unique findings (not in either other model):
- Timer restart pushing evaluation PAST single-process escalation timeout — the debounce mechanism can DEFEAT compound detection when second degradation arrives near end of first window (resets to full window, first process escalates via single-process path before new window fires). This means system gets FLATTEN instead of HALT — exactly what compound detection was supposed to prevent.
- Strategy worker single-atom masking via event ordering: Worker A degrades, Worker B degrades (same atom), Worker A recovers → atom set to :normal while B is still degraded. Event ordering across different workers mapped to same atom creates state loss.
- Registry stale PID after HealthMonitor crash: if subscription is PID-based (not PubSub topic), new HM instance is deaf — events go to dead PID, silently dropped. Compound detection completely disabled for that user until subscription refresh.
- :rest_for_one cascade + coincidental independent issue: debounce designed to filter cascade restarts, but cascade can EXPOSE a pre-existing issue (SignalRisk restarts and finds its upstream stale). Is this "compound" or "cascade + bad luck"? Semantic ambiguity the design doesn't address.
- Compound cleared event without recovery debounce: :compound_degradation_cleared emitted immediately when last process recovers (no settling period), causing operator oscillation if recovery is transient.
Claude Sonnet unique findings:
- ETS table creation race at startup (HealthMonitor writes before table exists)
- Registry lookup failure during pipeline startup (events before HM registered)
- However, Sonnet also made analytical errors: it described "multiple HealthMonitor instances for the same user" scenarios despite the document clearly stating one instance per user via DynamicSupervisor. Several of its findings assumed multi-instance coordination that doesn't match the architecture.
Quality assessment:
- GPT-5 was the most exhaustive and technically precise. Its cross-sender ordering finding (#2) is genuinely insightful — it identifies that the document's "rapid recovery" safety argument implicitly assumes events arrive in wall-clock order, which Erlang does NOT guarantee across different senders. The debounce starvation finding (#3) identifies a real operational hazard with practical consequences. All 12 findings reference specific mechanisms and describe specific interleavings clearly.
- Claude Opus found fewer race conditions but several were qualitatively superior. The timer-restart-defeats-compound-detection finding is the most architecturally significant race in the entire analysis — it shows that the debounce mechanism can work AGAINST the design's stated goals in specific (realistic) timing scenarios. The strategy-worker event ordering masking is also a genuine design flaw unique to the single-atom decision. Opus continues its pattern of reasoning about design TENSIONS rather than just failure modes.
- Claude Sonnet was notably weaker here than in previous experiments. Only 1,404 output tokens vs 4,610 (Opus) and 10,587 (GPT-5). Several findings contained analytical errors (assuming multi-instance coordination that doesn't exist). It found only 7 races, and 2-3 of those were based on misreadings of the architecture. This is a significant regression from Finding #12 where Sonnet found 17 assumptions (85% of GPT-5's count).
Key insight — concurrency reasoning is a different skill than assumption-finding: In previous experiments (#10, #11, #12), Sonnet 4.6 performed well on assumption-finding (a task that requires reasoning about what's NOT stated). Here, on race condition identification (a task requiring reasoning about temporal interleavings and message ordering semantics), Sonnet drops significantly. This suggests the task type matters more than we previously thought:
- Assumption-finding: Requires breadth of consideration ("what must be true for this to work?"). Sonnet handles this well — it's essentially pattern matching across possible failure dimensions.
- Race condition identification: Requires SEQUENTIAL reasoning about specific interleavings ("if A happens, then B happens, then C happens, what state is visible?"). This benefits dramatically from extended reasoning tokens (GPT-5's 8,192 reasoning tokens) or from Opus's internal reasoning depth.
The lesson: don't extrapolate model performance across task types. A model that's 85% as good at assumption-finding may be 50% as good at concurrency analysis. The cognitive demands are different.
Opus's distinguishing strength — finding design contradictions: Opus's best finding (timer restart defeating compound detection) isn't just a race condition — it's identifying that the debounce mechanism can work against the design's own stated goals. This is consistent with Opus's pattern in previous findings: it finds tensions where one part of the design undermines another part. For race condition analysis specifically, this manifests as "here's where your safety mechanism becomes your vulnerability."
Practical implication for architecture review:
- For race condition analysis: use GPT-5 (exhaustive) + Opus (design-tension)
- Sonnet is NOT suitable for concurrency reasoning tasks — use it for assumption-finding and structural review instead
- The three-model stack needs task-appropriate assignment:
- Structural/assumption review: all three models contribute
- Concurrency/race analysis: GPT-5 + Opus only
- Bias detection: any model (per Finding #8)