model-research/findings/2026-05-03-13-race-condition-identification-opus-excels.md

# Finding 13: Race condition identification: Opus excels at subtle temporal interactions; GPT-5 is exhaustive; Sonnet struggles with concurrency reasoning

**Date:** 2026-05-03
**Task:** Identify race conditions, timing-dependent bugs, and ordering hazards in
gargoyle's `concurrent-failure-detection.md` (241 lines) — a document specifically
about concurrent detection logic with timers, ETS state, and multi-process events.
**How we used them:** Same document (full text) + same focused analytical question
to all 3 models via HAI proxy. Prompt specifically asked for event ordering problems,
timer interaction bugs, state visibility gaps, crash/restart timing, and multi-instance
coordination. Required each finding to reference specific mechanisms in the document
with specific interleaving descriptions. No tools, no project context beyond the
document itself.

| Model | Time | Output tokens | Reasoning tokens | Race conditions found |
|---|---|---|---|---|
| GPT-5 | 116s | 10,587 | 8,192 | 12 |
| Claude Opus 4.6 | ~105s | 4,610 | (internal) | 10 |
| Claude Sonnet 4.6 | ~39s | 1,404 | (internal) | 7 |

**What they found — common ground (all 3 identified):**
- Stale timer messages in mailbox after cancellation (classic Erlang timer race)
- HealthMonitor crash losing compound detection state (init from :unknown, no replay)
- ETS vs GenServer state divergence visible to dashboard
- Kill switch mode conflict (FLATTEN from single-process vs HALT from compound path)

**GPT-5 unique findings (not in either Claude model):**
- Cross-sender message ordering: recovery events from pipeline processes vs timer
  expiry from runtime (Erlang preserves per-sender order, NOT cross-sender) — the
  "rapid recovery" safety argument in the doc relies on state being updated before
  timer fires, which isn't guaranteed
- Debounce starvation: flapping component repeatedly restarting the timer, causing
  compound evaluation to be indefinitely postponed while ≥2 genuinely degraded
- State regression: {:degraded} arriving after {:escalated, :kill_switch} with no
  guard in the event table — state machine allows regressing from :halted to :degraded
- Cold-start window: application boots with existing degraded processes that won't
  re-emit events, compound detection never fires
- Catch-all handle_info could accidentally swallow timer messages if pattern matching
  is ordered wrong (implementation pitfall of the described approach)
- Debounce window growing beyond calibrated bounds from repeated timer restarts

**Claude Opus unique findings (not in either other model):**
- Timer restart pushing evaluation PAST single-process escalation timeout — the
  debounce mechanism can DEFEAT compound detection when second degradation arrives
  near end of first window (resets to full window, first process escalates via
  single-process path before new window fires). This means system gets FLATTEN
  instead of HALT — exactly what compound detection was supposed to prevent.
- Strategy worker single-atom masking via event ordering: Worker A degrades, Worker
  B degrades (same atom), Worker A recovers → atom set to :normal while B is still
  degraded. Event ordering across different workers mapped to same atom creates
  state loss.
- Registry stale PID after HealthMonitor crash: if subscription is PID-based (not
  PubSub topic), new HM instance is deaf — events go to dead PID, silently dropped.
  Compound detection completely disabled for that user until subscription refresh.
- :rest_for_one cascade + coincidental independent issue: debounce designed to
  filter cascade restarts, but cascade can EXPOSE a pre-existing issue (SignalRisk
  restarts and finds its upstream stale). Is this "compound" or "cascade + bad luck"?
  Semantic ambiguity the design doesn't address.
- Compound cleared event without recovery debounce: :compound_degradation_cleared
  emitted immediately when last process recovers (no settling period), causing
  operator oscillation if recovery is transient.

**Claude Sonnet unique findings:**
- ETS table creation race at startup (HealthMonitor writes before table exists)
- Registry lookup failure during pipeline startup (events before HM registered)
- However, Sonnet also made analytical errors: it described "multiple HealthMonitor
  instances for the same user" scenarios despite the document clearly stating one
  instance per user via DynamicSupervisor. Several of its findings assumed
  multi-instance coordination that doesn't match the architecture.

**Quality assessment:**
- **GPT-5** was the most exhaustive and technically precise. Its cross-sender
  ordering finding (#2) is genuinely insightful — it identifies that the document's
  "rapid recovery" safety argument implicitly assumes events arrive in wall-clock
  order, which Erlang does NOT guarantee across different senders. The debounce
  starvation finding (#3) identifies a real operational hazard with practical
  consequences. All 12 findings reference specific mechanisms and describe specific
  interleavings clearly.
- **Claude Opus** found fewer race conditions but several were qualitatively
  superior. The timer-restart-defeats-compound-detection finding is the most
  architecturally significant race in the entire analysis — it shows that the
  debounce mechanism can work AGAINST the design's stated goals in specific
  (realistic) timing scenarios. The strategy-worker event ordering masking is
  also a genuine design flaw unique to the single-atom decision. Opus continues
  its pattern of reasoning about design TENSIONS rather than just failure modes.
- **Claude Sonnet** was notably weaker here than in previous experiments. Only
  1,404 output tokens vs 4,610 (Opus) and 10,587 (GPT-5). Several findings
  contained analytical errors (assuming multi-instance coordination that doesn't
  exist). It found only 7 races, and 2-3 of those were based on misreadings of
  the architecture. This is a significant regression from Finding #12 where
  Sonnet found 17 assumptions (85% of GPT-5's count).

**Key insight — concurrency reasoning is a different skill than assumption-finding:**
In previous experiments (#10, #11, #12), Sonnet 4.6 performed well on
assumption-finding (a task that requires reasoning about what's NOT stated).
Here, on race condition identification (a task requiring reasoning about temporal
interleavings and message ordering semantics), Sonnet drops significantly. This
suggests the task type matters more than we previously thought:

- **Assumption-finding:** Requires breadth of consideration ("what must be true
  for this to work?"). Sonnet handles this well — it's essentially pattern
  matching across possible failure dimensions.
- **Race condition identification:** Requires SEQUENTIAL reasoning about specific
  interleavings ("if A happens, then B happens, then C happens, what state is
  visible?"). This benefits dramatically from extended reasoning tokens (GPT-5's
  8,192 reasoning tokens) or from Opus's internal reasoning depth.

The lesson: don't extrapolate model performance across task types. A model that's
85% as good at assumption-finding may be 50% as good at concurrency analysis.
The cognitive demands are different.

**Opus's distinguishing strength — finding design contradictions:**
Opus's best finding (timer restart defeating compound detection) isn't just a
race condition — it's identifying that the debounce mechanism can work against
the design's own stated goals. This is consistent with Opus's pattern in
previous findings: it finds tensions where one part of the design undermines
another part. For race condition analysis specifically, this manifests as
"here's where your safety mechanism becomes your vulnerability."

**Practical implication for architecture review:**
- For race condition analysis: use GPT-5 (exhaustive) + Opus (design-tension)
- Sonnet is NOT suitable for concurrency reasoning tasks — use it for
  assumption-finding and structural review instead
- The three-model stack needs task-appropriate assignment:
  - Structural/assumption review: all three models contribute
  - Concurrency/race analysis: GPT-5 + Opus only
  - Bias detection: any model (per Finding #8)