6af8a6ee10
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
127 lines
8.0 KiB
Markdown
127 lines
8.0 KiB
Markdown
# Finding 13: Race condition identification: Opus excels at subtle temporal interactions; GPT-5 is exhaustive; Sonnet struggles with concurrency reasoning
|
|
|
|
**Date:** 2026-05-03
|
|
**Task:** Identify race conditions, timing-dependent bugs, and ordering hazards in
|
|
gargoyle's `concurrent-failure-detection.md` (241 lines) — a document specifically
|
|
about concurrent detection logic with timers, ETS state, and multi-process events.
|
|
**How we used them:** Same document (full text) + same focused analytical question
|
|
to all 3 models via HAI proxy. Prompt specifically asked for event ordering problems,
|
|
timer interaction bugs, state visibility gaps, crash/restart timing, and multi-instance
|
|
coordination. Required each finding to reference specific mechanisms in the document
|
|
with specific interleaving descriptions. No tools, no project context beyond the
|
|
document itself.
|
|
|
|
| Model | Time | Output tokens | Reasoning tokens | Race conditions found |
|
|
|---|---|---|---|---|
|
|
| GPT-5 | 116s | 10,587 | 8,192 | 12 |
|
|
| Claude Opus 4.6 | ~105s | 4,610 | (internal) | 10 |
|
|
| Claude Sonnet 4.6 | ~39s | 1,404 | (internal) | 7 |
|
|
|
|
**What they found — common ground (all 3 identified):**
|
|
- Stale timer messages in mailbox after cancellation (classic Erlang timer race)
|
|
- HealthMonitor crash losing compound detection state (init from :unknown, no replay)
|
|
- ETS vs GenServer state divergence visible to dashboard
|
|
- Kill switch mode conflict (FLATTEN from single-process vs HALT from compound path)
|
|
|
|
**GPT-5 unique findings (not in either Claude model):**
|
|
- Cross-sender message ordering: recovery events from pipeline processes vs timer
|
|
expiry from runtime (Erlang preserves per-sender order, NOT cross-sender) — the
|
|
"rapid recovery" safety argument in the doc relies on state being updated before
|
|
timer fires, which isn't guaranteed
|
|
- Debounce starvation: flapping component repeatedly restarting the timer, causing
|
|
compound evaluation to be indefinitely postponed while ≥2 genuinely degraded
|
|
- State regression: {:degraded} arriving after {:escalated, :kill_switch} with no
|
|
guard in the event table — state machine allows regressing from :halted to :degraded
|
|
- Cold-start window: application boots with existing degraded processes that won't
|
|
re-emit events, compound detection never fires
|
|
- Catch-all handle_info could accidentally swallow timer messages if pattern matching
|
|
is ordered wrong (implementation pitfall of the described approach)
|
|
- Debounce window growing beyond calibrated bounds from repeated timer restarts
|
|
|
|
**Claude Opus unique findings (not in either other model):**
|
|
- Timer restart pushing evaluation PAST single-process escalation timeout — the
|
|
debounce mechanism can DEFEAT compound detection when second degradation arrives
|
|
near end of first window (resets to full window, first process escalates via
|
|
single-process path before new window fires). This means system gets FLATTEN
|
|
instead of HALT — exactly what compound detection was supposed to prevent.
|
|
- Strategy worker single-atom masking via event ordering: Worker A degrades, Worker
|
|
B degrades (same atom), Worker A recovers → atom set to :normal while B is still
|
|
degraded. Event ordering across different workers mapped to same atom creates
|
|
state loss.
|
|
- Registry stale PID after HealthMonitor crash: if subscription is PID-based (not
|
|
PubSub topic), new HM instance is deaf — events go to dead PID, silently dropped.
|
|
Compound detection completely disabled for that user until subscription refresh.
|
|
- :rest_for_one cascade + coincidental independent issue: debounce designed to
|
|
filter cascade restarts, but cascade can EXPOSE a pre-existing issue (SignalRisk
|
|
restarts and finds its upstream stale). Is this "compound" or "cascade + bad luck"?
|
|
Semantic ambiguity the design doesn't address.
|
|
- Compound cleared event without recovery debounce: :compound_degradation_cleared
|
|
emitted immediately when last process recovers (no settling period), causing
|
|
operator oscillation if recovery is transient.
|
|
|
|
**Claude Sonnet unique findings:**
|
|
- ETS table creation race at startup (HealthMonitor writes before table exists)
|
|
- Registry lookup failure during pipeline startup (events before HM registered)
|
|
- However, Sonnet also made analytical errors: it described "multiple HealthMonitor
|
|
instances for the same user" scenarios despite the document clearly stating one
|
|
instance per user via DynamicSupervisor. Several of its findings assumed
|
|
multi-instance coordination that doesn't match the architecture.
|
|
|
|
**Quality assessment:**
|
|
- **GPT-5** was the most exhaustive and technically precise. Its cross-sender
|
|
ordering finding (#2) is genuinely insightful — it identifies that the document's
|
|
"rapid recovery" safety argument implicitly assumes events arrive in wall-clock
|
|
order, which Erlang does NOT guarantee across different senders. The debounce
|
|
starvation finding (#3) identifies a real operational hazard with practical
|
|
consequences. All 12 findings reference specific mechanisms and describe specific
|
|
interleavings clearly.
|
|
- **Claude Opus** found fewer race conditions but several were qualitatively
|
|
superior. The timer-restart-defeats-compound-detection finding is the most
|
|
architecturally significant race in the entire analysis — it shows that the
|
|
debounce mechanism can work AGAINST the design's stated goals in specific
|
|
(realistic) timing scenarios. The strategy-worker event ordering masking is
|
|
also a genuine design flaw unique to the single-atom decision. Opus continues
|
|
its pattern of reasoning about design TENSIONS rather than just failure modes.
|
|
- **Claude Sonnet** was notably weaker here than in previous experiments. Only
|
|
1,404 output tokens vs 4,610 (Opus) and 10,587 (GPT-5). Several findings
|
|
contained analytical errors (assuming multi-instance coordination that doesn't
|
|
exist). It found only 7 races, and 2-3 of those were based on misreadings of
|
|
the architecture. This is a significant regression from Finding #12 where
|
|
Sonnet found 17 assumptions (85% of GPT-5's count).
|
|
|
|
**Key insight — concurrency reasoning is a different skill than assumption-finding:**
|
|
In previous experiments (#10, #11, #12), Sonnet 4.6 performed well on
|
|
assumption-finding (a task that requires reasoning about what's NOT stated).
|
|
Here, on race condition identification (a task requiring reasoning about temporal
|
|
interleavings and message ordering semantics), Sonnet drops significantly. This
|
|
suggests the task type matters more than we previously thought:
|
|
|
|
- **Assumption-finding:** Requires breadth of consideration ("what must be true
|
|
for this to work?"). Sonnet handles this well — it's essentially pattern
|
|
matching across possible failure dimensions.
|
|
- **Race condition identification:** Requires SEQUENTIAL reasoning about specific
|
|
interleavings ("if A happens, then B happens, then C happens, what state is
|
|
visible?"). This benefits dramatically from extended reasoning tokens (GPT-5's
|
|
8,192 reasoning tokens) or from Opus's internal reasoning depth.
|
|
|
|
The lesson: don't extrapolate model performance across task types. A model that's
|
|
85% as good at assumption-finding may be 50% as good at concurrency analysis.
|
|
The cognitive demands are different.
|
|
|
|
**Opus's distinguishing strength — finding design contradictions:**
|
|
Opus's best finding (timer restart defeating compound detection) isn't just a
|
|
race condition — it's identifying that the debounce mechanism can work against
|
|
the design's own stated goals. This is consistent with Opus's pattern in
|
|
previous findings: it finds tensions where one part of the design undermines
|
|
another part. For race condition analysis specifically, this manifests as
|
|
"here's where your safety mechanism becomes your vulnerability."
|
|
|
|
**Practical implication for architecture review:**
|
|
- For race condition analysis: use GPT-5 (exhaustive) + Opus (design-tension)
|
|
- Sonnet is NOT suitable for concurrency reasoning tasks — use it for
|
|
assumption-finding and structural review instead
|
|
- The three-model stack needs task-appropriate assignment:
|
|
- Structural/assumption review: all three models contribute
|
|
- Concurrency/race analysis: GPT-5 + Opus only
|
|
- Bias detection: any model (per Finding #8)
|