refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
This commit is contained in:
@@ -0,0 +1,126 @@
|
||||
# Finding 13: Race condition identification: Opus excels at subtle temporal interactions; GPT-5 is exhaustive; Sonnet struggles with concurrency reasoning
|
||||
|
||||
**Date:** 2026-05-03
|
||||
**Task:** Identify race conditions, timing-dependent bugs, and ordering hazards in
|
||||
gargoyle's `concurrent-failure-detection.md` (241 lines) — a document specifically
|
||||
about concurrent detection logic with timers, ETS state, and multi-process events.
|
||||
**How we used them:** Same document (full text) + same focused analytical question
|
||||
to all 3 models via HAI proxy. Prompt specifically asked for event ordering problems,
|
||||
timer interaction bugs, state visibility gaps, crash/restart timing, and multi-instance
|
||||
coordination. Required each finding to reference specific mechanisms in the document
|
||||
with specific interleaving descriptions. No tools, no project context beyond the
|
||||
document itself.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Race conditions found |
|
||||
|---|---|---|---|---|
|
||||
| GPT-5 | 116s | 10,587 | 8,192 | 12 |
|
||||
| Claude Opus 4.6 | ~105s | 4,610 | (internal) | 10 |
|
||||
| Claude Sonnet 4.6 | ~39s | 1,404 | (internal) | 7 |
|
||||
|
||||
**What they found — common ground (all 3 identified):**
|
||||
- Stale timer messages in mailbox after cancellation (classic Erlang timer race)
|
||||
- HealthMonitor crash losing compound detection state (init from :unknown, no replay)
|
||||
- ETS vs GenServer state divergence visible to dashboard
|
||||
- Kill switch mode conflict (FLATTEN from single-process vs HALT from compound path)
|
||||
|
||||
**GPT-5 unique findings (not in either Claude model):**
|
||||
- Cross-sender message ordering: recovery events from pipeline processes vs timer
|
||||
expiry from runtime (Erlang preserves per-sender order, NOT cross-sender) — the
|
||||
"rapid recovery" safety argument in the doc relies on state being updated before
|
||||
timer fires, which isn't guaranteed
|
||||
- Debounce starvation: flapping component repeatedly restarting the timer, causing
|
||||
compound evaluation to be indefinitely postponed while ≥2 genuinely degraded
|
||||
- State regression: {:degraded} arriving after {:escalated, :kill_switch} with no
|
||||
guard in the event table — state machine allows regressing from :halted to :degraded
|
||||
- Cold-start window: application boots with existing degraded processes that won't
|
||||
re-emit events, compound detection never fires
|
||||
- Catch-all handle_info could accidentally swallow timer messages if pattern matching
|
||||
is ordered wrong (implementation pitfall of the described approach)
|
||||
- Debounce window growing beyond calibrated bounds from repeated timer restarts
|
||||
|
||||
**Claude Opus unique findings (not in either other model):**
|
||||
- Timer restart pushing evaluation PAST single-process escalation timeout — the
|
||||
debounce mechanism can DEFEAT compound detection when second degradation arrives
|
||||
near end of first window (resets to full window, first process escalates via
|
||||
single-process path before new window fires). This means system gets FLATTEN
|
||||
instead of HALT — exactly what compound detection was supposed to prevent.
|
||||
- Strategy worker single-atom masking via event ordering: Worker A degrades, Worker
|
||||
B degrades (same atom), Worker A recovers → atom set to :normal while B is still
|
||||
degraded. Event ordering across different workers mapped to same atom creates
|
||||
state loss.
|
||||
- Registry stale PID after HealthMonitor crash: if subscription is PID-based (not
|
||||
PubSub topic), new HM instance is deaf — events go to dead PID, silently dropped.
|
||||
Compound detection completely disabled for that user until subscription refresh.
|
||||
- :rest_for_one cascade + coincidental independent issue: debounce designed to
|
||||
filter cascade restarts, but cascade can EXPOSE a pre-existing issue (SignalRisk
|
||||
restarts and finds its upstream stale). Is this "compound" or "cascade + bad luck"?
|
||||
Semantic ambiguity the design doesn't address.
|
||||
- Compound cleared event without recovery debounce: :compound_degradation_cleared
|
||||
emitted immediately when last process recovers (no settling period), causing
|
||||
operator oscillation if recovery is transient.
|
||||
|
||||
**Claude Sonnet unique findings:**
|
||||
- ETS table creation race at startup (HealthMonitor writes before table exists)
|
||||
- Registry lookup failure during pipeline startup (events before HM registered)
|
||||
- However, Sonnet also made analytical errors: it described "multiple HealthMonitor
|
||||
instances for the same user" scenarios despite the document clearly stating one
|
||||
instance per user via DynamicSupervisor. Several of its findings assumed
|
||||
multi-instance coordination that doesn't match the architecture.
|
||||
|
||||
**Quality assessment:**
|
||||
- **GPT-5** was the most exhaustive and technically precise. Its cross-sender
|
||||
ordering finding (#2) is genuinely insightful — it identifies that the document's
|
||||
"rapid recovery" safety argument implicitly assumes events arrive in wall-clock
|
||||
order, which Erlang does NOT guarantee across different senders. The debounce
|
||||
starvation finding (#3) identifies a real operational hazard with practical
|
||||
consequences. All 12 findings reference specific mechanisms and describe specific
|
||||
interleavings clearly.
|
||||
- **Claude Opus** found fewer race conditions but several were qualitatively
|
||||
superior. The timer-restart-defeats-compound-detection finding is the most
|
||||
architecturally significant race in the entire analysis — it shows that the
|
||||
debounce mechanism can work AGAINST the design's stated goals in specific
|
||||
(realistic) timing scenarios. The strategy-worker event ordering masking is
|
||||
also a genuine design flaw unique to the single-atom decision. Opus continues
|
||||
its pattern of reasoning about design TENSIONS rather than just failure modes.
|
||||
- **Claude Sonnet** was notably weaker here than in previous experiments. Only
|
||||
1,404 output tokens vs 4,610 (Opus) and 10,587 (GPT-5). Several findings
|
||||
contained analytical errors (assuming multi-instance coordination that doesn't
|
||||
exist). It found only 7 races, and 2-3 of those were based on misreadings of
|
||||
the architecture. This is a significant regression from Finding #12 where
|
||||
Sonnet found 17 assumptions (85% of GPT-5's count).
|
||||
|
||||
**Key insight — concurrency reasoning is a different skill than assumption-finding:**
|
||||
In previous experiments (#10, #11, #12), Sonnet 4.6 performed well on
|
||||
assumption-finding (a task that requires reasoning about what's NOT stated).
|
||||
Here, on race condition identification (a task requiring reasoning about temporal
|
||||
interleavings and message ordering semantics), Sonnet drops significantly. This
|
||||
suggests the task type matters more than we previously thought:
|
||||
|
||||
- **Assumption-finding:** Requires breadth of consideration ("what must be true
|
||||
for this to work?"). Sonnet handles this well — it's essentially pattern
|
||||
matching across possible failure dimensions.
|
||||
- **Race condition identification:** Requires SEQUENTIAL reasoning about specific
|
||||
interleavings ("if A happens, then B happens, then C happens, what state is
|
||||
visible?"). This benefits dramatically from extended reasoning tokens (GPT-5's
|
||||
8,192 reasoning tokens) or from Opus's internal reasoning depth.
|
||||
|
||||
The lesson: don't extrapolate model performance across task types. A model that's
|
||||
85% as good at assumption-finding may be 50% as good at concurrency analysis.
|
||||
The cognitive demands are different.
|
||||
|
||||
**Opus's distinguishing strength — finding design contradictions:**
|
||||
Opus's best finding (timer restart defeating compound detection) isn't just a
|
||||
race condition — it's identifying that the debounce mechanism can work against
|
||||
the design's own stated goals. This is consistent with Opus's pattern in
|
||||
previous findings: it finds tensions where one part of the design undermines
|
||||
another part. For race condition analysis specifically, this manifests as
|
||||
"here's where your safety mechanism becomes your vulnerability."
|
||||
|
||||
**Practical implication for architecture review:**
|
||||
- For race condition analysis: use GPT-5 (exhaustive) + Opus (design-tension)
|
||||
- Sonnet is NOT suitable for concurrency reasoning tasks — use it for
|
||||
assumption-finding and structural review instead
|
||||
- The three-model stack needs task-appropriate assignment:
|
||||
- Structural/assumption review: all three models contribute
|
||||
- Concurrency/race analysis: GPT-5 + Opus only
|
||||
- Bias detection: any model (per Finding #8)
|
||||
Reference in New Issue
Block a user