refactor(findings): split ALL-FINDINGS.md into per-experiment files

Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00
parent 1b108ff66e
commit 6af8a6ee10
32 changed files with 3232 additions and 3254 deletions
@@ -0,0 +1,126 @@
+# Finding 13: Race condition identification: Opus excels at subtle temporal interactions; GPT-5 is exhaustive; Sonnet struggles with concurrency reasoning
+
+**Date:** 2026-05-03
+**Task:** Identify race conditions, timing-dependent bugs, and ordering hazards in
+gargoyle's `concurrent-failure-detection.md` (241 lines) — a document specifically
+about concurrent detection logic with timers, ETS state, and multi-process events.
+**How we used them:** Same document (full text) + same focused analytical question
+to all 3 models via HAI proxy. Prompt specifically asked for event ordering problems,
+timer interaction bugs, state visibility gaps, crash/restart timing, and multi-instance
+coordination. Required each finding to reference specific mechanisms in the document
+with specific interleaving descriptions. No tools, no project context beyond the
+document itself.
+
+| Model | Time | Output tokens | Reasoning tokens | Race conditions found |
+|---|---|---|---|---|
+| GPT-5 | 116s | 10,587 | 8,192 | 12 |
+| Claude Opus 4.6 | ~105s | 4,610 | (internal) | 10 |
+| Claude Sonnet 4.6 | ~39s | 1,404 | (internal) | 7 |
+
+**What they found — common ground (all 3 identified):**
+- Stale timer messages in mailbox after cancellation (classic Erlang timer race)
+- HealthMonitor crash losing compound detection state (init from :unknown, no replay)
+- ETS vs GenServer state divergence visible to dashboard
+- Kill switch mode conflict (FLATTEN from single-process vs HALT from compound path)
+
+**GPT-5 unique findings (not in either Claude model):**
+- Cross-sender message ordering: recovery events from pipeline processes vs timer
+  expiry from runtime (Erlang preserves per-sender order, NOT cross-sender) — the
+  "rapid recovery" safety argument in the doc relies on state being updated before
+  timer fires, which isn't guaranteed
+- Debounce starvation: flapping component repeatedly restarting the timer, causing
+  compound evaluation to be indefinitely postponed while ≥2 genuinely degraded
+- State regression: {:degraded} arriving after {:escalated, :kill_switch} with no
+  guard in the event table — state machine allows regressing from :halted to :degraded
+- Cold-start window: application boots with existing degraded processes that won't
+  re-emit events, compound detection never fires
+- Catch-all handle_info could accidentally swallow timer messages if pattern matching
+  is ordered wrong (implementation pitfall of the described approach)
+- Debounce window growing beyond calibrated bounds from repeated timer restarts
+
+**Claude Opus unique findings (not in either other model):**
+- Timer restart pushing evaluation PAST single-process escalation timeout — the
+  debounce mechanism can DEFEAT compound detection when second degradation arrives
+  near end of first window (resets to full window, first process escalates via
+  single-process path before new window fires). This means system gets FLATTEN
+  instead of HALT — exactly what compound detection was supposed to prevent.
+- Strategy worker single-atom masking via event ordering: Worker A degrades, Worker
+  B degrades (same atom), Worker A recovers → atom set to :normal while B is still
+  degraded. Event ordering across different workers mapped to same atom creates
+  state loss.
+- Registry stale PID after HealthMonitor crash: if subscription is PID-based (not
+  PubSub topic), new HM instance is deaf — events go to dead PID, silently dropped.
+  Compound detection completely disabled for that user until subscription refresh.
+- :rest_for_one cascade + coincidental independent issue: debounce designed to
+  filter cascade restarts, but cascade can EXPOSE a pre-existing issue (SignalRisk
+  restarts and finds its upstream stale). Is this "compound" or "cascade + bad luck"?
+  Semantic ambiguity the design doesn't address.
+- Compound cleared event without recovery debounce: :compound_degradation_cleared
+  emitted immediately when last process recovers (no settling period), causing
+  operator oscillation if recovery is transient.
+
+**Claude Sonnet unique findings:**
+- ETS table creation race at startup (HealthMonitor writes before table exists)
+- Registry lookup failure during pipeline startup (events before HM registered)
+- However, Sonnet also made analytical errors: it described "multiple HealthMonitor
+  instances for the same user" scenarios despite the document clearly stating one
+  instance per user via DynamicSupervisor. Several of its findings assumed
+  multi-instance coordination that doesn't match the architecture.
+
+**Quality assessment:**
+- **GPT-5** was the most exhaustive and technically precise. Its cross-sender
+  ordering finding (#2) is genuinely insightful — it identifies that the document's
+  "rapid recovery" safety argument implicitly assumes events arrive in wall-clock
+  order, which Erlang does NOT guarantee across different senders. The debounce
+  starvation finding (#3) identifies a real operational hazard with practical
+  consequences. All 12 findings reference specific mechanisms and describe specific
+  interleavings clearly.
+- **Claude Opus** found fewer race conditions but several were qualitatively
+  superior. The timer-restart-defeats-compound-detection finding is the most
+  architecturally significant race in the entire analysis — it shows that the
+  debounce mechanism can work AGAINST the design's stated goals in specific
+  (realistic) timing scenarios. The strategy-worker event ordering masking is
+  also a genuine design flaw unique to the single-atom decision. Opus continues
+  its pattern of reasoning about design TENSIONS rather than just failure modes.
+- **Claude Sonnet** was notably weaker here than in previous experiments. Only
+  1,404 output tokens vs 4,610 (Opus) and 10,587 (GPT-5). Several findings
+  contained analytical errors (assuming multi-instance coordination that doesn't
+  exist). It found only 7 races, and 2-3 of those were based on misreadings of
+  the architecture. This is a significant regression from Finding #12 where
+  Sonnet found 17 assumptions (85% of GPT-5's count).
+
+**Key insight — concurrency reasoning is a different skill than assumption-finding:**
+In previous experiments (#10, #11, #12), Sonnet 4.6 performed well on
+assumption-finding (a task that requires reasoning about what's NOT stated).
+Here, on race condition identification (a task requiring reasoning about temporal
+interleavings and message ordering semantics), Sonnet drops significantly. This
+suggests the task type matters more than we previously thought:
+
+- **Assumption-finding:** Requires breadth of consideration ("what must be true
+  for this to work?"). Sonnet handles this well — it's essentially pattern
+  matching across possible failure dimensions.
+- **Race condition identification:** Requires SEQUENTIAL reasoning about specific
+  interleavings ("if A happens, then B happens, then C happens, what state is
+  visible?"). This benefits dramatically from extended reasoning tokens (GPT-5's
+  8,192 reasoning tokens) or from Opus's internal reasoning depth.
+
+The lesson: don't extrapolate model performance across task types. A model that's
+85% as good at assumption-finding may be 50% as good at concurrency analysis.
+The cognitive demands are different.
+
+**Opus's distinguishing strength — finding design contradictions:**
+Opus's best finding (timer restart defeating compound detection) isn't just a
+race condition — it's identifying that the debounce mechanism can work against
+the design's own stated goals. This is consistent with Opus's pattern in
+previous findings: it finds tensions where one part of the design undermines
+another part. For race condition analysis specifically, this manifests as
+"here's where your safety mechanism becomes your vulnerability."
+
+**Practical implication for architecture review:**
+- For race condition analysis: use GPT-5 (exhaustive) + Opus (design-tension)
+- Sonnet is NOT suitable for concurrency reasoning tasks — use it for
+  assumption-finding and structural review instead
+- The three-model stack needs task-appropriate assignment:
+  - Structural/assumption review: all three models contribute
+  - Concurrency/race analysis: GPT-5 + Opus only
+  - Bias detection: any model (per Finding #8)