refactor(findings): split ALL-FINDINGS.md into per-experiment files

Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00
parent 1b108ff66e
commit 6af8a6ee10
32 changed files with 3232 additions and 3254 deletions
@@ -0,0 +1,152 @@
+# Finding 24: Design improvement proposals: GPT-5 excels at defense-in-depth thinking; Opus finds subtle design contradictions; Sonnet produces generic recommendations
+
+**Date:** 2026-05-05
+**Task:** Propose specific design improvements for gargoyle's `kill-switch.md` (185 lines)
+— the primary safety mechanism that prevents rogue orders. NEW task type: generative/
+creative ("what would you improve?") rather than purely analytical ("what's wrong?").
+**How we used them:** Same document (full text) + same focused prompt to all 3 models
+via HAI proxy. Prompt asked for 8-15 specific improvements with: weakness, proposed
+change (concrete), tradeoff, severity rating. Explicitly excluded generic advice
+("add more tests") and asked about runtime assumptions. No tools, no project context.
+
+| Model | Time | Output tokens | Reasoning tokens | Improvements proposed |
+|---|---|---|---|---|
+| GPT-5 | 118s | 8,710 | 6,016 | 15 |
+| Claude Opus 4.6 | 127s | 4,985 | (internal) | 15 |
+| Claude Sonnet 4.6 | 40s | 1,636 | (internal) | 12 |
+
+**What they found — common ground (all 3 identified):**
+- DB write failure blocking engagement (fail-open under DB outage) — all three
+  proposed in-memory-first engagement with async persistence
+- Kill switch process liveness monitoring (heartbeat/watchdog)
+- Broker connectivity loss during cancellation operations
+- ETS table ownership and crash-window vulnerability
+- Supervisor restart suppression as unstated mechanism
+- Per-venue/per-broker scope extension
+
+**GPT-5 unique findings (not in either other model):**
+- **Infrastructure-level "hard kill"** — egress proxy or service mesh that blocks
+  broker traffic independently of the application. Belt-and-suspenders approach
+  where the kill switch works even if the entire BEAM VM is unresponsive. This
+  was GPT-5's highest-impact unique insight.
+- **Kill fence token (epoch)** — every order-carrying message includes an epoch;
+  stale-epoch messages are dropped at the gate. Elegantly solves in-flight
+  messages without needing drain timeouts.
+- **Cluster/multi-node propagation** — detailed leader election + epoch broadcast
+  + fail-closed on partition design.
+- **Post-engage broker verification** — query broker AFTER engaging to confirm no
+  orders slipped through during the engagement window.
+- **Liquidation exposure validation** — proving tagged liquidation orders actually
+  REDUCE exposure rather than trusting the tag.
+- **Recovery/cold-start order suppression** — ensuring reconciliation/recovery
+  routines can't submit orders while engaged.
+- **Engage latency reordering** — ETS first, terminate second, DB async.
+- **Audit log tamper evidence** — append-only external sink + hash chain.
+
+**Claude Opus unique findings (not in either other model):**
+- **Ordering contradiction in engagement sequence** — identified that the
+  documented order (DB → ETS → terminate) creates a specific risk if a crash
+  occurs BETWEEN termination and ETS update (not just DB failure). The insight
+  is about the window where termination has started but gate is still open.
+  More subtle than GPT-5's version (which focused on DB-blocking-engage).
+- **Concurrent engagement race (mode escalation)** — multiple triggers
+  simultaneously issuing conflicting modes (RESTRICT vs LIQUIDATE). Proposed
+  explicit escalation rules (LIQUIDATE always wins) with GenServer serialization.
+- **Shared resources under per-user scope** — per-user kill switch doesn't
+  address orders in shared broker connection buffers. Forces architectural
+  decision about connection pooling strategy.
+- **Clock/time integrity for audit log** — monotonic counters + NTP validation
+  for forensic reliability.
+- **Partial multi-user engagement failures** — what happens when global engage
+  successfully terminates 4/5 user pipelines but one has orphaned processes.
+- **Liquidation direction validation** — similar to GPT-5's exposure validation
+  but framed differently: checking corrupted position records could cause
+  liquidation to OPEN positions rather than close them.
+- **Process termination verification** — checking that `:kill` signals actually
+  worked (defense against trap_exit, NIF blocking).
+- **Engagement latency SLA** — defining a 50ms target with monitoring/alerting.
+
+**Claude Sonnet findings (all also present in GPT-5 or Opus, differently framed):**
+- No genuinely unique improvements that GPT-5 or Opus didn't also identify.
+- Several were generic: "missing resource cleanup," "circuit breaker integration,"
+  "performance monitoring" — exactly the kind of advice the prompt tried to
+  exclude.
+- The "missing heartbeat" and "network partition handling" proposals were solid
+  but less detailed than the corresponding GPT-5/Opus versions.
+
+**Quality assessment:**
+- **GPT-5** produced the most ACTIONABLE improvements. Its proposals were
+  architecturally concrete ("add an egress proxy," "use kill epochs in messages,"
+  "query broker post-engage") and showed defense-in-depth thinking — multiple
+  independent layers rather than fixing one path. The infrastructure kill (#2)
+  is genuinely novel: no other model proposed going OUTSIDE the application
+  boundary for safety enforcement. GPT-5 consistently thought about "what if
+  this entire runtime is compromised?" rather than just fixing within-app paths.
+- **Claude Opus** produced equally numerous improvements (15) with characteristic
+  precision about failure SEQUENCES. Its unique strength: identifying design
+  contradictions rather than just gaps (the engagement ordering issue, concurrent
+  mode escalation, shared-resource scope mismatch). Opus's proposals were more
+  "fix the design tension" while GPT-5's were more "add another safety layer."
+  Opus also included the process termination verification and engagement latency
+  SLA — operational rigor that GPT-5 skipped.
+- **Claude Sonnet** produced 12 proposals in 40s (fast) but quality was notably
+  lower. Several proposals were generic software engineering advice that the
+  prompt explicitly excluded ("add performance monitoring," "resource cleanup").
+  No unique insights emerged. Sonnet's proposals lacked the architectural depth
+  of GPT-5 (no outside-the-application thinking) and the design-tension
+  identification of Opus.
+
+**Key insight — generative vs analytical tasks:**
+
+This is the first experiment testing a GENERATIVE task ("propose improvements")
+rather than a purely analytical one ("find problems"). The results reveal:
+
+1. **GPT-5's defense-in-depth thinking is unique.** In analytical tasks, GPT-5
+   finds exhaustive lists of issues. In generative tasks, it proposes LAYERED
+   solutions — multiple independent mechanisms that each catch what the others
+   miss. The infrastructure kill proposal (external to the application) shows
+   GPT-5 reasoning about failure modes that are invisible to within-app analysis.
+
+2. **Opus's design-tension identification transfers to improvement proposals.**
+   In analytical tasks, Opus finds where parts of a design contradict each other.
+   In generative tasks, this manifests as proposals that RESOLVE tensions rather
+   than just adding patches. The engagement ordering contradiction and mode
+   escalation rules are both "this design says X but the mechanism allows Y —
+   here's how to make them consistent."
+
+3. **Sonnet doesn't transfer well to generative tasks.** In analytical tasks
+   (assumption-finding, cross-component analysis), Sonnet performs well (85% of
+   GPT-5 in some experiments). In generative tasks, it falls back to generic
+   engineering advice. The task requires both identifying problems AND proposing
+   concrete solutions — Sonnet handles the first step but not the second with
+   sufficient depth.
+
+**Comparison to analytical task performance:**
+
+| Task type | GPT-5 character | Opus character | Sonnet character |
+|---|---|---|---|
+| Assumption-finding (#10-12) | Exhaustive breadth | Design tensions | Good (85% of GPT-5) |
+| Race conditions (#13) | Technical precision | Design contradictions | Weak (errors) |
+| Invariant violations (#20) | Maximum selectivity | Self-correcting depth | Imprecise |
+| **Design improvements (#24)** | **Defense-in-depth layers** | **Tension resolution** | **Generic advice** |
+
+The generative task reveals model ARCHITECTURES more clearly than analytical tasks.
+GPT-5's reasoning enables it to construct multi-layered solutions. Opus's internal
+reasoning enables it to identify what a design SHOULD be (not just what's wrong).
+Sonnet pattern-matches against known engineering practices without deep synthesis.
+
+**Practical implication:**
+
+For design improvement sessions on safety-critical systems:
+- Run GPT-5 for defense-in-depth proposals ("what layers should exist?")
+- Run Opus for design consistency proposals ("where does the design contradict itself?")
+- Skip Sonnet — its output is indistinguishable from generic checklists
+- The combination of GPT-5 + Opus produces complementary improvements: GPT-5 adds
+  safety layers, Opus fixes internal contradictions. Together they address both
+  "not enough protection" and "protection mechanisms that work against each other."
+
+**Cost analysis:**
+GPT-5: 118s, ~10.9K tokens (6K reasoning). Opus: 127s, ~5K tokens. Sonnet: 40s, ~1.6K tokens.
+For a safety-critical design review, running GPT-5 + Opus costs ~16K tokens and produces
+30 improvements with near-zero overlap in unique insights. Excellent ROI for a kill switch
+design that protects real money.