refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
This commit is contained in:
@@ -0,0 +1,152 @@
|
||||
# Finding 24: Design improvement proposals: GPT-5 excels at defense-in-depth thinking; Opus finds subtle design contradictions; Sonnet produces generic recommendations
|
||||
|
||||
**Date:** 2026-05-05
|
||||
**Task:** Propose specific design improvements for gargoyle's `kill-switch.md` (185 lines)
|
||||
— the primary safety mechanism that prevents rogue orders. NEW task type: generative/
|
||||
creative ("what would you improve?") rather than purely analytical ("what's wrong?").
|
||||
**How we used them:** Same document (full text) + same focused prompt to all 3 models
|
||||
via HAI proxy. Prompt asked for 8-15 specific improvements with: weakness, proposed
|
||||
change (concrete), tradeoff, severity rating. Explicitly excluded generic advice
|
||||
("add more tests") and asked about runtime assumptions. No tools, no project context.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Improvements proposed |
|
||||
|---|---|---|---|---|
|
||||
| GPT-5 | 118s | 8,710 | 6,016 | 15 |
|
||||
| Claude Opus 4.6 | 127s | 4,985 | (internal) | 15 |
|
||||
| Claude Sonnet 4.6 | 40s | 1,636 | (internal) | 12 |
|
||||
|
||||
**What they found — common ground (all 3 identified):**
|
||||
- DB write failure blocking engagement (fail-open under DB outage) — all three
|
||||
proposed in-memory-first engagement with async persistence
|
||||
- Kill switch process liveness monitoring (heartbeat/watchdog)
|
||||
- Broker connectivity loss during cancellation operations
|
||||
- ETS table ownership and crash-window vulnerability
|
||||
- Supervisor restart suppression as unstated mechanism
|
||||
- Per-venue/per-broker scope extension
|
||||
|
||||
**GPT-5 unique findings (not in either other model):**
|
||||
- **Infrastructure-level "hard kill"** — egress proxy or service mesh that blocks
|
||||
broker traffic independently of the application. Belt-and-suspenders approach
|
||||
where the kill switch works even if the entire BEAM VM is unresponsive. This
|
||||
was GPT-5's highest-impact unique insight.
|
||||
- **Kill fence token (epoch)** — every order-carrying message includes an epoch;
|
||||
stale-epoch messages are dropped at the gate. Elegantly solves in-flight
|
||||
messages without needing drain timeouts.
|
||||
- **Cluster/multi-node propagation** — detailed leader election + epoch broadcast
|
||||
+ fail-closed on partition design.
|
||||
- **Post-engage broker verification** — query broker AFTER engaging to confirm no
|
||||
orders slipped through during the engagement window.
|
||||
- **Liquidation exposure validation** — proving tagged liquidation orders actually
|
||||
REDUCE exposure rather than trusting the tag.
|
||||
- **Recovery/cold-start order suppression** — ensuring reconciliation/recovery
|
||||
routines can't submit orders while engaged.
|
||||
- **Engage latency reordering** — ETS first, terminate second, DB async.
|
||||
- **Audit log tamper evidence** — append-only external sink + hash chain.
|
||||
|
||||
**Claude Opus unique findings (not in either other model):**
|
||||
- **Ordering contradiction in engagement sequence** — identified that the
|
||||
documented order (DB → ETS → terminate) creates a specific risk if a crash
|
||||
occurs BETWEEN termination and ETS update (not just DB failure). The insight
|
||||
is about the window where termination has started but gate is still open.
|
||||
More subtle than GPT-5's version (which focused on DB-blocking-engage).
|
||||
- **Concurrent engagement race (mode escalation)** — multiple triggers
|
||||
simultaneously issuing conflicting modes (RESTRICT vs LIQUIDATE). Proposed
|
||||
explicit escalation rules (LIQUIDATE always wins) with GenServer serialization.
|
||||
- **Shared resources under per-user scope** — per-user kill switch doesn't
|
||||
address orders in shared broker connection buffers. Forces architectural
|
||||
decision about connection pooling strategy.
|
||||
- **Clock/time integrity for audit log** — monotonic counters + NTP validation
|
||||
for forensic reliability.
|
||||
- **Partial multi-user engagement failures** — what happens when global engage
|
||||
successfully terminates 4/5 user pipelines but one has orphaned processes.
|
||||
- **Liquidation direction validation** — similar to GPT-5's exposure validation
|
||||
but framed differently: checking corrupted position records could cause
|
||||
liquidation to OPEN positions rather than close them.
|
||||
- **Process termination verification** — checking that `:kill` signals actually
|
||||
worked (defense against trap_exit, NIF blocking).
|
||||
- **Engagement latency SLA** — defining a 50ms target with monitoring/alerting.
|
||||
|
||||
**Claude Sonnet findings (all also present in GPT-5 or Opus, differently framed):**
|
||||
- No genuinely unique improvements that GPT-5 or Opus didn't also identify.
|
||||
- Several were generic: "missing resource cleanup," "circuit breaker integration,"
|
||||
"performance monitoring" — exactly the kind of advice the prompt tried to
|
||||
exclude.
|
||||
- The "missing heartbeat" and "network partition handling" proposals were solid
|
||||
but less detailed than the corresponding GPT-5/Opus versions.
|
||||
|
||||
**Quality assessment:**
|
||||
- **GPT-5** produced the most ACTIONABLE improvements. Its proposals were
|
||||
architecturally concrete ("add an egress proxy," "use kill epochs in messages,"
|
||||
"query broker post-engage") and showed defense-in-depth thinking — multiple
|
||||
independent layers rather than fixing one path. The infrastructure kill (#2)
|
||||
is genuinely novel: no other model proposed going OUTSIDE the application
|
||||
boundary for safety enforcement. GPT-5 consistently thought about "what if
|
||||
this entire runtime is compromised?" rather than just fixing within-app paths.
|
||||
- **Claude Opus** produced equally numerous improvements (15) with characteristic
|
||||
precision about failure SEQUENCES. Its unique strength: identifying design
|
||||
contradictions rather than just gaps (the engagement ordering issue, concurrent
|
||||
mode escalation, shared-resource scope mismatch). Opus's proposals were more
|
||||
"fix the design tension" while GPT-5's were more "add another safety layer."
|
||||
Opus also included the process termination verification and engagement latency
|
||||
SLA — operational rigor that GPT-5 skipped.
|
||||
- **Claude Sonnet** produced 12 proposals in 40s (fast) but quality was notably
|
||||
lower. Several proposals were generic software engineering advice that the
|
||||
prompt explicitly excluded ("add performance monitoring," "resource cleanup").
|
||||
No unique insights emerged. Sonnet's proposals lacked the architectural depth
|
||||
of GPT-5 (no outside-the-application thinking) and the design-tension
|
||||
identification of Opus.
|
||||
|
||||
**Key insight — generative vs analytical tasks:**
|
||||
|
||||
This is the first experiment testing a GENERATIVE task ("propose improvements")
|
||||
rather than a purely analytical one ("find problems"). The results reveal:
|
||||
|
||||
1. **GPT-5's defense-in-depth thinking is unique.** In analytical tasks, GPT-5
|
||||
finds exhaustive lists of issues. In generative tasks, it proposes LAYERED
|
||||
solutions — multiple independent mechanisms that each catch what the others
|
||||
miss. The infrastructure kill proposal (external to the application) shows
|
||||
GPT-5 reasoning about failure modes that are invisible to within-app analysis.
|
||||
|
||||
2. **Opus's design-tension identification transfers to improvement proposals.**
|
||||
In analytical tasks, Opus finds where parts of a design contradict each other.
|
||||
In generative tasks, this manifests as proposals that RESOLVE tensions rather
|
||||
than just adding patches. The engagement ordering contradiction and mode
|
||||
escalation rules are both "this design says X but the mechanism allows Y —
|
||||
here's how to make them consistent."
|
||||
|
||||
3. **Sonnet doesn't transfer well to generative tasks.** In analytical tasks
|
||||
(assumption-finding, cross-component analysis), Sonnet performs well (85% of
|
||||
GPT-5 in some experiments). In generative tasks, it falls back to generic
|
||||
engineering advice. The task requires both identifying problems AND proposing
|
||||
concrete solutions — Sonnet handles the first step but not the second with
|
||||
sufficient depth.
|
||||
|
||||
**Comparison to analytical task performance:**
|
||||
|
||||
| Task type | GPT-5 character | Opus character | Sonnet character |
|
||||
|---|---|---|---|
|
||||
| Assumption-finding (#10-12) | Exhaustive breadth | Design tensions | Good (85% of GPT-5) |
|
||||
| Race conditions (#13) | Technical precision | Design contradictions | Weak (errors) |
|
||||
| Invariant violations (#20) | Maximum selectivity | Self-correcting depth | Imprecise |
|
||||
| **Design improvements (#24)** | **Defense-in-depth layers** | **Tension resolution** | **Generic advice** |
|
||||
|
||||
The generative task reveals model ARCHITECTURES more clearly than analytical tasks.
|
||||
GPT-5's reasoning enables it to construct multi-layered solutions. Opus's internal
|
||||
reasoning enables it to identify what a design SHOULD be (not just what's wrong).
|
||||
Sonnet pattern-matches against known engineering practices without deep synthesis.
|
||||
|
||||
**Practical implication:**
|
||||
|
||||
For design improvement sessions on safety-critical systems:
|
||||
- Run GPT-5 for defense-in-depth proposals ("what layers should exist?")
|
||||
- Run Opus for design consistency proposals ("where does the design contradict itself?")
|
||||
- Skip Sonnet — its output is indistinguishable from generic checklists
|
||||
- The combination of GPT-5 + Opus produces complementary improvements: GPT-5 adds
|
||||
safety layers, Opus fixes internal contradictions. Together they address both
|
||||
"not enough protection" and "protection mechanisms that work against each other."
|
||||
|
||||
**Cost analysis:**
|
||||
GPT-5: 118s, ~10.9K tokens (6K reasoning). Opus: 127s, ~5K tokens. Sonnet: 40s, ~1.6K tokens.
|
||||
For a safety-critical design review, running GPT-5 + Opus costs ~16K tokens and produces
|
||||
30 improvements with near-zero overlap in unique insights. Excellent ROI for a kill switch
|
||||
design that protects real money.
|
||||
Reference in New Issue
Block a user