6af8a6ee10
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
153 lines
9.3 KiB
Markdown
153 lines
9.3 KiB
Markdown
# Finding 24: Design improvement proposals: GPT-5 excels at defense-in-depth thinking; Opus finds subtle design contradictions; Sonnet produces generic recommendations
|
|
|
|
**Date:** 2026-05-05
|
|
**Task:** Propose specific design improvements for gargoyle's `kill-switch.md` (185 lines)
|
|
— the primary safety mechanism that prevents rogue orders. NEW task type: generative/
|
|
creative ("what would you improve?") rather than purely analytical ("what's wrong?").
|
|
**How we used them:** Same document (full text) + same focused prompt to all 3 models
|
|
via HAI proxy. Prompt asked for 8-15 specific improvements with: weakness, proposed
|
|
change (concrete), tradeoff, severity rating. Explicitly excluded generic advice
|
|
("add more tests") and asked about runtime assumptions. No tools, no project context.
|
|
|
|
| Model | Time | Output tokens | Reasoning tokens | Improvements proposed |
|
|
|---|---|---|---|---|
|
|
| GPT-5 | 118s | 8,710 | 6,016 | 15 |
|
|
| Claude Opus 4.6 | 127s | 4,985 | (internal) | 15 |
|
|
| Claude Sonnet 4.6 | 40s | 1,636 | (internal) | 12 |
|
|
|
|
**What they found — common ground (all 3 identified):**
|
|
- DB write failure blocking engagement (fail-open under DB outage) — all three
|
|
proposed in-memory-first engagement with async persistence
|
|
- Kill switch process liveness monitoring (heartbeat/watchdog)
|
|
- Broker connectivity loss during cancellation operations
|
|
- ETS table ownership and crash-window vulnerability
|
|
- Supervisor restart suppression as unstated mechanism
|
|
- Per-venue/per-broker scope extension
|
|
|
|
**GPT-5 unique findings (not in either other model):**
|
|
- **Infrastructure-level "hard kill"** — egress proxy or service mesh that blocks
|
|
broker traffic independently of the application. Belt-and-suspenders approach
|
|
where the kill switch works even if the entire BEAM VM is unresponsive. This
|
|
was GPT-5's highest-impact unique insight.
|
|
- **Kill fence token (epoch)** — every order-carrying message includes an epoch;
|
|
stale-epoch messages are dropped at the gate. Elegantly solves in-flight
|
|
messages without needing drain timeouts.
|
|
- **Cluster/multi-node propagation** — detailed leader election + epoch broadcast
|
|
+ fail-closed on partition design.
|
|
- **Post-engage broker verification** — query broker AFTER engaging to confirm no
|
|
orders slipped through during the engagement window.
|
|
- **Liquidation exposure validation** — proving tagged liquidation orders actually
|
|
REDUCE exposure rather than trusting the tag.
|
|
- **Recovery/cold-start order suppression** — ensuring reconciliation/recovery
|
|
routines can't submit orders while engaged.
|
|
- **Engage latency reordering** — ETS first, terminate second, DB async.
|
|
- **Audit log tamper evidence** — append-only external sink + hash chain.
|
|
|
|
**Claude Opus unique findings (not in either other model):**
|
|
- **Ordering contradiction in engagement sequence** — identified that the
|
|
documented order (DB → ETS → terminate) creates a specific risk if a crash
|
|
occurs BETWEEN termination and ETS update (not just DB failure). The insight
|
|
is about the window where termination has started but gate is still open.
|
|
More subtle than GPT-5's version (which focused on DB-blocking-engage).
|
|
- **Concurrent engagement race (mode escalation)** — multiple triggers
|
|
simultaneously issuing conflicting modes (RESTRICT vs LIQUIDATE). Proposed
|
|
explicit escalation rules (LIQUIDATE always wins) with GenServer serialization.
|
|
- **Shared resources under per-user scope** — per-user kill switch doesn't
|
|
address orders in shared broker connection buffers. Forces architectural
|
|
decision about connection pooling strategy.
|
|
- **Clock/time integrity for audit log** — monotonic counters + NTP validation
|
|
for forensic reliability.
|
|
- **Partial multi-user engagement failures** — what happens when global engage
|
|
successfully terminates 4/5 user pipelines but one has orphaned processes.
|
|
- **Liquidation direction validation** — similar to GPT-5's exposure validation
|
|
but framed differently: checking corrupted position records could cause
|
|
liquidation to OPEN positions rather than close them.
|
|
- **Process termination verification** — checking that `:kill` signals actually
|
|
worked (defense against trap_exit, NIF blocking).
|
|
- **Engagement latency SLA** — defining a 50ms target with monitoring/alerting.
|
|
|
|
**Claude Sonnet findings (all also present in GPT-5 or Opus, differently framed):**
|
|
- No genuinely unique improvements that GPT-5 or Opus didn't also identify.
|
|
- Several were generic: "missing resource cleanup," "circuit breaker integration,"
|
|
"performance monitoring" — exactly the kind of advice the prompt tried to
|
|
exclude.
|
|
- The "missing heartbeat" and "network partition handling" proposals were solid
|
|
but less detailed than the corresponding GPT-5/Opus versions.
|
|
|
|
**Quality assessment:**
|
|
- **GPT-5** produced the most ACTIONABLE improvements. Its proposals were
|
|
architecturally concrete ("add an egress proxy," "use kill epochs in messages,"
|
|
"query broker post-engage") and showed defense-in-depth thinking — multiple
|
|
independent layers rather than fixing one path. The infrastructure kill (#2)
|
|
is genuinely novel: no other model proposed going OUTSIDE the application
|
|
boundary for safety enforcement. GPT-5 consistently thought about "what if
|
|
this entire runtime is compromised?" rather than just fixing within-app paths.
|
|
- **Claude Opus** produced equally numerous improvements (15) with characteristic
|
|
precision about failure SEQUENCES. Its unique strength: identifying design
|
|
contradictions rather than just gaps (the engagement ordering issue, concurrent
|
|
mode escalation, shared-resource scope mismatch). Opus's proposals were more
|
|
"fix the design tension" while GPT-5's were more "add another safety layer."
|
|
Opus also included the process termination verification and engagement latency
|
|
SLA — operational rigor that GPT-5 skipped.
|
|
- **Claude Sonnet** produced 12 proposals in 40s (fast) but quality was notably
|
|
lower. Several proposals were generic software engineering advice that the
|
|
prompt explicitly excluded ("add performance monitoring," "resource cleanup").
|
|
No unique insights emerged. Sonnet's proposals lacked the architectural depth
|
|
of GPT-5 (no outside-the-application thinking) and the design-tension
|
|
identification of Opus.
|
|
|
|
**Key insight — generative vs analytical tasks:**
|
|
|
|
This is the first experiment testing a GENERATIVE task ("propose improvements")
|
|
rather than a purely analytical one ("find problems"). The results reveal:
|
|
|
|
1. **GPT-5's defense-in-depth thinking is unique.** In analytical tasks, GPT-5
|
|
finds exhaustive lists of issues. In generative tasks, it proposes LAYERED
|
|
solutions — multiple independent mechanisms that each catch what the others
|
|
miss. The infrastructure kill proposal (external to the application) shows
|
|
GPT-5 reasoning about failure modes that are invisible to within-app analysis.
|
|
|
|
2. **Opus's design-tension identification transfers to improvement proposals.**
|
|
In analytical tasks, Opus finds where parts of a design contradict each other.
|
|
In generative tasks, this manifests as proposals that RESOLVE tensions rather
|
|
than just adding patches. The engagement ordering contradiction and mode
|
|
escalation rules are both "this design says X but the mechanism allows Y —
|
|
here's how to make them consistent."
|
|
|
|
3. **Sonnet doesn't transfer well to generative tasks.** In analytical tasks
|
|
(assumption-finding, cross-component analysis), Sonnet performs well (85% of
|
|
GPT-5 in some experiments). In generative tasks, it falls back to generic
|
|
engineering advice. The task requires both identifying problems AND proposing
|
|
concrete solutions — Sonnet handles the first step but not the second with
|
|
sufficient depth.
|
|
|
|
**Comparison to analytical task performance:**
|
|
|
|
| Task type | GPT-5 character | Opus character | Sonnet character |
|
|
|---|---|---|---|
|
|
| Assumption-finding (#10-12) | Exhaustive breadth | Design tensions | Good (85% of GPT-5) |
|
|
| Race conditions (#13) | Technical precision | Design contradictions | Weak (errors) |
|
|
| Invariant violations (#20) | Maximum selectivity | Self-correcting depth | Imprecise |
|
|
| **Design improvements (#24)** | **Defense-in-depth layers** | **Tension resolution** | **Generic advice** |
|
|
|
|
The generative task reveals model ARCHITECTURES more clearly than analytical tasks.
|
|
GPT-5's reasoning enables it to construct multi-layered solutions. Opus's internal
|
|
reasoning enables it to identify what a design SHOULD be (not just what's wrong).
|
|
Sonnet pattern-matches against known engineering practices without deep synthesis.
|
|
|
|
**Practical implication:**
|
|
|
|
For design improvement sessions on safety-critical systems:
|
|
- Run GPT-5 for defense-in-depth proposals ("what layers should exist?")
|
|
- Run Opus for design consistency proposals ("where does the design contradict itself?")
|
|
- Skip Sonnet — its output is indistinguishable from generic checklists
|
|
- The combination of GPT-5 + Opus produces complementary improvements: GPT-5 adds
|
|
safety layers, Opus fixes internal contradictions. Together they address both
|
|
"not enough protection" and "protection mechanisms that work against each other."
|
|
|
|
**Cost analysis:**
|
|
GPT-5: 118s, ~10.9K tokens (6K reasoning). Opus: 127s, ~5K tokens. Sonnet: 40s, ~1.6K tokens.
|
|
For a safety-critical design review, running GPT-5 + Opus costs ~16K tokens and produces
|
|
30 improvements with near-zero overlap in unique insights. Excellent ROI for a kill switch
|
|
design that protects real money.
|