Files
model-research/findings/2026-05-05-24-design-improvement-proposals-gpt5-excels.md
T
Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00

153 lines
9.3 KiB
Markdown

# Finding 24: Design improvement proposals: GPT-5 excels at defense-in-depth thinking; Opus finds subtle design contradictions; Sonnet produces generic recommendations
**Date:** 2026-05-05
**Task:** Propose specific design improvements for gargoyle's `kill-switch.md` (185 lines)
— the primary safety mechanism that prevents rogue orders. NEW task type: generative/
creative ("what would you improve?") rather than purely analytical ("what's wrong?").
**How we used them:** Same document (full text) + same focused prompt to all 3 models
via HAI proxy. Prompt asked for 8-15 specific improvements with: weakness, proposed
change (concrete), tradeoff, severity rating. Explicitly excluded generic advice
("add more tests") and asked about runtime assumptions. No tools, no project context.
| Model | Time | Output tokens | Reasoning tokens | Improvements proposed |
|---|---|---|---|---|
| GPT-5 | 118s | 8,710 | 6,016 | 15 |
| Claude Opus 4.6 | 127s | 4,985 | (internal) | 15 |
| Claude Sonnet 4.6 | 40s | 1,636 | (internal) | 12 |
**What they found — common ground (all 3 identified):**
- DB write failure blocking engagement (fail-open under DB outage) — all three
proposed in-memory-first engagement with async persistence
- Kill switch process liveness monitoring (heartbeat/watchdog)
- Broker connectivity loss during cancellation operations
- ETS table ownership and crash-window vulnerability
- Supervisor restart suppression as unstated mechanism
- Per-venue/per-broker scope extension
**GPT-5 unique findings (not in either other model):**
- **Infrastructure-level "hard kill"** — egress proxy or service mesh that blocks
broker traffic independently of the application. Belt-and-suspenders approach
where the kill switch works even if the entire BEAM VM is unresponsive. This
was GPT-5's highest-impact unique insight.
- **Kill fence token (epoch)** — every order-carrying message includes an epoch;
stale-epoch messages are dropped at the gate. Elegantly solves in-flight
messages without needing drain timeouts.
- **Cluster/multi-node propagation** — detailed leader election + epoch broadcast
+ fail-closed on partition design.
- **Post-engage broker verification** — query broker AFTER engaging to confirm no
orders slipped through during the engagement window.
- **Liquidation exposure validation** — proving tagged liquidation orders actually
REDUCE exposure rather than trusting the tag.
- **Recovery/cold-start order suppression** — ensuring reconciliation/recovery
routines can't submit orders while engaged.
- **Engage latency reordering** — ETS first, terminate second, DB async.
- **Audit log tamper evidence** — append-only external sink + hash chain.
**Claude Opus unique findings (not in either other model):**
- **Ordering contradiction in engagement sequence** — identified that the
documented order (DB → ETS → terminate) creates a specific risk if a crash
occurs BETWEEN termination and ETS update (not just DB failure). The insight
is about the window where termination has started but gate is still open.
More subtle than GPT-5's version (which focused on DB-blocking-engage).
- **Concurrent engagement race (mode escalation)** — multiple triggers
simultaneously issuing conflicting modes (RESTRICT vs LIQUIDATE). Proposed
explicit escalation rules (LIQUIDATE always wins) with GenServer serialization.
- **Shared resources under per-user scope** — per-user kill switch doesn't
address orders in shared broker connection buffers. Forces architectural
decision about connection pooling strategy.
- **Clock/time integrity for audit log** — monotonic counters + NTP validation
for forensic reliability.
- **Partial multi-user engagement failures** — what happens when global engage
successfully terminates 4/5 user pipelines but one has orphaned processes.
- **Liquidation direction validation** — similar to GPT-5's exposure validation
but framed differently: checking corrupted position records could cause
liquidation to OPEN positions rather than close them.
- **Process termination verification** — checking that `:kill` signals actually
worked (defense against trap_exit, NIF blocking).
- **Engagement latency SLA** — defining a 50ms target with monitoring/alerting.
**Claude Sonnet findings (all also present in GPT-5 or Opus, differently framed):**
- No genuinely unique improvements that GPT-5 or Opus didn't also identify.
- Several were generic: "missing resource cleanup," "circuit breaker integration,"
"performance monitoring" — exactly the kind of advice the prompt tried to
exclude.
- The "missing heartbeat" and "network partition handling" proposals were solid
but less detailed than the corresponding GPT-5/Opus versions.
**Quality assessment:**
- **GPT-5** produced the most ACTIONABLE improvements. Its proposals were
architecturally concrete ("add an egress proxy," "use kill epochs in messages,"
"query broker post-engage") and showed defense-in-depth thinking — multiple
independent layers rather than fixing one path. The infrastructure kill (#2)
is genuinely novel: no other model proposed going OUTSIDE the application
boundary for safety enforcement. GPT-5 consistently thought about "what if
this entire runtime is compromised?" rather than just fixing within-app paths.
- **Claude Opus** produced equally numerous improvements (15) with characteristic
precision about failure SEQUENCES. Its unique strength: identifying design
contradictions rather than just gaps (the engagement ordering issue, concurrent
mode escalation, shared-resource scope mismatch). Opus's proposals were more
"fix the design tension" while GPT-5's were more "add another safety layer."
Opus also included the process termination verification and engagement latency
SLA — operational rigor that GPT-5 skipped.
- **Claude Sonnet** produced 12 proposals in 40s (fast) but quality was notably
lower. Several proposals were generic software engineering advice that the
prompt explicitly excluded ("add performance monitoring," "resource cleanup").
No unique insights emerged. Sonnet's proposals lacked the architectural depth
of GPT-5 (no outside-the-application thinking) and the design-tension
identification of Opus.
**Key insight — generative vs analytical tasks:**
This is the first experiment testing a GENERATIVE task ("propose improvements")
rather than a purely analytical one ("find problems"). The results reveal:
1. **GPT-5's defense-in-depth thinking is unique.** In analytical tasks, GPT-5
finds exhaustive lists of issues. In generative tasks, it proposes LAYERED
solutions — multiple independent mechanisms that each catch what the others
miss. The infrastructure kill proposal (external to the application) shows
GPT-5 reasoning about failure modes that are invisible to within-app analysis.
2. **Opus's design-tension identification transfers to improvement proposals.**
In analytical tasks, Opus finds where parts of a design contradict each other.
In generative tasks, this manifests as proposals that RESOLVE tensions rather
than just adding patches. The engagement ordering contradiction and mode
escalation rules are both "this design says X but the mechanism allows Y —
here's how to make them consistent."
3. **Sonnet doesn't transfer well to generative tasks.** In analytical tasks
(assumption-finding, cross-component analysis), Sonnet performs well (85% of
GPT-5 in some experiments). In generative tasks, it falls back to generic
engineering advice. The task requires both identifying problems AND proposing
concrete solutions — Sonnet handles the first step but not the second with
sufficient depth.
**Comparison to analytical task performance:**
| Task type | GPT-5 character | Opus character | Sonnet character |
|---|---|---|---|
| Assumption-finding (#10-12) | Exhaustive breadth | Design tensions | Good (85% of GPT-5) |
| Race conditions (#13) | Technical precision | Design contradictions | Weak (errors) |
| Invariant violations (#20) | Maximum selectivity | Self-correcting depth | Imprecise |
| **Design improvements (#24)** | **Defense-in-depth layers** | **Tension resolution** | **Generic advice** |
The generative task reveals model ARCHITECTURES more clearly than analytical tasks.
GPT-5's reasoning enables it to construct multi-layered solutions. Opus's internal
reasoning enables it to identify what a design SHOULD be (not just what's wrong).
Sonnet pattern-matches against known engineering practices without deep synthesis.
**Practical implication:**
For design improvement sessions on safety-critical systems:
- Run GPT-5 for defense-in-depth proposals ("what layers should exist?")
- Run Opus for design consistency proposals ("where does the design contradict itself?")
- Skip Sonnet — its output is indistinguishable from generic checklists
- The combination of GPT-5 + Opus produces complementary improvements: GPT-5 adds
safety layers, Opus fixes internal contradictions. Together they address both
"not enough protection" and "protection mechanisms that work against each other."
**Cost analysis:**
GPT-5: 118s, ~10.9K tokens (6K reasoning). Opus: 127s, ~5K tokens. Sonnet: 40s, ~1.6K tokens.
For a safety-critical design review, running GPT-5 + Opus costs ~16K tokens and produces
30 improvements with near-zero overlap in unique insights. Excellent ROI for a kill switch
design that protects real money.