model-research/findings/2026-05-05-24-design-improvement-proposals-gpt5-excels.md

# Finding 24: Design improvement proposals: GPT-5 excels at defense-in-depth thinking; Opus finds subtle design contradictions; Sonnet produces generic recommendations

**Date:** 2026-05-05
**Task:** Propose specific design improvements for gargoyle's `kill-switch.md` (185 lines)
— the primary safety mechanism that prevents rogue orders. NEW task type: generative/
creative ("what would you improve?") rather than purely analytical ("what's wrong?").
**How we used them:** Same document (full text) + same focused prompt to all 3 models
via HAI proxy. Prompt asked for 8-15 specific improvements with: weakness, proposed
change (concrete), tradeoff, severity rating. Explicitly excluded generic advice
("add more tests") and asked about runtime assumptions. No tools, no project context.

| Model | Time | Output tokens | Reasoning tokens | Improvements proposed |
|---|---|---|---|---|
| GPT-5 | 118s | 8,710 | 6,016 | 15 |
| Claude Opus 4.6 | 127s | 4,985 | (internal) | 15 |
| Claude Sonnet 4.6 | 40s | 1,636 | (internal) | 12 |

**What they found — common ground (all 3 identified):**
- DB write failure blocking engagement (fail-open under DB outage) — all three
  proposed in-memory-first engagement with async persistence
- Kill switch process liveness monitoring (heartbeat/watchdog)
- Broker connectivity loss during cancellation operations
- ETS table ownership and crash-window vulnerability
- Supervisor restart suppression as unstated mechanism
- Per-venue/per-broker scope extension

**GPT-5 unique findings (not in either other model):**
- **Infrastructure-level "hard kill"** — egress proxy or service mesh that blocks
  broker traffic independently of the application. Belt-and-suspenders approach
  where the kill switch works even if the entire BEAM VM is unresponsive. This
  was GPT-5's highest-impact unique insight.
- **Kill fence token (epoch)** — every order-carrying message includes an epoch;
  stale-epoch messages are dropped at the gate. Elegantly solves in-flight
  messages without needing drain timeouts.
- **Cluster/multi-node propagation** — detailed leader election + epoch broadcast
  + fail-closed on partition design.
- **Post-engage broker verification** — query broker AFTER engaging to confirm no
  orders slipped through during the engagement window.
- **Liquidation exposure validation** — proving tagged liquidation orders actually
  REDUCE exposure rather than trusting the tag.
- **Recovery/cold-start order suppression** — ensuring reconciliation/recovery
  routines can't submit orders while engaged.
- **Engage latency reordering** — ETS first, terminate second, DB async.
- **Audit log tamper evidence** — append-only external sink + hash chain.

**Claude Opus unique findings (not in either other model):**
- **Ordering contradiction in engagement sequence** — identified that the
  documented order (DB → ETS → terminate) creates a specific risk if a crash
  occurs BETWEEN termination and ETS update (not just DB failure). The insight
  is about the window where termination has started but gate is still open.
  More subtle than GPT-5's version (which focused on DB-blocking-engage).
- **Concurrent engagement race (mode escalation)** — multiple triggers
  simultaneously issuing conflicting modes (RESTRICT vs LIQUIDATE). Proposed
  explicit escalation rules (LIQUIDATE always wins) with GenServer serialization.
- **Shared resources under per-user scope** — per-user kill switch doesn't
  address orders in shared broker connection buffers. Forces architectural
  decision about connection pooling strategy.
- **Clock/time integrity for audit log** — monotonic counters + NTP validation
  for forensic reliability.
- **Partial multi-user engagement failures** — what happens when global engage
  successfully terminates 4/5 user pipelines but one has orphaned processes.
- **Liquidation direction validation** — similar to GPT-5's exposure validation
  but framed differently: checking corrupted position records could cause
  liquidation to OPEN positions rather than close them.
- **Process termination verification** — checking that `:kill` signals actually
  worked (defense against trap_exit, NIF blocking).
- **Engagement latency SLA** — defining a 50ms target with monitoring/alerting.

**Claude Sonnet findings (all also present in GPT-5 or Opus, differently framed):**
- No genuinely unique improvements that GPT-5 or Opus didn't also identify.
- Several were generic: "missing resource cleanup," "circuit breaker integration,"
  "performance monitoring" — exactly the kind of advice the prompt tried to
  exclude.
- The "missing heartbeat" and "network partition handling" proposals were solid
  but less detailed than the corresponding GPT-5/Opus versions.

**Quality assessment:**
- **GPT-5** produced the most ACTIONABLE improvements. Its proposals were
  architecturally concrete ("add an egress proxy," "use kill epochs in messages,"
  "query broker post-engage") and showed defense-in-depth thinking — multiple
  independent layers rather than fixing one path. The infrastructure kill (#2)
  is genuinely novel: no other model proposed going OUTSIDE the application
  boundary for safety enforcement. GPT-5 consistently thought about "what if
  this entire runtime is compromised?" rather than just fixing within-app paths.
- **Claude Opus** produced equally numerous improvements (15) with characteristic
  precision about failure SEQUENCES. Its unique strength: identifying design
  contradictions rather than just gaps (the engagement ordering issue, concurrent
  mode escalation, shared-resource scope mismatch). Opus's proposals were more
  "fix the design tension" while GPT-5's were more "add another safety layer."
  Opus also included the process termination verification and engagement latency
  SLA — operational rigor that GPT-5 skipped.
- **Claude Sonnet** produced 12 proposals in 40s (fast) but quality was notably
  lower. Several proposals were generic software engineering advice that the
  prompt explicitly excluded ("add performance monitoring," "resource cleanup").
  No unique insights emerged. Sonnet's proposals lacked the architectural depth
  of GPT-5 (no outside-the-application thinking) and the design-tension
  identification of Opus.

**Key insight — generative vs analytical tasks:**

This is the first experiment testing a GENERATIVE task ("propose improvements")
rather than a purely analytical one ("find problems"). The results reveal:

1. **GPT-5's defense-in-depth thinking is unique.** In analytical tasks, GPT-5
   finds exhaustive lists of issues. In generative tasks, it proposes LAYERED
   solutions — multiple independent mechanisms that each catch what the others
   miss. The infrastructure kill proposal (external to the application) shows
   GPT-5 reasoning about failure modes that are invisible to within-app analysis.

2. **Opus's design-tension identification transfers to improvement proposals.**
   In analytical tasks, Opus finds where parts of a design contradict each other.
   In generative tasks, this manifests as proposals that RESOLVE tensions rather
   than just adding patches. The engagement ordering contradiction and mode
   escalation rules are both "this design says X but the mechanism allows Y —
   here's how to make them consistent."

3. **Sonnet doesn't transfer well to generative tasks.** In analytical tasks
   (assumption-finding, cross-component analysis), Sonnet performs well (85% of
   GPT-5 in some experiments). In generative tasks, it falls back to generic
   engineering advice. The task requires both identifying problems AND proposing
   concrete solutions — Sonnet handles the first step but not the second with
   sufficient depth.

**Comparison to analytical task performance:**

| Task type | GPT-5 character | Opus character | Sonnet character |
|---|---|---|---|
| Assumption-finding (#10-12) | Exhaustive breadth | Design tensions | Good (85% of GPT-5) |
| Race conditions (#13) | Technical precision | Design contradictions | Weak (errors) |
| Invariant violations (#20) | Maximum selectivity | Self-correcting depth | Imprecise |
| **Design improvements (#24)** | **Defense-in-depth layers** | **Tension resolution** | **Generic advice** |

The generative task reveals model ARCHITECTURES more clearly than analytical tasks.
GPT-5's reasoning enables it to construct multi-layered solutions. Opus's internal
reasoning enables it to identify what a design SHOULD be (not just what's wrong).
Sonnet pattern-matches against known engineering practices without deep synthesis.

**Practical implication:**

For design improvement sessions on safety-critical systems:
- Run GPT-5 for defense-in-depth proposals ("what layers should exist?")
- Run Opus for design consistency proposals ("where does the design contradict itself?")
- Skip Sonnet — its output is indistinguishable from generic checklists
- The combination of GPT-5 + Opus produces complementary improvements: GPT-5 adds
  safety layers, Opus fixes internal contradictions. Together they address both
  "not enough protection" and "protection mechanisms that work against each other."

**Cost analysis:**
GPT-5: 118s, ~10.9K tokens (6K reasoning). Opus: 127s, ~5K tokens. Sonnet: 40s, ~1.6K tokens.
For a safety-critical design review, running GPT-5 + Opus costs ~16K tokens and produces
30 improvements with near-zero overlap in unique insights. Excellent ROI for a kill switch
design that protects real money.