Files
model-research/findings/2026-05-05-24-design-improvement-proposals-gpt5-excels.md
T
Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00

9.3 KiB

Finding 24: Design improvement proposals: GPT-5 excels at defense-in-depth thinking; Opus finds subtle design contradictions; Sonnet produces generic recommendations

Date: 2026-05-05 Task: Propose specific design improvements for gargoyle's kill-switch.md (185 lines) — the primary safety mechanism that prevents rogue orders. NEW task type: generative/ creative ("what would you improve?") rather than purely analytical ("what's wrong?"). How we used them: Same document (full text) + same focused prompt to all 3 models via HAI proxy. Prompt asked for 8-15 specific improvements with: weakness, proposed change (concrete), tradeoff, severity rating. Explicitly excluded generic advice ("add more tests") and asked about runtime assumptions. No tools, no project context.

Model Time Output tokens Reasoning tokens Improvements proposed
GPT-5 118s 8,710 6,016 15
Claude Opus 4.6 127s 4,985 (internal) 15
Claude Sonnet 4.6 40s 1,636 (internal) 12

What they found — common ground (all 3 identified):

  • DB write failure blocking engagement (fail-open under DB outage) — all three proposed in-memory-first engagement with async persistence
  • Kill switch process liveness monitoring (heartbeat/watchdog)
  • Broker connectivity loss during cancellation operations
  • ETS table ownership and crash-window vulnerability
  • Supervisor restart suppression as unstated mechanism
  • Per-venue/per-broker scope extension

GPT-5 unique findings (not in either other model):

  • Infrastructure-level "hard kill" — egress proxy or service mesh that blocks broker traffic independently of the application. Belt-and-suspenders approach where the kill switch works even if the entire BEAM VM is unresponsive. This was GPT-5's highest-impact unique insight.
  • Kill fence token (epoch) — every order-carrying message includes an epoch; stale-epoch messages are dropped at the gate. Elegantly solves in-flight messages without needing drain timeouts.
  • Cluster/multi-node propagation — detailed leader election + epoch broadcast
    • fail-closed on partition design.
  • Post-engage broker verification — query broker AFTER engaging to confirm no orders slipped through during the engagement window.
  • Liquidation exposure validation — proving tagged liquidation orders actually REDUCE exposure rather than trusting the tag.
  • Recovery/cold-start order suppression — ensuring reconciliation/recovery routines can't submit orders while engaged.
  • Engage latency reordering — ETS first, terminate second, DB async.
  • Audit log tamper evidence — append-only external sink + hash chain.

Claude Opus unique findings (not in either other model):

  • Ordering contradiction in engagement sequence — identified that the documented order (DB → ETS → terminate) creates a specific risk if a crash occurs BETWEEN termination and ETS update (not just DB failure). The insight is about the window where termination has started but gate is still open. More subtle than GPT-5's version (which focused on DB-blocking-engage).
  • Concurrent engagement race (mode escalation) — multiple triggers simultaneously issuing conflicting modes (RESTRICT vs LIQUIDATE). Proposed explicit escalation rules (LIQUIDATE always wins) with GenServer serialization.
  • Shared resources under per-user scope — per-user kill switch doesn't address orders in shared broker connection buffers. Forces architectural decision about connection pooling strategy.
  • Clock/time integrity for audit log — monotonic counters + NTP validation for forensic reliability.
  • Partial multi-user engagement failures — what happens when global engage successfully terminates 4/5 user pipelines but one has orphaned processes.
  • Liquidation direction validation — similar to GPT-5's exposure validation but framed differently: checking corrupted position records could cause liquidation to OPEN positions rather than close them.
  • Process termination verification — checking that :kill signals actually worked (defense against trap_exit, NIF blocking).
  • Engagement latency SLA — defining a 50ms target with monitoring/alerting.

Claude Sonnet findings (all also present in GPT-5 or Opus, differently framed):

  • No genuinely unique improvements that GPT-5 or Opus didn't also identify.
  • Several were generic: "missing resource cleanup," "circuit breaker integration," "performance monitoring" — exactly the kind of advice the prompt tried to exclude.
  • The "missing heartbeat" and "network partition handling" proposals were solid but less detailed than the corresponding GPT-5/Opus versions.

Quality assessment:

  • GPT-5 produced the most ACTIONABLE improvements. Its proposals were architecturally concrete ("add an egress proxy," "use kill epochs in messages," "query broker post-engage") and showed defense-in-depth thinking — multiple independent layers rather than fixing one path. The infrastructure kill (#2) is genuinely novel: no other model proposed going OUTSIDE the application boundary for safety enforcement. GPT-5 consistently thought about "what if this entire runtime is compromised?" rather than just fixing within-app paths.
  • Claude Opus produced equally numerous improvements (15) with characteristic precision about failure SEQUENCES. Its unique strength: identifying design contradictions rather than just gaps (the engagement ordering issue, concurrent mode escalation, shared-resource scope mismatch). Opus's proposals were more "fix the design tension" while GPT-5's were more "add another safety layer." Opus also included the process termination verification and engagement latency SLA — operational rigor that GPT-5 skipped.
  • Claude Sonnet produced 12 proposals in 40s (fast) but quality was notably lower. Several proposals were generic software engineering advice that the prompt explicitly excluded ("add performance monitoring," "resource cleanup"). No unique insights emerged. Sonnet's proposals lacked the architectural depth of GPT-5 (no outside-the-application thinking) and the design-tension identification of Opus.

Key insight — generative vs analytical tasks:

This is the first experiment testing a GENERATIVE task ("propose improvements") rather than a purely analytical one ("find problems"). The results reveal:

  1. GPT-5's defense-in-depth thinking is unique. In analytical tasks, GPT-5 finds exhaustive lists of issues. In generative tasks, it proposes LAYERED solutions — multiple independent mechanisms that each catch what the others miss. The infrastructure kill proposal (external to the application) shows GPT-5 reasoning about failure modes that are invisible to within-app analysis.

  2. Opus's design-tension identification transfers to improvement proposals. In analytical tasks, Opus finds where parts of a design contradict each other. In generative tasks, this manifests as proposals that RESOLVE tensions rather than just adding patches. The engagement ordering contradiction and mode escalation rules are both "this design says X but the mechanism allows Y — here's how to make them consistent."

  3. Sonnet doesn't transfer well to generative tasks. In analytical tasks (assumption-finding, cross-component analysis), Sonnet performs well (85% of GPT-5 in some experiments). In generative tasks, it falls back to generic engineering advice. The task requires both identifying problems AND proposing concrete solutions — Sonnet handles the first step but not the second with sufficient depth.

Comparison to analytical task performance:

Task type GPT-5 character Opus character Sonnet character
Assumption-finding (#10-12) Exhaustive breadth Design tensions Good (85% of GPT-5)
Race conditions (#13) Technical precision Design contradictions Weak (errors)
Invariant violations (#20) Maximum selectivity Self-correcting depth Imprecise
Design improvements (#24) Defense-in-depth layers Tension resolution Generic advice

The generative task reveals model ARCHITECTURES more clearly than analytical tasks. GPT-5's reasoning enables it to construct multi-layered solutions. Opus's internal reasoning enables it to identify what a design SHOULD be (not just what's wrong). Sonnet pattern-matches against known engineering practices without deep synthesis.

Practical implication:

For design improvement sessions on safety-critical systems:

  • Run GPT-5 for defense-in-depth proposals ("what layers should exist?")
  • Run Opus for design consistency proposals ("where does the design contradict itself?")
  • Skip Sonnet — its output is indistinguishable from generic checklists
  • The combination of GPT-5 + Opus produces complementary improvements: GPT-5 adds safety layers, Opus fixes internal contradictions. Together they address both "not enough protection" and "protection mechanisms that work against each other."

Cost analysis: GPT-5: 118s, ~10.9K tokens (6K reasoning). Opus: 127s, ~5K tokens. Sonnet: 40s, ~1.6K tokens. For a safety-critical design review, running GPT-5 + Opus costs ~16K tokens and produces 30 improvements with near-zero overlap in unique insights. Excellent ROI for a kill switch design that protects real money.