Files
model-research/findings/2026-05-04-19-union-coverage-test-gpt5-mini.md
Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00

6.9 KiB

Finding 19: Union coverage test: GPT-5 Mini + Sonnet 4.6 covers ~71% of GPT-5's findings; the missing 29% is where the real value lives

Date: 2026-05-04 Task: Identify hidden assumptions in gargoyle's trading-pipeline.md (1,110 lines, ~62KB) — the most complex document tested so far, covering the full end-to-end path from tick ingestion through order execution. How we used them: Same document (full text, no truncation) + same focused analytical question to all 3 models via HAI proxy. Standard hidden-assumption prompt with 5 categories (runtime behavior, external dependencies, timing/ordering, scale/load, uncovered failure modes). Required specific output format per finding. No tools, no project context beyond the document itself.

Model Time Output tokens Reasoning tokens Assumptions found
GPT-5 99s 9,418 5,696 35
GPT-5 Mini 93s 5,309 1,792 21
Claude Sonnet 4.6 38s 1,792 (internal) 17

Coverage analysis — can Mini + Sonnet together replace GPT-5?

Categorized each of GPT-5's 35 findings by whether the union of Mini + Sonnet also identified the same assumption:

  • Covered by BOTH Mini and Sonnet: ~12 findings (common ground — any model finds these: idempotency, single-writer, clock sync, instrument resolution, fill immutability, reconciliation gate, backpressure, fill correlation, event ordering, audit scalability, PortfolioRisk bottleneck)
  • Covered by Mini only (not Sonnet): ~7 findings (transactional atomicity, audit causal consistency, modification-in-flight enforcement, OM throughput, decimal precision, PM/PR close-only race, partition duplicate submit)
  • Covered by Sonnet only (not Mini): ~6 findings (market data feed rates, pipeline-vs-market speed, corporate actions atomicity, kill switch partition, shared port isolation, market close vs auction fills)
  • Union(Mini + Sonnet) total coverage: ~25/35 = ~71% of GPT-5's findings
  • GPT-5 unique (missed by both): ~10-18 findings depending on strictness

What GPT-5 uniquely found that the cheaper pair missed:

The missing 29% is NOT random — it's systematically different in character:

  1. Operational edge cases: Default TIF "day" broker semantics, OrderRate counting retries, extended-hours MarketHours mismatch, fractional quantities, local expiry timer precision per instrument
  2. Design-level interaction gaps: PortfolioRisk concurrent decision race (snapshot stale between two parallel approvals), re-validation gap between approval and submit, decision loss on crash after audit write
  3. Domain-specific knowledge: Manual broker-side actions conflicting with state machine, options/complex instrument position_effect mapping, Decision→Order 1:1 invariant vs broker auto-splitting, wash sale retroactive P&L mutation
  4. Architectural observations: Reduction re-entry rule insufficiency, PortfolioMonitor coalescing vs fast breach detection, multi-aggregator fanout and audit partial writes, replay/backtest alignment with production controls

These share a common trait: they require domain expertise (knowing how brokers actually behave, how regulatory rules interact, how production trading systems fail in practice) combined with architectural reasoning (how the design's own mechanisms interact under those real-world conditions). The cheaper models find assumptions about the document's internal consistency; GPT-5 additionally finds assumptions about the document's relationship to the external world it must operate in.

GPT-5 Mini vs Sonnet 4.6 — complementary, not redundant:

Mini and Sonnet covered different gaps:

  • Mini was stronger on internal consistency (transactional atomicity, causal consistency, decimal precision, modification serialization)
  • Sonnet was stronger on external interactions (market data feeds, corporate actions, kill switch distribution, shared resource isolation)

This aligns with previous findings: Mini reasons about implementation mechanics; Sonnet reasons about system boundaries and external interactions. Their union covers more ground than either alone.

Cost comparison:

Approach Total tokens Approx. cost Coverage of GPT-5
GPT-5 alone ~21K (9.4K output + 5.7K reasoning) ~$0.80 100% (35 findings)
Mini + Sonnet ~7.1K output + 1.8K reasoning ~$0.25 ~71% (25/35 findings)
All three ~28K total ~$1.05 >100% (35 + unique Sonnet/Mini extras)

Key insight — the 71% coverage is a floor, not a ceiling:

The union covers 71% of GPT-5's specific findings. But Mini and Sonnet each also produced findings that GPT-5 DIDN'T make:

  • Sonnet: DailyLossLimit query performance scaling, instrument reference data propagation atomicity across components
  • Mini: Signal audit correlation ambiguity under replay/duplicate ticks

So the total unique finding space is LARGER than any single model. Running all three produces the most comprehensive analysis.

Answer to the open question: "Would running GPT-5 Mini + Sonnet together approach GPT-5's coverage at lower combined cost?"

Partially. The pair covers ~71% of GPT-5's findings at ~31% of the cost. But the missing 29% is disproportionately valuable — it contains the domain-specific, interaction-level, real-world-knowledge findings that are most likely to prevent production incidents. For a quick sanity check or first-pass screening, Mini + Sonnet is excellent value. For architecture review where completeness matters (financial system, safety-critical), GPT-5 is not replaceable by cheaper models — its unique findings are exactly the ones that would cause real-world failures.

Practical implication: The optimal strategy depends on stakes:

  • Low stakes (internal doc review, non-critical systems): Mini + Sonnet is 71% coverage at 31% cost — strong ROI
  • High stakes (financial systems, safety-critical): run all three — the ~$1 total cost is irrelevant vs the value of the extra 10-18 findings
  • Budget-conscious high stakes: run GPT-5 alone — it subsumes most of what Mini + Sonnet find, and adds the critical domain-knowledge findings

The cost argument for Mini + Sonnet as a GPT-5 REPLACEMENT doesn't hold for important work. The cost argument for Mini + Sonnet as a GPT-5 COMPLEMENT is strong — they catch a few things GPT-5 misses, and the union of all three is the most thorough analysis available.

Document complexity observation: This is the largest document tested (1,110 lines vs previous 185-785 lines). GPT-5's finding count scaled up (35 vs 20-26 on smaller docs) while maintaining quality — no padding with obvious/low-value findings. Mini also scaled (21 vs 6 on 459-line doc in Finding #14). Sonnet scaled less (17 vs 12-17 on smaller docs) — it appears to have a natural output ceiling regardless of document size, consistent with its self-filtering behavior observed in previous findings.