# Finding 19: Union coverage test: GPT-5 Mini + Sonnet 4.6 covers ~71% of GPT-5's findings; the missing 29% is where the real value lives **Date:** 2026-05-04 **Task:** Identify hidden assumptions in gargoyle's `trading-pipeline.md` (1,110 lines, ~62KB) — the most complex document tested so far, covering the full end-to-end path from tick ingestion through order execution. **How we used them:** Same document (full text, no truncation) + same focused analytical question to all 3 models via HAI proxy. Standard hidden-assumption prompt with 5 categories (runtime behavior, external dependencies, timing/ordering, scale/load, uncovered failure modes). Required specific output format per finding. No tools, no project context beyond the document itself. | Model | Time | Output tokens | Reasoning tokens | Assumptions found | |---|---|---|---|---| | GPT-5 | 99s | 9,418 | 5,696 | 35 | | GPT-5 Mini | 93s | 5,309 | 1,792 | 21 | | Claude Sonnet 4.6 | 38s | 1,792 | (internal) | 17 | **Coverage analysis — can Mini + Sonnet together replace GPT-5?** Categorized each of GPT-5's 35 findings by whether the union of Mini + Sonnet also identified the same assumption: - **Covered by BOTH Mini and Sonnet:** ~12 findings (common ground — any model finds these: idempotency, single-writer, clock sync, instrument resolution, fill immutability, reconciliation gate, backpressure, fill correlation, event ordering, audit scalability, PortfolioRisk bottleneck) - **Covered by Mini only (not Sonnet):** ~7 findings (transactional atomicity, audit causal consistency, modification-in-flight enforcement, OM throughput, decimal precision, PM/PR close-only race, partition duplicate submit) - **Covered by Sonnet only (not Mini):** ~6 findings (market data feed rates, pipeline-vs-market speed, corporate actions atomicity, kill switch partition, shared port isolation, market close vs auction fills) - **Union(Mini + Sonnet) total coverage:** ~25/35 = **~71%** of GPT-5's findings - **GPT-5 unique (missed by both):** ~10-18 findings depending on strictness **What GPT-5 uniquely found that the cheaper pair missed:** The missing 29% is NOT random — it's systematically different in character: 1. **Operational edge cases:** Default TIF "day" broker semantics, OrderRate counting retries, extended-hours MarketHours mismatch, fractional quantities, local expiry timer precision per instrument 2. **Design-level interaction gaps:** PortfolioRisk concurrent decision race (snapshot stale between two parallel approvals), re-validation gap between approval and submit, decision loss on crash after audit write 3. **Domain-specific knowledge:** Manual broker-side actions conflicting with state machine, options/complex instrument position_effect mapping, Decision→Order 1:1 invariant vs broker auto-splitting, wash sale retroactive P&L mutation 4. **Architectural observations:** Reduction re-entry rule insufficiency, PortfolioMonitor coalescing vs fast breach detection, multi-aggregator fanout and audit partial writes, replay/backtest alignment with production controls These share a common trait: they require **domain expertise** (knowing how brokers actually behave, how regulatory rules interact, how production trading systems fail in practice) combined with **architectural reasoning** (how the design's own mechanisms interact under those real-world conditions). The cheaper models find assumptions about the document's internal consistency; GPT-5 additionally finds assumptions about the document's relationship to the external world it must operate in. **GPT-5 Mini vs Sonnet 4.6 — complementary, not redundant:** Mini and Sonnet covered different gaps: - Mini was stronger on **internal consistency** (transactional atomicity, causal consistency, decimal precision, modification serialization) - Sonnet was stronger on **external interactions** (market data feeds, corporate actions, kill switch distribution, shared resource isolation) This aligns with previous findings: Mini reasons about implementation mechanics; Sonnet reasons about system boundaries and external interactions. Their union covers more ground than either alone. **Cost comparison:** | Approach | Total tokens | Approx. cost | Coverage of GPT-5 | |---|---|---|---| | GPT-5 alone | ~21K (9.4K output + 5.7K reasoning) | ~$0.80 | 100% (35 findings) | | Mini + Sonnet | ~7.1K output + 1.8K reasoning | ~$0.25 | ~71% (25/35 findings) | | All three | ~28K total | ~$1.05 | >100% (35 + unique Sonnet/Mini extras) | **Key insight — the 71% coverage is a floor, not a ceiling:** The union covers 71% of GPT-5's specific findings. But Mini and Sonnet each also produced findings that GPT-5 DIDN'T make: - Sonnet: DailyLossLimit query performance scaling, instrument reference data propagation atomicity across components - Mini: Signal audit correlation ambiguity under replay/duplicate ticks So the total unique finding space is LARGER than any single model. Running all three produces the most comprehensive analysis. **Answer to the open question: "Would running GPT-5 Mini + Sonnet together approach GPT-5's coverage at lower combined cost?"** **Partially.** The pair covers ~71% of GPT-5's findings at ~31% of the cost. But the missing 29% is disproportionately valuable — it contains the domain-specific, interaction-level, real-world-knowledge findings that are most likely to prevent production incidents. For a quick sanity check or first-pass screening, Mini + Sonnet is excellent value. For architecture review where completeness matters (financial system, safety-critical), GPT-5 is not replaceable by cheaper models — its unique findings are exactly the ones that would cause real-world failures. **Practical implication:** The optimal strategy depends on stakes: - **Low stakes** (internal doc review, non-critical systems): Mini + Sonnet is 71% coverage at 31% cost — strong ROI - **High stakes** (financial systems, safety-critical): run all three — the ~$1 total cost is irrelevant vs the value of the extra 10-18 findings - **Budget-conscious high stakes:** run GPT-5 alone — it subsumes most of what Mini + Sonnet find, and adds the critical domain-knowledge findings The cost argument for Mini + Sonnet as a GPT-5 REPLACEMENT doesn't hold for important work. The cost argument for Mini + Sonnet as a GPT-5 COMPLEMENT is strong — they catch a few things GPT-5 misses, and the union of all three is the most thorough analysis available. **Document complexity observation:** This is the largest document tested (1,110 lines vs previous 185-785 lines). GPT-5's finding count scaled up (35 vs 20-26 on smaller docs) while maintaining quality — no padding with obvious/low-value findings. Mini also scaled (21 vs 6 on 459-line doc in Finding #14). Sonnet scaled less (17 vs 12-17 on smaller docs) — it appears to have a natural output ceiling regardless of document size, consistent with its self-filtering behavior observed in previous findings.