Files
model-research/findings/2026-05-04-19-union-coverage-test-gpt5-mini.md
T
Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00

125 lines
6.9 KiB
Markdown

# Finding 19: Union coverage test: GPT-5 Mini + Sonnet 4.6 covers ~71% of GPT-5's findings; the missing 29% is where the real value lives
**Date:** 2026-05-04
**Task:** Identify hidden assumptions in gargoyle's `trading-pipeline.md` (1,110 lines,
~62KB) — the most complex document tested so far, covering the full end-to-end path
from tick ingestion through order execution.
**How we used them:** Same document (full text, no truncation) + same focused analytical
question to all 3 models via HAI proxy. Standard hidden-assumption prompt with 5
categories (runtime behavior, external dependencies, timing/ordering, scale/load,
uncovered failure modes). Required specific output format per finding. No tools, no
project context beyond the document itself.
| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
|---|---|---|---|---|
| GPT-5 | 99s | 9,418 | 5,696 | 35 |
| GPT-5 Mini | 93s | 5,309 | 1,792 | 21 |
| Claude Sonnet 4.6 | 38s | 1,792 | (internal) | 17 |
**Coverage analysis — can Mini + Sonnet together replace GPT-5?**
Categorized each of GPT-5's 35 findings by whether the union of Mini + Sonnet
also identified the same assumption:
- **Covered by BOTH Mini and Sonnet:** ~12 findings (common ground — any model
finds these: idempotency, single-writer, clock sync, instrument resolution,
fill immutability, reconciliation gate, backpressure, fill correlation, event
ordering, audit scalability, PortfolioRisk bottleneck)
- **Covered by Mini only (not Sonnet):** ~7 findings (transactional atomicity,
audit causal consistency, modification-in-flight enforcement, OM throughput,
decimal precision, PM/PR close-only race, partition duplicate submit)
- **Covered by Sonnet only (not Mini):** ~6 findings (market data feed rates,
pipeline-vs-market speed, corporate actions atomicity, kill switch partition,
shared port isolation, market close vs auction fills)
- **Union(Mini + Sonnet) total coverage:** ~25/35 = **~71%** of GPT-5's findings
- **GPT-5 unique (missed by both):** ~10-18 findings depending on strictness
**What GPT-5 uniquely found that the cheaper pair missed:**
The missing 29% is NOT random — it's systematically different in character:
1. **Operational edge cases:** Default TIF "day" broker semantics, OrderRate
counting retries, extended-hours MarketHours mismatch, fractional quantities,
local expiry timer precision per instrument
2. **Design-level interaction gaps:** PortfolioRisk concurrent decision race
(snapshot stale between two parallel approvals), re-validation gap between
approval and submit, decision loss on crash after audit write
3. **Domain-specific knowledge:** Manual broker-side actions conflicting with
state machine, options/complex instrument position_effect mapping, Decision→Order
1:1 invariant vs broker auto-splitting, wash sale retroactive P&L mutation
4. **Architectural observations:** Reduction re-entry rule insufficiency,
PortfolioMonitor coalescing vs fast breach detection, multi-aggregator fanout
and audit partial writes, replay/backtest alignment with production controls
These share a common trait: they require **domain expertise** (knowing how brokers
actually behave, how regulatory rules interact, how production trading systems
fail in practice) combined with **architectural reasoning** (how the design's own
mechanisms interact under those real-world conditions). The cheaper models find
assumptions about the document's internal consistency; GPT-5 additionally finds
assumptions about the document's relationship to the external world it must
operate in.
**GPT-5 Mini vs Sonnet 4.6 — complementary, not redundant:**
Mini and Sonnet covered different gaps:
- Mini was stronger on **internal consistency** (transactional atomicity, causal
consistency, decimal precision, modification serialization)
- Sonnet was stronger on **external interactions** (market data feeds, corporate
actions, kill switch distribution, shared resource isolation)
This aligns with previous findings: Mini reasons about implementation mechanics;
Sonnet reasons about system boundaries and external interactions. Their union
covers more ground than either alone.
**Cost comparison:**
| Approach | Total tokens | Approx. cost | Coverage of GPT-5 |
|---|---|---|---|
| GPT-5 alone | ~21K (9.4K output + 5.7K reasoning) | ~$0.80 | 100% (35 findings) |
| Mini + Sonnet | ~7.1K output + 1.8K reasoning | ~$0.25 | ~71% (25/35 findings) |
| All three | ~28K total | ~$1.05 | >100% (35 + unique Sonnet/Mini extras) |
**Key insight — the 71% coverage is a floor, not a ceiling:**
The union covers 71% of GPT-5's specific findings. But Mini and Sonnet each
also produced findings that GPT-5 DIDN'T make:
- Sonnet: DailyLossLimit query performance scaling, instrument reference data
propagation atomicity across components
- Mini: Signal audit correlation ambiguity under replay/duplicate ticks
So the total unique finding space is LARGER than any single model. Running all
three produces the most comprehensive analysis.
**Answer to the open question: "Would running GPT-5 Mini + Sonnet together
approach GPT-5's coverage at lower combined cost?"**
**Partially.** The pair covers ~71% of GPT-5's findings at ~31% of the cost.
But the missing 29% is disproportionately valuable — it contains the
domain-specific, interaction-level, real-world-knowledge findings that are
most likely to prevent production incidents. For a quick sanity check or
first-pass screening, Mini + Sonnet is excellent value. For architecture
review where completeness matters (financial system, safety-critical), GPT-5
is not replaceable by cheaper models — its unique findings are exactly the
ones that would cause real-world failures.
**Practical implication:** The optimal strategy depends on stakes:
- **Low stakes** (internal doc review, non-critical systems): Mini + Sonnet
is 71% coverage at 31% cost — strong ROI
- **High stakes** (financial systems, safety-critical): run all three — the
~$1 total cost is irrelevant vs the value of the extra 10-18 findings
- **Budget-conscious high stakes:** run GPT-5 alone — it subsumes most of
what Mini + Sonnet find, and adds the critical domain-knowledge findings
The cost argument for Mini + Sonnet as a GPT-5 REPLACEMENT doesn't hold for
important work. The cost argument for Mini + Sonnet as a GPT-5 COMPLEMENT
is strong — they catch a few things GPT-5 misses, and the union of all three
is the most thorough analysis available.
**Document complexity observation:**
This is the largest document tested (1,110 lines vs previous 185-785 lines).
GPT-5's finding count scaled up (35 vs 20-26 on smaller docs) while maintaining
quality — no padding with obvious/low-value findings. Mini also scaled (21 vs
6 on 459-line doc in Finding #14). Sonnet scaled less (17 vs 12-17 on smaller
docs) — it appears to have a natural output ceiling regardless of document size,
consistent with its self-filtering behavior observed in previous findings.