6af8a6ee10
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
125 lines
6.9 KiB
Markdown
125 lines
6.9 KiB
Markdown
# Finding 19: Union coverage test: GPT-5 Mini + Sonnet 4.6 covers ~71% of GPT-5's findings; the missing 29% is where the real value lives
|
|
|
|
**Date:** 2026-05-04
|
|
**Task:** Identify hidden assumptions in gargoyle's `trading-pipeline.md` (1,110 lines,
|
|
~62KB) — the most complex document tested so far, covering the full end-to-end path
|
|
from tick ingestion through order execution.
|
|
**How we used them:** Same document (full text, no truncation) + same focused analytical
|
|
question to all 3 models via HAI proxy. Standard hidden-assumption prompt with 5
|
|
categories (runtime behavior, external dependencies, timing/ordering, scale/load,
|
|
uncovered failure modes). Required specific output format per finding. No tools, no
|
|
project context beyond the document itself.
|
|
|
|
| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
|
|
|---|---|---|---|---|
|
|
| GPT-5 | 99s | 9,418 | 5,696 | 35 |
|
|
| GPT-5 Mini | 93s | 5,309 | 1,792 | 21 |
|
|
| Claude Sonnet 4.6 | 38s | 1,792 | (internal) | 17 |
|
|
|
|
**Coverage analysis — can Mini + Sonnet together replace GPT-5?**
|
|
|
|
Categorized each of GPT-5's 35 findings by whether the union of Mini + Sonnet
|
|
also identified the same assumption:
|
|
|
|
- **Covered by BOTH Mini and Sonnet:** ~12 findings (common ground — any model
|
|
finds these: idempotency, single-writer, clock sync, instrument resolution,
|
|
fill immutability, reconciliation gate, backpressure, fill correlation, event
|
|
ordering, audit scalability, PortfolioRisk bottleneck)
|
|
- **Covered by Mini only (not Sonnet):** ~7 findings (transactional atomicity,
|
|
audit causal consistency, modification-in-flight enforcement, OM throughput,
|
|
decimal precision, PM/PR close-only race, partition duplicate submit)
|
|
- **Covered by Sonnet only (not Mini):** ~6 findings (market data feed rates,
|
|
pipeline-vs-market speed, corporate actions atomicity, kill switch partition,
|
|
shared port isolation, market close vs auction fills)
|
|
- **Union(Mini + Sonnet) total coverage:** ~25/35 = **~71%** of GPT-5's findings
|
|
- **GPT-5 unique (missed by both):** ~10-18 findings depending on strictness
|
|
|
|
**What GPT-5 uniquely found that the cheaper pair missed:**
|
|
|
|
The missing 29% is NOT random — it's systematically different in character:
|
|
|
|
1. **Operational edge cases:** Default TIF "day" broker semantics, OrderRate
|
|
counting retries, extended-hours MarketHours mismatch, fractional quantities,
|
|
local expiry timer precision per instrument
|
|
2. **Design-level interaction gaps:** PortfolioRisk concurrent decision race
|
|
(snapshot stale between two parallel approvals), re-validation gap between
|
|
approval and submit, decision loss on crash after audit write
|
|
3. **Domain-specific knowledge:** Manual broker-side actions conflicting with
|
|
state machine, options/complex instrument position_effect mapping, Decision→Order
|
|
1:1 invariant vs broker auto-splitting, wash sale retroactive P&L mutation
|
|
4. **Architectural observations:** Reduction re-entry rule insufficiency,
|
|
PortfolioMonitor coalescing vs fast breach detection, multi-aggregator fanout
|
|
and audit partial writes, replay/backtest alignment with production controls
|
|
|
|
These share a common trait: they require **domain expertise** (knowing how brokers
|
|
actually behave, how regulatory rules interact, how production trading systems
|
|
fail in practice) combined with **architectural reasoning** (how the design's own
|
|
mechanisms interact under those real-world conditions). The cheaper models find
|
|
assumptions about the document's internal consistency; GPT-5 additionally finds
|
|
assumptions about the document's relationship to the external world it must
|
|
operate in.
|
|
|
|
**GPT-5 Mini vs Sonnet 4.6 — complementary, not redundant:**
|
|
|
|
Mini and Sonnet covered different gaps:
|
|
- Mini was stronger on **internal consistency** (transactional atomicity, causal
|
|
consistency, decimal precision, modification serialization)
|
|
- Sonnet was stronger on **external interactions** (market data feeds, corporate
|
|
actions, kill switch distribution, shared resource isolation)
|
|
|
|
This aligns with previous findings: Mini reasons about implementation mechanics;
|
|
Sonnet reasons about system boundaries and external interactions. Their union
|
|
covers more ground than either alone.
|
|
|
|
**Cost comparison:**
|
|
|
|
| Approach | Total tokens | Approx. cost | Coverage of GPT-5 |
|
|
|---|---|---|---|
|
|
| GPT-5 alone | ~21K (9.4K output + 5.7K reasoning) | ~$0.80 | 100% (35 findings) |
|
|
| Mini + Sonnet | ~7.1K output + 1.8K reasoning | ~$0.25 | ~71% (25/35 findings) |
|
|
| All three | ~28K total | ~$1.05 | >100% (35 + unique Sonnet/Mini extras) |
|
|
|
|
**Key insight — the 71% coverage is a floor, not a ceiling:**
|
|
|
|
The union covers 71% of GPT-5's specific findings. But Mini and Sonnet each
|
|
also produced findings that GPT-5 DIDN'T make:
|
|
- Sonnet: DailyLossLimit query performance scaling, instrument reference data
|
|
propagation atomicity across components
|
|
- Mini: Signal audit correlation ambiguity under replay/duplicate ticks
|
|
|
|
So the total unique finding space is LARGER than any single model. Running all
|
|
three produces the most comprehensive analysis.
|
|
|
|
**Answer to the open question: "Would running GPT-5 Mini + Sonnet together
|
|
approach GPT-5's coverage at lower combined cost?"**
|
|
|
|
**Partially.** The pair covers ~71% of GPT-5's findings at ~31% of the cost.
|
|
But the missing 29% is disproportionately valuable — it contains the
|
|
domain-specific, interaction-level, real-world-knowledge findings that are
|
|
most likely to prevent production incidents. For a quick sanity check or
|
|
first-pass screening, Mini + Sonnet is excellent value. For architecture
|
|
review where completeness matters (financial system, safety-critical), GPT-5
|
|
is not replaceable by cheaper models — its unique findings are exactly the
|
|
ones that would cause real-world failures.
|
|
|
|
**Practical implication:** The optimal strategy depends on stakes:
|
|
- **Low stakes** (internal doc review, non-critical systems): Mini + Sonnet
|
|
is 71% coverage at 31% cost — strong ROI
|
|
- **High stakes** (financial systems, safety-critical): run all three — the
|
|
~$1 total cost is irrelevant vs the value of the extra 10-18 findings
|
|
- **Budget-conscious high stakes:** run GPT-5 alone — it subsumes most of
|
|
what Mini + Sonnet find, and adds the critical domain-knowledge findings
|
|
|
|
The cost argument for Mini + Sonnet as a GPT-5 REPLACEMENT doesn't hold for
|
|
important work. The cost argument for Mini + Sonnet as a GPT-5 COMPLEMENT
|
|
is strong — they catch a few things GPT-5 misses, and the union of all three
|
|
is the most thorough analysis available.
|
|
|
|
**Document complexity observation:**
|
|
This is the largest document tested (1,110 lines vs previous 185-785 lines).
|
|
GPT-5's finding count scaled up (35 vs 20-26 on smaller docs) while maintaining
|
|
quality — no padding with obvious/low-value findings. Mini also scaled (21 vs
|
|
6 on 459-line doc in Finding #14). Sonnet scaled less (17 vs 12-17 on smaller
|
|
docs) — it appears to have a natural output ceiling regardless of document size,
|
|
consistent with its self-filtering behavior observed in previous findings.
|