# Finding 7: Token budget matters more than model size for gap analysis (confirmed) **Date:** 2026-05-03 **Task:** Identify unaddressed failure scenarios in gargoyle's `failure-modes.md` (383 lines, ~25KB) **How we used them:** Same document, same analytical question ("What failure scenarios are NOT covered?"), three models. GPT-5 with 16K max_completion_tokens, Sonnet 4 with 4K max_tokens, GPT-4.1 Mini with 4K max_completion_tokens. No project context beyond the document itself. Pure gap-analysis task. **Results:** - GPT-5 (16K budget): 28 gaps, most exhaustive. Found domain-specific edge cases others missed entirely: ClOrdID collision across restarts, fractional share rounding, broker maintenance windows (410/426), hot code upgrades, regulatory halts vs staleness distinction, cancel-ack-then-late-fill race, DNS/TLS as distinct from network outage. - Sonnet 4 (4K budget): 12 gaps, concise. Unique framing: distinguished latency degradation from outage (subtle but actionable). ETS corruption vs loss. - GPT-4.1 Mini (4K budget): 13 gaps with summary table. Unique angles: unknown broker status enum values, configuration schema mismatches on cold-start, malformed signals from logic bugs (not just crashes). **Overlap (all three):** Rate limiting, clock skew, resource exhaustion, DB failures, message backpressure, partial connectivity. **Key insight:** GPT-5's 4K attempt produced ZERO output (finish_reason: length) — all tokens consumed by internal reasoning. At 16K it produced the richest analysis. This confirms finding #3 (GPT-5 needs generous token budgets) AND adds a new observation: for open-ended analytical questions, GPT-5's reasoning overhead is proportionally larger. The 4K models (Sonnet, Mini) both produced useful output at 4K because they don't burn tokens on chain-of-thought. **Model personality confirmed:** - GPT-5: exhaustive, domain-aware, finds edge cases a senior SRE would know - Sonnet: precise, architectural, finds design-level distinctions - GPT-4.1 Mini: structured, systematic, finds enumeration gaps **Practical implication:** For failure mode / gap analysis on design docs: - GPT-5 with ≥16K tokens for maximum coverage (most unique findings) - Sonnet for architectural framing ("this is really two different problems") - Mini for completeness checking ("what about this enum value?") - Running all three costs ~$0.50 and catches gaps none alone would find - GPT-5 at 4K is USELESS for this task — always give it room to think **Note on GPT-5 reasoning overhead:** First attempt at 4K max_completion_tokens returned empty content with finish_reason: length. The model spent all 4K tokens on internal reasoning and produced nothing. This is worse than a short answer — it's zero value for non-zero cost. Always budget ≥16K for GPT-5 analytical tasks.