model-research/findings/2026-05-03-07b-token-budget-matters-more-than.md

# Finding 7: Token budget matters more than model size for gap analysis (confirmed)

**Date:** 2026-05-03
**Task:** Identify unaddressed failure scenarios in gargoyle's `failure-modes.md` (383 lines, ~25KB)
**How we used them:** Same document, same analytical question ("What failure scenarios
are NOT covered?"), three models. GPT-5 with 16K max_completion_tokens, Sonnet 4
with 4K max_tokens, GPT-4.1 Mini with 4K max_completion_tokens. No project context
beyond the document itself. Pure gap-analysis task.

**Results:**
- GPT-5 (16K budget): 28 gaps, most exhaustive. Found domain-specific edge cases
  others missed entirely: ClOrdID collision across restarts, fractional share rounding,
  broker maintenance windows (410/426), hot code upgrades, regulatory halts vs staleness
  distinction, cancel-ack-then-late-fill race, DNS/TLS as distinct from network outage.
- Sonnet 4 (4K budget): 12 gaps, concise. Unique framing: distinguished latency
  degradation from outage (subtle but actionable). ETS corruption vs loss.
- GPT-4.1 Mini (4K budget): 13 gaps with summary table. Unique angles: unknown broker
  status enum values, configuration schema mismatches on cold-start, malformed signals
  from logic bugs (not just crashes).

**Overlap (all three):** Rate limiting, clock skew, resource exhaustion, DB failures,
message backpressure, partial connectivity.

**Key insight:** GPT-5's 4K attempt produced ZERO output (finish_reason: length) —
all tokens consumed by internal reasoning. At 16K it produced the richest analysis.
This confirms finding #3 (GPT-5 needs generous token budgets) AND adds a new
observation: for open-ended analytical questions, GPT-5's reasoning overhead is
proportionally larger. The 4K models (Sonnet, Mini) both produced useful output at
4K because they don't burn tokens on chain-of-thought.

**Model personality confirmed:**
- GPT-5: exhaustive, domain-aware, finds edge cases a senior SRE would know
- Sonnet: precise, architectural, finds design-level distinctions
- GPT-4.1 Mini: structured, systematic, finds enumeration gaps

**Practical implication:** For failure mode / gap analysis on design docs:
- GPT-5 with ≥16K tokens for maximum coverage (most unique findings)
- Sonnet for architectural framing ("this is really two different problems")
- Mini for completeness checking ("what about this enum value?")
- Running all three costs ~$0.50 and catches gaps none alone would find
- GPT-5 at 4K is USELESS for this task — always give it room to think

**Note on GPT-5 reasoning overhead:** First attempt at 4K max_completion_tokens
returned empty content with finish_reason: length. The model spent all 4K tokens
on internal reasoning and produced nothing. This is worse than a short answer —
it's zero value for non-zero cost. Always budget ≥16K for GPT-5 analytical tasks.