Files
model-research/findings/2026-05-03-07b-token-budget-matters-more-than.md
T
Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00

2.8 KiB

Finding 7: Token budget matters more than model size for gap analysis (confirmed)

Date: 2026-05-03 Task: Identify unaddressed failure scenarios in gargoyle's failure-modes.md (383 lines, ~25KB) How we used them: Same document, same analytical question ("What failure scenarios are NOT covered?"), three models. GPT-5 with 16K max_completion_tokens, Sonnet 4 with 4K max_tokens, GPT-4.1 Mini with 4K max_completion_tokens. No project context beyond the document itself. Pure gap-analysis task.

Results:

  • GPT-5 (16K budget): 28 gaps, most exhaustive. Found domain-specific edge cases others missed entirely: ClOrdID collision across restarts, fractional share rounding, broker maintenance windows (410/426), hot code upgrades, regulatory halts vs staleness distinction, cancel-ack-then-late-fill race, DNS/TLS as distinct from network outage.
  • Sonnet 4 (4K budget): 12 gaps, concise. Unique framing: distinguished latency degradation from outage (subtle but actionable). ETS corruption vs loss.
  • GPT-4.1 Mini (4K budget): 13 gaps with summary table. Unique angles: unknown broker status enum values, configuration schema mismatches on cold-start, malformed signals from logic bugs (not just crashes).

Overlap (all three): Rate limiting, clock skew, resource exhaustion, DB failures, message backpressure, partial connectivity.

Key insight: GPT-5's 4K attempt produced ZERO output (finish_reason: length) — all tokens consumed by internal reasoning. At 16K it produced the richest analysis. This confirms finding #3 (GPT-5 needs generous token budgets) AND adds a new observation: for open-ended analytical questions, GPT-5's reasoning overhead is proportionally larger. The 4K models (Sonnet, Mini) both produced useful output at 4K because they don't burn tokens on chain-of-thought.

Model personality confirmed:

  • GPT-5: exhaustive, domain-aware, finds edge cases a senior SRE would know
  • Sonnet: precise, architectural, finds design-level distinctions
  • GPT-4.1 Mini: structured, systematic, finds enumeration gaps

Practical implication: For failure mode / gap analysis on design docs:

  • GPT-5 with ≥16K tokens for maximum coverage (most unique findings)
  • Sonnet for architectural framing ("this is really two different problems")
  • Mini for completeness checking ("what about this enum value?")
  • Running all three costs ~$0.50 and catches gaps none alone would find
  • GPT-5 at 4K is USELESS for this task — always give it room to think

Note on GPT-5 reasoning overhead: First attempt at 4K max_completion_tokens returned empty content with finish_reason: length. The model spent all 4K tokens on internal reasoning and produced nothing. This is worse than a short answer — it's zero value for non-zero cost. Always budget ≥16K for GPT-5 analytical tasks.