Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
2.8 KiB
Finding 7: Token budget matters more than model size for gap analysis (confirmed)
Date: 2026-05-03
Task: Identify unaddressed failure scenarios in gargoyle's failure-modes.md (383 lines, ~25KB)
How we used them: Same document, same analytical question ("What failure scenarios
are NOT covered?"), three models. GPT-5 with 16K max_completion_tokens, Sonnet 4
with 4K max_tokens, GPT-4.1 Mini with 4K max_completion_tokens. No project context
beyond the document itself. Pure gap-analysis task.
Results:
- GPT-5 (16K budget): 28 gaps, most exhaustive. Found domain-specific edge cases others missed entirely: ClOrdID collision across restarts, fractional share rounding, broker maintenance windows (410/426), hot code upgrades, regulatory halts vs staleness distinction, cancel-ack-then-late-fill race, DNS/TLS as distinct from network outage.
- Sonnet 4 (4K budget): 12 gaps, concise. Unique framing: distinguished latency degradation from outage (subtle but actionable). ETS corruption vs loss.
- GPT-4.1 Mini (4K budget): 13 gaps with summary table. Unique angles: unknown broker status enum values, configuration schema mismatches on cold-start, malformed signals from logic bugs (not just crashes).
Overlap (all three): Rate limiting, clock skew, resource exhaustion, DB failures, message backpressure, partial connectivity.
Key insight: GPT-5's 4K attempt produced ZERO output (finish_reason: length) — all tokens consumed by internal reasoning. At 16K it produced the richest analysis. This confirms finding #3 (GPT-5 needs generous token budgets) AND adds a new observation: for open-ended analytical questions, GPT-5's reasoning overhead is proportionally larger. The 4K models (Sonnet, Mini) both produced useful output at 4K because they don't burn tokens on chain-of-thought.
Model personality confirmed:
- GPT-5: exhaustive, domain-aware, finds edge cases a senior SRE would know
- Sonnet: precise, architectural, finds design-level distinctions
- GPT-4.1 Mini: structured, systematic, finds enumeration gaps
Practical implication: For failure mode / gap analysis on design docs:
- GPT-5 with ≥16K tokens for maximum coverage (most unique findings)
- Sonnet for architectural framing ("this is really two different problems")
- Mini for completeness checking ("what about this enum value?")
- Running all three costs ~$0.50 and catches gaps none alone would find
- GPT-5 at 4K is USELESS for this task — always give it room to think
Note on GPT-5 reasoning overhead: First attempt at 4K max_completion_tokens returned empty content with finish_reason: length. The model spent all 4K tokens on internal reasoning and produced nothing. This is worse than a short answer — it's zero value for non-zero cost. Always budget ≥16K for GPT-5 analytical tasks.