6af8a6ee10
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
47 lines
2.8 KiB
Markdown
47 lines
2.8 KiB
Markdown
# Finding 7: Token budget matters more than model size for gap analysis (confirmed)
|
|
|
|
**Date:** 2026-05-03
|
|
**Task:** Identify unaddressed failure scenarios in gargoyle's `failure-modes.md` (383 lines, ~25KB)
|
|
**How we used them:** Same document, same analytical question ("What failure scenarios
|
|
are NOT covered?"), three models. GPT-5 with 16K max_completion_tokens, Sonnet 4
|
|
with 4K max_tokens, GPT-4.1 Mini with 4K max_completion_tokens. No project context
|
|
beyond the document itself. Pure gap-analysis task.
|
|
|
|
**Results:**
|
|
- GPT-5 (16K budget): 28 gaps, most exhaustive. Found domain-specific edge cases
|
|
others missed entirely: ClOrdID collision across restarts, fractional share rounding,
|
|
broker maintenance windows (410/426), hot code upgrades, regulatory halts vs staleness
|
|
distinction, cancel-ack-then-late-fill race, DNS/TLS as distinct from network outage.
|
|
- Sonnet 4 (4K budget): 12 gaps, concise. Unique framing: distinguished latency
|
|
degradation from outage (subtle but actionable). ETS corruption vs loss.
|
|
- GPT-4.1 Mini (4K budget): 13 gaps with summary table. Unique angles: unknown broker
|
|
status enum values, configuration schema mismatches on cold-start, malformed signals
|
|
from logic bugs (not just crashes).
|
|
|
|
**Overlap (all three):** Rate limiting, clock skew, resource exhaustion, DB failures,
|
|
message backpressure, partial connectivity.
|
|
|
|
**Key insight:** GPT-5's 4K attempt produced ZERO output (finish_reason: length) —
|
|
all tokens consumed by internal reasoning. At 16K it produced the richest analysis.
|
|
This confirms finding #3 (GPT-5 needs generous token budgets) AND adds a new
|
|
observation: for open-ended analytical questions, GPT-5's reasoning overhead is
|
|
proportionally larger. The 4K models (Sonnet, Mini) both produced useful output at
|
|
4K because they don't burn tokens on chain-of-thought.
|
|
|
|
**Model personality confirmed:**
|
|
- GPT-5: exhaustive, domain-aware, finds edge cases a senior SRE would know
|
|
- Sonnet: precise, architectural, finds design-level distinctions
|
|
- GPT-4.1 Mini: structured, systematic, finds enumeration gaps
|
|
|
|
**Practical implication:** For failure mode / gap analysis on design docs:
|
|
- GPT-5 with ≥16K tokens for maximum coverage (most unique findings)
|
|
- Sonnet for architectural framing ("this is really two different problems")
|
|
- Mini for completeness checking ("what about this enum value?")
|
|
- Running all three costs ~$0.50 and catches gaps none alone would find
|
|
- GPT-5 at 4K is USELESS for this task — always give it room to think
|
|
|
|
**Note on GPT-5 reasoning overhead:** First attempt at 4K max_completion_tokens
|
|
returned empty content with finish_reason: length. The model spent all 4K tokens
|
|
on internal reasoning and produced nothing. This is worse than a short answer —
|
|
it's zero value for non-zero cost. Always budget ≥16K for GPT-5 analytical tasks.
|