Files

T

Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files

Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.

2026-05-06 07:15:50 -07:00

2.8 KiB

Raw Blame History

Finding 7: Token budget matters more than model size for gap analysis (confirmed)

Date: 2026-05-03 Task: Identify unaddressed failure scenarios in gargoyle's failure-modes.md (383 lines, ~25KB) How we used them: Same document, same analytical question ("What failure scenarios are NOT covered?"), three models. GPT-5 with 16K max_completion_tokens, Sonnet 4 with 4K max_tokens, GPT-4.1 Mini with 4K max_completion_tokens. No project context beyond the document itself. Pure gap-analysis task.

Results:

GPT-5 (16K budget): 28 gaps, most exhaustive. Found domain-specific edge cases others missed entirely: ClOrdID collision across restarts, fractional share rounding, broker maintenance windows (410/426), hot code upgrades, regulatory halts vs staleness distinction, cancel-ack-then-late-fill race, DNS/TLS as distinct from network outage.
Sonnet 4 (4K budget): 12 gaps, concise. Unique framing: distinguished latency degradation from outage (subtle but actionable). ETS corruption vs loss.
GPT-4.1 Mini (4K budget): 13 gaps with summary table. Unique angles: unknown broker status enum values, configuration schema mismatches on cold-start, malformed signals from logic bugs (not just crashes).

Overlap (all three): Rate limiting, clock skew, resource exhaustion, DB failures, message backpressure, partial connectivity.

Key insight: GPT-5's 4K attempt produced ZERO output (finish_reason: length) — all tokens consumed by internal reasoning. At 16K it produced the richest analysis. This confirms finding #3 (GPT-5 needs generous token budgets) AND adds a new observation: for open-ended analytical questions, GPT-5's reasoning overhead is proportionally larger. The 4K models (Sonnet, Mini) both produced useful output at 4K because they don't burn tokens on chain-of-thought.

Model personality confirmed:

GPT-5: exhaustive, domain-aware, finds edge cases a senior SRE would know
Sonnet: precise, architectural, finds design-level distinctions
GPT-4.1 Mini: structured, systematic, finds enumeration gaps

Practical implication: For failure mode / gap analysis on design docs:

GPT-5 with ≥16K tokens for maximum coverage (most unique findings)
Sonnet for architectural framing ("this is really two different problems")
Mini for completeness checking ("what about this enum value?")
Running all three costs ~$0.50 and catches gaps none alone would find
GPT-5 at 4K is USELESS for this task — always give it room to think

Note on GPT-5 reasoning overhead: First attempt at 4K max_completion_tokens returned empty content with finish_reason: length. The model spent all 4K tokens on internal reasoning and produced nothing. This is worse than a short answer — it's zero value for non-zero cost. Always budget ≥16K for GPT-5 analytical tasks.

2.8 KiB Raw Blame History

Finding 7: Token budget matters more than model size for gap analysis (confirmed)

2.8 KiB

Raw Blame History