11 KiB
Lessons Learned: Operational Guide for AI Model Selection
Generated: 2026-05-11 09:00 PDT
Based on: 74 experiments (2026-04-26 to 2026-05-11)
This is the actionable distillation. For evidence and methodology, see REPORT.md.
Quick Reference: Model Selection by Task
┌─────────────────────────────────────────────────────────────────┐
│ TASK TYPE DECISION TREE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Is this a VERIFICATION task? │
│ (contradiction, consistency, race condition) │
│ │ │
│ ├─ YES → Use GPT-5 + Opus (skip Sonnet) │
│ │ Sonnet has ~33% precision on verification │
│ │ │
│ └─ NO → Is this CROSS-DOCUMENT? │
│ │ │
│ ├─ YES → Use Opus (2.4x faster, more findings) │
│ │ │
│ └─ NO → Is this HIGH-STAKES? │
│ (financial, safety, regulatory) │
│ │ │
│ ├─ YES → Run all three │
│ │ (GPT-5 + Opus + Sonnet) │
│ │ Total: ~$1-2, worth it │
│ │ │
│ └─ NO → Sonnet first-pass │
│ Add Opus if findings need depth │
│ │
└─────────────────────────────────────────────────────────────────┘
Rules
Rule 1: Match Model to Task Type
| If the task is... | Use this | Not this |
|---|---|---|
| Finding what's missing | GPT-5 | Mini |
| Finding contradictions | Opus | Sonnet |
| Cross-document consistency | Opus | GPT-5 |
| Quick structural scan | Sonnet 4.6 | GPT-5 |
| Broad coverage (noise OK) | Sonnet 4.5 | Sonnet 4.6 |
| Adversarial attack paths | GPT-5 then Opus | Either alone |
| Regulatory compliance | GPT-5 | Opus |
| Operational blind spots | GPT-5 | Sonnet |
Rule 2: Don't Trust Sonnet for Verification
Sonnet finds ~3 contradictions but only ~1 is genuine. The others are misreadings. Use Sonnet for identification tasks (what's here?), not verification tasks (is this true?).
Rule 3: Isolate the Signal
When checking for something specific (bias, contradictions, missing assumptions), extract the relevant text and ask about it directly. Don't bury the question in a broad review mandate. Signal-to-noise ratio matters more than model capability.
Rule 4: Run the Ensemble for High Stakes
For anything financial, safety-critical, or regulatory: run GPT-5 + Opus + Sonnet. Each finds things the others miss. The union is larger than any single model's output. Cost is trivial vs. the value.
Rule 5: Give GPT-5 Enough Tokens
GPT-5 needs max_completion_tokens ≥ 16K. A truncated GPT-5 response is worse than a complete Opus response. Token budget matters more than model size.
Rule 6: Break Large Outputs Into Sections
Single agents die generating 1000+ lines. Rich input is fine; it's output length that kills. For large generation tasks: break into sections, draft in parallel, assemble.
Rule 7: Narrow Framing Doesn't Fix Reasoning Gaps
You cannot make Sonnet match GPT-5/Opus by writing a better prompt. Narrow framing changes WHAT it looks for, not HOW WELL it reasons. The gap is architectural, not prompt engineering.
Operational Playbooks
Playbook A: Architecture Document Review
- First pass (Sonnet 4.6): ~30s, catches structural issues, broken refs, obvious gaps
- Deep analysis (GPT-5): ~90s, finds domain-specific gaps, hidden assumptions, edge cases
- Design tensions (Opus): ~60s, finds where the design contradicts itself
- Merge and dedupe: Union of all three, remove duplicates, sort by severity
Playbook B: Cross-Document Consistency Check
- Use Opus only. It's 2.4x faster than GPT-5 and finds more issues.
- Provide both documents in a single prompt (~25KB max)
- Explicitly exclude omissions in the prompt — you want contradictions, not "Doc A covers X but Doc B doesn't"
Playbook C: Adversarial Security Review
- First pass (GPT-5): Exhaustive enumeration of attack surface
- Extension pass (Opus): Give Opus GPT-5's findings, ask it to critique and extend
- Result: 30% more findings at 28% more cost, with prioritization
Playbook D: Regulatory Compliance Review
- First pass (Sonnet): ~25s, identifies areas of concern
- Deep dive (GPT-5): Regulatory specificity, correct citations, edge cases
- GPT-5's reasoning tokens are spent on verification — trust its citations
Playbook E: Contradiction Detection
- Use GPT-5 + Opus in parallel (not Sonnet)
- GPT-5 finds: Specification conflicts (same scenario, different prescriptions)
- Opus finds: Logical impossibilities (rules that can't coexist)
- Neither dominates — they find different classes of contradiction
Anti-Patterns
❌ Anti-Pattern 1: Using Sonnet for Verification Tasks
What happens: Sonnet reports contradictions that aren't real. You waste time investigating false positives or worse, trust a false negative.
Instead: Use GPT-5 or Opus for any task requiring "is this true?" reasoning.
❌ Anti-Pattern 2: Giving GPT-5 a Broad Mandate
What happens: GPT-5 spawns sub-agents, times out, or dumps raw tool output instead of synthesizing.
Instead: Give GPT-5 explicit single-actor instructions + output format. For Claude, broader mandates are fine.
❌ Anti-Pattern 3: Burying Important Checks in Large Reviews
What happens: The model misses the important thing because it's one of 47 things to check.
Instead: Extract the important check and ask about it specifically. Signal-to-noise ratio matters.
❌ Anti-Pattern 4: Extrapolating Across Task Types
What happens: "GPT-5 was great at X, so I'll use it for Y" — and it's mediocre.
Instead: Task type predicts performance better than "model X is better." Check the task-type table.
❌ Anti-Pattern 5: Skipping the Union
What happens: You run one model, miss things another would have caught, and the bug reaches production.
Instead: For high-stakes work, run the ensemble. The cost is trivial vs. the risk.
❌ Anti-Pattern 6: Tuning Reasoning Effort
What happens: You spend time adjusting low/medium/high reasoning effort parameters.
Instead: Don't bother. It has negligible effect on analytical work. Task type is the lever.
❌ Anti-Pattern 7: Trying to Fix Sonnet with Prompts
What happens: You write increasingly narrow prompts trying to get Sonnet to match GPT-5's reasoning depth.
Instead: Accept that the gap is architectural. Use Sonnet for what it's good at (speed, breadth, structural review), use GPT-5/Opus for reasoning-heavy tasks.
Model Personality Cheat Sheet
| Model | Personality | Default Behavior | Give It |
|---|---|---|---|
| GPT-5 | Exhaustive enumerator | Lists everything systematically | Bounded tasks, explicit output format, single-actor instructions |
| Opus | Design critic | Finds tensions and contradictions | Open-ended analysis, room to reason about boundaries |
| Sonnet 4.6 | Structural scanner | Fast, precise, shallow | Quick first-pass work, structural review |
| Sonnet 4.5 | Broad coverage | More findings, more noise | When you want breadth over precision |
| GPT-4.1 | Generic competent | Stays within document framing | Middle-ground cost-sensitive work |
| GPT-4.1 Mini | Template filler | Formulaic but catches obvious things | Bulk screening, sanity checks |
Opus Superpower
Opus finds where the spec's own assumptions are false. It doesn't just find missing things — it finds things the spec believes to be true that aren't.
Examples:
- "Realized P&L cannot recover" — the de-escalation model assumes all metrics can improve, but this one fundamentally cannot (#31)
- "Forward detection logic is backwards" — spec describes triggers in the wrong direction (#32)
- "Stop-loss defeated by temporal composition" — safety mechanism rendered ineffective by slow strategy (#47)
GPT-5 Superpower
GPT-5 reasons about the document's relationship to the real world. It asks "what must be true about the external world for this to work?"
Examples:
- Broker rate limiting (429s) bypasses "connection lost" detection (#9)
- Corporate actions bypass staleness detection (#9)
- DB "commit unknown outcome" causing restart loops (#9)
- Cross-symbol strategies with partial staleness (#9)
- IRS rule nuances that simplifications violate (#54)
Decision Framework
When to Add Another Model
| Situation | Action |
|---|---|
| Sonnet found nothing | Add Opus (may find design tensions) |
| GPT-5 found lots but all similar | Add Opus (may find different class) |
| Opus found tensions but no enumeration | Add GPT-5 (exhaustive coverage) |
| Cross-document task | Use Opus only (2.4x faster) |
| Regulatory/compliance task | Use GPT-5 (correct citations) |
When NOT to Add Another Model
| Situation | Action |
|---|---|
| Quick structural scan | Sonnet alone is fine |
| Bulk screening | Mini alone is fine |
| Already ran GPT-5 + Opus | Adding Sonnet rarely helps |
| Low-stakes internal doc | One model is enough |
Cost-Benefit Quick Calc
| Risk level | Model cost | Justified? |
|---|---|---|
| Financial/safety | ~$1-2 for ensemble | Always yes |
| Customer-facing | ~$0.50 for GPT-5 | Usually yes |
| Internal process | ~$0.10 for Sonnet | Always yes |
| One-off exploration | ~$0.02 for Mini | Always yes |
What We Still Don't Know
- Corpus bias: All experiments used gargoyle docs. Rankings may differ for other domains.
- Run variance: All findings are single-run. Stochastic variation is unquantified.
- Scale effects: Largest doc tested is 1,110 lines. Unknown behavior at 2000+.
- Non-architecture domains: These findings are for architecture document analysis, not coding, not chat, not creative writing.
Summary: The Two Things That Matter Most
-
Task type determines model choice. Don't pick a model because "it's best." Pick the model that's best for THIS task type.
-
The union beats any single model. For high-stakes work, run the ensemble. Different models find qualitatively different things.
Everything else is optimization.