# Lessons Learned: Operational Guide for AI Model Selection > **Generated:** 2026-05-11 09:00 PDT > **Based on:** 74 experiments (2026-04-26 to 2026-05-11) _This is the actionable distillation. For evidence and methodology, see REPORT.md._ --- ## Quick Reference: Model Selection by Task ``` ┌─────────────────────────────────────────────────────────────────┐ │ TASK TYPE DECISION TREE │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Is this a VERIFICATION task? │ │ (contradiction, consistency, race condition) │ │ │ │ │ ├─ YES → Use GPT-5 + Opus (skip Sonnet) │ │ │ Sonnet has ~33% precision on verification │ │ │ │ │ └─ NO → Is this CROSS-DOCUMENT? │ │ │ │ │ ├─ YES → Use Opus (2.4x faster, more findings) │ │ │ │ │ └─ NO → Is this HIGH-STAKES? │ │ (financial, safety, regulatory) │ │ │ │ │ ├─ YES → Run all three │ │ │ (GPT-5 + Opus + Sonnet) │ │ │ Total: ~$1-2, worth it │ │ │ │ │ └─ NO → Sonnet first-pass │ │ Add Opus if findings need depth │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## Rules ### Rule 1: Match Model to Task Type | If the task is... | Use this | Not this | |-------------------|----------|----------| | Finding what's missing | GPT-5 | Mini | | Finding contradictions | Opus | Sonnet | | Cross-document consistency | Opus | GPT-5 | | Quick structural scan | Sonnet 4.6 | GPT-5 | | Broad coverage (noise OK) | Sonnet 4.5 | Sonnet 4.6 | | Adversarial attack paths | GPT-5 then Opus | Either alone | | Regulatory compliance | GPT-5 | Opus | | Operational blind spots | GPT-5 | Sonnet | ### Rule 2: Don't Trust Sonnet for Verification Sonnet finds ~3 contradictions but only ~1 is genuine. The others are misreadings. Use Sonnet for *identification* tasks (what's here?), not *verification* tasks (is this true?). ### Rule 3: Isolate the Signal When checking for something specific (bias, contradictions, missing assumptions), extract the relevant text and ask about it directly. Don't bury the question in a broad review mandate. Signal-to-noise ratio matters more than model capability. ### Rule 4: Run the Ensemble for High Stakes For anything financial, safety-critical, or regulatory: run GPT-5 + Opus + Sonnet. Each finds things the others miss. The union is larger than any single model's output. Cost is trivial vs. the value. ### Rule 5: Give GPT-5 Enough Tokens GPT-5 needs `max_completion_tokens` ≥ 16K. A truncated GPT-5 response is worse than a complete Opus response. Token budget matters more than model size. ### Rule 6: Break Large Outputs Into Sections Single agents die generating 1000+ lines. Rich input is fine; it's output length that kills. For large generation tasks: break into sections, draft in parallel, assemble. ### Rule 7: Narrow Framing Doesn't Fix Reasoning Gaps You cannot make Sonnet match GPT-5/Opus by writing a better prompt. Narrow framing changes WHAT it looks for, not HOW WELL it reasons. The gap is architectural, not prompt engineering. --- ## Operational Playbooks ### Playbook A: Architecture Document Review 1. **First pass (Sonnet 4.6):** ~30s, catches structural issues, broken refs, obvious gaps 2. **Deep analysis (GPT-5):** ~90s, finds domain-specific gaps, hidden assumptions, edge cases 3. **Design tensions (Opus):** ~60s, finds where the design contradicts itself 4. **Merge and dedupe:** Union of all three, remove duplicates, sort by severity ### Playbook B: Cross-Document Consistency Check 1. **Use Opus only.** It's 2.4x faster than GPT-5 and finds more issues. 2. **Provide both documents in a single prompt** (~25KB max) 3. **Explicitly exclude omissions** in the prompt — you want contradictions, not "Doc A covers X but Doc B doesn't" ### Playbook C: Adversarial Security Review 1. **First pass (GPT-5):** Exhaustive enumeration of attack surface 2. **Extension pass (Opus):** Give Opus GPT-5's findings, ask it to critique and extend 3. **Result:** 30% more findings at 28% more cost, with prioritization ### Playbook D: Regulatory Compliance Review 1. **First pass (Sonnet):** ~25s, identifies areas of concern 2. **Deep dive (GPT-5):** Regulatory specificity, correct citations, edge cases 3. **GPT-5's reasoning tokens are spent on verification** — trust its citations ### Playbook E: Contradiction Detection 1. **Use GPT-5 + Opus in parallel** (not Sonnet) 2. **GPT-5 finds:** Specification conflicts (same scenario, different prescriptions) 3. **Opus finds:** Logical impossibilities (rules that can't coexist) 4. **Neither dominates** — they find different classes of contradiction --- ## Anti-Patterns ### ❌ Anti-Pattern 1: Using Sonnet for Verification Tasks **What happens:** Sonnet reports contradictions that aren't real. You waste time investigating false positives or worse, trust a false negative. **Instead:** Use GPT-5 or Opus for any task requiring "is this true?" reasoning. ### ❌ Anti-Pattern 2: Giving GPT-5 a Broad Mandate **What happens:** GPT-5 spawns sub-agents, times out, or dumps raw tool output instead of synthesizing. **Instead:** Give GPT-5 explicit single-actor instructions + output format. For Claude, broader mandates are fine. ### ❌ Anti-Pattern 3: Burying Important Checks in Large Reviews **What happens:** The model misses the important thing because it's one of 47 things to check. **Instead:** Extract the important check and ask about it specifically. Signal-to-noise ratio matters. ### ❌ Anti-Pattern 4: Extrapolating Across Task Types **What happens:** "GPT-5 was great at X, so I'll use it for Y" — and it's mediocre. **Instead:** Task type predicts performance better than "model X is better." Check the task-type table. ### ❌ Anti-Pattern 5: Skipping the Union **What happens:** You run one model, miss things another would have caught, and the bug reaches production. **Instead:** For high-stakes work, run the ensemble. The cost is trivial vs. the risk. ### ❌ Anti-Pattern 6: Tuning Reasoning Effort **What happens:** You spend time adjusting low/medium/high reasoning effort parameters. **Instead:** Don't bother. It has negligible effect on analytical work. Task type is the lever. ### ❌ Anti-Pattern 7: Trying to Fix Sonnet with Prompts **What happens:** You write increasingly narrow prompts trying to get Sonnet to match GPT-5's reasoning depth. **Instead:** Accept that the gap is architectural. Use Sonnet for what it's good at (speed, breadth, structural review), use GPT-5/Opus for reasoning-heavy tasks. --- ## Model Personality Cheat Sheet | Model | Personality | Default Behavior | Give It | |-------|-------------|------------------|---------| | **GPT-5** | Exhaustive enumerator | Lists everything systematically | Bounded tasks, explicit output format, single-actor instructions | | **Opus** | Design critic | Finds tensions and contradictions | Open-ended analysis, room to reason about boundaries | | **Sonnet 4.6** | Structural scanner | Fast, precise, shallow | Quick first-pass work, structural review | | **Sonnet 4.5** | Broad coverage | More findings, more noise | When you want breadth over precision | | **GPT-4.1** | Generic competent | Stays within document framing | Middle-ground cost-sensitive work | | **GPT-4.1 Mini** | Template filler | Formulaic but catches obvious things | Bulk screening, sanity checks | ### Opus Superpower Opus finds where the spec's **own assumptions are false**. It doesn't just find missing things — it finds things the spec *believes* to be true that *aren't*. Examples: - "Realized P&L cannot recover" — the de-escalation model assumes all metrics can improve, but this one fundamentally cannot (#31) - "Forward detection logic is backwards" — spec describes triggers in the wrong direction (#32) - "Stop-loss defeated by temporal composition" — safety mechanism rendered ineffective by slow strategy (#47) ### GPT-5 Superpower GPT-5 reasons about the document's **relationship to the real world**. It asks "what must be true about the external world for this to work?" Examples: - Broker rate limiting (429s) bypasses "connection lost" detection (#9) - Corporate actions bypass staleness detection (#9) - DB "commit unknown outcome" causing restart loops (#9) - Cross-symbol strategies with partial staleness (#9) - IRS rule nuances that simplifications violate (#54) --- ## Decision Framework ### When to Add Another Model | Situation | Action | |-----------|--------| | Sonnet found nothing | Add Opus (may find design tensions) | | GPT-5 found lots but all similar | Add Opus (may find different class) | | Opus found tensions but no enumeration | Add GPT-5 (exhaustive coverage) | | Cross-document task | Use Opus only (2.4x faster) | | Regulatory/compliance task | Use GPT-5 (correct citations) | ### When NOT to Add Another Model | Situation | Action | |-----------|--------| | Quick structural scan | Sonnet alone is fine | | Bulk screening | Mini alone is fine | | Already ran GPT-5 + Opus | Adding Sonnet rarely helps | | Low-stakes internal doc | One model is enough | ### Cost-Benefit Quick Calc | Risk level | Model cost | Justified? | |------------|------------|------------| | Financial/safety | ~$1-2 for ensemble | Always yes | | Customer-facing | ~$0.50 for GPT-5 | Usually yes | | Internal process | ~$0.10 for Sonnet | Always yes | | One-off exploration | ~$0.02 for Mini | Always yes | --- ## What We Still Don't Know 1. **Corpus bias:** All experiments used gargoyle docs. Rankings may differ for other domains. 2. **Run variance:** All findings are single-run. Stochastic variation is unquantified. 3. **Scale effects:** Largest doc tested is 1,110 lines. Unknown behavior at 2000+. 4. **Non-architecture domains:** These findings are for architecture document analysis, not coding, not chat, not creative writing. --- ## Summary: The Two Things That Matter Most 1. **Task type determines model choice.** Don't pick a model because "it's best." Pick the model that's best for THIS task type. 2. **The union beats any single model.** For high-stakes work, run the ensemble. Different models find qualitatively different things. Everything else is optimization.