# Lessons Learned: Operational Guide for AI Model Selection

> **Generated:** 2026-05-18 09:02 PDT  
> **Based on:** 80 experiments (2026-04-26 to 2026-05-15)

_This is the actionable distillation. For evidence and methodology, see REPORT.md._

---

## Quick Reference: Model Selection by Task

```
┌─────────────────────────────────────────────────────────────────┐
│                    TASK TYPE DECISION TREE                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Is this a VERIFICATION task?                                   │
│  (self-contradiction, consistency check, race condition)        │
│     │                                                           │
│     ├─ YES → Is it CROSS-DOCUMENT comparison?                   │
│     │         │                                                 │
│     │         ├─ YES → Use Opus (or Sonnet for inter-doc        │
│     │         │        contradictions specifically)             │
│     │         │                                                 │
│     │         └─ NO → Use GPT-5 + Opus (skip Sonnet)           │
│     │                  Sonnet has ~33% precision on             │
│     │                  self-contradiction verification          │
│     │                                                           │
│     └─ NO → Is this SECURITY code review?                      │
│              │                                                  │
│              ├─ YES → Use dedicated security persona            │
│              │        (generalist reviewers miss it)            │
│              │                                                  │
│              └─ NO → Is this HIGH-STAKES?                      │
│                       (financial, safety, regulatory)           │
│                       │                                         │
│                       ├─ YES → Run all three                   │
│                       │        (GPT-5 + Opus + Sonnet)         │
│                       │        Total: ~$0.50-0.70              │
│                       │                                         │
│                       └─ NO → Sonnet first-pass               │
│                               Add Opus if findings need depth   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

---

## Rules

### Rule 1: Match Model to Task Type

| If the task is... | Use this | Not this |
|-------------------|----------|----------|
| Finding what's missing | GPT-5 | Mini |
| Finding self-contradictions | GPT-5 + Opus (both) | Sonnet |
| Cross-document consistency | Opus | GPT-5 |
| Inter-document contradictions | Sonnet | GPT-5 |
| Quick structural scan | Sonnet 4.6 | GPT-5 |
| Broad coverage (noise OK) | Sonnet 4.5 | Sonnet 4.6 |
| Adversarial attack paths | GPT-5 then Opus | Either alone |
| Regulatory compliance | GPT-5 | Opus |
| Operational blind spots | GPT-5 + Opus | Sonnet |
| Security code review | Dedicated security persona | Generalist prompt |
| State machine completeness | GPT-5 | Sonnet |
| External system assumptions | GPT-5 | Sonnet |
| Counterfactual ordering | GPT-5 | Sonnet |
| Degraded-mode analysis | Opus + GPT-5 | Sonnet |
| Implementation ambiguity | Any (all viable) | — |

### Rule 2: Don't Trust Sonnet for Verification

Sonnet finds ~3 contradictions but only ~1 is genuine. The others are misreadings. Use Sonnet for *identification* tasks (what's here?) and *inter-document comparison* (do these conflict?), not *self-contradiction verification* (is this internally consistent?).

**Exception:** Inter-document contradiction (#67) — Sonnet outperforms GPT-5 when comparing two documents for conflicting claims. Parallel comparison ≠ serial verification.

### Rule 3: Isolate the Signal

When checking for something specific (bias, contradictions, missing assumptions), extract the relevant text and ask about it directly. Don't bury the question in a broad review mandate. Signal-to-noise ratio matters more than model capability.

### Rule 4: Run the Ensemble for High Stakes

For anything financial, safety-critical, or regulatory: run GPT-5 + Opus + Sonnet. Each finds things the others miss. The union is 30-60% larger than any single model. Cost is trivial vs. the value.

### Rule 5: Give GPT-5 Enough Tokens

GPT-5 needs `max_completion_tokens` ≥ 16K. A truncated GPT-5 response is worse than a complete Opus response. Token budget matters more than model size.

### Rule 6: Break Large Outputs Into Sections

Single agents die generating 1000+ lines. Rich input is fine; it's output length that kills. For large generation tasks: break into sections, draft in parallel, assemble.

### Rule 7: Narrow Framing Doesn't Fix Reasoning Gaps

You cannot make Sonnet match GPT-5/Opus by writing a better prompt. Narrow framing changes WHAT it looks for, not HOW WELL it reasons. The gap is architectural, not prompt engineering.

### Rule 8: Specialized Personas Outperform Model Upgrades (for security)

A dedicated security-reviewer persona on the same model catches issues that a generalist reviewer misses and approves. For security code review: configure explicit security criteria (trust boundaries, library edge cases, OS interaction) rather than relying on "please also check security."

### Rule 9: Dual-Bot Disagreement Is a Feature

Two reviewers that sometimes disagree create a quality ratchet. The PR can't merge until both are satisfied. Removing one reviewer drops the gate from ~32% REQUEST_CHANGES to ~2%. Never reduce to single-bot without understanding the quality tradeoff.

### Rule 10: Monitor Your AI Pipeline

AI pipelines need operational monitoring like any production system. A dispatcher malfunction caused 3.5x cost overage and invalidated experiment data (Finding #80). Monitor: dispatch correctness, per-PR review count, API costs, and reviewer participation patterns.

---

## Anti-Patterns

### ❌ "Just use the best model for everything"

No single model dominates. GPT-5 is worst for inter-document contradictions. Opus is worst for exhaustive enumeration. Sonnet is worst for self-contradiction verification. Match the task.

### ❌ "Sonnet is good enough for a quick check"

Only on structural/identification tasks. On verification tasks, Sonnet's 33% precision means 2/3 of its findings are wrong. A "quick check" that produces false confidence is worse than no check.

### ❌ "We'll fix the prompt to get better results"

Narrow framing helps direct attention but cannot fix reasoning depth gaps. If Sonnet can't verify a contradiction with a perfect prompt, a better prompt won't help. Use a better model.

### ❌ "One reviewer is enough"

Finding #78: Dropping from dual-bot to single-bot review caused a 15x drop in REQUEST_CHANGES rate. The disagreement between reviewers is where quality lives. A single reviewer has blind spots; two reviewers catch each other's misses.

### ❌ "Security is just another review criteria"

Findings #79, #79b: Generalist reviewers (Sonnet, GPT) both APPROVED code with critical SSRF bypasses. Only a dedicated security persona blocked merge. Security requires domain-specific knowledge (CGN ranges, proxy inheritance, call-site consistency) that generalist prompts don't invoke.

### ❌ "More reviews = better quality"

Finding #80: When the dispatcher malfunctioned, PRs received 14+ reviews from 6 reviewers. This didn't improve quality — it inflated costs 3.5x and created noise. Targeted, specialized reviews > spray-and-pray.

### ❌ "The models will catch operational issues"

Finding #80: A broken dispatcher ran for days before anyone noticed the 3.5x cost overage. AI pipelines need traditional observability (cost monitoring, dispatch verification, participation metrics) — they don't self-diagnose operational problems.

---

## Operational Playbooks

### Playbook 1: Architecture Document Review

```
1. Sonnet first-pass (15-40s, $0.02)
   - Structural gaps, missing sections, obvious issues
   - Decision: Is this worth deeper analysis?

2. If yes → GPT-5 focused analysis (80-140s, $0.40)
   - Hidden assumptions, domain-specific gaps
   - Regulatory compliance (if applicable)
   - Temporal/ordering hazards

3. Opus design-tension analysis (50-120s, $0.12)
   - Where do principles conflict?
   - Where do safety mechanisms become vulnerabilities?
   - Cross-document consistency (if multiple docs)

4. Union findings → prioritize by severity
```

### Playbook 2: Security Code Review

```
1. Standard review (generalist model, any)
   - Code structure, patterns, obvious issues
   - Note: Will likely APPROVE even with security gaps

2. Dedicated security persona (MANDATORY for auth/crypto/network)
   - Explicit criteria: trust boundaries, library semantics, OS interaction
   - Checks: HTTPS enforcement at every call site
   - Checks: IP validation (is_global vs is_private)
   - Checks: Transport inheritance (proxy, TLS, timeout settings)
   - Checks: Write-path vs read-path consistency

3. Security persona has VETO power
   - If security says REQUEST_CHANGES, it blocks regardless of other approvals
```

### Playbook 3: Multi-Model Review Pipeline

```
Configuration (optimal based on Finding #78):
- Bot 1: Structural/pattern reviewer (Sonnet — fast, catches structural issues)
- Bot 2: Depth/logic reviewer (GPT — catches reasoning issues)
- Bot 3: Security reviewer (dedicated persona — catches security issues)

Rules:
- ALL bots must approve for merge (disagreement = quality signal)
- Monitor: REQUEST_CHANGES rate should be 20-40% (too low = degraded gate)
- Monitor: Per-PR review count should be 4-8 (too high = dispatcher bug)
- Monitor: API cost per PR (set alerts for >2x expected)

Operational:
- If REQUEST_CHANGES drops below 10% → investigate (model config? code quality? gate degraded?)
- If review count exceeds 15 → check dispatcher/webhook configuration
- Track: which bot finds which issues → continuous model-task matching
```

### Playbook 4: Regulatory Compliance Review

```
1. Sonnet structural scan ($0.02, 15-30s)
   - Identify which regulatory categories are addressed
   - Flag structural gaps (missing sections, uncovered rules)

2. GPT-5 regulatory cross-reference ($0.40, 120-160s)
   - Rule-by-rule comparison with actual regulations
   - Cite specific IRS/FINRA/SEC sections
   - Identify mathematical/formula errors
   - Find edge cases the implementation misses

3. Opus operational compliance ($0.12, 50-80s)
   - What the system needs to DO at runtime
   - Cross-account, cross-entity obligations
   - Where the implementation's model doesn't match regulatory reality

4. Combine → prioritize Critical/High for immediate action
```

### Playbook 5: Cross-Document Consistency Check

```
If comparing 2-3 documents for contradictions:
  → Opus primary (2.4x faster, finds more boundary tensions)
  → Sonnet for parallel comparison of claims (faster for direct conflicts)
  → GPT-5 only if documents are very complex (1000+ lines combined)

If checking one document for self-consistency:
  → GPT-5 for specification conflicts (statement A contradicts statement B)
  → Opus for logical impossibilities (rule A + rule B = impossible condition)
  → Skip Sonnet (33% precision, wastes time filtering false positives)
```

---

## Model Personality Cheat Sheet

| Model | Thinks Like | Asks | Finds |
|-------|------------|------|-------|
| GPT-5 | Systems engineer with 20 years experience | "What does the real world need that this doesn't address?" | Infrastructure gaps, operational hazards, regulatory oversights |
| Opus | Architecture critic / philosophy professor | "Where do your own principles contradict each other?" | Design tensions, logical impossibilities, safety-mechanism-as-vulnerability |
| Sonnet 4.6 | Junior developer implementing the spec | "If I were coding this, what would confuse me?" | Implementation ambiguities, structural gaps, obvious missing pieces |
| Sonnet 4.5 | Enthusiastic intern brainstorming | "What COULD go wrong?" | Broad list of concerns (noisy, needs filtering) |
| GPT-4.1 | Reliable senior dev doing code review | "Does this follow the patterns correctly?" | Structural issues, format consistency |
| GPT-4.1 Mini | Fast intern doing a checklist | "Is this obviously incomplete?" | Missing sections, obvious gaps |

---

## Decision Framework: "Should I Run Another Model?"

```
After getting results from Model A, ask:

1. Is the task VERIFICATION? (contradictions, races, consistency)
   → YES: Run the complementary model (GPT-5 ↔ Opus)
   → Sonnet results alone are insufficient for verification

2. Are the stakes HIGH? (financial, safety, regulatory, security)
   → YES: Run all available models. The marginal cost ($0.10-0.50)
     is negligible vs. the cost of a missed finding.

3. Did Model A find < 5 issues on a complex document?
   → YES: The task might not suit Model A. Try a different model.
   → GPT-5 finding < 5 issues is unusual — check prompt/token budget.

4. Is this a SCREENING pass before deeper work?
   → YES: Sonnet is sufficient. Save heavy models for the deep dive.
   → NO: Add GPT-5 or Opus depending on task type.

5. Is the document about SECURITY (auth, crypto, network)?
   → YES: Use a dedicated security persona regardless of other reviewers.
   → Generalist models will likely approve code with security gaps.
```

---

## Key Numbers to Remember

| Fact | Number | Source |
|------|--------|--------|
| Sonnet precision on contradiction detection | ~33% | #39, #43 |
| Adversarial ensemble improvement over single model | +30% findings | #35 |
| Adversarial ensemble extra cost | +28% tokens | #35 |
| Opus speed advantage over GPT-5 (cross-doc) | 2.4x faster | #28 |
| Opus token efficiency vs GPT-5 | 6-9x fewer tokens/finding | Multiple |
| GPT-5 minimum token budget | 16K completion tokens | #5, multiple |
| Union findings vs best single model | 30-60% more | Multiple |
| Dual-bot REQUEST_CHANGES rate | ~32% | #78 |
| Single-bot REQUEST_CHANGES rate | ~2% | #78 |
| Post-merge escape: missing tests | 22% of all findings | #78 |
| Security persona catch rate vs generalist | 100% vs 0% (on security issues) | #79 |
| Dispatcher malfunction cost overage | 3.5x | #80 |
| Total findings analyzed | 80 | This report |
| Total validated analytical lenses | 28 | REPORT.md Part 6 |

---

## What Changed Since Last Report

**New rules added:**
- Rule 8: Specialized Personas Outperform Model Upgrades (for security)
- Rule 9: Dual-Bot Disagreement Is a Feature
- Rule 10: Monitor Your AI Pipeline

**New anti-patterns:**
- "One reviewer is enough"
- "Security is just another review criteria"
- "More reviews = better quality"
- "The models will catch operational issues"

**New playbooks:**
- Playbook 2: Security Code Review
- Playbook 3: Multi-Model Review Pipeline (updated from operational data)

**Updated decision tree:**
- Added security review branch
- Added inter-document contradiction (Sonnet dominance) path

---

## Evolution Notes

This document evolves weekly. As new findings emerge:
- Rules are added/modified when patterns are validated across 3+ experiments
- Anti-patterns are added after observing repeated mistakes
- Playbooks are updated when operational data improves recommendations
- The decision tree is simplified when clearer heuristics emerge

The goal: make model selection as automatic as possible, so the researcher spends time on *analysis* rather than *model choice*.