model-research/LESSONS.md

# Actionable Lessons: Using AI Models for Analytical Work

> **Generated:** 2026-05-06 07:30 PDT
> **Based on:** 29 experiments (2026-04-26 to 2026-05-06)

_Distilled from 29 experiments. These are the rules._

---

## The Three Rules

### 1. Match the model to the task, not the prestige

| If you need... | Use... | Why |
|---------------|--------|-----|
| "What's missing from this design?" | GPT-5 | Reasons about the world outside the document |
| "Where does this design contradict itself?" | Opus | Logical argumentation, zero false positives |
| "Is this consistent with that other doc?" | Opus | 2.4x faster, more findings than GPT-5 |
| "How could an attacker exploit this?" | GPT-5 (coverage) + Opus (creativity) | Different attack styles |
| "Quick sanity check before I ship" | Sonnet | Fast, cheap, precise enough |
| "What race conditions exist here?" | GPT-5 + Opus | Sonnet produces errors on concurrency |
| "Is there bias in this text?" | Anything (even Mini) | All models catch isolated bias equally |

### 2. Isolate the signal before asking the question

Don't ask "review this PR" and expect the model to catch a subtle bias buried in 6,600 lines of diff. Extract what matters, ask about it directly. Rich context dilutes attention.

**Pattern:**
- ❌ "Review this PR for quality, correctness, and bias" (broad mandate + rich context = missed signals)
- ✅ "Here are 12 hypotheses. Do any lead toward a predetermined conclusion?" (narrow question + minimal context = found everything)

### 3. Run multiple models on anything that matters

No single model finds everything. The union of GPT-5 + Opus + Sonnet finds more than any individual. The missing findings from a single-model run are disproportionately the ones that would cause production incidents.

**Decision framework:**
- **Costs nothing to get wrong:** One model is fine (Sonnet for speed, Opus for depth)
- **Would be embarrassing to miss:** Two models (Opus + GPT-5)
- **Would cost money or safety:** Three models (all three, plus manual review of unique findings)

---

## Operational Playbook

### Architecture Document Review
```
1. Opus: contradiction detection + cross-doc consistency
2. GPT-5: hidden assumptions + gap-finding
3. Sonnet: quick structural scan (broken refs, missing sections)
4. Merge findings, deduplicate, triage by severity
```

### Pre-Implementation Spec Review
```
1. Opus: "Where do the stated principles conflict?"
2. GPT-5: "What must be true about the world for this to work?"
3. Sonnet 4.5: "What would an implementer have to guess?"
```

### Security/Adversarial Review
```
1. GPT-5: "Enumerate all possible abuses of each mechanism"
2. Opus: "What would a smart adversary do that the designer didn't consider?"
3. Union the findings — GPT-5 catches mechanism-level, Opus catches system-level
```

### PR Review (dual-reviewer pattern)
```
- Sonnet: structural issues, broken links, formatting
- GPT-5: semantic issues, logical gaps, verdict mismatches
- For important PRs: add Opus for design-tension detection
```

---

## Anti-Patterns (Things That Don't Work)

1. **"Use the most expensive model for everything"** — GPT-5 is 5-9x more expensive than Opus per finding, and Opus beats it on contradiction/consistency tasks.

2. **"Reasoning effort = better output"** — The low/medium/high parameter has negligible effect on analytical tasks. Don't bother tuning it.

3. **"Sonnet can do anything Opus does, just shallower"** — Wrong. Sonnet produces *errors* on concurrency reasoning and false positives on contradiction detection. It's not "cheaper Opus" — it's a different tool.

4. **"More context = better analysis"** — Signal-to-noise ratio matters more than context richness. Isolate what you're asking about.

5. **"One good prompt works everywhere"** — Prompt framing shapes output more than model choice. The same model with a broad vs narrow prompt produces qualitatively different work. Design prompts per task type.

6. **"Run it once, trust the output"** — Single runs are stochastic. Models miss things non-deterministically. Multiple models or multiple runs are the only hedge.

---

## Model Personality Cheat Sheet

| Model | Default behavior | Thinks like a... |
|-------|-----------------|------------------|
| GPT-5 | Exhaustive enumeration, high verification bar, slow | Thorough auditor checking every line item |
| Opus | Design tensions, self-correcting, efficient | Architect who sees how pieces conflict |
| Sonnet 4.6 | Fast structural scan, self-filtering, concise | Senior engineer doing a quick review |
| Sonnet 4.5 | Exhaustive, verbose, occasional severity inflation | Junior engineer trying to catch everything |
| GPT-4.1 | Structured, stays within the document's framing | Competent analyst following a checklist |
| GPT-4.1 Mini | Formulaic, maps findings 1:1 to document sections | Intern reading the doc and noting concerns |

---

## The Bottom Line

**For our specific workflow (gargoyle architecture review, PR reviews, design docs):**

1. Opus is the default analytical model — most efficient, deepest on consistency/contradiction
2. GPT-5 is the "we can't afford to miss anything" model — use on high-stakes docs
3. Sonnet is the speed/screening model — first pass, structural checks, assumption-finding only
4. Never use Sonnet alone for concurrency, contradiction, or adversarial analysis
5. Always isolate the analytical question from surrounding noise
6. Task-type-specific prompts beat generic "review this" prompts every time