5.4 KiB
Actionable Lessons: Using AI Models for Analytical Work
Generated: 2026-05-06 07:30 PDT
Based on: 29 experiments (2026-04-26 to 2026-05-06)
Distilled from 29 experiments. These are the rules.
The Three Rules
1. Match the model to the task, not the prestige
| If you need... | Use... | Why |
|---|---|---|
| "What's missing from this design?" | GPT-5 | Reasons about the world outside the document |
| "Where does this design contradict itself?" | Opus | Logical argumentation, zero false positives |
| "Is this consistent with that other doc?" | Opus | 2.4x faster, more findings than GPT-5 |
| "How could an attacker exploit this?" | GPT-5 (coverage) + Opus (creativity) | Different attack styles |
| "Quick sanity check before I ship" | Sonnet | Fast, cheap, precise enough |
| "What race conditions exist here?" | GPT-5 + Opus | Sonnet produces errors on concurrency |
| "Is there bias in this text?" | Anything (even Mini) | All models catch isolated bias equally |
2. Isolate the signal before asking the question
Don't ask "review this PR" and expect the model to catch a subtle bias buried in 6,600 lines of diff. Extract what matters, ask about it directly. Rich context dilutes attention.
Pattern:
- ❌ "Review this PR for quality, correctness, and bias" (broad mandate + rich context = missed signals)
- ✅ "Here are 12 hypotheses. Do any lead toward a predetermined conclusion?" (narrow question + minimal context = found everything)
3. Run multiple models on anything that matters
No single model finds everything. The union of GPT-5 + Opus + Sonnet finds more than any individual. The missing findings from a single-model run are disproportionately the ones that would cause production incidents.
Decision framework:
- Costs nothing to get wrong: One model is fine (Sonnet for speed, Opus for depth)
- Would be embarrassing to miss: Two models (Opus + GPT-5)
- Would cost money or safety: Three models (all three, plus manual review of unique findings)
Operational Playbook
Architecture Document Review
1. Opus: contradiction detection + cross-doc consistency
2. GPT-5: hidden assumptions + gap-finding
3. Sonnet: quick structural scan (broken refs, missing sections)
4. Merge findings, deduplicate, triage by severity
Pre-Implementation Spec Review
1. Opus: "Where do the stated principles conflict?"
2. GPT-5: "What must be true about the world for this to work?"
3. Sonnet 4.5: "What would an implementer have to guess?"
Security/Adversarial Review
1. GPT-5: "Enumerate all possible abuses of each mechanism"
2. Opus: "What would a smart adversary do that the designer didn't consider?"
3. Union the findings — GPT-5 catches mechanism-level, Opus catches system-level
PR Review (dual-reviewer pattern)
- Sonnet: structural issues, broken links, formatting
- GPT-5: semantic issues, logical gaps, verdict mismatches
- For important PRs: add Opus for design-tension detection
Anti-Patterns (Things That Don't Work)
-
"Use the most expensive model for everything" — GPT-5 is 5-9x more expensive than Opus per finding, and Opus beats it on contradiction/consistency tasks.
-
"Reasoning effort = better output" — The low/medium/high parameter has negligible effect on analytical tasks. Don't bother tuning it.
-
"Sonnet can do anything Opus does, just shallower" — Wrong. Sonnet produces errors on concurrency reasoning and false positives on contradiction detection. It's not "cheaper Opus" — it's a different tool.
-
"More context = better analysis" — Signal-to-noise ratio matters more than context richness. Isolate what you're asking about.
-
"One good prompt works everywhere" — Prompt framing shapes output more than model choice. The same model with a broad vs narrow prompt produces qualitatively different work. Design prompts per task type.
-
"Run it once, trust the output" — Single runs are stochastic. Models miss things non-deterministically. Multiple models or multiple runs are the only hedge.
Model Personality Cheat Sheet
| Model | Default behavior | Thinks like a... |
|---|---|---|
| GPT-5 | Exhaustive enumeration, high verification bar, slow | Thorough auditor checking every line item |
| Opus | Design tensions, self-correcting, efficient | Architect who sees how pieces conflict |
| Sonnet 4.6 | Fast structural scan, self-filtering, concise | Senior engineer doing a quick review |
| Sonnet 4.5 | Exhaustive, verbose, occasional severity inflation | Junior engineer trying to catch everything |
| GPT-4.1 | Structured, stays within the document's framing | Competent analyst following a checklist |
| GPT-4.1 Mini | Formulaic, maps findings 1:1 to document sections | Intern reading the doc and noting concerns |
The Bottom Line
For our specific workflow (gargoyle architecture review, PR reviews, design docs):
- Opus is the default analytical model — most efficient, deepest on consistency/contradiction
- GPT-5 is the "we can't afford to miss anything" model — use on high-stakes docs
- Sonnet is the speed/screening model — first pass, structural checks, assumption-finding only
- Never use Sonnet alone for concurrency, contradiction, or adversarial analysis
- Always isolate the analytical question from surrounding noise
- Task-type-specific prompts beat generic "review this" prompts every time