Files
model-research/LESSONS.md
T

5.4 KiB

Actionable Lessons: Using AI Models for Analytical Work

Generated: 2026-05-06 07:30 PDT
Based on: 29 experiments (2026-04-26 to 2026-05-06)

Distilled from 29 experiments. These are the rules.


The Three Rules

1. Match the model to the task, not the prestige

If you need... Use... Why
"What's missing from this design?" GPT-5 Reasons about the world outside the document
"Where does this design contradict itself?" Opus Logical argumentation, zero false positives
"Is this consistent with that other doc?" Opus 2.4x faster, more findings than GPT-5
"How could an attacker exploit this?" GPT-5 (coverage) + Opus (creativity) Different attack styles
"Quick sanity check before I ship" Sonnet Fast, cheap, precise enough
"What race conditions exist here?" GPT-5 + Opus Sonnet produces errors on concurrency
"Is there bias in this text?" Anything (even Mini) All models catch isolated bias equally

2. Isolate the signal before asking the question

Don't ask "review this PR" and expect the model to catch a subtle bias buried in 6,600 lines of diff. Extract what matters, ask about it directly. Rich context dilutes attention.

Pattern:

  • "Review this PR for quality, correctness, and bias" (broad mandate + rich context = missed signals)
  • "Here are 12 hypotheses. Do any lead toward a predetermined conclusion?" (narrow question + minimal context = found everything)

3. Run multiple models on anything that matters

No single model finds everything. The union of GPT-5 + Opus + Sonnet finds more than any individual. The missing findings from a single-model run are disproportionately the ones that would cause production incidents.

Decision framework:

  • Costs nothing to get wrong: One model is fine (Sonnet for speed, Opus for depth)
  • Would be embarrassing to miss: Two models (Opus + GPT-5)
  • Would cost money or safety: Three models (all three, plus manual review of unique findings)

Operational Playbook

Architecture Document Review

1. Opus: contradiction detection + cross-doc consistency
2. GPT-5: hidden assumptions + gap-finding
3. Sonnet: quick structural scan (broken refs, missing sections)
4. Merge findings, deduplicate, triage by severity

Pre-Implementation Spec Review

1. Opus: "Where do the stated principles conflict?"
2. GPT-5: "What must be true about the world for this to work?"
3. Sonnet 4.5: "What would an implementer have to guess?"

Security/Adversarial Review

1. GPT-5: "Enumerate all possible abuses of each mechanism"
2. Opus: "What would a smart adversary do that the designer didn't consider?"
3. Union the findings — GPT-5 catches mechanism-level, Opus catches system-level

PR Review (dual-reviewer pattern)

- Sonnet: structural issues, broken links, formatting
- GPT-5: semantic issues, logical gaps, verdict mismatches
- For important PRs: add Opus for design-tension detection

Anti-Patterns (Things That Don't Work)

  1. "Use the most expensive model for everything" — GPT-5 is 5-9x more expensive than Opus per finding, and Opus beats it on contradiction/consistency tasks.

  2. "Reasoning effort = better output" — The low/medium/high parameter has negligible effect on analytical tasks. Don't bother tuning it.

  3. "Sonnet can do anything Opus does, just shallower" — Wrong. Sonnet produces errors on concurrency reasoning and false positives on contradiction detection. It's not "cheaper Opus" — it's a different tool.

  4. "More context = better analysis" — Signal-to-noise ratio matters more than context richness. Isolate what you're asking about.

  5. "One good prompt works everywhere" — Prompt framing shapes output more than model choice. The same model with a broad vs narrow prompt produces qualitatively different work. Design prompts per task type.

  6. "Run it once, trust the output" — Single runs are stochastic. Models miss things non-deterministically. Multiple models or multiple runs are the only hedge.


Model Personality Cheat Sheet

Model Default behavior Thinks like a...
GPT-5 Exhaustive enumeration, high verification bar, slow Thorough auditor checking every line item
Opus Design tensions, self-correcting, efficient Architect who sees how pieces conflict
Sonnet 4.6 Fast structural scan, self-filtering, concise Senior engineer doing a quick review
Sonnet 4.5 Exhaustive, verbose, occasional severity inflation Junior engineer trying to catch everything
GPT-4.1 Structured, stays within the document's framing Competent analyst following a checklist
GPT-4.1 Mini Formulaic, maps findings 1:1 to document sections Intern reading the doc and noting concerns

The Bottom Line

For our specific workflow (gargoyle architecture review, PR reviews, design docs):

  1. Opus is the default analytical model — most efficient, deepest on consistency/contradiction
  2. GPT-5 is the "we can't afford to miss anything" model — use on high-stakes docs
  3. Sonnet is the speed/screening model — first pass, structural checks, assumption-finding only
  4. Never use Sonnet alone for concurrency, contradiction, or adversarial analysis
  5. Always isolate the analytical question from surrounding noise
  6. Task-type-specific prompts beat generic "review this" prompts every time