f865a0d778
REPORT.md — full analysis of 29 experiments: model strengths, task-type mappings, meta-findings, cost-effectiveness, and open questions. LESSONS.md — distilled operational playbook: which model for which task, anti-patterns, decision framework, and the three core rules.
112 lines
5.3 KiB
Markdown
112 lines
5.3 KiB
Markdown
# Actionable Lessons: Using AI Models for Analytical Work
|
|
|
|
_Distilled from 29 experiments. These are the rules._
|
|
|
|
---
|
|
|
|
## The Three Rules
|
|
|
|
### 1. Match the model to the task, not the prestige
|
|
|
|
| If you need... | Use... | Why |
|
|
|---------------|--------|-----|
|
|
| "What's missing from this design?" | GPT-5 | Reasons about the world outside the document |
|
|
| "Where does this design contradict itself?" | Opus | Logical argumentation, zero false positives |
|
|
| "Is this consistent with that other doc?" | Opus | 2.4x faster, more findings than GPT-5 |
|
|
| "How could an attacker exploit this?" | GPT-5 (coverage) + Opus (creativity) | Different attack styles |
|
|
| "Quick sanity check before I ship" | Sonnet | Fast, cheap, precise enough |
|
|
| "What race conditions exist here?" | GPT-5 + Opus | Sonnet produces errors on concurrency |
|
|
| "Is there bias in this text?" | Anything (even Mini) | All models catch isolated bias equally |
|
|
|
|
### 2. Isolate the signal before asking the question
|
|
|
|
Don't ask "review this PR" and expect the model to catch a subtle bias buried in 6,600 lines of diff. Extract what matters, ask about it directly. Rich context dilutes attention.
|
|
|
|
**Pattern:**
|
|
- ❌ "Review this PR for quality, correctness, and bias" (broad mandate + rich context = missed signals)
|
|
- ✅ "Here are 12 hypotheses. Do any lead toward a predetermined conclusion?" (narrow question + minimal context = found everything)
|
|
|
|
### 3. Run multiple models on anything that matters
|
|
|
|
No single model finds everything. The union of GPT-5 + Opus + Sonnet finds more than any individual. The missing findings from a single-model run are disproportionately the ones that would cause production incidents.
|
|
|
|
**Decision framework:**
|
|
- **Costs nothing to get wrong:** One model is fine (Sonnet for speed, Opus for depth)
|
|
- **Would be embarrassing to miss:** Two models (Opus + GPT-5)
|
|
- **Would cost money or safety:** Three models (all three, plus manual review of unique findings)
|
|
|
|
---
|
|
|
|
## Operational Playbook
|
|
|
|
### Architecture Document Review
|
|
```
|
|
1. Opus: contradiction detection + cross-doc consistency
|
|
2. GPT-5: hidden assumptions + gap-finding
|
|
3. Sonnet: quick structural scan (broken refs, missing sections)
|
|
4. Merge findings, deduplicate, triage by severity
|
|
```
|
|
|
|
### Pre-Implementation Spec Review
|
|
```
|
|
1. Opus: "Where do the stated principles conflict?"
|
|
2. GPT-5: "What must be true about the world for this to work?"
|
|
3. Sonnet 4.5: "What would an implementer have to guess?"
|
|
```
|
|
|
|
### Security/Adversarial Review
|
|
```
|
|
1. GPT-5: "Enumerate all possible abuses of each mechanism"
|
|
2. Opus: "What would a smart adversary do that the designer didn't consider?"
|
|
3. Union the findings — GPT-5 catches mechanism-level, Opus catches system-level
|
|
```
|
|
|
|
### PR Review (dual-reviewer pattern)
|
|
```
|
|
- Sonnet: structural issues, broken links, formatting
|
|
- GPT-5: semantic issues, logical gaps, verdict mismatches
|
|
- For important PRs: add Opus for design-tension detection
|
|
```
|
|
|
|
---
|
|
|
|
## Anti-Patterns (Things That Don't Work)
|
|
|
|
1. **"Use the most expensive model for everything"** — GPT-5 is 5-9x more expensive than Opus per finding, and Opus beats it on contradiction/consistency tasks.
|
|
|
|
2. **"Reasoning effort = better output"** — The low/medium/high parameter has negligible effect on analytical tasks. Don't bother tuning it.
|
|
|
|
3. **"Sonnet can do anything Opus does, just shallower"** — Wrong. Sonnet produces *errors* on concurrency reasoning and false positives on contradiction detection. It's not "cheaper Opus" — it's a different tool.
|
|
|
|
4. **"More context = better analysis"** — Signal-to-noise ratio matters more than context richness. Isolate what you're asking about.
|
|
|
|
5. **"One good prompt works everywhere"** — Prompt framing shapes output more than model choice. The same model with a broad vs narrow prompt produces qualitatively different work. Design prompts per task type.
|
|
|
|
6. **"Run it once, trust the output"** — Single runs are stochastic. Models miss things non-deterministically. Multiple models or multiple runs are the only hedge.
|
|
|
|
---
|
|
|
|
## Model Personality Cheat Sheet
|
|
|
|
| Model | Default behavior | Thinks like a... |
|
|
|-------|-----------------|------------------|
|
|
| GPT-5 | Exhaustive enumeration, high verification bar, slow | Thorough auditor checking every line item |
|
|
| Opus | Design tensions, self-correcting, efficient | Architect who sees how pieces conflict |
|
|
| Sonnet 4.6 | Fast structural scan, self-filtering, concise | Senior engineer doing a quick review |
|
|
| Sonnet 4.5 | Exhaustive, verbose, occasional severity inflation | Junior engineer trying to catch everything |
|
|
| GPT-4.1 | Structured, stays within the document's framing | Competent analyst following a checklist |
|
|
| GPT-4.1 Mini | Formulaic, maps findings 1:1 to document sections | Intern reading the doc and noting concerns |
|
|
|
|
---
|
|
|
|
## The Bottom Line
|
|
|
|
**For our specific workflow (gargoyle architecture review, PR reviews, design docs):**
|
|
|
|
1. Opus is the default analytical model — most efficient, deepest on consistency/contradiction
|
|
2. GPT-5 is the "we can't afford to miss anything" model — use on high-stakes docs
|
|
3. Sonnet is the speed/screening model — first pass, structural checks, assumption-finding only
|
|
4. Never use Sonnet alone for concurrency, contradiction, or adversarial analysis
|
|
5. Always isolate the analytical question from surrounding noise
|
|
6. Task-type-specific prompts beat generic "review this" prompts every time
|