docs: regenerate weekly report (2026-05-11)
This commit is contained in:
+218
-87
@@ -1,114 +1,245 @@
|
||||
# Actionable Lessons: Using AI Models for Analytical Work
|
||||
# Lessons Learned: Operational Guide for AI Model Selection
|
||||
|
||||
> **Generated:** 2026-05-06 07:30 PDT
|
||||
> **Based on:** 29 experiments (2026-04-26 to 2026-05-06)
|
||||
> **Generated:** 2026-05-11 09:00 PDT
|
||||
> **Based on:** 74 experiments (2026-04-26 to 2026-05-11)
|
||||
|
||||
_Distilled from 29 experiments. These are the rules._
|
||||
_This is the actionable distillation. For evidence and methodology, see REPORT.md._
|
||||
|
||||
---
|
||||
|
||||
## The Three Rules
|
||||
## Quick Reference: Model Selection by Task
|
||||
|
||||
### 1. Match the model to the task, not the prestige
|
||||
|
||||
| If you need... | Use... | Why |
|
||||
|---------------|--------|-----|
|
||||
| "What's missing from this design?" | GPT-5 | Reasons about the world outside the document |
|
||||
| "Where does this design contradict itself?" | Opus | Logical argumentation, zero false positives |
|
||||
| "Is this consistent with that other doc?" | Opus | 2.4x faster, more findings than GPT-5 |
|
||||
| "How could an attacker exploit this?" | GPT-5 (coverage) + Opus (creativity) | Different attack styles |
|
||||
| "Quick sanity check before I ship" | Sonnet | Fast, cheap, precise enough |
|
||||
| "What race conditions exist here?" | GPT-5 + Opus | Sonnet produces errors on concurrency |
|
||||
| "Is there bias in this text?" | Anything (even Mini) | All models catch isolated bias equally |
|
||||
|
||||
### 2. Isolate the signal before asking the question
|
||||
|
||||
Don't ask "review this PR" and expect the model to catch a subtle bias buried in 6,600 lines of diff. Extract what matters, ask about it directly. Rich context dilutes attention.
|
||||
|
||||
**Pattern:**
|
||||
- ❌ "Review this PR for quality, correctness, and bias" (broad mandate + rich context = missed signals)
|
||||
- ✅ "Here are 12 hypotheses. Do any lead toward a predetermined conclusion?" (narrow question + minimal context = found everything)
|
||||
|
||||
### 3. Run multiple models on anything that matters
|
||||
|
||||
No single model finds everything. The union of GPT-5 + Opus + Sonnet finds more than any individual. The missing findings from a single-model run are disproportionately the ones that would cause production incidents.
|
||||
|
||||
**Decision framework:**
|
||||
- **Costs nothing to get wrong:** One model is fine (Sonnet for speed, Opus for depth)
|
||||
- **Would be embarrassing to miss:** Two models (Opus + GPT-5)
|
||||
- **Would cost money or safety:** Three models (all three, plus manual review of unique findings)
|
||||
|
||||
---
|
||||
|
||||
## Operational Playbook
|
||||
|
||||
### Architecture Document Review
|
||||
```
|
||||
1. Opus: contradiction detection + cross-doc consistency
|
||||
2. GPT-5: hidden assumptions + gap-finding
|
||||
3. Sonnet: quick structural scan (broken refs, missing sections)
|
||||
4. Merge findings, deduplicate, triage by severity
|
||||
```
|
||||
|
||||
### Pre-Implementation Spec Review
|
||||
```
|
||||
1. Opus: "Where do the stated principles conflict?"
|
||||
2. GPT-5: "What must be true about the world for this to work?"
|
||||
3. Sonnet 4.5: "What would an implementer have to guess?"
|
||||
```
|
||||
|
||||
### Security/Adversarial Review
|
||||
```
|
||||
1. GPT-5: "Enumerate all possible abuses of each mechanism"
|
||||
2. Opus: "What would a smart adversary do that the designer didn't consider?"
|
||||
3. Union the findings — GPT-5 catches mechanism-level, Opus catches system-level
|
||||
```
|
||||
|
||||
### PR Review (dual-reviewer pattern)
|
||||
```
|
||||
- Sonnet: structural issues, broken links, formatting
|
||||
- GPT-5: semantic issues, logical gaps, verdict mismatches
|
||||
- For important PRs: add Opus for design-tension detection
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ TASK TYPE DECISION TREE │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ Is this a VERIFICATION task? │
|
||||
│ (contradiction, consistency, race condition) │
|
||||
│ │ │
|
||||
│ ├─ YES → Use GPT-5 + Opus (skip Sonnet) │
|
||||
│ │ Sonnet has ~33% precision on verification │
|
||||
│ │ │
|
||||
│ └─ NO → Is this CROSS-DOCUMENT? │
|
||||
│ │ │
|
||||
│ ├─ YES → Use Opus (2.4x faster, more findings) │
|
||||
│ │ │
|
||||
│ └─ NO → Is this HIGH-STAKES? │
|
||||
│ (financial, safety, regulatory) │
|
||||
│ │ │
|
||||
│ ├─ YES → Run all three │
|
||||
│ │ (GPT-5 + Opus + Sonnet) │
|
||||
│ │ Total: ~$1-2, worth it │
|
||||
│ │ │
|
||||
│ └─ NO → Sonnet first-pass │
|
||||
│ Add Opus if findings need depth │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Anti-Patterns (Things That Don't Work)
|
||||
## Rules
|
||||
|
||||
1. **"Use the most expensive model for everything"** — GPT-5 is 5-9x more expensive than Opus per finding, and Opus beats it on contradiction/consistency tasks.
|
||||
### Rule 1: Match Model to Task Type
|
||||
|
||||
2. **"Reasoning effort = better output"** — The low/medium/high parameter has negligible effect on analytical tasks. Don't bother tuning it.
|
||||
| If the task is... | Use this | Not this |
|
||||
|-------------------|----------|----------|
|
||||
| Finding what's missing | GPT-5 | Mini |
|
||||
| Finding contradictions | Opus | Sonnet |
|
||||
| Cross-document consistency | Opus | GPT-5 |
|
||||
| Quick structural scan | Sonnet 4.6 | GPT-5 |
|
||||
| Broad coverage (noise OK) | Sonnet 4.5 | Sonnet 4.6 |
|
||||
| Adversarial attack paths | GPT-5 then Opus | Either alone |
|
||||
| Regulatory compliance | GPT-5 | Opus |
|
||||
| Operational blind spots | GPT-5 | Sonnet |
|
||||
|
||||
3. **"Sonnet can do anything Opus does, just shallower"** — Wrong. Sonnet produces *errors* on concurrency reasoning and false positives on contradiction detection. It's not "cheaper Opus" — it's a different tool.
|
||||
### Rule 2: Don't Trust Sonnet for Verification
|
||||
|
||||
4. **"More context = better analysis"** — Signal-to-noise ratio matters more than context richness. Isolate what you're asking about.
|
||||
Sonnet finds ~3 contradictions but only ~1 is genuine. The others are misreadings. Use Sonnet for *identification* tasks (what's here?), not *verification* tasks (is this true?).
|
||||
|
||||
5. **"One good prompt works everywhere"** — Prompt framing shapes output more than model choice. The same model with a broad vs narrow prompt produces qualitatively different work. Design prompts per task type.
|
||||
### Rule 3: Isolate the Signal
|
||||
|
||||
6. **"Run it once, trust the output"** — Single runs are stochastic. Models miss things non-deterministically. Multiple models or multiple runs are the only hedge.
|
||||
When checking for something specific (bias, contradictions, missing assumptions), extract the relevant text and ask about it directly. Don't bury the question in a broad review mandate. Signal-to-noise ratio matters more than model capability.
|
||||
|
||||
### Rule 4: Run the Ensemble for High Stakes
|
||||
|
||||
For anything financial, safety-critical, or regulatory: run GPT-5 + Opus + Sonnet. Each finds things the others miss. The union is larger than any single model's output. Cost is trivial vs. the value.
|
||||
|
||||
### Rule 5: Give GPT-5 Enough Tokens
|
||||
|
||||
GPT-5 needs `max_completion_tokens` ≥ 16K. A truncated GPT-5 response is worse than a complete Opus response. Token budget matters more than model size.
|
||||
|
||||
### Rule 6: Break Large Outputs Into Sections
|
||||
|
||||
Single agents die generating 1000+ lines. Rich input is fine; it's output length that kills. For large generation tasks: break into sections, draft in parallel, assemble.
|
||||
|
||||
### Rule 7: Narrow Framing Doesn't Fix Reasoning Gaps
|
||||
|
||||
You cannot make Sonnet match GPT-5/Opus by writing a better prompt. Narrow framing changes WHAT it looks for, not HOW WELL it reasons. The gap is architectural, not prompt engineering.
|
||||
|
||||
---
|
||||
|
||||
## Operational Playbooks
|
||||
|
||||
### Playbook A: Architecture Document Review
|
||||
|
||||
1. **First pass (Sonnet 4.6):** ~30s, catches structural issues, broken refs, obvious gaps
|
||||
2. **Deep analysis (GPT-5):** ~90s, finds domain-specific gaps, hidden assumptions, edge cases
|
||||
3. **Design tensions (Opus):** ~60s, finds where the design contradicts itself
|
||||
4. **Merge and dedupe:** Union of all three, remove duplicates, sort by severity
|
||||
|
||||
### Playbook B: Cross-Document Consistency Check
|
||||
|
||||
1. **Use Opus only.** It's 2.4x faster than GPT-5 and finds more issues.
|
||||
2. **Provide both documents in a single prompt** (~25KB max)
|
||||
3. **Explicitly exclude omissions** in the prompt — you want contradictions, not "Doc A covers X but Doc B doesn't"
|
||||
|
||||
### Playbook C: Adversarial Security Review
|
||||
|
||||
1. **First pass (GPT-5):** Exhaustive enumeration of attack surface
|
||||
2. **Extension pass (Opus):** Give Opus GPT-5's findings, ask it to critique and extend
|
||||
3. **Result:** 30% more findings at 28% more cost, with prioritization
|
||||
|
||||
### Playbook D: Regulatory Compliance Review
|
||||
|
||||
1. **First pass (Sonnet):** ~25s, identifies areas of concern
|
||||
2. **Deep dive (GPT-5):** Regulatory specificity, correct citations, edge cases
|
||||
3. **GPT-5's reasoning tokens are spent on verification** — trust its citations
|
||||
|
||||
### Playbook E: Contradiction Detection
|
||||
|
||||
1. **Use GPT-5 + Opus in parallel** (not Sonnet)
|
||||
2. **GPT-5 finds:** Specification conflicts (same scenario, different prescriptions)
|
||||
3. **Opus finds:** Logical impossibilities (rules that can't coexist)
|
||||
4. **Neither dominates** — they find different classes of contradiction
|
||||
|
||||
---
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
### ❌ Anti-Pattern 1: Using Sonnet for Verification Tasks
|
||||
|
||||
**What happens:** Sonnet reports contradictions that aren't real. You waste time investigating false positives or worse, trust a false negative.
|
||||
|
||||
**Instead:** Use GPT-5 or Opus for any task requiring "is this true?" reasoning.
|
||||
|
||||
### ❌ Anti-Pattern 2: Giving GPT-5 a Broad Mandate
|
||||
|
||||
**What happens:** GPT-5 spawns sub-agents, times out, or dumps raw tool output instead of synthesizing.
|
||||
|
||||
**Instead:** Give GPT-5 explicit single-actor instructions + output format. For Claude, broader mandates are fine.
|
||||
|
||||
### ❌ Anti-Pattern 3: Burying Important Checks in Large Reviews
|
||||
|
||||
**What happens:** The model misses the important thing because it's one of 47 things to check.
|
||||
|
||||
**Instead:** Extract the important check and ask about it specifically. Signal-to-noise ratio matters.
|
||||
|
||||
### ❌ Anti-Pattern 4: Extrapolating Across Task Types
|
||||
|
||||
**What happens:** "GPT-5 was great at X, so I'll use it for Y" — and it's mediocre.
|
||||
|
||||
**Instead:** Task type predicts performance better than "model X is better." Check the task-type table.
|
||||
|
||||
### ❌ Anti-Pattern 5: Skipping the Union
|
||||
|
||||
**What happens:** You run one model, miss things another would have caught, and the bug reaches production.
|
||||
|
||||
**Instead:** For high-stakes work, run the ensemble. The cost is trivial vs. the risk.
|
||||
|
||||
### ❌ Anti-Pattern 6: Tuning Reasoning Effort
|
||||
|
||||
**What happens:** You spend time adjusting low/medium/high reasoning effort parameters.
|
||||
|
||||
**Instead:** Don't bother. It has negligible effect on analytical work. Task type is the lever.
|
||||
|
||||
### ❌ Anti-Pattern 7: Trying to Fix Sonnet with Prompts
|
||||
|
||||
**What happens:** You write increasingly narrow prompts trying to get Sonnet to match GPT-5's reasoning depth.
|
||||
|
||||
**Instead:** Accept that the gap is architectural. Use Sonnet for what it's good at (speed, breadth, structural review), use GPT-5/Opus for reasoning-heavy tasks.
|
||||
|
||||
---
|
||||
|
||||
## Model Personality Cheat Sheet
|
||||
|
||||
| Model | Default behavior | Thinks like a... |
|
||||
|-------|-----------------|------------------|
|
||||
| GPT-5 | Exhaustive enumeration, high verification bar, slow | Thorough auditor checking every line item |
|
||||
| Opus | Design tensions, self-correcting, efficient | Architect who sees how pieces conflict |
|
||||
| Sonnet 4.6 | Fast structural scan, self-filtering, concise | Senior engineer doing a quick review |
|
||||
| Sonnet 4.5 | Exhaustive, verbose, occasional severity inflation | Junior engineer trying to catch everything |
|
||||
| GPT-4.1 | Structured, stays within the document's framing | Competent analyst following a checklist |
|
||||
| GPT-4.1 Mini | Formulaic, maps findings 1:1 to document sections | Intern reading the doc and noting concerns |
|
||||
| Model | Personality | Default Behavior | Give It |
|
||||
|-------|-------------|------------------|---------|
|
||||
| **GPT-5** | Exhaustive enumerator | Lists everything systematically | Bounded tasks, explicit output format, single-actor instructions |
|
||||
| **Opus** | Design critic | Finds tensions and contradictions | Open-ended analysis, room to reason about boundaries |
|
||||
| **Sonnet 4.6** | Structural scanner | Fast, precise, shallow | Quick first-pass work, structural review |
|
||||
| **Sonnet 4.5** | Broad coverage | More findings, more noise | When you want breadth over precision |
|
||||
| **GPT-4.1** | Generic competent | Stays within document framing | Middle-ground cost-sensitive work |
|
||||
| **GPT-4.1 Mini** | Template filler | Formulaic but catches obvious things | Bulk screening, sanity checks |
|
||||
|
||||
### Opus Superpower
|
||||
|
||||
Opus finds where the spec's **own assumptions are false**. It doesn't just find missing things — it finds things the spec *believes* to be true that *aren't*.
|
||||
|
||||
Examples:
|
||||
- "Realized P&L cannot recover" — the de-escalation model assumes all metrics can improve, but this one fundamentally cannot (#31)
|
||||
- "Forward detection logic is backwards" — spec describes triggers in the wrong direction (#32)
|
||||
- "Stop-loss defeated by temporal composition" — safety mechanism rendered ineffective by slow strategy (#47)
|
||||
|
||||
### GPT-5 Superpower
|
||||
|
||||
GPT-5 reasons about the document's **relationship to the real world**. It asks "what must be true about the external world for this to work?"
|
||||
|
||||
Examples:
|
||||
- Broker rate limiting (429s) bypasses "connection lost" detection (#9)
|
||||
- Corporate actions bypass staleness detection (#9)
|
||||
- DB "commit unknown outcome" causing restart loops (#9)
|
||||
- Cross-symbol strategies with partial staleness (#9)
|
||||
- IRS rule nuances that simplifications violate (#54)
|
||||
|
||||
---
|
||||
|
||||
## The Bottom Line
|
||||
## Decision Framework
|
||||
|
||||
**For our specific workflow (gargoyle architecture review, PR reviews, design docs):**
|
||||
### When to Add Another Model
|
||||
|
||||
1. Opus is the default analytical model — most efficient, deepest on consistency/contradiction
|
||||
2. GPT-5 is the "we can't afford to miss anything" model — use on high-stakes docs
|
||||
3. Sonnet is the speed/screening model — first pass, structural checks, assumption-finding only
|
||||
4. Never use Sonnet alone for concurrency, contradiction, or adversarial analysis
|
||||
5. Always isolate the analytical question from surrounding noise
|
||||
6. Task-type-specific prompts beat generic "review this" prompts every time
|
||||
| Situation | Action |
|
||||
|-----------|--------|
|
||||
| Sonnet found nothing | Add Opus (may find design tensions) |
|
||||
| GPT-5 found lots but all similar | Add Opus (may find different class) |
|
||||
| Opus found tensions but no enumeration | Add GPT-5 (exhaustive coverage) |
|
||||
| Cross-document task | Use Opus only (2.4x faster) |
|
||||
| Regulatory/compliance task | Use GPT-5 (correct citations) |
|
||||
|
||||
### When NOT to Add Another Model
|
||||
|
||||
| Situation | Action |
|
||||
|-----------|--------|
|
||||
| Quick structural scan | Sonnet alone is fine |
|
||||
| Bulk screening | Mini alone is fine |
|
||||
| Already ran GPT-5 + Opus | Adding Sonnet rarely helps |
|
||||
| Low-stakes internal doc | One model is enough |
|
||||
|
||||
### Cost-Benefit Quick Calc
|
||||
|
||||
| Risk level | Model cost | Justified? |
|
||||
|------------|------------|------------|
|
||||
| Financial/safety | ~$1-2 for ensemble | Always yes |
|
||||
| Customer-facing | ~$0.50 for GPT-5 | Usually yes |
|
||||
| Internal process | ~$0.10 for Sonnet | Always yes |
|
||||
| One-off exploration | ~$0.02 for Mini | Always yes |
|
||||
|
||||
---
|
||||
|
||||
## What We Still Don't Know
|
||||
|
||||
1. **Corpus bias:** All experiments used gargoyle docs. Rankings may differ for other domains.
|
||||
2. **Run variance:** All findings are single-run. Stochastic variation is unquantified.
|
||||
3. **Scale effects:** Largest doc tested is 1,110 lines. Unknown behavior at 2000+.
|
||||
4. **Non-architecture domains:** These findings are for architecture document analysis, not coding, not chat, not creative writing.
|
||||
|
||||
---
|
||||
|
||||
## Summary: The Two Things That Matter Most
|
||||
|
||||
1. **Task type determines model choice.** Don't pick a model because "it's best." Pick the model that's best for THIS task type.
|
||||
|
||||
2. **The union beats any single model.** For high-stakes work, run the ensemble. Different models find qualitatively different things.
|
||||
|
||||
Everything else is optimization.
|
||||
|
||||
Reference in New Issue
Block a user