docs: regenerate weekly report (2026-05-11)

This commit is contained in:
Rodin
2026-05-11 09:04:35 -07:00
parent 2ca8c974f3
commit 828da269c0
2 changed files with 394 additions and 140 deletions
+218 -87
View File
@@ -1,114 +1,245 @@
# Actionable Lessons: Using AI Models for Analytical Work
# Lessons Learned: Operational Guide for AI Model Selection
> **Generated:** 2026-05-06 07:30 PDT
> **Based on:** 29 experiments (2026-04-26 to 2026-05-06)
> **Generated:** 2026-05-11 09:00 PDT
> **Based on:** 74 experiments (2026-04-26 to 2026-05-11)
_Distilled from 29 experiments. These are the rules._
_This is the actionable distillation. For evidence and methodology, see REPORT.md._
---
## The Three Rules
## Quick Reference: Model Selection by Task
### 1. Match the model to the task, not the prestige
| If you need... | Use... | Why |
|---------------|--------|-----|
| "What's missing from this design?" | GPT-5 | Reasons about the world outside the document |
| "Where does this design contradict itself?" | Opus | Logical argumentation, zero false positives |
| "Is this consistent with that other doc?" | Opus | 2.4x faster, more findings than GPT-5 |
| "How could an attacker exploit this?" | GPT-5 (coverage) + Opus (creativity) | Different attack styles |
| "Quick sanity check before I ship" | Sonnet | Fast, cheap, precise enough |
| "What race conditions exist here?" | GPT-5 + Opus | Sonnet produces errors on concurrency |
| "Is there bias in this text?" | Anything (even Mini) | All models catch isolated bias equally |
### 2. Isolate the signal before asking the question
Don't ask "review this PR" and expect the model to catch a subtle bias buried in 6,600 lines of diff. Extract what matters, ask about it directly. Rich context dilutes attention.
**Pattern:**
- ❌ "Review this PR for quality, correctness, and bias" (broad mandate + rich context = missed signals)
- ✅ "Here are 12 hypotheses. Do any lead toward a predetermined conclusion?" (narrow question + minimal context = found everything)
### 3. Run multiple models on anything that matters
No single model finds everything. The union of GPT-5 + Opus + Sonnet finds more than any individual. The missing findings from a single-model run are disproportionately the ones that would cause production incidents.
**Decision framework:**
- **Costs nothing to get wrong:** One model is fine (Sonnet for speed, Opus for depth)
- **Would be embarrassing to miss:** Two models (Opus + GPT-5)
- **Would cost money or safety:** Three models (all three, plus manual review of unique findings)
---
## Operational Playbook
### Architecture Document Review
```
1. Opus: contradiction detection + cross-doc consistency
2. GPT-5: hidden assumptions + gap-finding
3. Sonnet: quick structural scan (broken refs, missing sections)
4. Merge findings, deduplicate, triage by severity
```
### Pre-Implementation Spec Review
```
1. Opus: "Where do the stated principles conflict?"
2. GPT-5: "What must be true about the world for this to work?"
3. Sonnet 4.5: "What would an implementer have to guess?"
```
### Security/Adversarial Review
```
1. GPT-5: "Enumerate all possible abuses of each mechanism"
2. Opus: "What would a smart adversary do that the designer didn't consider?"
3. Union the findings — GPT-5 catches mechanism-level, Opus catches system-level
```
### PR Review (dual-reviewer pattern)
```
- Sonnet: structural issues, broken links, formatting
- GPT-5: semantic issues, logical gaps, verdict mismatches
- For important PRs: add Opus for design-tension detection
┌─────────────────────────────────────────────────────────────────┐
│ TASK TYPE DECISION TREE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Is this a VERIFICATION task? │
│ (contradiction, consistency, race condition) │
│ │ │
│ ├─ YES → Use GPT-5 + Opus (skip Sonnet) │
│ │ Sonnet has ~33% precision on verification │
│ │ │
│ └─ NO → Is this CROSS-DOCUMENT? │
│ │ │
│ ├─ YES → Use Opus (2.4x faster, more findings) │
│ │ │
│ └─ NO → Is this HIGH-STAKES? │
│ (financial, safety, regulatory) │
│ │ │
│ ├─ YES → Run all three │
│ │ (GPT-5 + Opus + Sonnet) │
│ │ Total: ~$1-2, worth it │
│ │ │
│ └─ NO → Sonnet first-pass │
│ Add Opus if findings need depth │
│ │
└─────────────────────────────────────────────────────────────────┘
```
---
## Anti-Patterns (Things That Don't Work)
## Rules
1. **"Use the most expensive model for everything"** — GPT-5 is 5-9x more expensive than Opus per finding, and Opus beats it on contradiction/consistency tasks.
### Rule 1: Match Model to Task Type
2. **"Reasoning effort = better output"** — The low/medium/high parameter has negligible effect on analytical tasks. Don't bother tuning it.
| If the task is... | Use this | Not this |
|-------------------|----------|----------|
| Finding what's missing | GPT-5 | Mini |
| Finding contradictions | Opus | Sonnet |
| Cross-document consistency | Opus | GPT-5 |
| Quick structural scan | Sonnet 4.6 | GPT-5 |
| Broad coverage (noise OK) | Sonnet 4.5 | Sonnet 4.6 |
| Adversarial attack paths | GPT-5 then Opus | Either alone |
| Regulatory compliance | GPT-5 | Opus |
| Operational blind spots | GPT-5 | Sonnet |
3. **"Sonnet can do anything Opus does, just shallower"** — Wrong. Sonnet produces *errors* on concurrency reasoning and false positives on contradiction detection. It's not "cheaper Opus" — it's a different tool.
### Rule 2: Don't Trust Sonnet for Verification
4. **"More context = better analysis"** — Signal-to-noise ratio matters more than context richness. Isolate what you're asking about.
Sonnet finds ~3 contradictions but only ~1 is genuine. The others are misreadings. Use Sonnet for *identification* tasks (what's here?), not *verification* tasks (is this true?).
5. **"One good prompt works everywhere"** — Prompt framing shapes output more than model choice. The same model with a broad vs narrow prompt produces qualitatively different work. Design prompts per task type.
### Rule 3: Isolate the Signal
6. **"Run it once, trust the output"** — Single runs are stochastic. Models miss things non-deterministically. Multiple models or multiple runs are the only hedge.
When checking for something specific (bias, contradictions, missing assumptions), extract the relevant text and ask about it directly. Don't bury the question in a broad review mandate. Signal-to-noise ratio matters more than model capability.
### Rule 4: Run the Ensemble for High Stakes
For anything financial, safety-critical, or regulatory: run GPT-5 + Opus + Sonnet. Each finds things the others miss. The union is larger than any single model's output. Cost is trivial vs. the value.
### Rule 5: Give GPT-5 Enough Tokens
GPT-5 needs `max_completion_tokens` ≥ 16K. A truncated GPT-5 response is worse than a complete Opus response. Token budget matters more than model size.
### Rule 6: Break Large Outputs Into Sections
Single agents die generating 1000+ lines. Rich input is fine; it's output length that kills. For large generation tasks: break into sections, draft in parallel, assemble.
### Rule 7: Narrow Framing Doesn't Fix Reasoning Gaps
You cannot make Sonnet match GPT-5/Opus by writing a better prompt. Narrow framing changes WHAT it looks for, not HOW WELL it reasons. The gap is architectural, not prompt engineering.
---
## Operational Playbooks
### Playbook A: Architecture Document Review
1. **First pass (Sonnet 4.6):** ~30s, catches structural issues, broken refs, obvious gaps
2. **Deep analysis (GPT-5):** ~90s, finds domain-specific gaps, hidden assumptions, edge cases
3. **Design tensions (Opus):** ~60s, finds where the design contradicts itself
4. **Merge and dedupe:** Union of all three, remove duplicates, sort by severity
### Playbook B: Cross-Document Consistency Check
1. **Use Opus only.** It's 2.4x faster than GPT-5 and finds more issues.
2. **Provide both documents in a single prompt** (~25KB max)
3. **Explicitly exclude omissions** in the prompt — you want contradictions, not "Doc A covers X but Doc B doesn't"
### Playbook C: Adversarial Security Review
1. **First pass (GPT-5):** Exhaustive enumeration of attack surface
2. **Extension pass (Opus):** Give Opus GPT-5's findings, ask it to critique and extend
3. **Result:** 30% more findings at 28% more cost, with prioritization
### Playbook D: Regulatory Compliance Review
1. **First pass (Sonnet):** ~25s, identifies areas of concern
2. **Deep dive (GPT-5):** Regulatory specificity, correct citations, edge cases
3. **GPT-5's reasoning tokens are spent on verification** — trust its citations
### Playbook E: Contradiction Detection
1. **Use GPT-5 + Opus in parallel** (not Sonnet)
2. **GPT-5 finds:** Specification conflicts (same scenario, different prescriptions)
3. **Opus finds:** Logical impossibilities (rules that can't coexist)
4. **Neither dominates** — they find different classes of contradiction
---
## Anti-Patterns
### ❌ Anti-Pattern 1: Using Sonnet for Verification Tasks
**What happens:** Sonnet reports contradictions that aren't real. You waste time investigating false positives or worse, trust a false negative.
**Instead:** Use GPT-5 or Opus for any task requiring "is this true?" reasoning.
### ❌ Anti-Pattern 2: Giving GPT-5 a Broad Mandate
**What happens:** GPT-5 spawns sub-agents, times out, or dumps raw tool output instead of synthesizing.
**Instead:** Give GPT-5 explicit single-actor instructions + output format. For Claude, broader mandates are fine.
### ❌ Anti-Pattern 3: Burying Important Checks in Large Reviews
**What happens:** The model misses the important thing because it's one of 47 things to check.
**Instead:** Extract the important check and ask about it specifically. Signal-to-noise ratio matters.
### ❌ Anti-Pattern 4: Extrapolating Across Task Types
**What happens:** "GPT-5 was great at X, so I'll use it for Y" — and it's mediocre.
**Instead:** Task type predicts performance better than "model X is better." Check the task-type table.
### ❌ Anti-Pattern 5: Skipping the Union
**What happens:** You run one model, miss things another would have caught, and the bug reaches production.
**Instead:** For high-stakes work, run the ensemble. The cost is trivial vs. the risk.
### ❌ Anti-Pattern 6: Tuning Reasoning Effort
**What happens:** You spend time adjusting low/medium/high reasoning effort parameters.
**Instead:** Don't bother. It has negligible effect on analytical work. Task type is the lever.
### ❌ Anti-Pattern 7: Trying to Fix Sonnet with Prompts
**What happens:** You write increasingly narrow prompts trying to get Sonnet to match GPT-5's reasoning depth.
**Instead:** Accept that the gap is architectural. Use Sonnet for what it's good at (speed, breadth, structural review), use GPT-5/Opus for reasoning-heavy tasks.
---
## Model Personality Cheat Sheet
| Model | Default behavior | Thinks like a... |
|-------|-----------------|------------------|
| GPT-5 | Exhaustive enumeration, high verification bar, slow | Thorough auditor checking every line item |
| Opus | Design tensions, self-correcting, efficient | Architect who sees how pieces conflict |
| Sonnet 4.6 | Fast structural scan, self-filtering, concise | Senior engineer doing a quick review |
| Sonnet 4.5 | Exhaustive, verbose, occasional severity inflation | Junior engineer trying to catch everything |
| GPT-4.1 | Structured, stays within the document's framing | Competent analyst following a checklist |
| GPT-4.1 Mini | Formulaic, maps findings 1:1 to document sections | Intern reading the doc and noting concerns |
| Model | Personality | Default Behavior | Give It |
|-------|-------------|------------------|---------|
| **GPT-5** | Exhaustive enumerator | Lists everything systematically | Bounded tasks, explicit output format, single-actor instructions |
| **Opus** | Design critic | Finds tensions and contradictions | Open-ended analysis, room to reason about boundaries |
| **Sonnet 4.6** | Structural scanner | Fast, precise, shallow | Quick first-pass work, structural review |
| **Sonnet 4.5** | Broad coverage | More findings, more noise | When you want breadth over precision |
| **GPT-4.1** | Generic competent | Stays within document framing | Middle-ground cost-sensitive work |
| **GPT-4.1 Mini** | Template filler | Formulaic but catches obvious things | Bulk screening, sanity checks |
### Opus Superpower
Opus finds where the spec's **own assumptions are false**. It doesn't just find missing things — it finds things the spec *believes* to be true that *aren't*.
Examples:
- "Realized P&L cannot recover" — the de-escalation model assumes all metrics can improve, but this one fundamentally cannot (#31)
- "Forward detection logic is backwards" — spec describes triggers in the wrong direction (#32)
- "Stop-loss defeated by temporal composition" — safety mechanism rendered ineffective by slow strategy (#47)
### GPT-5 Superpower
GPT-5 reasons about the document's **relationship to the real world**. It asks "what must be true about the external world for this to work?"
Examples:
- Broker rate limiting (429s) bypasses "connection lost" detection (#9)
- Corporate actions bypass staleness detection (#9)
- DB "commit unknown outcome" causing restart loops (#9)
- Cross-symbol strategies with partial staleness (#9)
- IRS rule nuances that simplifications violate (#54)
---
## The Bottom Line
## Decision Framework
**For our specific workflow (gargoyle architecture review, PR reviews, design docs):**
### When to Add Another Model
1. Opus is the default analytical model — most efficient, deepest on consistency/contradiction
2. GPT-5 is the "we can't afford to miss anything" model — use on high-stakes docs
3. Sonnet is the speed/screening model — first pass, structural checks, assumption-finding only
4. Never use Sonnet alone for concurrency, contradiction, or adversarial analysis
5. Always isolate the analytical question from surrounding noise
6. Task-type-specific prompts beat generic "review this" prompts every time
| Situation | Action |
|-----------|--------|
| Sonnet found nothing | Add Opus (may find design tensions) |
| GPT-5 found lots but all similar | Add Opus (may find different class) |
| Opus found tensions but no enumeration | Add GPT-5 (exhaustive coverage) |
| Cross-document task | Use Opus only (2.4x faster) |
| Regulatory/compliance task | Use GPT-5 (correct citations) |
### When NOT to Add Another Model
| Situation | Action |
|-----------|--------|
| Quick structural scan | Sonnet alone is fine |
| Bulk screening | Mini alone is fine |
| Already ran GPT-5 + Opus | Adding Sonnet rarely helps |
| Low-stakes internal doc | One model is enough |
### Cost-Benefit Quick Calc
| Risk level | Model cost | Justified? |
|------------|------------|------------|
| Financial/safety | ~$1-2 for ensemble | Always yes |
| Customer-facing | ~$0.50 for GPT-5 | Usually yes |
| Internal process | ~$0.10 for Sonnet | Always yes |
| One-off exploration | ~$0.02 for Mini | Always yes |
---
## What We Still Don't Know
1. **Corpus bias:** All experiments used gargoyle docs. Rankings may differ for other domains.
2. **Run variance:** All findings are single-run. Stochastic variation is unquantified.
3. **Scale effects:** Largest doc tested is 1,110 lines. Unknown behavior at 2000+.
4. **Non-architecture domains:** These findings are for architecture document analysis, not coding, not chat, not creative writing.
---
## Summary: The Two Things That Matter Most
1. **Task type determines model choice.** Don't pick a model because "it's best." Pick the model that's best for THIS task type.
2. **The union beats any single model.** For high-stakes work, run the ensemble. Different models find qualitatively different things.
Everything else is optimization.