rodin/model-research

Fork 0

Files

T

Rodin 828da269c0 docs: regenerate weekly report (2026-05-11)

2026-05-11 09:04:35 -07:00

11 KiB

Raw Blame History

Lessons Learned: Operational Guide for AI Model Selection

Generated: 2026-05-11 09:00 PDT
Based on: 74 experiments (2026-04-26 to 2026-05-11)

This is the actionable distillation. For evidence and methodology, see REPORT.md.

Quick Reference: Model Selection by Task

┌─────────────────────────────────────────────────────────────────┐
│                    TASK TYPE DECISION TREE                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Is this a VERIFICATION task?                                   │
│  (contradiction, consistency, race condition)                   │
│     │                                                           │
│     ├─ YES → Use GPT-5 + Opus (skip Sonnet)                    │
│     │        Sonnet has ~33% precision on verification          │
│     │                                                           │
│     └─ NO → Is this CROSS-DOCUMENT?                            │
│              │                                                  │
│              ├─ YES → Use Opus (2.4x faster, more findings)    │
│              │                                                  │
│              └─ NO → Is this HIGH-STAKES?                      │
│                       (financial, safety, regulatory)           │
│                       │                                         │
│                       ├─ YES → Run all three                   │
│                       │        (GPT-5 + Opus + Sonnet)         │
│                       │        Total: ~$1-2, worth it          │
│                       │                                         │
│                       └─ NO → Sonnet first-pass               │
│                               Add Opus if findings need depth   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Rules

Rule 1: Match Model to Task Type

If the task is...	Use this	Not this
Finding what's missing	GPT-5	Mini
Finding contradictions	Opus	Sonnet
Cross-document consistency	Opus	GPT-5
Quick structural scan	Sonnet 4.6	GPT-5
Broad coverage (noise OK)	Sonnet 4.5	Sonnet 4.6
Adversarial attack paths	GPT-5 then Opus	Either alone
Regulatory compliance	GPT-5	Opus
Operational blind spots	GPT-5	Sonnet

Rule 2: Don't Trust Sonnet for Verification

Sonnet finds ~3 contradictions but only ~1 is genuine. The others are misreadings. Use Sonnet for identification tasks (what's here?), not verification tasks (is this true?).

Rule 3: Isolate the Signal

When checking for something specific (bias, contradictions, missing assumptions), extract the relevant text and ask about it directly. Don't bury the question in a broad review mandate. Signal-to-noise ratio matters more than model capability.

Rule 4: Run the Ensemble for High Stakes

For anything financial, safety-critical, or regulatory: run GPT-5 + Opus + Sonnet. Each finds things the others miss. The union is larger than any single model's output. Cost is trivial vs. the value.

Rule 5: Give GPT-5 Enough Tokens

GPT-5 needs max_completion_tokens ≥ 16K. A truncated GPT-5 response is worse than a complete Opus response. Token budget matters more than model size.

Rule 6: Break Large Outputs Into Sections

Single agents die generating 1000+ lines. Rich input is fine; it's output length that kills. For large generation tasks: break into sections, draft in parallel, assemble.

Rule 7: Narrow Framing Doesn't Fix Reasoning Gaps

You cannot make Sonnet match GPT-5/Opus by writing a better prompt. Narrow framing changes WHAT it looks for, not HOW WELL it reasons. The gap is architectural, not prompt engineering.

Operational Playbooks

Playbook A: Architecture Document Review

First pass (Sonnet 4.6): ~30s, catches structural issues, broken refs, obvious gaps
Deep analysis (GPT-5): ~90s, finds domain-specific gaps, hidden assumptions, edge cases
Design tensions (Opus): ~60s, finds where the design contradicts itself
Merge and dedupe: Union of all three, remove duplicates, sort by severity

Playbook B: Cross-Document Consistency Check

Use Opus only. It's 2.4x faster than GPT-5 and finds more issues.
Provide both documents in a single prompt (~25KB max)
Explicitly exclude omissions in the prompt — you want contradictions, not "Doc A covers X but Doc B doesn't"

Playbook C: Adversarial Security Review

First pass (GPT-5): Exhaustive enumeration of attack surface
Extension pass (Opus): Give Opus GPT-5's findings, ask it to critique and extend
Result: 30% more findings at 28% more cost, with prioritization

Playbook D: Regulatory Compliance Review

First pass (Sonnet): ~25s, identifies areas of concern
Deep dive (GPT-5): Regulatory specificity, correct citations, edge cases
GPT-5's reasoning tokens are spent on verification — trust its citations

Playbook E: Contradiction Detection

Use GPT-5 + Opus in parallel (not Sonnet)
GPT-5 finds: Specification conflicts (same scenario, different prescriptions)
Opus finds: Logical impossibilities (rules that can't coexist)
Neither dominates — they find different classes of contradiction

Anti-Patterns

❌ Anti-Pattern 1: Using Sonnet for Verification Tasks

What happens: Sonnet reports contradictions that aren't real. You waste time investigating false positives or worse, trust a false negative.

Instead: Use GPT-5 or Opus for any task requiring "is this true?" reasoning.

❌ Anti-Pattern 2: Giving GPT-5 a Broad Mandate

What happens: GPT-5 spawns sub-agents, times out, or dumps raw tool output instead of synthesizing.

Instead: Give GPT-5 explicit single-actor instructions + output format. For Claude, broader mandates are fine.

❌ Anti-Pattern 3: Burying Important Checks in Large Reviews

What happens: The model misses the important thing because it's one of 47 things to check.

Instead: Extract the important check and ask about it specifically. Signal-to-noise ratio matters.

❌ Anti-Pattern 4: Extrapolating Across Task Types

What happens: "GPT-5 was great at X, so I'll use it for Y" — and it's mediocre.

Instead: Task type predicts performance better than "model X is better." Check the task-type table.

❌ Anti-Pattern 5: Skipping the Union

What happens: You run one model, miss things another would have caught, and the bug reaches production.

Instead: For high-stakes work, run the ensemble. The cost is trivial vs. the risk.

❌ Anti-Pattern 6: Tuning Reasoning Effort

What happens: You spend time adjusting low/medium/high reasoning effort parameters.

Instead: Don't bother. It has negligible effect on analytical work. Task type is the lever.

❌ Anti-Pattern 7: Trying to Fix Sonnet with Prompts

What happens: You write increasingly narrow prompts trying to get Sonnet to match GPT-5's reasoning depth.

Instead: Accept that the gap is architectural. Use Sonnet for what it's good at (speed, breadth, structural review), use GPT-5/Opus for reasoning-heavy tasks.

Model Personality Cheat Sheet

Model	Personality	Default Behavior	Give It
GPT-5	Exhaustive enumerator	Lists everything systematically	Bounded tasks, explicit output format, single-actor instructions
Opus	Design critic	Finds tensions and contradictions	Open-ended analysis, room to reason about boundaries
Sonnet 4.6	Structural scanner	Fast, precise, shallow	Quick first-pass work, structural review
Sonnet 4.5	Broad coverage	More findings, more noise	When you want breadth over precision
GPT-4.1	Generic competent	Stays within document framing	Middle-ground cost-sensitive work
GPT-4.1 Mini	Template filler	Formulaic but catches obvious things	Bulk screening, sanity checks

Opus Superpower

Opus finds where the spec's own assumptions are false. It doesn't just find missing things — it finds things the spec believes to be true that aren't.

Examples:

"Realized P&L cannot recover" — the de-escalation model assumes all metrics can improve, but this one fundamentally cannot (#31)
"Forward detection logic is backwards" — spec describes triggers in the wrong direction (#32)
"Stop-loss defeated by temporal composition" — safety mechanism rendered ineffective by slow strategy (#47)

GPT-5 Superpower

GPT-5 reasons about the document's relationship to the real world. It asks "what must be true about the external world for this to work?"