rodin/model-research

Fork 0

Files

T

Rodin 5426026908 docs: regenerate weekly report (2026-05-18)

2026-05-18 16:10:16 +00:00

16 KiB

Raw Permalink Blame History

Lessons Learned: Operational Guide for AI Model Selection

Generated: 2026-05-18 09:02 PDT
Based on: 80 experiments (2026-04-26 to 2026-05-15)

This is the actionable distillation. For evidence and methodology, see REPORT.md.

Quick Reference: Model Selection by Task

┌─────────────────────────────────────────────────────────────────┐
│                    TASK TYPE DECISION TREE                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Is this a VERIFICATION task?                                   │
│  (self-contradiction, consistency check, race condition)        │
│     │                                                           │
│     ├─ YES → Is it CROSS-DOCUMENT comparison?                   │
│     │         │                                                 │
│     │         ├─ YES → Use Opus (or Sonnet for inter-doc        │
│     │         │        contradictions specifically)             │
│     │         │                                                 │
│     │         └─ NO → Use GPT-5 + Opus (skip Sonnet)           │
│     │                  Sonnet has ~33% precision on             │
│     │                  self-contradiction verification          │
│     │                                                           │
│     └─ NO → Is this SECURITY code review?                      │
│              │                                                  │
│              ├─ YES → Use dedicated security persona            │
│              │        (generalist reviewers miss it)            │
│              │                                                  │
│              └─ NO → Is this HIGH-STAKES?                      │
│                       (financial, safety, regulatory)           │
│                       │                                         │
│                       ├─ YES → Run all three                   │
│                       │        (GPT-5 + Opus + Sonnet)         │
│                       │        Total: ~$0.50-0.70              │
│                       │                                         │
│                       └─ NO → Sonnet first-pass               │
│                               Add Opus if findings need depth   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Rules

Rule 1: Match Model to Task Type

If the task is...	Use this	Not this
Finding what's missing	GPT-5	Mini
Finding self-contradictions	GPT-5 + Opus (both)	Sonnet
Cross-document consistency	Opus	GPT-5
Inter-document contradictions	Sonnet	GPT-5
Quick structural scan	Sonnet 4.6	GPT-5
Broad coverage (noise OK)	Sonnet 4.5	Sonnet 4.6
Adversarial attack paths	GPT-5 then Opus	Either alone
Regulatory compliance	GPT-5	Opus
Operational blind spots	GPT-5 + Opus	Sonnet
Security code review	Dedicated security persona	Generalist prompt
State machine completeness	GPT-5	Sonnet
External system assumptions	GPT-5	Sonnet
Counterfactual ordering	GPT-5	Sonnet
Degraded-mode analysis	Opus + GPT-5	Sonnet
Implementation ambiguity	Any (all viable)	—

Rule 2: Don't Trust Sonnet for Verification

Sonnet finds ~3 contradictions but only ~1 is genuine. The others are misreadings. Use Sonnet for identification tasks (what's here?) and inter-document comparison (do these conflict?), not self-contradiction verification (is this internally consistent?).

Exception: Inter-document contradiction (#67) — Sonnet outperforms GPT-5 when comparing two documents for conflicting claims. Parallel comparison ≠ serial verification.

Rule 3: Isolate the Signal

When checking for something specific (bias, contradictions, missing assumptions), extract the relevant text and ask about it directly. Don't bury the question in a broad review mandate. Signal-to-noise ratio matters more than model capability.

Rule 4: Run the Ensemble for High Stakes

For anything financial, safety-critical, or regulatory: run GPT-5 + Opus + Sonnet. Each finds things the others miss. The union is 30-60% larger than any single model. Cost is trivial vs. the value.

Rule 5: Give GPT-5 Enough Tokens

GPT-5 needs max_completion_tokens ≥ 16K. A truncated GPT-5 response is worse than a complete Opus response. Token budget matters more than model size.

Rule 6: Break Large Outputs Into Sections

Single agents die generating 1000+ lines. Rich input is fine; it's output length that kills. For large generation tasks: break into sections, draft in parallel, assemble.

Rule 7: Narrow Framing Doesn't Fix Reasoning Gaps

You cannot make Sonnet match GPT-5/Opus by writing a better prompt. Narrow framing changes WHAT it looks for, not HOW WELL it reasons. The gap is architectural, not prompt engineering.

Rule 8: Specialized Personas Outperform Model Upgrades (for security)

A dedicated security-reviewer persona on the same model catches issues that a generalist reviewer misses and approves. For security code review: configure explicit security criteria (trust boundaries, library edge cases, OS interaction) rather than relying on "please also check security."

Rule 9: Dual-Bot Disagreement Is a Feature

Two reviewers that sometimes disagree create a quality ratchet. The PR can't merge until both are satisfied. Removing one reviewer drops the gate from ~32% REQUEST_CHANGES to ~2%. Never reduce to single-bot without understanding the quality tradeoff.

Rule 10: Monitor Your AI Pipeline

AI pipelines need operational monitoring like any production system. A dispatcher malfunction caused 3.5x cost overage and invalidated experiment data (Finding #80). Monitor: dispatch correctness, per-PR review count, API costs, and reviewer participation patterns.

Anti-Patterns

❌ "Just use the best model for everything"

No single model dominates. GPT-5 is worst for inter-document contradictions. Opus is worst for exhaustive enumeration. Sonnet is worst for self-contradiction verification. Match the task.

❌ "Sonnet is good enough for a quick check"

Only on structural/identification tasks. On verification tasks, Sonnet's 33% precision means 2/3 of its findings are wrong. A "quick check" that produces false confidence is worse than no check.

❌ "We'll fix the prompt to get better results"

Narrow framing helps direct attention but cannot fix reasoning depth gaps. If Sonnet can't verify a contradiction with a perfect prompt, a better prompt won't help. Use a better model.

❌ "One reviewer is enough"

Finding #78: Dropping from dual-bot to single-bot review caused a 15x drop in REQUEST_CHANGES rate. The disagreement between reviewers is where quality lives. A single reviewer has blind spots; two reviewers catch each other's misses.

❌ "Security is just another review criteria"

Findings #79, #79b: Generalist reviewers (Sonnet, GPT) both APPROVED code with critical SSRF bypasses. Only a dedicated security persona blocked merge. Security requires domain-specific knowledge (CGN ranges, proxy inheritance, call-site consistency) that generalist prompts don't invoke.

❌ "More reviews = better quality"

Finding #80: When the dispatcher malfunctioned, PRs received 14+ reviews from 6 reviewers. This didn't improve quality — it inflated costs 3.5x and created noise. Targeted, specialized reviews > spray-and-pray.

❌ "The models will catch operational issues"

Finding #80: A broken dispatcher ran for days before anyone noticed the 3.5x cost overage. AI pipelines need traditional observability (cost monitoring, dispatch verification, participation metrics) — they don't self-diagnose operational problems.

Operational Playbooks

Playbook 1: Architecture Document Review

1. Sonnet first-pass (15-40s, $0.02)
   - Structural gaps, missing sections, obvious issues
   - Decision: Is this worth deeper analysis?

2. If yes → GPT-5 focused analysis (80-140s, $0.40)
   - Hidden assumptions, domain-specific gaps
   - Regulatory compliance (if applicable)
   - Temporal/ordering hazards

3. Opus design-tension analysis (50-120s, $0.12)
   - Where do principles conflict?
   - Where do safety mechanisms become vulnerabilities?
   - Cross-document consistency (if multiple docs)

4. Union findings → prioritize by severity

Playbook 2: Security Code Review

1. Standard review (generalist model, any)
   - Code structure, patterns, obvious issues
   - Note: Will likely APPROVE even with security gaps

2. Dedicated security persona (MANDATORY for auth/crypto/network)
   - Explicit criteria: trust boundaries, library semantics, OS interaction
   - Checks: HTTPS enforcement at every call site
   - Checks: IP validation (is_global vs is_private)
   - Checks: Transport inheritance (proxy, TLS, timeout settings)
   - Checks: Write-path vs read-path consistency

3. Security persona has VETO power
   - If security says REQUEST_CHANGES, it blocks regardless of other approvals

Playbook 3: Multi-Model Review Pipeline

Configuration (optimal based on Finding #78):
- Bot 1: Structural/pattern reviewer (Sonnet — fast, catches structural issues)
- Bot 2: Depth/logic reviewer (GPT — catches reasoning issues)
- Bot 3: Security reviewer (dedicated persona — catches security issues)

Rules:
- ALL bots must approve for merge (disagreement = quality signal)
- Monitor: REQUEST_CHANGES rate should be 20-40% (too low = degraded gate)
- Monitor: Per-PR review count should be 4-8 (too high = dispatcher bug)
- Monitor: API cost per PR (set alerts for >2x expected)

Operational:
- If REQUEST_CHANGES drops below 10% → investigate (model config? code quality? gate degraded?)
- If review count exceeds 15 → check dispatcher/webhook configuration
- Track: which bot finds which issues → continuous model-task matching

Playbook 4: Regulatory Compliance Review

1. Sonnet structural scan ($0.02, 15-30s)
   - Identify which regulatory categories are addressed
   - Flag structural gaps (missing sections, uncovered rules)

2. GPT-5 regulatory cross-reference ($0.40, 120-160s)
   - Rule-by-rule comparison with actual regulations
   - Cite specific IRS/FINRA/SEC sections
   - Identify mathematical/formula errors
   - Find edge cases the implementation misses

3. Opus operational compliance ($0.12, 50-80s)
   - What the system needs to DO at runtime
   - Cross-account, cross-entity obligations
   - Where the implementation's model doesn't match regulatory reality

4. Combine → prioritize Critical/High for immediate action

Playbook 5: Cross-Document Consistency Check

If comparing 2-3 documents for contradictions:
  → Opus primary (2.4x faster, finds more boundary tensions)
  → Sonnet for parallel comparison of claims (faster for direct conflicts)
  → GPT-5 only if documents are very complex (1000+ lines combined)

If checking one document for self-consistency:
  → GPT-5 for specification conflicts (statement A contradicts statement B)
  → Opus for logical impossibilities (rule A + rule B = impossible condition)
  → Skip Sonnet (33% precision, wastes time filtering false positives)

Model Personality Cheat Sheet

Model	Thinks Like	Asks	Finds
GPT-5	Systems engineer with 20 years experience	"What does the real world need that this doesn't address?"	Infrastructure gaps, operational hazards, regulatory oversights
Opus	Architecture critic / philosophy professor	"Where do your own principles contradict each other?"	Design tensions, logical impossibilities, safety-mechanism-as-vulnerability
Sonnet 4.6	Junior developer implementing the spec	"If I were coding this, what would confuse me?"	Implementation ambiguities, structural gaps, obvious missing pieces
Sonnet 4.5	Enthusiastic intern brainstorming	"What COULD go wrong?"	Broad list of concerns (noisy, needs filtering)
GPT-4.1	Reliable senior dev doing code review	"Does this follow the patterns correctly?"	Structural issues, format consistency
GPT-4.1 Mini	Fast intern doing a checklist	"Is this obviously incomplete?"	Missing sections, obvious gaps

Decision Framework: "Should I Run Another Model?"

After getting results from Model A, ask:

1. Is the task VERIFICATION? (contradictions, races, consistency)
   → YES: Run the complementary model (GPT-5 ↔ Opus)
   → Sonnet results alone are insufficient for verification

2. Are the stakes HIGH? (financial, safety, regulatory, security)
   → YES: Run all available models. The marginal cost ($0.10-0.50)
     is negligible vs. the cost of a missed finding.

3. Did Model A find < 5 issues on a complex document?
   → YES: The task might not suit Model A. Try a different model.
   → GPT-5 finding < 5 issues is unusual — check prompt/token budget.

4. Is this a SCREENING pass before deeper work?
   → YES: Sonnet is sufficient. Save heavy models for the deep dive.
   → NO: Add GPT-5 or Opus depending on task type.

5. Is the document about SECURITY (auth, crypto, network)?
   → YES: Use a dedicated security persona regardless of other reviewers.
   → Generalist models will likely approve code with security gaps.

Key Numbers to Remember

Fact	Number	Source
Sonnet precision on contradiction detection	~33%	#39, #43
Adversarial ensemble improvement over single model	+30% findings	#35
Adversarial ensemble extra cost	+28% tokens	#35
Opus speed advantage over GPT-5 (cross-doc)	2.4x faster	#28
Opus token efficiency vs GPT-5	6-9x fewer tokens/finding	Multiple
GPT-5 minimum token budget	16K completion tokens	#5, multiple
Union findings vs best single model	30-60% more	Multiple
Dual-bot REQUEST_CHANGES rate	~32%	#78
Single-bot REQUEST_CHANGES rate	~2%	#78
Post-merge escape: missing tests	22% of all findings	#78
Security persona catch rate vs generalist	100% vs 0% (on security issues)	#79
Dispatcher malfunction cost overage	3.5x	#80
Total findings analyzed	80	This report
Total validated analytical lenses	28	REPORT.md Part 6

What Changed Since Last Report

New rules added:

Rule 8: Specialized Personas Outperform Model Upgrades (for security)
Rule 9: Dual-Bot Disagreement Is a Feature
Rule 10: Monitor Your AI Pipeline

New anti-patterns:

"One reviewer is enough"
"Security is just another review criteria"
"More reviews = better quality"
"The models will catch operational issues"

New playbooks:

Playbook 2: Security Code Review
Playbook 3: Multi-Model Review Pipeline (updated from operational data)

Updated decision tree:

Added security review branch
Added inter-document contradiction (Sonnet dominance) path

Evolution Notes

This document evolves weekly. As new findings emerge:

Rules are added/modified when patterns are validated across 3+ experiments
Anti-patterns are added after observing repeated mistakes
Playbooks are updated when operational data improves recommendations
The decision tree is simplified when clearer heuristics emerge

The goal: make model selection as automatic as possible, so the researcher spends time on analysis rather than model choice.

16 KiB Raw Permalink Blame History