docs: regenerate weekly report (2026-05-11)

This commit is contained in:
Rodin
2026-05-11 09:04:35 -07:00
parent 2ca8c974f3
commit 828da269c0
2 changed files with 394 additions and 140 deletions
+218 -87
View File
@@ -1,114 +1,245 @@
# Actionable Lessons: Using AI Models for Analytical Work
# Lessons Learned: Operational Guide for AI Model Selection
> **Generated:** 2026-05-06 07:30 PDT
> **Based on:** 29 experiments (2026-04-26 to 2026-05-06)
> **Generated:** 2026-05-11 09:00 PDT
> **Based on:** 74 experiments (2026-04-26 to 2026-05-11)
_Distilled from 29 experiments. These are the rules._
_This is the actionable distillation. For evidence and methodology, see REPORT.md._
---
## The Three Rules
## Quick Reference: Model Selection by Task
### 1. Match the model to the task, not the prestige
| If you need... | Use... | Why |
|---------------|--------|-----|
| "What's missing from this design?" | GPT-5 | Reasons about the world outside the document |
| "Where does this design contradict itself?" | Opus | Logical argumentation, zero false positives |
| "Is this consistent with that other doc?" | Opus | 2.4x faster, more findings than GPT-5 |
| "How could an attacker exploit this?" | GPT-5 (coverage) + Opus (creativity) | Different attack styles |
| "Quick sanity check before I ship" | Sonnet | Fast, cheap, precise enough |
| "What race conditions exist here?" | GPT-5 + Opus | Sonnet produces errors on concurrency |
| "Is there bias in this text?" | Anything (even Mini) | All models catch isolated bias equally |
### 2. Isolate the signal before asking the question
Don't ask "review this PR" and expect the model to catch a subtle bias buried in 6,600 lines of diff. Extract what matters, ask about it directly. Rich context dilutes attention.
**Pattern:**
- ❌ "Review this PR for quality, correctness, and bias" (broad mandate + rich context = missed signals)
- ✅ "Here are 12 hypotheses. Do any lead toward a predetermined conclusion?" (narrow question + minimal context = found everything)
### 3. Run multiple models on anything that matters
No single model finds everything. The union of GPT-5 + Opus + Sonnet finds more than any individual. The missing findings from a single-model run are disproportionately the ones that would cause production incidents.
**Decision framework:**
- **Costs nothing to get wrong:** One model is fine (Sonnet for speed, Opus for depth)
- **Would be embarrassing to miss:** Two models (Opus + GPT-5)
- **Would cost money or safety:** Three models (all three, plus manual review of unique findings)
---
## Operational Playbook
### Architecture Document Review
```
1. Opus: contradiction detection + cross-doc consistency
2. GPT-5: hidden assumptions + gap-finding
3. Sonnet: quick structural scan (broken refs, missing sections)
4. Merge findings, deduplicate, triage by severity
```
### Pre-Implementation Spec Review
```
1. Opus: "Where do the stated principles conflict?"
2. GPT-5: "What must be true about the world for this to work?"
3. Sonnet 4.5: "What would an implementer have to guess?"
```
### Security/Adversarial Review
```
1. GPT-5: "Enumerate all possible abuses of each mechanism"
2. Opus: "What would a smart adversary do that the designer didn't consider?"
3. Union the findings — GPT-5 catches mechanism-level, Opus catches system-level
```
### PR Review (dual-reviewer pattern)
```
- Sonnet: structural issues, broken links, formatting
- GPT-5: semantic issues, logical gaps, verdict mismatches
- For important PRs: add Opus for design-tension detection
┌─────────────────────────────────────────────────────────────────┐
│ TASK TYPE DECISION TREE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Is this a VERIFICATION task? │
│ (contradiction, consistency, race condition) │
│ │ │
│ ├─ YES → Use GPT-5 + Opus (skip Sonnet) │
│ │ Sonnet has ~33% precision on verification │
│ │ │
│ └─ NO → Is this CROSS-DOCUMENT? │
│ │ │
│ ├─ YES → Use Opus (2.4x faster, more findings) │
│ │ │
│ └─ NO → Is this HIGH-STAKES? │
│ (financial, safety, regulatory) │
│ │ │
│ ├─ YES → Run all three │
│ │ (GPT-5 + Opus + Sonnet) │
│ │ Total: ~$1-2, worth it │
│ │ │
│ └─ NO → Sonnet first-pass │
│ Add Opus if findings need depth │
│ │
└─────────────────────────────────────────────────────────────────┘
```
---
## Anti-Patterns (Things That Don't Work)
## Rules
1. **"Use the most expensive model for everything"** — GPT-5 is 5-9x more expensive than Opus per finding, and Opus beats it on contradiction/consistency tasks.
### Rule 1: Match Model to Task Type
2. **"Reasoning effort = better output"** — The low/medium/high parameter has negligible effect on analytical tasks. Don't bother tuning it.
| If the task is... | Use this | Not this |
|-------------------|----------|----------|
| Finding what's missing | GPT-5 | Mini |
| Finding contradictions | Opus | Sonnet |
| Cross-document consistency | Opus | GPT-5 |
| Quick structural scan | Sonnet 4.6 | GPT-5 |
| Broad coverage (noise OK) | Sonnet 4.5 | Sonnet 4.6 |
| Adversarial attack paths | GPT-5 then Opus | Either alone |
| Regulatory compliance | GPT-5 | Opus |
| Operational blind spots | GPT-5 | Sonnet |
3. **"Sonnet can do anything Opus does, just shallower"** — Wrong. Sonnet produces *errors* on concurrency reasoning and false positives on contradiction detection. It's not "cheaper Opus" — it's a different tool.
### Rule 2: Don't Trust Sonnet for Verification
4. **"More context = better analysis"** — Signal-to-noise ratio matters more than context richness. Isolate what you're asking about.
Sonnet finds ~3 contradictions but only ~1 is genuine. The others are misreadings. Use Sonnet for *identification* tasks (what's here?), not *verification* tasks (is this true?).
5. **"One good prompt works everywhere"** — Prompt framing shapes output more than model choice. The same model with a broad vs narrow prompt produces qualitatively different work. Design prompts per task type.
### Rule 3: Isolate the Signal
6. **"Run it once, trust the output"** — Single runs are stochastic. Models miss things non-deterministically. Multiple models or multiple runs are the only hedge.
When checking for something specific (bias, contradictions, missing assumptions), extract the relevant text and ask about it directly. Don't bury the question in a broad review mandate. Signal-to-noise ratio matters more than model capability.
### Rule 4: Run the Ensemble for High Stakes
For anything financial, safety-critical, or regulatory: run GPT-5 + Opus + Sonnet. Each finds things the others miss. The union is larger than any single model's output. Cost is trivial vs. the value.
### Rule 5: Give GPT-5 Enough Tokens
GPT-5 needs `max_completion_tokens` ≥ 16K. A truncated GPT-5 response is worse than a complete Opus response. Token budget matters more than model size.
### Rule 6: Break Large Outputs Into Sections
Single agents die generating 1000+ lines. Rich input is fine; it's output length that kills. For large generation tasks: break into sections, draft in parallel, assemble.
### Rule 7: Narrow Framing Doesn't Fix Reasoning Gaps
You cannot make Sonnet match GPT-5/Opus by writing a better prompt. Narrow framing changes WHAT it looks for, not HOW WELL it reasons. The gap is architectural, not prompt engineering.
---
## Operational Playbooks
### Playbook A: Architecture Document Review
1. **First pass (Sonnet 4.6):** ~30s, catches structural issues, broken refs, obvious gaps
2. **Deep analysis (GPT-5):** ~90s, finds domain-specific gaps, hidden assumptions, edge cases
3. **Design tensions (Opus):** ~60s, finds where the design contradicts itself
4. **Merge and dedupe:** Union of all three, remove duplicates, sort by severity
### Playbook B: Cross-Document Consistency Check
1. **Use Opus only.** It's 2.4x faster than GPT-5 and finds more issues.
2. **Provide both documents in a single prompt** (~25KB max)
3. **Explicitly exclude omissions** in the prompt — you want contradictions, not "Doc A covers X but Doc B doesn't"
### Playbook C: Adversarial Security Review
1. **First pass (GPT-5):** Exhaustive enumeration of attack surface
2. **Extension pass (Opus):** Give Opus GPT-5's findings, ask it to critique and extend
3. **Result:** 30% more findings at 28% more cost, with prioritization
### Playbook D: Regulatory Compliance Review
1. **First pass (Sonnet):** ~25s, identifies areas of concern
2. **Deep dive (GPT-5):** Regulatory specificity, correct citations, edge cases
3. **GPT-5's reasoning tokens are spent on verification** — trust its citations
### Playbook E: Contradiction Detection
1. **Use GPT-5 + Opus in parallel** (not Sonnet)
2. **GPT-5 finds:** Specification conflicts (same scenario, different prescriptions)
3. **Opus finds:** Logical impossibilities (rules that can't coexist)
4. **Neither dominates** — they find different classes of contradiction
---
## Anti-Patterns
### ❌ Anti-Pattern 1: Using Sonnet for Verification Tasks
**What happens:** Sonnet reports contradictions that aren't real. You waste time investigating false positives or worse, trust a false negative.
**Instead:** Use GPT-5 or Opus for any task requiring "is this true?" reasoning.
### ❌ Anti-Pattern 2: Giving GPT-5 a Broad Mandate
**What happens:** GPT-5 spawns sub-agents, times out, or dumps raw tool output instead of synthesizing.
**Instead:** Give GPT-5 explicit single-actor instructions + output format. For Claude, broader mandates are fine.
### ❌ Anti-Pattern 3: Burying Important Checks in Large Reviews
**What happens:** The model misses the important thing because it's one of 47 things to check.
**Instead:** Extract the important check and ask about it specifically. Signal-to-noise ratio matters.
### ❌ Anti-Pattern 4: Extrapolating Across Task Types
**What happens:** "GPT-5 was great at X, so I'll use it for Y" — and it's mediocre.
**Instead:** Task type predicts performance better than "model X is better." Check the task-type table.
### ❌ Anti-Pattern 5: Skipping the Union
**What happens:** You run one model, miss things another would have caught, and the bug reaches production.
**Instead:** For high-stakes work, run the ensemble. The cost is trivial vs. the risk.
### ❌ Anti-Pattern 6: Tuning Reasoning Effort
**What happens:** You spend time adjusting low/medium/high reasoning effort parameters.
**Instead:** Don't bother. It has negligible effect on analytical work. Task type is the lever.
### ❌ Anti-Pattern 7: Trying to Fix Sonnet with Prompts
**What happens:** You write increasingly narrow prompts trying to get Sonnet to match GPT-5's reasoning depth.
**Instead:** Accept that the gap is architectural. Use Sonnet for what it's good at (speed, breadth, structural review), use GPT-5/Opus for reasoning-heavy tasks.
---
## Model Personality Cheat Sheet
| Model | Default behavior | Thinks like a... |
|-------|-----------------|------------------|
| GPT-5 | Exhaustive enumeration, high verification bar, slow | Thorough auditor checking every line item |
| Opus | Design tensions, self-correcting, efficient | Architect who sees how pieces conflict |
| Sonnet 4.6 | Fast structural scan, self-filtering, concise | Senior engineer doing a quick review |
| Sonnet 4.5 | Exhaustive, verbose, occasional severity inflation | Junior engineer trying to catch everything |
| GPT-4.1 | Structured, stays within the document's framing | Competent analyst following a checklist |
| GPT-4.1 Mini | Formulaic, maps findings 1:1 to document sections | Intern reading the doc and noting concerns |
| Model | Personality | Default Behavior | Give It |
|-------|-------------|------------------|---------|
| **GPT-5** | Exhaustive enumerator | Lists everything systematically | Bounded tasks, explicit output format, single-actor instructions |
| **Opus** | Design critic | Finds tensions and contradictions | Open-ended analysis, room to reason about boundaries |
| **Sonnet 4.6** | Structural scanner | Fast, precise, shallow | Quick first-pass work, structural review |
| **Sonnet 4.5** | Broad coverage | More findings, more noise | When you want breadth over precision |
| **GPT-4.1** | Generic competent | Stays within document framing | Middle-ground cost-sensitive work |
| **GPT-4.1 Mini** | Template filler | Formulaic but catches obvious things | Bulk screening, sanity checks |
### Opus Superpower
Opus finds where the spec's **own assumptions are false**. It doesn't just find missing things — it finds things the spec *believes* to be true that *aren't*.
Examples:
- "Realized P&L cannot recover" — the de-escalation model assumes all metrics can improve, but this one fundamentally cannot (#31)
- "Forward detection logic is backwards" — spec describes triggers in the wrong direction (#32)
- "Stop-loss defeated by temporal composition" — safety mechanism rendered ineffective by slow strategy (#47)
### GPT-5 Superpower
GPT-5 reasons about the document's **relationship to the real world**. It asks "what must be true about the external world for this to work?"
Examples:
- Broker rate limiting (429s) bypasses "connection lost" detection (#9)
- Corporate actions bypass staleness detection (#9)
- DB "commit unknown outcome" causing restart loops (#9)
- Cross-symbol strategies with partial staleness (#9)
- IRS rule nuances that simplifications violate (#54)
---
## The Bottom Line
## Decision Framework
**For our specific workflow (gargoyle architecture review, PR reviews, design docs):**
### When to Add Another Model
1. Opus is the default analytical model — most efficient, deepest on consistency/contradiction
2. GPT-5 is the "we can't afford to miss anything" model — use on high-stakes docs
3. Sonnet is the speed/screening model — first pass, structural checks, assumption-finding only
4. Never use Sonnet alone for concurrency, contradiction, or adversarial analysis
5. Always isolate the analytical question from surrounding noise
6. Task-type-specific prompts beat generic "review this" prompts every time
| Situation | Action |
|-----------|--------|
| Sonnet found nothing | Add Opus (may find design tensions) |
| GPT-5 found lots but all similar | Add Opus (may find different class) |
| Opus found tensions but no enumeration | Add GPT-5 (exhaustive coverage) |
| Cross-document task | Use Opus only (2.4x faster) |
| Regulatory/compliance task | Use GPT-5 (correct citations) |
### When NOT to Add Another Model
| Situation | Action |
|-----------|--------|
| Quick structural scan | Sonnet alone is fine |
| Bulk screening | Mini alone is fine |
| Already ran GPT-5 + Opus | Adding Sonnet rarely helps |
| Low-stakes internal doc | One model is enough |
### Cost-Benefit Quick Calc
| Risk level | Model cost | Justified? |
|------------|------------|------------|
| Financial/safety | ~$1-2 for ensemble | Always yes |
| Customer-facing | ~$0.50 for GPT-5 | Usually yes |
| Internal process | ~$0.10 for Sonnet | Always yes |
| One-off exploration | ~$0.02 for Mini | Always yes |
---
## What We Still Don't Know
1. **Corpus bias:** All experiments used gargoyle docs. Rankings may differ for other domains.
2. **Run variance:** All findings are single-run. Stochastic variation is unquantified.
3. **Scale effects:** Largest doc tested is 1,110 lines. Unknown behavior at 2000+.
4. **Non-architecture domains:** These findings are for architecture document analysis, not coding, not chat, not creative writing.
---
## Summary: The Two Things That Matter Most
1. **Task type determines model choice.** Don't pick a model because "it's best." Pick the model that's best for THIS task type.
2. **The union beats any single model.** For high-stakes work, run the ensemble. Different models find qualitatively different things.
Everything else is optimization.
+176 -53
View File
@@ -1,136 +1,238 @@
# Model Research Report: AI Models for Analytical Work
> **Generated:** 2026-05-06 07:30 PDT
> **Findings analyzed:** 29
> **Period:** 2026-04-26 to 2026-05-06
> **Generated:** 2026-05-11 09:00 PDT
> **Findings analyzed:** 74
> **Period:** 2026-04-26 to 2026-05-11
_29 experiments across 11 days. Five models tested on architecture document analysis — not coding._
_74 experiments across 16 days. Six models tested on architecture document analysis — not coding._
---
## What's New (Since May 6)
**45 new findings** (29 → 74) covering:
- **New task types validated:** Operational blind spot analysis (#46), emergent behavior from rule composition (#47), defense-in-depth gaps (#48), adversarial evasion/tampering (#49), concurrency race conditions (#50), implementation ambiguity (#51), degraded mode propagation (#52), unstated constraints (#53), state reconstruction correctness (#55), operational burden (#56), event flow correctness (#57), state machine completeness (#58), convention-rule gaps (#59), counterfactual event ordering (#60), regulatory completeness (#61), data integrity signal flow (#62), external system assumptions (#63), specification gaps (#64), temporal correctness (#65), concurrent write hazards (#65b), cross-context contract coherence (#68), boundary contract analysis, boundary violation analysis, inter-document contradiction analysis, security boundary analysis, audit log data integrity (#11-May), wash sale regulatory compliance (#11-May)
- **Cross-document consistency expanded** (#37, #44): Opus confirmed as dominant for subtle contradictions across tightly-coupled docs
- **Regulatory compliance analysis depth** (#38, #54, #61): GPT-5 excels at IRS/regulatory specificity with correct citations
- **Narrow framing tested and rejected** (#39, #43): Sonnet cannot match GPT-5/Opus via prompt framing alone — reasoning depth is the bottleneck
- **Adversarial ensemble validated** (#35): Critique-then-extend produces 30% more findings at 28% more cost
- **Operational burden as distinct lens** (#45, #56): Models diverge on what constitutes "operator cognitive load"
- **Silent data corruption paths** (#40): GPT-5 excels at tracing multi-step corruption through financial accounting
- **Temporal ordering dependencies** (#41): All models catch obvious ordering; GPT-5 unique on subtle cascades
- **Failure propagation chains** (#42): Opus finds the architectural insight; GPT-5 finds the enumeration
---
## Executive Summary
We tested GPT-5, Claude Opus, Claude Sonnet 4.6, Claude Sonnet 4.5, GPT-4.1, and GPT-4.1 Mini on analytical tasks: finding hidden assumptions, race conditions, design contradictions, adversarial attack paths, regulatory gaps, and cross-document inconsistencies in real architecture documents.
We tested GPT-5, Claude Opus, Claude Sonnet 4.6, Claude Sonnet 4.5, GPT-4.1, and GPT-4.1 Mini on analytical tasks: finding hidden assumptions, race conditions, design contradictions, adversarial attack paths, regulatory gaps, cross-document inconsistencies, operational blind spots, emergent behaviors, and security boundaries in real architecture documents.
**The central finding:** Different models don't just find more or fewer things — they find *qualitatively different kinds* of things. Model choice is task-dependent, and no single model dominates all analytical work.
**The secondary finding:** Task type predicts model performance better than "model X is better." A model that excels at gap-finding may struggle at contradiction detection. Match the model to the task.
---
## Part 1: What Each Model Is Good At
### GPT-5
**Strength:** Exhaustive enumeration + domain-specific reasoning about the real world.
GPT-5's reasoning tokens change the *kind* of analysis, not just the depth. Non-reasoning models identify risks within a document's own frame of reference. GPT-5 reasons about the document's relationship to the external world: broker semantics, deployment topology, OTP runtime behavior under load, timing correlations across independent subsystems.
GPT-5's reasoning tokens change the *kind* of analysis, not just the depth. Non-reasoning models identify risks within a document's own frame of reference. GPT-5 reasons about the document's relationship to the external world: broker semantics, deployment topology, OTP runtime behavior under load, timing correlations across independent subsystems, IRS regulatory requirements.
- Best at: gap-finding, hidden assumptions, adversarial enumeration, temporal boundary analysis
- Unique ability: finds multi-component interaction failures that require domain knowledge
- Weakness: slow (2-4x longer than Opus), expensive (5-10x more tokens per finding), sometimes over-verifies
| Capability | Evidence |
|------------|----------|
| Domain-specific gaps | #9, #31: Broker rate limiting, credential rotation, corporate actions |
| Multi-component interactions | #10, #14: Finds assumptions requiring cross-boundary reasoning |
| Adversarial enumeration | #29, #35: Most thorough attack surface coverage |
| Temporal boundary analysis | #18: 15 findings with mathematical precision |
| Regulatory compliance | #23, #38, #54: Correct IRS citations, regulatory edge cases |
| Silent data corruption | #40: Traces multi-step corruption paths |
| Invariant violation paths | #20: Precise, verifiable paths through state space |
| Operational blind spots | #46: 18 findings including cross-service trace gaps |
- Best at: gap-finding, hidden assumptions, adversarial enumeration, temporal boundary analysis, regulatory compliance, operational blind spots
- Unique ability: finds multi-component interaction failures requiring domain knowledge
- Weakness: slow (2-4x longer than Opus), expensive (5-10x more tokens per finding), sometimes over-enumerates
- Finding count: typically 15-35 depending on document complexity
### Claude Opus
**Strength:** Design tensions, logical argumentation, creative adversarial thinking.
**Strength:** Design tensions, logical argumentation, creative adversarial thinking, cross-document consistency.
Opus consistently identifies where one part of a design undermines another part. It doesn't enumerate failure modes — it finds the deeper question of *why* the design's own principles conflict.
- Best at: contradiction detection, cross-document consistency, race conditions (design-level), adversarial creativity
- Unique ability: self-corrects mid-analysis, finds "your safety mechanism IS your vulnerability" patterns
| Capability | Evidence |
|------------|----------|
| Contradiction detection | #25, #43: Finds logical impossibilities via deductive reasoning |
| Cross-document consistency | #28, #37, #44: 2.4x faster than GPT-5, finds more issues |
| Race conditions (design-level) | #13: 10 high-quality findings, self-corrects mid-analysis |
| Adversarial creativity | #29, #35: "Your safety mechanism IS your vulnerability" patterns |
| False assumption detection | #31, #32: Finds where spec's own logic contradicts itself |
| Emergent behavior insight | #47: Stop-loss defeated by temporal composition (best single finding) |
| Survivor bias identification | #46: Decision latency histogram hides stuck decisions |
- Best at: contradiction detection, cross-document consistency, race conditions (design-level), adversarial creativity, finding false assumptions
- Unique ability: self-corrects mid-analysis, finds where protection mechanisms become vulnerabilities
- Most efficient model: 6-9x fewer tokens per finding than GPT-5 on shared task types
- Weakness: produces fewer findings on pure enumeration tasks (10-13 vs GPT-5's 20-35)
### Claude Sonnet 4.6
**Strength:** Speed, structural issues, assumption-finding. Best precision-per-dollar.
| Capability | Evidence |
|------------|----------|
| Quick first-pass screening | #9, #12: 2-3x faster than other models |
| Structural review | #5: Catches formatting, broken links, missing sections |
| Specification gap identification | #16: 13 findings, zero false positives |
| Observability gaps | #33: 11 findings in 36s |
- Best at: quick first-pass screening, structural review, specification gap identification
- Zero false positives on most tasks — every finding is actionable
- Weakness: struggles with concurrency reasoning, contradiction detection, and tasks requiring formal logical reasoning
- Weakness: struggles with concurrency reasoning, contradiction detection, tasks requiring formal logical reasoning
- Produces false positives on verification-heavy tasks (contradiction, race conditions)
**Critical limitation (Finding #39):** Narrow framing does NOT close the gap with GPT-5/Opus. Sonnet can find 3 contradictions but only 1 is genuine (2 are misreadings). The gap is reasoning depth, not framing — Sonnet can't reliably verify whether two statements actually contradict each other.
### Claude Sonnet 4.5
**Strength:** Exhaustive coverage. More findings than 4.6, at the cost of some noise.
- Best at: specification completeness (25 findings vs 4.6's 13)
- Catches operational gaps that 4.6 filters out
| Capability | Evidence |
|------------|----------|
| Specification completeness | #16: 25 findings vs 4.6's 13 |
| Temporal reasoning | #18: 12 findings with no errors (vs 4.6's errors in #13) |
| Operational gaps | Catches gaps that 4.6 filters out |
- Best at: specification completeness, broad coverage
- Tradeoff: severity inflation, more verbose output
- Use 4.5 for coverage, 4.6 for precision
### GPT-4.1
**Strength:** Structured, thorough, good middle ground. Generic but competent.
- Stays within the document's own framing — finds assumptions the document *almost* states
- Best unique contribution: meta-observations about design structure (e.g., "all failure modes treated as isolated")
| Capability | Evidence |
|------------|----------|
| Stays within document framing | #9, #10: Finds assumptions the document almost states |
| Meta-observations | #10: "All failure modes treated as isolated" |
| Cost-effective first pass | Good enough when GPT-5's cost isn't justified |
- Best unique contribution: meta-observations about design structure
- Good enough for first-pass review where GPT-5's cost isn't justified
### GPT-4.1 Mini
**Strength:** Cheapest. Formulaic but catches the obvious things.
- Every finding maps cleanly to a section of the document
| Capability | Evidence |
|------------|----------|
| Scales with document size | #9, #19: 6 findings on 459 lines → 21 on 1,110 lines |
| Clean templates | Every finding maps to a document section |
| Bias detection | #8: Catches bias when signal isn't buried |
- Fine for quick sanity checks, not for architectural insight
- Scales with document size (6 findings on 459 lines → 21 on 1,110 lines)
- Best for: bulk screening, sanity checks, obvious-issue detection
---
## Part 2: What We Learned About Task Types
## Part 2: Task Type → Model Mapping
Not all analytical tasks are the same. Models that excel at one struggle at another.
| Task Type | Best Model | Runner-up | Avoid |
|-----------|-----------|-----------|-------|
| Hidden assumptions | GPT-5 | Opus | Mini (formulaic) |
| Gap-finding | GPT-5 | GPT-4.1 | Mini (surface-level) |
| Race conditions | GPT-5 + Opus | — | Sonnet (errors) |
| Contradiction detection | **Opus** | GPT-5 | Sonnet (false positives) |
| Cross-document consistency | **Opus** | GPT-5 | — |
| Adversarial attack paths | GPT-5 (enumeration) + Opus (creativity) | — | — |
| Bias detection | Any model | — | — |
| Design coherence | Document-dependent | — | — |
| Specification completeness | Sonnet 4.5 (breadth) or GPT-5 (self-contradictions) | — | — |
| Missing feature identification | All (with right prompt) | — | — |
| Invariant violation paths | GPT-5 (precision) | Opus (breadth) | Sonnet (imprecise) |
| Task Type | Best Model | Runner-up | Avoid | Evidence |
|-----------|-----------|-----------|-------|----------|
| **Gap-finding** | GPT-5 | GPT-4.1 | Mini (surface-level) | #9, #31, #64 |
| **Hidden assumptions** | GPT-5 | Opus | Mini (formulaic) | #10, #11, #12, #53 |
| **Race conditions** | GPT-5 + Opus | — | Sonnet (errors) | #13, #50 |
| **Contradiction detection** | **Opus** | GPT-5 | Sonnet (false positives) | #25, #43 |
| **Cross-document consistency** | **Opus** | GPT-5 | — | #28, #37, #44 |
| **Adversarial attack paths** | GPT-5 (enum) + Opus (creativity) | — | — | #29, #35, #49 |
| **Design coherence** | Document-dependent | — | — | #15, #27 |
| **Specification completeness** | Sonnet 4.5 (breadth) / GPT-5 (self-contradictions) | — | — | #16, #31 |
| **Regulatory compliance** | GPT-5 | Sonnet (first-pass) | — | #23, #38, #54 |
| **Operational blind spots** | GPT-5 | Opus | Sonnet | #46 |
| **Emergent behavior** | GPT-5 (feedback loops) | Opus (best single insight) | — | #47 |
| **Temporal boundaries** | GPT-5 | Opus | — | #18, #41, #65 |
| **State machine completeness** | GPT-5 | Opus | — | #58 |
| **Silent data corruption** | GPT-5 | — | — | #40, #62 |
| **Defense-in-depth gaps** | GPT-5 + Opus | — | — | #48 |
| **Security boundaries** | GPT-5 | Opus | — | #10-May |
**Key pattern:** Tasks requiring *identification* (what's missing? what's assumed?) are accessible to all models. Tasks requiring *verification* (is this sequence legal? does this contradict that?) favor reasoning models (GPT-5, Opus) and exclude Sonnet.
**Task category taxonomy:**
| Category | Sonnet value | Best models |
|----------|--------------|-------------|
| Systematic/exhaustive | None | GPT-5, Opus |
| Creative/generative | Meta-analytical synthesis | Opus, GPT-5 |
| Compliance/regulatory | Adequate but shallow | GPT-5 (deep), Sonnet (first-pass) |
| Cross-document | None | Opus strongly preferred |
---
## Part 3: Meta-Findings About How to Use Models
### 1. Signal-to-noise ratio matters more than model capability (Finding #8)
### 1. Signal-to-noise ratio matters more than model capability (#8)
When biased text is the ONLY input, even GPT-4.1 Mini catches it. When the same bias is buried inside a full PR review with diffs, issues, and project context, expensive models miss it. The issue isn't model intelligence — it's attention dilution.
**Implication:** For important analytical checks, isolate the signal. Extract the relevant text and ask about it specifically. Don't bury important questions inside broad review mandates.
### 2. Prompt framing dominates model personality (Finding #26)
### 2. Prompt framing dominates model personality for OPEN tasks (#26)
Opus's "finds design tensions" and GPT-5's "exhaustive enumeration" are DEFAULT tendencies, not capabilities. With structured prompts that explicitly ask for breadth, Opus produces MORE findings than GPT-5. With structured prompts asking for contradictions, GPT-5 becomes highly selective.
Opus's "finds design tensions" and GPT-5's "exhaustive enumeration" are DEFAULT tendencies, not hard limits. With structured prompts that explicitly ask for breadth, Opus produces MORE findings than GPT-5. With structured prompts asking for contradictions, GPT-5 becomes highly selective.
**Implication:** Model choice matters less than you think for any single task. Prompt structure is the primary lever. Model personality matters for OPEN-ENDED tasks where you want emergent analytical behavior.
**Implication:** Model choice matters less than you think for any single task. Prompt structure is the primary lever. Model personality matters for open-ended tasks where you want emergent analytical behavior.
### 3. Task type predicts model performance better than "model X is better" (Finding #13)
### 3. Narrow framing does NOT fix Sonnet's reasoning gaps (#39, #43)
Sonnet can't match GPT-5/Opus via narrow prompts alone. Narrow framing changes WHAT Sonnet looks for but not HOW WELL it reasons. Sonnet found 3 contradictions but only 1 was genuine (2 were misreadings). The gap is reasoning depth, not prompt engineering.
### 4. Task type predicts model performance better than "model X is better" (#13)
Sonnet scores 85% of GPT-5's performance on assumption-finding but drops to ~50% on concurrency reasoning. Don't extrapolate across task types.
### 4. The union of models finds the most (Finding #19)
### 5. The union of models finds the most (#19)
GPT-5 Mini + Sonnet covers ~71% of GPT-5's findings at 31% of the cost. But the missing 29% contains the domain-specific interaction-level findings most likely to prevent production incidents. Each model also finds things the others miss — the total unique finding space is larger than any single model's output.
### 5. Reasoning tokens change the KIND of analysis, not just the amount (Finding #10)
### 6. Adversarial ensemble produces 30% more findings (#35)
Run GPT-5 for exhaustive enumeration, then give Opus GPT-5's findings and ask it to critique and extend. Result: 56 findings vs 43 (GPT-5 alone) or 28 (Opus alone). Zero full disagreements. The critique's structured assessment is more valuable than raw extensions. Cost: ~28% more tokens for 30% more coverage + prioritization.
### 7. Reasoning tokens change the KIND of analysis, not just the amount (#10)
Non-reasoning models ask "what could this mechanism fail at?" Reasoning models ask "what must be true about the world for this mechanism to work?" This is a qualitative difference in analytical mode, not just thoroughness.
### 6. Reasoning effort parameter is a no-op for analytical work (Finding #21)
### 8. Reasoning effort parameter is a no-op for analytical work (#21)
Low/medium/high reasoning effort had negligible effect on GPT-5's output for open-ended analysis. Task type is a far stronger predictor of reasoning behavior. Don't waste time tuning this parameter for document review.
### 7. Output length kills, input length doesn't (Finding #6)
### 9. Output length kills, input length doesn't (#6)
Single agents die trying to generate 1000+ line documents. Rich input context is fine — it's the output length that causes OOM/timeout. Break output into sections, keep input context rich, draft in parallel, assemble.
### 8. Document complexity shifts model rankings (Finding #27)
### 10. Document complexity shifts model rankings (#27)
Opus beat GPT-5 on coherence analysis for one document but lost on another (more complex) document. Rankings are not universal — they interact with document complexity, domain specificity, and prompt structure.
### 9. Token budget matters more than model size (Finding #7b)
### 11. Token budget matters more than model size (#7b)
When output is truncated by token limits, even GPT-5 produces shallow findings. Ensure sufficient max_completion_tokens (≥16K for GPT-5). A cheap model with enough tokens beats an expensive model that runs out of space.
### 12. Opus excels at finding where specs believe false things (#31, #32)
Across experiments, Opus consistently finds where the spec's OWN ASSUMPTIONS are false. GPT-5 reasons about what the spec FAILS TO SAY. Opus reasons about what the spec BELIEVES to be true and checks whether those beliefs hold. Different but complementary.
### 13. GPT-5's reasoning tokens are spent on VERIFICATION for regulatory tasks (#54)
For domain-specific regulatory analysis (IRS wash sale rules), GPT-5 consistently cited correct publication sections, code numbers, and regulatory references. The 9,600 reasoning tokens appear spent on verification, not generation.
---
## Part 4: Cost-Effectiveness
@@ -138,21 +240,42 @@ When output is truncated by token limits, even GPT-5 produces shallow findings.
| Model | Typical tokens/finding | Relative cost | Best use case |
|-------|----------------------|---------------|---------------|
| Opus | 179-336 | 1x (baseline) | Cross-doc consistency, contradictions, design tensions |
| Sonnet 4.6 | 194-312 | 0.3x | Quick screening, structural review, assumption-finding |
| GPT-5 | 993-2,967 | 5-9x | High-stakes analysis where missing something has real cost |
| Sonnet 4.6 | 111-194 | 0.2-0.3x | Quick screening, structural review, assumption-finding |
| Sonnet 4.5 | 150-250 | 0.25x | Broad coverage when noise is acceptable |
| GPT-5 | 511-2,967 | 5-9x | High-stakes analysis where missing something has real cost |
| GPT-4.1 | ~500 | 0.5x | Middle-ground first pass |
| GPT-4.1 Mini | ~300 | 0.1x | Bulk screening, sanity checks |
**For financial/safety-critical systems:** Run all three (Opus + GPT-5 + Sonnet). The ~$1 total cost per document is irrelevant vs the value of comprehensive coverage.
**For financial/safety-critical systems:** Run all three (Opus + GPT-5 + Sonnet). The ~$1-2 total cost per document is trivially justified vs the value of comprehensive coverage.
**For routine review:** Opus alone or Sonnet + Opus pair. Skip GPT-5 unless the document is complex and the stakes justify it.
**For regulatory compliance:** GPT-5 for depth + correct citations, Sonnet for first-pass breadth.
---
## Part 5: What's Still Unknown
## Part 5: Open Questions
1. **Would running models sequentially (feed Model A's output to Model B) outperform parallel runs?** Hypothesized for adversarial analysis but untested.
2. **Are these findings corpus-specific?** All 29 experiments used gargoyle architecture docs. Different domains may shift rankings.
3. **How much do results vary across runs?** All findings are single-run. Stochastic variation is unquantified.
4. **Does Sonnet's narrow-framing weakness go away with explicit concurrency prompts?** Untested — the hypothesis that Sonnet's "structural reviewer" tendency is a framing artifact.
5. **What happens on 2000+ line documents?** Largest tested is 1,110 lines. Unknown if model rankings shift at extreme scale.
### Still Unanswered
1. **Are these findings corpus-specific?** All 74 experiments used gargoyle architecture docs. Different domains may shift rankings.
2. **How much do results vary across runs?** All findings are single-run. Stochastic variation is unquantified.
3. **What happens on 2000+ line documents?** Largest tested is 1,110 lines. Unknown if model rankings shift at extreme scale.
4. **Cross-document consistency as maintenance tool:** Does running cross-doc analysis across MORE document pairs yield additional real inconsistencies? Could become a systematic documentation maintenance tool.
5. **Why Opus dominates cross-doc consistency:** Is it because contradictions are easy to verify once spotted (reducing GPT-5's verification advantage)? Or because boundary reasoning (Opus's strength) is the primary skill needed?
### Answered Questions (from open-questions.md)
- ~~Opus + narrow framing for contradiction detection~~ → **WRONG QUESTION** (#43). Opus doesn't try to match GPT-5 — it finds a different CLASS of contradiction. GPT-5 finds specification conflicts (same scenario, conflicting prescriptions). Opus finds logical impossibilities (rules whose interaction produces impossible conditions). Neither dominates.
- ~~Sonnet + narrow framing = GPT-5 level?~~ → **NO** (#39). The gap is reasoning depth, not framing.
- ~~Adversarial ensemble (GPT-5 → Opus)?~~ → **YES** (#35). 30% more findings at 28% more cost.
- ~~Opus's "missing feature identification" mode — is it promptable?~~ → **YES** (#26). All models find regulatory gaps when explicitly prompted.
- ~~Is Opus > GPT-5 for coherence tasks universal?~~ → **NO** (#27). Document complexity affects ranking.