diff --git a/LESSONS.md b/LESSONS.md index 68c08ec..2f3c936 100644 --- a/LESSONS.md +++ b/LESSONS.md @@ -1,114 +1,245 @@ -# Actionable Lessons: Using AI Models for Analytical Work +# Lessons Learned: Operational Guide for AI Model Selection -> **Generated:** 2026-05-06 07:30 PDT -> **Based on:** 29 experiments (2026-04-26 to 2026-05-06) +> **Generated:** 2026-05-11 09:00 PDT +> **Based on:** 74 experiments (2026-04-26 to 2026-05-11) -_Distilled from 29 experiments. These are the rules._ +_This is the actionable distillation. For evidence and methodology, see REPORT.md._ --- -## The Three Rules +## Quick Reference: Model Selection by Task -### 1. Match the model to the task, not the prestige - -| If you need... | Use... | Why | -|---------------|--------|-----| -| "What's missing from this design?" | GPT-5 | Reasons about the world outside the document | -| "Where does this design contradict itself?" | Opus | Logical argumentation, zero false positives | -| "Is this consistent with that other doc?" | Opus | 2.4x faster, more findings than GPT-5 | -| "How could an attacker exploit this?" | GPT-5 (coverage) + Opus (creativity) | Different attack styles | -| "Quick sanity check before I ship" | Sonnet | Fast, cheap, precise enough | -| "What race conditions exist here?" | GPT-5 + Opus | Sonnet produces errors on concurrency | -| "Is there bias in this text?" | Anything (even Mini) | All models catch isolated bias equally | - -### 2. Isolate the signal before asking the question - -Don't ask "review this PR" and expect the model to catch a subtle bias buried in 6,600 lines of diff. Extract what matters, ask about it directly. Rich context dilutes attention. - -**Pattern:** -- ❌ "Review this PR for quality, correctness, and bias" (broad mandate + rich context = missed signals) -- ✅ "Here are 12 hypotheses. Do any lead toward a predetermined conclusion?" (narrow question + minimal context = found everything) - -### 3. Run multiple models on anything that matters - -No single model finds everything. The union of GPT-5 + Opus + Sonnet finds more than any individual. The missing findings from a single-model run are disproportionately the ones that would cause production incidents. - -**Decision framework:** -- **Costs nothing to get wrong:** One model is fine (Sonnet for speed, Opus for depth) -- **Would be embarrassing to miss:** Two models (Opus + GPT-5) -- **Would cost money or safety:** Three models (all three, plus manual review of unique findings) - ---- - -## Operational Playbook - -### Architecture Document Review ``` -1. Opus: contradiction detection + cross-doc consistency -2. GPT-5: hidden assumptions + gap-finding -3. Sonnet: quick structural scan (broken refs, missing sections) -4. Merge findings, deduplicate, triage by severity -``` - -### Pre-Implementation Spec Review -``` -1. Opus: "Where do the stated principles conflict?" -2. GPT-5: "What must be true about the world for this to work?" -3. Sonnet 4.5: "What would an implementer have to guess?" -``` - -### Security/Adversarial Review -``` -1. GPT-5: "Enumerate all possible abuses of each mechanism" -2. Opus: "What would a smart adversary do that the designer didn't consider?" -3. Union the findings — GPT-5 catches mechanism-level, Opus catches system-level -``` - -### PR Review (dual-reviewer pattern) -``` -- Sonnet: structural issues, broken links, formatting -- GPT-5: semantic issues, logical gaps, verdict mismatches -- For important PRs: add Opus for design-tension detection +┌─────────────────────────────────────────────────────────────────┐ +│ TASK TYPE DECISION TREE │ +├─────────────────────────────────────────────────────────────────┤ +│ │ +│ Is this a VERIFICATION task? │ +│ (contradiction, consistency, race condition) │ +│ │ │ +│ ├─ YES → Use GPT-5 + Opus (skip Sonnet) │ +│ │ Sonnet has ~33% precision on verification │ +│ │ │ +│ └─ NO → Is this CROSS-DOCUMENT? │ +│ │ │ +│ ├─ YES → Use Opus (2.4x faster, more findings) │ +│ │ │ +│ └─ NO → Is this HIGH-STAKES? │ +│ (financial, safety, regulatory) │ +│ │ │ +│ ├─ YES → Run all three │ +│ │ (GPT-5 + Opus + Sonnet) │ +│ │ Total: ~$1-2, worth it │ +│ │ │ +│ └─ NO → Sonnet first-pass │ +│ Add Opus if findings need depth │ +│ │ +└─────────────────────────────────────────────────────────────────┘ ``` --- -## Anti-Patterns (Things That Don't Work) +## Rules -1. **"Use the most expensive model for everything"** — GPT-5 is 5-9x more expensive than Opus per finding, and Opus beats it on contradiction/consistency tasks. +### Rule 1: Match Model to Task Type -2. **"Reasoning effort = better output"** — The low/medium/high parameter has negligible effect on analytical tasks. Don't bother tuning it. +| If the task is... | Use this | Not this | +|-------------------|----------|----------| +| Finding what's missing | GPT-5 | Mini | +| Finding contradictions | Opus | Sonnet | +| Cross-document consistency | Opus | GPT-5 | +| Quick structural scan | Sonnet 4.6 | GPT-5 | +| Broad coverage (noise OK) | Sonnet 4.5 | Sonnet 4.6 | +| Adversarial attack paths | GPT-5 then Opus | Either alone | +| Regulatory compliance | GPT-5 | Opus | +| Operational blind spots | GPT-5 | Sonnet | -3. **"Sonnet can do anything Opus does, just shallower"** — Wrong. Sonnet produces *errors* on concurrency reasoning and false positives on contradiction detection. It's not "cheaper Opus" — it's a different tool. +### Rule 2: Don't Trust Sonnet for Verification -4. **"More context = better analysis"** — Signal-to-noise ratio matters more than context richness. Isolate what you're asking about. +Sonnet finds ~3 contradictions but only ~1 is genuine. The others are misreadings. Use Sonnet for *identification* tasks (what's here?), not *verification* tasks (is this true?). -5. **"One good prompt works everywhere"** — Prompt framing shapes output more than model choice. The same model with a broad vs narrow prompt produces qualitatively different work. Design prompts per task type. +### Rule 3: Isolate the Signal -6. **"Run it once, trust the output"** — Single runs are stochastic. Models miss things non-deterministically. Multiple models or multiple runs are the only hedge. +When checking for something specific (bias, contradictions, missing assumptions), extract the relevant text and ask about it directly. Don't bury the question in a broad review mandate. Signal-to-noise ratio matters more than model capability. + +### Rule 4: Run the Ensemble for High Stakes + +For anything financial, safety-critical, or regulatory: run GPT-5 + Opus + Sonnet. Each finds things the others miss. The union is larger than any single model's output. Cost is trivial vs. the value. + +### Rule 5: Give GPT-5 Enough Tokens + +GPT-5 needs `max_completion_tokens` ≥ 16K. A truncated GPT-5 response is worse than a complete Opus response. Token budget matters more than model size. + +### Rule 6: Break Large Outputs Into Sections + +Single agents die generating 1000+ lines. Rich input is fine; it's output length that kills. For large generation tasks: break into sections, draft in parallel, assemble. + +### Rule 7: Narrow Framing Doesn't Fix Reasoning Gaps + +You cannot make Sonnet match GPT-5/Opus by writing a better prompt. Narrow framing changes WHAT it looks for, not HOW WELL it reasons. The gap is architectural, not prompt engineering. + +--- + +## Operational Playbooks + +### Playbook A: Architecture Document Review + +1. **First pass (Sonnet 4.6):** ~30s, catches structural issues, broken refs, obvious gaps +2. **Deep analysis (GPT-5):** ~90s, finds domain-specific gaps, hidden assumptions, edge cases +3. **Design tensions (Opus):** ~60s, finds where the design contradicts itself +4. **Merge and dedupe:** Union of all three, remove duplicates, sort by severity + +### Playbook B: Cross-Document Consistency Check + +1. **Use Opus only.** It's 2.4x faster than GPT-5 and finds more issues. +2. **Provide both documents in a single prompt** (~25KB max) +3. **Explicitly exclude omissions** in the prompt — you want contradictions, not "Doc A covers X but Doc B doesn't" + +### Playbook C: Adversarial Security Review + +1. **First pass (GPT-5):** Exhaustive enumeration of attack surface +2. **Extension pass (Opus):** Give Opus GPT-5's findings, ask it to critique and extend +3. **Result:** 30% more findings at 28% more cost, with prioritization + +### Playbook D: Regulatory Compliance Review + +1. **First pass (Sonnet):** ~25s, identifies areas of concern +2. **Deep dive (GPT-5):** Regulatory specificity, correct citations, edge cases +3. **GPT-5's reasoning tokens are spent on verification** — trust its citations + +### Playbook E: Contradiction Detection + +1. **Use GPT-5 + Opus in parallel** (not Sonnet) +2. **GPT-5 finds:** Specification conflicts (same scenario, different prescriptions) +3. **Opus finds:** Logical impossibilities (rules that can't coexist) +4. **Neither dominates** — they find different classes of contradiction + +--- + +## Anti-Patterns + +### ❌ Anti-Pattern 1: Using Sonnet for Verification Tasks + +**What happens:** Sonnet reports contradictions that aren't real. You waste time investigating false positives or worse, trust a false negative. + +**Instead:** Use GPT-5 or Opus for any task requiring "is this true?" reasoning. + +### ❌ Anti-Pattern 2: Giving GPT-5 a Broad Mandate + +**What happens:** GPT-5 spawns sub-agents, times out, or dumps raw tool output instead of synthesizing. + +**Instead:** Give GPT-5 explicit single-actor instructions + output format. For Claude, broader mandates are fine. + +### ❌ Anti-Pattern 3: Burying Important Checks in Large Reviews + +**What happens:** The model misses the important thing because it's one of 47 things to check. + +**Instead:** Extract the important check and ask about it specifically. Signal-to-noise ratio matters. + +### ❌ Anti-Pattern 4: Extrapolating Across Task Types + +**What happens:** "GPT-5 was great at X, so I'll use it for Y" — and it's mediocre. + +**Instead:** Task type predicts performance better than "model X is better." Check the task-type table. + +### ❌ Anti-Pattern 5: Skipping the Union + +**What happens:** You run one model, miss things another would have caught, and the bug reaches production. + +**Instead:** For high-stakes work, run the ensemble. The cost is trivial vs. the risk. + +### ❌ Anti-Pattern 6: Tuning Reasoning Effort + +**What happens:** You spend time adjusting low/medium/high reasoning effort parameters. + +**Instead:** Don't bother. It has negligible effect on analytical work. Task type is the lever. + +### ❌ Anti-Pattern 7: Trying to Fix Sonnet with Prompts + +**What happens:** You write increasingly narrow prompts trying to get Sonnet to match GPT-5's reasoning depth. + +**Instead:** Accept that the gap is architectural. Use Sonnet for what it's good at (speed, breadth, structural review), use GPT-5/Opus for reasoning-heavy tasks. --- ## Model Personality Cheat Sheet -| Model | Default behavior | Thinks like a... | -|-------|-----------------|------------------| -| GPT-5 | Exhaustive enumeration, high verification bar, slow | Thorough auditor checking every line item | -| Opus | Design tensions, self-correcting, efficient | Architect who sees how pieces conflict | -| Sonnet 4.6 | Fast structural scan, self-filtering, concise | Senior engineer doing a quick review | -| Sonnet 4.5 | Exhaustive, verbose, occasional severity inflation | Junior engineer trying to catch everything | -| GPT-4.1 | Structured, stays within the document's framing | Competent analyst following a checklist | -| GPT-4.1 Mini | Formulaic, maps findings 1:1 to document sections | Intern reading the doc and noting concerns | +| Model | Personality | Default Behavior | Give It | +|-------|-------------|------------------|---------| +| **GPT-5** | Exhaustive enumerator | Lists everything systematically | Bounded tasks, explicit output format, single-actor instructions | +| **Opus** | Design critic | Finds tensions and contradictions | Open-ended analysis, room to reason about boundaries | +| **Sonnet 4.6** | Structural scanner | Fast, precise, shallow | Quick first-pass work, structural review | +| **Sonnet 4.5** | Broad coverage | More findings, more noise | When you want breadth over precision | +| **GPT-4.1** | Generic competent | Stays within document framing | Middle-ground cost-sensitive work | +| **GPT-4.1 Mini** | Template filler | Formulaic but catches obvious things | Bulk screening, sanity checks | + +### Opus Superpower + +Opus finds where the spec's **own assumptions are false**. It doesn't just find missing things — it finds things the spec *believes* to be true that *aren't*. + +Examples: +- "Realized P&L cannot recover" — the de-escalation model assumes all metrics can improve, but this one fundamentally cannot (#31) +- "Forward detection logic is backwards" — spec describes triggers in the wrong direction (#32) +- "Stop-loss defeated by temporal composition" — safety mechanism rendered ineffective by slow strategy (#47) + +### GPT-5 Superpower + +GPT-5 reasons about the document's **relationship to the real world**. It asks "what must be true about the external world for this to work?" + +Examples: +- Broker rate limiting (429s) bypasses "connection lost" detection (#9) +- Corporate actions bypass staleness detection (#9) +- DB "commit unknown outcome" causing restart loops (#9) +- Cross-symbol strategies with partial staleness (#9) +- IRS rule nuances that simplifications violate (#54) --- -## The Bottom Line +## Decision Framework -**For our specific workflow (gargoyle architecture review, PR reviews, design docs):** +### When to Add Another Model -1. Opus is the default analytical model — most efficient, deepest on consistency/contradiction -2. GPT-5 is the "we can't afford to miss anything" model — use on high-stakes docs -3. Sonnet is the speed/screening model — first pass, structural checks, assumption-finding only -4. Never use Sonnet alone for concurrency, contradiction, or adversarial analysis -5. Always isolate the analytical question from surrounding noise -6. Task-type-specific prompts beat generic "review this" prompts every time +| Situation | Action | +|-----------|--------| +| Sonnet found nothing | Add Opus (may find design tensions) | +| GPT-5 found lots but all similar | Add Opus (may find different class) | +| Opus found tensions but no enumeration | Add GPT-5 (exhaustive coverage) | +| Cross-document task | Use Opus only (2.4x faster) | +| Regulatory/compliance task | Use GPT-5 (correct citations) | + +### When NOT to Add Another Model + +| Situation | Action | +|-----------|--------| +| Quick structural scan | Sonnet alone is fine | +| Bulk screening | Mini alone is fine | +| Already ran GPT-5 + Opus | Adding Sonnet rarely helps | +| Low-stakes internal doc | One model is enough | + +### Cost-Benefit Quick Calc + +| Risk level | Model cost | Justified? | +|------------|------------|------------| +| Financial/safety | ~$1-2 for ensemble | Always yes | +| Customer-facing | ~$0.50 for GPT-5 | Usually yes | +| Internal process | ~$0.10 for Sonnet | Always yes | +| One-off exploration | ~$0.02 for Mini | Always yes | + +--- + +## What We Still Don't Know + +1. **Corpus bias:** All experiments used gargoyle docs. Rankings may differ for other domains. +2. **Run variance:** All findings are single-run. Stochastic variation is unquantified. +3. **Scale effects:** Largest doc tested is 1,110 lines. Unknown behavior at 2000+. +4. **Non-architecture domains:** These findings are for architecture document analysis, not coding, not chat, not creative writing. + +--- + +## Summary: The Two Things That Matter Most + +1. **Task type determines model choice.** Don't pick a model because "it's best." Pick the model that's best for THIS task type. + +2. **The union beats any single model.** For high-stakes work, run the ensemble. Different models find qualitatively different things. + +Everything else is optimization. diff --git a/REPORT.md b/REPORT.md index 5a28961..30ecb0a 100644 --- a/REPORT.md +++ b/REPORT.md @@ -1,136 +1,238 @@ # Model Research Report: AI Models for Analytical Work -> **Generated:** 2026-05-06 07:30 PDT -> **Findings analyzed:** 29 -> **Period:** 2026-04-26 to 2026-05-06 +> **Generated:** 2026-05-11 09:00 PDT +> **Findings analyzed:** 74 +> **Period:** 2026-04-26 to 2026-05-11 -_29 experiments across 11 days. Five models tested on architecture document analysis — not coding._ +_74 experiments across 16 days. Six models tested on architecture document analysis — not coding._ + +--- + +## What's New (Since May 6) + +**45 new findings** (29 → 74) covering: + +- **New task types validated:** Operational blind spot analysis (#46), emergent behavior from rule composition (#47), defense-in-depth gaps (#48), adversarial evasion/tampering (#49), concurrency race conditions (#50), implementation ambiguity (#51), degraded mode propagation (#52), unstated constraints (#53), state reconstruction correctness (#55), operational burden (#56), event flow correctness (#57), state machine completeness (#58), convention-rule gaps (#59), counterfactual event ordering (#60), regulatory completeness (#61), data integrity signal flow (#62), external system assumptions (#63), specification gaps (#64), temporal correctness (#65), concurrent write hazards (#65b), cross-context contract coherence (#68), boundary contract analysis, boundary violation analysis, inter-document contradiction analysis, security boundary analysis, audit log data integrity (#11-May), wash sale regulatory compliance (#11-May) +- **Cross-document consistency expanded** (#37, #44): Opus confirmed as dominant for subtle contradictions across tightly-coupled docs +- **Regulatory compliance analysis depth** (#38, #54, #61): GPT-5 excels at IRS/regulatory specificity with correct citations +- **Narrow framing tested and rejected** (#39, #43): Sonnet cannot match GPT-5/Opus via prompt framing alone — reasoning depth is the bottleneck +- **Adversarial ensemble validated** (#35): Critique-then-extend produces 30% more findings at 28% more cost +- **Operational burden as distinct lens** (#45, #56): Models diverge on what constitutes "operator cognitive load" +- **Silent data corruption paths** (#40): GPT-5 excels at tracing multi-step corruption through financial accounting +- **Temporal ordering dependencies** (#41): All models catch obvious ordering; GPT-5 unique on subtle cascades +- **Failure propagation chains** (#42): Opus finds the architectural insight; GPT-5 finds the enumeration + +--- ## Executive Summary -We tested GPT-5, Claude Opus, Claude Sonnet 4.6, Claude Sonnet 4.5, GPT-4.1, and GPT-4.1 Mini on analytical tasks: finding hidden assumptions, race conditions, design contradictions, adversarial attack paths, regulatory gaps, and cross-document inconsistencies in real architecture documents. +We tested GPT-5, Claude Opus, Claude Sonnet 4.6, Claude Sonnet 4.5, GPT-4.1, and GPT-4.1 Mini on analytical tasks: finding hidden assumptions, race conditions, design contradictions, adversarial attack paths, regulatory gaps, cross-document inconsistencies, operational blind spots, emergent behaviors, and security boundaries in real architecture documents. **The central finding:** Different models don't just find more or fewer things — they find *qualitatively different kinds* of things. Model choice is task-dependent, and no single model dominates all analytical work. +**The secondary finding:** Task type predicts model performance better than "model X is better." A model that excels at gap-finding may struggle at contradiction detection. Match the model to the task. + --- ## Part 1: What Each Model Is Good At ### GPT-5 + **Strength:** Exhaustive enumeration + domain-specific reasoning about the real world. -GPT-5's reasoning tokens change the *kind* of analysis, not just the depth. Non-reasoning models identify risks within a document's own frame of reference. GPT-5 reasons about the document's relationship to the external world: broker semantics, deployment topology, OTP runtime behavior under load, timing correlations across independent subsystems. +GPT-5's reasoning tokens change the *kind* of analysis, not just the depth. Non-reasoning models identify risks within a document's own frame of reference. GPT-5 reasons about the document's relationship to the external world: broker semantics, deployment topology, OTP runtime behavior under load, timing correlations across independent subsystems, IRS regulatory requirements. -- Best at: gap-finding, hidden assumptions, adversarial enumeration, temporal boundary analysis -- Unique ability: finds multi-component interaction failures that require domain knowledge -- Weakness: slow (2-4x longer than Opus), expensive (5-10x more tokens per finding), sometimes over-verifies +| Capability | Evidence | +|------------|----------| +| Domain-specific gaps | #9, #31: Broker rate limiting, credential rotation, corporate actions | +| Multi-component interactions | #10, #14: Finds assumptions requiring cross-boundary reasoning | +| Adversarial enumeration | #29, #35: Most thorough attack surface coverage | +| Temporal boundary analysis | #18: 15 findings with mathematical precision | +| Regulatory compliance | #23, #38, #54: Correct IRS citations, regulatory edge cases | +| Silent data corruption | #40: Traces multi-step corruption paths | +| Invariant violation paths | #20: Precise, verifiable paths through state space | +| Operational blind spots | #46: 18 findings including cross-service trace gaps | + +- Best at: gap-finding, hidden assumptions, adversarial enumeration, temporal boundary analysis, regulatory compliance, operational blind spots +- Unique ability: finds multi-component interaction failures requiring domain knowledge +- Weakness: slow (2-4x longer than Opus), expensive (5-10x more tokens per finding), sometimes over-enumerates - Finding count: typically 15-35 depending on document complexity ### Claude Opus -**Strength:** Design tensions, logical argumentation, creative adversarial thinking. + +**Strength:** Design tensions, logical argumentation, creative adversarial thinking, cross-document consistency. Opus consistently identifies where one part of a design undermines another part. It doesn't enumerate failure modes — it finds the deeper question of *why* the design's own principles conflict. -- Best at: contradiction detection, cross-document consistency, race conditions (design-level), adversarial creativity -- Unique ability: self-corrects mid-analysis, finds "your safety mechanism IS your vulnerability" patterns +| Capability | Evidence | +|------------|----------| +| Contradiction detection | #25, #43: Finds logical impossibilities via deductive reasoning | +| Cross-document consistency | #28, #37, #44: 2.4x faster than GPT-5, finds more issues | +| Race conditions (design-level) | #13: 10 high-quality findings, self-corrects mid-analysis | +| Adversarial creativity | #29, #35: "Your safety mechanism IS your vulnerability" patterns | +| False assumption detection | #31, #32: Finds where spec's own logic contradicts itself | +| Emergent behavior insight | #47: Stop-loss defeated by temporal composition (best single finding) | +| Survivor bias identification | #46: Decision latency histogram hides stuck decisions | + +- Best at: contradiction detection, cross-document consistency, race conditions (design-level), adversarial creativity, finding false assumptions +- Unique ability: self-corrects mid-analysis, finds where protection mechanisms become vulnerabilities - Most efficient model: 6-9x fewer tokens per finding than GPT-5 on shared task types - Weakness: produces fewer findings on pure enumeration tasks (10-13 vs GPT-5's 20-35) ### Claude Sonnet 4.6 + **Strength:** Speed, structural issues, assumption-finding. Best precision-per-dollar. +| Capability | Evidence | +|------------|----------| +| Quick first-pass screening | #9, #12: 2-3x faster than other models | +| Structural review | #5: Catches formatting, broken links, missing sections | +| Specification gap identification | #16: 13 findings, zero false positives | +| Observability gaps | #33: 11 findings in 36s | + - Best at: quick first-pass screening, structural review, specification gap identification - Zero false positives on most tasks — every finding is actionable -- Weakness: struggles with concurrency reasoning, contradiction detection, and tasks requiring formal logical reasoning +- Weakness: struggles with concurrency reasoning, contradiction detection, tasks requiring formal logical reasoning - Produces false positives on verification-heavy tasks (contradiction, race conditions) +**Critical limitation (Finding #39):** Narrow framing does NOT close the gap with GPT-5/Opus. Sonnet can find 3 contradictions but only 1 is genuine (2 are misreadings). The gap is reasoning depth, not framing — Sonnet can't reliably verify whether two statements actually contradict each other. + ### Claude Sonnet 4.5 + **Strength:** Exhaustive coverage. More findings than 4.6, at the cost of some noise. -- Best at: specification completeness (25 findings vs 4.6's 13) -- Catches operational gaps that 4.6 filters out +| Capability | Evidence | +|------------|----------| +| Specification completeness | #16: 25 findings vs 4.6's 13 | +| Temporal reasoning | #18: 12 findings with no errors (vs 4.6's errors in #13) | +| Operational gaps | Catches gaps that 4.6 filters out | + +- Best at: specification completeness, broad coverage - Tradeoff: severity inflation, more verbose output +- Use 4.5 for coverage, 4.6 for precision ### GPT-4.1 + **Strength:** Structured, thorough, good middle ground. Generic but competent. -- Stays within the document's own framing — finds assumptions the document *almost* states -- Best unique contribution: meta-observations about design structure (e.g., "all failure modes treated as isolated") +| Capability | Evidence | +|------------|----------| +| Stays within document framing | #9, #10: Finds assumptions the document almost states | +| Meta-observations | #10: "All failure modes treated as isolated" | +| Cost-effective first pass | Good enough when GPT-5's cost isn't justified | + +- Best unique contribution: meta-observations about design structure - Good enough for first-pass review where GPT-5's cost isn't justified ### GPT-4.1 Mini + **Strength:** Cheapest. Formulaic but catches the obvious things. -- Every finding maps cleanly to a section of the document +| Capability | Evidence | +|------------|----------| +| Scales with document size | #9, #19: 6 findings on 459 lines → 21 on 1,110 lines | +| Clean templates | Every finding maps to a document section | +| Bias detection | #8: Catches bias when signal isn't buried | + - Fine for quick sanity checks, not for architectural insight -- Scales with document size (6 findings on 459 lines → 21 on 1,110 lines) +- Best for: bulk screening, sanity checks, obvious-issue detection --- -## Part 2: What We Learned About Task Types +## Part 2: Task Type → Model Mapping Not all analytical tasks are the same. Models that excel at one struggle at another. -| Task Type | Best Model | Runner-up | Avoid | -|-----------|-----------|-----------|-------| -| Hidden assumptions | GPT-5 | Opus | Mini (formulaic) | -| Gap-finding | GPT-5 | GPT-4.1 | Mini (surface-level) | -| Race conditions | GPT-5 + Opus | — | Sonnet (errors) | -| Contradiction detection | **Opus** | GPT-5 | Sonnet (false positives) | -| Cross-document consistency | **Opus** | GPT-5 | — | -| Adversarial attack paths | GPT-5 (enumeration) + Opus (creativity) | — | — | -| Bias detection | Any model | — | — | -| Design coherence | Document-dependent | — | — | -| Specification completeness | Sonnet 4.5 (breadth) or GPT-5 (self-contradictions) | — | — | -| Missing feature identification | All (with right prompt) | — | — | -| Invariant violation paths | GPT-5 (precision) | Opus (breadth) | Sonnet (imprecise) | +| Task Type | Best Model | Runner-up | Avoid | Evidence | +|-----------|-----------|-----------|-------|----------| +| **Gap-finding** | GPT-5 | GPT-4.1 | Mini (surface-level) | #9, #31, #64 | +| **Hidden assumptions** | GPT-5 | Opus | Mini (formulaic) | #10, #11, #12, #53 | +| **Race conditions** | GPT-5 + Opus | — | Sonnet (errors) | #13, #50 | +| **Contradiction detection** | **Opus** | GPT-5 | Sonnet (false positives) | #25, #43 | +| **Cross-document consistency** | **Opus** | GPT-5 | — | #28, #37, #44 | +| **Adversarial attack paths** | GPT-5 (enum) + Opus (creativity) | — | — | #29, #35, #49 | +| **Design coherence** | Document-dependent | — | — | #15, #27 | +| **Specification completeness** | Sonnet 4.5 (breadth) / GPT-5 (self-contradictions) | — | — | #16, #31 | +| **Regulatory compliance** | GPT-5 | Sonnet (first-pass) | — | #23, #38, #54 | +| **Operational blind spots** | GPT-5 | Opus | Sonnet | #46 | +| **Emergent behavior** | GPT-5 (feedback loops) | Opus (best single insight) | — | #47 | +| **Temporal boundaries** | GPT-5 | Opus | — | #18, #41, #65 | +| **State machine completeness** | GPT-5 | Opus | — | #58 | +| **Silent data corruption** | GPT-5 | — | — | #40, #62 | +| **Defense-in-depth gaps** | GPT-5 + Opus | — | — | #48 | +| **Security boundaries** | GPT-5 | Opus | — | #10-May | **Key pattern:** Tasks requiring *identification* (what's missing? what's assumed?) are accessible to all models. Tasks requiring *verification* (is this sequence legal? does this contradict that?) favor reasoning models (GPT-5, Opus) and exclude Sonnet. +**Task category taxonomy:** + +| Category | Sonnet value | Best models | +|----------|--------------|-------------| +| Systematic/exhaustive | None | GPT-5, Opus | +| Creative/generative | Meta-analytical synthesis | Opus, GPT-5 | +| Compliance/regulatory | Adequate but shallow | GPT-5 (deep), Sonnet (first-pass) | +| Cross-document | None | Opus strongly preferred | + --- ## Part 3: Meta-Findings About How to Use Models -### 1. Signal-to-noise ratio matters more than model capability (Finding #8) +### 1. Signal-to-noise ratio matters more than model capability (#8) When biased text is the ONLY input, even GPT-4.1 Mini catches it. When the same bias is buried inside a full PR review with diffs, issues, and project context, expensive models miss it. The issue isn't model intelligence — it's attention dilution. **Implication:** For important analytical checks, isolate the signal. Extract the relevant text and ask about it specifically. Don't bury important questions inside broad review mandates. -### 2. Prompt framing dominates model personality (Finding #26) +### 2. Prompt framing dominates model personality for OPEN tasks (#26) -Opus's "finds design tensions" and GPT-5's "exhaustive enumeration" are DEFAULT tendencies, not capabilities. With structured prompts that explicitly ask for breadth, Opus produces MORE findings than GPT-5. With structured prompts asking for contradictions, GPT-5 becomes highly selective. +Opus's "finds design tensions" and GPT-5's "exhaustive enumeration" are DEFAULT tendencies, not hard limits. With structured prompts that explicitly ask for breadth, Opus produces MORE findings than GPT-5. With structured prompts asking for contradictions, GPT-5 becomes highly selective. -**Implication:** Model choice matters less than you think for any single task. Prompt structure is the primary lever. Model personality matters for OPEN-ENDED tasks where you want emergent analytical behavior. +**Implication:** Model choice matters less than you think for any single task. Prompt structure is the primary lever. Model personality matters for open-ended tasks where you want emergent analytical behavior. -### 3. Task type predicts model performance better than "model X is better" (Finding #13) +### 3. Narrow framing does NOT fix Sonnet's reasoning gaps (#39, #43) + +Sonnet can't match GPT-5/Opus via narrow prompts alone. Narrow framing changes WHAT Sonnet looks for but not HOW WELL it reasons. Sonnet found 3 contradictions but only 1 was genuine (2 were misreadings). The gap is reasoning depth, not prompt engineering. + +### 4. Task type predicts model performance better than "model X is better" (#13) Sonnet scores 85% of GPT-5's performance on assumption-finding but drops to ~50% on concurrency reasoning. Don't extrapolate across task types. -### 4. The union of models finds the most (Finding #19) +### 5. The union of models finds the most (#19) GPT-5 Mini + Sonnet covers ~71% of GPT-5's findings at 31% of the cost. But the missing 29% contains the domain-specific interaction-level findings most likely to prevent production incidents. Each model also finds things the others miss — the total unique finding space is larger than any single model's output. -### 5. Reasoning tokens change the KIND of analysis, not just the amount (Finding #10) +### 6. Adversarial ensemble produces 30% more findings (#35) + +Run GPT-5 for exhaustive enumeration, then give Opus GPT-5's findings and ask it to critique and extend. Result: 56 findings vs 43 (GPT-5 alone) or 28 (Opus alone). Zero full disagreements. The critique's structured assessment is more valuable than raw extensions. Cost: ~28% more tokens for 30% more coverage + prioritization. + +### 7. Reasoning tokens change the KIND of analysis, not just the amount (#10) Non-reasoning models ask "what could this mechanism fail at?" Reasoning models ask "what must be true about the world for this mechanism to work?" This is a qualitative difference in analytical mode, not just thoroughness. -### 6. Reasoning effort parameter is a no-op for analytical work (Finding #21) +### 8. Reasoning effort parameter is a no-op for analytical work (#21) Low/medium/high reasoning effort had negligible effect on GPT-5's output for open-ended analysis. Task type is a far stronger predictor of reasoning behavior. Don't waste time tuning this parameter for document review. -### 7. Output length kills, input length doesn't (Finding #6) +### 9. Output length kills, input length doesn't (#6) Single agents die trying to generate 1000+ line documents. Rich input context is fine — it's the output length that causes OOM/timeout. Break output into sections, keep input context rich, draft in parallel, assemble. -### 8. Document complexity shifts model rankings (Finding #27) +### 10. Document complexity shifts model rankings (#27) Opus beat GPT-5 on coherence analysis for one document but lost on another (more complex) document. Rankings are not universal — they interact with document complexity, domain specificity, and prompt structure. -### 9. Token budget matters more than model size (Finding #7b) +### 11. Token budget matters more than model size (#7b) When output is truncated by token limits, even GPT-5 produces shallow findings. Ensure sufficient max_completion_tokens (≥16K for GPT-5). A cheap model with enough tokens beats an expensive model that runs out of space. +### 12. Opus excels at finding where specs believe false things (#31, #32) + +Across experiments, Opus consistently finds where the spec's OWN ASSUMPTIONS are false. GPT-5 reasons about what the spec FAILS TO SAY. Opus reasons about what the spec BELIEVES to be true and checks whether those beliefs hold. Different but complementary. + +### 13. GPT-5's reasoning tokens are spent on VERIFICATION for regulatory tasks (#54) + +For domain-specific regulatory analysis (IRS wash sale rules), GPT-5 consistently cited correct publication sections, code numbers, and regulatory references. The 9,600 reasoning tokens appear spent on verification, not generation. + --- ## Part 4: Cost-Effectiveness @@ -138,21 +240,42 @@ When output is truncated by token limits, even GPT-5 produces shallow findings. | Model | Typical tokens/finding | Relative cost | Best use case | |-------|----------------------|---------------|---------------| | Opus | 179-336 | 1x (baseline) | Cross-doc consistency, contradictions, design tensions | -| Sonnet 4.6 | 194-312 | 0.3x | Quick screening, structural review, assumption-finding | -| GPT-5 | 993-2,967 | 5-9x | High-stakes analysis where missing something has real cost | +| Sonnet 4.6 | 111-194 | 0.2-0.3x | Quick screening, structural review, assumption-finding | +| Sonnet 4.5 | 150-250 | 0.25x | Broad coverage when noise is acceptable | +| GPT-5 | 511-2,967 | 5-9x | High-stakes analysis where missing something has real cost | | GPT-4.1 | ~500 | 0.5x | Middle-ground first pass | | GPT-4.1 Mini | ~300 | 0.1x | Bulk screening, sanity checks | -**For financial/safety-critical systems:** Run all three (Opus + GPT-5 + Sonnet). The ~$1 total cost per document is irrelevant vs the value of comprehensive coverage. +**For financial/safety-critical systems:** Run all three (Opus + GPT-5 + Sonnet). The ~$1-2 total cost per document is trivially justified vs the value of comprehensive coverage. **For routine review:** Opus alone or Sonnet + Opus pair. Skip GPT-5 unless the document is complex and the stakes justify it. +**For regulatory compliance:** GPT-5 for depth + correct citations, Sonnet for first-pass breadth. + --- -## Part 5: What's Still Unknown +## Part 5: Open Questions -1. **Would running models sequentially (feed Model A's output to Model B) outperform parallel runs?** Hypothesized for adversarial analysis but untested. -2. **Are these findings corpus-specific?** All 29 experiments used gargoyle architecture docs. Different domains may shift rankings. -3. **How much do results vary across runs?** All findings are single-run. Stochastic variation is unquantified. -4. **Does Sonnet's narrow-framing weakness go away with explicit concurrency prompts?** Untested — the hypothesis that Sonnet's "structural reviewer" tendency is a framing artifact. -5. **What happens on 2000+ line documents?** Largest tested is 1,110 lines. Unknown if model rankings shift at extreme scale. +### Still Unanswered + +1. **Are these findings corpus-specific?** All 74 experiments used gargoyle architecture docs. Different domains may shift rankings. + +2. **How much do results vary across runs?** All findings are single-run. Stochastic variation is unquantified. + +3. **What happens on 2000+ line documents?** Largest tested is 1,110 lines. Unknown if model rankings shift at extreme scale. + +4. **Cross-document consistency as maintenance tool:** Does running cross-doc analysis across MORE document pairs yield additional real inconsistencies? Could become a systematic documentation maintenance tool. + +5. **Why Opus dominates cross-doc consistency:** Is it because contradictions are easy to verify once spotted (reducing GPT-5's verification advantage)? Or because boundary reasoning (Opus's strength) is the primary skill needed? + +### Answered Questions (from open-questions.md) + +- ~~Opus + narrow framing for contradiction detection~~ → **WRONG QUESTION** (#43). Opus doesn't try to match GPT-5 — it finds a different CLASS of contradiction. GPT-5 finds specification conflicts (same scenario, conflicting prescriptions). Opus finds logical impossibilities (rules whose interaction produces impossible conditions). Neither dominates. + +- ~~Sonnet + narrow framing = GPT-5 level?~~ → **NO** (#39). The gap is reasoning depth, not framing. + +- ~~Adversarial ensemble (GPT-5 → Opus)?~~ → **YES** (#35). 30% more findings at 28% more cost. + +- ~~Opus's "missing feature identification" mode — is it promptable?~~ → **YES** (#26). All models find regulatory gaps when explicitly prompted. + +- ~~Is Opus > GPT-5 for coherence tasks universal?~~ → **NO** (#27). Document complexity affects ranking.