docs: regenerate weekly report (2026-05-11)

2026-05-11 09:04:35 -07:00
parent 2ca8c974f3
commit 828da269c0
2 changed files with 394 additions and 140 deletions
@@ -1,114 +1,245 @@
-# Actionable Lessons: Using AI Models for Analytical Work
+# Lessons Learned: Operational Guide for AI Model Selection

-> **Generated:** 2026-05-06 07:30 PDT  
-> **Based on:** 29 experiments (2026-04-26 to 2026-05-06)
+> **Generated:** 2026-05-11 09:00 PDT  
+> **Based on:** 74 experiments (2026-04-26 to 2026-05-11)

-_Distilled from 29 experiments. These are the rules._
+_This is the actionable distillation. For evidence and methodology, see REPORT.md._

 ---

-## The Three Rules
+## Quick Reference: Model Selection by Task

-### 1. Match the model to the task, not the prestige
-
-| If you need... | Use... | Why |
-|---------------|--------|-----|
-| "What's missing from this design?" | GPT-5 | Reasons about the world outside the document |
-| "Where does this design contradict itself?" | Opus | Logical argumentation, zero false positives |
-| "Is this consistent with that other doc?" | Opus | 2.4x faster, more findings than GPT-5 |
-| "How could an attacker exploit this?" | GPT-5 (coverage) + Opus (creativity) | Different attack styles |
-| "Quick sanity check before I ship" | Sonnet | Fast, cheap, precise enough |
-| "What race conditions exist here?" | GPT-5 + Opus | Sonnet produces errors on concurrency |
-| "Is there bias in this text?" | Anything (even Mini) | All models catch isolated bias equally |
-
-### 2. Isolate the signal before asking the question
-
-Don't ask "review this PR" and expect the model to catch a subtle bias buried in 6,600 lines of diff. Extract what matters, ask about it directly. Rich context dilutes attention.
-
-**Pattern:**
- ❌ "Review this PR for quality, correctness, and bias" (broad mandate + rich context = missed signals)
- ✅ "Here are 12 hypotheses. Do any lead toward a predetermined conclusion?" (narrow question + minimal context = found everything)
-
-### 3. Run multiple models on anything that matters
-
-No single model finds everything. The union of GPT-5 + Opus + Sonnet finds more than any individual. The missing findings from a single-model run are disproportionately the ones that would cause production incidents.
-
-**Decision framework:**
- **Costs nothing to get wrong:** One model is fine (Sonnet for speed, Opus for depth)
- **Would be embarrassing to miss:** Two models (Opus + GPT-5)
- **Would cost money or safety:** Three models (all three, plus manual review of unique findings)
-
---
-
-## Operational Playbook
-
-### Architecture Document Review
 ```
-1. Opus: contradiction detection + cross-doc consistency
-2. GPT-5: hidden assumptions + gap-finding
-3. Sonnet: quick structural scan (broken refs, missing sections)
-4. Merge findings, deduplicate, triage by severity
-```
-
-### Pre-Implementation Spec Review
-```
-1. Opus: "Where do the stated principles conflict?"
-2. GPT-5: "What must be true about the world for this to work?"
-3. Sonnet 4.5: "What would an implementer have to guess?"
-```
-
-### Security/Adversarial Review
-```
-1. GPT-5: "Enumerate all possible abuses of each mechanism"
-2. Opus: "What would a smart adversary do that the designer didn't consider?"
-3. Union the findings — GPT-5 catches mechanism-level, Opus catches system-level
-```
-
-### PR Review (dual-reviewer pattern)
-```
- Sonnet: structural issues, broken links, formatting
- GPT-5: semantic issues, logical gaps, verdict mismatches
- For important PRs: add Opus for design-tension detection
+┌─────────────────────────────────────────────────────────────────┐
+│                    TASK TYPE DECISION TREE                      │
+├─────────────────────────────────────────────────────────────────┤
+│                                                                 │
+│  Is this a VERIFICATION task?                                   │
+│  (contradiction, consistency, race condition)                   │
+│     │                                                           │
+│     ├─ YES → Use GPT-5 + Opus (skip Sonnet)                    │
+│     │        Sonnet has ~33% precision on verification          │
+│     │                                                           │
+│     └─ NO → Is this CROSS-DOCUMENT?                            │
+│              │                                                  │
+│              ├─ YES → Use Opus (2.4x faster, more findings)    │
+│              │                                                  │
+│              └─ NO → Is this HIGH-STAKES?                      │
+│                       (financial, safety, regulatory)           │
+│                       │                                         │
+│                       ├─ YES → Run all three                   │
+│                       │        (GPT-5 + Opus + Sonnet)         │
+│                       │        Total: ~$1-2, worth it          │
+│                       │                                         │
+│                       └─ NO → Sonnet first-pass               │
+│                               Add Opus if findings need depth   │
+│                                                                 │
+└─────────────────────────────────────────────────────────────────┘
 ```

 ---

-## Anti-Patterns (Things That Don't Work)
+## Rules

-1. **"Use the most expensive model for everything"** — GPT-5 is 5-9x more expensive than Opus per finding, and Opus beats it on contradiction/consistency tasks.
+### Rule 1: Match Model to Task Type

-2. **"Reasoning effort = better output"** — The low/medium/high parameter has negligible effect on analytical tasks. Don't bother tuning it.
+| If the task is... | Use this | Not this |
+|-------------------|----------|----------|
+| Finding what's missing | GPT-5 | Mini |
+| Finding contradictions | Opus | Sonnet |
+| Cross-document consistency | Opus | GPT-5 |
+| Quick structural scan | Sonnet 4.6 | GPT-5 |
+| Broad coverage (noise OK) | Sonnet 4.5 | Sonnet 4.6 |
+| Adversarial attack paths | GPT-5 then Opus | Either alone |
+| Regulatory compliance | GPT-5 | Opus |
+| Operational blind spots | GPT-5 | Sonnet |

-3. **"Sonnet can do anything Opus does, just shallower"** — Wrong. Sonnet produces *errors* on concurrency reasoning and false positives on contradiction detection. It's not "cheaper Opus" — it's a different tool.
+### Rule 2: Don't Trust Sonnet for Verification

-4. **"More context = better analysis"** — Signal-to-noise ratio matters more than context richness. Isolate what you're asking about.
+Sonnet finds ~3 contradictions but only ~1 is genuine. The others are misreadings. Use Sonnet for *identification* tasks (what's here?), not *verification* tasks (is this true?).

-5. **"One good prompt works everywhere"** — Prompt framing shapes output more than model choice. The same model with a broad vs narrow prompt produces qualitatively different work. Design prompts per task type.
+### Rule 3: Isolate the Signal

-6. **"Run it once, trust the output"** — Single runs are stochastic. Models miss things non-deterministically. Multiple models or multiple runs are the only hedge.
+When checking for something specific (bias, contradictions, missing assumptions), extract the relevant text and ask about it directly. Don't bury the question in a broad review mandate. Signal-to-noise ratio matters more than model capability.
+
+### Rule 4: Run the Ensemble for High Stakes
+
+For anything financial, safety-critical, or regulatory: run GPT-5 + Opus + Sonnet. Each finds things the others miss. The union is larger than any single model's output. Cost is trivial vs. the value.
+
+### Rule 5: Give GPT-5 Enough Tokens
+
+GPT-5 needs `max_completion_tokens` ≥ 16K. A truncated GPT-5 response is worse than a complete Opus response. Token budget matters more than model size.
+
+### Rule 6: Break Large Outputs Into Sections
+
+Single agents die generating 1000+ lines. Rich input is fine; it's output length that kills. For large generation tasks: break into sections, draft in parallel, assemble.
+
+### Rule 7: Narrow Framing Doesn't Fix Reasoning Gaps
+
+You cannot make Sonnet match GPT-5/Opus by writing a better prompt. Narrow framing changes WHAT it looks for, not HOW WELL it reasons. The gap is architectural, not prompt engineering.
+
+---
+
+## Operational Playbooks
+
+### Playbook A: Architecture Document Review
+
+1. **First pass (Sonnet 4.6):** ~30s, catches structural issues, broken refs, obvious gaps
+2. **Deep analysis (GPT-5):** ~90s, finds domain-specific gaps, hidden assumptions, edge cases
+3. **Design tensions (Opus):** ~60s, finds where the design contradicts itself
+4. **Merge and dedupe:** Union of all three, remove duplicates, sort by severity
+
+### Playbook B: Cross-Document Consistency Check
+
+1. **Use Opus only.** It's 2.4x faster than GPT-5 and finds more issues.
+2. **Provide both documents in a single prompt** (~25KB max)
+3. **Explicitly exclude omissions** in the prompt — you want contradictions, not "Doc A covers X but Doc B doesn't"
+
+### Playbook C: Adversarial Security Review
+
+1. **First pass (GPT-5):** Exhaustive enumeration of attack surface
+2. **Extension pass (Opus):** Give Opus GPT-5's findings, ask it to critique and extend
+3. **Result:** 30% more findings at 28% more cost, with prioritization
+
+### Playbook D: Regulatory Compliance Review
+
+1. **First pass (Sonnet):** ~25s, identifies areas of concern
+2. **Deep dive (GPT-5):** Regulatory specificity, correct citations, edge cases
+3. **GPT-5's reasoning tokens are spent on verification** — trust its citations
+
+### Playbook E: Contradiction Detection
+
+1. **Use GPT-5 + Opus in parallel** (not Sonnet)
+2. **GPT-5 finds:** Specification conflicts (same scenario, different prescriptions)
+3. **Opus finds:** Logical impossibilities (rules that can't coexist)
+4. **Neither dominates** — they find different classes of contradiction
+
+---
+
+## Anti-Patterns
+
+### ❌ Anti-Pattern 1: Using Sonnet for Verification Tasks
+
+**What happens:** Sonnet reports contradictions that aren't real. You waste time investigating false positives or worse, trust a false negative.
+
+**Instead:** Use GPT-5 or Opus for any task requiring "is this true?" reasoning.
+
+### ❌ Anti-Pattern 2: Giving GPT-5 a Broad Mandate
+
+**What happens:** GPT-5 spawns sub-agents, times out, or dumps raw tool output instead of synthesizing.
+
+**Instead:** Give GPT-5 explicit single-actor instructions + output format. For Claude, broader mandates are fine.
+
+### ❌ Anti-Pattern 3: Burying Important Checks in Large Reviews
+
+**What happens:** The model misses the important thing because it's one of 47 things to check.
+
+**Instead:** Extract the important check and ask about it specifically. Signal-to-noise ratio matters.
+
+### ❌ Anti-Pattern 4: Extrapolating Across Task Types
+
+**What happens:** "GPT-5 was great at X, so I'll use it for Y" — and it's mediocre.
+
+**Instead:** Task type predicts performance better than "model X is better." Check the task-type table.
+
+### ❌ Anti-Pattern 5: Skipping the Union
+
+**What happens:** You run one model, miss things another would have caught, and the bug reaches production.
+
+**Instead:** For high-stakes work, run the ensemble. The cost is trivial vs. the risk.
+
+### ❌ Anti-Pattern 6: Tuning Reasoning Effort
+
+**What happens:** You spend time adjusting low/medium/high reasoning effort parameters.
+
+**Instead:** Don't bother. It has negligible effect on analytical work. Task type is the lever.
+
+### ❌ Anti-Pattern 7: Trying to Fix Sonnet with Prompts
+
+**What happens:** You write increasingly narrow prompts trying to get Sonnet to match GPT-5's reasoning depth.
+
+**Instead:** Accept that the gap is architectural. Use Sonnet for what it's good at (speed, breadth, structural review), use GPT-5/Opus for reasoning-heavy tasks.

 ---

 ## Model Personality Cheat Sheet

-| Model | Default behavior | Thinks like a... |
-|-------|-----------------|------------------|
-| GPT-5 | Exhaustive enumeration, high verification bar, slow | Thorough auditor checking every line item |
-| Opus | Design tensions, self-correcting, efficient | Architect who sees how pieces conflict |
-| Sonnet 4.6 | Fast structural scan, self-filtering, concise | Senior engineer doing a quick review |
-| Sonnet 4.5 | Exhaustive, verbose, occasional severity inflation | Junior engineer trying to catch everything |
-| GPT-4.1 | Structured, stays within the document's framing | Competent analyst following a checklist |
-| GPT-4.1 Mini | Formulaic, maps findings 1:1 to document sections | Intern reading the doc and noting concerns |
+| Model | Personality | Default Behavior | Give It |
+|-------|-------------|------------------|---------|
+| **GPT-5** | Exhaustive enumerator | Lists everything systematically | Bounded tasks, explicit output format, single-actor instructions |
+| **Opus** | Design critic | Finds tensions and contradictions | Open-ended analysis, room to reason about boundaries |
+| **Sonnet 4.6** | Structural scanner | Fast, precise, shallow | Quick first-pass work, structural review |
+| **Sonnet 4.5** | Broad coverage | More findings, more noise | When you want breadth over precision |
+| **GPT-4.1** | Generic competent | Stays within document framing | Middle-ground cost-sensitive work |
+| **GPT-4.1 Mini** | Template filler | Formulaic but catches obvious things | Bulk screening, sanity checks |
+
+### Opus Superpower
+
+Opus finds where the spec's **own assumptions are false**. It doesn't just find missing things — it finds things the spec *believes* to be true that *aren't*.
+
+Examples:
+- "Realized P&L cannot recover" — the de-escalation model assumes all metrics can improve, but this one fundamentally cannot (#31)
+- "Forward detection logic is backwards" — spec describes triggers in the wrong direction (#32)
+- "Stop-loss defeated by temporal composition" — safety mechanism rendered ineffective by slow strategy (#47)
+
+### GPT-5 Superpower
+
+GPT-5 reasons about the document's **relationship to the real world**. It asks "what must be true about the external world for this to work?"
+
+Examples:
+- Broker rate limiting (429s) bypasses "connection lost" detection (#9)
+- Corporate actions bypass staleness detection (#9)
+- DB "commit unknown outcome" causing restart loops (#9)
+- Cross-symbol strategies with partial staleness (#9)
+- IRS rule nuances that simplifications violate (#54)

 ---

-## The Bottom Line
+## Decision Framework

-**For our specific workflow (gargoyle architecture review, PR reviews, design docs):**
+### When to Add Another Model

-1. Opus is the default analytical model — most efficient, deepest on consistency/contradiction
-2. GPT-5 is the "we can't afford to miss anything" model — use on high-stakes docs
-3. Sonnet is the speed/screening model — first pass, structural checks, assumption-finding only
-4. Never use Sonnet alone for concurrency, contradiction, or adversarial analysis
-5. Always isolate the analytical question from surrounding noise
-6. Task-type-specific prompts beat generic "review this" prompts every time
+| Situation | Action |
+|-----------|--------|
+| Sonnet found nothing | Add Opus (may find design tensions) |
+| GPT-5 found lots but all similar | Add Opus (may find different class) |
+| Opus found tensions but no enumeration | Add GPT-5 (exhaustive coverage) |
+| Cross-document task | Use Opus only (2.4x faster) |
+| Regulatory/compliance task | Use GPT-5 (correct citations) |
+
+### When NOT to Add Another Model
+
+| Situation | Action |
+|-----------|--------|
+| Quick structural scan | Sonnet alone is fine |
+| Bulk screening | Mini alone is fine |
+| Already ran GPT-5 + Opus | Adding Sonnet rarely helps |
+| Low-stakes internal doc | One model is enough |
+
+### Cost-Benefit Quick Calc
+
+| Risk level | Model cost | Justified? |
+|------------|------------|------------|
+| Financial/safety | ~$1-2 for ensemble | Always yes |
+| Customer-facing | ~$0.50 for GPT-5 | Usually yes |
+| Internal process | ~$0.10 for Sonnet | Always yes |
+| One-off exploration | ~$0.02 for Mini | Always yes |
+
+---
+
+## What We Still Don't Know
+
+1. **Corpus bias:** All experiments used gargoyle docs. Rankings may differ for other domains.
+2. **Run variance:** All findings are single-run. Stochastic variation is unquantified.
+3. **Scale effects:** Largest doc tested is 1,110 lines. Unknown behavior at 2000+.
+4. **Non-architecture domains:** These findings are for architecture document analysis, not coding, not chat, not creative writing.
+
+---
+
+## Summary: The Two Things That Matter Most
+
+1. **Task type determines model choice.** Don't pick a model because "it's best." Pick the model that's best for THIS task type.
+
+2. **The union beats any single model.** For high-stakes work, run the ensemble. Different models find qualitatively different things.
+
+Everything else is optimization.
@@ -1,136 +1,238 @@
 # Model Research Report: AI Models for Analytical Work

-> **Generated:** 2026-05-06 07:30 PDT  
-> **Findings analyzed:** 29  
-> **Period:** 2026-04-26 to 2026-05-06
+> **Generated:** 2026-05-11 09:00 PDT  
+> **Findings analyzed:** 74  
+> **Period:** 2026-04-26 to 2026-05-11

-_29 experiments across 11 days. Five models tested on architecture document analysis — not coding._
+_74 experiments across 16 days. Six models tested on architecture document analysis — not coding._
+
+---
+
+## What's New (Since May 6)
+
+**45 new findings** (29 → 74) covering:
+
+- **New task types validated:** Operational blind spot analysis (#46), emergent behavior from rule composition (#47), defense-in-depth gaps (#48), adversarial evasion/tampering (#49), concurrency race conditions (#50), implementation ambiguity (#51), degraded mode propagation (#52), unstated constraints (#53), state reconstruction correctness (#55), operational burden (#56), event flow correctness (#57), state machine completeness (#58), convention-rule gaps (#59), counterfactual event ordering (#60), regulatory completeness (#61), data integrity signal flow (#62), external system assumptions (#63), specification gaps (#64), temporal correctness (#65), concurrent write hazards (#65b), cross-context contract coherence (#68), boundary contract analysis, boundary violation analysis, inter-document contradiction analysis, security boundary analysis, audit log data integrity (#11-May), wash sale regulatory compliance (#11-May)
+- **Cross-document consistency expanded** (#37, #44): Opus confirmed as dominant for subtle contradictions across tightly-coupled docs
+- **Regulatory compliance analysis depth** (#38, #54, #61): GPT-5 excels at IRS/regulatory specificity with correct citations
+- **Narrow framing tested and rejected** (#39, #43): Sonnet cannot match GPT-5/Opus via prompt framing alone — reasoning depth is the bottleneck
+- **Adversarial ensemble validated** (#35): Critique-then-extend produces 30% more findings at 28% more cost
+- **Operational burden as distinct lens** (#45, #56): Models diverge on what constitutes "operator cognitive load"
+- **Silent data corruption paths** (#40): GPT-5 excels at tracing multi-step corruption through financial accounting
+- **Temporal ordering dependencies** (#41): All models catch obvious ordering; GPT-5 unique on subtle cascades
+- **Failure propagation chains** (#42): Opus finds the architectural insight; GPT-5 finds the enumeration
+
+---

 ## Executive Summary

-We tested GPT-5, Claude Opus, Claude Sonnet 4.6, Claude Sonnet 4.5, GPT-4.1, and GPT-4.1 Mini on analytical tasks: finding hidden assumptions, race conditions, design contradictions, adversarial attack paths, regulatory gaps, and cross-document inconsistencies in real architecture documents.
+We tested GPT-5, Claude Opus, Claude Sonnet 4.6, Claude Sonnet 4.5, GPT-4.1, and GPT-4.1 Mini on analytical tasks: finding hidden assumptions, race conditions, design contradictions, adversarial attack paths, regulatory gaps, cross-document inconsistencies, operational blind spots, emergent behaviors, and security boundaries in real architecture documents.

 **The central finding:** Different models don't just find more or fewer things — they find *qualitatively different kinds* of things. Model choice is task-dependent, and no single model dominates all analytical work.

+**The secondary finding:** Task type predicts model performance better than "model X is better." A model that excels at gap-finding may struggle at contradiction detection. Match the model to the task.
+
 ---

 ## Part 1: What Each Model Is Good At

 ### GPT-5
+
 **Strength:** Exhaustive enumeration + domain-specific reasoning about the real world.

-GPT-5's reasoning tokens change the *kind* of analysis, not just the depth. Non-reasoning models identify risks within a document's own frame of reference. GPT-5 reasons about the document's relationship to the external world: broker semantics, deployment topology, OTP runtime behavior under load, timing correlations across independent subsystems.
+GPT-5's reasoning tokens change the *kind* of analysis, not just the depth. Non-reasoning models identify risks within a document's own frame of reference. GPT-5 reasons about the document's relationship to the external world: broker semantics, deployment topology, OTP runtime behavior under load, timing correlations across independent subsystems, IRS regulatory requirements.

- Best at: gap-finding, hidden assumptions, adversarial enumeration, temporal boundary analysis
- Unique ability: finds multi-component interaction failures that require domain knowledge
- Weakness: slow (2-4x longer than Opus), expensive (5-10x more tokens per finding), sometimes over-verifies
+| Capability | Evidence |
+|------------|----------|
+| Domain-specific gaps | #9, #31: Broker rate limiting, credential rotation, corporate actions |
+| Multi-component interactions | #10, #14: Finds assumptions requiring cross-boundary reasoning |
+| Adversarial enumeration | #29, #35: Most thorough attack surface coverage |
+| Temporal boundary analysis | #18: 15 findings with mathematical precision |
+| Regulatory compliance | #23, #38, #54: Correct IRS citations, regulatory edge cases |
+| Silent data corruption | #40: Traces multi-step corruption paths |
+| Invariant violation paths | #20: Precise, verifiable paths through state space |
+| Operational blind spots | #46: 18 findings including cross-service trace gaps |
+
+- Best at: gap-finding, hidden assumptions, adversarial enumeration, temporal boundary analysis, regulatory compliance, operational blind spots
+- Unique ability: finds multi-component interaction failures requiring domain knowledge
+- Weakness: slow (2-4x longer than Opus), expensive (5-10x more tokens per finding), sometimes over-enumerates
 - Finding count: typically 15-35 depending on document complexity

 ### Claude Opus
-**Strength:** Design tensions, logical argumentation, creative adversarial thinking.
+
+**Strength:** Design tensions, logical argumentation, creative adversarial thinking, cross-document consistency.

 Opus consistently identifies where one part of a design undermines another part. It doesn't enumerate failure modes — it finds the deeper question of *why* the design's own principles conflict.

- Best at: contradiction detection, cross-document consistency, race conditions (design-level), adversarial creativity
- Unique ability: self-corrects mid-analysis, finds "your safety mechanism IS your vulnerability" patterns
+| Capability | Evidence |
+|------------|----------|
+| Contradiction detection | #25, #43: Finds logical impossibilities via deductive reasoning |
+| Cross-document consistency | #28, #37, #44: 2.4x faster than GPT-5, finds more issues |
+| Race conditions (design-level) | #13: 10 high-quality findings, self-corrects mid-analysis |
+| Adversarial creativity | #29, #35: "Your safety mechanism IS your vulnerability" patterns |
+| False assumption detection | #31, #32: Finds where spec's own logic contradicts itself |
+| Emergent behavior insight | #47: Stop-loss defeated by temporal composition (best single finding) |
+| Survivor bias identification | #46: Decision latency histogram hides stuck decisions |
+
+- Best at: contradiction detection, cross-document consistency, race conditions (design-level), adversarial creativity, finding false assumptions
+- Unique ability: self-corrects mid-analysis, finds where protection mechanisms become vulnerabilities
 - Most efficient model: 6-9x fewer tokens per finding than GPT-5 on shared task types
 - Weakness: produces fewer findings on pure enumeration tasks (10-13 vs GPT-5's 20-35)

 ### Claude Sonnet 4.6
+
 **Strength:** Speed, structural issues, assumption-finding. Best precision-per-dollar.

+| Capability | Evidence |
+|------------|----------|
+| Quick first-pass screening | #9, #12: 2-3x faster than other models |
+| Structural review | #5: Catches formatting, broken links, missing sections |
+| Specification gap identification | #16: 13 findings, zero false positives |
+| Observability gaps | #33: 11 findings in 36s |
+
 - Best at: quick first-pass screening, structural review, specification gap identification
 - Zero false positives on most tasks — every finding is actionable
- Weakness: struggles with concurrency reasoning, contradiction detection, and tasks requiring formal logical reasoning
+- Weakness: struggles with concurrency reasoning, contradiction detection, tasks requiring formal logical reasoning
 - Produces false positives on verification-heavy tasks (contradiction, race conditions)

+**Critical limitation (Finding #39):** Narrow framing does NOT close the gap with GPT-5/Opus. Sonnet can find 3 contradictions but only 1 is genuine (2 are misreadings). The gap is reasoning depth, not framing — Sonnet can't reliably verify whether two statements actually contradict each other.
+
 ### Claude Sonnet 4.5
+
 **Strength:** Exhaustive coverage. More findings than 4.6, at the cost of some noise.

- Best at: specification completeness (25 findings vs 4.6's 13)
- Catches operational gaps that 4.6 filters out
+| Capability | Evidence |
+|------------|----------|
+| Specification completeness | #16: 25 findings vs 4.6's 13 |
+| Temporal reasoning | #18: 12 findings with no errors (vs 4.6's errors in #13) |
+| Operational gaps | Catches gaps that 4.6 filters out |
+
+- Best at: specification completeness, broad coverage
 - Tradeoff: severity inflation, more verbose output
+- Use 4.5 for coverage, 4.6 for precision

 ### GPT-4.1
+
 **Strength:** Structured, thorough, good middle ground. Generic but competent.

- Stays within the document's own framing — finds assumptions the document *almost* states
- Best unique contribution: meta-observations about design structure (e.g., "all failure modes treated as isolated")
+| Capability | Evidence |
+|------------|----------|
+| Stays within document framing | #9, #10: Finds assumptions the document almost states |
+| Meta-observations | #10: "All failure modes treated as isolated" |
+| Cost-effective first pass | Good enough when GPT-5's cost isn't justified |
+
+- Best unique contribution: meta-observations about design structure
 - Good enough for first-pass review where GPT-5's cost isn't justified

 ### GPT-4.1 Mini
+
 **Strength:** Cheapest. Formulaic but catches the obvious things.

- Every finding maps cleanly to a section of the document
+| Capability | Evidence |
+|------------|----------|
+| Scales with document size | #9, #19: 6 findings on 459 lines → 21 on 1,110 lines |
+| Clean templates | Every finding maps to a document section |
+| Bias detection | #8: Catches bias when signal isn't buried |
+
 - Fine for quick sanity checks, not for architectural insight
- Scales with document size (6 findings on 459 lines → 21 on 1,110 lines)
+- Best for: bulk screening, sanity checks, obvious-issue detection

 ---

-## Part 2: What We Learned About Task Types
+## Part 2: Task Type → Model Mapping

 Not all analytical tasks are the same. Models that excel at one struggle at another.

-| Task Type | Best Model | Runner-up | Avoid |
-|-----------|-----------|-----------|-------|
-| Hidden assumptions | GPT-5 | Opus | Mini (formulaic) |
-| Gap-finding | GPT-5 | GPT-4.1 | Mini (surface-level) |
-| Race conditions | GPT-5 + Opus | — | Sonnet (errors) |
-| Contradiction detection | **Opus** | GPT-5 | Sonnet (false positives) |
-| Cross-document consistency | **Opus** | GPT-5 | — |
-| Adversarial attack paths | GPT-5 (enumeration) + Opus (creativity) | — | — |
-| Bias detection | Any model | — | — |
-| Design coherence | Document-dependent | — | — |
-| Specification completeness | Sonnet 4.5 (breadth) or GPT-5 (self-contradictions) | — | — |
-| Missing feature identification | All (with right prompt) | — | — |
-| Invariant violation paths | GPT-5 (precision) | Opus (breadth) | Sonnet (imprecise) |
+| Task Type | Best Model | Runner-up | Avoid | Evidence |
+|-----------|-----------|-----------|-------|----------|
+| **Gap-finding** | GPT-5 | GPT-4.1 | Mini (surface-level) | #9, #31, #64 |
+| **Hidden assumptions** | GPT-5 | Opus | Mini (formulaic) | #10, #11, #12, #53 |
+| **Race conditions** | GPT-5 + Opus | — | Sonnet (errors) | #13, #50 |
+| **Contradiction detection** | **Opus** | GPT-5 | Sonnet (false positives) | #25, #43 |
+| **Cross-document consistency** | **Opus** | GPT-5 | — | #28, #37, #44 |
+| **Adversarial attack paths** | GPT-5 (enum) + Opus (creativity) | — | — | #29, #35, #49 |
+| **Design coherence** | Document-dependent | — | — | #15, #27 |
+| **Specification completeness** | Sonnet 4.5 (breadth) / GPT-5 (self-contradictions) | — | — | #16, #31 |
+| **Regulatory compliance** | GPT-5 | Sonnet (first-pass) | — | #23, #38, #54 |
+| **Operational blind spots** | GPT-5 | Opus | Sonnet | #46 |
+| **Emergent behavior** | GPT-5 (feedback loops) | Opus (best single insight) | — | #47 |
+| **Temporal boundaries** | GPT-5 | Opus | — | #18, #41, #65 |
+| **State machine completeness** | GPT-5 | Opus | — | #58 |
+| **Silent data corruption** | GPT-5 | — | — | #40, #62 |
+| **Defense-in-depth gaps** | GPT-5 + Opus | — | — | #48 |
+| **Security boundaries** | GPT-5 | Opus | — | #10-May |

 **Key pattern:** Tasks requiring *identification* (what's missing? what's assumed?) are accessible to all models. Tasks requiring *verification* (is this sequence legal? does this contradict that?) favor reasoning models (GPT-5, Opus) and exclude Sonnet.

+**Task category taxonomy:**
+
+| Category | Sonnet value | Best models |
+|----------|--------------|-------------|
+| Systematic/exhaustive | None | GPT-5, Opus |
+| Creative/generative | Meta-analytical synthesis | Opus, GPT-5 |
+| Compliance/regulatory | Adequate but shallow | GPT-5 (deep), Sonnet (first-pass) |
+| Cross-document | None | Opus strongly preferred |
+
 ---

 ## Part 3: Meta-Findings About How to Use Models

-### 1. Signal-to-noise ratio matters more than model capability (Finding #8)
+### 1. Signal-to-noise ratio matters more than model capability (#8)

 When biased text is the ONLY input, even GPT-4.1 Mini catches it. When the same bias is buried inside a full PR review with diffs, issues, and project context, expensive models miss it. The issue isn't model intelligence — it's attention dilution.

 **Implication:** For important analytical checks, isolate the signal. Extract the relevant text and ask about it specifically. Don't bury important questions inside broad review mandates.

-### 2. Prompt framing dominates model personality (Finding #26)
+### 2. Prompt framing dominates model personality for OPEN tasks (#26)

-Opus's "finds design tensions" and GPT-5's "exhaustive enumeration" are DEFAULT tendencies, not capabilities. With structured prompts that explicitly ask for breadth, Opus produces MORE findings than GPT-5. With structured prompts asking for contradictions, GPT-5 becomes highly selective.
+Opus's "finds design tensions" and GPT-5's "exhaustive enumeration" are DEFAULT tendencies, not hard limits. With structured prompts that explicitly ask for breadth, Opus produces MORE findings than GPT-5. With structured prompts asking for contradictions, GPT-5 becomes highly selective.

-**Implication:** Model choice matters less than you think for any single task. Prompt structure is the primary lever. Model personality matters for OPEN-ENDED tasks where you want emergent analytical behavior.
+**Implication:** Model choice matters less than you think for any single task. Prompt structure is the primary lever. Model personality matters for open-ended tasks where you want emergent analytical behavior.

-### 3. Task type predicts model performance better than "model X is better" (Finding #13)
+### 3. Narrow framing does NOT fix Sonnet's reasoning gaps (#39, #43)
+
+Sonnet can't match GPT-5/Opus via narrow prompts alone. Narrow framing changes WHAT Sonnet looks for but not HOW WELL it reasons. Sonnet found 3 contradictions but only 1 was genuine (2 were misreadings). The gap is reasoning depth, not prompt engineering.
+
+### 4. Task type predicts model performance better than "model X is better" (#13)

 Sonnet scores 85% of GPT-5's performance on assumption-finding but drops to ~50% on concurrency reasoning. Don't extrapolate across task types.

-### 4. The union of models finds the most (Finding #19)
+### 5. The union of models finds the most (#19)

 GPT-5 Mini + Sonnet covers ~71% of GPT-5's findings at 31% of the cost. But the missing 29% contains the domain-specific interaction-level findings most likely to prevent production incidents. Each model also finds things the others miss — the total unique finding space is larger than any single model's output.

-### 5. Reasoning tokens change the KIND of analysis, not just the amount (Finding #10)
+### 6. Adversarial ensemble produces 30% more findings (#35)
+
+Run GPT-5 for exhaustive enumeration, then give Opus GPT-5's findings and ask it to critique and extend. Result: 56 findings vs 43 (GPT-5 alone) or 28 (Opus alone). Zero full disagreements. The critique's structured assessment is more valuable than raw extensions. Cost: ~28% more tokens for 30% more coverage + prioritization.
+
+### 7. Reasoning tokens change the KIND of analysis, not just the amount (#10)

 Non-reasoning models ask "what could this mechanism fail at?" Reasoning models ask "what must be true about the world for this mechanism to work?" This is a qualitative difference in analytical mode, not just thoroughness.

-### 6. Reasoning effort parameter is a no-op for analytical work (Finding #21)
+### 8. Reasoning effort parameter is a no-op for analytical work (#21)

 Low/medium/high reasoning effort had negligible effect on GPT-5's output for open-ended analysis. Task type is a far stronger predictor of reasoning behavior. Don't waste time tuning this parameter for document review.

-### 7. Output length kills, input length doesn't (Finding #6)
+### 9. Output length kills, input length doesn't (#6)

 Single agents die trying to generate 1000+ line documents. Rich input context is fine — it's the output length that causes OOM/timeout. Break output into sections, keep input context rich, draft in parallel, assemble.

-### 8. Document complexity shifts model rankings (Finding #27)
+### 10. Document complexity shifts model rankings (#27)

 Opus beat GPT-5 on coherence analysis for one document but lost on another (more complex) document. Rankings are not universal — they interact with document complexity, domain specificity, and prompt structure.

-### 9. Token budget matters more than model size (Finding #7b)
+### 11. Token budget matters more than model size (#7b)

 When output is truncated by token limits, even GPT-5 produces shallow findings. Ensure sufficient max_completion_tokens (≥16K for GPT-5). A cheap model with enough tokens beats an expensive model that runs out of space.

+### 12. Opus excels at finding where specs believe false things (#31, #32)
+
+Across experiments, Opus consistently finds where the spec's OWN ASSUMPTIONS are false. GPT-5 reasons about what the spec FAILS TO SAY. Opus reasons about what the spec BELIEVES to be true and checks whether those beliefs hold. Different but complementary.
+
+### 13. GPT-5's reasoning tokens are spent on VERIFICATION for regulatory tasks (#54)
+
+For domain-specific regulatory analysis (IRS wash sale rules), GPT-5 consistently cited correct publication sections, code numbers, and regulatory references. The 9,600 reasoning tokens appear spent on verification, not generation.
+
 ---

 ## Part 4: Cost-Effectiveness
@@ -138,21 +240,42 @@ When output is truncated by token limits, even GPT-5 produces shallow findings.
 | Model | Typical tokens/finding | Relative cost | Best use case |
 |-------|----------------------|---------------|---------------|
 | Opus | 179-336 | 1x (baseline) | Cross-doc consistency, contradictions, design tensions |
-| Sonnet 4.6 | 194-312 | 0.3x | Quick screening, structural review, assumption-finding |
-| GPT-5 | 993-2,967 | 5-9x | High-stakes analysis where missing something has real cost |
+| Sonnet 4.6 | 111-194 | 0.2-0.3x | Quick screening, structural review, assumption-finding |
+| Sonnet 4.5 | 150-250 | 0.25x | Broad coverage when noise is acceptable |
+| GPT-5 | 511-2,967 | 5-9x | High-stakes analysis where missing something has real cost |
 | GPT-4.1 | ~500 | 0.5x | Middle-ground first pass |
 | GPT-4.1 Mini | ~300 | 0.1x | Bulk screening, sanity checks |

-**For financial/safety-critical systems:** Run all three (Opus + GPT-5 + Sonnet). The ~$1 total cost per document is irrelevant vs the value of comprehensive coverage.
+**For financial/safety-critical systems:** Run all three (Opus + GPT-5 + Sonnet). The ~$1-2 total cost per document is trivially justified vs the value of comprehensive coverage.

 **For routine review:** Opus alone or Sonnet + Opus pair. Skip GPT-5 unless the document is complex and the stakes justify it.

+**For regulatory compliance:** GPT-5 for depth + correct citations, Sonnet for first-pass breadth.
+
 ---

-## Part 5: What's Still Unknown
+## Part 5: Open Questions

-1. **Would running models sequentially (feed Model A's output to Model B) outperform parallel runs?** Hypothesized for adversarial analysis but untested.
-2. **Are these findings corpus-specific?** All 29 experiments used gargoyle architecture docs. Different domains may shift rankings.
-3. **How much do results vary across runs?** All findings are single-run. Stochastic variation is unquantified.
-4. **Does Sonnet's narrow-framing weakness go away with explicit concurrency prompts?** Untested — the hypothesis that Sonnet's "structural reviewer" tendency is a framing artifact.
-5. **What happens on 2000+ line documents?** Largest tested is 1,110 lines. Unknown if model rankings shift at extreme scale.
+### Still Unanswered
+
+1. **Are these findings corpus-specific?** All 74 experiments used gargoyle architecture docs. Different domains may shift rankings.
+
+2. **How much do results vary across runs?** All findings are single-run. Stochastic variation is unquantified.
+
+3. **What happens on 2000+ line documents?** Largest tested is 1,110 lines. Unknown if model rankings shift at extreme scale.
+
+4. **Cross-document consistency as maintenance tool:** Does running cross-doc analysis across MORE document pairs yield additional real inconsistencies? Could become a systematic documentation maintenance tool.
+
+5. **Why Opus dominates cross-doc consistency:** Is it because contradictions are easy to verify once spotted (reducing GPT-5's verification advantage)? Or because boundary reasoning (Opus's strength) is the primary skill needed?
+
+### Answered Questions (from open-questions.md)
+
+- ~~Opus + narrow framing for contradiction detection~~ → **WRONG QUESTION** (#43). Opus doesn't try to match GPT-5 — it finds a different CLASS of contradiction. GPT-5 finds specification conflicts (same scenario, conflicting prescriptions). Opus finds logical impossibilities (rules whose interaction produces impossible conditions). Neither dominates.
+
+- ~~Sonnet + narrow framing = GPT-5 level?~~ → **NO** (#39). The gap is reasoning depth, not framing.
+
+- ~~Adversarial ensemble (GPT-5 → Opus)?~~ → **YES** (#35). 30% more findings at 28% more cost.
+
+- ~~Opus's "missing feature identification" mode — is it promptable?~~ → **YES** (#26). All models find regulatory gaps when explicitly prompted.
+
+- ~~Is Opus > GPT-5 for coherence tasks universal?~~ → **NO** (#27). Document complexity affects ranking.