docs: regenerate weekly report (2026-05-18)

This commit is contained in:
Rodin
2026-05-18 16:10:16 +00:00
parent afbc013e2e
commit 5426026908
2 changed files with 510 additions and 285 deletions
+212 -124
View File
@@ -1,7 +1,7 @@
# Lessons Learned: Operational Guide for AI Model Selection
> **Generated:** 2026-05-11 09:00 PDT
> **Based on:** 74 experiments (2026-04-26 to 2026-05-11)
> **Generated:** 2026-05-18 09:02 PDT
> **Based on:** 80 experiments (2026-04-26 to 2026-05-15)
_This is the actionable distillation. For evidence and methodology, see REPORT.md._
@@ -11,25 +11,32 @@ _This is the actionable distillation. For evidence and methodology, see REPORT.m
```
┌─────────────────────────────────────────────────────────────────┐
│ TASK TYPE DECISION TREE │
│ TASK TYPE DECISION TREE
├─────────────────────────────────────────────────────────────────┤
│ │
│ Is this a VERIFICATION task? │
│ (contradiction, consistency, race condition)
│ (self-contradiction, consistency check, race condition) │
│ │ │
│ ├─ YES → Use GPT-5 + Opus (skip Sonnet)
│ │ Sonnet has ~33% precision on verification
│ ├─ YES → Is it CROSS-DOCUMENT comparison?
│ │
│ │ ├─ YES → Use Opus (or Sonnet for inter-doc │
│ │ │ contradictions specifically) │
│ │ │ │
│ │ └─ NO → Use GPT-5 + Opus (skip Sonnet) │
│ │ Sonnet has ~33% precision on │
│ │ self-contradiction verification │
│ │ │
│ └─ NO → Is this CROSS-DOCUMENT?
│ └─ NO → Is this SECURITY code review?
│ │ │
│ ├─ YES → Use Opus (2.4x faster, more findings)
│ ├─ YES → Use dedicated security persona
│ │ (generalist reviewers miss it) │
│ │ │
│ └─ NO → Is this HIGH-STAKES? │
│ (financial, safety, regulatory) │
│ │ │
│ ├─ YES → Run all three │
│ │ (GPT-5 + Opus + Sonnet) │
│ │ Total: ~$1-2, worth it
│ │ Total: ~$0.50-0.70
│ │ │
│ └─ NO → Sonnet first-pass │
│ Add Opus if findings need depth │
@@ -46,17 +53,26 @@ _This is the actionable distillation. For evidence and methodology, see REPORT.m
| If the task is... | Use this | Not this |
|-------------------|----------|----------|
| Finding what's missing | GPT-5 | Mini |
| Finding contradictions | Opus | Sonnet |
| Finding self-contradictions | GPT-5 + Opus (both) | Sonnet |
| Cross-document consistency | Opus | GPT-5 |
| Inter-document contradictions | Sonnet | GPT-5 |
| Quick structural scan | Sonnet 4.6 | GPT-5 |
| Broad coverage (noise OK) | Sonnet 4.5 | Sonnet 4.6 |
| Adversarial attack paths | GPT-5 then Opus | Either alone |
| Regulatory compliance | GPT-5 | Opus |
| Operational blind spots | GPT-5 | Sonnet |
| Operational blind spots | GPT-5 + Opus | Sonnet |
| Security code review | Dedicated security persona | Generalist prompt |
| State machine completeness | GPT-5 | Sonnet |
| External system assumptions | GPT-5 | Sonnet |
| Counterfactual ordering | GPT-5 | Sonnet |
| Degraded-mode analysis | Opus + GPT-5 | Sonnet |
| Implementation ambiguity | Any (all viable) | — |
### Rule 2: Don't Trust Sonnet for Verification
Sonnet finds ~3 contradictions but only ~1 is genuine. The others are misreadings. Use Sonnet for *identification* tasks (what's here?), not *verification* tasks (is this true?).
Sonnet finds ~3 contradictions but only ~1 is genuine. The others are misreadings. Use Sonnet for *identification* tasks (what's here?) and *inter-document comparison* (do these conflict?), not *self-contradiction verification* (is this internally consistent?).
**Exception:** Inter-document contradiction (#67) — Sonnet outperforms GPT-5 when comparing two documents for conflicting claims. Parallel comparison ≠ serial verification.
### Rule 3: Isolate the Signal
@@ -64,7 +80,7 @@ When checking for something specific (bias, contradictions, missing assumptions)
### Rule 4: Run the Ensemble for High Stakes
For anything financial, safety-critical, or regulatory: run GPT-5 + Opus + Sonnet. Each finds things the others miss. The union is larger than any single model's output. Cost is trivial vs. the value.
For anything financial, safety-critical, or regulatory: run GPT-5 + Opus + Sonnet. Each finds things the others miss. The union is 30-60% larger than any single model. Cost is trivial vs. the value.
### Rule 5: Give GPT-5 Enough Tokens
@@ -78,168 +94,240 @@ Single agents die generating 1000+ lines. Rich input is fine; it's output length
You cannot make Sonnet match GPT-5/Opus by writing a better prompt. Narrow framing changes WHAT it looks for, not HOW WELL it reasons. The gap is architectural, not prompt engineering.
---
### Rule 8: Specialized Personas Outperform Model Upgrades (for security)
## Operational Playbooks
A dedicated security-reviewer persona on the same model catches issues that a generalist reviewer misses and approves. For security code review: configure explicit security criteria (trust boundaries, library edge cases, OS interaction) rather than relying on "please also check security."
### Playbook A: Architecture Document Review
### Rule 9: Dual-Bot Disagreement Is a Feature
1. **First pass (Sonnet 4.6):** ~30s, catches structural issues, broken refs, obvious gaps
2. **Deep analysis (GPT-5):** ~90s, finds domain-specific gaps, hidden assumptions, edge cases
3. **Design tensions (Opus):** ~60s, finds where the design contradicts itself
4. **Merge and dedupe:** Union of all three, remove duplicates, sort by severity
Two reviewers that sometimes disagree create a quality ratchet. The PR can't merge until both are satisfied. Removing one reviewer drops the gate from ~32% REQUEST_CHANGES to ~2%. Never reduce to single-bot without understanding the quality tradeoff.
### Playbook B: Cross-Document Consistency Check
### Rule 10: Monitor Your AI Pipeline
1. **Use Opus only.** It's 2.4x faster than GPT-5 and finds more issues.
2. **Provide both documents in a single prompt** (~25KB max)
3. **Explicitly exclude omissions** in the prompt — you want contradictions, not "Doc A covers X but Doc B doesn't"
### Playbook C: Adversarial Security Review
1. **First pass (GPT-5):** Exhaustive enumeration of attack surface
2. **Extension pass (Opus):** Give Opus GPT-5's findings, ask it to critique and extend
3. **Result:** 30% more findings at 28% more cost, with prioritization
### Playbook D: Regulatory Compliance Review
1. **First pass (Sonnet):** ~25s, identifies areas of concern
2. **Deep dive (GPT-5):** Regulatory specificity, correct citations, edge cases
3. **GPT-5's reasoning tokens are spent on verification** — trust its citations
### Playbook E: Contradiction Detection
1. **Use GPT-5 + Opus in parallel** (not Sonnet)
2. **GPT-5 finds:** Specification conflicts (same scenario, different prescriptions)
3. **Opus finds:** Logical impossibilities (rules that can't coexist)
4. **Neither dominates** — they find different classes of contradiction
AI pipelines need operational monitoring like any production system. A dispatcher malfunction caused 3.5x cost overage and invalidated experiment data (Finding #80). Monitor: dispatch correctness, per-PR review count, API costs, and reviewer participation patterns.
---
## Anti-Patterns
### ❌ Anti-Pattern 1: Using Sonnet for Verification Tasks
### ❌ "Just use the best model for everything"
**What happens:** Sonnet reports contradictions that aren't real. You waste time investigating false positives or worse, trust a false negative.
No single model dominates. GPT-5 is worst for inter-document contradictions. Opus is worst for exhaustive enumeration. Sonnet is worst for self-contradiction verification. Match the task.
**Instead:** Use GPT-5 or Opus for any task requiring "is this true?" reasoning.
### ❌ "Sonnet is good enough for a quick check"
### ❌ Anti-Pattern 2: Giving GPT-5 a Broad Mandate
Only on structural/identification tasks. On verification tasks, Sonnet's 33% precision means 2/3 of its findings are wrong. A "quick check" that produces false confidence is worse than no check.
**What happens:** GPT-5 spawns sub-agents, times out, or dumps raw tool output instead of synthesizing.
### ❌ "We'll fix the prompt to get better results"
**Instead:** Give GPT-5 explicit single-actor instructions + output format. For Claude, broader mandates are fine.
Narrow framing helps direct attention but cannot fix reasoning depth gaps. If Sonnet can't verify a contradiction with a perfect prompt, a better prompt won't help. Use a better model.
### ❌ Anti-Pattern 3: Burying Important Checks in Large Reviews
### ❌ "One reviewer is enough"
**What happens:** The model misses the important thing because it's one of 47 things to check.
Finding #78: Dropping from dual-bot to single-bot review caused a 15x drop in REQUEST_CHANGES rate. The disagreement between reviewers is where quality lives. A single reviewer has blind spots; two reviewers catch each other's misses.
**Instead:** Extract the important check and ask about it specifically. Signal-to-noise ratio matters.
### ❌ "Security is just another review criteria"
### ❌ Anti-Pattern 4: Extrapolating Across Task Types
Findings #79, #79b: Generalist reviewers (Sonnet, GPT) both APPROVED code with critical SSRF bypasses. Only a dedicated security persona blocked merge. Security requires domain-specific knowledge (CGN ranges, proxy inheritance, call-site consistency) that generalist prompts don't invoke.
**What happens:** "GPT-5 was great at X, so I'll use it for Y" — and it's mediocre.
### ❌ "More reviews = better quality"
**Instead:** Task type predicts performance better than "model X is better." Check the task-type table.
Finding #80: When the dispatcher malfunctioned, PRs received 14+ reviews from 6 reviewers. This didn't improve quality — it inflated costs 3.5x and created noise. Targeted, specialized reviews > spray-and-pray.
### ❌ Anti-Pattern 5: Skipping the Union
### ❌ "The models will catch operational issues"
**What happens:** You run one model, miss things another would have caught, and the bug reaches production.
Finding #80: A broken dispatcher ran for days before anyone noticed the 3.5x cost overage. AI pipelines need traditional observability (cost monitoring, dispatch verification, participation metrics) — they don't self-diagnose operational problems.
**Instead:** For high-stakes work, run the ensemble. The cost is trivial vs. the risk.
---
### ❌ Anti-Pattern 6: Tuning Reasoning Effort
## Operational Playbooks
**What happens:** You spend time adjusting low/medium/high reasoning effort parameters.
### Playbook 1: Architecture Document Review
**Instead:** Don't bother. It has negligible effect on analytical work. Task type is the lever.
```
1. Sonnet first-pass (15-40s, $0.02)
- Structural gaps, missing sections, obvious issues
- Decision: Is this worth deeper analysis?
### ❌ Anti-Pattern 7: Trying to Fix Sonnet with Prompts
2. If yes → GPT-5 focused analysis (80-140s, $0.40)
- Hidden assumptions, domain-specific gaps
- Regulatory compliance (if applicable)
- Temporal/ordering hazards
**What happens:** You write increasingly narrow prompts trying to get Sonnet to match GPT-5's reasoning depth.
3. Opus design-tension analysis (50-120s, $0.12)
- Where do principles conflict?
- Where do safety mechanisms become vulnerabilities?
- Cross-document consistency (if multiple docs)
**Instead:** Accept that the gap is architectural. Use Sonnet for what it's good at (speed, breadth, structural review), use GPT-5/Opus for reasoning-heavy tasks.
4. Union findings → prioritize by severity
```
### Playbook 2: Security Code Review
```
1. Standard review (generalist model, any)
- Code structure, patterns, obvious issues
- Note: Will likely APPROVE even with security gaps
2. Dedicated security persona (MANDATORY for auth/crypto/network)
- Explicit criteria: trust boundaries, library semantics, OS interaction
- Checks: HTTPS enforcement at every call site
- Checks: IP validation (is_global vs is_private)
- Checks: Transport inheritance (proxy, TLS, timeout settings)
- Checks: Write-path vs read-path consistency
3. Security persona has VETO power
- If security says REQUEST_CHANGES, it blocks regardless of other approvals
```
### Playbook 3: Multi-Model Review Pipeline
```
Configuration (optimal based on Finding #78):
- Bot 1: Structural/pattern reviewer (Sonnet — fast, catches structural issues)
- Bot 2: Depth/logic reviewer (GPT — catches reasoning issues)
- Bot 3: Security reviewer (dedicated persona — catches security issues)
Rules:
- ALL bots must approve for merge (disagreement = quality signal)
- Monitor: REQUEST_CHANGES rate should be 20-40% (too low = degraded gate)
- Monitor: Per-PR review count should be 4-8 (too high = dispatcher bug)
- Monitor: API cost per PR (set alerts for >2x expected)
Operational:
- If REQUEST_CHANGES drops below 10% → investigate (model config? code quality? gate degraded?)
- If review count exceeds 15 → check dispatcher/webhook configuration
- Track: which bot finds which issues → continuous model-task matching
```
### Playbook 4: Regulatory Compliance Review
```
1. Sonnet structural scan ($0.02, 15-30s)
- Identify which regulatory categories are addressed
- Flag structural gaps (missing sections, uncovered rules)
2. GPT-5 regulatory cross-reference ($0.40, 120-160s)
- Rule-by-rule comparison with actual regulations
- Cite specific IRS/FINRA/SEC sections
- Identify mathematical/formula errors
- Find edge cases the implementation misses
3. Opus operational compliance ($0.12, 50-80s)
- What the system needs to DO at runtime
- Cross-account, cross-entity obligations
- Where the implementation's model doesn't match regulatory reality
4. Combine → prioritize Critical/High for immediate action
```
### Playbook 5: Cross-Document Consistency Check
```
If comparing 2-3 documents for contradictions:
→ Opus primary (2.4x faster, finds more boundary tensions)
→ Sonnet for parallel comparison of claims (faster for direct conflicts)
→ GPT-5 only if documents are very complex (1000+ lines combined)
If checking one document for self-consistency:
→ GPT-5 for specification conflicts (statement A contradicts statement B)
→ Opus for logical impossibilities (rule A + rule B = impossible condition)
→ Skip Sonnet (33% precision, wastes time filtering false positives)
```
---
## Model Personality Cheat Sheet
| Model | Personality | Default Behavior | Give It |
|-------|-------------|------------------|---------|
| **GPT-5** | Exhaustive enumerator | Lists everything systematically | Bounded tasks, explicit output format, single-actor instructions |
| **Opus** | Design critic | Finds tensions and contradictions | Open-ended analysis, room to reason about boundaries |
| **Sonnet 4.6** | Structural scanner | Fast, precise, shallow | Quick first-pass work, structural review |
| **Sonnet 4.5** | Broad coverage | More findings, more noise | When you want breadth over precision |
| **GPT-4.1** | Generic competent | Stays within document framing | Middle-ground cost-sensitive work |
| **GPT-4.1 Mini** | Template filler | Formulaic but catches obvious things | Bulk screening, sanity checks |
### Opus Superpower
Opus finds where the spec's **own assumptions are false**. It doesn't just find missing things — it finds things the spec *believes* to be true that *aren't*.
Examples:
- "Realized P&L cannot recover" — the de-escalation model assumes all metrics can improve, but this one fundamentally cannot (#31)
- "Forward detection logic is backwards" — spec describes triggers in the wrong direction (#32)
- "Stop-loss defeated by temporal composition" — safety mechanism rendered ineffective by slow strategy (#47)
### GPT-5 Superpower
GPT-5 reasons about the document's **relationship to the real world**. It asks "what must be true about the external world for this to work?"
Examples:
- Broker rate limiting (429s) bypasses "connection lost" detection (#9)
- Corporate actions bypass staleness detection (#9)
- DB "commit unknown outcome" causing restart loops (#9)
- Cross-symbol strategies with partial staleness (#9)
- IRS rule nuances that simplifications violate (#54)
| Model | Thinks Like | Asks | Finds |
|-------|------------|------|-------|
| GPT-5 | Systems engineer with 20 years experience | "What does the real world need that this doesn't address?" | Infrastructure gaps, operational hazards, regulatory oversights |
| Opus | Architecture critic / philosophy professor | "Where do your own principles contradict each other?" | Design tensions, logical impossibilities, safety-mechanism-as-vulnerability |
| Sonnet 4.6 | Junior developer implementing the spec | "If I were coding this, what would confuse me?" | Implementation ambiguities, structural gaps, obvious missing pieces |
| Sonnet 4.5 | Enthusiastic intern brainstorming | "What COULD go wrong?" | Broad list of concerns (noisy, needs filtering) |
| GPT-4.1 | Reliable senior dev doing code review | "Does this follow the patterns correctly?" | Structural issues, format consistency |
| GPT-4.1 Mini | Fast intern doing a checklist | "Is this obviously incomplete?" | Missing sections, obvious gaps |
---
## Decision Framework
## Decision Framework: "Should I Run Another Model?"
### When to Add Another Model
```
After getting results from Model A, ask:
| Situation | Action |
|-----------|--------|
| Sonnet found nothing | Add Opus (may find design tensions) |
| GPT-5 found lots but all similar | Add Opus (may find different class) |
| Opus found tensions but no enumeration | Add GPT-5 (exhaustive coverage) |
| Cross-document task | Use Opus only (2.4x faster) |
| Regulatory/compliance task | Use GPT-5 (correct citations) |
1. Is the task VERIFICATION? (contradictions, races, consistency)
→ YES: Run the complementary model (GPT-5 ↔ Opus)
Sonnet results alone are insufficient for verification
### When NOT to Add Another Model
2. Are the stakes HIGH? (financial, safety, regulatory, security)
→ YES: Run all available models. The marginal cost ($0.10-0.50)
is negligible vs. the cost of a missed finding.
| Situation | Action |
|-----------|--------|
| Quick structural scan | Sonnet alone is fine |
| Bulk screening | Mini alone is fine |
| Already ran GPT-5 + Opus | Adding Sonnet rarely helps |
| Low-stakes internal doc | One model is enough |
3. Did Model A find < 5 issues on a complex document?
→ YES: The task might not suit Model A. Try a different model.
→ GPT-5 finding < 5 issues is unusual — check prompt/token budget.
### Cost-Benefit Quick Calc
4. Is this a SCREENING pass before deeper work?
→ YES: Sonnet is sufficient. Save heavy models for the deep dive.
→ NO: Add GPT-5 or Opus depending on task type.
| Risk level | Model cost | Justified? |
|------------|------------|------------|
| Financial/safety | ~$1-2 for ensemble | Always yes |
| Customer-facing | ~$0.50 for GPT-5 | Usually yes |
| Internal process | ~$0.10 for Sonnet | Always yes |
| One-off exploration | ~$0.02 for Mini | Always yes |
5. Is the document about SECURITY (auth, crypto, network)?
→ YES: Use a dedicated security persona regardless of other reviewers.
→ Generalist models will likely approve code with security gaps.
```
---
## What We Still Don't Know
## Key Numbers to Remember
1. **Corpus bias:** All experiments used gargoyle docs. Rankings may differ for other domains.
2. **Run variance:** All findings are single-run. Stochastic variation is unquantified.
3. **Scale effects:** Largest doc tested is 1,110 lines. Unknown behavior at 2000+.
4. **Non-architecture domains:** These findings are for architecture document analysis, not coding, not chat, not creative writing.
| Fact | Number | Source |
|------|--------|--------|
| Sonnet precision on contradiction detection | ~33% | #39, #43 |
| Adversarial ensemble improvement over single model | +30% findings | #35 |
| Adversarial ensemble extra cost | +28% tokens | #35 |
| Opus speed advantage over GPT-5 (cross-doc) | 2.4x faster | #28 |
| Opus token efficiency vs GPT-5 | 6-9x fewer tokens/finding | Multiple |
| GPT-5 minimum token budget | 16K completion tokens | #5, multiple |
| Union findings vs best single model | 30-60% more | Multiple |
| Dual-bot REQUEST_CHANGES rate | ~32% | #78 |
| Single-bot REQUEST_CHANGES rate | ~2% | #78 |
| Post-merge escape: missing tests | 22% of all findings | #78 |
| Security persona catch rate vs generalist | 100% vs 0% (on security issues) | #79 |
| Dispatcher malfunction cost overage | 3.5x | #80 |
| Total findings analyzed | 80 | This report |
| Total validated analytical lenses | 28 | REPORT.md Part 6 |
---
## Summary: The Two Things That Matter Most
## What Changed Since Last Report
1. **Task type determines model choice.** Don't pick a model because "it's best." Pick the model that's best for THIS task type.
**New rules added:**
- Rule 8: Specialized Personas Outperform Model Upgrades (for security)
- Rule 9: Dual-Bot Disagreement Is a Feature
- Rule 10: Monitor Your AI Pipeline
2. **The union beats any single model.** For high-stakes work, run the ensemble. Different models find qualitatively different things.
**New anti-patterns:**
- "One reviewer is enough"
- "Security is just another review criteria"
- "More reviews = better quality"
- "The models will catch operational issues"
Everything else is optimization.
**New playbooks:**
- Playbook 2: Security Code Review
- Playbook 3: Multi-Model Review Pipeline (updated from operational data)
**Updated decision tree:**
- Added security review branch
- Added inter-document contradiction (Sonnet dominance) path
---
## Evolution Notes
This document evolves weekly. As new findings emerge:
- Rules are added/modified when patterns are validated across 3+ experiments
- Anti-patterns are added after observing repeated mistakes
- Playbooks are updated when operational data improves recommendations
- The decision tree is simplified when clearer heuristics emerge
The goal: make model selection as automatic as possible, so the researcher spends time on *analysis* rather than *model choice*.
+298 -161
View File
@@ -1,37 +1,44 @@
# Model Research Report: AI Models for Analytical Work
> **Generated:** 2026-05-11 09:00 PDT
> **Findings analyzed:** 74
> **Period:** 2026-04-26 to 2026-05-11
> **Generated:** 2026-05-18 09:02 PDT
> **Findings analyzed:** 80
> **Period:** 2026-04-26 to 2026-05-15
> **Corpus:** gargoyle architecture docs, review-bot security code, dev pipeline metrics
_74 experiments across 16 days. Six models tested on architecture document analysis — not coding._
_80 experiments across 20 days. Six models tested on architecture document analysis, security review, and development process effectiveness._
---
## What's New (Since May 6)
## What's New (Since May 11)
**45 new findings** (2974) covering:
**6 new findings** (7480) covering:
- **New task types validated:** Operational blind spot analysis (#46), emergent behavior from rule composition (#47), defense-in-depth gaps (#48), adversarial evasion/tampering (#49), concurrency race conditions (#50), implementation ambiguity (#51), degraded mode propagation (#52), unstated constraints (#53), state reconstruction correctness (#55), operational burden (#56), event flow correctness (#57), state machine completeness (#58), convention-rule gaps (#59), counterfactual event ordering (#60), regulatory completeness (#61), data integrity signal flow (#62), external system assumptions (#63), specification gaps (#64), temporal correctness (#65), concurrent write hazards (#65b), cross-context contract coherence (#68), boundary contract analysis, boundary violation analysis, inter-document contradiction analysis, security boundary analysis, audit log data integrity (#11-May), wash sale regulatory compliance (#11-May)
- **Cross-document consistency expanded** (#37, #44): Opus confirmed as dominant for subtle contradictions across tightly-coupled docs
- **Regulatory compliance analysis depth** (#38, #54, #61): GPT-5 excels at IRS/regulatory specificity with correct citations
- **Narrow framing tested and rejected** (#39, #43): Sonnet cannot match GPT-5/Opus via prompt framing alone — reasoning depth is the bottleneck
- **Adversarial ensemble validated** (#35): Critique-then-extend produces 30% more findings at 28% more cost
- **Operational burden as distinct lens** (#45, #56): Models diverge on what constitutes "operator cognitive load"
- **Silent data corruption paths** (#40): GPT-5 excels at tracing multi-step corruption through financial accounting
- **Temporal ordering dependencies** (#41): All models catch obvious ordering; GPT-5 unique on subtle cascades
- **Failure propagation chains** (#42): Opus finds the architectural insight; GPT-5 finds the enumeration
- **Finding #78: Dev Loop Effectiveness Analysis** — Quantitative audit of the gargoyle autonomous development pipeline. Key results: dual-bot review (Sonnet + GPT) achieved 32% REQUEST_CHANGES rate vs 2% for single-bot. Post-merge review caught 100 escaped defects (all fixed). 22% of post-merge findings were missing test coverage. Sonnet-review-bot dropout was the single largest quality regression.
- **Finding #79 (two parts): Multi-Model Security Review Catches SSRF Gaps** — Dedicated security-reviewer persona caught CGN range bypass (100.64.0.0/10 not covered by Python `is_private`) and proxy-assisted SSRF (Go `http.DefaultTransport` cloning preserves `ProxyFromEnvironment`). Standard reviewers (Sonnet, GPT) both approved — only the specialized security persona blocked merge. Validates: domain-specialized reviewer roles outperform generalist "security-aware" review.
- **Finding #79b: HTTPS Enforcement Bypass in GitHub Client** — Security reviewer caught write-path methods (`PostReview`, `DeleteReview`, `RequestReviewer`) bypassing the `doRequest` HTTPS guard. Standard reviewers missed it. 30-minute fix cycle from detection to re-approval. Validates: write-path code paths deserve extra security scrutiny.
- **Finding #80: Config-A/B Dispatcher Malfunction** — The even/odd PR parity routing for multi-model review experiments was NOT operational. All 6 reviewers fired on all PRs simultaneously, causing 3.5x API cost overage and invalidating Phase 1 baseline metrics. Demonstrates: operational monitoring of AI pipeline configuration is critical.
**Key new insights:**
1. Specialized reviewer personas provide value that model capability alone cannot replicate
2. Multi-model review pipelines need operational monitoring (cost, dispatch correctness)
3. Dual-bot disagreement acts as a natural quality ratchet — removing one bot degrades quality disproportionately
4. Security review at library/OS-interaction boundaries requires domain-specific knowledge (CGN ranges, proxy inheritance)
---
## Executive Summary
We tested GPT-5, Claude Opus, Claude Sonnet 4.6, Claude Sonnet 4.5, GPT-4.1, and GPT-4.1 Mini on analytical tasks: finding hidden assumptions, race conditions, design contradictions, adversarial attack paths, regulatory gaps, cross-document inconsistencies, operational blind spots, emergent behaviors, and security boundaries in real architecture documents.
We tested GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, Claude Sonnet 4.5, GPT-4.1, and GPT-4.1 Mini on analytical tasks: finding hidden assumptions, race conditions, design contradictions, adversarial attack paths, regulatory gaps, cross-document inconsistencies, operational blind spots, emergent behaviors, security boundaries, and multi-model review pipeline effectiveness.
**The central finding:** Different models don't just find more or fewer things — they find *qualitatively different kinds* of things. Model choice is task-dependent, and no single model dominates all analytical work.
**The secondary finding:** Task type predicts model performance better than "model X is better." A model that excels at gap-finding may struggle at contradiction detection. Match the model to the task.
**The tertiary finding (new):** Reviewer *persona* (security-focused, domain-focused) matters as much as model capability. A dedicated security reviewer using the same model catches issues that a generalist reviewer on the same model misses.
---
## Part 1: What Each Model Is Good At
@@ -44,23 +51,30 @@ GPT-5's reasoning tokens change the *kind* of analysis, not just the depth. Non-
| Capability | Evidence |
|------------|----------|
| Domain-specific gaps | #9, #31: Broker rate limiting, credential rotation, corporate actions |
| Multi-component interactions | #10, #14: Finds assumptions requiring cross-boundary reasoning |
| Domain-specific gaps | #9, #31, #63: Broker rate limiting, credential rotation, entitlement gaps |
| Multi-component interactions | #10, #14, #68: Finds assumptions requiring cross-boundary reasoning |
| Adversarial enumeration | #29, #35: Most thorough attack surface coverage |
| Temporal boundary analysis | #18: 15 findings with mathematical precision |
| Regulatory compliance | #23, #38, #54: Correct IRS citations, regulatory edge cases |
| Silent data corruption | #40: Traces multi-step corruption paths |
| Temporal boundary analysis | #18, #41: 15+ findings with mathematical precision |
| Regulatory compliance | #23, #38, #54, #61, #64: Correct IRS/FINRA citations, regulatory edge cases |
| Silent data corruption | #40, #62: Traces multi-step corruption paths |
| Invariant violation paths | #20: Precise, verifiable paths through state space |
| Operational blind spots | #46: 18 findings including cross-service trace gaps |
| State machine completeness | #58: 16 gaps including race windows during state transitions |
| Concurrent write hazards | #65: 19 hazards with specific ordering interleavings |
| External system assumptions | #63: 24 assumptions about broker APIs, network behavior |
| Counterfactual event ordering | #60: 30 findings through systematic permutation |
| Specification gap analysis | #64b: 17 implementation-divergence scenarios |
| Convention rule gaps | #59: 34 findings through section-by-section enumeration |
- Best at: gap-finding, hidden assumptions, adversarial enumeration, temporal boundary analysis, regulatory compliance, operational blind spots
- Unique ability: finds multi-component interaction failures requiring domain knowledge
- Weakness: slow (2-4x longer than Opus), expensive (5-10x more tokens per finding), sometimes over-enumerates
- Best at: gap-finding, hidden assumptions, adversarial enumeration, temporal boundary analysis, regulatory compliance, operational blind spots, state machine analysis, exhaustive permutation
- Unique ability: finds multi-component interaction failures requiring domain knowledge + systematic enumeration
- Weakness: slow (2-4x longer than Opus), expensive (5-10x more tokens per finding), sometimes over-enumerates low-severity variants
- Finding count: typically 15-35 depending on document complexity
- Typical cost: $0.30-0.50 per experiment
### Claude Opus
### Claude Opus 4.6
**Strength:** Design tensions, logical argumentation, creative adversarial thinking, cross-document consistency.
**Strength:** Design tensions, logical argumentation, creative adversarial thinking, cross-document consistency, failure mode reasoning.
Opus consistently identifies where one part of a design undermines another part. It doesn't enumerate failure modes — it finds the deeper question of *why* the design's own principles conflict.
@@ -73,209 +87,332 @@ Opus consistently identifies where one part of a design undermines another part.
| False assumption detection | #31, #32: Finds where spec's own logic contradicts itself |
| Emergent behavior insight | #47: Stop-loss defeated by temporal composition (best single finding) |
| Survivor bias identification | #46: Decision latency histogram hides stuck decisions |
| Degraded-mode propagation | #52: 10 findings including lost-pending-state indistinguishability |
| Failure propagation chains | #42: 10 findings in 4K tokens — 2.2x more token-efficient than GPT-5 |
| Security boundary tensions | Security analysis: Signal reconnaissance via audit blindspot |
| Design-level incompleteness | #51, #52: Where the model is fundamentally underspecified |
- Best at: contradiction detection, cross-document consistency, race conditions (design-level), adversarial creativity, finding false assumptions
- Unique ability: self-corrects mid-analysis, finds where protection mechanisms become vulnerabilities
- Best at: contradiction detection, cross-document consistency, race conditions (design-level), adversarial creativity, finding false assumptions, degraded-mode propagation, failure mode reasoning
- Unique ability: self-corrects mid-analysis, finds where protection mechanisms become vulnerabilities, identifies logical impossibilities from rule interaction
- Most efficient model: 6-9x fewer tokens per finding than GPT-5 on shared task types
- Weakness: produces fewer findings on pure enumeration tasks (10-13 vs GPT-5's 20-35)
- Typical cost: $0.08-0.15 per experiment
### Claude Sonnet 4.6
**Strength:** Speed, structural issues, assumption-finding. Best precision-per-dollar.
**Strength:** Fast structural scanning, implementation-perspective findings, inter-document contradiction detection, quick first-pass screening.
Sonnet is the fastest and cheapest model. It catches ~60-80% of findings on structural tasks. On inter-document contradiction detection (Finding #67), it *outperformed* GPT-5: more findings, better severity calibration, 10x faster.
| Capability | Evidence |
|------------|----------|
| Quick first-pass screening | #9, #12: 2-3x faster than other models |
| Structural review | #5: Catches formatting, broken links, missing sections |
| Specification gap identification | #16: 13 findings, zero false positives |
| Observability gaps | #33: 11 findings in 36s |
| Quick structural scanning | #12, #14: 17 findings in 35s; recovers with structure |
| First-pass screening | #51, #63: Catches ~60% of findings at 1/5 the cost |
| Inter-document contradictions | #67: 5 findings (3 Critical) vs GPT-5's 4 (0 Critical), 14s vs 136s |
| Implementation perspective | #51: Finds what would confuse a developer writing the actual code |
| Regulatory category identification | #61: Finds structural gaps in regulatory coverage quickly |
| Cross-component basics | #14: 8 findings with good structure when prompts are explicit |
- Best at: quick first-pass screening, structural review, specification gap identification
- Zero false positives on most tasks — every finding is actionable
- Weakness: struggles with concurrency reasoning, contradiction detection, tasks requiring formal logical reasoning
- Produces false positives on verification-heavy tasks (contradiction, race conditions)
- Best at: fast first-pass, structural scanning, inter-document contradictions, implementation-perspective ambiguities
- Unique ability: reasons from implementer's perspective ("if I were coding this, what would I be unsure about?")
- Strength: 5-10x faster than GPT-5, useful for time-constrained reviews
- Weakness: ~33% precision on self-contradiction detection (misreads), cannot match GPT-5/Opus depth on verification tasks, zero unique insights on many analytical tasks
- Finding count: typically 7-17 depending on task type
- Typical cost: $0.01-0.03 per experiment
**Critical limitation (Finding #39):** Narrow framing does NOT close the gap with GPT-5/Opus. Sonnet can find 3 contradictions but only 1 is genuine (2 are misreadings). The gap is reasoning depth, not framing — Sonnet can't reliably verify whether two statements actually contradict each other.
### GPT-4.1 and GPT-4.1 Mini
**Role:** Non-reasoning baselines for structured tasks.
| Capability | Evidence |
|------------|----------|
| Structured output | #4: Best at consistent JSON/table format |
| Quick gap-finding | #9: 14 findings at lowest cost tier |
| Delegation target | #4, #14: Good enough for simple enumeration tasks |
| Cross-component basics | #14: Finds obvious interactions |
- GPT-4.1: Solid non-reasoning baseline, 14 findings on gap-finding tasks
- GPT-4.1 Mini: Cheapest screening, 12 findings on gap-finding, useful for triage
- Neither suitable for verification, contradiction, or adversarial analysis
### Claude Sonnet 4.5
**Strength:** Exhaustive coverage. More findings than 4.6, at the cost of some noise.
**Role:** Predecessor comparison, broad-but-noisy coverage.
| Capability | Evidence |
|------------|----------|
| Specification completeness | #16: 25 findings vs 4.6's 13 |
| Temporal reasoning | #18: 12 findings with no errors (vs 4.6's errors in #13) |
| Operational gaps | Catches gaps that 4.6 filters out |
- Best at: specification completeness, broad coverage
- Tradeoff: severity inflation, more verbose output
- Use 4.5 for coverage, 4.6 for precision
### GPT-4.1
**Strength:** Structured, thorough, good middle ground. Generic but competent.
| Capability | Evidence |
|------------|----------|
| Stays within document framing | #9, #10: Finds assumptions the document almost states |
| Meta-observations | #10: "All failure modes treated as isolated" |
| Cost-effective first pass | Good enough when GPT-5's cost isn't justified |
- Best unique contribution: meta-observations about design structure
- Good enough for first-pass review where GPT-5's cost isn't justified
### GPT-4.1 Mini
**Strength:** Cheapest. Formulaic but catches the obvious things.
| Capability | Evidence |
|------------|----------|
| Scales with document size | #9, #19: 6 findings on 459 lines → 21 on 1,110 lines |
| Clean templates | Every finding maps to a document section |
| Bias detection | #8: Catches bias when signal isn't buried |
- Fine for quick sanity checks, not for architectural insight
- Best for: bulk screening, sanity checks, obvious-issue detection
- Produces more findings than Sonnet 4.6 but with more noise
- Less precise severity calibration
- Use when breadth > precision (initial exploration)
---
## Part 2: Task Type → Model Mapping
## Part 2: Task-Type Performance Matrix
Not all analytical tasks are the same. Models that excel at one struggle at another.
### Core Task Types (validated across 5+ experiments)
| Task Type | Best Model | Runner-up | Avoid | Evidence |
|-----------|-----------|-----------|-------|----------|
| **Gap-finding** | GPT-5 | GPT-4.1 | Mini (surface-level) | #9, #31, #64 |
| **Hidden assumptions** | GPT-5 | Opus | Mini (formulaic) | #10, #11, #12, #53 |
| **Race conditions** | GPT-5 + Opus | — | Sonnet (errors) | #13, #50 |
| **Contradiction detection** | **Opus** | GPT-5 | Sonnet (false positives) | #25, #43 |
| **Cross-document consistency** | **Opus** | GPT-5 | — | #28, #37, #44 |
| **Adversarial attack paths** | GPT-5 (enum) + Opus (creativity) | — | — | #29, #35, #49 |
| **Design coherence** | Document-dependent | — | — | #15, #27 |
| **Specification completeness** | Sonnet 4.5 (breadth) / GPT-5 (self-contradictions) | — | — | #16, #31 |
| **Regulatory compliance** | GPT-5 | Sonnet (first-pass) | — | #23, #38, #54 |
| **Operational blind spots** | GPT-5 | Opus | Sonnet | #46 |
| **Emergent behavior** | GPT-5 (feedback loops) | Opus (best single insight) | — | #47 |
| **Temporal boundaries** | GPT-5 | Opus | — | #18, #41, #65 |
| **State machine completeness** | GPT-5 | Opus | — | #58 |
| **Silent data corruption** | GPT-5 | — | — | #40, #62 |
| **Defense-in-depth gaps** | GPT-5 + Opus | — | — | #48 |
| **Security boundaries** | GPT-5 | Opus | — | #10-May |
| Task Type | Best Model(s) | Evidence | Notes |
|-----------|---------------|----------|-------|
| Hidden assumption identification | GPT-5 + Opus | #10-12, #53 | GPT-5 for breadth, Opus for design tensions |
| Gap-finding (what's missing) | GPT-5 | #9, #31, #46 | Dominates on exhaustive enumeration |
| Self-contradiction detection | GPT-5 + Opus | #25, #43 | Different types: spec conflicts vs logical impossibilities |
| Cross-document consistency | Opus (primary) | #28, #37, #44 | 2.4x faster, more findings than GPT-5 |
| Inter-document contradictions | Sonnet (primary) | #67 | Outperforms GPT-5 on parallel comparison |
| Race condition identification | GPT-5 + Opus | #13, #50 | Sonnet unreliable for concurrency |
| Adversarial attack paths | GPT-5 → Opus ensemble | #29, #35 | 30% more findings with critique+extend |
| Regulatory compliance | GPT-5 (primary) | #23, #38, #54, #61 | Correct citations, regulatory edge cases |
| Operational blind spots | GPT-5 + Opus | #46 | GPT-5 coverage mapping; Opus false confidence |
| Temporal ordering dependencies | GPT-5 + Opus | #18, #41 | Different aspects of temporal reasoning |
| Failure propagation chains | Opus + GPT-5 | #42 | Opus architectural insight; GPT-5 enumeration |
| Silent data corruption | GPT-5 | #40, #62 | Traces multi-step paths through accounting |
**Key pattern:** Tasks requiring *identification* (what's missing? what's assumed?) are accessible to all models. Tasks requiring *verification* (is this sequence legal? does this contradict that?) favor reasoning models (GPT-5, Opus) and exclude Sonnet.
### Newer Task Types (validated in 2-4 experiments)
**Task category taxonomy:**
| Task Type | Best Model(s) | Evidence | Notes |
|-----------|---------------|----------|-------|
| Emergent behavior / rule composition | GPT-5 + Opus | #47 | GPT-5 feedback loops; Opus best single finding |
| Defense-in-depth gaps | GPT-5 + Opus | #48 | Complementary coverage |
| Concurrency / write hazards | GPT-5 | #50, #65 | Exhaustive hazard enumeration |
| Implementation ambiguity | All viable | #51 | Smallest model gap; Sonnet viable |
| Degraded-mode propagation | Opus + GPT-5 | #52 | Opus finds boundary semantic mismatches |
| State machine completeness | GPT-5 | #58 | 16 gaps through systematic transition coverage |
| Convention/specification gaps | GPT-5 | #59 | 34 findings via section-by-section enumeration |
| Counterfactual event ordering | GPT-5 | #60 | 30 findings through systematic permutation |
| Data integrity / signal flow | GPT-5 + Opus | #62 | GPT-5 audit gaps; Opus semantic violations |
| External system assumptions | GPT-5 | #63 | Reasoning about systems NOT in the document |
| Temporal correctness | Opus | #65b | Stronger on cross-component temporal coupling |
| Cross-context contracts | GPT-5 | #68 | Flow tracing across bounded contexts |
| Security boundary analysis | Opus + GPT-5 | Security | Opus finds design tension exploits |
| Event flow correctness | All (different strengths) | #57 | GPT-5 domain knowledge; Opus crash scenarios; Sonnet structure |
| Boundary contract analysis | GPT-5 + Opus | Boundary | Exhaustive + design-level |
| Operational burden analysis | GPT-5 + Opus | #45, #56 | Different definitions of "operator load" |
| Category | Sonnet value | Best models |
|----------|--------------|-------------|
| Systematic/exhaustive | None | GPT-5, Opus |
| Creative/generative | Meta-analytical synthesis | Opus, GPT-5 |
| Compliance/regulatory | Adequate but shallow | GPT-5 (deep), Sonnet (first-pass) |
| Cross-document | None | Opus strongly preferred |
### Security Code Review (new category)
| Task Type | Best Approach | Evidence | Notes |
|-----------|---------------|----------|-------|
| SSRF defense review | Dedicated security persona | #79 | Catches CGN, proxy bypass |
| HTTPS enforcement audit | Dedicated security persona | #79b | Catches inconsistent call-site guards |
| Multi-model security pipeline | Specialized + generalist | #79, #79b | Security persona blocks what generalists approve |
---
## Part 3: Meta-Findings About How to Use Models
## Part 3: Meta-Findings
### 1. Signal-to-noise ratio matters more than model capability (#8)
### 3.1 — Model Complementarity Is the Dominant Pattern
When biased text is the ONLY input, even GPT-4.1 Mini catches it. When the same bias is buried inside a full PR review with diffs, issues, and project context, expensive models miss it. The issue isn't model intelligence — it's attention dilution.
No single model dominates. Across all task types, the **union** of model findings is 30-60% larger than the best single model. This isn't noise — unique findings from each model are consistently validated as genuine.
**Implication:** For important analytical checks, isolate the signal. Extract the relevant text and ask about it specifically. Don't bury important questions inside broad review mandates.
**Evidence:** Finding #42 — Failure propagation chains. Opus: 10 findings in 4K tokens. GPT-5: 10 findings in 9K tokens. Same count, but only ~60% overlap. The non-overlapping findings from each are architecturally significant.
### 2. Prompt framing dominates model personality for OPEN tasks (#26)
### 3.2 — Two Distinct Modes of Contradiction Detection
Opus's "finds design tensions" and GPT-5's "exhaustive enumeration" are DEFAULT tendencies, not hard limits. With structured prompts that explicitly ask for breadth, Opus produces MORE findings than GPT-5. With structured prompts asking for contradictions, GPT-5 becomes highly selective.
| Mode | Best Model | What It Catches | Cognitive Demand |
|------|-----------|-----------------|------------------|
| Specification conflicts | GPT-5 | Same scenario, different prescriptions | Statement comparison + verification |
| Logical impossibilities | Opus | Rules that can't coexist under all conditions | Multi-step deductive reasoning |
**Implication:** Model choice matters less than you think for any single task. Prompt structure is the primary lever. Model personality matters for open-ended tasks where you want emergent analytical behavior.
From Finding #43: GPT-5 and Opus don't compete on contradiction detection — they find entirely different *classes* of contradiction. Run both for complete coverage.
### 3. Narrow framing does NOT fix Sonnet's reasoning gaps (#39, #43)
### 3.3 — Narrow Framing Cannot Fix Reasoning Gaps
Sonnet can't match GPT-5/Opus via narrow prompts alone. Narrow framing changes WHAT Sonnet looks for but not HOW WELL it reasons. Sonnet found 3 contradictions but only 1 was genuine (2 were misreadings). The gap is reasoning depth, not prompt engineering.
Finding #39 (confirmed by #43): Giving Sonnet a focused "check for contradictions" prompt changes WHAT it looks for but not HOW WELL it reasons. Sonnet with narrow framing found 3 contradictions but only 1 was genuine. The gap between Sonnet and reasoning models is architectural — you cannot prompt-engineer around it.
### 4. Task type predicts model performance better than "model X is better" (#13)
### 3.4 — Adversarial Ensemble Produces Superior Coverage
Sonnet scores 85% of GPT-5's performance on assumption-finding but drops to ~50% on concurrency reasoning. Don't extrapolate across task types.
Finding #35: GPT-5 → Opus critique+extend pipeline produces 30% more findings than either model alone. Zero full disagreements during critique. Extension phase adds genuinely new High-severity findings. Cost: ~28% more tokens for 30% more coverage + prioritization.
### 5. The union of models finds the most (#19)
### 3.5 — Document Type Shapes Finding Character
GPT-5 Mini + Sonnet covers ~71% of GPT-5's findings at 31% of the cost. But the missing 29% contains the domain-specific interaction-level findings most likely to prevent production incidents. Each model also finds things the others miss — the total unique finding space is larger than any single model's output.
| Document Level | Best Analytical Lenses | Evidence |
|---------------|----------------------|----------|
| Overview/architecture | Failure propagation, blast radius, isolation verification | #42 |
| Component specifications | Race conditions, invariant violations, hidden assumptions | #13, #20 |
| Cross-cutting docs | Temporal ordering, recovery hazards, cross-context contracts | #41, #68 |
| Convention/rules docs | Exhaustive enumeration, contradiction detection | #59 |
| Regulatory specs | Compliance gap analysis, regulatory cross-referencing | #61, #64 |
### 6. Adversarial ensemble produces 30% more findings (#35)
### 3.6 — Token Budget Matters More Than Model Choice (for some tasks)
Run GPT-5 for exhaustive enumeration, then give Opus GPT-5's findings and ask it to critique and extend. Result: 56 findings vs 43 (GPT-5 alone) or 28 (Opus alone). Zero full disagreements. The critique's structured assessment is more valuable than raw extensions. Cost: ~28% more tokens for 30% more coverage + prioritization.
Finding #7b: A truncated GPT-5 response is worse than a complete Opus response. GPT-5 needs `max_completion_tokens` ≥ 16K. When token budgets are equal, the model gap narrows on enumeration tasks but widens on verification tasks.
### 7. Reasoning tokens change the KIND of analysis, not just the amount (#10)
### 3.7 — Reasoning Effort Settings Have Negligible Effect
Non-reasoning models ask "what could this mechanism fail at?" Reasoning models ask "what must be true about the world for this mechanism to work?" This is a qualitative difference in analytical mode, not just thoroughness.
Finding #21: Low/medium/high reasoning effort on GPT-5 produced nearly identical output quality. Either the parameter doesn't work for open-ended analysis or the tasks were within GPT-5's "easy" threshold.
### 8. Reasoning effort parameter is a no-op for analytical work (#21)
### 3.8 — Inter-Document Contradiction: Sonnet's One Dominance
Low/medium/high reasoning effort had negligible effect on GPT-5's output for open-ended analysis. Task type is a far stronger predictor of reasoning behavior. Don't waste time tuning this parameter for document review.
Finding #67: On inter-document contradiction detection (comparing two documents for conflicting statements), Sonnet outperformed GPT-5: 5 findings (3 Critical) vs 4 findings (0 Critical) in 14s vs 136s. This is the **only** task type where Sonnet clearly dominates. The hypothesis: this task requires parallel comparison (pattern matching across two texts) which benefits from Sonnet's approach more than GPT-5's serial deep reasoning.
### 9. Output length kills, input length doesn't (#6)
### 3.9 — Reviewer Persona > Model Capability for Security
Single agents die trying to generate 1000+ line documents. Rich input context is fine — it's the output length that causes OOM/timeout. Break output into sections, keep input context rich, draft in parallel, assemble.
Findings #79, #79b: A dedicated security-reviewer persona caught critical SSRF gaps that both Sonnet and GPT generalist reviewers missed and approved. The security reviewer uses structured criteria (trust boundaries, library semantics, OS interaction) that generalist review prompts don't invoke. Persona specialization provides unique coverage beyond model improvement.
### 10. Document complexity shifts model rankings (#27)
### 3.10 — Dual-Bot Disagreement as Quality Ratchet
Opus beat GPT-5 on coherence analysis for one document but lost on another (more complex) document. Rankings are not universalthey interact with document complexity, domain specificity, and prompt structure.
### 11. Token budget matters more than model size (#7b)
When output is truncated by token limits, even GPT-5 produces shallow findings. Ensure sufficient max_completion_tokens (≥16K for GPT-5). A cheap model with enough tokens beats an expensive model that runs out of space.
### 12. Opus excels at finding where specs believe false things (#31, #32)
Across experiments, Opus consistently finds where the spec's OWN ASSUMPTIONS are false. GPT-5 reasons about what the spec FAILS TO SAY. Opus reasons about what the spec BELIEVES to be true and checks whether those beliefs hold. Different but complementary.
### 13. GPT-5's reasoning tokens are spent on VERIFICATION for regulatory tasks (#54)
For domain-specific regulatory analysis (IRS wash sale rules), GPT-5 consistently cited correct publication sections, code numbers, and regulatory references. The 9,600 reasoning tokens appear spent on verification, not generation.
Finding #78: In the gargoyle dev pipeline, dual-bot review (Sonnet + GPT) achieved 32% REQUEST_CHANGES rate. After dropping to single-bot (GPT only), REQUEST_CHANGES dropped to 2%. The disagreement between two modelswhere one blocks while the other approves — creates a natural quality gate. Removing one model from the pipeline disproportionately degrades review rigor.
---
## Part 4: Cost-Effectiveness
| Model | Typical tokens/finding | Relative cost | Best use case |
|-------|----------------------|---------------|---------------|
| Opus | 179-336 | 1x (baseline) | Cross-doc consistency, contradictions, design tensions |
| Sonnet 4.6 | 111-194 | 0.2-0.3x | Quick screening, structural review, assumption-finding |
| Sonnet 4.5 | 150-250 | 0.25x | Broad coverage when noise is acceptable |
| GPT-5 | 511-2,967 | 5-9x | High-stakes analysis where missing something has real cost |
| GPT-4.1 | ~500 | 0.5x | Middle-ground first pass |
| GPT-4.1 Mini | ~300 | 0.1x | Bulk screening, sanity checks |
### Per-Experiment Cost by Model
**For financial/safety-critical systems:** Run all three (Opus + GPT-5 + Sonnet). The ~$1-2 total cost per document is trivially justified vs the value of comprehensive coverage.
| Model | Typical Time | Typical Output | Typical Cost | Findings/$ |
|-------|-------------|----------------|-------------|-----------|
| GPT-5 | 80-140s | 7-11K tokens | $0.30-0.50 | 30-60 |
| Claude Opus 4.6 | 50-120s | 2-5K tokens | $0.08-0.15 | 80-130 |
| Claude Sonnet 4.6 | 15-40s | 1-2K tokens | $0.01-0.03 | 300-700 |
| GPT-4.1 | 20-40s | 2-4K tokens | $0.03-0.06 | 200-400 |
| GPT-4.1 Mini | 10-20s | 1-2K tokens | $0.005-0.01 | 1000+ |
**For routine review:** Opus alone or Sonnet + Opus pair. Skip GPT-5 unless the document is complex and the stakes justify it.
### Three-Model Ensemble Cost
**For regulatory compliance:** GPT-5 for depth + correct citations, Sonnet for first-pass breadth.
Running GPT-5 + Opus + Sonnet on a single document:
- **Total cost:** ~$0.40-0.70
- **Total time:** ~3-5 minutes (sequential)
- **Total unique findings:** Typically 1.3-1.6x the best single model
- **Value proposition:** Finding one Critical issue before production justifies the entire research budget
### Efficiency Ratios
| Metric | GPT-5 | Opus | Sonnet |
|--------|-------|------|--------|
| Tokens per finding | 500-1000 | 200-400 | 100-200 |
| Time per finding | 6-10s | 5-10s | 2-4s |
| Unique finding rate | 25-40% | 20-35% | 5-15% |
| False positive rate | <5% | <5% | 15-33% (verification tasks) |
### When to Use Each Tier
| Budget | Approach | Expected Coverage |
|--------|----------|-------------------|
| Minimal ($0.01-0.03) | Sonnet only | ~60% of findings, fast |
| Standard ($0.15-0.20) | Opus + Sonnet | ~80% of findings, good depth |
| Comprehensive ($0.50-0.70) | GPT-5 + Opus + Sonnet | ~95% of findings, full coverage |
| Critical ($1-2) | Ensemble (GPT-5 → Opus critique+extend) + Sonnet | Maximum coverage with prioritization |
---
## Part 5: Open Questions
## Part 5: Pipeline Findings (Dev Loop Analysis)
### Still Unanswered
### Multi-Model Review Pipeline Effectiveness (Finding #78)
1. **Are these findings corpus-specific?** All 74 experiments used gargoyle architecture docs. Different domains may shift rankings.
| Metric | Dual-Bot (Sonnet+GPT) | Single-Bot (GPT only) |
|--------|----------------------|----------------------|
| REQUEST_CHANGES rate | 32% | 2% |
| Avg reviews per PR | 7-11 | 22-30 |
| Post-merge escape rate | Declining | Unknown (too recent) |
| Most caught category | Missing tests (22%) | — |
2. **How much do results vary across runs?** All findings are single-run. Stochastic variation is unquantified.
### Security Review Pipeline (Findings #79, #79b)
3. **What happens on 2000+ line documents?** Largest tested is 1,110 lines. Unknown if model rankings shift at extreme scale.
| Standard Reviewer | Security Reviewer | Outcome |
|-------------------|-------------------|---------|
| APPROVED (both Sonnet + GPT) | REQUEST_CHANGES | Correct (security issue was real) |
| Generalist prompt | Domain-specific criteria | Security persona provides unique value |
| Misses library semantics | Catches Python `is_private` gaps | Domain knowledge matters |
| Misses OS interaction | Catches proxy inheritance | Cross-layer reasoning matters |
4. **Cross-document consistency as maintenance tool:** Does running cross-doc analysis across MORE document pairs yield additional real inconsistencies? Could become a systematic documentation maintenance tool.
### Operational Lessons (Finding #80)
5. **Why Opus dominates cross-doc consistency:** Is it because contradictions are easy to verify once spotted (reducing GPT-5's verification advantage)? Or because boundary reasoning (Opus's strength) is the primary skill needed?
- Config-A/B parity routing must be actively monitored
- All-reviewer-fire-always costs 3.5x expected budget
- Phase 1 baseline invalidated by dispatcher malfunction
- Operational monitoring of AI pipelines is non-optional
### Answered Questions (from open-questions.md)
---
- ~~Opus + narrow framing for contradiction detection~~ → **WRONG QUESTION** (#43). Opus doesn't try to match GPT-5 — it finds a different CLASS of contradiction. GPT-5 finds specification conflicts (same scenario, conflicting prescriptions). Opus finds logical impossibilities (rules whose interaction produces impossible conditions). Neither dominates.
## Part 6: Validated Analytical Lenses (Full Catalog)
- ~~Sonnet + narrow framing = GPT-5 level?~~ → **NO** (#39). The gap is reasoning depth, not framing.
The research has validated **28 distinct analytical lenses** for architecture document review:
- ~~Adversarial ensemble (GPT-5 → Opus)?~~ → **YES** (#35). 30% more findings at 28% more cost.
| # | Lens | First Tested | Key Findings |
|---|------|-------------|--------------|
| 1 | Hidden assumption identification | #10 | GPT-5 + Opus complementary |
| 2 | Gap-finding | #9 | GPT-5 dominates |
| 3 | Bias detection | #8 | Signal isolation matters most |
| 4 | Self-contradiction detection | #25, #43 | Two distinct modes |
| 5 | Cross-document consistency | #28 | Opus dominates |
| 6 | Inter-document contradictions | #67 | Sonnet dominates |
| 7 | Race condition identification | #13 | GPT-5 + Opus |
| 8 | Temporal boundary analysis | #18 | GPT-5 + Opus |
| 9 | Cross-component interaction | #14 | All models viable |
| 10 | Adversarial manipulation | #29 | Ensemble best |
| 11 | Design coherence | #15, #27 | Document-dependent |
| 12 | Spec completeness | #16 | Sonnet 4.5 adequate |
| 13 | Missing-feature identification | #26 | Promptable across all |
| 14 | Operational blind spots | #46 | GPT-5 + Opus |
| 15 | Emergent behavior / rule composition | #47 | GPT-5 feedback loops; Opus insight |
| 16 | Defense-in-depth gaps | #48 | GPT-5 + Opus |
| 17 | Adversarial evasion/tampering | #49 | GPT-5 + Opus |
| 18 | Concurrency / race conditions | #50 | GPT-5 exhaustive |
| 19 | Implementation ambiguity | #51 | All viable (smallest gap) |
| 20 | Degraded-mode propagation | #52 | Opus + GPT-5 |
| 21 | Failure propagation chains | #42 | Opus insight; GPT-5 coverage |
| 22 | State machine completeness | #58 | GPT-5 dominates |
| 23 | Convention/specification gaps | #59 | GPT-5 dominates |
| 24 | Counterfactual event ordering | #60 | GPT-5 systematic permutation |
| 25 | Regulatory completeness | #61, #64 | GPT-5 regulatory; Opus operational |
| 26 | Data integrity / signal flow | #62 | GPT-5 audit; Opus semantic |
| 27 | External system assumptions | #63 | GPT-5 exhaustive |
| 28 | Security boundary analysis | Security | Opus tension exploits |
- ~~Opus's "missing feature identification" mode — is it promptable?~~ → **YES** (#26). All models find regulatory gaps when explicitly prompted.
---
- ~~Is Opus > GPT-5 for coherence tasks universal?~~ → **NO** (#27). Document complexity affects ranking.
## Part 7: Open Questions
### High Priority
1. **Does the dual-bot quality ratchet scale?** Finding #78 showed dramatic quality degradation when dropping from 2 to 1 reviewer bot. Would 3 bots (adding Opus) further improve? Or is the marginal value of the 3rd reviewer diminishing?
2. **Security persona transferability:** Findings #79/#79b validate specialized security review on SSRF/HTTPS code. Does the same persona pattern work for auth, crypto, and other security domains? Or does each domain need a separately-tuned persona?
3. **Config-A/B measurement recovery:** With the dispatcher now fixed, can Phase 1 data be salvaged (all reviewers ran, so both configs' data exists), or must the experiment restart?
4. **Reasoning effort on harder documents:** Finding #21 showed negligible effect on a moderately complex document. Test with a genuinely hard document (1000+ lines, multiple interacting concerns) to see if reasoning effort matters when the task exceeds the "easy" threshold.
5. **Model personality vs prompt:** Finding #26 showed missing-feature identification is promptable. How many other "model personality" observations are prompt framing effects? Systematic test needed.
### Medium Priority
6. **Cross-corpus generalization:** All findings are on a single corpus (gargoyle). Do the model rankings hold for other domains (infrastructure, web apps, data pipelines)?
7. **Opus for inter-document contradictions:** Finding #67 showed Sonnet outperforming GPT-5. Would Opus (with its boundary reasoning strength) outperform both?
8. **Automated lens selection:** Given 28 validated lenses, can a model accurately select which lenses apply to a given document? Or does human judgment remain necessary?
9. **Longitudinal review effectiveness:** As the codebase improves from post-merge review findings, does the multi-model review pipeline's REQUEST_CHANGES rate stabilize or continue declining?
### Answered (from previous period)
- ~~Opus + narrow framing for contradiction detection~~ → ANSWERED (#43): Different class of findings, not comparable
- ~~Sonnet + narrow framing = GPT-5 level?~~ → ANSWERED (#39): No. Reasoning depth, not framing, is the bottleneck
- ~~Adversarial ensemble value?~~ → ANSWERED (#35): Yes, 30% more coverage at 28% more cost
- ~~Is Opus > GPT-5 universal for coherence?~~ → ANSWERED (#27): No, document-dependent
---
## Methodology
See `methodology.md` for full experimental setup. Key constraints:
- Same input text to all models (no information advantage)
- Structured prompts with explicit categories and output format
- No tools, no project context beyond the document(s) under analysis
- Independent runs (no cross-pollination between models)
- Single researcher evaluating findings (subjectivity acknowledged)
- Single corpus (gargoyle) — domain bias possible
Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, GPT-4.1, GPT-4.1 Mini, Claude Sonnet 4.5
---
## Conclusion
After 80 experiments, the evidence strongly supports a multi-model approach to analytical work:
1. **No single model dominates** — task type determines the best model
2. **The union always exceeds the parts** — run multiple models for critical work
3. **Persona specialization adds unique value** — beyond model capability
4. **Operational monitoring matters** — AI pipelines need the same rigor as production systems
5. **The research pays for itself** — total budget (~$30-50 over 20 days) vs value of findings applied to a real production system
The next frontier is operationalizing these findings: automated lens selection, pipeline health monitoring, and measuring downstream impact of review quality on production defect rates.