docs: regenerate weekly report (2026-05-18)

2026-05-18 16:10:16 +00:00
parent afbc013e2e
commit 5426026908
2 changed files with 510 additions and 285 deletions
@@ -1,7 +1,7 @@
 # Lessons Learned: Operational Guide for AI Model Selection

-> **Generated:** 2026-05-11 09:00 PDT  
-> **Based on:** 74 experiments (2026-04-26 to 2026-05-11)
+> **Generated:** 2026-05-18 09:02 PDT  
+> **Based on:** 80 experiments (2026-04-26 to 2026-05-15)

 _This is the actionable distillation. For evidence and methodology, see REPORT.md._

@@ -11,25 +11,32 @@ _This is the actionable distillation. For evidence and methodology, see REPORT.m

 ```
 ┌─────────────────────────────────────────────────────────────────┐
-│                    TASK TYPE DECISION TREE                      │
+│                    TASK TYPE DECISION TREE                       │
 ├─────────────────────────────────────────────────────────────────┤
 │                                                                 │
 │  Is this a VERIFICATION task?                                   │
-│  (contradiction, consistency, race condition)                   │
+│  (self-contradiction, consistency check, race condition)        │
 │     │                                                           │
-│     ├─ YES → Use GPT-5 + Opus (skip Sonnet)                    │
-│     │        Sonnet has ~33% precision on verification          │
+│     ├─ YES → Is it CROSS-DOCUMENT comparison?                   │
+│     │         │                                                 │
+│     │         ├─ YES → Use Opus (or Sonnet for inter-doc        │
+│     │         │        contradictions specifically)             │
+│     │         │                                                 │
+│     │         └─ NO → Use GPT-5 + Opus (skip Sonnet)           │
+│     │                  Sonnet has ~33% precision on             │
+│     │                  self-contradiction verification          │
 │     │                                                           │
-│     └─ NO → Is this CROSS-DOCUMENT?                            │
+│     └─ NO → Is this SECURITY code review?                      │
 │              │                                                  │
-│              ├─ YES → Use Opus (2.4x faster, more findings)    │
+│              ├─ YES → Use dedicated security persona            │
+│              │        (generalist reviewers miss it)            │
 │              │                                                  │
 │              └─ NO → Is this HIGH-STAKES?                      │
 │                       (financial, safety, regulatory)           │
 │                       │                                         │
 │                       ├─ YES → Run all three                   │
 │                       │        (GPT-5 + Opus + Sonnet)         │
-│                       │        Total: ~$1-2, worth it          │
+│                       │        Total: ~$0.50-0.70              │
 │                       │                                         │
 │                       └─ NO → Sonnet first-pass               │
 │                               Add Opus if findings need depth   │
@@ -46,17 +53,26 @@ _This is the actionable distillation. For evidence and methodology, see REPORT.m
 | If the task is... | Use this | Not this |
 |-------------------|----------|----------|
 | Finding what's missing | GPT-5 | Mini |
-| Finding contradictions | Opus | Sonnet |
+| Finding self-contradictions | GPT-5 + Opus (both) | Sonnet |
 | Cross-document consistency | Opus | GPT-5 |
+| Inter-document contradictions | Sonnet | GPT-5 |
 | Quick structural scan | Sonnet 4.6 | GPT-5 |
 | Broad coverage (noise OK) | Sonnet 4.5 | Sonnet 4.6 |
 | Adversarial attack paths | GPT-5 then Opus | Either alone |
 | Regulatory compliance | GPT-5 | Opus |
-| Operational blind spots | GPT-5 | Sonnet |
+| Operational blind spots | GPT-5 + Opus | Sonnet |
+| Security code review | Dedicated security persona | Generalist prompt |
+| State machine completeness | GPT-5 | Sonnet |
+| External system assumptions | GPT-5 | Sonnet |
+| Counterfactual ordering | GPT-5 | Sonnet |
+| Degraded-mode analysis | Opus + GPT-5 | Sonnet |
+| Implementation ambiguity | Any (all viable) | — |

 ### Rule 2: Don't Trust Sonnet for Verification

-Sonnet finds ~3 contradictions but only ~1 is genuine. The others are misreadings. Use Sonnet for *identification* tasks (what's here?), not *verification* tasks (is this true?).
+Sonnet finds ~3 contradictions but only ~1 is genuine. The others are misreadings. Use Sonnet for *identification* tasks (what's here?) and *inter-document comparison* (do these conflict?), not *self-contradiction verification* (is this internally consistent?).
+
+**Exception:** Inter-document contradiction (#67) — Sonnet outperforms GPT-5 when comparing two documents for conflicting claims. Parallel comparison ≠ serial verification.

 ### Rule 3: Isolate the Signal

@@ -64,7 +80,7 @@ When checking for something specific (bias, contradictions, missing assumptions)

 ### Rule 4: Run the Ensemble for High Stakes

-For anything financial, safety-critical, or regulatory: run GPT-5 + Opus + Sonnet. Each finds things the others miss. The union is larger than any single model's output. Cost is trivial vs. the value.
+For anything financial, safety-critical, or regulatory: run GPT-5 + Opus + Sonnet. Each finds things the others miss. The union is 30-60% larger than any single model. Cost is trivial vs. the value.

 ### Rule 5: Give GPT-5 Enough Tokens

@@ -78,168 +94,240 @@ Single agents die generating 1000+ lines. Rich input is fine; it's output length

 You cannot make Sonnet match GPT-5/Opus by writing a better prompt. Narrow framing changes WHAT it looks for, not HOW WELL it reasons. The gap is architectural, not prompt engineering.

---
+### Rule 8: Specialized Personas Outperform Model Upgrades (for security)

-## Operational Playbooks
+A dedicated security-reviewer persona on the same model catches issues that a generalist reviewer misses and approves. For security code review: configure explicit security criteria (trust boundaries, library edge cases, OS interaction) rather than relying on "please also check security."

-### Playbook A: Architecture Document Review
+### Rule 9: Dual-Bot Disagreement Is a Feature

-1. **First pass (Sonnet 4.6):** ~30s, catches structural issues, broken refs, obvious gaps
-2. **Deep analysis (GPT-5):** ~90s, finds domain-specific gaps, hidden assumptions, edge cases
-3. **Design tensions (Opus):** ~60s, finds where the design contradicts itself
-4. **Merge and dedupe:** Union of all three, remove duplicates, sort by severity
+Two reviewers that sometimes disagree create a quality ratchet. The PR can't merge until both are satisfied. Removing one reviewer drops the gate from ~32% REQUEST_CHANGES to ~2%. Never reduce to single-bot without understanding the quality tradeoff.

-### Playbook B: Cross-Document Consistency Check
+### Rule 10: Monitor Your AI Pipeline

-1. **Use Opus only.** It's 2.4x faster than GPT-5 and finds more issues.
-2. **Provide both documents in a single prompt** (~25KB max)
-3. **Explicitly exclude omissions** in the prompt — you want contradictions, not "Doc A covers X but Doc B doesn't"
-
-### Playbook C: Adversarial Security Review
-
-1. **First pass (GPT-5):** Exhaustive enumeration of attack surface
-2. **Extension pass (Opus):** Give Opus GPT-5's findings, ask it to critique and extend
-3. **Result:** 30% more findings at 28% more cost, with prioritization
-
-### Playbook D: Regulatory Compliance Review
-
-1. **First pass (Sonnet):** ~25s, identifies areas of concern
-2. **Deep dive (GPT-5):** Regulatory specificity, correct citations, edge cases
-3. **GPT-5's reasoning tokens are spent on verification** — trust its citations
-
-### Playbook E: Contradiction Detection
-
-1. **Use GPT-5 + Opus in parallel** (not Sonnet)
-2. **GPT-5 finds:** Specification conflicts (same scenario, different prescriptions)
-3. **Opus finds:** Logical impossibilities (rules that can't coexist)
-4. **Neither dominates** — they find different classes of contradiction
+AI pipelines need operational monitoring like any production system. A dispatcher malfunction caused 3.5x cost overage and invalidated experiment data (Finding #80). Monitor: dispatch correctness, per-PR review count, API costs, and reviewer participation patterns.

 ---

 ## Anti-Patterns

-### ❌ Anti-Pattern 1: Using Sonnet for Verification Tasks
+### ❌ "Just use the best model for everything"

-**What happens:** Sonnet reports contradictions that aren't real. You waste time investigating false positives or worse, trust a false negative.
+No single model dominates. GPT-5 is worst for inter-document contradictions. Opus is worst for exhaustive enumeration. Sonnet is worst for self-contradiction verification. Match the task.

-**Instead:** Use GPT-5 or Opus for any task requiring "is this true?" reasoning.
+### ❌ "Sonnet is good enough for a quick check"

-### ❌ Anti-Pattern 2: Giving GPT-5 a Broad Mandate
+Only on structural/identification tasks. On verification tasks, Sonnet's 33% precision means 2/3 of its findings are wrong. A "quick check" that produces false confidence is worse than no check.

-**What happens:** GPT-5 spawns sub-agents, times out, or dumps raw tool output instead of synthesizing.
+### ❌ "We'll fix the prompt to get better results"

-**Instead:** Give GPT-5 explicit single-actor instructions + output format. For Claude, broader mandates are fine.
+Narrow framing helps direct attention but cannot fix reasoning depth gaps. If Sonnet can't verify a contradiction with a perfect prompt, a better prompt won't help. Use a better model.

-### ❌ Anti-Pattern 3: Burying Important Checks in Large Reviews
+### ❌ "One reviewer is enough"

-**What happens:** The model misses the important thing because it's one of 47 things to check.
+Finding #78: Dropping from dual-bot to single-bot review caused a 15x drop in REQUEST_CHANGES rate. The disagreement between reviewers is where quality lives. A single reviewer has blind spots; two reviewers catch each other's misses.

-**Instead:** Extract the important check and ask about it specifically. Signal-to-noise ratio matters.
+### ❌ "Security is just another review criteria"

-### ❌ Anti-Pattern 4: Extrapolating Across Task Types
+Findings #79, #79b: Generalist reviewers (Sonnet, GPT) both APPROVED code with critical SSRF bypasses. Only a dedicated security persona blocked merge. Security requires domain-specific knowledge (CGN ranges, proxy inheritance, call-site consistency) that generalist prompts don't invoke.

-**What happens:** "GPT-5 was great at X, so I'll use it for Y" — and it's mediocre.
+### ❌ "More reviews = better quality"

-**Instead:** Task type predicts performance better than "model X is better." Check the task-type table.
+Finding #80: When the dispatcher malfunctioned, PRs received 14+ reviews from 6 reviewers. This didn't improve quality — it inflated costs 3.5x and created noise. Targeted, specialized reviews > spray-and-pray.

-### ❌ Anti-Pattern 5: Skipping the Union
+### ❌ "The models will catch operational issues"

-**What happens:** You run one model, miss things another would have caught, and the bug reaches production.
+Finding #80: A broken dispatcher ran for days before anyone noticed the 3.5x cost overage. AI pipelines need traditional observability (cost monitoring, dispatch verification, participation metrics) — they don't self-diagnose operational problems.

-**Instead:** For high-stakes work, run the ensemble. The cost is trivial vs. the risk.
+---

-### ❌ Anti-Pattern 6: Tuning Reasoning Effort
+## Operational Playbooks

-**What happens:** You spend time adjusting low/medium/high reasoning effort parameters.
+### Playbook 1: Architecture Document Review

-**Instead:** Don't bother. It has negligible effect on analytical work. Task type is the lever.
+```
+1. Sonnet first-pass (15-40s, $0.02)
+   - Structural gaps, missing sections, obvious issues
+   - Decision: Is this worth deeper analysis?

-### ❌ Anti-Pattern 7: Trying to Fix Sonnet with Prompts
+2. If yes → GPT-5 focused analysis (80-140s, $0.40)
+   - Hidden assumptions, domain-specific gaps
+   - Regulatory compliance (if applicable)
+   - Temporal/ordering hazards

-**What happens:** You write increasingly narrow prompts trying to get Sonnet to match GPT-5's reasoning depth.
+3. Opus design-tension analysis (50-120s, $0.12)
+   - Where do principles conflict?
+   - Where do safety mechanisms become vulnerabilities?
+   - Cross-document consistency (if multiple docs)

-**Instead:** Accept that the gap is architectural. Use Sonnet for what it's good at (speed, breadth, structural review), use GPT-5/Opus for reasoning-heavy tasks.
+4. Union findings → prioritize by severity
+```
+
+### Playbook 2: Security Code Review
+
+```
+1. Standard review (generalist model, any)
+   - Code structure, patterns, obvious issues
+   - Note: Will likely APPROVE even with security gaps
+
+2. Dedicated security persona (MANDATORY for auth/crypto/network)
+   - Explicit criteria: trust boundaries, library semantics, OS interaction
+   - Checks: HTTPS enforcement at every call site
+   - Checks: IP validation (is_global vs is_private)
+   - Checks: Transport inheritance (proxy, TLS, timeout settings)
+   - Checks: Write-path vs read-path consistency
+
+3. Security persona has VETO power
+   - If security says REQUEST_CHANGES, it blocks regardless of other approvals
+```
+
+### Playbook 3: Multi-Model Review Pipeline
+
+```
+Configuration (optimal based on Finding #78):
+- Bot 1: Structural/pattern reviewer (Sonnet — fast, catches structural issues)
+- Bot 2: Depth/logic reviewer (GPT — catches reasoning issues)
+- Bot 3: Security reviewer (dedicated persona — catches security issues)
+
+Rules:
+- ALL bots must approve for merge (disagreement = quality signal)
+- Monitor: REQUEST_CHANGES rate should be 20-40% (too low = degraded gate)
+- Monitor: Per-PR review count should be 4-8 (too high = dispatcher bug)
+- Monitor: API cost per PR (set alerts for >2x expected)
+
+Operational:
+- If REQUEST_CHANGES drops below 10% → investigate (model config? code quality? gate degraded?)
+- If review count exceeds 15 → check dispatcher/webhook configuration
+- Track: which bot finds which issues → continuous model-task matching
+```
+
+### Playbook 4: Regulatory Compliance Review
+
+```
+1. Sonnet structural scan ($0.02, 15-30s)
+   - Identify which regulatory categories are addressed
+   - Flag structural gaps (missing sections, uncovered rules)
+
+2. GPT-5 regulatory cross-reference ($0.40, 120-160s)
+   - Rule-by-rule comparison with actual regulations
+   - Cite specific IRS/FINRA/SEC sections
+   - Identify mathematical/formula errors
+   - Find edge cases the implementation misses
+
+3. Opus operational compliance ($0.12, 50-80s)
+   - What the system needs to DO at runtime
+   - Cross-account, cross-entity obligations
+   - Where the implementation's model doesn't match regulatory reality
+
+4. Combine → prioritize Critical/High for immediate action
+```
+
+### Playbook 5: Cross-Document Consistency Check
+
+```
+If comparing 2-3 documents for contradictions:
+  → Opus primary (2.4x faster, finds more boundary tensions)
+  → Sonnet for parallel comparison of claims (faster for direct conflicts)
+  → GPT-5 only if documents are very complex (1000+ lines combined)
+
+If checking one document for self-consistency:
+  → GPT-5 for specification conflicts (statement A contradicts statement B)
+  → Opus for logical impossibilities (rule A + rule B = impossible condition)
+  → Skip Sonnet (33% precision, wastes time filtering false positives)
+```

 ---

 ## Model Personality Cheat Sheet

-| Model | Personality | Default Behavior | Give It |
-|-------|-------------|------------------|---------|
-| **GPT-5** | Exhaustive enumerator | Lists everything systematically | Bounded tasks, explicit output format, single-actor instructions |
-| **Opus** | Design critic | Finds tensions and contradictions | Open-ended analysis, room to reason about boundaries |
-| **Sonnet 4.6** | Structural scanner | Fast, precise, shallow | Quick first-pass work, structural review |
-| **Sonnet 4.5** | Broad coverage | More findings, more noise | When you want breadth over precision |
-| **GPT-4.1** | Generic competent | Stays within document framing | Middle-ground cost-sensitive work |
-| **GPT-4.1 Mini** | Template filler | Formulaic but catches obvious things | Bulk screening, sanity checks |
-
-### Opus Superpower
-
-Opus finds where the spec's **own assumptions are false**. It doesn't just find missing things — it finds things the spec *believes* to be true that *aren't*.
-
-Examples:
- "Realized P&L cannot recover" — the de-escalation model assumes all metrics can improve, but this one fundamentally cannot (#31)
- "Forward detection logic is backwards" — spec describes triggers in the wrong direction (#32)
- "Stop-loss defeated by temporal composition" — safety mechanism rendered ineffective by slow strategy (#47)
-
-### GPT-5 Superpower
-
-GPT-5 reasons about the document's **relationship to the real world**. It asks "what must be true about the external world for this to work?"
-
-Examples:
- Broker rate limiting (429s) bypasses "connection lost" detection (#9)
- Corporate actions bypass staleness detection (#9)
- DB "commit unknown outcome" causing restart loops (#9)
- Cross-symbol strategies with partial staleness (#9)
- IRS rule nuances that simplifications violate (#54)
+| Model | Thinks Like | Asks | Finds |
+|-------|------------|------|-------|
+| GPT-5 | Systems engineer with 20 years experience | "What does the real world need that this doesn't address?" | Infrastructure gaps, operational hazards, regulatory oversights |
+| Opus | Architecture critic / philosophy professor | "Where do your own principles contradict each other?" | Design tensions, logical impossibilities, safety-mechanism-as-vulnerability |
+| Sonnet 4.6 | Junior developer implementing the spec | "If I were coding this, what would confuse me?" | Implementation ambiguities, structural gaps, obvious missing pieces |
+| Sonnet 4.5 | Enthusiastic intern brainstorming | "What COULD go wrong?" | Broad list of concerns (noisy, needs filtering) |
+| GPT-4.1 | Reliable senior dev doing code review | "Does this follow the patterns correctly?" | Structural issues, format consistency |
+| GPT-4.1 Mini | Fast intern doing a checklist | "Is this obviously incomplete?" | Missing sections, obvious gaps |

 ---

-## Decision Framework
+## Decision Framework: "Should I Run Another Model?"

-### When to Add Another Model
+```
+After getting results from Model A, ask:

-| Situation | Action |
-|-----------|--------|
-| Sonnet found nothing | Add Opus (may find design tensions) |
-| GPT-5 found lots but all similar | Add Opus (may find different class) |
-| Opus found tensions but no enumeration | Add GPT-5 (exhaustive coverage) |
-| Cross-document task | Use Opus only (2.4x faster) |
-| Regulatory/compliance task | Use GPT-5 (correct citations) |
+1. Is the task VERIFICATION? (contradictions, races, consistency)
+   → YES: Run the complementary model (GPT-5 ↔ Opus)
+   → Sonnet results alone are insufficient for verification

-### When NOT to Add Another Model
+2. Are the stakes HIGH? (financial, safety, regulatory, security)
+   → YES: Run all available models. The marginal cost ($0.10-0.50)
+     is negligible vs. the cost of a missed finding.

-| Situation | Action |
-|-----------|--------|
-| Quick structural scan | Sonnet alone is fine |
-| Bulk screening | Mini alone is fine |
-| Already ran GPT-5 + Opus | Adding Sonnet rarely helps |
-| Low-stakes internal doc | One model is enough |
+3. Did Model A find < 5 issues on a complex document?
+   → YES: The task might not suit Model A. Try a different model.
+   → GPT-5 finding < 5 issues is unusual — check prompt/token budget.

-### Cost-Benefit Quick Calc
+4. Is this a SCREENING pass before deeper work?
+   → YES: Sonnet is sufficient. Save heavy models for the deep dive.
+   → NO: Add GPT-5 or Opus depending on task type.

-| Risk level | Model cost | Justified? |
-|------------|------------|------------|
-| Financial/safety | ~$1-2 for ensemble | Always yes |
-| Customer-facing | ~$0.50 for GPT-5 | Usually yes |
-| Internal process | ~$0.10 for Sonnet | Always yes |
-| One-off exploration | ~$0.02 for Mini | Always yes |
+5. Is the document about SECURITY (auth, crypto, network)?
+   → YES: Use a dedicated security persona regardless of other reviewers.
+   → Generalist models will likely approve code with security gaps.
+```

 ---

-## What We Still Don't Know
+## Key Numbers to Remember

-1. **Corpus bias:** All experiments used gargoyle docs. Rankings may differ for other domains.
-2. **Run variance:** All findings are single-run. Stochastic variation is unquantified.
-3. **Scale effects:** Largest doc tested is 1,110 lines. Unknown behavior at 2000+.
-4. **Non-architecture domains:** These findings are for architecture document analysis, not coding, not chat, not creative writing.
+| Fact | Number | Source |
+|------|--------|--------|
+| Sonnet precision on contradiction detection | ~33% | #39, #43 |
+| Adversarial ensemble improvement over single model | +30% findings | #35 |
+| Adversarial ensemble extra cost | +28% tokens | #35 |
+| Opus speed advantage over GPT-5 (cross-doc) | 2.4x faster | #28 |
+| Opus token efficiency vs GPT-5 | 6-9x fewer tokens/finding | Multiple |
+| GPT-5 minimum token budget | 16K completion tokens | #5, multiple |
+| Union findings vs best single model | 30-60% more | Multiple |
+| Dual-bot REQUEST_CHANGES rate | ~32% | #78 |
+| Single-bot REQUEST_CHANGES rate | ~2% | #78 |
+| Post-merge escape: missing tests | 22% of all findings | #78 |
+| Security persona catch rate vs generalist | 100% vs 0% (on security issues) | #79 |
+| Dispatcher malfunction cost overage | 3.5x | #80 |
+| Total findings analyzed | 80 | This report |
+| Total validated analytical lenses | 28 | REPORT.md Part 6 |

 ---

-## Summary: The Two Things That Matter Most
+## What Changed Since Last Report

-1. **Task type determines model choice.** Don't pick a model because "it's best." Pick the model that's best for THIS task type.
+**New rules added:**
+- Rule 8: Specialized Personas Outperform Model Upgrades (for security)
+- Rule 9: Dual-Bot Disagreement Is a Feature
+- Rule 10: Monitor Your AI Pipeline

-2. **The union beats any single model.** For high-stakes work, run the ensemble. Different models find qualitatively different things.
+**New anti-patterns:**
+- "One reviewer is enough"
+- "Security is just another review criteria"
+- "More reviews = better quality"
+- "The models will catch operational issues"

-Everything else is optimization.
+**New playbooks:**
+- Playbook 2: Security Code Review
+- Playbook 3: Multi-Model Review Pipeline (updated from operational data)
+
+**Updated decision tree:**
+- Added security review branch
+- Added inter-document contradiction (Sonnet dominance) path
+
+---
+
+## Evolution Notes
+
+This document evolves weekly. As new findings emerge:
+- Rules are added/modified when patterns are validated across 3+ experiments
+- Anti-patterns are added after observing repeated mistakes
+- Playbooks are updated when operational data improves recommendations
+- The decision tree is simplified when clearer heuristics emerge
+
+The goal: make model selection as automatic as possible, so the researcher spends time on *analysis* rather than *model choice*.