From f865a0d778f2d5641ab8bfd9e72eb38570acf587 Mon Sep 17 00:00:00 2001 From: Rodin Date: Wed, 6 May 2026 07:24:12 -0700 Subject: [PATCH] docs: add research report and actionable lessons summary MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit REPORT.md — full analysis of 29 experiments: model strengths, task-type mappings, meta-findings, cost-effectiveness, and open questions. LESSONS.md — distilled operational playbook: which model for which task, anti-patterns, decision framework, and the three core rules. --- LESSONS.md | 111 ++++++++++++++++++++++++++++++++++++++ REPORT.md | 154 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 265 insertions(+) create mode 100644 LESSONS.md create mode 100644 REPORT.md diff --git a/LESSONS.md b/LESSONS.md new file mode 100644 index 0000000..383fd2d --- /dev/null +++ b/LESSONS.md @@ -0,0 +1,111 @@ +# Actionable Lessons: Using AI Models for Analytical Work + +_Distilled from 29 experiments. These are the rules._ + +--- + +## The Three Rules + +### 1. Match the model to the task, not the prestige + +| If you need... | Use... | Why | +|---------------|--------|-----| +| "What's missing from this design?" | GPT-5 | Reasons about the world outside the document | +| "Where does this design contradict itself?" | Opus | Logical argumentation, zero false positives | +| "Is this consistent with that other doc?" | Opus | 2.4x faster, more findings than GPT-5 | +| "How could an attacker exploit this?" | GPT-5 (coverage) + Opus (creativity) | Different attack styles | +| "Quick sanity check before I ship" | Sonnet | Fast, cheap, precise enough | +| "What race conditions exist here?" | GPT-5 + Opus | Sonnet produces errors on concurrency | +| "Is there bias in this text?" | Anything (even Mini) | All models catch isolated bias equally | + +### 2. Isolate the signal before asking the question + +Don't ask "review this PR" and expect the model to catch a subtle bias buried in 6,600 lines of diff. Extract what matters, ask about it directly. Rich context dilutes attention. + +**Pattern:** +- ❌ "Review this PR for quality, correctness, and bias" (broad mandate + rich context = missed signals) +- ✅ "Here are 12 hypotheses. Do any lead toward a predetermined conclusion?" (narrow question + minimal context = found everything) + +### 3. Run multiple models on anything that matters + +No single model finds everything. The union of GPT-5 + Opus + Sonnet finds more than any individual. The missing findings from a single-model run are disproportionately the ones that would cause production incidents. + +**Decision framework:** +- **Costs nothing to get wrong:** One model is fine (Sonnet for speed, Opus for depth) +- **Would be embarrassing to miss:** Two models (Opus + GPT-5) +- **Would cost money or safety:** Three models (all three, plus manual review of unique findings) + +--- + +## Operational Playbook + +### Architecture Document Review +``` +1. Opus: contradiction detection + cross-doc consistency +2. GPT-5: hidden assumptions + gap-finding +3. Sonnet: quick structural scan (broken refs, missing sections) +4. Merge findings, deduplicate, triage by severity +``` + +### Pre-Implementation Spec Review +``` +1. Opus: "Where do the stated principles conflict?" +2. GPT-5: "What must be true about the world for this to work?" +3. Sonnet 4.5: "What would an implementer have to guess?" +``` + +### Security/Adversarial Review +``` +1. GPT-5: "Enumerate all possible abuses of each mechanism" +2. Opus: "What would a smart adversary do that the designer didn't consider?" +3. Union the findings — GPT-5 catches mechanism-level, Opus catches system-level +``` + +### PR Review (dual-reviewer pattern) +``` +- Sonnet: structural issues, broken links, formatting +- GPT-5: semantic issues, logical gaps, verdict mismatches +- For important PRs: add Opus for design-tension detection +``` + +--- + +## Anti-Patterns (Things That Don't Work) + +1. **"Use the most expensive model for everything"** — GPT-5 is 5-9x more expensive than Opus per finding, and Opus beats it on contradiction/consistency tasks. + +2. **"Reasoning effort = better output"** — The low/medium/high parameter has negligible effect on analytical tasks. Don't bother tuning it. + +3. **"Sonnet can do anything Opus does, just shallower"** — Wrong. Sonnet produces *errors* on concurrency reasoning and false positives on contradiction detection. It's not "cheaper Opus" — it's a different tool. + +4. **"More context = better analysis"** — Signal-to-noise ratio matters more than context richness. Isolate what you're asking about. + +5. **"One good prompt works everywhere"** — Prompt framing shapes output more than model choice. The same model with a broad vs narrow prompt produces qualitatively different work. Design prompts per task type. + +6. **"Run it once, trust the output"** — Single runs are stochastic. Models miss things non-deterministically. Multiple models or multiple runs are the only hedge. + +--- + +## Model Personality Cheat Sheet + +| Model | Default behavior | Thinks like a... | +|-------|-----------------|------------------| +| GPT-5 | Exhaustive enumeration, high verification bar, slow | Thorough auditor checking every line item | +| Opus | Design tensions, self-correcting, efficient | Architect who sees how pieces conflict | +| Sonnet 4.6 | Fast structural scan, self-filtering, concise | Senior engineer doing a quick review | +| Sonnet 4.5 | Exhaustive, verbose, occasional severity inflation | Junior engineer trying to catch everything | +| GPT-4.1 | Structured, stays within the document's framing | Competent analyst following a checklist | +| GPT-4.1 Mini | Formulaic, maps findings 1:1 to document sections | Intern reading the doc and noting concerns | + +--- + +## The Bottom Line + +**For our specific workflow (gargoyle architecture review, PR reviews, design docs):** + +1. Opus is the default analytical model — most efficient, deepest on consistency/contradiction +2. GPT-5 is the "we can't afford to miss anything" model — use on high-stakes docs +3. Sonnet is the speed/screening model — first pass, structural checks, assumption-finding only +4. Never use Sonnet alone for concurrency, contradiction, or adversarial analysis +5. Always isolate the analytical question from surrounding noise +6. Task-type-specific prompts beat generic "review this" prompts every time diff --git a/REPORT.md b/REPORT.md new file mode 100644 index 0000000..c270809 --- /dev/null +++ b/REPORT.md @@ -0,0 +1,154 @@ +# Model Research Report: AI Models for Analytical Work + +_29 experiments across 11 days (2026-04-26 to 2026-05-06). Five models tested on architecture document analysis — not coding._ + +## Executive Summary + +We tested GPT-5, Claude Opus, Claude Sonnet 4.6, Claude Sonnet 4.5, GPT-4.1, and GPT-4.1 Mini on analytical tasks: finding hidden assumptions, race conditions, design contradictions, adversarial attack paths, regulatory gaps, and cross-document inconsistencies in real architecture documents. + +**The central finding:** Different models don't just find more or fewer things — they find *qualitatively different kinds* of things. Model choice is task-dependent, and no single model dominates all analytical work. + +--- + +## Part 1: What Each Model Is Good At + +### GPT-5 +**Strength:** Exhaustive enumeration + domain-specific reasoning about the real world. + +GPT-5's reasoning tokens change the *kind* of analysis, not just the depth. Non-reasoning models identify risks within a document's own frame of reference. GPT-5 reasons about the document's relationship to the external world: broker semantics, deployment topology, OTP runtime behavior under load, timing correlations across independent subsystems. + +- Best at: gap-finding, hidden assumptions, adversarial enumeration, temporal boundary analysis +- Unique ability: finds multi-component interaction failures that require domain knowledge +- Weakness: slow (2-4x longer than Opus), expensive (5-10x more tokens per finding), sometimes over-verifies +- Finding count: typically 15-35 depending on document complexity + +### Claude Opus +**Strength:** Design tensions, logical argumentation, creative adversarial thinking. + +Opus consistently identifies where one part of a design undermines another part. It doesn't enumerate failure modes — it finds the deeper question of *why* the design's own principles conflict. + +- Best at: contradiction detection, cross-document consistency, race conditions (design-level), adversarial creativity +- Unique ability: self-corrects mid-analysis, finds "your safety mechanism IS your vulnerability" patterns +- Most efficient model: 6-9x fewer tokens per finding than GPT-5 on shared task types +- Weakness: produces fewer findings on pure enumeration tasks (10-13 vs GPT-5's 20-35) + +### Claude Sonnet 4.6 +**Strength:** Speed, structural issues, assumption-finding. Best precision-per-dollar. + +- Best at: quick first-pass screening, structural review, specification gap identification +- Zero false positives on most tasks — every finding is actionable +- Weakness: struggles with concurrency reasoning, contradiction detection, and tasks requiring formal logical reasoning +- Produces false positives on verification-heavy tasks (contradiction, race conditions) + +### Claude Sonnet 4.5 +**Strength:** Exhaustive coverage. More findings than 4.6, at the cost of some noise. + +- Best at: specification completeness (25 findings vs 4.6's 13) +- Catches operational gaps that 4.6 filters out +- Tradeoff: severity inflation, more verbose output + +### GPT-4.1 +**Strength:** Structured, thorough, good middle ground. Generic but competent. + +- Stays within the document's own framing — finds assumptions the document *almost* states +- Best unique contribution: meta-observations about design structure (e.g., "all failure modes treated as isolated") +- Good enough for first-pass review where GPT-5's cost isn't justified + +### GPT-4.1 Mini +**Strength:** Cheapest. Formulaic but catches the obvious things. + +- Every finding maps cleanly to a section of the document +- Fine for quick sanity checks, not for architectural insight +- Scales with document size (6 findings on 459 lines → 21 on 1,110 lines) + +--- + +## Part 2: What We Learned About Task Types + +Not all analytical tasks are the same. Models that excel at one struggle at another. + +| Task Type | Best Model | Runner-up | Avoid | +|-----------|-----------|-----------|-------| +| Hidden assumptions | GPT-5 | Opus | Mini (formulaic) | +| Gap-finding | GPT-5 | GPT-4.1 | Mini (surface-level) | +| Race conditions | GPT-5 + Opus | — | Sonnet (errors) | +| Contradiction detection | **Opus** | GPT-5 | Sonnet (false positives) | +| Cross-document consistency | **Opus** | GPT-5 | — | +| Adversarial attack paths | GPT-5 (enumeration) + Opus (creativity) | — | — | +| Bias detection | Any model | — | — | +| Design coherence | Document-dependent | — | — | +| Specification completeness | Sonnet 4.5 (breadth) or GPT-5 (self-contradictions) | — | — | +| Missing feature identification | All (with right prompt) | — | — | +| Invariant violation paths | GPT-5 (precision) | Opus (breadth) | Sonnet (imprecise) | + +**Key pattern:** Tasks requiring *identification* (what's missing? what's assumed?) are accessible to all models. Tasks requiring *verification* (is this sequence legal? does this contradict that?) favor reasoning models (GPT-5, Opus) and exclude Sonnet. + +--- + +## Part 3: Meta-Findings About How to Use Models + +### 1. Signal-to-noise ratio matters more than model capability (Finding #8) + +When biased text is the ONLY input, even GPT-4.1 Mini catches it. When the same bias is buried inside a full PR review with diffs, issues, and project context, expensive models miss it. The issue isn't model intelligence — it's attention dilution. + +**Implication:** For important analytical checks, isolate the signal. Extract the relevant text and ask about it specifically. Don't bury important questions inside broad review mandates. + +### 2. Prompt framing dominates model personality (Finding #26) + +Opus's "finds design tensions" and GPT-5's "exhaustive enumeration" are DEFAULT tendencies, not capabilities. With structured prompts that explicitly ask for breadth, Opus produces MORE findings than GPT-5. With structured prompts asking for contradictions, GPT-5 becomes highly selective. + +**Implication:** Model choice matters less than you think for any single task. Prompt structure is the primary lever. Model personality matters for OPEN-ENDED tasks where you want emergent analytical behavior. + +### 3. Task type predicts model performance better than "model X is better" (Finding #13) + +Sonnet scores 85% of GPT-5's performance on assumption-finding but drops to ~50% on concurrency reasoning. Don't extrapolate across task types. + +### 4. The union of models finds the most (Finding #19) + +GPT-5 Mini + Sonnet covers ~71% of GPT-5's findings at 31% of the cost. But the missing 29% contains the domain-specific interaction-level findings most likely to prevent production incidents. Each model also finds things the others miss — the total unique finding space is larger than any single model's output. + +### 5. Reasoning tokens change the KIND of analysis, not just the amount (Finding #10) + +Non-reasoning models ask "what could this mechanism fail at?" Reasoning models ask "what must be true about the world for this mechanism to work?" This is a qualitative difference in analytical mode, not just thoroughness. + +### 6. Reasoning effort parameter is a no-op for analytical work (Finding #21) + +Low/medium/high reasoning effort had negligible effect on GPT-5's output for open-ended analysis. Task type is a far stronger predictor of reasoning behavior. Don't waste time tuning this parameter for document review. + +### 7. Output length kills, input length doesn't (Finding #6) + +Single agents die trying to generate 1000+ line documents. Rich input context is fine — it's the output length that causes OOM/timeout. Break output into sections, keep input context rich, draft in parallel, assemble. + +### 8. Document complexity shifts model rankings (Finding #27) + +Opus beat GPT-5 on coherence analysis for one document but lost on another (more complex) document. Rankings are not universal — they interact with document complexity, domain specificity, and prompt structure. + +### 9. Token budget matters more than model size (Finding #7b) + +When output is truncated by token limits, even GPT-5 produces shallow findings. Ensure sufficient max_completion_tokens (≥16K for GPT-5). A cheap model with enough tokens beats an expensive model that runs out of space. + +--- + +## Part 4: Cost-Effectiveness + +| Model | Typical tokens/finding | Relative cost | Best use case | +|-------|----------------------|---------------|---------------| +| Opus | 179-336 | 1x (baseline) | Cross-doc consistency, contradictions, design tensions | +| Sonnet 4.6 | 194-312 | 0.3x | Quick screening, structural review, assumption-finding | +| GPT-5 | 993-2,967 | 5-9x | High-stakes analysis where missing something has real cost | +| GPT-4.1 | ~500 | 0.5x | Middle-ground first pass | +| GPT-4.1 Mini | ~300 | 0.1x | Bulk screening, sanity checks | + +**For financial/safety-critical systems:** Run all three (Opus + GPT-5 + Sonnet). The ~$1 total cost per document is irrelevant vs the value of comprehensive coverage. + +**For routine review:** Opus alone or Sonnet + Opus pair. Skip GPT-5 unless the document is complex and the stakes justify it. + +--- + +## Part 5: What's Still Unknown + +1. **Would running models sequentially (feed Model A's output to Model B) outperform parallel runs?** Hypothesized for adversarial analysis but untested. +2. **Are these findings corpus-specific?** All 29 experiments used gargoyle architecture docs. Different domains may shift rankings. +3. **How much do results vary across runs?** All findings are single-run. Stochastic variation is unquantified. +4. **Does Sonnet's narrow-framing weakness go away with explicit concurrency prompts?** Untested — the hypothesis that Sonnet's "structural reviewer" tendency is a framing artifact. +5. **What happens on 2000+ line documents?** Largest tested is 1,110 lines. Unknown if model rankings shift at extreme scale.