f865a0d778
REPORT.md — full analysis of 29 experiments: model strengths, task-type mappings, meta-findings, cost-effectiveness, and open questions. LESSONS.md — distilled operational playbook: which model for which task, anti-patterns, decision framework, and the three core rules.
155 lines
9.5 KiB
Markdown
155 lines
9.5 KiB
Markdown
# Model Research Report: AI Models for Analytical Work
|
|
|
|
_29 experiments across 11 days (2026-04-26 to 2026-05-06). Five models tested on architecture document analysis — not coding._
|
|
|
|
## Executive Summary
|
|
|
|
We tested GPT-5, Claude Opus, Claude Sonnet 4.6, Claude Sonnet 4.5, GPT-4.1, and GPT-4.1 Mini on analytical tasks: finding hidden assumptions, race conditions, design contradictions, adversarial attack paths, regulatory gaps, and cross-document inconsistencies in real architecture documents.
|
|
|
|
**The central finding:** Different models don't just find more or fewer things — they find *qualitatively different kinds* of things. Model choice is task-dependent, and no single model dominates all analytical work.
|
|
|
|
---
|
|
|
|
## Part 1: What Each Model Is Good At
|
|
|
|
### GPT-5
|
|
**Strength:** Exhaustive enumeration + domain-specific reasoning about the real world.
|
|
|
|
GPT-5's reasoning tokens change the *kind* of analysis, not just the depth. Non-reasoning models identify risks within a document's own frame of reference. GPT-5 reasons about the document's relationship to the external world: broker semantics, deployment topology, OTP runtime behavior under load, timing correlations across independent subsystems.
|
|
|
|
- Best at: gap-finding, hidden assumptions, adversarial enumeration, temporal boundary analysis
|
|
- Unique ability: finds multi-component interaction failures that require domain knowledge
|
|
- Weakness: slow (2-4x longer than Opus), expensive (5-10x more tokens per finding), sometimes over-verifies
|
|
- Finding count: typically 15-35 depending on document complexity
|
|
|
|
### Claude Opus
|
|
**Strength:** Design tensions, logical argumentation, creative adversarial thinking.
|
|
|
|
Opus consistently identifies where one part of a design undermines another part. It doesn't enumerate failure modes — it finds the deeper question of *why* the design's own principles conflict.
|
|
|
|
- Best at: contradiction detection, cross-document consistency, race conditions (design-level), adversarial creativity
|
|
- Unique ability: self-corrects mid-analysis, finds "your safety mechanism IS your vulnerability" patterns
|
|
- Most efficient model: 6-9x fewer tokens per finding than GPT-5 on shared task types
|
|
- Weakness: produces fewer findings on pure enumeration tasks (10-13 vs GPT-5's 20-35)
|
|
|
|
### Claude Sonnet 4.6
|
|
**Strength:** Speed, structural issues, assumption-finding. Best precision-per-dollar.
|
|
|
|
- Best at: quick first-pass screening, structural review, specification gap identification
|
|
- Zero false positives on most tasks — every finding is actionable
|
|
- Weakness: struggles with concurrency reasoning, contradiction detection, and tasks requiring formal logical reasoning
|
|
- Produces false positives on verification-heavy tasks (contradiction, race conditions)
|
|
|
|
### Claude Sonnet 4.5
|
|
**Strength:** Exhaustive coverage. More findings than 4.6, at the cost of some noise.
|
|
|
|
- Best at: specification completeness (25 findings vs 4.6's 13)
|
|
- Catches operational gaps that 4.6 filters out
|
|
- Tradeoff: severity inflation, more verbose output
|
|
|
|
### GPT-4.1
|
|
**Strength:** Structured, thorough, good middle ground. Generic but competent.
|
|
|
|
- Stays within the document's own framing — finds assumptions the document *almost* states
|
|
- Best unique contribution: meta-observations about design structure (e.g., "all failure modes treated as isolated")
|
|
- Good enough for first-pass review where GPT-5's cost isn't justified
|
|
|
|
### GPT-4.1 Mini
|
|
**Strength:** Cheapest. Formulaic but catches the obvious things.
|
|
|
|
- Every finding maps cleanly to a section of the document
|
|
- Fine for quick sanity checks, not for architectural insight
|
|
- Scales with document size (6 findings on 459 lines → 21 on 1,110 lines)
|
|
|
|
---
|
|
|
|
## Part 2: What We Learned About Task Types
|
|
|
|
Not all analytical tasks are the same. Models that excel at one struggle at another.
|
|
|
|
| Task Type | Best Model | Runner-up | Avoid |
|
|
|-----------|-----------|-----------|-------|
|
|
| Hidden assumptions | GPT-5 | Opus | Mini (formulaic) |
|
|
| Gap-finding | GPT-5 | GPT-4.1 | Mini (surface-level) |
|
|
| Race conditions | GPT-5 + Opus | — | Sonnet (errors) |
|
|
| Contradiction detection | **Opus** | GPT-5 | Sonnet (false positives) |
|
|
| Cross-document consistency | **Opus** | GPT-5 | — |
|
|
| Adversarial attack paths | GPT-5 (enumeration) + Opus (creativity) | — | — |
|
|
| Bias detection | Any model | — | — |
|
|
| Design coherence | Document-dependent | — | — |
|
|
| Specification completeness | Sonnet 4.5 (breadth) or GPT-5 (self-contradictions) | — | — |
|
|
| Missing feature identification | All (with right prompt) | — | — |
|
|
| Invariant violation paths | GPT-5 (precision) | Opus (breadth) | Sonnet (imprecise) |
|
|
|
|
**Key pattern:** Tasks requiring *identification* (what's missing? what's assumed?) are accessible to all models. Tasks requiring *verification* (is this sequence legal? does this contradict that?) favor reasoning models (GPT-5, Opus) and exclude Sonnet.
|
|
|
|
---
|
|
|
|
## Part 3: Meta-Findings About How to Use Models
|
|
|
|
### 1. Signal-to-noise ratio matters more than model capability (Finding #8)
|
|
|
|
When biased text is the ONLY input, even GPT-4.1 Mini catches it. When the same bias is buried inside a full PR review with diffs, issues, and project context, expensive models miss it. The issue isn't model intelligence — it's attention dilution.
|
|
|
|
**Implication:** For important analytical checks, isolate the signal. Extract the relevant text and ask about it specifically. Don't bury important questions inside broad review mandates.
|
|
|
|
### 2. Prompt framing dominates model personality (Finding #26)
|
|
|
|
Opus's "finds design tensions" and GPT-5's "exhaustive enumeration" are DEFAULT tendencies, not capabilities. With structured prompts that explicitly ask for breadth, Opus produces MORE findings than GPT-5. With structured prompts asking for contradictions, GPT-5 becomes highly selective.
|
|
|
|
**Implication:** Model choice matters less than you think for any single task. Prompt structure is the primary lever. Model personality matters for OPEN-ENDED tasks where you want emergent analytical behavior.
|
|
|
|
### 3. Task type predicts model performance better than "model X is better" (Finding #13)
|
|
|
|
Sonnet scores 85% of GPT-5's performance on assumption-finding but drops to ~50% on concurrency reasoning. Don't extrapolate across task types.
|
|
|
|
### 4. The union of models finds the most (Finding #19)
|
|
|
|
GPT-5 Mini + Sonnet covers ~71% of GPT-5's findings at 31% of the cost. But the missing 29% contains the domain-specific interaction-level findings most likely to prevent production incidents. Each model also finds things the others miss — the total unique finding space is larger than any single model's output.
|
|
|
|
### 5. Reasoning tokens change the KIND of analysis, not just the amount (Finding #10)
|
|
|
|
Non-reasoning models ask "what could this mechanism fail at?" Reasoning models ask "what must be true about the world for this mechanism to work?" This is a qualitative difference in analytical mode, not just thoroughness.
|
|
|
|
### 6. Reasoning effort parameter is a no-op for analytical work (Finding #21)
|
|
|
|
Low/medium/high reasoning effort had negligible effect on GPT-5's output for open-ended analysis. Task type is a far stronger predictor of reasoning behavior. Don't waste time tuning this parameter for document review.
|
|
|
|
### 7. Output length kills, input length doesn't (Finding #6)
|
|
|
|
Single agents die trying to generate 1000+ line documents. Rich input context is fine — it's the output length that causes OOM/timeout. Break output into sections, keep input context rich, draft in parallel, assemble.
|
|
|
|
### 8. Document complexity shifts model rankings (Finding #27)
|
|
|
|
Opus beat GPT-5 on coherence analysis for one document but lost on another (more complex) document. Rankings are not universal — they interact with document complexity, domain specificity, and prompt structure.
|
|
|
|
### 9. Token budget matters more than model size (Finding #7b)
|
|
|
|
When output is truncated by token limits, even GPT-5 produces shallow findings. Ensure sufficient max_completion_tokens (≥16K for GPT-5). A cheap model with enough tokens beats an expensive model that runs out of space.
|
|
|
|
---
|
|
|
|
## Part 4: Cost-Effectiveness
|
|
|
|
| Model | Typical tokens/finding | Relative cost | Best use case |
|
|
|-------|----------------------|---------------|---------------|
|
|
| Opus | 179-336 | 1x (baseline) | Cross-doc consistency, contradictions, design tensions |
|
|
| Sonnet 4.6 | 194-312 | 0.3x | Quick screening, structural review, assumption-finding |
|
|
| GPT-5 | 993-2,967 | 5-9x | High-stakes analysis where missing something has real cost |
|
|
| GPT-4.1 | ~500 | 0.5x | Middle-ground first pass |
|
|
| GPT-4.1 Mini | ~300 | 0.1x | Bulk screening, sanity checks |
|
|
|
|
**For financial/safety-critical systems:** Run all three (Opus + GPT-5 + Sonnet). The ~$1 total cost per document is irrelevant vs the value of comprehensive coverage.
|
|
|
|
**For routine review:** Opus alone or Sonnet + Opus pair. Skip GPT-5 unless the document is complex and the stakes justify it.
|
|
|
|
---
|
|
|
|
## Part 5: What's Still Unknown
|
|
|
|
1. **Would running models sequentially (feed Model A's output to Model B) outperform parallel runs?** Hypothesized for adversarial analysis but untested.
|
|
2. **Are these findings corpus-specific?** All 29 experiments used gargoyle architecture docs. Different domains may shift rankings.
|
|
3. **How much do results vary across runs?** All findings are single-run. Stochastic variation is unquantified.
|
|
4. **Does Sonnet's narrow-framing weakness go away with explicit concurrency prompts?** Untested — the hypothesis that Sonnet's "structural reviewer" tendency is a framing artifact.
|
|
5. **What happens on 2000+ line documents?** Largest tested is 1,110 lines. Unknown if model rankings shift at extreme scale.
|