Files
model-research/REPORT.md
T
Rodin f865a0d778 docs: add research report and actionable lessons summary
REPORT.md — full analysis of 29 experiments: model strengths, task-type
mappings, meta-findings, cost-effectiveness, and open questions.

LESSONS.md — distilled operational playbook: which model for which task,
anti-patterns, decision framework, and the three core rules.
2026-05-06 07:24:12 -07:00

9.5 KiB

Model Research Report: AI Models for Analytical Work

29 experiments across 11 days (2026-04-26 to 2026-05-06). Five models tested on architecture document analysis — not coding.

Executive Summary

We tested GPT-5, Claude Opus, Claude Sonnet 4.6, Claude Sonnet 4.5, GPT-4.1, and GPT-4.1 Mini on analytical tasks: finding hidden assumptions, race conditions, design contradictions, adversarial attack paths, regulatory gaps, and cross-document inconsistencies in real architecture documents.

The central finding: Different models don't just find more or fewer things — they find qualitatively different kinds of things. Model choice is task-dependent, and no single model dominates all analytical work.


Part 1: What Each Model Is Good At

GPT-5

Strength: Exhaustive enumeration + domain-specific reasoning about the real world.

GPT-5's reasoning tokens change the kind of analysis, not just the depth. Non-reasoning models identify risks within a document's own frame of reference. GPT-5 reasons about the document's relationship to the external world: broker semantics, deployment topology, OTP runtime behavior under load, timing correlations across independent subsystems.

  • Best at: gap-finding, hidden assumptions, adversarial enumeration, temporal boundary analysis
  • Unique ability: finds multi-component interaction failures that require domain knowledge
  • Weakness: slow (2-4x longer than Opus), expensive (5-10x more tokens per finding), sometimes over-verifies
  • Finding count: typically 15-35 depending on document complexity

Claude Opus

Strength: Design tensions, logical argumentation, creative adversarial thinking.

Opus consistently identifies where one part of a design undermines another part. It doesn't enumerate failure modes — it finds the deeper question of why the design's own principles conflict.

  • Best at: contradiction detection, cross-document consistency, race conditions (design-level), adversarial creativity
  • Unique ability: self-corrects mid-analysis, finds "your safety mechanism IS your vulnerability" patterns
  • Most efficient model: 6-9x fewer tokens per finding than GPT-5 on shared task types
  • Weakness: produces fewer findings on pure enumeration tasks (10-13 vs GPT-5's 20-35)

Claude Sonnet 4.6

Strength: Speed, structural issues, assumption-finding. Best precision-per-dollar.

  • Best at: quick first-pass screening, structural review, specification gap identification
  • Zero false positives on most tasks — every finding is actionable
  • Weakness: struggles with concurrency reasoning, contradiction detection, and tasks requiring formal logical reasoning
  • Produces false positives on verification-heavy tasks (contradiction, race conditions)

Claude Sonnet 4.5

Strength: Exhaustive coverage. More findings than 4.6, at the cost of some noise.

  • Best at: specification completeness (25 findings vs 4.6's 13)
  • Catches operational gaps that 4.6 filters out
  • Tradeoff: severity inflation, more verbose output

GPT-4.1

Strength: Structured, thorough, good middle ground. Generic but competent.

  • Stays within the document's own framing — finds assumptions the document almost states
  • Best unique contribution: meta-observations about design structure (e.g., "all failure modes treated as isolated")
  • Good enough for first-pass review where GPT-5's cost isn't justified

GPT-4.1 Mini

Strength: Cheapest. Formulaic but catches the obvious things.

  • Every finding maps cleanly to a section of the document
  • Fine for quick sanity checks, not for architectural insight
  • Scales with document size (6 findings on 459 lines → 21 on 1,110 lines)

Part 2: What We Learned About Task Types

Not all analytical tasks are the same. Models that excel at one struggle at another.

Task Type Best Model Runner-up Avoid
Hidden assumptions GPT-5 Opus Mini (formulaic)
Gap-finding GPT-5 GPT-4.1 Mini (surface-level)
Race conditions GPT-5 + Opus Sonnet (errors)
Contradiction detection Opus GPT-5 Sonnet (false positives)
Cross-document consistency Opus GPT-5
Adversarial attack paths GPT-5 (enumeration) + Opus (creativity)
Bias detection Any model
Design coherence Document-dependent
Specification completeness Sonnet 4.5 (breadth) or GPT-5 (self-contradictions)
Missing feature identification All (with right prompt)
Invariant violation paths GPT-5 (precision) Opus (breadth) Sonnet (imprecise)

Key pattern: Tasks requiring identification (what's missing? what's assumed?) are accessible to all models. Tasks requiring verification (is this sequence legal? does this contradict that?) favor reasoning models (GPT-5, Opus) and exclude Sonnet.


Part 3: Meta-Findings About How to Use Models

1. Signal-to-noise ratio matters more than model capability (Finding #8)

When biased text is the ONLY input, even GPT-4.1 Mini catches it. When the same bias is buried inside a full PR review with diffs, issues, and project context, expensive models miss it. The issue isn't model intelligence — it's attention dilution.

Implication: For important analytical checks, isolate the signal. Extract the relevant text and ask about it specifically. Don't bury important questions inside broad review mandates.

2. Prompt framing dominates model personality (Finding #26)

Opus's "finds design tensions" and GPT-5's "exhaustive enumeration" are DEFAULT tendencies, not capabilities. With structured prompts that explicitly ask for breadth, Opus produces MORE findings than GPT-5. With structured prompts asking for contradictions, GPT-5 becomes highly selective.

Implication: Model choice matters less than you think for any single task. Prompt structure is the primary lever. Model personality matters for OPEN-ENDED tasks where you want emergent analytical behavior.

3. Task type predicts model performance better than "model X is better" (Finding #13)

Sonnet scores 85% of GPT-5's performance on assumption-finding but drops to ~50% on concurrency reasoning. Don't extrapolate across task types.

4. The union of models finds the most (Finding #19)

GPT-5 Mini + Sonnet covers ~71% of GPT-5's findings at 31% of the cost. But the missing 29% contains the domain-specific interaction-level findings most likely to prevent production incidents. Each model also finds things the others miss — the total unique finding space is larger than any single model's output.

5. Reasoning tokens change the KIND of analysis, not just the amount (Finding #10)

Non-reasoning models ask "what could this mechanism fail at?" Reasoning models ask "what must be true about the world for this mechanism to work?" This is a qualitative difference in analytical mode, not just thoroughness.

6. Reasoning effort parameter is a no-op for analytical work (Finding #21)

Low/medium/high reasoning effort had negligible effect on GPT-5's output for open-ended analysis. Task type is a far stronger predictor of reasoning behavior. Don't waste time tuning this parameter for document review.

7. Output length kills, input length doesn't (Finding #6)

Single agents die trying to generate 1000+ line documents. Rich input context is fine — it's the output length that causes OOM/timeout. Break output into sections, keep input context rich, draft in parallel, assemble.

8. Document complexity shifts model rankings (Finding #27)

Opus beat GPT-5 on coherence analysis for one document but lost on another (more complex) document. Rankings are not universal — they interact with document complexity, domain specificity, and prompt structure.

9. Token budget matters more than model size (Finding #7b)

When output is truncated by token limits, even GPT-5 produces shallow findings. Ensure sufficient max_completion_tokens (≥16K for GPT-5). A cheap model with enough tokens beats an expensive model that runs out of space.


Part 4: Cost-Effectiveness

Model Typical tokens/finding Relative cost Best use case
Opus 179-336 1x (baseline) Cross-doc consistency, contradictions, design tensions
Sonnet 4.6 194-312 0.3x Quick screening, structural review, assumption-finding
GPT-5 993-2,967 5-9x High-stakes analysis where missing something has real cost
GPT-4.1 ~500 0.5x Middle-ground first pass
GPT-4.1 Mini ~300 0.1x Bulk screening, sanity checks

For financial/safety-critical systems: Run all three (Opus + GPT-5 + Sonnet). The ~$1 total cost per document is irrelevant vs the value of comprehensive coverage.

For routine review: Opus alone or Sonnet + Opus pair. Skip GPT-5 unless the document is complex and the stakes justify it.


Part 5: What's Still Unknown

  1. Would running models sequentially (feed Model A's output to Model B) outperform parallel runs? Hypothesized for adversarial analysis but untested.
  2. Are these findings corpus-specific? All 29 experiments used gargoyle architecture docs. Different domains may shift rankings.
  3. How much do results vary across runs? All findings are single-run. Stochastic variation is unquantified.
  4. Does Sonnet's narrow-framing weakness go away with explicit concurrency prompts? Untested — the hypothesis that Sonnet's "structural reviewer" tendency is a framing artifact.
  5. What happens on 2000+ line documents? Largest tested is 1,110 lines. Unknown if model rankings shift at extreme scale.