18 KiB
Model Research Report: AI Models for Analytical Work
Generated: 2026-05-11 09:00 PDT
Findings analyzed: 74
Period: 2026-04-26 to 2026-05-11
74 experiments across 16 days. Six models tested on architecture document analysis — not coding.
What's New (Since May 6)
45 new findings (29 → 74) covering:
- New task types validated: Operational blind spot analysis (#46), emergent behavior from rule composition (#47), defense-in-depth gaps (#48), adversarial evasion/tampering (#49), concurrency race conditions (#50), implementation ambiguity (#51), degraded mode propagation (#52), unstated constraints (#53), state reconstruction correctness (#55), operational burden (#56), event flow correctness (#57), state machine completeness (#58), convention-rule gaps (#59), counterfactual event ordering (#60), regulatory completeness (#61), data integrity signal flow (#62), external system assumptions (#63), specification gaps (#64), temporal correctness (#65), concurrent write hazards (#65b), cross-context contract coherence (#68), boundary contract analysis, boundary violation analysis, inter-document contradiction analysis, security boundary analysis, audit log data integrity (#11-May), wash sale regulatory compliance (#11-May)
- Cross-document consistency expanded (#37, #44): Opus confirmed as dominant for subtle contradictions across tightly-coupled docs
- Regulatory compliance analysis depth (#38, #54, #61): GPT-5 excels at IRS/regulatory specificity with correct citations
- Narrow framing tested and rejected (#39, #43): Sonnet cannot match GPT-5/Opus via prompt framing alone — reasoning depth is the bottleneck
- Adversarial ensemble validated (#35): Critique-then-extend produces 30% more findings at 28% more cost
- Operational burden as distinct lens (#45, #56): Models diverge on what constitutes "operator cognitive load"
- Silent data corruption paths (#40): GPT-5 excels at tracing multi-step corruption through financial accounting
- Temporal ordering dependencies (#41): All models catch obvious ordering; GPT-5 unique on subtle cascades
- Failure propagation chains (#42): Opus finds the architectural insight; GPT-5 finds the enumeration
Executive Summary
We tested GPT-5, Claude Opus, Claude Sonnet 4.6, Claude Sonnet 4.5, GPT-4.1, and GPT-4.1 Mini on analytical tasks: finding hidden assumptions, race conditions, design contradictions, adversarial attack paths, regulatory gaps, cross-document inconsistencies, operational blind spots, emergent behaviors, and security boundaries in real architecture documents.
The central finding: Different models don't just find more or fewer things — they find qualitatively different kinds of things. Model choice is task-dependent, and no single model dominates all analytical work.
The secondary finding: Task type predicts model performance better than "model X is better." A model that excels at gap-finding may struggle at contradiction detection. Match the model to the task.
Part 1: What Each Model Is Good At
GPT-5
Strength: Exhaustive enumeration + domain-specific reasoning about the real world.
GPT-5's reasoning tokens change the kind of analysis, not just the depth. Non-reasoning models identify risks within a document's own frame of reference. GPT-5 reasons about the document's relationship to the external world: broker semantics, deployment topology, OTP runtime behavior under load, timing correlations across independent subsystems, IRS regulatory requirements.
| Capability | Evidence |
|---|---|
| Domain-specific gaps | #9, #31: Broker rate limiting, credential rotation, corporate actions |
| Multi-component interactions | #10, #14: Finds assumptions requiring cross-boundary reasoning |
| Adversarial enumeration | #29, #35: Most thorough attack surface coverage |
| Temporal boundary analysis | #18: 15 findings with mathematical precision |
| Regulatory compliance | #23, #38, #54: Correct IRS citations, regulatory edge cases |
| Silent data corruption | #40: Traces multi-step corruption paths |
| Invariant violation paths | #20: Precise, verifiable paths through state space |
| Operational blind spots | #46: 18 findings including cross-service trace gaps |
- Best at: gap-finding, hidden assumptions, adversarial enumeration, temporal boundary analysis, regulatory compliance, operational blind spots
- Unique ability: finds multi-component interaction failures requiring domain knowledge
- Weakness: slow (2-4x longer than Opus), expensive (5-10x more tokens per finding), sometimes over-enumerates
- Finding count: typically 15-35 depending on document complexity
Claude Opus
Strength: Design tensions, logical argumentation, creative adversarial thinking, cross-document consistency.
Opus consistently identifies where one part of a design undermines another part. It doesn't enumerate failure modes — it finds the deeper question of why the design's own principles conflict.
| Capability | Evidence |
|---|---|
| Contradiction detection | #25, #43: Finds logical impossibilities via deductive reasoning |
| Cross-document consistency | #28, #37, #44: 2.4x faster than GPT-5, finds more issues |
| Race conditions (design-level) | #13: 10 high-quality findings, self-corrects mid-analysis |
| Adversarial creativity | #29, #35: "Your safety mechanism IS your vulnerability" patterns |
| False assumption detection | #31, #32: Finds where spec's own logic contradicts itself |
| Emergent behavior insight | #47: Stop-loss defeated by temporal composition (best single finding) |
| Survivor bias identification | #46: Decision latency histogram hides stuck decisions |
- Best at: contradiction detection, cross-document consistency, race conditions (design-level), adversarial creativity, finding false assumptions
- Unique ability: self-corrects mid-analysis, finds where protection mechanisms become vulnerabilities
- Most efficient model: 6-9x fewer tokens per finding than GPT-5 on shared task types
- Weakness: produces fewer findings on pure enumeration tasks (10-13 vs GPT-5's 20-35)
Claude Sonnet 4.6
Strength: Speed, structural issues, assumption-finding. Best precision-per-dollar.
| Capability | Evidence |
|---|---|
| Quick first-pass screening | #9, #12: 2-3x faster than other models |
| Structural review | #5: Catches formatting, broken links, missing sections |
| Specification gap identification | #16: 13 findings, zero false positives |
| Observability gaps | #33: 11 findings in 36s |
- Best at: quick first-pass screening, structural review, specification gap identification
- Zero false positives on most tasks — every finding is actionable
- Weakness: struggles with concurrency reasoning, contradiction detection, tasks requiring formal logical reasoning
- Produces false positives on verification-heavy tasks (contradiction, race conditions)
Critical limitation (Finding #39): Narrow framing does NOT close the gap with GPT-5/Opus. Sonnet can find 3 contradictions but only 1 is genuine (2 are misreadings). The gap is reasoning depth, not framing — Sonnet can't reliably verify whether two statements actually contradict each other.
Claude Sonnet 4.5
Strength: Exhaustive coverage. More findings than 4.6, at the cost of some noise.
| Capability | Evidence |
|---|---|
| Specification completeness | #16: 25 findings vs 4.6's 13 |
| Temporal reasoning | #18: 12 findings with no errors (vs 4.6's errors in #13) |
| Operational gaps | Catches gaps that 4.6 filters out |
- Best at: specification completeness, broad coverage
- Tradeoff: severity inflation, more verbose output
- Use 4.5 for coverage, 4.6 for precision
GPT-4.1
Strength: Structured, thorough, good middle ground. Generic but competent.
| Capability | Evidence |
|---|---|
| Stays within document framing | #9, #10: Finds assumptions the document almost states |
| Meta-observations | #10: "All failure modes treated as isolated" |
| Cost-effective first pass | Good enough when GPT-5's cost isn't justified |
- Best unique contribution: meta-observations about design structure
- Good enough for first-pass review where GPT-5's cost isn't justified
GPT-4.1 Mini
Strength: Cheapest. Formulaic but catches the obvious things.
| Capability | Evidence |
|---|---|
| Scales with document size | #9, #19: 6 findings on 459 lines → 21 on 1,110 lines |
| Clean templates | Every finding maps to a document section |
| Bias detection | #8: Catches bias when signal isn't buried |
- Fine for quick sanity checks, not for architectural insight
- Best for: bulk screening, sanity checks, obvious-issue detection
Part 2: Task Type → Model Mapping
Not all analytical tasks are the same. Models that excel at one struggle at another.
| Task Type | Best Model | Runner-up | Avoid | Evidence |
|---|---|---|---|---|
| Gap-finding | GPT-5 | GPT-4.1 | Mini (surface-level) | #9, #31, #64 |
| Hidden assumptions | GPT-5 | Opus | Mini (formulaic) | #10, #11, #12, #53 |
| Race conditions | GPT-5 + Opus | — | Sonnet (errors) | #13, #50 |
| Contradiction detection | Opus | GPT-5 | Sonnet (false positives) | #25, #43 |
| Cross-document consistency | Opus | GPT-5 | — | #28, #37, #44 |
| Adversarial attack paths | GPT-5 (enum) + Opus (creativity) | — | — | #29, #35, #49 |
| Design coherence | Document-dependent | — | — | #15, #27 |
| Specification completeness | Sonnet 4.5 (breadth) / GPT-5 (self-contradictions) | — | — | #16, #31 |
| Regulatory compliance | GPT-5 | Sonnet (first-pass) | — | #23, #38, #54 |
| Operational blind spots | GPT-5 | Opus | Sonnet | #46 |
| Emergent behavior | GPT-5 (feedback loops) | Opus (best single insight) | — | #47 |
| Temporal boundaries | GPT-5 | Opus | — | #18, #41, #65 |
| State machine completeness | GPT-5 | Opus | — | #58 |
| Silent data corruption | GPT-5 | — | — | #40, #62 |
| Defense-in-depth gaps | GPT-5 + Opus | — | — | #48 |
| Security boundaries | GPT-5 | Opus | — | #10-May |
Key pattern: Tasks requiring identification (what's missing? what's assumed?) are accessible to all models. Tasks requiring verification (is this sequence legal? does this contradict that?) favor reasoning models (GPT-5, Opus) and exclude Sonnet.
Task category taxonomy:
| Category | Sonnet value | Best models |
|---|---|---|
| Systematic/exhaustive | None | GPT-5, Opus |
| Creative/generative | Meta-analytical synthesis | Opus, GPT-5 |
| Compliance/regulatory | Adequate but shallow | GPT-5 (deep), Sonnet (first-pass) |
| Cross-document | None | Opus strongly preferred |
Part 3: Meta-Findings About How to Use Models
1. Signal-to-noise ratio matters more than model capability (#8)
When biased text is the ONLY input, even GPT-4.1 Mini catches it. When the same bias is buried inside a full PR review with diffs, issues, and project context, expensive models miss it. The issue isn't model intelligence — it's attention dilution.
Implication: For important analytical checks, isolate the signal. Extract the relevant text and ask about it specifically. Don't bury important questions inside broad review mandates.
2. Prompt framing dominates model personality for OPEN tasks (#26)
Opus's "finds design tensions" and GPT-5's "exhaustive enumeration" are DEFAULT tendencies, not hard limits. With structured prompts that explicitly ask for breadth, Opus produces MORE findings than GPT-5. With structured prompts asking for contradictions, GPT-5 becomes highly selective.
Implication: Model choice matters less than you think for any single task. Prompt structure is the primary lever. Model personality matters for open-ended tasks where you want emergent analytical behavior.
3. Narrow framing does NOT fix Sonnet's reasoning gaps (#39, #43)
Sonnet can't match GPT-5/Opus via narrow prompts alone. Narrow framing changes WHAT Sonnet looks for but not HOW WELL it reasons. Sonnet found 3 contradictions but only 1 was genuine (2 were misreadings). The gap is reasoning depth, not prompt engineering.
4. Task type predicts model performance better than "model X is better" (#13)
Sonnet scores 85% of GPT-5's performance on assumption-finding but drops to ~50% on concurrency reasoning. Don't extrapolate across task types.
5. The union of models finds the most (#19)
GPT-5 Mini + Sonnet covers ~71% of GPT-5's findings at 31% of the cost. But the missing 29% contains the domain-specific interaction-level findings most likely to prevent production incidents. Each model also finds things the others miss — the total unique finding space is larger than any single model's output.
6. Adversarial ensemble produces 30% more findings (#35)
Run GPT-5 for exhaustive enumeration, then give Opus GPT-5's findings and ask it to critique and extend. Result: 56 findings vs 43 (GPT-5 alone) or 28 (Opus alone). Zero full disagreements. The critique's structured assessment is more valuable than raw extensions. Cost: ~28% more tokens for 30% more coverage + prioritization.
7. Reasoning tokens change the KIND of analysis, not just the amount (#10)
Non-reasoning models ask "what could this mechanism fail at?" Reasoning models ask "what must be true about the world for this mechanism to work?" This is a qualitative difference in analytical mode, not just thoroughness.
8. Reasoning effort parameter is a no-op for analytical work (#21)
Low/medium/high reasoning effort had negligible effect on GPT-5's output for open-ended analysis. Task type is a far stronger predictor of reasoning behavior. Don't waste time tuning this parameter for document review.
9. Output length kills, input length doesn't (#6)
Single agents die trying to generate 1000+ line documents. Rich input context is fine — it's the output length that causes OOM/timeout. Break output into sections, keep input context rich, draft in parallel, assemble.
10. Document complexity shifts model rankings (#27)
Opus beat GPT-5 on coherence analysis for one document but lost on another (more complex) document. Rankings are not universal — they interact with document complexity, domain specificity, and prompt structure.
11. Token budget matters more than model size (#7b)
When output is truncated by token limits, even GPT-5 produces shallow findings. Ensure sufficient max_completion_tokens (≥16K for GPT-5). A cheap model with enough tokens beats an expensive model that runs out of space.
12. Opus excels at finding where specs believe false things (#31, #32)
Across experiments, Opus consistently finds where the spec's OWN ASSUMPTIONS are false. GPT-5 reasons about what the spec FAILS TO SAY. Opus reasons about what the spec BELIEVES to be true and checks whether those beliefs hold. Different but complementary.
13. GPT-5's reasoning tokens are spent on VERIFICATION for regulatory tasks (#54)
For domain-specific regulatory analysis (IRS wash sale rules), GPT-5 consistently cited correct publication sections, code numbers, and regulatory references. The 9,600 reasoning tokens appear spent on verification, not generation.
Part 4: Cost-Effectiveness
| Model | Typical tokens/finding | Relative cost | Best use case |
|---|---|---|---|
| Opus | 179-336 | 1x (baseline) | Cross-doc consistency, contradictions, design tensions |
| Sonnet 4.6 | 111-194 | 0.2-0.3x | Quick screening, structural review, assumption-finding |
| Sonnet 4.5 | 150-250 | 0.25x | Broad coverage when noise is acceptable |
| GPT-5 | 511-2,967 | 5-9x | High-stakes analysis where missing something has real cost |
| GPT-4.1 | ~500 | 0.5x | Middle-ground first pass |
| GPT-4.1 Mini | ~300 | 0.1x | Bulk screening, sanity checks |
For financial/safety-critical systems: Run all three (Opus + GPT-5 + Sonnet). The ~$1-2 total cost per document is trivially justified vs the value of comprehensive coverage.
For routine review: Opus alone or Sonnet + Opus pair. Skip GPT-5 unless the document is complex and the stakes justify it.
For regulatory compliance: GPT-5 for depth + correct citations, Sonnet for first-pass breadth.
Part 5: Open Questions
Still Unanswered
-
Are these findings corpus-specific? All 74 experiments used gargoyle architecture docs. Different domains may shift rankings.
-
How much do results vary across runs? All findings are single-run. Stochastic variation is unquantified.
-
What happens on 2000+ line documents? Largest tested is 1,110 lines. Unknown if model rankings shift at extreme scale.
-
Cross-document consistency as maintenance tool: Does running cross-doc analysis across MORE document pairs yield additional real inconsistencies? Could become a systematic documentation maintenance tool.
-
Why Opus dominates cross-doc consistency: Is it because contradictions are easy to verify once spotted (reducing GPT-5's verification advantage)? Or because boundary reasoning (Opus's strength) is the primary skill needed?
Answered Questions (from open-questions.md)
-
Opus + narrow framing for contradiction detection→ WRONG QUESTION (#43). Opus doesn't try to match GPT-5 — it finds a different CLASS of contradiction. GPT-5 finds specification conflicts (same scenario, conflicting prescriptions). Opus finds logical impossibilities (rules whose interaction produces impossible conditions). Neither dominates. -
Sonnet + narrow framing = GPT-5 level?→ NO (#39). The gap is reasoning depth, not framing. -
Adversarial ensemble (GPT-5 → Opus)?→ YES (#35). 30% more findings at 28% more cost. -
Opus's "missing feature identification" mode — is it promptable?→ YES (#26). All models find regulatory gaps when explicitly prompted. -
Is Opus > GPT-5 for coherence tasks universal?→ NO (#27). Document complexity affects ranking.