# Model Research Report: AI Models for Analytical Work > **Generated:** 2026-05-11 09:00 PDT > **Findings analyzed:** 74 > **Period:** 2026-04-26 to 2026-05-11 _74 experiments across 16 days. Six models tested on architecture document analysis — not coding._ --- ## What's New (Since May 6) **45 new findings** (29 → 74) covering: - **New task types validated:** Operational blind spot analysis (#46), emergent behavior from rule composition (#47), defense-in-depth gaps (#48), adversarial evasion/tampering (#49), concurrency race conditions (#50), implementation ambiguity (#51), degraded mode propagation (#52), unstated constraints (#53), state reconstruction correctness (#55), operational burden (#56), event flow correctness (#57), state machine completeness (#58), convention-rule gaps (#59), counterfactual event ordering (#60), regulatory completeness (#61), data integrity signal flow (#62), external system assumptions (#63), specification gaps (#64), temporal correctness (#65), concurrent write hazards (#65b), cross-context contract coherence (#68), boundary contract analysis, boundary violation analysis, inter-document contradiction analysis, security boundary analysis, audit log data integrity (#11-May), wash sale regulatory compliance (#11-May) - **Cross-document consistency expanded** (#37, #44): Opus confirmed as dominant for subtle contradictions across tightly-coupled docs - **Regulatory compliance analysis depth** (#38, #54, #61): GPT-5 excels at IRS/regulatory specificity with correct citations - **Narrow framing tested and rejected** (#39, #43): Sonnet cannot match GPT-5/Opus via prompt framing alone — reasoning depth is the bottleneck - **Adversarial ensemble validated** (#35): Critique-then-extend produces 30% more findings at 28% more cost - **Operational burden as distinct lens** (#45, #56): Models diverge on what constitutes "operator cognitive load" - **Silent data corruption paths** (#40): GPT-5 excels at tracing multi-step corruption through financial accounting - **Temporal ordering dependencies** (#41): All models catch obvious ordering; GPT-5 unique on subtle cascades - **Failure propagation chains** (#42): Opus finds the architectural insight; GPT-5 finds the enumeration --- ## Executive Summary We tested GPT-5, Claude Opus, Claude Sonnet 4.6, Claude Sonnet 4.5, GPT-4.1, and GPT-4.1 Mini on analytical tasks: finding hidden assumptions, race conditions, design contradictions, adversarial attack paths, regulatory gaps, cross-document inconsistencies, operational blind spots, emergent behaviors, and security boundaries in real architecture documents. **The central finding:** Different models don't just find more or fewer things — they find *qualitatively different kinds* of things. Model choice is task-dependent, and no single model dominates all analytical work. **The secondary finding:** Task type predicts model performance better than "model X is better." A model that excels at gap-finding may struggle at contradiction detection. Match the model to the task. --- ## Part 1: What Each Model Is Good At ### GPT-5 **Strength:** Exhaustive enumeration + domain-specific reasoning about the real world. GPT-5's reasoning tokens change the *kind* of analysis, not just the depth. Non-reasoning models identify risks within a document's own frame of reference. GPT-5 reasons about the document's relationship to the external world: broker semantics, deployment topology, OTP runtime behavior under load, timing correlations across independent subsystems, IRS regulatory requirements. | Capability | Evidence | |------------|----------| | Domain-specific gaps | #9, #31: Broker rate limiting, credential rotation, corporate actions | | Multi-component interactions | #10, #14: Finds assumptions requiring cross-boundary reasoning | | Adversarial enumeration | #29, #35: Most thorough attack surface coverage | | Temporal boundary analysis | #18: 15 findings with mathematical precision | | Regulatory compliance | #23, #38, #54: Correct IRS citations, regulatory edge cases | | Silent data corruption | #40: Traces multi-step corruption paths | | Invariant violation paths | #20: Precise, verifiable paths through state space | | Operational blind spots | #46: 18 findings including cross-service trace gaps | - Best at: gap-finding, hidden assumptions, adversarial enumeration, temporal boundary analysis, regulatory compliance, operational blind spots - Unique ability: finds multi-component interaction failures requiring domain knowledge - Weakness: slow (2-4x longer than Opus), expensive (5-10x more tokens per finding), sometimes over-enumerates - Finding count: typically 15-35 depending on document complexity ### Claude Opus **Strength:** Design tensions, logical argumentation, creative adversarial thinking, cross-document consistency. Opus consistently identifies where one part of a design undermines another part. It doesn't enumerate failure modes — it finds the deeper question of *why* the design's own principles conflict. | Capability | Evidence | |------------|----------| | Contradiction detection | #25, #43: Finds logical impossibilities via deductive reasoning | | Cross-document consistency | #28, #37, #44: 2.4x faster than GPT-5, finds more issues | | Race conditions (design-level) | #13: 10 high-quality findings, self-corrects mid-analysis | | Adversarial creativity | #29, #35: "Your safety mechanism IS your vulnerability" patterns | | False assumption detection | #31, #32: Finds where spec's own logic contradicts itself | | Emergent behavior insight | #47: Stop-loss defeated by temporal composition (best single finding) | | Survivor bias identification | #46: Decision latency histogram hides stuck decisions | - Best at: contradiction detection, cross-document consistency, race conditions (design-level), adversarial creativity, finding false assumptions - Unique ability: self-corrects mid-analysis, finds where protection mechanisms become vulnerabilities - Most efficient model: 6-9x fewer tokens per finding than GPT-5 on shared task types - Weakness: produces fewer findings on pure enumeration tasks (10-13 vs GPT-5's 20-35) ### Claude Sonnet 4.6 **Strength:** Speed, structural issues, assumption-finding. Best precision-per-dollar. | Capability | Evidence | |------------|----------| | Quick first-pass screening | #9, #12: 2-3x faster than other models | | Structural review | #5: Catches formatting, broken links, missing sections | | Specification gap identification | #16: 13 findings, zero false positives | | Observability gaps | #33: 11 findings in 36s | - Best at: quick first-pass screening, structural review, specification gap identification - Zero false positives on most tasks — every finding is actionable - Weakness: struggles with concurrency reasoning, contradiction detection, tasks requiring formal logical reasoning - Produces false positives on verification-heavy tasks (contradiction, race conditions) **Critical limitation (Finding #39):** Narrow framing does NOT close the gap with GPT-5/Opus. Sonnet can find 3 contradictions but only 1 is genuine (2 are misreadings). The gap is reasoning depth, not framing — Sonnet can't reliably verify whether two statements actually contradict each other. ### Claude Sonnet 4.5 **Strength:** Exhaustive coverage. More findings than 4.6, at the cost of some noise. | Capability | Evidence | |------------|----------| | Specification completeness | #16: 25 findings vs 4.6's 13 | | Temporal reasoning | #18: 12 findings with no errors (vs 4.6's errors in #13) | | Operational gaps | Catches gaps that 4.6 filters out | - Best at: specification completeness, broad coverage - Tradeoff: severity inflation, more verbose output - Use 4.5 for coverage, 4.6 for precision ### GPT-4.1 **Strength:** Structured, thorough, good middle ground. Generic but competent. | Capability | Evidence | |------------|----------| | Stays within document framing | #9, #10: Finds assumptions the document almost states | | Meta-observations | #10: "All failure modes treated as isolated" | | Cost-effective first pass | Good enough when GPT-5's cost isn't justified | - Best unique contribution: meta-observations about design structure - Good enough for first-pass review where GPT-5's cost isn't justified ### GPT-4.1 Mini **Strength:** Cheapest. Formulaic but catches the obvious things. | Capability | Evidence | |------------|----------| | Scales with document size | #9, #19: 6 findings on 459 lines → 21 on 1,110 lines | | Clean templates | Every finding maps to a document section | | Bias detection | #8: Catches bias when signal isn't buried | - Fine for quick sanity checks, not for architectural insight - Best for: bulk screening, sanity checks, obvious-issue detection --- ## Part 2: Task Type → Model Mapping Not all analytical tasks are the same. Models that excel at one struggle at another. | Task Type | Best Model | Runner-up | Avoid | Evidence | |-----------|-----------|-----------|-------|----------| | **Gap-finding** | GPT-5 | GPT-4.1 | Mini (surface-level) | #9, #31, #64 | | **Hidden assumptions** | GPT-5 | Opus | Mini (formulaic) | #10, #11, #12, #53 | | **Race conditions** | GPT-5 + Opus | — | Sonnet (errors) | #13, #50 | | **Contradiction detection** | **Opus** | GPT-5 | Sonnet (false positives) | #25, #43 | | **Cross-document consistency** | **Opus** | GPT-5 | — | #28, #37, #44 | | **Adversarial attack paths** | GPT-5 (enum) + Opus (creativity) | — | — | #29, #35, #49 | | **Design coherence** | Document-dependent | — | — | #15, #27 | | **Specification completeness** | Sonnet 4.5 (breadth) / GPT-5 (self-contradictions) | — | — | #16, #31 | | **Regulatory compliance** | GPT-5 | Sonnet (first-pass) | — | #23, #38, #54 | | **Operational blind spots** | GPT-5 | Opus | Sonnet | #46 | | **Emergent behavior** | GPT-5 (feedback loops) | Opus (best single insight) | — | #47 | | **Temporal boundaries** | GPT-5 | Opus | — | #18, #41, #65 | | **State machine completeness** | GPT-5 | Opus | — | #58 | | **Silent data corruption** | GPT-5 | — | — | #40, #62 | | **Defense-in-depth gaps** | GPT-5 + Opus | — | — | #48 | | **Security boundaries** | GPT-5 | Opus | — | #10-May | **Key pattern:** Tasks requiring *identification* (what's missing? what's assumed?) are accessible to all models. Tasks requiring *verification* (is this sequence legal? does this contradict that?) favor reasoning models (GPT-5, Opus) and exclude Sonnet. **Task category taxonomy:** | Category | Sonnet value | Best models | |----------|--------------|-------------| | Systematic/exhaustive | None | GPT-5, Opus | | Creative/generative | Meta-analytical synthesis | Opus, GPT-5 | | Compliance/regulatory | Adequate but shallow | GPT-5 (deep), Sonnet (first-pass) | | Cross-document | None | Opus strongly preferred | --- ## Part 3: Meta-Findings About How to Use Models ### 1. Signal-to-noise ratio matters more than model capability (#8) When biased text is the ONLY input, even GPT-4.1 Mini catches it. When the same bias is buried inside a full PR review with diffs, issues, and project context, expensive models miss it. The issue isn't model intelligence — it's attention dilution. **Implication:** For important analytical checks, isolate the signal. Extract the relevant text and ask about it specifically. Don't bury important questions inside broad review mandates. ### 2. Prompt framing dominates model personality for OPEN tasks (#26) Opus's "finds design tensions" and GPT-5's "exhaustive enumeration" are DEFAULT tendencies, not hard limits. With structured prompts that explicitly ask for breadth, Opus produces MORE findings than GPT-5. With structured prompts asking for contradictions, GPT-5 becomes highly selective. **Implication:** Model choice matters less than you think for any single task. Prompt structure is the primary lever. Model personality matters for open-ended tasks where you want emergent analytical behavior. ### 3. Narrow framing does NOT fix Sonnet's reasoning gaps (#39, #43) Sonnet can't match GPT-5/Opus via narrow prompts alone. Narrow framing changes WHAT Sonnet looks for but not HOW WELL it reasons. Sonnet found 3 contradictions but only 1 was genuine (2 were misreadings). The gap is reasoning depth, not prompt engineering. ### 4. Task type predicts model performance better than "model X is better" (#13) Sonnet scores 85% of GPT-5's performance on assumption-finding but drops to ~50% on concurrency reasoning. Don't extrapolate across task types. ### 5. The union of models finds the most (#19) GPT-5 Mini + Sonnet covers ~71% of GPT-5's findings at 31% of the cost. But the missing 29% contains the domain-specific interaction-level findings most likely to prevent production incidents. Each model also finds things the others miss — the total unique finding space is larger than any single model's output. ### 6. Adversarial ensemble produces 30% more findings (#35) Run GPT-5 for exhaustive enumeration, then give Opus GPT-5's findings and ask it to critique and extend. Result: 56 findings vs 43 (GPT-5 alone) or 28 (Opus alone). Zero full disagreements. The critique's structured assessment is more valuable than raw extensions. Cost: ~28% more tokens for 30% more coverage + prioritization. ### 7. Reasoning tokens change the KIND of analysis, not just the amount (#10) Non-reasoning models ask "what could this mechanism fail at?" Reasoning models ask "what must be true about the world for this mechanism to work?" This is a qualitative difference in analytical mode, not just thoroughness. ### 8. Reasoning effort parameter is a no-op for analytical work (#21) Low/medium/high reasoning effort had negligible effect on GPT-5's output for open-ended analysis. Task type is a far stronger predictor of reasoning behavior. Don't waste time tuning this parameter for document review. ### 9. Output length kills, input length doesn't (#6) Single agents die trying to generate 1000+ line documents. Rich input context is fine — it's the output length that causes OOM/timeout. Break output into sections, keep input context rich, draft in parallel, assemble. ### 10. Document complexity shifts model rankings (#27) Opus beat GPT-5 on coherence analysis for one document but lost on another (more complex) document. Rankings are not universal — they interact with document complexity, domain specificity, and prompt structure. ### 11. Token budget matters more than model size (#7b) When output is truncated by token limits, even GPT-5 produces shallow findings. Ensure sufficient max_completion_tokens (≥16K for GPT-5). A cheap model with enough tokens beats an expensive model that runs out of space. ### 12. Opus excels at finding where specs believe false things (#31, #32) Across experiments, Opus consistently finds where the spec's OWN ASSUMPTIONS are false. GPT-5 reasons about what the spec FAILS TO SAY. Opus reasons about what the spec BELIEVES to be true and checks whether those beliefs hold. Different but complementary. ### 13. GPT-5's reasoning tokens are spent on VERIFICATION for regulatory tasks (#54) For domain-specific regulatory analysis (IRS wash sale rules), GPT-5 consistently cited correct publication sections, code numbers, and regulatory references. The 9,600 reasoning tokens appear spent on verification, not generation. --- ## Part 4: Cost-Effectiveness | Model | Typical tokens/finding | Relative cost | Best use case | |-------|----------------------|---------------|---------------| | Opus | 179-336 | 1x (baseline) | Cross-doc consistency, contradictions, design tensions | | Sonnet 4.6 | 111-194 | 0.2-0.3x | Quick screening, structural review, assumption-finding | | Sonnet 4.5 | 150-250 | 0.25x | Broad coverage when noise is acceptable | | GPT-5 | 511-2,967 | 5-9x | High-stakes analysis where missing something has real cost | | GPT-4.1 | ~500 | 0.5x | Middle-ground first pass | | GPT-4.1 Mini | ~300 | 0.1x | Bulk screening, sanity checks | **For financial/safety-critical systems:** Run all three (Opus + GPT-5 + Sonnet). The ~$1-2 total cost per document is trivially justified vs the value of comprehensive coverage. **For routine review:** Opus alone or Sonnet + Opus pair. Skip GPT-5 unless the document is complex and the stakes justify it. **For regulatory compliance:** GPT-5 for depth + correct citations, Sonnet for first-pass breadth. --- ## Part 5: Open Questions ### Still Unanswered 1. **Are these findings corpus-specific?** All 74 experiments used gargoyle architecture docs. Different domains may shift rankings. 2. **How much do results vary across runs?** All findings are single-run. Stochastic variation is unquantified. 3. **What happens on 2000+ line documents?** Largest tested is 1,110 lines. Unknown if model rankings shift at extreme scale. 4. **Cross-document consistency as maintenance tool:** Does running cross-doc analysis across MORE document pairs yield additional real inconsistencies? Could become a systematic documentation maintenance tool. 5. **Why Opus dominates cross-doc consistency:** Is it because contradictions are easy to verify once spotted (reducing GPT-5's verification advantage)? Or because boundary reasoning (Opus's strength) is the primary skill needed? ### Answered Questions (from open-questions.md) - ~~Opus + narrow framing for contradiction detection~~ → **WRONG QUESTION** (#43). Opus doesn't try to match GPT-5 — it finds a different CLASS of contradiction. GPT-5 finds specification conflicts (same scenario, conflicting prescriptions). Opus finds logical impossibilities (rules whose interaction produces impossible conditions). Neither dominates. - ~~Sonnet + narrow framing = GPT-5 level?~~ → **NO** (#39). The gap is reasoning depth, not framing. - ~~Adversarial ensemble (GPT-5 → Opus)?~~ → **YES** (#35). 30% more findings at 28% more cost. - ~~Opus's "missing feature identification" mode — is it promptable?~~ → **YES** (#26). All models find regulatory gaps when explicitly prompted. - ~~Is Opus > GPT-5 for coherence tasks universal?~~ → **NO** (#27). Document complexity affects ranking.