Files

T

Rodin 5426026908 docs: regenerate weekly report (2026-05-18)

2026-05-18 16:10:16 +00:00

26 KiB

Raw Blame History

Model Research Report: AI Models for Analytical Work

Generated: 2026-05-18 09:02 PDT
Findings analyzed: 80
Period: 2026-04-26 to 2026-05-15
Corpus: gargoyle architecture docs, review-bot security code, dev pipeline metrics

80 experiments across 20 days. Six models tested on architecture document analysis, security review, and development process effectiveness.

What's New (Since May 11)

6 new findings (74 → 80) covering:

Finding #78: Dev Loop Effectiveness Analysis — Quantitative audit of the gargoyle autonomous development pipeline. Key results: dual-bot review (Sonnet + GPT) achieved 32% REQUEST_CHANGES rate vs 2% for single-bot. Post-merge review caught 100 escaped defects (all fixed). 22% of post-merge findings were missing test coverage. Sonnet-review-bot dropout was the single largest quality regression.
Finding #79 (two parts): Multi-Model Security Review Catches SSRF Gaps — Dedicated security-reviewer persona caught CGN range bypass (100.64.0.0/10 not covered by Python is_private) and proxy-assisted SSRF (Go http.DefaultTransport cloning preserves ProxyFromEnvironment). Standard reviewers (Sonnet, GPT) both approved — only the specialized security persona blocked merge. Validates: domain-specialized reviewer roles outperform generalist "security-aware" review.
Finding #79b: HTTPS Enforcement Bypass in GitHub Client — Security reviewer caught write-path methods (PostReview, DeleteReview, RequestReviewer) bypassing the doRequest HTTPS guard. Standard reviewers missed it. 30-minute fix cycle from detection to re-approval. Validates: write-path code paths deserve extra security scrutiny.
Finding #80: Config-A/B Dispatcher Malfunction — The even/odd PR parity routing for multi-model review experiments was NOT operational. All 6 reviewers fired on all PRs simultaneously, causing 3.5x API cost overage and invalidating Phase 1 baseline metrics. Demonstrates: operational monitoring of AI pipeline configuration is critical.

Key new insights:

Specialized reviewer personas provide value that model capability alone cannot replicate
Multi-model review pipelines need operational monitoring (cost, dispatch correctness)
Dual-bot disagreement acts as a natural quality ratchet — removing one bot degrades quality disproportionately
Security review at library/OS-interaction boundaries requires domain-specific knowledge (CGN ranges, proxy inheritance)

Executive Summary

We tested GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, Claude Sonnet 4.5, GPT-4.1, and GPT-4.1 Mini on analytical tasks: finding hidden assumptions, race conditions, design contradictions, adversarial attack paths, regulatory gaps, cross-document inconsistencies, operational blind spots, emergent behaviors, security boundaries, and multi-model review pipeline effectiveness.

The central finding: Different models don't just find more or fewer things — they find qualitatively different kinds of things. Model choice is task-dependent, and no single model dominates all analytical work.

The secondary finding: Task type predicts model performance better than "model X is better." A model that excels at gap-finding may struggle at contradiction detection. Match the model to the task.

The tertiary finding (new): Reviewer persona (security-focused, domain-focused) matters as much as model capability. A dedicated security reviewer using the same model catches issues that a generalist reviewer on the same model misses.

Part 1: What Each Model Is Good At

GPT-5

Strength: Exhaustive enumeration + domain-specific reasoning about the real world.

GPT-5's reasoning tokens change the kind of analysis, not just the depth. Non-reasoning models identify risks within a document's own frame of reference. GPT-5 reasons about the document's relationship to the external world: broker semantics, deployment topology, OTP runtime behavior under load, timing correlations across independent subsystems, IRS regulatory requirements.

Capability	Evidence
Domain-specific gaps	#9, #31, #63: Broker rate limiting, credential rotation, entitlement gaps
Multi-component interactions	#10, #14, #68: Finds assumptions requiring cross-boundary reasoning
Adversarial enumeration	#29, #35: Most thorough attack surface coverage
Temporal boundary analysis	#18, #41: 15+ findings with mathematical precision
Regulatory compliance	#23, #38, #54, #61, #64: Correct IRS/FINRA citations, regulatory edge cases
Silent data corruption	#40, #62: Traces multi-step corruption paths
Invariant violation paths	#20: Precise, verifiable paths through state space
Operational blind spots	#46: 18 findings including cross-service trace gaps
State machine completeness	#58: 16 gaps including race windows during state transitions
Concurrent write hazards	#65: 19 hazards with specific ordering interleavings
External system assumptions	#63: 24 assumptions about broker APIs, network behavior
Counterfactual event ordering	#60: 30 findings through systematic permutation
Specification gap analysis	#64b: 17 implementation-divergence scenarios
Convention rule gaps	#59: 34 findings through section-by-section enumeration

Best at: gap-finding, hidden assumptions, adversarial enumeration, temporal boundary analysis, regulatory compliance, operational blind spots, state machine analysis, exhaustive permutation
Unique ability: finds multi-component interaction failures requiring domain knowledge + systematic enumeration
Weakness: slow (2-4x longer than Opus), expensive (5-10x more tokens per finding), sometimes over-enumerates low-severity variants
Finding count: typically 15-35 depending on document complexity
Typical cost: $0.30-0.50 per experiment

Claude Opus 4.6

Strength: Design tensions, logical argumentation, creative adversarial thinking, cross-document consistency, failure mode reasoning.

Opus consistently identifies where one part of a design undermines another part. It doesn't enumerate failure modes — it finds the deeper question of why the design's own principles conflict.

Capability	Evidence
Contradiction detection	#25, #43: Finds logical impossibilities via deductive reasoning
Cross-document consistency	#28, #37, #44: 2.4x faster than GPT-5, finds more issues
Race conditions (design-level)	#13: 10 high-quality findings, self-corrects mid-analysis
Adversarial creativity	#29, #35: "Your safety mechanism IS your vulnerability" patterns
False assumption detection	#31, #32: Finds where spec's own logic contradicts itself
Emergent behavior insight	#47: Stop-loss defeated by temporal composition (best single finding)
Survivor bias identification	#46: Decision latency histogram hides stuck decisions
Degraded-mode propagation	#52: 10 findings including lost-pending-state indistinguishability
Failure propagation chains	#42: 10 findings in 4K tokens — 2.2x more token-efficient than GPT-5
Security boundary tensions	Security analysis: Signal reconnaissance via audit blindspot
Design-level incompleteness	#51, #52: Where the model is fundamentally underspecified

Best at: contradiction detection, cross-document consistency, race conditions (design-level), adversarial creativity, finding false assumptions, degraded-mode propagation, failure mode reasoning
Unique ability: self-corrects mid-analysis, finds where protection mechanisms become vulnerabilities, identifies logical impossibilities from rule interaction
Most efficient model: 6-9x fewer tokens per finding than GPT-5 on shared task types
Weakness: produces fewer findings on pure enumeration tasks (10-13 vs GPT-5's 20-35)
Typical cost: $0.08-0.15 per experiment

Claude Sonnet 4.6

Strength: Fast structural scanning, implementation-perspective findings, inter-document contradiction detection, quick first-pass screening.

Sonnet is the fastest and cheapest model. It catches ~60-80% of findings on structural tasks. On inter-document contradiction detection (Finding #67), it outperformed GPT-5: more findings, better severity calibration, 10x faster.

Capability	Evidence
Quick structural scanning	#12, #14: 17 findings in 35s; recovers with structure
First-pass screening	#51, #63: Catches ~60% of findings at 1/5 the cost
Inter-document contradictions	#67: 5 findings (3 Critical) vs GPT-5's 4 (0 Critical), 14s vs 136s
Implementation perspective	#51: Finds what would confuse a developer writing the actual code
Regulatory category identification	#61: Finds structural gaps in regulatory coverage quickly
Cross-component basics	#14: 8 findings with good structure when prompts are explicit

Best at: fast first-pass, structural scanning, inter-document contradictions, implementation-perspective ambiguities
Unique ability: reasons from implementer's perspective ("if I were coding this, what would I be unsure about?")
Strength: 5-10x faster than GPT-5, useful for time-constrained reviews
Weakness: ~33% precision on self-contradiction detection (misreads), cannot match GPT-5/Opus depth on verification tasks, zero unique insights on many analytical tasks
Finding count: typically 7-17 depending on task type
Typical cost: $0.01-0.03 per experiment

GPT-4.1 and GPT-4.1 Mini

Role: Non-reasoning baselines for structured tasks.

Capability	Evidence
Structured output	#4: Best at consistent JSON/table format
Quick gap-finding	#9: 14 findings at lowest cost tier
Delegation target	#4, #14: Good enough for simple enumeration tasks
Cross-component basics	#14: Finds obvious interactions

GPT-4.1: Solid non-reasoning baseline, 14 findings on gap-finding tasks
GPT-4.1 Mini: Cheapest screening, 12 findings on gap-finding, useful for triage
Neither suitable for verification, contradiction, or adversarial analysis

Claude Sonnet 4.5

Role: Predecessor comparison, broad-but-noisy coverage.

Produces more findings than Sonnet 4.6 but with more noise
Less precise severity calibration
Use when breadth > precision (initial exploration)

Part 2: Task-Type Performance Matrix

Core Task Types (validated across 5+ experiments)

Task Type	Best Model(s)	Evidence	Notes
Hidden assumption identification	GPT-5 + Opus	#10-12, #53	GPT-5 for breadth, Opus for design tensions
Gap-finding (what's missing)	GPT-5	#9, #31, #46	Dominates on exhaustive enumeration
Self-contradiction detection	GPT-5 + Opus	#25, #43	Different types: spec conflicts vs logical impossibilities
Cross-document consistency	Opus (primary)	#28, #37, #44	2.4x faster, more findings than GPT-5
Inter-document contradictions	Sonnet (primary)	#67	Outperforms GPT-5 on parallel comparison
Race condition identification	GPT-5 + Opus	#13, #50	Sonnet unreliable for concurrency
Adversarial attack paths	GPT-5 → Opus ensemble	#29, #35	30% more findings with critique+extend
Regulatory compliance	GPT-5 (primary)	#23, #38, #54, #61	Correct citations, regulatory edge cases
Operational blind spots	GPT-5 + Opus	#46	GPT-5 coverage mapping; Opus false confidence
Temporal ordering dependencies	GPT-5 + Opus	#18, #41	Different aspects of temporal reasoning
Failure propagation chains	Opus + GPT-5	#42	Opus architectural insight; GPT-5 enumeration
Silent data corruption	GPT-5	#40, #62	Traces multi-step paths through accounting

Newer Task Types (validated in 2-4 experiments)

Task Type	Best Model(s)	Evidence	Notes
Emergent behavior / rule composition	GPT-5 + Opus	#47	GPT-5 feedback loops; Opus best single finding
Defense-in-depth gaps	GPT-5 + Opus	#48	Complementary coverage
Concurrency / write hazards	GPT-5	#50, #65	Exhaustive hazard enumeration
Implementation ambiguity	All viable	#51	Smallest model gap; Sonnet viable
Degraded-mode propagation	Opus + GPT-5	#52	Opus finds boundary semantic mismatches
State machine completeness	GPT-5	#58	16 gaps through systematic transition coverage
Convention/specification gaps	GPT-5	#59	34 findings via section-by-section enumeration
Counterfactual event ordering	GPT-5	#60	30 findings through systematic permutation
Data integrity / signal flow	GPT-5 + Opus	#62	GPT-5 audit gaps; Opus semantic violations
External system assumptions	GPT-5	#63	Reasoning about systems NOT in the document
Temporal correctness	Opus	#65b	Stronger on cross-component temporal coupling
Cross-context contracts	GPT-5	#68	Flow tracing across bounded contexts
Security boundary analysis	Opus + GPT-5	Security	Opus finds design tension exploits
Event flow correctness	All (different strengths)	#57	GPT-5 domain knowledge; Opus crash scenarios; Sonnet structure
Boundary contract analysis	GPT-5 + Opus	Boundary	Exhaustive + design-level
Operational burden analysis	GPT-5 + Opus	#45, #56	Different definitions of "operator load"

Security Code Review (new category)

Task Type	Best Approach	Evidence	Notes
SSRF defense review	Dedicated security persona	#79	Catches CGN, proxy bypass
HTTPS enforcement audit	Dedicated security persona	#79b	Catches inconsistent call-site guards
Multi-model security pipeline	Specialized + generalist	#79, #79b	Security persona blocks what generalists approve

Part 3: Meta-Findings

3.1 — Model Complementarity Is the Dominant Pattern

No single model dominates. Across all task types, the union of model findings is 30-60% larger than the best single model. This isn't noise — unique findings from each model are consistently validated as genuine.

Evidence: Finding #42 — Failure propagation chains. Opus: 10 findings in 4K tokens. GPT-5: 10 findings in 9K tokens. Same count, but only ~60% overlap. The non-overlapping findings from each are architecturally significant.

3.2 — Two Distinct Modes of Contradiction Detection

Mode	Best Model	What It Catches	Cognitive Demand
Specification conflicts	GPT-5	Same scenario, different prescriptions	Statement comparison + verification
Logical impossibilities	Opus	Rules that can't coexist under all conditions	Multi-step deductive reasoning

From Finding #43: GPT-5 and Opus don't compete on contradiction detection — they find entirely different classes of contradiction. Run both for complete coverage.

3.3 — Narrow Framing Cannot Fix Reasoning Gaps

Finding #39 (confirmed by #43): Giving Sonnet a focused "check for contradictions" prompt changes WHAT it looks for but not HOW WELL it reasons. Sonnet with narrow framing found 3 contradictions but only 1 was genuine. The gap between Sonnet and reasoning models is architectural — you cannot prompt-engineer around it.

3.4 — Adversarial Ensemble Produces Superior Coverage

Finding #35: GPT-5 → Opus critique+extend pipeline produces 30% more findings than either model alone. Zero full disagreements during critique. Extension phase adds genuinely new High-severity findings. Cost: ~28% more tokens for 30% more coverage + prioritization.

3.5 — Document Type Shapes Finding Character

Document Level	Best Analytical Lenses	Evidence
Overview/architecture	Failure propagation, blast radius, isolation verification	#42
Component specifications	Race conditions, invariant violations, hidden assumptions	#13, #20
Cross-cutting docs	Temporal ordering, recovery hazards, cross-context contracts	#41, #68
Convention/rules docs	Exhaustive enumeration, contradiction detection	#59
Regulatory specs	Compliance gap analysis, regulatory cross-referencing	#61, #64

3.6 — Token Budget Matters More Than Model Choice (for some tasks)

Finding #7b: A truncated GPT-5 response is worse than a complete Opus response. GPT-5 needs max_completion_tokens ≥ 16K. When token budgets are equal, the model gap narrows on enumeration tasks but widens on verification tasks.

3.7 — Reasoning Effort Settings Have Negligible Effect

Finding #21: Low/medium/high reasoning effort on GPT-5 produced nearly identical output quality. Either the parameter doesn't work for open-ended analysis or the tasks were within GPT-5's "easy" threshold.

3.8 — Inter-Document Contradiction: Sonnet's One Dominance

Finding #67: On inter-document contradiction detection (comparing two documents for conflicting statements), Sonnet outperformed GPT-5: 5 findings (3 Critical) vs 4 findings (0 Critical) in 14s vs 136s. This is the only task type where Sonnet clearly dominates. The hypothesis: this task requires parallel comparison (pattern matching across two texts) which benefits from Sonnet's approach more than GPT-5's serial deep reasoning.

3.9 — Reviewer Persona > Model Capability for Security

Findings #79, #79b: A dedicated security-reviewer persona caught critical SSRF gaps that both Sonnet and GPT generalist reviewers missed and approved. The security reviewer uses structured criteria (trust boundaries, library semantics, OS interaction) that generalist review prompts don't invoke. Persona specialization provides unique coverage beyond model improvement.

3.10 — Dual-Bot Disagreement as Quality Ratchet

Finding #78: In the gargoyle dev pipeline, dual-bot review (Sonnet + GPT) achieved 32% REQUEST_CHANGES rate. After dropping to single-bot (GPT only), REQUEST_CHANGES dropped to 2%. The disagreement between two models — where one blocks while the other approves — creates a natural quality gate. Removing one model from the pipeline disproportionately degrades review rigor.

Part 4: Cost-Effectiveness

Per-Experiment Cost by Model

Model	Typical Time	Typical Output	Typical Cost	Findings/$
GPT-5	80-140s	7-11K tokens	$0.30-0.50	30-60
Claude Opus 4.6	50-120s	2-5K tokens	$0.08-0.15	80-130
Claude Sonnet 4.6	15-40s	1-2K tokens	$0.01-0.03	300-700
GPT-4.1	20-40s	2-4K tokens	$0.03-0.06	200-400
GPT-4.1 Mini	10-20s	1-2K tokens	$0.005-0.01	1000+

Three-Model Ensemble Cost

Running GPT-5 + Opus + Sonnet on a single document:

Total cost: ~$0.40-0.70
Total time: ~3-5 minutes (sequential)
Total unique findings: Typically 1.3-1.6x the best single model
Value proposition: Finding one Critical issue before production justifies the entire research budget

Efficiency Ratios

Metric	GPT-5	Opus	Sonnet
Tokens per finding	500-1000	200-400	100-200
Time per finding	6-10s	5-10s	2-4s
Unique finding rate	25-40%	20-35%	5-15%
False positive rate	<5%	<5%	15-33% (verification tasks)

When to Use Each Tier

Budget	Approach	Expected Coverage
Minimal ($0.01-0.03)	Sonnet only	~60% of findings, fast
Standard ($0.15-0.20)	Opus + Sonnet	~80% of findings, good depth
Comprehensive ($0.50-0.70)	GPT-5 + Opus + Sonnet	~95% of findings, full coverage
Critical ($1-2)	Ensemble (GPT-5 → Opus critique+extend) + Sonnet	Maximum coverage with prioritization

Part 5: Pipeline Findings (Dev Loop Analysis)

Multi-Model Review Pipeline Effectiveness (Finding #78)

Metric	Dual-Bot (Sonnet+GPT)	Single-Bot (GPT only)
REQUEST_CHANGES rate	32%	2%
Avg reviews per PR	7-11	22-30
Post-merge escape rate	Declining	Unknown (too recent)
Most caught category	Missing tests (22%)	—

Security Review Pipeline (Findings #79, #79b)

Standard Reviewer	Security Reviewer	Outcome
APPROVED (both Sonnet + GPT)	REQUEST_CHANGES	Correct (security issue was real)
Generalist prompt	Domain-specific criteria	Security persona provides unique value
Misses library semantics	Catches Python `is_private` gaps	Domain knowledge matters
Misses OS interaction	Catches proxy inheritance	Cross-layer reasoning matters

Operational Lessons (Finding #80)

Config-A/B parity routing must be actively monitored
All-reviewer-fire-always costs 3.5x expected budget
Phase 1 baseline invalidated by dispatcher malfunction
Operational monitoring of AI pipelines is non-optional

Part 6: Validated Analytical Lenses (Full Catalog)

The research has validated 28 distinct analytical lenses for architecture document review:

#	Lens	First Tested	Key Findings
1	Hidden assumption identification	#10	GPT-5 + Opus complementary
2	Gap-finding	#9	GPT-5 dominates
3	Bias detection	#8	Signal isolation matters most
4	Self-contradiction detection	#25, #43	Two distinct modes
5	Cross-document consistency	#28	Opus dominates
6	Inter-document contradictions	#67	Sonnet dominates
7	Race condition identification	#13	GPT-5 + Opus
8	Temporal boundary analysis	#18	GPT-5 + Opus
9	Cross-component interaction	#14	All models viable
10	Adversarial manipulation	#29	Ensemble best
11	Design coherence	#15, #27	Document-dependent
12	Spec completeness	#16	Sonnet 4.5 adequate
13	Missing-feature identification	#26	Promptable across all
14	Operational blind spots	#46	GPT-5 + Opus
15	Emergent behavior / rule composition	#47	GPT-5 feedback loops; Opus insight
16	Defense-in-depth gaps	#48	GPT-5 + Opus
17	Adversarial evasion/tampering	#49	GPT-5 + Opus
18	Concurrency / race conditions	#50	GPT-5 exhaustive
19	Implementation ambiguity	#51	All viable (smallest gap)
20	Degraded-mode propagation	#52	Opus + GPT-5
21	Failure propagation chains	#42	Opus insight; GPT-5 coverage
22	State machine completeness	#58	GPT-5 dominates
23	Convention/specification gaps	#59	GPT-5 dominates
24	Counterfactual event ordering	#60	GPT-5 systematic permutation
25	Regulatory completeness	#61, #64	GPT-5 regulatory; Opus operational
26	Data integrity / signal flow	#62	GPT-5 audit; Opus semantic
27	External system assumptions	#63	GPT-5 exhaustive
28	Security boundary analysis	Security	Opus tension exploits

Part 7: Open Questions

High Priority

Does the dual-bot quality ratchet scale? Finding #78 showed dramatic quality degradation when dropping from 2 to 1 reviewer bot. Would 3 bots (adding Opus) further improve? Or is the marginal value of the 3rd reviewer diminishing?
Security persona transferability: Findings #79/#79b validate specialized security review on SSRF/HTTPS code. Does the same persona pattern work for auth, crypto, and other security domains? Or does each domain need a separately-tuned persona?
Config-A/B measurement recovery: With the dispatcher now fixed, can Phase 1 data be salvaged (all reviewers ran, so both configs' data exists), or must the experiment restart?
Reasoning effort on harder documents: Finding #21 showed negligible effect on a moderately complex document. Test with a genuinely hard document (1000+ lines, multiple interacting concerns) to see if reasoning effort matters when the task exceeds the "easy" threshold.
Model personality vs prompt: Finding #26 showed missing-feature identification is promptable. How many other "model personality" observations are prompt framing effects? Systematic test needed.

Medium Priority

Cross-corpus generalization: All findings are on a single corpus (gargoyle). Do the model rankings hold for other domains (infrastructure, web apps, data pipelines)?
Opus for inter-document contradictions: Finding #67 showed Sonnet outperforming GPT-5. Would Opus (with its boundary reasoning strength) outperform both?
Automated lens selection: Given 28 validated lenses, can a model accurately select which lenses apply to a given document? Or does human judgment remain necessary?
Longitudinal review effectiveness: As the codebase improves from post-merge review findings, does the multi-model review pipeline's REQUEST_CHANGES rate stabilize or continue declining?

Answered (from previous period)

~~Opus + narrow framing for contradiction detection~~ → ANSWERED (#43): Different class of findings, not comparable
~~Sonnet + narrow framing = GPT-5 level?~~ → ANSWERED (#39): No. Reasoning depth, not framing, is the bottleneck
~~Adversarial ensemble value?~~ → ANSWERED (#35): Yes, 30% more coverage at 28% more cost
~~Is Opus > GPT-5 universal for coherence?~~ → ANSWERED (#27): No, document-dependent

Methodology

See methodology.md for full experimental setup. Key constraints:

Same input text to all models (no information advantage)
Structured prompts with explicit categories and output format
No tools, no project context beyond the document(s) under analysis
Independent runs (no cross-pollination between models)
Single researcher evaluating findings (subjectivity acknowledged)
Single corpus (gargoyle) — domain bias possible

Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, GPT-4.1, GPT-4.1 Mini, Claude Sonnet 4.5

Conclusion

After 80 experiments, the evidence strongly supports a multi-model approach to analytical work:

No single model dominates — task type determines the best model
The union always exceeds the parts — run multiple models for critical work
Persona specialization adds unique value — beyond model capability
Operational monitoring matters — AI pipelines need the same rigor as production systems
The research pays for itself — total budget (~$30-50 over 20 days) vs value of findings applied to a real production system

The next frontier is operationalizing these findings: automated lens selection, pipeline health monitoring, and measuring downstream impact of review quality on production defect rates.

26 KiB Raw Blame History