Initial publish: 29 findings, 6 prompts, methodology, open questions

Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding). Contents: - findings/ALL-FINDINGS.md — complete 3,249-line research log with all 29 findings, methodology notes, and open questions - prompts/ — 6 exact prompts used across experiments - methodology.md — experimental setup and evaluation criteria - open-questions.md — unanswered questions for future work - README.md — overview and summary table Key findings: - Cross-document consistency: Opus is 2.4x faster with more findings - Gap-finding: GPT-5 reasoning tokens find domain-specific gaps - Race conditions: Opus excels at temporal interaction reasoning - Bias detection: Signal-to-noise ratio > model capability - Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different Signed-off-by: Rodin
2026-05-05 19:13:03 -07:00
parent 4aea0d004b
commit 1b108ff66e
10 changed files with 3831 additions and 2 deletions
@@ -1,3 +1,81 @@
-# model-research
+# Model Research — AI for Analytical Work

-Comparative analysis of AI models on analytical tasks — not coding. Tracking what works when using GPT-5, Claude Opus, Claude Sonnet, and GPT-4.1 for research, document review, bias detection, and architecture analysis.
+Comparative analysis of AI models on **analytical tasks** — not coding.
+
+Most public discussion about LLM capabilities focuses on code generation.
+We found almost no published methodology for using models in analytical
+research tasks (searched 2026-04-26). This repo fills that gap with
+controlled experiments and reproducible findings.
+
+## What We're Testing
+
+Using GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, and GPT-4.1 (+ Mini) for:
+
+- Architecture document review
+- Bias and assumption detection
+- Gap-finding in design specifications
+- Cross-document consistency analysis
+- Race condition identification
+- Adversarial path analysis
+- Contradiction detection
+- Regulatory compliance review
+
+## Key Findings (Summary)
+
+| # | Task Type | Winner | Key Insight |
+|---|-----------|--------|-------------|
+| 1 | PR review | Both | Different models catch different things — Sonnet: structural, GPT-5: semantic |
+| 2 | Bias detection | Framing | Signal-to-noise ratio matters more than model capability |
+| 9 | Gap-finding | GPT-5 | Reasoning tokens find domain-specific gaps, not generic ones |
+| 10 | Hidden assumptions | GPT-5 | Reasoning produces qualitatively different (not just more) findings |
+| 13 | Race conditions | Opus | Temporal interaction reasoning is Opus's strongest domain |
+| 15 | Design coherence | Task-dependent | Single-doc: model choice depends on document complexity |
+| 25 | Contradiction detection | Opus | Precision > exhaustiveness; Opus's self-correction is unique |
+| 28 | Cross-doc consistency | Opus | 2.4x faster than GPT-5 with more findings; boundary reasoning |
+| 29 | Adversarial analysis | GPT-5 + Opus | GPT-5: exhaustive; Opus: qualitatively different attack vectors |
+
+## Methodology
+
+Each experiment:
+1. Same input document(s) to all models
+2. Same structured prompt with explicit categories to analyze
+3. No tools, no project context beyond the document(s)
+4. Independent runs — no cross-pollination between models
+5. Results evaluated for: correctness, uniqueness, actionability
+
+**Context dimensions tracked:**
+- Rich vs minimal (how much background info)
+- Broad vs focused ("review this" vs "answer this specific question")
+- What kind of context (diff, full files, issue text, nothing)
+- Whether the model had tools or just text
+- Whether the task was step-by-step or open-ended
+
+## Repository Structure
+
+```
+findings/           # Individual findings with full analysis
+  01-different-models-different-things.md
+  02-narrow-lens-vs-broad-review.md
+  ...
+  28-cross-document-consistency.md
+  29-adversarial-manipulation.md
+prompts/            # Exact prompts used for reproducibility
+  cross-document-consistency.md
+  design-coherence.md
+  gap-finding.md
+  hidden-assumptions.md
+  ...
+open-questions.md   # Unanswered questions for future experiments
+methodology.md      # Full methodology notes
+```
+
+## Who We Are
+
+This research is conducted by [Rodin](https://gitea.weiker.me/rodin) (AI
+assistant) and Aaron Weiker. The test corpus is gargoyle — an Elixir
+trading system with extensive architecture documentation (~35 design docs,
+~5000 lines).
+
+## License
+
+CC BY 4.0 — share and adapt with attribution.
@@ -0,0 +1,76 @@
+# Methodology
+
+## Principles
+
+1. **Internet opinions about models are overwhelmingly about coding.** Don't
+   extrapolate to analytical work without testing.
+2. **"Just because someone says it on the internet doesn't make it right."**
+   Opinions need context. Track our own evidence.
+3. **Absence of published methodology for a use case is itself a finding.**
+4. **No unsupported generalizations.** Each finding needs: date, task,
+   how we used it (context shape, task framing, what info the model
+   had/didn't have), what happened, takeaway.
+
+## Experimental Setup
+
+### Models Tested
+
+| Model | Provider | Access | Notes |
+|-------|----------|--------|-------|
+| GPT-5 | OpenAI (via HAI proxy) | API | Requires `max_completion_tokens` ≥16K |
+| Claude Opus 4.6 | Anthropic (via HAI proxy) | API | Internal reasoning (not exposed) |
+| Claude Sonnet 4.6 | Anthropic (via HAI proxy) | API | Fast, cost-effective |
+| GPT-4.1 | OpenAI (via HAI proxy) | API | Non-reasoning, structured output |
+| GPT-4.1 Mini | OpenAI (via HAI proxy) | API | Cheapest, good for screening |
+| Claude Sonnet 4.5 | Anthropic (via HAI proxy) | API | Predecessor to 4.6 |
+
+### Control Variables
+
+- **Same input:** All models receive identical document text
+- **Same prompt:** Structured prompt with explicit categories and output format
+- **Same constraints:** No tools, no project context beyond the document(s)
+- **Independent runs:** No cross-pollination between model runs
+- **Temperature:** 0.3 for GPT-4.1/Mini; default (1.0) for GPT-5 (required)
+
+### Measurement
+
+- **Time:** Wall clock from request to final token
+- **Output tokens:** Total generated tokens
+- **Reasoning tokens:** For reasoning models (GPT-5), exposed separately
+- **Findings count:** Number of distinct issues identified
+- **Unique findings:** Issues found by only one model
+- **Severity distribution:** Critical / High / Medium / Low per finding
+- **Tokens per finding:** Efficiency metric
+
+### Evaluation Criteria
+
+Each finding is assessed for:
+1. **Correctness:** Is the identified issue real?
+2. **Uniqueness:** Did only this model find it?
+3. **Actionability:** Would a developer change something based on this?
+4. **Depth:** Surface observation vs architectural insight?
+
+### Context Dimensions Tracked
+
+| Dimension | Options |
+|-----------|---------|
+| Context richness | Rich (full project) vs Minimal (document only) |
+| Task framing | Broad ("review this") vs Focused ("check for X") |
+| Context type | Diff, full files, issue text, research notes, nothing |
+| Tool access | With tools (API calls, file reads) vs text-only |
+| Task structure | Step-by-step explicit vs open-ended |
+
+## Limitations
+
+- Single test corpus (gargoyle architecture docs) — domain bias possible
+- Single researcher evaluating findings — subjectivity in quality assessment
+- Models are non-deterministic — single runs, not averaged
+- Proxy adds latency — timing comparisons are relative, not absolute
+- Internal reasoning tokens not visible for Claude models
+
+## Reproducibility
+
+Prompts for each experiment are in the `prompts/` directory. The test
+corpus is the gargoyle project's `docs/` directory (available at
+`gitea.weiker.me/grgl/gargoyle`). Each finding documents the exact document
+used, its line count, and the specific version/commit when relevant.
@@ -0,0 +1,58 @@
+# Open Questions
+
+Unanswered questions from experiments, ordered by potential impact.
+
+## High Priority
+
+### Signal-to-noise confirmation (from Finding #8)
+Give a model the FULL PR review context (diff, files, issue, AC) but add
+the narrow bias question as an explicit review checklist item. If the model
+catches bias despite the rich context, it confirms the signal-to-noise
+hypothesis. If it misses, it suggests something else (attention allocation,
+task switching cost).
+
+### Cross-document consistency as maintenance tool (from Finding #28)
+Does running cross-doc analysis across MORE document pairs (domain readmes
+vs implementation docs, design docs vs plan docs) yield additional real
+inconsistencies? Could become a systematic documentation maintenance tool.
+
+### Why Opus dominates cross-doc consistency (from Finding #28)
+Opus was 2.4x faster AND found more issues than GPT-5. Is this because
+cross-doc contradictions are easy to verify once spotted (reducing GPT-5's
+verification advantage)? Or because boundary reasoning (Opus's strength)
+is the primary skill needed?
+
+### Sonnet + narrow framing = GPT-5 level? (from Finding #5)
+Would Sonnet catch semantic issues if given a narrower "check for logical
+consistency" framing instead of broad review? The hypothesis: Sonnet's
+"structural reviewer" tendency is a framing artifact, not a capability limit.
+
+## Medium Priority
+
+### Adversarial analysis ensemble (from Finding #29)
+Run GPT-5 and Opus sequentially — give Opus access to GPT-5's findings
+and ask it to critique and extend. Does the ensemble find more than either
+alone? Does Opus's system-level thinking complement GPT-5's exhaustiveness?
+
+### Reasoning effort parameter (from Finding #21)
+Reasoning effort (low/medium/high) had negligible effect on GPT-5's
+analytical output. Is this because the parameter doesn't work for open-ended
+analysis? Or because the task was already within GPT-5's "easy" threshold?
+Test with a harder document.
+
+### Model personality vs prompt (from Finding #26)
+Missing-feature identification IS promptable across all models — prompt
+framing eliminates Opus's historical advantage. How many other "model
+personality" observations are actually just prompt framing effects?
+
+## Answered Questions
+
+- ~~Opus's "missing feature identification" mode — is it promptable?~~
+  **YES** (Finding #26): all models find regulatory gaps when explicitly
+  prompted. Opus's behavior was an emergent DEFAULT tendency, not a unique
+  capability.
+
+- ~~Is Opus > GPT-5 for coherence tasks universal?~~
+  **NO** (Finding #27): Opus's advantage from Finding #15 was document-
+  specific. On risk-controls.md (992 lines, more complex), GPT-5 regained
+  top position. Document complexity and domain specialization affect ranking.
@@ -0,0 +1,59 @@
+# Prompt: Adversarial Manipulation Analysis
+
+Used in Finding #29.
+
+## Setup
+
+- Single document (full text)
+- Same prompt to all models
+- No tools, no project context beyond the document
+
+## Prompt
+
+```
+You are a red-team security analyst reviewing a trading system's
+aggregation component. Your task is to identify how a MISBEHAVING,
+COMPROMISED, or BUGGY upstream component could exploit this design
+to produce harmful trading outcomes that bypass downstream safety controls.
+
+## Categories of adversarial manipulation:
+
+1. **Signal injection** — How could a compromised strategy inject signals
+   that exploit the aggregator's logic to produce dangerous decisions?
+2. **Timing manipulation** — How could an attacker manipulate timing
+   (delays, bursts, clock skew) to exploit the aggregator's temporal logic?
+3. **Capacity weaponization** — How could the max_signals bound or group
+   completion logic be exploited to force premature or delayed decisions?
+4. **State corruption via crash** — How could deliberate crashes be used
+   to put the aggregator in an exploitable state?
+5. **Audit evasion** — How could an attacker cause the aggregator to make
+   decisions that don't appear in the audit log, or appear differently
+   than what actually happened?
+
+## For each attack vector:
+
+- **Category:** (one of the 5 above)
+- **Attack vector:** Name of the attack
+- **Mechanism:** How the attacker exploits the design
+- **Exploit:** Step-by-step attack sequence
+- **Why downstream controls miss it:** Why PortfolioRisk, BuyingPower,
+  or other downstream checks don't catch this
+- **Severity:** Critical / High / Medium
+- **Mitigation:** What the design could add to prevent it
+
+## Document:
+
+[FULL TEXT OF aggregation.md, 193 lines]
+```
+
+## Results
+
+| Model | Time | Findings | Unique vectors |
+|-------|------|----------|----------------|
+| GPT-5 | ~150s | 8 | 3 (most exhaustive) |
+| Opus | ~65s | 6 | 2 (qualitatively different) |
+| Sonnet | ~20s | 4 | 0 (subset of others) |
+
+GPT-5 was most exhaustive and systematic. Opus found qualitatively different
+attack vectors with system-level thinking (e.g., exploiting supervision tree
+restart semantics).
@@ -0,0 +1,58 @@
+# Prompt: Contradiction Detection
+
+Used in Finding #25.
+
+## Setup
+
+- Single document (full text)
+- Same prompt to all models
+- No tools, no project context beyond the document
+
+## Prompt
+
+```
+You are analyzing a design document for CONTRADICTIONS — places where
+the document makes two claims that cannot both be true simultaneously.
+
+This is NOT about:
+- Missing information
+- Unclear writing
+- Design tradeoffs
+- Things that MIGHT conflict
+
+This IS about:
+- Statement A says X, Statement B says NOT-X
+- Mechanism A requires condition C, Mechanism B prevents condition C
+- Rule A applies to set S, but S includes elements that violate Rule A
+
+## Categories:
+
+1. **Direct contradictions** — Two statements that are logically incompatible
+2. **Mechanism conflicts** — Two described mechanisms that cannot coexist
+3. **Scope violations** — A rule/invariant that is violated by a specific
+   case described elsewhere in the document
+4. **Temporal impossibilities** — A sequence that requires something to be
+   true before the described mechanism makes it true
+
+## For each contradiction:
+
+- **Category:** (one of the 4 above)
+- **Statement A:** (exact text, with section)
+- **Statement B:** (exact text, with section)
+- **Why contradictory:** (formal reasoning about incompatibility)
+- **Severity:** Critical (system correctness) / High (safety) / Medium (confusion)
+
+Be PRECISE. Only report genuine logical contradictions, not differences
+in emphasis or scope.
+
+## Document:
+
+[FULL TEXT OF DOCUMENT]
+```
+
+## Key Design Decision
+
+The "Be PRECISE" instruction and explicit exclusion list ("NOT about")
+is critical. Without it, models pad findings with style/clarity issues.
+The contradiction prompt naturally favors Opus (self-correcting, withdraws
+false positives) over GPT-5 (exhaustive, includes borderline cases).
@@ -0,0 +1,80 @@
+# Prompt: Cross-Document Consistency Analysis
+
+Used in Finding #28.
+
+## Setup
+
+- Two documents provided as full text in a single prompt (~25KB total)
+- Document A: `system-overview.md` (323 lines, narrative overview)
+- Document B: `architecture.md` (213 lines, DDD-focused)
+- No tools, no project context beyond the two documents
+- Same prompt to all 3 models independently
+
+## Prompt
+
+```
+You are analyzing two architecture documents that describe the SAME system.
+Your task is to identify places where these documents CONTRADICT each other
+— not where they differ in scope or detail level, but where they make
+incompatible claims about the same concept.
+
+## Categories of inconsistency to check:
+
+1. **Terminology conflicts** — Same concept called different names in ways
+   that imply different meanings (not just abbreviation)
+2. **Structural contradictions** — Documents disagree about what is inside
+   vs outside a component boundary
+3. **Flow/sequence conflicts** — Documents describe incompatible orderings
+   or data flows for the same process
+4. **Ownership/authority conflicts** — Documents disagree about which
+   component owns, writes, or is authoritative for a concept
+5. **Philosophical contradictions** — Documents state incompatible
+   foundational assumptions (e.g., event sourcing vs CRUD)
+
+## What to EXCLUDE:
+
+- Omissions (one doc covers something the other doesn't)
+- Detail-level differences (one is more detailed than the other)
+- Naming differences that are clearly just abbreviations
+- Scope differences (one covers more topics)
+
+## Output format per finding:
+
+For each inconsistency found:
+- **Category:** (one of the 5 above)
+- **Severity:** Critical / High / Medium
+- **Document A says:** (exact quote or precise paraphrase with section ref)
+- **Document B says:** (exact quote or precise paraphrase with section ref)
+- **Why these are incompatible:** (explain why both cannot be correct)
+- **Impact:** (what would go wrong if an implementer followed both)
+
+## Document A: [system-overview.md]
+
+[FULL TEXT OF DOCUMENT A]
+
+## Document B: [architecture.md]
+
+[FULL TEXT OF DOCUMENT B]
+```
+
+## Key Design Decisions
+
+1. **Explicit exclusion of omissions** — prevents models from padding
+   findings with "Doc A mentions X but Doc B doesn't"
+2. **Five specific categories** — focuses attention without being
+   so restrictive that models miss novel inconsistency types
+3. **Required "why incompatible" explanation** — forces models to reason
+   about WHY differences matter, not just list differences
+4. **Impact field** — grounds findings in practical consequences
+5. **Both documents in single prompt** — enables cross-referencing
+   without tool calls or context fragmentation
+
+## Results
+
+| Model | Time | Findings | Tokens/finding |
+|-------|------|----------|----------------|
+| Opus | 52s | 7 | 336 |
+| GPT-5 | 125s | 6 | 2,967 |
+| Sonnet | 14s | 4 | 194 |
+
+Opus recommended for this task type.
@@ -0,0 +1,71 @@
+# Prompt: Design Coherence Analysis
+
+Used in Findings #15, #27.
+
+## Setup
+
+- Single document provided as full text
+- No tools, no project context beyond the document
+- Same prompt to all models independently
+
+## Prompt
+
+```
+You are analyzing a single design document for INTERNAL incoherence —
+places where the document contradicts itself. The document states
+principles, invariants, or guarantees in one place, then describes
+mechanisms that violate those guarantees elsewhere.
+
+## Categories of incoherence to check:
+
+1. **Safety properties not enforced** — Document claims a safety property
+   (e.g., "fail-closed") but the described mechanism has a path that
+   violates it
+2. **State machine violations** — Declared states/transitions don't match
+   the described behavior (missing transitions, unreachable states,
+   states with no exit)
+3. **Recovery contradictions** — Recovery mechanism assumes preconditions
+   that the failure scenario explicitly invalidates
+4. **Supervision conflicts** — Supervision strategy contradicts the
+   independence/coupling claims about the supervised processes
+5. **Cross-mechanism contradictions** — Two different sections describe
+   incompatible behaviors for the same scenario
+
+## What to EXCLUDE:
+
+- Missing features (things the document doesn't cover)
+- Design tradeoffs that are explicitly acknowledged
+- Future work items marked as such
+
+## Output format per finding:
+
+- **Category:** (one of the 5 above)
+- **Severity:** Critical / High / Medium
+- **Section A says:** (exact quote with section reference)
+- **Section B says:** (exact quote with section reference)
+- **The incoherence:** (explain the contradiction)
+- **Why it matters:** (what would break in implementation)
+
+## Document:
+
+[FULL TEXT OF DOCUMENT]
+```
+
+## Results (Finding #15: failure-modes.md, 383 lines)
+
+| Model | Time | Findings |
+|-------|------|----------|
+| Sonnet 4.6 | 39s | 5 |
+| Opus 4.6 | 105s | 7 (8 attempted, 1 self-withdrawn) |
+| GPT-5 | 120s | 4 |
+
+## Results (Finding #27: risk-controls.md, 992 lines)
+
+| Model | Time | Findings |
+|-------|------|----------|
+| Sonnet 4.6 | 31s | 4 |
+| Opus 4.6 | 86s | 5 |
+| GPT-5 | 112s | 6 |
+
+Key insight: results are document-dependent. Opus won on the shorter doc,
+GPT-5 won on the longer, more complex one.
@@ -0,0 +1,47 @@
+# Prompt: Gap-Finding in Architecture Documents
+
+Used in Finding #9.
+
+## Setup
+
+- Single document (full text, no truncation)
+- Same focused analytical question to all models
+- No tools, no project context beyond the document
+- Temperature 0.3 for GPT-4.1/Mini, default for GPT-5
+
+## Prompt
+
+```
+You are a systems reliability engineer reviewing a failure modes document
+for a trading platform. Your task is to identify MISSING failure scenarios
+— things that COULD go wrong in this architecture but are NOT covered in
+the document.
+
+Focus on:
+1. Scenarios specific to THIS architecture (not generic "server could crash")
+2. Interactions between components that could produce unexpected states
+3. External dependency failures not covered
+4. Timing/ordering issues in the described sequences
+5. Recovery procedures that have gaps
+
+For each missing scenario:
+- **Scenario:** What goes wrong
+- **Why it's specific to this system:** Why generic monitoring wouldn't catch it
+- **Impact:** What state the system ends up in
+- **Why the document misses it:** What assumption makes this invisible
+
+## Document:
+
+[FULL TEXT OF failure-modes.md, 383 lines]
+```
+
+## Results
+
+| Model | Time | Output tokens | Reasoning tokens | Scenarios found |
+|-------|------|---------------|------------------|-----------------|
+| GPT-4.1 Mini | 16s | 2,003 | 0 | 10 |
+| GPT-4.1 | 24s | 2,575 | 0 | 15 |
+| GPT-5 | 45s | 8,565 | 6,656 | 14 |
+
+GPT-5 found the most domain-specific and actionable gaps despite finding
+fewer total scenarios than GPT-4.1. Quality > quantity.
@@ -0,0 +1,53 @@
+# Prompt: Hidden Assumption Identification
+
+Used in Findings #10, #11, #12.
+
+## Setup
+
+- Single document (full text)
+- Same prompt to all models
+- No tools, no project context beyond the document
+- Temperature 0.3 for non-reasoning models
+
+## Prompt
+
+```
+You are reviewing a system design document for hidden assumptions —
+things the design DEPENDS ON being true but does NOT explicitly state
+or validate.
+
+A hidden assumption is different from a design decision:
+- Design decision: "We use event sourcing" (explicit choice)
+- Hidden assumption: "Events will always be delivered in order"
+  (unstated dependency that could break)
+
+For each hidden assumption found:
+- **Assumption:** What the design implicitly depends on
+- **Where it's hidden:** Which mechanism relies on it (section reference)
+- **What breaks if violated:** Concrete failure mode
+- **Likelihood of violation:** In production, how likely is this to be
+  violated? (not in theory — in the real world with network partitions,
+  clock skew, operator error, etc.)
+
+Focus on assumptions that:
+1. Are NOT explicitly stated in the document
+2. COULD realistically be violated in production
+3. Would cause SILENT incorrect behavior (not loud crashes)
+4. Are specific to THIS architecture (not generic distributed systems concerns)
+
+## Document:
+
+[FULL TEXT OF DOCUMENT]
+```
+
+## Results (Finding #10: cold-start-and-recovery.md, 234 lines)
+
+| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
+|-------|------|---------------|------------------|-------------------|
+| GPT-4.1 Mini | 25s | 3,090 | 0 | 12 |
+| GPT-4.1 | 77s | 2,751 | 0 | 14 |
+| GPT-5 | 78s | 2,649 | 4,096 | 26 |
+
+GPT-5 found 2x more assumptions AND they were qualitatively different —
+multi-component interaction assumptions that require reasoning about
+system-level behavior, not just local properties.