Initial publish: 29 findings, 6 prompts, methodology, open questions

Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding). Contents: - findings/ALL-FINDINGS.md — complete 3,249-line research log with all 29 findings, methodology notes, and open questions - prompts/ — 6 exact prompts used across experiments - methodology.md — experimental setup and evaluation criteria - open-questions.md — unanswered questions for future work - README.md — overview and summary table Key findings: - Cross-document consistency: Opus is 2.4x faster with more findings - Gap-finding: GPT-5 reasoning tokens find domain-specific gaps - Race conditions: Opus excels at temporal interaction reasoning - Bias detection: Signal-to-noise ratio > model capability - Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different Signed-off-by: Rodin
2026-05-05 19:13:03 -07:00
parent 4aea0d004b
commit 1b108ff66e
10 changed files with 3831 additions and 2 deletions
@@ -1,3 +1,81 @@
-# model-research
+# Model Research — AI for Analytical Work
-Comparative analysis of AI models on analytical tasks — not coding. Tracking what works when using GPT-5, Claude Opus, Claude Sonnet, and GPT-4.1 for research, document review, bias detection, and architecture analysis.
+Comparative analysis of AI models on **analytical tasks** — not coding.
 Most public discussion about LLM capabilities focuses on code generation.
 We found almost no published methodology for using models in analytical
 research tasks (searched 2026-04-26). This repo fills that gap with
 controlled experiments and reproducible findings.
 ## What We're Testing
 Using GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, and GPT-4.1 (+ Mini) for:
 - Architecture document review
 - Bias and assumption detection
 - Gap-finding in design specifications
 - Cross-document consistency analysis
 - Race condition identification
 - Adversarial path analysis
 - Contradiction detection
 - Regulatory compliance review
 ## Key Findings (Summary)
 | # | Task Type | Winner | Key Insight |
 |---|-----------|--------|-------------|
 | 1 | PR review | Both | Different models catch different things — Sonnet: structural, GPT-5: semantic |
 | 2 | Bias detection | Framing | Signal-to-noise ratio matters more than model capability |
 | 9 | Gap-finding | GPT-5 | Reasoning tokens find domain-specific gaps, not generic ones |
 | 10 | Hidden assumptions | GPT-5 | Reasoning produces qualitatively different (not just more) findings |
 | 13 | Race conditions | Opus | Temporal interaction reasoning is Opus's strongest domain |
 | 15 | Design coherence | Task-dependent | Single-doc: model choice depends on document complexity |
 | 25 | Contradiction detection | Opus | Precision > exhaustiveness; Opus's self-correction is unique |
 | 28 | Cross-doc consistency | Opus | 2.4x faster than GPT-5 with more findings; boundary reasoning |
 | 29 | Adversarial analysis | GPT-5 + Opus | GPT-5: exhaustive; Opus: qualitatively different attack vectors |
 ## Methodology
 Each experiment:
 1. Same input document(s) to all models
 2. Same structured prompt with explicit categories to analyze
 3. No tools, no project context beyond the document(s)
 4. Independent runs — no cross-pollination between models
 5. Results evaluated for: correctness, uniqueness, actionability
 **Context dimensions tracked:**
 - Rich vs minimal (how much background info)
 - Broad vs focused ("review this" vs "answer this specific question")
 - What kind of context (diff, full files, issue text, nothing)
 - Whether the model had tools or just text
 - Whether the task was step-by-step or open-ended
 ## Repository Structure
 ```
 findings/           # Individual findings with full analysis
  01-different-models-different-things.md
  02-narrow-lens-vs-broad-review.md
  ...
  28-cross-document-consistency.md
  29-adversarial-manipulation.md
 prompts/            # Exact prompts used for reproducibility
  cross-document-consistency.md
  design-coherence.md
  gap-finding.md
  hidden-assumptions.md
  ...
 open-questions.md   # Unanswered questions for future experiments
 methodology.md      # Full methodology notes
 ```
 ## Who We Are
 This research is conducted by [Rodin](https://gitea.weiker.me/rodin) (AI
 assistant) and Aaron Weiker. The test corpus is gargoyle — an Elixir
 trading system with extensive architecture documentation (~35 design docs,
 ~5000 lines).
 ## License
 CC BY 4.0 — share and adapt with attribution.
@@ -0,0 +1,76 @@
 # Methodology
 ## Principles
 1. **Internet opinions about models are overwhelmingly about coding.** Don't
   extrapolate to analytical work without testing.
 2. **"Just because someone says it on the internet doesn't make it right."**
   Opinions need context. Track our own evidence.
 3. **Absence of published methodology for a use case is itself a finding.**
 4. **No unsupported generalizations.** Each finding needs: date, task,
   how we used it (context shape, task framing, what info the model
   had/didn't have), what happened, takeaway.
 ## Experimental Setup
 ### Models Tested
 | Model | Provider | Access | Notes |
 |-------|----------|--------|-------|
 | GPT-5 | OpenAI (via HAI proxy) | API | Requires `max_completion_tokens` ≥16K |
 | Claude Opus 4.6 | Anthropic (via HAI proxy) | API | Internal reasoning (not exposed) |
 | Claude Sonnet 4.6 | Anthropic (via HAI proxy) | API | Fast, cost-effective |
 | GPT-4.1 | OpenAI (via HAI proxy) | API | Non-reasoning, structured output |
 | GPT-4.1 Mini | OpenAI (via HAI proxy) | API | Cheapest, good for screening |
 | Claude Sonnet 4.5 | Anthropic (via HAI proxy) | API | Predecessor to 4.6 |
 ### Control Variables
 - **Same input:** All models receive identical document text
 - **Same prompt:** Structured prompt with explicit categories and output format
 - **Same constraints:** No tools, no project context beyond the document(s)
 - **Independent runs:** No cross-pollination between model runs
 - **Temperature:** 0.3 for GPT-4.1/Mini; default (1.0) for GPT-5 (required)
 ### Measurement
 - **Time:** Wall clock from request to final token
 - **Output tokens:** Total generated tokens
 - **Reasoning tokens:** For reasoning models (GPT-5), exposed separately
 - **Findings count:** Number of distinct issues identified
 - **Unique findings:** Issues found by only one model
 - **Severity distribution:** Critical / High / Medium / Low per finding
 - **Tokens per finding:** Efficiency metric
 ### Evaluation Criteria
 Each finding is assessed for:
 1. **Correctness:** Is the identified issue real?
 2. **Uniqueness:** Did only this model find it?
 3. **Actionability:** Would a developer change something based on this?
 4. **Depth:** Surface observation vs architectural insight?
 ### Context Dimensions Tracked
 | Dimension | Options |
 |-----------|---------|
 | Context richness | Rich (full project) vs Minimal (document only) |
 | Task framing | Broad ("review this") vs Focused ("check for X") |
 | Context type | Diff, full files, issue text, research notes, nothing |
 | Tool access | With tools (API calls, file reads) vs text-only |
 | Task structure | Step-by-step explicit vs open-ended |
 ## Limitations
 - Single test corpus (gargoyle architecture docs) — domain bias possible
 - Single researcher evaluating findings — subjectivity in quality assessment
 - Models are non-deterministic — single runs, not averaged
 - Proxy adds latency — timing comparisons are relative, not absolute
 - Internal reasoning tokens not visible for Claude models
 ## Reproducibility
 Prompts for each experiment are in the `prompts/` directory. The test
 corpus is the gargoyle project's `docs/` directory (available at
 `gitea.weiker.me/grgl/gargoyle`). Each finding documents the exact document
 used, its line count, and the specific version/commit when relevant.
@@ -0,0 +1,58 @@
 # Open Questions
 Unanswered questions from experiments, ordered by potential impact.
 ## High Priority
 ### Signal-to-noise confirmation (from Finding #8)
 Give a model the FULL PR review context (diff, files, issue, AC) but add
 the narrow bias question as an explicit review checklist item. If the model
 catches bias despite the rich context, it confirms the signal-to-noise
 hypothesis. If it misses, it suggests something else (attention allocation,
 task switching cost).
 ### Cross-document consistency as maintenance tool (from Finding #28)
 Does running cross-doc analysis across MORE document pairs (domain readmes
 vs implementation docs, design docs vs plan docs) yield additional real
 inconsistencies? Could become a systematic documentation maintenance tool.
 ### Why Opus dominates cross-doc consistency (from Finding #28)
 Opus was 2.4x faster AND found more issues than GPT-5. Is this because
 cross-doc contradictions are easy to verify once spotted (reducing GPT-5's
 verification advantage)? Or because boundary reasoning (Opus's strength)
 is the primary skill needed?
 ### Sonnet + narrow framing = GPT-5 level? (from Finding #5)
 Would Sonnet catch semantic issues if given a narrower "check for logical
 consistency" framing instead of broad review? The hypothesis: Sonnet's
 "structural reviewer" tendency is a framing artifact, not a capability limit.
 ## Medium Priority
 ### Adversarial analysis ensemble (from Finding #29)
 Run GPT-5 and Opus sequentially — give Opus access to GPT-5's findings
 and ask it to critique and extend. Does the ensemble find more than either
 alone? Does Opus's system-level thinking complement GPT-5's exhaustiveness?
 ### Reasoning effort parameter (from Finding #21)
 Reasoning effort (low/medium/high) had negligible effect on GPT-5's
 analytical output. Is this because the parameter doesn't work for open-ended
 analysis? Or because the task was already within GPT-5's "easy" threshold?
 Test with a harder document.
 ### Model personality vs prompt (from Finding #26)
 Missing-feature identification IS promptable across all models — prompt
 framing eliminates Opus's historical advantage. How many other "model
 personality" observations are actually just prompt framing effects?
 ## Answered Questions
 - ~~Opus's "missing feature identification" mode — is it promptable?~~
  **YES** (Finding #26): all models find regulatory gaps when explicitly
  prompted. Opus's behavior was an emergent DEFAULT tendency, not a unique
  capability.
 - ~~Is Opus > GPT-5 for coherence tasks universal?~~
  **NO** (Finding #27): Opus's advantage from Finding #15 was document-
  specific. On risk-controls.md (992 lines, more complex), GPT-5 regained
  top position. Document complexity and domain specialization affect ranking.
@@ -0,0 +1,59 @@
 # Prompt: Adversarial Manipulation Analysis
 Used in Finding #29.
 ## Setup
 - Single document (full text)
 - Same prompt to all models
 - No tools, no project context beyond the document
 ## Prompt
 ```
 You are a red-team security analyst reviewing a trading system's
 aggregation component. Your task is to identify how a MISBEHAVING,
 COMPROMISED, or BUGGY upstream component could exploit this design
 to produce harmful trading outcomes that bypass downstream safety controls.
 ## Categories of adversarial manipulation:
 1. **Signal injection** — How could a compromised strategy inject signals
   that exploit the aggregator's logic to produce dangerous decisions?
 2. **Timing manipulation** — How could an attacker manipulate timing
   (delays, bursts, clock skew) to exploit the aggregator's temporal logic?
 3. **Capacity weaponization** — How could the max_signals bound or group
   completion logic be exploited to force premature or delayed decisions?
 4. **State corruption via crash** — How could deliberate crashes be used
   to put the aggregator in an exploitable state?
 5. **Audit evasion** — How could an attacker cause the aggregator to make
   decisions that don't appear in the audit log, or appear differently
   than what actually happened?
 ## For each attack vector:
 - **Category:** (one of the 5 above)
 - **Attack vector:** Name of the attack
 - **Mechanism:** How the attacker exploits the design
 - **Exploit:** Step-by-step attack sequence
 - **Why downstream controls miss it:** Why PortfolioRisk, BuyingPower,
  or other downstream checks don't catch this
 - **Severity:** Critical / High / Medium
 - **Mitigation:** What the design could add to prevent it
 ## Document:
 [FULL TEXT OF aggregation.md, 193 lines]
 ```
 ## Results
 | Model | Time | Findings | Unique vectors |
 |-------|------|----------|----------------|
 | GPT-5 | ~150s | 8 | 3 (most exhaustive) |
 | Opus | ~65s | 6 | 2 (qualitatively different) |
 | Sonnet | ~20s | 4 | 0 (subset of others) |
 GPT-5 was most exhaustive and systematic. Opus found qualitatively different
 attack vectors with system-level thinking (e.g., exploiting supervision tree
 restart semantics).
@@ -0,0 +1,58 @@
 # Prompt: Contradiction Detection
 Used in Finding #25.
 ## Setup
 - Single document (full text)
 - Same prompt to all models
 - No tools, no project context beyond the document
 ## Prompt
 ```
 You are analyzing a design document for CONTRADICTIONS — places where
 the document makes two claims that cannot both be true simultaneously.
 This is NOT about:
 - Missing information
 - Unclear writing
 - Design tradeoffs
 - Things that MIGHT conflict
 This IS about:
 - Statement A says X, Statement B says NOT-X
 - Mechanism A requires condition C, Mechanism B prevents condition C
 - Rule A applies to set S, but S includes elements that violate Rule A
 ## Categories:
 1. **Direct contradictions** — Two statements that are logically incompatible
 2. **Mechanism conflicts** — Two described mechanisms that cannot coexist
 3. **Scope violations** — A rule/invariant that is violated by a specific
   case described elsewhere in the document
 4. **Temporal impossibilities** — A sequence that requires something to be
   true before the described mechanism makes it true
 ## For each contradiction:
 - **Category:** (one of the 4 above)
 - **Statement A:** (exact text, with section)
 - **Statement B:** (exact text, with section)
 - **Why contradictory:** (formal reasoning about incompatibility)
 - **Severity:** Critical (system correctness) / High (safety) / Medium (confusion)
 Be PRECISE. Only report genuine logical contradictions, not differences
 in emphasis or scope.
 ## Document:
 [FULL TEXT OF DOCUMENT]
 ```
 ## Key Design Decision
 The "Be PRECISE" instruction and explicit exclusion list ("NOT about")
 is critical. Without it, models pad findings with style/clarity issues.
 The contradiction prompt naturally favors Opus (self-correcting, withdraws
 false positives) over GPT-5 (exhaustive, includes borderline cases).
@@ -0,0 +1,80 @@
 # Prompt: Cross-Document Consistency Analysis
 Used in Finding #28.
 ## Setup
 - Two documents provided as full text in a single prompt (~25KB total)
 - Document A: `system-overview.md` (323 lines, narrative overview)
 - Document B: `architecture.md` (213 lines, DDD-focused)
 - No tools, no project context beyond the two documents
 - Same prompt to all 3 models independently
 ## Prompt
 ```
 You are analyzing two architecture documents that describe the SAME system.
 Your task is to identify places where these documents CONTRADICT each other
 — not where they differ in scope or detail level, but where they make
 incompatible claims about the same concept.
 ## Categories of inconsistency to check:
 1. **Terminology conflicts** — Same concept called different names in ways
   that imply different meanings (not just abbreviation)
 2. **Structural contradictions** — Documents disagree about what is inside
   vs outside a component boundary
 3. **Flow/sequence conflicts** — Documents describe incompatible orderings
   or data flows for the same process
 4. **Ownership/authority conflicts** — Documents disagree about which
   component owns, writes, or is authoritative for a concept
 5. **Philosophical contradictions** — Documents state incompatible
   foundational assumptions (e.g., event sourcing vs CRUD)
 ## What to EXCLUDE:
 - Omissions (one doc covers something the other doesn't)
 - Detail-level differences (one is more detailed than the other)
 - Naming differences that are clearly just abbreviations
 - Scope differences (one covers more topics)
 ## Output format per finding:
 For each inconsistency found:
 - **Category:** (one of the 5 above)
 - **Severity:** Critical / High / Medium
 - **Document A says:** (exact quote or precise paraphrase with section ref)
 - **Document B says:** (exact quote or precise paraphrase with section ref)
 - **Why these are incompatible:** (explain why both cannot be correct)
 - **Impact:** (what would go wrong if an implementer followed both)
 ## Document A: [system-overview.md]
 [FULL TEXT OF DOCUMENT A]
 ## Document B: [architecture.md]
 [FULL TEXT OF DOCUMENT B]
 ```
 ## Key Design Decisions
 1. **Explicit exclusion of omissions** — prevents models from padding
   findings with "Doc A mentions X but Doc B doesn't"
 2. **Five specific categories** — focuses attention without being
   so restrictive that models miss novel inconsistency types
 3. **Required "why incompatible" explanation** — forces models to reason
   about WHY differences matter, not just list differences
 4. **Impact field** — grounds findings in practical consequences
 5. **Both documents in single prompt** — enables cross-referencing
   without tool calls or context fragmentation
 ## Results
 | Model | Time | Findings | Tokens/finding |
 |-------|------|----------|----------------|
 | Opus | 52s | 7 | 336 |
 | GPT-5 | 125s | 6 | 2,967 |
 | Sonnet | 14s | 4 | 194 |
 Opus recommended for this task type.
@@ -0,0 +1,71 @@
 # Prompt: Design Coherence Analysis
 Used in Findings #15, #27.
 ## Setup
 - Single document provided as full text
 - No tools, no project context beyond the document
 - Same prompt to all models independently
 ## Prompt
 ```
 You are analyzing a single design document for INTERNAL incoherence —
 places where the document contradicts itself. The document states
 principles, invariants, or guarantees in one place, then describes
 mechanisms that violate those guarantees elsewhere.
 ## Categories of incoherence to check:
 1. **Safety properties not enforced** — Document claims a safety property
   (e.g., "fail-closed") but the described mechanism has a path that
   violates it
 2. **State machine violations** — Declared states/transitions don't match
   the described behavior (missing transitions, unreachable states,
   states with no exit)
 3. **Recovery contradictions** — Recovery mechanism assumes preconditions
   that the failure scenario explicitly invalidates
 4. **Supervision conflicts** — Supervision strategy contradicts the
   independence/coupling claims about the supervised processes
 5. **Cross-mechanism contradictions** — Two different sections describe
   incompatible behaviors for the same scenario
 ## What to EXCLUDE:
 - Missing features (things the document doesn't cover)
 - Design tradeoffs that are explicitly acknowledged
 - Future work items marked as such
 ## Output format per finding:
 - **Category:** (one of the 5 above)
 - **Severity:** Critical / High / Medium
 - **Section A says:** (exact quote with section reference)
 - **Section B says:** (exact quote with section reference)
 - **The incoherence:** (explain the contradiction)
 - **Why it matters:** (what would break in implementation)
 ## Document:
 [FULL TEXT OF DOCUMENT]
 ```
 ## Results (Finding #15: failure-modes.md, 383 lines)
 | Model | Time | Findings |
 |-------|------|----------|
 | Sonnet 4.6 | 39s | 5 |
 | Opus 4.6 | 105s | 7 (8 attempted, 1 self-withdrawn) |
 | GPT-5 | 120s | 4 |
 ## Results (Finding #27: risk-controls.md, 992 lines)
 | Model | Time | Findings |
 |-------|------|----------|
 | Sonnet 4.6 | 31s | 4 |
 | Opus 4.6 | 86s | 5 |
 | GPT-5 | 112s | 6 |
 Key insight: results are document-dependent. Opus won on the shorter doc,
 GPT-5 won on the longer, more complex one.
@@ -0,0 +1,47 @@
 # Prompt: Gap-Finding in Architecture Documents
 Used in Finding #9.
 ## Setup
 - Single document (full text, no truncation)
 - Same focused analytical question to all models
 - No tools, no project context beyond the document
 - Temperature 0.3 for GPT-4.1/Mini, default for GPT-5
 ## Prompt
 ```
 You are a systems reliability engineer reviewing a failure modes document
 for a trading platform. Your task is to identify MISSING failure scenarios
 — things that COULD go wrong in this architecture but are NOT covered in
 the document.
 Focus on:
 1. Scenarios specific to THIS architecture (not generic "server could crash")
 2. Interactions between components that could produce unexpected states
 3. External dependency failures not covered
 4. Timing/ordering issues in the described sequences
 5. Recovery procedures that have gaps
 For each missing scenario:
 - **Scenario:** What goes wrong
 - **Why it's specific to this system:** Why generic monitoring wouldn't catch it
 - **Impact:** What state the system ends up in
 - **Why the document misses it:** What assumption makes this invisible
 ## Document:
 [FULL TEXT OF failure-modes.md, 383 lines]
 ```
 ## Results
 | Model | Time | Output tokens | Reasoning tokens | Scenarios found |
 |-------|------|---------------|------------------|-----------------|
 | GPT-4.1 Mini | 16s | 2,003 | 0 | 10 |
 | GPT-4.1 | 24s | 2,575 | 0 | 15 |
 | GPT-5 | 45s | 8,565 | 6,656 | 14 |
 GPT-5 found the most domain-specific and actionable gaps despite finding
 fewer total scenarios than GPT-4.1. Quality > quantity.
@@ -0,0 +1,53 @@
 # Prompt: Hidden Assumption Identification
 Used in Findings #10, #11, #12.
 ## Setup
 - Single document (full text)
 - Same prompt to all models
 - No tools, no project context beyond the document
 - Temperature 0.3 for non-reasoning models
 ## Prompt
 ```
 You are reviewing a system design document for hidden assumptions —
 things the design DEPENDS ON being true but does NOT explicitly state
 or validate.
 A hidden assumption is different from a design decision:
 - Design decision: "We use event sourcing" (explicit choice)
 - Hidden assumption: "Events will always be delivered in order"
  (unstated dependency that could break)
 For each hidden assumption found:
 - **Assumption:** What the design implicitly depends on
 - **Where it's hidden:** Which mechanism relies on it (section reference)
 - **What breaks if violated:** Concrete failure mode
 - **Likelihood of violation:** In production, how likely is this to be
  violated? (not in theory — in the real world with network partitions,
  clock skew, operator error, etc.)
 Focus on assumptions that:
 1. Are NOT explicitly stated in the document
 2. COULD realistically be violated in production
 3. Would cause SILENT incorrect behavior (not loud crashes)
 4. Are specific to THIS architecture (not generic distributed systems concerns)
 ## Document:
 [FULL TEXT OF DOCUMENT]
 ```
 ## Results (Finding #10: cold-start-and-recovery.md, 234 lines)
 | Model | Time | Output tokens | Reasoning tokens | Assumptions found |
 |-------|------|---------------|------------------|-------------------|
 | GPT-4.1 Mini | 25s | 3,090 | 0 | 12 |
 | GPT-4.1 | 77s | 2,751 | 0 | 14 |
 | GPT-5 | 78s | 2,649 | 4,096 | 26 |
 GPT-5 found 2x more assumptions AND they were qualitatively different —
 multi-component interaction assumptions that require reasoning about
 system-level behavior, not just local properties.