Files
model-research/README.md
T
Rodin 1b108ff66e Initial publish: 29 findings, 6 prompts, methodology, open questions
Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6,
GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding).

Contents:
- findings/ALL-FINDINGS.md — complete 3,249-line research log with all
  29 findings, methodology notes, and open questions
- prompts/ — 6 exact prompts used across experiments
- methodology.md — experimental setup and evaluation criteria
- open-questions.md — unanswered questions for future work
- README.md — overview and summary table

Key findings:
- Cross-document consistency: Opus is 2.4x faster with more findings
- Gap-finding: GPT-5 reasoning tokens find domain-specific gaps
- Race conditions: Opus excels at temporal interaction reasoning
- Bias detection: Signal-to-noise ratio > model capability
- Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different

Signed-off-by: Rodin
2026-05-05 19:13:03 -07:00

3.1 KiB

Model Research — AI for Analytical Work

Comparative analysis of AI models on analytical tasks — not coding.

Most public discussion about LLM capabilities focuses on code generation. We found almost no published methodology for using models in analytical research tasks (searched 2026-04-26). This repo fills that gap with controlled experiments and reproducible findings.

What We're Testing

Using GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, and GPT-4.1 (+ Mini) for:

  • Architecture document review
  • Bias and assumption detection
  • Gap-finding in design specifications
  • Cross-document consistency analysis
  • Race condition identification
  • Adversarial path analysis
  • Contradiction detection
  • Regulatory compliance review

Key Findings (Summary)

# Task Type Winner Key Insight
1 PR review Both Different models catch different things — Sonnet: structural, GPT-5: semantic
2 Bias detection Framing Signal-to-noise ratio matters more than model capability
9 Gap-finding GPT-5 Reasoning tokens find domain-specific gaps, not generic ones
10 Hidden assumptions GPT-5 Reasoning produces qualitatively different (not just more) findings
13 Race conditions Opus Temporal interaction reasoning is Opus's strongest domain
15 Design coherence Task-dependent Single-doc: model choice depends on document complexity
25 Contradiction detection Opus Precision > exhaustiveness; Opus's self-correction is unique
28 Cross-doc consistency Opus 2.4x faster than GPT-5 with more findings; boundary reasoning
29 Adversarial analysis GPT-5 + Opus GPT-5: exhaustive; Opus: qualitatively different attack vectors

Methodology

Each experiment:

  1. Same input document(s) to all models
  2. Same structured prompt with explicit categories to analyze
  3. No tools, no project context beyond the document(s)
  4. Independent runs — no cross-pollination between models
  5. Results evaluated for: correctness, uniqueness, actionability

Context dimensions tracked:

  • Rich vs minimal (how much background info)
  • Broad vs focused ("review this" vs "answer this specific question")
  • What kind of context (diff, full files, issue text, nothing)
  • Whether the model had tools or just text
  • Whether the task was step-by-step or open-ended

Repository Structure

findings/           # Individual findings with full analysis
  01-different-models-different-things.md
  02-narrow-lens-vs-broad-review.md
  ...
  28-cross-document-consistency.md
  29-adversarial-manipulation.md
prompts/            # Exact prompts used for reproducibility
  cross-document-consistency.md
  design-coherence.md
  gap-finding.md
  hidden-assumptions.md
  ...
open-questions.md   # Unanswered questions for future experiments
methodology.md      # Full methodology notes

Who We Are

This research is conducted by Rodin (AI assistant) and Aaron Weiker. The test corpus is gargoyle — an Elixir trading system with extensive architecture documentation (~35 design docs, ~5000 lines).

License

CC BY 4.0 — share and adapt with attribution.