Files

T

Rodin 1b108ff66e Initial publish: 29 findings, 6 prompts, methodology, open questions

Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6,
GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding).

Contents:
- findings/ALL-FINDINGS.md — complete 3,249-line research log with all
  29 findings, methodology notes, and open questions
- prompts/ — 6 exact prompts used across experiments
- methodology.md — experimental setup and evaluation criteria
- open-questions.md — unanswered questions for future work
- README.md — overview and summary table

Key findings:
- Cross-document consistency: Opus is 2.4x faster with more findings
- Gap-finding: GPT-5 reasoning tokens find domain-specific gaps
- Race conditions: Opus excels at temporal interaction reasoning
- Bias detection: Signal-to-noise ratio > model capability
- Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different

Signed-off-by: Rodin

2026-05-05 19:13:03 -07:00

3.1 KiB

Raw Blame History

Model Research — AI for Analytical Work

Comparative analysis of AI models on analytical tasks — not coding.

Most public discussion about LLM capabilities focuses on code generation. We found almost no published methodology for using models in analytical research tasks (searched 2026-04-26). This repo fills that gap with controlled experiments and reproducible findings.

What We're Testing

Using GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, and GPT-4.1 (+ Mini) for:

Architecture document review
Bias and assumption detection
Gap-finding in design specifications
Cross-document consistency analysis
Race condition identification
Adversarial path analysis
Contradiction detection
Regulatory compliance review

Key Findings (Summary)

#	Task Type	Winner	Key Insight
1	PR review	Both	Different models catch different things — Sonnet: structural, GPT-5: semantic
2	Bias detection	Framing	Signal-to-noise ratio matters more than model capability
9	Gap-finding	GPT-5	Reasoning tokens find domain-specific gaps, not generic ones
10	Hidden assumptions	GPT-5	Reasoning produces qualitatively different (not just more) findings
13	Race conditions	Opus	Temporal interaction reasoning is Opus's strongest domain
15	Design coherence	Task-dependent	Single-doc: model choice depends on document complexity
25	Contradiction detection	Opus	Precision > exhaustiveness; Opus's self-correction is unique
28	Cross-doc consistency	Opus	2.4x faster than GPT-5 with more findings; boundary reasoning
29	Adversarial analysis	GPT-5 + Opus	GPT-5: exhaustive; Opus: qualitatively different attack vectors

Methodology

Each experiment:

Same input document(s) to all models
Same structured prompt with explicit categories to analyze
No tools, no project context beyond the document(s)
Independent runs — no cross-pollination between models
Results evaluated for: correctness, uniqueness, actionability

Context dimensions tracked:

Rich vs minimal (how much background info)
Broad vs focused ("review this" vs "answer this specific question")
What kind of context (diff, full files, issue text, nothing)
Whether the model had tools or just text
Whether the task was step-by-step or open-ended

Repository Structure

findings/           # Individual findings with full analysis
  01-different-models-different-things.md
  02-narrow-lens-vs-broad-review.md
  ...
  28-cross-document-consistency.md
  29-adversarial-manipulation.md
prompts/            # Exact prompts used for reproducibility
  cross-document-consistency.md
  design-coherence.md
  gap-finding.md
  hidden-assumptions.md
  ...
open-questions.md   # Unanswered questions for future experiments
methodology.md      # Full methodology notes

Who We Are

This research is conducted by Rodin (AI assistant) and Aaron Weiker. The test corpus is gargoyle — an Elixir trading system with extensive architecture documentation (~35 design docs, ~5000 lines).

License

CC BY 4.0 — share and adapt with attribution.

3.1 KiB Raw Blame History