Files
model-research/README.md
T
Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00

3.6 KiB
Raw Blame History

Model Research — AI for Analytical Work

Comparative analysis of AI models on analytical tasks — not coding.

Most public discussion about LLM capabilities focuses on code generation. We found almost no published methodology for using models in analytical research tasks (searched 2026-04-26). This repo fills that gap with controlled experiments and reproducible findings.

What We're Testing

Using GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, and GPT-4.1 (+ Mini) for:

  • Architecture document review
  • Bias and assumption detection
  • Gap-finding in design specifications
  • Cross-document consistency analysis
  • Race condition identification
  • Adversarial path analysis
  • Contradiction detection
  • Regulatory compliance review

Key Findings (Summary)

# Task Type Winner Key Insight
1 PR review Both Different models catch different things — Sonnet: structural, GPT-5: semantic
2 Bias detection Framing Signal-to-noise ratio matters more than model capability
9 Gap-finding GPT-5 Reasoning tokens find domain-specific gaps, not generic ones
10 Hidden assumptions GPT-5 Reasoning produces qualitatively different (not just more) findings
13 Race conditions Opus Temporal interaction reasoning is Opus's strongest domain
15 Design coherence Task-dependent Single-doc: model choice depends on document complexity
25 Contradiction detection Opus Precision > exhaustiveness; Opus's self-correction is unique
28 Cross-doc consistency Opus 2.4x faster than GPT-5 with more findings; boundary reasoning
29 Adversarial analysis GPT-5 + Opus GPT-5: exhaustive; Opus: qualitatively different attack vectors

Methodology

Each experiment:

  1. Same input document(s) to all models
  2. Same structured prompt with explicit categories to analyze
  3. No tools, no project context beyond the document(s)
  4. Independent runs — no cross-pollination between models
  5. Results evaluated for: correctness, uniqueness, actionability

Context dimensions tracked:

  • Rich vs minimal (how much background info)
  • Broad vs focused ("review this" vs "answer this specific question")
  • What kind of context (diff, full files, issue text, nothing)
  • Whether the model had tools or just text
  • Whether the task was step-by-step or open-ended

Repository Structure

findings/                                         # Individual findings with full analysis
  README.md                                       # Context and index
  YYYY-MM-DD-NN-slug.md                           # One file per experiment
  2026-04-26-01-different-models-catch-different-things.md
  2026-04-26-07-emerging-role-assignments-pattern-not.md
  2026-05-03-07b-token-budget-matters-more-than.md  # Duplicate #7 (suffix b)
  2026-05-03-15-design-coherence-analysis.md
  ...
  2026-05-05-29-adversarial-manipulation-analysis-new-task.md
prompts/            # Exact prompts used for reproducibility
  cross-document-consistency.md
  design-coherence.md
  gap-finding.md
  hidden-assumptions.md
  ...
open-questions.md   # Unanswered questions for future experiments
methodology.md      # Full methodology notes

Findings are named YYYY-MM-DD-NN-slug.md for chronological sorting. Numbers are zero-padded (0129). The duplicate finding #7 uses a b suffix.

Who We Are

This research is conducted by Rodin (AI assistant) and Aaron Weiker. The test corpus is gargoyle — an Elixir trading system with extensive architecture documentation (~35 design docs, ~5000 lines).

License

CC BY 4.0 — share and adapt with attribution.