Files

T

Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files

Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.

2026-05-06 07:15:50 -07:00

3.6 KiB

Raw Blame History

Model Research — AI for Analytical Work

Comparative analysis of AI models on analytical tasks — not coding.

Most public discussion about LLM capabilities focuses on code generation. We found almost no published methodology for using models in analytical research tasks (searched 2026-04-26). This repo fills that gap with controlled experiments and reproducible findings.

What We're Testing

Using GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, and GPT-4.1 (+ Mini) for:

Architecture document review
Bias and assumption detection
Gap-finding in design specifications
Cross-document consistency analysis
Race condition identification
Adversarial path analysis
Contradiction detection
Regulatory compliance review

Key Findings (Summary)

#	Task Type	Winner	Key Insight
1	PR review	Both	Different models catch different things — Sonnet: structural, GPT-5: semantic
2	Bias detection	Framing	Signal-to-noise ratio matters more than model capability
9	Gap-finding	GPT-5	Reasoning tokens find domain-specific gaps, not generic ones
10	Hidden assumptions	GPT-5	Reasoning produces qualitatively different (not just more) findings
13	Race conditions	Opus	Temporal interaction reasoning is Opus's strongest domain
15	Design coherence	Task-dependent	Single-doc: model choice depends on document complexity
25	Contradiction detection	Opus	Precision > exhaustiveness; Opus's self-correction is unique
28	Cross-doc consistency	Opus	2.4x faster than GPT-5 with more findings; boundary reasoning
29	Adversarial analysis	GPT-5 + Opus	GPT-5: exhaustive; Opus: qualitatively different attack vectors

Methodology

Each experiment:

Same input document(s) to all models
Same structured prompt with explicit categories to analyze
No tools, no project context beyond the document(s)
Independent runs — no cross-pollination between models
Results evaluated for: correctness, uniqueness, actionability

Context dimensions tracked:

Rich vs minimal (how much background info)
Broad vs focused ("review this" vs "answer this specific question")
What kind of context (diff, full files, issue text, nothing)
Whether the model had tools or just text
Whether the task was step-by-step or open-ended

Repository Structure

findings/                                         # Individual findings with full analysis
  README.md                                       # Context and index
  YYYY-MM-DD-NN-slug.md                           # One file per experiment
  2026-04-26-01-different-models-catch-different-things.md
  2026-04-26-07-emerging-role-assignments-pattern-not.md
  2026-05-03-07b-token-budget-matters-more-than.md  # Duplicate #7 (suffix b)
  2026-05-03-15-design-coherence-analysis.md
  ...
  2026-05-05-29-adversarial-manipulation-analysis-new-task.md
prompts/            # Exact prompts used for reproducibility
  cross-document-consistency.md
  design-coherence.md
  gap-finding.md
  hidden-assumptions.md
  ...
open-questions.md   # Unanswered questions for future experiments
methodology.md      # Full methodology notes

Findings are named YYYY-MM-DD-NN-slug.md for chronological sorting. Numbers are zero-padded (01–29). The duplicate finding #7 uses a b suffix.

Who We Are

This research is conducted by Rodin (AI assistant) and Aaron Weiker. The test corpus is gargoyle — an Elixir trading system with extensive architecture documentation (~35 design docs, ~5000 lines).

License

CC BY 4.0 — share and adapt with attribution.

3.6 KiB Raw Blame History Unescape Escape