Files
model-research/README.md
T
Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00

88 lines
3.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Model Research — AI for Analytical Work
Comparative analysis of AI models on **analytical tasks** — not coding.
Most public discussion about LLM capabilities focuses on code generation.
We found almost no published methodology for using models in analytical
research tasks (searched 2026-04-26). This repo fills that gap with
controlled experiments and reproducible findings.
## What We're Testing
Using GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, and GPT-4.1 (+ Mini) for:
- Architecture document review
- Bias and assumption detection
- Gap-finding in design specifications
- Cross-document consistency analysis
- Race condition identification
- Adversarial path analysis
- Contradiction detection
- Regulatory compliance review
## Key Findings (Summary)
| # | Task Type | Winner | Key Insight |
|---|-----------|--------|-------------|
| 1 | PR review | Both | Different models catch different things — Sonnet: structural, GPT-5: semantic |
| 2 | Bias detection | Framing | Signal-to-noise ratio matters more than model capability |
| 9 | Gap-finding | GPT-5 | Reasoning tokens find domain-specific gaps, not generic ones |
| 10 | Hidden assumptions | GPT-5 | Reasoning produces qualitatively different (not just more) findings |
| 13 | Race conditions | Opus | Temporal interaction reasoning is Opus's strongest domain |
| 15 | Design coherence | Task-dependent | Single-doc: model choice depends on document complexity |
| 25 | Contradiction detection | Opus | Precision > exhaustiveness; Opus's self-correction is unique |
| 28 | Cross-doc consistency | Opus | 2.4x faster than GPT-5 with more findings; boundary reasoning |
| 29 | Adversarial analysis | GPT-5 + Opus | GPT-5: exhaustive; Opus: qualitatively different attack vectors |
## Methodology
Each experiment:
1. Same input document(s) to all models
2. Same structured prompt with explicit categories to analyze
3. No tools, no project context beyond the document(s)
4. Independent runs — no cross-pollination between models
5. Results evaluated for: correctness, uniqueness, actionability
**Context dimensions tracked:**
- Rich vs minimal (how much background info)
- Broad vs focused ("review this" vs "answer this specific question")
- What kind of context (diff, full files, issue text, nothing)
- Whether the model had tools or just text
- Whether the task was step-by-step or open-ended
## Repository Structure
```
findings/ # Individual findings with full analysis
README.md # Context and index
YYYY-MM-DD-NN-slug.md # One file per experiment
2026-04-26-01-different-models-catch-different-things.md
2026-04-26-07-emerging-role-assignments-pattern-not.md
2026-05-03-07b-token-budget-matters-more-than.md # Duplicate #7 (suffix b)
2026-05-03-15-design-coherence-analysis.md
...
2026-05-05-29-adversarial-manipulation-analysis-new-task.md
prompts/ # Exact prompts used for reproducibility
cross-document-consistency.md
design-coherence.md
gap-finding.md
hidden-assumptions.md
...
open-questions.md # Unanswered questions for future experiments
methodology.md # Full methodology notes
```
Findings are named `YYYY-MM-DD-NN-slug.md` for chronological sorting.
Numbers are zero-padded (0129). The duplicate finding #7 uses a `b` suffix.
## Who We Are
This research is conducted by [Rodin](https://gitea.weiker.me/rodin) (AI
assistant) and Aaron Weiker. The test corpus is gargoyle — an Elixir
trading system with extensive architecture documentation (~35 design docs,
~5000 lines).
## License
CC BY 4.0 — share and adapt with attribution.