Files
model-research/README.md
T
Rodin a3aebc7cc1 docs(readme): add Reports section with links to REPORT.md and LESSONS.md
Explains what each file contains, that they're auto-regenerated weekly,
and includes generation timestamps.
2026-05-06 07:29:03 -07:00

94 lines
4.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Model Research — AI for Analytical Work
Comparative analysis of AI models on **analytical tasks** — not coding.
Most public discussion about LLM capabilities focuses on code generation.
We found almost no published methodology for using models in analytical
research tasks (searched 2026-04-26). This repo fills that gap with
controlled experiments and reproducible findings.
## What We're Testing
Using GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, and GPT-4.1 (+ Mini) for:
- Architecture document review
- Bias and assumption detection
- Gap-finding in design specifications
- Cross-document consistency analysis
- Race condition identification
- Adversarial path analysis
- Contradiction detection
- Regulatory compliance review
## Key Findings (Summary)
| # | Task Type | Winner | Key Insight |
|---|-----------|--------|-------------|
| 1 | PR review | Both | Different models catch different things — Sonnet: structural, GPT-5: semantic |
| 2 | Bias detection | Framing | Signal-to-noise ratio matters more than model capability |
| 9 | Gap-finding | GPT-5 | Reasoning tokens find domain-specific gaps, not generic ones |
| 10 | Hidden assumptions | GPT-5 | Reasoning produces qualitatively different (not just more) findings |
| 13 | Race conditions | Opus | Temporal interaction reasoning is Opus's strongest domain |
| 15 | Design coherence | Task-dependent | Single-doc: model choice depends on document complexity |
| 25 | Contradiction detection | Opus | Precision > exhaustiveness; Opus's self-correction is unique |
| 28 | Cross-doc consistency | Opus | 2.4x faster than GPT-5 with more findings; boundary reasoning |
| 29 | Adversarial analysis | GPT-5 + Opus | GPT-5: exhaustive; Opus: qualitatively different attack vectors |
## Methodology
Each experiment:
1. Same input document(s) to all models
2. Same structured prompt with explicit categories to analyze
3. No tools, no project context beyond the document(s)
4. Independent runs — no cross-pollination between models
5. Results evaluated for: correctness, uniqueness, actionability
**Context dimensions tracked:**
- Rich vs minimal (how much background info)
- Broad vs focused ("review this" vs "answer this specific question")
- What kind of context (diff, full files, issue text, nothing)
- Whether the model had tools or just text
- Whether the task was step-by-step or open-ended
## Reports
- **[REPORT.md](REPORT.md)** — Full research analysis. Covers model strengths with evidence, task-type → model mappings, meta-findings about how to use models effectively, cost-effectiveness comparison, and open questions. Regenerated weekly from all findings.
- **[LESSONS.md](LESSONS.md)** — Actionable summary. The distilled "here's what to actually do" version: three core rules, operational playbooks for different review types, anti-patterns to avoid, and a model personality cheat sheet. Start here if you want answers, not methodology.
Both files include a generation timestamp and are automatically regenerated every Monday at 9 AM Pacific to incorporate new experiment results.
## Repository Structure
```
REPORT.md # Full research report (auto-regenerated weekly)
LESSONS.md # Actionable lessons and playbooks (auto-regenerated weekly)
findings/ # Individual experiment files (one per experiment)
README.md # Context and index
YYYY-MM-DD-NN-slug.md
2026-04-26-01-different-models-catch-different-things.md
...
2026-05-05-29-adversarial-manipulation-analysis-new-task.md
prompts/ # Exact prompts used for reproducibility
cross-document-consistency.md
design-coherence.md
gap-finding.md
hidden-assumptions.md
...
open-questions.md # Unanswered questions for future experiments
methodology.md # Full methodology notes
```
Findings are named `YYYY-MM-DD-NN-slug.md` for chronological sorting.
Numbers are zero-padded (0129). The duplicate finding #7 uses a `b` suffix.
## Who We Are
This research is conducted by [Rodin](https://gitea.weiker.me/rodin) (AI
assistant) and Aaron Weiker. The test corpus is gargoyle — an Elixir
trading system with extensive architecture documentation (~35 design docs,
~5000 lines).
## License
CC BY 4.0 — share and adapt with attribution.