6af8a6ee10
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
88 lines
3.6 KiB
Markdown
88 lines
3.6 KiB
Markdown
# Model Research — AI for Analytical Work
|
||
|
||
Comparative analysis of AI models on **analytical tasks** — not coding.
|
||
|
||
Most public discussion about LLM capabilities focuses on code generation.
|
||
We found almost no published methodology for using models in analytical
|
||
research tasks (searched 2026-04-26). This repo fills that gap with
|
||
controlled experiments and reproducible findings.
|
||
|
||
## What We're Testing
|
||
|
||
Using GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, and GPT-4.1 (+ Mini) for:
|
||
|
||
- Architecture document review
|
||
- Bias and assumption detection
|
||
- Gap-finding in design specifications
|
||
- Cross-document consistency analysis
|
||
- Race condition identification
|
||
- Adversarial path analysis
|
||
- Contradiction detection
|
||
- Regulatory compliance review
|
||
|
||
## Key Findings (Summary)
|
||
|
||
| # | Task Type | Winner | Key Insight |
|
||
|---|-----------|--------|-------------|
|
||
| 1 | PR review | Both | Different models catch different things — Sonnet: structural, GPT-5: semantic |
|
||
| 2 | Bias detection | Framing | Signal-to-noise ratio matters more than model capability |
|
||
| 9 | Gap-finding | GPT-5 | Reasoning tokens find domain-specific gaps, not generic ones |
|
||
| 10 | Hidden assumptions | GPT-5 | Reasoning produces qualitatively different (not just more) findings |
|
||
| 13 | Race conditions | Opus | Temporal interaction reasoning is Opus's strongest domain |
|
||
| 15 | Design coherence | Task-dependent | Single-doc: model choice depends on document complexity |
|
||
| 25 | Contradiction detection | Opus | Precision > exhaustiveness; Opus's self-correction is unique |
|
||
| 28 | Cross-doc consistency | Opus | 2.4x faster than GPT-5 with more findings; boundary reasoning |
|
||
| 29 | Adversarial analysis | GPT-5 + Opus | GPT-5: exhaustive; Opus: qualitatively different attack vectors |
|
||
|
||
## Methodology
|
||
|
||
Each experiment:
|
||
1. Same input document(s) to all models
|
||
2. Same structured prompt with explicit categories to analyze
|
||
3. No tools, no project context beyond the document(s)
|
||
4. Independent runs — no cross-pollination between models
|
||
5. Results evaluated for: correctness, uniqueness, actionability
|
||
|
||
**Context dimensions tracked:**
|
||
- Rich vs minimal (how much background info)
|
||
- Broad vs focused ("review this" vs "answer this specific question")
|
||
- What kind of context (diff, full files, issue text, nothing)
|
||
- Whether the model had tools or just text
|
||
- Whether the task was step-by-step or open-ended
|
||
|
||
## Repository Structure
|
||
|
||
```
|
||
findings/ # Individual findings with full analysis
|
||
README.md # Context and index
|
||
YYYY-MM-DD-NN-slug.md # One file per experiment
|
||
2026-04-26-01-different-models-catch-different-things.md
|
||
2026-04-26-07-emerging-role-assignments-pattern-not.md
|
||
2026-05-03-07b-token-budget-matters-more-than.md # Duplicate #7 (suffix b)
|
||
2026-05-03-15-design-coherence-analysis.md
|
||
...
|
||
2026-05-05-29-adversarial-manipulation-analysis-new-task.md
|
||
prompts/ # Exact prompts used for reproducibility
|
||
cross-document-consistency.md
|
||
design-coherence.md
|
||
gap-finding.md
|
||
hidden-assumptions.md
|
||
...
|
||
open-questions.md # Unanswered questions for future experiments
|
||
methodology.md # Full methodology notes
|
||
```
|
||
|
||
Findings are named `YYYY-MM-DD-NN-slug.md` for chronological sorting.
|
||
Numbers are zero-padded (01–29). The duplicate finding #7 uses a `b` suffix.
|
||
|
||
## Who We Are
|
||
|
||
This research is conducted by [Rodin](https://gitea.weiker.me/rodin) (AI
|
||
assistant) and Aaron Weiker. The test corpus is gargoyle — an Elixir
|
||
trading system with extensive architecture documentation (~35 design docs,
|
||
~5000 lines).
|
||
|
||
## License
|
||
|
||
CC BY 4.0 — share and adapt with attribution.
|