# Model Research — AI for Analytical Work Comparative analysis of AI models on **analytical tasks** — not coding. Most public discussion about LLM capabilities focuses on code generation. We found almost no published methodology for using models in analytical research tasks (searched 2026-04-26). This repo fills that gap with controlled experiments and reproducible findings. ## What We're Testing Using GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, and GPT-4.1 (+ Mini) for: - Architecture document review - Bias and assumption detection - Gap-finding in design specifications - Cross-document consistency analysis - Race condition identification - Adversarial path analysis - Contradiction detection - Regulatory compliance review ## Key Findings (Summary) | # | Task Type | Winner | Key Insight | |---|-----------|--------|-------------| | 1 | PR review | Both | Different models catch different things — Sonnet: structural, GPT-5: semantic | | 2 | Bias detection | Framing | Signal-to-noise ratio matters more than model capability | | 9 | Gap-finding | GPT-5 | Reasoning tokens find domain-specific gaps, not generic ones | | 10 | Hidden assumptions | GPT-5 | Reasoning produces qualitatively different (not just more) findings | | 13 | Race conditions | Opus | Temporal interaction reasoning is Opus's strongest domain | | 15 | Design coherence | Task-dependent | Single-doc: model choice depends on document complexity | | 25 | Contradiction detection | Opus | Precision > exhaustiveness; Opus's self-correction is unique | | 28 | Cross-doc consistency | Opus | 2.4x faster than GPT-5 with more findings; boundary reasoning | | 29 | Adversarial analysis | GPT-5 + Opus | GPT-5: exhaustive; Opus: qualitatively different attack vectors | ## Methodology Each experiment: 1. Same input document(s) to all models 2. Same structured prompt with explicit categories to analyze 3. No tools, no project context beyond the document(s) 4. Independent runs — no cross-pollination between models 5. Results evaluated for: correctness, uniqueness, actionability **Context dimensions tracked:** - Rich vs minimal (how much background info) - Broad vs focused ("review this" vs "answer this specific question") - What kind of context (diff, full files, issue text, nothing) - Whether the model had tools or just text - Whether the task was step-by-step or open-ended ## Reports - **[REPORT.md](REPORT.md)** — Full research analysis. Covers model strengths with evidence, task-type → model mappings, meta-findings about how to use models effectively, cost-effectiveness comparison, and open questions. Regenerated weekly from all findings. - **[LESSONS.md](LESSONS.md)** — Actionable summary. The distilled "here's what to actually do" version: three core rules, operational playbooks for different review types, anti-patterns to avoid, and a model personality cheat sheet. Start here if you want answers, not methodology. Both files include a generation timestamp and are automatically regenerated every Monday at 9 AM Pacific to incorporate new experiment results. ## Repository Structure ``` REPORT.md # Full research report (auto-regenerated weekly) LESSONS.md # Actionable lessons and playbooks (auto-regenerated weekly) findings/ # Individual experiment files (one per experiment) README.md # Context and index YYYY-MM-DD-NN-slug.md 2026-04-26-01-different-models-catch-different-things.md ... 2026-05-05-29-adversarial-manipulation-analysis-new-task.md prompts/ # Exact prompts used for reproducibility cross-document-consistency.md design-coherence.md gap-finding.md hidden-assumptions.md ... open-questions.md # Unanswered questions for future experiments methodology.md # Full methodology notes ``` Findings are named `YYYY-MM-DD-NN-slug.md` for chronological sorting. Numbers are zero-padded (01–29). The duplicate finding #7 uses a `b` suffix. ## Who We Are This research is conducted by [Rodin](https://gitea.weiker.me/rodin) (AI assistant) and Aaron Weiker. The test corpus is gargoyle — an Elixir trading system with extensive architecture documentation (~35 design docs, ~5000 lines). ## License CC BY 4.0 — share and adapt with attribution.