Initial publish: 29 findings, 6 prompts, methodology, open questions

Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding). Contents: - findings/ALL-FINDINGS.md — complete 3,249-line research log with all 29 findings, methodology notes, and open questions - prompts/ — 6 exact prompts used across experiments - methodology.md — experimental setup and evaluation criteria - open-questions.md — unanswered questions for future work - README.md — overview and summary table Key findings: - Cross-document consistency: Opus is 2.4x faster with more findings - Gap-finding: GPT-5 reasoning tokens find domain-specific gaps - Race conditions: Opus excels at temporal interaction reasoning - Bias detection: Signal-to-noise ratio > model capability - Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different Signed-off-by: Rodin
2026-05-05 19:13:03 -07:00
parent 4aea0d004b
commit 1b108ff66e
10 changed files with 3831 additions and 2 deletions
@@ -1,3 +1,81 @@
-# model-research
+# Model Research — AI for Analytical Work

-Comparative analysis of AI models on analytical tasks — not coding. Tracking what works when using GPT-5, Claude Opus, Claude Sonnet, and GPT-4.1 for research, document review, bias detection, and architecture analysis.
+Comparative analysis of AI models on **analytical tasks** — not coding.
+
+Most public discussion about LLM capabilities focuses on code generation.
+We found almost no published methodology for using models in analytical
+research tasks (searched 2026-04-26). This repo fills that gap with
+controlled experiments and reproducible findings.
+
+## What We're Testing
+
+Using GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, and GPT-4.1 (+ Mini) for:
+
+- Architecture document review
+- Bias and assumption detection
+- Gap-finding in design specifications
+- Cross-document consistency analysis
+- Race condition identification
+- Adversarial path analysis
+- Contradiction detection
+- Regulatory compliance review
+
+## Key Findings (Summary)
+
+| # | Task Type | Winner | Key Insight |
+|---|-----------|--------|-------------|
+| 1 | PR review | Both | Different models catch different things — Sonnet: structural, GPT-5: semantic |
+| 2 | Bias detection | Framing | Signal-to-noise ratio matters more than model capability |
+| 9 | Gap-finding | GPT-5 | Reasoning tokens find domain-specific gaps, not generic ones |
+| 10 | Hidden assumptions | GPT-5 | Reasoning produces qualitatively different (not just more) findings |
+| 13 | Race conditions | Opus | Temporal interaction reasoning is Opus's strongest domain |
+| 15 | Design coherence | Task-dependent | Single-doc: model choice depends on document complexity |
+| 25 | Contradiction detection | Opus | Precision > exhaustiveness; Opus's self-correction is unique |
+| 28 | Cross-doc consistency | Opus | 2.4x faster than GPT-5 with more findings; boundary reasoning |
+| 29 | Adversarial analysis | GPT-5 + Opus | GPT-5: exhaustive; Opus: qualitatively different attack vectors |
+
+## Methodology
+
+Each experiment:
+1. Same input document(s) to all models
+2. Same structured prompt with explicit categories to analyze
+3. No tools, no project context beyond the document(s)
+4. Independent runs — no cross-pollination between models
+5. Results evaluated for: correctness, uniqueness, actionability
+
+**Context dimensions tracked:**
+- Rich vs minimal (how much background info)
+- Broad vs focused ("review this" vs "answer this specific question")
+- What kind of context (diff, full files, issue text, nothing)
+- Whether the model had tools or just text
+- Whether the task was step-by-step or open-ended
+
+## Repository Structure
+
+```
+findings/           # Individual findings with full analysis
+  01-different-models-different-things.md
+  02-narrow-lens-vs-broad-review.md
+  ...
+  28-cross-document-consistency.md
+  29-adversarial-manipulation.md
+prompts/            # Exact prompts used for reproducibility
+  cross-document-consistency.md
+  design-coherence.md
+  gap-finding.md
+  hidden-assumptions.md
+  ...
+open-questions.md   # Unanswered questions for future experiments
+methodology.md      # Full methodology notes
+```
+
+## Who We Are
+
+This research is conducted by [Rodin](https://gitea.weiker.me/rodin) (AI
+assistant) and Aaron Weiker. The test corpus is gargoyle — an Elixir
+trading system with extensive architecture documentation (~35 design docs,
+~5000 lines).
+
+## License
+
+CC BY 4.0 — share and adapt with attribution.