Initial publish: 29 findings, 6 prompts, methodology, open questions

Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding). Contents: - findings/ALL-FINDINGS.md — complete 3,249-line research log with all 29 findings, methodology notes, and open questions - prompts/ — 6 exact prompts used across experiments - methodology.md — experimental setup and evaluation criteria - open-questions.md — unanswered questions for future work - README.md — overview and summary table Key findings: - Cross-document consistency: Opus is 2.4x faster with more findings - Gap-finding: GPT-5 reasoning tokens find domain-specific gaps - Race conditions: Opus excels at temporal interaction reasoning - Bias detection: Signal-to-noise ratio > model capability - Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different Signed-off-by: Rodin
2026-05-05 19:13:03 -07:00
parent 4aea0d004b
commit 1b108ff66e
10 changed files with 3831 additions and 2 deletions
@@ -0,0 +1,76 @@
+# Methodology
+
+## Principles
+
+1. **Internet opinions about models are overwhelmingly about coding.** Don't
+   extrapolate to analytical work without testing.
+2. **"Just because someone says it on the internet doesn't make it right."**
+   Opinions need context. Track our own evidence.
+3. **Absence of published methodology for a use case is itself a finding.**
+4. **No unsupported generalizations.** Each finding needs: date, task,
+   how we used it (context shape, task framing, what info the model
+   had/didn't have), what happened, takeaway.
+
+## Experimental Setup
+
+### Models Tested
+
+| Model | Provider | Access | Notes |
+|-------|----------|--------|-------|
+| GPT-5 | OpenAI (via HAI proxy) | API | Requires `max_completion_tokens` ≥16K |
+| Claude Opus 4.6 | Anthropic (via HAI proxy) | API | Internal reasoning (not exposed) |
+| Claude Sonnet 4.6 | Anthropic (via HAI proxy) | API | Fast, cost-effective |
+| GPT-4.1 | OpenAI (via HAI proxy) | API | Non-reasoning, structured output |
+| GPT-4.1 Mini | OpenAI (via HAI proxy) | API | Cheapest, good for screening |
+| Claude Sonnet 4.5 | Anthropic (via HAI proxy) | API | Predecessor to 4.6 |
+
+### Control Variables
+
+- **Same input:** All models receive identical document text
+- **Same prompt:** Structured prompt with explicit categories and output format
+- **Same constraints:** No tools, no project context beyond the document(s)
+- **Independent runs:** No cross-pollination between model runs
+- **Temperature:** 0.3 for GPT-4.1/Mini; default (1.0) for GPT-5 (required)
+
+### Measurement
+
+- **Time:** Wall clock from request to final token
+- **Output tokens:** Total generated tokens
+- **Reasoning tokens:** For reasoning models (GPT-5), exposed separately
+- **Findings count:** Number of distinct issues identified
+- **Unique findings:** Issues found by only one model
+- **Severity distribution:** Critical / High / Medium / Low per finding
+- **Tokens per finding:** Efficiency metric
+
+### Evaluation Criteria
+
+Each finding is assessed for:
+1. **Correctness:** Is the identified issue real?
+2. **Uniqueness:** Did only this model find it?
+3. **Actionability:** Would a developer change something based on this?
+4. **Depth:** Surface observation vs architectural insight?
+
+### Context Dimensions Tracked
+
+| Dimension | Options |
+|-----------|---------|
+| Context richness | Rich (full project) vs Minimal (document only) |
+| Task framing | Broad ("review this") vs Focused ("check for X") |
+| Context type | Diff, full files, issue text, research notes, nothing |
+| Tool access | With tools (API calls, file reads) vs text-only |
+| Task structure | Step-by-step explicit vs open-ended |
+
+## Limitations
+
+- Single test corpus (gargoyle architecture docs) — domain bias possible
+- Single researcher evaluating findings — subjectivity in quality assessment
+- Models are non-deterministic — single runs, not averaged
+- Proxy adds latency — timing comparisons are relative, not absolute
+- Internal reasoning tokens not visible for Claude models
+
+## Reproducibility
+
+Prompts for each experiment are in the `prompts/` directory. The test
+corpus is the gargoyle project's `docs/` directory (available at
+`gitea.weiker.me/grgl/gargoyle`). Each finding documents the exact document
+used, its line count, and the specific version/commit when relevant.