1b108ff66e
Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding). Contents: - findings/ALL-FINDINGS.md — complete 3,249-line research log with all 29 findings, methodology notes, and open questions - prompts/ — 6 exact prompts used across experiments - methodology.md — experimental setup and evaluation criteria - open-questions.md — unanswered questions for future work - README.md — overview and summary table Key findings: - Cross-document consistency: Opus is 2.4x faster with more findings - Gap-finding: GPT-5 reasoning tokens find domain-specific gaps - Race conditions: Opus excels at temporal interaction reasoning - Bias detection: Signal-to-noise ratio > model capability - Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different Signed-off-by: Rodin
3.2 KiB
3.2 KiB
Methodology
Principles
- Internet opinions about models are overwhelmingly about coding. Don't extrapolate to analytical work without testing.
- "Just because someone says it on the internet doesn't make it right." Opinions need context. Track our own evidence.
- Absence of published methodology for a use case is itself a finding.
- No unsupported generalizations. Each finding needs: date, task, how we used it (context shape, task framing, what info the model had/didn't have), what happened, takeaway.
Experimental Setup
Models Tested
| Model | Provider | Access | Notes |
|---|---|---|---|
| GPT-5 | OpenAI (via HAI proxy) | API | Requires max_completion_tokens ≥16K |
| Claude Opus 4.6 | Anthropic (via HAI proxy) | API | Internal reasoning (not exposed) |
| Claude Sonnet 4.6 | Anthropic (via HAI proxy) | API | Fast, cost-effective |
| GPT-4.1 | OpenAI (via HAI proxy) | API | Non-reasoning, structured output |
| GPT-4.1 Mini | OpenAI (via HAI proxy) | API | Cheapest, good for screening |
| Claude Sonnet 4.5 | Anthropic (via HAI proxy) | API | Predecessor to 4.6 |
Control Variables
- Same input: All models receive identical document text
- Same prompt: Structured prompt with explicit categories and output format
- Same constraints: No tools, no project context beyond the document(s)
- Independent runs: No cross-pollination between model runs
- Temperature: 0.3 for GPT-4.1/Mini; default (1.0) for GPT-5 (required)
Measurement
- Time: Wall clock from request to final token
- Output tokens: Total generated tokens
- Reasoning tokens: For reasoning models (GPT-5), exposed separately
- Findings count: Number of distinct issues identified
- Unique findings: Issues found by only one model
- Severity distribution: Critical / High / Medium / Low per finding
- Tokens per finding: Efficiency metric
Evaluation Criteria
Each finding is assessed for:
- Correctness: Is the identified issue real?
- Uniqueness: Did only this model find it?
- Actionability: Would a developer change something based on this?
- Depth: Surface observation vs architectural insight?
Context Dimensions Tracked
| Dimension | Options |
|---|---|
| Context richness | Rich (full project) vs Minimal (document only) |
| Task framing | Broad ("review this") vs Focused ("check for X") |
| Context type | Diff, full files, issue text, research notes, nothing |
| Tool access | With tools (API calls, file reads) vs text-only |
| Task structure | Step-by-step explicit vs open-ended |
Limitations
- Single test corpus (gargoyle architecture docs) — domain bias possible
- Single researcher evaluating findings — subjectivity in quality assessment
- Models are non-deterministic — single runs, not averaged
- Proxy adds latency — timing comparisons are relative, not absolute
- Internal reasoning tokens not visible for Claude models
Reproducibility
Prompts for each experiment are in the prompts/ directory. The test
corpus is the gargoyle project's docs/ directory (available at
gitea.weiker.me/grgl/gargoyle). Each finding documents the exact document
used, its line count, and the specific version/commit when relevant.