Files
Rodin 1b108ff66e Initial publish: 29 findings, 6 prompts, methodology, open questions
Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6,
GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding).

Contents:
- findings/ALL-FINDINGS.md — complete 3,249-line research log with all
  29 findings, methodology notes, and open questions
- prompts/ — 6 exact prompts used across experiments
- methodology.md — experimental setup and evaluation criteria
- open-questions.md — unanswered questions for future work
- README.md — overview and summary table

Key findings:
- Cross-document consistency: Opus is 2.4x faster with more findings
- Gap-finding: GPT-5 reasoning tokens find domain-specific gaps
- Race conditions: Opus excels at temporal interaction reasoning
- Bias detection: Signal-to-noise ratio > model capability
- Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different

Signed-off-by: Rodin
2026-05-05 19:13:03 -07:00

3.2 KiB

Methodology

Principles

  1. Internet opinions about models are overwhelmingly about coding. Don't extrapolate to analytical work without testing.
  2. "Just because someone says it on the internet doesn't make it right." Opinions need context. Track our own evidence.
  3. Absence of published methodology for a use case is itself a finding.
  4. No unsupported generalizations. Each finding needs: date, task, how we used it (context shape, task framing, what info the model had/didn't have), what happened, takeaway.

Experimental Setup

Models Tested

Model Provider Access Notes
GPT-5 OpenAI (via HAI proxy) API Requires max_completion_tokens ≥16K
Claude Opus 4.6 Anthropic (via HAI proxy) API Internal reasoning (not exposed)
Claude Sonnet 4.6 Anthropic (via HAI proxy) API Fast, cost-effective
GPT-4.1 OpenAI (via HAI proxy) API Non-reasoning, structured output
GPT-4.1 Mini OpenAI (via HAI proxy) API Cheapest, good for screening
Claude Sonnet 4.5 Anthropic (via HAI proxy) API Predecessor to 4.6

Control Variables

  • Same input: All models receive identical document text
  • Same prompt: Structured prompt with explicit categories and output format
  • Same constraints: No tools, no project context beyond the document(s)
  • Independent runs: No cross-pollination between model runs
  • Temperature: 0.3 for GPT-4.1/Mini; default (1.0) for GPT-5 (required)

Measurement

  • Time: Wall clock from request to final token
  • Output tokens: Total generated tokens
  • Reasoning tokens: For reasoning models (GPT-5), exposed separately
  • Findings count: Number of distinct issues identified
  • Unique findings: Issues found by only one model
  • Severity distribution: Critical / High / Medium / Low per finding
  • Tokens per finding: Efficiency metric

Evaluation Criteria

Each finding is assessed for:

  1. Correctness: Is the identified issue real?
  2. Uniqueness: Did only this model find it?
  3. Actionability: Would a developer change something based on this?
  4. Depth: Surface observation vs architectural insight?

Context Dimensions Tracked

Dimension Options
Context richness Rich (full project) vs Minimal (document only)
Task framing Broad ("review this") vs Focused ("check for X")
Context type Diff, full files, issue text, research notes, nothing
Tool access With tools (API calls, file reads) vs text-only
Task structure Step-by-step explicit vs open-ended

Limitations

  • Single test corpus (gargoyle architecture docs) — domain bias possible
  • Single researcher evaluating findings — subjectivity in quality assessment
  • Models are non-deterministic — single runs, not averaged
  • Proxy adds latency — timing comparisons are relative, not absolute
  • Internal reasoning tokens not visible for Claude models

Reproducibility

Prompts for each experiment are in the prompts/ directory. The test corpus is the gargoyle project's docs/ directory (available at gitea.weiker.me/grgl/gargoyle). Each finding documents the exact document used, its line count, and the specific version/commit when relevant.