Files

T

Rodin 1b108ff66e Initial publish: 29 findings, 6 prompts, methodology, open questions

Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6,
GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding).

Contents:
- findings/ALL-FINDINGS.md — complete 3,249-line research log with all
  29 findings, methodology notes, and open questions
- prompts/ — 6 exact prompts used across experiments
- methodology.md — experimental setup and evaluation criteria
- open-questions.md — unanswered questions for future work
- README.md — overview and summary table

Key findings:
- Cross-document consistency: Opus is 2.4x faster with more findings
- Gap-finding: GPT-5 reasoning tokens find domain-specific gaps
- Race conditions: Opus excels at temporal interaction reasoning
- Bias detection: Signal-to-noise ratio > model capability
- Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different

Signed-off-by: Rodin

2026-05-05 19:13:03 -07:00

3.2 KiB

Raw Blame History

Methodology

Principles

Internet opinions about models are overwhelmingly about coding. Don't extrapolate to analytical work without testing.
"Just because someone says it on the internet doesn't make it right." Opinions need context. Track our own evidence.
Absence of published methodology for a use case is itself a finding.
No unsupported generalizations. Each finding needs: date, task, how we used it (context shape, task framing, what info the model had/didn't have), what happened, takeaway.

Experimental Setup

Models Tested

Model	Provider	Access	Notes
GPT-5	OpenAI (via HAI proxy)	API	Requires `max_completion_tokens` ≥16K
Claude Opus 4.6	Anthropic (via HAI proxy)	API	Internal reasoning (not exposed)
Claude Sonnet 4.6	Anthropic (via HAI proxy)	API	Fast, cost-effective
GPT-4.1	OpenAI (via HAI proxy)	API	Non-reasoning, structured output
GPT-4.1 Mini	OpenAI (via HAI proxy)	API	Cheapest, good for screening
Claude Sonnet 4.5	Anthropic (via HAI proxy)	API	Predecessor to 4.6

Control Variables

Same input: All models receive identical document text
Same prompt: Structured prompt with explicit categories and output format
Same constraints: No tools, no project context beyond the document(s)
Independent runs: No cross-pollination between model runs
Temperature: 0.3 for GPT-4.1/Mini; default (1.0) for GPT-5 (required)

Measurement

Time: Wall clock from request to final token
Output tokens: Total generated tokens
Reasoning tokens: For reasoning models (GPT-5), exposed separately
Findings count: Number of distinct issues identified
Unique findings: Issues found by only one model
Severity distribution: Critical / High / Medium / Low per finding
Tokens per finding: Efficiency metric

Evaluation Criteria

Each finding is assessed for:

Correctness: Is the identified issue real?
Uniqueness: Did only this model find it?
Actionability: Would a developer change something based on this?
Depth: Surface observation vs architectural insight?

Context Dimensions Tracked

Dimension	Options
Context richness	Rich (full project) vs Minimal (document only)
Task framing	Broad ("review this") vs Focused ("check for X")
Context type	Diff, full files, issue text, research notes, nothing
Tool access	With tools (API calls, file reads) vs text-only
Task structure	Step-by-step explicit vs open-ended

Limitations

Single test corpus (gargoyle architecture docs) — domain bias possible
Single researcher evaluating findings — subjectivity in quality assessment
Models are non-deterministic — single runs, not averaged
Proxy adds latency — timing comparisons are relative, not absolute
Internal reasoning tokens not visible for Claude models

Reproducibility

Prompts for each experiment are in the prompts/ directory. The test corpus is the gargoyle project's docs/ directory (available at gitea.weiker.me/grgl/gargoyle). Each finding documents the exact document used, its line count, and the specific version/commit when relevant.

3.2 KiB Raw Blame History