# Methodology ## Principles 1. **Internet opinions about models are overwhelmingly about coding.** Don't extrapolate to analytical work without testing. 2. **"Just because someone says it on the internet doesn't make it right."** Opinions need context. Track our own evidence. 3. **Absence of published methodology for a use case is itself a finding.** 4. **No unsupported generalizations.** Each finding needs: date, task, how we used it (context shape, task framing, what info the model had/didn't have), what happened, takeaway. ## Experimental Setup ### Models Tested | Model | Provider | Access | Notes | |-------|----------|--------|-------| | GPT-5 | OpenAI (via HAI proxy) | API | Requires `max_completion_tokens` ≥16K | | Claude Opus 4.6 | Anthropic (via HAI proxy) | API | Internal reasoning (not exposed) | | Claude Sonnet 4.6 | Anthropic (via HAI proxy) | API | Fast, cost-effective | | GPT-4.1 | OpenAI (via HAI proxy) | API | Non-reasoning, structured output | | GPT-4.1 Mini | OpenAI (via HAI proxy) | API | Cheapest, good for screening | | Claude Sonnet 4.5 | Anthropic (via HAI proxy) | API | Predecessor to 4.6 | ### Control Variables - **Same input:** All models receive identical document text - **Same prompt:** Structured prompt with explicit categories and output format - **Same constraints:** No tools, no project context beyond the document(s) - **Independent runs:** No cross-pollination between model runs - **Temperature:** 0.3 for GPT-4.1/Mini; default (1.0) for GPT-5 (required) ### Measurement - **Time:** Wall clock from request to final token - **Output tokens:** Total generated tokens - **Reasoning tokens:** For reasoning models (GPT-5), exposed separately - **Findings count:** Number of distinct issues identified - **Unique findings:** Issues found by only one model - **Severity distribution:** Critical / High / Medium / Low per finding - **Tokens per finding:** Efficiency metric ### Evaluation Criteria Each finding is assessed for: 1. **Correctness:** Is the identified issue real? 2. **Uniqueness:** Did only this model find it? 3. **Actionability:** Would a developer change something based on this? 4. **Depth:** Surface observation vs architectural insight? ### Context Dimensions Tracked | Dimension | Options | |-----------|---------| | Context richness | Rich (full project) vs Minimal (document only) | | Task framing | Broad ("review this") vs Focused ("check for X") | | Context type | Diff, full files, issue text, research notes, nothing | | Tool access | With tools (API calls, file reads) vs text-only | | Task structure | Step-by-step explicit vs open-ended | ## Limitations - Single test corpus (gargoyle architecture docs) — domain bias possible - Single researcher evaluating findings — subjectivity in quality assessment - Models are non-deterministic — single runs, not averaged - Proxy adds latency — timing comparisons are relative, not absolute - Internal reasoning tokens not visible for Claude models ## Reproducibility Prompts for each experiment are in the `prompts/` directory. The test corpus is the gargoyle project's `docs/` directory (available at `gitea.weiker.me/grgl/gargoyle`). Each finding documents the exact document used, its line count, and the specific version/commit when relevant.