model-research/methodology.md

# Methodology

## Principles

1. **Internet opinions about models are overwhelmingly about coding.** Don't
   extrapolate to analytical work without testing.
2. **"Just because someone says it on the internet doesn't make it right."**
   Opinions need context. Track our own evidence.
3. **Absence of published methodology for a use case is itself a finding.**
4. **No unsupported generalizations.** Each finding needs: date, task,
   how we used it (context shape, task framing, what info the model
   had/didn't have), what happened, takeaway.

## Experimental Setup

### Models Tested

| Model | Provider | Access | Notes |
|-------|----------|--------|-------|
| GPT-5 | OpenAI (via HAI proxy) | API | Requires `max_completion_tokens` ≥16K |
| Claude Opus 4.6 | Anthropic (via HAI proxy) | API | Internal reasoning (not exposed) |
| Claude Sonnet 4.6 | Anthropic (via HAI proxy) | API | Fast, cost-effective |
| GPT-4.1 | OpenAI (via HAI proxy) | API | Non-reasoning, structured output |
| GPT-4.1 Mini | OpenAI (via HAI proxy) | API | Cheapest, good for screening |
| Claude Sonnet 4.5 | Anthropic (via HAI proxy) | API | Predecessor to 4.6 |

### Control Variables

- **Same input:** All models receive identical document text
- **Same prompt:** Structured prompt with explicit categories and output format
- **Same constraints:** No tools, no project context beyond the document(s)
- **Independent runs:** No cross-pollination between model runs
- **Temperature:** 0.3 for GPT-4.1/Mini; default (1.0) for GPT-5 (required)

### Measurement

- **Time:** Wall clock from request to final token
- **Output tokens:** Total generated tokens
- **Reasoning tokens:** For reasoning models (GPT-5), exposed separately
- **Findings count:** Number of distinct issues identified
- **Unique findings:** Issues found by only one model
- **Severity distribution:** Critical / High / Medium / Low per finding
- **Tokens per finding:** Efficiency metric

### Evaluation Criteria

Each finding is assessed for:
1. **Correctness:** Is the identified issue real?
2. **Uniqueness:** Did only this model find it?
3. **Actionability:** Would a developer change something based on this?
4. **Depth:** Surface observation vs architectural insight?

### Context Dimensions Tracked

| Dimension | Options |
|-----------|---------|
| Context richness | Rich (full project) vs Minimal (document only) |
| Task framing | Broad ("review this") vs Focused ("check for X") |
| Context type | Diff, full files, issue text, research notes, nothing |
| Tool access | With tools (API calls, file reads) vs text-only |
| Task structure | Step-by-step explicit vs open-ended |

## Limitations

- Single test corpus (gargoyle architecture docs) — domain bias possible
- Single researcher evaluating findings — subjectivity in quality assessment
- Models are non-deterministic — single runs, not averaged
- Proxy adds latency — timing comparisons are relative, not absolute
- Internal reasoning tokens not visible for Claude models

## Reproducibility

Prompts for each experiment are in the `prompts/` directory. The test
corpus is the gargoyle project's `docs/` directory (available at
`gitea.weiker.me/grgl/gargoyle`). Each finding documents the exact document
used, its line count, and the specific version/commit when relevant.