Files
model-research/findings/2026-04-26-02-cheap-model-narrow-lens-expensive.md
T
Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00

1.0 KiB

Finding 2: Cheap model + narrow lens > expensive model + broad review (one data point)

Date: 2026-04-26 Task: Check 12 rewritten hypotheses for directional bias How we used them:

  • Sonnet & GPT-5: full PR review context (diff, file content, issue, AC). Broad mandate: "review this PR." Rich context but unfocused task.

  • GPT-4.1 Mini: given ONLY the 12 hypothesis texts + one focused question: "Do any of these hypotheses lead toward a predetermined conclusion?" Minimal context, laser-focused task. No diff, no project docs, no issue.

  • Both Sonnet and GPT-5 approved the hypotheses as reviewers

  • GPT-4.1 Mini found ALL 12 pushed toward predetermined conclusions

  • Words like "requires," "necessary," "must be" were flagged as directional

  • Takeaway: Task framing mattered more than model size. Rich context + broad mandate = missed the forest for the trees. Minimal context + precise question = found exactly what mattered. This needs more testing — was it the narrow framing, the lack of surrounding context, or both?