Files

T

Rodin 1b108ff66e Initial publish: 29 findings, 6 prompts, methodology, open questions

Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6,
GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding).

Contents:
- findings/ALL-FINDINGS.md — complete 3,249-line research log with all
  29 findings, methodology notes, and open questions
- prompts/ — 6 exact prompts used across experiments
- methodology.md — experimental setup and evaluation criteria
- open-questions.md — unanswered questions for future work
- README.md — overview and summary table

Key findings:
- Cross-document consistency: Opus is 2.4x faster with more findings
- Gap-finding: GPT-5 reasoning tokens find domain-specific gaps
- Race conditions: Opus excels at temporal interaction reasoning
- Bias detection: Signal-to-noise ratio > model capability
- Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different

Signed-off-by: Rodin

2026-05-05 19:13:03 -07:00

2.7 KiB

Raw Blame History

Open Questions

Unanswered questions from experiments, ordered by potential impact.

High Priority

Signal-to-noise confirmation (from Finding #8)

Give a model the FULL PR review context (diff, files, issue, AC) but add the narrow bias question as an explicit review checklist item. If the model catches bias despite the rich context, it confirms the signal-to-noise hypothesis. If it misses, it suggests something else (attention allocation, task switching cost).

Cross-document consistency as maintenance tool (from Finding #28)

Does running cross-doc analysis across MORE document pairs (domain readmes vs implementation docs, design docs vs plan docs) yield additional real inconsistencies? Could become a systematic documentation maintenance tool.

Why Opus dominates cross-doc consistency (from Finding #28)

Opus was 2.4x faster AND found more issues than GPT-5. Is this because cross-doc contradictions are easy to verify once spotted (reducing GPT-5's verification advantage)? Or because boundary reasoning (Opus's strength) is the primary skill needed?

Sonnet + narrow framing = GPT-5 level? (from Finding #5)

Would Sonnet catch semantic issues if given a narrower "check for logical consistency" framing instead of broad review? The hypothesis: Sonnet's "structural reviewer" tendency is a framing artifact, not a capability limit.

Medium Priority

Adversarial analysis ensemble (from Finding #29)

Run GPT-5 and Opus sequentially — give Opus access to GPT-5's findings and ask it to critique and extend. Does the ensemble find more than either alone? Does Opus's system-level thinking complement GPT-5's exhaustiveness?

Reasoning effort parameter (from Finding #21)

Reasoning effort (low/medium/high) had negligible effect on GPT-5's analytical output. Is this because the parameter doesn't work for open-ended analysis? Or because the task was already within GPT-5's "easy" threshold? Test with a harder document.

Model personality vs prompt (from Finding #26)

Missing-feature identification IS promptable across all models — prompt framing eliminates Opus's historical advantage. How many other "model personality" observations are actually just prompt framing effects?

Answered Questions

~~Opus's "missing feature identification" mode — is it promptable?~~ YES (Finding #26): all models find regulatory gaps when explicitly prompted. Opus's behavior was an emergent DEFAULT tendency, not a unique capability.
~~Is Opus > GPT-5 for coherence tasks universal?~~ NO (Finding #27): Opus's advantage from Finding #15 was document- specific. On risk-controls.md (992 lines, more complex), GPT-5 regained top position. Document complexity and domain specialization affect ranking.

2.7 KiB Raw Blame History