Initial publish: 29 findings, 6 prompts, methodology, open questions

Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding). Contents: - findings/ALL-FINDINGS.md — complete 3,249-line research log with all 29 findings, methodology notes, and open questions - prompts/ — 6 exact prompts used across experiments - methodology.md — experimental setup and evaluation criteria - open-questions.md — unanswered questions for future work - README.md — overview and summary table Key findings: - Cross-document consistency: Opus is 2.4x faster with more findings - Gap-finding: GPT-5 reasoning tokens find domain-specific gaps - Race conditions: Opus excels at temporal interaction reasoning - Bias detection: Signal-to-noise ratio > model capability - Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different Signed-off-by: Rodin
2026-05-05 19:13:03 -07:00
parent 4aea0d004b
commit 1b108ff66e
10 changed files with 3831 additions and 2 deletions
@@ -0,0 +1,58 @@
+# Open Questions
+
+Unanswered questions from experiments, ordered by potential impact.
+
+## High Priority
+
+### Signal-to-noise confirmation (from Finding #8)
+Give a model the FULL PR review context (diff, files, issue, AC) but add
+the narrow bias question as an explicit review checklist item. If the model
+catches bias despite the rich context, it confirms the signal-to-noise
+hypothesis. If it misses, it suggests something else (attention allocation,
+task switching cost).
+
+### Cross-document consistency as maintenance tool (from Finding #28)
+Does running cross-doc analysis across MORE document pairs (domain readmes
+vs implementation docs, design docs vs plan docs) yield additional real
+inconsistencies? Could become a systematic documentation maintenance tool.
+
+### Why Opus dominates cross-doc consistency (from Finding #28)
+Opus was 2.4x faster AND found more issues than GPT-5. Is this because
+cross-doc contradictions are easy to verify once spotted (reducing GPT-5's
+verification advantage)? Or because boundary reasoning (Opus's strength)
+is the primary skill needed?
+
+### Sonnet + narrow framing = GPT-5 level? (from Finding #5)
+Would Sonnet catch semantic issues if given a narrower "check for logical
+consistency" framing instead of broad review? The hypothesis: Sonnet's
+"structural reviewer" tendency is a framing artifact, not a capability limit.
+
+## Medium Priority
+
+### Adversarial analysis ensemble (from Finding #29)
+Run GPT-5 and Opus sequentially — give Opus access to GPT-5's findings
+and ask it to critique and extend. Does the ensemble find more than either
+alone? Does Opus's system-level thinking complement GPT-5's exhaustiveness?
+
+### Reasoning effort parameter (from Finding #21)
+Reasoning effort (low/medium/high) had negligible effect on GPT-5's
+analytical output. Is this because the parameter doesn't work for open-ended
+analysis? Or because the task was already within GPT-5's "easy" threshold?
+Test with a harder document.
+
+### Model personality vs prompt (from Finding #26)
+Missing-feature identification IS promptable across all models — prompt
+framing eliminates Opus's historical advantage. How many other "model
+personality" observations are actually just prompt framing effects?
+
+## Answered Questions
+
+- ~~Opus's "missing feature identification" mode — is it promptable?~~
+  **YES** (Finding #26): all models find regulatory gaps when explicitly
+  prompted. Opus's behavior was an emergent DEFAULT tendency, not a unique
+  capability.
+
+- ~~Is Opus > GPT-5 for coherence tasks universal?~~
+  **NO** (Finding #27): Opus's advantage from Finding #15 was document-
+  specific. On risk-controls.md (992 lines, more complex), GPT-5 regained
+  top position. Document complexity and domain specialization affect ranking.