65 lines
3.1 KiB
Markdown
65 lines
3.1 KiB
Markdown
# Open Questions
|
|
|
|
Unanswered questions from experiments, ordered by potential impact.
|
|
|
|
## High Priority
|
|
|
|
### Signal-to-noise confirmation (from Finding #8)
|
|
Give a model the FULL PR review context (diff, files, issue, AC) but add
|
|
the narrow bias question as an explicit review checklist item. If the model
|
|
catches bias despite the rich context, it confirms the signal-to-noise
|
|
hypothesis. If it misses, it suggests something else (attention allocation,
|
|
task switching cost).
|
|
|
|
### Cross-document consistency as maintenance tool (from Finding #28)
|
|
Does running cross-doc analysis across MORE document pairs (domain readmes
|
|
vs implementation docs, design docs vs plan docs) yield additional real
|
|
inconsistencies? Could become a systematic documentation maintenance tool.
|
|
|
|
### Why Opus dominates cross-doc consistency (from Finding #28)
|
|
Opus was 2.4x faster AND found more issues than GPT-5. Is this because
|
|
cross-doc contradictions are easy to verify once spotted (reducing GPT-5's
|
|
verification advantage)? Or because boundary reasoning (Opus's strength)
|
|
is the primary skill needed?
|
|
|
|
### Sonnet + narrow framing = GPT-5 level? (from Finding #5)
|
|
Would Sonnet catch semantic issues if given a narrower "check for logical
|
|
consistency" framing instead of broad review? The hypothesis: Sonnet's
|
|
"structural reviewer" tendency is a framing artifact, not a capability limit.
|
|
|
|
## Medium Priority
|
|
|
|
### ~~Adversarial analysis ensemble (from Finding #29)~~ → ANSWERED (Finding #35)
|
|
~~Run GPT-5 and Opus sequentially — give Opus access to GPT-5's findings
|
|
and ask it to critique and extend. Does the ensemble find more than either
|
|
alone? Does Opus's system-level thinking complement GPT-5's exhaustiveness?~~
|
|
|
|
**YES.** Ensemble produces 56 findings vs 43 (GPT-5) or 28 (Opus) alone (30%
|
|
improvement). Zero full disagreements — critique phase calibrates severity
|
|
without discarding. Extension phase adds 13 genuinely new findings (4 High).
|
|
The critique's structured assessment is more valuable than raw extensions.
|
|
Cost: ~28% more tokens for 30% more coverage + prioritization.
|
|
|
|
### Reasoning effort parameter (from Finding #21)
|
|
Reasoning effort (low/medium/high) had negligible effect on GPT-5's
|
|
analytical output. Is this because the parameter doesn't work for open-ended
|
|
analysis? Or because the task was already within GPT-5's "easy" threshold?
|
|
Test with a harder document.
|
|
|
|
### Model personality vs prompt (from Finding #26)
|
|
Missing-feature identification IS promptable across all models — prompt
|
|
framing eliminates Opus's historical advantage. How many other "model
|
|
personality" observations are actually just prompt framing effects?
|
|
|
|
## Answered Questions
|
|
|
|
- ~~Opus's "missing feature identification" mode — is it promptable?~~
|
|
**YES** (Finding #26): all models find regulatory gaps when explicitly
|
|
prompted. Opus's behavior was an emergent DEFAULT tendency, not a unique
|
|
capability.
|
|
|
|
- ~~Is Opus > GPT-5 for coherence tasks universal?~~
|
|
**NO** (Finding #27): Opus's advantage from Finding #15 was document-
|
|
specific. On risk-controls.md (992 lines, more complex), GPT-5 regained
|
|
top position. Document complexity and domain specialization affect ranking.
|