Files
model-research/open-questions.md
T
claw d8a030d9e9 finding #43: opus + narrow framing for contradiction detection
Tests the open question from Finding #39: does Opus's internal reasoning
depth suffice for self-contradiction verification?

Key result: wrong question. Opus finds a different CLASS of contradiction
than GPT-5. GPT-5 finds specification conflicts (statement comparison).
Opus finds logical impossibilities (deductive rule interaction). Neither
dominates — they don't overlap. Sonnet remains unreliable (~33% precision).

Document tested: escalation-policy.md (228 lines)
Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4.6
2026-05-07 16:05:14 -07:00

84 lines
4.3 KiB
Markdown

# Open Questions
Unanswered questions from experiments, ordered by potential impact.
## High Priority
### Signal-to-noise confirmation (from Finding #8)
Give a model the FULL PR review context (diff, files, issue, AC) but add
the narrow bias question as an explicit review checklist item. If the model
catches bias despite the rich context, it confirms the signal-to-noise
hypothesis. If it misses, it suggests something else (attention allocation,
task switching cost).
### Cross-document consistency as maintenance tool (from Finding #28)
Does running cross-doc analysis across MORE document pairs (domain readmes
vs implementation docs, design docs vs plan docs) yield additional real
inconsistencies? Could become a systematic documentation maintenance tool.
### Why Opus dominates cross-doc consistency (from Finding #28)
Opus was 2.4x faster AND found more issues than GPT-5. Is this because
cross-doc contradictions are easy to verify once spotted (reducing GPT-5's
verification advantage)? Or because boundary reasoning (Opus's strength)
is the primary skill needed?
### ~~Opus + narrow framing for contradiction detection (from Finding #39)~~ → ANSWERED (Finding #43)
~~Would Opus + narrow framing match GPT-5 for self-contradiction detection?
Finding #39 showed Sonnet can't do it even with narrow framing (reasoning
depth issue). Opus has strong cross-boundary reasoning — does its internal
reasoning depth suffice for the verification step that Sonnet lacks?~~
**WRONG QUESTION.** Opus doesn't try to match GPT-5 — it finds a different CLASS
of contradiction. GPT-5 finds specification conflicts (same scenario, conflicting
prescriptions via statement comparison). Opus finds logical impossibilities (rules
whose interaction produces impossible conditions via deductive reasoning). Neither
dominates — they don't overlap. Run both for complete coverage. Sonnet remains
unreliable (~33% precision on contradiction detection).
### ~~Sonnet + narrow framing = GPT-5 level? (from Finding #5)~~ → ANSWERED (Finding #39)
~~Would Sonnet catch semantic issues if given a narrower "check for logical
consistency" framing instead of broad review? The hypothesis: Sonnet's
"structural reviewer" tendency is a framing artifact, not a capability limit.~~
**NO.** Narrow framing changes WHAT Sonnet looks for (redirects from gaps to
contradictions) but not HOW WELL it reasons. Sonnet narrow found 3 contradictions
but only 1 was genuine (2 were misreadings). GPT-5 found 4 all-genuine findings.
The gap is reasoning depth, not framing — Sonnet can't reliably verify whether
two statements actually contradict each other.
## Medium Priority
### ~~Adversarial analysis ensemble (from Finding #29)~~ → ANSWERED (Finding #35)
~~Run GPT-5 and Opus sequentially — give Opus access to GPT-5's findings
and ask it to critique and extend. Does the ensemble find more than either
alone? Does Opus's system-level thinking complement GPT-5's exhaustiveness?~~
**YES.** Ensemble produces 56 findings vs 43 (GPT-5) or 28 (Opus) alone (30%
improvement). Zero full disagreements — critique phase calibrates severity
without discarding. Extension phase adds 13 genuinely new findings (4 High).
The critique's structured assessment is more valuable than raw extensions.
Cost: ~28% more tokens for 30% more coverage + prioritization.
### Reasoning effort parameter (from Finding #21)
Reasoning effort (low/medium/high) had negligible effect on GPT-5's
analytical output. Is this because the parameter doesn't work for open-ended
analysis? Or because the task was already within GPT-5's "easy" threshold?
Test with a harder document.
### Model personality vs prompt (from Finding #26)
Missing-feature identification IS promptable across all models — prompt
framing eliminates Opus's historical advantage. How many other "model
personality" observations are actually just prompt framing effects?
## Answered Questions
- ~~Opus's "missing feature identification" mode — is it promptable?~~
**YES** (Finding #26): all models find regulatory gaps when explicitly
prompted. Opus's behavior was an emergent DEFAULT tendency, not a unique
capability.
- ~~Is Opus > GPT-5 for coherence tasks universal?~~
**NO** (Finding #27): Opus's advantage from Finding #15 was document-
specific. On risk-controls.md (992 lines, more complex), GPT-5 regained
top position. Document complexity and domain specialization affect ranking.