finding #43: opus + narrow framing for contradiction detection

Tests the open question from Finding #39: does Opus's internal reasoning
depth suffice for self-contradiction verification?

Key result: wrong question. Opus finds a different CLASS of contradiction
than GPT-5. GPT-5 finds specification conflicts (statement comparison).
Opus finds logical impossibilities (deductive rule interaction). Neither
dominates — they don't overlap. Sonnet remains unreliable (~33% precision).

Document tested: escalation-policy.md (228 lines)
Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4.6
This commit is contained in:
claw
2026-05-07 16:05:14 -07:00
parent 296bb21eb7
commit d8a030d9e9
2 changed files with 144 additions and 3 deletions
+10 -3
View File
@@ -22,11 +22,18 @@ cross-doc contradictions are easy to verify once spotted (reducing GPT-5's
verification advantage)? Or because boundary reasoning (Opus's strength)
is the primary skill needed?
### Opus + narrow framing for contradiction detection (from Finding #39)
Would Opus + narrow framing match GPT-5 for self-contradiction detection?
### ~~Opus + narrow framing for contradiction detection (from Finding #39)~~ → ANSWERED (Finding #43)
~~Would Opus + narrow framing match GPT-5 for self-contradiction detection?
Finding #39 showed Sonnet can't do it even with narrow framing (reasoning
depth issue). Opus has strong cross-boundary reasoning — does its internal
reasoning depth suffice for the verification step that Sonnet lacks?
reasoning depth suffice for the verification step that Sonnet lacks?~~
**WRONG QUESTION.** Opus doesn't try to match GPT-5 — it finds a different CLASS
of contradiction. GPT-5 finds specification conflicts (same scenario, conflicting
prescriptions via statement comparison). Opus finds logical impossibilities (rules
whose interaction produces impossible conditions via deductive reasoning). Neither
dominates — they don't overlap. Run both for complete coverage. Sonnet remains
unreliable (~33% precision on contradiction detection).
### ~~Sonnet + narrow framing = GPT-5 level? (from Finding #5)~~ → ANSWERED (Finding #39)
~~Would Sonnet catch semantic issues if given a narrower "check for logical