finding #43: opus + narrow framing for contradiction detection

Tests the open question from Finding #39: does Opus's internal reasoning depth suffice for self-contradiction verification? Key result: wrong question. Opus finds a different CLASS of contradiction than GPT-5. GPT-5 finds specification conflicts (statement comparison). Opus finds logical impossibilities (deductive rule interaction). Neither dominates — they don't overlap. Sonnet remains unreliable (~33% precision). Document tested: escalation-policy.md (228 lines) Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4.6
2026-05-07 16:05:14 -07:00
parent 296bb21eb7
commit d8a030d9e9
2 changed files with 144 additions and 3 deletions
@@ -0,0 +1,134 @@
 # Finding #43: Opus + narrow framing produces qualitatively different contradiction type than GPT-5; neither dominates
 **Date:** 2026-05-07
 **Document:** `docs/domain/contexts/risk/escalation-policy.md` (228 lines)
 **Task type:** Internal logical consistency / self-contradiction detection
 **Models:** Claude Opus 4.6, GPT-5, Claude Sonnet 4.6 (all narrow framing)
 **Open question tested:** "Would Opus + narrow framing match GPT-5 for self-contradiction detection?" (from Finding #39)
 ## Experiment Design
 Finding #39 showed that Sonnet + narrow framing does NOT close the gap with GPT-5 for
 contradiction detection — Sonnet found 3 contradictions but only 1 was genuine (2 misreadings).
 The open question: does Opus's deeper internal reasoning suffice for the verification step
 that Sonnet lacks?
 Three conditions, same document, same narrow prompt:
 | Condition | Model | Time | Output tokens | Reasoning tokens | Contradictions |
 |---|---|---|---|---|---|
 | A | GPT-5 | 52s | 6,415 | 6,208 | 1 |
 | B | Claude Opus 4.6 | 12s | 468 | (internal) | 1 |
 | C | Claude Sonnet 4.6 | 26s | 1,451 | (internal) | 3 |
 ## What They Found
 ### GPT-5 (1 genuine contradiction):
 **Broker-unavailable timing conflict:** The prose says broker unreachability leads to kill
 switch only after "continued consecutive breaches" (N more evaluations). The table says
 broker unavailable → "Immediate kill switch escalation." Both describe the same scenario
 (broker unavailable during liquidation) but prescribe different timing: debounce-gated vs
 immediate. Severity: High.
 ### Claude Opus 4.6 (1 genuine contradiction):
 **Debounce reset paradox:** The document states "A single clear evaluation resets the breach
 counter." But the Liquidation Sizing section says if liquidation is insufficient, "the next
 evaluation cycle can trigger additional liquidation — but only after the debounce count resets
 and fires again." If the metric NEVER clears (liquidation was insufficient, metric still
 breaches), the counter can never reset per the stated rule. Yet the document says additional
 liquidation requires the counter to reset. These cannot both be true for a continuously-
 breaching metric. Severity: High.
 ### Claude Sonnet 4.6 (3 claimed, assessment below):
 1. **Failure modes "Automatic" vs manual de-escalation** — Claims "Automatic" recovery in
   the failure modes table contradicts "manual only" de-escalation from liquidate.
   **Assessment: MISREAD.** The "Automatic" column describes how the system HANDLES the
   failure scenario (auto-retries, escalates to kill switch), not downward de-escalation.
   The system's autonomous recovery is escalation UPWARD (to kill switch), which is
   consistent with manual-only downward de-escalation.
 2. **Debounce defaults vs calibration guidance** — Restrict→Liquidate defaults to 3 but
   calibration says volatile metrics need 5-8.
   **Assessment: TENSION, not contradiction.** The document explicitly says "These are
   configurable per metric" — the defaults don't need to match the guidance for specific
   metric types. The calibration section explains HOW to override defaults, not what the
   defaults must be. This is advice vs defaults, not statement vs statement.
 3. **Kill switch immediate trigger vs "post-liquidation" event description** — Same finding
   as GPT-5's: broker-unavailable immediate escalation conflicts with the event described
   as "post-liquidation."
   **Assessment: GENUINE.** This is the same contradiction GPT-5 found but arrived at via
   a different evidence path (event description rather than prose/table conflict).
 **Sonnet accuracy: 1 genuine + 1 tension + 1 misread out of 3 claimed = 33-67% precision.**
 ## Analysis
 ### GPT-5's finding vs Opus's finding — different types of contradiction:
 GPT-5 found a **surface-level specification conflict**: two statements about the same
 scenario (broker unavailable) prescribe different behaviors (wait N breaches vs immediate).
 This is the type of contradiction you'd find during careful proofreading — it's where the
 document says "X" in one place and "not-X" in another about the same thing.
 Opus found a **logical impossibility**: the interaction between two stated rules creates a
 situation that can never resolve. The debounce reset rule (requires a clear evaluation) and
 the re-triggering mechanism (needs the counter to reset) cannot both work as described when
 the metric continuously breaches. This is NOT a statement-vs-statement conflict — it's a
 logical consequence that the author likely didn't reason through.
 These are qualitatively different:
 - GPT-5's type: "you said conflicting things about the same scenario" (specification bug)
 - Opus's type: "your rules, when combined, produce an impossible requirement" (logic bug)
 ### Does Opus match GPT-5?
 **No — but not because it's worse.** They find different things. GPT-5's 6,208 reasoning
 tokens went toward exhaustively checking statement pairs for direct conflicts. Opus's
 internal reasoning went toward understanding the LOGICAL INTERACTION between rules.
 GPT-5 missed the debounce reset paradox (likely because it requires multi-step logical
 reasoning about rule interactions rather than statement comparison). Opus missed the
 broker-unavailable timing conflict (likely because it's a more surface-level inconsistency
 between prose and table that doesn't involve logical deduction).
 ### Sonnet's continued weakness:
 Consistent with Finding #39: Sonnet found 3 contradictions but only 1 was genuine (the
 broker-unavailable one, same as GPT-5). The failure-modes misread shows Sonnet doesn't
 reliably verify whether two statements ACTUALLY conflict — it pattern-matches on surface
 similarity ("Automatic" and "manual only" appear to conflict) without reasoning about
 whether they refer to the same thing. The debounce/calibration "contradiction" confuses
 advisory guidance with specification (a type confusion that reasoning models avoid).
 ## Key Insight — Two distinct contradiction-finding modes:
 | Mode | Best model | What it catches | Cognitive demand |
 |---|---|---|---|
 | Specification conflicts | GPT-5 | Same scenario, different prescriptions | Statement comparison + verification |
 | Logical impossibilities | Opus | Rules that can't coexist under all conditions | Multi-step logical deduction |
 This explains why the open question ("does Opus match GPT-5?") has no clean yes/no answer.
 They're not attempting the same thing. GPT-5 exhaustively compares statement pairs. Opus
 reasons about what the stated rules IMPLY when combined. Both modes catch real bugs that
 the other misses.
 ## Practical Implication
 For self-contradiction detection in architecture documents:
 - Run BOTH GPT-5 and Opus — they catch fundamentally different types of contradictions
 - GPT-5 catches specification bugs (conflicting statements about the same thing)
 - Opus catches logic bugs (rules whose interactions produce impossible conditions)
 - Sonnet remains unreliable — too many false positives from surface-pattern matching
 - The cost is minimal (12s + 468 tokens for Opus vs 52s + 6,415 for GPT-5)
 ## Updated Answer to Open Question
 > "Would Opus + narrow framing match GPT-5 for self-contradiction detection?"
 **Wrong question.** Opus doesn't try to match GPT-5 — it finds a different class of
 contradiction. The right framing: Opus + GPT-5 together catch more than either alone,
 and the contradictions they find don't overlap. Run both.
@@ -22,11 +22,18 @@ cross-doc contradictions are easy to verify once spotted (reducing GPT-5's
 verification advantage)? Or because boundary reasoning (Opus's strength)
 is the primary skill needed?
-### Opus + narrow framing for contradiction detection (from Finding #39)
+### ~~Opus + narrow framing for contradiction detection (from Finding #39)~~ → ANSWERED (Finding #43)
-Would Opus + narrow framing match GPT-5 for self-contradiction detection?
+~~Would Opus + narrow framing match GPT-5 for self-contradiction detection?
 Finding #39 showed Sonnet can't do it even with narrow framing (reasoning
 depth issue). Opus has strong cross-boundary reasoning — does its internal
-reasoning depth suffice for the verification step that Sonnet lacks?
+reasoning depth suffice for the verification step that Sonnet lacks?~~
 **WRONG QUESTION.** Opus doesn't try to match GPT-5 — it finds a different CLASS
 of contradiction. GPT-5 finds specification conflicts (same scenario, conflicting
 prescriptions via statement comparison). Opus finds logical impossibilities (rules
 whose interaction produces impossible conditions via deductive reasoning). Neither
 dominates — they don't overlap. Run both for complete coverage. Sonnet remains
 unreliable (~33% precision on contradiction detection).
 ### ~~Sonnet + narrow framing = GPT-5 level? (from Finding #5)~~ → ANSWERED (Finding #39)
 ~~Would Sonnet catch semantic issues if given a narrower "check for logical