finding #43: opus + narrow framing for contradiction detection

Tests the open question from Finding #39: does Opus's internal reasoning depth suffice for self-contradiction verification? Key result: wrong question. Opus finds a different CLASS of contradiction than GPT-5. GPT-5 finds specification conflicts (statement comparison). Opus finds logical impossibilities (deductive rule interaction). Neither dominates — they don't overlap. Sonnet remains unreliable (~33% precision). Document tested: escalation-policy.md (228 lines) Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4.6
2026-05-07 16:05:14 -07:00
parent 296bb21eb7
commit d8a030d9e9
2 changed files with 144 additions and 3 deletions
@@ -0,0 +1,134 @@
+# Finding #43: Opus + narrow framing produces qualitatively different contradiction type than GPT-5; neither dominates
+
+**Date:** 2026-05-07
+**Document:** `docs/domain/contexts/risk/escalation-policy.md` (228 lines)
+**Task type:** Internal logical consistency / self-contradiction detection
+**Models:** Claude Opus 4.6, GPT-5, Claude Sonnet 4.6 (all narrow framing)
+**Open question tested:** "Would Opus + narrow framing match GPT-5 for self-contradiction detection?" (from Finding #39)
+
+## Experiment Design
+
+Finding #39 showed that Sonnet + narrow framing does NOT close the gap with GPT-5 for
+contradiction detection — Sonnet found 3 contradictions but only 1 was genuine (2 misreadings).
+The open question: does Opus's deeper internal reasoning suffice for the verification step
+that Sonnet lacks?
+
+Three conditions, same document, same narrow prompt:
+
+| Condition | Model | Time | Output tokens | Reasoning tokens | Contradictions |
+|---|---|---|---|---|---|
+| A | GPT-5 | 52s | 6,415 | 6,208 | 1 |
+| B | Claude Opus 4.6 | 12s | 468 | (internal) | 1 |
+| C | Claude Sonnet 4.6 | 26s | 1,451 | (internal) | 3 |
+
+## What They Found
+
+### GPT-5 (1 genuine contradiction):
+
+**Broker-unavailable timing conflict:** The prose says broker unreachability leads to kill
+switch only after "continued consecutive breaches" (N more evaluations). The table says
+broker unavailable → "Immediate kill switch escalation." Both describe the same scenario
+(broker unavailable during liquidation) but prescribe different timing: debounce-gated vs
+immediate. Severity: High.
+
+### Claude Opus 4.6 (1 genuine contradiction):
+
+**Debounce reset paradox:** The document states "A single clear evaluation resets the breach
+counter." But the Liquidation Sizing section says if liquidation is insufficient, "the next
+evaluation cycle can trigger additional liquidation — but only after the debounce count resets
+and fires again." If the metric NEVER clears (liquidation was insufficient, metric still
+breaches), the counter can never reset per the stated rule. Yet the document says additional
+liquidation requires the counter to reset. These cannot both be true for a continuously-
+breaching metric. Severity: High.
+
+### Claude Sonnet 4.6 (3 claimed, assessment below):
+
+1. **Failure modes "Automatic" vs manual de-escalation** — Claims "Automatic" recovery in
+   the failure modes table contradicts "manual only" de-escalation from liquidate.
+   **Assessment: MISREAD.** The "Automatic" column describes how the system HANDLES the
+   failure scenario (auto-retries, escalates to kill switch), not downward de-escalation.
+   The system's autonomous recovery is escalation UPWARD (to kill switch), which is
+   consistent with manual-only downward de-escalation.
+
+2. **Debounce defaults vs calibration guidance** — Restrict→Liquidate defaults to 3 but
+   calibration says volatile metrics need 5-8.
+   **Assessment: TENSION, not contradiction.** The document explicitly says "These are
+   configurable per metric" — the defaults don't need to match the guidance for specific
+   metric types. The calibration section explains HOW to override defaults, not what the
+   defaults must be. This is advice vs defaults, not statement vs statement.
+
+3. **Kill switch immediate trigger vs "post-liquidation" event description** — Same finding
+   as GPT-5's: broker-unavailable immediate escalation conflicts with the event described
+   as "post-liquidation."
+   **Assessment: GENUINE.** This is the same contradiction GPT-5 found but arrived at via
+   a different evidence path (event description rather than prose/table conflict).
+
+**Sonnet accuracy: 1 genuine + 1 tension + 1 misread out of 3 claimed = 33-67% precision.**
+
+## Analysis
+
+### GPT-5's finding vs Opus's finding — different types of contradiction:
+
+GPT-5 found a **surface-level specification conflict**: two statements about the same
+scenario (broker unavailable) prescribe different behaviors (wait N breaches vs immediate).
+This is the type of contradiction you'd find during careful proofreading — it's where the
+document says "X" in one place and "not-X" in another about the same thing.
+
+Opus found a **logical impossibility**: the interaction between two stated rules creates a
+situation that can never resolve. The debounce reset rule (requires a clear evaluation) and
+the re-triggering mechanism (needs the counter to reset) cannot both work as described when
+the metric continuously breaches. This is NOT a statement-vs-statement conflict — it's a
+logical consequence that the author likely didn't reason through.
+
+These are qualitatively different:
+- GPT-5's type: "you said conflicting things about the same scenario" (specification bug)
+- Opus's type: "your rules, when combined, produce an impossible requirement" (logic bug)
+
+### Does Opus match GPT-5?
+
+**No — but not because it's worse.** They find different things. GPT-5's 6,208 reasoning
+tokens went toward exhaustively checking statement pairs for direct conflicts. Opus's
+internal reasoning went toward understanding the LOGICAL INTERACTION between rules.
+
+GPT-5 missed the debounce reset paradox (likely because it requires multi-step logical
+reasoning about rule interactions rather than statement comparison). Opus missed the
+broker-unavailable timing conflict (likely because it's a more surface-level inconsistency
+between prose and table that doesn't involve logical deduction).
+
+### Sonnet's continued weakness:
+
+Consistent with Finding #39: Sonnet found 3 contradictions but only 1 was genuine (the
+broker-unavailable one, same as GPT-5). The failure-modes misread shows Sonnet doesn't
+reliably verify whether two statements ACTUALLY conflict — it pattern-matches on surface
+similarity ("Automatic" and "manual only" appear to conflict) without reasoning about
+whether they refer to the same thing. The debounce/calibration "contradiction" confuses
+advisory guidance with specification (a type confusion that reasoning models avoid).
+
+## Key Insight — Two distinct contradiction-finding modes:
+
+| Mode | Best model | What it catches | Cognitive demand |
+|---|---|---|---|
+| Specification conflicts | GPT-5 | Same scenario, different prescriptions | Statement comparison + verification |
+| Logical impossibilities | Opus | Rules that can't coexist under all conditions | Multi-step logical deduction |
+
+This explains why the open question ("does Opus match GPT-5?") has no clean yes/no answer.
+They're not attempting the same thing. GPT-5 exhaustively compares statement pairs. Opus
+reasons about what the stated rules IMPLY when combined. Both modes catch real bugs that
+the other misses.
+
+## Practical Implication
+
+For self-contradiction detection in architecture documents:
+- Run BOTH GPT-5 and Opus — they catch fundamentally different types of contradictions
+- GPT-5 catches specification bugs (conflicting statements about the same thing)
+- Opus catches logic bugs (rules whose interactions produce impossible conditions)
+- Sonnet remains unreliable — too many false positives from surface-pattern matching
+- The cost is minimal (12s + 468 tokens for Opus vs 52s + 6,415 for GPT-5)
+
+## Updated Answer to Open Question
+
+> "Would Opus + narrow framing match GPT-5 for self-contradiction detection?"
+
+**Wrong question.** Opus doesn't try to match GPT-5 — it finds a different class of
+contradiction. The right framing: Opus + GPT-5 together catch more than either alone,
+and the contradictions they find don't overlap. Run both.
@@ -22,11 +22,18 @@ cross-doc contradictions are easy to verify once spotted (reducing GPT-5's
 verification advantage)? Or because boundary reasoning (Opus's strength)
 is the primary skill needed?

-### Opus + narrow framing for contradiction detection (from Finding #39)
-Would Opus + narrow framing match GPT-5 for self-contradiction detection?
+### ~~Opus + narrow framing for contradiction detection (from Finding #39)~~ → ANSWERED (Finding #43)
+~~Would Opus + narrow framing match GPT-5 for self-contradiction detection?
 Finding #39 showed Sonnet can't do it even with narrow framing (reasoning
 depth issue). Opus has strong cross-boundary reasoning — does its internal
-reasoning depth suffice for the verification step that Sonnet lacks?
+reasoning depth suffice for the verification step that Sonnet lacks?~~
+
+**WRONG QUESTION.** Opus doesn't try to match GPT-5 — it finds a different CLASS
+of contradiction. GPT-5 finds specification conflicts (same scenario, conflicting
+prescriptions via statement comparison). Opus finds logical impossibilities (rules
+whose interaction produces impossible conditions via deductive reasoning). Neither
+dominates — they don't overlap. Run both for complete coverage. Sonnet remains
+unreliable (~33% precision on contradiction detection).

 ### ~~Sonnet + narrow framing = GPT-5 level? (from Finding #5)~~ → ANSWERED (Finding #39)
 ~~Would Sonnet catch semantic issues if given a narrower "check for logical