finding #43: opus + narrow framing for contradiction detection
Tests the open question from Finding #39: does Opus's internal reasoning depth suffice for self-contradiction verification? Key result: wrong question. Opus finds a different CLASS of contradiction than GPT-5. GPT-5 finds specification conflicts (statement comparison). Opus finds logical impossibilities (deductive rule interaction). Neither dominates — they don't overlap. Sonnet remains unreliable (~33% precision). Document tested: escalation-policy.md (228 lines) Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4.6
This commit is contained in:
@@ -0,0 +1,134 @@
|
||||
# Finding #43: Opus + narrow framing produces qualitatively different contradiction type than GPT-5; neither dominates
|
||||
|
||||
**Date:** 2026-05-07
|
||||
**Document:** `docs/domain/contexts/risk/escalation-policy.md` (228 lines)
|
||||
**Task type:** Internal logical consistency / self-contradiction detection
|
||||
**Models:** Claude Opus 4.6, GPT-5, Claude Sonnet 4.6 (all narrow framing)
|
||||
**Open question tested:** "Would Opus + narrow framing match GPT-5 for self-contradiction detection?" (from Finding #39)
|
||||
|
||||
## Experiment Design
|
||||
|
||||
Finding #39 showed that Sonnet + narrow framing does NOT close the gap with GPT-5 for
|
||||
contradiction detection — Sonnet found 3 contradictions but only 1 was genuine (2 misreadings).
|
||||
The open question: does Opus's deeper internal reasoning suffice for the verification step
|
||||
that Sonnet lacks?
|
||||
|
||||
Three conditions, same document, same narrow prompt:
|
||||
|
||||
| Condition | Model | Time | Output tokens | Reasoning tokens | Contradictions |
|
||||
|---|---|---|---|---|---|
|
||||
| A | GPT-5 | 52s | 6,415 | 6,208 | 1 |
|
||||
| B | Claude Opus 4.6 | 12s | 468 | (internal) | 1 |
|
||||
| C | Claude Sonnet 4.6 | 26s | 1,451 | (internal) | 3 |
|
||||
|
||||
## What They Found
|
||||
|
||||
### GPT-5 (1 genuine contradiction):
|
||||
|
||||
**Broker-unavailable timing conflict:** The prose says broker unreachability leads to kill
|
||||
switch only after "continued consecutive breaches" (N more evaluations). The table says
|
||||
broker unavailable → "Immediate kill switch escalation." Both describe the same scenario
|
||||
(broker unavailable during liquidation) but prescribe different timing: debounce-gated vs
|
||||
immediate. Severity: High.
|
||||
|
||||
### Claude Opus 4.6 (1 genuine contradiction):
|
||||
|
||||
**Debounce reset paradox:** The document states "A single clear evaluation resets the breach
|
||||
counter." But the Liquidation Sizing section says if liquidation is insufficient, "the next
|
||||
evaluation cycle can trigger additional liquidation — but only after the debounce count resets
|
||||
and fires again." If the metric NEVER clears (liquidation was insufficient, metric still
|
||||
breaches), the counter can never reset per the stated rule. Yet the document says additional
|
||||
liquidation requires the counter to reset. These cannot both be true for a continuously-
|
||||
breaching metric. Severity: High.
|
||||
|
||||
### Claude Sonnet 4.6 (3 claimed, assessment below):
|
||||
|
||||
1. **Failure modes "Automatic" vs manual de-escalation** — Claims "Automatic" recovery in
|
||||
the failure modes table contradicts "manual only" de-escalation from liquidate.
|
||||
**Assessment: MISREAD.** The "Automatic" column describes how the system HANDLES the
|
||||
failure scenario (auto-retries, escalates to kill switch), not downward de-escalation.
|
||||
The system's autonomous recovery is escalation UPWARD (to kill switch), which is
|
||||
consistent with manual-only downward de-escalation.
|
||||
|
||||
2. **Debounce defaults vs calibration guidance** — Restrict→Liquidate defaults to 3 but
|
||||
calibration says volatile metrics need 5-8.
|
||||
**Assessment: TENSION, not contradiction.** The document explicitly says "These are
|
||||
configurable per metric" — the defaults don't need to match the guidance for specific
|
||||
metric types. The calibration section explains HOW to override defaults, not what the
|
||||
defaults must be. This is advice vs defaults, not statement vs statement.
|
||||
|
||||
3. **Kill switch immediate trigger vs "post-liquidation" event description** — Same finding
|
||||
as GPT-5's: broker-unavailable immediate escalation conflicts with the event described
|
||||
as "post-liquidation."
|
||||
**Assessment: GENUINE.** This is the same contradiction GPT-5 found but arrived at via
|
||||
a different evidence path (event description rather than prose/table conflict).
|
||||
|
||||
**Sonnet accuracy: 1 genuine + 1 tension + 1 misread out of 3 claimed = 33-67% precision.**
|
||||
|
||||
## Analysis
|
||||
|
||||
### GPT-5's finding vs Opus's finding — different types of contradiction:
|
||||
|
||||
GPT-5 found a **surface-level specification conflict**: two statements about the same
|
||||
scenario (broker unavailable) prescribe different behaviors (wait N breaches vs immediate).
|
||||
This is the type of contradiction you'd find during careful proofreading — it's where the
|
||||
document says "X" in one place and "not-X" in another about the same thing.
|
||||
|
||||
Opus found a **logical impossibility**: the interaction between two stated rules creates a
|
||||
situation that can never resolve. The debounce reset rule (requires a clear evaluation) and
|
||||
the re-triggering mechanism (needs the counter to reset) cannot both work as described when
|
||||
the metric continuously breaches. This is NOT a statement-vs-statement conflict — it's a
|
||||
logical consequence that the author likely didn't reason through.
|
||||
|
||||
These are qualitatively different:
|
||||
- GPT-5's type: "you said conflicting things about the same scenario" (specification bug)
|
||||
- Opus's type: "your rules, when combined, produce an impossible requirement" (logic bug)
|
||||
|
||||
### Does Opus match GPT-5?
|
||||
|
||||
**No — but not because it's worse.** They find different things. GPT-5's 6,208 reasoning
|
||||
tokens went toward exhaustively checking statement pairs for direct conflicts. Opus's
|
||||
internal reasoning went toward understanding the LOGICAL INTERACTION between rules.
|
||||
|
||||
GPT-5 missed the debounce reset paradox (likely because it requires multi-step logical
|
||||
reasoning about rule interactions rather than statement comparison). Opus missed the
|
||||
broker-unavailable timing conflict (likely because it's a more surface-level inconsistency
|
||||
between prose and table that doesn't involve logical deduction).
|
||||
|
||||
### Sonnet's continued weakness:
|
||||
|
||||
Consistent with Finding #39: Sonnet found 3 contradictions but only 1 was genuine (the
|
||||
broker-unavailable one, same as GPT-5). The failure-modes misread shows Sonnet doesn't
|
||||
reliably verify whether two statements ACTUALLY conflict — it pattern-matches on surface
|
||||
similarity ("Automatic" and "manual only" appear to conflict) without reasoning about
|
||||
whether they refer to the same thing. The debounce/calibration "contradiction" confuses
|
||||
advisory guidance with specification (a type confusion that reasoning models avoid).
|
||||
|
||||
## Key Insight — Two distinct contradiction-finding modes:
|
||||
|
||||
| Mode | Best model | What it catches | Cognitive demand |
|
||||
|---|---|---|---|
|
||||
| Specification conflicts | GPT-5 | Same scenario, different prescriptions | Statement comparison + verification |
|
||||
| Logical impossibilities | Opus | Rules that can't coexist under all conditions | Multi-step logical deduction |
|
||||
|
||||
This explains why the open question ("does Opus match GPT-5?") has no clean yes/no answer.
|
||||
They're not attempting the same thing. GPT-5 exhaustively compares statement pairs. Opus
|
||||
reasons about what the stated rules IMPLY when combined. Both modes catch real bugs that
|
||||
the other misses.
|
||||
|
||||
## Practical Implication
|
||||
|
||||
For self-contradiction detection in architecture documents:
|
||||
- Run BOTH GPT-5 and Opus — they catch fundamentally different types of contradictions
|
||||
- GPT-5 catches specification bugs (conflicting statements about the same thing)
|
||||
- Opus catches logic bugs (rules whose interactions produce impossible conditions)
|
||||
- Sonnet remains unreliable — too many false positives from surface-pattern matching
|
||||
- The cost is minimal (12s + 468 tokens for Opus vs 52s + 6,415 for GPT-5)
|
||||
|
||||
## Updated Answer to Open Question
|
||||
|
||||
> "Would Opus + narrow framing match GPT-5 for self-contradiction detection?"
|
||||
|
||||
**Wrong question.** Opus doesn't try to match GPT-5 — it finds a different class of
|
||||
contradiction. The right framing: Opus + GPT-5 together catch more than either alone,
|
||||
and the contradictions they find don't overlap. Run both.
|
||||
+10
-3
@@ -22,11 +22,18 @@ cross-doc contradictions are easy to verify once spotted (reducing GPT-5's
|
||||
verification advantage)? Or because boundary reasoning (Opus's strength)
|
||||
is the primary skill needed?
|
||||
|
||||
### Opus + narrow framing for contradiction detection (from Finding #39)
|
||||
Would Opus + narrow framing match GPT-5 for self-contradiction detection?
|
||||
### ~~Opus + narrow framing for contradiction detection (from Finding #39)~~ → ANSWERED (Finding #43)
|
||||
~~Would Opus + narrow framing match GPT-5 for self-contradiction detection?
|
||||
Finding #39 showed Sonnet can't do it even with narrow framing (reasoning
|
||||
depth issue). Opus has strong cross-boundary reasoning — does its internal
|
||||
reasoning depth suffice for the verification step that Sonnet lacks?
|
||||
reasoning depth suffice for the verification step that Sonnet lacks?~~
|
||||
|
||||
**WRONG QUESTION.** Opus doesn't try to match GPT-5 — it finds a different CLASS
|
||||
of contradiction. GPT-5 finds specification conflicts (same scenario, conflicting
|
||||
prescriptions via statement comparison). Opus finds logical impossibilities (rules
|
||||
whose interaction produces impossible conditions via deductive reasoning). Neither
|
||||
dominates — they don't overlap. Run both for complete coverage. Sonnet remains
|
||||
unreliable (~33% precision on contradiction detection).
|
||||
|
||||
### ~~Sonnet + narrow framing = GPT-5 level? (from Finding #5)~~ → ANSWERED (Finding #39)
|
||||
~~Would Sonnet catch semantic issues if given a narrower "check for logical
|
||||
|
||||
Reference in New Issue
Block a user