From 0f43934cb89e56770724d92ab94aceda31f6ee3c Mon Sep 17 00:00:00 2001 From: Rodin Date: Sun, 10 May 2026 18:32:45 -0700 Subject: [PATCH] Add finding #67: Inter-document contradiction analysis Sonnet 4 outperforms GPT-5 on inter-document contradiction analysis: - More findings (5 vs 4) - Faster (14s vs 136s) - Better severity calibration (3 Critical vs 0 Critical) Key insight: GPT-5's extended reasoning (9.7K tokens) doesn't pay off for this task type. Inter-document comparison requires parallel pattern matching, not serial verification. --- ...0-inter-document-contradiction-analysis.md | 55 +++++++++++++++++++ 1 file changed, 55 insertions(+) create mode 100644 findings/2026-05-10-inter-document-contradiction-analysis.md diff --git a/findings/2026-05-10-inter-document-contradiction-analysis.md b/findings/2026-05-10-inter-document-contradiction-analysis.md new file mode 100644 index 0000000..ee479ac --- /dev/null +++ b/findings/2026-05-10-inter-document-contradiction-analysis.md @@ -0,0 +1,55 @@ +# Finding #67: Inter-document Contradiction Analysis + +**Date:** 2026-05-10 +**Documents:** `escalation-policy.md` (228 lines) + `kill-switch.md` (293 lines) +**Task Type:** Inter-document contradiction detection + +## Summary + +Sonnet 4 outperforms GPT-5 on inter-document contradiction analysis: more findings (5 vs 4), faster (14s vs 136s), and better severity calibration (3 Critical vs 0 Critical). + +## Results + +| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium | +|---|---|---|---|---|---|---|---| +| Claude Sonnet 4 | 14s | 864 | (internal) | 5 | 3 | 2 | 0 | +| GPT-5 | 136s | 711 | 9,728 | 4 | 0 | 3 | 1 | + +## Key Findings + +### Common ground (both models found): + +1. **Autonomous vs manual liquidation** - Doc A says system submits autonomous liquidation orders; Doc B says operator manually triggers liquidation +2. **Restrict behavior mismatch** - Doc A: no new positions allowed; Doc B: reject-all (no submissions OR cancellations) +3. **Automatic vs manual escalation** - Doc A: debounce-driven auto-escalation; Doc B: "transition is never automatic" + +### Sonnet unique (Critical-severity): + +4. **Acceptance policy contradicts autonomous liquidation** - Doc B's close-only policy rejects "all automated decision engine orders" — but Doc A's autonomous liquidation orders ARE automated orders. Liquidation mechanism cannot work as specified. + +5. **Kill switch semantic confusion** - Doc A treats kill switch as escalation BEYOND liquidate; Doc B treats liquidate as a MODE OF kill switch. Different hierarchies. + +### GPT-5 unique: + +- Meta-observation about vocabulary claims vs actual behavior divergence (valid but less actionable than Sonnet's Critical findings) + +## Analysis + +GPT-5 used 9,728 reasoning tokens (~10x Sonnet's output) but produced fewer, lower-severity findings. Possible explanations: + +1. **Working memory pressure**: Comparing two documents requires holding claims from both simultaneously. Extended reasoning may cause fixation on specific threads rather than broad scanning. + +2. **Verification burden mismatch**: Single-document analysis benefits from thorough verification. Inter-document analysis requires parallel comparison (pattern matching) — potentially Sonnet's strength over GPT-5's serial reasoning. + +3. **Severity under-calibration**: GPT-5 rated 0 Critical; Sonnet rated 3. The acceptance-policy/autonomous-liquidation contradiction would prevent liquidation from functioning — Sonnet's Critical rating is accurate. + +## Practical Implications + +- Use Sonnet as primary reviewer for inter-document contradiction analysis +- GPT-5's reasoning overhead doesn't pay off for this task type +- Task involves parallel comparison (Sonnet strength) not serial verification (GPT-5 strength) + +## Open Questions + +- Would Opus outperform both? Given Opus's strength at emergent design tensions, it might excel at finding contradictions that arise from the COMBINATION of both documents' design decisions. +- Does this pattern hold for other document pairs, or was it specific to these documents?