Files
model-research/findings/2026-05-10-inter-document-contradiction-analysis.md
T
Rodin 0f43934cb8 Add finding #67: Inter-document contradiction analysis
Sonnet 4 outperforms GPT-5 on inter-document contradiction analysis:
- More findings (5 vs 4)
- Faster (14s vs 136s)
- Better severity calibration (3 Critical vs 0 Critical)

Key insight: GPT-5's extended reasoning (9.7K tokens) doesn't pay off
for this task type. Inter-document comparison requires parallel pattern
matching, not serial verification.
2026-05-10 18:32:45 -07:00

3.0 KiB

Finding #67: Inter-document Contradiction Analysis

Date: 2026-05-10
Documents: escalation-policy.md (228 lines) + kill-switch.md (293 lines)
Task Type: Inter-document contradiction detection

Summary

Sonnet 4 outperforms GPT-5 on inter-document contradiction analysis: more findings (5 vs 4), faster (14s vs 136s), and better severity calibration (3 Critical vs 0 Critical).

Results

Model Time Output tokens Reasoning tokens Findings Critical High Medium
Claude Sonnet 4 14s 864 (internal) 5 3 2 0
GPT-5 136s 711 9,728 4 0 3 1

Key Findings

Common ground (both models found):

  1. Autonomous vs manual liquidation - Doc A says system submits autonomous liquidation orders; Doc B says operator manually triggers liquidation
  2. Restrict behavior mismatch - Doc A: no new positions allowed; Doc B: reject-all (no submissions OR cancellations)
  3. Automatic vs manual escalation - Doc A: debounce-driven auto-escalation; Doc B: "transition is never automatic"

Sonnet unique (Critical-severity):

  1. Acceptance policy contradicts autonomous liquidation - Doc B's close-only policy rejects "all automated decision engine orders" — but Doc A's autonomous liquidation orders ARE automated orders. Liquidation mechanism cannot work as specified.

  2. Kill switch semantic confusion - Doc A treats kill switch as escalation BEYOND liquidate; Doc B treats liquidate as a MODE OF kill switch. Different hierarchies.

GPT-5 unique:

  • Meta-observation about vocabulary claims vs actual behavior divergence (valid but less actionable than Sonnet's Critical findings)

Analysis

GPT-5 used 9,728 reasoning tokens (~10x Sonnet's output) but produced fewer, lower-severity findings. Possible explanations:

  1. Working memory pressure: Comparing two documents requires holding claims from both simultaneously. Extended reasoning may cause fixation on specific threads rather than broad scanning.

  2. Verification burden mismatch: Single-document analysis benefits from thorough verification. Inter-document analysis requires parallel comparison (pattern matching) — potentially Sonnet's strength over GPT-5's serial reasoning.

  3. Severity under-calibration: GPT-5 rated 0 Critical; Sonnet rated 3. The acceptance-policy/autonomous-liquidation contradiction would prevent liquidation from functioning — Sonnet's Critical rating is accurate.

Practical Implications

  • Use Sonnet as primary reviewer for inter-document contradiction analysis
  • GPT-5's reasoning overhead doesn't pay off for this task type
  • Task involves parallel comparison (Sonnet strength) not serial verification (GPT-5 strength)

Open Questions

  • Would Opus outperform both? Given Opus's strength at emergent design tensions, it might excel at finding contradictions that arise from the COMBINATION of both documents' design decisions.
  • Does this pattern hold for other document pairs, or was it specific to these documents?