Add finding #67: Inter-document contradiction analysis
Sonnet 4 outperforms GPT-5 on inter-document contradiction analysis: - More findings (5 vs 4) - Faster (14s vs 136s) - Better severity calibration (3 Critical vs 0 Critical) Key insight: GPT-5's extended reasoning (9.7K tokens) doesn't pay off for this task type. Inter-document comparison requires parallel pattern matching, not serial verification.
This commit is contained in:
@@ -0,0 +1,55 @@
|
|||||||
|
# Finding #67: Inter-document Contradiction Analysis
|
||||||
|
|
||||||
|
**Date:** 2026-05-10
|
||||||
|
**Documents:** `escalation-policy.md` (228 lines) + `kill-switch.md` (293 lines)
|
||||||
|
**Task Type:** Inter-document contradiction detection
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
Sonnet 4 outperforms GPT-5 on inter-document contradiction analysis: more findings (5 vs 4), faster (14s vs 136s), and better severity calibration (3 Critical vs 0 Critical).
|
||||||
|
|
||||||
|
## Results
|
||||||
|
|
||||||
|
| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
|
||||||
|
|---|---|---|---|---|---|---|---|
|
||||||
|
| Claude Sonnet 4 | 14s | 864 | (internal) | 5 | 3 | 2 | 0 |
|
||||||
|
| GPT-5 | 136s | 711 | 9,728 | 4 | 0 | 3 | 1 |
|
||||||
|
|
||||||
|
## Key Findings
|
||||||
|
|
||||||
|
### Common ground (both models found):
|
||||||
|
|
||||||
|
1. **Autonomous vs manual liquidation** - Doc A says system submits autonomous liquidation orders; Doc B says operator manually triggers liquidation
|
||||||
|
2. **Restrict behavior mismatch** - Doc A: no new positions allowed; Doc B: reject-all (no submissions OR cancellations)
|
||||||
|
3. **Automatic vs manual escalation** - Doc A: debounce-driven auto-escalation; Doc B: "transition is never automatic"
|
||||||
|
|
||||||
|
### Sonnet unique (Critical-severity):
|
||||||
|
|
||||||
|
4. **Acceptance policy contradicts autonomous liquidation** - Doc B's close-only policy rejects "all automated decision engine orders" — but Doc A's autonomous liquidation orders ARE automated orders. Liquidation mechanism cannot work as specified.
|
||||||
|
|
||||||
|
5. **Kill switch semantic confusion** - Doc A treats kill switch as escalation BEYOND liquidate; Doc B treats liquidate as a MODE OF kill switch. Different hierarchies.
|
||||||
|
|
||||||
|
### GPT-5 unique:
|
||||||
|
|
||||||
|
- Meta-observation about vocabulary claims vs actual behavior divergence (valid but less actionable than Sonnet's Critical findings)
|
||||||
|
|
||||||
|
## Analysis
|
||||||
|
|
||||||
|
GPT-5 used 9,728 reasoning tokens (~10x Sonnet's output) but produced fewer, lower-severity findings. Possible explanations:
|
||||||
|
|
||||||
|
1. **Working memory pressure**: Comparing two documents requires holding claims from both simultaneously. Extended reasoning may cause fixation on specific threads rather than broad scanning.
|
||||||
|
|
||||||
|
2. **Verification burden mismatch**: Single-document analysis benefits from thorough verification. Inter-document analysis requires parallel comparison (pattern matching) — potentially Sonnet's strength over GPT-5's serial reasoning.
|
||||||
|
|
||||||
|
3. **Severity under-calibration**: GPT-5 rated 0 Critical; Sonnet rated 3. The acceptance-policy/autonomous-liquidation contradiction would prevent liquidation from functioning — Sonnet's Critical rating is accurate.
|
||||||
|
|
||||||
|
## Practical Implications
|
||||||
|
|
||||||
|
- Use Sonnet as primary reviewer for inter-document contradiction analysis
|
||||||
|
- GPT-5's reasoning overhead doesn't pay off for this task type
|
||||||
|
- Task involves parallel comparison (Sonnet strength) not serial verification (GPT-5 strength)
|
||||||
|
|
||||||
|
## Open Questions
|
||||||
|
|
||||||
|
- Would Opus outperform both? Given Opus's strength at emergent design tensions, it might excel at finding contradictions that arise from the COMBINATION of both documents' design decisions.
|
||||||
|
- Does this pattern hold for other document pairs, or was it specific to these documents?
|
||||||
Reference in New Issue
Block a user