# Finding 25: Contradiction detection: NEW task type — Opus excels at finding genuine contradictions with precision; GPT-5 is thorough but spends heavily on reasoning; Sonnet finds surface-level issues quickly **Date:** 2026-05-05 **Task:** Identify internal contradictions, logical inconsistencies, and conflicting rules in gargoyle's `order-state-machine.md` (311 lines) — a document defining states, transitions, invariants, fill precedence rules, and time-in-force behavior. **How we used them:** Same document (full text) + same focused analytical question to all 3 models via HAI proxy. Prompt specifically asked for: state machine contradictions, semantic conflicts, rule violations, implicit contradictions, and terminology inconsistencies. Required each finding to quote the conflicting statements, explain the logical argument, assign severity, and recommend which statement should "win." No tools, no project context beyond the document itself. | Model | Time | Output tokens | Reasoning tokens | Contradictions found | |---|---|---|---|---| | GPT-5 | 162s | 12,074 | 11,008 | 4 | | Claude Opus 4.6 | 41s | 2,056 | (internal) | 6 | | Claude Sonnet 4.6 | 17s | 826 | (internal) | 4 | **What they found — common ground (2+ models identified):** - **Missing `pending_cancel → partially_filled` revert transition** (GPT-5 #1 + Opus #2 + Sonnet partial): The "Rejection reverts" invariant states orders return to their "pre-modification state (`working` or `partially_filled`)", but the state diagram only shows `pending_cancel → working` for cancel rejection — no path back to `partially_filled`. All models correctly identified this as the diagram being incomplete relative to the stated invariant. GPT-5 and Opus rated CRITICAL. - **Same issue for `pending_replace` revert** (GPT-5 #1 + Opus #3): The state diagram only shows `pending_replace → working` for replace rejection, but a replace requested from `partially_filled` should revert to `partially_filled`. Same root cause as above, just the replace variant. - **FOK "never partially fills" vs state machine allowing it** (GPT-5 #2 + Opus #4): The TIF table says FOK "never partially fills" but the state machine has no guards preventing FOK orders from reaching `partially_filled`. Both correctly noted this is a broker-enforced guarantee but the document presents it as system-level. - **`rejection_reason` described as "broker-provided" but local rejections exist** (GPT-5 #4 + Opus #5 + Sonnet): `pending → rejected` is "local validation failure" with no broker interaction, but the field says "Broker-provided reason when rejected." All three caught this terminology inconsistency. **GPT-5 unique findings (not in either other model):** - **IOC valid terminal states exclude `expired` vs generic expiry transitions** (#3): IOC should never reach `expired` (unfilled portion is cancelled immediately), but the state diagram allows any order to transition to `expired` without TIF guards. Well-reasoned extension of the FOK finding to IOC semantics. GPT-5 correctly identified that broker "expired-like" outcomes should map to `cancelled` for IOC. **Claude Opus unique findings (not in either other model):** - **Terminal states that aren't terminal — the `partially_filled` re-entry problem** (#1): Opus identified the DEEPER contradiction beyond the surface-level "cancelled states have outgoing transitions." When `cancelled → partially_filled` fires via late fill, the order is now non-terminal with NO defined mechanism to re-terminate if no further fills arrive. The order is stuck in `partially_filled` indefinitely. This goes beyond "the diagram contradicts the definition of terminal" to "the fill precedence rule creates an unspecified operational scenario." This is the most architecturally significant finding across all three models. - **Fill precedence label misapplication to non-terminal states** (#6): The state diagram labels transitions from `pending_cancel → partially_filled` and `pending_replace → partially_filled` as "fill precedence," but the Fill Precedence Rule explicitly defines itself as overriding TERMINAL states. `pending_cancel` is non-terminal. The label conflates two different mechanisms (fill during pending modification vs. fill overriding terminal state), which could cause implementers to use the same code path for fundamentally different scenarios. **Claude Sonnet unique findings (not in either other model):** - **State diagram terminal arrow contradiction** (#1): Sonnet was the only model to explicitly note that the Mermaid diagram shows `cancelled → [*]` (terminal arrow) while simultaneously showing `cancelled → partially_filled` (outgoing transition). A valid observation but more surface-level than Opus's deeper analysis of the same phenomenon. - **Pending replace fill logic error** (#3): Sonnet argued that receiving a fill during `pending_replace` creates a logical impossibility because the order parameters are in flux. This is WRONG — fills always apply to current parameters (the replace hasn't been confirmed yet), and the document actually handles this correctly. This is a FALSE POSITIVE from Sonnet. **Quality assessment:** - **Claude Opus** was the clear winner for this task. Found the most contradictions (6), had the highest precision (0 false positives), and — crucially — found qualitatively deeper issues. The `partially_filled` re-entry problem (#1) isn't just "the diagram has a missing transition" but "the fill precedence rule creates an unresolvable operational state." The fill precedence label misapplication (#6) identifies a conceptual confusion that would genuinely cause implementation bugs. Opus completed in only 41s with 2,056 output tokens — by far the most efficient. - **GPT-5** found 4 genuine contradictions with 0 false positives but spent an extraordinary amount of reasoning tokens (11,008) for modest output (1,066 visible content tokens, 10.3:1 reasoning ratio). The IOC finding was unique and valuable. But the cost is disproportionate: 162s and 12K tokens for 4 findings vs Opus's 41s and 2K tokens for 6 findings. GPT-5's reasoning budget seems to have been mostly spent on VERIFICATION (confirming each finding is genuine), consistent with Finding #20's observation. - **Claude Sonnet** was fastest (17s) and found 4 items, but one was a false positive (the pending_replace logic error claim is incorrect). That gives it a precision of 75% (3/4 genuine) — the lowest of the three. Its genuine findings were all also found by the other models (no unique true contributions). Sonnet appears to trade speed for accuracy on contradiction detection. **Key insight — contradiction detection favors precision-oriented models:** This task is fundamentally about LOGICAL ARGUMENTATION: proving that two statements cannot both be true. Unlike assumption-finding (which is about imagining what could go wrong) or gap-finding (which is about identifying missing content), contradiction detection requires the model to: 1. Hold two statements in working memory simultaneously 2. Construct a formal argument for why they conflict 3. NOT get confused by statements that SEEM contradictory but are actually consistent Requirement #3 is where models diverge. Sonnet produced a false positive because it didn't fully reason through whether the pending_replace fill scenario is actually inconsistent (it isn't — current parameters apply). Opus avoided this trap entirely and additionally found DEEPER contradictions that require multi-step logical reasoning (the re-entry problem, the label misapplication). GPT-5 also avoided false positives but at massive computational cost. **Opus's efficiency advantage:** This is the first task where Opus is not just qualitatively better but also quantitatively more efficient. 6 findings in 41s and 2K tokens vs GPT-5's 4 findings in 162s and 12K tokens. That's 3x more findings per token and 4x faster. For contradiction detection specifically, Opus appears to have a structural advantage — possibly because its internal reasoning is better calibrated for logical argumentation than GPT-5's externalized reasoning chain. **Comparison to Finding #20 (invariant violation paths):** In Finding #20, GPT-5 was maximally selective (3 findings, all genuine, 15:1 reasoning ratio). Here, GPT-5 shows the same pattern: few findings, all genuine, high reasoning ratio (10.3:1). The difference: in #20, GPT-5's selectivity meant it found UNIQUE violations others missed. Here, all of GPT-5's findings were also found by Opus (plus Opus found 2 more). GPT-5's high verification bar doesn't help when Opus is ALSO precise AND more thorough. **Updated task-model assignment:** For contradiction/consistency checking: 1. **Opus** — best choice: highest precision, deepest contradictions, most efficient 2. **GPT-5** — solid backup: zero false positives, unique TIF-related insights, but expensive and slower 3. **Sonnet** — NOT recommended for this task: produces false positives, no unique true contributions This confirms the emerging pattern: each model has task types where it excels. Opus excels at logical argumentation and design tensions. GPT-5 excels at exhaustive enumeration and operational concerns. Sonnet excels at speed and structural/assumption analysis but struggles with tasks requiring formal logical reasoning (contradiction detection, concurrency analysis per Finding #13). **Practical implication:** When reviewing architecture documents for internal consistency (e.g., before implementation begins), run Opus. If budget allows, add GPT-5 for TIF/edge-case coverage. Skip Sonnet for consistency checking — its speed advantage is negated by the false positive risk.