# Finding 25: Contradiction detection: NEW task type — Opus excels at finding genuine contradictions with precision; GPT-5 is thorough but spends heavily on reasoning; Sonnet finds surface-level issues quickly

**Date:** 2026-05-05
**Task:** Identify internal contradictions, logical inconsistencies, and conflicting rules
in gargoyle's `order-state-machine.md` (311 lines) — a document defining states,
transitions, invariants, fill precedence rules, and time-in-force behavior.
**How we used them:** Same document (full text) + same focused analytical question to all
3 models via HAI proxy. Prompt specifically asked for: state machine contradictions,
semantic conflicts, rule violations, implicit contradictions, and terminology
inconsistencies. Required each finding to quote the conflicting statements, explain
the logical argument, assign severity, and recommend which statement should "win."
No tools, no project context beyond the document itself.

| Model | Time | Output tokens | Reasoning tokens | Contradictions found |
|---|---|---|---|---|
| GPT-5 | 162s | 12,074 | 11,008 | 4 |
| Claude Opus 4.6 | 41s | 2,056 | (internal) | 6 |
| Claude Sonnet 4.6 | 17s | 826 | (internal) | 4 |

**What they found — common ground (2+ models identified):**

- **Missing `pending_cancel → partially_filled` revert transition** (GPT-5 #1 +
  Opus #2 + Sonnet partial): The "Rejection reverts" invariant states orders return
  to their "pre-modification state (`working` or `partially_filled`)", but the state
  diagram only shows `pending_cancel → working` for cancel rejection — no path back
  to `partially_filled`. All models correctly identified this as the diagram being
  incomplete relative to the stated invariant. GPT-5 and Opus rated CRITICAL.
- **Same issue for `pending_replace` revert** (GPT-5 #1 + Opus #3): The state diagram
  only shows `pending_replace → working` for replace rejection, but a replace
  requested from `partially_filled` should revert to `partially_filled`. Same root
  cause as above, just the replace variant.
- **FOK "never partially fills" vs state machine allowing it** (GPT-5 #2 + Opus #4):
  The TIF table says FOK "never partially fills" but the state machine has no guards
  preventing FOK orders from reaching `partially_filled`. Both correctly noted this
  is a broker-enforced guarantee but the document presents it as system-level.
- **`rejection_reason` described as "broker-provided" but local rejections exist**
  (GPT-5 #4 + Opus #5 + Sonnet): `pending → rejected` is "local validation failure"
  with no broker interaction, but the field says "Broker-provided reason when
  rejected." All three caught this terminology inconsistency.

**GPT-5 unique findings (not in either other model):**

- **IOC valid terminal states exclude `expired` vs generic expiry transitions** (#3):
  IOC should never reach `expired` (unfilled portion is cancelled immediately), but
  the state diagram allows any order to transition to `expired` without TIF guards.
  Well-reasoned extension of the FOK finding to IOC semantics. GPT-5 correctly
  identified that broker "expired-like" outcomes should map to `cancelled` for IOC.

**Claude Opus unique findings (not in either other model):**

- **Terminal states that aren't terminal — the `partially_filled` re-entry problem**
  (#1): Opus identified the DEEPER contradiction beyond the surface-level "cancelled
  states have outgoing transitions." When `cancelled → partially_filled` fires via
  late fill, the order is now non-terminal with NO defined mechanism to re-terminate
  if no further fills arrive. The order is stuck in `partially_filled` indefinitely.
  This goes beyond "the diagram contradicts the definition of terminal" to "the fill
  precedence rule creates an unspecified operational scenario." This is the most
  architecturally significant finding across all three models.
- **Fill precedence label misapplication to non-terminal states** (#6): The state
  diagram labels transitions from `pending_cancel → partially_filled` and
  `pending_replace → partially_filled` as "fill precedence," but the Fill
  Precedence Rule explicitly defines itself as overriding TERMINAL states.
  `pending_cancel` is non-terminal. The label conflates two different mechanisms
  (fill during pending modification vs. fill overriding terminal state), which
  could cause implementers to use the same code path for fundamentally different
  scenarios.

**Claude Sonnet unique findings (not in either other model):**

- **State diagram terminal arrow contradiction** (#1): Sonnet was the only model to
  explicitly note that the Mermaid diagram shows `cancelled → [*]` (terminal arrow)
  while simultaneously showing `cancelled → partially_filled` (outgoing transition).
  A valid observation but more surface-level than Opus's deeper analysis of the same
  phenomenon.
- **Pending replace fill logic error** (#3): Sonnet argued that receiving a fill
  during `pending_replace` creates a logical impossibility because the order
  parameters are in flux. This is WRONG — fills always apply to current parameters
  (the replace hasn't been confirmed yet), and the document actually handles this
  correctly. This is a FALSE POSITIVE from Sonnet.

**Quality assessment:**

- **Claude Opus** was the clear winner for this task. Found the most contradictions
  (6), had the highest precision (0 false positives), and — crucially — found
  qualitatively deeper issues. The `partially_filled` re-entry problem (#1) isn't
  just "the diagram has a missing transition" but "the fill precedence rule creates
  an unresolvable operational state." The fill precedence label misapplication (#6)
  identifies a conceptual confusion that would genuinely cause implementation bugs.
  Opus completed in only 41s with 2,056 output tokens — by far the most efficient.
- **GPT-5** found 4 genuine contradictions with 0 false positives but spent an
  extraordinary amount of reasoning tokens (11,008) for modest output (1,066 visible
  content tokens, 10.3:1 reasoning ratio). The IOC finding was unique and valuable.
  But the cost is disproportionate: 162s and 12K tokens for 4 findings vs Opus's
  41s and 2K tokens for 6 findings. GPT-5's reasoning budget seems to have been
  mostly spent on VERIFICATION (confirming each finding is genuine), consistent
  with Finding #20's observation.
- **Claude Sonnet** was fastest (17s) and found 4 items, but one was a false positive
  (the pending_replace logic error claim is incorrect). That gives it a precision of
  75% (3/4 genuine) — the lowest of the three. Its genuine findings were all also
  found by the other models (no unique true contributions). Sonnet appears to trade
  speed for accuracy on contradiction detection.

**Key insight — contradiction detection favors precision-oriented models:**

This task is fundamentally about LOGICAL ARGUMENTATION: proving that two statements
cannot both be true. Unlike assumption-finding (which is about imagining what could go
wrong) or gap-finding (which is about identifying missing content), contradiction
detection requires the model to:
1. Hold two statements in working memory simultaneously
2. Construct a formal argument for why they conflict
3. NOT get confused by statements that SEEM contradictory but are actually consistent

Requirement #3 is where models diverge. Sonnet produced a false positive because it
didn't fully reason through whether the pending_replace fill scenario is actually
inconsistent (it isn't — current parameters apply). Opus avoided this trap entirely
and additionally found DEEPER contradictions that require multi-step logical reasoning
(the re-entry problem, the label misapplication). GPT-5 also avoided false positives
but at massive computational cost.

**Opus's efficiency advantage:**
This is the first task where Opus is not just qualitatively better but also
quantitatively more efficient. 6 findings in 41s and 2K tokens vs GPT-5's 4 findings
in 162s and 12K tokens. That's 3x more findings per token and 4x faster. For
contradiction detection specifically, Opus appears to have a structural advantage —
possibly because its internal reasoning is better calibrated for logical argumentation
than GPT-5's externalized reasoning chain.

**Comparison to Finding #20 (invariant violation paths):**
In Finding #20, GPT-5 was maximally selective (3 findings, all genuine, 15:1
reasoning ratio). Here, GPT-5 shows the same pattern: few findings, all genuine,
high reasoning ratio (10.3:1). The difference: in #20, GPT-5's selectivity meant
it found UNIQUE violations others missed. Here, all of GPT-5's findings were also
found by Opus (plus Opus found 2 more). GPT-5's high verification bar doesn't help
when Opus is ALSO precise AND more thorough.

**Updated task-model assignment:**

For contradiction/consistency checking:
1. **Opus** — best choice: highest precision, deepest contradictions, most efficient
2. **GPT-5** — solid backup: zero false positives, unique TIF-related insights, but
   expensive and slower
3. **Sonnet** — NOT recommended for this task: produces false positives, no unique
   true contributions

This confirms the emerging pattern: each model has task types where it excels.
Opus excels at logical argumentation and design tensions. GPT-5 excels at
exhaustive enumeration and operational concerns. Sonnet excels at speed and
structural/assumption analysis but struggles with tasks requiring formal logical
reasoning (contradiction detection, concurrency analysis per Finding #13).

**Practical implication:** When reviewing architecture documents for internal
consistency (e.g., before implementation begins), run Opus. If budget allows,
add GPT-5 for TIF/edge-case coverage. Skip Sonnet for consistency checking —
its speed advantage is negated by the false positive risk.