Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
9.6 KiB
Finding 25: Contradiction detection: NEW task type — Opus excels at finding genuine contradictions with precision; GPT-5 is thorough but spends heavily on reasoning; Sonnet finds surface-level issues quickly
Date: 2026-05-05
Task: Identify internal contradictions, logical inconsistencies, and conflicting rules
in gargoyle's order-state-machine.md (311 lines) — a document defining states,
transitions, invariants, fill precedence rules, and time-in-force behavior.
How we used them: Same document (full text) + same focused analytical question to all
3 models via HAI proxy. Prompt specifically asked for: state machine contradictions,
semantic conflicts, rule violations, implicit contradictions, and terminology
inconsistencies. Required each finding to quote the conflicting statements, explain
the logical argument, assign severity, and recommend which statement should "win."
No tools, no project context beyond the document itself.
| Model | Time | Output tokens | Reasoning tokens | Contradictions found |
|---|---|---|---|---|
| GPT-5 | 162s | 12,074 | 11,008 | 4 |
| Claude Opus 4.6 | 41s | 2,056 | (internal) | 6 |
| Claude Sonnet 4.6 | 17s | 826 | (internal) | 4 |
What they found — common ground (2+ models identified):
- Missing
pending_cancel → partially_filledrevert transition (GPT-5 #1 + Opus #2 + Sonnet partial): The "Rejection reverts" invariant states orders return to their "pre-modification state (workingorpartially_filled)", but the state diagram only showspending_cancel → workingfor cancel rejection — no path back topartially_filled. All models correctly identified this as the diagram being incomplete relative to the stated invariant. GPT-5 and Opus rated CRITICAL. - Same issue for
pending_replacerevert (GPT-5 #1 + Opus #3): The state diagram only showspending_replace → workingfor replace rejection, but a replace requested frompartially_filledshould revert topartially_filled. Same root cause as above, just the replace variant. - FOK "never partially fills" vs state machine allowing it (GPT-5 #2 + Opus #4):
The TIF table says FOK "never partially fills" but the state machine has no guards
preventing FOK orders from reaching
partially_filled. Both correctly noted this is a broker-enforced guarantee but the document presents it as system-level. rejection_reasondescribed as "broker-provided" but local rejections exist (GPT-5 #4 + Opus #5 + Sonnet):pending → rejectedis "local validation failure" with no broker interaction, but the field says "Broker-provided reason when rejected." All three caught this terminology inconsistency.
GPT-5 unique findings (not in either other model):
- IOC valid terminal states exclude
expiredvs generic expiry transitions (#3): IOC should never reachexpired(unfilled portion is cancelled immediately), but the state diagram allows any order to transition toexpiredwithout TIF guards. Well-reasoned extension of the FOK finding to IOC semantics. GPT-5 correctly identified that broker "expired-like" outcomes should map tocancelledfor IOC.
Claude Opus unique findings (not in either other model):
- Terminal states that aren't terminal — the
partially_filledre-entry problem (#1): Opus identified the DEEPER contradiction beyond the surface-level "cancelled states have outgoing transitions." Whencancelled → partially_filledfires via late fill, the order is now non-terminal with NO defined mechanism to re-terminate if no further fills arrive. The order is stuck inpartially_filledindefinitely. This goes beyond "the diagram contradicts the definition of terminal" to "the fill precedence rule creates an unspecified operational scenario." This is the most architecturally significant finding across all three models. - Fill precedence label misapplication to non-terminal states (#6): The state
diagram labels transitions from
pending_cancel → partially_filledandpending_replace → partially_filledas "fill precedence," but the Fill Precedence Rule explicitly defines itself as overriding TERMINAL states.pending_cancelis non-terminal. The label conflates two different mechanisms (fill during pending modification vs. fill overriding terminal state), which could cause implementers to use the same code path for fundamentally different scenarios.
Claude Sonnet unique findings (not in either other model):
- State diagram terminal arrow contradiction (#1): Sonnet was the only model to
explicitly note that the Mermaid diagram shows
cancelled → [*](terminal arrow) while simultaneously showingcancelled → partially_filled(outgoing transition). A valid observation but more surface-level than Opus's deeper analysis of the same phenomenon. - Pending replace fill logic error (#3): Sonnet argued that receiving a fill
during
pending_replacecreates a logical impossibility because the order parameters are in flux. This is WRONG — fills always apply to current parameters (the replace hasn't been confirmed yet), and the document actually handles this correctly. This is a FALSE POSITIVE from Sonnet.
Quality assessment:
- Claude Opus was the clear winner for this task. Found the most contradictions
(6), had the highest precision (0 false positives), and — crucially — found
qualitatively deeper issues. The
partially_filledre-entry problem (#1) isn't just "the diagram has a missing transition" but "the fill precedence rule creates an unresolvable operational state." The fill precedence label misapplication (#6) identifies a conceptual confusion that would genuinely cause implementation bugs. Opus completed in only 41s with 2,056 output tokens — by far the most efficient. - GPT-5 found 4 genuine contradictions with 0 false positives but spent an extraordinary amount of reasoning tokens (11,008) for modest output (1,066 visible content tokens, 10.3:1 reasoning ratio). The IOC finding was unique and valuable. But the cost is disproportionate: 162s and 12K tokens for 4 findings vs Opus's 41s and 2K tokens for 6 findings. GPT-5's reasoning budget seems to have been mostly spent on VERIFICATION (confirming each finding is genuine), consistent with Finding #20's observation.
- Claude Sonnet was fastest (17s) and found 4 items, but one was a false positive (the pending_replace logic error claim is incorrect). That gives it a precision of 75% (3/4 genuine) — the lowest of the three. Its genuine findings were all also found by the other models (no unique true contributions). Sonnet appears to trade speed for accuracy on contradiction detection.
Key insight — contradiction detection favors precision-oriented models:
This task is fundamentally about LOGICAL ARGUMENTATION: proving that two statements cannot both be true. Unlike assumption-finding (which is about imagining what could go wrong) or gap-finding (which is about identifying missing content), contradiction detection requires the model to:
- Hold two statements in working memory simultaneously
- Construct a formal argument for why they conflict
- NOT get confused by statements that SEEM contradictory but are actually consistent
Requirement #3 is where models diverge. Sonnet produced a false positive because it didn't fully reason through whether the pending_replace fill scenario is actually inconsistent (it isn't — current parameters apply). Opus avoided this trap entirely and additionally found DEEPER contradictions that require multi-step logical reasoning (the re-entry problem, the label misapplication). GPT-5 also avoided false positives but at massive computational cost.
Opus's efficiency advantage: This is the first task where Opus is not just qualitatively better but also quantitatively more efficient. 6 findings in 41s and 2K tokens vs GPT-5's 4 findings in 162s and 12K tokens. That's 3x more findings per token and 4x faster. For contradiction detection specifically, Opus appears to have a structural advantage — possibly because its internal reasoning is better calibrated for logical argumentation than GPT-5's externalized reasoning chain.
Comparison to Finding #20 (invariant violation paths): In Finding #20, GPT-5 was maximally selective (3 findings, all genuine, 15:1 reasoning ratio). Here, GPT-5 shows the same pattern: few findings, all genuine, high reasoning ratio (10.3:1). The difference: in #20, GPT-5's selectivity meant it found UNIQUE violations others missed. Here, all of GPT-5's findings were also found by Opus (plus Opus found 2 more). GPT-5's high verification bar doesn't help when Opus is ALSO precise AND more thorough.
Updated task-model assignment:
For contradiction/consistency checking:
- Opus — best choice: highest precision, deepest contradictions, most efficient
- GPT-5 — solid backup: zero false positives, unique TIF-related insights, but expensive and slower
- Sonnet — NOT recommended for this task: produces false positives, no unique true contributions
This confirms the emerging pattern: each model has task types where it excels. Opus excels at logical argumentation and design tensions. GPT-5 excels at exhaustive enumeration and operational concerns. Sonnet excels at speed and structural/assumption analysis but struggles with tasks requiring formal logical reasoning (contradiction detection, concurrency analysis per Finding #13).
Practical implication: When reviewing architecture documents for internal consistency (e.g., before implementation begins), run Opus. If budget allows, add GPT-5 for TIF/edge-case coverage. Skip Sonnet for consistency checking — its speed advantage is negated by the false positive risk.