Files

T

Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files

Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.

2026-05-06 07:15:50 -07:00

9.6 KiB

Raw Blame History

Finding 25: Contradiction detection: NEW task type — Opus excels at finding genuine contradictions with precision; GPT-5 is thorough but spends heavily on reasoning; Sonnet finds surface-level issues quickly

Date: 2026-05-05 Task: Identify internal contradictions, logical inconsistencies, and conflicting rules in gargoyle's order-state-machine.md (311 lines) — a document defining states, transitions, invariants, fill precedence rules, and time-in-force behavior. How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy. Prompt specifically asked for: state machine contradictions, semantic conflicts, rule violations, implicit contradictions, and terminology inconsistencies. Required each finding to quote the conflicting statements, explain the logical argument, assign severity, and recommend which statement should "win." No tools, no project context beyond the document itself.

Model	Time	Output tokens	Reasoning tokens	Contradictions found
GPT-5	162s	12,074	11,008	4
Claude Opus 4.6	41s	2,056	(internal)	6
Claude Sonnet 4.6	17s	826	(internal)	4

What they found — common ground (2+ models identified):

Missing pending_cancel → partially_filled revert transition (GPT-5 #1 + Opus #2 + Sonnet partial): The "Rejection reverts" invariant states orders return to their "pre-modification state (working or partially_filled)", but the state diagram only shows pending_cancel → working for cancel rejection — no path back to partially_filled. All models correctly identified this as the diagram being incomplete relative to the stated invariant. GPT-5 and Opus rated CRITICAL.
Same issue for pending_replace revert (GPT-5 #1 + Opus #3): The state diagram only shows pending_replace → working for replace rejection, but a replace requested from partially_filled should revert to partially_filled. Same root cause as above, just the replace variant.
FOK "never partially fills" vs state machine allowing it (GPT-5 #2 + Opus #4): The TIF table says FOK "never partially fills" but the state machine has no guards preventing FOK orders from reaching partially_filled. Both correctly noted this is a broker-enforced guarantee but the document presents it as system-level.
rejection_reason described as "broker-provided" but local rejections exist (GPT-5 #4 + Opus #5 + Sonnet): pending → rejected is "local validation failure" with no broker interaction, but the field says "Broker-provided reason when rejected." All three caught this terminology inconsistency.

GPT-5 unique findings (not in either other model):

IOC valid terminal states exclude expired vs generic expiry transitions (#3): IOC should never reach expired (unfilled portion is cancelled immediately), but the state diagram allows any order to transition to expired without TIF guards. Well-reasoned extension of the FOK finding to IOC semantics. GPT-5 correctly identified that broker "expired-like" outcomes should map to cancelled for IOC.

Claude Opus unique findings (not in either other model):

Terminal states that aren't terminal — the partially_filled re-entry problem (#1): Opus identified the DEEPER contradiction beyond the surface-level "cancelled states have outgoing transitions." When cancelled → partially_filled fires via late fill, the order is now non-terminal with NO defined mechanism to re-terminate if no further fills arrive. The order is stuck in partially_filled indefinitely. This goes beyond "the diagram contradicts the definition of terminal" to "the fill precedence rule creates an unspecified operational scenario." This is the most architecturally significant finding across all three models.
Fill precedence label misapplication to non-terminal states (#6): The state diagram labels transitions from pending_cancel → partially_filled and pending_replace → partially_filled as "fill precedence," but the Fill Precedence Rule explicitly defines itself as overriding TERMINAL states. pending_cancel is non-terminal. The label conflates two different mechanisms (fill during pending modification vs. fill overriding terminal state), which could cause implementers to use the same code path for fundamentally different scenarios.

Claude Sonnet unique findings (not in either other model):

State diagram terminal arrow contradiction (#1): Sonnet was the only model to explicitly note that the Mermaid diagram shows cancelled → [*] (terminal arrow) while simultaneously showing cancelled → partially_filled (outgoing transition). A valid observation but more surface-level than Opus's deeper analysis of the same phenomenon.
Pending replace fill logic error (#3): Sonnet argued that receiving a fill during pending_replace creates a logical impossibility because the order parameters are in flux. This is WRONG — fills always apply to current parameters (the replace hasn't been confirmed yet), and the document actually handles this correctly. This is a FALSE POSITIVE from Sonnet.

Quality assessment:

Claude Opus was the clear winner for this task. Found the most contradictions (6), had the highest precision (0 false positives), and — crucially — found qualitatively deeper issues. The partially_filled re-entry problem (#1) isn't just "the diagram has a missing transition" but "the fill precedence rule creates an unresolvable operational state." The fill precedence label misapplication (#6) identifies a conceptual confusion that would genuinely cause implementation bugs. Opus completed in only 41s with 2,056 output tokens — by far the most efficient.
GPT-5 found 4 genuine contradictions with 0 false positives but spent an extraordinary amount of reasoning tokens (11,008) for modest output (1,066 visible content tokens, 10.3:1 reasoning ratio). The IOC finding was unique and valuable. But the cost is disproportionate: 162s and 12K tokens for 4 findings vs Opus's 41s and 2K tokens for 6 findings. GPT-5's reasoning budget seems to have been mostly spent on VERIFICATION (confirming each finding is genuine), consistent with Finding #20's observation.
Claude Sonnet was fastest (17s) and found 4 items, but one was a false positive (the pending_replace logic error claim is incorrect). That gives it a precision of 75% (3/4 genuine) — the lowest of the three. Its genuine findings were all also found by the other models (no unique true contributions). Sonnet appears to trade speed for accuracy on contradiction detection.

Key insight — contradiction detection favors precision-oriented models:

This task is fundamentally about LOGICAL ARGUMENTATION: proving that two statements cannot both be true. Unlike assumption-finding (which is about imagining what could go wrong) or gap-finding (which is about identifying missing content), contradiction detection requires the model to:

Hold two statements in working memory simultaneously
Construct a formal argument for why they conflict
NOT get confused by statements that SEEM contradictory but are actually consistent

Requirement #3 is where models diverge. Sonnet produced a false positive because it didn't fully reason through whether the pending_replace fill scenario is actually inconsistent (it isn't — current parameters apply). Opus avoided this trap entirely and additionally found DEEPER contradictions that require multi-step logical reasoning (the re-entry problem, the label misapplication). GPT-5 also avoided false positives but at massive computational cost.

Opus's efficiency advantage: This is the first task where Opus is not just qualitatively better but also quantitatively more efficient. 6 findings in 41s and 2K tokens vs GPT-5's 4 findings in 162s and 12K tokens. That's 3x more findings per token and 4x faster. For contradiction detection specifically, Opus appears to have a structural advantage — possibly because its internal reasoning is better calibrated for logical argumentation than GPT-5's externalized reasoning chain.

Comparison to Finding #20 (invariant violation paths): In Finding #20, GPT-5 was maximally selective (3 findings, all genuine, 15:1 reasoning ratio). Here, GPT-5 shows the same pattern: few findings, all genuine, high reasoning ratio (10.3:1). The difference: in #20, GPT-5's selectivity meant it found UNIQUE violations others missed. Here, all of GPT-5's findings were also found by Opus (plus Opus found 2 more). GPT-5's high verification bar doesn't help when Opus is ALSO precise AND more thorough.

Updated task-model assignment:

For contradiction/consistency checking:

Opus — best choice: highest precision, deepest contradictions, most efficient
GPT-5 — solid backup: zero false positives, unique TIF-related insights, but expensive and slower
Sonnet — NOT recommended for this task: produces false positives, no unique true contributions

This confirms the emerging pattern: each model has task types where it excels. Opus excels at logical argumentation and design tensions. GPT-5 excels at exhaustive enumeration and operational concerns. Sonnet excels at speed and structural/assumption analysis but struggles with tasks requiring formal logical reasoning (contradiction detection, concurrency analysis per Finding #13).

Practical implication: When reviewing architecture documents for internal consistency (e.g., before implementation begins), run Opus. If budget allows, add GPT-5 for TIF/edge-case coverage. Skip Sonnet for consistency checking — its speed advantage is negated by the false positive risk.

9.6 KiB Raw Blame History

Finding 25: Contradiction detection: NEW task type — Opus excels at finding genuine contradictions with precision; GPT-5 is thorough but spends heavily on reasoning; Sonnet finds surface-level issues quickly

9.6 KiB

Raw Blame History