# Finding 15: Design Coherence Analysis **Date:** 2026-05-03 **Task:** Identify internal design incoherences in gargoyle's `failure-modes.md` (383 lines) — places where the document's stated principles/invariants are contradicted by its own specified mechanisms. **How we used them:** Same document (full text) + same focused analytical question to all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence to look for (safety properties not enforced, state machine violations, recovery contradictions, supervision conflicts, cross-mechanism contradictions). Required each finding to reference specific sections. No tools, no project context beyond the document itself. | Model | Time | Output tokens | Reasoning tokens | Incoherences found | |---|---|---|---|---| | Claude Sonnet 4.6 | ~39s | 1,168 | (internal) | 5 | | Claude Opus 4.6 | ~105s | 3,378 | (internal) | 7 (8 attempted, 1 self-withdrawn) | | GPT-5 | ~120s | 10,235 | 9,088 | 4 | **What they found — common ground (all 3 identified):** - State machine universality claim vs Strategy.Worker crash behavior (process crashes bypass the degraded state entirely — no transition path in the model) - Market data staleness advisory-only vs the "don't trade when ambiguous" principle (or vs concurrent failure auto-halt) - `pending_cancel`/`pending_replace` absent from recovery query set (GPT-5 and Sonnet found this directly; Opus addressed the broader state machine gap) **GPT-5 unique findings (not in either Claude model):** - Kill switch halted = "process terminated" vs kill switch requiring RUNNING processes for cancel_all, FLATTEN, and LIQUIDATE modes. The state definition claims processes are terminated, but the mechanisms require them alive to execute orders. **This is the most architecturally significant finding** — it reveals a fundamental definitional error in the state machine. - Per-symbol degradation contradicts the process-level degradation semantics. A worker "enters degraded" but continues operating for non-stale symbols — violating the stated definition that degraded = "cannot perform primary function." The metrics/eventing model has no per-symbol dimension. **Claude Opus unique findings (not in either other model):** - `:rest_for_one` cascade creates a FIFTH implicit state (terminated-and- restarting) not in the four-state model — processes that were `normal` are forcibly killed (not by kill switch) and restart. Self-corrected one finding that initially looked like incoherence but was actually consistent. - PortfolioMonitor continues evaluating with stale data ("fail-safe") while Strategy.Workers are stopped for the SAME condition — contradicts both the universal state machine (PM doesn't transition to degraded) and the doc's reasoning about why stale data is dangerous. - Path-dependent recovery criteria: Strategy.Worker recovery requires 21+ bars after crash but only "price continuity check" after staleness. The state machine's single "catch-up complete" exit condition can't express this. - `halted → [*]` transition in state diagram is logically impossible if "halted" means the process is already terminated — dead processes can't fire transitions. - Compound failure detection requires a meta-observer across processes but the per-process state machine model has no way to express cross-process conditions. **Claude Sonnet unique findings (not in either other model):** - Market data global staleness: the failure table says "Manual (disengage)" for recovery — implying automatic engagement happened — but the text says it's advisory only. Table contradicts prose. - ReconciliationGate: doc claims gate survives OM crash (separate supervision tree), but then says "missing ETS table = not ready" when OM crashes. If the gate survives, why would its table be missing? - Signal survival claims are contradictory between sections: worker crash says downstream signals survive, but OM crash says all upstream signals lost. (NOTE: this is actually describing different scenarios — worker crash doesn't cascade to SignalRisk, OM crash does via :rest_for_one. Sonnet may have misread the architecture here — the two statements are consistent when you understand the supervision tree.) **Quality assessment:** - **GPT-5** found only 4 incoherences but TWO of them are genuinely critical architectural findings. The "halted = terminated" vs "kill switch requires running processes" contradiction is a real design error — you can't both terminate processes AND require them to execute cancel/liquidation orders. The per-symbol degradation finding is also a real modeling gap. GPT-5 was MORE SELECTIVE here than in previous experiments — it didn't pad with medium-severity findings. Each of its 4 was high/critical. - **Claude Opus** produced the most findings (7 valid) with characteristic depth. Its self-correction (withdrawing finding #6 after deeper analysis) shows intellectual honesty rare in model outputs. The PortfolioMonitor stale-data contradiction is genuinely insightful — same input condition, opposite response, no justification within the state machine model. The compound failure meta-observer finding identifies a modeling category error. Opus also found modeling imprecisions (path-dependent recovery, halted → [*] impossibility) that the other models didn't notice. - **Claude Sonnet** found 5 issues quickly (39s, 1,168 tokens) but quality was mixed. Finding #4 (ReconciliationGate) raises a genuine question about the ETS table ownership claim. Finding #1 (table vs prose contradiction on market data staleness) is a real documentation inconsistency. However, Finding #5 appears to misread the supervision architecture — the two statements about signal survival ARE consistent when you understand that different crashes cascade differently. Sonnet produced one false positive. **Key insight — "design coherence" is a NEW analytical category with distinct model strengths:** This is different from assumption-finding (Finding #10-12), race conditions (Finding #13), and cross-component interactions (Finding #14). Coherence checking requires the model to hold MULTIPLE parts of the document in tension with each other and reason about whether they're compatible. Results: - **GPT-5** was MORE SELECTIVE than in any previous experiment. Only 4 findings vs 10-24 in other tasks. But precision was near-perfect — all 4 are genuine contradictions. This suggests GPT-5's reasoning tokens are being used for VERIFICATION (checking whether apparent contradictions hold up) rather than EXPLORATION (finding more things). The 9K reasoning tokens produced 4 findings vs the usual 10+ — GPT-5 is self-editing aggressively. - **Opus** hit its sweet spot. Coherence checking IS design-tension identification — Opus's consistent strength. Finding incoherences requires exactly the kind of "how does this design disagree with itself" reasoning that Opus excels at. It also showed unique self-correction behavior (withdrawing a finding after deeper analysis). - **Sonnet** was fast but produced a false positive. Coherence checking requires holding multiple document sections in memory simultaneously and reasoning about their compatibility — this is harder than assumption-finding (where you reason about one mechanism at a time) but easier than race conditions (which require sequential temporal reasoning). Sonnet occupies a middle ground. **Model ranking for design coherence checking:** 1. Claude Opus 4.6 — most findings, highest depth, self-correcting (7 valid) 2. GPT-5 — fewest findings but near-perfect precision, finds the critical ones (4) 3. Claude Sonnet 4.6 — fast screening, but prone to false positives on architectural misreads (4/5 valid) **This inverts the usual GPT-5 > Opus ordering.** In previous experiments, GPT-5 consistently found MORE issues. Here, GPT-5 was more selective than Opus. The task type (self-consistency checking) favors Opus's "design tension" reasoning style over GPT-5's "exhaustive exploration" style. GPT-5 apparently uses its reasoning to VERIFY rather than GENERATE when the task is about contradictions rather than gaps. **Practical implication:** For architecture documents, run coherence checking as a separate pass using Opus as the primary model. GPT-5's higher precision means it's good for confirming which Opus findings are genuine vs overreads. The two-pass approach: Opus generates candidates → GPT-5 validates → result is the intersection plus GPT-5's independent finds.