model-research/findings/2026-05-03-15-design-coherence-analysis.md

# Finding 15: Design Coherence Analysis

**Date:** 2026-05-03
**Task:** Identify internal design incoherences in gargoyle's `failure-modes.md` (383 lines)
— places where the document's stated principles/invariants are contradicted by its own
specified mechanisms.
**How we used them:** Same document (full text) + same focused analytical question to all
3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence
to look for (safety properties not enforced, state machine violations, recovery contradictions,
supervision conflicts, cross-mechanism contradictions). Required each finding to reference
specific sections. No tools, no project context beyond the document itself.

| Model | Time | Output tokens | Reasoning tokens | Incoherences found |
|---|---|---|---|---|
| Claude Sonnet 4.6 | ~39s | 1,168 | (internal) | 5 |
| Claude Opus 4.6 | ~105s | 3,378 | (internal) | 7 (8 attempted, 1 self-withdrawn) |
| GPT-5 | ~120s | 10,235 | 9,088 | 4 |

**What they found — common ground (all 3 identified):**
- State machine universality claim vs Strategy.Worker crash behavior (process
  crashes bypass the degraded state entirely — no transition path in the model)
- Market data staleness advisory-only vs the "don't trade when ambiguous" principle
  (or vs concurrent failure auto-halt)
- `pending_cancel`/`pending_replace` absent from recovery query set (GPT-5 and
  Sonnet found this directly; Opus addressed the broader state machine gap)

**GPT-5 unique findings (not in either Claude model):**
- Kill switch halted = "process terminated" vs kill switch requiring RUNNING
  processes for cancel_all, FLATTEN, and LIQUIDATE modes. The state definition
  claims processes are terminated, but the mechanisms require them alive to
  execute orders. **This is the most architecturally significant finding** — it
  reveals a fundamental definitional error in the state machine.
- Per-symbol degradation contradicts the process-level degradation semantics.
  A worker "enters degraded" but continues operating for non-stale symbols —
  violating the stated definition that degraded = "cannot perform primary
  function." The metrics/eventing model has no per-symbol dimension.

**Claude Opus unique findings (not in either other model):**
- `:rest_for_one` cascade creates a FIFTH implicit state (terminated-and-
  restarting) not in the four-state model — processes that were `normal` are
  forcibly killed (not by kill switch) and restart. Self-corrected one finding
  that initially looked like incoherence but was actually consistent.
- PortfolioMonitor continues evaluating with stale data ("fail-safe") while
  Strategy.Workers are stopped for the SAME condition — contradicts both the
  universal state machine (PM doesn't transition to degraded) and the doc's
  reasoning about why stale data is dangerous.
- Path-dependent recovery criteria: Strategy.Worker recovery requires 21+ bars
  after crash but only "price continuity check" after staleness. The state
  machine's single "catch-up complete" exit condition can't express this.
- `halted → [*]` transition in state diagram is logically impossible if "halted"
  means the process is already terminated — dead processes can't fire transitions.
- Compound failure detection requires a meta-observer across processes but the
  per-process state machine model has no way to express cross-process conditions.

**Claude Sonnet unique findings (not in either other model):**
- Market data global staleness: the failure table says "Manual (disengage)" for
  recovery — implying automatic engagement happened — but the text says it's
  advisory only. Table contradicts prose.
- ReconciliationGate: doc claims gate survives OM crash (separate supervision
  tree), but then says "missing ETS table = not ready" when OM crashes. If the
  gate survives, why would its table be missing?
- Signal survival claims are contradictory between sections: worker crash says
  downstream signals survive, but OM crash says all upstream signals lost.
  (NOTE: this is actually describing different scenarios — worker crash doesn't
  cascade to SignalRisk, OM crash does via :rest_for_one. Sonnet may have
  misread the architecture here — the two statements are consistent when you
  understand the supervision tree.)

**Quality assessment:**
- **GPT-5** found only 4 incoherences but TWO of them are genuinely critical
  architectural findings. The "halted = terminated" vs "kill switch requires
  running processes" contradiction is a real design error — you can't both
  terminate processes AND require them to execute cancel/liquidation orders.
  The per-symbol degradation finding is also a real modeling gap. GPT-5 was
  MORE SELECTIVE here than in previous experiments — it didn't pad with
  medium-severity findings. Each of its 4 was high/critical.
- **Claude Opus** produced the most findings (7 valid) with characteristic
  depth. Its self-correction (withdrawing finding #6 after deeper analysis)
  shows intellectual honesty rare in model outputs. The PortfolioMonitor
  stale-data contradiction is genuinely insightful — same input condition,
  opposite response, no justification within the state machine model. The
  compound failure meta-observer finding identifies a modeling category error.
  Opus also found modeling imprecisions (path-dependent recovery, halted → [*]
  impossibility) that the other models didn't notice.
- **Claude Sonnet** found 5 issues quickly (39s, 1,168 tokens) but quality was
  mixed. Finding #4 (ReconciliationGate) raises a genuine question about
  the ETS table ownership claim. Finding #1 (table vs prose contradiction on
  market data staleness) is a real documentation inconsistency. However,
  Finding #5 appears to misread the supervision architecture — the two
  statements about signal survival ARE consistent when you understand that
  different crashes cascade differently. Sonnet produced one false positive.

**Key insight — "design coherence" is a NEW analytical category with distinct model strengths:**
This is different from assumption-finding (Finding #10-12), race conditions
(Finding #13), and cross-component interactions (Finding #14). Coherence
checking requires the model to hold MULTIPLE parts of the document in tension
with each other and reason about whether they're compatible. Results:

- **GPT-5** was MORE SELECTIVE than in any previous experiment. Only 4 findings
  vs 10-24 in other tasks. But precision was near-perfect — all 4 are genuine
  contradictions. This suggests GPT-5's reasoning tokens are being used for
  VERIFICATION (checking whether apparent contradictions hold up) rather than
  EXPLORATION (finding more things). The 9K reasoning tokens produced 4 findings
  vs the usual 10+ — GPT-5 is self-editing aggressively.
- **Opus** hit its sweet spot. Coherence checking IS design-tension identification
  — Opus's consistent strength. Finding incoherences requires exactly the kind
  of "how does this design disagree with itself" reasoning that Opus excels at.
  It also showed unique self-correction behavior (withdrawing a finding after
  deeper analysis).
- **Sonnet** was fast but produced a false positive. Coherence checking requires
  holding multiple document sections in memory simultaneously and reasoning about
  their compatibility — this is harder than assumption-finding (where you
  reason about one mechanism at a time) but easier than race conditions (which
  require sequential temporal reasoning). Sonnet occupies a middle ground.

**Model ranking for design coherence checking:**
1. Claude Opus 4.6 — most findings, highest depth, self-correcting (7 valid)
2. GPT-5 — fewest findings but near-perfect precision, finds the critical ones (4)
3. Claude Sonnet 4.6 — fast screening, but prone to false positives on
   architectural misreads (4/5 valid)

**This inverts the usual GPT-5 > Opus ordering.** In previous experiments, GPT-5
consistently found MORE issues. Here, GPT-5 was more selective than Opus. The
task type (self-consistency checking) favors Opus's "design tension" reasoning
style over GPT-5's "exhaustive exploration" style. GPT-5 apparently uses its
reasoning to VERIFY rather than GENERATE when the task is about contradictions
rather than gaps.

**Practical implication:** For architecture documents, run coherence checking as
a separate pass using Opus as the primary model. GPT-5's higher precision means
it's good for confirming which Opus findings are genuine vs overreads. The
two-pass approach: Opus generates candidates → GPT-5 validates → result is the
intersection plus GPT-5's independent finds.