Files
model-research/findings/2026-05-03-15-design-coherence-analysis.md
Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00

8.5 KiB

Finding 15: Design Coherence Analysis

Date: 2026-05-03 Task: Identify internal design incoherences in gargoyle's failure-modes.md (383 lines) — places where the document's stated principles/invariants are contradicted by its own specified mechanisms. How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence to look for (safety properties not enforced, state machine violations, recovery contradictions, supervision conflicts, cross-mechanism contradictions). Required each finding to reference specific sections. No tools, no project context beyond the document itself.

Model Time Output tokens Reasoning tokens Incoherences found
Claude Sonnet 4.6 ~39s 1,168 (internal) 5
Claude Opus 4.6 ~105s 3,378 (internal) 7 (8 attempted, 1 self-withdrawn)
GPT-5 ~120s 10,235 9,088 4

What they found — common ground (all 3 identified):

  • State machine universality claim vs Strategy.Worker crash behavior (process crashes bypass the degraded state entirely — no transition path in the model)
  • Market data staleness advisory-only vs the "don't trade when ambiguous" principle (or vs concurrent failure auto-halt)
  • pending_cancel/pending_replace absent from recovery query set (GPT-5 and Sonnet found this directly; Opus addressed the broader state machine gap)

GPT-5 unique findings (not in either Claude model):

  • Kill switch halted = "process terminated" vs kill switch requiring RUNNING processes for cancel_all, FLATTEN, and LIQUIDATE modes. The state definition claims processes are terminated, but the mechanisms require them alive to execute orders. This is the most architecturally significant finding — it reveals a fundamental definitional error in the state machine.
  • Per-symbol degradation contradicts the process-level degradation semantics. A worker "enters degraded" but continues operating for non-stale symbols — violating the stated definition that degraded = "cannot perform primary function." The metrics/eventing model has no per-symbol dimension.

Claude Opus unique findings (not in either other model):

  • :rest_for_one cascade creates a FIFTH implicit state (terminated-and- restarting) not in the four-state model — processes that were normal are forcibly killed (not by kill switch) and restart. Self-corrected one finding that initially looked like incoherence but was actually consistent.
  • PortfolioMonitor continues evaluating with stale data ("fail-safe") while Strategy.Workers are stopped for the SAME condition — contradicts both the universal state machine (PM doesn't transition to degraded) and the doc's reasoning about why stale data is dangerous.
  • Path-dependent recovery criteria: Strategy.Worker recovery requires 21+ bars after crash but only "price continuity check" after staleness. The state machine's single "catch-up complete" exit condition can't express this.
  • halted → [*] transition in state diagram is logically impossible if "halted" means the process is already terminated — dead processes can't fire transitions.
  • Compound failure detection requires a meta-observer across processes but the per-process state machine model has no way to express cross-process conditions.

Claude Sonnet unique findings (not in either other model):

  • Market data global staleness: the failure table says "Manual (disengage)" for recovery — implying automatic engagement happened — but the text says it's advisory only. Table contradicts prose.
  • ReconciliationGate: doc claims gate survives OM crash (separate supervision tree), but then says "missing ETS table = not ready" when OM crashes. If the gate survives, why would its table be missing?
  • Signal survival claims are contradictory between sections: worker crash says downstream signals survive, but OM crash says all upstream signals lost. (NOTE: this is actually describing different scenarios — worker crash doesn't cascade to SignalRisk, OM crash does via :rest_for_one. Sonnet may have misread the architecture here — the two statements are consistent when you understand the supervision tree.)

Quality assessment:

  • GPT-5 found only 4 incoherences but TWO of them are genuinely critical architectural findings. The "halted = terminated" vs "kill switch requires running processes" contradiction is a real design error — you can't both terminate processes AND require them to execute cancel/liquidation orders. The per-symbol degradation finding is also a real modeling gap. GPT-5 was MORE SELECTIVE here than in previous experiments — it didn't pad with medium-severity findings. Each of its 4 was high/critical.
  • Claude Opus produced the most findings (7 valid) with characteristic depth. Its self-correction (withdrawing finding #6 after deeper analysis) shows intellectual honesty rare in model outputs. The PortfolioMonitor stale-data contradiction is genuinely insightful — same input condition, opposite response, no justification within the state machine model. The compound failure meta-observer finding identifies a modeling category error. Opus also found modeling imprecisions (path-dependent recovery, halted → [*] impossibility) that the other models didn't notice.
  • Claude Sonnet found 5 issues quickly (39s, 1,168 tokens) but quality was mixed. Finding #4 (ReconciliationGate) raises a genuine question about the ETS table ownership claim. Finding #1 (table vs prose contradiction on market data staleness) is a real documentation inconsistency. However, Finding #5 appears to misread the supervision architecture — the two statements about signal survival ARE consistent when you understand that different crashes cascade differently. Sonnet produced one false positive.

Key insight — "design coherence" is a NEW analytical category with distinct model strengths: This is different from assumption-finding (Finding #10-12), race conditions (Finding #13), and cross-component interactions (Finding #14). Coherence checking requires the model to hold MULTIPLE parts of the document in tension with each other and reason about whether they're compatible. Results:

  • GPT-5 was MORE SELECTIVE than in any previous experiment. Only 4 findings vs 10-24 in other tasks. But precision was near-perfect — all 4 are genuine contradictions. This suggests GPT-5's reasoning tokens are being used for VERIFICATION (checking whether apparent contradictions hold up) rather than EXPLORATION (finding more things). The 9K reasoning tokens produced 4 findings vs the usual 10+ — GPT-5 is self-editing aggressively.
  • Opus hit its sweet spot. Coherence checking IS design-tension identification — Opus's consistent strength. Finding incoherences requires exactly the kind of "how does this design disagree with itself" reasoning that Opus excels at. It also showed unique self-correction behavior (withdrawing a finding after deeper analysis).
  • Sonnet was fast but produced a false positive. Coherence checking requires holding multiple document sections in memory simultaneously and reasoning about their compatibility — this is harder than assumption-finding (where you reason about one mechanism at a time) but easier than race conditions (which require sequential temporal reasoning). Sonnet occupies a middle ground.

Model ranking for design coherence checking:

  1. Claude Opus 4.6 — most findings, highest depth, self-correcting (7 valid)
  2. GPT-5 — fewest findings but near-perfect precision, finds the critical ones (4)
  3. Claude Sonnet 4.6 — fast screening, but prone to false positives on architectural misreads (4/5 valid)

This inverts the usual GPT-5 > Opus ordering. In previous experiments, GPT-5 consistently found MORE issues. Here, GPT-5 was more selective than Opus. The task type (self-consistency checking) favors Opus's "design tension" reasoning style over GPT-5's "exhaustive exploration" style. GPT-5 apparently uses its reasoning to VERIFY rather than GENERATE when the task is about contradictions rather than gaps.

Practical implication: For architecture documents, run coherence checking as a separate pass using Opus as the primary model. GPT-5's higher precision means it's good for confirming which Opus findings are genuine vs overreads. The two-pass approach: Opus generates candidates → GPT-5 validates → result is the intersection plus GPT-5's independent finds.