Files
model-research/findings/2026-05-03-15-design-coherence-analysis.md
T
Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00

134 lines
8.5 KiB
Markdown

# Finding 15: Design Coherence Analysis
**Date:** 2026-05-03
**Task:** Identify internal design incoherences in gargoyle's `failure-modes.md` (383 lines)
— places where the document's stated principles/invariants are contradicted by its own
specified mechanisms.
**How we used them:** Same document (full text) + same focused analytical question to all
3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence
to look for (safety properties not enforced, state machine violations, recovery contradictions,
supervision conflicts, cross-mechanism contradictions). Required each finding to reference
specific sections. No tools, no project context beyond the document itself.
| Model | Time | Output tokens | Reasoning tokens | Incoherences found |
|---|---|---|---|---|
| Claude Sonnet 4.6 | ~39s | 1,168 | (internal) | 5 |
| Claude Opus 4.6 | ~105s | 3,378 | (internal) | 7 (8 attempted, 1 self-withdrawn) |
| GPT-5 | ~120s | 10,235 | 9,088 | 4 |
**What they found — common ground (all 3 identified):**
- State machine universality claim vs Strategy.Worker crash behavior (process
crashes bypass the degraded state entirely — no transition path in the model)
- Market data staleness advisory-only vs the "don't trade when ambiguous" principle
(or vs concurrent failure auto-halt)
- `pending_cancel`/`pending_replace` absent from recovery query set (GPT-5 and
Sonnet found this directly; Opus addressed the broader state machine gap)
**GPT-5 unique findings (not in either Claude model):**
- Kill switch halted = "process terminated" vs kill switch requiring RUNNING
processes for cancel_all, FLATTEN, and LIQUIDATE modes. The state definition
claims processes are terminated, but the mechanisms require them alive to
execute orders. **This is the most architecturally significant finding** — it
reveals a fundamental definitional error in the state machine.
- Per-symbol degradation contradicts the process-level degradation semantics.
A worker "enters degraded" but continues operating for non-stale symbols —
violating the stated definition that degraded = "cannot perform primary
function." The metrics/eventing model has no per-symbol dimension.
**Claude Opus unique findings (not in either other model):**
- `:rest_for_one` cascade creates a FIFTH implicit state (terminated-and-
restarting) not in the four-state model — processes that were `normal` are
forcibly killed (not by kill switch) and restart. Self-corrected one finding
that initially looked like incoherence but was actually consistent.
- PortfolioMonitor continues evaluating with stale data ("fail-safe") while
Strategy.Workers are stopped for the SAME condition — contradicts both the
universal state machine (PM doesn't transition to degraded) and the doc's
reasoning about why stale data is dangerous.
- Path-dependent recovery criteria: Strategy.Worker recovery requires 21+ bars
after crash but only "price continuity check" after staleness. The state
machine's single "catch-up complete" exit condition can't express this.
- `halted → [*]` transition in state diagram is logically impossible if "halted"
means the process is already terminated — dead processes can't fire transitions.
- Compound failure detection requires a meta-observer across processes but the
per-process state machine model has no way to express cross-process conditions.
**Claude Sonnet unique findings (not in either other model):**
- Market data global staleness: the failure table says "Manual (disengage)" for
recovery — implying automatic engagement happened — but the text says it's
advisory only. Table contradicts prose.
- ReconciliationGate: doc claims gate survives OM crash (separate supervision
tree), but then says "missing ETS table = not ready" when OM crashes. If the
gate survives, why would its table be missing?
- Signal survival claims are contradictory between sections: worker crash says
downstream signals survive, but OM crash says all upstream signals lost.
(NOTE: this is actually describing different scenarios — worker crash doesn't
cascade to SignalRisk, OM crash does via :rest_for_one. Sonnet may have
misread the architecture here — the two statements are consistent when you
understand the supervision tree.)
**Quality assessment:**
- **GPT-5** found only 4 incoherences but TWO of them are genuinely critical
architectural findings. The "halted = terminated" vs "kill switch requires
running processes" contradiction is a real design error — you can't both
terminate processes AND require them to execute cancel/liquidation orders.
The per-symbol degradation finding is also a real modeling gap. GPT-5 was
MORE SELECTIVE here than in previous experiments — it didn't pad with
medium-severity findings. Each of its 4 was high/critical.
- **Claude Opus** produced the most findings (7 valid) with characteristic
depth. Its self-correction (withdrawing finding #6 after deeper analysis)
shows intellectual honesty rare in model outputs. The PortfolioMonitor
stale-data contradiction is genuinely insightful — same input condition,
opposite response, no justification within the state machine model. The
compound failure meta-observer finding identifies a modeling category error.
Opus also found modeling imprecisions (path-dependent recovery, halted → [*]
impossibility) that the other models didn't notice.
- **Claude Sonnet** found 5 issues quickly (39s, 1,168 tokens) but quality was
mixed. Finding #4 (ReconciliationGate) raises a genuine question about
the ETS table ownership claim. Finding #1 (table vs prose contradiction on
market data staleness) is a real documentation inconsistency. However,
Finding #5 appears to misread the supervision architecture — the two
statements about signal survival ARE consistent when you understand that
different crashes cascade differently. Sonnet produced one false positive.
**Key insight — "design coherence" is a NEW analytical category with distinct model strengths:**
This is different from assumption-finding (Finding #10-12), race conditions
(Finding #13), and cross-component interactions (Finding #14). Coherence
checking requires the model to hold MULTIPLE parts of the document in tension
with each other and reason about whether they're compatible. Results:
- **GPT-5** was MORE SELECTIVE than in any previous experiment. Only 4 findings
vs 10-24 in other tasks. But precision was near-perfect — all 4 are genuine
contradictions. This suggests GPT-5's reasoning tokens are being used for
VERIFICATION (checking whether apparent contradictions hold up) rather than
EXPLORATION (finding more things). The 9K reasoning tokens produced 4 findings
vs the usual 10+ — GPT-5 is self-editing aggressively.
- **Opus** hit its sweet spot. Coherence checking IS design-tension identification
— Opus's consistent strength. Finding incoherences requires exactly the kind
of "how does this design disagree with itself" reasoning that Opus excels at.
It also showed unique self-correction behavior (withdrawing a finding after
deeper analysis).
- **Sonnet** was fast but produced a false positive. Coherence checking requires
holding multiple document sections in memory simultaneously and reasoning about
their compatibility — this is harder than assumption-finding (where you
reason about one mechanism at a time) but easier than race conditions (which
require sequential temporal reasoning). Sonnet occupies a middle ground.
**Model ranking for design coherence checking:**
1. Claude Opus 4.6 — most findings, highest depth, self-correcting (7 valid)
2. GPT-5 — fewest findings but near-perfect precision, finds the critical ones (4)
3. Claude Sonnet 4.6 — fast screening, but prone to false positives on
architectural misreads (4/5 valid)
**This inverts the usual GPT-5 > Opus ordering.** In previous experiments, GPT-5
consistently found MORE issues. Here, GPT-5 was more selective than Opus. The
task type (self-consistency checking) favors Opus's "design tension" reasoning
style over GPT-5's "exhaustive exploration" style. GPT-5 apparently uses its
reasoning to VERIFY rather than GENERATE when the task is about contradictions
rather than gaps.
**Practical implication:** For architecture documents, run coherence checking as
a separate pass using Opus as the primary model. GPT-5's higher precision means
it's good for confirming which Opus findings are genuine vs overreads. The
two-pass approach: Opus generates candidates → GPT-5 validates → result is the
intersection plus GPT-5's independent finds.