6af8a6ee10
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
134 lines
8.5 KiB
Markdown
134 lines
8.5 KiB
Markdown
# Finding 15: Design Coherence Analysis
|
|
|
|
**Date:** 2026-05-03
|
|
**Task:** Identify internal design incoherences in gargoyle's `failure-modes.md` (383 lines)
|
|
— places where the document's stated principles/invariants are contradicted by its own
|
|
specified mechanisms.
|
|
**How we used them:** Same document (full text) + same focused analytical question to all
|
|
3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence
|
|
to look for (safety properties not enforced, state machine violations, recovery contradictions,
|
|
supervision conflicts, cross-mechanism contradictions). Required each finding to reference
|
|
specific sections. No tools, no project context beyond the document itself.
|
|
|
|
| Model | Time | Output tokens | Reasoning tokens | Incoherences found |
|
|
|---|---|---|---|---|
|
|
| Claude Sonnet 4.6 | ~39s | 1,168 | (internal) | 5 |
|
|
| Claude Opus 4.6 | ~105s | 3,378 | (internal) | 7 (8 attempted, 1 self-withdrawn) |
|
|
| GPT-5 | ~120s | 10,235 | 9,088 | 4 |
|
|
|
|
**What they found — common ground (all 3 identified):**
|
|
- State machine universality claim vs Strategy.Worker crash behavior (process
|
|
crashes bypass the degraded state entirely — no transition path in the model)
|
|
- Market data staleness advisory-only vs the "don't trade when ambiguous" principle
|
|
(or vs concurrent failure auto-halt)
|
|
- `pending_cancel`/`pending_replace` absent from recovery query set (GPT-5 and
|
|
Sonnet found this directly; Opus addressed the broader state machine gap)
|
|
|
|
**GPT-5 unique findings (not in either Claude model):**
|
|
- Kill switch halted = "process terminated" vs kill switch requiring RUNNING
|
|
processes for cancel_all, FLATTEN, and LIQUIDATE modes. The state definition
|
|
claims processes are terminated, but the mechanisms require them alive to
|
|
execute orders. **This is the most architecturally significant finding** — it
|
|
reveals a fundamental definitional error in the state machine.
|
|
- Per-symbol degradation contradicts the process-level degradation semantics.
|
|
A worker "enters degraded" but continues operating for non-stale symbols —
|
|
violating the stated definition that degraded = "cannot perform primary
|
|
function." The metrics/eventing model has no per-symbol dimension.
|
|
|
|
**Claude Opus unique findings (not in either other model):**
|
|
- `:rest_for_one` cascade creates a FIFTH implicit state (terminated-and-
|
|
restarting) not in the four-state model — processes that were `normal` are
|
|
forcibly killed (not by kill switch) and restart. Self-corrected one finding
|
|
that initially looked like incoherence but was actually consistent.
|
|
- PortfolioMonitor continues evaluating with stale data ("fail-safe") while
|
|
Strategy.Workers are stopped for the SAME condition — contradicts both the
|
|
universal state machine (PM doesn't transition to degraded) and the doc's
|
|
reasoning about why stale data is dangerous.
|
|
- Path-dependent recovery criteria: Strategy.Worker recovery requires 21+ bars
|
|
after crash but only "price continuity check" after staleness. The state
|
|
machine's single "catch-up complete" exit condition can't express this.
|
|
- `halted → [*]` transition in state diagram is logically impossible if "halted"
|
|
means the process is already terminated — dead processes can't fire transitions.
|
|
- Compound failure detection requires a meta-observer across processes but the
|
|
per-process state machine model has no way to express cross-process conditions.
|
|
|
|
**Claude Sonnet unique findings (not in either other model):**
|
|
- Market data global staleness: the failure table says "Manual (disengage)" for
|
|
recovery — implying automatic engagement happened — but the text says it's
|
|
advisory only. Table contradicts prose.
|
|
- ReconciliationGate: doc claims gate survives OM crash (separate supervision
|
|
tree), but then says "missing ETS table = not ready" when OM crashes. If the
|
|
gate survives, why would its table be missing?
|
|
- Signal survival claims are contradictory between sections: worker crash says
|
|
downstream signals survive, but OM crash says all upstream signals lost.
|
|
(NOTE: this is actually describing different scenarios — worker crash doesn't
|
|
cascade to SignalRisk, OM crash does via :rest_for_one. Sonnet may have
|
|
misread the architecture here — the two statements are consistent when you
|
|
understand the supervision tree.)
|
|
|
|
**Quality assessment:**
|
|
- **GPT-5** found only 4 incoherences but TWO of them are genuinely critical
|
|
architectural findings. The "halted = terminated" vs "kill switch requires
|
|
running processes" contradiction is a real design error — you can't both
|
|
terminate processes AND require them to execute cancel/liquidation orders.
|
|
The per-symbol degradation finding is also a real modeling gap. GPT-5 was
|
|
MORE SELECTIVE here than in previous experiments — it didn't pad with
|
|
medium-severity findings. Each of its 4 was high/critical.
|
|
- **Claude Opus** produced the most findings (7 valid) with characteristic
|
|
depth. Its self-correction (withdrawing finding #6 after deeper analysis)
|
|
shows intellectual honesty rare in model outputs. The PortfolioMonitor
|
|
stale-data contradiction is genuinely insightful — same input condition,
|
|
opposite response, no justification within the state machine model. The
|
|
compound failure meta-observer finding identifies a modeling category error.
|
|
Opus also found modeling imprecisions (path-dependent recovery, halted → [*]
|
|
impossibility) that the other models didn't notice.
|
|
- **Claude Sonnet** found 5 issues quickly (39s, 1,168 tokens) but quality was
|
|
mixed. Finding #4 (ReconciliationGate) raises a genuine question about
|
|
the ETS table ownership claim. Finding #1 (table vs prose contradiction on
|
|
market data staleness) is a real documentation inconsistency. However,
|
|
Finding #5 appears to misread the supervision architecture — the two
|
|
statements about signal survival ARE consistent when you understand that
|
|
different crashes cascade differently. Sonnet produced one false positive.
|
|
|
|
**Key insight — "design coherence" is a NEW analytical category with distinct model strengths:**
|
|
This is different from assumption-finding (Finding #10-12), race conditions
|
|
(Finding #13), and cross-component interactions (Finding #14). Coherence
|
|
checking requires the model to hold MULTIPLE parts of the document in tension
|
|
with each other and reason about whether they're compatible. Results:
|
|
|
|
- **GPT-5** was MORE SELECTIVE than in any previous experiment. Only 4 findings
|
|
vs 10-24 in other tasks. But precision was near-perfect — all 4 are genuine
|
|
contradictions. This suggests GPT-5's reasoning tokens are being used for
|
|
VERIFICATION (checking whether apparent contradictions hold up) rather than
|
|
EXPLORATION (finding more things). The 9K reasoning tokens produced 4 findings
|
|
vs the usual 10+ — GPT-5 is self-editing aggressively.
|
|
- **Opus** hit its sweet spot. Coherence checking IS design-tension identification
|
|
— Opus's consistent strength. Finding incoherences requires exactly the kind
|
|
of "how does this design disagree with itself" reasoning that Opus excels at.
|
|
It also showed unique self-correction behavior (withdrawing a finding after
|
|
deeper analysis).
|
|
- **Sonnet** was fast but produced a false positive. Coherence checking requires
|
|
holding multiple document sections in memory simultaneously and reasoning about
|
|
their compatibility — this is harder than assumption-finding (where you
|
|
reason about one mechanism at a time) but easier than race conditions (which
|
|
require sequential temporal reasoning). Sonnet occupies a middle ground.
|
|
|
|
**Model ranking for design coherence checking:**
|
|
1. Claude Opus 4.6 — most findings, highest depth, self-correcting (7 valid)
|
|
2. GPT-5 — fewest findings but near-perfect precision, finds the critical ones (4)
|
|
3. Claude Sonnet 4.6 — fast screening, but prone to false positives on
|
|
architectural misreads (4/5 valid)
|
|
|
|
**This inverts the usual GPT-5 > Opus ordering.** In previous experiments, GPT-5
|
|
consistently found MORE issues. Here, GPT-5 was more selective than Opus. The
|
|
task type (self-consistency checking) favors Opus's "design tension" reasoning
|
|
style over GPT-5's "exhaustive exploration" style. GPT-5 apparently uses its
|
|
reasoning to VERIFY rather than GENERATE when the task is about contradictions
|
|
rather than gaps.
|
|
|
|
**Practical implication:** For architecture documents, run coherence checking as
|
|
a separate pass using Opus as the primary model. GPT-5's higher precision means
|
|
it's good for confirming which Opus findings are genuine vs overreads. The
|
|
two-pass approach: Opus generates candidates → GPT-5 validates → result is the
|
|
intersection plus GPT-5's independent finds.
|