Files
model-research/findings/2026-05-07-37-crossdoc-consistency-tightlycoupled-risk-docs.md
T
claw 58e69e21f8 finding 37: cross-doc consistency on tightly coupled risk docs
Tested kill-switch.md + escalation-policy.md (same bounded context,
shared vocabulary). Key insight: shared vocabulary claims are the most
dangerous inconsistency — same words with opposite severity ordering.

Opus found the severity-ordering inversion (restrict/liquidate ladders
run in opposite directions). GPT-5 found the meta-issue (the 'same
vocabulary' claim is itself the problem). Sonnet fast but shallow.

Tightly coupled docs produce more Critical findings than loosely coupled
ones (Finding #28).
2026-05-07 04:29:23 -07:00

139 lines
8.7 KiB
Markdown

# Finding 37: Cross-document consistency on tightly coupled risk docs — shared vocabulary creates critical contradictions invisible to each document's author
**Date:** 2026-05-07
**Task:** Identify contradictions, inconsistencies, or semantic conflicts between
gargoyle's `kill-switch.md` (293 lines) and `escalation-policy.md` (228 lines) — two
documents in the SAME bounded context (risk) that describe related subsystems and
explicitly reference each other.
**How we used them:** Both documents provided as full text in a single prompt (~29KB).
Structured prompt specifying 5 categories (direct contradictions, semantic mismatches,
boundary ambiguities, escalation path conflicts, failure mode contradictions). Required
specific output format per finding. No tools, no project context beyond the two
documents. Same analytical task type as Finding #28, but testing with TIGHTLY COUPLED
documents from the same domain vs Finding #28's loosely coupled docs (overview vs
architecture).
| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
|---|---|---|---|---|---|---|---|
| GPT-5 | 185s | 11,149 | 9,792 | 6 | 2 | 2 | 1(+1 Med-Hi) |
| Claude Opus 4.6 | 54s | 2,410 | (internal) | 7 | 5 | 2 | 0 |
| Claude Sonnet 4 | 23s | 1,049 | (internal) | 6 | 3 | 2 | 1 |
**Note on models:** This experiment used Claude Sonnet 4 (not 4.6 as in previous
experiments) and Claude Opus 4.6. GPT-5 remains the same.
**What they found — common ground (all 3 identified):**
1. **"Restrict" means fundamentally different things** — Kill-switch.md defines Restrict
as decision engine TERMINATED + reject-all policy (total shutdown). Escalation-policy.md
uses Restrict as a live monitoring level where the system continues evaluating metrics
and can autonomously escalate to Liquidate. All three models rated this Critical.
2. **Liquidation execution model conflict** — Kill-switch.md describes liquidation as a
manual operator action (Step 2: operator triggers from dashboard). Escalation-policy.md
describes autonomous liquidation orders submitted by the system with iterative sizing.
All three models found this (Critical).
3. **Restrict→Liquidate transition: manual vs autonomous** — Kill-switch.md explicitly
states "The system never makes this transition automatically." Escalation-policy.md
defines an automatic Restrict→Liquidate transition with debounce=3 evaluations.
GPT-5 and Sonnet identified this directly; Opus captured it within the broader
severity-ordering finding.
4. **Kill switch engagement prerequisites conflict** — Kill-switch.md says automated
systems can engage on detecting a breach (any breach). Escalation-policy.md constrains
automated kill switch engagement to post-liquidation-failure only. GPT-5 and Sonnet
flagged this as High; Opus captured it as part of the escalation ladder inversion.
5. **Monitor crash + persisted state = split-brain** — Kill-switch state survives
crashes per kill-switch.md, but escalation-policy.md says monitor restarts from Clear.
All three models identified this creates an inconsistent recovery state.
**GPT-5 unique findings:**
- The "shared escalation vocabulary" claim in kill-switch.md explicitly asserts these
modes "describe the same outcomes" as continuous risk monitoring — but they DON'T.
GPT-5 specifically flagged this meta-claim as the ROOT of the contradictions. Neither
Claude model explicitly attacked the "same vocabulary" assertion as itself being the
problem.
- De-escalation semantics conflict: automatic Restrict→Alert cooldown in
escalation-policy.md vs "all disengagement requires operator action" in kill-switch.md.
Framed this as a principle-level contradiction.
**Claude Opus unique findings:**
- **Severity ordering inversion** (most architecturally significant finding across all
models): Kill-switch.md treats Restrict as MORE severe than Liquidate (you de-escalate
FROM Restrict TO Liquidate when you verify state). Escalation-policy.md treats
Liquidate as MORE severe than Restrict (Restrict escalates TO Liquidate on sustained
breach). THE ESCALATION LADDERS RUN IN OPPOSITE DIRECTIONS. This is genuinely a
Critical design flaw — it means the two subsystems fundamentally disagree about which
state is worse.
- **Decision engine liveness contradiction**: Escalation-policy.md's de-escalation from
Restrict requires ongoing metric evaluation (Portfolio Risk alive). Kill-switch.md
terminates Portfolio Risk on engagement in Restrict mode. The de-escalation mechanism
is impossible if the termination boundary is implemented as specified.
- **Acceptance policy gap**: Escalation-policy.md never specifies which acceptance policy
its Restrict level sets, creating an implementation gap when combined with
kill-switch.md's explicit Restrict=reject-all mapping.
**Claude Sonnet 4 unique findings:**
- Kill switch engagement authority: unclear whether risk monitors can bypass the
escalation policy or must flow through debounce/cooldown logic. (Medium severity
boundary question)
- Sonnet noted the ambiguity about whether kill switch "liquidate mode" and escalation
policy "liquidate level" are the same state or different states sharing a name —
framed this as the core namespace collision problem.
**Key insight — shared vocabulary is the most dangerous form of inconsistency:**
The documents CLAIM to share vocabulary ("These use the same escalation vocabulary as
continuous risk monitoring because they describe the same outcomes"). This creates a
false sense of consistency. Implementers reading kill-switch.md would assume Restrict/
Liquidate mean what escalation-policy.md says, and vice versa — but they DON'T.
Opus's severity-ordering inversion finding is the clearest demonstration: one document
has a ladder going Restrict→Liquidate (escalating), the other has Restrict←Liquidate
(de-escalating from Restrict into Liquidate). Same words, opposite directions. This is
the kind of bug that passes design review because everyone reads one document at a time
and the terms "feel" consistent.
**Comparison to Finding #28 (loosely coupled docs):**
Finding #28 tested system-overview.md + architecture.md (different abstraction levels,
different concerns). That experiment found 6-7 inconsistencies with the models
performing similarly to here. Key differences:
1. **Tightly coupled docs produce MORE Critical findings.** Finding #28: 1-3 Critical
per model. This experiment: 2-5 Critical per model. When documents describe the same
domain and cross-reference each other, contradictions are more likely to cause real
implementation bugs.
2. **Shared terminology amplifies contradictions.** The most dangerous findings here
(severity ordering, Restrict semantics) arise specifically because both documents USE
THE SAME WORDS with different meanings. Finding #28's docs used different terminology
for different concepts, making contradictions more obvious.
3. **Opus performs better on tightly coupled docs.** In Finding #28, Opus found 7
inconsistencies (similar to GPT-5's 6). Here, Opus found 7 findings with 5 Critical
— outperforming GPT-5 (6 findings, 2 Critical) in severity accuracy. Opus excels
when the contradictions require reasoning about how terms relate across documents.
**Model performance on cross-document consistency (updated hierarchy):**
For tightly coupled documents with shared vocabulary:
1. **Claude Opus 4.6** — Best at finding the DEEP contradictions (severity ordering
inversion, liveness impossibility). Fewer findings but higher proportion of Critical-
severity. Reasons about semantic relationships between document claims.
2. **GPT-5** — Most thorough and finds meta-level issues (the "same vocabulary" claim
itself is problematic). Good at principle-level contradictions. Reasoning tokens
help trace implication chains.
3. **Claude Sonnet 4** — Fast (23s vs 54-185s), finds the core issues correctly, but
misses the more subtle architectural implications. Good for first-pass screening.
**Practical implication for documentation review:**
Cross-document consistency analysis is most valuable (and finds the most Critical bugs)
when applied to tightly coupled documents that SHARE VOCABULARY. The anti-pattern to
hunt for: documents that claim alignment ("uses the same vocabulary as X") but define
terms in subtly incompatible ways. This is more dangerous than documents that use
obviously different terminology — different words signal difference; same words mask it.
**Recommendation:** When a document claims terminological alignment with another ("uses
the same vocabulary as..."), that's the strongest signal to run cross-document
consistency analysis. The claim of shared vocabulary is ITSELF a risk factor for hidden
contradictions.