model-research/findings/2026-05-07-37-crossdoc-consistency-tightlycoupled-risk-docs.md

# Finding 37: Cross-document consistency on tightly coupled risk docs — shared vocabulary creates critical contradictions invisible to each document's author

**Date:** 2026-05-07
**Task:** Identify contradictions, inconsistencies, or semantic conflicts between
gargoyle's `kill-switch.md` (293 lines) and `escalation-policy.md` (228 lines) — two
documents in the SAME bounded context (risk) that describe related subsystems and
explicitly reference each other.
**How we used them:** Both documents provided as full text in a single prompt (~29KB).
Structured prompt specifying 5 categories (direct contradictions, semantic mismatches,
boundary ambiguities, escalation path conflicts, failure mode contradictions). Required
specific output format per finding. No tools, no project context beyond the two
documents. Same analytical task type as Finding #28, but testing with TIGHTLY COUPLED
documents from the same domain vs Finding #28's loosely coupled docs (overview vs
architecture).

| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
|---|---|---|---|---|---|---|---|
| GPT-5 | 185s | 11,149 | 9,792 | 6 | 2 | 2 | 1(+1 Med-Hi) |
| Claude Opus 4.6 | 54s | 2,410 | (internal) | 7 | 5 | 2 | 0 |
| Claude Sonnet 4 | 23s | 1,049 | (internal) | 6 | 3 | 2 | 1 |

**Note on models:** This experiment used Claude Sonnet 4 (not 4.6 as in previous
experiments) and Claude Opus 4.6. GPT-5 remains the same.

**What they found — common ground (all 3 identified):**

1. **"Restrict" means fundamentally different things** — Kill-switch.md defines Restrict
   as decision engine TERMINATED + reject-all policy (total shutdown). Escalation-policy.md
   uses Restrict as a live monitoring level where the system continues evaluating metrics
   and can autonomously escalate to Liquidate. All three models rated this Critical.

2. **Liquidation execution model conflict** — Kill-switch.md describes liquidation as a
   manual operator action (Step 2: operator triggers from dashboard). Escalation-policy.md
   describes autonomous liquidation orders submitted by the system with iterative sizing.
   All three models found this (Critical).

3. **Restrict→Liquidate transition: manual vs autonomous** — Kill-switch.md explicitly
   states "The system never makes this transition automatically." Escalation-policy.md
   defines an automatic Restrict→Liquidate transition with debounce=3 evaluations.
   GPT-5 and Sonnet identified this directly; Opus captured it within the broader
   severity-ordering finding.

4. **Kill switch engagement prerequisites conflict** — Kill-switch.md says automated
   systems can engage on detecting a breach (any breach). Escalation-policy.md constrains
   automated kill switch engagement to post-liquidation-failure only. GPT-5 and Sonnet
   flagged this as High; Opus captured it as part of the escalation ladder inversion.

5. **Monitor crash + persisted state = split-brain** — Kill-switch state survives
   crashes per kill-switch.md, but escalation-policy.md says monitor restarts from Clear.
   All three models identified this creates an inconsistent recovery state.

**GPT-5 unique findings:**
- The "shared escalation vocabulary" claim in kill-switch.md explicitly asserts these
  modes "describe the same outcomes" as continuous risk monitoring — but they DON'T.
  GPT-5 specifically flagged this meta-claim as the ROOT of the contradictions. Neither
  Claude model explicitly attacked the "same vocabulary" assertion as itself being the
  problem.
- De-escalation semantics conflict: automatic Restrict→Alert cooldown in
  escalation-policy.md vs "all disengagement requires operator action" in kill-switch.md.
  Framed this as a principle-level contradiction.

**Claude Opus unique findings:**
- **Severity ordering inversion** (most architecturally significant finding across all
  models): Kill-switch.md treats Restrict as MORE severe than Liquidate (you de-escalate
  FROM Restrict TO Liquidate when you verify state). Escalation-policy.md treats
  Liquidate as MORE severe than Restrict (Restrict escalates TO Liquidate on sustained
  breach). THE ESCALATION LADDERS RUN IN OPPOSITE DIRECTIONS. This is genuinely a
  Critical design flaw — it means the two subsystems fundamentally disagree about which
  state is worse.
- **Decision engine liveness contradiction**: Escalation-policy.md's de-escalation from
  Restrict requires ongoing metric evaluation (Portfolio Risk alive). Kill-switch.md
  terminates Portfolio Risk on engagement in Restrict mode. The de-escalation mechanism
  is impossible if the termination boundary is implemented as specified.
- **Acceptance policy gap**: Escalation-policy.md never specifies which acceptance policy
  its Restrict level sets, creating an implementation gap when combined with
  kill-switch.md's explicit Restrict=reject-all mapping.

**Claude Sonnet 4 unique findings:**
- Kill switch engagement authority: unclear whether risk monitors can bypass the
  escalation policy or must flow through debounce/cooldown logic. (Medium severity
  boundary question)
- Sonnet noted the ambiguity about whether kill switch "liquidate mode" and escalation
  policy "liquidate level" are the same state or different states sharing a name —
  framed this as the core namespace collision problem.

**Key insight — shared vocabulary is the most dangerous form of inconsistency:**
The documents CLAIM to share vocabulary ("These use the same escalation vocabulary as
continuous risk monitoring because they describe the same outcomes"). This creates a
false sense of consistency. Implementers reading kill-switch.md would assume Restrict/
Liquidate mean what escalation-policy.md says, and vice versa — but they DON'T.

Opus's severity-ordering inversion finding is the clearest demonstration: one document
has a ladder going Restrict→Liquidate (escalating), the other has Restrict←Liquidate
(de-escalating from Restrict into Liquidate). Same words, opposite directions. This is
the kind of bug that passes design review because everyone reads one document at a time
and the terms "feel" consistent.

**Comparison to Finding #28 (loosely coupled docs):**
Finding #28 tested system-overview.md + architecture.md (different abstraction levels,
different concerns). That experiment found 6-7 inconsistencies with the models
performing similarly to here. Key differences:

1. **Tightly coupled docs produce MORE Critical findings.** Finding #28: 1-3 Critical
   per model. This experiment: 2-5 Critical per model. When documents describe the same
   domain and cross-reference each other, contradictions are more likely to cause real
   implementation bugs.
2. **Shared terminology amplifies contradictions.** The most dangerous findings here
   (severity ordering, Restrict semantics) arise specifically because both documents USE
   THE SAME WORDS with different meanings. Finding #28's docs used different terminology
   for different concepts, making contradictions more obvious.
3. **Opus performs better on tightly coupled docs.** In Finding #28, Opus found 7
   inconsistencies (similar to GPT-5's 6). Here, Opus found 7 findings with 5 Critical
   — outperforming GPT-5 (6 findings, 2 Critical) in severity accuracy. Opus excels
   when the contradictions require reasoning about how terms relate across documents.

**Model performance on cross-document consistency (updated hierarchy):**

For tightly coupled documents with shared vocabulary:
1. **Claude Opus 4.6** — Best at finding the DEEP contradictions (severity ordering
   inversion, liveness impossibility). Fewer findings but higher proportion of Critical-
   severity. Reasons about semantic relationships between document claims.
2. **GPT-5** — Most thorough and finds meta-level issues (the "same vocabulary" claim
   itself is problematic). Good at principle-level contradictions. Reasoning tokens
   help trace implication chains.
3. **Claude Sonnet 4** — Fast (23s vs 54-185s), finds the core issues correctly, but
   misses the more subtle architectural implications. Good for first-pass screening.

**Practical implication for documentation review:**
Cross-document consistency analysis is most valuable (and finds the most Critical bugs)
when applied to tightly coupled documents that SHARE VOCABULARY. The anti-pattern to
hunt for: documents that claim alignment ("uses the same vocabulary as X") but define
terms in subtly incompatible ways. This is more dangerous than documents that use
obviously different terminology — different words signal difference; same words mask it.

**Recommendation:** When a document claims terminological alignment with another ("uses
the same vocabulary as..."), that's the strongest signal to run cross-document
consistency analysis. The claim of shared vocabulary is ITSELF a risk factor for hidden
contradictions.