Files
model-research/findings/2026-05-07-37-crossdoc-consistency-tightlycoupled-risk-docs.md
T
claw 58e69e21f8 finding 37: cross-doc consistency on tightly coupled risk docs
Tested kill-switch.md + escalation-policy.md (same bounded context,
shared vocabulary). Key insight: shared vocabulary claims are the most
dangerous inconsistency — same words with opposite severity ordering.

Opus found the severity-ordering inversion (restrict/liquidate ladders
run in opposite directions). GPT-5 found the meta-issue (the 'same
vocabulary' claim is itself the problem). Sonnet fast but shallow.

Tightly coupled docs produce more Critical findings than loosely coupled
ones (Finding #28).
2026-05-07 04:29:23 -07:00

8.7 KiB

Finding 37: Cross-document consistency on tightly coupled risk docs — shared vocabulary creates critical contradictions invisible to each document's author

Date: 2026-05-07 Task: Identify contradictions, inconsistencies, or semantic conflicts between gargoyle's kill-switch.md (293 lines) and escalation-policy.md (228 lines) — two documents in the SAME bounded context (risk) that describe related subsystems and explicitly reference each other. How we used them: Both documents provided as full text in a single prompt (~29KB). Structured prompt specifying 5 categories (direct contradictions, semantic mismatches, boundary ambiguities, escalation path conflicts, failure mode contradictions). Required specific output format per finding. No tools, no project context beyond the two documents. Same analytical task type as Finding #28, but testing with TIGHTLY COUPLED documents from the same domain vs Finding #28's loosely coupled docs (overview vs architecture).

Model Time Output tokens Reasoning tokens Findings Critical High Medium
GPT-5 185s 11,149 9,792 6 2 2 1(+1 Med-Hi)
Claude Opus 4.6 54s 2,410 (internal) 7 5 2 0
Claude Sonnet 4 23s 1,049 (internal) 6 3 2 1

Note on models: This experiment used Claude Sonnet 4 (not 4.6 as in previous experiments) and Claude Opus 4.6. GPT-5 remains the same.

What they found — common ground (all 3 identified):

  1. "Restrict" means fundamentally different things — Kill-switch.md defines Restrict as decision engine TERMINATED + reject-all policy (total shutdown). Escalation-policy.md uses Restrict as a live monitoring level where the system continues evaluating metrics and can autonomously escalate to Liquidate. All three models rated this Critical.

  2. Liquidation execution model conflict — Kill-switch.md describes liquidation as a manual operator action (Step 2: operator triggers from dashboard). Escalation-policy.md describes autonomous liquidation orders submitted by the system with iterative sizing. All three models found this (Critical).

  3. Restrict→Liquidate transition: manual vs autonomous — Kill-switch.md explicitly states "The system never makes this transition automatically." Escalation-policy.md defines an automatic Restrict→Liquidate transition with debounce=3 evaluations. GPT-5 and Sonnet identified this directly; Opus captured it within the broader severity-ordering finding.

  4. Kill switch engagement prerequisites conflict — Kill-switch.md says automated systems can engage on detecting a breach (any breach). Escalation-policy.md constrains automated kill switch engagement to post-liquidation-failure only. GPT-5 and Sonnet flagged this as High; Opus captured it as part of the escalation ladder inversion.

  5. Monitor crash + persisted state = split-brain — Kill-switch state survives crashes per kill-switch.md, but escalation-policy.md says monitor restarts from Clear. All three models identified this creates an inconsistent recovery state.

GPT-5 unique findings:

  • The "shared escalation vocabulary" claim in kill-switch.md explicitly asserts these modes "describe the same outcomes" as continuous risk monitoring — but they DON'T. GPT-5 specifically flagged this meta-claim as the ROOT of the contradictions. Neither Claude model explicitly attacked the "same vocabulary" assertion as itself being the problem.
  • De-escalation semantics conflict: automatic Restrict→Alert cooldown in escalation-policy.md vs "all disengagement requires operator action" in kill-switch.md. Framed this as a principle-level contradiction.

Claude Opus unique findings:

  • Severity ordering inversion (most architecturally significant finding across all models): Kill-switch.md treats Restrict as MORE severe than Liquidate (you de-escalate FROM Restrict TO Liquidate when you verify state). Escalation-policy.md treats Liquidate as MORE severe than Restrict (Restrict escalates TO Liquidate on sustained breach). THE ESCALATION LADDERS RUN IN OPPOSITE DIRECTIONS. This is genuinely a Critical design flaw — it means the two subsystems fundamentally disagree about which state is worse.
  • Decision engine liveness contradiction: Escalation-policy.md's de-escalation from Restrict requires ongoing metric evaluation (Portfolio Risk alive). Kill-switch.md terminates Portfolio Risk on engagement in Restrict mode. The de-escalation mechanism is impossible if the termination boundary is implemented as specified.
  • Acceptance policy gap: Escalation-policy.md never specifies which acceptance policy its Restrict level sets, creating an implementation gap when combined with kill-switch.md's explicit Restrict=reject-all mapping.

Claude Sonnet 4 unique findings:

  • Kill switch engagement authority: unclear whether risk monitors can bypass the escalation policy or must flow through debounce/cooldown logic. (Medium severity boundary question)
  • Sonnet noted the ambiguity about whether kill switch "liquidate mode" and escalation policy "liquidate level" are the same state or different states sharing a name — framed this as the core namespace collision problem.

Key insight — shared vocabulary is the most dangerous form of inconsistency: The documents CLAIM to share vocabulary ("These use the same escalation vocabulary as continuous risk monitoring because they describe the same outcomes"). This creates a false sense of consistency. Implementers reading kill-switch.md would assume Restrict/ Liquidate mean what escalation-policy.md says, and vice versa — but they DON'T.

Opus's severity-ordering inversion finding is the clearest demonstration: one document has a ladder going Restrict→Liquidate (escalating), the other has Restrict←Liquidate (de-escalating from Restrict into Liquidate). Same words, opposite directions. This is the kind of bug that passes design review because everyone reads one document at a time and the terms "feel" consistent.

Comparison to Finding #28 (loosely coupled docs): Finding #28 tested system-overview.md + architecture.md (different abstraction levels, different concerns). That experiment found 6-7 inconsistencies with the models performing similarly to here. Key differences:

  1. Tightly coupled docs produce MORE Critical findings. Finding #28: 1-3 Critical per model. This experiment: 2-5 Critical per model. When documents describe the same domain and cross-reference each other, contradictions are more likely to cause real implementation bugs.
  2. Shared terminology amplifies contradictions. The most dangerous findings here (severity ordering, Restrict semantics) arise specifically because both documents USE THE SAME WORDS with different meanings. Finding #28's docs used different terminology for different concepts, making contradictions more obvious.
  3. Opus performs better on tightly coupled docs. In Finding #28, Opus found 7 inconsistencies (similar to GPT-5's 6). Here, Opus found 7 findings with 5 Critical — outperforming GPT-5 (6 findings, 2 Critical) in severity accuracy. Opus excels when the contradictions require reasoning about how terms relate across documents.

Model performance on cross-document consistency (updated hierarchy):

For tightly coupled documents with shared vocabulary:

  1. Claude Opus 4.6 — Best at finding the DEEP contradictions (severity ordering inversion, liveness impossibility). Fewer findings but higher proportion of Critical- severity. Reasons about semantic relationships between document claims.
  2. GPT-5 — Most thorough and finds meta-level issues (the "same vocabulary" claim itself is problematic). Good at principle-level contradictions. Reasoning tokens help trace implication chains.
  3. Claude Sonnet 4 — Fast (23s vs 54-185s), finds the core issues correctly, but misses the more subtle architectural implications. Good for first-pass screening.

Practical implication for documentation review: Cross-document consistency analysis is most valuable (and finds the most Critical bugs) when applied to tightly coupled documents that SHARE VOCABULARY. The anti-pattern to hunt for: documents that claim alignment ("uses the same vocabulary as X") but define terms in subtly incompatible ways. This is more dangerous than documents that use obviously different terminology — different words signal difference; same words mask it.

Recommendation: When a document claims terminological alignment with another ("uses the same vocabulary as..."), that's the strongest signal to run cross-document consistency analysis. The claim of shared vocabulary is ITSELF a risk factor for hidden contradictions.