58e69e21f8
Tested kill-switch.md + escalation-policy.md (same bounded context, shared vocabulary). Key insight: shared vocabulary claims are the most dangerous inconsistency — same words with opposite severity ordering. Opus found the severity-ordering inversion (restrict/liquidate ladders run in opposite directions). GPT-5 found the meta-issue (the 'same vocabulary' claim is itself the problem). Sonnet fast but shallow. Tightly coupled docs produce more Critical findings than loosely coupled ones (Finding #28).
139 lines
8.7 KiB
Markdown
139 lines
8.7 KiB
Markdown
# Finding 37: Cross-document consistency on tightly coupled risk docs — shared vocabulary creates critical contradictions invisible to each document's author
|
|
|
|
**Date:** 2026-05-07
|
|
**Task:** Identify contradictions, inconsistencies, or semantic conflicts between
|
|
gargoyle's `kill-switch.md` (293 lines) and `escalation-policy.md` (228 lines) — two
|
|
documents in the SAME bounded context (risk) that describe related subsystems and
|
|
explicitly reference each other.
|
|
**How we used them:** Both documents provided as full text in a single prompt (~29KB).
|
|
Structured prompt specifying 5 categories (direct contradictions, semantic mismatches,
|
|
boundary ambiguities, escalation path conflicts, failure mode contradictions). Required
|
|
specific output format per finding. No tools, no project context beyond the two
|
|
documents. Same analytical task type as Finding #28, but testing with TIGHTLY COUPLED
|
|
documents from the same domain vs Finding #28's loosely coupled docs (overview vs
|
|
architecture).
|
|
|
|
| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
|
|
|---|---|---|---|---|---|---|---|
|
|
| GPT-5 | 185s | 11,149 | 9,792 | 6 | 2 | 2 | 1(+1 Med-Hi) |
|
|
| Claude Opus 4.6 | 54s | 2,410 | (internal) | 7 | 5 | 2 | 0 |
|
|
| Claude Sonnet 4 | 23s | 1,049 | (internal) | 6 | 3 | 2 | 1 |
|
|
|
|
**Note on models:** This experiment used Claude Sonnet 4 (not 4.6 as in previous
|
|
experiments) and Claude Opus 4.6. GPT-5 remains the same.
|
|
|
|
**What they found — common ground (all 3 identified):**
|
|
|
|
1. **"Restrict" means fundamentally different things** — Kill-switch.md defines Restrict
|
|
as decision engine TERMINATED + reject-all policy (total shutdown). Escalation-policy.md
|
|
uses Restrict as a live monitoring level where the system continues evaluating metrics
|
|
and can autonomously escalate to Liquidate. All three models rated this Critical.
|
|
|
|
2. **Liquidation execution model conflict** — Kill-switch.md describes liquidation as a
|
|
manual operator action (Step 2: operator triggers from dashboard). Escalation-policy.md
|
|
describes autonomous liquidation orders submitted by the system with iterative sizing.
|
|
All three models found this (Critical).
|
|
|
|
3. **Restrict→Liquidate transition: manual vs autonomous** — Kill-switch.md explicitly
|
|
states "The system never makes this transition automatically." Escalation-policy.md
|
|
defines an automatic Restrict→Liquidate transition with debounce=3 evaluations.
|
|
GPT-5 and Sonnet identified this directly; Opus captured it within the broader
|
|
severity-ordering finding.
|
|
|
|
4. **Kill switch engagement prerequisites conflict** — Kill-switch.md says automated
|
|
systems can engage on detecting a breach (any breach). Escalation-policy.md constrains
|
|
automated kill switch engagement to post-liquidation-failure only. GPT-5 and Sonnet
|
|
flagged this as High; Opus captured it as part of the escalation ladder inversion.
|
|
|
|
5. **Monitor crash + persisted state = split-brain** — Kill-switch state survives
|
|
crashes per kill-switch.md, but escalation-policy.md says monitor restarts from Clear.
|
|
All three models identified this creates an inconsistent recovery state.
|
|
|
|
**GPT-5 unique findings:**
|
|
- The "shared escalation vocabulary" claim in kill-switch.md explicitly asserts these
|
|
modes "describe the same outcomes" as continuous risk monitoring — but they DON'T.
|
|
GPT-5 specifically flagged this meta-claim as the ROOT of the contradictions. Neither
|
|
Claude model explicitly attacked the "same vocabulary" assertion as itself being the
|
|
problem.
|
|
- De-escalation semantics conflict: automatic Restrict→Alert cooldown in
|
|
escalation-policy.md vs "all disengagement requires operator action" in kill-switch.md.
|
|
Framed this as a principle-level contradiction.
|
|
|
|
**Claude Opus unique findings:**
|
|
- **Severity ordering inversion** (most architecturally significant finding across all
|
|
models): Kill-switch.md treats Restrict as MORE severe than Liquidate (you de-escalate
|
|
FROM Restrict TO Liquidate when you verify state). Escalation-policy.md treats
|
|
Liquidate as MORE severe than Restrict (Restrict escalates TO Liquidate on sustained
|
|
breach). THE ESCALATION LADDERS RUN IN OPPOSITE DIRECTIONS. This is genuinely a
|
|
Critical design flaw — it means the two subsystems fundamentally disagree about which
|
|
state is worse.
|
|
- **Decision engine liveness contradiction**: Escalation-policy.md's de-escalation from
|
|
Restrict requires ongoing metric evaluation (Portfolio Risk alive). Kill-switch.md
|
|
terminates Portfolio Risk on engagement in Restrict mode. The de-escalation mechanism
|
|
is impossible if the termination boundary is implemented as specified.
|
|
- **Acceptance policy gap**: Escalation-policy.md never specifies which acceptance policy
|
|
its Restrict level sets, creating an implementation gap when combined with
|
|
kill-switch.md's explicit Restrict=reject-all mapping.
|
|
|
|
**Claude Sonnet 4 unique findings:**
|
|
- Kill switch engagement authority: unclear whether risk monitors can bypass the
|
|
escalation policy or must flow through debounce/cooldown logic. (Medium severity
|
|
boundary question)
|
|
- Sonnet noted the ambiguity about whether kill switch "liquidate mode" and escalation
|
|
policy "liquidate level" are the same state or different states sharing a name —
|
|
framed this as the core namespace collision problem.
|
|
|
|
**Key insight — shared vocabulary is the most dangerous form of inconsistency:**
|
|
The documents CLAIM to share vocabulary ("These use the same escalation vocabulary as
|
|
continuous risk monitoring because they describe the same outcomes"). This creates a
|
|
false sense of consistency. Implementers reading kill-switch.md would assume Restrict/
|
|
Liquidate mean what escalation-policy.md says, and vice versa — but they DON'T.
|
|
|
|
Opus's severity-ordering inversion finding is the clearest demonstration: one document
|
|
has a ladder going Restrict→Liquidate (escalating), the other has Restrict←Liquidate
|
|
(de-escalating from Restrict into Liquidate). Same words, opposite directions. This is
|
|
the kind of bug that passes design review because everyone reads one document at a time
|
|
and the terms "feel" consistent.
|
|
|
|
**Comparison to Finding #28 (loosely coupled docs):**
|
|
Finding #28 tested system-overview.md + architecture.md (different abstraction levels,
|
|
different concerns). That experiment found 6-7 inconsistencies with the models
|
|
performing similarly to here. Key differences:
|
|
|
|
1. **Tightly coupled docs produce MORE Critical findings.** Finding #28: 1-3 Critical
|
|
per model. This experiment: 2-5 Critical per model. When documents describe the same
|
|
domain and cross-reference each other, contradictions are more likely to cause real
|
|
implementation bugs.
|
|
2. **Shared terminology amplifies contradictions.** The most dangerous findings here
|
|
(severity ordering, Restrict semantics) arise specifically because both documents USE
|
|
THE SAME WORDS with different meanings. Finding #28's docs used different terminology
|
|
for different concepts, making contradictions more obvious.
|
|
3. **Opus performs better on tightly coupled docs.** In Finding #28, Opus found 7
|
|
inconsistencies (similar to GPT-5's 6). Here, Opus found 7 findings with 5 Critical
|
|
— outperforming GPT-5 (6 findings, 2 Critical) in severity accuracy. Opus excels
|
|
when the contradictions require reasoning about how terms relate across documents.
|
|
|
|
**Model performance on cross-document consistency (updated hierarchy):**
|
|
|
|
For tightly coupled documents with shared vocabulary:
|
|
1. **Claude Opus 4.6** — Best at finding the DEEP contradictions (severity ordering
|
|
inversion, liveness impossibility). Fewer findings but higher proportion of Critical-
|
|
severity. Reasons about semantic relationships between document claims.
|
|
2. **GPT-5** — Most thorough and finds meta-level issues (the "same vocabulary" claim
|
|
itself is problematic). Good at principle-level contradictions. Reasoning tokens
|
|
help trace implication chains.
|
|
3. **Claude Sonnet 4** — Fast (23s vs 54-185s), finds the core issues correctly, but
|
|
misses the more subtle architectural implications. Good for first-pass screening.
|
|
|
|
**Practical implication for documentation review:**
|
|
Cross-document consistency analysis is most valuable (and finds the most Critical bugs)
|
|
when applied to tightly coupled documents that SHARE VOCABULARY. The anti-pattern to
|
|
hunt for: documents that claim alignment ("uses the same vocabulary as X") but define
|
|
terms in subtly incompatible ways. This is more dangerous than documents that use
|
|
obviously different terminology — different words signal difference; same words mask it.
|
|
|
|
**Recommendation:** When a document claims terminological alignment with another ("uses
|
|
the same vocabulary as..."), that's the strongest signal to run cross-document
|
|
consistency analysis. The claim of shared vocabulary is ITSELF a risk factor for hidden
|
|
contradictions.
|