finding 44: cross-doc consistency on closely related docs

Sonnet finds ZERO subtle contradictions between signal-lifecycle.md and aggregation.md, while GPT-5 and Opus each find 3 genuine conflicts. Key insight: Sonnet can detect explicit contradictions (Finding 28: 4/6) but completely fails on implication conflicts where one doc's simplified model creates false impressions about another doc's complete specification. Refines Finding 28 and confirms cross-document consistency is actually TWO distinct tasks with different model requirements.
2026-05-07 19:27:20 -07:00
parent d8a030d9e9
commit e127e7b0c7
1 changed files with 138 additions and 0 deletions
@@ -0,0 +1,138 @@
+# Finding 44: Cross-document consistency analysis on closely related docs: Sonnet finds ZERO subtle contradictions; refines Finding 28
+
+**Date:** 2026-05-07
+**Task:** Identify inconsistencies between two closely related architecture documents from
+the same bounded context — `signal-lifecycle.md` (111 lines) and `aggregation.md` (239
+lines) — that describe adjacent stages in the same pipeline (decision-engine context).
+**Builds on Finding 28:** Finding 28 tested cross-doc consistency between high-level
+overview docs (`system-overview.md` vs `architecture.md`) where Sonnet found 4/6 conflicts.
+This experiment uses CLOSELY RELATED component-level docs where contradictions are subtle
+(simplified descriptions vs complete specifications) rather than obvious (different
+architectural philosophies).
+**How we used them:** Both documents concatenated into a single prompt with explicit
+instructions to find ONLY genuine conflicts where both documents make specific claims that
+cannot both be true. Explicitly excluded gaps, style differences, and complementary info.
+Required quote/reference from each document plus explanation of logical incompatibility.
+GPT-5 via HAI OpenAI endpoint; Opus and Sonnet via HAI Anthropic endpoint. No tools.
+
+| Model | Time | Output tokens | Reasoning tokens | Inconsistencies found |
+|---|---|---|---|---|
+| GPT-5 | 84s | 10,587 | 9,856 | 3 (all genuine) |
+| Claude Opus 4.6 | 28s | 1,168 | (internal) | 3 (all genuine) |
+| Claude Sonnet 4.6 | 9s | 309 | (internal) | 0 ("no genuine inconsistencies") |
+
+## GPT-5 found (3 conflicts):
+
+### 1. CRITICAL — Audit coverage for expired signals
+- **Doc A:** "The audit log records rejected signals (with reason) and signals that
+  contributed to decisions. Signals that were produced but passed through uneventfully
+  are not recorded — they are noise."
+- **Doc B:** "The audit trail records expired groups with their discarded signals —
+  nothing is silently dropped."
+- **Conflict:** Expired signals are either recorded (B) or not (A). Direct contradiction
+  about audit scope.
+
+### 2. HIGH — One signal → multiple decisions impossibility
+- **Doc A:** "one signal can appear under many decisions (if multiple aggregators receive it)"
+- **Doc B:** Each strategy has its own aggregator instance with strict isolation. Fan-in
+  happens at PortfolioRisk, not the aggregator. No routing path exists for a single signal
+  to reach multiple aggregators.
+- **Conflict:** Doc A describes a capability that Doc B's architecture makes impossible.
+
+### 3. MEDIUM — Capacity behavior: expire-only vs force-complete
+- **Doc A:** "excess signals are expired rather than accumulated indefinitely"
+- **Doc B:** capacity limits can trigger either "force-complete or expire depending on
+  the algorithm"
+- **Conflict:** Doc A implies capacity always discards; Doc B says it can also produce
+  real decisions.
+
+## Opus found (3 conflicts):
+
+### 1. MEDIUM — Late-arriving signal after timeout assumes expiry only
+- **Doc A:** failure mode says signal arriving after timeout "starts a new group" (implying
+  old group is gone)
+- **Doc B:** force-complete path means old group may have already PRODUCED A DECISION
+- **Conflict:** Doc A doesn't acknowledge the force-complete case creates potential
+  duplicate trading intent with the new group.
+
+### 2. HIGH — Capacity limits: expire-only vs force-complete
+- Same finding as GPT-5 #3 but at higher severity: implementers reading Doc A would
+  believe capacity limits always discard, while Doc B says they may trigger real trading
+  decisions.
+
+### 3. HIGH — Decision creation paths
+- **Doc A:** "Aggregator's completion predicate fires → decision is born" (step 4,
+  presented as THE sole path)
+- **Doc B:** State diagram shows `timeout (force-complete)` and `capacity limit
+  (force-complete)` as additional paths to Complete state — decisions produced WITHOUT
+  predicate satisfaction
+- **Conflict:** Doc A's lifecycle omits two first-class decision-creation paths that
+  Doc B explicitly defines.
+
+## Sonnet found (0 conflicts):
+
+Explicitly stated "NO GENUINE INCONSISTENCIES FOUND." Listed ways the documents are
+CONSISTENT (signal transience, failure handling, identifiers, process flow). Concluded
+they are "complementary rather than conflicting."
+
+## Analysis
+
+**Overlap between GPT-5 and Opus:**
+- **Shared:** Both caught capacity-limit expire-vs-force-complete (the most surface-level
+  conflict — explicit word "expired" in A vs explicit "force-complete or expire" in B)
+- **GPT-5 unique:** Audit-coverage contradiction (CRITICAL) and multi-aggregator routing
+  impossibility (HIGH). Both require precise cross-referencing of specific claims.
+- **Opus unique:** Late-arrival duplicate-intent concern (MEDIUM) and decision-creation-
+  path omission (HIGH). Both require reasoning about what Doc A's SIMPLIFICATION IMPLIES
+  by OMITTING paths that Doc B explicitly describes.
+
+**Different contradiction-finding modes (confirms Finding #43):**
+- GPT-5 finds contradictions by comparing explicit statements (A says X, B says not-X)
+- Opus finds contradictions by identifying where A's simplification creates false
+  impressions about B's actual behavior
+
+**Why Sonnet failed completely — refined from Finding 28:**
+In Finding 28 (overview vs architecture docs), Sonnet found 4 conflicts. Here: zero.
+The key difference is CONFLICT SUBTLETY:
+- Finding 28's docs had OBVIOUS contradictions (different architectural philosophies:
+  event sourcing vs fills-as-ground-truth). The surface text clearly disagreed.
+- This experiment's docs have SUBTLE contradictions (simplified lifecycle description
+  omits valid paths described in the detailed spec). The surface text appears consistent
+  — contradictions only emerge when you reason about what the simplification IMPLIES.
+
+Sonnet can spot documents that EXPLICITLY DISAGREE. It cannot detect when one document's
+SIMPLIFIED MODEL creates false impressions about another document's COMPLETE SPECIFICATION.
+This requires holding both models in working memory and testing one against the other —
+a capability that reasoning-token-backed models excel at.
+
+## Key Insight
+
+**Cross-document consistency is actually TWO tasks:**
+1. **Explicit contradiction detection** — "A says X, B says not-X" → Sonnet can handle
+2. **Implication conflict detection** — "A implies X by simplification, B shows not-X as
+   a first-class path" → requires reasoning models
+
+The practical implication: well-written documentation that APPEARS consistent (like these
+two docs) can harbor real design bugs that only reasoning models detect. This is exactly
+the kind of bug that causes implementation drift — one developer reads the lifecycle doc
+and builds assuming predicate-only completion; another reads the aggregation doc and
+implements force-complete as a valid path. The docs are "consistent enough" that no human
+reviewer catches the conflict.
+
+## Cost-Effectiveness
+
+Opus: 3 genuine findings, 28s, 1,168 output tokens
+GPT-5: 3 genuine findings, 84s, 10,587 output tokens (9,856 reasoning)
+Sonnet: 0 findings, 9s, 309 tokens — NEGATIVE value (false reassurance)
+
+For subtle cross-doc consistency: Opus is 9x more token-efficient than GPT-5 with equal
+finding count. Sonnet is worse than useless (actively harmful — reports "no conflicts"
+when conflicts exist).
+
+## Updated Task-Model Matrix (cross-document subset)
+
+| Document Relationship | Conflict Type | Sonnet | Opus | GPT-5 |
+|---|---|---|---|---|
+| High-level vs high-level | Obvious philosophy conflicts | ✓ (4/6) | ✓✓ (7/6) | ✓✓ (6/6) |
+| Component vs component (related) | Subtle implication conflicts | ✗ (0/3) | ✓✓ (3/3) | ✓✓ (3/3) |
+| Component vs component (distant) | TBD | ? | ? | ? |