Files
model-research/findings/2026-05-07-44-cross-doc-consistency-subtle-contradictions.md
T
claw e127e7b0c7 finding 44: cross-doc consistency on closely related docs
Sonnet finds ZERO subtle contradictions between signal-lifecycle.md and
aggregation.md, while GPT-5 and Opus each find 3 genuine conflicts.

Key insight: Sonnet can detect explicit contradictions (Finding 28: 4/6)
but completely fails on implication conflicts where one doc's simplified
model creates false impressions about another doc's complete specification.

Refines Finding 28 and confirms cross-document consistency is actually
TWO distinct tasks with different model requirements.
2026-05-07 19:27:20 -07:00

7.6 KiB

Finding 44: Cross-document consistency analysis on closely related docs: Sonnet finds ZERO subtle contradictions; refines Finding 28

Date: 2026-05-07 Task: Identify inconsistencies between two closely related architecture documents from the same bounded context — signal-lifecycle.md (111 lines) and aggregation.md (239 lines) — that describe adjacent stages in the same pipeline (decision-engine context). Builds on Finding 28: Finding 28 tested cross-doc consistency between high-level overview docs (system-overview.md vs architecture.md) where Sonnet found 4/6 conflicts. This experiment uses CLOSELY RELATED component-level docs where contradictions are subtle (simplified descriptions vs complete specifications) rather than obvious (different architectural philosophies). How we used them: Both documents concatenated into a single prompt with explicit instructions to find ONLY genuine conflicts where both documents make specific claims that cannot both be true. Explicitly excluded gaps, style differences, and complementary info. Required quote/reference from each document plus explanation of logical incompatibility. GPT-5 via HAI OpenAI endpoint; Opus and Sonnet via HAI Anthropic endpoint. No tools.

Model Time Output tokens Reasoning tokens Inconsistencies found
GPT-5 84s 10,587 9,856 3 (all genuine)
Claude Opus 4.6 28s 1,168 (internal) 3 (all genuine)
Claude Sonnet 4.6 9s 309 (internal) 0 ("no genuine inconsistencies")

GPT-5 found (3 conflicts):

1. CRITICAL — Audit coverage for expired signals

  • Doc A: "The audit log records rejected signals (with reason) and signals that contributed to decisions. Signals that were produced but passed through uneventfully are not recorded — they are noise."
  • Doc B: "The audit trail records expired groups with their discarded signals — nothing is silently dropped."
  • Conflict: Expired signals are either recorded (B) or not (A). Direct contradiction about audit scope.

2. HIGH — One signal → multiple decisions impossibility

  • Doc A: "one signal can appear under many decisions (if multiple aggregators receive it)"
  • Doc B: Each strategy has its own aggregator instance with strict isolation. Fan-in happens at PortfolioRisk, not the aggregator. No routing path exists for a single signal to reach multiple aggregators.
  • Conflict: Doc A describes a capability that Doc B's architecture makes impossible.

3. MEDIUM — Capacity behavior: expire-only vs force-complete

  • Doc A: "excess signals are expired rather than accumulated indefinitely"
  • Doc B: capacity limits can trigger either "force-complete or expire depending on the algorithm"
  • Conflict: Doc A implies capacity always discards; Doc B says it can also produce real decisions.

Opus found (3 conflicts):

1. MEDIUM — Late-arriving signal after timeout assumes expiry only

  • Doc A: failure mode says signal arriving after timeout "starts a new group" (implying old group is gone)
  • Doc B: force-complete path means old group may have already PRODUCED A DECISION
  • Conflict: Doc A doesn't acknowledge the force-complete case creates potential duplicate trading intent with the new group.

2. HIGH — Capacity limits: expire-only vs force-complete

  • Same finding as GPT-5 #3 but at higher severity: implementers reading Doc A would believe capacity limits always discard, while Doc B says they may trigger real trading decisions.

3. HIGH — Decision creation paths

  • Doc A: "Aggregator's completion predicate fires → decision is born" (step 4, presented as THE sole path)
  • Doc B: State diagram shows timeout (force-complete) and capacity limit (force-complete) as additional paths to Complete state — decisions produced WITHOUT predicate satisfaction
  • Conflict: Doc A's lifecycle omits two first-class decision-creation paths that Doc B explicitly defines.

Sonnet found (0 conflicts):

Explicitly stated "NO GENUINE INCONSISTENCIES FOUND." Listed ways the documents are CONSISTENT (signal transience, failure handling, identifiers, process flow). Concluded they are "complementary rather than conflicting."

Analysis

Overlap between GPT-5 and Opus:

  • Shared: Both caught capacity-limit expire-vs-force-complete (the most surface-level conflict — explicit word "expired" in A vs explicit "force-complete or expire" in B)
  • GPT-5 unique: Audit-coverage contradiction (CRITICAL) and multi-aggregator routing impossibility (HIGH). Both require precise cross-referencing of specific claims.
  • Opus unique: Late-arrival duplicate-intent concern (MEDIUM) and decision-creation- path omission (HIGH). Both require reasoning about what Doc A's SIMPLIFICATION IMPLIES by OMITTING paths that Doc B explicitly describes.

Different contradiction-finding modes (confirms Finding #43):

  • GPT-5 finds contradictions by comparing explicit statements (A says X, B says not-X)
  • Opus finds contradictions by identifying where A's simplification creates false impressions about B's actual behavior

Why Sonnet failed completely — refined from Finding 28: In Finding 28 (overview vs architecture docs), Sonnet found 4 conflicts. Here: zero. The key difference is CONFLICT SUBTLETY:

  • Finding 28's docs had OBVIOUS contradictions (different architectural philosophies: event sourcing vs fills-as-ground-truth). The surface text clearly disagreed.
  • This experiment's docs have SUBTLE contradictions (simplified lifecycle description omits valid paths described in the detailed spec). The surface text appears consistent — contradictions only emerge when you reason about what the simplification IMPLIES.

Sonnet can spot documents that EXPLICITLY DISAGREE. It cannot detect when one document's SIMPLIFIED MODEL creates false impressions about another document's COMPLETE SPECIFICATION. This requires holding both models in working memory and testing one against the other — a capability that reasoning-token-backed models excel at.

Key Insight

Cross-document consistency is actually TWO tasks:

  1. Explicit contradiction detection — "A says X, B says not-X" → Sonnet can handle
  2. Implication conflict detection — "A implies X by simplification, B shows not-X as a first-class path" → requires reasoning models

The practical implication: well-written documentation that APPEARS consistent (like these two docs) can harbor real design bugs that only reasoning models detect. This is exactly the kind of bug that causes implementation drift — one developer reads the lifecycle doc and builds assuming predicate-only completion; another reads the aggregation doc and implements force-complete as a valid path. The docs are "consistent enough" that no human reviewer catches the conflict.

Cost-Effectiveness

Opus: 3 genuine findings, 28s, 1,168 output tokens GPT-5: 3 genuine findings, 84s, 10,587 output tokens (9,856 reasoning) Sonnet: 0 findings, 9s, 309 tokens — NEGATIVE value (false reassurance)

For subtle cross-doc consistency: Opus is 9x more token-efficient than GPT-5 with equal finding count. Sonnet is worse than useless (actively harmful — reports "no conflicts" when conflicts exist).

Updated Task-Model Matrix (cross-document subset)

Document Relationship Conflict Type Sonnet Opus GPT-5
High-level vs high-level Obvious philosophy conflicts ✓ (4/6) ✓✓ (7/6) ✓✓ (6/6)
Component vs component (related) Subtle implication conflicts ✗ (0/3) ✓✓ (3/3) ✓✓ (3/3)
Component vs component (distant) TBD ? ? ?