finding 44: cross-doc consistency on closely related docs

Sonnet finds ZERO subtle contradictions between signal-lifecycle.md and
aggregation.md, while GPT-5 and Opus each find 3 genuine conflicts.

Key insight: Sonnet can detect explicit contradictions (Finding 28: 4/6)
but completely fails on implication conflicts where one doc's simplified
model creates false impressions about another doc's complete specification.

Refines Finding 28 and confirms cross-document consistency is actually
TWO distinct tasks with different model requirements.
This commit is contained in:
claw
2026-05-07 19:27:20 -07:00
parent d8a030d9e9
commit e127e7b0c7
@@ -0,0 +1,138 @@
# Finding 44: Cross-document consistency analysis on closely related docs: Sonnet finds ZERO subtle contradictions; refines Finding 28
**Date:** 2026-05-07
**Task:** Identify inconsistencies between two closely related architecture documents from
the same bounded context — `signal-lifecycle.md` (111 lines) and `aggregation.md` (239
lines) — that describe adjacent stages in the same pipeline (decision-engine context).
**Builds on Finding 28:** Finding 28 tested cross-doc consistency between high-level
overview docs (`system-overview.md` vs `architecture.md`) where Sonnet found 4/6 conflicts.
This experiment uses CLOSELY RELATED component-level docs where contradictions are subtle
(simplified descriptions vs complete specifications) rather than obvious (different
architectural philosophies).
**How we used them:** Both documents concatenated into a single prompt with explicit
instructions to find ONLY genuine conflicts where both documents make specific claims that
cannot both be true. Explicitly excluded gaps, style differences, and complementary info.
Required quote/reference from each document plus explanation of logical incompatibility.
GPT-5 via HAI OpenAI endpoint; Opus and Sonnet via HAI Anthropic endpoint. No tools.
| Model | Time | Output tokens | Reasoning tokens | Inconsistencies found |
|---|---|---|---|---|
| GPT-5 | 84s | 10,587 | 9,856 | 3 (all genuine) |
| Claude Opus 4.6 | 28s | 1,168 | (internal) | 3 (all genuine) |
| Claude Sonnet 4.6 | 9s | 309 | (internal) | 0 ("no genuine inconsistencies") |
## GPT-5 found (3 conflicts):
### 1. CRITICAL — Audit coverage for expired signals
- **Doc A:** "The audit log records rejected signals (with reason) and signals that
contributed to decisions. Signals that were produced but passed through uneventfully
are not recorded — they are noise."
- **Doc B:** "The audit trail records expired groups with their discarded signals —
nothing is silently dropped."
- **Conflict:** Expired signals are either recorded (B) or not (A). Direct contradiction
about audit scope.
### 2. HIGH — One signal → multiple decisions impossibility
- **Doc A:** "one signal can appear under many decisions (if multiple aggregators receive it)"
- **Doc B:** Each strategy has its own aggregator instance with strict isolation. Fan-in
happens at PortfolioRisk, not the aggregator. No routing path exists for a single signal
to reach multiple aggregators.
- **Conflict:** Doc A describes a capability that Doc B's architecture makes impossible.
### 3. MEDIUM — Capacity behavior: expire-only vs force-complete
- **Doc A:** "excess signals are expired rather than accumulated indefinitely"
- **Doc B:** capacity limits can trigger either "force-complete or expire depending on
the algorithm"
- **Conflict:** Doc A implies capacity always discards; Doc B says it can also produce
real decisions.
## Opus found (3 conflicts):
### 1. MEDIUM — Late-arriving signal after timeout assumes expiry only
- **Doc A:** failure mode says signal arriving after timeout "starts a new group" (implying
old group is gone)
- **Doc B:** force-complete path means old group may have already PRODUCED A DECISION
- **Conflict:** Doc A doesn't acknowledge the force-complete case creates potential
duplicate trading intent with the new group.
### 2. HIGH — Capacity limits: expire-only vs force-complete
- Same finding as GPT-5 #3 but at higher severity: implementers reading Doc A would
believe capacity limits always discard, while Doc B says they may trigger real trading
decisions.
### 3. HIGH — Decision creation paths
- **Doc A:** "Aggregator's completion predicate fires → decision is born" (step 4,
presented as THE sole path)
- **Doc B:** State diagram shows `timeout (force-complete)` and `capacity limit
(force-complete)` as additional paths to Complete state — decisions produced WITHOUT
predicate satisfaction
- **Conflict:** Doc A's lifecycle omits two first-class decision-creation paths that
Doc B explicitly defines.
## Sonnet found (0 conflicts):
Explicitly stated "NO GENUINE INCONSISTENCIES FOUND." Listed ways the documents are
CONSISTENT (signal transience, failure handling, identifiers, process flow). Concluded
they are "complementary rather than conflicting."
## Analysis
**Overlap between GPT-5 and Opus:**
- **Shared:** Both caught capacity-limit expire-vs-force-complete (the most surface-level
conflict — explicit word "expired" in A vs explicit "force-complete or expire" in B)
- **GPT-5 unique:** Audit-coverage contradiction (CRITICAL) and multi-aggregator routing
impossibility (HIGH). Both require precise cross-referencing of specific claims.
- **Opus unique:** Late-arrival duplicate-intent concern (MEDIUM) and decision-creation-
path omission (HIGH). Both require reasoning about what Doc A's SIMPLIFICATION IMPLIES
by OMITTING paths that Doc B explicitly describes.
**Different contradiction-finding modes (confirms Finding #43):**
- GPT-5 finds contradictions by comparing explicit statements (A says X, B says not-X)
- Opus finds contradictions by identifying where A's simplification creates false
impressions about B's actual behavior
**Why Sonnet failed completely — refined from Finding 28:**
In Finding 28 (overview vs architecture docs), Sonnet found 4 conflicts. Here: zero.
The key difference is CONFLICT SUBTLETY:
- Finding 28's docs had OBVIOUS contradictions (different architectural philosophies:
event sourcing vs fills-as-ground-truth). The surface text clearly disagreed.
- This experiment's docs have SUBTLE contradictions (simplified lifecycle description
omits valid paths described in the detailed spec). The surface text appears consistent
— contradictions only emerge when you reason about what the simplification IMPLIES.
Sonnet can spot documents that EXPLICITLY DISAGREE. It cannot detect when one document's
SIMPLIFIED MODEL creates false impressions about another document's COMPLETE SPECIFICATION.
This requires holding both models in working memory and testing one against the other —
a capability that reasoning-token-backed models excel at.
## Key Insight
**Cross-document consistency is actually TWO tasks:**
1. **Explicit contradiction detection** — "A says X, B says not-X" → Sonnet can handle
2. **Implication conflict detection** — "A implies X by simplification, B shows not-X as
a first-class path" → requires reasoning models
The practical implication: well-written documentation that APPEARS consistent (like these
two docs) can harbor real design bugs that only reasoning models detect. This is exactly
the kind of bug that causes implementation drift — one developer reads the lifecycle doc
and builds assuming predicate-only completion; another reads the aggregation doc and
implements force-complete as a valid path. The docs are "consistent enough" that no human
reviewer catches the conflict.
## Cost-Effectiveness
Opus: 3 genuine findings, 28s, 1,168 output tokens
GPT-5: 3 genuine findings, 84s, 10,587 output tokens (9,856 reasoning)
Sonnet: 0 findings, 9s, 309 tokens — NEGATIVE value (false reassurance)
For subtle cross-doc consistency: Opus is 9x more token-efficient than GPT-5 with equal
finding count. Sonnet is worse than useless (actively harmful — reports "no conflicts"
when conflicts exist).
## Updated Task-Model Matrix (cross-document subset)
| Document Relationship | Conflict Type | Sonnet | Opus | GPT-5 |
|---|---|---|---|---|
| High-level vs high-level | Obvious philosophy conflicts | ✓ (4/6) | ✓✓ (7/6) | ✓✓ (6/6) |
| Component vs component (related) | Subtle implication conflicts | ✗ (0/3) | ✓✓ (3/3) | ✓✓ (3/3) |
| Component vs component (distant) | TBD | ? | ? | ? |