finding 44: cross-doc consistency on closely related docs
Sonnet finds ZERO subtle contradictions between signal-lifecycle.md and aggregation.md, while GPT-5 and Opus each find 3 genuine conflicts. Key insight: Sonnet can detect explicit contradictions (Finding 28: 4/6) but completely fails on implication conflicts where one doc's simplified model creates false impressions about another doc's complete specification. Refines Finding 28 and confirms cross-document consistency is actually TWO distinct tasks with different model requirements.
This commit is contained in:
@@ -0,0 +1,138 @@
|
||||
# Finding 44: Cross-document consistency analysis on closely related docs: Sonnet finds ZERO subtle contradictions; refines Finding 28
|
||||
|
||||
**Date:** 2026-05-07
|
||||
**Task:** Identify inconsistencies between two closely related architecture documents from
|
||||
the same bounded context — `signal-lifecycle.md` (111 lines) and `aggregation.md` (239
|
||||
lines) — that describe adjacent stages in the same pipeline (decision-engine context).
|
||||
**Builds on Finding 28:** Finding 28 tested cross-doc consistency between high-level
|
||||
overview docs (`system-overview.md` vs `architecture.md`) where Sonnet found 4/6 conflicts.
|
||||
This experiment uses CLOSELY RELATED component-level docs where contradictions are subtle
|
||||
(simplified descriptions vs complete specifications) rather than obvious (different
|
||||
architectural philosophies).
|
||||
**How we used them:** Both documents concatenated into a single prompt with explicit
|
||||
instructions to find ONLY genuine conflicts where both documents make specific claims that
|
||||
cannot both be true. Explicitly excluded gaps, style differences, and complementary info.
|
||||
Required quote/reference from each document plus explanation of logical incompatibility.
|
||||
GPT-5 via HAI OpenAI endpoint; Opus and Sonnet via HAI Anthropic endpoint. No tools.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Inconsistencies found |
|
||||
|---|---|---|---|---|
|
||||
| GPT-5 | 84s | 10,587 | 9,856 | 3 (all genuine) |
|
||||
| Claude Opus 4.6 | 28s | 1,168 | (internal) | 3 (all genuine) |
|
||||
| Claude Sonnet 4.6 | 9s | 309 | (internal) | 0 ("no genuine inconsistencies") |
|
||||
|
||||
## GPT-5 found (3 conflicts):
|
||||
|
||||
### 1. CRITICAL — Audit coverage for expired signals
|
||||
- **Doc A:** "The audit log records rejected signals (with reason) and signals that
|
||||
contributed to decisions. Signals that were produced but passed through uneventfully
|
||||
are not recorded — they are noise."
|
||||
- **Doc B:** "The audit trail records expired groups with their discarded signals —
|
||||
nothing is silently dropped."
|
||||
- **Conflict:** Expired signals are either recorded (B) or not (A). Direct contradiction
|
||||
about audit scope.
|
||||
|
||||
### 2. HIGH — One signal → multiple decisions impossibility
|
||||
- **Doc A:** "one signal can appear under many decisions (if multiple aggregators receive it)"
|
||||
- **Doc B:** Each strategy has its own aggregator instance with strict isolation. Fan-in
|
||||
happens at PortfolioRisk, not the aggregator. No routing path exists for a single signal
|
||||
to reach multiple aggregators.
|
||||
- **Conflict:** Doc A describes a capability that Doc B's architecture makes impossible.
|
||||
|
||||
### 3. MEDIUM — Capacity behavior: expire-only vs force-complete
|
||||
- **Doc A:** "excess signals are expired rather than accumulated indefinitely"
|
||||
- **Doc B:** capacity limits can trigger either "force-complete or expire depending on
|
||||
the algorithm"
|
||||
- **Conflict:** Doc A implies capacity always discards; Doc B says it can also produce
|
||||
real decisions.
|
||||
|
||||
## Opus found (3 conflicts):
|
||||
|
||||
### 1. MEDIUM — Late-arriving signal after timeout assumes expiry only
|
||||
- **Doc A:** failure mode says signal arriving after timeout "starts a new group" (implying
|
||||
old group is gone)
|
||||
- **Doc B:** force-complete path means old group may have already PRODUCED A DECISION
|
||||
- **Conflict:** Doc A doesn't acknowledge the force-complete case creates potential
|
||||
duplicate trading intent with the new group.
|
||||
|
||||
### 2. HIGH — Capacity limits: expire-only vs force-complete
|
||||
- Same finding as GPT-5 #3 but at higher severity: implementers reading Doc A would
|
||||
believe capacity limits always discard, while Doc B says they may trigger real trading
|
||||
decisions.
|
||||
|
||||
### 3. HIGH — Decision creation paths
|
||||
- **Doc A:** "Aggregator's completion predicate fires → decision is born" (step 4,
|
||||
presented as THE sole path)
|
||||
- **Doc B:** State diagram shows `timeout (force-complete)` and `capacity limit
|
||||
(force-complete)` as additional paths to Complete state — decisions produced WITHOUT
|
||||
predicate satisfaction
|
||||
- **Conflict:** Doc A's lifecycle omits two first-class decision-creation paths that
|
||||
Doc B explicitly defines.
|
||||
|
||||
## Sonnet found (0 conflicts):
|
||||
|
||||
Explicitly stated "NO GENUINE INCONSISTENCIES FOUND." Listed ways the documents are
|
||||
CONSISTENT (signal transience, failure handling, identifiers, process flow). Concluded
|
||||
they are "complementary rather than conflicting."
|
||||
|
||||
## Analysis
|
||||
|
||||
**Overlap between GPT-5 and Opus:**
|
||||
- **Shared:** Both caught capacity-limit expire-vs-force-complete (the most surface-level
|
||||
conflict — explicit word "expired" in A vs explicit "force-complete or expire" in B)
|
||||
- **GPT-5 unique:** Audit-coverage contradiction (CRITICAL) and multi-aggregator routing
|
||||
impossibility (HIGH). Both require precise cross-referencing of specific claims.
|
||||
- **Opus unique:** Late-arrival duplicate-intent concern (MEDIUM) and decision-creation-
|
||||
path omission (HIGH). Both require reasoning about what Doc A's SIMPLIFICATION IMPLIES
|
||||
by OMITTING paths that Doc B explicitly describes.
|
||||
|
||||
**Different contradiction-finding modes (confirms Finding #43):**
|
||||
- GPT-5 finds contradictions by comparing explicit statements (A says X, B says not-X)
|
||||
- Opus finds contradictions by identifying where A's simplification creates false
|
||||
impressions about B's actual behavior
|
||||
|
||||
**Why Sonnet failed completely — refined from Finding 28:**
|
||||
In Finding 28 (overview vs architecture docs), Sonnet found 4 conflicts. Here: zero.
|
||||
The key difference is CONFLICT SUBTLETY:
|
||||
- Finding 28's docs had OBVIOUS contradictions (different architectural philosophies:
|
||||
event sourcing vs fills-as-ground-truth). The surface text clearly disagreed.
|
||||
- This experiment's docs have SUBTLE contradictions (simplified lifecycle description
|
||||
omits valid paths described in the detailed spec). The surface text appears consistent
|
||||
— contradictions only emerge when you reason about what the simplification IMPLIES.
|
||||
|
||||
Sonnet can spot documents that EXPLICITLY DISAGREE. It cannot detect when one document's
|
||||
SIMPLIFIED MODEL creates false impressions about another document's COMPLETE SPECIFICATION.
|
||||
This requires holding both models in working memory and testing one against the other —
|
||||
a capability that reasoning-token-backed models excel at.
|
||||
|
||||
## Key Insight
|
||||
|
||||
**Cross-document consistency is actually TWO tasks:**
|
||||
1. **Explicit contradiction detection** — "A says X, B says not-X" → Sonnet can handle
|
||||
2. **Implication conflict detection** — "A implies X by simplification, B shows not-X as
|
||||
a first-class path" → requires reasoning models
|
||||
|
||||
The practical implication: well-written documentation that APPEARS consistent (like these
|
||||
two docs) can harbor real design bugs that only reasoning models detect. This is exactly
|
||||
the kind of bug that causes implementation drift — one developer reads the lifecycle doc
|
||||
and builds assuming predicate-only completion; another reads the aggregation doc and
|
||||
implements force-complete as a valid path. The docs are "consistent enough" that no human
|
||||
reviewer catches the conflict.
|
||||
|
||||
## Cost-Effectiveness
|
||||
|
||||
Opus: 3 genuine findings, 28s, 1,168 output tokens
|
||||
GPT-5: 3 genuine findings, 84s, 10,587 output tokens (9,856 reasoning)
|
||||
Sonnet: 0 findings, 9s, 309 tokens — NEGATIVE value (false reassurance)
|
||||
|
||||
For subtle cross-doc consistency: Opus is 9x more token-efficient than GPT-5 with equal
|
||||
finding count. Sonnet is worse than useless (actively harmful — reports "no conflicts"
|
||||
when conflicts exist).
|
||||
|
||||
## Updated Task-Model Matrix (cross-document subset)
|
||||
|
||||
| Document Relationship | Conflict Type | Sonnet | Opus | GPT-5 |
|
||||
|---|---|---|---|---|
|
||||
| High-level vs high-level | Obvious philosophy conflicts | ✓ (4/6) | ✓✓ (7/6) | ✓✓ (6/6) |
|
||||
| Component vs component (related) | Subtle implication conflicts | ✗ (0/3) | ✓✓ (3/3) | ✓✓ (3/3) |
|
||||
| Component vs component (distant) | TBD | ? | ? | ? |
|
||||
Reference in New Issue
Block a user