From e127e7b0c78b929e2833b4baa7b5411b31c39d46 Mon Sep 17 00:00:00 2001 From: claw Date: Thu, 7 May 2026 19:27:20 -0700 Subject: [PATCH] finding 44: cross-doc consistency on closely related docs Sonnet finds ZERO subtle contradictions between signal-lifecycle.md and aggregation.md, while GPT-5 and Opus each find 3 genuine conflicts. Key insight: Sonnet can detect explicit contradictions (Finding 28: 4/6) but completely fails on implication conflicts where one doc's simplified model creates false impressions about another doc's complete specification. Refines Finding 28 and confirms cross-document consistency is actually TWO distinct tasks with different model requirements. --- ...s-doc-consistency-subtle-contradictions.md | 138 ++++++++++++++++++ 1 file changed, 138 insertions(+) create mode 100644 findings/2026-05-07-44-cross-doc-consistency-subtle-contradictions.md diff --git a/findings/2026-05-07-44-cross-doc-consistency-subtle-contradictions.md b/findings/2026-05-07-44-cross-doc-consistency-subtle-contradictions.md new file mode 100644 index 0000000..ea32fa2 --- /dev/null +++ b/findings/2026-05-07-44-cross-doc-consistency-subtle-contradictions.md @@ -0,0 +1,138 @@ +# Finding 44: Cross-document consistency analysis on closely related docs: Sonnet finds ZERO subtle contradictions; refines Finding 28 + +**Date:** 2026-05-07 +**Task:** Identify inconsistencies between two closely related architecture documents from +the same bounded context — `signal-lifecycle.md` (111 lines) and `aggregation.md` (239 +lines) — that describe adjacent stages in the same pipeline (decision-engine context). +**Builds on Finding 28:** Finding 28 tested cross-doc consistency between high-level +overview docs (`system-overview.md` vs `architecture.md`) where Sonnet found 4/6 conflicts. +This experiment uses CLOSELY RELATED component-level docs where contradictions are subtle +(simplified descriptions vs complete specifications) rather than obvious (different +architectural philosophies). +**How we used them:** Both documents concatenated into a single prompt with explicit +instructions to find ONLY genuine conflicts where both documents make specific claims that +cannot both be true. Explicitly excluded gaps, style differences, and complementary info. +Required quote/reference from each document plus explanation of logical incompatibility. +GPT-5 via HAI OpenAI endpoint; Opus and Sonnet via HAI Anthropic endpoint. No tools. + +| Model | Time | Output tokens | Reasoning tokens | Inconsistencies found | +|---|---|---|---|---| +| GPT-5 | 84s | 10,587 | 9,856 | 3 (all genuine) | +| Claude Opus 4.6 | 28s | 1,168 | (internal) | 3 (all genuine) | +| Claude Sonnet 4.6 | 9s | 309 | (internal) | 0 ("no genuine inconsistencies") | + +## GPT-5 found (3 conflicts): + +### 1. CRITICAL — Audit coverage for expired signals +- **Doc A:** "The audit log records rejected signals (with reason) and signals that + contributed to decisions. Signals that were produced but passed through uneventfully + are not recorded — they are noise." +- **Doc B:** "The audit trail records expired groups with their discarded signals — + nothing is silently dropped." +- **Conflict:** Expired signals are either recorded (B) or not (A). Direct contradiction + about audit scope. + +### 2. HIGH — One signal → multiple decisions impossibility +- **Doc A:** "one signal can appear under many decisions (if multiple aggregators receive it)" +- **Doc B:** Each strategy has its own aggregator instance with strict isolation. Fan-in + happens at PortfolioRisk, not the aggregator. No routing path exists for a single signal + to reach multiple aggregators. +- **Conflict:** Doc A describes a capability that Doc B's architecture makes impossible. + +### 3. MEDIUM — Capacity behavior: expire-only vs force-complete +- **Doc A:** "excess signals are expired rather than accumulated indefinitely" +- **Doc B:** capacity limits can trigger either "force-complete or expire depending on + the algorithm" +- **Conflict:** Doc A implies capacity always discards; Doc B says it can also produce + real decisions. + +## Opus found (3 conflicts): + +### 1. MEDIUM — Late-arriving signal after timeout assumes expiry only +- **Doc A:** failure mode says signal arriving after timeout "starts a new group" (implying + old group is gone) +- **Doc B:** force-complete path means old group may have already PRODUCED A DECISION +- **Conflict:** Doc A doesn't acknowledge the force-complete case creates potential + duplicate trading intent with the new group. + +### 2. HIGH — Capacity limits: expire-only vs force-complete +- Same finding as GPT-5 #3 but at higher severity: implementers reading Doc A would + believe capacity limits always discard, while Doc B says they may trigger real trading + decisions. + +### 3. HIGH — Decision creation paths +- **Doc A:** "Aggregator's completion predicate fires → decision is born" (step 4, + presented as THE sole path) +- **Doc B:** State diagram shows `timeout (force-complete)` and `capacity limit + (force-complete)` as additional paths to Complete state — decisions produced WITHOUT + predicate satisfaction +- **Conflict:** Doc A's lifecycle omits two first-class decision-creation paths that + Doc B explicitly defines. + +## Sonnet found (0 conflicts): + +Explicitly stated "NO GENUINE INCONSISTENCIES FOUND." Listed ways the documents are +CONSISTENT (signal transience, failure handling, identifiers, process flow). Concluded +they are "complementary rather than conflicting." + +## Analysis + +**Overlap between GPT-5 and Opus:** +- **Shared:** Both caught capacity-limit expire-vs-force-complete (the most surface-level + conflict — explicit word "expired" in A vs explicit "force-complete or expire" in B) +- **GPT-5 unique:** Audit-coverage contradiction (CRITICAL) and multi-aggregator routing + impossibility (HIGH). Both require precise cross-referencing of specific claims. +- **Opus unique:** Late-arrival duplicate-intent concern (MEDIUM) and decision-creation- + path omission (HIGH). Both require reasoning about what Doc A's SIMPLIFICATION IMPLIES + by OMITTING paths that Doc B explicitly describes. + +**Different contradiction-finding modes (confirms Finding #43):** +- GPT-5 finds contradictions by comparing explicit statements (A says X, B says not-X) +- Opus finds contradictions by identifying where A's simplification creates false + impressions about B's actual behavior + +**Why Sonnet failed completely — refined from Finding 28:** +In Finding 28 (overview vs architecture docs), Sonnet found 4 conflicts. Here: zero. +The key difference is CONFLICT SUBTLETY: +- Finding 28's docs had OBVIOUS contradictions (different architectural philosophies: + event sourcing vs fills-as-ground-truth). The surface text clearly disagreed. +- This experiment's docs have SUBTLE contradictions (simplified lifecycle description + omits valid paths described in the detailed spec). The surface text appears consistent + — contradictions only emerge when you reason about what the simplification IMPLIES. + +Sonnet can spot documents that EXPLICITLY DISAGREE. It cannot detect when one document's +SIMPLIFIED MODEL creates false impressions about another document's COMPLETE SPECIFICATION. +This requires holding both models in working memory and testing one against the other — +a capability that reasoning-token-backed models excel at. + +## Key Insight + +**Cross-document consistency is actually TWO tasks:** +1. **Explicit contradiction detection** — "A says X, B says not-X" → Sonnet can handle +2. **Implication conflict detection** — "A implies X by simplification, B shows not-X as + a first-class path" → requires reasoning models + +The practical implication: well-written documentation that APPEARS consistent (like these +two docs) can harbor real design bugs that only reasoning models detect. This is exactly +the kind of bug that causes implementation drift — one developer reads the lifecycle doc +and builds assuming predicate-only completion; another reads the aggregation doc and +implements force-complete as a valid path. The docs are "consistent enough" that no human +reviewer catches the conflict. + +## Cost-Effectiveness + +Opus: 3 genuine findings, 28s, 1,168 output tokens +GPT-5: 3 genuine findings, 84s, 10,587 output tokens (9,856 reasoning) +Sonnet: 0 findings, 9s, 309 tokens — NEGATIVE value (false reassurance) + +For subtle cross-doc consistency: Opus is 9x more token-efficient than GPT-5 with equal +finding count. Sonnet is worse than useless (actively harmful — reports "no conflicts" +when conflicts exist). + +## Updated Task-Model Matrix (cross-document subset) + +| Document Relationship | Conflict Type | Sonnet | Opus | GPT-5 | +|---|---|---|---|---| +| High-level vs high-level | Obvious philosophy conflicts | ✓ (4/6) | ✓✓ (7/6) | ✓✓ (6/6) | +| Component vs component (related) | Subtle implication conflicts | ✗ (0/3) | ✓✓ (3/3) | ✓✓ (3/3) | +| Component vs component (distant) | TBD | ? | ? | ? |