Files

T

Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files

Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.

2026-05-06 07:15:50 -07:00

11 KiB

Raw Blame History

Finding 28: Cross-document consistency analysis: NEW task type — GPT-5 finds deep semantic contradictions; Opus finds structural/boundary mismatches; Sonnet identifies core issues quickly

Date: 2026-05-05 Task: Identify contradictions and inconsistencies BETWEEN two architecture documents describing the same system: system-overview.md (323 lines, narrative overview with component flows, invariants, and domain events) and architecture.md (213 lines, DDD-focused with bounded contexts, context map, and message taxonomy). How we used them: BOTH documents provided as full text in a single prompt (~25KB total). Highly structured prompt specifying 5 categories of cross-document inconsistency (terminology conflicts, structural contradictions, flow/sequence conflicts, ownership/authority conflicts, philosophical contradictions). Required specific output format per finding. Explicitly excluded omissions (things one doc covers and the other doesn't) and detail-level differences. No tools, no project context beyond the two documents. This is a NEW analytical task not previously tested: reasoning about CONSISTENCY BETWEEN documents rather than internal coherence of a single document.

Model	Time	Output tokens	Reasoning tokens	Inconsistencies found	Critical	High	Medium
GPT-5	125s	9,415	8,384	6	2	3	1
Claude Opus 4.6	52s	2,351	(internal)	7	3	3	1
Claude Sonnet 4.6	14s	776	(internal)	4	1	2	1

What they found — common ground (all 3 identified):

Event sourcing (all events as source of truth) vs fills-only ground truth: Document A says fills are "ground truth from which all other state can be derived," while Document B says "events are the source of truth, state is computed by replaying events." A treats fills as the recovery foundation; B treats ALL domain events as authoritative. All three models rated this Critical.
Bounded context naming mismatch: "Decision Engine" / "Order Management" (A) vs "Engine" / "Trading" (B) for the same functional responsibilities. GPT-5 folded this into a broader ownership analysis; Opus and Sonnet surfaced it as its own finding.
Signal classification conflict: Document A lists "Signal emitted" as a domain event; Document B explicitly categorizes SignalEmitted as an audit event ("not used to rebuild state"). This determines event store design and recovery semantics.

GPT-5 unique findings (not in either Claude model):

Signal persistence contradiction: Document A states "Signals are never persisted" while Document B lists SignalEmitted as an audit event that IS persisted and states the audit log is mandatory for trading. These are directly incompatible claims about whether signal data is stored.
Audit event ownership conflict: Document A says "Decision approved" events originate from PortfolioRisk. Document B states "only the decision engine writes audit events" and lists DecisionApproved as an audit event example. If PortfolioRisk is part of Risk (not Engine), this is an authority violation.
"Single writer per user" (A: OrderManager writes all trading state) vs per-aggregate single-writer (B: each aggregate writes its own event stream, Ledger owns positions). These are incompatible authority models — either OM centralizes writes or each domain owns its own events.

Claude Opus unique findings (not in either other model):

Engine → OrderManager is an internal pipeline flow (A: same subgraph, direct arrow) vs Engine → Trading is a cross-domain COMMAND (B: PlaceOrder command crossing a bounded context boundary). This structural disagreement determines whether order management is an internal pipeline stage or an independent domain with its own aggregates and command validation.
Signal Risk's architectural position: Document A shows a two-stage risk architecture (SignalRisk pre-aggregation, PortfolioRisk post-aggregation) where Risk is embedded in the pipeline. Document B's context map shows Risk as a separate domain that Engine merely QUERIES ("kill switch active?") — no arrow shows signal routing through Risk. Either risk logic lives inside Engine (contradicting B's context boundary) or the context map is incomplete.
The "reduce" step ownership: A's top-level flow labels Approved →|"reduce"| Decisions (reduction at aggregation), while A's own domain events table says "Decision reduced" originates from PortfolioRisk (reduction after aggregation). This is actually an INTRA-document inconsistency in Document A, but Opus surfaced it as part of cross-doc analysis.

Claude Sonnet unique findings:

None genuinely unique. All 4 findings overlapped with GPT-5/Opus common ground (event sourcing, signal persistence, context count/naming). Sonnet was efficient (14s, 776 tokens) but didn't identify any inconsistency that the other two missed.

Quality assessment:

GPT-5 produced 6 well-reasoned findings with the deepest analysis of OWNERSHIP conflicts. Its signal-persistence contradiction and single-writer authority conflict are genuinely important — they reveal places where the two documents would lead implementers to build fundamentally different systems. Every finding quotes specific text from both documents and explains precisely WHY they can't both be correct. The reasoning investment (8,384 tokens) was used for thorough cross-referencing between documents.
Claude Opus found the most inconsistencies (7) and was remarkably fast (52s vs GPT-5's 125s). Its unique strength: identifying STRUCTURAL contradictions about component boundaries and communication patterns. The Engine→Trading command vs internal pipeline finding is architecturally the most significant discovery — it reveals a fundamental disagreement about whether order management is INSIDE or OUTSIDE the decision engine's boundary. Opus also caught a bonus intra-document inconsistency (the "reduce" labeling error).
Claude Sonnet was the fastest (14s) and most concise (776 tokens) but found only the obvious common-ground issues. For cross-document consistency, Sonnet's speed advantage came at the cost of missing the architectural insights that make this task valuable. It did correctly identify all the Critical-level issues, making it viable as a quick first-pass screen.

Key insight — cross-document consistency is a DISTINCT task type: This is fundamentally different from single-document analysis (assumptions, race conditions, coherence). It requires:

Building a mental model from Document A
Building a separate mental model from Document B
Finding places where the models are incompatible
Reasoning about WHY they can't both be correct (not just "different")

Step 4 is what distinguishes this from simple diff-detection. Many surface differences (naming, detail level, scope) are NOT contradictions — the models must judge which differences are genuinely incompatible vs. complementary. The prompt explicitly excluded omissions and detail-level differences, and all three models respected this constraint well.

Model strengths on cross-document analysis:

GPT-5 excels at ownership/authority conflicts: it systematically checked "who owns this concept" in each document and found mismatches. Its findings cluster around "who writes what" and "who is authoritative."
Opus excels at structural/boundary contradictions: it identified where the documents draw architectural lines differently. Its findings cluster around "where are the boundaries" and "what crosses them."
Sonnet identifies the obvious/critical issues quickly but doesn't dig deeper. Viable for screening, not for thorough analysis.

Comparison to Finding #15 / #27 (single-document coherence checking): Single-document coherence asks "does this document contradict itself?" Cross-document consistency asks "do these documents contradict each other?" Key differences in results:

Aspect	Single-doc coherence	Cross-doc consistency
Opus findings	5-7	7
GPT-5 findings	4-6	6
Sonnet findings	4-5	4
Opus unique	Design tensions	Structural/boundary mismatches
GPT-5 unique	Definitional errors	Ownership/authority conflicts
Best model	Task-dependent	Opus (most findings + fastest)

The relative ordering is similar (Opus ≥ GPT-5 > Sonnet for coherence-style tasks), but the CHARACTER of unique findings shifted. On single-doc coherence, Opus finds design tensions within a single design. On cross-doc consistency, Opus finds BOUNDARY disagreements between two designs. GPT-5 shifts from finding definitional errors to ownership conflicts.

Are these findings REAL bugs in the gargoyle documentation? Yes — several are genuine issues worth fixing:

The fills-vs-events-as-ground-truth is a real philosophical tension between the two documents that needs resolution.
The Position event ownership (OrderManager vs Ledger) is a real boundary conflict that affects implementation.
The Engine→Trading communication style (internal pipeline vs cross-domain command) is a genuine structural ambiguity.
The signal persistence claim ("never persisted" vs SignalEmitted audit event) is a direct textual contradiction.

These are the kind of cross-document inconsistencies that cause teams to build inconsistent implementations — one engineer reads Document A and builds one way, another reads Document B and builds differently.

Practical implication: Cross-document consistency analysis is a high-value task for documentation maintenance. Run it when:

A system has multiple architecture docs written at different times
A refactoring has updated one doc but not another
Multiple people contribute to design documentation
Moving from high-level overview to detailed specification

Opus is the recommended model for this task: fastest (52s vs 125s), most findings (7 vs 6), and uniquely strong at boundary disagreements. GPT-5 adds value for ownership-specific conflicts. Sonnet is sufficient for quick screening (catches the Critical issues in 14s) but won't find the architectural insights.

Cost-effectiveness: Opus: 2,351 output tokens for 7 findings = 336 tokens/finding (52s) GPT-5: 9,415 output + 8,384 reasoning for 6 findings = 2,967 tokens/finding (125s) Sonnet: 776 output tokens for 4 findings = 194 tokens/finding (14s)

Opus is the clear winner on this task type: more findings than GPT-5, 2.4x faster, and 8.8x more token-efficient per finding. GPT-5's massive reasoning investment (8,384 tokens) produced only one fewer finding than Opus — the verification overhead is not paying off here because cross-document contradictions are relatively easy to verify once identified (just check both documents).

11 KiB Raw Blame History

Finding 28: Cross-document consistency analysis: NEW task type — GPT-5 finds deep semantic contradictions; Opus finds structural/boundary mismatches; Sonnet identifies core issues quickly

11 KiB

Raw Blame History