refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
This commit is contained in:
@@ -0,0 +1,178 @@
|
||||
# Finding 28: Cross-document consistency analysis: NEW task type — GPT-5 finds deep semantic contradictions; Opus finds structural/boundary mismatches; Sonnet identifies core issues quickly
|
||||
|
||||
**Date:** 2026-05-05
|
||||
**Task:** Identify contradictions and inconsistencies BETWEEN two architecture documents
|
||||
describing the same system: `system-overview.md` (323 lines, narrative overview with
|
||||
component flows, invariants, and domain events) and `architecture.md` (213 lines,
|
||||
DDD-focused with bounded contexts, context map, and message taxonomy).
|
||||
**How we used them:** BOTH documents provided as full text in a single prompt (~25KB
|
||||
total). Highly structured prompt specifying 5 categories of cross-document inconsistency
|
||||
(terminology conflicts, structural contradictions, flow/sequence conflicts,
|
||||
ownership/authority conflicts, philosophical contradictions). Required specific output
|
||||
format per finding. Explicitly excluded omissions (things one doc covers and the other
|
||||
doesn't) and detail-level differences. No tools, no project context beyond the two
|
||||
documents. This is a NEW analytical task not previously tested: reasoning about
|
||||
CONSISTENCY BETWEEN documents rather than internal coherence of a single document.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Inconsistencies found | Critical | High | Medium |
|
||||
|---|---|---|---|---|---|---|---|
|
||||
| GPT-5 | 125s | 9,415 | 8,384 | 6 | 2 | 3 | 1 |
|
||||
| Claude Opus 4.6 | 52s | 2,351 | (internal) | 7 | 3 | 3 | 1 |
|
||||
| Claude Sonnet 4.6 | 14s | 776 | (internal) | 4 | 1 | 2 | 1 |
|
||||
|
||||
**What they found — common ground (all 3 identified):**
|
||||
- Event sourcing (all events as source of truth) vs fills-only ground truth:
|
||||
Document A says fills are "ground truth from which all other state can be
|
||||
derived," while Document B says "events are the source of truth, state is
|
||||
computed by replaying events." A treats fills as the recovery foundation;
|
||||
B treats ALL domain events as authoritative. All three models rated this
|
||||
Critical.
|
||||
- Bounded context naming mismatch: "Decision Engine" / "Order Management" (A)
|
||||
vs "Engine" / "Trading" (B) for the same functional responsibilities.
|
||||
GPT-5 folded this into a broader ownership analysis; Opus and Sonnet
|
||||
surfaced it as its own finding.
|
||||
- Signal classification conflict: Document A lists "Signal emitted" as a domain
|
||||
event; Document B explicitly categorizes `SignalEmitted` as an audit event
|
||||
("not used to rebuild state"). This determines event store design and
|
||||
recovery semantics.
|
||||
|
||||
**GPT-5 unique findings (not in either Claude model):**
|
||||
- Signal persistence contradiction: Document A states "Signals are never
|
||||
persisted" while Document B lists `SignalEmitted` as an audit event that IS
|
||||
persisted and states the audit log is mandatory for trading. These are
|
||||
directly incompatible claims about whether signal data is stored.
|
||||
- Audit event ownership conflict: Document A says "Decision approved" events
|
||||
originate from PortfolioRisk. Document B states "only the decision engine
|
||||
writes audit events" and lists `DecisionApproved` as an audit event example.
|
||||
If PortfolioRisk is part of Risk (not Engine), this is an authority violation.
|
||||
- "Single writer per user" (A: OrderManager writes all trading state) vs
|
||||
per-aggregate single-writer (B: each aggregate writes its own event stream,
|
||||
Ledger owns positions). These are incompatible authority models — either OM
|
||||
centralizes writes or each domain owns its own events.
|
||||
|
||||
**Claude Opus unique findings (not in either other model):**
|
||||
- Engine → OrderManager is an internal pipeline flow (A: same subgraph, direct
|
||||
arrow) vs Engine → Trading is a cross-domain COMMAND (B: `PlaceOrder` command
|
||||
crossing a bounded context boundary). This structural disagreement determines
|
||||
whether order management is an internal pipeline stage or an independent domain
|
||||
with its own aggregates and command validation.
|
||||
- Signal Risk's architectural position: Document A shows a two-stage risk
|
||||
architecture (SignalRisk pre-aggregation, PortfolioRisk post-aggregation)
|
||||
where Risk is embedded in the pipeline. Document B's context map shows Risk
|
||||
as a separate domain that Engine merely QUERIES ("kill switch active?") —
|
||||
no arrow shows signal routing through Risk. Either risk logic lives inside
|
||||
Engine (contradicting B's context boundary) or the context map is incomplete.
|
||||
- The "reduce" step ownership: A's top-level flow labels `Approved →|"reduce"|
|
||||
Decisions` (reduction at aggregation), while A's own domain events table says
|
||||
"Decision reduced" originates from PortfolioRisk (reduction after aggregation).
|
||||
This is actually an INTRA-document inconsistency in Document A, but Opus surfaced
|
||||
it as part of cross-doc analysis.
|
||||
|
||||
**Claude Sonnet unique findings:**
|
||||
- None genuinely unique. All 4 findings overlapped with GPT-5/Opus common ground
|
||||
(event sourcing, signal persistence, context count/naming). Sonnet was efficient
|
||||
(14s, 776 tokens) but didn't identify any inconsistency that the other two missed.
|
||||
|
||||
**Quality assessment:**
|
||||
- **GPT-5** produced 6 well-reasoned findings with the deepest analysis of
|
||||
OWNERSHIP conflicts. Its signal-persistence contradiction and single-writer
|
||||
authority conflict are genuinely important — they reveal places where the two
|
||||
documents would lead implementers to build fundamentally different systems.
|
||||
Every finding quotes specific text from both documents and explains precisely
|
||||
WHY they can't both be correct. The reasoning investment (8,384 tokens) was
|
||||
used for thorough cross-referencing between documents.
|
||||
- **Claude Opus** found the most inconsistencies (7) and was remarkably fast
|
||||
(52s vs GPT-5's 125s). Its unique strength: identifying STRUCTURAL contradictions
|
||||
about component boundaries and communication patterns. The Engine→Trading
|
||||
command vs internal pipeline finding is architecturally the most significant
|
||||
discovery — it reveals a fundamental disagreement about whether order
|
||||
management is INSIDE or OUTSIDE the decision engine's boundary. Opus also
|
||||
caught a bonus intra-document inconsistency (the "reduce" labeling error).
|
||||
- **Claude Sonnet** was the fastest (14s) and most concise (776 tokens) but
|
||||
found only the obvious common-ground issues. For cross-document consistency,
|
||||
Sonnet's speed advantage came at the cost of missing the architectural
|
||||
insights that make this task valuable. It did correctly identify all the
|
||||
Critical-level issues, making it viable as a quick first-pass screen.
|
||||
|
||||
**Key insight — cross-document consistency is a DISTINCT task type:**
|
||||
This is fundamentally different from single-document analysis (assumptions,
|
||||
race conditions, coherence). It requires:
|
||||
1. Building a mental model from Document A
|
||||
2. Building a separate mental model from Document B
|
||||
3. Finding places where the models are incompatible
|
||||
4. Reasoning about WHY they can't both be correct (not just "different")
|
||||
|
||||
Step 4 is what distinguishes this from simple diff-detection. Many surface
|
||||
differences (naming, detail level, scope) are NOT contradictions — the models
|
||||
must judge which differences are genuinely incompatible vs. complementary.
|
||||
The prompt explicitly excluded omissions and detail-level differences, and
|
||||
all three models respected this constraint well.
|
||||
|
||||
**Model strengths on cross-document analysis:**
|
||||
- **GPT-5** excels at ownership/authority conflicts: it systematically
|
||||
checked "who owns this concept" in each document and found mismatches.
|
||||
Its findings cluster around "who writes what" and "who is authoritative."
|
||||
- **Opus** excels at structural/boundary contradictions: it identified where
|
||||
the documents draw architectural lines differently. Its findings cluster
|
||||
around "where are the boundaries" and "what crosses them."
|
||||
- **Sonnet** identifies the obvious/critical issues quickly but doesn't dig
|
||||
deeper. Viable for screening, not for thorough analysis.
|
||||
|
||||
**Comparison to Finding #15 / #27 (single-document coherence checking):**
|
||||
Single-document coherence asks "does this document contradict itself?"
|
||||
Cross-document consistency asks "do these documents contradict each other?"
|
||||
Key differences in results:
|
||||
|
||||
| Aspect | Single-doc coherence | Cross-doc consistency |
|
||||
|---|---|---|
|
||||
| Opus findings | 5-7 | 7 |
|
||||
| GPT-5 findings | 4-6 | 6 |
|
||||
| Sonnet findings | 4-5 | 4 |
|
||||
| Opus unique | Design tensions | Structural/boundary mismatches |
|
||||
| GPT-5 unique | Definitional errors | Ownership/authority conflicts |
|
||||
| Best model | Task-dependent | Opus (most findings + fastest) |
|
||||
|
||||
The relative ordering is similar (Opus ≥ GPT-5 > Sonnet for coherence-style
|
||||
tasks), but the CHARACTER of unique findings shifted. On single-doc coherence,
|
||||
Opus finds design tensions within a single design. On cross-doc consistency,
|
||||
Opus finds BOUNDARY disagreements between two designs. GPT-5 shifts from
|
||||
finding definitional errors to ownership conflicts.
|
||||
|
||||
**Are these findings REAL bugs in the gargoyle documentation?**
|
||||
Yes — several are genuine issues worth fixing:
|
||||
1. The fills-vs-events-as-ground-truth is a real philosophical tension between
|
||||
the two documents that needs resolution.
|
||||
2. The Position event ownership (OrderManager vs Ledger) is a real boundary
|
||||
conflict that affects implementation.
|
||||
3. The Engine→Trading communication style (internal pipeline vs cross-domain
|
||||
command) is a genuine structural ambiguity.
|
||||
4. The signal persistence claim ("never persisted" vs `SignalEmitted` audit
|
||||
event) is a direct textual contradiction.
|
||||
|
||||
These are the kind of cross-document inconsistencies that cause teams to build
|
||||
inconsistent implementations — one engineer reads Document A and builds one way,
|
||||
another reads Document B and builds differently.
|
||||
|
||||
**Practical implication:** Cross-document consistency analysis is a high-value
|
||||
task for documentation maintenance. Run it when:
|
||||
- A system has multiple architecture docs written at different times
|
||||
- A refactoring has updated one doc but not another
|
||||
- Multiple people contribute to design documentation
|
||||
- Moving from high-level overview to detailed specification
|
||||
|
||||
Opus is the recommended model for this task: fastest (52s vs 125s), most
|
||||
findings (7 vs 6), and uniquely strong at boundary disagreements. GPT-5 adds
|
||||
value for ownership-specific conflicts. Sonnet is sufficient for quick
|
||||
screening (catches the Critical issues in 14s) but won't find the architectural
|
||||
insights.
|
||||
|
||||
**Cost-effectiveness:**
|
||||
Opus: 2,351 output tokens for 7 findings = 336 tokens/finding (52s)
|
||||
GPT-5: 9,415 output + 8,384 reasoning for 6 findings = 2,967 tokens/finding (125s)
|
||||
Sonnet: 776 output tokens for 4 findings = 194 tokens/finding (14s)
|
||||
|
||||
Opus is the clear winner on this task type: more findings than GPT-5, 2.4x
|
||||
faster, and 8.8x more token-efficient per finding. GPT-5's massive reasoning
|
||||
investment (8,384 tokens) produced only one fewer finding than Opus — the
|
||||
verification overhead is not paying off here because cross-document contradictions
|
||||
are relatively easy to verify once identified (just check both documents).
|
||||
Reference in New Issue
Block a user