refactor(findings): split ALL-FINDINGS.md into per-experiment files

Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00
parent 1b108ff66e
commit 6af8a6ee10
32 changed files with 3232 additions and 3254 deletions
@@ -0,0 +1,178 @@
+# Finding 28: Cross-document consistency analysis: NEW task type — GPT-5 finds deep semantic contradictions; Opus finds structural/boundary mismatches; Sonnet identifies core issues quickly
+
+**Date:** 2026-05-05
+**Task:** Identify contradictions and inconsistencies BETWEEN two architecture documents
+describing the same system: `system-overview.md` (323 lines, narrative overview with
+component flows, invariants, and domain events) and `architecture.md` (213 lines,
+DDD-focused with bounded contexts, context map, and message taxonomy).
+**How we used them:** BOTH documents provided as full text in a single prompt (~25KB
+total). Highly structured prompt specifying 5 categories of cross-document inconsistency
+(terminology conflicts, structural contradictions, flow/sequence conflicts,
+ownership/authority conflicts, philosophical contradictions). Required specific output
+format per finding. Explicitly excluded omissions (things one doc covers and the other
+doesn't) and detail-level differences. No tools, no project context beyond the two
+documents. This is a NEW analytical task not previously tested: reasoning about
+CONSISTENCY BETWEEN documents rather than internal coherence of a single document.
+
+| Model | Time | Output tokens | Reasoning tokens | Inconsistencies found | Critical | High | Medium |
+|---|---|---|---|---|---|---|---|
+| GPT-5 | 125s | 9,415 | 8,384 | 6 | 2 | 3 | 1 |
+| Claude Opus 4.6 | 52s | 2,351 | (internal) | 7 | 3 | 3 | 1 |
+| Claude Sonnet 4.6 | 14s | 776 | (internal) | 4 | 1 | 2 | 1 |
+
+**What they found — common ground (all 3 identified):**
+- Event sourcing (all events as source of truth) vs fills-only ground truth:
+  Document A says fills are "ground truth from which all other state can be
+  derived," while Document B says "events are the source of truth, state is
+  computed by replaying events." A treats fills as the recovery foundation;
+  B treats ALL domain events as authoritative. All three models rated this
+  Critical.
+- Bounded context naming mismatch: "Decision Engine" / "Order Management" (A)
+  vs "Engine" / "Trading" (B) for the same functional responsibilities.
+  GPT-5 folded this into a broader ownership analysis; Opus and Sonnet
+  surfaced it as its own finding.
+- Signal classification conflict: Document A lists "Signal emitted" as a domain
+  event; Document B explicitly categorizes `SignalEmitted` as an audit event
+  ("not used to rebuild state"). This determines event store design and
+  recovery semantics.
+
+**GPT-5 unique findings (not in either Claude model):**
+- Signal persistence contradiction: Document A states "Signals are never
+  persisted" while Document B lists `SignalEmitted` as an audit event that IS
+  persisted and states the audit log is mandatory for trading. These are
+  directly incompatible claims about whether signal data is stored.
+- Audit event ownership conflict: Document A says "Decision approved" events
+  originate from PortfolioRisk. Document B states "only the decision engine
+  writes audit events" and lists `DecisionApproved` as an audit event example.
+  If PortfolioRisk is part of Risk (not Engine), this is an authority violation.
+- "Single writer per user" (A: OrderManager writes all trading state) vs
+  per-aggregate single-writer (B: each aggregate writes its own event stream,
+  Ledger owns positions). These are incompatible authority models — either OM
+  centralizes writes or each domain owns its own events.
+
+**Claude Opus unique findings (not in either other model):**
+- Engine → OrderManager is an internal pipeline flow (A: same subgraph, direct
+  arrow) vs Engine → Trading is a cross-domain COMMAND (B: `PlaceOrder` command
+  crossing a bounded context boundary). This structural disagreement determines
+  whether order management is an internal pipeline stage or an independent domain
+  with its own aggregates and command validation.
+- Signal Risk's architectural position: Document A shows a two-stage risk
+  architecture (SignalRisk pre-aggregation, PortfolioRisk post-aggregation)
+  where Risk is embedded in the pipeline. Document B's context map shows Risk
+  as a separate domain that Engine merely QUERIES ("kill switch active?") —
+  no arrow shows signal routing through Risk. Either risk logic lives inside
+  Engine (contradicting B's context boundary) or the context map is incomplete.
+- The "reduce" step ownership: A's top-level flow labels `Approved →|"reduce"|
+  Decisions` (reduction at aggregation), while A's own domain events table says
+  "Decision reduced" originates from PortfolioRisk (reduction after aggregation).
+  This is actually an INTRA-document inconsistency in Document A, but Opus surfaced
+  it as part of cross-doc analysis.
+
+**Claude Sonnet unique findings:**
+- None genuinely unique. All 4 findings overlapped with GPT-5/Opus common ground
+  (event sourcing, signal persistence, context count/naming). Sonnet was efficient
+  (14s, 776 tokens) but didn't identify any inconsistency that the other two missed.
+
+**Quality assessment:**
+- **GPT-5** produced 6 well-reasoned findings with the deepest analysis of
+  OWNERSHIP conflicts. Its signal-persistence contradiction and single-writer
+  authority conflict are genuinely important — they reveal places where the two
+  documents would lead implementers to build fundamentally different systems.
+  Every finding quotes specific text from both documents and explains precisely
+  WHY they can't both be correct. The reasoning investment (8,384 tokens) was
+  used for thorough cross-referencing between documents.
+- **Claude Opus** found the most inconsistencies (7) and was remarkably fast
+  (52s vs GPT-5's 125s). Its unique strength: identifying STRUCTURAL contradictions
+  about component boundaries and communication patterns. The Engine→Trading
+  command vs internal pipeline finding is architecturally the most significant
+  discovery — it reveals a fundamental disagreement about whether order
+  management is INSIDE or OUTSIDE the decision engine's boundary. Opus also
+  caught a bonus intra-document inconsistency (the "reduce" labeling error).
+- **Claude Sonnet** was the fastest (14s) and most concise (776 tokens) but
+  found only the obvious common-ground issues. For cross-document consistency,
+  Sonnet's speed advantage came at the cost of missing the architectural
+  insights that make this task valuable. It did correctly identify all the
+  Critical-level issues, making it viable as a quick first-pass screen.
+
+**Key insight — cross-document consistency is a DISTINCT task type:**
+This is fundamentally different from single-document analysis (assumptions,
+race conditions, coherence). It requires:
+1. Building a mental model from Document A
+2. Building a separate mental model from Document B
+3. Finding places where the models are incompatible
+4. Reasoning about WHY they can't both be correct (not just "different")
+
+Step 4 is what distinguishes this from simple diff-detection. Many surface
+differences (naming, detail level, scope) are NOT contradictions — the models
+must judge which differences are genuinely incompatible vs. complementary.
+The prompt explicitly excluded omissions and detail-level differences, and
+all three models respected this constraint well.
+
+**Model strengths on cross-document analysis:**
+- **GPT-5** excels at ownership/authority conflicts: it systematically
+  checked "who owns this concept" in each document and found mismatches.
+  Its findings cluster around "who writes what" and "who is authoritative."
+- **Opus** excels at structural/boundary contradictions: it identified where
+  the documents draw architectural lines differently. Its findings cluster
+  around "where are the boundaries" and "what crosses them."
+- **Sonnet** identifies the obvious/critical issues quickly but doesn't dig
+  deeper. Viable for screening, not for thorough analysis.
+
+**Comparison to Finding #15 / #27 (single-document coherence checking):**
+Single-document coherence asks "does this document contradict itself?"
+Cross-document consistency asks "do these documents contradict each other?"
+Key differences in results:
+
+| Aspect | Single-doc coherence | Cross-doc consistency |
+|---|---|---|
+| Opus findings | 5-7 | 7 |
+| GPT-5 findings | 4-6 | 6 |
+| Sonnet findings | 4-5 | 4 |
+| Opus unique | Design tensions | Structural/boundary mismatches |
+| GPT-5 unique | Definitional errors | Ownership/authority conflicts |
+| Best model | Task-dependent | Opus (most findings + fastest) |
+
+The relative ordering is similar (Opus ≥ GPT-5 > Sonnet for coherence-style
+tasks), but the CHARACTER of unique findings shifted. On single-doc coherence,
+Opus finds design tensions within a single design. On cross-doc consistency,
+Opus finds BOUNDARY disagreements between two designs. GPT-5 shifts from
+finding definitional errors to ownership conflicts.
+
+**Are these findings REAL bugs in the gargoyle documentation?**
+Yes — several are genuine issues worth fixing:
+1. The fills-vs-events-as-ground-truth is a real philosophical tension between
+   the two documents that needs resolution.
+2. The Position event ownership (OrderManager vs Ledger) is a real boundary
+   conflict that affects implementation.
+3. The Engine→Trading communication style (internal pipeline vs cross-domain
+   command) is a genuine structural ambiguity.
+4. The signal persistence claim ("never persisted" vs `SignalEmitted` audit
+   event) is a direct textual contradiction.
+
+These are the kind of cross-document inconsistencies that cause teams to build
+inconsistent implementations — one engineer reads Document A and builds one way,
+another reads Document B and builds differently.
+
+**Practical implication:** Cross-document consistency analysis is a high-value
+task for documentation maintenance. Run it when:
+- A system has multiple architecture docs written at different times
+- A refactoring has updated one doc but not another
+- Multiple people contribute to design documentation
+- Moving from high-level overview to detailed specification
+
+Opus is the recommended model for this task: fastest (52s vs 125s), most
+findings (7 vs 6), and uniquely strong at boundary disagreements. GPT-5 adds
+value for ownership-specific conflicts. Sonnet is sufficient for quick
+screening (catches the Critical issues in 14s) but won't find the architectural
+insights.
+
+**Cost-effectiveness:**
+Opus: 2,351 output tokens for 7 findings = 336 tokens/finding (52s)
+GPT-5: 9,415 output + 8,384 reasoning for 6 findings = 2,967 tokens/finding (125s)
+Sonnet: 776 output tokens for 4 findings = 194 tokens/finding (14s)
+
+Opus is the clear winner on this task type: more findings than GPT-5, 2.4x
+faster, and 8.8x more token-efficient per finding. GPT-5's massive reasoning
+investment (8,384 tokens) produced only one fewer finding than Opus — the
+verification overhead is not paying off here because cross-document contradictions
+are relatively easy to verify once identified (just check both documents).