refactor(findings): split ALL-FINDINGS.md into per-experiment files

Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
This commit is contained in:
Rodin
2026-05-06 07:15:50 -07:00
parent 1b108ff66e
commit 6af8a6ee10
32 changed files with 3232 additions and 3254 deletions
@@ -0,0 +1,178 @@
# Finding 28: Cross-document consistency analysis: NEW task type — GPT-5 finds deep semantic contradictions; Opus finds structural/boundary mismatches; Sonnet identifies core issues quickly
**Date:** 2026-05-05
**Task:** Identify contradictions and inconsistencies BETWEEN two architecture documents
describing the same system: `system-overview.md` (323 lines, narrative overview with
component flows, invariants, and domain events) and `architecture.md` (213 lines,
DDD-focused with bounded contexts, context map, and message taxonomy).
**How we used them:** BOTH documents provided as full text in a single prompt (~25KB
total). Highly structured prompt specifying 5 categories of cross-document inconsistency
(terminology conflicts, structural contradictions, flow/sequence conflicts,
ownership/authority conflicts, philosophical contradictions). Required specific output
format per finding. Explicitly excluded omissions (things one doc covers and the other
doesn't) and detail-level differences. No tools, no project context beyond the two
documents. This is a NEW analytical task not previously tested: reasoning about
CONSISTENCY BETWEEN documents rather than internal coherence of a single document.
| Model | Time | Output tokens | Reasoning tokens | Inconsistencies found | Critical | High | Medium |
|---|---|---|---|---|---|---|---|
| GPT-5 | 125s | 9,415 | 8,384 | 6 | 2 | 3 | 1 |
| Claude Opus 4.6 | 52s | 2,351 | (internal) | 7 | 3 | 3 | 1 |
| Claude Sonnet 4.6 | 14s | 776 | (internal) | 4 | 1 | 2 | 1 |
**What they found — common ground (all 3 identified):**
- Event sourcing (all events as source of truth) vs fills-only ground truth:
Document A says fills are "ground truth from which all other state can be
derived," while Document B says "events are the source of truth, state is
computed by replaying events." A treats fills as the recovery foundation;
B treats ALL domain events as authoritative. All three models rated this
Critical.
- Bounded context naming mismatch: "Decision Engine" / "Order Management" (A)
vs "Engine" / "Trading" (B) for the same functional responsibilities.
GPT-5 folded this into a broader ownership analysis; Opus and Sonnet
surfaced it as its own finding.
- Signal classification conflict: Document A lists "Signal emitted" as a domain
event; Document B explicitly categorizes `SignalEmitted` as an audit event
("not used to rebuild state"). This determines event store design and
recovery semantics.
**GPT-5 unique findings (not in either Claude model):**
- Signal persistence contradiction: Document A states "Signals are never
persisted" while Document B lists `SignalEmitted` as an audit event that IS
persisted and states the audit log is mandatory for trading. These are
directly incompatible claims about whether signal data is stored.
- Audit event ownership conflict: Document A says "Decision approved" events
originate from PortfolioRisk. Document B states "only the decision engine
writes audit events" and lists `DecisionApproved` as an audit event example.
If PortfolioRisk is part of Risk (not Engine), this is an authority violation.
- "Single writer per user" (A: OrderManager writes all trading state) vs
per-aggregate single-writer (B: each aggregate writes its own event stream,
Ledger owns positions). These are incompatible authority models — either OM
centralizes writes or each domain owns its own events.
**Claude Opus unique findings (not in either other model):**
- Engine → OrderManager is an internal pipeline flow (A: same subgraph, direct
arrow) vs Engine → Trading is a cross-domain COMMAND (B: `PlaceOrder` command
crossing a bounded context boundary). This structural disagreement determines
whether order management is an internal pipeline stage or an independent domain
with its own aggregates and command validation.
- Signal Risk's architectural position: Document A shows a two-stage risk
architecture (SignalRisk pre-aggregation, PortfolioRisk post-aggregation)
where Risk is embedded in the pipeline. Document B's context map shows Risk
as a separate domain that Engine merely QUERIES ("kill switch active?") —
no arrow shows signal routing through Risk. Either risk logic lives inside
Engine (contradicting B's context boundary) or the context map is incomplete.
- The "reduce" step ownership: A's top-level flow labels `Approved →|"reduce"|
Decisions` (reduction at aggregation), while A's own domain events table says
"Decision reduced" originates from PortfolioRisk (reduction after aggregation).
This is actually an INTRA-document inconsistency in Document A, but Opus surfaced
it as part of cross-doc analysis.
**Claude Sonnet unique findings:**
- None genuinely unique. All 4 findings overlapped with GPT-5/Opus common ground
(event sourcing, signal persistence, context count/naming). Sonnet was efficient
(14s, 776 tokens) but didn't identify any inconsistency that the other two missed.
**Quality assessment:**
- **GPT-5** produced 6 well-reasoned findings with the deepest analysis of
OWNERSHIP conflicts. Its signal-persistence contradiction and single-writer
authority conflict are genuinely important — they reveal places where the two
documents would lead implementers to build fundamentally different systems.
Every finding quotes specific text from both documents and explains precisely
WHY they can't both be correct. The reasoning investment (8,384 tokens) was
used for thorough cross-referencing between documents.
- **Claude Opus** found the most inconsistencies (7) and was remarkably fast
(52s vs GPT-5's 125s). Its unique strength: identifying STRUCTURAL contradictions
about component boundaries and communication patterns. The Engine→Trading
command vs internal pipeline finding is architecturally the most significant
discovery — it reveals a fundamental disagreement about whether order
management is INSIDE or OUTSIDE the decision engine's boundary. Opus also
caught a bonus intra-document inconsistency (the "reduce" labeling error).
- **Claude Sonnet** was the fastest (14s) and most concise (776 tokens) but
found only the obvious common-ground issues. For cross-document consistency,
Sonnet's speed advantage came at the cost of missing the architectural
insights that make this task valuable. It did correctly identify all the
Critical-level issues, making it viable as a quick first-pass screen.
**Key insight — cross-document consistency is a DISTINCT task type:**
This is fundamentally different from single-document analysis (assumptions,
race conditions, coherence). It requires:
1. Building a mental model from Document A
2. Building a separate mental model from Document B
3. Finding places where the models are incompatible
4. Reasoning about WHY they can't both be correct (not just "different")
Step 4 is what distinguishes this from simple diff-detection. Many surface
differences (naming, detail level, scope) are NOT contradictions — the models
must judge which differences are genuinely incompatible vs. complementary.
The prompt explicitly excluded omissions and detail-level differences, and
all three models respected this constraint well.
**Model strengths on cross-document analysis:**
- **GPT-5** excels at ownership/authority conflicts: it systematically
checked "who owns this concept" in each document and found mismatches.
Its findings cluster around "who writes what" and "who is authoritative."
- **Opus** excels at structural/boundary contradictions: it identified where
the documents draw architectural lines differently. Its findings cluster
around "where are the boundaries" and "what crosses them."
- **Sonnet** identifies the obvious/critical issues quickly but doesn't dig
deeper. Viable for screening, not for thorough analysis.
**Comparison to Finding #15 / #27 (single-document coherence checking):**
Single-document coherence asks "does this document contradict itself?"
Cross-document consistency asks "do these documents contradict each other?"
Key differences in results:
| Aspect | Single-doc coherence | Cross-doc consistency |
|---|---|---|
| Opus findings | 5-7 | 7 |
| GPT-5 findings | 4-6 | 6 |
| Sonnet findings | 4-5 | 4 |
| Opus unique | Design tensions | Structural/boundary mismatches |
| GPT-5 unique | Definitional errors | Ownership/authority conflicts |
| Best model | Task-dependent | Opus (most findings + fastest) |
The relative ordering is similar (Opus ≥ GPT-5 > Sonnet for coherence-style
tasks), but the CHARACTER of unique findings shifted. On single-doc coherence,
Opus finds design tensions within a single design. On cross-doc consistency,
Opus finds BOUNDARY disagreements between two designs. GPT-5 shifts from
finding definitional errors to ownership conflicts.
**Are these findings REAL bugs in the gargoyle documentation?**
Yes — several are genuine issues worth fixing:
1. The fills-vs-events-as-ground-truth is a real philosophical tension between
the two documents that needs resolution.
2. The Position event ownership (OrderManager vs Ledger) is a real boundary
conflict that affects implementation.
3. The Engine→Trading communication style (internal pipeline vs cross-domain
command) is a genuine structural ambiguity.
4. The signal persistence claim ("never persisted" vs `SignalEmitted` audit
event) is a direct textual contradiction.
These are the kind of cross-document inconsistencies that cause teams to build
inconsistent implementations — one engineer reads Document A and builds one way,
another reads Document B and builds differently.
**Practical implication:** Cross-document consistency analysis is a high-value
task for documentation maintenance. Run it when:
- A system has multiple architecture docs written at different times
- A refactoring has updated one doc but not another
- Multiple people contribute to design documentation
- Moving from high-level overview to detailed specification
Opus is the recommended model for this task: fastest (52s vs 125s), most
findings (7 vs 6), and uniquely strong at boundary disagreements. GPT-5 adds
value for ownership-specific conflicts. Sonnet is sufficient for quick
screening (catches the Critical issues in 14s) but won't find the architectural
insights.
**Cost-effectiveness:**
Opus: 2,351 output tokens for 7 findings = 336 tokens/finding (52s)
GPT-5: 9,415 output + 8,384 reasoning for 6 findings = 2,967 tokens/finding (125s)
Sonnet: 776 output tokens for 4 findings = 194 tokens/finding (14s)
Opus is the clear winner on this task type: more findings than GPT-5, 2.4x
faster, and 8.8x more token-efficient per finding. GPT-5's massive reasoning
investment (8,384 tokens) produced only one fewer finding than Opus — the
verification overhead is not paying off here because cross-document contradictions
are relatively easy to verify once identified (just check both documents).