model-research/findings/2026-05-09-57-event-flow-correctness-analysis.md

# Event Flow Correctness Analysis: A New Analytical Lens

**Finding ID:** 57
**Date:** 2026-05-09
**Documents:** gargoyle/docs/impl/event-catalog.md + gargoyle/docs/domain/architecture.md + gargoyle/docs/domain/system-overview.md (~644 lines combined)
**Task type:** Event flow correctness analysis — a NEW analytical lens
**Prompt:** "Analyze these event-sourced architecture documents for EVENT FLOW correctness. Focus on: missing events, event chain completeness, temporal dependencies, recovery gaps, cross-aggregate event flow."
**Models compared:** GPT-5, Claude Sonnet 4.6, Claude Opus 4.5

## Experiment Design

This experiment tests a novel analytical lens: **event flow correctness analysis**. Unlike gap-finding (what's missing from the spec) or contradiction detection (where statements conflict), this lens asks: "Can this event-sourced system be correctly replayed from its documented events? Are all state transitions covered?"

This is particularly relevant for event-sourced architectures where "events are the source of truth" is a core principle — any gap in the event chain means state cannot be reconstructed after restart.

## Performance Metrics

| Model | Time | Input Tokens | Output Tokens | Reasoning Tokens | Findings |
|-------|------|--------------|---------------|------------------|----------|
| GPT-5 | 138s | 6,128 | 9,211 | 6,720 | 13 findings |
| Claude Sonnet 4.6 | 18s | 7,450 | 1,200 | (internal) | 10 findings |
| Claude Opus 4.5 | 74s | 7,450 | 4,034 | (internal) | 15 findings + detailed analysis |

**Cost efficiency:** Sonnet was 7.7x faster than GPT-5 and 4x faster than Opus. For quick screening, Sonnet found 10 issues in 18 seconds.

## Common Ground (All Three Models)

These issues were identified by all three models — the unambiguous gaps:

1. **Order replacement flow incomplete** (HIGH) — `OrderReplaceRequested` exists but no completion event (`OrderReplaced`, `OrderReplaceFailed`). Orders stuck in `pending_replace` after replay.

2. **Cancel rejection not handled** (HIGH) — `OrderCancelRequested` → `pending_cancel` has no exit path if broker rejects. Fill precedence rule doesn't mention `pending_cancel`.

3. **Lot/Position ownership ambiguity** (CRITICAL/HIGH) — Lot events (`LotOpened`, `LotClosed`) are in "Trading Aggregate Events" but domain architecture assigns Ledger ownership. Single-writer invariant unclear.

4. **Corporate action events missing** (HIGH/MEDIUM) — Domain Events table mentions "Lot adjusted" but Event Catalog has no `LotAdjusted` event definition.

5. **Cross-aggregate propagation mechanism undocumented** (HIGH/MEDIUM) — Trading produces `OrderFilled`, Ledger must react to produce lots. How? No mechanism specified.

6. **Reconciliation events not cataloged** (HIGH/MEDIUM) — Domain invariant says "reconciliation gates trading" but no stream ID, fields, or event shapes for reconciliation events.

## Model-Specific Strengths

### GPT-5: Exhaustive Enumeration + External Knowledge

GPT-5 produced the most detailed analysis (13 findings) with specific attention to broker interactions and compliance:

**Unique catches:**
- Fill/execution IDs missing on fill events — broker execution ID needed for reliable dedup
- Derived state (realized_pnl) baked into PositionUpdated risks divergence from lot-closure truth
- Broker commissions/fees not modeled (LOW but noted)
- Reporting events (Daily P&L snapshot) referenced but not cataloged

**Characteristic:** GPT-5 brings external knowledge — it knows how real broker APIs work (execution IDs, commissions) and what regulators expect (fill dedup, audit trails). It found the execution ID gap that the other models missed.

### Claude Opus 4.5: Systematic Completeness + Deep Reasoning

Opus produced the most thorough analysis (15 findings) with careful deductive chains:

**Unique catches:**
- Pending states on replay (HIGH) — explicit analysis of what happens when orders are in `pending_*` states at crash time
- Audit log failure recovery — how does trading resume after audit store recovers?
- Strategy worker state / warmup — strategies using moving averages have cold-start issues
- Lot-position atomicity — are these written in same transaction?
- Single writer clarification — who actually writes `OrderPlaced`, aggregator or OrderManager?

**Characteristic:** Opus reasons through crash scenarios step by step. It asks "what if we crash here?" at each state transition. Its analysis of pending states on replay was the most thorough — it identified that reconciliation documentation doesn't specify how to handle orders stuck in `pending_cancel`.

### Claude Sonnet 4.6: Fast Structural Scan

Sonnet produced a focused analysis (10 findings) with clear severity stratification:

**Unique catches:**
- Market data event flow incomplete — strategies can't replay decision context without tick events
- Audit event triggering timing — when are audit events written relative to processing?

**Characteristic:** Sonnet thinks like a systems reviewer doing a time-boxed audit. It found the core issues quickly but didn't dig as deep into edge cases. Its market data observation was unique — neither GPT-5 nor Opus mentioned that strategies need historical tick data for replay.

## Overlap Analysis

| Finding | GPT-5 | Sonnet | Opus |
|---------|-------|--------|------|
| Order replacement incomplete | ✅ HIGH | ✅ HIGH | ✅ HIGH |
| Cancel rejection unhandled | ❌ (partial) | ✅ HIGH | ✅ HIGH |
| Lot/Position ownership ambiguity | ✅ CRITICAL | ✅ CRITICAL | ✅ MEDIUM |
| Corporate action events missing | ✅ HIGH | ✅ MEDIUM | ✅ MEDIUM |
| Cross-aggregate propagation unclear | ✅ HIGH | ✅ CRITICAL | ✅ MEDIUM |
| Reconciliation events missing | ✅ HIGH | ✅ MEDIUM | ❌ |
| Fill execution IDs missing | ✅ HIGH | ❌ | ❌ |
| Pending states on replay | ❌ | ❌ | ✅ HIGH |
| Audit log failure recovery | ❌ | ❌ | ✅ MEDIUM |
| Strategy warmup / cold start | ❌ | ❌ | ✅ MEDIUM |
| Market data event flow | ❌ | ✅ MEDIUM | ❌ |
| Partial fill → position chain | ❌ | ❌ | ✅ HIGH |
| Kill switch cascade undocumented | ❌ | ✅ HIGH | ✅ MEDIUM |
| Derived state in PositionUpdated | ✅ MEDIUM | ❌ | ❌ |
| Stream ID conventions incomplete | ❌ | ✅ LOW | ❌ |

**Union:** 21 unique findings across all three models
**Intersection:** 5 findings all agreed on
**GPT-5 unique:** 3 findings (execution IDs, derived state, commissions)
**Opus unique:** 4 findings (pending states, audit recovery, warmup, partial fill chain)
**Sonnet unique:** 2 findings (market data flow, stream ID gaps)

## Key Insight: Event Flow Analysis Reveals Model Reasoning Styles

This lens exposes fundamental differences in how models reason about systems:

1. **GPT-5** brings external domain knowledge. It knows what broker APIs look like, what regulators expect, and what real implementations need. This catches practical gaps (execution IDs, commissions) that pure document analysis misses.

2. **Opus** reasons through failure modes systematically. For every state, it asks "what if we crash here?" This catches recovery gaps and edge cases that require step-by-step deduction.

3. **Sonnet** does rapid structural analysis. It identifies missing pieces and ownership ambiguities quickly but doesn't explore crash scenarios or bring external knowledge.

## Actionable Recommendations for Event Documentation

Based on the union of all findings:

**CRITICAL (must fix before production):**
1. Clarify Lot/Position aggregate ownership — single source of truth
2. Document cross-aggregate event propagation mechanism (pub/sub? direct write? saga?)

**HIGH (should fix):**
3. Add `OrderReplaced`/`OrderReplaceFailed` events
4. Add `OrderCancelRejected` event or expand fill precedence rule
5. Document pending-state reconciliation behavior on restart
6. Add `fill_id` and broker `execution_id` to fill events
7. Add `LotAdjusted` event for corporate actions
8. Document partial fill → position event chain explicitly

**MEDIUM (nice to have):**
9. Add `RiskBreachDetected` event
10. Document kill switch cascade events
11. Document audit log failure/recovery
12. Add strategy warmup to Infrastructure context
13. Document market data event flow for strategy replay

## Conclusion

Event flow correctness analysis is a valuable lens for event-sourced architectures. It asks: "Can we replay to correct state from these events alone?" This is different from gap-finding (what's missing conceptually) or contradiction detection (where statements conflict) — it's about operational correctness.

**Model recommendation for this lens:**
- **GPT-5 for external domain knowledge** — catches what real implementations need
- **Opus for failure mode reasoning** — catches what happens at crash boundaries
- **Sonnet for fast screening** — catches obvious structural gaps quickly
- Run all three for thoroughness — they found non-overlapping issues

**Cost-effectiveness:** For time-constrained reviews, Opus (74s) provides the best depth-to-time ratio. For thoroughness, GPT-5 + Opus union captures 18/21 findings. Adding Sonnet catches the remaining 3 (market data flow, stream ID gaps, audit timing).

**New pattern discovered:** Models have distinct "reasoning styles" that event flow analysis exposes:
- GPT-5: "What does the real world need?"
- Opus: "What happens when things fail?"
- Sonnet: "What's structurally missing?"

All three styles find real issues. Use all three.