Files
model-research/findings/2026-05-09-57-event-flow-correctness-analysis.md
T
Rodin faaa6d9c11 Finding #57: Event flow correctness analysis - new analytical lens
Tests a novel lens for event-sourced architectures: can all state be
reconstructed from documented events alone?

Key findings:
- GPT-5 brings external domain knowledge (broker APIs, compliance)
- Opus reasons through failure modes systematically (crash boundaries)
- Sonnet does rapid structural analysis (missing pieces)

21 unique findings across three models with only 5 in common.
Each model's reasoning style reveals different issue categories.

New pattern: event flow analysis exposes model reasoning styles
that gap-finding and contradiction detection don't surface.
2026-05-09 13:29:58 -07:00

9.3 KiB

Event Flow Correctness Analysis: A New Analytical Lens

Finding ID: 57 Date: 2026-05-09 Documents: gargoyle/docs/impl/event-catalog.md + gargoyle/docs/domain/architecture.md + gargoyle/docs/domain/system-overview.md (~644 lines combined) Task type: Event flow correctness analysis — a NEW analytical lens Prompt: "Analyze these event-sourced architecture documents for EVENT FLOW correctness. Focus on: missing events, event chain completeness, temporal dependencies, recovery gaps, cross-aggregate event flow." Models compared: GPT-5, Claude Sonnet 4.6, Claude Opus 4.5

Experiment Design

This experiment tests a novel analytical lens: event flow correctness analysis. Unlike gap-finding (what's missing from the spec) or contradiction detection (where statements conflict), this lens asks: "Can this event-sourced system be correctly replayed from its documented events? Are all state transitions covered?"

This is particularly relevant for event-sourced architectures where "events are the source of truth" is a core principle — any gap in the event chain means state cannot be reconstructed after restart.

Performance Metrics

Model Time Input Tokens Output Tokens Reasoning Tokens Findings
GPT-5 138s 6,128 9,211 6,720 13 findings
Claude Sonnet 4.6 18s 7,450 1,200 (internal) 10 findings
Claude Opus 4.5 74s 7,450 4,034 (internal) 15 findings + detailed analysis

Cost efficiency: Sonnet was 7.7x faster than GPT-5 and 4x faster than Opus. For quick screening, Sonnet found 10 issues in 18 seconds.

Common Ground (All Three Models)

These issues were identified by all three models — the unambiguous gaps:

  1. Order replacement flow incomplete (HIGH) — OrderReplaceRequested exists but no completion event (OrderReplaced, OrderReplaceFailed). Orders stuck in pending_replace after replay.

  2. Cancel rejection not handled (HIGH) — OrderCancelRequestedpending_cancel has no exit path if broker rejects. Fill precedence rule doesn't mention pending_cancel.

  3. Lot/Position ownership ambiguity (CRITICAL/HIGH) — Lot events (LotOpened, LotClosed) are in "Trading Aggregate Events" but domain architecture assigns Ledger ownership. Single-writer invariant unclear.

  4. Corporate action events missing (HIGH/MEDIUM) — Domain Events table mentions "Lot adjusted" but Event Catalog has no LotAdjusted event definition.

  5. Cross-aggregate propagation mechanism undocumented (HIGH/MEDIUM) — Trading produces OrderFilled, Ledger must react to produce lots. How? No mechanism specified.

  6. Reconciliation events not cataloged (HIGH/MEDIUM) — Domain invariant says "reconciliation gates trading" but no stream ID, fields, or event shapes for reconciliation events.

Model-Specific Strengths

GPT-5: Exhaustive Enumeration + External Knowledge

GPT-5 produced the most detailed analysis (13 findings) with specific attention to broker interactions and compliance:

Unique catches:

  • Fill/execution IDs missing on fill events — broker execution ID needed for reliable dedup
  • Derived state (realized_pnl) baked into PositionUpdated risks divergence from lot-closure truth
  • Broker commissions/fees not modeled (LOW but noted)
  • Reporting events (Daily P&L snapshot) referenced but not cataloged

Characteristic: GPT-5 brings external knowledge — it knows how real broker APIs work (execution IDs, commissions) and what regulators expect (fill dedup, audit trails). It found the execution ID gap that the other models missed.

Claude Opus 4.5: Systematic Completeness + Deep Reasoning

Opus produced the most thorough analysis (15 findings) with careful deductive chains:

Unique catches:

  • Pending states on replay (HIGH) — explicit analysis of what happens when orders are in pending_* states at crash time
  • Audit log failure recovery — how does trading resume after audit store recovers?
  • Strategy worker state / warmup — strategies using moving averages have cold-start issues
  • Lot-position atomicity — are these written in same transaction?
  • Single writer clarification — who actually writes OrderPlaced, aggregator or OrderManager?

Characteristic: Opus reasons through crash scenarios step by step. It asks "what if we crash here?" at each state transition. Its analysis of pending states on replay was the most thorough — it identified that reconciliation documentation doesn't specify how to handle orders stuck in pending_cancel.

Claude Sonnet 4.6: Fast Structural Scan

Sonnet produced a focused analysis (10 findings) with clear severity stratification:

Unique catches:

  • Market data event flow incomplete — strategies can't replay decision context without tick events
  • Audit event triggering timing — when are audit events written relative to processing?

Characteristic: Sonnet thinks like a systems reviewer doing a time-boxed audit. It found the core issues quickly but didn't dig as deep into edge cases. Its market data observation was unique — neither GPT-5 nor Opus mentioned that strategies need historical tick data for replay.

Overlap Analysis

Finding GPT-5 Sonnet Opus
Order replacement incomplete HIGH HIGH HIGH
Cancel rejection unhandled (partial) HIGH HIGH
Lot/Position ownership ambiguity CRITICAL CRITICAL MEDIUM
Corporate action events missing HIGH MEDIUM MEDIUM
Cross-aggregate propagation unclear HIGH CRITICAL MEDIUM
Reconciliation events missing HIGH MEDIUM
Fill execution IDs missing HIGH
Pending states on replay HIGH
Audit log failure recovery MEDIUM
Strategy warmup / cold start MEDIUM
Market data event flow MEDIUM
Partial fill → position chain HIGH
Kill switch cascade undocumented HIGH MEDIUM
Derived state in PositionUpdated MEDIUM
Stream ID conventions incomplete LOW

Union: 21 unique findings across all three models Intersection: 5 findings all agreed on GPT-5 unique: 3 findings (execution IDs, derived state, commissions) Opus unique: 4 findings (pending states, audit recovery, warmup, partial fill chain) Sonnet unique: 2 findings (market data flow, stream ID gaps)

Key Insight: Event Flow Analysis Reveals Model Reasoning Styles

This lens exposes fundamental differences in how models reason about systems:

  1. GPT-5 brings external domain knowledge. It knows what broker APIs look like, what regulators expect, and what real implementations need. This catches practical gaps (execution IDs, commissions) that pure document analysis misses.

  2. Opus reasons through failure modes systematically. For every state, it asks "what if we crash here?" This catches recovery gaps and edge cases that require step-by-step deduction.

  3. Sonnet does rapid structural analysis. It identifies missing pieces and ownership ambiguities quickly but doesn't explore crash scenarios or bring external knowledge.

Actionable Recommendations for Event Documentation

Based on the union of all findings:

CRITICAL (must fix before production):

  1. Clarify Lot/Position aggregate ownership — single source of truth
  2. Document cross-aggregate event propagation mechanism (pub/sub? direct write? saga?)

HIGH (should fix): 3. Add OrderReplaced/OrderReplaceFailed events 4. Add OrderCancelRejected event or expand fill precedence rule 5. Document pending-state reconciliation behavior on restart 6. Add fill_id and broker execution_id to fill events 7. Add LotAdjusted event for corporate actions 8. Document partial fill → position event chain explicitly

MEDIUM (nice to have): 9. Add RiskBreachDetected event 10. Document kill switch cascade events 11. Document audit log failure/recovery 12. Add strategy warmup to Infrastructure context 13. Document market data event flow for strategy replay

Conclusion

Event flow correctness analysis is a valuable lens for event-sourced architectures. It asks: "Can we replay to correct state from these events alone?" This is different from gap-finding (what's missing conceptually) or contradiction detection (where statements conflict) — it's about operational correctness.

Model recommendation for this lens:

  • GPT-5 for external domain knowledge — catches what real implementations need
  • Opus for failure mode reasoning — catches what happens at crash boundaries
  • Sonnet for fast screening — catches obvious structural gaps quickly
  • Run all three for thoroughness — they found non-overlapping issues

Cost-effectiveness: For time-constrained reviews, Opus (74s) provides the best depth-to-time ratio. For thoroughness, GPT-5 + Opus union captures 18/21 findings. Adding Sonnet catches the remaining 3 (market data flow, stream ID gaps, audit timing).

New pattern discovered: Models have distinct "reasoning styles" that event flow analysis exposes:

  • GPT-5: "What does the real world need?"
  • Opus: "What happens when things fail?"
  • Sonnet: "What's structurally missing?"

All three styles find real issues. Use all three.