refocus on LLM mistakes, not textbook definitions

Based on actual review findings: - Replay determinism: DateTime.utc_now() in apply, random in state - Event design: OrderUpdated with changes map (CRUD-in-disguise) - Projections as source of truth - Suggesting event 'fixes' instead of compensating events - Missing idempotency in handlers Added Elixir/OTP specific patterns (handle_continue for replay, Process dictionary for test isolation). Anti-patterns table for quick flagging.
2026-05-11 00:29:58 -07:00
parent d68d1697aa
commit b72c14f370
2 changed files with 113 additions and 129 deletions
@@ -1,165 +1,148 @@
 # DDD & Event Sourcing Review Checklist

-Patterns and gotchas for domain-driven design and event sourcing. Focus on what models get subtly wrong.
+What LLMs get subtly wrong in event-sourced systems. Based on actual review findings.

 ---

-## Aggregate Design
+## Replay Determinism (CRITICAL)

-### Boundaries
- [ ] Aggregate defines a **consistency boundary**, not a data grouping
- [ ] Ask: "What MUST be consistent in a single transaction?" - that's the aggregate
- [ ] Prefer smaller aggregates; large aggregates = contention and scaling pain
- [ ] Reference other aggregates by ID, never by embedding the object
+Models frequently introduce non-determinism into replay paths:

-**Common mistake:** Making aggregates too large. An `Order` aggregate doesn't need to contain `Customer` - it needs a `customer_id`.
+### Timestamps
+- [ ] **No `DateTime.utc_now()` in apply/reduce functions** — timestamp must come from the event
+- [ ] **No `System.monotonic_time()` in state reconstruction** — time-based decisions use event clock
+- [ ] If you need "when did this happen", the event carries that data

-### Invariants
- [ ] Business rules enforced **inside** the aggregate, not in services
- [ ] Aggregate is always valid after any operation (or rejects the operation)
- [ ] Constructor enforces required fields - no invalid aggregate should exist
+**LLM mistake:** Adding `updated_at: DateTime.utc_now()` to state in an apply function. This means replay produces different state than the original run.

-**Common mistake:** Putting validation in application services. If `Order` requires at least one line item, the `Order` aggregate enforces this, not `OrderService`.
+### Randomness  
+- [ ] **No `:rand` calls in apply functions** — random values must be computed at command time and stored in event
+- [ ] **No UUID generation during replay** — IDs assigned at command/event creation, never reconstruction

-### Identity
- [ ] Aggregate root has a stable identity (UUID preferred over DB sequence)
- [ ] Identity assigned at creation, never changes
- [ ] Child entities within aggregate may have local IDs (only unique within aggregate)
+**LLM mistake:** Generating order IDs in the apply function instead of the command handler.
+
+### External Calls
+- [ ] **No HTTP/DB/external calls in apply functions** — apply is pure: event in, state out
+- [ ] **No side effects in apply** — logging at debug level is acceptable, nothing else
+- [ ] External data needed for decisions must be fetched BEFORE emitting the event
+
+**LLM mistake:** Fetching current price during apply to validate an order event. The price should be in the event.

 ---

-## Event Sourcing
+## Event Design

-### Event Immutability
- [ ] Events are **facts that happened** - never modify, never delete
- [ ] Bad data? Emit a **compensating event** (e.g., `OrderCorrected`, `AmountAdjusted`)
- [ ] Event schema versioned; old events must remain readable forever
+### Self-Contained Events
+- [ ] **Event contains all data needed to understand what happened** — don't rely on external lookups
+- [ ] **Include denormalized data that might change** — `{product_id, product_name, price_at_time}` not just `{product_id}`
+- [ ] **Actor/causation metadata** — who triggered this, correlation_id for tracing

-**Critical:** Models sometimes suggest "fixing" events in the store. This is always wrong. Events are immutable historical facts.
+**LLM mistake:** `%OrderPlaced{order_id: "123", product_ids: [...]}` — missing prices, quantities, everything needed to understand the order without external lookups.

 ### Event Naming
- [ ] Events are **past tense**: `OrderPlaced`, `PaymentReceived`, `ItemShipped`
- [ ] Commands are **imperative**: `PlaceOrder`, `ProcessPayment`, `ShipItem`
- [ ] Events describe what happened, not what to do
+- [ ] **Past tense** — `OrderPlaced`, `PaymentReceived`, `ItemShipped`
+- [ ] **Not commands** — never `PlaceOrder` or `ProcessPayment` as event names
+- [ ] **Specific over generic** — `OrderItemQuantityAdjusted` not `OrderUpdated`

-### Event Content
- [ ] Include all data needed to understand the event in isolation
- [ ] Don't reference external state that might change
- [ ] Include actor/causation metadata (who, why, correlation_id)
- [ ] Avoid large payloads - events aren't documents
+**LLM mistake:** Creating `OrderUpdated` events with a `changes` map. This is CRUD-in-disguise, not event sourcing.

-**Common mistake:** Storing just IDs and looking up current state. Event should be self-contained: `ItemAdded { product_id, name, price, quantity }` not just `ItemAdded { product_id }`.
+### Event Immutability
+- [ ] **Never suggest "fixing" or "updating" an event** — events are immutable facts
+- [ ] **Compensating events for corrections** — `OrderCorrected`, `AmountAdjusted`
+- [ ] **Schema versioning for evolution** — old events must remain readable forever

-### Aggregate Reconstruction
- [ ] Replay all events to rebuild current state
- [ ] `apply/2` is pure: event in, state out, no side effects
- [ ] Handle unknown event types gracefully (forward compatibility)
+**LLM mistake:** Suggesting a migration that modifies existing events to fix a bug. The correct answer is always a compensating event or projection rebuild.

 ---

-## Domain vs Integration Events
+## Aggregate Boundaries

-### Domain Events
- Internal to bounded context
- Rich, contains domain concepts
- Can evolve with internal refactoring
- Consumed by projections, process managers within same context
+### Size
+- [ ] **Aggregate = consistency boundary, not data grouping** — what MUST be consistent in one transaction?
+- [ ] **Smaller is better** — large aggregates = contention, scaling pain
+- [ ] **Reference other aggregates by ID** — never embed full objects

-### Integration Events
- Cross bounded context communication
- Stable contract, versioned schema
- Minimal data (anti-corruption layer transforms as needed)
- Published to message bus/event broker
+**LLM mistake:** Making `Portfolio` contain a list of `Position` aggregates. Each position should be its own aggregate referenced by ID.

-**Common mistake:** Publishing internal domain events directly to other services. Use an anti-corruption layer to transform to integration events.
+### Invariants
+- [ ] **Business rules inside the aggregate** — not in application services
+- [ ] **Always valid after any operation** — reject operations that would violate invariants
+- [ ] **Constructor enforces required fields** — no invalid aggregate instances

---
-
-## Eventual Consistency
-
-### Expectations
- [ ] Read models may lag behind writes (milliseconds to seconds typically)
- [ ] UI must handle "command accepted" != "projection updated"
- [ ] Idempotent event handlers (same event delivered twice = same result)
-
-### Patterns
- [ ] Optimistic UI: show expected state, reconcile when projection catches up
- [ ] Polling/subscription for read model updates
- [ ] Correlation IDs to track command → event → projection flow
-
-**Common mistake:** Returning the "updated" read model immediately after a command. The projection might not have processed the event yet.
+**LLM mistake:** Putting validation in a service layer: "OrderService checks if order has at least one item." The Order aggregate should reject an empty order.

 ---

 ## Projections (Read Models)

-### Design
- [ ] Projections are disposable - can always rebuild from events
- [ ] One projection per query need (don't share if requirements differ)
- [ ] Denormalized for query performance - no joins at read time
- [ ] Include projection version/position for consistency checks
+### Not Source of Truth
+- [ ] **Projections are derived, disposable** — can always rebuild from events
+- [ ] **If projection is wrong, fix the projection logic and rebuild** — don't "fix" projection data directly
+- [ ] **One projection per query need** — don't share if requirements differ

-### Rebuilding
- [ ] Support full rebuild from event stream
- [ ] Handle schema migrations via rebuild or versioned projections
- [ ] Track last processed event position for resume
+**LLM mistake:** Treating a projection table as canonical and syncing events to match it.

-**Common mistake:** Treating projections as source of truth. If projection is wrong, fix the projection logic and rebuild - don't "fix" the projection data directly.
+### Eventual Consistency
+- [ ] **Read models may lag behind writes** — UI must handle this
+- [ ] **Don't return projection state immediately after command** — it might not be updated yet
+- [ ] **Idempotent handlers** — same event delivered twice produces same result
+
+**LLM mistake:** API endpoint that does `append_event(...)` then immediately `query_projection(...)` and returns it. Race condition.

 ---

-## Snapshotting
+## Idempotency

-### When to Snapshot
- [ ] Only when replay performance is a problem (measure first)
- [ ] Typical threshold: 100-1000 events before snapshot becomes worthwhile
- [ ] Snapshot frequency based on write volume and rebuild tolerance
+### Event Handling
+- [ ] **Idempotency keys for commands** — especially payments, orders
+- [ ] **Check for duplicate events before processing** — at-least-once delivery is common
+- [ ] **Make apply functions idempotent** — applying same event twice = same state

-### Implementation
- [ ] Snapshot = serialized aggregate state at event N
- [ ] Load snapshot, then replay events after snapshot
- [ ] Snapshot schema must evolve with aggregate (version snapshots)
- [ ] Keep events forever - snapshots are optimization, not replacement
+**LLM mistake:** An event handler that increments a counter without checking if this event was already processed.

-**Common mistake:** Snapshotting too early (premature optimization) or deleting events after snapshot (data loss).
+### Command Handling  
+- [ ] **Use `append_if_absent` patterns** — check before write, atomically
+- [ ] **Return success for duplicate valid commands** — don't error on retry

 ---

 ## Process Managers / Sagas

-### Coordination
- [ ] Long-running processes spanning multiple aggregates/contexts
- [ ] Maintain own state (what step are we on, what's pending)
- [ ] React to events, emit commands
- [ ] Handle failures: compensating actions, retries, timeouts
+- [ ] **Long-running coordination across aggregates** — don't do this in application services
+- [ ] **Own state machine with explicit states** — what step are we on?
+- [ ] **Handle timeouts** — what if a step never completes?
+- [ ] **Compensating actions for failures** — if step 3 fails, undo steps 1-2

-### State Machine
- [ ] Explicit states and transitions
- [ ] Idempotent transitions (receiving same event twice is safe)
- [ ] Timeout handling for stuck processes
-
-**Common mistake:** Putting multi-aggregate coordination in application services with direct calls. Use process managers for anything that spans consistency boundaries.
+**LLM mistake:** Multi-aggregate coordination in a service with direct calls and no failure handling:
+```elixir
+def transfer(from, to, amount) do
+  Wallet.debit(from, amount)  # What if this succeeds but next fails?
+  Wallet.credit(to, amount)
+end
+```

 ---

-## Value Objects
+## Elixir/OTP Specific

- [ ] Immutable (no setters, all state set in constructor)
- [ ] Equality by value, not reference (`Money(100, :USD) == Money(100, :USD)`)
- [ ] Self-validating (invalid value object cannot be constructed)
- [ ] No identity - two VOs with same values are interchangeable
+### GenServer State Recovery
+- [ ] **Use `handle_continue` for replay** — not `init/1` directly
+- [ ] **Keep apply functions pure** — use a reducer pattern
+- [ ] **Trap exits if cleanup needed** — but prefer stateless design

-**Examples:** `Money`, `EmailAddress`, `DateRange`, `Coordinates`, `OrderLineItem`
-
-**Common mistake:** Making value objects mutable or giving them database IDs.
+### Process Dictionary for Test Isolation
+- [ ] **`Process.get/put` for store references** — allows per-test isolation
+- [ ] **Set in GenServer init, read in public API** — callers don't pass store around

 ---

-## Anti-Patterns to Flag
+## Anti-Patterns to Flag Immediately

- **Anemic domain model**: Aggregates with only getters/setters, logic in services
- **God aggregate**: Everything in one aggregate, massive contention
- **Event sourcing the UI**: Storing UI state changes as domain events
- **Synchronous projections**: Blocking command until projection updates
- **Shared kernel abuse**: Too much shared code between bounded contexts
- **CRUD in disguise**: `EntityUpdated` events that just dump all fields
+| Pattern | Problem |
+|---------|---------|
+| `DateTime.utc_now()` in apply | Non-deterministic replay |
+| `OrderUpdated` with `changes` map | CRUD-in-disguise |
+| Projection used as source of truth | Data inconsistency |
+| Event "fix" migration | Violates immutability |
+| Multi-aggregate in one transaction | Wrong boundaries |
+| External call in apply | Side effects break replay |
@@ -1,31 +1,32 @@
 # DDD & Event Sourcing Patterns

-Review guidance for domain-driven design and event sourcing. Focus on subtle mistakes models make and patterns they get wrong.
-
-## Contents
-
- `DDD-CHECKLIST.md` - The review checklist covering:
-  - Aggregate Design (boundaries, invariants, identity)
-  - Event Sourcing (immutability, naming, content, reconstruction)
-  - Domain vs Integration Events
-  - Eventual Consistency
-  - Projections (read models)
-  - Snapshotting
-  - Process Managers / Sagas
-  - Value Objects
-  - Anti-Patterns to Flag
+What LLMs get subtly wrong in event-sourced systems. Based on actual code review findings.

 ## Why This Exists

-Models know the textbook definitions but often get the nuances wrong:
- Making aggregates too large (data grouping vs consistency boundary)
+LLMs know the textbook definitions of event sourcing and DDD. But they make subtle mistakes:
+
+- Adding `DateTime.utc_now()` in apply functions (breaks replay determinism)
+- Creating `OrderUpdated` events with a changes map (CRUD-in-disguise)
+- Treating projections as source of truth
 - Suggesting "fixes" to immutable events
- Mixing domain and integration events
- Expecting synchronous read model updates
- Snapshotting prematurely
+- Making aggregates too large (data grouping vs consistency boundary)
+- Multi-aggregate coordination in services without saga patterns

 This checklist catches those mistakes.

+## Contents
+
+- `DDD-CHECKLIST.md` - Review prompts covering:
+  - Replay Determinism (CRITICAL)
+  - Event Design
+  - Aggregate Boundaries
+  - Projections
+  - Idempotency
+  - Process Managers / Sagas
+  - Elixir/OTP Specific
+  - Anti-patterns table
+
 ## Integration

 ```yaml