Files
model-research/findings/2026-05-09-55-state-reconstruction-correctness.md
claw 5ee0cff3a8 experiment #55: state reconstruction correctness — new analytical lens
Tests whether event stream supports time-travel queries, retroactive truth,
and audit reconstruction. All three models found CRITICAL issues in a document
that passed previous lenses. Key insight: distinguishes telemetry events from
sourcing events.

Document: gargoyle corporate-actions.md
Models: GPT-5, Sonnet 4.6, Opus 4.6
Lens validation: model-stable, domain-independent, architecturally significant
2026-05-09 05:06:45 -07:00

8.3 KiB

State Reconstruction Correctness Analysis

Finding ID: 55 Date: 2026-05-09 Document: gargoyle/docs/domain/contexts/market-data/corporate-actions.md (~120 lines) Task type: State reconstruction correctness analysis — a NEW analytical lens Prompt: "Can the system correctly reconstruct its state at any arbitrary point in time given only the event stream?" Models compared: GPT-5, Claude Sonnet 4.6, Claude Opus 4.6

Experiment Design

This experiment tests a novel analytical lens: state reconstruction correctness. Unlike previous lenses that focus on gaps, race conditions, or implementation ambiguities, this lens asks whether the system's event stream is sufficient for:

  1. Time-travel queries: Answering "what was position X at timestamp T?"
  2. Retroactive truth: Handling actions whose effective date precedes detection/application
  3. Event ordering: Determining which temporal dimension orders events for replay
  4. Audit trail completeness: Enabling an auditor to reconstruct exact before/after states
  5. Snapshot consistency: Ensuring snapshot + replay events converges to the same state

This lens is particularly relevant for systems that claim to be event-sourced or must satisfy regulatory audit requirements.

Performance Metrics

Model Time Input Tokens Output Tokens Reasoning Tokens Findings
GPT-5 83s 1,294 6,275 3,968 10
Claude Sonnet 4.6 74s 1,485 3,890 (internal) 10+ (truncated)
Claude Opus 4.6 55s 1,485 2,579 (internal) 10

Key Findings by Model

All Three Identified (Common Ground)

All models converged on these CRITICAL issues:

  1. Missing effective_date in action.applied event — The telemetry event schema shows no effective_date, making it impossible to answer time-travel queries correctly
  2. Undefined details field — The document says "Record immutable adjustment events capturing before/after state" but the event schema uses an unstructured details blob with no contract
  3. Three-timeline conflation — Effective date, detection time, and application time are distinct concepts but not preserved separately in events
  4. No before/after lot-level state in events — Cannot reconstruct exact lot mutations from the event stream alone

GPT-5 Unique Findings

GPT-5 produced the most exhaustive analysis with specific operational concerns:

  • Currency and numeric precision unspecified — Cost basis calculations need explicit rounding rules for deterministic replay
  • No idempotency key for per-user/per-lot adjustments — Crash-retry could produce duplicates
  • Cash postings (dividends, cash-in-lieu) lack event modeling — Pay date vs. ex-date vs. record date semantics missing
  • Instrument identity changes lack time-indexed mapping events — Old→new symbol transitions not replayable
  • Cross-reference normative authority undefined — Which document is source of truth when specs conflict?

GPT-5's recommendations section was exceptionally actionable, specifying exact event schema additions.

Sonnet Unique Findings

Sonnet provided the clearest framing of the bi-temporal problem:

  • Bi-temporal data model missing — System needs both valid time (when world changed) and transaction time (when we recorded it)
  • Silent deduplication is invisible in event stream — No way to know if duplicates were suppressed
  • Delisting leaves phantom lots with no terminal state event — Lots exist forever in "delisted" state with no resolution path
  • Source accountability lost for manual entries — No operator audit trail

Sonnet's analysis was remarkably structured with clear severity justifications.

Opus Unique Findings

Opus excelled at identifying architectural category errors:

  • Events are telemetry, not sourcing events — The document explicitly says "Naming follows the telemetry event convention," which is fundamentally incompatible with state reconstruction
  • State mutations with notification events ≠ event-sourced state transitions — Core architectural mismatch
  • Spinoff basis allocation is non-deterministic without captured FMV parameters — Replaying with different market data produces different results
  • "Mark the action as applied" has no corresponding event — Completion tracking is stateful, not event-based

Opus's concluding insight was the most architecturally significant: "This system treats corporate actions as state mutations with notification events rather than as event-sourced state transitions."

Comparative Analysis

Dimension GPT-5 Sonnet Opus
Finding count 10 10+ 10
CRITICAL findings 4 3 3
Unique insights 5 4 4
Tokens per finding 628 389 258
Structural clarity Good Excellent Excellent
Actionability Highest High Medium
Architectural depth Medium High Highest

Pattern Observations

GPT-5 approaches state reconstruction as a specification completeness problem. It enumerates every field that should exist but doesn't, every timestamp that should be captured, every edge case in the replay logic. The output is a comprehensive implementation checklist.

Sonnet approaches it as a systems design problem. It frames the issues in terms of well-known patterns (bi-temporal modeling, event sourcing vs. CRUD) and explains why each gap matters for reconstruction. The output is a design review.

Opus approaches it as an architectural coherence problem. It identifies the fundamental category error — the document conflates telemetry events with sourcing events — and traces most specific issues back to this root cause. The output is an architectural critique.

Novel Lens Validation

"State reconstruction correctness" proved to be a valuable analytical lens distinct from previous experiments:

  1. It requires temporal reasoning — Unlike gap-finding or ambiguity detection, this lens specifically tests whether events can be replayed in time order to reproduce state
  2. It tests architectural properties, not just specification gaps — The question isn't "what's missing?" but "does this design support a specific capability?"
  3. It's domain-independent — Applicable to any event-driven system, not just financial platforms
  4. All three models found substantive issues — The 120-line document had no obvious event-sourcing defects, yet all models identified CRITICAL reconstruction problems

Practical Implications

For state reconstruction review, use this pipeline:

  1. Opus first — Identifies architectural category errors (is this actually event-sourced or just event-logged?)
  2. Sonnet second — Frames specific gaps in terms of standard patterns (bi-temporal, CQRS, etc.)
  3. GPT-5 third — Produces actionable implementation checklist for remediation

Cost-effectiveness for this lens:

  • Opus: 10 findings in 55s at 2,579 tokens = 258 tokens/finding (best efficiency)
  • Sonnet: 10+ findings in 74s at 3,890 tokens = ~389 tokens/finding
  • GPT-5: 10 findings in 83s at 6,275 tokens = 628 tokens/finding (but includes recommendations)

For this specific task type, Opus delivers the most insight per token, but GPT-5's recommendations section adds significant implementation value.

Lessons Learned

  1. "State reconstruction correctness" is a distinct analytical task — It complements but doesn't overlap with gaps, races, or ambiguities
  2. Telemetry events ≠ sourcing events — This is a common architectural confusion; the lens is good at detecting it
  3. All three models caught the core issue — Multi-model review had ~70% overlap on CRITICAL findings, suggesting the lens is model-stable
  4. Opus identified the root cause most clearly — Its "telemetry vs. sourcing" framing explains most specific findings

Conclusion

The state reconstruction correctness lens is a valuable addition to the analytical toolkit. It's particularly useful for:

  • Event-driven system design review
  • Audit trail compliance assessment
  • Data migration/replay planning
  • Regulatory submission documentation review

For documents claiming event-sourcing or audit trail capabilities, this lens should be part of standard review — all three models found CRITICAL issues in a document that passed previous analytical lenses without major findings.