From 5ee0cff3a86a7e9025f9c22e8027baf51908aceb Mon Sep 17 00:00:00 2001 From: claw Date: Sat, 9 May 2026 05:06:45 -0700 Subject: [PATCH] =?UTF-8?q?experiment=20#55:=20state=20reconstruction=20co?= =?UTF-8?q?rrectness=20=E2=80=94=20new=20analytical=20lens?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Tests whether event stream supports time-travel queries, retroactive truth, and audit reconstruction. All three models found CRITICAL issues in a document that passed previous lenses. Key insight: distinguishes telemetry events from sourcing events. Document: gargoyle corporate-actions.md Models: GPT-5, Sonnet 4.6, Opus 4.6 Lens validation: model-stable, domain-independent, architecturally significant --- ...-09-55-state-reconstruction-correctness.md | 135 ++++++++++++++++++ 1 file changed, 135 insertions(+) create mode 100644 findings/2026-05-09-55-state-reconstruction-correctness.md diff --git a/findings/2026-05-09-55-state-reconstruction-correctness.md b/findings/2026-05-09-55-state-reconstruction-correctness.md new file mode 100644 index 0000000..a57be74 --- /dev/null +++ b/findings/2026-05-09-55-state-reconstruction-correctness.md @@ -0,0 +1,135 @@ +# State Reconstruction Correctness Analysis + +**Finding ID:** 55 +**Date:** 2026-05-09 +**Document:** gargoyle/docs/domain/contexts/market-data/corporate-actions.md (~120 lines) +**Task type:** State reconstruction correctness analysis — a NEW analytical lens +**Prompt:** "Can the system correctly reconstruct its state at any arbitrary point in time given only the event stream?" +**Models compared:** GPT-5, Claude Sonnet 4.6, Claude Opus 4.6 + +## Experiment Design + +This experiment tests a novel analytical lens: **state reconstruction correctness**. Unlike previous lenses that focus on gaps, race conditions, or implementation ambiguities, this lens asks whether the system's event stream is sufficient for: + +1. **Time-travel queries:** Answering "what was position X at timestamp T?" +2. **Retroactive truth:** Handling actions whose effective date precedes detection/application +3. **Event ordering:** Determining which temporal dimension orders events for replay +4. **Audit trail completeness:** Enabling an auditor to reconstruct exact before/after states +5. **Snapshot consistency:** Ensuring snapshot + replay events converges to the same state + +This lens is particularly relevant for systems that claim to be event-sourced or must satisfy regulatory audit requirements. + +## Performance Metrics + +| Model | Time | Input Tokens | Output Tokens | Reasoning Tokens | Findings | +|-------|------|--------------|---------------|------------------|----------| +| GPT-5 | 83s | 1,294 | 6,275 | 3,968 | 10 | +| Claude Sonnet 4.6 | 74s | 1,485 | 3,890 | (internal) | 10+ (truncated) | +| Claude Opus 4.6 | 55s | 1,485 | 2,579 | (internal) | 10 | + +## Key Findings by Model + +### All Three Identified (Common Ground) + +All models converged on these CRITICAL issues: + +1. **Missing effective_date in action.applied event** — The telemetry event schema shows no effective_date, making it impossible to answer time-travel queries correctly +2. **Undefined `details` field** — The document says "Record immutable adjustment events capturing before/after state" but the event schema uses an unstructured `details` blob with no contract +3. **Three-timeline conflation** — Effective date, detection time, and application time are distinct concepts but not preserved separately in events +4. **No before/after lot-level state in events** — Cannot reconstruct exact lot mutations from the event stream alone + +### GPT-5 Unique Findings + +GPT-5 produced the most exhaustive analysis with specific operational concerns: + +- **Currency and numeric precision unspecified** — Cost basis calculations need explicit rounding rules for deterministic replay +- **No idempotency key for per-user/per-lot adjustments** — Crash-retry could produce duplicates +- **Cash postings (dividends, cash-in-lieu) lack event modeling** — Pay date vs. ex-date vs. record date semantics missing +- **Instrument identity changes lack time-indexed mapping events** — Old→new symbol transitions not replayable +- **Cross-reference normative authority undefined** — Which document is source of truth when specs conflict? + +GPT-5's recommendations section was exceptionally actionable, specifying exact event schema additions. + +### Sonnet Unique Findings + +Sonnet provided the clearest framing of the bi-temporal problem: + +- **Bi-temporal data model missing** — System needs both valid time (when world changed) and transaction time (when we recorded it) +- **Silent deduplication is invisible in event stream** — No way to know if duplicates were suppressed +- **Delisting leaves phantom lots with no terminal state event** — Lots exist forever in "delisted" state with no resolution path +- **Source accountability lost for manual entries** — No operator audit trail + +Sonnet's analysis was remarkably structured with clear severity justifications. + +### Opus Unique Findings + +Opus excelled at identifying architectural category errors: + +- **Events are telemetry, not sourcing events** — The document explicitly says "Naming follows the telemetry event convention," which is fundamentally incompatible with state reconstruction +- **State mutations with notification events ≠ event-sourced state transitions** — Core architectural mismatch +- **Spinoff basis allocation is non-deterministic without captured FMV parameters** — Replaying with different market data produces different results +- **"Mark the action as applied" has no corresponding event** — Completion tracking is stateful, not event-based + +Opus's concluding insight was the most architecturally significant: "This system treats corporate actions as state mutations with notification events rather than as event-sourced state transitions." + +## Comparative Analysis + +| Dimension | GPT-5 | Sonnet | Opus | +|-----------|-------|--------|------| +| Finding count | 10 | 10+ | 10 | +| CRITICAL findings | 4 | 3 | 3 | +| Unique insights | 5 | 4 | 4 | +| Tokens per finding | 628 | 389 | 258 | +| Structural clarity | Good | Excellent | Excellent | +| Actionability | Highest | High | Medium | +| Architectural depth | Medium | High | Highest | + +### Pattern Observations + +**GPT-5** approaches state reconstruction as a specification completeness problem. It enumerates every field that should exist but doesn't, every timestamp that should be captured, every edge case in the replay logic. The output is a comprehensive implementation checklist. + +**Sonnet** approaches it as a systems design problem. It frames the issues in terms of well-known patterns (bi-temporal modeling, event sourcing vs. CRUD) and explains *why* each gap matters for reconstruction. The output is a design review. + +**Opus** approaches it as an architectural coherence problem. It identifies the fundamental category error — the document conflates telemetry events with sourcing events — and traces most specific issues back to this root cause. The output is an architectural critique. + +## Novel Lens Validation + +"State reconstruction correctness" proved to be a valuable analytical lens distinct from previous experiments: + +1. **It requires temporal reasoning** — Unlike gap-finding or ambiguity detection, this lens specifically tests whether events can be replayed in time order to reproduce state +2. **It tests architectural properties, not just specification gaps** — The question isn't "what's missing?" but "does this design support a specific capability?" +3. **It's domain-independent** — Applicable to any event-driven system, not just financial platforms +4. **All three models found substantive issues** — The 120-line document had no obvious event-sourcing defects, yet all models identified CRITICAL reconstruction problems + +## Practical Implications + +**For state reconstruction review, use this pipeline:** + +1. **Opus first** — Identifies architectural category errors (is this actually event-sourced or just event-logged?) +2. **Sonnet second** — Frames specific gaps in terms of standard patterns (bi-temporal, CQRS, etc.) +3. **GPT-5 third** — Produces actionable implementation checklist for remediation + +**Cost-effectiveness for this lens:** +- Opus: 10 findings in 55s at 2,579 tokens = **258 tokens/finding** (best efficiency) +- Sonnet: 10+ findings in 74s at 3,890 tokens = ~389 tokens/finding +- GPT-5: 10 findings in 83s at 6,275 tokens = 628 tokens/finding (but includes recommendations) + +For this specific task type, Opus delivers the most insight per token, but GPT-5's recommendations section adds significant implementation value. + +## Lessons Learned + +1. **"State reconstruction correctness" is a distinct analytical task** — It complements but doesn't overlap with gaps, races, or ambiguities +2. **Telemetry events ≠ sourcing events** — This is a common architectural confusion; the lens is good at detecting it +3. **All three models caught the core issue** — Multi-model review had ~70% overlap on CRITICAL findings, suggesting the lens is model-stable +4. **Opus identified the root cause most clearly** — Its "telemetry vs. sourcing" framing explains most specific findings + +## Conclusion + +The state reconstruction correctness lens is a valuable addition to the analytical toolkit. It's particularly useful for: + +- Event-driven system design review +- Audit trail compliance assessment +- Data migration/replay planning +- Regulatory submission documentation review + +For documents claiming event-sourcing or audit trail capabilities, this lens should be part of standard review — all three models found CRITICAL issues in a document that passed previous analytical lenses without major findings.