Files
model-research/findings/2026-05-09-60-counterfactual-event-ordering-analysis.md
Rodin af950a33d1 Add finding #60: Counterfactual event ordering analysis
New analytical lens testing what breaks when events arrive out of order.
- GPT-5: 30 findings via exhaustive permutation enumeration
- Opus: 19 findings with operational consequence tracing
- Sonnet: 17 findings with regulatory compliance focus

Key insight: GPT-5's reasoning enables systematic swap/delay/duplicate/
interleave enumeration. Sonnet uniquely connects to regulatory requirements.
2026-05-09 18:28:40 -07:00

4.9 KiB

Finding #60: Counterfactual Event Ordering Analysis

Date: 2026-05-09 Document: specid-lot-selection.md (125 lines) New analytical lens: Counterfactual event ordering — "what breaks if events arrive in a different order?"

Results

Model Time Output tokens Reasoning tokens Findings
GPT-5 84s 7,278 3,712 30
Claude Opus 4 56s 1,662 (internal) 19
Claude Sonnet 4 36s 1,464 (internal) 17

Common Ground (all 3 identified)

  • Concurrent fills on same instrument race to consume same lots (Critical)
  • Crash between selection and closure recording creates non-determinism (Critical)
  • Manual selection validated at order time but may be stale at fill time (High)
  • Corporate actions interleaving with fill processing creates basis/quantity confusion (Critical)
  • Partial event persistence on crash leads to inconsistent state (Critical)

GPT-5 Unique Findings

  • Finding 8: Duplicate buy fill ingestion creates phantom shares — "Ledger shows more shares than actually purchased, allowing subsequent sells to over-consume"
  • Finding 13: Partial fills with manual selection have no partitioning rule — "The first partial fill either consumes the entire manual selection or uses automatic strategy"
  • Finding 16: Closure event reordering affects holding period classification — "Time-sensitive calculations (holding period, wash sale windows) can be mis-evaluated"
  • Finding 20: Network partition causes causal inversion — "Later buy lot.opened arrives before earlier buy" creates causally inconsistent snapshot
  • Finding 24: Recompute job interleaving with selection — "Recompute for fill A executes while selection for fill B is mid-processing"
  • Finding 27: Closure event emission order affects compliance reporting — "Tax lot reports, audit traces will misreport strategy application"
  • Finding 28: Lot instrument metadata race — "Manual selection includes lot from different instrument due to late metadata update"
  • Finding 30: Sharded processing creates merge ordering issues — "Two independent arrival-time orders interleave unpredictably when merged"

Opus Unique Findings

  • Strategy configuration changes mid-flight: "User changes strategy → In-flight fill already being processed → Strategy change committed" — fill uses old strategy
  • Serialization mechanism unspecified: "Phrase 'serial processing prevents concurrent modification' doesn't specify the mechanism (locks, queues, etc.)"
  • Risk system sees incorrect position: "Lot events emitted → Network partition → Some events delayed → Position rebuilt on partial data → Risk system sees incorrect position size, may allow overleveraging"

Sonnet Unique Findings

  • Regulatory identification timing: "Ambiguity about which strategy applies - the one active when order was placed or when fill arrived. Could violate regulatory identification timing requirements"
  • Clock skew creates impossible sequences: "Events appear to occur in impossible sequences (e.g., selling before buying), corrupting tax lot accounting"
  • Wash sale double-adjustment: "Lot closure uses unadjusted basis while wash sale system has already adjusted it, leading to incorrect gain/loss calculations and double-adjustment"

Key Insights

Exhaustive permutation enumeration is a GPT-5 strength

GPT-5's 3,712 reasoning tokens enabled systematic coverage: swap, delay, duplicate, and interleave for each event type. It explicitly enumerated network partition, sharded processing, and partial persistence scenarios. The Claude models identified categories of failure but didn't systematically enumerate all permutations within each category.

Opus traces failures to operational consequences

Opus consistently traced failures to downstream impact ("may allow overleveraging", "corrupted position state", "incomplete event set"). GPT-5 enumerated the failures; Opus connected them to what operators would actually see break.

Sonnet finds regulatory compliance angles

Sonnet uniquely flagged regulatory timing requirements and audit trail corruption. Neither GPT-5 nor Opus explicitly connected ordering failures to Treasury Regulation §1.1012-1(c) timing requirements.

Task Taxonomy Update

  • Counterfactual event ordering analysis → GPT-5 for exhaustive permutation coverage, Sonnet for regulatory compliance implications
  • This task type is distinct from "race condition analysis" (Finding #13) — race conditions assume two things run concurrently; counterfactual ordering assumes things arrive in unexpected sequences
  • GPT-5's reasoning enables systematic enumeration of permutation types (swap, delay, duplicate, interleave, partition)

Practical Implication

For event ordering correctness analysis of financial systems, use GPT-5 for exhaustive permutation coverage, then Sonnet to flag regulatory implications. Opus adds value in tracing failures to operational observability gaps.