Finding #40: Silent data corruption paths in financial accounting

New analytical lens applied to lot-accounting.md (181 lines). Tests how models identify sequences of individually correct operations that produce silently wrong financial results. Results: - GPT-5: 12 findings (137s, 10688 reasoning tokens) - tax law domain knowledge - Opus: 8 findings (121s) - concurrent systems / crash recovery focus - Sonnet: 8 findings (111s) - structural meta-analysis, highest-leverage finding Key insight: First experiment where domain-specific knowledge (tax law) is the primary differentiator. Models reason from different knowledge domains: GPT-5=tax law, Opus=distributed systems, Sonnet=architecture patterns. Sonnet produced the most architecturally significant finding: that the system's reconciliation mechanism confirms corruption rather than detecting it (because it re-derives from LotClosed which is itself the corrupted source).
2026-05-07 11:09:58 -07:00
parent 0c632c255a
commit bb0c0d564b
1 changed files with 195 additions and 0 deletions
@@ -0,0 +1,195 @@
+# Finding #40: Silent Data Corruption Paths in Financial Accounting — GPT-5 finds most paths with highest domain specificity; Sonnet produces strongest structural insight
+
+**Date:** 2026-05-07
+**Task:** Identify silent data corruption paths in gargoyle's `lot-accounting.md`
+(181 lines) — sequences of individually correct operations that produce financially
+wrong results (incorrect P&L, wrong tax terms, misattributed gains) without
+triggering any error or invariant violation.
+**How we used them:** Same document (full text) + same focused analytical question
+to all 3 models via HAI proxy. Highly structured prompt specifying 5 categories
+(upstream corruption propagation, ordering-dependent correctness, temporal boundary
+errors, aggregate drift without detection, cross-strategy contamination). Required
+specific output format per finding with severity ratings and financial impact
+quantification. No tools, no project context beyond the document itself.
+
+| Model | Time | Output tokens | Reasoning tokens | Findings |
+|---|---|---|---|---|
+| GPT-5 | 137s | 14,251 | 10,688 | 12 |
+| Claude Opus 4.6 | 121s | 6,270 | (internal) | 8 |
+| Claude Sonnet 4.6 | 111s | 5,844 | (internal) | 8 |
+
+## What they found — common ground (all 3 identified):
+
+- **Wash sale + LotClosed immutability conflict**: Disallowed loss permanently
+  recorded in immutable LotClosed because wash sale detection fires after the
+  closure event is written. No mechanism to annotate, supersede, or disallow
+  the recorded loss. All three models identified this as the most fundamental
+  design tension.
+- **Corporate action timing vs fill ordering**: Late-arriving corporate actions
+  (splits, spin-offs) processed after related sells produce LotClosed events
+  with pre-adjustment cost bases that can never be corrected.
+- **Holding period / tax term boundary errors**: Ambiguity in timestamp semantics
+  (exchange time vs processing time), IRS day-after rule, timezone boundaries
+  causing short/long-term misclassification at the 365-day boundary.
+- **FIFO ordering ambiguity**: When opened_at timestamps are identical or
+  reordered (network jitter, batch processing), FIFO produces different outcomes
+  without any error signal.
+- **Cross-strategy lot consumption**: FIFO/HIFO operates at position level
+  (user+instrument), not strategy level — sells from Strategy A can consume lots
+  opened by Strategy B, corrupting per-strategy P&L attribution.
+- **Position.average_cost drift after corporate actions**: The spec only describes
+  Position updates during buy/sell flows, not during the Adjusting path — corporate
+  actions that modify lot cost bases may not trigger Position recalculation.
+
+## GPT-5 unique findings (not in either Claude model):
+
+- **Late spin-off basis reallocation after sale**: 20% basis reallocation to SpinCo
+  arrives after Parent shares were sold → LotClosed records gain computed from
+  original (too-high) basis, understating taxable income by the reallocation amount.
+  Distinct from split timing because the *kind* of error is different (basis
+  reallocation vs quantity/basis halving).
+- **Wash sale holding period tack-back not implemented**: Spec's LotClosed schema
+  has no mechanism to record tack-back of holding period from original lot to
+  replacement lot. `opened_at` is always set from the opening fill timestamp, not
+  the original acquisition date. This can flip long-term/short-term classification.
+- **HIFO flipped by CA timing**: Corporate action reducing one lot's basis changes
+  which lot HIFO selects, producing different P&L depending on whether the CA
+  arrives before or after the sell.
+- **Position.closed_at stale after post-closure corporate action**: If a position is
+  fully closed (closed_at set) and then a corporate action credits new shares,
+  closed_at may not be cleared → Risk treats position as closed while it's live.
+- **Wash sale cross-strategy tax deferral**: Strategy A's loss is deferred into
+  Strategy B's replacement lot basis, silently transferring tax liability between
+  strategies without any detection mechanism.
+- **61-day wash window off-by-one from timezone**: Boundary cases where UTC vs local
+  time difference pushes a replacement buy just inside/outside the 61-day window.
+
+## Claude Opus unique findings (not in either other model):
+
+- **Recovery after crash + strategy switch race**: If user changes lot selection
+  strategy between a crash and fill reprocessing, the recovered fill is processed
+  under the wrong strategy. Immutable LotClosed from the recovery is now permanently
+  wrong. The spec doesn't require strategy to be stored with the fill at submission
+  time.
+- **Position rebuild racing with active processing creates double-count**: The spec's
+  recovery procedure ("re-derive from LotClosed events") can race with new fill
+  processing that's also incrementing Position.realized_pnl, causing a new LotClosed
+  event to be counted both in the rebuild SUM and in the incremental update.
+- **Self-correcting analytical behavior**: Opus began exploring Position.average_cost
+  drift with a running-weighted-average scenario, worked through the math mid-response,
+  proved to itself that the running average IS correct for simple cases, then
+  identified the REAL drift path (corporate action not triggering recalculation).
+  This inline verification continues the pattern from Findings #15 and #20.
+
+## Claude Sonnet unique findings (not in either other model):
+
+- **Rounding error accumulation in split-adjusted cost basis**: 3-for-1 split on
+  $100 basis → $33.33 per share × 300 = $9,999 instead of $10,000. Per-split error
+  is small but accumulates without bound across multiple splits and high lot counts.
+  No invariant checks that `SUM(lot.quantity × lot.cost_basis) = original_total_cost`.
+- **Reconciliation re-derives from the corrupted source (meta-finding)**: The spec's
+  reconciliation job checks Position.realized_pnl against SUM(LotClosed.realized_pnl).
+  For 7 of 8 findings, LotClosed is itself the corrupted source — reconciliation
+  confirms the wrong number with high confidence, providing *false assurance*.
+  This is a system-level insight about the reconciliation architecture, not just
+  another data path.
+- **LotClosedAdjustment event type recommendation**: Proposed a specific architectural
+  fix (adjustment events that reference original LotClosed IDs and record deltas)
+  that preserves immutability semantics while enabling corrections. Neither GPT-5
+  nor Opus proposed concrete mechanisms at this level.
+
+## Quality assessment:
+
+- **GPT-5** produced the most findings (12) with the highest domain specificity.
+  Its unique findings demonstrate deep knowledge of US tax law (IRS tack-back rules,
+  61-day window semantics, trade-date vs settlement-date distinctions, spin-off basis
+  reallocation mechanics). Several findings are variations on the same theme
+  (corporate action timing) but each identifies a genuinely distinct failure mode
+  with different financial characteristics. The 10,688 reasoning tokens appear to
+  have been invested in domain-specific reasoning about tax rules and timing edge
+  cases. Every finding includes precise financial impact quantification.
+
+- **Claude Opus** produced 8 findings with characteristic design-tension focus.
+  The crash-recovery race condition and strategy-switch-during-reprocessing findings
+  show reasoning about system dynamics that neither other model surfaced — these
+  are about HOW the system operates at runtime, not just what it computes. The
+  self-correcting behavior (proving the running average is actually correct before
+  finding the real drift path) demonstrates intellectual honesty. However, Opus
+  produced fewer unique findings than in previous experiments and some overlap with
+  the common ground was substantial.
+
+- **Claude Sonnet** was the surprise performer in this experiment. While its raw
+  finding count (8) matches Opus, two of its contributions represent qualitatively
+  different analytical modes:
+  1. The **rounding accumulation** finding shows quantitative reasoning about numeric
+     precision that neither GPT-5 nor Opus considered — they focused on logical/semantic
+     errors while Sonnet identified a computational precision problem.
+  2. The **"reconciliation confirms corruption"** meta-finding is architecturally the
+     most important insight across all three models. It identifies that the system's
+     own safety mechanism (reconciliation from LotClosed) becomes a corruption
+     *amplifier* when LotClosed itself is wrong. This is a system-of-systems insight.
+  3. The concrete **LotClosedAdjustment** architectural recommendation shows Sonnet
+     reasoning about solutions, not just problems.
+
+## Key insight — Financial domain knowledge as differentiator:
+
+This is the first experiment where **domain-specific knowledge** (US tax law,
+IRS rules, financial accounting conventions) is a primary differentiator between
+models. Previous experiments tested general architectural reasoning skills that
+don't require specialized knowledge. Here:
+
+- GPT-5's unique findings are primarily driven by **tax law knowledge** (tack-back
+  rules, 61-day window semantics, spin-off basis mechanics) that the other models
+  either don't know or don't surface.
+- Opus's unique findings are primarily driven by **concurrent systems reasoning**
+  (crash recovery races, strategy change timing) — its traditional strength.
+- Sonnet's unique findings are primarily driven by **structural/meta-analytical
+  reasoning** (what does it mean that the reconciliation source is corrupted? how
+  do numeric precision errors accumulate?).
+
+The models aren't just finding "different things" — they're reasoning from
+*different knowledge domains* applied to the same document:
+- GPT-5: tax law + financial semantics
+- Opus: distributed systems + runtime dynamics
+- Sonnet: architecture patterns + numeric computing + meta-analysis
+
+## Comparison to Finding #22 (silent correctness failures):
+
+Finding #22 tested the "silent correctness failures" lens on `risk-controls.md`
+(a very different document — runtime behavioral rather than data/accounting).
+That experiment found GPT-5 dominant with Sonnet producing "surface-level" results.
+Here, on a financial accounting document where the failures are about *data
+correctness over time*, Sonnet performs much better. This suggests Sonnet's
+analytical capabilities are strongest when the document describes data structures
+and their transformations (where it can reason about invariants and meta-properties)
+rather than runtime behavior (where it struggles with temporal/concurrent reasoning
+— consistent with Finding #13).
+
+## Practical implication:
+
+For **financial accounting architecture review** specifically:
+- GPT-5 is essential for tax-rule compliance gaps (it appears to have genuine
+  knowledge of IRS wash sale rules, holding period tack-back, corporate action
+  basis mechanics)
+- Sonnet is valuable for structural/meta-analysis of the reconciliation architecture
+  (it found that the safety net itself is compromised — the highest-leverage finding)
+- Opus adds value for runtime/crash-recovery scenarios but provides less unique insight
+  on the data-correctness dimension
+
+The three-model approach continues to justify itself: GPT-5 finds 4 unique tax-law
+gaps, Opus finds 2 unique runtime gaps, and Sonnet finds 2 unique structural insights
+(including the most architecturally significant one). None alone would have produced
+the complete picture.
+
+## Token efficiency:
+
+| Model | Findings | Tokens/Finding | Unique findings | Tokens/Unique finding |
+|---|---|---|---|---|
+| GPT-5 | 12 | 1,188 | 6 | 2,375 |
+| Opus | 8 | 784 | 3 | 2,090 |
+| Sonnet | 8 | 731 | 3 | 1,948 |
+
+Sonnet is most token-efficient per finding (including its highest-leverage
+meta-finding). GPT-5's reasoning tokens (10,688) produced the most unique findings
+but at ~4.5× the cost per finding vs Sonnet. For financial document review where
+every unique finding represents potential regulatory risk, all three are justified.