model-research/findings/2026-05-07-40-silent-data-corruption-paths-financial-accounting.md

# Finding #40: Silent Data Corruption Paths in Financial Accounting — GPT-5 finds most paths with highest domain specificity; Sonnet produces strongest structural insight

**Date:** 2026-05-07
**Task:** Identify silent data corruption paths in gargoyle's `lot-accounting.md`
(181 lines) — sequences of individually correct operations that produce financially
wrong results (incorrect P&L, wrong tax terms, misattributed gains) without
triggering any error or invariant violation.
**How we used them:** Same document (full text) + same focused analytical question
to all 3 models via HAI proxy. Highly structured prompt specifying 5 categories
(upstream corruption propagation, ordering-dependent correctness, temporal boundary
errors, aggregate drift without detection, cross-strategy contamination). Required
specific output format per finding with severity ratings and financial impact
quantification. No tools, no project context beyond the document itself.

| Model | Time | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|
| GPT-5 | 137s | 14,251 | 10,688 | 12 |
| Claude Opus 4.6 | 121s | 6,270 | (internal) | 8 |
| Claude Sonnet 4.6 | 111s | 5,844 | (internal) | 8 |

## What they found — common ground (all 3 identified):

- **Wash sale + LotClosed immutability conflict**: Disallowed loss permanently
  recorded in immutable LotClosed because wash sale detection fires after the
  closure event is written. No mechanism to annotate, supersede, or disallow
  the recorded loss. All three models identified this as the most fundamental
  design tension.
- **Corporate action timing vs fill ordering**: Late-arriving corporate actions
  (splits, spin-offs) processed after related sells produce LotClosed events
  with pre-adjustment cost bases that can never be corrected.
- **Holding period / tax term boundary errors**: Ambiguity in timestamp semantics
  (exchange time vs processing time), IRS day-after rule, timezone boundaries
  causing short/long-term misclassification at the 365-day boundary.
- **FIFO ordering ambiguity**: When opened_at timestamps are identical or
  reordered (network jitter, batch processing), FIFO produces different outcomes
  without any error signal.
- **Cross-strategy lot consumption**: FIFO/HIFO operates at position level
  (user+instrument), not strategy level — sells from Strategy A can consume lots
  opened by Strategy B, corrupting per-strategy P&L attribution.
- **Position.average_cost drift after corporate actions**: The spec only describes
  Position updates during buy/sell flows, not during the Adjusting path — corporate
  actions that modify lot cost bases may not trigger Position recalculation.

## GPT-5 unique findings (not in either Claude model):

- **Late spin-off basis reallocation after sale**: 20% basis reallocation to SpinCo
  arrives after Parent shares were sold → LotClosed records gain computed from
  original (too-high) basis, understating taxable income by the reallocation amount.
  Distinct from split timing because the *kind* of error is different (basis
  reallocation vs quantity/basis halving).
- **Wash sale holding period tack-back not implemented**: Spec's LotClosed schema
  has no mechanism to record tack-back of holding period from original lot to
  replacement lot. `opened_at` is always set from the opening fill timestamp, not
  the original acquisition date. This can flip long-term/short-term classification.
- **HIFO flipped by CA timing**: Corporate action reducing one lot's basis changes
  which lot HIFO selects, producing different P&L depending on whether the CA
  arrives before or after the sell.
- **Position.closed_at stale after post-closure corporate action**: If a position is
  fully closed (closed_at set) and then a corporate action credits new shares,
  closed_at may not be cleared → Risk treats position as closed while it's live.
- **Wash sale cross-strategy tax deferral**: Strategy A's loss is deferred into
  Strategy B's replacement lot basis, silently transferring tax liability between
  strategies without any detection mechanism.
- **61-day wash window off-by-one from timezone**: Boundary cases where UTC vs local
  time difference pushes a replacement buy just inside/outside the 61-day window.

## Claude Opus unique findings (not in either other model):

- **Recovery after crash + strategy switch race**: If user changes lot selection
  strategy between a crash and fill reprocessing, the recovered fill is processed
  under the wrong strategy. Immutable LotClosed from the recovery is now permanently
  wrong. The spec doesn't require strategy to be stored with the fill at submission
  time.
- **Position rebuild racing with active processing creates double-count**: The spec's
  recovery procedure ("re-derive from LotClosed events") can race with new fill
  processing that's also incrementing Position.realized_pnl, causing a new LotClosed
  event to be counted both in the rebuild SUM and in the incremental update.
- **Self-correcting analytical behavior**: Opus began exploring Position.average_cost
  drift with a running-weighted-average scenario, worked through the math mid-response,
  proved to itself that the running average IS correct for simple cases, then
  identified the REAL drift path (corporate action not triggering recalculation).
  This inline verification continues the pattern from Findings #15 and #20.

## Claude Sonnet unique findings (not in either other model):

- **Rounding error accumulation in split-adjusted cost basis**: 3-for-1 split on
  $100 basis → $33.33 per share × 300 = $9,999 instead of $10,000. Per-split error
  is small but accumulates without bound across multiple splits and high lot counts.
  No invariant checks that `SUM(lot.quantity × lot.cost_basis) = original_total_cost`.
- **Reconciliation re-derives from the corrupted source (meta-finding)**: The spec's
  reconciliation job checks Position.realized_pnl against SUM(LotClosed.realized_pnl).
  For 7 of 8 findings, LotClosed is itself the corrupted source — reconciliation
  confirms the wrong number with high confidence, providing *false assurance*.
  This is a system-level insight about the reconciliation architecture, not just
  another data path.
- **LotClosedAdjustment event type recommendation**: Proposed a specific architectural
  fix (adjustment events that reference original LotClosed IDs and record deltas)
  that preserves immutability semantics while enabling corrections. Neither GPT-5
  nor Opus proposed concrete mechanisms at this level.

## Quality assessment:

- **GPT-5** produced the most findings (12) with the highest domain specificity.
  Its unique findings demonstrate deep knowledge of US tax law (IRS tack-back rules,
  61-day window semantics, trade-date vs settlement-date distinctions, spin-off basis
  reallocation mechanics). Several findings are variations on the same theme
  (corporate action timing) but each identifies a genuinely distinct failure mode
  with different financial characteristics. The 10,688 reasoning tokens appear to
  have been invested in domain-specific reasoning about tax rules and timing edge
  cases. Every finding includes precise financial impact quantification.

- **Claude Opus** produced 8 findings with characteristic design-tension focus.
  The crash-recovery race condition and strategy-switch-during-reprocessing findings
  show reasoning about system dynamics that neither other model surfaced — these
  are about HOW the system operates at runtime, not just what it computes. The
  self-correcting behavior (proving the running average is actually correct before
  finding the real drift path) demonstrates intellectual honesty. However, Opus
  produced fewer unique findings than in previous experiments and some overlap with
  the common ground was substantial.

- **Claude Sonnet** was the surprise performer in this experiment. While its raw
  finding count (8) matches Opus, two of its contributions represent qualitatively
  different analytical modes:
  1. The **rounding accumulation** finding shows quantitative reasoning about numeric
     precision that neither GPT-5 nor Opus considered — they focused on logical/semantic
     errors while Sonnet identified a computational precision problem.
  2. The **"reconciliation confirms corruption"** meta-finding is architecturally the
     most important insight across all three models. It identifies that the system's
     own safety mechanism (reconciliation from LotClosed) becomes a corruption
     *amplifier* when LotClosed itself is wrong. This is a system-of-systems insight.
  3. The concrete **LotClosedAdjustment** architectural recommendation shows Sonnet
     reasoning about solutions, not just problems.

## Key insight — Financial domain knowledge as differentiator:

This is the first experiment where **domain-specific knowledge** (US tax law,
IRS rules, financial accounting conventions) is a primary differentiator between
models. Previous experiments tested general architectural reasoning skills that
don't require specialized knowledge. Here:

- GPT-5's unique findings are primarily driven by **tax law knowledge** (tack-back
  rules, 61-day window semantics, spin-off basis mechanics) that the other models
  either don't know or don't surface.
- Opus's unique findings are primarily driven by **concurrent systems reasoning**
  (crash recovery races, strategy change timing) — its traditional strength.
- Sonnet's unique findings are primarily driven by **structural/meta-analytical
  reasoning** (what does it mean that the reconciliation source is corrupted? how
  do numeric precision errors accumulate?).

The models aren't just finding "different things" — they're reasoning from
*different knowledge domains* applied to the same document:
- GPT-5: tax law + financial semantics
- Opus: distributed systems + runtime dynamics
- Sonnet: architecture patterns + numeric computing + meta-analysis

## Comparison to Finding #22 (silent correctness failures):

Finding #22 tested the "silent correctness failures" lens on `risk-controls.md`
(a very different document — runtime behavioral rather than data/accounting).
That experiment found GPT-5 dominant with Sonnet producing "surface-level" results.
Here, on a financial accounting document where the failures are about *data
correctness over time*, Sonnet performs much better. This suggests Sonnet's
analytical capabilities are strongest when the document describes data structures
and their transformations (where it can reason about invariants and meta-properties)
rather than runtime behavior (where it struggles with temporal/concurrent reasoning
— consistent with Finding #13).

## Practical implication:

For **financial accounting architecture review** specifically:
- GPT-5 is essential for tax-rule compliance gaps (it appears to have genuine
  knowledge of IRS wash sale rules, holding period tack-back, corporate action
  basis mechanics)
- Sonnet is valuable for structural/meta-analysis of the reconciliation architecture
  (it found that the safety net itself is compromised — the highest-leverage finding)
- Opus adds value for runtime/crash-recovery scenarios but provides less unique insight
  on the data-correctness dimension

The three-model approach continues to justify itself: GPT-5 finds 4 unique tax-law
gaps, Opus finds 2 unique runtime gaps, and Sonnet finds 2 unique structural insights
(including the most architecturally significant one). None alone would have produced
the complete picture.

## Token efficiency:

| Model | Findings | Tokens/Finding | Unique findings | Tokens/Unique finding |
|---|---|---|---|---|
| GPT-5 | 12 | 1,188 | 6 | 2,375 |
| Opus | 8 | 784 | 3 | 2,090 |
| Sonnet | 8 | 731 | 3 | 1,948 |

Sonnet is most token-efficient per finding (including its highest-leverage
meta-finding). GPT-5's reasoning tokens (10,688) produced the most unique findings
but at ~4.5× the cost per finding vs Sonnet. For financial document review where
every unique finding represents potential regulatory risk, all three are justified.