Files
model-research/findings/2026-05-07-40-silent-data-corruption-paths-financial-accounting.md
claw bb0c0d564b Finding #40: Silent data corruption paths in financial accounting
New analytical lens applied to lot-accounting.md (181 lines).
Tests how models identify sequences of individually correct
operations that produce silently wrong financial results.

Results:
- GPT-5: 12 findings (137s, 10688 reasoning tokens) - tax law domain knowledge
- Opus: 8 findings (121s) - concurrent systems / crash recovery focus
- Sonnet: 8 findings (111s) - structural meta-analysis, highest-leverage finding

Key insight: First experiment where domain-specific knowledge (tax law)
is the primary differentiator. Models reason from different knowledge
domains: GPT-5=tax law, Opus=distributed systems, Sonnet=architecture patterns.

Sonnet produced the most architecturally significant finding: that the
system's reconciliation mechanism confirms corruption rather than detecting
it (because it re-derives from LotClosed which is itself the corrupted source).
2026-05-07 11:09:58 -07:00

196 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Finding #40: Silent Data Corruption Paths in Financial Accounting — GPT-5 finds most paths with highest domain specificity; Sonnet produces strongest structural insight
**Date:** 2026-05-07
**Task:** Identify silent data corruption paths in gargoyle's `lot-accounting.md`
(181 lines) — sequences of individually correct operations that produce financially
wrong results (incorrect P&L, wrong tax terms, misattributed gains) without
triggering any error or invariant violation.
**How we used them:** Same document (full text) + same focused analytical question
to all 3 models via HAI proxy. Highly structured prompt specifying 5 categories
(upstream corruption propagation, ordering-dependent correctness, temporal boundary
errors, aggregate drift without detection, cross-strategy contamination). Required
specific output format per finding with severity ratings and financial impact
quantification. No tools, no project context beyond the document itself.
| Model | Time | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|
| GPT-5 | 137s | 14,251 | 10,688 | 12 |
| Claude Opus 4.6 | 121s | 6,270 | (internal) | 8 |
| Claude Sonnet 4.6 | 111s | 5,844 | (internal) | 8 |
## What they found — common ground (all 3 identified):
- **Wash sale + LotClosed immutability conflict**: Disallowed loss permanently
recorded in immutable LotClosed because wash sale detection fires after the
closure event is written. No mechanism to annotate, supersede, or disallow
the recorded loss. All three models identified this as the most fundamental
design tension.
- **Corporate action timing vs fill ordering**: Late-arriving corporate actions
(splits, spin-offs) processed after related sells produce LotClosed events
with pre-adjustment cost bases that can never be corrected.
- **Holding period / tax term boundary errors**: Ambiguity in timestamp semantics
(exchange time vs processing time), IRS day-after rule, timezone boundaries
causing short/long-term misclassification at the 365-day boundary.
- **FIFO ordering ambiguity**: When opened_at timestamps are identical or
reordered (network jitter, batch processing), FIFO produces different outcomes
without any error signal.
- **Cross-strategy lot consumption**: FIFO/HIFO operates at position level
(user+instrument), not strategy level — sells from Strategy A can consume lots
opened by Strategy B, corrupting per-strategy P&L attribution.
- **Position.average_cost drift after corporate actions**: The spec only describes
Position updates during buy/sell flows, not during the Adjusting path — corporate
actions that modify lot cost bases may not trigger Position recalculation.
## GPT-5 unique findings (not in either Claude model):
- **Late spin-off basis reallocation after sale**: 20% basis reallocation to SpinCo
arrives after Parent shares were sold → LotClosed records gain computed from
original (too-high) basis, understating taxable income by the reallocation amount.
Distinct from split timing because the *kind* of error is different (basis
reallocation vs quantity/basis halving).
- **Wash sale holding period tack-back not implemented**: Spec's LotClosed schema
has no mechanism to record tack-back of holding period from original lot to
replacement lot. `opened_at` is always set from the opening fill timestamp, not
the original acquisition date. This can flip long-term/short-term classification.
- **HIFO flipped by CA timing**: Corporate action reducing one lot's basis changes
which lot HIFO selects, producing different P&L depending on whether the CA
arrives before or after the sell.
- **Position.closed_at stale after post-closure corporate action**: If a position is
fully closed (closed_at set) and then a corporate action credits new shares,
closed_at may not be cleared → Risk treats position as closed while it's live.
- **Wash sale cross-strategy tax deferral**: Strategy A's loss is deferred into
Strategy B's replacement lot basis, silently transferring tax liability between
strategies without any detection mechanism.
- **61-day wash window off-by-one from timezone**: Boundary cases where UTC vs local
time difference pushes a replacement buy just inside/outside the 61-day window.
## Claude Opus unique findings (not in either other model):
- **Recovery after crash + strategy switch race**: If user changes lot selection
strategy between a crash and fill reprocessing, the recovered fill is processed
under the wrong strategy. Immutable LotClosed from the recovery is now permanently
wrong. The spec doesn't require strategy to be stored with the fill at submission
time.
- **Position rebuild racing with active processing creates double-count**: The spec's
recovery procedure ("re-derive from LotClosed events") can race with new fill
processing that's also incrementing Position.realized_pnl, causing a new LotClosed
event to be counted both in the rebuild SUM and in the incremental update.
- **Self-correcting analytical behavior**: Opus began exploring Position.average_cost
drift with a running-weighted-average scenario, worked through the math mid-response,
proved to itself that the running average IS correct for simple cases, then
identified the REAL drift path (corporate action not triggering recalculation).
This inline verification continues the pattern from Findings #15 and #20.
## Claude Sonnet unique findings (not in either other model):
- **Rounding error accumulation in split-adjusted cost basis**: 3-for-1 split on
$100 basis → $33.33 per share × 300 = $9,999 instead of $10,000. Per-split error
is small but accumulates without bound across multiple splits and high lot counts.
No invariant checks that `SUM(lot.quantity × lot.cost_basis) = original_total_cost`.
- **Reconciliation re-derives from the corrupted source (meta-finding)**: The spec's
reconciliation job checks Position.realized_pnl against SUM(LotClosed.realized_pnl).
For 7 of 8 findings, LotClosed is itself the corrupted source — reconciliation
confirms the wrong number with high confidence, providing *false assurance*.
This is a system-level insight about the reconciliation architecture, not just
another data path.
- **LotClosedAdjustment event type recommendation**: Proposed a specific architectural
fix (adjustment events that reference original LotClosed IDs and record deltas)
that preserves immutability semantics while enabling corrections. Neither GPT-5
nor Opus proposed concrete mechanisms at this level.
## Quality assessment:
- **GPT-5** produced the most findings (12) with the highest domain specificity.
Its unique findings demonstrate deep knowledge of US tax law (IRS tack-back rules,
61-day window semantics, trade-date vs settlement-date distinctions, spin-off basis
reallocation mechanics). Several findings are variations on the same theme
(corporate action timing) but each identifies a genuinely distinct failure mode
with different financial characteristics. The 10,688 reasoning tokens appear to
have been invested in domain-specific reasoning about tax rules and timing edge
cases. Every finding includes precise financial impact quantification.
- **Claude Opus** produced 8 findings with characteristic design-tension focus.
The crash-recovery race condition and strategy-switch-during-reprocessing findings
show reasoning about system dynamics that neither other model surfaced — these
are about HOW the system operates at runtime, not just what it computes. The
self-correcting behavior (proving the running average is actually correct before
finding the real drift path) demonstrates intellectual honesty. However, Opus
produced fewer unique findings than in previous experiments and some overlap with
the common ground was substantial.
- **Claude Sonnet** was the surprise performer in this experiment. While its raw
finding count (8) matches Opus, two of its contributions represent qualitatively
different analytical modes:
1. The **rounding accumulation** finding shows quantitative reasoning about numeric
precision that neither GPT-5 nor Opus considered — they focused on logical/semantic
errors while Sonnet identified a computational precision problem.
2. The **"reconciliation confirms corruption"** meta-finding is architecturally the
most important insight across all three models. It identifies that the system's
own safety mechanism (reconciliation from LotClosed) becomes a corruption
*amplifier* when LotClosed itself is wrong. This is a system-of-systems insight.
3. The concrete **LotClosedAdjustment** architectural recommendation shows Sonnet
reasoning about solutions, not just problems.
## Key insight — Financial domain knowledge as differentiator:
This is the first experiment where **domain-specific knowledge** (US tax law,
IRS rules, financial accounting conventions) is a primary differentiator between
models. Previous experiments tested general architectural reasoning skills that
don't require specialized knowledge. Here:
- GPT-5's unique findings are primarily driven by **tax law knowledge** (tack-back
rules, 61-day window semantics, spin-off basis mechanics) that the other models
either don't know or don't surface.
- Opus's unique findings are primarily driven by **concurrent systems reasoning**
(crash recovery races, strategy change timing) — its traditional strength.
- Sonnet's unique findings are primarily driven by **structural/meta-analytical
reasoning** (what does it mean that the reconciliation source is corrupted? how
do numeric precision errors accumulate?).
The models aren't just finding "different things" — they're reasoning from
*different knowledge domains* applied to the same document:
- GPT-5: tax law + financial semantics
- Opus: distributed systems + runtime dynamics
- Sonnet: architecture patterns + numeric computing + meta-analysis
## Comparison to Finding #22 (silent correctness failures):
Finding #22 tested the "silent correctness failures" lens on `risk-controls.md`
(a very different document — runtime behavioral rather than data/accounting).
That experiment found GPT-5 dominant with Sonnet producing "surface-level" results.
Here, on a financial accounting document where the failures are about *data
correctness over time*, Sonnet performs much better. This suggests Sonnet's
analytical capabilities are strongest when the document describes data structures
and their transformations (where it can reason about invariants and meta-properties)
rather than runtime behavior (where it struggles with temporal/concurrent reasoning
— consistent with Finding #13).
## Practical implication:
For **financial accounting architecture review** specifically:
- GPT-5 is essential for tax-rule compliance gaps (it appears to have genuine
knowledge of IRS wash sale rules, holding period tack-back, corporate action
basis mechanics)
- Sonnet is valuable for structural/meta-analysis of the reconciliation architecture
(it found that the safety net itself is compromised — the highest-leverage finding)
- Opus adds value for runtime/crash-recovery scenarios but provides less unique insight
on the data-correctness dimension
The three-model approach continues to justify itself: GPT-5 finds 4 unique tax-law
gaps, Opus finds 2 unique runtime gaps, and Sonnet finds 2 unique structural insights
(including the most architecturally significant one). None alone would have produced
the complete picture.
## Token efficiency:
| Model | Findings | Tokens/Finding | Unique findings | Tokens/Unique finding |
|---|---|---|---|---|
| GPT-5 | 12 | 1,188 | 6 | 2,375 |
| Opus | 8 | 784 | 3 | 2,090 |
| Sonnet | 8 | 731 | 3 | 1,948 |
Sonnet is most token-efficient per finding (including its highest-leverage
meta-finding). GPT-5's reasoning tokens (10,688) produced the most unique findings
but at ~4.5× the cost per finding vs Sonnet. For financial document review where
every unique finding represents potential regulatory risk, all three are justified.