Finding #40: Silent data corruption paths in financial accounting
New analytical lens applied to lot-accounting.md (181 lines). Tests how models identify sequences of individually correct operations that produce silently wrong financial results. Results: - GPT-5: 12 findings (137s, 10688 reasoning tokens) - tax law domain knowledge - Opus: 8 findings (121s) - concurrent systems / crash recovery focus - Sonnet: 8 findings (111s) - structural meta-analysis, highest-leverage finding Key insight: First experiment where domain-specific knowledge (tax law) is the primary differentiator. Models reason from different knowledge domains: GPT-5=tax law, Opus=distributed systems, Sonnet=architecture patterns. Sonnet produced the most architecturally significant finding: that the system's reconciliation mechanism confirms corruption rather than detecting it (because it re-derives from LotClosed which is itself the corrupted source).
This commit is contained in:
@@ -0,0 +1,195 @@
|
||||
# Finding #40: Silent Data Corruption Paths in Financial Accounting — GPT-5 finds most paths with highest domain specificity; Sonnet produces strongest structural insight
|
||||
|
||||
**Date:** 2026-05-07
|
||||
**Task:** Identify silent data corruption paths in gargoyle's `lot-accounting.md`
|
||||
(181 lines) — sequences of individually correct operations that produce financially
|
||||
wrong results (incorrect P&L, wrong tax terms, misattributed gains) without
|
||||
triggering any error or invariant violation.
|
||||
**How we used them:** Same document (full text) + same focused analytical question
|
||||
to all 3 models via HAI proxy. Highly structured prompt specifying 5 categories
|
||||
(upstream corruption propagation, ordering-dependent correctness, temporal boundary
|
||||
errors, aggregate drift without detection, cross-strategy contamination). Required
|
||||
specific output format per finding with severity ratings and financial impact
|
||||
quantification. No tools, no project context beyond the document itself.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Findings |
|
||||
|---|---|---|---|---|
|
||||
| GPT-5 | 137s | 14,251 | 10,688 | 12 |
|
||||
| Claude Opus 4.6 | 121s | 6,270 | (internal) | 8 |
|
||||
| Claude Sonnet 4.6 | 111s | 5,844 | (internal) | 8 |
|
||||
|
||||
## What they found — common ground (all 3 identified):
|
||||
|
||||
- **Wash sale + LotClosed immutability conflict**: Disallowed loss permanently
|
||||
recorded in immutable LotClosed because wash sale detection fires after the
|
||||
closure event is written. No mechanism to annotate, supersede, or disallow
|
||||
the recorded loss. All three models identified this as the most fundamental
|
||||
design tension.
|
||||
- **Corporate action timing vs fill ordering**: Late-arriving corporate actions
|
||||
(splits, spin-offs) processed after related sells produce LotClosed events
|
||||
with pre-adjustment cost bases that can never be corrected.
|
||||
- **Holding period / tax term boundary errors**: Ambiguity in timestamp semantics
|
||||
(exchange time vs processing time), IRS day-after rule, timezone boundaries
|
||||
causing short/long-term misclassification at the 365-day boundary.
|
||||
- **FIFO ordering ambiguity**: When opened_at timestamps are identical or
|
||||
reordered (network jitter, batch processing), FIFO produces different outcomes
|
||||
without any error signal.
|
||||
- **Cross-strategy lot consumption**: FIFO/HIFO operates at position level
|
||||
(user+instrument), not strategy level — sells from Strategy A can consume lots
|
||||
opened by Strategy B, corrupting per-strategy P&L attribution.
|
||||
- **Position.average_cost drift after corporate actions**: The spec only describes
|
||||
Position updates during buy/sell flows, not during the Adjusting path — corporate
|
||||
actions that modify lot cost bases may not trigger Position recalculation.
|
||||
|
||||
## GPT-5 unique findings (not in either Claude model):
|
||||
|
||||
- **Late spin-off basis reallocation after sale**: 20% basis reallocation to SpinCo
|
||||
arrives after Parent shares were sold → LotClosed records gain computed from
|
||||
original (too-high) basis, understating taxable income by the reallocation amount.
|
||||
Distinct from split timing because the *kind* of error is different (basis
|
||||
reallocation vs quantity/basis halving).
|
||||
- **Wash sale holding period tack-back not implemented**: Spec's LotClosed schema
|
||||
has no mechanism to record tack-back of holding period from original lot to
|
||||
replacement lot. `opened_at` is always set from the opening fill timestamp, not
|
||||
the original acquisition date. This can flip long-term/short-term classification.
|
||||
- **HIFO flipped by CA timing**: Corporate action reducing one lot's basis changes
|
||||
which lot HIFO selects, producing different P&L depending on whether the CA
|
||||
arrives before or after the sell.
|
||||
- **Position.closed_at stale after post-closure corporate action**: If a position is
|
||||
fully closed (closed_at set) and then a corporate action credits new shares,
|
||||
closed_at may not be cleared → Risk treats position as closed while it's live.
|
||||
- **Wash sale cross-strategy tax deferral**: Strategy A's loss is deferred into
|
||||
Strategy B's replacement lot basis, silently transferring tax liability between
|
||||
strategies without any detection mechanism.
|
||||
- **61-day wash window off-by-one from timezone**: Boundary cases where UTC vs local
|
||||
time difference pushes a replacement buy just inside/outside the 61-day window.
|
||||
|
||||
## Claude Opus unique findings (not in either other model):
|
||||
|
||||
- **Recovery after crash + strategy switch race**: If user changes lot selection
|
||||
strategy between a crash and fill reprocessing, the recovered fill is processed
|
||||
under the wrong strategy. Immutable LotClosed from the recovery is now permanently
|
||||
wrong. The spec doesn't require strategy to be stored with the fill at submission
|
||||
time.
|
||||
- **Position rebuild racing with active processing creates double-count**: The spec's
|
||||
recovery procedure ("re-derive from LotClosed events") can race with new fill
|
||||
processing that's also incrementing Position.realized_pnl, causing a new LotClosed
|
||||
event to be counted both in the rebuild SUM and in the incremental update.
|
||||
- **Self-correcting analytical behavior**: Opus began exploring Position.average_cost
|
||||
drift with a running-weighted-average scenario, worked through the math mid-response,
|
||||
proved to itself that the running average IS correct for simple cases, then
|
||||
identified the REAL drift path (corporate action not triggering recalculation).
|
||||
This inline verification continues the pattern from Findings #15 and #20.
|
||||
|
||||
## Claude Sonnet unique findings (not in either other model):
|
||||
|
||||
- **Rounding error accumulation in split-adjusted cost basis**: 3-for-1 split on
|
||||
$100 basis → $33.33 per share × 300 = $9,999 instead of $10,000. Per-split error
|
||||
is small but accumulates without bound across multiple splits and high lot counts.
|
||||
No invariant checks that `SUM(lot.quantity × lot.cost_basis) = original_total_cost`.
|
||||
- **Reconciliation re-derives from the corrupted source (meta-finding)**: The spec's
|
||||
reconciliation job checks Position.realized_pnl against SUM(LotClosed.realized_pnl).
|
||||
For 7 of 8 findings, LotClosed is itself the corrupted source — reconciliation
|
||||
confirms the wrong number with high confidence, providing *false assurance*.
|
||||
This is a system-level insight about the reconciliation architecture, not just
|
||||
another data path.
|
||||
- **LotClosedAdjustment event type recommendation**: Proposed a specific architectural
|
||||
fix (adjustment events that reference original LotClosed IDs and record deltas)
|
||||
that preserves immutability semantics while enabling corrections. Neither GPT-5
|
||||
nor Opus proposed concrete mechanisms at this level.
|
||||
|
||||
## Quality assessment:
|
||||
|
||||
- **GPT-5** produced the most findings (12) with the highest domain specificity.
|
||||
Its unique findings demonstrate deep knowledge of US tax law (IRS tack-back rules,
|
||||
61-day window semantics, trade-date vs settlement-date distinctions, spin-off basis
|
||||
reallocation mechanics). Several findings are variations on the same theme
|
||||
(corporate action timing) but each identifies a genuinely distinct failure mode
|
||||
with different financial characteristics. The 10,688 reasoning tokens appear to
|
||||
have been invested in domain-specific reasoning about tax rules and timing edge
|
||||
cases. Every finding includes precise financial impact quantification.
|
||||
|
||||
- **Claude Opus** produced 8 findings with characteristic design-tension focus.
|
||||
The crash-recovery race condition and strategy-switch-during-reprocessing findings
|
||||
show reasoning about system dynamics that neither other model surfaced — these
|
||||
are about HOW the system operates at runtime, not just what it computes. The
|
||||
self-correcting behavior (proving the running average is actually correct before
|
||||
finding the real drift path) demonstrates intellectual honesty. However, Opus
|
||||
produced fewer unique findings than in previous experiments and some overlap with
|
||||
the common ground was substantial.
|
||||
|
||||
- **Claude Sonnet** was the surprise performer in this experiment. While its raw
|
||||
finding count (8) matches Opus, two of its contributions represent qualitatively
|
||||
different analytical modes:
|
||||
1. The **rounding accumulation** finding shows quantitative reasoning about numeric
|
||||
precision that neither GPT-5 nor Opus considered — they focused on logical/semantic
|
||||
errors while Sonnet identified a computational precision problem.
|
||||
2. The **"reconciliation confirms corruption"** meta-finding is architecturally the
|
||||
most important insight across all three models. It identifies that the system's
|
||||
own safety mechanism (reconciliation from LotClosed) becomes a corruption
|
||||
*amplifier* when LotClosed itself is wrong. This is a system-of-systems insight.
|
||||
3. The concrete **LotClosedAdjustment** architectural recommendation shows Sonnet
|
||||
reasoning about solutions, not just problems.
|
||||
|
||||
## Key insight — Financial domain knowledge as differentiator:
|
||||
|
||||
This is the first experiment where **domain-specific knowledge** (US tax law,
|
||||
IRS rules, financial accounting conventions) is a primary differentiator between
|
||||
models. Previous experiments tested general architectural reasoning skills that
|
||||
don't require specialized knowledge. Here:
|
||||
|
||||
- GPT-5's unique findings are primarily driven by **tax law knowledge** (tack-back
|
||||
rules, 61-day window semantics, spin-off basis mechanics) that the other models
|
||||
either don't know or don't surface.
|
||||
- Opus's unique findings are primarily driven by **concurrent systems reasoning**
|
||||
(crash recovery races, strategy change timing) — its traditional strength.
|
||||
- Sonnet's unique findings are primarily driven by **structural/meta-analytical
|
||||
reasoning** (what does it mean that the reconciliation source is corrupted? how
|
||||
do numeric precision errors accumulate?).
|
||||
|
||||
The models aren't just finding "different things" — they're reasoning from
|
||||
*different knowledge domains* applied to the same document:
|
||||
- GPT-5: tax law + financial semantics
|
||||
- Opus: distributed systems + runtime dynamics
|
||||
- Sonnet: architecture patterns + numeric computing + meta-analysis
|
||||
|
||||
## Comparison to Finding #22 (silent correctness failures):
|
||||
|
||||
Finding #22 tested the "silent correctness failures" lens on `risk-controls.md`
|
||||
(a very different document — runtime behavioral rather than data/accounting).
|
||||
That experiment found GPT-5 dominant with Sonnet producing "surface-level" results.
|
||||
Here, on a financial accounting document where the failures are about *data
|
||||
correctness over time*, Sonnet performs much better. This suggests Sonnet's
|
||||
analytical capabilities are strongest when the document describes data structures
|
||||
and their transformations (where it can reason about invariants and meta-properties)
|
||||
rather than runtime behavior (where it struggles with temporal/concurrent reasoning
|
||||
— consistent with Finding #13).
|
||||
|
||||
## Practical implication:
|
||||
|
||||
For **financial accounting architecture review** specifically:
|
||||
- GPT-5 is essential for tax-rule compliance gaps (it appears to have genuine
|
||||
knowledge of IRS wash sale rules, holding period tack-back, corporate action
|
||||
basis mechanics)
|
||||
- Sonnet is valuable for structural/meta-analysis of the reconciliation architecture
|
||||
(it found that the safety net itself is compromised — the highest-leverage finding)
|
||||
- Opus adds value for runtime/crash-recovery scenarios but provides less unique insight
|
||||
on the data-correctness dimension
|
||||
|
||||
The three-model approach continues to justify itself: GPT-5 finds 4 unique tax-law
|
||||
gaps, Opus finds 2 unique runtime gaps, and Sonnet finds 2 unique structural insights
|
||||
(including the most architecturally significant one). None alone would have produced
|
||||
the complete picture.
|
||||
|
||||
## Token efficiency:
|
||||
|
||||
| Model | Findings | Tokens/Finding | Unique findings | Tokens/Unique finding |
|
||||
|---|---|---|---|---|
|
||||
| GPT-5 | 12 | 1,188 | 6 | 2,375 |
|
||||
| Opus | 8 | 784 | 3 | 2,090 |
|
||||
| Sonnet | 8 | 731 | 3 | 1,948 |
|
||||
|
||||
Sonnet is most token-efficient per finding (including its highest-leverage
|
||||
meta-finding). GPT-5's reasoning tokens (10,688) produced the most unique findings
|
||||
but at ~4.5× the cost per finding vs Sonnet. For financial document review where
|
||||
every unique finding represents potential regulatory risk, all three are justified.
|
||||
Reference in New Issue
Block a user