Files

T

claw bb0c0d564b Finding #40 : Silent data corruption paths in financial accounting

New analytical lens applied to lot-accounting.md (181 lines).
Tests how models identify sequences of individually correct
operations that produce silently wrong financial results.

Results:
- GPT-5: 12 findings (137s, 10688 reasoning tokens) - tax law domain knowledge
- Opus: 8 findings (121s) - concurrent systems / crash recovery focus
- Sonnet: 8 findings (111s) - structural meta-analysis, highest-leverage finding

Key insight: First experiment where domain-specific knowledge (tax law)
is the primary differentiator. Models reason from different knowledge
domains: GPT-5=tax law, Opus=distributed systems, Sonnet=architecture patterns.

Sonnet produced the most architecturally significant finding: that the
system's reconciliation mechanism confirms corruption rather than detecting
it (because it re-derives from LotClosed which is itself the corrupted source).

2026-05-07 11:09:58 -07:00

12 KiB

Raw Permalink Blame History

Finding #40: Silent Data Corruption Paths in Financial Accounting — GPT-5 finds most paths with highest domain specificity; Sonnet produces strongest structural insight

Date: 2026-05-07 Task: Identify silent data corruption paths in gargoyle's lot-accounting.md (181 lines) — sequences of individually correct operations that produce financially wrong results (incorrect P&L, wrong tax terms, misattributed gains) without triggering any error or invariant violation. How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy. Highly structured prompt specifying 5 categories (upstream corruption propagation, ordering-dependent correctness, temporal boundary errors, aggregate drift without detection, cross-strategy contamination). Required specific output format per finding with severity ratings and financial impact quantification. No tools, no project context beyond the document itself.

Model	Time	Output tokens	Reasoning tokens	Findings
GPT-5	137s	14,251	10,688	12
Claude Opus 4.6	121s	6,270	(internal)	8
Claude Sonnet 4.6	111s	5,844	(internal)	8

What they found — common ground (all 3 identified):

Wash sale + LotClosed immutability conflict: Disallowed loss permanently recorded in immutable LotClosed because wash sale detection fires after the closure event is written. No mechanism to annotate, supersede, or disallow the recorded loss. All three models identified this as the most fundamental design tension.
Corporate action timing vs fill ordering: Late-arriving corporate actions (splits, spin-offs) processed after related sells produce LotClosed events with pre-adjustment cost bases that can never be corrected.
Holding period / tax term boundary errors: Ambiguity in timestamp semantics (exchange time vs processing time), IRS day-after rule, timezone boundaries causing short/long-term misclassification at the 365-day boundary.
FIFO ordering ambiguity: When opened_at timestamps are identical or reordered (network jitter, batch processing), FIFO produces different outcomes without any error signal.
Cross-strategy lot consumption: FIFO/HIFO operates at position level (user+instrument), not strategy level — sells from Strategy A can consume lots opened by Strategy B, corrupting per-strategy P&L attribution.
Position.average_cost drift after corporate actions: The spec only describes Position updates during buy/sell flows, not during the Adjusting path — corporate actions that modify lot cost bases may not trigger Position recalculation.

GPT-5 unique findings (not in either Claude model):

Late spin-off basis reallocation after sale: 20% basis reallocation to SpinCo arrives after Parent shares were sold → LotClosed records gain computed from original (too-high) basis, understating taxable income by the reallocation amount. Distinct from split timing because the kind of error is different (basis reallocation vs quantity/basis halving).
Wash sale holding period tack-back not implemented: Spec's LotClosed schema has no mechanism to record tack-back of holding period from original lot to replacement lot. opened_at is always set from the opening fill timestamp, not the original acquisition date. This can flip long-term/short-term classification.
HIFO flipped by CA timing: Corporate action reducing one lot's basis changes which lot HIFO selects, producing different P&L depending on whether the CA arrives before or after the sell.
Position.closed_at stale after post-closure corporate action: If a position is fully closed (closed_at set) and then a corporate action credits new shares, closed_at may not be cleared → Risk treats position as closed while it's live.
Wash sale cross-strategy tax deferral: Strategy A's loss is deferred into Strategy B's replacement lot basis, silently transferring tax liability between strategies without any detection mechanism.
61-day wash window off-by-one from timezone: Boundary cases where UTC vs local time difference pushes a replacement buy just inside/outside the 61-day window.

Claude Opus unique findings (not in either other model):

Recovery after crash + strategy switch race: If user changes lot selection strategy between a crash and fill reprocessing, the recovered fill is processed under the wrong strategy. Immutable LotClosed from the recovery is now permanently wrong. The spec doesn't require strategy to be stored with the fill at submission time.
Position rebuild racing with active processing creates double-count: The spec's recovery procedure ("re-derive from LotClosed events") can race with new fill processing that's also incrementing Position.realized_pnl, causing a new LotClosed event to be counted both in the rebuild SUM and in the incremental update.
Self-correcting analytical behavior: Opus began exploring Position.average_cost drift with a running-weighted-average scenario, worked through the math mid-response, proved to itself that the running average IS correct for simple cases, then identified the REAL drift path (corporate action not triggering recalculation). This inline verification continues the pattern from Findings #15 and #20.

Claude Sonnet unique findings (not in either other model):

Rounding error accumulation in split-adjusted cost basis: 3-for-1 split on $100 basis → $33.33 per share × 300 = $9,999 instead of $10,000. Per-split error is small but accumulates without bound across multiple splits and high lot counts. No invariant checks that SUM(lot.quantity × lot.cost_basis) = original_total_cost.
Reconciliation re-derives from the corrupted source (meta-finding): The spec's reconciliation job checks Position.realized_pnl against SUM(LotClosed.realized_pnl). For 7 of 8 findings, LotClosed is itself the corrupted source — reconciliation confirms the wrong number with high confidence, providing false assurance. This is a system-level insight about the reconciliation architecture, not just another data path.
LotClosedAdjustment event type recommendation: Proposed a specific architectural fix (adjustment events that reference original LotClosed IDs and record deltas) that preserves immutability semantics while enabling corrections. Neither GPT-5 nor Opus proposed concrete mechanisms at this level.

Quality assessment:

GPT-5 produced the most findings (12) with the highest domain specificity. Its unique findings demonstrate deep knowledge of US tax law (IRS tack-back rules, 61-day window semantics, trade-date vs settlement-date distinctions, spin-off basis reallocation mechanics). Several findings are variations on the same theme (corporate action timing) but each identifies a genuinely distinct failure mode with different financial characteristics. The 10,688 reasoning tokens appear to have been invested in domain-specific reasoning about tax rules and timing edge cases. Every finding includes precise financial impact quantification.
Claude Opus produced 8 findings with characteristic design-tension focus. The crash-recovery race condition and strategy-switch-during-reprocessing findings show reasoning about system dynamics that neither other model surfaced — these are about HOW the system operates at runtime, not just what it computes. The self-correcting behavior (proving the running average is actually correct before finding the real drift path) demonstrates intellectual honesty. However, Opus produced fewer unique findings than in previous experiments and some overlap with the common ground was substantial.
Claude Sonnet was the surprise performer in this experiment. While its raw finding count (8) matches Opus, two of its contributions represent qualitatively different analytical modes:
1. The rounding accumulation finding shows quantitative reasoning about numeric precision that neither GPT-5 nor Opus considered — they focused on logical/semantic errors while Sonnet identified a computational precision problem.
2. The "reconciliation confirms corruption" meta-finding is architecturally the most important insight across all three models. It identifies that the system's own safety mechanism (reconciliation from LotClosed) becomes a corruption amplifier when LotClosed itself is wrong. This is a system-of-systems insight.
3. The concrete LotClosedAdjustment architectural recommendation shows Sonnet reasoning about solutions, not just problems.

Key insight — Financial domain knowledge as differentiator:

This is the first experiment where domain-specific knowledge (US tax law, IRS rules, financial accounting conventions) is a primary differentiator between models. Previous experiments tested general architectural reasoning skills that don't require specialized knowledge. Here:

GPT-5's unique findings are primarily driven by tax law knowledge (tack-back rules, 61-day window semantics, spin-off basis mechanics) that the other models either don't know or don't surface.
Opus's unique findings are primarily driven by concurrent systems reasoning (crash recovery races, strategy change timing) — its traditional strength.
Sonnet's unique findings are primarily driven by structural/meta-analytical reasoning (what does it mean that the reconciliation source is corrupted? how do numeric precision errors accumulate?).

The models aren't just finding "different things" — they're reasoning from different knowledge domains applied to the same document:

GPT-5: tax law + financial semantics
Opus: distributed systems + runtime dynamics
Sonnet: architecture patterns + numeric computing + meta-analysis

Comparison to Finding #22 (silent correctness failures):

Finding #22 tested the "silent correctness failures" lens on risk-controls.md (a very different document — runtime behavioral rather than data/accounting). That experiment found GPT-5 dominant with Sonnet producing "surface-level" results. Here, on a financial accounting document where the failures are about data correctness over time, Sonnet performs much better. This suggests Sonnet's analytical capabilities are strongest when the document describes data structures and their transformations (where it can reason about invariants and meta-properties) rather than runtime behavior (where it struggles with temporal/concurrent reasoning — consistent with Finding #13).

Practical implication:

For financial accounting architecture review specifically:

GPT-5 is essential for tax-rule compliance gaps (it appears to have genuine knowledge of IRS wash sale rules, holding period tack-back, corporate action basis mechanics)
Sonnet is valuable for structural/meta-analysis of the reconciliation architecture (it found that the safety net itself is compromised — the highest-leverage finding)
Opus adds value for runtime/crash-recovery scenarios but provides less unique insight on the data-correctness dimension

The three-model approach continues to justify itself: GPT-5 finds 4 unique tax-law gaps, Opus finds 2 unique runtime gaps, and Sonnet finds 2 unique structural insights (including the most architecturally significant one). None alone would have produced the complete picture.

Token efficiency:

Model	Findings	Tokens/Finding	Unique findings	Tokens/Unique finding
GPT-5	12	1,188	6	2,375
Opus	8	784	3	2,090
Sonnet	8	731	3	1,948

Sonnet is most token-efficient per finding (including its highest-leverage meta-finding). GPT-5's reasoning tokens (10,688) produced the most unique findings but at ~4.5× the cost per finding vs Sonnet. For financial document review where every unique finding represents potential regulatory risk, all three are justified.

12 KiB Raw Permalink Blame History Unescape Escape