From bb0c0d564be1ed3ec2765c6baad192444eec39c9 Mon Sep 17 00:00:00 2001 From: claw Date: Thu, 7 May 2026 11:09:58 -0700 Subject: [PATCH] Finding #40: Silent data corruption paths in financial accounting New analytical lens applied to lot-accounting.md (181 lines). Tests how models identify sequences of individually correct operations that produce silently wrong financial results. Results: - GPT-5: 12 findings (137s, 10688 reasoning tokens) - tax law domain knowledge - Opus: 8 findings (121s) - concurrent systems / crash recovery focus - Sonnet: 8 findings (111s) - structural meta-analysis, highest-leverage finding Key insight: First experiment where domain-specific knowledge (tax law) is the primary differentiator. Models reason from different knowledge domains: GPT-5=tax law, Opus=distributed systems, Sonnet=architecture patterns. Sonnet produced the most architecturally significant finding: that the system's reconciliation mechanism confirms corruption rather than detecting it (because it re-derives from LotClosed which is itself the corrupted source). --- ...a-corruption-paths-financial-accounting.md | 195 ++++++++++++++++++ 1 file changed, 195 insertions(+) create mode 100644 findings/2026-05-07-40-silent-data-corruption-paths-financial-accounting.md diff --git a/findings/2026-05-07-40-silent-data-corruption-paths-financial-accounting.md b/findings/2026-05-07-40-silent-data-corruption-paths-financial-accounting.md new file mode 100644 index 0000000..2de96ea --- /dev/null +++ b/findings/2026-05-07-40-silent-data-corruption-paths-financial-accounting.md @@ -0,0 +1,195 @@ +# Finding #40: Silent Data Corruption Paths in Financial Accounting — GPT-5 finds most paths with highest domain specificity; Sonnet produces strongest structural insight + +**Date:** 2026-05-07 +**Task:** Identify silent data corruption paths in gargoyle's `lot-accounting.md` +(181 lines) — sequences of individually correct operations that produce financially +wrong results (incorrect P&L, wrong tax terms, misattributed gains) without +triggering any error or invariant violation. +**How we used them:** Same document (full text) + same focused analytical question +to all 3 models via HAI proxy. Highly structured prompt specifying 5 categories +(upstream corruption propagation, ordering-dependent correctness, temporal boundary +errors, aggregate drift without detection, cross-strategy contamination). Required +specific output format per finding with severity ratings and financial impact +quantification. No tools, no project context beyond the document itself. + +| Model | Time | Output tokens | Reasoning tokens | Findings | +|---|---|---|---|---| +| GPT-5 | 137s | 14,251 | 10,688 | 12 | +| Claude Opus 4.6 | 121s | 6,270 | (internal) | 8 | +| Claude Sonnet 4.6 | 111s | 5,844 | (internal) | 8 | + +## What they found — common ground (all 3 identified): + +- **Wash sale + LotClosed immutability conflict**: Disallowed loss permanently + recorded in immutable LotClosed because wash sale detection fires after the + closure event is written. No mechanism to annotate, supersede, or disallow + the recorded loss. All three models identified this as the most fundamental + design tension. +- **Corporate action timing vs fill ordering**: Late-arriving corporate actions + (splits, spin-offs) processed after related sells produce LotClosed events + with pre-adjustment cost bases that can never be corrected. +- **Holding period / tax term boundary errors**: Ambiguity in timestamp semantics + (exchange time vs processing time), IRS day-after rule, timezone boundaries + causing short/long-term misclassification at the 365-day boundary. +- **FIFO ordering ambiguity**: When opened_at timestamps are identical or + reordered (network jitter, batch processing), FIFO produces different outcomes + without any error signal. +- **Cross-strategy lot consumption**: FIFO/HIFO operates at position level + (user+instrument), not strategy level — sells from Strategy A can consume lots + opened by Strategy B, corrupting per-strategy P&L attribution. +- **Position.average_cost drift after corporate actions**: The spec only describes + Position updates during buy/sell flows, not during the Adjusting path — corporate + actions that modify lot cost bases may not trigger Position recalculation. + +## GPT-5 unique findings (not in either Claude model): + +- **Late spin-off basis reallocation after sale**: 20% basis reallocation to SpinCo + arrives after Parent shares were sold → LotClosed records gain computed from + original (too-high) basis, understating taxable income by the reallocation amount. + Distinct from split timing because the *kind* of error is different (basis + reallocation vs quantity/basis halving). +- **Wash sale holding period tack-back not implemented**: Spec's LotClosed schema + has no mechanism to record tack-back of holding period from original lot to + replacement lot. `opened_at` is always set from the opening fill timestamp, not + the original acquisition date. This can flip long-term/short-term classification. +- **HIFO flipped by CA timing**: Corporate action reducing one lot's basis changes + which lot HIFO selects, producing different P&L depending on whether the CA + arrives before or after the sell. +- **Position.closed_at stale after post-closure corporate action**: If a position is + fully closed (closed_at set) and then a corporate action credits new shares, + closed_at may not be cleared → Risk treats position as closed while it's live. +- **Wash sale cross-strategy tax deferral**: Strategy A's loss is deferred into + Strategy B's replacement lot basis, silently transferring tax liability between + strategies without any detection mechanism. +- **61-day wash window off-by-one from timezone**: Boundary cases where UTC vs local + time difference pushes a replacement buy just inside/outside the 61-day window. + +## Claude Opus unique findings (not in either other model): + +- **Recovery after crash + strategy switch race**: If user changes lot selection + strategy between a crash and fill reprocessing, the recovered fill is processed + under the wrong strategy. Immutable LotClosed from the recovery is now permanently + wrong. The spec doesn't require strategy to be stored with the fill at submission + time. +- **Position rebuild racing with active processing creates double-count**: The spec's + recovery procedure ("re-derive from LotClosed events") can race with new fill + processing that's also incrementing Position.realized_pnl, causing a new LotClosed + event to be counted both in the rebuild SUM and in the incremental update. +- **Self-correcting analytical behavior**: Opus began exploring Position.average_cost + drift with a running-weighted-average scenario, worked through the math mid-response, + proved to itself that the running average IS correct for simple cases, then + identified the REAL drift path (corporate action not triggering recalculation). + This inline verification continues the pattern from Findings #15 and #20. + +## Claude Sonnet unique findings (not in either other model): + +- **Rounding error accumulation in split-adjusted cost basis**: 3-for-1 split on + $100 basis → $33.33 per share × 300 = $9,999 instead of $10,000. Per-split error + is small but accumulates without bound across multiple splits and high lot counts. + No invariant checks that `SUM(lot.quantity × lot.cost_basis) = original_total_cost`. +- **Reconciliation re-derives from the corrupted source (meta-finding)**: The spec's + reconciliation job checks Position.realized_pnl against SUM(LotClosed.realized_pnl). + For 7 of 8 findings, LotClosed is itself the corrupted source — reconciliation + confirms the wrong number with high confidence, providing *false assurance*. + This is a system-level insight about the reconciliation architecture, not just + another data path. +- **LotClosedAdjustment event type recommendation**: Proposed a specific architectural + fix (adjustment events that reference original LotClosed IDs and record deltas) + that preserves immutability semantics while enabling corrections. Neither GPT-5 + nor Opus proposed concrete mechanisms at this level. + +## Quality assessment: + +- **GPT-5** produced the most findings (12) with the highest domain specificity. + Its unique findings demonstrate deep knowledge of US tax law (IRS tack-back rules, + 61-day window semantics, trade-date vs settlement-date distinctions, spin-off basis + reallocation mechanics). Several findings are variations on the same theme + (corporate action timing) but each identifies a genuinely distinct failure mode + with different financial characteristics. The 10,688 reasoning tokens appear to + have been invested in domain-specific reasoning about tax rules and timing edge + cases. Every finding includes precise financial impact quantification. + +- **Claude Opus** produced 8 findings with characteristic design-tension focus. + The crash-recovery race condition and strategy-switch-during-reprocessing findings + show reasoning about system dynamics that neither other model surfaced — these + are about HOW the system operates at runtime, not just what it computes. The + self-correcting behavior (proving the running average is actually correct before + finding the real drift path) demonstrates intellectual honesty. However, Opus + produced fewer unique findings than in previous experiments and some overlap with + the common ground was substantial. + +- **Claude Sonnet** was the surprise performer in this experiment. While its raw + finding count (8) matches Opus, two of its contributions represent qualitatively + different analytical modes: + 1. The **rounding accumulation** finding shows quantitative reasoning about numeric + precision that neither GPT-5 nor Opus considered — they focused on logical/semantic + errors while Sonnet identified a computational precision problem. + 2. The **"reconciliation confirms corruption"** meta-finding is architecturally the + most important insight across all three models. It identifies that the system's + own safety mechanism (reconciliation from LotClosed) becomes a corruption + *amplifier* when LotClosed itself is wrong. This is a system-of-systems insight. + 3. The concrete **LotClosedAdjustment** architectural recommendation shows Sonnet + reasoning about solutions, not just problems. + +## Key insight — Financial domain knowledge as differentiator: + +This is the first experiment where **domain-specific knowledge** (US tax law, +IRS rules, financial accounting conventions) is a primary differentiator between +models. Previous experiments tested general architectural reasoning skills that +don't require specialized knowledge. Here: + +- GPT-5's unique findings are primarily driven by **tax law knowledge** (tack-back + rules, 61-day window semantics, spin-off basis mechanics) that the other models + either don't know or don't surface. +- Opus's unique findings are primarily driven by **concurrent systems reasoning** + (crash recovery races, strategy change timing) — its traditional strength. +- Sonnet's unique findings are primarily driven by **structural/meta-analytical + reasoning** (what does it mean that the reconciliation source is corrupted? how + do numeric precision errors accumulate?). + +The models aren't just finding "different things" — they're reasoning from +*different knowledge domains* applied to the same document: +- GPT-5: tax law + financial semantics +- Opus: distributed systems + runtime dynamics +- Sonnet: architecture patterns + numeric computing + meta-analysis + +## Comparison to Finding #22 (silent correctness failures): + +Finding #22 tested the "silent correctness failures" lens on `risk-controls.md` +(a very different document — runtime behavioral rather than data/accounting). +That experiment found GPT-5 dominant with Sonnet producing "surface-level" results. +Here, on a financial accounting document where the failures are about *data +correctness over time*, Sonnet performs much better. This suggests Sonnet's +analytical capabilities are strongest when the document describes data structures +and their transformations (where it can reason about invariants and meta-properties) +rather than runtime behavior (where it struggles with temporal/concurrent reasoning +— consistent with Finding #13). + +## Practical implication: + +For **financial accounting architecture review** specifically: +- GPT-5 is essential for tax-rule compliance gaps (it appears to have genuine + knowledge of IRS wash sale rules, holding period tack-back, corporate action + basis mechanics) +- Sonnet is valuable for structural/meta-analysis of the reconciliation architecture + (it found that the safety net itself is compromised — the highest-leverage finding) +- Opus adds value for runtime/crash-recovery scenarios but provides less unique insight + on the data-correctness dimension + +The three-model approach continues to justify itself: GPT-5 finds 4 unique tax-law +gaps, Opus finds 2 unique runtime gaps, and Sonnet finds 2 unique structural insights +(including the most architecturally significant one). None alone would have produced +the complete picture. + +## Token efficiency: + +| Model | Findings | Tokens/Finding | Unique findings | Tokens/Unique finding | +|---|---|---|---|---| +| GPT-5 | 12 | 1,188 | 6 | 2,375 | +| Opus | 8 | 784 | 3 | 2,090 | +| Sonnet | 8 | 731 | 3 | 1,948 | + +Sonnet is most token-efficient per finding (including its highest-leverage +meta-finding). GPT-5's reasoning tokens (10,688) produced the most unique findings +but at ~4.5× the cost per finding vs Sonnet. For financial document review where +every unique finding represents potential regulatory risk, all three are justified.