bb0c0d564b
New analytical lens applied to lot-accounting.md (181 lines). Tests how models identify sequences of individually correct operations that produce silently wrong financial results. Results: - GPT-5: 12 findings (137s, 10688 reasoning tokens) - tax law domain knowledge - Opus: 8 findings (121s) - concurrent systems / crash recovery focus - Sonnet: 8 findings (111s) - structural meta-analysis, highest-leverage finding Key insight: First experiment where domain-specific knowledge (tax law) is the primary differentiator. Models reason from different knowledge domains: GPT-5=tax law, Opus=distributed systems, Sonnet=architecture patterns. Sonnet produced the most architecturally significant finding: that the system's reconciliation mechanism confirms corruption rather than detecting it (because it re-derives from LotClosed which is itself the corrupted source).
196 lines
12 KiB
Markdown
196 lines
12 KiB
Markdown
# Finding #40: Silent Data Corruption Paths in Financial Accounting — GPT-5 finds most paths with highest domain specificity; Sonnet produces strongest structural insight
|
||
|
||
**Date:** 2026-05-07
|
||
**Task:** Identify silent data corruption paths in gargoyle's `lot-accounting.md`
|
||
(181 lines) — sequences of individually correct operations that produce financially
|
||
wrong results (incorrect P&L, wrong tax terms, misattributed gains) without
|
||
triggering any error or invariant violation.
|
||
**How we used them:** Same document (full text) + same focused analytical question
|
||
to all 3 models via HAI proxy. Highly structured prompt specifying 5 categories
|
||
(upstream corruption propagation, ordering-dependent correctness, temporal boundary
|
||
errors, aggregate drift without detection, cross-strategy contamination). Required
|
||
specific output format per finding with severity ratings and financial impact
|
||
quantification. No tools, no project context beyond the document itself.
|
||
|
||
| Model | Time | Output tokens | Reasoning tokens | Findings |
|
||
|---|---|---|---|---|
|
||
| GPT-5 | 137s | 14,251 | 10,688 | 12 |
|
||
| Claude Opus 4.6 | 121s | 6,270 | (internal) | 8 |
|
||
| Claude Sonnet 4.6 | 111s | 5,844 | (internal) | 8 |
|
||
|
||
## What they found — common ground (all 3 identified):
|
||
|
||
- **Wash sale + LotClosed immutability conflict**: Disallowed loss permanently
|
||
recorded in immutable LotClosed because wash sale detection fires after the
|
||
closure event is written. No mechanism to annotate, supersede, or disallow
|
||
the recorded loss. All three models identified this as the most fundamental
|
||
design tension.
|
||
- **Corporate action timing vs fill ordering**: Late-arriving corporate actions
|
||
(splits, spin-offs) processed after related sells produce LotClosed events
|
||
with pre-adjustment cost bases that can never be corrected.
|
||
- **Holding period / tax term boundary errors**: Ambiguity in timestamp semantics
|
||
(exchange time vs processing time), IRS day-after rule, timezone boundaries
|
||
causing short/long-term misclassification at the 365-day boundary.
|
||
- **FIFO ordering ambiguity**: When opened_at timestamps are identical or
|
||
reordered (network jitter, batch processing), FIFO produces different outcomes
|
||
without any error signal.
|
||
- **Cross-strategy lot consumption**: FIFO/HIFO operates at position level
|
||
(user+instrument), not strategy level — sells from Strategy A can consume lots
|
||
opened by Strategy B, corrupting per-strategy P&L attribution.
|
||
- **Position.average_cost drift after corporate actions**: The spec only describes
|
||
Position updates during buy/sell flows, not during the Adjusting path — corporate
|
||
actions that modify lot cost bases may not trigger Position recalculation.
|
||
|
||
## GPT-5 unique findings (not in either Claude model):
|
||
|
||
- **Late spin-off basis reallocation after sale**: 20% basis reallocation to SpinCo
|
||
arrives after Parent shares were sold → LotClosed records gain computed from
|
||
original (too-high) basis, understating taxable income by the reallocation amount.
|
||
Distinct from split timing because the *kind* of error is different (basis
|
||
reallocation vs quantity/basis halving).
|
||
- **Wash sale holding period tack-back not implemented**: Spec's LotClosed schema
|
||
has no mechanism to record tack-back of holding period from original lot to
|
||
replacement lot. `opened_at` is always set from the opening fill timestamp, not
|
||
the original acquisition date. This can flip long-term/short-term classification.
|
||
- **HIFO flipped by CA timing**: Corporate action reducing one lot's basis changes
|
||
which lot HIFO selects, producing different P&L depending on whether the CA
|
||
arrives before or after the sell.
|
||
- **Position.closed_at stale after post-closure corporate action**: If a position is
|
||
fully closed (closed_at set) and then a corporate action credits new shares,
|
||
closed_at may not be cleared → Risk treats position as closed while it's live.
|
||
- **Wash sale cross-strategy tax deferral**: Strategy A's loss is deferred into
|
||
Strategy B's replacement lot basis, silently transferring tax liability between
|
||
strategies without any detection mechanism.
|
||
- **61-day wash window off-by-one from timezone**: Boundary cases where UTC vs local
|
||
time difference pushes a replacement buy just inside/outside the 61-day window.
|
||
|
||
## Claude Opus unique findings (not in either other model):
|
||
|
||
- **Recovery after crash + strategy switch race**: If user changes lot selection
|
||
strategy between a crash and fill reprocessing, the recovered fill is processed
|
||
under the wrong strategy. Immutable LotClosed from the recovery is now permanently
|
||
wrong. The spec doesn't require strategy to be stored with the fill at submission
|
||
time.
|
||
- **Position rebuild racing with active processing creates double-count**: The spec's
|
||
recovery procedure ("re-derive from LotClosed events") can race with new fill
|
||
processing that's also incrementing Position.realized_pnl, causing a new LotClosed
|
||
event to be counted both in the rebuild SUM and in the incremental update.
|
||
- **Self-correcting analytical behavior**: Opus began exploring Position.average_cost
|
||
drift with a running-weighted-average scenario, worked through the math mid-response,
|
||
proved to itself that the running average IS correct for simple cases, then
|
||
identified the REAL drift path (corporate action not triggering recalculation).
|
||
This inline verification continues the pattern from Findings #15 and #20.
|
||
|
||
## Claude Sonnet unique findings (not in either other model):
|
||
|
||
- **Rounding error accumulation in split-adjusted cost basis**: 3-for-1 split on
|
||
$100 basis → $33.33 per share × 300 = $9,999 instead of $10,000. Per-split error
|
||
is small but accumulates without bound across multiple splits and high lot counts.
|
||
No invariant checks that `SUM(lot.quantity × lot.cost_basis) = original_total_cost`.
|
||
- **Reconciliation re-derives from the corrupted source (meta-finding)**: The spec's
|
||
reconciliation job checks Position.realized_pnl against SUM(LotClosed.realized_pnl).
|
||
For 7 of 8 findings, LotClosed is itself the corrupted source — reconciliation
|
||
confirms the wrong number with high confidence, providing *false assurance*.
|
||
This is a system-level insight about the reconciliation architecture, not just
|
||
another data path.
|
||
- **LotClosedAdjustment event type recommendation**: Proposed a specific architectural
|
||
fix (adjustment events that reference original LotClosed IDs and record deltas)
|
||
that preserves immutability semantics while enabling corrections. Neither GPT-5
|
||
nor Opus proposed concrete mechanisms at this level.
|
||
|
||
## Quality assessment:
|
||
|
||
- **GPT-5** produced the most findings (12) with the highest domain specificity.
|
||
Its unique findings demonstrate deep knowledge of US tax law (IRS tack-back rules,
|
||
61-day window semantics, trade-date vs settlement-date distinctions, spin-off basis
|
||
reallocation mechanics). Several findings are variations on the same theme
|
||
(corporate action timing) but each identifies a genuinely distinct failure mode
|
||
with different financial characteristics. The 10,688 reasoning tokens appear to
|
||
have been invested in domain-specific reasoning about tax rules and timing edge
|
||
cases. Every finding includes precise financial impact quantification.
|
||
|
||
- **Claude Opus** produced 8 findings with characteristic design-tension focus.
|
||
The crash-recovery race condition and strategy-switch-during-reprocessing findings
|
||
show reasoning about system dynamics that neither other model surfaced — these
|
||
are about HOW the system operates at runtime, not just what it computes. The
|
||
self-correcting behavior (proving the running average is actually correct before
|
||
finding the real drift path) demonstrates intellectual honesty. However, Opus
|
||
produced fewer unique findings than in previous experiments and some overlap with
|
||
the common ground was substantial.
|
||
|
||
- **Claude Sonnet** was the surprise performer in this experiment. While its raw
|
||
finding count (8) matches Opus, two of its contributions represent qualitatively
|
||
different analytical modes:
|
||
1. The **rounding accumulation** finding shows quantitative reasoning about numeric
|
||
precision that neither GPT-5 nor Opus considered — they focused on logical/semantic
|
||
errors while Sonnet identified a computational precision problem.
|
||
2. The **"reconciliation confirms corruption"** meta-finding is architecturally the
|
||
most important insight across all three models. It identifies that the system's
|
||
own safety mechanism (reconciliation from LotClosed) becomes a corruption
|
||
*amplifier* when LotClosed itself is wrong. This is a system-of-systems insight.
|
||
3. The concrete **LotClosedAdjustment** architectural recommendation shows Sonnet
|
||
reasoning about solutions, not just problems.
|
||
|
||
## Key insight — Financial domain knowledge as differentiator:
|
||
|
||
This is the first experiment where **domain-specific knowledge** (US tax law,
|
||
IRS rules, financial accounting conventions) is a primary differentiator between
|
||
models. Previous experiments tested general architectural reasoning skills that
|
||
don't require specialized knowledge. Here:
|
||
|
||
- GPT-5's unique findings are primarily driven by **tax law knowledge** (tack-back
|
||
rules, 61-day window semantics, spin-off basis mechanics) that the other models
|
||
either don't know or don't surface.
|
||
- Opus's unique findings are primarily driven by **concurrent systems reasoning**
|
||
(crash recovery races, strategy change timing) — its traditional strength.
|
||
- Sonnet's unique findings are primarily driven by **structural/meta-analytical
|
||
reasoning** (what does it mean that the reconciliation source is corrupted? how
|
||
do numeric precision errors accumulate?).
|
||
|
||
The models aren't just finding "different things" — they're reasoning from
|
||
*different knowledge domains* applied to the same document:
|
||
- GPT-5: tax law + financial semantics
|
||
- Opus: distributed systems + runtime dynamics
|
||
- Sonnet: architecture patterns + numeric computing + meta-analysis
|
||
|
||
## Comparison to Finding #22 (silent correctness failures):
|
||
|
||
Finding #22 tested the "silent correctness failures" lens on `risk-controls.md`
|
||
(a very different document — runtime behavioral rather than data/accounting).
|
||
That experiment found GPT-5 dominant with Sonnet producing "surface-level" results.
|
||
Here, on a financial accounting document where the failures are about *data
|
||
correctness over time*, Sonnet performs much better. This suggests Sonnet's
|
||
analytical capabilities are strongest when the document describes data structures
|
||
and their transformations (where it can reason about invariants and meta-properties)
|
||
rather than runtime behavior (where it struggles with temporal/concurrent reasoning
|
||
— consistent with Finding #13).
|
||
|
||
## Practical implication:
|
||
|
||
For **financial accounting architecture review** specifically:
|
||
- GPT-5 is essential for tax-rule compliance gaps (it appears to have genuine
|
||
knowledge of IRS wash sale rules, holding period tack-back, corporate action
|
||
basis mechanics)
|
||
- Sonnet is valuable for structural/meta-analysis of the reconciliation architecture
|
||
(it found that the safety net itself is compromised — the highest-leverage finding)
|
||
- Opus adds value for runtime/crash-recovery scenarios but provides less unique insight
|
||
on the data-correctness dimension
|
||
|
||
The three-model approach continues to justify itself: GPT-5 finds 4 unique tax-law
|
||
gaps, Opus finds 2 unique runtime gaps, and Sonnet finds 2 unique structural insights
|
||
(including the most architecturally significant one). None alone would have produced
|
||
the complete picture.
|
||
|
||
## Token efficiency:
|
||
|
||
| Model | Findings | Tokens/Finding | Unique findings | Tokens/Unique finding |
|
||
|---|---|---|---|---|
|
||
| GPT-5 | 12 | 1,188 | 6 | 2,375 |
|
||
| Opus | 8 | 784 | 3 | 2,090 |
|
||
| Sonnet | 8 | 731 | 3 | 1,948 |
|
||
|
||
Sonnet is most token-efficient per finding (including its highest-leverage
|
||
meta-finding). GPT-5's reasoning tokens (10,688) produced the most unique findings
|
||
but at ~4.5× the cost per finding vs Sonnet. For financial document review where
|
||
every unique finding represents potential regulatory risk, all three are justified.
|