6af8a6ee10
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
181 lines
12 KiB
Markdown
181 lines
12 KiB
Markdown
# Finding 22: Silent correctness failures: NEW analytical lens reveals Opus's strength at domain/regulatory reasoning; GPT-5 shows regulatory depth; all models converge on compliance timestamp errors
|
|
|
|
**Date:** 2026-05-05
|
|
**Task:** Identify scenarios where the mechanism produces SILENTLY INCORRECT results
|
|
(not errors, not crashes — wrong financial calculations, wrong lot selections, or wrong
|
|
compliance records that pass all validation) in gargoyle's `specid-lot-selection.md`
|
|
(306 lines) — a financial system specification covering tax lot selection strategies,
|
|
cost basis accounting, and IRS SpecID compliance.
|
|
**How we used them:** Same document (full text) + same focused analytical question to
|
|
all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of silent
|
|
incorrectness (stale data, semantic precision, ordering sensitivity, composition errors,
|
|
temporal reference errors). Required specific output format per finding with concrete
|
|
numerical examples of financial impact. No tools, no project context beyond the document.
|
|
|
|
| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
|
|
|---|---|---|---|---|---|---|---|
|
|
| GPT-5 | 147s | 13,006 | 10,496 | 7 | 2 | 2 | 3 |
|
|
| Claude Opus 4.6 | 119s | 5,902 | (internal) | 10 | 3 | 3 | 4 |
|
|
| Claude Sonnet 4.6 | 122s | 6,011 | (internal) | 6 | 3 | 3 | 0 |
|
|
|
|
**What they found — common ground (all 3 identified):**
|
|
- `designation_at` = `DateTime.utc_now()` at processing time, NOT at actual
|
|
designation time (manual selection was made at order submission, standing
|
|
orders were configured earlier) — compliance record factually incorrect
|
|
- Holding period calculation boundary errors (>365 days vs IRS "more than one
|
|
year" rule, off-by-one at leap year boundaries, day-after-acquisition start)
|
|
- HIFO tie-breaker `opened_at ASC` ignores tax_term dimension — selects
|
|
long-term losses over short-term losses when both have identical cost basis,
|
|
producing less tax-valuable outcomes
|
|
- Strategy preference resolved at fill processing time, not at trade time
|
|
(preference changes between trade and fill processing apply retroactively)
|
|
|
|
**GPT-5 unique findings (not in either Claude model):**
|
|
- Corporate action applied late stale cost basis in HIFO: ROC/dividend reduces
|
|
basis but if close/4 fires before apply_corporate_action/3, HIFO sorts on
|
|
pre-adjusted basis AND records wrong realized P&L permanently. No mechanism
|
|
to restate previously persisted LotClosed events. Concrete example: $2,000
|
|
overstated loss from one trade.
|
|
- `designation_at` fragmentation: a single sell consuming multiple lots calls
|
|
DateTime.utc_now() per loop iteration, producing slightly different timestamps
|
|
for what should be a single coherent designation event. Audit risk.
|
|
- LIFO label in `selection_method` field: records "lifo" but for securities LIFO
|
|
isn't an authorized tax method — the operation is legally SpecID electing
|
|
newest lots. Downstream reporting may reject or misclassify.
|
|
|
|
**Claude Opus unique findings (not in either other model):**
|
|
- Realized P&L excludes commissions/fees: formula uses `sell_fill.price` (raw
|
|
execution price) minus `lot.cost_basis`, not net proceeds. If cost_basis also
|
|
excludes buy-side commissions, P&L is doubly overstated. Active trader doing
|
|
1000 trades/year: ~$20,000+ cumulative P&L overstatement.
|
|
- Position `average_cost` is meaningless under SpecID and potentially misleading:
|
|
SpecID exists to exploit lot-level basis differences, but position-level average
|
|
obscures this. If downstream consumers use average_cost for tax estimation,
|
|
results can be 50%+ wrong per lot.
|
|
- GenServer mailbox ordering determines lot-to-fill assignment for concurrent sells:
|
|
two simultaneous fills for the same instrument get different lots based on network
|
|
arrival timing. With different holding periods, produces $670+ tax difference
|
|
without user awareness.
|
|
- Wash sale rule completely unaddressed: system reports losses as realized/deductible
|
|
without checking 30-day substantially identical purchase rule. Active trader
|
|
harvesting $50,000 in losses could have $0 actually deductible — $18,500 tax gap.
|
|
- `opened_at` semantics undefined: whether it's exchange execution time, GenServer
|
|
arrival time, or settlement date affects every downstream calculation (FIFO/LIFO
|
|
ordering, holding periods, tax terms). Network timing could produce wrong FIFO
|
|
lot selection.
|
|
|
|
**Claude Sonnet 4.6 unique findings (not in either other model):**
|
|
- Stale cost basis in manual lot picker during concurrent corporate actions: UI shows
|
|
pre-action basis, user selects based on stale data, but close/4 only validates
|
|
open/ownership/quantity — never re-validates that the selection rationale is still
|
|
correct. No field records the discrepancy.
|
|
- `average_cost` recomputation ordering ambiguity in event-sourced model: step 4
|
|
recomputes from "updated lots" but step 3 (persist events) may not have completed
|
|
— if implementation re-derives from event store rather than in-memory state, reads
|
|
pre-closure lot quantities. Accumulates $500+ error per partial close.
|
|
- Strategy fallback + config corruption silently overwrites selection method in
|
|
compliance record: if config becomes invalid, fallback to :fifo is logged at
|
|
:warning but LotClosed records `selection_method: "fifo"` — compliance record
|
|
shows user "chose" FIFO when they configured HIFO. No field records intended vs
|
|
actual strategy.
|
|
|
|
**Quality assessment:**
|
|
- **Claude Opus** produced the most findings (10) with the broadest analytical scope.
|
|
Several findings went BEYOND the document's mechanism to identify missing features
|
|
that create silent incorrectness (wash sale rules, commission handling, opened_at
|
|
semantics). This is a different analytical mode: Opus identified what the system
|
|
SHOULD compute but DOESN'T, not just where the existing computation is wrong.
|
|
The wash sale finding is the highest-impact across all three models — an active
|
|
trader's entire tax-loss harvesting strategy could be invalid. The GenServer
|
|
mailbox ordering finding shows characteristic Opus reasoning about emergent
|
|
behavior from design decisions.
|
|
- **GPT-5** produced fewer findings (7) but with extreme precision and specificity.
|
|
Every finding includes concrete dollar amounts and specific field references.
|
|
The corporate action stale basis finding is uniquely actionable — it identifies a
|
|
specific race condition between two documented mechanisms (close/4 and
|
|
apply_corporate_action/3) that produces permanently incorrect persisted data
|
|
with no correction path. The designation_at fragmentation finding shows attention
|
|
to implementation detail that neither Claude model noticed. GPT-5 used 10,496
|
|
reasoning tokens for 7 findings (1,500 tokens/finding) — HIGH verification,
|
|
consistent with Finding #20's pattern for precision-over-breadth tasks.
|
|
- **Claude Sonnet 4.6** produced 6 findings with strong specificity and novel angles.
|
|
The event-sourced recomputation ordering finding (#5) is architecturally subtle —
|
|
it identifies a composition error between the walk-and-consume algorithm's step
|
|
ordering and event-sourcing patterns. The strategy fallback compliance recording
|
|
finding is a genuine audit hazard. However, Sonnet produced no Medium-severity
|
|
findings — it either found Critical/High issues or filtered everything else out.
|
|
This aligns with its established high-precision, high-self-filtering behavior.
|
|
|
|
**Key insight — "Silent correctness" as an analytical lens:**
|
|
|
|
This is the FIRST experiment testing a "silent incorrectness" prompt. The key
|
|
difference from previous analytical lenses:
|
|
- **Assumption-finding:** "What must be true for this to work?" (Finding #10-12)
|
|
- **Race conditions:** "What timing issues exist?" (Finding #13)
|
|
- **Design coherence:** "Does the design contradict itself?" (Finding #15)
|
|
- **Invariant violations:** "What operation sequences break invariants?" (Finding #20)
|
|
- **Silent correctness:** "Where does the system CONFIDENTLY produce WRONG output
|
|
with NO indication of error?"
|
|
|
|
The silent correctness lens produced qualitatively different findings from all
|
|
previous lenses. The emphasis on "passes all validation" forced models to reason
|
|
about what SHOULD be validated but ISN'T, and about semantic correctness (regulatory
|
|
requirements, financial accounting rules) vs syntactic correctness (valid types,
|
|
non-nil fields, correct schema).
|
|
|
|
This lens also revealed a key model differentiation not seen before:
|
|
- **Opus reasons about MISSING functionality** (wash sales, commissions, opened_at
|
|
semantics) — things the system should do but doesn't
|
|
- **GPT-5 reasons about EXISTING functionality being wrong** (corporate action race,
|
|
designation fragmentation, LIFO labeling) — things the system does but incorrectly
|
|
- **Sonnet reasons about COMPOSITION failures** (event-sourcing step ordering,
|
|
strategy fallback propagation) — things that are individually correct but combine
|
|
incorrectly
|
|
|
|
These are three genuinely different analytical modes, not just "more/less thorough."
|
|
All three are valuable for different review outcomes: Opus for feature completeness,
|
|
GPT-5 for mechanism correctness, Sonnet for integration correctness.
|
|
|
|
**Financial domain advantage:**
|
|
|
|
This is the first experiment on a document with strong regulatory/financial semantics.
|
|
All three models demonstrated domain knowledge (IRS holding period rules, Treas. Reg.
|
|
1.1012-1(c) requirements, wash sale IRC §1091, long-term/short-term capital gains
|
|
rate differentials). Opus in particular referenced specific IRC sections and provided
|
|
concrete tax rate calculations. The "silent incorrectness" lens works especially well
|
|
on financial/regulatory documents because the gap between "syntactically valid output"
|
|
and "semantically/legally correct output" is large and consequential.
|
|
|
|
**Comparison to previous findings on the same models:**
|
|
|
|
| Task type | GPT-5 findings | Opus findings | Sonnet findings | Opus > GPT-5? |
|
|
|---|---|---|---|---|
|
|
| Hidden assumptions (#10-12) | 20-35 | 12-13 | 13-17 | No |
|
|
| Race conditions (#13) | 12 | 10 | 7 | No |
|
|
| Design coherence (#15) | 4 | 7 | 5 | **Yes** |
|
|
| Invariant violations (#20) | 3 | 7 | 5 | **Yes** |
|
|
| Silent correctness (#22) | 7 | 10 | 6 | **Yes** |
|
|
|
|
Pattern confirmed: Opus outperforms GPT-5 (by finding count) on tasks that require
|
|
reasoning about the design's RELATIONSHIP to external requirements (regulatory,
|
|
financial, consumer expectations). GPT-5 outperforms Opus on tasks that require
|
|
EXHAUSTIVE EXPLORATION within a self-contained system (assumptions, race conditions).
|
|
|
|
The "silent correctness" lens is structurally similar to coherence checking (does the
|
|
system match its external requirements?) rather than gap-finding (what's missing
|
|
within the system?). This explains why Opus outperforms: the task requires reasoning
|
|
about the world outside the document (IRS rules, financial accounting standards,
|
|
regulatory requirements), which is Opus's strength.
|
|
|
|
**Practical implication:**
|
|
For financial/regulatory system review, the "silent correctness" lens should be
|
|
run using Opus as the primary model (broadest findings including missing-feature
|
|
identification) plus GPT-5 for mechanism-level precision. Sonnet adds value for
|
|
composition/integration issues that neither Opus nor GPT-5 catches. All three
|
|
produced unique, actionable findings that the others missed.
|
|
|
|
The three findings ALL models converged on (designation_at, holding period, HIFO
|
|
tie-breaker, strategy preference timing) should be treated as confirmed design
|
|
bugs requiring fixes. The fact that three independent models all identified them
|
|
with concrete financial impact examples increases confidence that these are real.
|