model-research/findings/2026-05-05-22-silent-correctness-failures-new-analytical.md

# Finding 22: Silent correctness failures: NEW analytical lens reveals Opus's strength at domain/regulatory reasoning; GPT-5 shows regulatory depth; all models converge on compliance timestamp errors

**Date:** 2026-05-05
**Task:** Identify scenarios where the mechanism produces SILENTLY INCORRECT results
(not errors, not crashes — wrong financial calculations, wrong lot selections, or wrong
compliance records that pass all validation) in gargoyle's `specid-lot-selection.md`
(306 lines) — a financial system specification covering tax lot selection strategies,
cost basis accounting, and IRS SpecID compliance.
**How we used them:** Same document (full text) + same focused analytical question to
all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of silent
incorrectness (stale data, semantic precision, ordering sensitivity, composition errors,
temporal reference errors). Required specific output format per finding with concrete
numerical examples of financial impact. No tools, no project context beyond the document.

| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
|---|---|---|---|---|---|---|---|
| GPT-5 | 147s | 13,006 | 10,496 | 7 | 2 | 2 | 3 |
| Claude Opus 4.6 | 119s | 5,902 | (internal) | 10 | 3 | 3 | 4 |
| Claude Sonnet 4.6 | 122s | 6,011 | (internal) | 6 | 3 | 3 | 0 |

**What they found — common ground (all 3 identified):**
- `designation_at` = `DateTime.utc_now()` at processing time, NOT at actual
  designation time (manual selection was made at order submission, standing
  orders were configured earlier) — compliance record factually incorrect
- Holding period calculation boundary errors (>365 days vs IRS "more than one
  year" rule, off-by-one at leap year boundaries, day-after-acquisition start)
- HIFO tie-breaker `opened_at ASC` ignores tax_term dimension — selects
  long-term losses over short-term losses when both have identical cost basis,
  producing less tax-valuable outcomes
- Strategy preference resolved at fill processing time, not at trade time
  (preference changes between trade and fill processing apply retroactively)

**GPT-5 unique findings (not in either Claude model):**
- Corporate action applied late stale cost basis in HIFO: ROC/dividend reduces
  basis but if close/4 fires before apply_corporate_action/3, HIFO sorts on
  pre-adjusted basis AND records wrong realized P&L permanently. No mechanism
  to restate previously persisted LotClosed events. Concrete example: $2,000
  overstated loss from one trade.
- `designation_at` fragmentation: a single sell consuming multiple lots calls
  DateTime.utc_now() per loop iteration, producing slightly different timestamps
  for what should be a single coherent designation event. Audit risk.
- LIFO label in `selection_method` field: records "lifo" but for securities LIFO
  isn't an authorized tax method — the operation is legally SpecID electing
  newest lots. Downstream reporting may reject or misclassify.

**Claude Opus unique findings (not in either other model):**
- Realized P&L excludes commissions/fees: formula uses `sell_fill.price` (raw
  execution price) minus `lot.cost_basis`, not net proceeds. If cost_basis also
  excludes buy-side commissions, P&L is doubly overstated. Active trader doing
  1000 trades/year: ~$20,000+ cumulative P&L overstatement.
- Position `average_cost` is meaningless under SpecID and potentially misleading:
  SpecID exists to exploit lot-level basis differences, but position-level average
  obscures this. If downstream consumers use average_cost for tax estimation,
  results can be 50%+ wrong per lot.
- GenServer mailbox ordering determines lot-to-fill assignment for concurrent sells:
  two simultaneous fills for the same instrument get different lots based on network
  arrival timing. With different holding periods, produces $670+ tax difference
  without user awareness.
- Wash sale rule completely unaddressed: system reports losses as realized/deductible
  without checking 30-day substantially identical purchase rule. Active trader
  harvesting $50,000 in losses could have $0 actually deductible — $18,500 tax gap.
- `opened_at` semantics undefined: whether it's exchange execution time, GenServer
  arrival time, or settlement date affects every downstream calculation (FIFO/LIFO
  ordering, holding periods, tax terms). Network timing could produce wrong FIFO
  lot selection.

**Claude Sonnet 4.6 unique findings (not in either other model):**
- Stale cost basis in manual lot picker during concurrent corporate actions: UI shows
  pre-action basis, user selects based on stale data, but close/4 only validates
  open/ownership/quantity — never re-validates that the selection rationale is still
  correct. No field records the discrepancy.
- `average_cost` recomputation ordering ambiguity in event-sourced model: step 4
  recomputes from "updated lots" but step 3 (persist events) may not have completed
  — if implementation re-derives from event store rather than in-memory state, reads
  pre-closure lot quantities. Accumulates $500+ error per partial close.
- Strategy fallback + config corruption silently overwrites selection method in
  compliance record: if config becomes invalid, fallback to :fifo is logged at
  :warning but LotClosed records `selection_method: "fifo"` — compliance record
  shows user "chose" FIFO when they configured HIFO. No field records intended vs
  actual strategy.

**Quality assessment:**
- **Claude Opus** produced the most findings (10) with the broadest analytical scope.
  Several findings went BEYOND the document's mechanism to identify missing features
  that create silent incorrectness (wash sale rules, commission handling, opened_at
  semantics). This is a different analytical mode: Opus identified what the system
  SHOULD compute but DOESN'T, not just where the existing computation is wrong.
  The wash sale finding is the highest-impact across all three models — an active
  trader's entire tax-loss harvesting strategy could be invalid. The GenServer
  mailbox ordering finding shows characteristic Opus reasoning about emergent
  behavior from design decisions.
- **GPT-5** produced fewer findings (7) but with extreme precision and specificity.
  Every finding includes concrete dollar amounts and specific field references.
  The corporate action stale basis finding is uniquely actionable — it identifies a
  specific race condition between two documented mechanisms (close/4 and
  apply_corporate_action/3) that produces permanently incorrect persisted data
  with no correction path. The designation_at fragmentation finding shows attention
  to implementation detail that neither Claude model noticed. GPT-5 used 10,496
  reasoning tokens for 7 findings (1,500 tokens/finding) — HIGH verification,
  consistent with Finding #20's pattern for precision-over-breadth tasks.
- **Claude Sonnet 4.6** produced 6 findings with strong specificity and novel angles.
  The event-sourced recomputation ordering finding (#5) is architecturally subtle —
  it identifies a composition error between the walk-and-consume algorithm's step
  ordering and event-sourcing patterns. The strategy fallback compliance recording
  finding is a genuine audit hazard. However, Sonnet produced no Medium-severity
  findings — it either found Critical/High issues or filtered everything else out.
  This aligns with its established high-precision, high-self-filtering behavior.

**Key insight — "Silent correctness" as an analytical lens:**

This is the FIRST experiment testing a "silent incorrectness" prompt. The key
difference from previous analytical lenses:
- **Assumption-finding:** "What must be true for this to work?" (Finding #10-12)
- **Race conditions:** "What timing issues exist?" (Finding #13)
- **Design coherence:** "Does the design contradict itself?" (Finding #15)
- **Invariant violations:** "What operation sequences break invariants?" (Finding #20)
- **Silent correctness:** "Where does the system CONFIDENTLY produce WRONG output
  with NO indication of error?"

The silent correctness lens produced qualitatively different findings from all
previous lenses. The emphasis on "passes all validation" forced models to reason
about what SHOULD be validated but ISN'T, and about semantic correctness (regulatory
requirements, financial accounting rules) vs syntactic correctness (valid types,
non-nil fields, correct schema).

This lens also revealed a key model differentiation not seen before:
- **Opus reasons about MISSING functionality** (wash sales, commissions, opened_at
  semantics) — things the system should do but doesn't
- **GPT-5 reasons about EXISTING functionality being wrong** (corporate action race,
  designation fragmentation, LIFO labeling) — things the system does but incorrectly
- **Sonnet reasons about COMPOSITION failures** (event-sourcing step ordering,
  strategy fallback propagation) — things that are individually correct but combine
  incorrectly

These are three genuinely different analytical modes, not just "more/less thorough."
All three are valuable for different review outcomes: Opus for feature completeness,
GPT-5 for mechanism correctness, Sonnet for integration correctness.

**Financial domain advantage:**

This is the first experiment on a document with strong regulatory/financial semantics.
All three models demonstrated domain knowledge (IRS holding period rules, Treas. Reg.
1.1012-1(c) requirements, wash sale IRC §1091, long-term/short-term capital gains
rate differentials). Opus in particular referenced specific IRC sections and provided
concrete tax rate calculations. The "silent incorrectness" lens works especially well
on financial/regulatory documents because the gap between "syntactically valid output"
and "semantically/legally correct output" is large and consequential.

**Comparison to previous findings on the same models:**

| Task type | GPT-5 findings | Opus findings | Sonnet findings | Opus > GPT-5? |
|---|---|---|---|---|
| Hidden assumptions (#10-12) | 20-35 | 12-13 | 13-17 | No |
| Race conditions (#13) | 12 | 10 | 7 | No |
| Design coherence (#15) | 4 | 7 | 5 | **Yes** |
| Invariant violations (#20) | 3 | 7 | 5 | **Yes** |
| Silent correctness (#22) | 7 | 10 | 6 | **Yes** |

Pattern confirmed: Opus outperforms GPT-5 (by finding count) on tasks that require
reasoning about the design's RELATIONSHIP to external requirements (regulatory,
financial, consumer expectations). GPT-5 outperforms Opus on tasks that require
EXHAUSTIVE EXPLORATION within a self-contained system (assumptions, race conditions).

The "silent correctness" lens is structurally similar to coherence checking (does the
system match its external requirements?) rather than gap-finding (what's missing
within the system?). This explains why Opus outperforms: the task requires reasoning
about the world outside the document (IRS rules, financial accounting standards,
regulatory requirements), which is Opus's strength.

**Practical implication:**
For financial/regulatory system review, the "silent correctness" lens should be
run using Opus as the primary model (broadest findings including missing-feature
identification) plus GPT-5 for mechanism-level precision. Sonnet adds value for
composition/integration issues that neither Opus nor GPT-5 catches. All three
produced unique, actionable findings that the others missed.

The three findings ALL models converged on (designation_at, holding period, HIFO
tie-breaker, strategy preference timing) should be treated as confirmed design
bugs requiring fixes. The fact that three independent models all identified them
with concrete financial impact examples increases confidence that these are real.