Files

T

Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files

Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.

2026-05-06 07:15:50 -07:00

12 KiB

Raw Blame History

Finding 22: Silent correctness failures: NEW analytical lens reveals Opus's strength at domain/regulatory reasoning; GPT-5 shows regulatory depth; all models converge on compliance timestamp errors

Date: 2026-05-05 Task: Identify scenarios where the mechanism produces SILENTLY INCORRECT results (not errors, not crashes — wrong financial calculations, wrong lot selections, or wrong compliance records that pass all validation) in gargoyle's specid-lot-selection.md (306 lines) — a financial system specification covering tax lot selection strategies, cost basis accounting, and IRS SpecID compliance. How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of silent incorrectness (stale data, semantic precision, ordering sensitivity, composition errors, temporal reference errors). Required specific output format per finding with concrete numerical examples of financial impact. No tools, no project context beyond the document.

Model	Time	Output tokens	Reasoning tokens	Findings	Critical	High	Medium
GPT-5	147s	13,006	10,496	7	2	2	3
Claude Opus 4.6	119s	5,902	(internal)	10	3	3	4
Claude Sonnet 4.6	122s	6,011	(internal)	6	3	3	0

What they found — common ground (all 3 identified):

designation_at = DateTime.utc_now() at processing time, NOT at actual designation time (manual selection was made at order submission, standing orders were configured earlier) — compliance record factually incorrect
Holding period calculation boundary errors (>365 days vs IRS "more than one year" rule, off-by-one at leap year boundaries, day-after-acquisition start)
HIFO tie-breaker opened_at ASC ignores tax_term dimension — selects long-term losses over short-term losses when both have identical cost basis, producing less tax-valuable outcomes
Strategy preference resolved at fill processing time, not at trade time (preference changes between trade and fill processing apply retroactively)

GPT-5 unique findings (not in either Claude model):

Corporate action applied late stale cost basis in HIFO: ROC/dividend reduces basis but if close/4 fires before apply_corporate_action/3, HIFO sorts on pre-adjusted basis AND records wrong realized P&L permanently. No mechanism to restate previously persisted LotClosed events. Concrete example: $2,000 overstated loss from one trade.
designation_at fragmentation: a single sell consuming multiple lots calls DateTime.utc_now() per loop iteration, producing slightly different timestamps for what should be a single coherent designation event. Audit risk.
LIFO label in selection_method field: records "lifo" but for securities LIFO isn't an authorized tax method — the operation is legally SpecID electing newest lots. Downstream reporting may reject or misclassify.

Claude Opus unique findings (not in either other model):

Realized P&L excludes commissions/fees: formula uses sell_fill.price (raw execution price) minus lot.cost_basis, not net proceeds. If cost_basis also excludes buy-side commissions, P&L is doubly overstated. Active trader doing 1000 trades/year: ~$20,000+ cumulative P&L overstatement.
Position average_cost is meaningless under SpecID and potentially misleading: SpecID exists to exploit lot-level basis differences, but position-level average obscures this. If downstream consumers use average_cost for tax estimation, results can be 50%+ wrong per lot.
GenServer mailbox ordering determines lot-to-fill assignment for concurrent sells: two simultaneous fills for the same instrument get different lots based on network arrival timing. With different holding periods, produces $670+ tax difference without user awareness.
Wash sale rule completely unaddressed: system reports losses as realized/deductible without checking 30-day substantially identical purchase rule. Active trader harvesting $50,000 in losses could have $0 actually deductible — $18,500 tax gap.
opened_at semantics undefined: whether it's exchange execution time, GenServer arrival time, or settlement date affects every downstream calculation (FIFO/LIFO ordering, holding periods, tax terms). Network timing could produce wrong FIFO lot selection.

Claude Sonnet 4.6 unique findings (not in either other model):

Stale cost basis in manual lot picker during concurrent corporate actions: UI shows pre-action basis, user selects based on stale data, but close/4 only validates open/ownership/quantity — never re-validates that the selection rationale is still correct. No field records the discrepancy.
average_cost recomputation ordering ambiguity in event-sourced model: step 4 recomputes from "updated lots" but step 3 (persist events) may not have completed — if implementation re-derives from event store rather than in-memory state, reads pre-closure lot quantities. Accumulates $500+ error per partial close.
Strategy fallback + config corruption silently overwrites selection method in compliance record: if config becomes invalid, fallback to :fifo is logged at :warning but LotClosed records selection_method: "fifo" — compliance record shows user "chose" FIFO when they configured HIFO. No field records intended vs actual strategy.

Quality assessment:

Claude Opus produced the most findings (10) with the broadest analytical scope. Several findings went BEYOND the document's mechanism to identify missing features that create silent incorrectness (wash sale rules, commission handling, opened_at semantics). This is a different analytical mode: Opus identified what the system SHOULD compute but DOESN'T, not just where the existing computation is wrong. The wash sale finding is the highest-impact across all three models — an active trader's entire tax-loss harvesting strategy could be invalid. The GenServer mailbox ordering finding shows characteristic Opus reasoning about emergent behavior from design decisions.
GPT-5 produced fewer findings (7) but with extreme precision and specificity. Every finding includes concrete dollar amounts and specific field references. The corporate action stale basis finding is uniquely actionable — it identifies a specific race condition between two documented mechanisms (close/4 and apply_corporate_action/3) that produces permanently incorrect persisted data with no correction path. The designation_at fragmentation finding shows attention to implementation detail that neither Claude model noticed. GPT-5 used 10,496 reasoning tokens for 7 findings (1,500 tokens/finding) — HIGH verification, consistent with Finding #20's pattern for precision-over-breadth tasks.
Claude Sonnet 4.6 produced 6 findings with strong specificity and novel angles. The event-sourced recomputation ordering finding (#5) is architecturally subtle — it identifies a composition error between the walk-and-consume algorithm's step ordering and event-sourcing patterns. The strategy fallback compliance recording finding is a genuine audit hazard. However, Sonnet produced no Medium-severity findings — it either found Critical/High issues or filtered everything else out. This aligns with its established high-precision, high-self-filtering behavior.

Key insight — "Silent correctness" as an analytical lens:

This is the FIRST experiment testing a "silent incorrectness" prompt. The key difference from previous analytical lenses:

Assumption-finding: "What must be true for this to work?" (Finding #10-12)
Race conditions: "What timing issues exist?" (Finding #13)
Design coherence: "Does the design contradict itself?" (Finding #15)
Invariant violations: "What operation sequences break invariants?" (Finding #20)
Silent correctness: "Where does the system CONFIDENTLY produce WRONG output with NO indication of error?"

The silent correctness lens produced qualitatively different findings from all previous lenses. The emphasis on "passes all validation" forced models to reason about what SHOULD be validated but ISN'T, and about semantic correctness (regulatory requirements, financial accounting rules) vs syntactic correctness (valid types, non-nil fields, correct schema).

This lens also revealed a key model differentiation not seen before:

Opus reasons about MISSING functionality (wash sales, commissions, opened_at semantics) — things the system should do but doesn't
GPT-5 reasons about EXISTING functionality being wrong (corporate action race, designation fragmentation, LIFO labeling) — things the system does but incorrectly
Sonnet reasons about COMPOSITION failures (event-sourcing step ordering, strategy fallback propagation) — things that are individually correct but combine incorrectly

These are three genuinely different analytical modes, not just "more/less thorough." All three are valuable for different review outcomes: Opus for feature completeness, GPT-5 for mechanism correctness, Sonnet for integration correctness.

Financial domain advantage:

This is the first experiment on a document with strong regulatory/financial semantics. All three models demonstrated domain knowledge (IRS holding period rules, Treas. Reg. 1.1012-1(c) requirements, wash sale IRC §1091, long-term/short-term capital gains rate differentials). Opus in particular referenced specific IRC sections and provided concrete tax rate calculations. The "silent incorrectness" lens works especially well on financial/regulatory documents because the gap between "syntactically valid output" and "semantically/legally correct output" is large and consequential.

Comparison to previous findings on the same models:

Task type	GPT-5 findings	Opus findings	Sonnet findings	Opus > GPT-5?
Hidden assumptions (#10-12)	20-35	12-13	13-17	No
Race conditions (#13)	12	10	7	No
Design coherence (#15)	4	7	5	Yes
Invariant violations (#20)	3	7	5	Yes
Silent correctness (#22)	7	10	6	Yes

Pattern confirmed: Opus outperforms GPT-5 (by finding count) on tasks that require reasoning about the design's RELATIONSHIP to external requirements (regulatory, financial, consumer expectations). GPT-5 outperforms Opus on tasks that require EXHAUSTIVE EXPLORATION within a self-contained system (assumptions, race conditions).

The "silent correctness" lens is structurally similar to coherence checking (does the system match its external requirements?) rather than gap-finding (what's missing within the system?). This explains why Opus outperforms: the task requires reasoning about the world outside the document (IRS rules, financial accounting standards, regulatory requirements), which is Opus's strength.

Practical implication: For financial/regulatory system review, the "silent correctness" lens should be run using Opus as the primary model (broadest findings including missing-feature identification) plus GPT-5 for mechanism-level precision. Sonnet adds value for composition/integration issues that neither Opus nor GPT-5 catches. All three produced unique, actionable findings that the others missed.

The three findings ALL models converged on (designation_at, holding period, HIFO tie-breaker, strategy preference timing) should be treated as confirmed design bugs requiring fixes. The fact that three independent models all identified them with concrete financial impact examples increases confidence that these are real.

12 KiB Raw Blame History

Finding 22: Silent correctness failures: NEW analytical lens reveals Opus's strength at domain/regulatory reasoning; GPT-5 shows regulatory depth; all models converge on compliance timestamp errors

12 KiB

Raw Blame History