Files

T

Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files

Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.

2026-05-06 07:15:50 -07:00

12 KiB

Raw Permalink Blame History

Finding 23: Regulatory compliance analysis: GPT-5 finds broadest scope of IRS issues; Opus self-corrects mid-analysis; all models converge on cross-account gap

Date: 2026-05-05 Task: Identify where gargoyle's wash-sale-tracking.md (391 lines) could produce incorrect tax reporting compared to IRS wash sale regulations (IRC 1091). NEW analytical lens: regulatory compliance verification — asking models to reason about a code implementation's correctness against EXTERNAL regulatory requirements (not internal system assumptions or race conditions). How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy. Prompt structured around 5 categories: regulatory gaps, interpretation errors, edge cases at regulatory boundaries, cross-account/entity concerns, and interaction with other IRC sections. Required specific regulatory citations, implementation analysis, concrete tax errors, and audit risk levels. No tools, no project context beyond the document.

Model	Time	Output tokens	Reasoning tokens	Findings
GPT-5	178s	12,525	9,536	16
Claude Opus 4.6	155s	7,326	(internal)	16 (with 2 self-corrections/withdrawals)
Claude Sonnet 4.6	40s	1,818	(internal)	12

What they found — common ground (all 3 identified):

Cross-account/IRA/external broker wash sales not tracked (IRS applies at taxpayer level)
Options/contracts to acquire stock not triggering wash sales (explicit in IRC 1091(a) text)
"Substantially identical" definition too narrow (same index ETFs, share classes, ADRs)
Trade date vs settlement date ambiguity in opened_at/closed_at
Short sale wash sales not addressed
Section 475 mark-to-market traders incorrectly subjected to wash sale tracking
IRC 1092 straddle rules interaction not addressed
Related party / spousal transactions not considered
Corporate action identity changes breaking matching

GPT-5 unique findings (not in either other model):

Per-share vs lot-level basis tacking (#1): The system applies disallowed_loss and tacked_opened_at at the LOT level, but IRS requires per-share treatment when only partial shares are matched. A lot of 100 shares where only 60 trigger wash sale should have per-share basis segregation — the system inflates basis for all 100 shares. Most architecturally significant finding — a fundamental design-level error, not a missing feature.
IRA permanent disallowance (#2): When replacement purchase is in an IRA, the loss is PERMANENTLY lost (no basis adjustment possible in tax-deferred accounts). System either incorrectly applies basis adjustment inside IRA or misses it entirely.
Instruments not subject to §1091 (#4): §1256 contracts (futures, index options), cryptocurrency, and §475 elections are all exempt — system may over-disallow.
Average-cost mutual fund basis (#11): Wash sale adjustments for funds using average-cost method require different math than discrete lot-level adjustments.
ADRs vs local shares (#14): ADRs and underlying foreign ordinaries are substantially identical but have different instrument_ids.
RSU vestings/ESPP purchases (#15): Equity compensation creating lots via corporate action paths may not trigger check_replacement/2.
Ordering priority between pre/post sale purchases (#10): Industry convention (post-sale first, then pre-sale) may differ from system's strict chronological ordering, causing 1099-B mismatches.

Claude Opus unique findings (not in either other model):

Year-end boundary timing (#5): Loss in December + replacement in January means tax reports generated between Dec 31 and the replacement purchase date are incorrect. Forward detection fires retroactively but users may have already filed. System needs a "30-day pending window" for year-end reports.
Form 8949 reporting format (#6): IRS requires code "W" in column (f) and specific adjustment amounts in column (g). System doesn't describe how tax_summary/3 produces Form 8949-compatible output — potential CP2000 notice triggers from automated IRS matching against broker 1099-B.
"Open lots" query in backward detection (#10): If backward detection only queries currently-open lots, it misses replacements that were acquired AND SOLD within the window. IRS looks at acquisition regardless of current holding status. (Rev. Rul. 56-602)
Forward detection loss ordering unspecified (#7): When multiple prior losses compete for the same replacement shares, ordering matters — different allocation produces different basis amounts on the replacement lot.
DRIP reinvestments triggering wash sales (#9): Dividend reinvestment creates new lots that should trigger forward detection but may not if only buy fills produce LotOpened events.
Self-correcting analytical style (CONFIRMED): Opus withdrew Finding #4 entirely mid-analysis ("Revised assessment: holding period logic appears correct. I withdraw the claim of error"). Spent ~500 words reasoning through the holding period tacking logic, found it correct, and explicitly retracted. This is now confirmed across Findings #15, #20, and #23 as a consistent Opus behavior for verification-heavy regulatory analysis.

Claude Sonnet unique findings (not in either other model):

Entity-level tracking for partnerships/S-Corps (#4.2): Tax-transparent entities trading through the platform need K-1 reporting to partners — user-scoped model doesn't address pass-through entity wash sale reporting.
Constructive sale integration (IRC 1259) (#4.1): Short positions or derivatives creating constructive ownership interact with wash sale determination in ways not addressed.
NOL carryforward interaction (#5.3): Wash sale deferrals affect character and timing of losses contributing to NOL calculations across tax years.

Quality assessment:

GPT-5 produced the broadest regulatory scope (16 findings) with the most specific IRS citations (Rev. Rul. 2008-5, Pub. 550, IRC §§267, 1091, 1092, 1222, 1223, 1256, 475). Its per-share vs lot-level finding (#1) is the only one that identifies a FUNDAMENTAL DESIGN ERROR (not a missing feature). Most other models' findings are "you don't handle X" — GPT-5's #1 says "what you DO handle is handled INCORRECTLY." This distinction matters: missing features are known scope limitations; incorrect logic is a bug.
Claude Opus matched GPT-5's count (16 with 2 self-corrections = 14 net confirmed) but with different character. Opus excelled at identifying OPERATIONAL implications (year-end boundary timing, Form 8949 format requirements, forward detection ordering) rather than just statutory gaps. Its findings tend to describe HOW the gap manifests in practice ("user files taxes, then January purchase retroactively invalidates the filing") vs GPT-5's approach of citing the statute and describing the theoretical violation.
Claude Sonnet was fast (40s) and produced 12 competent findings but with less regulatory precision. Findings lacked specific IRS citations (no Rev. Rul. references, no Treas. Reg. citations). Several findings overlapped heavily with common ground items without adding unique depth. The entity-level and constructive sale findings show awareness of tax complexity but are relatively generic ("this is complex and not addressed").

Key insight — regulatory compliance as a distinct task type:

This experiment tests a fundamentally different cognitive demand than previous ones: previous tasks asked "what could go wrong with this system?" (internal reasoning). This task asks "does this system correctly implement external rules?" (external reasoning). The model must hold TWO bodies of knowledge simultaneously: the implementation spec AND the regulatory framework, then find mismatches.

All three models had strong tax law knowledge — they cited IRC sections, Revenue Rulings, and Treasury Regulations correctly. The differentiation wasn't in legal knowledge but in HOW they applied it:

GPT-5: Exhaustive statutory mapping ("here's every IRC section that touches wash sales; here's where the implementation falls short on each"). Breadth-first coverage. Found the most issues by sheer scope of regulatory awareness.
Opus: Operational consequence reasoning ("here's how this gap manifests as a real-world problem for the user/auditor"). Found issues by reasoning about the implementation's interaction with real-world workflows (filing deadlines, form formats, broker reconciliation).
Sonnet: Category-based analysis ("here are cross-account issues, here are entity issues, here are interaction issues"). Followed the prompt structure closely but didn't go deep within each category.

The per-share vs lot-level finding (GPT-5 #1) — why it matters:

This is the experiment's most important result. Every model found missing features (options, cross-account, short sales) — those are SCOPE limitations that the document itself acknowledges or defers. GPT-5 uniquely found a correctness bug in the IMPLEMENTED logic: the system's lot-level basis adjustment is mathematically wrong for partial wash sales.

Example: Loss lot of 100 shares, replacement lot of 60 shares. Only 60 shares trigger wash sale. System adds full 60% of disallowed loss to the entire replacement lot's basis. If the replacement lot later sells 30 shares, the per-share basis is inflated (reflects 60 shares of adjustment spread across 60 shares). This is actually correct for the replacement lot specifically — but the tacked_opened_at is applied to ALL 60 shares when only the matched shares should have tacked holding periods. For lots where adjusted_quantity < replacement_quantity, the non-matched shares have incorrect holding period characterization.

Actually, on closer inspection: if adjusted_quantity = min(loss_quantity, replacement_quantity), and the system matches 60 shares of a 60-share replacement lot, ALL shares of that lot are matched. The edge case GPT-5 identifies would require a replacement lot larger than the loss — e.g., loss of 60 shares matched against a replacement lot of 100 shares where only 60 are affected. In that case, the tacked_opened_at is set on the entire lot (100 shares) when only 60 should be affected. This IS a genuine bug: 40 shares get incorrect holding period classification.

Updated task-type taxonomy:

Task type	Primary cognitive demand	Best model
Hidden assumptions	Breadth identification (what's not stated?)	GPT-5 (exhaustive)
Race conditions	Sequential temporal reasoning	GPT-5 + Opus
Cross-component interactions	Component boundary reasoning	GPT-5 + Sonnet
Design coherence	Internal consistency checking	Opus
Invariant violation paths	Construction + verification	GPT-5 (precision)
Silent correctness	External requirement matching	Opus
Regulatory compliance	Dual-knowledge-base comparison	GPT-5 (breadth) + Opus (operations)

Regulatory compliance is closest to "silent correctness" (Finding #22) in that both require reasoning about external requirements. The key difference:

Silent correctness asks "does this produce correct outputs for all inputs?"
Regulatory compliance asks "does this implement the law correctly?"

Both favor models that reason about the system's relationship to the outside world (Opus's strength), but regulatory compliance also rewards breadth of statutory knowledge (GPT-5's strength). The combination produces the most complete picture.

Practical implication: For regulatory compliance review of financial systems:

Run GPT-5 for exhaustive statutory coverage (finds the most gaps)
Run Opus for operational impact analysis (finds how gaps manifest in practice)
Sonnet adds marginal value — use only if budget allows
GPT-5's unique strength: identifying correctness bugs in implemented logic (not just missing features)
Opus's unique strength: identifying timing/workflow issues (year-end, form reporting, reconciliation with broker)

12 KiB Raw Permalink Blame History

Finding 23: Regulatory compliance analysis: GPT-5 finds broadest scope of IRS issues; Opus self-corrects mid-analysis; all models converge on cross-account gap

12 KiB

Raw Permalink Blame History