Files
model-research/findings/2026-05-05-23-regulatory-compliance-analysis-gpt5-finds.md
T
Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00

194 lines
12 KiB
Markdown

# Finding 23: Regulatory compliance analysis: GPT-5 finds broadest scope of IRS issues; Opus self-corrects mid-analysis; all models converge on cross-account gap
**Date:** 2026-05-05
**Task:** Identify where gargoyle's `wash-sale-tracking.md` (391 lines) could produce
incorrect tax reporting compared to IRS wash sale regulations (IRC 1091). NEW
analytical lens: regulatory compliance verification — asking models to reason about
a code implementation's correctness against EXTERNAL regulatory requirements (not
internal system assumptions or race conditions).
**How we used them:** Same document (full text) + same focused analytical question
to all 3 models via HAI proxy. Prompt structured around 5 categories: regulatory
gaps, interpretation errors, edge cases at regulatory boundaries, cross-account/entity
concerns, and interaction with other IRC sections. Required specific regulatory
citations, implementation analysis, concrete tax errors, and audit risk levels.
No tools, no project context beyond the document.
| Model | Time | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|
| GPT-5 | 178s | 12,525 | 9,536 | 16 |
| Claude Opus 4.6 | 155s | 7,326 | (internal) | 16 (with 2 self-corrections/withdrawals) |
| Claude Sonnet 4.6 | 40s | 1,818 | (internal) | 12 |
**What they found — common ground (all 3 identified):**
- Cross-account/IRA/external broker wash sales not tracked (IRS applies at taxpayer level)
- Options/contracts to acquire stock not triggering wash sales (explicit in IRC 1091(a) text)
- "Substantially identical" definition too narrow (same index ETFs, share classes, ADRs)
- Trade date vs settlement date ambiguity in opened_at/closed_at
- Short sale wash sales not addressed
- Section 475 mark-to-market traders incorrectly subjected to wash sale tracking
- IRC 1092 straddle rules interaction not addressed
- Related party / spousal transactions not considered
- Corporate action identity changes breaking matching
**GPT-5 unique findings (not in either other model):**
- **Per-share vs lot-level basis tacking** (#1): The system applies `disallowed_loss`
and `tacked_opened_at` at the LOT level, but IRS requires per-share treatment
when only partial shares are matched. A lot of 100 shares where only 60 trigger
wash sale should have per-share basis segregation — the system inflates basis for
all 100 shares. **Most architecturally significant finding** — a fundamental
design-level error, not a missing feature.
- **IRA permanent disallowance** (#2): When replacement purchase is in an IRA, the
loss is PERMANENTLY lost (no basis adjustment possible in tax-deferred accounts).
System either incorrectly applies basis adjustment inside IRA or misses it entirely.
- **Instruments not subject to §1091** (#4): §1256 contracts (futures, index options),
cryptocurrency, and §475 elections are all exempt — system may over-disallow.
- **Average-cost mutual fund basis** (#11): Wash sale adjustments for funds using
average-cost method require different math than discrete lot-level adjustments.
- **ADRs vs local shares** (#14): ADRs and underlying foreign ordinaries are
substantially identical but have different instrument_ids.
- **RSU vestings/ESPP purchases** (#15): Equity compensation creating lots via
corporate action paths may not trigger `check_replacement/2`.
- **Ordering priority between pre/post sale purchases** (#10): Industry convention
(post-sale first, then pre-sale) may differ from system's strict chronological
ordering, causing 1099-B mismatches.
**Claude Opus unique findings (not in either other model):**
- **Year-end boundary timing** (#5): Loss in December + replacement in January means
tax reports generated between Dec 31 and the replacement purchase date are incorrect.
Forward detection fires retroactively but users may have already filed. System needs
a "30-day pending window" for year-end reports.
- **Form 8949 reporting format** (#6): IRS requires code "W" in column (f) and
specific adjustment amounts in column (g). System doesn't describe how `tax_summary/3`
produces Form 8949-compatible output — potential CP2000 notice triggers from
automated IRS matching against broker 1099-B.
- **"Open lots" query in backward detection** (#10): If backward detection only
queries currently-open lots, it misses replacements that were acquired AND SOLD
within the window. IRS looks at acquisition regardless of current holding status.
(Rev. Rul. 56-602)
- **Forward detection loss ordering unspecified** (#7): When multiple prior losses
compete for the same replacement shares, ordering matters — different allocation
produces different basis amounts on the replacement lot.
- **DRIP reinvestments triggering wash sales** (#9): Dividend reinvestment creates
new lots that should trigger forward detection but may not if only buy fills
produce `LotOpened` events.
- **Self-correcting analytical style (CONFIRMED):** Opus withdrew Finding #4
entirely mid-analysis ("Revised assessment: holding period logic appears correct.
I withdraw the claim of error"). Spent ~500 words reasoning through the holding
period tacking logic, found it correct, and explicitly retracted. This is now
confirmed across Findings #15, #20, and #23 as a consistent Opus behavior for
verification-heavy regulatory analysis.
**Claude Sonnet unique findings (not in either other model):**
- **Entity-level tracking for partnerships/S-Corps** (#4.2): Tax-transparent entities
trading through the platform need K-1 reporting to partners — user-scoped model
doesn't address pass-through entity wash sale reporting.
- **Constructive sale integration (IRC 1259)** (#4.1): Short positions or derivatives
creating constructive ownership interact with wash sale determination in ways not
addressed.
- **NOL carryforward interaction** (#5.3): Wash sale deferrals affect character and
timing of losses contributing to NOL calculations across tax years.
**Quality assessment:**
- **GPT-5** produced the broadest regulatory scope (16 findings) with the most
specific IRS citations (Rev. Rul. 2008-5, Pub. 550, IRC §§267, 1091, 1092, 1222,
1223, 1256, 475). Its per-share vs lot-level finding (#1) is the only one that
identifies a FUNDAMENTAL DESIGN ERROR (not a missing feature). Most other models'
findings are "you don't handle X" — GPT-5's #1 says "what you DO handle is
handled INCORRECTLY." This distinction matters: missing features are known scope
limitations; incorrect logic is a bug.
- **Claude Opus** matched GPT-5's count (16 with 2 self-corrections = 14 net
confirmed) but with different character. Opus excelled at identifying OPERATIONAL
implications (year-end boundary timing, Form 8949 format requirements, forward
detection ordering) rather than just statutory gaps. Its findings tend to describe
HOW the gap manifests in practice ("user files taxes, then January purchase
retroactively invalidates the filing") vs GPT-5's approach of citing the statute
and describing the theoretical violation.
- **Claude Sonnet** was fast (40s) and produced 12 competent findings but with less
regulatory precision. Findings lacked specific IRS citations (no Rev. Rul.
references, no Treas. Reg. citations). Several findings overlapped heavily with
common ground items without adding unique depth. The entity-level and
constructive sale findings show awareness of tax complexity but are relatively
generic ("this is complex and not addressed").
**Key insight — regulatory compliance as a distinct task type:**
This experiment tests a fundamentally different cognitive demand than previous ones:
previous tasks asked "what could go wrong with this system?" (internal reasoning).
This task asks "does this system correctly implement external rules?" (external
reasoning). The model must hold TWO bodies of knowledge simultaneously: the
implementation spec AND the regulatory framework, then find mismatches.
All three models had strong tax law knowledge — they cited IRC sections, Revenue
Rulings, and Treasury Regulations correctly. The differentiation wasn't in legal
knowledge but in HOW they applied it:
- **GPT-5:** Exhaustive statutory mapping ("here's every IRC section that touches
wash sales; here's where the implementation falls short on each"). Breadth-first
coverage. Found the most issues by sheer scope of regulatory awareness.
- **Opus:** Operational consequence reasoning ("here's how this gap manifests as
a real-world problem for the user/auditor"). Found issues by reasoning about
the implementation's interaction with real-world workflows (filing deadlines,
form formats, broker reconciliation).
- **Sonnet:** Category-based analysis ("here are cross-account issues, here are
entity issues, here are interaction issues"). Followed the prompt structure
closely but didn't go deep within each category.
**The per-share vs lot-level finding (GPT-5 #1) — why it matters:**
This is the experiment's most important result. Every model found missing features
(options, cross-account, short sales) — those are SCOPE limitations that the
document itself acknowledges or defers. GPT-5 uniquely found a correctness bug in
the IMPLEMENTED logic: the system's lot-level basis adjustment is mathematically
wrong for partial wash sales.
Example: Loss lot of 100 shares, replacement lot of 60 shares. Only 60 shares
trigger wash sale. System adds full 60% of disallowed loss to the entire
replacement lot's basis. If the replacement lot later sells 30 shares, the
per-share basis is inflated (reflects 60 shares of adjustment spread across 60
shares). This is actually correct for the replacement lot specifically — but
the `tacked_opened_at` is applied to ALL 60 shares when only the matched shares
should have tacked holding periods. For lots where `adjusted_quantity <
replacement_quantity`, the non-matched shares have incorrect holding period
characterization.
Actually, on closer inspection: if `adjusted_quantity = min(loss_quantity,
replacement_quantity)`, and the system matches 60 shares of a 60-share
replacement lot, ALL shares of that lot are matched. The edge case GPT-5
identifies would require a replacement lot larger than the loss — e.g., loss of
60 shares matched against a replacement lot of 100 shares where only 60 are
affected. In that case, the `tacked_opened_at` is set on the entire lot (100
shares) when only 60 should be affected. This IS a genuine bug: 40 shares get
incorrect holding period classification.
**Updated task-type taxonomy:**
| Task type | Primary cognitive demand | Best model |
|---|---|---|
| Hidden assumptions | Breadth identification (what's not stated?) | GPT-5 (exhaustive) |
| Race conditions | Sequential temporal reasoning | GPT-5 + Opus |
| Cross-component interactions | Component boundary reasoning | GPT-5 + Sonnet |
| Design coherence | Internal consistency checking | Opus |
| Invariant violation paths | Construction + verification | GPT-5 (precision) |
| Silent correctness | External requirement matching | Opus |
| **Regulatory compliance** | **Dual-knowledge-base comparison** | **GPT-5 (breadth) + Opus (operations)** |
Regulatory compliance is closest to "silent correctness" (Finding #22) in that
both require reasoning about external requirements. The key difference:
- Silent correctness asks "does this produce correct outputs for all inputs?"
- Regulatory compliance asks "does this implement the law correctly?"
Both favor models that reason about the system's relationship to the outside
world (Opus's strength), but regulatory compliance also rewards breadth of
statutory knowledge (GPT-5's strength). The combination produces the most
complete picture.
**Practical implication:**
For regulatory compliance review of financial systems:
- Run GPT-5 for exhaustive statutory coverage (finds the most gaps)
- Run Opus for operational impact analysis (finds how gaps manifest in practice)
- Sonnet adds marginal value — use only if budget allows
- GPT-5's unique strength: identifying correctness bugs in implemented logic
(not just missing features)
- Opus's unique strength: identifying timing/workflow issues (year-end, form
reporting, reconciliation with broker)