6af8a6ee10
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
194 lines
12 KiB
Markdown
194 lines
12 KiB
Markdown
# Finding 23: Regulatory compliance analysis: GPT-5 finds broadest scope of IRS issues; Opus self-corrects mid-analysis; all models converge on cross-account gap
|
|
|
|
**Date:** 2026-05-05
|
|
**Task:** Identify where gargoyle's `wash-sale-tracking.md` (391 lines) could produce
|
|
incorrect tax reporting compared to IRS wash sale regulations (IRC 1091). NEW
|
|
analytical lens: regulatory compliance verification — asking models to reason about
|
|
a code implementation's correctness against EXTERNAL regulatory requirements (not
|
|
internal system assumptions or race conditions).
|
|
**How we used them:** Same document (full text) + same focused analytical question
|
|
to all 3 models via HAI proxy. Prompt structured around 5 categories: regulatory
|
|
gaps, interpretation errors, edge cases at regulatory boundaries, cross-account/entity
|
|
concerns, and interaction with other IRC sections. Required specific regulatory
|
|
citations, implementation analysis, concrete tax errors, and audit risk levels.
|
|
No tools, no project context beyond the document.
|
|
|
|
| Model | Time | Output tokens | Reasoning tokens | Findings |
|
|
|---|---|---|---|---|
|
|
| GPT-5 | 178s | 12,525 | 9,536 | 16 |
|
|
| Claude Opus 4.6 | 155s | 7,326 | (internal) | 16 (with 2 self-corrections/withdrawals) |
|
|
| Claude Sonnet 4.6 | 40s | 1,818 | (internal) | 12 |
|
|
|
|
**What they found — common ground (all 3 identified):**
|
|
- Cross-account/IRA/external broker wash sales not tracked (IRS applies at taxpayer level)
|
|
- Options/contracts to acquire stock not triggering wash sales (explicit in IRC 1091(a) text)
|
|
- "Substantially identical" definition too narrow (same index ETFs, share classes, ADRs)
|
|
- Trade date vs settlement date ambiguity in opened_at/closed_at
|
|
- Short sale wash sales not addressed
|
|
- Section 475 mark-to-market traders incorrectly subjected to wash sale tracking
|
|
- IRC 1092 straddle rules interaction not addressed
|
|
- Related party / spousal transactions not considered
|
|
- Corporate action identity changes breaking matching
|
|
|
|
**GPT-5 unique findings (not in either other model):**
|
|
- **Per-share vs lot-level basis tacking** (#1): The system applies `disallowed_loss`
|
|
and `tacked_opened_at` at the LOT level, but IRS requires per-share treatment
|
|
when only partial shares are matched. A lot of 100 shares where only 60 trigger
|
|
wash sale should have per-share basis segregation — the system inflates basis for
|
|
all 100 shares. **Most architecturally significant finding** — a fundamental
|
|
design-level error, not a missing feature.
|
|
- **IRA permanent disallowance** (#2): When replacement purchase is in an IRA, the
|
|
loss is PERMANENTLY lost (no basis adjustment possible in tax-deferred accounts).
|
|
System either incorrectly applies basis adjustment inside IRA or misses it entirely.
|
|
- **Instruments not subject to §1091** (#4): §1256 contracts (futures, index options),
|
|
cryptocurrency, and §475 elections are all exempt — system may over-disallow.
|
|
- **Average-cost mutual fund basis** (#11): Wash sale adjustments for funds using
|
|
average-cost method require different math than discrete lot-level adjustments.
|
|
- **ADRs vs local shares** (#14): ADRs and underlying foreign ordinaries are
|
|
substantially identical but have different instrument_ids.
|
|
- **RSU vestings/ESPP purchases** (#15): Equity compensation creating lots via
|
|
corporate action paths may not trigger `check_replacement/2`.
|
|
- **Ordering priority between pre/post sale purchases** (#10): Industry convention
|
|
(post-sale first, then pre-sale) may differ from system's strict chronological
|
|
ordering, causing 1099-B mismatches.
|
|
|
|
**Claude Opus unique findings (not in either other model):**
|
|
- **Year-end boundary timing** (#5): Loss in December + replacement in January means
|
|
tax reports generated between Dec 31 and the replacement purchase date are incorrect.
|
|
Forward detection fires retroactively but users may have already filed. System needs
|
|
a "30-day pending window" for year-end reports.
|
|
- **Form 8949 reporting format** (#6): IRS requires code "W" in column (f) and
|
|
specific adjustment amounts in column (g). System doesn't describe how `tax_summary/3`
|
|
produces Form 8949-compatible output — potential CP2000 notice triggers from
|
|
automated IRS matching against broker 1099-B.
|
|
- **"Open lots" query in backward detection** (#10): If backward detection only
|
|
queries currently-open lots, it misses replacements that were acquired AND SOLD
|
|
within the window. IRS looks at acquisition regardless of current holding status.
|
|
(Rev. Rul. 56-602)
|
|
- **Forward detection loss ordering unspecified** (#7): When multiple prior losses
|
|
compete for the same replacement shares, ordering matters — different allocation
|
|
produces different basis amounts on the replacement lot.
|
|
- **DRIP reinvestments triggering wash sales** (#9): Dividend reinvestment creates
|
|
new lots that should trigger forward detection but may not if only buy fills
|
|
produce `LotOpened` events.
|
|
- **Self-correcting analytical style (CONFIRMED):** Opus withdrew Finding #4
|
|
entirely mid-analysis ("Revised assessment: holding period logic appears correct.
|
|
I withdraw the claim of error"). Spent ~500 words reasoning through the holding
|
|
period tacking logic, found it correct, and explicitly retracted. This is now
|
|
confirmed across Findings #15, #20, and #23 as a consistent Opus behavior for
|
|
verification-heavy regulatory analysis.
|
|
|
|
**Claude Sonnet unique findings (not in either other model):**
|
|
- **Entity-level tracking for partnerships/S-Corps** (#4.2): Tax-transparent entities
|
|
trading through the platform need K-1 reporting to partners — user-scoped model
|
|
doesn't address pass-through entity wash sale reporting.
|
|
- **Constructive sale integration (IRC 1259)** (#4.1): Short positions or derivatives
|
|
creating constructive ownership interact with wash sale determination in ways not
|
|
addressed.
|
|
- **NOL carryforward interaction** (#5.3): Wash sale deferrals affect character and
|
|
timing of losses contributing to NOL calculations across tax years.
|
|
|
|
**Quality assessment:**
|
|
- **GPT-5** produced the broadest regulatory scope (16 findings) with the most
|
|
specific IRS citations (Rev. Rul. 2008-5, Pub. 550, IRC §§267, 1091, 1092, 1222,
|
|
1223, 1256, 475). Its per-share vs lot-level finding (#1) is the only one that
|
|
identifies a FUNDAMENTAL DESIGN ERROR (not a missing feature). Most other models'
|
|
findings are "you don't handle X" — GPT-5's #1 says "what you DO handle is
|
|
handled INCORRECTLY." This distinction matters: missing features are known scope
|
|
limitations; incorrect logic is a bug.
|
|
- **Claude Opus** matched GPT-5's count (16 with 2 self-corrections = 14 net
|
|
confirmed) but with different character. Opus excelled at identifying OPERATIONAL
|
|
implications (year-end boundary timing, Form 8949 format requirements, forward
|
|
detection ordering) rather than just statutory gaps. Its findings tend to describe
|
|
HOW the gap manifests in practice ("user files taxes, then January purchase
|
|
retroactively invalidates the filing") vs GPT-5's approach of citing the statute
|
|
and describing the theoretical violation.
|
|
- **Claude Sonnet** was fast (40s) and produced 12 competent findings but with less
|
|
regulatory precision. Findings lacked specific IRS citations (no Rev. Rul.
|
|
references, no Treas. Reg. citations). Several findings overlapped heavily with
|
|
common ground items without adding unique depth. The entity-level and
|
|
constructive sale findings show awareness of tax complexity but are relatively
|
|
generic ("this is complex and not addressed").
|
|
|
|
**Key insight — regulatory compliance as a distinct task type:**
|
|
|
|
This experiment tests a fundamentally different cognitive demand than previous ones:
|
|
previous tasks asked "what could go wrong with this system?" (internal reasoning).
|
|
This task asks "does this system correctly implement external rules?" (external
|
|
reasoning). The model must hold TWO bodies of knowledge simultaneously: the
|
|
implementation spec AND the regulatory framework, then find mismatches.
|
|
|
|
All three models had strong tax law knowledge — they cited IRC sections, Revenue
|
|
Rulings, and Treasury Regulations correctly. The differentiation wasn't in legal
|
|
knowledge but in HOW they applied it:
|
|
|
|
- **GPT-5:** Exhaustive statutory mapping ("here's every IRC section that touches
|
|
wash sales; here's where the implementation falls short on each"). Breadth-first
|
|
coverage. Found the most issues by sheer scope of regulatory awareness.
|
|
- **Opus:** Operational consequence reasoning ("here's how this gap manifests as
|
|
a real-world problem for the user/auditor"). Found issues by reasoning about
|
|
the implementation's interaction with real-world workflows (filing deadlines,
|
|
form formats, broker reconciliation).
|
|
- **Sonnet:** Category-based analysis ("here are cross-account issues, here are
|
|
entity issues, here are interaction issues"). Followed the prompt structure
|
|
closely but didn't go deep within each category.
|
|
|
|
**The per-share vs lot-level finding (GPT-5 #1) — why it matters:**
|
|
|
|
This is the experiment's most important result. Every model found missing features
|
|
(options, cross-account, short sales) — those are SCOPE limitations that the
|
|
document itself acknowledges or defers. GPT-5 uniquely found a correctness bug in
|
|
the IMPLEMENTED logic: the system's lot-level basis adjustment is mathematically
|
|
wrong for partial wash sales.
|
|
|
|
Example: Loss lot of 100 shares, replacement lot of 60 shares. Only 60 shares
|
|
trigger wash sale. System adds full 60% of disallowed loss to the entire
|
|
replacement lot's basis. If the replacement lot later sells 30 shares, the
|
|
per-share basis is inflated (reflects 60 shares of adjustment spread across 60
|
|
shares). This is actually correct for the replacement lot specifically — but
|
|
the `tacked_opened_at` is applied to ALL 60 shares when only the matched shares
|
|
should have tacked holding periods. For lots where `adjusted_quantity <
|
|
replacement_quantity`, the non-matched shares have incorrect holding period
|
|
characterization.
|
|
|
|
Actually, on closer inspection: if `adjusted_quantity = min(loss_quantity,
|
|
replacement_quantity)`, and the system matches 60 shares of a 60-share
|
|
replacement lot, ALL shares of that lot are matched. The edge case GPT-5
|
|
identifies would require a replacement lot larger than the loss — e.g., loss of
|
|
60 shares matched against a replacement lot of 100 shares where only 60 are
|
|
affected. In that case, the `tacked_opened_at` is set on the entire lot (100
|
|
shares) when only 60 should be affected. This IS a genuine bug: 40 shares get
|
|
incorrect holding period classification.
|
|
|
|
**Updated task-type taxonomy:**
|
|
|
|
| Task type | Primary cognitive demand | Best model |
|
|
|---|---|---|
|
|
| Hidden assumptions | Breadth identification (what's not stated?) | GPT-5 (exhaustive) |
|
|
| Race conditions | Sequential temporal reasoning | GPT-5 + Opus |
|
|
| Cross-component interactions | Component boundary reasoning | GPT-5 + Sonnet |
|
|
| Design coherence | Internal consistency checking | Opus |
|
|
| Invariant violation paths | Construction + verification | GPT-5 (precision) |
|
|
| Silent correctness | External requirement matching | Opus |
|
|
| **Regulatory compliance** | **Dual-knowledge-base comparison** | **GPT-5 (breadth) + Opus (operations)** |
|
|
|
|
Regulatory compliance is closest to "silent correctness" (Finding #22) in that
|
|
both require reasoning about external requirements. The key difference:
|
|
- Silent correctness asks "does this produce correct outputs for all inputs?"
|
|
- Regulatory compliance asks "does this implement the law correctly?"
|
|
|
|
Both favor models that reason about the system's relationship to the outside
|
|
world (Opus's strength), but regulatory compliance also rewards breadth of
|
|
statutory knowledge (GPT-5's strength). The combination produces the most
|
|
complete picture.
|
|
|
|
**Practical implication:**
|
|
For regulatory compliance review of financial systems:
|
|
- Run GPT-5 for exhaustive statutory coverage (finds the most gaps)
|
|
- Run Opus for operational impact analysis (finds how gaps manifest in practice)
|
|
- Sonnet adds marginal value — use only if budget allows
|
|
- GPT-5's unique strength: identifying correctness bugs in implemented logic
|
|
(not just missing features)
|
|
- Opus's unique strength: identifying timing/workflow issues (year-end, form
|
|
reporting, reconciliation with broker)
|