Files
model-research/findings/2026-05-09-54-wash-sale-multi-model-analysis.md
claw bb191e48d1 finding #54: wash sale multi-model design review analysis
Compared Sonnet 4, GPT-5, and Opus 4.6 on gargoyle wash-sale-tracking.md.
Key insights:
- GPT-5 requires 16K+ completion tokens (4K for reasoning alone)
- Opus caught holding period add-vs-backdate correctness issue
- Sonnet caught Section 1259 (constructive sales) that others missed
- All three missed multi-broker 1099-B reconciliation problem
- Multi-model review justified for tax compliance domains
2026-05-09 03:35:12 -07:00

6.9 KiB

Multi-Model Analysis: Wash Sale Design Document Review

Finding ID: 54 Date: 2026-05-09 Document: gargoyle/docs/domain/contexts/ledger/wash-sale-tracking.md Task type: Design doc edge case analysis Prompt: "What edge cases, ambiguities, or potential bugs might this design miss?" Models compared: Claude Sonnet 4, GPT-5, Claude Opus 4.6

Experiment Design

This experiment compares how different frontier models analyze a real production design document for a trading platform's wash sale tracking feature. The document involves:

  • Tax law compliance (IRC §1091)
  • Event-driven architecture
  • Domain modeling with immutable events
  • Edge cases in financial calculations

Key Findings by Model

Claude Sonnet 4 (5.5K tokens, ~5s latency)

Strengths:

  • Fast, structured response with clear categories
  • Strong on IRS compliance gaps (IRA permanent disallowance, cross-account rules, retirement accounts)
  • Good on "substantially identical" scope limitations
  • Identified trade date vs timestamp issue

Unique catches:

  • Explicitly called out retirement account interactions (IRA/401k)
  • Noted constructive sale provisions (Section 1259)
  • Annual wash sale carryover across tax years

Weaknesses:

  • Less depth on event ordering/concurrency
  • Formula correctness analysis less thorough

GPT-5 (16K tokens with 4K reasoning, ~3min latency)

Strengths:

  • Most comprehensive coverage
  • Exceptional detail on implementation ambiguities
  • Strong on ordering/allocation algorithm gaps
  • Specific bug-prone scenarios with concrete examples
  • Excellent actionable recommendations section

Unique catches:

  • Fractional shares/rounding policy missing
  • Fees/commissions treatment undefined
  • Short sale handling completely absent
  • Most detailed on multiple-lot allocation problem
  • Concrete numeric examples (double-counting across replacements)

Weaknesses:

  • Required 16K token budget (4K for reasoning alone)
  • 3+ minute latency
  • Verbose — some redundancy across sections

Claude Opus 4.6 (16K tokens, ~2min latency)

Strengths:

  • Deepest reasoning about edge cases
  • Best on chain wash sales (daisy-chaining) scenario
  • Excellent on concurrent detection race conditions
  • Strong "holding period tacking" correctness analysis (add vs backdate)
  • Clear "highest-risk issues" prioritization

Unique catches:

  • Backdating vs adding holding period — IRS requires adding, design does backdating (different results in edge cases)
  • FIFO ordering per Rev. Rul. 85-4 for multiple replacements
  • Gain-then-loss sequence scenario
  • Forward detection on purchase arriving before backward detection on same-day sale
  • Adjustment event interleaving with P&L queries

Weaknesses:

  • Slower than Sonnet for similar category coverage
  • Some overlap with GPT-5 on cross-account gaps

Synthesis: What Each Model Catches

Finding Sonnet GPT-5 Opus
Cross-account wash sales missing
IRA permanent disallowance
Multiple replacement lot allocation Partial
Short sale handling absent
Trade date vs timestamp
Substantially identical too narrow
Chain wash sales (daisy-chaining) Partial
Holding period add vs backdate
FIFO ordering per IRS rules Partial
Concurrent detection race
Rounding/fractional shares
Fees/commissions treatment
Corporate action edge cases
Year-end boundary handling
Section 1259 constructive sales
Concrete numeric examples
Actionable recommendations Partial Partial

Model Selection Guidance

For quick design review (time-critical): Sonnet — catches most high-severity compliance gaps, fast enough for interactive use.

For comprehensive pre-implementation review: GPT-5 — exhaustive coverage, actionable recommendations, but budget time and tokens (16K+ completion tokens needed).

For deep edge case analysis: Opus — best at chain scenarios, ordering/concurrency, subtle correctness issues. Good for final review before production.

Optimal pipeline:

  1. Sonnet for initial triage (identifies categories)
  2. GPT-5 or Opus for deep dive on specific high-risk areas Sonnet flagged
  3. Both GPT-5 and Opus if the domain is critical (tax compliance, financial calculations)

Surprising Results

  1. Opus caught holding period semantics that others missed — The IRS requires adding holding periods, not backdating. This produces different results when loss lot open date ≠ (replacement open date - loss holding period). Neither GPT-5 nor Sonnet caught this.

  2. GPT-5's reasoning tokens consumed 4K before any output — At 4K max_completion_tokens, GPT-5 returned empty content (all tokens went to reasoning). This is a critical operational consideration.

  3. Sonnet caught Section 1259 (constructive sales) — A relatively obscure IRS provision that neither Opus nor GPT-5 mentioned. Suggests Sonnet may have fresher/broader tax law training data.

  4. All three missed the same thing — None explicitly addressed what happens when a user has multiple brokers reporting different 1099-Bs with different wash sale treatments. The reconciliation problem is real and untreated.

Cost Analysis (estimated)

Model Input tokens Output tokens Latency Relative cost
Sonnet ~1.3K ~5.5K ~5s 1x
GPT-5 ~1.3K ~16K (4K reasoning) ~180s ~8x
Opus ~1.3K ~16K ~120s ~6x

Lessons Learned

  1. GPT-5 token budget is critical — Must use max_completion_tokens ≥16K for reasoning-heavy tasks. 4K produces empty output because reasoning consumes all tokens.

  2. Model blind spots are complementary — Each model caught things the others missed. Multi-model review is justified for high-stakes domains.

  3. Opus excels at subtle correctness issues — The holding period add-vs-backdate distinction is exactly the kind of thing that causes production bugs months later.

  4. Sonnet's speed enables iteration — At ~5s latency, you can run Sonnet multiple times with different prompts in the time it takes for one GPT-5 response.

Conclusion

For design document review of financial/regulatory domains:

  • Single model: GPT-5 with ≥16K completion tokens is most comprehensive
  • Speed/cost constrained: Sonnet catches ~70% of critical issues at ~10x lower cost
  • Multi-model pipeline: Sonnet → Opus catches the most unique issues (complementary blind spots)
  • GPT-5 + Opus overlap is high (~80%) but each has unique catches

The multi-model approach is justified for high-stakes domains where the cost of a missed edge case exceeds the cost of running 2-3 models.