Files

T

claw bb191e48d1 finding #54 : wash sale multi-model design review analysis

Compared Sonnet 4, GPT-5, and Opus 4.6 on gargoyle wash-sale-tracking.md.
Key insights:
- GPT-5 requires 16K+ completion tokens (4K for reasoning alone)
- Opus caught holding period add-vs-backdate correctness issue
- Sonnet caught Section 1259 (constructive sales) that others missed
- All three missed multi-broker 1099-B reconciliation problem
- Multi-model review justified for tax compliance domains

2026-05-09 03:35:12 -07:00

6.9 KiB

Raw Permalink Blame History

Multi-Model Analysis: Wash Sale Design Document Review

Finding ID: 54 Date: 2026-05-09 Document: gargoyle/docs/domain/contexts/ledger/wash-sale-tracking.md Task type: Design doc edge case analysis Prompt: "What edge cases, ambiguities, or potential bugs might this design miss?" Models compared: Claude Sonnet 4, GPT-5, Claude Opus 4.6

Experiment Design

This experiment compares how different frontier models analyze a real production design document for a trading platform's wash sale tracking feature. The document involves:

Tax law compliance (IRC §1091)
Event-driven architecture
Domain modeling with immutable events
Edge cases in financial calculations

Key Findings by Model

Claude Sonnet 4 (5.5K tokens, ~5s latency)

Strengths:

Fast, structured response with clear categories
Strong on IRS compliance gaps (IRA permanent disallowance, cross-account rules, retirement accounts)
Good on "substantially identical" scope limitations
Identified trade date vs timestamp issue

Unique catches:

Explicitly called out retirement account interactions (IRA/401k)
Noted constructive sale provisions (Section 1259)
Annual wash sale carryover across tax years

Weaknesses:

Less depth on event ordering/concurrency
Formula correctness analysis less thorough

GPT-5 (16K tokens with 4K reasoning, ~3min latency)

Strengths:

Most comprehensive coverage
Exceptional detail on implementation ambiguities
Strong on ordering/allocation algorithm gaps
Specific bug-prone scenarios with concrete examples
Excellent actionable recommendations section

Unique catches:

Fractional shares/rounding policy missing
Fees/commissions treatment undefined
Short sale handling completely absent
Most detailed on multiple-lot allocation problem
Concrete numeric examples (double-counting across replacements)

Weaknesses:

Required 16K token budget (4K for reasoning alone)
3+ minute latency
Verbose — some redundancy across sections

Claude Opus 4.6 (16K tokens, ~2min latency)

Strengths:

Deepest reasoning about edge cases
Best on chain wash sales (daisy-chaining) scenario
Excellent on concurrent detection race conditions
Strong "holding period tacking" correctness analysis (add vs backdate)
Clear "highest-risk issues" prioritization

Unique catches:

Backdating vs adding holding period — IRS requires adding, design does backdating (different results in edge cases)
FIFO ordering per Rev. Rul. 85-4 for multiple replacements
Gain-then-loss sequence scenario
Forward detection on purchase arriving before backward detection on same-day sale
Adjustment event interleaving with P&L queries

Weaknesses:

Slower than Sonnet for similar category coverage
Some overlap with GPT-5 on cross-account gaps

Synthesis: What Each Model Catches

Finding	Sonnet	GPT-5	Opus
Cross-account wash sales missing	✅	✅	✅
IRA permanent disallowance	✅	✅	✅
Multiple replacement lot allocation	Partial	✅	✅
Short sale handling absent	✅	✅	✅
Trade date vs timestamp	✅	✅	✅
Substantially identical too narrow	✅	✅	✅
Chain wash sales (daisy-chaining)	❌	Partial	✅
Holding period add vs backdate	❌	❌	✅
FIFO ordering per IRS rules	❌	Partial	✅
Concurrent detection race	❌	✅	✅
Rounding/fractional shares	❌	✅	✅
Fees/commissions treatment	❌	✅	❌
Corporate action edge cases	✅	❌	✅
Year-end boundary handling	✅	❌	✅
Section 1259 constructive sales	✅	❌	❌
Concrete numeric examples	❌	✅	❌
Actionable recommendations	Partial	✅	Partial

Model Selection Guidance

For quick design review (time-critical): Sonnet — catches most high-severity compliance gaps, fast enough for interactive use.

For comprehensive pre-implementation review: GPT-5 — exhaustive coverage, actionable recommendations, but budget time and tokens (16K+ completion tokens needed).

For deep edge case analysis: Opus — best at chain scenarios, ordering/concurrency, subtle correctness issues. Good for final review before production.

Optimal pipeline:

Sonnet for initial triage (identifies categories)
GPT-5 or Opus for deep dive on specific high-risk areas Sonnet flagged
Both GPT-5 and Opus if the domain is critical (tax compliance, financial calculations)

Surprising Results

Opus caught holding period semantics that others missed — The IRS requires adding holding periods, not backdating. This produces different results when loss lot open date ≠ (replacement open date - loss holding period). Neither GPT-5 nor Sonnet caught this.
GPT-5's reasoning tokens consumed 4K before any output — At 4K max_completion_tokens, GPT-5 returned empty content (all tokens went to reasoning). This is a critical operational consideration.
Sonnet caught Section 1259 (constructive sales) — A relatively obscure IRS provision that neither Opus nor GPT-5 mentioned. Suggests Sonnet may have fresher/broader tax law training data.
All three missed the same thing — None explicitly addressed what happens when a user has multiple brokers reporting different 1099-Bs with different wash sale treatments. The reconciliation problem is real and untreated.

Cost Analysis (estimated)

Model	Input tokens	Output tokens	Latency	Relative cost
Sonnet	~1.3K	~5.5K	~5s	1x
GPT-5	~1.3K	~16K (4K reasoning)	~180s	~8x
Opus	~1.3K	~16K	~120s	~6x

Lessons Learned

GPT-5 token budget is critical — Must use max_completion_tokens ≥16K for reasoning-heavy tasks. 4K produces empty output because reasoning consumes all tokens.
Model blind spots are complementary — Each model caught things the others missed. Multi-model review is justified for high-stakes domains.
Opus excels at subtle correctness issues — The holding period add-vs-backdate distinction is exactly the kind of thing that causes production bugs months later.
Sonnet's speed enables iteration — At ~5s latency, you can run Sonnet multiple times with different prompts in the time it takes for one GPT-5 response.

Conclusion

For design document review of financial/regulatory domains:

Single model: GPT-5 with ≥16K completion tokens is most comprehensive
Speed/cost constrained: Sonnet catches ~70% of critical issues at ~10x lower cost
Multi-model pipeline: Sonnet → Opus catches the most unique issues (complementary blind spots)
GPT-5 + Opus overlap is high (~80%) but each has unique catches

The multi-model approach is justified for high-stakes domains where the cost of a missed edge case exceeds the cost of running 2-3 models.

6.9 KiB Raw Permalink Blame History

Multi-Model Analysis: Wash Sale Design Document Review

Experiment Design

Key Findings by Model

Claude Sonnet 4 (5.5K tokens, ~5s latency)

GPT-5 (16K tokens with 4K reasoning, ~3min latency)

Claude Opus 4.6 (16K tokens, ~2min latency)

Synthesis: What Each Model Catches

Model Selection Guidance

Surprising Results

Cost Analysis (estimated)

Lessons Learned

Conclusion

6.9 KiB

Raw Permalink Blame History