diff --git a/findings/2026-05-09-54-wash-sale-multi-model-analysis.md b/findings/2026-05-09-54-wash-sale-multi-model-analysis.md new file mode 100644 index 0000000..4f01a75 --- /dev/null +++ b/findings/2026-05-09-54-wash-sale-multi-model-analysis.md @@ -0,0 +1,149 @@ +# Multi-Model Analysis: Wash Sale Design Document Review + +**Finding ID:** 54 +**Date:** 2026-05-09 +**Document:** gargoyle/docs/domain/contexts/ledger/wash-sale-tracking.md +**Task type:** Design doc edge case analysis +**Prompt:** "What edge cases, ambiguities, or potential bugs might this design miss?" +**Models compared:** Claude Sonnet 4, GPT-5, Claude Opus 4.6 + +## Experiment Design + +This experiment compares how different frontier models analyze a real production design document for a trading platform's wash sale tracking feature. The document involves: +- Tax law compliance (IRC §1091) +- Event-driven architecture +- Domain modeling with immutable events +- Edge cases in financial calculations + +## Key Findings by Model + +### Claude Sonnet 4 (5.5K tokens, ~5s latency) + +**Strengths:** +- Fast, structured response with clear categories +- Strong on IRS compliance gaps (IRA permanent disallowance, cross-account rules, retirement accounts) +- Good on "substantially identical" scope limitations +- Identified trade date vs timestamp issue + +**Unique catches:** +- Explicitly called out retirement account interactions (IRA/401k) +- Noted constructive sale provisions (Section 1259) +- Annual wash sale carryover across tax years + +**Weaknesses:** +- Less depth on event ordering/concurrency +- Formula correctness analysis less thorough + +### GPT-5 (16K tokens with 4K reasoning, ~3min latency) + +**Strengths:** +- Most comprehensive coverage +- Exceptional detail on implementation ambiguities +- Strong on ordering/allocation algorithm gaps +- Specific bug-prone scenarios with concrete examples +- Excellent actionable recommendations section + +**Unique catches:** +- Fractional shares/rounding policy missing +- Fees/commissions treatment undefined +- Short sale handling completely absent +- Most detailed on multiple-lot allocation problem +- Concrete numeric examples (double-counting across replacements) + +**Weaknesses:** +- Required 16K token budget (4K for reasoning alone) +- 3+ minute latency +- Verbose — some redundancy across sections + +### Claude Opus 4.6 (16K tokens, ~2min latency) + +**Strengths:** +- Deepest reasoning about edge cases +- Best on chain wash sales (daisy-chaining) scenario +- Excellent on concurrent detection race conditions +- Strong "holding period tacking" correctness analysis (add vs backdate) +- Clear "highest-risk issues" prioritization + +**Unique catches:** +- Backdating vs adding holding period — IRS requires adding, design does backdating (different results in edge cases) +- FIFO ordering per Rev. Rul. 85-4 for multiple replacements +- Gain-then-loss sequence scenario +- Forward detection on purchase arriving before backward detection on same-day sale +- Adjustment event interleaving with P&L queries + +**Weaknesses:** +- Slower than Sonnet for similar category coverage +- Some overlap with GPT-5 on cross-account gaps + +## Synthesis: What Each Model Catches + +| Finding | Sonnet | GPT-5 | Opus | +|---------|--------|-------|------| +| Cross-account wash sales missing | ✅ | ✅ | ✅ | +| IRA permanent disallowance | ✅ | ✅ | ✅ | +| Multiple replacement lot allocation | Partial | ✅ | ✅ | +| Short sale handling absent | ✅ | ✅ | ✅ | +| Trade date vs timestamp | ✅ | ✅ | ✅ | +| Substantially identical too narrow | ✅ | ✅ | ✅ | +| Chain wash sales (daisy-chaining) | ❌ | Partial | ✅ | +| Holding period add vs backdate | ❌ | ❌ | ✅ | +| FIFO ordering per IRS rules | ❌ | Partial | ✅ | +| Concurrent detection race | ❌ | ✅ | ✅ | +| Rounding/fractional shares | ❌ | ✅ | ✅ | +| Fees/commissions treatment | ❌ | ✅ | ❌ | +| Corporate action edge cases | ✅ | ❌ | ✅ | +| Year-end boundary handling | ✅ | ❌ | ✅ | +| Section 1259 constructive sales | ✅ | ❌ | ❌ | +| Concrete numeric examples | ❌ | ✅ | ❌ | +| Actionable recommendations | Partial | ✅ | Partial | + +## Model Selection Guidance + +**For quick design review (time-critical):** Sonnet — catches most high-severity compliance gaps, fast enough for interactive use. + +**For comprehensive pre-implementation review:** GPT-5 — exhaustive coverage, actionable recommendations, but budget time and tokens (16K+ completion tokens needed). + +**For deep edge case analysis:** Opus — best at chain scenarios, ordering/concurrency, subtle correctness issues. Good for final review before production. + +**Optimal pipeline:** +1. Sonnet for initial triage (identifies categories) +2. GPT-5 or Opus for deep dive on specific high-risk areas Sonnet flagged +3. Both GPT-5 and Opus if the domain is critical (tax compliance, financial calculations) + +## Surprising Results + +1. **Opus caught holding period semantics that others missed** — The IRS requires *adding* holding periods, not *backdating*. This produces different results when loss lot open date ≠ (replacement open date - loss holding period). Neither GPT-5 nor Sonnet caught this. + +2. **GPT-5's reasoning tokens consumed 4K before any output** — At 4K max_completion_tokens, GPT-5 returned empty content (all tokens went to reasoning). This is a critical operational consideration. + +3. **Sonnet caught Section 1259 (constructive sales)** — A relatively obscure IRS provision that neither Opus nor GPT-5 mentioned. Suggests Sonnet may have fresher/broader tax law training data. + +4. **All three missed the same thing** — None explicitly addressed what happens when a user has *multiple brokers* reporting different 1099-Bs with different wash sale treatments. The reconciliation problem is real and untreated. + +## Cost Analysis (estimated) + +| Model | Input tokens | Output tokens | Latency | Relative cost | +|-------|--------------|---------------|---------|---------------| +| Sonnet | ~1.3K | ~5.5K | ~5s | 1x | +| GPT-5 | ~1.3K | ~16K (4K reasoning) | ~180s | ~8x | +| Opus | ~1.3K | ~16K | ~120s | ~6x | + +## Lessons Learned + +1. **GPT-5 token budget is critical** — Must use `max_completion_tokens` ≥16K for reasoning-heavy tasks. 4K produces empty output because reasoning consumes all tokens. + +2. **Model blind spots are complementary** — Each model caught things the others missed. Multi-model review is justified for high-stakes domains. + +3. **Opus excels at subtle correctness issues** — The holding period add-vs-backdate distinction is exactly the kind of thing that causes production bugs months later. + +4. **Sonnet's speed enables iteration** — At ~5s latency, you can run Sonnet multiple times with different prompts in the time it takes for one GPT-5 response. + +## Conclusion + +For design document review of financial/regulatory domains: +- **Single model:** GPT-5 with ≥16K completion tokens is most comprehensive +- **Speed/cost constrained:** Sonnet catches ~70% of critical issues at ~10x lower cost +- **Multi-model pipeline:** Sonnet → Opus catches the most unique issues (complementary blind spots) +- **GPT-5 + Opus overlap** is high (~80%) but each has unique catches + +The multi-model approach is justified for high-stakes domains where the cost of a missed edge case exceeds the cost of running 2-3 models.