8cfabfdc55
Tested GPT-5, Opus, Sonnet on wash-sale-tracking.md spec. Opus found a genuine spec bug (trigger logic described backwards). Confirms pattern: GPT-5 for breadth, Opus for logic contradictions, Sonnet adds no value for systematic analytical tasks.
88 lines
4.1 KiB
Markdown
88 lines
4.1 KiB
Markdown
# Experiment #32: Testability Analysis on wash-sale-tracking.md
|
|
|
|
**Date:** 2026-05-06
|
|
**Task type:** Testability analysis (NEW analytical lens)
|
|
**Document:** gargoyle's `wash-sale-tracking.md` (184 lines) — IRC 1091 wash sale detection plan
|
|
|
|
## Hypothesis
|
|
|
|
Testability analysis (identifying what parts of a spec prevent deterministic automated testing)
|
|
is a distinct analytical lens from gap analysis or contradiction detection. Models may differ in
|
|
whether they find boundary ambiguities (precision, rounding, inclusive/exclusive) vs logical
|
|
contradictions that make test assertions indeterminate.
|
|
|
|
## Method
|
|
|
|
Same structured prompt to all three models:
|
|
> Identify testability problems: untestable behaviors, ambiguities causing different test
|
|
> assertions, implicit fragile assumptions, missing boundary conditions, timing/ordering
|
|
> non-determinism. For each: state spec text, why it's a testability problem, what the spec
|
|
> needs to say.
|
|
|
|
Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4 (all via HAI proxy on anvil).
|
|
|
|
## Results
|
|
|
|
| Model | Findings | Output tokens | Reasoning tokens | Latency | Tokens/finding |
|
|
|---|---|---|---|---|---|
|
|
| GPT-5 | 14 | 3,039 | 6,336 | 124.5s | 670 (incl reasoning) |
|
|
| Opus | 10 | 2,719 | — | ~90s | 272 |
|
|
| Sonnet | ~11 (grouped) | 1,117 | — | ~24s | 102 |
|
|
|
|
## Unique Findings by Model
|
|
|
|
### GPT-5 unique (5):
|
|
1. `total_loss` is undefined (gross vs net, before/after prior adjustments)
|
|
2. Effective cost basis formula inconsistency (Decision 9 omits lot_adjustments)
|
|
3. Cascading wash sales — loss lot that was previously a replacement lot
|
|
4. 61-day vs 30-day wording inconsistency creates off-by-one interpretations
|
|
5. Trade date vs settlement date as standalone finding
|
|
|
|
### Opus unique (5, including a real spec bug):
|
|
1. **Forward detection logic is CONTRADICTORY** — spec says "lot open catches forward wash sales"
|
|
but forward wash sales are buy-then-sell-at-loss, so the *closure* is the trigger. The spec
|
|
describes the triggers backwards. This is a genuine spec bug.
|
|
2. Daisy-chain holding period propagation (A→B→C: does C inherit A's date?)
|
|
3. Per-share vs per-lot basis semantics for partial closures of replacement lots
|
|
4. Instrument identity breaks across mergers (same instrument_id rule fails)
|
|
5. "What counts as a loss sale" — raw or adjusted P&L? (circular dependency)
|
|
|
|
### Sonnet unique: None
|
|
- Options substantially-identical finding is a false positive (spec explicitly defers this)
|
|
- Database performance finding is off-topic (not a testability problem)
|
|
|
|
## Key Insight
|
|
|
|
**Opus found a genuine spec bug** that neither GPT-5 nor Sonnet identified. The spec's description
|
|
of which trigger catches which direction of wash sale is logically backwards. This is not ambiguity —
|
|
it's an error in the spec's reasoning. This confirms the pattern from experiment #31: Opus excels
|
|
at finding where the spec's OWN LOGIC contradicts itself.
|
|
|
|
## Pattern Confirmation
|
|
|
|
For systematic/exhaustive analytical tasks:
|
|
- **GPT-5:** Best for comprehensive boundary/precision/edge-case enumeration (breadth)
|
|
- **Opus:** Best for finding logic contradictions and false assumptions (depth/insight)
|
|
- **Sonnet:** No unique value; produces false positives
|
|
|
|
This matches spec-gap analysis (#31) exactly. Sonnet only contributes unique insights in
|
|
creative/generative tasks (adversarial gaming #29-30, emergent behavior identification #23).
|
|
|
|
## Practical Recommendation
|
|
|
|
For testability reviews of spec documents before implementation:
|
|
1. Run GPT-5 for comprehensive boundary/precision/edge-case coverage
|
|
2. Run Opus for logic-level contradictions and assumption violations
|
|
3. Skip Sonnet — it finds nothing the other two miss and adds noise
|
|
|
|
## Meta-Observation: Task Type Taxonomy
|
|
|
|
| Task category | Sonnet value | Example experiments |
|
|
|---|---|---|
|
|
| Systematic/exhaustive | None | #31 spec-gap, #32 testability, #25 contradiction |
|
|
| Creative/generative | Meta-analytical synthesis | #29-30 adversarial gaming |
|
|
| Compliance/regulatory | Adequate but shallow | #22 silent correctness |
|
|
|
|
Testability analysis falls firmly in the systematic/exhaustive category. Two-model configuration
|
|
(GPT-5 + Opus) is optimal.
|