Files
model-research/findings/2026-05-06-32-testability-analysis-wash-sale-tracking.md
T
Rodin 8cfabfdc55 experiment #32: testability analysis — new analytical lens
Tested GPT-5, Opus, Sonnet on wash-sale-tracking.md spec.
Opus found a genuine spec bug (trigger logic described backwards).
Confirms pattern: GPT-5 for breadth, Opus for logic contradictions,
Sonnet adds no value for systematic analytical tasks.
2026-05-06 10:09:05 -07:00

4.1 KiB

Experiment #32: Testability Analysis on wash-sale-tracking.md

Date: 2026-05-06 Task type: Testability analysis (NEW analytical lens) Document: gargoyle's wash-sale-tracking.md (184 lines) — IRC 1091 wash sale detection plan

Hypothesis

Testability analysis (identifying what parts of a spec prevent deterministic automated testing) is a distinct analytical lens from gap analysis or contradiction detection. Models may differ in whether they find boundary ambiguities (precision, rounding, inclusive/exclusive) vs logical contradictions that make test assertions indeterminate.

Method

Same structured prompt to all three models:

Identify testability problems: untestable behaviors, ambiguities causing different test assertions, implicit fragile assumptions, missing boundary conditions, timing/ordering non-determinism. For each: state spec text, why it's a testability problem, what the spec needs to say.

Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4 (all via HAI proxy on anvil).

Results

Model Findings Output tokens Reasoning tokens Latency Tokens/finding
GPT-5 14 3,039 6,336 124.5s 670 (incl reasoning)
Opus 10 2,719 ~90s 272
Sonnet ~11 (grouped) 1,117 ~24s 102

Unique Findings by Model

GPT-5 unique (5):

  1. total_loss is undefined (gross vs net, before/after prior adjustments)
  2. Effective cost basis formula inconsistency (Decision 9 omits lot_adjustments)
  3. Cascading wash sales — loss lot that was previously a replacement lot
  4. 61-day vs 30-day wording inconsistency creates off-by-one interpretations
  5. Trade date vs settlement date as standalone finding

Opus unique (5, including a real spec bug):

  1. Forward detection logic is CONTRADICTORY — spec says "lot open catches forward wash sales" but forward wash sales are buy-then-sell-at-loss, so the closure is the trigger. The spec describes the triggers backwards. This is a genuine spec bug.
  2. Daisy-chain holding period propagation (A→B→C: does C inherit A's date?)
  3. Per-share vs per-lot basis semantics for partial closures of replacement lots
  4. Instrument identity breaks across mergers (same instrument_id rule fails)
  5. "What counts as a loss sale" — raw or adjusted P&L? (circular dependency)

Sonnet unique: None

  • Options substantially-identical finding is a false positive (spec explicitly defers this)
  • Database performance finding is off-topic (not a testability problem)

Key Insight

Opus found a genuine spec bug that neither GPT-5 nor Sonnet identified. The spec's description of which trigger catches which direction of wash sale is logically backwards. This is not ambiguity — it's an error in the spec's reasoning. This confirms the pattern from experiment #31: Opus excels at finding where the spec's OWN LOGIC contradicts itself.

Pattern Confirmation

For systematic/exhaustive analytical tasks:

  • GPT-5: Best for comprehensive boundary/precision/edge-case enumeration (breadth)
  • Opus: Best for finding logic contradictions and false assumptions (depth/insight)
  • Sonnet: No unique value; produces false positives

This matches spec-gap analysis (#31) exactly. Sonnet only contributes unique insights in creative/generative tasks (adversarial gaming #29-30, emergent behavior identification #23).

Practical Recommendation

For testability reviews of spec documents before implementation:

  1. Run GPT-5 for comprehensive boundary/precision/edge-case coverage
  2. Run Opus for logic-level contradictions and assumption violations
  3. Skip Sonnet — it finds nothing the other two miss and adds noise

Meta-Observation: Task Type Taxonomy

Task category Sonnet value Example experiments
Systematic/exhaustive None #31 spec-gap, #32 testability, #25 contradiction
Creative/generative Meta-analytical synthesis #29-30 adversarial gaming
Compliance/regulatory Adequate but shallow #22 silent correctness

Testability analysis falls firmly in the systematic/exhaustive category. Two-model configuration (GPT-5 + Opus) is optimal.