From 8cfabfdc5513fd296cb15cbc56b239759d4b412e Mon Sep 17 00:00:00 2001 From: Rodin Date: Wed, 6 May 2026 10:09:05 -0700 Subject: [PATCH] =?UTF-8?q?experiment=20#32:=20testability=20analysis=20?= =?UTF-8?q?=E2=80=94=20new=20analytical=20lens?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Tested GPT-5, Opus, Sonnet on wash-sale-tracking.md spec. Opus found a genuine spec bug (trigger logic described backwards). Confirms pattern: GPT-5 for breadth, Opus for logic contradictions, Sonnet adds no value for systematic analytical tasks. --- ...testability-analysis-wash-sale-tracking.md | 87 +++++++++++++++++++ 1 file changed, 87 insertions(+) create mode 100644 findings/2026-05-06-32-testability-analysis-wash-sale-tracking.md diff --git a/findings/2026-05-06-32-testability-analysis-wash-sale-tracking.md b/findings/2026-05-06-32-testability-analysis-wash-sale-tracking.md new file mode 100644 index 0000000..1141852 --- /dev/null +++ b/findings/2026-05-06-32-testability-analysis-wash-sale-tracking.md @@ -0,0 +1,87 @@ +# Experiment #32: Testability Analysis on wash-sale-tracking.md + +**Date:** 2026-05-06 +**Task type:** Testability analysis (NEW analytical lens) +**Document:** gargoyle's `wash-sale-tracking.md` (184 lines) — IRC 1091 wash sale detection plan + +## Hypothesis + +Testability analysis (identifying what parts of a spec prevent deterministic automated testing) +is a distinct analytical lens from gap analysis or contradiction detection. Models may differ in +whether they find boundary ambiguities (precision, rounding, inclusive/exclusive) vs logical +contradictions that make test assertions indeterminate. + +## Method + +Same structured prompt to all three models: +> Identify testability problems: untestable behaviors, ambiguities causing different test +> assertions, implicit fragile assumptions, missing boundary conditions, timing/ordering +> non-determinism. For each: state spec text, why it's a testability problem, what the spec +> needs to say. + +Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4 (all via HAI proxy on anvil). + +## Results + +| Model | Findings | Output tokens | Reasoning tokens | Latency | Tokens/finding | +|---|---|---|---|---|---| +| GPT-5 | 14 | 3,039 | 6,336 | 124.5s | 670 (incl reasoning) | +| Opus | 10 | 2,719 | — | ~90s | 272 | +| Sonnet | ~11 (grouped) | 1,117 | — | ~24s | 102 | + +## Unique Findings by Model + +### GPT-5 unique (5): +1. `total_loss` is undefined (gross vs net, before/after prior adjustments) +2. Effective cost basis formula inconsistency (Decision 9 omits lot_adjustments) +3. Cascading wash sales — loss lot that was previously a replacement lot +4. 61-day vs 30-day wording inconsistency creates off-by-one interpretations +5. Trade date vs settlement date as standalone finding + +### Opus unique (5, including a real spec bug): +1. **Forward detection logic is CONTRADICTORY** — spec says "lot open catches forward wash sales" + but forward wash sales are buy-then-sell-at-loss, so the *closure* is the trigger. The spec + describes the triggers backwards. This is a genuine spec bug. +2. Daisy-chain holding period propagation (A→B→C: does C inherit A's date?) +3. Per-share vs per-lot basis semantics for partial closures of replacement lots +4. Instrument identity breaks across mergers (same instrument_id rule fails) +5. "What counts as a loss sale" — raw or adjusted P&L? (circular dependency) + +### Sonnet unique: None +- Options substantially-identical finding is a false positive (spec explicitly defers this) +- Database performance finding is off-topic (not a testability problem) + +## Key Insight + +**Opus found a genuine spec bug** that neither GPT-5 nor Sonnet identified. The spec's description +of which trigger catches which direction of wash sale is logically backwards. This is not ambiguity — +it's an error in the spec's reasoning. This confirms the pattern from experiment #31: Opus excels +at finding where the spec's OWN LOGIC contradicts itself. + +## Pattern Confirmation + +For systematic/exhaustive analytical tasks: +- **GPT-5:** Best for comprehensive boundary/precision/edge-case enumeration (breadth) +- **Opus:** Best for finding logic contradictions and false assumptions (depth/insight) +- **Sonnet:** No unique value; produces false positives + +This matches spec-gap analysis (#31) exactly. Sonnet only contributes unique insights in +creative/generative tasks (adversarial gaming #29-30, emergent behavior identification #23). + +## Practical Recommendation + +For testability reviews of spec documents before implementation: +1. Run GPT-5 for comprehensive boundary/precision/edge-case coverage +2. Run Opus for logic-level contradictions and assumption violations +3. Skip Sonnet — it finds nothing the other two miss and adds noise + +## Meta-Observation: Task Type Taxonomy + +| Task category | Sonnet value | Example experiments | +|---|---|---| +| Systematic/exhaustive | None | #31 spec-gap, #32 testability, #25 contradiction | +| Creative/generative | Meta-analytical synthesis | #29-30 adversarial gaming | +| Compliance/regulatory | Adequate but shallow | #22 silent correctness | + +Testability analysis falls firmly in the systematic/exhaustive category. Two-model configuration +(GPT-5 + Opus) is optimal.