Files
model-research/findings/2026-05-10-64-specification-gap-analysis.md
T
Rodin 873591877d Finding #64: Specification gap analysis - new analytical lens
Tested GPT-5, Opus, Sonnet on specid-lot-selection.md (125 lines)
for implementation specification gaps.

Key findings:
- Opus most cost-effective (4.6 gaps/1K tokens vs 1.8 for GPT-5)
- GPT-5 catches operational/financial edge cases (fees, multi-execution)
- Opus catches design-level binding ambiguities
- Sonnet too shallow for serious spec review

New lens distinct from hidden assumptions and race conditions:
focuses on ambiguity of intent, not risks.
2026-05-10 11:10:33 -07:00

141 lines
8.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Specification Gap Analysis — Finding #64
**Date:** 2026-05-10
**Lens:** Implementation Specification Gaps (NEW)
**Document:** gargoyle's `specid-lot-selection.md` (125 lines)
**Models:** GPT-5, Claude Opus 4, Claude Sonnet 4
## Summary
New analytical lens examining **implementation specification gaps** — areas where an implementer would need to make decisions the specification doesn't explicitly address. Unlike hidden assumptions (what must be true) or race conditions (temporal interactions), this lens focuses on **ambiguity that leads to divergent implementations**.
The prompt explicitly asked for:
1. Underspecified algorithms
2. Edge case silence
3. Ordering ambiguity
4. Implicit prerequisites
5. Consistency gaps
6. Semantic ambiguity
## Results
| Model | Time | Output tokens | Reasoning tokens | Gaps found | High impact | Medium |
|---|---|---|---|---|---|---|
| GPT-5 | 126s | 9,561 | 6,592 | 17 | 13 | 4 |
| Claude Opus 4 | 63s | 3,951 | (internal) | 18 | 12 | 4 |
| Claude Sonnet 4 | 28s | 1,639 | (internal) | 11 | 5 | 6 |
## Key Findings by Category
### Tie-Breaking Ambiguity (all models found)
- **HIFO tie-breaking** — Two lots with identical cost basis: sort by opened_at? lot_id? quantity?
- **FIFO/LIFO tie-breaking** — Two lots with identical timestamps: deterministic secondary key?
- All three models correctly identified this as HIGH impact (affects tax classification)
### Regulatory Compliance Gaps (all models found)
- **Settlement timing** — Treasury Regulation §1.1012-1(c) requires designation "at or before settlement" but spec doesn't define when settlement occurs or when designation locks
- **Holding period calculation** — "Lots held > 1 year" but spec doesn't define whether this is trade date, settlement date, timestamp-precise, or date-only
### Partial Fill Handling (GPT-5 + Opus, not Sonnet)
- **Manual selection + partial fills** — User selects Lot A:600, Lot B:400 for 1000-share sell. Only 300 fills. Pro-rata? Sequential? Reject?
- Sonnet mentioned partial lot closure mechanics but missed the manual selection + partial fill interaction
- HIGH impact: partial fills are common; ambiguity affects every partial fill with Manual selection
### GPT-5 Unique Findings (6 not in either Claude model)
1. **Fees/commissions in gain formula**`gain = (sell_price - cost_basis) × quantity` but no mention of fees. Different tax systems handle fees differently.
2. **Multi-execution price handling** — Fill at 50@$10.00 + 50@$9.90. Use VWAP? Per-execution closures? Pro-rate fees?
3. **Sell exceeds available long** — Algorithm assumes consuming open lots but doesn't specify shorting or rejection
4. **Account/portfolio scoping** — "Correct instrument" but no mention of cross-account lot selection
5. **Partial metadata availability** — What if opened_at is missing but basis exists? All-or-nothing failure?
6. **Basis at selection time vs post-adjustment** — HIFO uses current basis but wash sales can adjust basis retroactively. Idempotent replay could choose different lots.
### Opus Unique Findings (5 not in either other model)
1. **Strategy evaluation timing** — User changes HIFO→FIFO between order submission and fill arrival. Which applies?
2. **Corporate action during consumption** — Stock splits while algorithm is mid-walk through lots
3. **Lot ledger invariants** — Assumes re-derivation works but doesn't state invariants (lots sum to position, no negative shares)
4. **"Last fill wins" vs audit completeness** — Serialized processing drops fills but audit trail requires completeness
5. **Same-day open/close** — Buy at 9:00, sell at 15:00 same day. Lot opening and closing overlap in time. Consumption algorithm handles this?
### Sonnet Unique Findings (1 not in other models)
1. **Instrument delisting during processing** — Order placed, stock delisted before fill. Proceed or fail?
- Note: This is a valid edge case but lower priority than the implementation ambiguities GPT-5 and Opus found
## Model Analytical Style Comparison
### GPT-5: Exhaustive enumeration with concrete divergence examples
GPT-5 produced the most detailed divergence scenarios, often with three different implementation approaches spelled out. Its fee/commission finding (#5) is particularly valuable — a real-world tax preparation concern that neither Claude model mentioned. The multi-execution price handling (#6) shows deep domain reasoning about how fills actually arrive.
**Strength:** Real-world operational considerations (fees, multi-price fills, cross-account)
**Weakness:** Sometimes conflates "things the spec doesn't mention" with "things the implementer must guess"
### Opus: Regulatory and system-boundary focused
Opus found fewer gaps but they were more architecturally significant. The "strategy evaluation timing" gap reveals a fundamental question about when configuration binding occurs. The "lot ledger invariants" gap identifies that the spec assumes self-consistency without stating the rules. The "last fill wins vs audit completeness" finding shows Opus's characteristic strength at finding consistency tensions.
**Strength:** Design-level contradictions and boundary ambiguities
**Weakness:** Less operational detail than GPT-5
### Sonnet: Structural scan with lower depth
Sonnet found 11 gaps vs 17-18 for the other models. Its findings were valid but shallower — it identified tie-breaking ambiguity but didn't explore cross-account, multi-execution, or fee implications. The "instrument delisting" finding was unique but low-priority.
**Strength:** Fast (28s vs 63s/126s), correct on core issues
**Weakness:** Misses operational nuance and doesn't reason about component interactions
## Novel Insight: Implementation Spec Gaps as Lens
This analytical lens differs from previous experiments:
| Previous lens | Focus | Example |
|---|---|---|
| Hidden assumptions | What must be true for this to work | "Assumes broker API returns all fills" |
| Race conditions | Temporal interleavings that cause bugs | "Fill arrives before lot state updates" |
| **Specification gaps (NEW)** | What implementer must decide that spec doesn't | "HIFO tie-breaking undefined" |
**Key distinction:** Specification gap analysis is about **ambiguity of intent**, not **risks** or **assumptions**. A spec can be internally consistent, assume correct inputs, have no race conditions, and STILL be underspecified — leading to divergent implementations that all "follow the spec."
This is particularly valuable for:
- Design review before implementation
- Documentation quality assessment
- Identifying where tests should specify behavior
## Overlap Analysis
| Gap type | GPT-5 | Opus | Sonnet | All 3 |
|---|---|---|---|---|
| Tie-breaking ambiguity | ✓ | ✓ | ✓ | ✓ |
| Settlement/holding period | ✓ | ✓ | ✓ | ✓ |
| Partial closure mechanics | ✓ | ✓ | ✓ | ✓ |
| Manual + partial fills | ✓ | ✓ | — | — |
| Fees/commissions | ✓ | — | — | — |
| Multi-execution pricing | ✓ | — | — | — |
| Strategy evaluation timing | — | ✓ | — | — |
| Cross-account scoping | ✓ | — | — | — |
| Lot ledger invariants | — | ✓ | — | — |
| Instrument delisting | — | — | ✓ | — |
**Overlap rate:** 5 gaps found by all 3 models, 12+ gaps unique to one model.
## Practical Implication
For **specification quality review** (not architecture review), run:
1. **GPT-5** — Catches operational/financial edge cases (fees, multi-fills, cross-account)
2. **Opus** — Catches design-level binding ambiguities and consistency gaps
3. Optional Sonnet for speed — Catches structural issues but misses depth
The union of GPT-5 + Opus findings would produce 23+ unique gaps vs 11-18 from any single model.
## Cost-Efficiency
| Model | Gaps | Time | Tokens | Gaps/minute | Gaps/1K tokens |
|---|---|---|---|---|---|
| GPT-5 | 17 | 126s | 9,561 | 8.1 | 1.8 |
| Opus | 18 | 63s | 3,951 | 17.1 | 4.6 |
| Sonnet | 11 | 28s | 1,639 | 23.6 | 6.7 |
**Opus is the most cost-effective** for this task type — 2.5x more efficient than GPT-5 (gaps per token) while finding comparable depth. Sonnet is fast but misses too much for serious specification review.
## Recommendations
1. **Spec quality gate:** Before implementation starts, run Opus + GPT-5 on the spec with this prompt. Address HIGH-impact gaps before coding.
2. **Different from architecture review:** This is a documentation quality check, not a safety review. Different skill for different purpose.
3. **Domain expertise matters:** Several GPT-5 findings (fees, multi-execution pricing) reflect financial domain knowledge. For domain-specific specs, GPT-5's breadth may be worth the extra cost.