Finding #64: Specification gap analysis - new analytical lens
Tested GPT-5, Opus, Sonnet on specid-lot-selection.md (125 lines) for implementation specification gaps. Key findings: - Opus most cost-effective (4.6 gaps/1K tokens vs 1.8 for GPT-5) - GPT-5 catches operational/financial edge cases (fees, multi-execution) - Opus catches design-level binding ambiguities - Sonnet too shallow for serious spec review New lens distinct from hidden assumptions and race conditions: focuses on ambiguity of intent, not risks.
This commit is contained in:
@@ -0,0 +1,140 @@
|
||||
# Specification Gap Analysis — Finding #64
|
||||
|
||||
**Date:** 2026-05-10
|
||||
**Lens:** Implementation Specification Gaps (NEW)
|
||||
**Document:** gargoyle's `specid-lot-selection.md` (125 lines)
|
||||
**Models:** GPT-5, Claude Opus 4, Claude Sonnet 4
|
||||
|
||||
## Summary
|
||||
|
||||
New analytical lens examining **implementation specification gaps** — areas where an implementer would need to make decisions the specification doesn't explicitly address. Unlike hidden assumptions (what must be true) or race conditions (temporal interactions), this lens focuses on **ambiguity that leads to divergent implementations**.
|
||||
|
||||
The prompt explicitly asked for:
|
||||
1. Underspecified algorithms
|
||||
2. Edge case silence
|
||||
3. Ordering ambiguity
|
||||
4. Implicit prerequisites
|
||||
5. Consistency gaps
|
||||
6. Semantic ambiguity
|
||||
|
||||
## Results
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Gaps found | High impact | Medium |
|
||||
|---|---|---|---|---|---|---|
|
||||
| GPT-5 | 126s | 9,561 | 6,592 | 17 | 13 | 4 |
|
||||
| Claude Opus 4 | 63s | 3,951 | (internal) | 18 | 12 | 4 |
|
||||
| Claude Sonnet 4 | 28s | 1,639 | (internal) | 11 | 5 | 6 |
|
||||
|
||||
## Key Findings by Category
|
||||
|
||||
### Tie-Breaking Ambiguity (all models found)
|
||||
- **HIFO tie-breaking** — Two lots with identical cost basis: sort by opened_at? lot_id? quantity?
|
||||
- **FIFO/LIFO tie-breaking** — Two lots with identical timestamps: deterministic secondary key?
|
||||
- All three models correctly identified this as HIGH impact (affects tax classification)
|
||||
|
||||
### Regulatory Compliance Gaps (all models found)
|
||||
- **Settlement timing** — Treasury Regulation §1.1012-1(c) requires designation "at or before settlement" but spec doesn't define when settlement occurs or when designation locks
|
||||
- **Holding period calculation** — "Lots held > 1 year" but spec doesn't define whether this is trade date, settlement date, timestamp-precise, or date-only
|
||||
|
||||
### Partial Fill Handling (GPT-5 + Opus, not Sonnet)
|
||||
- **Manual selection + partial fills** — User selects Lot A:600, Lot B:400 for 1000-share sell. Only 300 fills. Pro-rata? Sequential? Reject?
|
||||
- Sonnet mentioned partial lot closure mechanics but missed the manual selection + partial fill interaction
|
||||
- HIGH impact: partial fills are common; ambiguity affects every partial fill with Manual selection
|
||||
|
||||
### GPT-5 Unique Findings (6 not in either Claude model)
|
||||
1. **Fees/commissions in gain formula** — `gain = (sell_price - cost_basis) × quantity` but no mention of fees. Different tax systems handle fees differently.
|
||||
2. **Multi-execution price handling** — Fill at 50@$10.00 + 50@$9.90. Use VWAP? Per-execution closures? Pro-rate fees?
|
||||
3. **Sell exceeds available long** — Algorithm assumes consuming open lots but doesn't specify shorting or rejection
|
||||
4. **Account/portfolio scoping** — "Correct instrument" but no mention of cross-account lot selection
|
||||
5. **Partial metadata availability** — What if opened_at is missing but basis exists? All-or-nothing failure?
|
||||
6. **Basis at selection time vs post-adjustment** — HIFO uses current basis but wash sales can adjust basis retroactively. Idempotent replay could choose different lots.
|
||||
|
||||
### Opus Unique Findings (5 not in either other model)
|
||||
1. **Strategy evaluation timing** — User changes HIFO→FIFO between order submission and fill arrival. Which applies?
|
||||
2. **Corporate action during consumption** — Stock splits while algorithm is mid-walk through lots
|
||||
3. **Lot ledger invariants** — Assumes re-derivation works but doesn't state invariants (lots sum to position, no negative shares)
|
||||
4. **"Last fill wins" vs audit completeness** — Serialized processing drops fills but audit trail requires completeness
|
||||
5. **Same-day open/close** — Buy at 9:00, sell at 15:00 same day. Lot opening and closing overlap in time. Consumption algorithm handles this?
|
||||
|
||||
### Sonnet Unique Findings (1 not in other models)
|
||||
1. **Instrument delisting during processing** — Order placed, stock delisted before fill. Proceed or fail?
|
||||
- Note: This is a valid edge case but lower priority than the implementation ambiguities GPT-5 and Opus found
|
||||
|
||||
## Model Analytical Style Comparison
|
||||
|
||||
### GPT-5: Exhaustive enumeration with concrete divergence examples
|
||||
GPT-5 produced the most detailed divergence scenarios, often with three different implementation approaches spelled out. Its fee/commission finding (#5) is particularly valuable — a real-world tax preparation concern that neither Claude model mentioned. The multi-execution price handling (#6) shows deep domain reasoning about how fills actually arrive.
|
||||
|
||||
**Strength:** Real-world operational considerations (fees, multi-price fills, cross-account)
|
||||
**Weakness:** Sometimes conflates "things the spec doesn't mention" with "things the implementer must guess"
|
||||
|
||||
### Opus: Regulatory and system-boundary focused
|
||||
Opus found fewer gaps but they were more architecturally significant. The "strategy evaluation timing" gap reveals a fundamental question about when configuration binding occurs. The "lot ledger invariants" gap identifies that the spec assumes self-consistency without stating the rules. The "last fill wins vs audit completeness" finding shows Opus's characteristic strength at finding consistency tensions.
|
||||
|
||||
**Strength:** Design-level contradictions and boundary ambiguities
|
||||
**Weakness:** Less operational detail than GPT-5
|
||||
|
||||
### Sonnet: Structural scan with lower depth
|
||||
Sonnet found 11 gaps vs 17-18 for the other models. Its findings were valid but shallower — it identified tie-breaking ambiguity but didn't explore cross-account, multi-execution, or fee implications. The "instrument delisting" finding was unique but low-priority.
|
||||
|
||||
**Strength:** Fast (28s vs 63s/126s), correct on core issues
|
||||
**Weakness:** Misses operational nuance and doesn't reason about component interactions
|
||||
|
||||
## Novel Insight: Implementation Spec Gaps as Lens
|
||||
|
||||
This analytical lens differs from previous experiments:
|
||||
|
||||
| Previous lens | Focus | Example |
|
||||
|---|---|---|
|
||||
| Hidden assumptions | What must be true for this to work | "Assumes broker API returns all fills" |
|
||||
| Race conditions | Temporal interleavings that cause bugs | "Fill arrives before lot state updates" |
|
||||
| **Specification gaps (NEW)** | What implementer must decide that spec doesn't | "HIFO tie-breaking undefined" |
|
||||
|
||||
**Key distinction:** Specification gap analysis is about **ambiguity of intent**, not **risks** or **assumptions**. A spec can be internally consistent, assume correct inputs, have no race conditions, and STILL be underspecified — leading to divergent implementations that all "follow the spec."
|
||||
|
||||
This is particularly valuable for:
|
||||
- Design review before implementation
|
||||
- Documentation quality assessment
|
||||
- Identifying where tests should specify behavior
|
||||
|
||||
## Overlap Analysis
|
||||
|
||||
| Gap type | GPT-5 | Opus | Sonnet | All 3 |
|
||||
|---|---|---|---|---|
|
||||
| Tie-breaking ambiguity | ✓ | ✓ | ✓ | ✓ |
|
||||
| Settlement/holding period | ✓ | ✓ | ✓ | ✓ |
|
||||
| Partial closure mechanics | ✓ | ✓ | ✓ | ✓ |
|
||||
| Manual + partial fills | ✓ | ✓ | — | — |
|
||||
| Fees/commissions | ✓ | — | — | — |
|
||||
| Multi-execution pricing | ✓ | — | — | — |
|
||||
| Strategy evaluation timing | — | ✓ | — | — |
|
||||
| Cross-account scoping | ✓ | — | — | — |
|
||||
| Lot ledger invariants | — | ✓ | — | — |
|
||||
| Instrument delisting | — | — | ✓ | — |
|
||||
|
||||
**Overlap rate:** 5 gaps found by all 3 models, 12+ gaps unique to one model.
|
||||
|
||||
## Practical Implication
|
||||
|
||||
For **specification quality review** (not architecture review), run:
|
||||
1. **GPT-5** — Catches operational/financial edge cases (fees, multi-fills, cross-account)
|
||||
2. **Opus** — Catches design-level binding ambiguities and consistency gaps
|
||||
3. Optional Sonnet for speed — Catches structural issues but misses depth
|
||||
|
||||
The union of GPT-5 + Opus findings would produce 23+ unique gaps vs 11-18 from any single model.
|
||||
|
||||
## Cost-Efficiency
|
||||
|
||||
| Model | Gaps | Time | Tokens | Gaps/minute | Gaps/1K tokens |
|
||||
|---|---|---|---|---|---|
|
||||
| GPT-5 | 17 | 126s | 9,561 | 8.1 | 1.8 |
|
||||
| Opus | 18 | 63s | 3,951 | 17.1 | 4.6 |
|
||||
| Sonnet | 11 | 28s | 1,639 | 23.6 | 6.7 |
|
||||
|
||||
**Opus is the most cost-effective** for this task type — 2.5x more efficient than GPT-5 (gaps per token) while finding comparable depth. Sonnet is fast but misses too much for serious specification review.
|
||||
|
||||
## Recommendations
|
||||
|
||||
1. **Spec quality gate:** Before implementation starts, run Opus + GPT-5 on the spec with this prompt. Address HIGH-impact gaps before coding.
|
||||
2. **Different from architecture review:** This is a documentation quality check, not a safety review. Different skill for different purpose.
|
||||
3. **Domain expertise matters:** Several GPT-5 findings (fees, multi-execution pricing) reflect financial domain knowledge. For domain-specific specs, GPT-5's breadth may be worth the extra cost.
|
||||
Reference in New Issue
Block a user