Files

T

Rodin 873591877d Finding #64 : Specification gap analysis - new analytical lens

Tested GPT-5, Opus, Sonnet on specid-lot-selection.md (125 lines)
for implementation specification gaps.

Key findings:
- Opus most cost-effective (4.6 gaps/1K tokens vs 1.8 for GPT-5)
- GPT-5 catches operational/financial edge cases (fees, multi-execution)
- Opus catches design-level binding ambiguities
- Sonnet too shallow for serious spec review

New lens distinct from hidden assumptions and race conditions:
focuses on ambiguity of intent, not risks.

2026-05-10 11:10:33 -07:00

8.5 KiB

Raw Blame History

Specification Gap Analysis — Finding #64

Date: 2026-05-10
Lens: Implementation Specification Gaps (NEW)
Document: gargoyle's specid-lot-selection.md (125 lines)
Models: GPT-5, Claude Opus 4, Claude Sonnet 4

Summary

New analytical lens examining implementation specification gaps — areas where an implementer would need to make decisions the specification doesn't explicitly address. Unlike hidden assumptions (what must be true) or race conditions (temporal interactions), this lens focuses on ambiguity that leads to divergent implementations.

The prompt explicitly asked for:

Underspecified algorithms
Edge case silence
Ordering ambiguity
Implicit prerequisites
Consistency gaps
Semantic ambiguity

Results

Model	Time	Output tokens	Reasoning tokens	Gaps found	High impact	Medium
GPT-5	126s	9,561	6,592	17	13	4
Claude Opus 4	63s	3,951	(internal)	18	12	4
Claude Sonnet 4	28s	1,639	(internal)	11	5	6

Key Findings by Category

Tie-Breaking Ambiguity (all models found)

HIFO tie-breaking — Two lots with identical cost basis: sort by opened_at? lot_id? quantity?
FIFO/LIFO tie-breaking — Two lots with identical timestamps: deterministic secondary key?
All three models correctly identified this as HIGH impact (affects tax classification)

Regulatory Compliance Gaps (all models found)

Settlement timing — Treasury Regulation §1.1012-1(c) requires designation "at or before settlement" but spec doesn't define when settlement occurs or when designation locks
Holding period calculation — "Lots held > 1 year" but spec doesn't define whether this is trade date, settlement date, timestamp-precise, or date-only

Partial Fill Handling (GPT-5 + Opus, not Sonnet)

Manual selection + partial fills — User selects Lot A:600, Lot B:400 for 1000-share sell. Only 300 fills. Pro-rata? Sequential? Reject?
Sonnet mentioned partial lot closure mechanics but missed the manual selection + partial fill interaction
HIGH impact: partial fills are common; ambiguity affects every partial fill with Manual selection

GPT-5 Unique Findings (6 not in either Claude model)

Fees/commissions in gain formula — gain = (sell_price - cost_basis) × quantity but no mention of fees. Different tax systems handle fees differently.
Multi-execution price handling — Fill at 50@$10.00 + 50@$9.90. Use VWAP? Per-execution closures? Pro-rate fees?
Sell exceeds available long — Algorithm assumes consuming open lots but doesn't specify shorting or rejection
Account/portfolio scoping — "Correct instrument" but no mention of cross-account lot selection
Partial metadata availability — What if opened_at is missing but basis exists? All-or-nothing failure?
Basis at selection time vs post-adjustment — HIFO uses current basis but wash sales can adjust basis retroactively. Idempotent replay could choose different lots.

Opus Unique Findings (5 not in either other model)

Strategy evaluation timing — User changes HIFO→FIFO between order submission and fill arrival. Which applies?
Corporate action during consumption — Stock splits while algorithm is mid-walk through lots
Lot ledger invariants — Assumes re-derivation works but doesn't state invariants (lots sum to position, no negative shares)
"Last fill wins" vs audit completeness — Serialized processing drops fills but audit trail requires completeness
Same-day open/close — Buy at 9:00, sell at 15:00 same day. Lot opening and closing overlap in time. Consumption algorithm handles this?

Sonnet Unique Findings (1 not in other models)

Instrument delisting during processing — Order placed, stock delisted before fill. Proceed or fail?

Note: This is a valid edge case but lower priority than the implementation ambiguities GPT-5 and Opus found

Model Analytical Style Comparison

GPT-5: Exhaustive enumeration with concrete divergence examples

GPT-5 produced the most detailed divergence scenarios, often with three different implementation approaches spelled out. Its fee/commission finding (#5) is particularly valuable — a real-world tax preparation concern that neither Claude model mentioned. The multi-execution price handling (#6) shows deep domain reasoning about how fills actually arrive.

Strength: Real-world operational considerations (fees, multi-price fills, cross-account) Weakness: Sometimes conflates "things the spec doesn't mention" with "things the implementer must guess"

Opus: Regulatory and system-boundary focused

Opus found fewer gaps but they were more architecturally significant. The "strategy evaluation timing" gap reveals a fundamental question about when configuration binding occurs. The "lot ledger invariants" gap identifies that the spec assumes self-consistency without stating the rules. The "last fill wins vs audit completeness" finding shows Opus's characteristic strength at finding consistency tensions.

Strength: Design-level contradictions and boundary ambiguities Weakness: Less operational detail than GPT-5

Sonnet: Structural scan with lower depth

Sonnet found 11 gaps vs 17-18 for the other models. Its findings were valid but shallower — it identified tie-breaking ambiguity but didn't explore cross-account, multi-execution, or fee implications. The "instrument delisting" finding was unique but low-priority.

Strength: Fast (28s vs 63s/126s), correct on core issues Weakness: Misses operational nuance and doesn't reason about component interactions

Novel Insight: Implementation Spec Gaps as Lens

This analytical lens differs from previous experiments:

Previous lens	Focus	Example
Hidden assumptions	What must be true for this to work	"Assumes broker API returns all fills"
Race conditions	Temporal interleavings that cause bugs	"Fill arrives before lot state updates"
Specification gaps (NEW)	What implementer must decide that spec doesn't	"HIFO tie-breaking undefined"

Key distinction: Specification gap analysis is about ambiguity of intent, not risks or assumptions. A spec can be internally consistent, assume correct inputs, have no race conditions, and STILL be underspecified — leading to divergent implementations that all "follow the spec."

This is particularly valuable for:

Design review before implementation
Documentation quality assessment
Identifying where tests should specify behavior

Overlap Analysis

Gap type	GPT-5	Opus	Sonnet	All 3
Tie-breaking ambiguity	✓	✓	✓	✓
Settlement/holding period	✓	✓	✓	✓
Partial closure mechanics	✓	✓	✓	✓
Manual + partial fills	✓	✓	—	—
Fees/commissions	✓	—	—	—
Multi-execution pricing	✓	—	—	—
Strategy evaluation timing	—	✓	—	—
Cross-account scoping	✓	—	—	—
Lot ledger invariants	—	✓	—	—
Instrument delisting	—	—	✓	—

Overlap rate: 5 gaps found by all 3 models, 12+ gaps unique to one model.

Practical Implication

For specification quality review (not architecture review), run:

GPT-5 — Catches operational/financial edge cases (fees, multi-fills, cross-account)
Opus — Catches design-level binding ambiguities and consistency gaps
Optional Sonnet for speed — Catches structural issues but misses depth

The union of GPT-5 + Opus findings would produce 23+ unique gaps vs 11-18 from any single model.

Cost-Efficiency

Model	Gaps	Time	Tokens	Gaps/minute	Gaps/1K tokens
GPT-5	17	126s	9,561	8.1	1.8
Opus	18	63s	3,951	17.1	4.6
Sonnet	11	28s	1,639	23.6	6.7

Opus is the most cost-effective for this task type — 2.5x more efficient than GPT-5 (gaps per token) while finding comparable depth. Sonnet is fast but misses too much for serious specification review.

Recommendations

Spec quality gate: Before implementation starts, run Opus + GPT-5 on the spec with this prompt. Address HIGH-impact gaps before coding.
Different from architecture review: This is a documentation quality check, not a safety review. Different skill for different purpose.
Domain expertise matters: Several GPT-5 findings (fees, multi-execution pricing) reflect financial domain knowledge. For domain-specific specs, GPT-5's breadth may be worth the extra cost.

8.5 KiB Raw Blame History Unescape Escape