Files
model-research/findings/2026-05-10-64-specification-gap-analysis.md
T
Rodin 873591877d Finding #64: Specification gap analysis - new analytical lens
Tested GPT-5, Opus, Sonnet on specid-lot-selection.md (125 lines)
for implementation specification gaps.

Key findings:
- Opus most cost-effective (4.6 gaps/1K tokens vs 1.8 for GPT-5)
- GPT-5 catches operational/financial edge cases (fees, multi-execution)
- Opus catches design-level binding ambiguities
- Sonnet too shallow for serious spec review

New lens distinct from hidden assumptions and race conditions:
focuses on ambiguity of intent, not risks.
2026-05-10 11:10:33 -07:00

8.5 KiB
Raw Blame History

Specification Gap Analysis — Finding #64

Date: 2026-05-10
Lens: Implementation Specification Gaps (NEW)
Document: gargoyle's specid-lot-selection.md (125 lines)
Models: GPT-5, Claude Opus 4, Claude Sonnet 4

Summary

New analytical lens examining implementation specification gaps — areas where an implementer would need to make decisions the specification doesn't explicitly address. Unlike hidden assumptions (what must be true) or race conditions (temporal interactions), this lens focuses on ambiguity that leads to divergent implementations.

The prompt explicitly asked for:

  1. Underspecified algorithms
  2. Edge case silence
  3. Ordering ambiguity
  4. Implicit prerequisites
  5. Consistency gaps
  6. Semantic ambiguity

Results

Model Time Output tokens Reasoning tokens Gaps found High impact Medium
GPT-5 126s 9,561 6,592 17 13 4
Claude Opus 4 63s 3,951 (internal) 18 12 4
Claude Sonnet 4 28s 1,639 (internal) 11 5 6

Key Findings by Category

Tie-Breaking Ambiguity (all models found)

  • HIFO tie-breaking — Two lots with identical cost basis: sort by opened_at? lot_id? quantity?
  • FIFO/LIFO tie-breaking — Two lots with identical timestamps: deterministic secondary key?
  • All three models correctly identified this as HIGH impact (affects tax classification)

Regulatory Compliance Gaps (all models found)

  • Settlement timing — Treasury Regulation §1.1012-1(c) requires designation "at or before settlement" but spec doesn't define when settlement occurs or when designation locks
  • Holding period calculation — "Lots held > 1 year" but spec doesn't define whether this is trade date, settlement date, timestamp-precise, or date-only

Partial Fill Handling (GPT-5 + Opus, not Sonnet)

  • Manual selection + partial fills — User selects Lot A:600, Lot B:400 for 1000-share sell. Only 300 fills. Pro-rata? Sequential? Reject?
  • Sonnet mentioned partial lot closure mechanics but missed the manual selection + partial fill interaction
  • HIGH impact: partial fills are common; ambiguity affects every partial fill with Manual selection

GPT-5 Unique Findings (6 not in either Claude model)

  1. Fees/commissions in gain formulagain = (sell_price - cost_basis) × quantity but no mention of fees. Different tax systems handle fees differently.
  2. Multi-execution price handling — Fill at 50@$10.00 + 50@$9.90. Use VWAP? Per-execution closures? Pro-rate fees?
  3. Sell exceeds available long — Algorithm assumes consuming open lots but doesn't specify shorting or rejection
  4. Account/portfolio scoping — "Correct instrument" but no mention of cross-account lot selection
  5. Partial metadata availability — What if opened_at is missing but basis exists? All-or-nothing failure?
  6. Basis at selection time vs post-adjustment — HIFO uses current basis but wash sales can adjust basis retroactively. Idempotent replay could choose different lots.

Opus Unique Findings (5 not in either other model)

  1. Strategy evaluation timing — User changes HIFO→FIFO between order submission and fill arrival. Which applies?
  2. Corporate action during consumption — Stock splits while algorithm is mid-walk through lots
  3. Lot ledger invariants — Assumes re-derivation works but doesn't state invariants (lots sum to position, no negative shares)
  4. "Last fill wins" vs audit completeness — Serialized processing drops fills but audit trail requires completeness
  5. Same-day open/close — Buy at 9:00, sell at 15:00 same day. Lot opening and closing overlap in time. Consumption algorithm handles this?

Sonnet Unique Findings (1 not in other models)

  1. Instrument delisting during processing — Order placed, stock delisted before fill. Proceed or fail?
  • Note: This is a valid edge case but lower priority than the implementation ambiguities GPT-5 and Opus found

Model Analytical Style Comparison

GPT-5: Exhaustive enumeration with concrete divergence examples

GPT-5 produced the most detailed divergence scenarios, often with three different implementation approaches spelled out. Its fee/commission finding (#5) is particularly valuable — a real-world tax preparation concern that neither Claude model mentioned. The multi-execution price handling (#6) shows deep domain reasoning about how fills actually arrive.

Strength: Real-world operational considerations (fees, multi-price fills, cross-account) Weakness: Sometimes conflates "things the spec doesn't mention" with "things the implementer must guess"

Opus: Regulatory and system-boundary focused

Opus found fewer gaps but they were more architecturally significant. The "strategy evaluation timing" gap reveals a fundamental question about when configuration binding occurs. The "lot ledger invariants" gap identifies that the spec assumes self-consistency without stating the rules. The "last fill wins vs audit completeness" finding shows Opus's characteristic strength at finding consistency tensions.

Strength: Design-level contradictions and boundary ambiguities Weakness: Less operational detail than GPT-5

Sonnet: Structural scan with lower depth

Sonnet found 11 gaps vs 17-18 for the other models. Its findings were valid but shallower — it identified tie-breaking ambiguity but didn't explore cross-account, multi-execution, or fee implications. The "instrument delisting" finding was unique but low-priority.

Strength: Fast (28s vs 63s/126s), correct on core issues Weakness: Misses operational nuance and doesn't reason about component interactions

Novel Insight: Implementation Spec Gaps as Lens

This analytical lens differs from previous experiments:

Previous lens Focus Example
Hidden assumptions What must be true for this to work "Assumes broker API returns all fills"
Race conditions Temporal interleavings that cause bugs "Fill arrives before lot state updates"
Specification gaps (NEW) What implementer must decide that spec doesn't "HIFO tie-breaking undefined"

Key distinction: Specification gap analysis is about ambiguity of intent, not risks or assumptions. A spec can be internally consistent, assume correct inputs, have no race conditions, and STILL be underspecified — leading to divergent implementations that all "follow the spec."

This is particularly valuable for:

  • Design review before implementation
  • Documentation quality assessment
  • Identifying where tests should specify behavior

Overlap Analysis

Gap type GPT-5 Opus Sonnet All 3
Tie-breaking ambiguity
Settlement/holding period
Partial closure mechanics
Manual + partial fills
Fees/commissions
Multi-execution pricing
Strategy evaluation timing
Cross-account scoping
Lot ledger invariants
Instrument delisting

Overlap rate: 5 gaps found by all 3 models, 12+ gaps unique to one model.

Practical Implication

For specification quality review (not architecture review), run:

  1. GPT-5 — Catches operational/financial edge cases (fees, multi-fills, cross-account)
  2. Opus — Catches design-level binding ambiguities and consistency gaps
  3. Optional Sonnet for speed — Catches structural issues but misses depth

The union of GPT-5 + Opus findings would produce 23+ unique gaps vs 11-18 from any single model.

Cost-Efficiency

Model Gaps Time Tokens Gaps/minute Gaps/1K tokens
GPT-5 17 126s 9,561 8.1 1.8
Opus 18 63s 3,951 17.1 4.6
Sonnet 11 28s 1,639 23.6 6.7

Opus is the most cost-effective for this task type — 2.5x more efficient than GPT-5 (gaps per token) while finding comparable depth. Sonnet is fast but misses too much for serious specification review.

Recommendations

  1. Spec quality gate: Before implementation starts, run Opus + GPT-5 on the spec with this prompt. Address HIGH-impact gaps before coding.
  2. Different from architecture review: This is a documentation quality check, not a safety review. Different skill for different purpose.
  3. Domain expertise matters: Several GPT-5 findings (fees, multi-execution pricing) reflect financial domain knowledge. For domain-specific specs, GPT-5's breadth may be worth the extra cost.