Files
model-research/findings/2026-05-05-26-missingfeature-identification-is-promptable-across.md
T
Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00

159 lines
9.1 KiB
Markdown

# Finding 26: Missing-feature identification IS promptable across all models; prompt framing eliminates Opus's historical advantage — all three models find regulatory gaps when explicitly asked
**Date:** 2026-05-05
**Task:** Identify computations, behaviors, or features that gargoyle's
`corporate-actions.md` (992 lines) SHOULD perform for financial correctness,
regulatory compliance, or operational safety — but doesn't describe.
**How we used them:** Same document (full text) + same focused analytical
prompt to all 3 models via HAI proxy. Prompt explicitly structured around 5
categories: missing computations, missing behaviors, missing validations,
missing integrations, and regulatory gaps. Required concrete findings with
severity. No tools, no project context beyond the document. GPT-5 via
OpenAI endpoint (16K max_completion_tokens), Opus 4.6 and Sonnet 4.6 via
Anthropic endpoint (8K max_tokens).
| Model | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
|---|---|---|---|---|---|---|
| GPT-5 | 11,354 | 8,512 | 20 | 3 | 10 | 7 |
| Claude Opus 4.6 | 4,111 | (internal) | 23 | 6 | 10 | 7 |
| Claude Sonnet 4.6 | 4,686 | (internal) | 15 | 5 | 6 | 4 |
**What they found — common ground (all 3 identified):**
- Wash sale rule interaction with CA-driven lot closures (IRC §1091)
- Short position treatment for corporate actions
- Same-day corporate action ordering beyond `recorded_at` timestamp
- Record date / ex-date position verification (entitlement timing)
- Idempotency guard preventing double-application per user
- Decimal precision/rounding policy unspecified
- Superseded CA status has no lot rollback mechanism
- Rights/warrants post-creation lifecycle (exercise/expiration)
- Basis preservation invariant has no runtime enforcement
- Manual entry authorization and audit trail
**GPT-5 unique findings (not in either Claude model):**
- Per-lot eligibility based on entitlement date (not just user-level)
- Election-based outcomes for shareholder choices (cash vs stock)
- Instrument-level trading hold during CA application window
- Pre-application consistency checks against broker entitlements
- DB-level enforcement of status transitions and invariants
- Action-type-specific date semantics per field (ex vs record vs payable)
- Voluntary/tender actions beyond distributions
- Backfill/initialization guard for newly onboarded users
- Applicator retry/backoff semantics and confirmation race
- Rights indivisibility constraints vs exact Decimal quantities
**Claude Opus unique findings (not in either other model):**
- Pending order PRICE adjustment after splits (not just cancellation)
- Multi-instrument position recalculation atomicity for mergers
- Mixed merger basis floor at zero (can produce negative basis)
- Tax lot identification method interaction with inherited dates
- Corporate action effect on strategy position limits/risk params
- Corporate actions on instruments not yet in the database
- Partial application window: new user acquires position mid-fan-out
- IRC §305(c) deemed distributions (taxable stock dividends)
- CA impact on unrealized P&L display and strategy evaluation
- Concurrent OrderManager startup + Applicator fan-out race
**Claude Sonnet unique findings (not in either other model):**
- Stale orders: failure modes table contradicts "excluded" section
- IRC §1223(1) holding period tacking verification at lot close
- Spinoff allocation percentage — no validation child != parent instrument
- Combined spinoff allocations exceeding meaningful bounds
- Cash dividend bypasses OrderManager — record-date quantity snapshot lost
- Mixed merger large-denominator exchange ratio overflow
- Detector schedule: no intraday re-poll for same-day announcements
- ROC cumulative basis floor tracking and IRC §301(c)(3) distinction
- Mixed merger deferred loss not explicitly recorded in metadata
**Quality assessment:**
- **Claude Opus** was the MOST PROLIFIC (23 findings) — a notable inversion
from previous experiments where Opus typically found fewer but deeper
findings. Here, the explicit "missing feature" framing appears to have
unlocked Opus's breadth. Its unique findings included genuinely critical
items: pending order price adjustment after splits (Critical — direct
financial loss), multi-instrument atomicity for mergers (Critical —
position loss), and mixed merger negative basis (High — accounting
corruption). The findings were precise, well-reasoned, and showed both
regulatory depth (IRC §305(c)) and operational awareness.
- **GPT-5** was slightly less prolific (20 findings) but maintained its
characteristic breadth and operational-level thinking. Per-lot eligibility
(not just per-user) is a subtle but important distinction. The election-
based outcomes finding shows awareness of real-world corporate action
complexity. The backfill/initialization guard is operationally significant.
GPT-5 spent 8,512 reasoning tokens — moderate for its output volume.
- **Claude Sonnet** found fewer gaps (15) but several were genuinely
insightful. The internal contradiction between the failure modes table
and the "excluded" section is a real document inconsistency. The cash
dividend record-date quantity snapshot insight (#9) identifies a DATA LOSS
problem — the opportunity to capture that data expires. The mixed merger
deferred loss recording gap shows regulatory awareness. However, some
findings were more surface-level or overlapped heavily with the others.
**KEY INSIGHT — The original question from Finding #22 is ANSWERED:**
> "Opus's 'missing feature identification' mode (wash sales, commissions) —
> is this promptable on other models? Could we explicitly ask GPT-5 'what
> should this system compute but doesn't' and get similar results?"
**YES.** When explicitly prompted with a structured "missing feature"
framing, ALL three models found regulatory gaps (wash sales, IRC sections),
missing computations (basis calculations, rounding), and missing behaviors
(lifecycle events, notifications). GPT-5 produced findings in the same
*category* as what Opus uniquely found in Finding #22 (silent correctness
failures on specid-lot-selection.md).
In Finding #22, Opus uniquely identified wash sales and commission tracking
as missing features while GPT-5 focused on mechanism incorrectness and
Sonnet on composition failures. HERE, with the explicit "what's missing"
prompt, ALL three models found wash sales, ALL found regulatory gaps, and
ALL found missing behaviors.
**This confirms:** Opus's "missing feature identification" mode in Finding
#22 was NOT an inherent model capability — it was an emergent behavior from
the open-ended "silent correctness failures" prompt. When you give ALL models
the EXPLICIT instruction to look for missing features, they all do it. The
differentiation from #22 was caused by the prompt being more open-ended,
allowing each model to default to its natural analytical mode:
- Opus → "what's missing" (features/functionality)
- GPT-5 → "what's wrong" (mechanism failures)
- Sonnet → "what breaks when combined" (composition)
**Prompt framing dominates model personality.** With the right prompt,
any model can be directed into any analytical mode. The model differences
that emerged in earlier open-ended experiments reflect DEFAULT TENDENCIES,
not capabilities.
**NEW finding about Opus on complex documents:**
Opus produced MORE findings than GPT-5 (23 vs 20) — the first time this
has happened on a broad analytical task. Previous pattern: GPT-5 always
finds more (20-33 findings) while Opus finds fewer but deeper (7-13).
What changed? The document is 992 lines — the longest tested — and the
task is explicitly about breadth ("find all gaps"). On this specific
combination (long document + breadth-focused prompt), Opus appears to
allocate its internal reasoning budget toward exploration rather than
its usual depth-first design-tension mode. This suggests Opus's typical
"fewer but deeper" pattern is partially a RESPONSE to shorter documents
where depth is more productive than breadth.
**Practical implications:**
1. For missing-feature analysis: prompt structure matters more than model
choice. All three models are viable. Use the explicit 5-category prompt.
2. Run all three for critical docs — they find different specific gaps
despite finding the same categories.
3. For open-ended analysis where you want models to find DIFFERENT things:
use open-ended prompts. For analysis where you want COMPREHENSIVE
coverage of one type: use structured prompts.
4. Opus's "fewer but deeper" personality can be overridden by document
length + breadth-focused prompt. On 992-line docs, it competes on
volume with GPT-5.
**Cost-effectiveness:**
Opus: 4,111 output tokens for 23 findings = 179 tokens/finding
GPT-5: 11,354 output tokens (+ 8,512 reasoning) for 20 findings = 993 tokens/finding
Sonnet: 4,686 output tokens for 15 findings = 312 tokens/finding
Opus is by far the most efficient: nearly 6x fewer tokens than GPT-5 per
finding, with MORE findings. This is the strongest cost-effectiveness case
for Opus on any tested task. On long documents with breadth-focused prompts,
Opus appears to be the optimal choice for both quality AND efficiency.