6af8a6ee10
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
159 lines
9.1 KiB
Markdown
159 lines
9.1 KiB
Markdown
# Finding 26: Missing-feature identification IS promptable across all models; prompt framing eliminates Opus's historical advantage — all three models find regulatory gaps when explicitly asked
|
|
|
|
**Date:** 2026-05-05
|
|
**Task:** Identify computations, behaviors, or features that gargoyle's
|
|
`corporate-actions.md` (992 lines) SHOULD perform for financial correctness,
|
|
regulatory compliance, or operational safety — but doesn't describe.
|
|
**How we used them:** Same document (full text) + same focused analytical
|
|
prompt to all 3 models via HAI proxy. Prompt explicitly structured around 5
|
|
categories: missing computations, missing behaviors, missing validations,
|
|
missing integrations, and regulatory gaps. Required concrete findings with
|
|
severity. No tools, no project context beyond the document. GPT-5 via
|
|
OpenAI endpoint (16K max_completion_tokens), Opus 4.6 and Sonnet 4.6 via
|
|
Anthropic endpoint (8K max_tokens).
|
|
|
|
| Model | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
|
|
|---|---|---|---|---|---|---|
|
|
| GPT-5 | 11,354 | 8,512 | 20 | 3 | 10 | 7 |
|
|
| Claude Opus 4.6 | 4,111 | (internal) | 23 | 6 | 10 | 7 |
|
|
| Claude Sonnet 4.6 | 4,686 | (internal) | 15 | 5 | 6 | 4 |
|
|
|
|
**What they found — common ground (all 3 identified):**
|
|
- Wash sale rule interaction with CA-driven lot closures (IRC §1091)
|
|
- Short position treatment for corporate actions
|
|
- Same-day corporate action ordering beyond `recorded_at` timestamp
|
|
- Record date / ex-date position verification (entitlement timing)
|
|
- Idempotency guard preventing double-application per user
|
|
- Decimal precision/rounding policy unspecified
|
|
- Superseded CA status has no lot rollback mechanism
|
|
- Rights/warrants post-creation lifecycle (exercise/expiration)
|
|
- Basis preservation invariant has no runtime enforcement
|
|
- Manual entry authorization and audit trail
|
|
|
|
**GPT-5 unique findings (not in either Claude model):**
|
|
- Per-lot eligibility based on entitlement date (not just user-level)
|
|
- Election-based outcomes for shareholder choices (cash vs stock)
|
|
- Instrument-level trading hold during CA application window
|
|
- Pre-application consistency checks against broker entitlements
|
|
- DB-level enforcement of status transitions and invariants
|
|
- Action-type-specific date semantics per field (ex vs record vs payable)
|
|
- Voluntary/tender actions beyond distributions
|
|
- Backfill/initialization guard for newly onboarded users
|
|
- Applicator retry/backoff semantics and confirmation race
|
|
- Rights indivisibility constraints vs exact Decimal quantities
|
|
|
|
**Claude Opus unique findings (not in either other model):**
|
|
- Pending order PRICE adjustment after splits (not just cancellation)
|
|
- Multi-instrument position recalculation atomicity for mergers
|
|
- Mixed merger basis floor at zero (can produce negative basis)
|
|
- Tax lot identification method interaction with inherited dates
|
|
- Corporate action effect on strategy position limits/risk params
|
|
- Corporate actions on instruments not yet in the database
|
|
- Partial application window: new user acquires position mid-fan-out
|
|
- IRC §305(c) deemed distributions (taxable stock dividends)
|
|
- CA impact on unrealized P&L display and strategy evaluation
|
|
- Concurrent OrderManager startup + Applicator fan-out race
|
|
|
|
**Claude Sonnet unique findings (not in either other model):**
|
|
- Stale orders: failure modes table contradicts "excluded" section
|
|
- IRC §1223(1) holding period tacking verification at lot close
|
|
- Spinoff allocation percentage — no validation child != parent instrument
|
|
- Combined spinoff allocations exceeding meaningful bounds
|
|
- Cash dividend bypasses OrderManager — record-date quantity snapshot lost
|
|
- Mixed merger large-denominator exchange ratio overflow
|
|
- Detector schedule: no intraday re-poll for same-day announcements
|
|
- ROC cumulative basis floor tracking and IRC §301(c)(3) distinction
|
|
- Mixed merger deferred loss not explicitly recorded in metadata
|
|
|
|
**Quality assessment:**
|
|
- **Claude Opus** was the MOST PROLIFIC (23 findings) — a notable inversion
|
|
from previous experiments where Opus typically found fewer but deeper
|
|
findings. Here, the explicit "missing feature" framing appears to have
|
|
unlocked Opus's breadth. Its unique findings included genuinely critical
|
|
items: pending order price adjustment after splits (Critical — direct
|
|
financial loss), multi-instrument atomicity for mergers (Critical —
|
|
position loss), and mixed merger negative basis (High — accounting
|
|
corruption). The findings were precise, well-reasoned, and showed both
|
|
regulatory depth (IRC §305(c)) and operational awareness.
|
|
- **GPT-5** was slightly less prolific (20 findings) but maintained its
|
|
characteristic breadth and operational-level thinking. Per-lot eligibility
|
|
(not just per-user) is a subtle but important distinction. The election-
|
|
based outcomes finding shows awareness of real-world corporate action
|
|
complexity. The backfill/initialization guard is operationally significant.
|
|
GPT-5 spent 8,512 reasoning tokens — moderate for its output volume.
|
|
- **Claude Sonnet** found fewer gaps (15) but several were genuinely
|
|
insightful. The internal contradiction between the failure modes table
|
|
and the "excluded" section is a real document inconsistency. The cash
|
|
dividend record-date quantity snapshot insight (#9) identifies a DATA LOSS
|
|
problem — the opportunity to capture that data expires. The mixed merger
|
|
deferred loss recording gap shows regulatory awareness. However, some
|
|
findings were more surface-level or overlapped heavily with the others.
|
|
|
|
**KEY INSIGHT — The original question from Finding #22 is ANSWERED:**
|
|
|
|
> "Opus's 'missing feature identification' mode (wash sales, commissions) —
|
|
> is this promptable on other models? Could we explicitly ask GPT-5 'what
|
|
> should this system compute but doesn't' and get similar results?"
|
|
|
|
**YES.** When explicitly prompted with a structured "missing feature"
|
|
framing, ALL three models found regulatory gaps (wash sales, IRC sections),
|
|
missing computations (basis calculations, rounding), and missing behaviors
|
|
(lifecycle events, notifications). GPT-5 produced findings in the same
|
|
*category* as what Opus uniquely found in Finding #22 (silent correctness
|
|
failures on specid-lot-selection.md).
|
|
|
|
In Finding #22, Opus uniquely identified wash sales and commission tracking
|
|
as missing features while GPT-5 focused on mechanism incorrectness and
|
|
Sonnet on composition failures. HERE, with the explicit "what's missing"
|
|
prompt, ALL three models found wash sales, ALL found regulatory gaps, and
|
|
ALL found missing behaviors.
|
|
|
|
**This confirms:** Opus's "missing feature identification" mode in Finding
|
|
#22 was NOT an inherent model capability — it was an emergent behavior from
|
|
the open-ended "silent correctness failures" prompt. When you give ALL models
|
|
the EXPLICIT instruction to look for missing features, they all do it. The
|
|
differentiation from #22 was caused by the prompt being more open-ended,
|
|
allowing each model to default to its natural analytical mode:
|
|
- Opus → "what's missing" (features/functionality)
|
|
- GPT-5 → "what's wrong" (mechanism failures)
|
|
- Sonnet → "what breaks when combined" (composition)
|
|
|
|
**Prompt framing dominates model personality.** With the right prompt,
|
|
any model can be directed into any analytical mode. The model differences
|
|
that emerged in earlier open-ended experiments reflect DEFAULT TENDENCIES,
|
|
not capabilities.
|
|
|
|
**NEW finding about Opus on complex documents:**
|
|
Opus produced MORE findings than GPT-5 (23 vs 20) — the first time this
|
|
has happened on a broad analytical task. Previous pattern: GPT-5 always
|
|
finds more (20-33 findings) while Opus finds fewer but deeper (7-13).
|
|
What changed? The document is 992 lines — the longest tested — and the
|
|
task is explicitly about breadth ("find all gaps"). On this specific
|
|
combination (long document + breadth-focused prompt), Opus appears to
|
|
allocate its internal reasoning budget toward exploration rather than
|
|
its usual depth-first design-tension mode. This suggests Opus's typical
|
|
"fewer but deeper" pattern is partially a RESPONSE to shorter documents
|
|
where depth is more productive than breadth.
|
|
|
|
**Practical implications:**
|
|
1. For missing-feature analysis: prompt structure matters more than model
|
|
choice. All three models are viable. Use the explicit 5-category prompt.
|
|
2. Run all three for critical docs — they find different specific gaps
|
|
despite finding the same categories.
|
|
3. For open-ended analysis where you want models to find DIFFERENT things:
|
|
use open-ended prompts. For analysis where you want COMPREHENSIVE
|
|
coverage of one type: use structured prompts.
|
|
4. Opus's "fewer but deeper" personality can be overridden by document
|
|
length + breadth-focused prompt. On 992-line docs, it competes on
|
|
volume with GPT-5.
|
|
|
|
**Cost-effectiveness:**
|
|
Opus: 4,111 output tokens for 23 findings = 179 tokens/finding
|
|
GPT-5: 11,354 output tokens (+ 8,512 reasoning) for 20 findings = 993 tokens/finding
|
|
Sonnet: 4,686 output tokens for 15 findings = 312 tokens/finding
|
|
|
|
Opus is by far the most efficient: nearly 6x fewer tokens than GPT-5 per
|
|
finding, with MORE findings. This is the strongest cost-effectiveness case
|
|
for Opus on any tested task. On long documents with breadth-focused prompts,
|
|
Opus appears to be the optimal choice for both quality AND efficiency.
|