refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
This commit is contained in:
@@ -0,0 +1,158 @@
|
||||
# Finding 26: Missing-feature identification IS promptable across all models; prompt framing eliminates Opus's historical advantage — all three models find regulatory gaps when explicitly asked
|
||||
|
||||
**Date:** 2026-05-05
|
||||
**Task:** Identify computations, behaviors, or features that gargoyle's
|
||||
`corporate-actions.md` (992 lines) SHOULD perform for financial correctness,
|
||||
regulatory compliance, or operational safety — but doesn't describe.
|
||||
**How we used them:** Same document (full text) + same focused analytical
|
||||
prompt to all 3 models via HAI proxy. Prompt explicitly structured around 5
|
||||
categories: missing computations, missing behaviors, missing validations,
|
||||
missing integrations, and regulatory gaps. Required concrete findings with
|
||||
severity. No tools, no project context beyond the document. GPT-5 via
|
||||
OpenAI endpoint (16K max_completion_tokens), Opus 4.6 and Sonnet 4.6 via
|
||||
Anthropic endpoint (8K max_tokens).
|
||||
|
||||
| Model | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
|
||||
|---|---|---|---|---|---|---|
|
||||
| GPT-5 | 11,354 | 8,512 | 20 | 3 | 10 | 7 |
|
||||
| Claude Opus 4.6 | 4,111 | (internal) | 23 | 6 | 10 | 7 |
|
||||
| Claude Sonnet 4.6 | 4,686 | (internal) | 15 | 5 | 6 | 4 |
|
||||
|
||||
**What they found — common ground (all 3 identified):**
|
||||
- Wash sale rule interaction with CA-driven lot closures (IRC §1091)
|
||||
- Short position treatment for corporate actions
|
||||
- Same-day corporate action ordering beyond `recorded_at` timestamp
|
||||
- Record date / ex-date position verification (entitlement timing)
|
||||
- Idempotency guard preventing double-application per user
|
||||
- Decimal precision/rounding policy unspecified
|
||||
- Superseded CA status has no lot rollback mechanism
|
||||
- Rights/warrants post-creation lifecycle (exercise/expiration)
|
||||
- Basis preservation invariant has no runtime enforcement
|
||||
- Manual entry authorization and audit trail
|
||||
|
||||
**GPT-5 unique findings (not in either Claude model):**
|
||||
- Per-lot eligibility based on entitlement date (not just user-level)
|
||||
- Election-based outcomes for shareholder choices (cash vs stock)
|
||||
- Instrument-level trading hold during CA application window
|
||||
- Pre-application consistency checks against broker entitlements
|
||||
- DB-level enforcement of status transitions and invariants
|
||||
- Action-type-specific date semantics per field (ex vs record vs payable)
|
||||
- Voluntary/tender actions beyond distributions
|
||||
- Backfill/initialization guard for newly onboarded users
|
||||
- Applicator retry/backoff semantics and confirmation race
|
||||
- Rights indivisibility constraints vs exact Decimal quantities
|
||||
|
||||
**Claude Opus unique findings (not in either other model):**
|
||||
- Pending order PRICE adjustment after splits (not just cancellation)
|
||||
- Multi-instrument position recalculation atomicity for mergers
|
||||
- Mixed merger basis floor at zero (can produce negative basis)
|
||||
- Tax lot identification method interaction with inherited dates
|
||||
- Corporate action effect on strategy position limits/risk params
|
||||
- Corporate actions on instruments not yet in the database
|
||||
- Partial application window: new user acquires position mid-fan-out
|
||||
- IRC §305(c) deemed distributions (taxable stock dividends)
|
||||
- CA impact on unrealized P&L display and strategy evaluation
|
||||
- Concurrent OrderManager startup + Applicator fan-out race
|
||||
|
||||
**Claude Sonnet unique findings (not in either other model):**
|
||||
- Stale orders: failure modes table contradicts "excluded" section
|
||||
- IRC §1223(1) holding period tacking verification at lot close
|
||||
- Spinoff allocation percentage — no validation child != parent instrument
|
||||
- Combined spinoff allocations exceeding meaningful bounds
|
||||
- Cash dividend bypasses OrderManager — record-date quantity snapshot lost
|
||||
- Mixed merger large-denominator exchange ratio overflow
|
||||
- Detector schedule: no intraday re-poll for same-day announcements
|
||||
- ROC cumulative basis floor tracking and IRC §301(c)(3) distinction
|
||||
- Mixed merger deferred loss not explicitly recorded in metadata
|
||||
|
||||
**Quality assessment:**
|
||||
- **Claude Opus** was the MOST PROLIFIC (23 findings) — a notable inversion
|
||||
from previous experiments where Opus typically found fewer but deeper
|
||||
findings. Here, the explicit "missing feature" framing appears to have
|
||||
unlocked Opus's breadth. Its unique findings included genuinely critical
|
||||
items: pending order price adjustment after splits (Critical — direct
|
||||
financial loss), multi-instrument atomicity for mergers (Critical —
|
||||
position loss), and mixed merger negative basis (High — accounting
|
||||
corruption). The findings were precise, well-reasoned, and showed both
|
||||
regulatory depth (IRC §305(c)) and operational awareness.
|
||||
- **GPT-5** was slightly less prolific (20 findings) but maintained its
|
||||
characteristic breadth and operational-level thinking. Per-lot eligibility
|
||||
(not just per-user) is a subtle but important distinction. The election-
|
||||
based outcomes finding shows awareness of real-world corporate action
|
||||
complexity. The backfill/initialization guard is operationally significant.
|
||||
GPT-5 spent 8,512 reasoning tokens — moderate for its output volume.
|
||||
- **Claude Sonnet** found fewer gaps (15) but several were genuinely
|
||||
insightful. The internal contradiction between the failure modes table
|
||||
and the "excluded" section is a real document inconsistency. The cash
|
||||
dividend record-date quantity snapshot insight (#9) identifies a DATA LOSS
|
||||
problem — the opportunity to capture that data expires. The mixed merger
|
||||
deferred loss recording gap shows regulatory awareness. However, some
|
||||
findings were more surface-level or overlapped heavily with the others.
|
||||
|
||||
**KEY INSIGHT — The original question from Finding #22 is ANSWERED:**
|
||||
|
||||
> "Opus's 'missing feature identification' mode (wash sales, commissions) —
|
||||
> is this promptable on other models? Could we explicitly ask GPT-5 'what
|
||||
> should this system compute but doesn't' and get similar results?"
|
||||
|
||||
**YES.** When explicitly prompted with a structured "missing feature"
|
||||
framing, ALL three models found regulatory gaps (wash sales, IRC sections),
|
||||
missing computations (basis calculations, rounding), and missing behaviors
|
||||
(lifecycle events, notifications). GPT-5 produced findings in the same
|
||||
*category* as what Opus uniquely found in Finding #22 (silent correctness
|
||||
failures on specid-lot-selection.md).
|
||||
|
||||
In Finding #22, Opus uniquely identified wash sales and commission tracking
|
||||
as missing features while GPT-5 focused on mechanism incorrectness and
|
||||
Sonnet on composition failures. HERE, with the explicit "what's missing"
|
||||
prompt, ALL three models found wash sales, ALL found regulatory gaps, and
|
||||
ALL found missing behaviors.
|
||||
|
||||
**This confirms:** Opus's "missing feature identification" mode in Finding
|
||||
#22 was NOT an inherent model capability — it was an emergent behavior from
|
||||
the open-ended "silent correctness failures" prompt. When you give ALL models
|
||||
the EXPLICIT instruction to look for missing features, they all do it. The
|
||||
differentiation from #22 was caused by the prompt being more open-ended,
|
||||
allowing each model to default to its natural analytical mode:
|
||||
- Opus → "what's missing" (features/functionality)
|
||||
- GPT-5 → "what's wrong" (mechanism failures)
|
||||
- Sonnet → "what breaks when combined" (composition)
|
||||
|
||||
**Prompt framing dominates model personality.** With the right prompt,
|
||||
any model can be directed into any analytical mode. The model differences
|
||||
that emerged in earlier open-ended experiments reflect DEFAULT TENDENCIES,
|
||||
not capabilities.
|
||||
|
||||
**NEW finding about Opus on complex documents:**
|
||||
Opus produced MORE findings than GPT-5 (23 vs 20) — the first time this
|
||||
has happened on a broad analytical task. Previous pattern: GPT-5 always
|
||||
finds more (20-33 findings) while Opus finds fewer but deeper (7-13).
|
||||
What changed? The document is 992 lines — the longest tested — and the
|
||||
task is explicitly about breadth ("find all gaps"). On this specific
|
||||
combination (long document + breadth-focused prompt), Opus appears to
|
||||
allocate its internal reasoning budget toward exploration rather than
|
||||
its usual depth-first design-tension mode. This suggests Opus's typical
|
||||
"fewer but deeper" pattern is partially a RESPONSE to shorter documents
|
||||
where depth is more productive than breadth.
|
||||
|
||||
**Practical implications:**
|
||||
1. For missing-feature analysis: prompt structure matters more than model
|
||||
choice. All three models are viable. Use the explicit 5-category prompt.
|
||||
2. Run all three for critical docs — they find different specific gaps
|
||||
despite finding the same categories.
|
||||
3. For open-ended analysis where you want models to find DIFFERENT things:
|
||||
use open-ended prompts. For analysis where you want COMPREHENSIVE
|
||||
coverage of one type: use structured prompts.
|
||||
4. Opus's "fewer but deeper" personality can be overridden by document
|
||||
length + breadth-focused prompt. On 992-line docs, it competes on
|
||||
volume with GPT-5.
|
||||
|
||||
**Cost-effectiveness:**
|
||||
Opus: 4,111 output tokens for 23 findings = 179 tokens/finding
|
||||
GPT-5: 11,354 output tokens (+ 8,512 reasoning) for 20 findings = 993 tokens/finding
|
||||
Sonnet: 4,686 output tokens for 15 findings = 312 tokens/finding
|
||||
|
||||
Opus is by far the most efficient: nearly 6x fewer tokens than GPT-5 per
|
||||
finding, with MORE findings. This is the strongest cost-effectiveness case
|
||||
for Opus on any tested task. On long documents with breadth-focused prompts,
|
||||
Opus appears to be the optimal choice for both quality AND efficiency.
|
||||
Reference in New Issue
Block a user