refactor(findings): split ALL-FINDINGS.md into per-experiment files

Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00
parent 1b108ff66e
commit 6af8a6ee10
32 changed files with 3232 additions and 3254 deletions
@@ -0,0 +1,158 @@
+# Finding 26: Missing-feature identification IS promptable across all models; prompt framing eliminates Opus's historical advantage — all three models find regulatory gaps when explicitly asked
+
+**Date:** 2026-05-05
+**Task:** Identify computations, behaviors, or features that gargoyle's
+`corporate-actions.md` (992 lines) SHOULD perform for financial correctness,
+regulatory compliance, or operational safety — but doesn't describe.
+**How we used them:** Same document (full text) + same focused analytical
+prompt to all 3 models via HAI proxy. Prompt explicitly structured around 5
+categories: missing computations, missing behaviors, missing validations,
+missing integrations, and regulatory gaps. Required concrete findings with
+severity. No tools, no project context beyond the document. GPT-5 via
+OpenAI endpoint (16K max_completion_tokens), Opus 4.6 and Sonnet 4.6 via
+Anthropic endpoint (8K max_tokens).
+
+| Model | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
+|---|---|---|---|---|---|---|
+| GPT-5 | 11,354 | 8,512 | 20 | 3 | 10 | 7 |
+| Claude Opus 4.6 | 4,111 | (internal) | 23 | 6 | 10 | 7 |
+| Claude Sonnet 4.6 | 4,686 | (internal) | 15 | 5 | 6 | 4 |
+
+**What they found — common ground (all 3 identified):**
+- Wash sale rule interaction with CA-driven lot closures (IRC §1091)
+- Short position treatment for corporate actions
+- Same-day corporate action ordering beyond `recorded_at` timestamp
+- Record date / ex-date position verification (entitlement timing)
+- Idempotency guard preventing double-application per user
+- Decimal precision/rounding policy unspecified
+- Superseded CA status has no lot rollback mechanism
+- Rights/warrants post-creation lifecycle (exercise/expiration)
+- Basis preservation invariant has no runtime enforcement
+- Manual entry authorization and audit trail
+
+**GPT-5 unique findings (not in either Claude model):**
+- Per-lot eligibility based on entitlement date (not just user-level)
+- Election-based outcomes for shareholder choices (cash vs stock)
+- Instrument-level trading hold during CA application window
+- Pre-application consistency checks against broker entitlements
+- DB-level enforcement of status transitions and invariants
+- Action-type-specific date semantics per field (ex vs record vs payable)
+- Voluntary/tender actions beyond distributions
+- Backfill/initialization guard for newly onboarded users
+- Applicator retry/backoff semantics and confirmation race
+- Rights indivisibility constraints vs exact Decimal quantities
+
+**Claude Opus unique findings (not in either other model):**
+- Pending order PRICE adjustment after splits (not just cancellation)
+- Multi-instrument position recalculation atomicity for mergers
+- Mixed merger basis floor at zero (can produce negative basis)
+- Tax lot identification method interaction with inherited dates
+- Corporate action effect on strategy position limits/risk params
+- Corporate actions on instruments not yet in the database
+- Partial application window: new user acquires position mid-fan-out
+- IRC §305(c) deemed distributions (taxable stock dividends)
+- CA impact on unrealized P&L display and strategy evaluation
+- Concurrent OrderManager startup + Applicator fan-out race
+
+**Claude Sonnet unique findings (not in either other model):**
+- Stale orders: failure modes table contradicts "excluded" section
+- IRC §1223(1) holding period tacking verification at lot close
+- Spinoff allocation percentage — no validation child != parent instrument
+- Combined spinoff allocations exceeding meaningful bounds
+- Cash dividend bypasses OrderManager — record-date quantity snapshot lost
+- Mixed merger large-denominator exchange ratio overflow
+- Detector schedule: no intraday re-poll for same-day announcements
+- ROC cumulative basis floor tracking and IRC §301(c)(3) distinction
+- Mixed merger deferred loss not explicitly recorded in metadata
+
+**Quality assessment:**
+- **Claude Opus** was the MOST PROLIFIC (23 findings) — a notable inversion
+  from previous experiments where Opus typically found fewer but deeper
+  findings. Here, the explicit "missing feature" framing appears to have
+  unlocked Opus's breadth. Its unique findings included genuinely critical
+  items: pending order price adjustment after splits (Critical — direct
+  financial loss), multi-instrument atomicity for mergers (Critical —
+  position loss), and mixed merger negative basis (High — accounting
+  corruption). The findings were precise, well-reasoned, and showed both
+  regulatory depth (IRC §305(c)) and operational awareness.
+- **GPT-5** was slightly less prolific (20 findings) but maintained its
+  characteristic breadth and operational-level thinking. Per-lot eligibility
+  (not just per-user) is a subtle but important distinction. The election-
+  based outcomes finding shows awareness of real-world corporate action
+  complexity. The backfill/initialization guard is operationally significant.
+  GPT-5 spent 8,512 reasoning tokens — moderate for its output volume.
+- **Claude Sonnet** found fewer gaps (15) but several were genuinely
+  insightful. The internal contradiction between the failure modes table
+  and the "excluded" section is a real document inconsistency. The cash
+  dividend record-date quantity snapshot insight (#9) identifies a DATA LOSS
+  problem — the opportunity to capture that data expires. The mixed merger
+  deferred loss recording gap shows regulatory awareness. However, some
+  findings were more surface-level or overlapped heavily with the others.
+
+**KEY INSIGHT — The original question from Finding #22 is ANSWERED:**
+
+> "Opus's 'missing feature identification' mode (wash sales, commissions) —
+> is this promptable on other models? Could we explicitly ask GPT-5 'what
+> should this system compute but doesn't' and get similar results?"
+
+**YES.** When explicitly prompted with a structured "missing feature"
+framing, ALL three models found regulatory gaps (wash sales, IRC sections),
+missing computations (basis calculations, rounding), and missing behaviors
+(lifecycle events, notifications). GPT-5 produced findings in the same
+*category* as what Opus uniquely found in Finding #22 (silent correctness
+failures on specid-lot-selection.md).
+
+In Finding #22, Opus uniquely identified wash sales and commission tracking
+as missing features while GPT-5 focused on mechanism incorrectness and
+Sonnet on composition failures. HERE, with the explicit "what's missing"
+prompt, ALL three models found wash sales, ALL found regulatory gaps, and
+ALL found missing behaviors.
+
+**This confirms:** Opus's "missing feature identification" mode in Finding
+#22 was NOT an inherent model capability — it was an emergent behavior from
+the open-ended "silent correctness failures" prompt. When you give ALL models
+the EXPLICIT instruction to look for missing features, they all do it. The
+differentiation from #22 was caused by the prompt being more open-ended,
+allowing each model to default to its natural analytical mode:
+- Opus → "what's missing" (features/functionality)
+- GPT-5 → "what's wrong" (mechanism failures)
+- Sonnet → "what breaks when combined" (composition)
+
+**Prompt framing dominates model personality.** With the right prompt,
+any model can be directed into any analytical mode. The model differences
+that emerged in earlier open-ended experiments reflect DEFAULT TENDENCIES,
+not capabilities.
+
+**NEW finding about Opus on complex documents:**
+Opus produced MORE findings than GPT-5 (23 vs 20) — the first time this
+has happened on a broad analytical task. Previous pattern: GPT-5 always
+finds more (20-33 findings) while Opus finds fewer but deeper (7-13).
+What changed? The document is 992 lines — the longest tested — and the
+task is explicitly about breadth ("find all gaps"). On this specific
+combination (long document + breadth-focused prompt), Opus appears to
+allocate its internal reasoning budget toward exploration rather than
+its usual depth-first design-tension mode. This suggests Opus's typical
+"fewer but deeper" pattern is partially a RESPONSE to shorter documents
+where depth is more productive than breadth.
+
+**Practical implications:**
+1. For missing-feature analysis: prompt structure matters more than model
+   choice. All three models are viable. Use the explicit 5-category prompt.
+2. Run all three for critical docs — they find different specific gaps
+   despite finding the same categories.
+3. For open-ended analysis where you want models to find DIFFERENT things:
+   use open-ended prompts. For analysis where you want COMPREHENSIVE
+   coverage of one type: use structured prompts.
+4. Opus's "fewer but deeper" personality can be overridden by document
+   length + breadth-focused prompt. On 992-line docs, it competes on
+   volume with GPT-5.
+
+**Cost-effectiveness:**
+Opus: 4,111 output tokens for 23 findings = 179 tokens/finding
+GPT-5: 11,354 output tokens (+ 8,512 reasoning) for 20 findings = 993 tokens/finding
+Sonnet: 4,686 output tokens for 15 findings = 312 tokens/finding
+
+Opus is by far the most efficient: nearly 6x fewer tokens than GPT-5 per
+finding, with MORE findings. This is the strongest cost-effectiveness case
+for Opus on any tested task. On long documents with breadth-focused prompts,
+Opus appears to be the optimal choice for both quality AND efficiency.