refactor(findings): split ALL-FINDINGS.md into per-experiment files

Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00
parent 1b108ff66e
commit 6af8a6ee10
32 changed files with 3232 additions and 3254 deletions
@@ -0,0 +1,98 @@
+# Finding 10: Hidden-assumption identification: GPT-5's reasoning produces qualitatively different (not just more) findings
+
+**Date:** 2026-05-02
+**Task:** Identify hidden assumptions in gargoyle's `cold-start-and-recovery.md` (234 lines)
+that could break under real-world production conditions.
+**How we used them:** Same document (full text) + same focused analytical question
+to all 3 models via HAI proxy (OpenAI-compatible endpoint). No tools, no project
+context beyond the document itself. Single prompt, no conversation history.
+Temperature 0.3 for GPT-4.1/Mini; GPT-5 uses default (required).
+
+| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
+|---|---|---|---|---|
+| GPT-4.1 Mini | 25s | 3,090 | 0 | 12 |
+| GPT-4.1 | 77s | 2,751 | 0 | 14 |
+| GPT-5 | 78s | 2,649 | 4,096 | 26 |
+
+**What they found — common ground (all 3 identified):**
+- Broker API consistency/availability during reconciliation
+- ETS table availability and fail-closed behavior
+- Single-writer/mailbox ordering guarantees holding in practice
+- User independence assumption vs shared resources (rate limits, DB)
+- Reconciliation idempotency under repeated runs
+- Corporate action data completeness/timeliness
+- Escalation threshold calibration vs changing market conditions
+- Strategy warmup with partial/missing historical data
+- Signal expiry correctness on restart
+
+**GPT-5 unique findings (not in either other model):**
+- Unbounded mailbox growth during extended reconciliation (memory pressure
+  from queued messages at market open)
+- handle_continue side effects in OTHER processes (risk, metrics) acting
+  concurrently via different paths
+- Pre-existing GTC orders filling while gated (positions as moving target)
+- Broker position semantics mismatch (trade-date vs settled-date)
+- Strategy warmup evaluate() having non-signal side effects (metrics, caches)
+- Historical bar / live tick boundary alignment (double-processing or gaps)
+- ETS gate caching in process state creating fail-open windows
+- Correlated retry stampede when many users restart together
+- Corporate action double-application race with broker (missing idempotency
+  keys per action/instrument/date)
+- Kill switch state vs DB unavailability at startup
+- Market data subscriptions as shared bottleneck across "independent" users
+- Time-invariant signals incorrectly expired by aggregation window logic
+- Broker fills vs positions endpoints internally inconsistent (different caches)
+- Positions changing under reconciliation while kill switch is engaged
+- Gate phase sequencing: :ready written before worker warmup completes
+- Periodic reconciler allowing 1hr of divergent trading (rate-of-change blind)
+
+**GPT-4.1 unique findings (not in GPT-5 or Mini):**
+- No correlated failure handling (all failure modes treated as isolated) —
+  only model to frame this as a meta-assumption about the failure table
+
+**GPT-4.1 Mini unique findings:**
+- None that weren't also covered by the other two models
+
+**Quality assessment:**
+- **GPT-5** didn't just find more assumptions — it found *qualitatively
+  different kinds*. Many of its unique findings involve multi-component
+  interactions (mailbox + reconciliation + market open timing), semantic
+  mismatches (trade-date vs settled positions), and second-order effects
+  (metrics side effects during warmup, GTC orders filling while gated).
+  These require reasoning about system behavior across boundaries the
+  document doesn't explicitly draw.
+- **GPT-4.1** was competent and structured, found the same core assumptions
+  as Mini, plus one good meta-observation about correlated failures. But
+  it stayed within the document's own framing — it found assumptions the
+  document *almost* states rather than ones the document can't see.
+- **GPT-4.1 Mini** was formulaic. Every finding maps cleanly to a section
+  of the document. It's essentially "what could go wrong with each stated
+  mechanism" rather than "what does this design take for granted about
+  the world outside itself."
+
+**Key insight — reasoning tokens change the KIND of analysis:**
+GPT-5's 4,096 reasoning tokens aren't producing "more of the same" —
+they're producing a different analytical mode. The non-reasoning models
+(4.1 and Mini) identify risks within the document's own frame of reference.
+GPT-5 reasons about the document's relationship to the external world:
+broker semantics, deployment topology, OTP runtime behavior under load,
+timing correlations across independent subsystems. This is the difference
+between "what could this mechanism fail at" and "what must be true about
+the world for this mechanism to work."
+
+**Comparison to Finding #9 (gap-finding on failure-modes.md):**
+Same pattern confirmed. GPT-5 consistently finds domain-specific,
+interaction-level issues that require reasoning about component boundaries.
+GPT-4.1 is thorough within-frame. Mini is formulaic. The gap between
+GPT-5 and the others is larger here than in #9 — possibly because
+"hidden assumptions" requires more abstraction than "missing failure
+scenarios." Assumption-finding requires the model to reason about what
+ISN'T stated, which benefits more from extended reasoning.
+
+**Practical implication:** For architecture review, running GPT-5 on
+"identify hidden assumptions" is higher-value than the same question to
+non-reasoning models. The cost difference (4K extra reasoning tokens) is
+trivial for a document that will drive months of implementation. Use
+non-reasoning models for within-frame checks ("does this section have
+gaps") and reasoning models for cross-boundary analysis ("what must be
+true about the world for this to work").