refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
This commit is contained in:
@@ -0,0 +1,98 @@
|
||||
# Finding 10: Hidden-assumption identification: GPT-5's reasoning produces qualitatively different (not just more) findings
|
||||
|
||||
**Date:** 2026-05-02
|
||||
**Task:** Identify hidden assumptions in gargoyle's `cold-start-and-recovery.md` (234 lines)
|
||||
that could break under real-world production conditions.
|
||||
**How we used them:** Same document (full text) + same focused analytical question
|
||||
to all 3 models via HAI proxy (OpenAI-compatible endpoint). No tools, no project
|
||||
context beyond the document itself. Single prompt, no conversation history.
|
||||
Temperature 0.3 for GPT-4.1/Mini; GPT-5 uses default (required).
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
|
||||
|---|---|---|---|---|
|
||||
| GPT-4.1 Mini | 25s | 3,090 | 0 | 12 |
|
||||
| GPT-4.1 | 77s | 2,751 | 0 | 14 |
|
||||
| GPT-5 | 78s | 2,649 | 4,096 | 26 |
|
||||
|
||||
**What they found — common ground (all 3 identified):**
|
||||
- Broker API consistency/availability during reconciliation
|
||||
- ETS table availability and fail-closed behavior
|
||||
- Single-writer/mailbox ordering guarantees holding in practice
|
||||
- User independence assumption vs shared resources (rate limits, DB)
|
||||
- Reconciliation idempotency under repeated runs
|
||||
- Corporate action data completeness/timeliness
|
||||
- Escalation threshold calibration vs changing market conditions
|
||||
- Strategy warmup with partial/missing historical data
|
||||
- Signal expiry correctness on restart
|
||||
|
||||
**GPT-5 unique findings (not in either other model):**
|
||||
- Unbounded mailbox growth during extended reconciliation (memory pressure
|
||||
from queued messages at market open)
|
||||
- handle_continue side effects in OTHER processes (risk, metrics) acting
|
||||
concurrently via different paths
|
||||
- Pre-existing GTC orders filling while gated (positions as moving target)
|
||||
- Broker position semantics mismatch (trade-date vs settled-date)
|
||||
- Strategy warmup evaluate() having non-signal side effects (metrics, caches)
|
||||
- Historical bar / live tick boundary alignment (double-processing or gaps)
|
||||
- ETS gate caching in process state creating fail-open windows
|
||||
- Correlated retry stampede when many users restart together
|
||||
- Corporate action double-application race with broker (missing idempotency
|
||||
keys per action/instrument/date)
|
||||
- Kill switch state vs DB unavailability at startup
|
||||
- Market data subscriptions as shared bottleneck across "independent" users
|
||||
- Time-invariant signals incorrectly expired by aggregation window logic
|
||||
- Broker fills vs positions endpoints internally inconsistent (different caches)
|
||||
- Positions changing under reconciliation while kill switch is engaged
|
||||
- Gate phase sequencing: :ready written before worker warmup completes
|
||||
- Periodic reconciler allowing 1hr of divergent trading (rate-of-change blind)
|
||||
|
||||
**GPT-4.1 unique findings (not in GPT-5 or Mini):**
|
||||
- No correlated failure handling (all failure modes treated as isolated) —
|
||||
only model to frame this as a meta-assumption about the failure table
|
||||
|
||||
**GPT-4.1 Mini unique findings:**
|
||||
- None that weren't also covered by the other two models
|
||||
|
||||
**Quality assessment:**
|
||||
- **GPT-5** didn't just find more assumptions — it found *qualitatively
|
||||
different kinds*. Many of its unique findings involve multi-component
|
||||
interactions (mailbox + reconciliation + market open timing), semantic
|
||||
mismatches (trade-date vs settled positions), and second-order effects
|
||||
(metrics side effects during warmup, GTC orders filling while gated).
|
||||
These require reasoning about system behavior across boundaries the
|
||||
document doesn't explicitly draw.
|
||||
- **GPT-4.1** was competent and structured, found the same core assumptions
|
||||
as Mini, plus one good meta-observation about correlated failures. But
|
||||
it stayed within the document's own framing — it found assumptions the
|
||||
document *almost* states rather than ones the document can't see.
|
||||
- **GPT-4.1 Mini** was formulaic. Every finding maps cleanly to a section
|
||||
of the document. It's essentially "what could go wrong with each stated
|
||||
mechanism" rather than "what does this design take for granted about
|
||||
the world outside itself."
|
||||
|
||||
**Key insight — reasoning tokens change the KIND of analysis:**
|
||||
GPT-5's 4,096 reasoning tokens aren't producing "more of the same" —
|
||||
they're producing a different analytical mode. The non-reasoning models
|
||||
(4.1 and Mini) identify risks within the document's own frame of reference.
|
||||
GPT-5 reasons about the document's relationship to the external world:
|
||||
broker semantics, deployment topology, OTP runtime behavior under load,
|
||||
timing correlations across independent subsystems. This is the difference
|
||||
between "what could this mechanism fail at" and "what must be true about
|
||||
the world for this mechanism to work."
|
||||
|
||||
**Comparison to Finding #9 (gap-finding on failure-modes.md):**
|
||||
Same pattern confirmed. GPT-5 consistently finds domain-specific,
|
||||
interaction-level issues that require reasoning about component boundaries.
|
||||
GPT-4.1 is thorough within-frame. Mini is formulaic. The gap between
|
||||
GPT-5 and the others is larger here than in #9 — possibly because
|
||||
"hidden assumptions" requires more abstraction than "missing failure
|
||||
scenarios." Assumption-finding requires the model to reason about what
|
||||
ISN'T stated, which benefits more from extended reasoning.
|
||||
|
||||
**Practical implication:** For architecture review, running GPT-5 on
|
||||
"identify hidden assumptions" is higher-value than the same question to
|
||||
non-reasoning models. The cost difference (4K extra reasoning tokens) is
|
||||
trivial for a document that will drive months of implementation. Use
|
||||
non-reasoning models for within-frame checks ("does this section have
|
||||
gaps") and reasoning models for cross-boundary analysis ("what must be
|
||||
true about the world for this to work").
|
||||
Reference in New Issue
Block a user