6af8a6ee10
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
99 lines
5.5 KiB
Markdown
99 lines
5.5 KiB
Markdown
# Finding 10: Hidden-assumption identification: GPT-5's reasoning produces qualitatively different (not just more) findings
|
|
|
|
**Date:** 2026-05-02
|
|
**Task:** Identify hidden assumptions in gargoyle's `cold-start-and-recovery.md` (234 lines)
|
|
that could break under real-world production conditions.
|
|
**How we used them:** Same document (full text) + same focused analytical question
|
|
to all 3 models via HAI proxy (OpenAI-compatible endpoint). No tools, no project
|
|
context beyond the document itself. Single prompt, no conversation history.
|
|
Temperature 0.3 for GPT-4.1/Mini; GPT-5 uses default (required).
|
|
|
|
| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
|
|
|---|---|---|---|---|
|
|
| GPT-4.1 Mini | 25s | 3,090 | 0 | 12 |
|
|
| GPT-4.1 | 77s | 2,751 | 0 | 14 |
|
|
| GPT-5 | 78s | 2,649 | 4,096 | 26 |
|
|
|
|
**What they found — common ground (all 3 identified):**
|
|
- Broker API consistency/availability during reconciliation
|
|
- ETS table availability and fail-closed behavior
|
|
- Single-writer/mailbox ordering guarantees holding in practice
|
|
- User independence assumption vs shared resources (rate limits, DB)
|
|
- Reconciliation idempotency under repeated runs
|
|
- Corporate action data completeness/timeliness
|
|
- Escalation threshold calibration vs changing market conditions
|
|
- Strategy warmup with partial/missing historical data
|
|
- Signal expiry correctness on restart
|
|
|
|
**GPT-5 unique findings (not in either other model):**
|
|
- Unbounded mailbox growth during extended reconciliation (memory pressure
|
|
from queued messages at market open)
|
|
- handle_continue side effects in OTHER processes (risk, metrics) acting
|
|
concurrently via different paths
|
|
- Pre-existing GTC orders filling while gated (positions as moving target)
|
|
- Broker position semantics mismatch (trade-date vs settled-date)
|
|
- Strategy warmup evaluate() having non-signal side effects (metrics, caches)
|
|
- Historical bar / live tick boundary alignment (double-processing or gaps)
|
|
- ETS gate caching in process state creating fail-open windows
|
|
- Correlated retry stampede when many users restart together
|
|
- Corporate action double-application race with broker (missing idempotency
|
|
keys per action/instrument/date)
|
|
- Kill switch state vs DB unavailability at startup
|
|
- Market data subscriptions as shared bottleneck across "independent" users
|
|
- Time-invariant signals incorrectly expired by aggregation window logic
|
|
- Broker fills vs positions endpoints internally inconsistent (different caches)
|
|
- Positions changing under reconciliation while kill switch is engaged
|
|
- Gate phase sequencing: :ready written before worker warmup completes
|
|
- Periodic reconciler allowing 1hr of divergent trading (rate-of-change blind)
|
|
|
|
**GPT-4.1 unique findings (not in GPT-5 or Mini):**
|
|
- No correlated failure handling (all failure modes treated as isolated) —
|
|
only model to frame this as a meta-assumption about the failure table
|
|
|
|
**GPT-4.1 Mini unique findings:**
|
|
- None that weren't also covered by the other two models
|
|
|
|
**Quality assessment:**
|
|
- **GPT-5** didn't just find more assumptions — it found *qualitatively
|
|
different kinds*. Many of its unique findings involve multi-component
|
|
interactions (mailbox + reconciliation + market open timing), semantic
|
|
mismatches (trade-date vs settled positions), and second-order effects
|
|
(metrics side effects during warmup, GTC orders filling while gated).
|
|
These require reasoning about system behavior across boundaries the
|
|
document doesn't explicitly draw.
|
|
- **GPT-4.1** was competent and structured, found the same core assumptions
|
|
as Mini, plus one good meta-observation about correlated failures. But
|
|
it stayed within the document's own framing — it found assumptions the
|
|
document *almost* states rather than ones the document can't see.
|
|
- **GPT-4.1 Mini** was formulaic. Every finding maps cleanly to a section
|
|
of the document. It's essentially "what could go wrong with each stated
|
|
mechanism" rather than "what does this design take for granted about
|
|
the world outside itself."
|
|
|
|
**Key insight — reasoning tokens change the KIND of analysis:**
|
|
GPT-5's 4,096 reasoning tokens aren't producing "more of the same" —
|
|
they're producing a different analytical mode. The non-reasoning models
|
|
(4.1 and Mini) identify risks within the document's own frame of reference.
|
|
GPT-5 reasons about the document's relationship to the external world:
|
|
broker semantics, deployment topology, OTP runtime behavior under load,
|
|
timing correlations across independent subsystems. This is the difference
|
|
between "what could this mechanism fail at" and "what must be true about
|
|
the world for this mechanism to work."
|
|
|
|
**Comparison to Finding #9 (gap-finding on failure-modes.md):**
|
|
Same pattern confirmed. GPT-5 consistently finds domain-specific,
|
|
interaction-level issues that require reasoning about component boundaries.
|
|
GPT-4.1 is thorough within-frame. Mini is formulaic. The gap between
|
|
GPT-5 and the others is larger here than in #9 — possibly because
|
|
"hidden assumptions" requires more abstraction than "missing failure
|
|
scenarios." Assumption-finding requires the model to reason about what
|
|
ISN'T stated, which benefits more from extended reasoning.
|
|
|
|
**Practical implication:** For architecture review, running GPT-5 on
|
|
"identify hidden assumptions" is higher-value than the same question to
|
|
non-reasoning models. The cost difference (4K extra reasoning tokens) is
|
|
trivial for a document that will drive months of implementation. Use
|
|
non-reasoning models for within-frame checks ("does this section have
|
|
gaps") and reasoning models for cross-boundary analysis ("what must be
|
|
true about the world for this to work").
|