Files
model-research/findings/2026-05-02-10-hiddenassumption-identification-gpt5s-reasoning-produces.md
T
Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00

99 lines
5.5 KiB
Markdown

# Finding 10: Hidden-assumption identification: GPT-5's reasoning produces qualitatively different (not just more) findings
**Date:** 2026-05-02
**Task:** Identify hidden assumptions in gargoyle's `cold-start-and-recovery.md` (234 lines)
that could break under real-world production conditions.
**How we used them:** Same document (full text) + same focused analytical question
to all 3 models via HAI proxy (OpenAI-compatible endpoint). No tools, no project
context beyond the document itself. Single prompt, no conversation history.
Temperature 0.3 for GPT-4.1/Mini; GPT-5 uses default (required).
| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
|---|---|---|---|---|
| GPT-4.1 Mini | 25s | 3,090 | 0 | 12 |
| GPT-4.1 | 77s | 2,751 | 0 | 14 |
| GPT-5 | 78s | 2,649 | 4,096 | 26 |
**What they found — common ground (all 3 identified):**
- Broker API consistency/availability during reconciliation
- ETS table availability and fail-closed behavior
- Single-writer/mailbox ordering guarantees holding in practice
- User independence assumption vs shared resources (rate limits, DB)
- Reconciliation idempotency under repeated runs
- Corporate action data completeness/timeliness
- Escalation threshold calibration vs changing market conditions
- Strategy warmup with partial/missing historical data
- Signal expiry correctness on restart
**GPT-5 unique findings (not in either other model):**
- Unbounded mailbox growth during extended reconciliation (memory pressure
from queued messages at market open)
- handle_continue side effects in OTHER processes (risk, metrics) acting
concurrently via different paths
- Pre-existing GTC orders filling while gated (positions as moving target)
- Broker position semantics mismatch (trade-date vs settled-date)
- Strategy warmup evaluate() having non-signal side effects (metrics, caches)
- Historical bar / live tick boundary alignment (double-processing or gaps)
- ETS gate caching in process state creating fail-open windows
- Correlated retry stampede when many users restart together
- Corporate action double-application race with broker (missing idempotency
keys per action/instrument/date)
- Kill switch state vs DB unavailability at startup
- Market data subscriptions as shared bottleneck across "independent" users
- Time-invariant signals incorrectly expired by aggregation window logic
- Broker fills vs positions endpoints internally inconsistent (different caches)
- Positions changing under reconciliation while kill switch is engaged
- Gate phase sequencing: :ready written before worker warmup completes
- Periodic reconciler allowing 1hr of divergent trading (rate-of-change blind)
**GPT-4.1 unique findings (not in GPT-5 or Mini):**
- No correlated failure handling (all failure modes treated as isolated) —
only model to frame this as a meta-assumption about the failure table
**GPT-4.1 Mini unique findings:**
- None that weren't also covered by the other two models
**Quality assessment:**
- **GPT-5** didn't just find more assumptions — it found *qualitatively
different kinds*. Many of its unique findings involve multi-component
interactions (mailbox + reconciliation + market open timing), semantic
mismatches (trade-date vs settled positions), and second-order effects
(metrics side effects during warmup, GTC orders filling while gated).
These require reasoning about system behavior across boundaries the
document doesn't explicitly draw.
- **GPT-4.1** was competent and structured, found the same core assumptions
as Mini, plus one good meta-observation about correlated failures. But
it stayed within the document's own framing — it found assumptions the
document *almost* states rather than ones the document can't see.
- **GPT-4.1 Mini** was formulaic. Every finding maps cleanly to a section
of the document. It's essentially "what could go wrong with each stated
mechanism" rather than "what does this design take for granted about
the world outside itself."
**Key insight — reasoning tokens change the KIND of analysis:**
GPT-5's 4,096 reasoning tokens aren't producing "more of the same" —
they're producing a different analytical mode. The non-reasoning models
(4.1 and Mini) identify risks within the document's own frame of reference.
GPT-5 reasons about the document's relationship to the external world:
broker semantics, deployment topology, OTP runtime behavior under load,
timing correlations across independent subsystems. This is the difference
between "what could this mechanism fail at" and "what must be true about
the world for this mechanism to work."
**Comparison to Finding #9 (gap-finding on failure-modes.md):**
Same pattern confirmed. GPT-5 consistently finds domain-specific,
interaction-level issues that require reasoning about component boundaries.
GPT-4.1 is thorough within-frame. Mini is formulaic. The gap between
GPT-5 and the others is larger here than in #9 — possibly because
"hidden assumptions" requires more abstraction than "missing failure
scenarios." Assumption-finding requires the model to reason about what
ISN'T stated, which benefits more from extended reasoning.
**Practical implication:** For architecture review, running GPT-5 on
"identify hidden assumptions" is higher-value than the same question to
non-reasoning models. The cost difference (4K extra reasoning tokens) is
trivial for a document that will drive months of implementation. Use
non-reasoning models for within-frame checks ("does this section have
gaps") and reasoning models for cross-boundary analysis ("what must be
true about the world for this to work").