model-research/findings/2026-05-02-11-hiddenassumption-identification-on-simpler-doc.md

# Finding 11: Hidden-assumption identification on simpler doc: reasoning models diverge in approach, both outperform non-reasoning

**Date:** 2026-05-02
**Task:** Identify hidden assumptions in gargoyle's `market-calendar.md` (238 lines)
— a simpler, single-component document vs the 234-line cold-start doc from Finding #10.
**How we used them:** Same document (full text) + same focused analytical question
to all 3 models via HAI proxy. No tools, no project context beyond the document
itself. Single prompt, no conversation history. Temperature 0.3 for GPT-4.1;
GPT-5 and Opus use their defaults (required). Same prompt across all three.

| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
|---|---|---|---|---|
| GPT-4.1 | 19s | 2,554 | 0 | 14 |
| Claude Opus 4.6 | 74s | 3,288 | (internal, not reported) | 13 |
| GPT-5 | 101s | 8,417 | 5,504 | 24 |

**What they found — common ground (all 3 identified):**
- Alpaca calendar API data correctness/completeness as single source of truth
- Alpaca API availability at startup (no local cache persistence)
- ETS table atomicity during refresh (partial-state exposure risk)
- System clock/timezone alignment (dates are timezone-naive)
- NYSE emergency/unscheduled closures not reflected until refresh
- Two-year cache range sufficiency
- API response format stability
- Rate limiting / API capacity concerns

**GPT-5 unique findings (not in either other model):**
- Date struct term-ordering in ETS match specs may not match chronological
  order (ETS range guards rely on Erlang term comparison, not Date semantics)
- close_time/1 returns naive Time without timezone — DST conversion burden on
  consumers, one hour off twice per year
- trading_day?/1 conflates "not a trading day" with "calendar unavailable" —
  operational outages invisible to callers
- ETS table name collision risk (global namespace per node)
- No other process should modify the ETS table (access mode discipline)
- Network egress and credential availability on all nodes at all times
- ETS read/write concurrency flags for contention under load
- Direct ETS access by consumers bypassing the module's error handling
- next/prev_trading_day edge cases at cache boundaries
- Alpaca API start/end parameter inclusivity (off-by-one at year boundaries)
- Half-day vs full-day distinction insufficiency for special sessions
- Small table size makes O(n) selects acceptable (scaling concern)
- Year-end refresh failure leaving gaps at boundary
- Alpaca never omits a legitimate trading day (absence = non-trading conflation)

**Claude Opus unique findings (not in either other model):**
- ETS ownership semantics: heir-protection would change fail-closed behavior;
  current design means ALL consumers fail simultaneously during crash-to-restart
  window (framed as a design tension, not just a risk)
- Silent data corruption from partial API response (pagination/truncation) —
  specifically that missing rows are SILENT failures with no error propagation
  (other models mentioned API completeness but not the silence aspect)
- Consumers calling functions with Dates, not DateTimes — the API accepts Date.t()
  but doesn't specify HOW consumers should derive "today" (system-wide
  coordination problem made invisible by the API contract)
- `trading_day?/1` returning false is NOT fail-closed for ALL consumers — only
  for PDT-like "block action" consumers; for batch-trigger consumers it's
  fail-OPEN (subtle inversion of safety semantics)
- Startup ordering: background_children placement means PDT could receive orders
  before MarketCalendar finishes init, creating recurring rejection windows
  during hot deploys
- Continuous-running assumption for refresh timer (daily restarts would mean
  refresh mechanism never fires — no staleness alert exists)

**GPT-4.1 unique findings (not in either other model):**
- No need for real-time calendar change notification (event emission gap)
- All consumers using the same module instance (configuration consistency)
- No need for historical calendar data (audit/backtesting limitation)
- Consumers correctly handling {:error, :calendar_unavailable} in practice

**Quality assessment:**
- **GPT-5** found the most assumptions (24) with the most technical specificity.
  Many are implementation-level insights (ETS term ordering, named table
  collisions, read_concurrency flags) that demonstrate deep Erlang/OTP
  knowledge. Some are slightly obvious or overlapping. The ETS term-ordering
  finding is genuinely insightful — Date structs DO compare correctly in Erlang
  term order (year > month > day fields), but questioning it shows depth of
  reasoning about underlying mechanisms. Also provided concrete recommendations.
- **Claude Opus** found fewer assumptions (13) but several were qualitatively
  different — they identified *design tensions* and *semantic inversions*
  rather than just failure scenarios. The fail-open/fail-closed inversion
  (finding #12), the ETS ownership tension, and the "API makes timezone
  coordination invisible" findings show reasoning about the design's
  *relationship to its consumers* rather than just its internal mechanics.
  Tighter, more curated output with less filler.
- **GPT-4.1** was competent and well-structured (14 assumptions, clean table)
  but stayed within the document's own framing. Its unique findings are
  relatively generic ("consumers should handle errors correctly," "no
  historical data"). Solid baseline, no surprises.

**Key insight — two reasoning models, different analytical styles:**
GPT-5 and Opus are both reasoning models, but they reason about different
things. GPT-5 reasons DEEPER into implementation mechanics (how does ETS
actually work? what are the exact failure modes of each component?). Opus
reasons WIDER about system context (how does this component's API contract
affect the safety properties of the overall system? what tensions does this
design create that aren't visible to the author?).

GPT-5's approach: "Here are 24 things that could go wrong, many highly
technical." Opus's approach: "Here are 13 assumptions, several of which
reveal design tensions the document can't see about itself."

**Does the reasoning gap narrow with simpler docs?**
Comparing to Finding #10 (cold-start doc, 234 lines, 26 vs 14 vs 12 assumptions
for GPT-5/GPT-4.1/Mini):
- GPT-5 still dominates in raw count (24 vs 14 for GPT-4.1)
- The gap ratio is similar (~1.7x here vs ~1.9x in Finding #10)
- Document complexity doesn't appear to be the driver of the gap —
  reasoning tokens enable more exhaustive exploration regardless of
  input complexity

**Claude Opus vs GPT-5 (the headline comparison):**
They're not competing on the same axis. GPT-5 is better for "find all
possible issues" (breadth + technical depth). Opus is better for "find
the assumptions that will actually surprise the author" (insight density).
If you want a security-audit-style exhaustive list: GPT-5. If you want a
design-review-style "here's what you're not seeing about your own design":
Opus. Both are better than GPT-4.1 for this task, but in different ways.

**Practical implication:** Run BOTH reasoning models on architecture docs.
GPT-5 catches implementation-level hazards the team might miss during
coding. Opus catches design-level tensions the team might miss during
planning. GPT-4.1 is sufficient as a quick sanity check but won't
surprise you.