6af8a6ee10
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
125 lines
7.2 KiB
Markdown
125 lines
7.2 KiB
Markdown
# Finding 11: Hidden-assumption identification on simpler doc: reasoning models diverge in approach, both outperform non-reasoning
|
|
|
|
**Date:** 2026-05-02
|
|
**Task:** Identify hidden assumptions in gargoyle's `market-calendar.md` (238 lines)
|
|
— a simpler, single-component document vs the 234-line cold-start doc from Finding #10.
|
|
**How we used them:** Same document (full text) + same focused analytical question
|
|
to all 3 models via HAI proxy. No tools, no project context beyond the document
|
|
itself. Single prompt, no conversation history. Temperature 0.3 for GPT-4.1;
|
|
GPT-5 and Opus use their defaults (required). Same prompt across all three.
|
|
|
|
| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
|
|
|---|---|---|---|---|
|
|
| GPT-4.1 | 19s | 2,554 | 0 | 14 |
|
|
| Claude Opus 4.6 | 74s | 3,288 | (internal, not reported) | 13 |
|
|
| GPT-5 | 101s | 8,417 | 5,504 | 24 |
|
|
|
|
**What they found — common ground (all 3 identified):**
|
|
- Alpaca calendar API data correctness/completeness as single source of truth
|
|
- Alpaca API availability at startup (no local cache persistence)
|
|
- ETS table atomicity during refresh (partial-state exposure risk)
|
|
- System clock/timezone alignment (dates are timezone-naive)
|
|
- NYSE emergency/unscheduled closures not reflected until refresh
|
|
- Two-year cache range sufficiency
|
|
- API response format stability
|
|
- Rate limiting / API capacity concerns
|
|
|
|
**GPT-5 unique findings (not in either other model):**
|
|
- Date struct term-ordering in ETS match specs may not match chronological
|
|
order (ETS range guards rely on Erlang term comparison, not Date semantics)
|
|
- close_time/1 returns naive Time without timezone — DST conversion burden on
|
|
consumers, one hour off twice per year
|
|
- trading_day?/1 conflates "not a trading day" with "calendar unavailable" —
|
|
operational outages invisible to callers
|
|
- ETS table name collision risk (global namespace per node)
|
|
- No other process should modify the ETS table (access mode discipline)
|
|
- Network egress and credential availability on all nodes at all times
|
|
- ETS read/write concurrency flags for contention under load
|
|
- Direct ETS access by consumers bypassing the module's error handling
|
|
- next/prev_trading_day edge cases at cache boundaries
|
|
- Alpaca API start/end parameter inclusivity (off-by-one at year boundaries)
|
|
- Half-day vs full-day distinction insufficiency for special sessions
|
|
- Small table size makes O(n) selects acceptable (scaling concern)
|
|
- Year-end refresh failure leaving gaps at boundary
|
|
- Alpaca never omits a legitimate trading day (absence = non-trading conflation)
|
|
|
|
**Claude Opus unique findings (not in either other model):**
|
|
- ETS ownership semantics: heir-protection would change fail-closed behavior;
|
|
current design means ALL consumers fail simultaneously during crash-to-restart
|
|
window (framed as a design tension, not just a risk)
|
|
- Silent data corruption from partial API response (pagination/truncation) —
|
|
specifically that missing rows are SILENT failures with no error propagation
|
|
(other models mentioned API completeness but not the silence aspect)
|
|
- Consumers calling functions with Dates, not DateTimes — the API accepts Date.t()
|
|
but doesn't specify HOW consumers should derive "today" (system-wide
|
|
coordination problem made invisible by the API contract)
|
|
- `trading_day?/1` returning false is NOT fail-closed for ALL consumers — only
|
|
for PDT-like "block action" consumers; for batch-trigger consumers it's
|
|
fail-OPEN (subtle inversion of safety semantics)
|
|
- Startup ordering: background_children placement means PDT could receive orders
|
|
before MarketCalendar finishes init, creating recurring rejection windows
|
|
during hot deploys
|
|
- Continuous-running assumption for refresh timer (daily restarts would mean
|
|
refresh mechanism never fires — no staleness alert exists)
|
|
|
|
**GPT-4.1 unique findings (not in either other model):**
|
|
- No need for real-time calendar change notification (event emission gap)
|
|
- All consumers using the same module instance (configuration consistency)
|
|
- No need for historical calendar data (audit/backtesting limitation)
|
|
- Consumers correctly handling {:error, :calendar_unavailable} in practice
|
|
|
|
**Quality assessment:**
|
|
- **GPT-5** found the most assumptions (24) with the most technical specificity.
|
|
Many are implementation-level insights (ETS term ordering, named table
|
|
collisions, read_concurrency flags) that demonstrate deep Erlang/OTP
|
|
knowledge. Some are slightly obvious or overlapping. The ETS term-ordering
|
|
finding is genuinely insightful — Date structs DO compare correctly in Erlang
|
|
term order (year > month > day fields), but questioning it shows depth of
|
|
reasoning about underlying mechanisms. Also provided concrete recommendations.
|
|
- **Claude Opus** found fewer assumptions (13) but several were qualitatively
|
|
different — they identified *design tensions* and *semantic inversions*
|
|
rather than just failure scenarios. The fail-open/fail-closed inversion
|
|
(finding #12), the ETS ownership tension, and the "API makes timezone
|
|
coordination invisible" findings show reasoning about the design's
|
|
*relationship to its consumers* rather than just its internal mechanics.
|
|
Tighter, more curated output with less filler.
|
|
- **GPT-4.1** was competent and well-structured (14 assumptions, clean table)
|
|
but stayed within the document's own framing. Its unique findings are
|
|
relatively generic ("consumers should handle errors correctly," "no
|
|
historical data"). Solid baseline, no surprises.
|
|
|
|
**Key insight — two reasoning models, different analytical styles:**
|
|
GPT-5 and Opus are both reasoning models, but they reason about different
|
|
things. GPT-5 reasons DEEPER into implementation mechanics (how does ETS
|
|
actually work? what are the exact failure modes of each component?). Opus
|
|
reasons WIDER about system context (how does this component's API contract
|
|
affect the safety properties of the overall system? what tensions does this
|
|
design create that aren't visible to the author?).
|
|
|
|
GPT-5's approach: "Here are 24 things that could go wrong, many highly
|
|
technical." Opus's approach: "Here are 13 assumptions, several of which
|
|
reveal design tensions the document can't see about itself."
|
|
|
|
**Does the reasoning gap narrow with simpler docs?**
|
|
Comparing to Finding #10 (cold-start doc, 234 lines, 26 vs 14 vs 12 assumptions
|
|
for GPT-5/GPT-4.1/Mini):
|
|
- GPT-5 still dominates in raw count (24 vs 14 for GPT-4.1)
|
|
- The gap ratio is similar (~1.7x here vs ~1.9x in Finding #10)
|
|
- Document complexity doesn't appear to be the driver of the gap —
|
|
reasoning tokens enable more exhaustive exploration regardless of
|
|
input complexity
|
|
|
|
**Claude Opus vs GPT-5 (the headline comparison):**
|
|
They're not competing on the same axis. GPT-5 is better for "find all
|
|
possible issues" (breadth + technical depth). Opus is better for "find
|
|
the assumptions that will actually surprise the author" (insight density).
|
|
If you want a security-audit-style exhaustive list: GPT-5. If you want a
|
|
design-review-style "here's what you're not seeing about your own design":
|
|
Opus. Both are better than GPT-4.1 for this task, but in different ways.
|
|
|
|
**Practical implication:** Run BOTH reasoning models on architecture docs.
|
|
GPT-5 catches implementation-level hazards the team might miss during
|
|
coding. Opus catches design-level tensions the team might miss during
|
|
planning. GPT-4.1 is sufficient as a quick sanity check but won't
|
|
surprise you.
|