Files

T

Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files

Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.

2026-05-06 07:15:50 -07:00

7.2 KiB

Raw Blame History

Finding 11: Hidden-assumption identification on simpler doc: reasoning models diverge in approach, both outperform non-reasoning

Date: 2026-05-02 Task: Identify hidden assumptions in gargoyle's market-calendar.md (238 lines) — a simpler, single-component document vs the 234-line cold-start doc from Finding #10. How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy. No tools, no project context beyond the document itself. Single prompt, no conversation history. Temperature 0.3 for GPT-4.1; GPT-5 and Opus use their defaults (required). Same prompt across all three.

Model	Time	Output tokens	Reasoning tokens	Assumptions found
GPT-4.1	19s	2,554	0	14
Claude Opus 4.6	74s	3,288	(internal, not reported)	13
GPT-5	101s	8,417	5,504	24

What they found — common ground (all 3 identified):

Alpaca calendar API data correctness/completeness as single source of truth
Alpaca API availability at startup (no local cache persistence)
ETS table atomicity during refresh (partial-state exposure risk)
System clock/timezone alignment (dates are timezone-naive)
NYSE emergency/unscheduled closures not reflected until refresh
Two-year cache range sufficiency
API response format stability
Rate limiting / API capacity concerns

GPT-5 unique findings (not in either other model):

Date struct term-ordering in ETS match specs may not match chronological order (ETS range guards rely on Erlang term comparison, not Date semantics)
close_time/1 returns naive Time without timezone — DST conversion burden on consumers, one hour off twice per year
trading_day?/1 conflates "not a trading day" with "calendar unavailable" — operational outages invisible to callers
ETS table name collision risk (global namespace per node)
No other process should modify the ETS table (access mode discipline)
Network egress and credential availability on all nodes at all times
ETS read/write concurrency flags for contention under load
Direct ETS access by consumers bypassing the module's error handling
next/prev_trading_day edge cases at cache boundaries
Alpaca API start/end parameter inclusivity (off-by-one at year boundaries)
Half-day vs full-day distinction insufficiency for special sessions
Small table size makes O(n) selects acceptable (scaling concern)
Year-end refresh failure leaving gaps at boundary
Alpaca never omits a legitimate trading day (absence = non-trading conflation)

Claude Opus unique findings (not in either other model):

ETS ownership semantics: heir-protection would change fail-closed behavior; current design means ALL consumers fail simultaneously during crash-to-restart window (framed as a design tension, not just a risk)
Silent data corruption from partial API response (pagination/truncation) — specifically that missing rows are SILENT failures with no error propagation (other models mentioned API completeness but not the silence aspect)
Consumers calling functions with Dates, not DateTimes — the API accepts Date.t() but doesn't specify HOW consumers should derive "today" (system-wide coordination problem made invisible by the API contract)
trading_day?/1 returning false is NOT fail-closed for ALL consumers — only for PDT-like "block action" consumers; for batch-trigger consumers it's fail-OPEN (subtle inversion of safety semantics)
Startup ordering: background_children placement means PDT could receive orders before MarketCalendar finishes init, creating recurring rejection windows during hot deploys
Continuous-running assumption for refresh timer (daily restarts would mean refresh mechanism never fires — no staleness alert exists)

GPT-4.1 unique findings (not in either other model):

No need for real-time calendar change notification (event emission gap)
All consumers using the same module instance (configuration consistency)
No need for historical calendar data (audit/backtesting limitation)
Consumers correctly handling {:error, :calendar_unavailable} in practice

Quality assessment:

GPT-5 found the most assumptions (24) with the most technical specificity. Many are implementation-level insights (ETS term ordering, named table collisions, read_concurrency flags) that demonstrate deep Erlang/OTP knowledge. Some are slightly obvious or overlapping. The ETS term-ordering finding is genuinely insightful — Date structs DO compare correctly in Erlang term order (year > month > day fields), but questioning it shows depth of reasoning about underlying mechanisms. Also provided concrete recommendations.
Claude Opus found fewer assumptions (13) but several were qualitatively different — they identified design tensions and semantic inversions rather than just failure scenarios. The fail-open/fail-closed inversion (finding #12), the ETS ownership tension, and the "API makes timezone coordination invisible" findings show reasoning about the design's relationship to its consumers rather than just its internal mechanics. Tighter, more curated output with less filler.
GPT-4.1 was competent and well-structured (14 assumptions, clean table) but stayed within the document's own framing. Its unique findings are relatively generic ("consumers should handle errors correctly," "no historical data"). Solid baseline, no surprises.

Key insight — two reasoning models, different analytical styles: GPT-5 and Opus are both reasoning models, but they reason about different things. GPT-5 reasons DEEPER into implementation mechanics (how does ETS actually work? what are the exact failure modes of each component?). Opus reasons WIDER about system context (how does this component's API contract affect the safety properties of the overall system? what tensions does this design create that aren't visible to the author?).

GPT-5's approach: "Here are 24 things that could go wrong, many highly technical." Opus's approach: "Here are 13 assumptions, several of which reveal design tensions the document can't see about itself."

Does the reasoning gap narrow with simpler docs? Comparing to Finding #10 (cold-start doc, 234 lines, 26 vs 14 vs 12 assumptions for GPT-5/GPT-4.1/Mini):

GPT-5 still dominates in raw count (24 vs 14 for GPT-4.1)
The gap ratio is similar (~1.7x here vs ~1.9x in Finding #10)
Document complexity doesn't appear to be the driver of the gap — reasoning tokens enable more exhaustive exploration regardless of input complexity

Claude Opus vs GPT-5 (the headline comparison): They're not competing on the same axis. GPT-5 is better for "find all possible issues" (breadth + technical depth). Opus is better for "find the assumptions that will actually surprise the author" (insight density). If you want a security-audit-style exhaustive list: GPT-5. If you want a design-review-style "here's what you're not seeing about your own design": Opus. Both are better than GPT-4.1 for this task, but in different ways.

Practical implication: Run BOTH reasoning models on architecture docs. GPT-5 catches implementation-level hazards the team might miss during coding. Opus catches design-level tensions the team might miss during planning. GPT-4.1 is sufficient as a quick sanity check but won't surprise you.

7.2 KiB Raw Blame History

Finding 11: Hidden-assumption identification on simpler doc: reasoning models diverge in approach, both outperform non-reasoning

7.2 KiB

Raw Blame History