# Finding 12: Sonnet 4.6 outperforms expectations on assumption-finding; competes with reasoning models on complex docs **Date:** 2026-05-02 **Task:** Identify hidden assumptions in gargoyle's `order-execution.md` (785 lines) — a complex, multi-component document covering OrderManager, BrokerAdapter, TradeStream, and PositionReconciler. **How we used them:** Same document (full text, no truncation) + same focused analytical question to all 3 models. GPT-5 via HAI OpenAI endpoint; Opus 4.6 and Sonnet 4.6 via HAI Anthropic endpoint. No tools, no project context beyond the document itself. Single prompt, no conversation history. | Model | Time | Output tokens | Reasoning tokens | Assumptions found | |---|---|---|---|---| | GPT-5 | 93s | 8,485 | 6,016 | 20 | | Claude Sonnet 4.6 | 106s | 4,637 | (internal) | 17 | | Claude Opus 4.6 | 105s | 4,615 | (internal) | 12 | **What they found — common ground (all 3 identified):** - Synchronous broker REST calls blocking OrderManager GenServer (mailbox growth) - TradeStream event ordering assumptions (out-of-order fills/status) - Fill deduplication gap (no explicit fill-level idempotency) - `cancel_all/1` with `timeout: :infinity` blocking GenServer during FLATTEN - Recovery/restart races with TradeStream fill delivery (fills queued during `handle_continue/2`) - Lot operation idempotency under crash recovery (partial execution) - Replace race: fills for new broker_order_id arriving before `replaced` event - Database write latency impact on GenServer throughput under burst fills - ETS table scope assumptions (single-node, access mode) **GPT-5 unique findings (not in either Claude model):** - Rate-limit retry blocking OrderManager inline (no async retry path specified) - Single TradeStream connection per user not enforced (duplicate detection gap) - Kill switch FLATTEN vs degraded state interaction (OM drops cancels while degraded, but FLATTEN calls cancel_all through OM) - ClOrdID uniqueness scope/retention at broker across sessions and days - `after: datetime` filter semantics (clock skew, timezone, inclusive/exclusive) - Reconciliation responses may exceed single-response size (no pagination) - Event broadcasting blocking model (synchronous vs fire-and-forget) - Credential rotation during TradeStream connection lifetime - `market_closed` semantics varying across brokers (reject vs queue) - Dropped Alpaca statuses (stopped/suspended/calculated) may affect accounting **Claude Sonnet 4.6 unique findings (not in either other model):** - Single fill per fill event assumption (broker batching multiple fills into one WebSocket message) - Lot operations (`Lots.open/2`, `Lots.close/4`) assumed to never fail — no `{:error, _}` handling shown, crash propagation risk - `Task.async_stream` inside GenServer creating linked tasks whose crash signals propagate to OrderManager during critical cancel_all - Broker cancel semantics during in-flight replace at the broker level (cancel targets old broker_order_id which broker already replaced away) - Database operations in fill processing assumed transactional (no explicit Ecto.Multi/transaction mention) - Broker position reflects only Gargoyle's activity (external trades cause false-positive reconciliation halts) **Claude Opus 4.6 unique findings (not in either other model):** - `{:ok, broker_order_id}` from REST place conflated with durable OMS acceptance vs mere HTTP acknowledgment (no timeout on `submitted` state) - Concurrent `apply_corrections/2` from periodic reconciler running in separate process conflicts with OrderManager's single-writer invariant (corrections write to same tables outside GenServer serialization) - Reconciliation gate initialized state after `:rest_for_one` restart — ETS table EXISTS but freshly initialized vs table MISSING are different conditions with different safety properties - Escalation state reset after crash creating double-exposure window (systematic issue persists but escalation timer resets to zero) - `replace/3` error semantics: non-atomic replace (cancel + re-submit) where cancel succeeds but re-submit fails leaves original order cancelled at broker while OrderManager reverts to "working" locally **Quality assessment:** - **GPT-5** maintained its pattern from previous findings: broadest coverage (20 assumptions), most technically specific about implementation details. Found cross-cutting operational concerns (clock skew, credential rotation, pagination) that the Claude models didn't surface. However, several of its findings were medium-severity operational concerns rather than architectural assumptions. - **Claude Sonnet 4.6** was the surprise performer. Found 17 assumptions — close to GPT-5's count (85%) — and several of its unique findings were genuinely insightful. The `cancel_all` race with broker-side replace state (finding #16) and the lot operation failure propagation (finding #6) show deep reasoning about component interaction despite Sonnet not being positioned as a "reasoning" model. More importantly, Sonnet's findings were consistently well-structured with clear "how it could break" scenarios. - **Claude Opus 4.6** found the fewest assumptions (12) but — consistent with Finding #11 — its unique findings were qualitatively different. The concurrent `apply_corrections` write conflict, the gate initialization state distinction, and the non-atomic replace error semantics all reveal design tensions that neither GPT-5 nor Sonnet identified. Opus continues to reason about the *boundaries between components* rather than within-component mechanics. **Key insight — Sonnet 4.6 is NOT just a faster GPT-4.1:** In previous findings (#9, #10, #11), non-reasoning models (GPT-4.1, GPT-4.1 Mini) performed significantly below reasoning models on assumption-finding. GPT-4.1 found ~14 assumptions where GPT-5 found 24-26. Here, Sonnet 4.6 finds 17 where GPT-5 finds 20 — a much smaller gap (~85% vs ~58% previously). Sonnet's findings also included several that showed genuine reasoning about component interactions (not just within-frame risks). This suggests Sonnet 4.6 is qualitatively different from GPT-4.1 for analytical work — it occupies a middle ground between GPT-4.1's "competent but surface-level" and GPT-5's "exhaustive and deep." The severity distribution was also similar to GPT-5 (multiple critical/high findings), whereas GPT-4.1 in previous experiments tended toward medium-severity generic concerns. **Updated model hierarchy for assumption-finding:** 1. GPT-5 — broadest coverage, most operational-level findings (20) 2. Sonnet 4.6 — strong analytical depth, good component interaction reasoning (17) 3. Opus 4.6 — fewest but most architecturally insightful, finds design tensions (12) 4. GPT-4.1 — competent within-frame, generic (~14 from previous experiments) 5. GPT-4.1 Mini — formulaic, surface-level (~10-12) **Practical implication:** For architecture review, Sonnet 4.6 is now a strong candidate for volume analytical work. It's fast enough to run alongside GPT-5 and catches different things (lot operation failures, broker-side replace races). The ideal three-model review stack for architecture docs appears to be: - GPT-5 for breadth + operational concerns - Sonnet 4.6 for component interaction analysis - Opus 4.6 for design-tension identification Each consistently finds things the others miss. The cost-efficiency argument for Sonnet is strong: ~85% of GPT-5's count with more actionable findings per token generated (4,637 vs 8,485 tokens for 17 vs 20 assumptions).