Files
model-research/findings/2026-05-02-12-sonnet-46-outperforms-expectations-on.md
T
Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00

7.4 KiB

Finding 12: Sonnet 4.6 outperforms expectations on assumption-finding; competes with reasoning models on complex docs

Date: 2026-05-02 Task: Identify hidden assumptions in gargoyle's order-execution.md (785 lines) — a complex, multi-component document covering OrderManager, BrokerAdapter, TradeStream, and PositionReconciler. How we used them: Same document (full text, no truncation) + same focused analytical question to all 3 models. GPT-5 via HAI OpenAI endpoint; Opus 4.6 and Sonnet 4.6 via HAI Anthropic endpoint. No tools, no project context beyond the document itself. Single prompt, no conversation history.

Model Time Output tokens Reasoning tokens Assumptions found
GPT-5 93s 8,485 6,016 20
Claude Sonnet 4.6 106s 4,637 (internal) 17
Claude Opus 4.6 105s 4,615 (internal) 12

What they found — common ground (all 3 identified):

  • Synchronous broker REST calls blocking OrderManager GenServer (mailbox growth)
  • TradeStream event ordering assumptions (out-of-order fills/status)
  • Fill deduplication gap (no explicit fill-level idempotency)
  • cancel_all/1 with timeout: :infinity blocking GenServer during FLATTEN
  • Recovery/restart races with TradeStream fill delivery (fills queued during handle_continue/2)
  • Lot operation idempotency under crash recovery (partial execution)
  • Replace race: fills for new broker_order_id arriving before replaced event
  • Database write latency impact on GenServer throughput under burst fills
  • ETS table scope assumptions (single-node, access mode)

GPT-5 unique findings (not in either Claude model):

  • Rate-limit retry blocking OrderManager inline (no async retry path specified)
  • Single TradeStream connection per user not enforced (duplicate detection gap)
  • Kill switch FLATTEN vs degraded state interaction (OM drops cancels while degraded, but FLATTEN calls cancel_all through OM)
  • ClOrdID uniqueness scope/retention at broker across sessions and days
  • after: datetime filter semantics (clock skew, timezone, inclusive/exclusive)
  • Reconciliation responses may exceed single-response size (no pagination)
  • Event broadcasting blocking model (synchronous vs fire-and-forget)
  • Credential rotation during TradeStream connection lifetime
  • market_closed semantics varying across brokers (reject vs queue)
  • Dropped Alpaca statuses (stopped/suspended/calculated) may affect accounting

Claude Sonnet 4.6 unique findings (not in either other model):

  • Single fill per fill event assumption (broker batching multiple fills into one WebSocket message)
  • Lot operations (Lots.open/2, Lots.close/4) assumed to never fail — no {:error, _} handling shown, crash propagation risk
  • Task.async_stream inside GenServer creating linked tasks whose crash signals propagate to OrderManager during critical cancel_all
  • Broker cancel semantics during in-flight replace at the broker level (cancel targets old broker_order_id which broker already replaced away)
  • Database operations in fill processing assumed transactional (no explicit Ecto.Multi/transaction mention)
  • Broker position reflects only Gargoyle's activity (external trades cause false-positive reconciliation halts)

Claude Opus 4.6 unique findings (not in either other model):

  • {:ok, broker_order_id} from REST place conflated with durable OMS acceptance vs mere HTTP acknowledgment (no timeout on submitted state)
  • Concurrent apply_corrections/2 from periodic reconciler running in separate process conflicts with OrderManager's single-writer invariant (corrections write to same tables outside GenServer serialization)
  • Reconciliation gate initialized state after :rest_for_one restart — ETS table EXISTS but freshly initialized vs table MISSING are different conditions with different safety properties
  • Escalation state reset after crash creating double-exposure window (systematic issue persists but escalation timer resets to zero)
  • replace/3 error semantics: non-atomic replace (cancel + re-submit) where cancel succeeds but re-submit fails leaves original order cancelled at broker while OrderManager reverts to "working" locally

Quality assessment:

  • GPT-5 maintained its pattern from previous findings: broadest coverage (20 assumptions), most technically specific about implementation details. Found cross-cutting operational concerns (clock skew, credential rotation, pagination) that the Claude models didn't surface. However, several of its findings were medium-severity operational concerns rather than architectural assumptions.
  • Claude Sonnet 4.6 was the surprise performer. Found 17 assumptions — close to GPT-5's count (85%) — and several of its unique findings were genuinely insightful. The cancel_all race with broker-side replace state (finding #16) and the lot operation failure propagation (finding #6) show deep reasoning about component interaction despite Sonnet not being positioned as a "reasoning" model. More importantly, Sonnet's findings were consistently well-structured with clear "how it could break" scenarios.
  • Claude Opus 4.6 found the fewest assumptions (12) but — consistent with Finding #11 — its unique findings were qualitatively different. The concurrent apply_corrections write conflict, the gate initialization state distinction, and the non-atomic replace error semantics all reveal design tensions that neither GPT-5 nor Sonnet identified. Opus continues to reason about the boundaries between components rather than within-component mechanics.

Key insight — Sonnet 4.6 is NOT just a faster GPT-4.1: In previous findings (#9, #10, #11), non-reasoning models (GPT-4.1, GPT-4.1 Mini) performed significantly below reasoning models on assumption-finding. GPT-4.1 found ~14 assumptions where GPT-5 found 24-26. Here, Sonnet 4.6 finds 17 where GPT-5 finds 20 — a much smaller gap (~85% vs ~58% previously).

Sonnet's findings also included several that showed genuine reasoning about component interactions (not just within-frame risks). This suggests Sonnet 4.6 is qualitatively different from GPT-4.1 for analytical work — it occupies a middle ground between GPT-4.1's "competent but surface-level" and GPT-5's "exhaustive and deep." The severity distribution was also similar to GPT-5 (multiple critical/high findings), whereas GPT-4.1 in previous experiments tended toward medium-severity generic concerns.

Updated model hierarchy for assumption-finding:

  1. GPT-5 — broadest coverage, most operational-level findings (20)
  2. Sonnet 4.6 — strong analytical depth, good component interaction reasoning (17)
  3. Opus 4.6 — fewest but most architecturally insightful, finds design tensions (12)
  4. GPT-4.1 — competent within-frame, generic (~14 from previous experiments)
  5. GPT-4.1 Mini — formulaic, surface-level (~10-12)

Practical implication: For architecture review, Sonnet 4.6 is now a strong candidate for volume analytical work. It's fast enough to run alongside GPT-5 and catches different things (lot operation failures, broker-side replace races). The ideal three-model review stack for architecture docs appears to be:

  • GPT-5 for breadth + operational concerns
  • Sonnet 4.6 for component interaction analysis
  • Opus 4.6 for design-tension identification

Each consistently finds things the others miss. The cost-efficiency argument for Sonnet is strong: ~85% of GPT-5's count with more actionable findings per token generated (4,637 vs 8,485 tokens for 17 vs 20 assumptions).