refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
This commit is contained in:
@@ -0,0 +1,125 @@
|
||||
# Finding 12: Sonnet 4.6 outperforms expectations on assumption-finding; competes with reasoning models on complex docs
|
||||
|
||||
**Date:** 2026-05-02
|
||||
**Task:** Identify hidden assumptions in gargoyle's `order-execution.md` (785 lines)
|
||||
— a complex, multi-component document covering OrderManager, BrokerAdapter,
|
||||
TradeStream, and PositionReconciler.
|
||||
**How we used them:** Same document (full text, no truncation) + same focused
|
||||
analytical question to all 3 models. GPT-5 via HAI OpenAI endpoint; Opus 4.6
|
||||
and Sonnet 4.6 via HAI Anthropic endpoint. No tools, no project context beyond
|
||||
the document itself. Single prompt, no conversation history.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
|
||||
|---|---|---|---|---|
|
||||
| GPT-5 | 93s | 8,485 | 6,016 | 20 |
|
||||
| Claude Sonnet 4.6 | 106s | 4,637 | (internal) | 17 |
|
||||
| Claude Opus 4.6 | 105s | 4,615 | (internal) | 12 |
|
||||
|
||||
**What they found — common ground (all 3 identified):**
|
||||
- Synchronous broker REST calls blocking OrderManager GenServer (mailbox growth)
|
||||
- TradeStream event ordering assumptions (out-of-order fills/status)
|
||||
- Fill deduplication gap (no explicit fill-level idempotency)
|
||||
- `cancel_all/1` with `timeout: :infinity` blocking GenServer during FLATTEN
|
||||
- Recovery/restart races with TradeStream fill delivery (fills queued during
|
||||
`handle_continue/2`)
|
||||
- Lot operation idempotency under crash recovery (partial execution)
|
||||
- Replace race: fills for new broker_order_id arriving before `replaced` event
|
||||
- Database write latency impact on GenServer throughput under burst fills
|
||||
- ETS table scope assumptions (single-node, access mode)
|
||||
|
||||
**GPT-5 unique findings (not in either Claude model):**
|
||||
- Rate-limit retry blocking OrderManager inline (no async retry path specified)
|
||||
- Single TradeStream connection per user not enforced (duplicate detection gap)
|
||||
- Kill switch FLATTEN vs degraded state interaction (OM drops cancels while
|
||||
degraded, but FLATTEN calls cancel_all through OM)
|
||||
- ClOrdID uniqueness scope/retention at broker across sessions and days
|
||||
- `after: datetime` filter semantics (clock skew, timezone, inclusive/exclusive)
|
||||
- Reconciliation responses may exceed single-response size (no pagination)
|
||||
- Event broadcasting blocking model (synchronous vs fire-and-forget)
|
||||
- Credential rotation during TradeStream connection lifetime
|
||||
- `market_closed` semantics varying across brokers (reject vs queue)
|
||||
- Dropped Alpaca statuses (stopped/suspended/calculated) may affect accounting
|
||||
|
||||
**Claude Sonnet 4.6 unique findings (not in either other model):**
|
||||
- Single fill per fill event assumption (broker batching multiple fills into
|
||||
one WebSocket message)
|
||||
- Lot operations (`Lots.open/2`, `Lots.close/4`) assumed to never fail —
|
||||
no `{:error, _}` handling shown, crash propagation risk
|
||||
- `Task.async_stream` inside GenServer creating linked tasks whose crash
|
||||
signals propagate to OrderManager during critical cancel_all
|
||||
- Broker cancel semantics during in-flight replace at the broker level
|
||||
(cancel targets old broker_order_id which broker already replaced away)
|
||||
- Database operations in fill processing assumed transactional (no explicit
|
||||
Ecto.Multi/transaction mention)
|
||||
- Broker position reflects only Gargoyle's activity (external trades cause
|
||||
false-positive reconciliation halts)
|
||||
|
||||
**Claude Opus 4.6 unique findings (not in either other model):**
|
||||
- `{:ok, broker_order_id}` from REST place conflated with durable OMS
|
||||
acceptance vs mere HTTP acknowledgment (no timeout on `submitted` state)
|
||||
- Concurrent `apply_corrections/2` from periodic reconciler running in
|
||||
separate process conflicts with OrderManager's single-writer invariant
|
||||
(corrections write to same tables outside GenServer serialization)
|
||||
- Reconciliation gate initialized state after `:rest_for_one` restart —
|
||||
ETS table EXISTS but freshly initialized vs table MISSING are different
|
||||
conditions with different safety properties
|
||||
- Escalation state reset after crash creating double-exposure window
|
||||
(systematic issue persists but escalation timer resets to zero)
|
||||
- `replace/3` error semantics: non-atomic replace (cancel + re-submit)
|
||||
where cancel succeeds but re-submit fails leaves original order cancelled
|
||||
at broker while OrderManager reverts to "working" locally
|
||||
|
||||
**Quality assessment:**
|
||||
- **GPT-5** maintained its pattern from previous findings: broadest coverage
|
||||
(20 assumptions), most technically specific about implementation details.
|
||||
Found cross-cutting operational concerns (clock skew, credential rotation,
|
||||
pagination) that the Claude models didn't surface. However, several of its
|
||||
findings were medium-severity operational concerns rather than architectural
|
||||
assumptions.
|
||||
- **Claude Sonnet 4.6** was the surprise performer. Found 17 assumptions —
|
||||
close to GPT-5's count (85%) — and several of its unique findings were
|
||||
genuinely insightful. The `cancel_all` race with broker-side replace state
|
||||
(finding #16) and the lot operation failure propagation (finding #6) show
|
||||
deep reasoning about component interaction despite Sonnet not being
|
||||
positioned as a "reasoning" model. More importantly, Sonnet's findings were
|
||||
consistently well-structured with clear "how it could break" scenarios.
|
||||
- **Claude Opus 4.6** found the fewest assumptions (12) but — consistent with
|
||||
Finding #11 — its unique findings were qualitatively different. The
|
||||
concurrent `apply_corrections` write conflict, the gate initialization state
|
||||
distinction, and the non-atomic replace error semantics all reveal design
|
||||
tensions that neither GPT-5 nor Sonnet identified. Opus continues to reason
|
||||
about the *boundaries between components* rather than within-component
|
||||
mechanics.
|
||||
|
||||
**Key insight — Sonnet 4.6 is NOT just a faster GPT-4.1:**
|
||||
In previous findings (#9, #10, #11), non-reasoning models (GPT-4.1, GPT-4.1
|
||||
Mini) performed significantly below reasoning models on assumption-finding.
|
||||
GPT-4.1 found ~14 assumptions where GPT-5 found 24-26. Here, Sonnet 4.6
|
||||
finds 17 where GPT-5 finds 20 — a much smaller gap (~85% vs ~58% previously).
|
||||
|
||||
Sonnet's findings also included several that showed genuine reasoning about
|
||||
component interactions (not just within-frame risks). This suggests Sonnet 4.6
|
||||
is qualitatively different from GPT-4.1 for analytical work — it occupies a
|
||||
middle ground between GPT-4.1's "competent but surface-level" and GPT-5's
|
||||
"exhaustive and deep." The severity distribution was also similar to GPT-5
|
||||
(multiple critical/high findings), whereas GPT-4.1 in previous experiments
|
||||
tended toward medium-severity generic concerns.
|
||||
|
||||
**Updated model hierarchy for assumption-finding:**
|
||||
1. GPT-5 — broadest coverage, most operational-level findings (20)
|
||||
2. Sonnet 4.6 — strong analytical depth, good component interaction reasoning (17)
|
||||
3. Opus 4.6 — fewest but most architecturally insightful, finds design tensions (12)
|
||||
4. GPT-4.1 — competent within-frame, generic (~14 from previous experiments)
|
||||
5. GPT-4.1 Mini — formulaic, surface-level (~10-12)
|
||||
|
||||
**Practical implication:** For architecture review, Sonnet 4.6 is now a strong
|
||||
candidate for volume analytical work. It's fast enough to run alongside GPT-5
|
||||
and catches different things (lot operation failures, broker-side replace races).
|
||||
The ideal three-model review stack for architecture docs appears to be:
|
||||
- GPT-5 for breadth + operational concerns
|
||||
- Sonnet 4.6 for component interaction analysis
|
||||
- Opus 4.6 for design-tension identification
|
||||
|
||||
Each consistently finds things the others miss. The cost-efficiency argument
|
||||
for Sonnet is strong: ~85% of GPT-5's count with more actionable findings
|
||||
per token generated (4,637 vs 8,485 tokens for 17 vs 20 assumptions).
|
||||
Reference in New Issue
Block a user