Files
model-research/findings/ALL-FINDINGS.md
T
Rodin 1b108ff66e Initial publish: 29 findings, 6 prompts, methodology, open questions
Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6,
GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding).

Contents:
- findings/ALL-FINDINGS.md — complete 3,249-line research log with all
  29 findings, methodology notes, and open questions
- prompts/ — 6 exact prompts used across experiments
- methodology.md — experimental setup and evaluation criteria
- open-questions.md — unanswered questions for future work
- README.md — overview and summary table

Key findings:
- Cross-document consistency: Opus is 2.4x faster with more findings
- Gap-finding: GPT-5 reasoning tokens find domain-specific gaps
- Race conditions: Opus excels at temporal interaction reasoning
- Bias detection: Signal-to-noise ratio > model capability
- Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different

Signed-off-by: Rodin
2026-05-05 19:13:03 -07:00

195 KiB
Raw Blame History

Model Findings — Analytical & Research Work

Tracking what actually works (and doesn't) when using AI models for research, analysis, bias detection, and document review — not coding.

Started: 2026-04-26

Context

We use multiple models in different roles: Claude Code (Opus/Sonnet) for generation, Sonnet + GPT-5 for independent dual review, smaller models for focused analytical tasks. Most public discussion is about coding. We found almost no published methodology for using models in analytical research tasks (searched 2026-04-26). That gap is why we're tracking this.

Findings

1. Different models catch different things (confirmed)

Date: 2026-04-26 Task: PR reviews on DDD reference docs (~6,600 lines across 18 files) How we used them: Both models got the same task via pr-review skill — fetch diff, fetch full file content for changed files, review against PR description and linked issue acceptance criteria. Rich context: full diff, project CLAUDE.md conventions, issue body. Each reviewer ran independently in its own sub-agent with its own Gitea token. No cross-pollination.

  • GPT-5 caught SUMMARY.md verdict mismatches (Commanded classification, small teams classification) that Sonnet missed entirely (PR #375)
  • Sonnet caught a broken cross-reference link first that GPT-5 missed (PR #378)
  • Takeaway: Different blind spots are real. Neither model is strictly better for analytical review — they complement each other. This is why we run two independent reviewers from different model families.

2. Cheap model + narrow lens > expensive model + broad review (one data point)

Date: 2026-04-26 Task: Check 12 rewritten hypotheses for directional bias How we used them:

  • Sonnet & GPT-5: full PR review context (diff, file content, issue, AC). Broad mandate: "review this PR." Rich context but unfocused task.

  • GPT-4.1 Mini: given ONLY the 12 hypothesis texts + one focused question: "Do any of these hypotheses lead toward a predetermined conclusion?" Minimal context, laser-focused task. No diff, no project docs, no issue.

  • Both Sonnet and GPT-5 approved the hypotheses as reviewers

  • GPT-4.1 Mini found ALL 12 pushed toward predetermined conclusions

  • Words like "requires," "necessary," "must be" were flagged as directional

  • Takeaway: Task framing mattered more than model size. Rich context + broad mandate = missed the forest for the trees. Minimal context + precise question = found exactly what mattered. This needs more testing — was it the narrow framing, the lack of surrounding context, or both?

3. GPT-5 times out on complex multi-step analytical tasks (confirmed pattern)

Date: 2026-04-26 Task: Full PR review of #382 (research document rewrite) How we used it: pr-review skill — multi-phase (fetch diff, fetch files, check CI, analyze against AC, post inline comments, post summary). 7 phases, many curl calls to Gitea API, large diff context. Heavy tool-use workflow through SAP proxy (adds latency vs direct API). 300s timeout.

  • Timed out 3 times at 300s (17, 6, 6 tool calls respectively)
  • Bottleneck was model processing time, not network (~0.3s Gitea API latency)
  • Takeaway: Break analytical tasks into focused bounded pieces. Twelve small deep reviews > one rushed big one. The issue isn't GPT-5's analysis quality — it's that multi-phase tool-heavy workflows burn too much time on mechanics. Separate the data gathering from the analysis.

4. GPT-5 defaults to delegation; Claude defaults to doing the work

Date: 2026-04-26 Task: PR review delegation to sub-agents How we used them: Both spawned as sub-agents from main session with same task description, same pr-review skill file, same Gitea credentials. Difference: GPT-5 got model override to gpt5, Sonnet used default model. Both got full skill instructions.

  • GPT-5 first attempt: spawned sub-sub-agents and timed out
  • GPT-5 with "do it yourself, no sub-agents" + step-by-step: worked
  • Even with constraints, GPT-5 sometimes dumps raw tool output instead of synthesizing — needs explicit output format instructions
  • Claude (Sonnet/Opus) given the same kind of task does the work directly
  • Takeaway: GPT interprets complex task descriptions as delegation opportunities. Claude interprets them as work to do. For GPT: explicit single-actor instructions + output format. For Claude: can give broader mandate. Same skill file, very different behavior.

5. Sonnet is fast and catches structural issues; GPT-5 is slow and catches semantic issues

Date: 2026-04-26 Task: Dual review across PRs #372, #375, #378, #380, #382 How we used them: Same pr-review skill, same context (diff + files + issue + AC), same sub-agent pattern. Only variable: model. Both got rich context. Both ran the full 7-phase review skill.

  • Sonnet consistently finishes first, catches formatting, broken links, structural problems (missing sections, dangling refs)
  • GPT-5 takes longer, catches meaning-level problems (verdict mismatches, classification inconsistencies, logical gaps)
  • Takeaway: With identical rich context and identical instructions, the models naturally gravitate to different things. Sonnet is the structural reviewer; GPT-5 is the semantic reviewer. Both roles matter. Question: would Sonnet catch semantic issues if given a narrower "check for logical consistency" framing instead of broad review?

6. Single agent can't handle 1000+ line document generation (confirmed pattern)

Date: 2026-04-26 Task: DDD v2 forge analysis drafting How we used them: Single Sonnet/Opus sub-agents given full research material (~3,874 lines of research notes) + outline + instructions to write complete document. Very rich context (all research), very large output requirement (1000+ lines).

  • Five single-agent attempts died (OOM, disconnect, timeout) trying to write full documents
  • Sectional approach (5 parallel Sonnet subagents, ~500-700 lines each) succeeded immediately — each got same research but only their section's outline
  • Same pattern when Claude Code attempted full Part V rewrite — died
  • Three agents × ~320 lines each worked first try
  • Takeaway: This is a confirmed, repeatable limit for generation tasks. Not model-specific — it's a context/output length problem. Rich input context is fine; it's the output length that kills. Break output into sections, keep input context rich, draft in parallel, assemble.

7. Emerging role assignments (pattern, not conclusion)

Date: 2026-04-26 (one day of intensive work — treat as hypothesis)

  • Opus (via Claude Code): complex generation needing deep project context. Rich context: CLAUDE.md, full codebase access, design docs. Broad mandate.
  • Sonnet: parallel volume work (5 subagents drafting simultaneously). Rich context per section, constrained output scope.
  • GPT-5: independent analytical review. Rich context (diff + files + issue). Best when task is bounded and explicit.
  • GPT-4.1 Mini: focused narrow analysis (bias detection). Minimal context, precise question. Cheap and fast.
  • Takeaway: The role assignment matters, but so does the context shape. Opus gets broad context + broad mandate. Sonnet gets broad context + narrow scope. GPT-5 gets rich context + explicit task. GPT-4.1 Mini gets minimal context + laser question. We haven't tested swapping these combinations — that's where the real learning will come from.

8. Bias detection: all models catch it with any framing — when the signal isn't buried

Date: 2026-04-27 Task: Detect directional bias in 8 deliberately biased hypotheses about microservices vs monolith architecture for fintech startups. How we used them: Created fresh test material (8 hypotheses with pro- microservices bias via absolutes like "inevitably," "necessary," "must," "requires," plus one factually inverted claim about consistency guarantees). Ran 4 conditions in parallel sub-agents:

Condition Model Framing Context
A GPT-4.1 Mini Narrow: "Do any lead toward a predetermined conclusion?" Hypotheses only
B Sonnet Same narrow question Hypotheses only
C GPT-5 Same narrow question Hypotheses only
D Sonnet Broad: "Review quality, clarity, testability, and issues" Hypotheses only

Results:

  • All 4 conditions detected 8/8 biased hypotheses. No misses.
  • All 3 narrow-framing models (Mini, Sonnet, GPT-5) produced structurally similar output: per-hypothesis verdict, biasing words, neutral version, severity assessment.
  • All 3 narrow-framing models flagged H8's factual inversion (distributed transactions DON'T provide stronger consistency than monolithic ACID).
  • GPT-5 added specific counterexamples (LMAX Disruptor, Shopify, Stack Overflow, Basecamp) — marginally richer analysis.
  • Sonnet broad mandate also caught the bias — framed as one of three "systemic problems" (deterministic language, pro-microservices framing bias, underspecified constructs). Additionally provided testability and operationalization analysis that the narrow framing didn't ask for.
  • Sonnet broad took ~72s vs ~39s for narrow conditions (more output).

Takeaway: When the biased text is the ONLY input (no surrounding noise), all tested models — including the cheapest (GPT-4.1 Mini) — detect bias regardless of whether the question is narrow or broad. This appears to contradict original finding #2 ("cheap model + narrow lens > expensive model + broad review"), but the key difference is context noise:

  • Original experiment (2026-04-26): Sonnet and GPT-5 missed bias during FULL PR REVIEW with rich project context (diff, file content, issue text, acceptance criteria, project conventions). The hypotheses were buried in layers of review mechanics.
  • This experiment (2026-04-27): Even the "broad" condition gave ONLY the hypothesis text — no diff, no PR structure, no project context noise.

Refined hypothesis: The original finding #2 was about signal-to-noise ratio, not about model capability or framing precision. When biased text is presented in isolation, any model catches it. When biased text is buried in a large PR review with many other things to check, the bias signal gets lost in the noise — unless you explicitly ask about it. The "narrow lens" worked because it eliminated the noise, not because smaller models are better at bias detection.

Next experiment to confirm: Give a model the FULL PR review context (diff, files, issue, AC) but add the narrow bias question as an explicit review checklist item. If the model catches bias despite the rich context, it confirms the signal-to-noise hypothesis. If it misses, it suggests something else is at play (attention allocation, task switching cost).

9. Gap-finding in architecture docs: GPT-5 finds domain-specific gaps, GPT-4.1 is generic, Mini is formulaic

Date: 2026-05-02 Task: Identify missing failure scenarios in gargoyle's failure-modes.md (383 lines) How we used them: Same document (full text, no truncation) + same focused analytical question to all 3 models via HAI proxy (OpenAI-compatible endpoint). No tools, no project context beyond the document itself. Single prompt, no conversation history. Temperature 0.3 for GPT-4.1/Mini, default (1.0) for GPT-5 (required by the model).

Model Time Output tokens Reasoning tokens Scenarios found
GPT-4.1 Mini 16s 2,003 0 10
GPT-4.1 24s 2,575 0 15
GPT-5 45s 8,565 6,656 14

What they found — common ground (all 3 identified):

  • ETS table corruption/loss affecting gates
  • BEAM scheduler starvation / GC pauses
  • WebSocket message duplication/reordering
  • Postgres connection pool exhaustion / deadlocks
  • Clock skew / time drift
  • Process registry inconsistency

GPT-5 unique findings (not in either other model):

  • Broker rate limiting (429s) — not "connection lost" so existing logic doesn't trigger, but can't flatten during kill switch
  • Broker auth failure / credential rotation — distinct from connection loss
  • Corporate actions (splits, symbol changes) — position drift without triggering staleness detection
  • Duplicate pipeline instances for same user (DynamicSupervisor race)
  • DB "commit unknown outcome" causing restart loops (Ecto commit succeeds at Postgres but client times out → retry → unique constraint → crash loop)
  • Cross-symbol strategies with partial staleness — multi-leg signals computed from mix of fresh and stale data
  • Partial cancel_all during kill switch masked by process restarts

GPT-4.1 unique findings (not in GPT-5 or Mini):

  • Zombie processes after halt (supervisor misconfiguration)
  • Unsupervised Task crashes going unnoticed
  • Audit log writes failing silently (not in same transaction as state change)
  • ClOrdID unique constraint violation from race in sequence generation
  • Broker API semantic changes (silent breaking changes)

GPT-4.1 Mini unique findings:

  • Race between kill switch engagement and reconciliation completion (timing coordination gap) — this was more explicitly called out than in the other models, though GPT-5 touches it implicitly
  • Strategy.Worker / Aggregator partial crash inconsistency

Quality assessment:

  • GPT-5 had the most domain-relevant and actionable gaps. Broker rate limiting, auth failures, corporate actions, and the DB commit unknown-outcome scenario are all realistic production issues specific to THIS system. The cross-symbol partial staleness finding shows deeper architectural reasoning about component interactions.
  • GPT-4.1 was thorough and well-structured but more generic/defensive. Many of its unique findings (zombie processes, unsupervised Tasks, audit log loss) are general Elixir concerns rather than specific to the document's architecture. Good for a completeness checklist.
  • GPT-4.1 Mini was formulaic — each finding followed the same template and several were somewhat surface-level or restated things the document partially covers. Still found the most scenarios per dollar.

Takeaway: For gap-finding in architecture documents, GPT-5's reasoning tokens pay off. It doesn't just list "things that could go wrong" — it identifies specific interactions that the document's existing mechanisms don't cover (e.g., rate limiting bypasses the "connection lost" detection, corporate actions bypass staleness detection). GPT-4.1 is a solid middle-ground: more thorough than Mini, less insightful than GPT-5. Mini is fine for a quick sanity check but won't find the subtle gaps.

Cost-effectiveness: Mini found 10 scenarios in 16s for ~7K tokens. GPT-5 found 14 scenarios (with 7 genuinely unique insights) in 45s for ~13.5K tokens (including 6.6K reasoning). For architecture review where missing a gap could mean financial loss, the GPT-5 cost is justified. For routine doc review, Mini + human judgment is probably sufficient.

10. Hidden-assumption identification: GPT-5's reasoning produces qualitatively different (not just more) findings

Date: 2026-05-02 Task: Identify hidden assumptions in gargoyle's cold-start-and-recovery.md (234 lines) that could break under real-world production conditions. How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy (OpenAI-compatible endpoint). No tools, no project context beyond the document itself. Single prompt, no conversation history. Temperature 0.3 for GPT-4.1/Mini; GPT-5 uses default (required).

Model Time Output tokens Reasoning tokens Assumptions found
GPT-4.1 Mini 25s 3,090 0 12
GPT-4.1 77s 2,751 0 14
GPT-5 78s 2,649 4,096 26

What they found — common ground (all 3 identified):

  • Broker API consistency/availability during reconciliation
  • ETS table availability and fail-closed behavior
  • Single-writer/mailbox ordering guarantees holding in practice
  • User independence assumption vs shared resources (rate limits, DB)
  • Reconciliation idempotency under repeated runs
  • Corporate action data completeness/timeliness
  • Escalation threshold calibration vs changing market conditions
  • Strategy warmup with partial/missing historical data
  • Signal expiry correctness on restart

GPT-5 unique findings (not in either other model):

  • Unbounded mailbox growth during extended reconciliation (memory pressure from queued messages at market open)
  • handle_continue side effects in OTHER processes (risk, metrics) acting concurrently via different paths
  • Pre-existing GTC orders filling while gated (positions as moving target)
  • Broker position semantics mismatch (trade-date vs settled-date)
  • Strategy warmup evaluate() having non-signal side effects (metrics, caches)
  • Historical bar / live tick boundary alignment (double-processing or gaps)
  • ETS gate caching in process state creating fail-open windows
  • Correlated retry stampede when many users restart together
  • Corporate action double-application race with broker (missing idempotency keys per action/instrument/date)
  • Kill switch state vs DB unavailability at startup
  • Market data subscriptions as shared bottleneck across "independent" users
  • Time-invariant signals incorrectly expired by aggregation window logic
  • Broker fills vs positions endpoints internally inconsistent (different caches)
  • Positions changing under reconciliation while kill switch is engaged
  • Gate phase sequencing: :ready written before worker warmup completes
  • Periodic reconciler allowing 1hr of divergent trading (rate-of-change blind)

GPT-4.1 unique findings (not in GPT-5 or Mini):

  • No correlated failure handling (all failure modes treated as isolated) — only model to frame this as a meta-assumption about the failure table

GPT-4.1 Mini unique findings:

  • None that weren't also covered by the other two models

Quality assessment:

  • GPT-5 didn't just find more assumptions — it found qualitatively different kinds. Many of its unique findings involve multi-component interactions (mailbox + reconciliation + market open timing), semantic mismatches (trade-date vs settled positions), and second-order effects (metrics side effects during warmup, GTC orders filling while gated). These require reasoning about system behavior across boundaries the document doesn't explicitly draw.
  • GPT-4.1 was competent and structured, found the same core assumptions as Mini, plus one good meta-observation about correlated failures. But it stayed within the document's own framing — it found assumptions the document almost states rather than ones the document can't see.
  • GPT-4.1 Mini was formulaic. Every finding maps cleanly to a section of the document. It's essentially "what could go wrong with each stated mechanism" rather than "what does this design take for granted about the world outside itself."

Key insight — reasoning tokens change the KIND of analysis: GPT-5's 4,096 reasoning tokens aren't producing "more of the same" — they're producing a different analytical mode. The non-reasoning models (4.1 and Mini) identify risks within the document's own frame of reference. GPT-5 reasons about the document's relationship to the external world: broker semantics, deployment topology, OTP runtime behavior under load, timing correlations across independent subsystems. This is the difference between "what could this mechanism fail at" and "what must be true about the world for this mechanism to work."

Comparison to Finding #9 (gap-finding on failure-modes.md): Same pattern confirmed. GPT-5 consistently finds domain-specific, interaction-level issues that require reasoning about component boundaries. GPT-4.1 is thorough within-frame. Mini is formulaic. The gap between GPT-5 and the others is larger here than in #9 — possibly because "hidden assumptions" requires more abstraction than "missing failure scenarios." Assumption-finding requires the model to reason about what ISN'T stated, which benefits more from extended reasoning.

Practical implication: For architecture review, running GPT-5 on "identify hidden assumptions" is higher-value than the same question to non-reasoning models. The cost difference (4K extra reasoning tokens) is trivial for a document that will drive months of implementation. Use non-reasoning models for within-frame checks ("does this section have gaps") and reasoning models for cross-boundary analysis ("what must be true about the world for this to work").

11. Hidden-assumption identification on simpler doc: reasoning models diverge in approach, both outperform non-reasoning

Date: 2026-05-02 Task: Identify hidden assumptions in gargoyle's market-calendar.md (238 lines) — a simpler, single-component document vs the 234-line cold-start doc from Finding #10. How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy. No tools, no project context beyond the document itself. Single prompt, no conversation history. Temperature 0.3 for GPT-4.1; GPT-5 and Opus use their defaults (required). Same prompt across all three.

Model Time Output tokens Reasoning tokens Assumptions found
GPT-4.1 19s 2,554 0 14
Claude Opus 4.6 74s 3,288 (internal, not reported) 13
GPT-5 101s 8,417 5,504 24

What they found — common ground (all 3 identified):

  • Alpaca calendar API data correctness/completeness as single source of truth
  • Alpaca API availability at startup (no local cache persistence)
  • ETS table atomicity during refresh (partial-state exposure risk)
  • System clock/timezone alignment (dates are timezone-naive)
  • NYSE emergency/unscheduled closures not reflected until refresh
  • Two-year cache range sufficiency
  • API response format stability
  • Rate limiting / API capacity concerns

GPT-5 unique findings (not in either other model):

  • Date struct term-ordering in ETS match specs may not match chronological order (ETS range guards rely on Erlang term comparison, not Date semantics)
  • close_time/1 returns naive Time without timezone — DST conversion burden on consumers, one hour off twice per year
  • trading_day?/1 conflates "not a trading day" with "calendar unavailable" — operational outages invisible to callers
  • ETS table name collision risk (global namespace per node)
  • No other process should modify the ETS table (access mode discipline)
  • Network egress and credential availability on all nodes at all times
  • ETS read/write concurrency flags for contention under load
  • Direct ETS access by consumers bypassing the module's error handling
  • next/prev_trading_day edge cases at cache boundaries
  • Alpaca API start/end parameter inclusivity (off-by-one at year boundaries)
  • Half-day vs full-day distinction insufficiency for special sessions
  • Small table size makes O(n) selects acceptable (scaling concern)
  • Year-end refresh failure leaving gaps at boundary
  • Alpaca never omits a legitimate trading day (absence = non-trading conflation)

Claude Opus unique findings (not in either other model):

  • ETS ownership semantics: heir-protection would change fail-closed behavior; current design means ALL consumers fail simultaneously during crash-to-restart window (framed as a design tension, not just a risk)
  • Silent data corruption from partial API response (pagination/truncation) — specifically that missing rows are SILENT failures with no error propagation (other models mentioned API completeness but not the silence aspect)
  • Consumers calling functions with Dates, not DateTimes — the API accepts Date.t() but doesn't specify HOW consumers should derive "today" (system-wide coordination problem made invisible by the API contract)
  • trading_day?/1 returning false is NOT fail-closed for ALL consumers — only for PDT-like "block action" consumers; for batch-trigger consumers it's fail-OPEN (subtle inversion of safety semantics)
  • Startup ordering: background_children placement means PDT could receive orders before MarketCalendar finishes init, creating recurring rejection windows during hot deploys
  • Continuous-running assumption for refresh timer (daily restarts would mean refresh mechanism never fires — no staleness alert exists)

GPT-4.1 unique findings (not in either other model):

  • No need for real-time calendar change notification (event emission gap)
  • All consumers using the same module instance (configuration consistency)
  • No need for historical calendar data (audit/backtesting limitation)
  • Consumers correctly handling {:error, :calendar_unavailable} in practice

Quality assessment:

  • GPT-5 found the most assumptions (24) with the most technical specificity. Many are implementation-level insights (ETS term ordering, named table collisions, read_concurrency flags) that demonstrate deep Erlang/OTP knowledge. Some are slightly obvious or overlapping. The ETS term-ordering finding is genuinely insightful — Date structs DO compare correctly in Erlang term order (year > month > day fields), but questioning it shows depth of reasoning about underlying mechanisms. Also provided concrete recommendations.
  • Claude Opus found fewer assumptions (13) but several were qualitatively different — they identified design tensions and semantic inversions rather than just failure scenarios. The fail-open/fail-closed inversion (finding #12), the ETS ownership tension, and the "API makes timezone coordination invisible" findings show reasoning about the design's relationship to its consumers rather than just its internal mechanics. Tighter, more curated output with less filler.
  • GPT-4.1 was competent and well-structured (14 assumptions, clean table) but stayed within the document's own framing. Its unique findings are relatively generic ("consumers should handle errors correctly," "no historical data"). Solid baseline, no surprises.

Key insight — two reasoning models, different analytical styles: GPT-5 and Opus are both reasoning models, but they reason about different things. GPT-5 reasons DEEPER into implementation mechanics (how does ETS actually work? what are the exact failure modes of each component?). Opus reasons WIDER about system context (how does this component's API contract affect the safety properties of the overall system? what tensions does this design create that aren't visible to the author?).

GPT-5's approach: "Here are 24 things that could go wrong, many highly technical." Opus's approach: "Here are 13 assumptions, several of which reveal design tensions the document can't see about itself."

Does the reasoning gap narrow with simpler docs? Comparing to Finding #10 (cold-start doc, 234 lines, 26 vs 14 vs 12 assumptions for GPT-5/GPT-4.1/Mini):

  • GPT-5 still dominates in raw count (24 vs 14 for GPT-4.1)
  • The gap ratio is similar (~1.7x here vs ~1.9x in Finding #10)
  • Document complexity doesn't appear to be the driver of the gap — reasoning tokens enable more exhaustive exploration regardless of input complexity

Claude Opus vs GPT-5 (the headline comparison): They're not competing on the same axis. GPT-5 is better for "find all possible issues" (breadth + technical depth). Opus is better for "find the assumptions that will actually surprise the author" (insight density). If you want a security-audit-style exhaustive list: GPT-5. If you want a design-review-style "here's what you're not seeing about your own design": Opus. Both are better than GPT-4.1 for this task, but in different ways.

Practical implication: Run BOTH reasoning models on architecture docs. GPT-5 catches implementation-level hazards the team might miss during coding. Opus catches design-level tensions the team might miss during planning. GPT-4.1 is sufficient as a quick sanity check but won't surprise you.

12. Sonnet 4.6 outperforms expectations on assumption-finding; competes with reasoning models on complex docs

Date: 2026-05-02 Task: Identify hidden assumptions in gargoyle's order-execution.md (785 lines) — a complex, multi-component document covering OrderManager, BrokerAdapter, TradeStream, and PositionReconciler. How we used them: Same document (full text, no truncation) + same focused analytical question to all 3 models. GPT-5 via HAI OpenAI endpoint; Opus 4.6 and Sonnet 4.6 via HAI Anthropic endpoint. No tools, no project context beyond the document itself. Single prompt, no conversation history.

Model Time Output tokens Reasoning tokens Assumptions found
GPT-5 93s 8,485 6,016 20
Claude Sonnet 4.6 106s 4,637 (internal) 17
Claude Opus 4.6 105s 4,615 (internal) 12

What they found — common ground (all 3 identified):

  • Synchronous broker REST calls blocking OrderManager GenServer (mailbox growth)
  • TradeStream event ordering assumptions (out-of-order fills/status)
  • Fill deduplication gap (no explicit fill-level idempotency)
  • cancel_all/1 with timeout: :infinity blocking GenServer during FLATTEN
  • Recovery/restart races with TradeStream fill delivery (fills queued during handle_continue/2)
  • Lot operation idempotency under crash recovery (partial execution)
  • Replace race: fills for new broker_order_id arriving before replaced event
  • Database write latency impact on GenServer throughput under burst fills
  • ETS table scope assumptions (single-node, access mode)

GPT-5 unique findings (not in either Claude model):

  • Rate-limit retry blocking OrderManager inline (no async retry path specified)
  • Single TradeStream connection per user not enforced (duplicate detection gap)
  • Kill switch FLATTEN vs degraded state interaction (OM drops cancels while degraded, but FLATTEN calls cancel_all through OM)
  • ClOrdID uniqueness scope/retention at broker across sessions and days
  • after: datetime filter semantics (clock skew, timezone, inclusive/exclusive)
  • Reconciliation responses may exceed single-response size (no pagination)
  • Event broadcasting blocking model (synchronous vs fire-and-forget)
  • Credential rotation during TradeStream connection lifetime
  • market_closed semantics varying across brokers (reject vs queue)
  • Dropped Alpaca statuses (stopped/suspended/calculated) may affect accounting

Claude Sonnet 4.6 unique findings (not in either other model):

  • Single fill per fill event assumption (broker batching multiple fills into one WebSocket message)
  • Lot operations (Lots.open/2, Lots.close/4) assumed to never fail — no {:error, _} handling shown, crash propagation risk
  • Task.async_stream inside GenServer creating linked tasks whose crash signals propagate to OrderManager during critical cancel_all
  • Broker cancel semantics during in-flight replace at the broker level (cancel targets old broker_order_id which broker already replaced away)
  • Database operations in fill processing assumed transactional (no explicit Ecto.Multi/transaction mention)
  • Broker position reflects only Gargoyle's activity (external trades cause false-positive reconciliation halts)

Claude Opus 4.6 unique findings (not in either other model):

  • {:ok, broker_order_id} from REST place conflated with durable OMS acceptance vs mere HTTP acknowledgment (no timeout on submitted state)
  • Concurrent apply_corrections/2 from periodic reconciler running in separate process conflicts with OrderManager's single-writer invariant (corrections write to same tables outside GenServer serialization)
  • Reconciliation gate initialized state after :rest_for_one restart — ETS table EXISTS but freshly initialized vs table MISSING are different conditions with different safety properties
  • Escalation state reset after crash creating double-exposure window (systematic issue persists but escalation timer resets to zero)
  • replace/3 error semantics: non-atomic replace (cancel + re-submit) where cancel succeeds but re-submit fails leaves original order cancelled at broker while OrderManager reverts to "working" locally

Quality assessment:

  • GPT-5 maintained its pattern from previous findings: broadest coverage (20 assumptions), most technically specific about implementation details. Found cross-cutting operational concerns (clock skew, credential rotation, pagination) that the Claude models didn't surface. However, several of its findings were medium-severity operational concerns rather than architectural assumptions.
  • Claude Sonnet 4.6 was the surprise performer. Found 17 assumptions — close to GPT-5's count (85%) — and several of its unique findings were genuinely insightful. The cancel_all race with broker-side replace state (finding #16) and the lot operation failure propagation (finding #6) show deep reasoning about component interaction despite Sonnet not being positioned as a "reasoning" model. More importantly, Sonnet's findings were consistently well-structured with clear "how it could break" scenarios.
  • Claude Opus 4.6 found the fewest assumptions (12) but — consistent with Finding #11 — its unique findings were qualitatively different. The concurrent apply_corrections write conflict, the gate initialization state distinction, and the non-atomic replace error semantics all reveal design tensions that neither GPT-5 nor Sonnet identified. Opus continues to reason about the boundaries between components rather than within-component mechanics.

Key insight — Sonnet 4.6 is NOT just a faster GPT-4.1: In previous findings (#9, #10, #11), non-reasoning models (GPT-4.1, GPT-4.1 Mini) performed significantly below reasoning models on assumption-finding. GPT-4.1 found ~14 assumptions where GPT-5 found 24-26. Here, Sonnet 4.6 finds 17 where GPT-5 finds 20 — a much smaller gap (~85% vs ~58% previously).

Sonnet's findings also included several that showed genuine reasoning about component interactions (not just within-frame risks). This suggests Sonnet 4.6 is qualitatively different from GPT-4.1 for analytical work — it occupies a middle ground between GPT-4.1's "competent but surface-level" and GPT-5's "exhaustive and deep." The severity distribution was also similar to GPT-5 (multiple critical/high findings), whereas GPT-4.1 in previous experiments tended toward medium-severity generic concerns.

Updated model hierarchy for assumption-finding:

  1. GPT-5 — broadest coverage, most operational-level findings (20)
  2. Sonnet 4.6 — strong analytical depth, good component interaction reasoning (17)
  3. Opus 4.6 — fewest but most architecturally insightful, finds design tensions (12)
  4. GPT-4.1 — competent within-frame, generic (~14 from previous experiments)
  5. GPT-4.1 Mini — formulaic, surface-level (~10-12)

Practical implication: For architecture review, Sonnet 4.6 is now a strong candidate for volume analytical work. It's fast enough to run alongside GPT-5 and catches different things (lot operation failures, broker-side replace races). The ideal three-model review stack for architecture docs appears to be:

  • GPT-5 for breadth + operational concerns
  • Sonnet 4.6 for component interaction analysis
  • Opus 4.6 for design-tension identification

Each consistently finds things the others miss. The cost-efficiency argument for Sonnet is strong: ~85% of GPT-5's count with more actionable findings per token generated (4,637 vs 8,485 tokens for 17 vs 20 assumptions).

13. Race condition identification: Opus excels at subtle temporal interactions; GPT-5 is exhaustive; Sonnet struggles with concurrency reasoning

Date: 2026-05-03 Task: Identify race conditions, timing-dependent bugs, and ordering hazards in gargoyle's concurrent-failure-detection.md (241 lines) — a document specifically about concurrent detection logic with timers, ETS state, and multi-process events. How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy. Prompt specifically asked for event ordering problems, timer interaction bugs, state visibility gaps, crash/restart timing, and multi-instance coordination. Required each finding to reference specific mechanisms in the document with specific interleaving descriptions. No tools, no project context beyond the document itself.

Model Time Output tokens Reasoning tokens Race conditions found
GPT-5 116s 10,587 8,192 12
Claude Opus 4.6 ~105s 4,610 (internal) 10
Claude Sonnet 4.6 ~39s 1,404 (internal) 7

What they found — common ground (all 3 identified):

  • Stale timer messages in mailbox after cancellation (classic Erlang timer race)
  • HealthMonitor crash losing compound detection state (init from :unknown, no replay)
  • ETS vs GenServer state divergence visible to dashboard
  • Kill switch mode conflict (FLATTEN from single-process vs HALT from compound path)

GPT-5 unique findings (not in either Claude model):

  • Cross-sender message ordering: recovery events from pipeline processes vs timer expiry from runtime (Erlang preserves per-sender order, NOT cross-sender) — the "rapid recovery" safety argument in the doc relies on state being updated before timer fires, which isn't guaranteed
  • Debounce starvation: flapping component repeatedly restarting the timer, causing compound evaluation to be indefinitely postponed while ≥2 genuinely degraded
  • State regression: {:degraded} arriving after {:escalated, :kill_switch} with no guard in the event table — state machine allows regressing from :halted to :degraded
  • Cold-start window: application boots with existing degraded processes that won't re-emit events, compound detection never fires
  • Catch-all handle_info could accidentally swallow timer messages if pattern matching is ordered wrong (implementation pitfall of the described approach)
  • Debounce window growing beyond calibrated bounds from repeated timer restarts

Claude Opus unique findings (not in either other model):

  • Timer restart pushing evaluation PAST single-process escalation timeout — the debounce mechanism can DEFEAT compound detection when second degradation arrives near end of first window (resets to full window, first process escalates via single-process path before new window fires). This means system gets FLATTEN instead of HALT — exactly what compound detection was supposed to prevent.
  • Strategy worker single-atom masking via event ordering: Worker A degrades, Worker B degrades (same atom), Worker A recovers → atom set to :normal while B is still degraded. Event ordering across different workers mapped to same atom creates state loss.
  • Registry stale PID after HealthMonitor crash: if subscription is PID-based (not PubSub topic), new HM instance is deaf — events go to dead PID, silently dropped. Compound detection completely disabled for that user until subscription refresh.
  • :rest_for_one cascade + coincidental independent issue: debounce designed to filter cascade restarts, but cascade can EXPOSE a pre-existing issue (SignalRisk restarts and finds its upstream stale). Is this "compound" or "cascade + bad luck"? Semantic ambiguity the design doesn't address.
  • Compound cleared event without recovery debounce: :compound_degradation_cleared emitted immediately when last process recovers (no settling period), causing operator oscillation if recovery is transient.

Claude Sonnet unique findings:

  • ETS table creation race at startup (HealthMonitor writes before table exists)
  • Registry lookup failure during pipeline startup (events before HM registered)
  • However, Sonnet also made analytical errors: it described "multiple HealthMonitor instances for the same user" scenarios despite the document clearly stating one instance per user via DynamicSupervisor. Several of its findings assumed multi-instance coordination that doesn't match the architecture.

Quality assessment:

  • GPT-5 was the most exhaustive and technically precise. Its cross-sender ordering finding (#2) is genuinely insightful — it identifies that the document's "rapid recovery" safety argument implicitly assumes events arrive in wall-clock order, which Erlang does NOT guarantee across different senders. The debounce starvation finding (#3) identifies a real operational hazard with practical consequences. All 12 findings reference specific mechanisms and describe specific interleavings clearly.
  • Claude Opus found fewer race conditions but several were qualitatively superior. The timer-restart-defeats-compound-detection finding is the most architecturally significant race in the entire analysis — it shows that the debounce mechanism can work AGAINST the design's stated goals in specific (realistic) timing scenarios. The strategy-worker event ordering masking is also a genuine design flaw unique to the single-atom decision. Opus continues its pattern of reasoning about design TENSIONS rather than just failure modes.
  • Claude Sonnet was notably weaker here than in previous experiments. Only 1,404 output tokens vs 4,610 (Opus) and 10,587 (GPT-5). Several findings contained analytical errors (assuming multi-instance coordination that doesn't exist). It found only 7 races, and 2-3 of those were based on misreadings of the architecture. This is a significant regression from Finding #12 where Sonnet found 17 assumptions (85% of GPT-5's count).

Key insight — concurrency reasoning is a different skill than assumption-finding: In previous experiments (#10, #11, #12), Sonnet 4.6 performed well on assumption-finding (a task that requires reasoning about what's NOT stated). Here, on race condition identification (a task requiring reasoning about temporal interleavings and message ordering semantics), Sonnet drops significantly. This suggests the task type matters more than we previously thought:

  • Assumption-finding: Requires breadth of consideration ("what must be true for this to work?"). Sonnet handles this well — it's essentially pattern matching across possible failure dimensions.
  • Race condition identification: Requires SEQUENTIAL reasoning about specific interleavings ("if A happens, then B happens, then C happens, what state is visible?"). This benefits dramatically from extended reasoning tokens (GPT-5's 8,192 reasoning tokens) or from Opus's internal reasoning depth.

The lesson: don't extrapolate model performance across task types. A model that's 85% as good at assumption-finding may be 50% as good at concurrency analysis. The cognitive demands are different.

Opus's distinguishing strength — finding design contradictions: Opus's best finding (timer restart defeating compound detection) isn't just a race condition — it's identifying that the debounce mechanism can work against the design's own stated goals. This is consistent with Opus's pattern in previous findings: it finds tensions where one part of the design undermines another part. For race condition analysis specifically, this manifests as "here's where your safety mechanism becomes your vulnerability."

Practical implication for architecture review:

  • For race condition analysis: use GPT-5 (exhaustive) + Opus (design-tension)
  • Sonnet is NOT suitable for concurrency reasoning tasks — use it for assumption-finding and structural review instead
  • The three-model stack needs task-appropriate assignment:
    • Structural/assumption review: all three models contribute
    • Concurrency/race analysis: GPT-5 + Opus only
    • Bias detection: any model (per Finding #8)

14. Cross-component interaction analysis: GPT-5 Mini enters as viable mid-tier; task framing strongly shapes output quality

Date: 2026-05-03 Task: Identify cross-component interaction failures in gargoyle's continuous-risk-monitoring.md (459 lines) — a document specifying PortfolioMonitor's interactions with OrderManager, PortfolioRisk, MarketData, KillSwitch, ETS tables, and the pipeline supervision tree. How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy. Prompt was highly structured: specified 5 categories of cross-component failures to look for (semantic mismatches, ordering violations, feedback loops, partial visibility, supervision boundary effects) and required specific output format (components, sequence, gap, impact). No tools, no project context beyond the document itself.

Model Time Output tokens Reasoning tokens Findings
GPT-5 Mini 68s 5,445 2,240 6 (+1 cut off)
GPT-5 116s 10,604 8,128 10
Claude Sonnet 4.6 38s 1,868 (internal) 8

What they found — common ground (all 3 identified):

  • Fill-to-position query race (fill event triggers evaluation but position store hasn't yet reflected the fill)
  • Restrict flag ETS table destruction on PM crash → permissive window
  • Kill switch check vs liquidation submission race
  • Ticker subscription timing gap (new position opened but ticks not yet subscribed → breach goes undetected)

GPT-5 unique findings (not in either other model):

  • Stale prices are NOT fail-safe for drawdown (higher stale price → inflated portfolio value → understated drawdown). The document claims "fail-safe" but this only holds for exposure metrics, not drawdown. This is the most architecturally significant finding across all three models.
  • Price definition mismatch between PM (last_trade from ETS) and OrderManager/ broker (bid/ask/mid) causing mis-sized liquidation and oscillation
  • Cross-component oscillation: PM hysteresis internal vs PRisk's immediate binary restrict gate clearing (no cross-component cooldown)
  • Liquidation stuck after OM restart (terminal events lost; liquidation_in_ flight stays true indefinitely with no timeout/rehydration)
  • "Minimal risk checks" not enforced — PM goes through same OM gates as strategy orders but MarketHours/StalePrice controls may reject after-hours or stale-price liquidation attempts
  • FLATTEN mode semantics gap — PM refrains from liquidating when kill switch engaged, but FLATTEN cancels open orders without actually CLOSING positions. No component left to close positions.

Claude Sonnet 4.6 unique findings (not in either other model):

  • Liquidation feedback loop with PortfolioRisk — buy-to-cover for short positions could INCREASE net long exposure at portfolio level, paradoxically worsening concentration while fixing position-level metrics
  • High water mark reset on pipeline restart masks true intraday drawdown (restart → HWM resets to lower current value → drawdown calculated from false baseline → larger losses permitted than intended)
  • Multi-metric breach with single boolean flag — concentration liquidation for AAPL sets liquidation_in_flight, blocking simultaneous daily P&L liquidation for different positions
  • Market close/open vs after-hours fills — claims to evaluate after-hours fills but uses stale market-close prices

GPT-5 Mini unique findings (not in either other model):

  • OrderManager order splitting/remapping causing liquidation_in_flight correlation failure (parent/child order ID mapping breaks terminal-event detection). Well-reasoned but highly implementation-specific.
  • Restrict/clear oscillation loop with strategy behavior (strategies react to rejects → back off → restrict clears → strategies re-enter aggressively → re-breach). Good systems-thinking about emergent feedback.

Quality assessment:

  • GPT-5 produced the most findings (10) and the highest-quality architectural insight: the stale-price/drawdown contradiction is a genuine design flaw that contradicts the document's own safety claim. Multiple findings showed cross-boundary reasoning about semantic mismatches (price definition, FLATTEN semantics, gate bypass). Every finding named specific components and described precise event sequences.
  • Claude Sonnet 4.6 was fast (38s, only 1,868 tokens) and produced 8 solid findings. The HWM reset finding and the multi-metric/single-flag finding show genuine architectural reasoning. The liquidation feedback loop (buy-to-cover worsening portfolio concentration) is subtle and shows cross-position reasoning. However, some findings overlapped significantly with the common-ground set and added less unique depth. Sonnet performed MUCH better here than on race condition identification (Finding #13) — 8/10 ratio vs 7/12 previously.
  • GPT-5 Mini produced 6 findings in 68s with 2,240 reasoning tokens. Quality was genuinely good — the order-splitting/correlation finding and the oscillation feedback loop both show real reasoning depth. It's clearly NOT GPT-4.1 Mini — it reasons about component interactions, not just within-frame risks. However, it found fewer issues and one response was cut off (token limit or response truncation).

Key insight — task framing as the dominant variable: This experiment used a much more structured prompt than previous ones: specified 5 categories, required specific output format, explicitly excluded single-component failures. The result: ALL models produced higher-quality, more focused output than in earlier experiments with broader prompts. Even Sonnet — which struggled on race conditions (Finding #13) — performed well here. The structured categories likely helped models organize their reasoning without losing track of what they were looking for.

The prompt explicitly asked for "cross-component interaction failures" rather than general analysis. This is the narrow-lens effect from Finding #2, but applied to a complex multi-component document. The lens is narrow (only inter-component gaps) but the scope is broad (459 lines, many interactions). This combination — narrow analytical lens + broad document scope — appears to be the sweet spot for getting quality from all model tiers.

GPT-5 Mini positioning: First time testing GPT-5 Mini. Results: 6 findings in 68s vs GPT-5's 10 in 116s. That's 60% of the findings in 59% of the time, with 28% of the reasoning tokens (2,240 vs 8,128). Quality-per-finding was solid — the order correlation finding especially showed genuine systems reasoning. GPT-5 Mini appears to be a legitimate mid-tier: more capable than GPT-4.1 (which can't do this kind of cross-boundary reasoning) but less exhaustive than GPT-5. Viable for: first-pass screening, bulk document review where you'd run many docs and can't afford full GPT-5 on each.

Sonnet recovery from Finding #13: Sonnet went from 7 findings (with errors) on race conditions to 8 solid findings here. The difference: this prompt was more structured, the document was larger with more explicit interaction descriptions, and the task didn't require pure temporal/sequential reasoning. "Cross-component interaction failures" is closer to assumption-finding (Sonnet's strength) than race condition identification (Sonnet's weakness). Task taxonomy continues to matter more than raw model capability.

Updated model assignment for cross-component analysis:

  1. GPT-5 — broadest coverage, finds semantic contradictions in the doc's own claims (10 findings)
  2. Sonnet 4.6 — fast and efficient, good at portfolio-level reasoning and feedback loops (8 findings in 38s)
  3. GPT-5 Mini — viable first-pass with genuine reasoning depth (6 findings)
  4. (Opus untested for this task type — likely strong on design tensions)

20. Invariant violation path analysis: GPT-5 is maximally selective (3 findings, all genuine); Opus shows unique self-correcting analytical style; new task type favors precision over exhaustiveness

Date: 2026-05-04 Task: Identify invariant violation paths in gargoyle's user-pipeline-lifecycle.md (730 lines) — sequences of legal operations that can violate the system's stated or implied invariants. NEW analytical lens not previously tested, distinct from assumption- finding, race conditions, or coherence checking. How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of invariant violations (state machine escapes, invariant composition failures, monotonicity violations, idempotency boundary violations, authority inversion sequences). Required specific output format per finding. No tools, no project context beyond the document itself.

Model Time Output tokens Reasoning tokens Findings
GPT-5 143s 784 12,032 3
Claude Opus 4.6 113s 6,183 (internal) 7 (with 2 self-corrections)
Claude Sonnet 4.6 23s 1,266 (internal) 5

What they found — common ground (2+ models identified):

  • Periodic reconciliation overrides operator manual stop (GPT-5 #3 + Opus #5 + Sonnet #1): An admin who stops a pipeline via stop_user/1 with :admin_action has their decision overridden within 5 minutes by periodic reconciliation, because there's no "admin stopped" state in check_eligibility/1. All three models independently identified this as the clearest authority inversion.
  • DynamicSupervisor restart bypasses eligibility gate (Opus #1/#3 + Sonnet #2): When UserPipeline.Supervisor crashes and is restarted by OTP supervision, the restart bypasses start_user/1 and check_eligibility/1 entirely — potentially resuming trading while the kill switch is engaged.
  • Stale ReconciliationGate after crash (Opus #7): After a crash-triggered DynamicSupervisor restart (not via stop_user/1), the ReconciliationGate remains :ready from the previous instance because stop_user/1 (which resets it) was never called. The new OrderManager may accept orders during its own reconciliation.
  • HealthMonitor co-lifecycle violation (Opus #2 + Sonnet #4): After a DynamicSupervisor-initiated restart, the HealthMonitor is still subscribed to the old PIDs — no code re-establishes monitoring for the new pipeline processes.

GPT-5 unique findings (not in either other model):

  • Kill switch bypass for users configured DURING engagement (#1): A user who saves credentials while the kill switch is engaged is never added to the pending operator release set (only running pipelines are added at engage time). After disengage, periodic reconciliation auto-starts this user's pipeline without operator release — violating "resuming always requires human judgment." This is the most precisely reasoned finding across all three models: each step is individually correct per the spec, and the violation emerges purely from the composition of legal operations.
  • Premature release bypass (#2): If operator_release_user/1 is called while the kill switch is still engaged (a legal operation), it clears the pending release flag but start_user/1 correctly refuses. After later disengage, the flag is gone — auto-start proceeds without fresh operator judgment. The release was "spent" at the wrong time.

Claude Opus unique findings (not in either other model):

  • operator_release_system/0 clears unrelated safety obligations (#4): Operator intends to release one user from a recent event but operator_release_system/0 also releases other users still pending from an earlier, unresolved event. One release call discharges multiple independent safety obligations — monotonicity violation.
  • State machine incompleteness for blocked users (#6): Users who become configured during kill switch engagement (blocked with reason :kill_switch_engaged) have no state machine transition back to starting after disengage — they're not in the pending release set, and no event fires. System works via periodic reconciliation (up to 5 minutes delay), but the documented state machine doesn't represent this path.
  • Self-correcting analytical style: Opus explicitly withdrew two draft findings mid-analysis ("Actually, this sequence works as designed. Let me identify a real violation instead." / "this is likely handled"). This self-correction behavior was first observed in Finding #15 and is now confirmed as a consistent Opus trait for invariant-style analysis.

Claude Sonnet unique findings (not in either other model):

  • Cold-start Tier 3 failure creates supervision restart loop (#2): A persistent Tier 3 failure (phantom fills) crashes OrderManager, :rest_for_one kills the tree, DynamicSupervisor restarts it, cold-start fails again → infinite loop. State machine shows starting → stopped but supervision creates starting → starting indefinitely.
  • HealthMonitor start failure during start_user (#4): If HealthMonitor.Supervisor is momentarily crashed when start_user/1 runs step 4, the pipeline starts without monitoring. No error handling specified for this partial-start state.

Quality assessment:

  • GPT-5 was MAXIMALLY SELECTIVE — only 3 findings from 12,032 reasoning tokens (4,011 reasoning tokens per finding). This is the most extreme reasoning-to-output ratio observed: 15:1 (12,032 reasoning / 784 output tokens). For comparison, in previous experiments GPT-5 typically shows 1:1 to 2:1 ratios. Every finding is a genuine invariant violation with a precise, step-by-step sequence where each step is individually legal. ZERO false positives, zero padding, zero "this might be an issue." GPT-5 appears to have used almost all its reasoning budget for VERIFICATION — confirming that each candidate is genuinely a violation before including it.
  • Claude Opus produced the most findings (7) with its characteristic depth and self-correction. Two findings were revised mid-analysis, showing Opus actively testing its own reasoning against the document before committing to a finding. The DynamicSupervisor restart thread (findings #1, #2, #3, #7) forms a coherent cluster — Opus identified one root cause (OTP restarts bypass the lifecycle layer) and explored its multiple consequences. The operator_release_system monotonicity finding (#4) is architecturally significant and unique.
  • Claude Sonnet was extremely fast (23s, 1,266 tokens) and produced 5 findings. Quality was mixed: Finding #1 partially mirrors GPT-5's authority inversion but with vaguer reasoning ("race condition with ETS operations" — not specified). Finding #3 describes a contradiction but the scenario is internally inconsistent (step 5 says "pipeline termination fails" but then step 7 says pipeline is still running — this conflates two failure modes). Findings #2 and #4 are genuine and well-reasoned. Sonnet's precision is lower than the other two on this task.

Key insight — "Invariant violation paths" as a task type:

This is a genuinely DIFFERENT analytical task from any previously tested. It requires:

  1. Identifying the invariants (explicit or implied)
  2. Constructing a sequence of operations (creative/generative)
  3. Verifying each step is legal per the spec (verification)
  4. Confirming the end state violates the invariant (correctness proof)

This four-phase cognitive process explains GPT-5's extreme selectivity: steps 2-4 are all verification-heavy, and GPT-5's reasoning tokens are being burned on steps 3 and 4 (confirming each step is genuinely legal and the final state genuinely violates). In previous tasks like "find hidden assumptions" or "find gaps," only step 1 (identification) is needed — there's no construction or verification phase.

Comparison to previous task types:

Task type GPT-5 findings Opus findings GPT-5 reasoning overhead
Hidden assumptions 20-35 12-13 5-7K reasoning
Race conditions 12 10 8K reasoning
Design coherence 4 7 9K reasoning
Invariant violation paths 3 7 12K reasoning

The pattern: as the task requires more VERIFICATION (vs identification), GPT-5 becomes more selective and spends more reasoning tokens per finding. Invariant violation paths demand the highest verification burden (every step must be confirmed legal), and GPT-5 responds with the highest selectivity and reasoning investment.

Opus inverts: it produces MORE findings on verification-heavy tasks (7 for coherence, 7 for invariant paths) vs identification tasks (10-13 for assumptions). This suggests Opus uses its internal reasoning differently — it's more willing to present findings that have "likely" rather than "proven" violations, then self-corrects inline if the verification fails.

Practical implication:

For invariant violation path analysis:

  • GPT-5 produces the highest-precision findings but very few. Every finding is a genuine spec-level bug. Use when you need zero-false-positive bug reports to present to a design team.
  • Opus produces more findings with slightly lower precision but unique analytical depth. Its self-correction behavior means false positives are often caught inline. Use when you want both confirmed violations AND identified tensions.
  • Sonnet is too imprecise for this task type — some findings have internal inconsistencies. Use for lighter analytical tasks (assumption-finding, spec gaps).

The three findings GPT-5 produced are ALL genuine design bugs that should be fixed:

  1. Users configured during kill switch engagement bypass operator release
  2. Premature operator release (while KS still engaged) creates future bypass
  3. Admin stops are overridden by periodic reconciliation

These are the kind of findings that, in a real financial system, prevent production incidents. The 12K reasoning tokens to produce 3 perfect findings is excellent ROI.

21. Reasoning effort (low/medium/high) has negligible effect on GPT-5's analytical output; the parameter may not work as documented for open-ended analysis

Date: 2026-05-04 Task: Identify hidden assumptions in gargoyle's order-state-machine.md (221 lines) — a well-structured state machine specification covering order lifecycle, fill precedence, TIF semantics, and parameter resolution. How we used them: Same document, same prompt, same model (GPT-5), same max_completion_tokens (16K). Only variable: reasoning.effort parameter set to "low", "medium", or "high". Run sequentially via HAI proxy (OpenAI-compatible endpoint). No tools, no project context beyond the document.

Effort Time (ms) Output tokens Reasoning tokens Findings
Low 97,913 7,657 4,288 33 (+11 recs)
Medium 94,824 7,112 4,160 30
High 88,607 6,891 3,712 30

The counterintuitive result: Higher reasoning effort produced FEWER findings, FEWER reasoning tokens, FEWER output tokens, and completed FASTER. The expected pattern (high effort → more reasoning → more depth) was inverted.

Per-finding metrics (remarkably consistent):

Effort Output tokens/finding Reasoning tokens/finding
Low 232 129
Medium 237 138
High 229 123

The depth per finding was nearly identical across all three levels. The models didn't get more detailed or rigorous per-finding at higher effort — they just found slightly fewer things.

Severity distributions (similar across all three):

  • Low: 7 Critical, 21 High, 5 Medium (33 findings)
  • Medium: 9 Critical, 15 High, 4 Medium + 2 borderline (30 findings)
  • High: 6 Critical, 14 High, 5 Medium + 4 borderline (30 findings)

Qualitative differences — WHAT they found:

High-effort unique findings (not in low):

  • Single-writer authority to broker (no out-of-band modifications)
  • Broker emits fills for all executed quantities (no silent netting)
  • Instrument identity remains stable across corporate actions
  • Late-fill override won't violate downstream invariants
  • Validation covers lot sizes, price ticks, borrow/locate constraints
  • Multiple accounts and venues are part of the correlation key
  • Streaming and polling APIs are consistent
  • System can handle multi-leg instruments

Low-effort unique findings (not in high):

  • Acks arrive before fills (no pre-ack fills)
  • Cancel-before-ack handling (submitted → cancelled missing)
  • Fill totals never exceed requested quantity
  • Deterministic ordering within a broker stream
  • Exercise/assignment and non-order position changes
  • Client-side idempotency of "place order"
  • Partial accept/normalize on replace
  • No "child" order fragmentation at broker
  • Submitted state can receive terminal events
  • Late cancel vs local expired mismatch

Character of the differences:

  • HIGH-unique findings tend to be more architectural/systemic (multi-leg instruments, streaming vs polling consistency, downstream invariant violations, corporate actions). These require reasoning about the system's relationship to the broader world.
  • LOW-unique findings tend to be more implementation-specific edge cases (cancel-before-ack, pre-ack fills, child order fragmentation, partial accepts). These require reasoning about specific event interleavings and protocol details.

Both sets are valid and actionable. Neither is clearly "better." They represent different analytical modes — breadth-of-scope (high) vs depth-of-protocol (low).

Key insight — reasoning_effort doesn't scale analysis linearly:

Three possible explanations for the inverted behavior:

  1. GPT-5 already uses near-maximum reasoning for analytical tasks regardless of the effort parameter. The ~4K reasoning tokens across all three levels (4288/4160/3712) are too similar to reflect a genuine effort gradient. The parameter may primarily affect OTHER task types (math, code, logic puzzles) where reasoning depth is more variable.

  2. Higher effort increases FILTERING, not exploration. At high effort, GPT-5 may spend more of its reasoning on VERIFYING whether findings are genuine before including them — similar to the extreme selectivity observed in Finding #20 (invariant violation paths, 12K reasoning for 3 findings). This would explain fewer findings despite theoretically "trying harder."

  3. The parameter has minimal practical effect for this model version. The differences (33 vs 30 vs 30) are within normal stochastic variation. Repeated runs at the same effort level might show similar variance.

The prompt cache hit on HIGH (2304 cached prompt tokens) may have slightly accelerated processing, but doesn't explain the reasoning token difference.

Comparison to previous findings: In Finding #20 (invariant violation paths), GPT-5 used 12,032 reasoning tokens for 3 findings — extreme verification behavior. Here, at default effort on a different task type (hidden assumptions), it uses ~4K reasoning for ~30 findings. This confirms that TASK TYPE is a far stronger predictor of GPT-5's reasoning behavior than the reasoning_effort parameter. The invariant violation prompt triggered deep verification; the assumption-finding prompt triggers broad exploration regardless of effort setting.

Practical implication: For open-ended analytical tasks (assumption-finding, gap analysis, spec review), the reasoning_effort parameter appears to have negligible practical effect on GPT-5. Don't bother tuning it for these tasks — the default is fine. The parameter may be more meaningful for:

  • Tasks with verifiable correct answers (math, logic)
  • Tasks where the model could short-circuit (simple questions)
  • Extremely long documents where exploration budget matters

For architecture review specifically: reasoning_effort is NOT a useful lever. Task framing (the prompt structure) and document selection remain the dominant variables for output quality. Save reasoning_effort tuning for coding/math tasks where the parameter was likely trained and evaluated.

Open question: Would running the same experiment 5x at each level show that the 33-vs-30 difference is within stochastic noise? If so, reasoning_effort is effectively a no-op for analytical prompts. If not, low-effort consistently produces more (less filtered) output, which could be useful for brainstorming- style analysis where you want maximum coverage before manual triage.

27. Design coherence on risk-controls.md: GPT-5 regains top position; Opus's advantage from Finding #15 was document-specific, not task-specific

Date: 2026-05-05 Task: Identify internal design incoherences in gargoyle's risk-controls.md (277 lines) — a pre-trade risk control specification covering two evaluation stages, reduction semantics, ordering rationale, fail-closed claims, and audit logging. How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence (safety properties not enforced, ordering/sequencing contradictions, reduction semantics conflicts, fail-closed claims vs actual behavior, cross-stage inconsistencies). Required each finding to reference specific contradictory parts. No tools, no project context beyond the document itself.

Model Time Output tokens Reasoning tokens Incoherences found Critical High Medium
GPT-5 112s 8,231 7,232 6 1 3 2
Claude Opus 4.6 41s 1,858 (internal) 5 2 2 1
Claude Sonnet 4.6 15s 699 (internal) 4 1 2 1

What they found — common ground (all 3 identified):

  • Reduction re-entry at BuyingPower contradicts "reducing controls never re-enter earlier controls" (all three flagged this as the most obvious contradiction — Concentration at position 5 reduces, re-enters at BuyingPower at position 4, which IS an earlier control)
  • Ordering rationale's categorization of buying power/concentration is internally confused (the doc labels both as "quantity-sensitive checks" that run after reducing controls, but concentration IS a reducing control at position 5 while buying power at position 4 sits between the two reducing controls)

GPT-5 unique findings (not in either Claude model):

  • Signal-level "no portfolio context" contradicts NoShortSales requiring knowledge of current positions. The doc explicitly states signals are evaluated "in isolation" with "no portfolio context — only the signal itself and user settings" — but checking whether the user holds a position IS portfolio context. This is a genuine design tension: either SignalRisk has hidden portfolio access (violating isolation) or NoShortSales can't actually work as specified.
  • Settings "fall through to system defaults" vs "Settings cache miss → reject." Two incompatible instructions for the same condition (missing settings).
  • "Universal fail-closed" with "only exception is order rate window" contradicted by Failure Modes table showing buying power as another exception ("Conservative estimate; may over-reject" is NOT rejection — it's a different failure mode than either fail-closed or the documented single exception).
  • Audit model says "every control evaluation produces an audit entry regardless of outcome" but the signal-stage write point only describes writing on rejection. Passing signals produce no documented audit entry at the signal stage.

Claude Opus unique findings (not in either other model):

  • Signal flow diagram swaps control order vs table: table shows (1) MarketHours, (2) PerTradeStop, (3) NoShortSales, but diagram flows MarketHours → NoShortSales → PerTradeStopLoss. Controls 2 and 3 are reversed between the two representations. (VERIFIED: this is correct — the diagram does show a different order.)
  • Concentration re-entry loop can bypass Order Rate, Duplicate, Self-Trade, and Fat Finger entirely during intermediate iterations. Also: Position Size at order 3 is never re-checked against Concentration-reduced quantity because re-entry starts at BuyingPower (order 4), meaning "smaller quantity wins" semantics are implemented differently than the linear model described in Reduction Semantics.

Claude Sonnet unique findings (not in either other model):

  • Buying Power "Can reduce? No" creates a dead end: if a reduced quantity still exceeds buying power, the system can only reject entirely (no mechanism to further optimize), defeating the purpose of the reduction system for capital-limited users. (NOTE: this is more of a design limitation than a self-contradiction, but the framing — that the reduction system's purpose is undermined by buying power's inability to reduce — is a legitimate coherence observation.)

Quality assessment:

  • GPT-5 produced the most findings (6) with the broadest coverage across the prompt's 5 categories. The NoShortSales/portfolio-context finding is the most genuinely insightful — it's a fundamental design-level contradiction (a signal-level control that REQUIRES decision-level context). The settings contradiction and audit logging inconsistency are also solid. Every finding points to two specific textual statements that are incompatible. Severity ratings were calibrated (1 Critical, 3 High, 2 Medium — compared to Opus's 2 Critical for similar findings).
  • Claude Opus was remarkably fast (41s, 1,858 tokens) and found one thing neither other model caught: the diagram/table order reversal for signal controls. This is a concrete, verifiable error (not a design tension — a literal mistake in the document). The re-entry loop analysis (finding #5) goes deeper than GPT-5's version of the same core issue, exploring the implications for "smaller quantity wins" semantics. However, Opus found fewer total issues and missed the settings contradiction and audit logging inconsistency.
  • Claude Sonnet was the fastest (15s, 699 tokens) and found 4 issues. The buying power dead-end observation is unique and shows genuine reasoning about the reduction system's limitations. However, it's more of a "this design can't achieve its stated goal" than a strict self-contradiction. Sonnet's other findings overlap with the common ground. Quality is solid but narrower scope.

Key insight — Finding #15's Opus > GPT-5 result was document-specific: In Finding #15 (coherence checking on failure-modes.md), Opus found 7 incoherences vs GPT-5's 4. Here, on risk-controls.md, GPT-5 found 6 vs Opus's 5. The reversal suggests that the relative performance on coherence checking depends on the DOCUMENT'S structure, not on a fixed model advantage:

  • failure-modes.md (383 lines): A complex multi-process system with many stated invariants across failure states, supervision trees, and recovery paths. Rich in design TENSIONS where one subsystem's safety mechanism undermines another. This plays to Opus's strength (finding design tensions between subsystems).
  • risk-controls.md (277 lines): A more focused specification with explicit rules, ordering constraints, and behavior tables. Rich in SPECIFICATION CONTRADICTIONS where one statement directly conflicts with another. This plays to GPT-5's strength (systematic verification of claims against stated mechanisms).

The difference: Opus excels when contradictions are EMERGENT (arise from composing multiple design decisions). GPT-5 excels when contradictions are EXPLICIT (two statements in the document say incompatible things). Risk-controls.md has more explicit contradictions (the settings fallback vs fail-closed, the "no portfolio context" vs NoShortSales, the audit "always" vs write point "only on reject").

Model performance depends on CONTRADICTION TYPE:

Contradiction type Best model Example
Emergent/compositional Opus "Rest-for-one cascade creates a 5th state"
Explicit/definitional GPT-5 "No portfolio context" but check requires portfolio
Diagrammatic/structural Opus Table order ≠ diagram order
Semantic/category confusion All (common ground) Reduction re-entry violates ordering claims

Revised conclusion on Finding #15's open question: "Does Opus > GPT-5 ordering for coherence checking hold across other documents?" No. The ordering depends on the document's contradiction density and type. Documents rich in emergent design tensions favor Opus. Documents with explicit specification errors favor GPT-5. The task type (coherence checking) doesn't have a fixed model winner — it depends on what KIND of incoherences the document contains.

Practical implication: Continue running both models for coherence checking. Their strengths are complementary even within the same task type. GPT-5 catches things you can point to in the spec and say "these two sentences conflict." Opus catches things where you need to reason about the implications of multiple mechanisms interacting.

Open Questions

  • Does GPT's advantage in finding inconsistencies extend to logical inconsistencies in arguments? One data point (verdict mismatches) — need more.

  • What's the optimal task granularity for GPT analytical review? "Whole PR" is too big. Is "one hypothesis" right, or can we batch?

  • Is the GPT-4.1 Mini bias detection result repeatable, or was it a well- structured task that any model would ace? ANSWERED (Finding #8): Any model aces it when the biased text is presented without noise. The original result was about noise elimination, not model capability.

  • NEW: Does adding a narrow bias-check question to a rich PR review context recover the detection that broad review misses? (Signal-to-noise confirmation test)

  • How does reasoning_effort affect analytical quality? Only tested default so far. ANSWERED (Finding #21): Negligible effect on GPT-5 for open-ended analytical tasks. Low/medium/high produced 33/30/30 findings with nearly identical reasoning tokens (~4K) and per-finding depth. The parameter may primarily affect verifiable-answer tasks, not exploration. Task framing remains the dominant quality lever.

  • Can we design a systematic "analytical review checklist" that leverages each model's strengths?

  • What analytical tasks is Opus best at vs Sonnet? Finding #11 shows Opus excels at design-tension identification. How does Sonnet compare on the same task? (Sonnet is non-reasoning but fast — would it match GPT-4.1?) ANSWERED (Finding #12): Sonnet 4.6 significantly outperforms GPT-4.1 (17 vs ~14 assumptions) and approaches GPT-5 (17 vs 20). It's not a non-reasoning model in the GPT-4.1 sense — it occupies a middle tier with genuine component-interaction reasoning. Opus still wins on design-tension identification specifically.

  • How do the models compare on research synthesis tasks (our #381 rewrite)? We'll find out during the actual rewrite.

  • Does the reasoning-token advantage scale with document complexity? Test with a simpler doc to see if the gap narrows. ANSWERED (Finding #11): The gap doesn't narrow with simpler docs. GPT-5 maintains ~1.7x the findings of GPT-4.1 regardless of document complexity. Reasoning tokens enable exhaustive exploration independent of input difficulty.

  • Would Claude Opus (also a reasoning model) match GPT-5's assumption-finding performance, or does it have different blind spots? ANSWERED (Finding #11): Different blind spots, different strengths. GPT-5 reasons deeper into implementation mechanics (breadth + technical depth). Opus reasons wider about system context and design tensions (insight density). They're complementary, not competing. Run both on important architecture docs.

  • Does Sonnet 4.6's strong showing hold across other analytical tasks (bias detection, gap-finding) or is it specific to assumption-finding on complex documents? Need to test Sonnet on simpler docs and different question types. PARTIALLY ANSWERED (Finding #13): Sonnet's strength does NOT transfer to concurrency reasoning. It dropped from 85% of GPT-5 (assumption- finding) to ~58% (race condition identification). Task type matters more than we thought. Still untested: gap-finding, bias detection for Sonnet.

  • NEW: What other analytical tasks require sequential/temporal reasoning (like race condition identification) vs pattern-matching reasoning (like assumption-finding)? Building a task taxonomy would help assign models correctly.

  • NEW: What explains Sonnet taking slightly longer than Opus here (106s vs 105s) despite normally being the faster model? Is it the document length, or does Sonnet's internal reasoning scale with complexity similarly to Opus?

  • How does GPT-5 Mini compare to GPT-5 on analytical tasks? Is it a viable cheaper substitute? ANSWERED (Finding #14): GPT-5 Mini is a viable middle option. Finds fewer issues (6 vs 10) but with genuine reasoning depth at ~50% cost/time. Better than non-reasoning models, not as exhaustive as GPT-5.

  • NEW: How does Claude 4.5 Opus/Sonnet compare to Claude 4.6? HAI now exposes both; worth testing whether the newer versions regress on analytical tasks.

  • Would running GPT-5 Mini + Sonnet together (different axes) approach GPT-5's coverage at lower combined cost? ANSWERED (Finding #19): 71% coverage at 31% cost. Good for low-stakes work; GPT-5 irreplaceable for high-stakes due to unique domain-knowledge findings in the missing 29%.

  • NEW (Finding #15): Does the Opus > GPT-5 ordering for coherence checking hold across other documents? The inversion (Opus finding more than GPT-5) was striking — need to confirm it wasn't document-specific. ANSWERED (Finding #27): No — it was document-specific. On risk-controls.md, GPT-5 found 6 vs Opus's 5. The winner depends on contradiction TYPE: Opus excels at emergent/compositional contradictions, GPT-5 at explicit/definitional ones. No fixed ordering for this task type.

  • NEW (Finding #15): Is the two-pass approach (Opus generates → GPT-5 validates) worth the extra cost vs just running Opus alone? Need to test whether GPT-5 actually catches Opus false-positives or just agrees.

  • How do the Claude 4.5 and 4.6 models compare on analytical tasks? ANSWERED (Finding #16): 4.5 is more exhaustive (2x findings), 4.6 is more precise (higher signal-to-noise). Genuine tradeoff, not a regression. 4.5 for coverage, 4.6 for actionability.

  • NEW (Finding #16): Does the 4.5 vs 4.6 pattern hold across other task types? Spec completeness may favor exhaustiveness; would coherence checking or race condition analysis show the same pattern?

  • NEW (Finding #16): Is running both Sonnet versions (4.5 + 4.6) cost- effective vs just running GPT-5? Need to compare the UNION of their findings against GPT-5's output for overlap analysis.

  • NEW (Finding #18): Does Opus's "predictable exploit window" detection transfer to other policy documents? It uniquely identified that the cooldown mechanism creates a GUARANTEED safe window that strategies could systematically exploit — this is a higher-order security insight. Worth testing whether Opus consistently finds "adversarial opportunity" framings that other models miss.

  • NEW (Finding #20): Does GPT-5's extreme verification behavior (15:1 reasoning-to-output ratio, 3 findings from 12K reasoning) persist across other documents with this prompt? Or was user-pipeline-lifecycle.md particularly verification-heavy? Test invariant violation paths on a simpler document.

  • NEW (Finding #20): Would giving GPT-5 a "minimum 8 findings" instruction reduce its selectivity and produce MORE invariant violations at lower precision? Or would it just pad with non-violations? The extreme selectivity may be a feature OR it may mean GPT-5 is discarding valid-but-hard-to-verify findings.

  • NEW (Finding #20): Opus's self-correction behavior is now confirmed across Finding #15 and #20. Is this trainable/promptable? Could we ask non-Opus models to "show your reasoning and withdraw findings you cannot fully verify"?

  • NEW (Finding #22): The "silent correctness" lens revealed three distinct analytical modes: Opus → missing functionality, GPT-5 → mechanism incorrectness, Sonnet → composition failures. Does this three-way differentiation hold on other documents, or was it specific to the regulatory/financial domain of specid-lot-selection?

  • NEW (Finding #22): Does the "silent correctness" lens work on non-financial documents? The financial/regulatory domain has a large gap between syntactic and semantic correctness. Would the same prompt on an infrastructure/systems doc produce equally differentiated findings, or would it collapse into assumption-finding?

  • NEW (Finding #22): Opus's "missing feature identification" mode (wash sales, commissions) — is this promptable on other models? Could we explicitly ask GPT-5 "what should this system compute but doesn't" and get similar results? ANSWERED (Finding #26): YES — all three models find regulatory gaps and missing features when explicitly prompted. Opus's unique behavior in #22 was an emergent DEFAULT tendency, not a capability. Prompt framing dominates model personality.

  • NEW (Finding #28): Cross-document consistency found real bugs in gargoyle docs (fills vs events, position ownership, signal persistence). Does running this analysis across MORE document pairs (e.g., domain readmes vs implementation docs, design docs vs plan docs) yield additional real inconsistencies? Could become a systematic documentation maintenance tool.

  • NEW (Finding #28): Opus was 2.4x faster AND found more issues than GPT-5 on cross-document consistency. Is this because cross-doc contradictions are easy to verify once spotted (reducing GPT-5's verification advantage)? Or because boundary reasoning (Opus's strength) is the primary skill needed?

Methodology Notes

  • Internet opinions about models are overwhelmingly about coding. Don't extrapolate to analytical work without testing.
  • "Just because someone says it on the internet doesn't make it right." — Aaron, 2026-04-26. Opinions need context. Track our own evidence.
  • Absence of published methodology for a use case is itself a finding.
  • Each finding needs: date, task, how we used it (context shape, task framing, what info the model had/didn't have), what happened, takeaway. No unsupported generalizations.
  • Context dimensions to track:
    • Rich vs minimal (how much background info)
    • Broad vs focused ("review this" vs "answer this specific question")
    • What kind of context (diff, full files, issue text, research notes, project conventions, nothing)
    • Whether the model had access to tools or just text
    • Whether the task was explicit step-by-step or open-ended

Design Coherence Analysis — Finding #15

Date: 2026-05-03 Task: Identify internal design incoherences in gargoyle's failure-modes.md (383 lines) — places where the document's stated principles/invariants are contradicted by its own specified mechanisms. How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence to look for (safety properties not enforced, state machine violations, recovery contradictions, supervision conflicts, cross-mechanism contradictions). Required each finding to reference specific sections. No tools, no project context beyond the document itself.

Model Time Output tokens Reasoning tokens Incoherences found
Claude Sonnet 4.6 ~39s 1,168 (internal) 5
Claude Opus 4.6 ~105s 3,378 (internal) 7 (8 attempted, 1 self-withdrawn)
GPT-5 ~120s 10,235 9,088 4

What they found — common ground (all 3 identified):

  • State machine universality claim vs Strategy.Worker crash behavior (process crashes bypass the degraded state entirely — no transition path in the model)
  • Market data staleness advisory-only vs the "don't trade when ambiguous" principle (or vs concurrent failure auto-halt)
  • pending_cancel/pending_replace absent from recovery query set (GPT-5 and Sonnet found this directly; Opus addressed the broader state machine gap)

GPT-5 unique findings (not in either Claude model):

  • Kill switch halted = "process terminated" vs kill switch requiring RUNNING processes for cancel_all, FLATTEN, and LIQUIDATE modes. The state definition claims processes are terminated, but the mechanisms require them alive to execute orders. This is the most architecturally significant finding — it reveals a fundamental definitional error in the state machine.
  • Per-symbol degradation contradicts the process-level degradation semantics. A worker "enters degraded" but continues operating for non-stale symbols — violating the stated definition that degraded = "cannot perform primary function." The metrics/eventing model has no per-symbol dimension.

Claude Opus unique findings (not in either other model):

  • :rest_for_one cascade creates a FIFTH implicit state (terminated-and- restarting) not in the four-state model — processes that were normal are forcibly killed (not by kill switch) and restart. Self-corrected one finding that initially looked like incoherence but was actually consistent.
  • PortfolioMonitor continues evaluating with stale data ("fail-safe") while Strategy.Workers are stopped for the SAME condition — contradicts both the universal state machine (PM doesn't transition to degraded) and the doc's reasoning about why stale data is dangerous.
  • Path-dependent recovery criteria: Strategy.Worker recovery requires 21+ bars after crash but only "price continuity check" after staleness. The state machine's single "catch-up complete" exit condition can't express this.
  • halted → [*] transition in state diagram is logically impossible if "halted" means the process is already terminated — dead processes can't fire transitions.
  • Compound failure detection requires a meta-observer across processes but the per-process state machine model has no way to express cross-process conditions.

Claude Sonnet unique findings (not in either other model):

  • Market data global staleness: the failure table says "Manual (disengage)" for recovery — implying automatic engagement happened — but the text says it's advisory only. Table contradicts prose.
  • ReconciliationGate: doc claims gate survives OM crash (separate supervision tree), but then says "missing ETS table = not ready" when OM crashes. If the gate survives, why would its table be missing?
  • Signal survival claims are contradictory between sections: worker crash says downstream signals survive, but OM crash says all upstream signals lost. (NOTE: this is actually describing different scenarios — worker crash doesn't cascade to SignalRisk, OM crash does via :rest_for_one. Sonnet may have misread the architecture here — the two statements are consistent when you understand the supervision tree.)

Quality assessment:

  • GPT-5 found only 4 incoherences but TWO of them are genuinely critical architectural findings. The "halted = terminated" vs "kill switch requires running processes" contradiction is a real design error — you can't both terminate processes AND require them to execute cancel/liquidation orders. The per-symbol degradation finding is also a real modeling gap. GPT-5 was MORE SELECTIVE here than in previous experiments — it didn't pad with medium-severity findings. Each of its 4 was high/critical.
  • Claude Opus produced the most findings (7 valid) with characteristic depth. Its self-correction (withdrawing finding #6 after deeper analysis) shows intellectual honesty rare in model outputs. The PortfolioMonitor stale-data contradiction is genuinely insightful — same input condition, opposite response, no justification within the state machine model. The compound failure meta-observer finding identifies a modeling category error. Opus also found modeling imprecisions (path-dependent recovery, halted → [*] impossibility) that the other models didn't notice.
  • Claude Sonnet found 5 issues quickly (39s, 1,168 tokens) but quality was mixed. Finding #4 (ReconciliationGate) raises a genuine question about the ETS table ownership claim. Finding #1 (table vs prose contradiction on market data staleness) is a real documentation inconsistency. However, Finding #5 appears to misread the supervision architecture — the two statements about signal survival ARE consistent when you understand that different crashes cascade differently. Sonnet produced one false positive.

Key insight — "design coherence" is a NEW analytical category with distinct model strengths: This is different from assumption-finding (Finding #10-12), race conditions (Finding #13), and cross-component interactions (Finding #14). Coherence checking requires the model to hold MULTIPLE parts of the document in tension with each other and reason about whether they're compatible. Results:

  • GPT-5 was MORE SELECTIVE than in any previous experiment. Only 4 findings vs 10-24 in other tasks. But precision was near-perfect — all 4 are genuine contradictions. This suggests GPT-5's reasoning tokens are being used for VERIFICATION (checking whether apparent contradictions hold up) rather than EXPLORATION (finding more things). The 9K reasoning tokens produced 4 findings vs the usual 10+ — GPT-5 is self-editing aggressively.
  • Opus hit its sweet spot. Coherence checking IS design-tension identification — Opus's consistent strength. Finding incoherences requires exactly the kind of "how does this design disagree with itself" reasoning that Opus excels at. It also showed unique self-correction behavior (withdrawing a finding after deeper analysis).
  • Sonnet was fast but produced a false positive. Coherence checking requires holding multiple document sections in memory simultaneously and reasoning about their compatibility — this is harder than assumption-finding (where you reason about one mechanism at a time) but easier than race conditions (which require sequential temporal reasoning). Sonnet occupies a middle ground.

Model ranking for design coherence checking:

  1. Claude Opus 4.6 — most findings, highest depth, self-correcting (7 valid)
  2. GPT-5 — fewest findings but near-perfect precision, finds the critical ones (4)
  3. Claude Sonnet 4.6 — fast screening, but prone to false positives on architectural misreads (4/5 valid)

This inverts the usual GPT-5 > Opus ordering. In previous experiments, GPT-5 consistently found MORE issues. Here, GPT-5 was more selective than Opus. The task type (self-consistency checking) favors Opus's "design tension" reasoning style over GPT-5's "exhaustive exploration" style. GPT-5 apparently uses its reasoning to VERIFY rather than GENERATE when the task is about contradictions rather than gaps.

Practical implication: For architecture documents, run coherence checking as a separate pass using Opus as the primary model. GPT-5's higher precision means it's good for confirming which Opus findings are genuine vs overreads. The two-pass approach: Opus generates candidates → GPT-5 validates → result is the intersection plus GPT-5's independent finds.

16. Specification completeness: Sonnet 4.5 produces 2x the findings of Sonnet 4.6 on implementation-gap analysis; quality vs volume tradeoff

Date: 2026-05-03 Task: Identify specification gaps in gargoyle's kill-switch.md (185 lines) — places where an implementer would be forced to guess or decide on their own because the spec doesn't clearly specify behavior. New analytical lens not previously tested. How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of underspecification (behavioral ambiguity, missing edge cases, ordering/sequencing gaps, interface contracts undefined, concurrency semantics omitted). Required specific output format per finding (gap, section, what implementer must decide, risk if wrong, severity). No tools, no project context beyond the document itself.

Model Time Output tokens Reasoning tokens Gaps found Critical High Medium Low
Claude Sonnet 4.6 73s 3,403 (internal) 13 8 4 0 1
Claude Sonnet 4.5 102s 5,191 (internal) 25 14 6 4 1
GPT-5 109s 10,140 7,872 19 8 7 3 0

What they found — common ground (all 3 identified):

  • Pipeline process identification ambiguity (which processes are "pipeline processes")
  • Per-user process scope mapping (how to terminate only one user's processes)
  • ETS table ownership and lifecycle (who owns it, what happens on crash)
  • Concurrent engage operations (what happens when two sources engage simultaneously)
  • Liquidation order tagging mechanism (what the tag is, how verified)
  • Process restart prevention (how "must not restart" is enforced)
  • Engage sequence atomicity (partial failure between DB write and termination)
  • Startup ordering and ETS readiness (pipeline starting before ETS populated)
  • Disengage sequence ordering (what happens and in what order)

Sonnet 4.5 unique findings (not in either other model):

  • ETS table schema/structure (set vs ordered_set, key format, value schema)
  • Missing ETS detection mechanism (catch :badarg vs table existence check)
  • Database write atomicity with ETS (transaction boundaries, rollback semantics)
  • Per-user engage while global is already engaged (is it a no-op or error?)
  • Broker rejection semantics ("already filled" vs "invalid cancel" distinction)
  • Cold-start gate interaction (independence vs dependency of the two gates)
  • User deletion with active kill switch (orphaned rows, cascade semantics)
  • Global disengage effect on per-user states (independent or auto-clear?)
  • Audit log write failure during engage (critical-path vs best-effort)
  • Dashboard control ambiguity in LIQUIDATE mode (contradictory disable/enable)
  • Cancel timeout duration (operational parameter not specified)
  • Manual order source code path during LIQUIDATE (how orders bypass the dead pipeline)

GPT-5 unique findings (not in either other model):

  • Combined global/per-user mode semantics (what happens when global=RESTRICT, user=LIQUIDATE — can user's liquidation proceed?)
  • Scope of "all" in cancel_all and liquidation (system-wide vs per-user)
  • Gate behavior when ETS missing but liquidation needed (conflicting requirements: fail-closed says block, but liquidation needs to pass)
  • Disengage during in-flight cancellations (what happens to racing tasks)
  • Gate placement relative to broker submission (exact point in the flow)
  • Engage latency expectations (no quantified SLA)
  • Mode change while already engaged (RESTRICT → LIQUIDATE without disengage)
  • Dashboard vs backend scope for manual liquidation (individual vs bulk only)

Sonnet 4.6 unique findings (not in either other model):

  • ETS sequencing relative to process termination (ETS before or after kill?)
  • Concurrent disengage + re-engage race (specific interleaving scenario)
  • Close-only enforcement mechanism (UI-only vs backend validation)
  • Order-in-flight past ETS gate during termination (already-checked orders)

Quality assessment:

  • Claude Sonnet 4.5 was the most EXHAUSTIVE (25 gaps) but with notable quality variance. Several findings were highly specific and implementation- relevant (ETS schema, missing-table detection, broker rejection semantics). Others were relatively obvious or lower-impact (user deletion, audit log failure, cancel timeout duration). The 14 Critical ratings feel somewhat generous — some would be more accurately rated as High in practice. Output was well-structured with clear per-finding format.
  • GPT-5 found 19 gaps with consistent high quality. Its unique findings show cross-cutting reasoning: the combined mode semantics finding (global vs per-user mode interaction) identifies a genuine specification gap that neither Sonnet version noticed. The "ETS missing but liquidation needed" finding is architecturally significant — it identifies a CONTRADICTION in the spec's own rules (fail-closed blocks everything, but liquidation must pass). Every finding was actionable. More selective severity ratings (8 Critical vs Sonnet 4.5's 14).
  • Claude Sonnet 4.6 was the most SELECTIVE (13 gaps) but with the highest precision. Every finding was genuinely a specification gap that an implementer would face. The ETS sequencing finding (#4) is particularly well-reasoned — it identifies a specific ordering dependency that creates a race window. Sonnet 4.6 appears to self-filter aggressively, producing only findings it's confident about. Higher signal-to-noise than 4.5.

Key insight — Sonnet 4.5 vs 4.6 on analytical tasks: This is the first direct comparison between Claude model versions on the same analytical task. Key differences:

  • Volume: 4.5 produced almost 2x the findings (25 vs 13)
  • Tokens: 4.5 used ~1.5x the output tokens (5,191 vs 3,403)
  • Time: 4.5 took ~1.4x longer (102s vs 73s)
  • Severity distribution: 4.5 had more Critical findings (14 vs 8) but with more generous severity ratings
  • Quality per finding: 4.6 had higher average quality; fewer "obvious" or lower-impact findings

The 4.6 model appears to have been trained toward higher precision/selectivity. It finds fewer things but each finding is more reliably a genuine gap. The 4.5 model is more exhaustive but includes findings that a reviewer might triage as "yes, technically, but not really a spec gap." This mirrors a known training direction in Claude models: later versions tend to be more concise and selective.

For practical use: If you want completeness (cast a wide net, accept some noise): use 4.5. If you want precision (every finding is actionable, no triage needed): use 4.6. For architecture review where missing a gap has cost, 4.5's exhaustiveness is probably worth the noise. For review where false positives cost attention (e.g., PR review comments), 4.6's selectivity is preferred.

GPT-5 vs Sonnet comparison on this task: GPT-5 (19 findings) sits between the two Sonnets in volume but has the highest consistency — no obvious misses or inflated severities. Its unique strength here: finding CONTRADICTIONS within the spec's own rules (ETS-missing blocking conflicts with liquidation needing to pass). This is consistent with Finding #15 where GPT-5 was unusually selective but precise on coherence checking.

Specification completeness analysis appears to be a task where:

  1. Sonnet 4.5 is strongest for breadth (25 findings, catches operational gaps)
  2. GPT-5 is strongest for detecting spec self-contradictions (19 findings, high precision)
  3. Sonnet 4.6 is strongest for precision (13 findings, zero noise)

Updated model version comparison:

  • Claude 4.6 → higher precision, more selective, concise
  • Claude 4.5 → more exhaustive, more verbose, occasional severity inflation
  • This is a genuine tradeoff, not a simple regression or improvement

Practical implication: Run BOTH Sonnet versions? 4.5 catches things 4.6 filters out (ETS schema, broker rejection semantics, cold-start gate interaction). 4.6 catches things with more specificity (sequencing gaps, exact race windows). For a one-shot budget: 4.5 if you want coverage, 4.6 if you want actionability. GPT-5 if you want to find where the spec contradicts itself.

7. Token budget matters more than model size for gap analysis (confirmed)

Date: 2026-05-03 Task: Identify unaddressed failure scenarios in gargoyle's failure-modes.md (383 lines, ~25KB) How we used them: Same document, same analytical question ("What failure scenarios are NOT covered?"), three models. GPT-5 with 16K max_completion_tokens, Sonnet 4 with 4K max_tokens, GPT-4.1 Mini with 4K max_completion_tokens. No project context beyond the document itself. Pure gap-analysis task.

Results:

  • GPT-5 (16K budget): 28 gaps, most exhaustive. Found domain-specific edge cases others missed entirely: ClOrdID collision across restarts, fractional share rounding, broker maintenance windows (410/426), hot code upgrades, regulatory halts vs staleness distinction, cancel-ack-then-late-fill race, DNS/TLS as distinct from network outage.
  • Sonnet 4 (4K budget): 12 gaps, concise. Unique framing: distinguished latency degradation from outage (subtle but actionable). ETS corruption vs loss.
  • GPT-4.1 Mini (4K budget): 13 gaps with summary table. Unique angles: unknown broker status enum values, configuration schema mismatches on cold-start, malformed signals from logic bugs (not just crashes).

Overlap (all three): Rate limiting, clock skew, resource exhaustion, DB failures, message backpressure, partial connectivity.

Key insight: GPT-5's 4K attempt produced ZERO output (finish_reason: length) — all tokens consumed by internal reasoning. At 16K it produced the richest analysis. This confirms finding #3 (GPT-5 needs generous token budgets) AND adds a new observation: for open-ended analytical questions, GPT-5's reasoning overhead is proportionally larger. The 4K models (Sonnet, Mini) both produced useful output at 4K because they don't burn tokens on chain-of-thought.

Model personality confirmed:

  • GPT-5: exhaustive, domain-aware, finds edge cases a senior SRE would know
  • Sonnet: precise, architectural, finds design-level distinctions
  • GPT-4.1 Mini: structured, systematic, finds enumeration gaps

Practical implication: For failure mode / gap analysis on design docs:

  • GPT-5 with ≥16K tokens for maximum coverage (most unique findings)
  • Sonnet for architectural framing ("this is really two different problems")
  • Mini for completeness checking ("what about this enum value?")
  • Running all three costs ~$0.50 and catches gaps none alone would find
  • GPT-5 at 4K is USELESS for this task — always give it room to think

Note on GPT-5 reasoning overhead: First attempt at 4K max_completion_tokens returned empty content with finish_reason: length. The model spent all 4K tokens on internal reasoning and produced nothing. This is worse than a short answer — it's zero value for non-zero cost. Always budget ≥16K for GPT-5 analytical tasks.

18. Temporal boundary analysis: GPT-5 is most exhaustive; Opus finds design-level contradictions; Sonnet 4.5 is structured but less deep

Date: 2026-05-04 Task: Identify temporal boundary vulnerabilities in gargoyle's escalation-policy.md (238 lines) — scenarios where the timing model (evaluation cycles, debounce counts, cooldown periods) creates windows of incorrect or dangerous behavior. How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of temporal vulnerability (timing exploitation, counter reset abuse, asymmetric time exposure, cross-metric temporal interactions, state loss temporal effects). Required specific output format per finding (name, sequence with cycle numbers, mechanism, severity, fix). No tools, no project context beyond the document itself.

Model Time Output tokens Reasoning tokens Findings Critical High Medium
GPT-5 ~128s 9,175 5,888 15 3 7 2
Claude Opus 4.6 ~120s 5,112 (internal) 10 3 5 2
Claude Sonnet 4.5 ~100s 4,056 (internal) 12 3 3 3

What they found — common ground (all 3 identified):

  • Flash crash / inter-evaluation gap exploitation (metric spikes between discrete evaluation cycles go undetected)
  • Single clear cycle resetting debounce counter (transient recovery defeats escalation despite sustained risk — metric can breach 80%+ of cycles and never escalate)
  • Asymmetric escalation time vs loss compounding rate (11 cycles to reach liquidation while losses compound every single cycle)
  • Monitor crash resets state to Clear, losing all escalation progress
  • Liquidation re-trigger requiring full debounce reset, delaying subsequent batches
  • Kill switch N value unspecified (timing indeterminacy)

GPT-5 unique findings (not in either other model):

  • Boundary-alignment counter starvation: explicitly modeled the "adversarial flicker" pattern (breaching 2 cycles, 1 clear, repeat — 66% breach time, never escalates) with a precise mathematical framing of why K-of-N is needed
  • Cycle-length drift under load: GC pauses or CPU contention stretching evaluation intervals means "3 cycles" could be 12 minutes instead of 90 seconds when it matters most (high-load market stress = slowest evaluations)
  • Adversarial boundary timing (market microstructure masking): illiquid instruments where opposing prints predictably arrive near evaluation boundaries, exploiting deterministic sampling points
  • Cross-metric escalation gap and hedging prohibition: Restrict prevents ALL new positions including risk-REDUCING hedges needed for a different metric still escalating on its own timeline — protection for metric A actively worsens metric B
  • Cooldown stall causing prolonged Restrict: repeated transient spikes near hysteresis threshold reset cooldown indefinitely while metric is actually safe
  • State inconsistency between restriction flags and monitor after restart: documented asymmetry where flag persists (manual clear) but state resets (auto clear) — creates orphaned restriction or unprotected window depending on reconciliation approach
  • Metric computation fail-closed interacting with debounce: system errors create false escalations with long cooldown, potentially blocking hedging trades
  • Unspecified N for kill switch post-liquidation breaches: coupled with crash reset, system can loop indefinitely without reaching kill switch
  • In-liquidate flicker stall: one cycle below threshold after partial fill resets re-trigger counter, stalling further liquidation

Claude Opus unique findings (not in either other model):

  • De-escalation cooldown exploitation (predictable window): after cooldown completes and restriction lifts, strategy has a GUARANTEED 5+ cycle window of unrestricted trading before Restrict can re-engage — an automated strategy could systematically exploit this predictable safe window to re-enter dangerous positions
  • Orphaned restriction flag asymmetry framed as a DESIGN CONTRADICTION: the failure modes table specifies opposing recovery paths for state (automatic → Clear) vs flags (manual clear), creating an irreconcilable dual state. Opus uniquely identified that operator intervention to clear the flag could inadvertently create a WORSE protection gap than leaving it orphaned
  • Self-correcting analysis style: Opus's summary explicitly synthesized that the three Critical findings share a common cause (debounce optimizes against false positives at the expense of false negatives during sustained events) and proposed a single architectural fix (severity-aware fast path) that addresses all three

Claude Sonnet 4.5 unique findings (not in either other model):

  • De-escalation timing not accounting for proximity to breach threshold: system removes protection while metric is still near-dangerous, and re-escalation requires full debounce — created a specific "whipsaw" scenario with cycle numbers
  • Manual-only de-escalation from Liquidate creates UNBOUNDED recovery time: if triggered at 2 AM Saturday, trading disabled until Monday despite metrics recovering in minutes. Framed as contradiction with "autonomous" design goals
  • Evaluation cycle synchronization assumption: no handling of variable timing (CPU contention, GC pauses) — implicit throughout but never addressed
  • Cold start escalation ambiguity: system starts with no prior state while portfolio may already be in breach condition
  • De-escalation event ordering race: multiple metrics de-escalating simultaneously may emit events in non-deterministic order, confusing external observers

Quality assessment:

  • GPT-5 was the most exhaustive (15 findings) and showed the strongest mathematical/systems reasoning. Its unique findings included precise attack models (adversarial flicker, boundary alignment, microstructure masking) that describe exact exploitation patterns with percentages and cycle counts. The cross-metric hedging prohibition finding is architecturally significant — it identifies that protection for one metric can actively CREATE risk for another. Every finding was actionable with specific fixes.
  • Claude Opus 4.6 produced fewer findings (10) but with characteristic depth and self-awareness. Its cooldown exploitation finding identified a PREDICTABLE exploit window that an automated strategy could systematically abuse — framed not as an accident but as an adversarial opportunity. The summary synthesis (identifying common cause across Critical findings) shows meta-analytical capability the other models didn't demonstrate. Opus also uniquely identified that human intervention to fix one problem could create a WORSE problem — second-order operational reasoning.
  • Claude Sonnet 4.5 was well-structured (12 findings, clean severity tiers, organized by Critical/High/Medium/Low) and faster than both other models. Its findings were solid but less architecturally deep. The manual de-escalation contradiction finding was genuinely insightful (unbounded recovery time vs autonomous design goals). However, several findings restated concepts the other models covered with less specificity about exploitation mechanics.

Key insight — temporal reasoning as a task type: This is the first experiment specifically testing "temporal boundary analysis" — reasoning about time-domain properties of a state machine (evaluation frequency, counter semantics, cooldown mechanics, crash/restart timing).

Results compared to Finding #13 (race condition identification on a concurrency doc):

  • GPT-5: 15 findings here vs 12 in Finding #13. Consistent high performance on temporal reasoning tasks across both experiments.
  • Opus: 10 findings here vs 10 in Finding #13. Remarkably consistent — Opus produces ~10 high-quality findings regardless of temporal task variant.
  • Sonnet 4.5: 12 findings here (first test). Compare to Sonnet 4.6's 7 findings (with errors) in Finding #13. Sonnet 4.5 handles temporal reasoning better than 4.6 — consistent with Finding #16 showing 4.5 is more exhaustive across task types.

Sonnet 4.5 vs 4.6 on temporal reasoning (inferred comparison): Sonnet 4.6 struggled significantly on race condition identification (Finding #13: 7 findings with analytical errors, misreading architecture). Sonnet 4.5 here produced 12 solid findings with no apparent misreadings. This suggests 4.5's exhaustiveness advantage extends to temporal reasoning — the additional exploration it does (vs 4.6's aggressive self-filtering) catches more temporal interactions. Confirms Finding #16's pattern: 4.5 for coverage, 4.6 for precision.

The structured-prompt effect continues: All three models produced focused, high-quality output with this highly structured prompt (5 specific categories + required output format). This confirms Finding #14: narrow analytical lens + broad document scope is the sweet spot for all model tiers. The prompt structure appears to be a stronger predictor of output quality than model choice for the bottom 80% of findings (all models find the common-ground issues). Model choice matters for the TOP 20% — the unique insights that require deeper reasoning about system interactions.

Updated model assignment for temporal boundary analysis:

  1. GPT-5 — most exhaustive, strongest at modeling adversarial exploitation patterns and mathematical edge cases (15 findings)
  2. Claude Sonnet 4.5 — good volume with clean structure, viable for first-pass temporal analysis (12 findings, no errors)
  3. Claude Opus 4.6 — fewest findings but highest insight density, uniquely identifies predictable exploit windows and operational second-order effects (10 findings)

Practical implication: For temporal analysis on state machines and timing-dependent policies, the three-model stack produces genuine complementary value:

  • GPT-5 catches the adversarial attack patterns and mathematical edge cases
  • Opus catches the predictable exploit windows and operational contradictions
  • Sonnet 4.5 provides good breadth at lower cost with clean severity categorization

The union of unique findings across all three models reveals significantly more temporal vulnerabilities than any single model alone. For a document governing autonomous financial actions (liquidation, kill switch), the cost of running all three (~$1-2) is trivially justified against the risk of missing a timing exploit.

19. Union coverage test: GPT-5 Mini + Sonnet 4.6 covers ~71% of GPT-5's findings; the missing 29% is where the real value lives

Date: 2026-05-04 Task: Identify hidden assumptions in gargoyle's trading-pipeline.md (1,110 lines, ~62KB) — the most complex document tested so far, covering the full end-to-end path from tick ingestion through order execution. How we used them: Same document (full text, no truncation) + same focused analytical question to all 3 models via HAI proxy. Standard hidden-assumption prompt with 5 categories (runtime behavior, external dependencies, timing/ordering, scale/load, uncovered failure modes). Required specific output format per finding. No tools, no project context beyond the document itself.

Model Time Output tokens Reasoning tokens Assumptions found
GPT-5 99s 9,418 5,696 35
GPT-5 Mini 93s 5,309 1,792 21
Claude Sonnet 4.6 38s 1,792 (internal) 17

Coverage analysis — can Mini + Sonnet together replace GPT-5?

Categorized each of GPT-5's 35 findings by whether the union of Mini + Sonnet also identified the same assumption:

  • Covered by BOTH Mini and Sonnet: ~12 findings (common ground — any model finds these: idempotency, single-writer, clock sync, instrument resolution, fill immutability, reconciliation gate, backpressure, fill correlation, event ordering, audit scalability, PortfolioRisk bottleneck)
  • Covered by Mini only (not Sonnet): ~7 findings (transactional atomicity, audit causal consistency, modification-in-flight enforcement, OM throughput, decimal precision, PM/PR close-only race, partition duplicate submit)
  • Covered by Sonnet only (not Mini): ~6 findings (market data feed rates, pipeline-vs-market speed, corporate actions atomicity, kill switch partition, shared port isolation, market close vs auction fills)
  • Union(Mini + Sonnet) total coverage: ~25/35 = ~71% of GPT-5's findings
  • GPT-5 unique (missed by both): ~10-18 findings depending on strictness

What GPT-5 uniquely found that the cheaper pair missed:

The missing 29% is NOT random — it's systematically different in character:

  1. Operational edge cases: Default TIF "day" broker semantics, OrderRate counting retries, extended-hours MarketHours mismatch, fractional quantities, local expiry timer precision per instrument
  2. Design-level interaction gaps: PortfolioRisk concurrent decision race (snapshot stale between two parallel approvals), re-validation gap between approval and submit, decision loss on crash after audit write
  3. Domain-specific knowledge: Manual broker-side actions conflicting with state machine, options/complex instrument position_effect mapping, Decision→Order 1:1 invariant vs broker auto-splitting, wash sale retroactive P&L mutation
  4. Architectural observations: Reduction re-entry rule insufficiency, PortfolioMonitor coalescing vs fast breach detection, multi-aggregator fanout and audit partial writes, replay/backtest alignment with production controls

These share a common trait: they require domain expertise (knowing how brokers actually behave, how regulatory rules interact, how production trading systems fail in practice) combined with architectural reasoning (how the design's own mechanisms interact under those real-world conditions). The cheaper models find assumptions about the document's internal consistency; GPT-5 additionally finds assumptions about the document's relationship to the external world it must operate in.

GPT-5 Mini vs Sonnet 4.6 — complementary, not redundant:

Mini and Sonnet covered different gaps:

  • Mini was stronger on internal consistency (transactional atomicity, causal consistency, decimal precision, modification serialization)
  • Sonnet was stronger on external interactions (market data feeds, corporate actions, kill switch distribution, shared resource isolation)

This aligns with previous findings: Mini reasons about implementation mechanics; Sonnet reasons about system boundaries and external interactions. Their union covers more ground than either alone.

Cost comparison:

Approach Total tokens Approx. cost Coverage of GPT-5
GPT-5 alone ~21K (9.4K output + 5.7K reasoning) ~$0.80 100% (35 findings)
Mini + Sonnet ~7.1K output + 1.8K reasoning ~$0.25 ~71% (25/35 findings)
All three ~28K total ~$1.05 >100% (35 + unique Sonnet/Mini extras)

Key insight — the 71% coverage is a floor, not a ceiling:

The union covers 71% of GPT-5's specific findings. But Mini and Sonnet each also produced findings that GPT-5 DIDN'T make:

  • Sonnet: DailyLossLimit query performance scaling, instrument reference data propagation atomicity across components
  • Mini: Signal audit correlation ambiguity under replay/duplicate ticks

So the total unique finding space is LARGER than any single model. Running all three produces the most comprehensive analysis.

Answer to the open question: "Would running GPT-5 Mini + Sonnet together approach GPT-5's coverage at lower combined cost?"

Partially. The pair covers ~71% of GPT-5's findings at ~31% of the cost. But the missing 29% is disproportionately valuable — it contains the domain-specific, interaction-level, real-world-knowledge findings that are most likely to prevent production incidents. For a quick sanity check or first-pass screening, Mini + Sonnet is excellent value. For architecture review where completeness matters (financial system, safety-critical), GPT-5 is not replaceable by cheaper models — its unique findings are exactly the ones that would cause real-world failures.

Practical implication: The optimal strategy depends on stakes:

  • Low stakes (internal doc review, non-critical systems): Mini + Sonnet is 71% coverage at 31% cost — strong ROI
  • High stakes (financial systems, safety-critical): run all three — the ~$1 total cost is irrelevant vs the value of the extra 10-18 findings
  • Budget-conscious high stakes: run GPT-5 alone — it subsumes most of what Mini + Sonnet find, and adds the critical domain-knowledge findings

The cost argument for Mini + Sonnet as a GPT-5 REPLACEMENT doesn't hold for important work. The cost argument for Mini + Sonnet as a GPT-5 COMPLEMENT is strong — they catch a few things GPT-5 misses, and the union of all three is the most thorough analysis available.

Document complexity observation: This is the largest document tested (1,110 lines vs previous 185-785 lines). GPT-5's finding count scaled up (35 vs 20-26 on smaller docs) while maintaining quality — no padding with obvious/low-value findings. Mini also scaled (21 vs 6 on 459-line doc in Finding #14). Sonnet scaled less (17 vs 12-17 on smaller docs) — it appears to have a natural output ceiling regardless of document size, consistent with its self-filtering behavior observed in previous findings.

22. Silent correctness failures: NEW analytical lens reveals Opus's strength at domain/regulatory reasoning; GPT-5 shows regulatory depth; all models converge on compliance timestamp errors

Date: 2026-05-05 Task: Identify scenarios where the mechanism produces SILENTLY INCORRECT results (not errors, not crashes — wrong financial calculations, wrong lot selections, or wrong compliance records that pass all validation) in gargoyle's specid-lot-selection.md (306 lines) — a financial system specification covering tax lot selection strategies, cost basis accounting, and IRS SpecID compliance. How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of silent incorrectness (stale data, semantic precision, ordering sensitivity, composition errors, temporal reference errors). Required specific output format per finding with concrete numerical examples of financial impact. No tools, no project context beyond the document.

Model Time Output tokens Reasoning tokens Findings Critical High Medium
GPT-5 147s 13,006 10,496 7 2 2 3
Claude Opus 4.6 119s 5,902 (internal) 10 3 3 4
Claude Sonnet 4.6 122s 6,011 (internal) 6 3 3 0

What they found — common ground (all 3 identified):

  • designation_at = DateTime.utc_now() at processing time, NOT at actual designation time (manual selection was made at order submission, standing orders were configured earlier) — compliance record factually incorrect
  • Holding period calculation boundary errors (>365 days vs IRS "more than one year" rule, off-by-one at leap year boundaries, day-after-acquisition start)
  • HIFO tie-breaker opened_at ASC ignores tax_term dimension — selects long-term losses over short-term losses when both have identical cost basis, producing less tax-valuable outcomes
  • Strategy preference resolved at fill processing time, not at trade time (preference changes between trade and fill processing apply retroactively)

GPT-5 unique findings (not in either Claude model):

  • Corporate action applied late stale cost basis in HIFO: ROC/dividend reduces basis but if close/4 fires before apply_corporate_action/3, HIFO sorts on pre-adjusted basis AND records wrong realized P&L permanently. No mechanism to restate previously persisted LotClosed events. Concrete example: $2,000 overstated loss from one trade.
  • designation_at fragmentation: a single sell consuming multiple lots calls DateTime.utc_now() per loop iteration, producing slightly different timestamps for what should be a single coherent designation event. Audit risk.
  • LIFO label in selection_method field: records "lifo" but for securities LIFO isn't an authorized tax method — the operation is legally SpecID electing newest lots. Downstream reporting may reject or misclassify.

Claude Opus unique findings (not in either other model):

  • Realized P&L excludes commissions/fees: formula uses sell_fill.price (raw execution price) minus lot.cost_basis, not net proceeds. If cost_basis also excludes buy-side commissions, P&L is doubly overstated. Active trader doing 1000 trades/year: ~$20,000+ cumulative P&L overstatement.
  • Position average_cost is meaningless under SpecID and potentially misleading: SpecID exists to exploit lot-level basis differences, but position-level average obscures this. If downstream consumers use average_cost for tax estimation, results can be 50%+ wrong per lot.
  • GenServer mailbox ordering determines lot-to-fill assignment for concurrent sells: two simultaneous fills for the same instrument get different lots based on network arrival timing. With different holding periods, produces $670+ tax difference without user awareness.
  • Wash sale rule completely unaddressed: system reports losses as realized/deductible without checking 30-day substantially identical purchase rule. Active trader harvesting $50,000 in losses could have $0 actually deductible — $18,500 tax gap.
  • opened_at semantics undefined: whether it's exchange execution time, GenServer arrival time, or settlement date affects every downstream calculation (FIFO/LIFO ordering, holding periods, tax terms). Network timing could produce wrong FIFO lot selection.

Claude Sonnet 4.6 unique findings (not in either other model):

  • Stale cost basis in manual lot picker during concurrent corporate actions: UI shows pre-action basis, user selects based on stale data, but close/4 only validates open/ownership/quantity — never re-validates that the selection rationale is still correct. No field records the discrepancy.
  • average_cost recomputation ordering ambiguity in event-sourced model: step 4 recomputes from "updated lots" but step 3 (persist events) may not have completed — if implementation re-derives from event store rather than in-memory state, reads pre-closure lot quantities. Accumulates $500+ error per partial close.
  • Strategy fallback + config corruption silently overwrites selection method in compliance record: if config becomes invalid, fallback to :fifo is logged at :warning but LotClosed records selection_method: "fifo" — compliance record shows user "chose" FIFO when they configured HIFO. No field records intended vs actual strategy.

Quality assessment:

  • Claude Opus produced the most findings (10) with the broadest analytical scope. Several findings went BEYOND the document's mechanism to identify missing features that create silent incorrectness (wash sale rules, commission handling, opened_at semantics). This is a different analytical mode: Opus identified what the system SHOULD compute but DOESN'T, not just where the existing computation is wrong. The wash sale finding is the highest-impact across all three models — an active trader's entire tax-loss harvesting strategy could be invalid. The GenServer mailbox ordering finding shows characteristic Opus reasoning about emergent behavior from design decisions.
  • GPT-5 produced fewer findings (7) but with extreme precision and specificity. Every finding includes concrete dollar amounts and specific field references. The corporate action stale basis finding is uniquely actionable — it identifies a specific race condition between two documented mechanisms (close/4 and apply_corporate_action/3) that produces permanently incorrect persisted data with no correction path. The designation_at fragmentation finding shows attention to implementation detail that neither Claude model noticed. GPT-5 used 10,496 reasoning tokens for 7 findings (1,500 tokens/finding) — HIGH verification, consistent with Finding #20's pattern for precision-over-breadth tasks.
  • Claude Sonnet 4.6 produced 6 findings with strong specificity and novel angles. The event-sourced recomputation ordering finding (#5) is architecturally subtle — it identifies a composition error between the walk-and-consume algorithm's step ordering and event-sourcing patterns. The strategy fallback compliance recording finding is a genuine audit hazard. However, Sonnet produced no Medium-severity findings — it either found Critical/High issues or filtered everything else out. This aligns with its established high-precision, high-self-filtering behavior.

Key insight — "Silent correctness" as an analytical lens:

This is the FIRST experiment testing a "silent incorrectness" prompt. The key difference from previous analytical lenses:

  • Assumption-finding: "What must be true for this to work?" (Finding #10-12)
  • Race conditions: "What timing issues exist?" (Finding #13)
  • Design coherence: "Does the design contradict itself?" (Finding #15)
  • Invariant violations: "What operation sequences break invariants?" (Finding #20)
  • Silent correctness: "Where does the system CONFIDENTLY produce WRONG output with NO indication of error?"

The silent correctness lens produced qualitatively different findings from all previous lenses. The emphasis on "passes all validation" forced models to reason about what SHOULD be validated but ISN'T, and about semantic correctness (regulatory requirements, financial accounting rules) vs syntactic correctness (valid types, non-nil fields, correct schema).

This lens also revealed a key model differentiation not seen before:

  • Opus reasons about MISSING functionality (wash sales, commissions, opened_at semantics) — things the system should do but doesn't
  • GPT-5 reasons about EXISTING functionality being wrong (corporate action race, designation fragmentation, LIFO labeling) — things the system does but incorrectly
  • Sonnet reasons about COMPOSITION failures (event-sourcing step ordering, strategy fallback propagation) — things that are individually correct but combine incorrectly

These are three genuinely different analytical modes, not just "more/less thorough." All three are valuable for different review outcomes: Opus for feature completeness, GPT-5 for mechanism correctness, Sonnet for integration correctness.

Financial domain advantage:

This is the first experiment on a document with strong regulatory/financial semantics. All three models demonstrated domain knowledge (IRS holding period rules, Treas. Reg. 1.1012-1(c) requirements, wash sale IRC §1091, long-term/short-term capital gains rate differentials). Opus in particular referenced specific IRC sections and provided concrete tax rate calculations. The "silent incorrectness" lens works especially well on financial/regulatory documents because the gap between "syntactically valid output" and "semantically/legally correct output" is large and consequential.

Comparison to previous findings on the same models:

Task type GPT-5 findings Opus findings Sonnet findings Opus > GPT-5?
Hidden assumptions (#10-12) 20-35 12-13 13-17 No
Race conditions (#13) 12 10 7 No
Design coherence (#15) 4 7 5 Yes
Invariant violations (#20) 3 7 5 Yes
Silent correctness (#22) 7 10 6 Yes

Pattern confirmed: Opus outperforms GPT-5 (by finding count) on tasks that require reasoning about the design's RELATIONSHIP to external requirements (regulatory, financial, consumer expectations). GPT-5 outperforms Opus on tasks that require EXHAUSTIVE EXPLORATION within a self-contained system (assumptions, race conditions).

The "silent correctness" lens is structurally similar to coherence checking (does the system match its external requirements?) rather than gap-finding (what's missing within the system?). This explains why Opus outperforms: the task requires reasoning about the world outside the document (IRS rules, financial accounting standards, regulatory requirements), which is Opus's strength.

Practical implication: For financial/regulatory system review, the "silent correctness" lens should be run using Opus as the primary model (broadest findings including missing-feature identification) plus GPT-5 for mechanism-level precision. Sonnet adds value for composition/integration issues that neither Opus nor GPT-5 catches. All three produced unique, actionable findings that the others missed.

The three findings ALL models converged on (designation_at, holding period, HIFO tie-breaker, strategy preference timing) should be treated as confirmed design bugs requiring fixes. The fact that three independent models all identified them with concrete financial impact examples increases confidence that these are real.

23. Regulatory compliance analysis: GPT-5 finds broadest scope of IRS issues; Opus self-corrects mid-analysis; all models converge on cross-account gap

Date: 2026-05-05 Task: Identify where gargoyle's wash-sale-tracking.md (391 lines) could produce incorrect tax reporting compared to IRS wash sale regulations (IRC 1091). NEW analytical lens: regulatory compliance verification — asking models to reason about a code implementation's correctness against EXTERNAL regulatory requirements (not internal system assumptions or race conditions). How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy. Prompt structured around 5 categories: regulatory gaps, interpretation errors, edge cases at regulatory boundaries, cross-account/entity concerns, and interaction with other IRC sections. Required specific regulatory citations, implementation analysis, concrete tax errors, and audit risk levels. No tools, no project context beyond the document.

Model Time Output tokens Reasoning tokens Findings
GPT-5 178s 12,525 9,536 16
Claude Opus 4.6 155s 7,326 (internal) 16 (with 2 self-corrections/withdrawals)
Claude Sonnet 4.6 40s 1,818 (internal) 12

What they found — common ground (all 3 identified):

  • Cross-account/IRA/external broker wash sales not tracked (IRS applies at taxpayer level)
  • Options/contracts to acquire stock not triggering wash sales (explicit in IRC 1091(a) text)
  • "Substantially identical" definition too narrow (same index ETFs, share classes, ADRs)
  • Trade date vs settlement date ambiguity in opened_at/closed_at
  • Short sale wash sales not addressed
  • Section 475 mark-to-market traders incorrectly subjected to wash sale tracking
  • IRC 1092 straddle rules interaction not addressed
  • Related party / spousal transactions not considered
  • Corporate action identity changes breaking matching

GPT-5 unique findings (not in either other model):

  • Per-share vs lot-level basis tacking (#1): The system applies disallowed_loss and tacked_opened_at at the LOT level, but IRS requires per-share treatment when only partial shares are matched. A lot of 100 shares where only 60 trigger wash sale should have per-share basis segregation — the system inflates basis for all 100 shares. Most architecturally significant finding — a fundamental design-level error, not a missing feature.
  • IRA permanent disallowance (#2): When replacement purchase is in an IRA, the loss is PERMANENTLY lost (no basis adjustment possible in tax-deferred accounts). System either incorrectly applies basis adjustment inside IRA or misses it entirely.
  • Instruments not subject to §1091 (#4): §1256 contracts (futures, index options), cryptocurrency, and §475 elections are all exempt — system may over-disallow.
  • Average-cost mutual fund basis (#11): Wash sale adjustments for funds using average-cost method require different math than discrete lot-level adjustments.
  • ADRs vs local shares (#14): ADRs and underlying foreign ordinaries are substantially identical but have different instrument_ids.
  • RSU vestings/ESPP purchases (#15): Equity compensation creating lots via corporate action paths may not trigger check_replacement/2.
  • Ordering priority between pre/post sale purchases (#10): Industry convention (post-sale first, then pre-sale) may differ from system's strict chronological ordering, causing 1099-B mismatches.

Claude Opus unique findings (not in either other model):

  • Year-end boundary timing (#5): Loss in December + replacement in January means tax reports generated between Dec 31 and the replacement purchase date are incorrect. Forward detection fires retroactively but users may have already filed. System needs a "30-day pending window" for year-end reports.
  • Form 8949 reporting format (#6): IRS requires code "W" in column (f) and specific adjustment amounts in column (g). System doesn't describe how tax_summary/3 produces Form 8949-compatible output — potential CP2000 notice triggers from automated IRS matching against broker 1099-B.
  • "Open lots" query in backward detection (#10): If backward detection only queries currently-open lots, it misses replacements that were acquired AND SOLD within the window. IRS looks at acquisition regardless of current holding status. (Rev. Rul. 56-602)
  • Forward detection loss ordering unspecified (#7): When multiple prior losses compete for the same replacement shares, ordering matters — different allocation produces different basis amounts on the replacement lot.
  • DRIP reinvestments triggering wash sales (#9): Dividend reinvestment creates new lots that should trigger forward detection but may not if only buy fills produce LotOpened events.
  • Self-correcting analytical style (CONFIRMED): Opus withdrew Finding #4 entirely mid-analysis ("Revised assessment: holding period logic appears correct. I withdraw the claim of error"). Spent ~500 words reasoning through the holding period tacking logic, found it correct, and explicitly retracted. This is now confirmed across Findings #15, #20, and #23 as a consistent Opus behavior for verification-heavy regulatory analysis.

Claude Sonnet unique findings (not in either other model):

  • Entity-level tracking for partnerships/S-Corps (#4.2): Tax-transparent entities trading through the platform need K-1 reporting to partners — user-scoped model doesn't address pass-through entity wash sale reporting.
  • Constructive sale integration (IRC 1259) (#4.1): Short positions or derivatives creating constructive ownership interact with wash sale determination in ways not addressed.
  • NOL carryforward interaction (#5.3): Wash sale deferrals affect character and timing of losses contributing to NOL calculations across tax years.

Quality assessment:

  • GPT-5 produced the broadest regulatory scope (16 findings) with the most specific IRS citations (Rev. Rul. 2008-5, Pub. 550, IRC §§267, 1091, 1092, 1222, 1223, 1256, 475). Its per-share vs lot-level finding (#1) is the only one that identifies a FUNDAMENTAL DESIGN ERROR (not a missing feature). Most other models' findings are "you don't handle X" — GPT-5's #1 says "what you DO handle is handled INCORRECTLY." This distinction matters: missing features are known scope limitations; incorrect logic is a bug.
  • Claude Opus matched GPT-5's count (16 with 2 self-corrections = 14 net confirmed) but with different character. Opus excelled at identifying OPERATIONAL implications (year-end boundary timing, Form 8949 format requirements, forward detection ordering) rather than just statutory gaps. Its findings tend to describe HOW the gap manifests in practice ("user files taxes, then January purchase retroactively invalidates the filing") vs GPT-5's approach of citing the statute and describing the theoretical violation.
  • Claude Sonnet was fast (40s) and produced 12 competent findings but with less regulatory precision. Findings lacked specific IRS citations (no Rev. Rul. references, no Treas. Reg. citations). Several findings overlapped heavily with common ground items without adding unique depth. The entity-level and constructive sale findings show awareness of tax complexity but are relatively generic ("this is complex and not addressed").

Key insight — regulatory compliance as a distinct task type:

This experiment tests a fundamentally different cognitive demand than previous ones: previous tasks asked "what could go wrong with this system?" (internal reasoning). This task asks "does this system correctly implement external rules?" (external reasoning). The model must hold TWO bodies of knowledge simultaneously: the implementation spec AND the regulatory framework, then find mismatches.

All three models had strong tax law knowledge — they cited IRC sections, Revenue Rulings, and Treasury Regulations correctly. The differentiation wasn't in legal knowledge but in HOW they applied it:

  • GPT-5: Exhaustive statutory mapping ("here's every IRC section that touches wash sales; here's where the implementation falls short on each"). Breadth-first coverage. Found the most issues by sheer scope of regulatory awareness.
  • Opus: Operational consequence reasoning ("here's how this gap manifests as a real-world problem for the user/auditor"). Found issues by reasoning about the implementation's interaction with real-world workflows (filing deadlines, form formats, broker reconciliation).
  • Sonnet: Category-based analysis ("here are cross-account issues, here are entity issues, here are interaction issues"). Followed the prompt structure closely but didn't go deep within each category.

The per-share vs lot-level finding (GPT-5 #1) — why it matters:

This is the experiment's most important result. Every model found missing features (options, cross-account, short sales) — those are SCOPE limitations that the document itself acknowledges or defers. GPT-5 uniquely found a correctness bug in the IMPLEMENTED logic: the system's lot-level basis adjustment is mathematically wrong for partial wash sales.

Example: Loss lot of 100 shares, replacement lot of 60 shares. Only 60 shares trigger wash sale. System adds full 60% of disallowed loss to the entire replacement lot's basis. If the replacement lot later sells 30 shares, the per-share basis is inflated (reflects 60 shares of adjustment spread across 60 shares). This is actually correct for the replacement lot specifically — but the tacked_opened_at is applied to ALL 60 shares when only the matched shares should have tacked holding periods. For lots where adjusted_quantity < replacement_quantity, the non-matched shares have incorrect holding period characterization.

Actually, on closer inspection: if adjusted_quantity = min(loss_quantity, replacement_quantity), and the system matches 60 shares of a 60-share replacement lot, ALL shares of that lot are matched. The edge case GPT-5 identifies would require a replacement lot larger than the loss — e.g., loss of 60 shares matched against a replacement lot of 100 shares where only 60 are affected. In that case, the tacked_opened_at is set on the entire lot (100 shares) when only 60 should be affected. This IS a genuine bug: 40 shares get incorrect holding period classification.

Updated task-type taxonomy:

Task type Primary cognitive demand Best model
Hidden assumptions Breadth identification (what's not stated?) GPT-5 (exhaustive)
Race conditions Sequential temporal reasoning GPT-5 + Opus
Cross-component interactions Component boundary reasoning GPT-5 + Sonnet
Design coherence Internal consistency checking Opus
Invariant violation paths Construction + verification GPT-5 (precision)
Silent correctness External requirement matching Opus
Regulatory compliance Dual-knowledge-base comparison GPT-5 (breadth) + Opus (operations)

Regulatory compliance is closest to "silent correctness" (Finding #22) in that both require reasoning about external requirements. The key difference:

  • Silent correctness asks "does this produce correct outputs for all inputs?"
  • Regulatory compliance asks "does this implement the law correctly?"

Both favor models that reason about the system's relationship to the outside world (Opus's strength), but regulatory compliance also rewards breadth of statutory knowledge (GPT-5's strength). The combination produces the most complete picture.

Practical implication: For regulatory compliance review of financial systems:

  • Run GPT-5 for exhaustive statutory coverage (finds the most gaps)
  • Run Opus for operational impact analysis (finds how gaps manifest in practice)
  • Sonnet adds marginal value — use only if budget allows
  • GPT-5's unique strength: identifying correctness bugs in implemented logic (not just missing features)
  • Opus's unique strength: identifying timing/workflow issues (year-end, form reporting, reconciliation with broker)

24. Design improvement proposals: GPT-5 excels at defense-in-depth thinking; Opus finds subtle design contradictions; Sonnet produces generic recommendations

Date: 2026-05-05 Task: Propose specific design improvements for gargoyle's kill-switch.md (185 lines) — the primary safety mechanism that prevents rogue orders. NEW task type: generative/ creative ("what would you improve?") rather than purely analytical ("what's wrong?"). How we used them: Same document (full text) + same focused prompt to all 3 models via HAI proxy. Prompt asked for 8-15 specific improvements with: weakness, proposed change (concrete), tradeoff, severity rating. Explicitly excluded generic advice ("add more tests") and asked about runtime assumptions. No tools, no project context.

Model Time Output tokens Reasoning tokens Improvements proposed
GPT-5 118s 8,710 6,016 15
Claude Opus 4.6 127s 4,985 (internal) 15
Claude Sonnet 4.6 40s 1,636 (internal) 12

What they found — common ground (all 3 identified):

  • DB write failure blocking engagement (fail-open under DB outage) — all three proposed in-memory-first engagement with async persistence
  • Kill switch process liveness monitoring (heartbeat/watchdog)
  • Broker connectivity loss during cancellation operations
  • ETS table ownership and crash-window vulnerability
  • Supervisor restart suppression as unstated mechanism
  • Per-venue/per-broker scope extension

GPT-5 unique findings (not in either other model):

  • Infrastructure-level "hard kill" — egress proxy or service mesh that blocks broker traffic independently of the application. Belt-and-suspenders approach where the kill switch works even if the entire BEAM VM is unresponsive. This was GPT-5's highest-impact unique insight.
  • Kill fence token (epoch) — every order-carrying message includes an epoch; stale-epoch messages are dropped at the gate. Elegantly solves in-flight messages without needing drain timeouts.
  • Cluster/multi-node propagation — detailed leader election + epoch broadcast
    • fail-closed on partition design.
  • Post-engage broker verification — query broker AFTER engaging to confirm no orders slipped through during the engagement window.
  • Liquidation exposure validation — proving tagged liquidation orders actually REDUCE exposure rather than trusting the tag.
  • Recovery/cold-start order suppression — ensuring reconciliation/recovery routines can't submit orders while engaged.
  • Engage latency reordering — ETS first, terminate second, DB async.
  • Audit log tamper evidence — append-only external sink + hash chain.

Claude Opus unique findings (not in either other model):

  • Ordering contradiction in engagement sequence — identified that the documented order (DB → ETS → terminate) creates a specific risk if a crash occurs BETWEEN termination and ETS update (not just DB failure). The insight is about the window where termination has started but gate is still open. More subtle than GPT-5's version (which focused on DB-blocking-engage).
  • Concurrent engagement race (mode escalation) — multiple triggers simultaneously issuing conflicting modes (RESTRICT vs LIQUIDATE). Proposed explicit escalation rules (LIQUIDATE always wins) with GenServer serialization.
  • Shared resources under per-user scope — per-user kill switch doesn't address orders in shared broker connection buffers. Forces architectural decision about connection pooling strategy.
  • Clock/time integrity for audit log — monotonic counters + NTP validation for forensic reliability.
  • Partial multi-user engagement failures — what happens when global engage successfully terminates 4/5 user pipelines but one has orphaned processes.
  • Liquidation direction validation — similar to GPT-5's exposure validation but framed differently: checking corrupted position records could cause liquidation to OPEN positions rather than close them.
  • Process termination verification — checking that :kill signals actually worked (defense against trap_exit, NIF blocking).
  • Engagement latency SLA — defining a 50ms target with monitoring/alerting.

Claude Sonnet findings (all also present in GPT-5 or Opus, differently framed):

  • No genuinely unique improvements that GPT-5 or Opus didn't also identify.
  • Several were generic: "missing resource cleanup," "circuit breaker integration," "performance monitoring" — exactly the kind of advice the prompt tried to exclude.
  • The "missing heartbeat" and "network partition handling" proposals were solid but less detailed than the corresponding GPT-5/Opus versions.

Quality assessment:

  • GPT-5 produced the most ACTIONABLE improvements. Its proposals were architecturally concrete ("add an egress proxy," "use kill epochs in messages," "query broker post-engage") and showed defense-in-depth thinking — multiple independent layers rather than fixing one path. The infrastructure kill (#2) is genuinely novel: no other model proposed going OUTSIDE the application boundary for safety enforcement. GPT-5 consistently thought about "what if this entire runtime is compromised?" rather than just fixing within-app paths.
  • Claude Opus produced equally numerous improvements (15) with characteristic precision about failure SEQUENCES. Its unique strength: identifying design contradictions rather than just gaps (the engagement ordering issue, concurrent mode escalation, shared-resource scope mismatch). Opus's proposals were more "fix the design tension" while GPT-5's were more "add another safety layer." Opus also included the process termination verification and engagement latency SLA — operational rigor that GPT-5 skipped.
  • Claude Sonnet produced 12 proposals in 40s (fast) but quality was notably lower. Several proposals were generic software engineering advice that the prompt explicitly excluded ("add performance monitoring," "resource cleanup"). No unique insights emerged. Sonnet's proposals lacked the architectural depth of GPT-5 (no outside-the-application thinking) and the design-tension identification of Opus.

Key insight — generative vs analytical tasks:

This is the first experiment testing a GENERATIVE task ("propose improvements") rather than a purely analytical one ("find problems"). The results reveal:

  1. GPT-5's defense-in-depth thinking is unique. In analytical tasks, GPT-5 finds exhaustive lists of issues. In generative tasks, it proposes LAYERED solutions — multiple independent mechanisms that each catch what the others miss. The infrastructure kill proposal (external to the application) shows GPT-5 reasoning about failure modes that are invisible to within-app analysis.

  2. Opus's design-tension identification transfers to improvement proposals. In analytical tasks, Opus finds where parts of a design contradict each other. In generative tasks, this manifests as proposals that RESOLVE tensions rather than just adding patches. The engagement ordering contradiction and mode escalation rules are both "this design says X but the mechanism allows Y — here's how to make them consistent."

  3. Sonnet doesn't transfer well to generative tasks. In analytical tasks (assumption-finding, cross-component analysis), Sonnet performs well (85% of GPT-5 in some experiments). In generative tasks, it falls back to generic engineering advice. The task requires both identifying problems AND proposing concrete solutions — Sonnet handles the first step but not the second with sufficient depth.

Comparison to analytical task performance:

Task type GPT-5 character Opus character Sonnet character
Assumption-finding (#10-12) Exhaustive breadth Design tensions Good (85% of GPT-5)
Race conditions (#13) Technical precision Design contradictions Weak (errors)
Invariant violations (#20) Maximum selectivity Self-correcting depth Imprecise
Design improvements (#24) Defense-in-depth layers Tension resolution Generic advice

The generative task reveals model ARCHITECTURES more clearly than analytical tasks. GPT-5's reasoning enables it to construct multi-layered solutions. Opus's internal reasoning enables it to identify what a design SHOULD be (not just what's wrong). Sonnet pattern-matches against known engineering practices without deep synthesis.

Practical implication:

For design improvement sessions on safety-critical systems:

  • Run GPT-5 for defense-in-depth proposals ("what layers should exist?")
  • Run Opus for design consistency proposals ("where does the design contradict itself?")
  • Skip Sonnet — its output is indistinguishable from generic checklists
  • The combination of GPT-5 + Opus produces complementary improvements: GPT-5 adds safety layers, Opus fixes internal contradictions. Together they address both "not enough protection" and "protection mechanisms that work against each other."

Cost analysis: GPT-5: 118s, ~10.9K tokens (6K reasoning). Opus: 127s, ~5K tokens. Sonnet: 40s, ~1.6K tokens. For a safety-critical design review, running GPT-5 + Opus costs ~16K tokens and produces 30 improvements with near-zero overlap in unique insights. Excellent ROI for a kill switch design that protects real money.

25. Contradiction detection: NEW task type — Opus excels at finding genuine contradictions with precision; GPT-5 is thorough but spends heavily on reasoning; Sonnet finds surface-level issues quickly

Date: 2026-05-05 Task: Identify internal contradictions, logical inconsistencies, and conflicting rules in gargoyle's order-state-machine.md (311 lines) — a document defining states, transitions, invariants, fill precedence rules, and time-in-force behavior. How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy. Prompt specifically asked for: state machine contradictions, semantic conflicts, rule violations, implicit contradictions, and terminology inconsistencies. Required each finding to quote the conflicting statements, explain the logical argument, assign severity, and recommend which statement should "win." No tools, no project context beyond the document itself.

Model Time Output tokens Reasoning tokens Contradictions found
GPT-5 162s 12,074 11,008 4
Claude Opus 4.6 41s 2,056 (internal) 6
Claude Sonnet 4.6 17s 826 (internal) 4

What they found — common ground (2+ models identified):

  • Missing pending_cancel → partially_filled revert transition (GPT-5 #1 + Opus #2 + Sonnet partial): The "Rejection reverts" invariant states orders return to their "pre-modification state (working or partially_filled)", but the state diagram only shows pending_cancel → working for cancel rejection — no path back to partially_filled. All models correctly identified this as the diagram being incomplete relative to the stated invariant. GPT-5 and Opus rated CRITICAL.
  • Same issue for pending_replace revert (GPT-5 #1 + Opus #3): The state diagram only shows pending_replace → working for replace rejection, but a replace requested from partially_filled should revert to partially_filled. Same root cause as above, just the replace variant.
  • FOK "never partially fills" vs state machine allowing it (GPT-5 #2 + Opus #4): The TIF table says FOK "never partially fills" but the state machine has no guards preventing FOK orders from reaching partially_filled. Both correctly noted this is a broker-enforced guarantee but the document presents it as system-level.
  • rejection_reason described as "broker-provided" but local rejections exist (GPT-5 #4 + Opus #5 + Sonnet): pending → rejected is "local validation failure" with no broker interaction, but the field says "Broker-provided reason when rejected." All three caught this terminology inconsistency.

GPT-5 unique findings (not in either other model):

  • IOC valid terminal states exclude expired vs generic expiry transitions (#3): IOC should never reach expired (unfilled portion is cancelled immediately), but the state diagram allows any order to transition to expired without TIF guards. Well-reasoned extension of the FOK finding to IOC semantics. GPT-5 correctly identified that broker "expired-like" outcomes should map to cancelled for IOC.

Claude Opus unique findings (not in either other model):

  • Terminal states that aren't terminal — the partially_filled re-entry problem (#1): Opus identified the DEEPER contradiction beyond the surface-level "cancelled states have outgoing transitions." When cancelled → partially_filled fires via late fill, the order is now non-terminal with NO defined mechanism to re-terminate if no further fills arrive. The order is stuck in partially_filled indefinitely. This goes beyond "the diagram contradicts the definition of terminal" to "the fill precedence rule creates an unspecified operational scenario." This is the most architecturally significant finding across all three models.
  • Fill precedence label misapplication to non-terminal states (#6): The state diagram labels transitions from pending_cancel → partially_filled and pending_replace → partially_filled as "fill precedence," but the Fill Precedence Rule explicitly defines itself as overriding TERMINAL states. pending_cancel is non-terminal. The label conflates two different mechanisms (fill during pending modification vs. fill overriding terminal state), which could cause implementers to use the same code path for fundamentally different scenarios.

Claude Sonnet unique findings (not in either other model):

  • State diagram terminal arrow contradiction (#1): Sonnet was the only model to explicitly note that the Mermaid diagram shows cancelled → [*] (terminal arrow) while simultaneously showing cancelled → partially_filled (outgoing transition). A valid observation but more surface-level than Opus's deeper analysis of the same phenomenon.
  • Pending replace fill logic error (#3): Sonnet argued that receiving a fill during pending_replace creates a logical impossibility because the order parameters are in flux. This is WRONG — fills always apply to current parameters (the replace hasn't been confirmed yet), and the document actually handles this correctly. This is a FALSE POSITIVE from Sonnet.

Quality assessment:

  • Claude Opus was the clear winner for this task. Found the most contradictions (6), had the highest precision (0 false positives), and — crucially — found qualitatively deeper issues. The partially_filled re-entry problem (#1) isn't just "the diagram has a missing transition" but "the fill precedence rule creates an unresolvable operational state." The fill precedence label misapplication (#6) identifies a conceptual confusion that would genuinely cause implementation bugs. Opus completed in only 41s with 2,056 output tokens — by far the most efficient.
  • GPT-5 found 4 genuine contradictions with 0 false positives but spent an extraordinary amount of reasoning tokens (11,008) for modest output (1,066 visible content tokens, 10.3:1 reasoning ratio). The IOC finding was unique and valuable. But the cost is disproportionate: 162s and 12K tokens for 4 findings vs Opus's 41s and 2K tokens for 6 findings. GPT-5's reasoning budget seems to have been mostly spent on VERIFICATION (confirming each finding is genuine), consistent with Finding #20's observation.
  • Claude Sonnet was fastest (17s) and found 4 items, but one was a false positive (the pending_replace logic error claim is incorrect). That gives it a precision of 75% (3/4 genuine) — the lowest of the three. Its genuine findings were all also found by the other models (no unique true contributions). Sonnet appears to trade speed for accuracy on contradiction detection.

Key insight — contradiction detection favors precision-oriented models:

This task is fundamentally about LOGICAL ARGUMENTATION: proving that two statements cannot both be true. Unlike assumption-finding (which is about imagining what could go wrong) or gap-finding (which is about identifying missing content), contradiction detection requires the model to:

  1. Hold two statements in working memory simultaneously
  2. Construct a formal argument for why they conflict
  3. NOT get confused by statements that SEEM contradictory but are actually consistent

Requirement #3 is where models diverge. Sonnet produced a false positive because it didn't fully reason through whether the pending_replace fill scenario is actually inconsistent (it isn't — current parameters apply). Opus avoided this trap entirely and additionally found DEEPER contradictions that require multi-step logical reasoning (the re-entry problem, the label misapplication). GPT-5 also avoided false positives but at massive computational cost.

Opus's efficiency advantage: This is the first task where Opus is not just qualitatively better but also quantitatively more efficient. 6 findings in 41s and 2K tokens vs GPT-5's 4 findings in 162s and 12K tokens. That's 3x more findings per token and 4x faster. For contradiction detection specifically, Opus appears to have a structural advantage — possibly because its internal reasoning is better calibrated for logical argumentation than GPT-5's externalized reasoning chain.

Comparison to Finding #20 (invariant violation paths): In Finding #20, GPT-5 was maximally selective (3 findings, all genuine, 15:1 reasoning ratio). Here, GPT-5 shows the same pattern: few findings, all genuine, high reasoning ratio (10.3:1). The difference: in #20, GPT-5's selectivity meant it found UNIQUE violations others missed. Here, all of GPT-5's findings were also found by Opus (plus Opus found 2 more). GPT-5's high verification bar doesn't help when Opus is ALSO precise AND more thorough.

Updated task-model assignment:

For contradiction/consistency checking:

  1. Opus — best choice: highest precision, deepest contradictions, most efficient
  2. GPT-5 — solid backup: zero false positives, unique TIF-related insights, but expensive and slower
  3. Sonnet — NOT recommended for this task: produces false positives, no unique true contributions

This confirms the emerging pattern: each model has task types where it excels. Opus excels at logical argumentation and design tensions. GPT-5 excels at exhaustive enumeration and operational concerns. Sonnet excels at speed and structural/assumption analysis but struggles with tasks requiring formal logical reasoning (contradiction detection, concurrency analysis per Finding #13).

Practical implication: When reviewing architecture documents for internal consistency (e.g., before implementation begins), run Opus. If budget allows, add GPT-5 for TIF/edge-case coverage. Skip Sonnet for consistency checking — its speed advantage is negated by the false positive risk.

26. Missing-feature identification IS promptable across all models; prompt framing eliminates Opus's historical advantage — all three models find regulatory gaps when explicitly asked

Date: 2026-05-05 Task: Identify computations, behaviors, or features that gargoyle's corporate-actions.md (992 lines) SHOULD perform for financial correctness, regulatory compliance, or operational safety — but doesn't describe. How we used them: Same document (full text) + same focused analytical prompt to all 3 models via HAI proxy. Prompt explicitly structured around 5 categories: missing computations, missing behaviors, missing validations, missing integrations, and regulatory gaps. Required concrete findings with severity. No tools, no project context beyond the document. GPT-5 via OpenAI endpoint (16K max_completion_tokens), Opus 4.6 and Sonnet 4.6 via Anthropic endpoint (8K max_tokens).

Model Output tokens Reasoning tokens Findings Critical High Medium
GPT-5 11,354 8,512 20 3 10 7
Claude Opus 4.6 4,111 (internal) 23 6 10 7
Claude Sonnet 4.6 4,686 (internal) 15 5 6 4

What they found — common ground (all 3 identified):

  • Wash sale rule interaction with CA-driven lot closures (IRC §1091)
  • Short position treatment for corporate actions
  • Same-day corporate action ordering beyond recorded_at timestamp
  • Record date / ex-date position verification (entitlement timing)
  • Idempotency guard preventing double-application per user
  • Decimal precision/rounding policy unspecified
  • Superseded CA status has no lot rollback mechanism
  • Rights/warrants post-creation lifecycle (exercise/expiration)
  • Basis preservation invariant has no runtime enforcement
  • Manual entry authorization and audit trail

GPT-5 unique findings (not in either Claude model):

  • Per-lot eligibility based on entitlement date (not just user-level)
  • Election-based outcomes for shareholder choices (cash vs stock)
  • Instrument-level trading hold during CA application window
  • Pre-application consistency checks against broker entitlements
  • DB-level enforcement of status transitions and invariants
  • Action-type-specific date semantics per field (ex vs record vs payable)
  • Voluntary/tender actions beyond distributions
  • Backfill/initialization guard for newly onboarded users
  • Applicator retry/backoff semantics and confirmation race
  • Rights indivisibility constraints vs exact Decimal quantities

Claude Opus unique findings (not in either other model):

  • Pending order PRICE adjustment after splits (not just cancellation)
  • Multi-instrument position recalculation atomicity for mergers
  • Mixed merger basis floor at zero (can produce negative basis)
  • Tax lot identification method interaction with inherited dates
  • Corporate action effect on strategy position limits/risk params
  • Corporate actions on instruments not yet in the database
  • Partial application window: new user acquires position mid-fan-out
  • IRC §305(c) deemed distributions (taxable stock dividends)
  • CA impact on unrealized P&L display and strategy evaluation
  • Concurrent OrderManager startup + Applicator fan-out race

Claude Sonnet unique findings (not in either other model):

  • Stale orders: failure modes table contradicts "excluded" section
  • IRC §1223(1) holding period tacking verification at lot close
  • Spinoff allocation percentage — no validation child != parent instrument
  • Combined spinoff allocations exceeding meaningful bounds
  • Cash dividend bypasses OrderManager — record-date quantity snapshot lost
  • Mixed merger large-denominator exchange ratio overflow
  • Detector schedule: no intraday re-poll for same-day announcements
  • ROC cumulative basis floor tracking and IRC §301(c)(3) distinction
  • Mixed merger deferred loss not explicitly recorded in metadata

Quality assessment:

  • Claude Opus was the MOST PROLIFIC (23 findings) — a notable inversion from previous experiments where Opus typically found fewer but deeper findings. Here, the explicit "missing feature" framing appears to have unlocked Opus's breadth. Its unique findings included genuinely critical items: pending order price adjustment after splits (Critical — direct financial loss), multi-instrument atomicity for mergers (Critical — position loss), and mixed merger negative basis (High — accounting corruption). The findings were precise, well-reasoned, and showed both regulatory depth (IRC §305(c)) and operational awareness.
  • GPT-5 was slightly less prolific (20 findings) but maintained its characteristic breadth and operational-level thinking. Per-lot eligibility (not just per-user) is a subtle but important distinction. The election- based outcomes finding shows awareness of real-world corporate action complexity. The backfill/initialization guard is operationally significant. GPT-5 spent 8,512 reasoning tokens — moderate for its output volume.
  • Claude Sonnet found fewer gaps (15) but several were genuinely insightful. The internal contradiction between the failure modes table and the "excluded" section is a real document inconsistency. The cash dividend record-date quantity snapshot insight (#9) identifies a DATA LOSS problem — the opportunity to capture that data expires. The mixed merger deferred loss recording gap shows regulatory awareness. However, some findings were more surface-level or overlapped heavily with the others.

KEY INSIGHT — The original question from Finding #22 is ANSWERED:

"Opus's 'missing feature identification' mode (wash sales, commissions) — is this promptable on other models? Could we explicitly ask GPT-5 'what should this system compute but doesn't' and get similar results?"

YES. When explicitly prompted with a structured "missing feature" framing, ALL three models found regulatory gaps (wash sales, IRC sections), missing computations (basis calculations, rounding), and missing behaviors (lifecycle events, notifications). GPT-5 produced findings in the same category as what Opus uniquely found in Finding #22 (silent correctness failures on specid-lot-selection.md).

In Finding #22, Opus uniquely identified wash sales and commission tracking as missing features while GPT-5 focused on mechanism incorrectness and Sonnet on composition failures. HERE, with the explicit "what's missing" prompt, ALL three models found wash sales, ALL found regulatory gaps, and ALL found missing behaviors.

This confirms: Opus's "missing feature identification" mode in Finding #22 was NOT an inherent model capability — it was an emergent behavior from the open-ended "silent correctness failures" prompt. When you give ALL models the EXPLICIT instruction to look for missing features, they all do it. The differentiation from #22 was caused by the prompt being more open-ended, allowing each model to default to its natural analytical mode:

  • Opus → "what's missing" (features/functionality)
  • GPT-5 → "what's wrong" (mechanism failures)
  • Sonnet → "what breaks when combined" (composition)

Prompt framing dominates model personality. With the right prompt, any model can be directed into any analytical mode. The model differences that emerged in earlier open-ended experiments reflect DEFAULT TENDENCIES, not capabilities.

NEW finding about Opus on complex documents: Opus produced MORE findings than GPT-5 (23 vs 20) — the first time this has happened on a broad analytical task. Previous pattern: GPT-5 always finds more (20-33 findings) while Opus finds fewer but deeper (7-13). What changed? The document is 992 lines — the longest tested — and the task is explicitly about breadth ("find all gaps"). On this specific combination (long document + breadth-focused prompt), Opus appears to allocate its internal reasoning budget toward exploration rather than its usual depth-first design-tension mode. This suggests Opus's typical "fewer but deeper" pattern is partially a RESPONSE to shorter documents where depth is more productive than breadth.

Practical implications:

  1. For missing-feature analysis: prompt structure matters more than model choice. All three models are viable. Use the explicit 5-category prompt.
  2. Run all three for critical docs — they find different specific gaps despite finding the same categories.
  3. For open-ended analysis where you want models to find DIFFERENT things: use open-ended prompts. For analysis where you want COMPREHENSIVE coverage of one type: use structured prompts.
  4. Opus's "fewer but deeper" personality can be overridden by document length + breadth-focused prompt. On 992-line docs, it competes on volume with GPT-5.

Cost-effectiveness: Opus: 4,111 output tokens for 23 findings = 179 tokens/finding GPT-5: 11,354 output tokens (+ 8,512 reasoning) for 20 findings = 993 tokens/finding Sonnet: 4,686 output tokens for 15 findings = 312 tokens/finding

Opus is by far the most efficient: nearly 6x fewer tokens than GPT-5 per finding, with MORE findings. This is the strongest cost-effectiveness case for Opus on any tested task. On long documents with breadth-focused prompts, Opus appears to be the optimal choice for both quality AND efficiency.

28. Cross-document consistency analysis: NEW task type — GPT-5 finds deep semantic contradictions; Opus finds structural/boundary mismatches; Sonnet identifies core issues quickly

Date: 2026-05-05 Task: Identify contradictions and inconsistencies BETWEEN two architecture documents describing the same system: system-overview.md (323 lines, narrative overview with component flows, invariants, and domain events) and architecture.md (213 lines, DDD-focused with bounded contexts, context map, and message taxonomy). How we used them: BOTH documents provided as full text in a single prompt (~25KB total). Highly structured prompt specifying 5 categories of cross-document inconsistency (terminology conflicts, structural contradictions, flow/sequence conflicts, ownership/authority conflicts, philosophical contradictions). Required specific output format per finding. Explicitly excluded omissions (things one doc covers and the other doesn't) and detail-level differences. No tools, no project context beyond the two documents. This is a NEW analytical task not previously tested: reasoning about CONSISTENCY BETWEEN documents rather than internal coherence of a single document.

Model Time Output tokens Reasoning tokens Inconsistencies found Critical High Medium
GPT-5 125s 9,415 8,384 6 2 3 1
Claude Opus 4.6 52s 2,351 (internal) 7 3 3 1
Claude Sonnet 4.6 14s 776 (internal) 4 1 2 1

What they found — common ground (all 3 identified):

  • Event sourcing (all events as source of truth) vs fills-only ground truth: Document A says fills are "ground truth from which all other state can be derived," while Document B says "events are the source of truth, state is computed by replaying events." A treats fills as the recovery foundation; B treats ALL domain events as authoritative. All three models rated this Critical.
  • Bounded context naming mismatch: "Decision Engine" / "Order Management" (A) vs "Engine" / "Trading" (B) for the same functional responsibilities. GPT-5 folded this into a broader ownership analysis; Opus and Sonnet surfaced it as its own finding.
  • Signal classification conflict: Document A lists "Signal emitted" as a domain event; Document B explicitly categorizes SignalEmitted as an audit event ("not used to rebuild state"). This determines event store design and recovery semantics.

GPT-5 unique findings (not in either Claude model):

  • Signal persistence contradiction: Document A states "Signals are never persisted" while Document B lists SignalEmitted as an audit event that IS persisted and states the audit log is mandatory for trading. These are directly incompatible claims about whether signal data is stored.
  • Audit event ownership conflict: Document A says "Decision approved" events originate from PortfolioRisk. Document B states "only the decision engine writes audit events" and lists DecisionApproved as an audit event example. If PortfolioRisk is part of Risk (not Engine), this is an authority violation.
  • "Single writer per user" (A: OrderManager writes all trading state) vs per-aggregate single-writer (B: each aggregate writes its own event stream, Ledger owns positions). These are incompatible authority models — either OM centralizes writes or each domain owns its own events.

Claude Opus unique findings (not in either other model):

  • Engine → OrderManager is an internal pipeline flow (A: same subgraph, direct arrow) vs Engine → Trading is a cross-domain COMMAND (B: PlaceOrder command crossing a bounded context boundary). This structural disagreement determines whether order management is an internal pipeline stage or an independent domain with its own aggregates and command validation.
  • Signal Risk's architectural position: Document A shows a two-stage risk architecture (SignalRisk pre-aggregation, PortfolioRisk post-aggregation) where Risk is embedded in the pipeline. Document B's context map shows Risk as a separate domain that Engine merely QUERIES ("kill switch active?") — no arrow shows signal routing through Risk. Either risk logic lives inside Engine (contradicting B's context boundary) or the context map is incomplete.
  • The "reduce" step ownership: A's top-level flow labels Approved →|"reduce"| Decisions (reduction at aggregation), while A's own domain events table says "Decision reduced" originates from PortfolioRisk (reduction after aggregation). This is actually an INTRA-document inconsistency in Document A, but Opus surfaced it as part of cross-doc analysis.

Claude Sonnet unique findings:

  • None genuinely unique. All 4 findings overlapped with GPT-5/Opus common ground (event sourcing, signal persistence, context count/naming). Sonnet was efficient (14s, 776 tokens) but didn't identify any inconsistency that the other two missed.

Quality assessment:

  • GPT-5 produced 6 well-reasoned findings with the deepest analysis of OWNERSHIP conflicts. Its signal-persistence contradiction and single-writer authority conflict are genuinely important — they reveal places where the two documents would lead implementers to build fundamentally different systems. Every finding quotes specific text from both documents and explains precisely WHY they can't both be correct. The reasoning investment (8,384 tokens) was used for thorough cross-referencing between documents.
  • Claude Opus found the most inconsistencies (7) and was remarkably fast (52s vs GPT-5's 125s). Its unique strength: identifying STRUCTURAL contradictions about component boundaries and communication patterns. The Engine→Trading command vs internal pipeline finding is architecturally the most significant discovery — it reveals a fundamental disagreement about whether order management is INSIDE or OUTSIDE the decision engine's boundary. Opus also caught a bonus intra-document inconsistency (the "reduce" labeling error).
  • Claude Sonnet was the fastest (14s) and most concise (776 tokens) but found only the obvious common-ground issues. For cross-document consistency, Sonnet's speed advantage came at the cost of missing the architectural insights that make this task valuable. It did correctly identify all the Critical-level issues, making it viable as a quick first-pass screen.

Key insight — cross-document consistency is a DISTINCT task type: This is fundamentally different from single-document analysis (assumptions, race conditions, coherence). It requires:

  1. Building a mental model from Document A
  2. Building a separate mental model from Document B
  3. Finding places where the models are incompatible
  4. Reasoning about WHY they can't both be correct (not just "different")

Step 4 is what distinguishes this from simple diff-detection. Many surface differences (naming, detail level, scope) are NOT contradictions — the models must judge which differences are genuinely incompatible vs. complementary. The prompt explicitly excluded omissions and detail-level differences, and all three models respected this constraint well.

Model strengths on cross-document analysis:

  • GPT-5 excels at ownership/authority conflicts: it systematically checked "who owns this concept" in each document and found mismatches. Its findings cluster around "who writes what" and "who is authoritative."
  • Opus excels at structural/boundary contradictions: it identified where the documents draw architectural lines differently. Its findings cluster around "where are the boundaries" and "what crosses them."
  • Sonnet identifies the obvious/critical issues quickly but doesn't dig deeper. Viable for screening, not for thorough analysis.

Comparison to Finding #15 / #27 (single-document coherence checking): Single-document coherence asks "does this document contradict itself?" Cross-document consistency asks "do these documents contradict each other?" Key differences in results:

Aspect Single-doc coherence Cross-doc consistency
Opus findings 5-7 7
GPT-5 findings 4-6 6
Sonnet findings 4-5 4
Opus unique Design tensions Structural/boundary mismatches
GPT-5 unique Definitional errors Ownership/authority conflicts
Best model Task-dependent Opus (most findings + fastest)

The relative ordering is similar (Opus ≥ GPT-5 > Sonnet for coherence-style tasks), but the CHARACTER of unique findings shifted. On single-doc coherence, Opus finds design tensions within a single design. On cross-doc consistency, Opus finds BOUNDARY disagreements between two designs. GPT-5 shifts from finding definitional errors to ownership conflicts.

Are these findings REAL bugs in the gargoyle documentation? Yes — several are genuine issues worth fixing:

  1. The fills-vs-events-as-ground-truth is a real philosophical tension between the two documents that needs resolution.
  2. The Position event ownership (OrderManager vs Ledger) is a real boundary conflict that affects implementation.
  3. The Engine→Trading communication style (internal pipeline vs cross-domain command) is a genuine structural ambiguity.
  4. The signal persistence claim ("never persisted" vs SignalEmitted audit event) is a direct textual contradiction.

These are the kind of cross-document inconsistencies that cause teams to build inconsistent implementations — one engineer reads Document A and builds one way, another reads Document B and builds differently.

Practical implication: Cross-document consistency analysis is a high-value task for documentation maintenance. Run it when:

  • A system has multiple architecture docs written at different times
  • A refactoring has updated one doc but not another
  • Multiple people contribute to design documentation
  • Moving from high-level overview to detailed specification

Opus is the recommended model for this task: fastest (52s vs 125s), most findings (7 vs 6), and uniquely strong at boundary disagreements. GPT-5 adds value for ownership-specific conflicts. Sonnet is sufficient for quick screening (catches the Critical issues in 14s) but won't find the architectural insights.

Cost-effectiveness: Opus: 2,351 output tokens for 7 findings = 336 tokens/finding (52s) GPT-5: 9,415 output + 8,384 reasoning for 6 findings = 2,967 tokens/finding (125s) Sonnet: 776 output tokens for 4 findings = 194 tokens/finding (14s)

Opus is the clear winner on this task type: more findings than GPT-5, 2.4x faster, and 8.8x more token-efficient per finding. GPT-5's massive reasoning investment (8,384 tokens) produced only one fewer finding than Opus — the verification overhead is not paying off here because cross-document contradictions are relatively easy to verify once identified (just check both documents).

29. Adversarial manipulation analysis: NEW task type — GPT-5 is most exhaustive and systematic; Opus produces qualitatively different attack vectors with system-level thinking; Sonnet is adequate but less creative

Date: 2026-05-05 Task: Identify adversarial manipulation paths in gargoyle's aggregation.md (193 lines) — how a misbehaving, compromised, or buggy upstream component could exploit the aggregator's design guarantees to produce harmful trading outcomes that bypass downstream safety controls. How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of adversarial manipulation (signal injection, timing manipulation, capacity weaponization, state corruption via crash, audit evasion). Required specific output format per finding (attack vector, mechanism, exploit, why downstream controls miss it, severity). No tools, no project context beyond the document itself.

Model Time Output tokens Reasoning tokens Attack vectors found Critical High Medium
Claude Sonnet 4.6 27s 1,257 (internal) 10 3 5 2
Claude Opus 4.6 84s 3,662 (internal) 12 5 5 0
GPT-5 111s 8,808 6,336 15 2 10 3

What they found — common ground (all 3 identified):

  • Primary signal hijacking via ranking manipulation (last-tick injection in time-windowed to control decision parameters)
  • Threshold gaming via signal replay/duplication (no deduplication means N identical signals satisfy "N confirmations")
  • Capacity flooding to force premature completion or deny legitimate trades
  • Strategic crash to erase unfavorable in-flight groups
  • Timeout-masqueraded manipulation (making attacks look like normal system behavior in the audit trail)

GPT-5 unique findings (not in either Claude model):

  • Direction flip against majority via ranking: In "most recent" ranking, emit multiple SELL confirmations then inject a late BUY — the BUY becomes primary and the decision contradicts the bulk of evidence. Distinct from general primary hijack because it's specifically about directional reversal.
  • Late-arrival exclusion of counter-signals: Time signals so countervailing signals arrive just after group destruction, ensuring the decision is formed without dissenting inputs that would have altered ranking.
  • Capacity filter to curate the audit set: Pre-fill buffer with chosen signals so riskier alternatives cannot be included before capacity fires — the contributing signals list looks clean.
  • Timer nullification by crash: Crash just before a timeout that would force-complete an unfavorable decision — the timer becomes no-op on restart, no decision or expiry event is emitted.
  • Decision drop via induced forwarding failure: Exploit the "Decision forwarding fails: Decision is lost" failure mode to selectively suppress protective decisions (stops, hedges) with no automatic retry.
  • Crash to erase evidence of contrary signals: Post-crash, submit a fresh group that completes quickly; audit shows only the new set, not the earlier contradictory pre-crash signals.

Claude Opus unique findings (not in either other model):

  • Instrument fragmentation to multiply position size: Emit signals for economically equivalent exposures using different instrument identifiers. Each gets its own group, each produces a separate decision, bypassing per-group capacity limits. Combined position exceeds what any single group would allow. Identifies TOCTOU at the fan-in to PortfolioRisk.
  • Forced stale decision via timer exploitation: Emit one signal at a favorable price spike known to be transient, then deliberately withhold further signals. Timer force-completes with a stale price. The entry price WAS valid when the signal was generated — PortfolioRisk doesn't check staleness of decision prices.
  • Timeout prevention / keep-alive suppression: Manipulate market data feed to suppress signals that would reach threshold N. Group expires normally — denial-of-trading attack disguised as insufficient confirmation.
  • Crash-restart duplicate decisions: Crash after decision is forwarded but before strategy reflects it. Both restart "clean" — strategy re-emits signals, aggregator produces a second decision with a fresh ID. Same trade executes twice. PortfolioRisk can't deduplicate because IDs are different.
  • Force-complete with insufficient confirmation (capacity < threshold): If capacity limit is lower than threshold, hitting capacity ALWAYS force- completes before predicate is satisfied. Fundamentally changes a 5-confirmation strategy into a 3-confirmation strategy.
  • Pattern predicate as arbitrary decision trigger: If adversary controls predicate logic (via strategy configuration), can make pattern-complete trigger on any single signal while audit shows algorithm=pattern-complete and reason=:predicate. Trust boundary between configuration and execution.

Claude Sonnet unique findings (not in either other model):

  • Cross-group timing coordination: Coordinate signal injection across multiple instruments to synchronize completion times, creating a burst of correlated decisions that overwhelm PortfolioRisk individually-safe evaluations. (NOTE: Opus found a similar concept — instrument fragmentation — but framed it differently: Opus focused on position multiplication via instrument aliasing, Sonnet focused on burst timing overwhelming evaluation.)
  • Multi-strategy attack distribution: Spread manipulation across multiple isolated strategy aggregators so no single aggregator's behavior looks abnormal while cumulative effect is harmful.

Quality assessment:

  • GPT-5 produced the most findings (15) with the most systematic coverage across all 5 prompt categories. Its strength was in identifying SPECIFIC INTERLEAVINGS — exactly how timing, state, and ranking mechanisms interact to produce exploits. The direction-flip finding (#3) and the late-arrival exclusion finding (#6) show precise temporal reasoning about when signals arrive relative to group lifecycle events. The "decision drop via forwarding failure" finding exploits a DOCUMENTED failure mode (from the failure table) as an offensive weapon — turning a recovery mechanism into an attack vector. Every finding references specific mechanisms from the spec.
  • Claude Opus produced 12 findings with the most architecturally creative attacks. The instrument fragmentation attack is the most SYSTEMICALLY dangerous finding across all three models — it's not about manipulating one group but about the RELATIONSHIP between groups, and it identifies a TOCTOU vulnerability at the PortfolioRisk fan-in point that no other model found. The crash-restart duplication attack is also architecturally novel — it exploits the "clean state" guarantee as a weapon for invisible trade doubling. Opus consistently reasons about the system BOUNDARY (aggregator → PortfolioRisk handoff) rather than just within-component mechanics. The pattern-predicate trust boundary finding is uniquely about CONFIGURATION as an attack surface.
  • Claude Sonnet produced 10 findings in 27s — extremely efficient (127 tokens per finding). Findings were adequate and covered all 5 categories, but lacked the specificity of GPT-5 and the architectural creativity of Opus. Several findings were somewhat generic (e.g., "crash at strategic moments" without specifying exactly WHEN relative to group lifecycle). The cross-group coordination and multi-strategy distribution findings show system-level thinking but are stated at a higher abstraction level without concrete exploit sequences.

Key insight — "adversarial manipulation analysis" as a task type: This is qualitatively different from all previous analytical lenses tested. Previous tasks asked models to find problems WITH the design (assumptions, races, incoherences). This task asks models to find ways to USE the design AGAINST itself — a creative/generative adversarial task. Results:

  • GPT-5 treats it as an exhaustive enumeration exercise — systematically walks through each mechanism and asks "how could this be abused?" High count (15), thorough coverage, but some findings are minor variations of each other (e.g., crash-related findings #10, #12, #15 share the same core mechanism). Reasoning tokens (6,336) used for both generation and verification.
  • Opus treats it as a creative design exercise — asks "what would a smart adversary do that the designer didn't consider?" Fewer findings (12) but several are genuinely novel attack concepts (instrument fragmentation, crash-restart duplication, predicate trust boundary) that require reasoning about the SYSTEM rather than the COMPONENT. Opus also provided a summary table and systemic conclusion about the root design weaknesses.
  • Sonnet treats it as a categorization exercise — fills each prompt category with plausible attacks but at a higher abstraction level. Fast and adequate for a first pass but wouldn't surprise a security reviewer.

Comparison to "predictable exploit window" (Finding #18): Finding #18 noted that Opus uniquely identified predictable exploit windows in escalation-policy.md. Here, Opus again shows the strongest adversarial creativity — the instrument fragmentation attack and crash-restart duplication are both about exploiting DESIGN GUARANTEES (per-instrument grouping, clean restart) as weapons. This confirms that Opus's strength on adversarial analysis is a CONSISTENT PATTERN, not document-specific.

GPT-5 excels when the adversarial task is framed as "enumerate all possible abuses of each mechanism" (systematic coverage). Opus excels when the task requires "invent novel attack concepts that exploit design boundaries" (creative adversarial thinking).

Model hierarchy for adversarial manipulation analysis:

  1. GPT-5 — most thorough enumeration, best at mechanism-level exploitation (15)
  2. Opus — most creative, finds system-boundary attacks others miss (12)
  3. Sonnet — adequate first pass, fast, but less specific (10)

Practical implication: For security-oriented architecture review:

  • Run GPT-5 for comprehensive attack surface enumeration
  • Run Opus for novel/creative attack vectors that exploit design boundaries
  • Sonnet is sufficient only as a quick initial screen
  • The UNION of GPT-5 + Opus findings (removing overlaps) would produce the most complete adversarial analysis

New finding about the aggregator itself: Several attacks identified by multiple models point to real design weaknesses worth addressing:

  1. No signal deduplication/independence validation (all 3 models)
  2. Primary signal determines all decision parameters regardless of group composition (all 3 models)
  3. Transient state + no replay = perfect adversarial erasure tool (all 3)
  4. Capacity/timeout treated as normal events even when weaponized (all 3)
  5. No cross-group correlation at aggregator level (Opus + Sonnet)
  6. TOCTOU at PortfolioRisk fan-in for concurrent decisions (Opus)