diff --git a/README.md b/README.md index a66b0d3..5d5f38b 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,81 @@ -# model-research +# Model Research — AI for Analytical Work -Comparative analysis of AI models on analytical tasks — not coding. Tracking what works when using GPT-5, Claude Opus, Claude Sonnet, and GPT-4.1 for research, document review, bias detection, and architecture analysis. \ No newline at end of file +Comparative analysis of AI models on **analytical tasks** — not coding. + +Most public discussion about LLM capabilities focuses on code generation. +We found almost no published methodology for using models in analytical +research tasks (searched 2026-04-26). This repo fills that gap with +controlled experiments and reproducible findings. + +## What We're Testing + +Using GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, and GPT-4.1 (+ Mini) for: + +- Architecture document review +- Bias and assumption detection +- Gap-finding in design specifications +- Cross-document consistency analysis +- Race condition identification +- Adversarial path analysis +- Contradiction detection +- Regulatory compliance review + +## Key Findings (Summary) + +| # | Task Type | Winner | Key Insight | +|---|-----------|--------|-------------| +| 1 | PR review | Both | Different models catch different things — Sonnet: structural, GPT-5: semantic | +| 2 | Bias detection | Framing | Signal-to-noise ratio matters more than model capability | +| 9 | Gap-finding | GPT-5 | Reasoning tokens find domain-specific gaps, not generic ones | +| 10 | Hidden assumptions | GPT-5 | Reasoning produces qualitatively different (not just more) findings | +| 13 | Race conditions | Opus | Temporal interaction reasoning is Opus's strongest domain | +| 15 | Design coherence | Task-dependent | Single-doc: model choice depends on document complexity | +| 25 | Contradiction detection | Opus | Precision > exhaustiveness; Opus's self-correction is unique | +| 28 | Cross-doc consistency | Opus | 2.4x faster than GPT-5 with more findings; boundary reasoning | +| 29 | Adversarial analysis | GPT-5 + Opus | GPT-5: exhaustive; Opus: qualitatively different attack vectors | + +## Methodology + +Each experiment: +1. Same input document(s) to all models +2. Same structured prompt with explicit categories to analyze +3. No tools, no project context beyond the document(s) +4. Independent runs — no cross-pollination between models +5. Results evaluated for: correctness, uniqueness, actionability + +**Context dimensions tracked:** +- Rich vs minimal (how much background info) +- Broad vs focused ("review this" vs "answer this specific question") +- What kind of context (diff, full files, issue text, nothing) +- Whether the model had tools or just text +- Whether the task was step-by-step or open-ended + +## Repository Structure + +``` +findings/ # Individual findings with full analysis + 01-different-models-different-things.md + 02-narrow-lens-vs-broad-review.md + ... + 28-cross-document-consistency.md + 29-adversarial-manipulation.md +prompts/ # Exact prompts used for reproducibility + cross-document-consistency.md + design-coherence.md + gap-finding.md + hidden-assumptions.md + ... +open-questions.md # Unanswered questions for future experiments +methodology.md # Full methodology notes +``` + +## Who We Are + +This research is conducted by [Rodin](https://gitea.weiker.me/rodin) (AI +assistant) and Aaron Weiker. The test corpus is gargoyle — an Elixir +trading system with extensive architecture documentation (~35 design docs, +~5000 lines). + +## License + +CC BY 4.0 — share and adapt with attribution. diff --git a/findings/ALL-FINDINGS.md b/findings/ALL-FINDINGS.md new file mode 100644 index 0000000..ed0762d --- /dev/null +++ b/findings/ALL-FINDINGS.md @@ -0,0 +1,3249 @@ +# Model Findings — Analytical & Research Work + +_Tracking what actually works (and doesn't) when using AI models for research, +analysis, bias detection, and document review — not coding._ + +Started: 2026-04-26 + +## Context + +We use multiple models in different roles: Claude Code (Opus/Sonnet) for +generation, Sonnet + GPT-5 for independent dual review, smaller models for +focused analytical tasks. Most public discussion is about coding. We found +almost no published methodology for using models in analytical research tasks +(searched 2026-04-26). That gap is why we're tracking this. + +## Findings + +### 1. Different models catch different things (confirmed) + +**Date:** 2026-04-26 +**Task:** PR reviews on DDD reference docs (~6,600 lines across 18 files) +**How we used them:** Both models got the same task via pr-review skill — +fetch diff, fetch full file content for changed files, review against PR +description and linked issue acceptance criteria. Rich context: full diff, +project CLAUDE.md conventions, issue body. Each reviewer ran independently +in its own sub-agent with its own Gitea token. No cross-pollination. + +- GPT-5 caught SUMMARY.md verdict mismatches (Commanded classification, + small teams classification) that Sonnet missed entirely (PR #375) +- Sonnet caught a broken cross-reference link first that GPT-5 missed (PR #378) +- **Takeaway:** Different blind spots are real. Neither model is strictly better + for analytical review — they complement each other. This is why we run two + independent reviewers from different model families. + +### 2. Cheap model + narrow lens > expensive model + broad review (one data point) + +**Date:** 2026-04-26 +**Task:** Check 12 rewritten hypotheses for directional bias +**How we used them:** +- Sonnet & GPT-5: full PR review context (diff, file content, issue, AC). + Broad mandate: "review this PR." Rich context but unfocused task. +- GPT-4.1 Mini: given ONLY the 12 hypothesis texts + one focused question: + "Do any of these hypotheses lead toward a predetermined conclusion?" + Minimal context, laser-focused task. No diff, no project docs, no issue. + +- Both Sonnet and GPT-5 approved the hypotheses as reviewers +- GPT-4.1 Mini found ALL 12 pushed toward predetermined conclusions +- Words like "requires," "necessary," "must be" were flagged as directional +- **Takeaway:** Task framing mattered more than model size. Rich context + + broad mandate = missed the forest for the trees. Minimal context + precise + question = found exactly what mattered. This needs more testing — was it + the narrow framing, the lack of surrounding context, or both? + +### 3. GPT-5 times out on complex multi-step analytical tasks (confirmed pattern) + +**Date:** 2026-04-26 +**Task:** Full PR review of #382 (research document rewrite) +**How we used it:** pr-review skill — multi-phase (fetch diff, fetch files, +check CI, analyze against AC, post inline comments, post summary). 7 phases, +many curl calls to Gitea API, large diff context. Heavy tool-use workflow +through SAP proxy (adds latency vs direct API). 300s timeout. + +- Timed out 3 times at 300s (17, 6, 6 tool calls respectively) +- Bottleneck was model processing time, not network (~0.3s Gitea API latency) +- **Takeaway:** Break analytical tasks into focused bounded pieces. Twelve + small deep reviews > one rushed big one. The issue isn't GPT-5's analysis + quality — it's that multi-phase tool-heavy workflows burn too much time + on mechanics. Separate the data gathering from the analysis. + +### 4. GPT-5 defaults to delegation; Claude defaults to doing the work + +**Date:** 2026-04-26 +**Task:** PR review delegation to sub-agents +**How we used them:** Both spawned as sub-agents from main session with +same task description, same pr-review skill file, same Gitea credentials. +Difference: GPT-5 got model override to gpt5, Sonnet used default model. +Both got full skill instructions. + +- GPT-5 first attempt: spawned sub-sub-agents and timed out +- GPT-5 with "do it yourself, no sub-agents" + step-by-step: worked +- Even with constraints, GPT-5 sometimes dumps raw tool output instead of + synthesizing — needs explicit output format instructions +- Claude (Sonnet/Opus) given the same kind of task does the work directly +- **Takeaway:** GPT interprets complex task descriptions as delegation + opportunities. Claude interprets them as work to do. For GPT: explicit + single-actor instructions + output format. For Claude: can give broader + mandate. Same skill file, very different behavior. + +### 5. Sonnet is fast and catches structural issues; GPT-5 is slow and catches semantic issues + +**Date:** 2026-04-26 +**Task:** Dual review across PRs #372, #375, #378, #380, #382 +**How we used them:** Same pr-review skill, same context (diff + files + +issue + AC), same sub-agent pattern. Only variable: model. Both got rich +context. Both ran the full 7-phase review skill. + +- Sonnet consistently finishes first, catches formatting, broken links, + structural problems (missing sections, dangling refs) +- GPT-5 takes longer, catches meaning-level problems (verdict mismatches, + classification inconsistencies, logical gaps) +- **Takeaway:** With identical rich context and identical instructions, the + models naturally gravitate to different things. Sonnet is the structural + reviewer; GPT-5 is the semantic reviewer. Both roles matter. Question: + would Sonnet catch semantic issues if given a narrower "check for logical + consistency" framing instead of broad review? + +### 6. Single agent can't handle 1000+ line document generation (confirmed pattern) + +**Date:** 2026-04-26 +**Task:** DDD v2 forge analysis drafting +**How we used them:** Single Sonnet/Opus sub-agents given full research +material (~3,874 lines of research notes) + outline + instructions to write +complete document. Very rich context (all research), very large output +requirement (1000+ lines). + +- Five single-agent attempts died (OOM, disconnect, timeout) trying to write + full documents +- Sectional approach (5 parallel Sonnet subagents, ~500-700 lines each) + succeeded immediately — each got same research but only their section's + outline +- Same pattern when Claude Code attempted full Part V rewrite — died +- Three agents × ~320 lines each worked first try +- **Takeaway:** This is a confirmed, repeatable limit for generation tasks. + Not model-specific — it's a context/output length problem. Rich input + context is fine; it's the output length that kills. Break output into + sections, keep input context rich, draft in parallel, assemble. + +### 7. Emerging role assignments (pattern, not conclusion) + +**Date:** 2026-04-26 (one day of intensive work — treat as hypothesis) + +- Opus (via Claude Code): complex generation needing deep project context. + Rich context: CLAUDE.md, full codebase access, design docs. Broad mandate. +- Sonnet: parallel volume work (5 subagents drafting simultaneously). + Rich context per section, constrained output scope. +- GPT-5: independent analytical review. Rich context (diff + files + issue). + Best when task is bounded and explicit. +- GPT-4.1 Mini: focused narrow analysis (bias detection). Minimal context, + precise question. Cheap and fast. +- **Takeaway:** The role assignment matters, but so does the context shape. + Opus gets broad context + broad mandate. Sonnet gets broad context + + narrow scope. GPT-5 gets rich context + explicit task. GPT-4.1 Mini gets + minimal context + laser question. We haven't tested swapping these + combinations — that's where the real learning will come from. + +### 8. Bias detection: all models catch it with any framing — when the signal isn't buried + +**Date:** 2026-04-27 +**Task:** Detect directional bias in 8 deliberately biased hypotheses about +microservices vs monolith architecture for fintech startups. +**How we used them:** Created fresh test material (8 hypotheses with pro- +microservices bias via absolutes like "inevitably," "necessary," "must," +"requires," plus one factually inverted claim about consistency guarantees). +Ran 4 conditions in parallel sub-agents: + +| Condition | Model | Framing | Context | +|---|---|---|---| +| A | GPT-4.1 Mini | Narrow: "Do any lead toward a predetermined conclusion?" | Hypotheses only | +| B | Sonnet | Same narrow question | Hypotheses only | +| C | GPT-5 | Same narrow question | Hypotheses only | +| D | Sonnet | Broad: "Review quality, clarity, testability, and issues" | Hypotheses only | + +**Results:** +- **All 4 conditions detected 8/8 biased hypotheses.** No misses. +- All 3 narrow-framing models (Mini, Sonnet, GPT-5) produced structurally + similar output: per-hypothesis verdict, biasing words, neutral version, + severity assessment. +- All 3 narrow-framing models flagged H8's factual inversion (distributed + transactions DON'T provide stronger consistency than monolithic ACID). +- GPT-5 added specific counterexamples (LMAX Disruptor, Shopify, Stack + Overflow, Basecamp) — marginally richer analysis. +- Sonnet broad mandate also caught the bias — framed as one of three + "systemic problems" (deterministic language, pro-microservices framing + bias, underspecified constructs). Additionally provided testability and + operationalization analysis that the narrow framing didn't ask for. +- Sonnet broad took ~72s vs ~39s for narrow conditions (more output). + +**Takeaway:** When the biased text is the ONLY input (no surrounding noise), +all tested models — including the cheapest (GPT-4.1 Mini) — detect bias +regardless of whether the question is narrow or broad. This appears to +**contradict** original finding #2 ("cheap model + narrow lens > expensive +model + broad review"), but the key difference is context noise: + +- **Original experiment (2026-04-26):** Sonnet and GPT-5 missed bias during + FULL PR REVIEW with rich project context (diff, file content, issue text, + acceptance criteria, project conventions). The hypotheses were buried in + layers of review mechanics. +- **This experiment (2026-04-27):** Even the "broad" condition gave ONLY the + hypothesis text — no diff, no PR structure, no project context noise. + +**Refined hypothesis:** The original finding #2 was about **signal-to-noise +ratio**, not about model capability or framing precision. When biased text +is presented in isolation, any model catches it. When biased text is buried +in a large PR review with many other things to check, the bias signal gets +lost in the noise — unless you explicitly ask about it. The "narrow lens" +worked because it eliminated the noise, not because smaller models are +better at bias detection. + +**Next experiment to confirm:** Give a model the FULL PR review context +(diff, files, issue, AC) but add the narrow bias question as an explicit +review checklist item. If the model catches bias despite the rich context, +it confirms the signal-to-noise hypothesis. If it misses, it suggests +something else is at play (attention allocation, task switching cost). + +### 9. Gap-finding in architecture docs: GPT-5 finds domain-specific gaps, GPT-4.1 is generic, Mini is formulaic + +**Date:** 2026-05-02 +**Task:** Identify missing failure scenarios in gargoyle's `failure-modes.md` (383 lines) +**How we used them:** Same document (full text, no truncation) + same focused +analytical question to all 3 models via HAI proxy (OpenAI-compatible endpoint). +No tools, no project context beyond the document itself. Single prompt, no +conversation history. Temperature 0.3 for GPT-4.1/Mini, default (1.0) for GPT-5 +(required by the model). + +| Model | Time | Output tokens | Reasoning tokens | Scenarios found | +|---|---|---|---|---| +| GPT-4.1 Mini | 16s | 2,003 | 0 | 10 | +| GPT-4.1 | 24s | 2,575 | 0 | 15 | +| GPT-5 | 45s | 8,565 | 6,656 | 14 | + +**What they found — common ground (all 3 identified):** +- ETS table corruption/loss affecting gates +- BEAM scheduler starvation / GC pauses +- WebSocket message duplication/reordering +- Postgres connection pool exhaustion / deadlocks +- Clock skew / time drift +- Process registry inconsistency + +**GPT-5 unique findings (not in either other model):** +- Broker rate limiting (429s) — not "connection lost" so existing logic + doesn't trigger, but can't flatten during kill switch +- Broker auth failure / credential rotation — distinct from connection loss +- Corporate actions (splits, symbol changes) — position drift without + triggering staleness detection +- Duplicate pipeline instances for same user (DynamicSupervisor race) +- DB "commit unknown outcome" causing restart loops (Ecto commit succeeds + at Postgres but client times out → retry → unique constraint → crash loop) +- Cross-symbol strategies with partial staleness — multi-leg signals + computed from mix of fresh and stale data +- Partial cancel_all during kill switch masked by process restarts + +**GPT-4.1 unique findings (not in GPT-5 or Mini):** +- Zombie processes after halt (supervisor misconfiguration) +- Unsupervised Task crashes going unnoticed +- Audit log writes failing silently (not in same transaction as state change) +- ClOrdID unique constraint violation from race in sequence generation +- Broker API semantic changes (silent breaking changes) + +**GPT-4.1 Mini unique findings:** +- Race between kill switch engagement and reconciliation completion + (timing coordination gap) — this was more explicitly called out than + in the other models, though GPT-5 touches it implicitly +- Strategy.Worker / Aggregator partial crash inconsistency + +**Quality assessment:** +- **GPT-5** had the most *domain-relevant* and *actionable* gaps. Broker + rate limiting, auth failures, corporate actions, and the DB commit + unknown-outcome scenario are all realistic production issues specific + to THIS system. The cross-symbol partial staleness finding shows + deeper architectural reasoning about component interactions. +- **GPT-4.1** was thorough and well-structured but more generic/defensive. + Many of its unique findings (zombie processes, unsupervised Tasks, + audit log loss) are general Elixir concerns rather than specific to + the document's architecture. Good for a completeness checklist. +- **GPT-4.1 Mini** was formulaic — each finding followed the same template + and several were somewhat surface-level or restated things the document + partially covers. Still found the most scenarios per dollar. + +**Takeaway:** For gap-finding in architecture documents, GPT-5's reasoning +tokens pay off. It doesn't just list "things that could go wrong" — it +identifies *specific interactions* that the document's existing mechanisms +don't cover (e.g., rate limiting bypasses the "connection lost" detection, +corporate actions bypass staleness detection). GPT-4.1 is a solid +middle-ground: more thorough than Mini, less insightful than GPT-5. +Mini is fine for a quick sanity check but won't find the subtle gaps. + +**Cost-effectiveness:** Mini found 10 scenarios in 16s for ~7K tokens. +GPT-5 found 14 scenarios (with 7 genuinely unique insights) in 45s for +~13.5K tokens (including 6.6K reasoning). For architecture review where +missing a gap could mean financial loss, the GPT-5 cost is justified. +For routine doc review, Mini + human judgment is probably sufficient. + +### 10. Hidden-assumption identification: GPT-5's reasoning produces qualitatively different (not just more) findings + +**Date:** 2026-05-02 +**Task:** Identify hidden assumptions in gargoyle's `cold-start-and-recovery.md` (234 lines) +that could break under real-world production conditions. +**How we used them:** Same document (full text) + same focused analytical question +to all 3 models via HAI proxy (OpenAI-compatible endpoint). No tools, no project +context beyond the document itself. Single prompt, no conversation history. +Temperature 0.3 for GPT-4.1/Mini; GPT-5 uses default (required). + +| Model | Time | Output tokens | Reasoning tokens | Assumptions found | +|---|---|---|---|---| +| GPT-4.1 Mini | 25s | 3,090 | 0 | 12 | +| GPT-4.1 | 77s | 2,751 | 0 | 14 | +| GPT-5 | 78s | 2,649 | 4,096 | 26 | + +**What they found — common ground (all 3 identified):** +- Broker API consistency/availability during reconciliation +- ETS table availability and fail-closed behavior +- Single-writer/mailbox ordering guarantees holding in practice +- User independence assumption vs shared resources (rate limits, DB) +- Reconciliation idempotency under repeated runs +- Corporate action data completeness/timeliness +- Escalation threshold calibration vs changing market conditions +- Strategy warmup with partial/missing historical data +- Signal expiry correctness on restart + +**GPT-5 unique findings (not in either other model):** +- Unbounded mailbox growth during extended reconciliation (memory pressure + from queued messages at market open) +- handle_continue side effects in OTHER processes (risk, metrics) acting + concurrently via different paths +- Pre-existing GTC orders filling while gated (positions as moving target) +- Broker position semantics mismatch (trade-date vs settled-date) +- Strategy warmup evaluate() having non-signal side effects (metrics, caches) +- Historical bar / live tick boundary alignment (double-processing or gaps) +- ETS gate caching in process state creating fail-open windows +- Correlated retry stampede when many users restart together +- Corporate action double-application race with broker (missing idempotency + keys per action/instrument/date) +- Kill switch state vs DB unavailability at startup +- Market data subscriptions as shared bottleneck across "independent" users +- Time-invariant signals incorrectly expired by aggregation window logic +- Broker fills vs positions endpoints internally inconsistent (different caches) +- Positions changing under reconciliation while kill switch is engaged +- Gate phase sequencing: :ready written before worker warmup completes +- Periodic reconciler allowing 1hr of divergent trading (rate-of-change blind) + +**GPT-4.1 unique findings (not in GPT-5 or Mini):** +- No correlated failure handling (all failure modes treated as isolated) — + only model to frame this as a meta-assumption about the failure table + +**GPT-4.1 Mini unique findings:** +- None that weren't also covered by the other two models + +**Quality assessment:** +- **GPT-5** didn't just find more assumptions — it found *qualitatively + different kinds*. Many of its unique findings involve multi-component + interactions (mailbox + reconciliation + market open timing), semantic + mismatches (trade-date vs settled positions), and second-order effects + (metrics side effects during warmup, GTC orders filling while gated). + These require reasoning about system behavior across boundaries the + document doesn't explicitly draw. +- **GPT-4.1** was competent and structured, found the same core assumptions + as Mini, plus one good meta-observation about correlated failures. But + it stayed within the document's own framing — it found assumptions the + document *almost* states rather than ones the document can't see. +- **GPT-4.1 Mini** was formulaic. Every finding maps cleanly to a section + of the document. It's essentially "what could go wrong with each stated + mechanism" rather than "what does this design take for granted about + the world outside itself." + +**Key insight — reasoning tokens change the KIND of analysis:** +GPT-5's 4,096 reasoning tokens aren't producing "more of the same" — +they're producing a different analytical mode. The non-reasoning models +(4.1 and Mini) identify risks within the document's own frame of reference. +GPT-5 reasons about the document's relationship to the external world: +broker semantics, deployment topology, OTP runtime behavior under load, +timing correlations across independent subsystems. This is the difference +between "what could this mechanism fail at" and "what must be true about +the world for this mechanism to work." + +**Comparison to Finding #9 (gap-finding on failure-modes.md):** +Same pattern confirmed. GPT-5 consistently finds domain-specific, +interaction-level issues that require reasoning about component boundaries. +GPT-4.1 is thorough within-frame. Mini is formulaic. The gap between +GPT-5 and the others is larger here than in #9 — possibly because +"hidden assumptions" requires more abstraction than "missing failure +scenarios." Assumption-finding requires the model to reason about what +ISN'T stated, which benefits more from extended reasoning. + +**Practical implication:** For architecture review, running GPT-5 on +"identify hidden assumptions" is higher-value than the same question to +non-reasoning models. The cost difference (4K extra reasoning tokens) is +trivial for a document that will drive months of implementation. Use +non-reasoning models for within-frame checks ("does this section have +gaps") and reasoning models for cross-boundary analysis ("what must be +true about the world for this to work"). + +### 11. Hidden-assumption identification on simpler doc: reasoning models diverge in approach, both outperform non-reasoning + +**Date:** 2026-05-02 +**Task:** Identify hidden assumptions in gargoyle's `market-calendar.md` (238 lines) +— a simpler, single-component document vs the 234-line cold-start doc from Finding #10. +**How we used them:** Same document (full text) + same focused analytical question +to all 3 models via HAI proxy. No tools, no project context beyond the document +itself. Single prompt, no conversation history. Temperature 0.3 for GPT-4.1; +GPT-5 and Opus use their defaults (required). Same prompt across all three. + +| Model | Time | Output tokens | Reasoning tokens | Assumptions found | +|---|---|---|---|---| +| GPT-4.1 | 19s | 2,554 | 0 | 14 | +| Claude Opus 4.6 | 74s | 3,288 | (internal, not reported) | 13 | +| GPT-5 | 101s | 8,417 | 5,504 | 24 | + +**What they found — common ground (all 3 identified):** +- Alpaca calendar API data correctness/completeness as single source of truth +- Alpaca API availability at startup (no local cache persistence) +- ETS table atomicity during refresh (partial-state exposure risk) +- System clock/timezone alignment (dates are timezone-naive) +- NYSE emergency/unscheduled closures not reflected until refresh +- Two-year cache range sufficiency +- API response format stability +- Rate limiting / API capacity concerns + +**GPT-5 unique findings (not in either other model):** +- Date struct term-ordering in ETS match specs may not match chronological + order (ETS range guards rely on Erlang term comparison, not Date semantics) +- close_time/1 returns naive Time without timezone — DST conversion burden on + consumers, one hour off twice per year +- trading_day?/1 conflates "not a trading day" with "calendar unavailable" — + operational outages invisible to callers +- ETS table name collision risk (global namespace per node) +- No other process should modify the ETS table (access mode discipline) +- Network egress and credential availability on all nodes at all times +- ETS read/write concurrency flags for contention under load +- Direct ETS access by consumers bypassing the module's error handling +- next/prev_trading_day edge cases at cache boundaries +- Alpaca API start/end parameter inclusivity (off-by-one at year boundaries) +- Half-day vs full-day distinction insufficiency for special sessions +- Small table size makes O(n) selects acceptable (scaling concern) +- Year-end refresh failure leaving gaps at boundary +- Alpaca never omits a legitimate trading day (absence = non-trading conflation) + +**Claude Opus unique findings (not in either other model):** +- ETS ownership semantics: heir-protection would change fail-closed behavior; + current design means ALL consumers fail simultaneously during crash-to-restart + window (framed as a design tension, not just a risk) +- Silent data corruption from partial API response (pagination/truncation) — + specifically that missing rows are SILENT failures with no error propagation + (other models mentioned API completeness but not the silence aspect) +- Consumers calling functions with Dates, not DateTimes — the API accepts Date.t() + but doesn't specify HOW consumers should derive "today" (system-wide + coordination problem made invisible by the API contract) +- `trading_day?/1` returning false is NOT fail-closed for ALL consumers — only + for PDT-like "block action" consumers; for batch-trigger consumers it's + fail-OPEN (subtle inversion of safety semantics) +- Startup ordering: background_children placement means PDT could receive orders + before MarketCalendar finishes init, creating recurring rejection windows + during hot deploys +- Continuous-running assumption for refresh timer (daily restarts would mean + refresh mechanism never fires — no staleness alert exists) + +**GPT-4.1 unique findings (not in either other model):** +- No need for real-time calendar change notification (event emission gap) +- All consumers using the same module instance (configuration consistency) +- No need for historical calendar data (audit/backtesting limitation) +- Consumers correctly handling {:error, :calendar_unavailable} in practice + +**Quality assessment:** +- **GPT-5** found the most assumptions (24) with the most technical specificity. + Many are implementation-level insights (ETS term ordering, named table + collisions, read_concurrency flags) that demonstrate deep Erlang/OTP + knowledge. Some are slightly obvious or overlapping. The ETS term-ordering + finding is genuinely insightful — Date structs DO compare correctly in Erlang + term order (year > month > day fields), but questioning it shows depth of + reasoning about underlying mechanisms. Also provided concrete recommendations. +- **Claude Opus** found fewer assumptions (13) but several were qualitatively + different — they identified *design tensions* and *semantic inversions* + rather than just failure scenarios. The fail-open/fail-closed inversion + (finding #12), the ETS ownership tension, and the "API makes timezone + coordination invisible" findings show reasoning about the design's + *relationship to its consumers* rather than just its internal mechanics. + Tighter, more curated output with less filler. +- **GPT-4.1** was competent and well-structured (14 assumptions, clean table) + but stayed within the document's own framing. Its unique findings are + relatively generic ("consumers should handle errors correctly," "no + historical data"). Solid baseline, no surprises. + +**Key insight — two reasoning models, different analytical styles:** +GPT-5 and Opus are both reasoning models, but they reason about different +things. GPT-5 reasons DEEPER into implementation mechanics (how does ETS +actually work? what are the exact failure modes of each component?). Opus +reasons WIDER about system context (how does this component's API contract +affect the safety properties of the overall system? what tensions does this +design create that aren't visible to the author?). + +GPT-5's approach: "Here are 24 things that could go wrong, many highly +technical." Opus's approach: "Here are 13 assumptions, several of which +reveal design tensions the document can't see about itself." + +**Does the reasoning gap narrow with simpler docs?** +Comparing to Finding #10 (cold-start doc, 234 lines, 26 vs 14 vs 12 assumptions +for GPT-5/GPT-4.1/Mini): +- GPT-5 still dominates in raw count (24 vs 14 for GPT-4.1) +- The gap ratio is similar (~1.7x here vs ~1.9x in Finding #10) +- Document complexity doesn't appear to be the driver of the gap — + reasoning tokens enable more exhaustive exploration regardless of + input complexity + +**Claude Opus vs GPT-5 (the headline comparison):** +They're not competing on the same axis. GPT-5 is better for "find all +possible issues" (breadth + technical depth). Opus is better for "find +the assumptions that will actually surprise the author" (insight density). +If you want a security-audit-style exhaustive list: GPT-5. If you want a +design-review-style "here's what you're not seeing about your own design": +Opus. Both are better than GPT-4.1 for this task, but in different ways. + +**Practical implication:** Run BOTH reasoning models on architecture docs. +GPT-5 catches implementation-level hazards the team might miss during +coding. Opus catches design-level tensions the team might miss during +planning. GPT-4.1 is sufficient as a quick sanity check but won't +surprise you. + +### 12. Sonnet 4.6 outperforms expectations on assumption-finding; competes with reasoning models on complex docs + +**Date:** 2026-05-02 +**Task:** Identify hidden assumptions in gargoyle's `order-execution.md` (785 lines) +— a complex, multi-component document covering OrderManager, BrokerAdapter, +TradeStream, and PositionReconciler. +**How we used them:** Same document (full text, no truncation) + same focused +analytical question to all 3 models. GPT-5 via HAI OpenAI endpoint; Opus 4.6 +and Sonnet 4.6 via HAI Anthropic endpoint. No tools, no project context beyond +the document itself. Single prompt, no conversation history. + +| Model | Time | Output tokens | Reasoning tokens | Assumptions found | +|---|---|---|---|---| +| GPT-5 | 93s | 8,485 | 6,016 | 20 | +| Claude Sonnet 4.6 | 106s | 4,637 | (internal) | 17 | +| Claude Opus 4.6 | 105s | 4,615 | (internal) | 12 | + +**What they found — common ground (all 3 identified):** +- Synchronous broker REST calls blocking OrderManager GenServer (mailbox growth) +- TradeStream event ordering assumptions (out-of-order fills/status) +- Fill deduplication gap (no explicit fill-level idempotency) +- `cancel_all/1` with `timeout: :infinity` blocking GenServer during FLATTEN +- Recovery/restart races with TradeStream fill delivery (fills queued during + `handle_continue/2`) +- Lot operation idempotency under crash recovery (partial execution) +- Replace race: fills for new broker_order_id arriving before `replaced` event +- Database write latency impact on GenServer throughput under burst fills +- ETS table scope assumptions (single-node, access mode) + +**GPT-5 unique findings (not in either Claude model):** +- Rate-limit retry blocking OrderManager inline (no async retry path specified) +- Single TradeStream connection per user not enforced (duplicate detection gap) +- Kill switch FLATTEN vs degraded state interaction (OM drops cancels while + degraded, but FLATTEN calls cancel_all through OM) +- ClOrdID uniqueness scope/retention at broker across sessions and days +- `after: datetime` filter semantics (clock skew, timezone, inclusive/exclusive) +- Reconciliation responses may exceed single-response size (no pagination) +- Event broadcasting blocking model (synchronous vs fire-and-forget) +- Credential rotation during TradeStream connection lifetime +- `market_closed` semantics varying across brokers (reject vs queue) +- Dropped Alpaca statuses (stopped/suspended/calculated) may affect accounting + +**Claude Sonnet 4.6 unique findings (not in either other model):** +- Single fill per fill event assumption (broker batching multiple fills into + one WebSocket message) +- Lot operations (`Lots.open/2`, `Lots.close/4`) assumed to never fail — + no `{:error, _}` handling shown, crash propagation risk +- `Task.async_stream` inside GenServer creating linked tasks whose crash + signals propagate to OrderManager during critical cancel_all +- Broker cancel semantics during in-flight replace at the broker level + (cancel targets old broker_order_id which broker already replaced away) +- Database operations in fill processing assumed transactional (no explicit + Ecto.Multi/transaction mention) +- Broker position reflects only Gargoyle's activity (external trades cause + false-positive reconciliation halts) + +**Claude Opus 4.6 unique findings (not in either other model):** +- `{:ok, broker_order_id}` from REST place conflated with durable OMS + acceptance vs mere HTTP acknowledgment (no timeout on `submitted` state) +- Concurrent `apply_corrections/2` from periodic reconciler running in + separate process conflicts with OrderManager's single-writer invariant + (corrections write to same tables outside GenServer serialization) +- Reconciliation gate initialized state after `:rest_for_one` restart — + ETS table EXISTS but freshly initialized vs table MISSING are different + conditions with different safety properties +- Escalation state reset after crash creating double-exposure window + (systematic issue persists but escalation timer resets to zero) +- `replace/3` error semantics: non-atomic replace (cancel + re-submit) + where cancel succeeds but re-submit fails leaves original order cancelled + at broker while OrderManager reverts to "working" locally + +**Quality assessment:** +- **GPT-5** maintained its pattern from previous findings: broadest coverage + (20 assumptions), most technically specific about implementation details. + Found cross-cutting operational concerns (clock skew, credential rotation, + pagination) that the Claude models didn't surface. However, several of its + findings were medium-severity operational concerns rather than architectural + assumptions. +- **Claude Sonnet 4.6** was the surprise performer. Found 17 assumptions — + close to GPT-5's count (85%) — and several of its unique findings were + genuinely insightful. The `cancel_all` race with broker-side replace state + (finding #16) and the lot operation failure propagation (finding #6) show + deep reasoning about component interaction despite Sonnet not being + positioned as a "reasoning" model. More importantly, Sonnet's findings were + consistently well-structured with clear "how it could break" scenarios. +- **Claude Opus 4.6** found the fewest assumptions (12) but — consistent with + Finding #11 — its unique findings were qualitatively different. The + concurrent `apply_corrections` write conflict, the gate initialization state + distinction, and the non-atomic replace error semantics all reveal design + tensions that neither GPT-5 nor Sonnet identified. Opus continues to reason + about the *boundaries between components* rather than within-component + mechanics. + +**Key insight — Sonnet 4.6 is NOT just a faster GPT-4.1:** +In previous findings (#9, #10, #11), non-reasoning models (GPT-4.1, GPT-4.1 +Mini) performed significantly below reasoning models on assumption-finding. +GPT-4.1 found ~14 assumptions where GPT-5 found 24-26. Here, Sonnet 4.6 +finds 17 where GPT-5 finds 20 — a much smaller gap (~85% vs ~58% previously). + +Sonnet's findings also included several that showed genuine reasoning about +component interactions (not just within-frame risks). This suggests Sonnet 4.6 +is qualitatively different from GPT-4.1 for analytical work — it occupies a +middle ground between GPT-4.1's "competent but surface-level" and GPT-5's +"exhaustive and deep." The severity distribution was also similar to GPT-5 +(multiple critical/high findings), whereas GPT-4.1 in previous experiments +tended toward medium-severity generic concerns. + +**Updated model hierarchy for assumption-finding:** +1. GPT-5 — broadest coverage, most operational-level findings (20) +2. Sonnet 4.6 — strong analytical depth, good component interaction reasoning (17) +3. Opus 4.6 — fewest but most architecturally insightful, finds design tensions (12) +4. GPT-4.1 — competent within-frame, generic (~14 from previous experiments) +5. GPT-4.1 Mini — formulaic, surface-level (~10-12) + +**Practical implication:** For architecture review, Sonnet 4.6 is now a strong +candidate for volume analytical work. It's fast enough to run alongside GPT-5 +and catches different things (lot operation failures, broker-side replace races). +The ideal three-model review stack for architecture docs appears to be: +- GPT-5 for breadth + operational concerns +- Sonnet 4.6 for component interaction analysis +- Opus 4.6 for design-tension identification + +Each consistently finds things the others miss. The cost-efficiency argument +for Sonnet is strong: ~85% of GPT-5's count with more actionable findings +per token generated (4,637 vs 8,485 tokens for 17 vs 20 assumptions). + +### 13. Race condition identification: Opus excels at subtle temporal interactions; GPT-5 is exhaustive; Sonnet struggles with concurrency reasoning + +**Date:** 2026-05-03 +**Task:** Identify race conditions, timing-dependent bugs, and ordering hazards in +gargoyle's `concurrent-failure-detection.md` (241 lines) — a document specifically +about concurrent detection logic with timers, ETS state, and multi-process events. +**How we used them:** Same document (full text) + same focused analytical question +to all 3 models via HAI proxy. Prompt specifically asked for event ordering problems, +timer interaction bugs, state visibility gaps, crash/restart timing, and multi-instance +coordination. Required each finding to reference specific mechanisms in the document +with specific interleaving descriptions. No tools, no project context beyond the +document itself. + +| Model | Time | Output tokens | Reasoning tokens | Race conditions found | +|---|---|---|---|---| +| GPT-5 | 116s | 10,587 | 8,192 | 12 | +| Claude Opus 4.6 | ~105s | 4,610 | (internal) | 10 | +| Claude Sonnet 4.6 | ~39s | 1,404 | (internal) | 7 | + +**What they found — common ground (all 3 identified):** +- Stale timer messages in mailbox after cancellation (classic Erlang timer race) +- HealthMonitor crash losing compound detection state (init from :unknown, no replay) +- ETS vs GenServer state divergence visible to dashboard +- Kill switch mode conflict (FLATTEN from single-process vs HALT from compound path) + +**GPT-5 unique findings (not in either Claude model):** +- Cross-sender message ordering: recovery events from pipeline processes vs timer + expiry from runtime (Erlang preserves per-sender order, NOT cross-sender) — the + "rapid recovery" safety argument in the doc relies on state being updated before + timer fires, which isn't guaranteed +- Debounce starvation: flapping component repeatedly restarting the timer, causing + compound evaluation to be indefinitely postponed while ≥2 genuinely degraded +- State regression: {:degraded} arriving after {:escalated, :kill_switch} with no + guard in the event table — state machine allows regressing from :halted to :degraded +- Cold-start window: application boots with existing degraded processes that won't + re-emit events, compound detection never fires +- Catch-all handle_info could accidentally swallow timer messages if pattern matching + is ordered wrong (implementation pitfall of the described approach) +- Debounce window growing beyond calibrated bounds from repeated timer restarts + +**Claude Opus unique findings (not in either other model):** +- Timer restart pushing evaluation PAST single-process escalation timeout — the + debounce mechanism can DEFEAT compound detection when second degradation arrives + near end of first window (resets to full window, first process escalates via + single-process path before new window fires). This means system gets FLATTEN + instead of HALT — exactly what compound detection was supposed to prevent. +- Strategy worker single-atom masking via event ordering: Worker A degrades, Worker + B degrades (same atom), Worker A recovers → atom set to :normal while B is still + degraded. Event ordering across different workers mapped to same atom creates + state loss. +- Registry stale PID after HealthMonitor crash: if subscription is PID-based (not + PubSub topic), new HM instance is deaf — events go to dead PID, silently dropped. + Compound detection completely disabled for that user until subscription refresh. +- :rest_for_one cascade + coincidental independent issue: debounce designed to + filter cascade restarts, but cascade can EXPOSE a pre-existing issue (SignalRisk + restarts and finds its upstream stale). Is this "compound" or "cascade + bad luck"? + Semantic ambiguity the design doesn't address. +- Compound cleared event without recovery debounce: :compound_degradation_cleared + emitted immediately when last process recovers (no settling period), causing + operator oscillation if recovery is transient. + +**Claude Sonnet unique findings:** +- ETS table creation race at startup (HealthMonitor writes before table exists) +- Registry lookup failure during pipeline startup (events before HM registered) +- However, Sonnet also made analytical errors: it described "multiple HealthMonitor + instances for the same user" scenarios despite the document clearly stating one + instance per user via DynamicSupervisor. Several of its findings assumed + multi-instance coordination that doesn't match the architecture. + +**Quality assessment:** +- **GPT-5** was the most exhaustive and technically precise. Its cross-sender + ordering finding (#2) is genuinely insightful — it identifies that the document's + "rapid recovery" safety argument implicitly assumes events arrive in wall-clock + order, which Erlang does NOT guarantee across different senders. The debounce + starvation finding (#3) identifies a real operational hazard with practical + consequences. All 12 findings reference specific mechanisms and describe specific + interleavings clearly. +- **Claude Opus** found fewer race conditions but several were qualitatively + superior. The timer-restart-defeats-compound-detection finding is the most + architecturally significant race in the entire analysis — it shows that the + debounce mechanism can work AGAINST the design's stated goals in specific + (realistic) timing scenarios. The strategy-worker event ordering masking is + also a genuine design flaw unique to the single-atom decision. Opus continues + its pattern of reasoning about design TENSIONS rather than just failure modes. +- **Claude Sonnet** was notably weaker here than in previous experiments. Only + 1,404 output tokens vs 4,610 (Opus) and 10,587 (GPT-5). Several findings + contained analytical errors (assuming multi-instance coordination that doesn't + exist). It found only 7 races, and 2-3 of those were based on misreadings of + the architecture. This is a significant regression from Finding #12 where + Sonnet found 17 assumptions (85% of GPT-5's count). + +**Key insight — concurrency reasoning is a different skill than assumption-finding:** +In previous experiments (#10, #11, #12), Sonnet 4.6 performed well on +assumption-finding (a task that requires reasoning about what's NOT stated). +Here, on race condition identification (a task requiring reasoning about temporal +interleavings and message ordering semantics), Sonnet drops significantly. This +suggests the task type matters more than we previously thought: + +- **Assumption-finding:** Requires breadth of consideration ("what must be true + for this to work?"). Sonnet handles this well — it's essentially pattern + matching across possible failure dimensions. +- **Race condition identification:** Requires SEQUENTIAL reasoning about specific + interleavings ("if A happens, then B happens, then C happens, what state is + visible?"). This benefits dramatically from extended reasoning tokens (GPT-5's + 8,192 reasoning tokens) or from Opus's internal reasoning depth. + +The lesson: don't extrapolate model performance across task types. A model that's +85% as good at assumption-finding may be 50% as good at concurrency analysis. +The cognitive demands are different. + +**Opus's distinguishing strength — finding design contradictions:** +Opus's best finding (timer restart defeating compound detection) isn't just a +race condition — it's identifying that the debounce mechanism can work against +the design's own stated goals. This is consistent with Opus's pattern in +previous findings: it finds tensions where one part of the design undermines +another part. For race condition analysis specifically, this manifests as +"here's where your safety mechanism becomes your vulnerability." + +**Practical implication for architecture review:** +- For race condition analysis: use GPT-5 (exhaustive) + Opus (design-tension) +- Sonnet is NOT suitable for concurrency reasoning tasks — use it for + assumption-finding and structural review instead +- The three-model stack needs task-appropriate assignment: + - Structural/assumption review: all three models contribute + - Concurrency/race analysis: GPT-5 + Opus only + - Bias detection: any model (per Finding #8) + +### 14. Cross-component interaction analysis: GPT-5 Mini enters as viable mid-tier; task framing strongly shapes output quality + +**Date:** 2026-05-03 +**Task:** Identify cross-component interaction failures in gargoyle's +`continuous-risk-monitoring.md` (459 lines) — a document specifying +PortfolioMonitor's interactions with OrderManager, PortfolioRisk, MarketData, +KillSwitch, ETS tables, and the pipeline supervision tree. +**How we used them:** Same document (full text) + same focused analytical +question to all 3 models via HAI proxy. Prompt was highly structured: specified +5 categories of cross-component failures to look for (semantic mismatches, +ordering violations, feedback loops, partial visibility, supervision boundary +effects) and required specific output format (components, sequence, gap, impact). +No tools, no project context beyond the document itself. + +| Model | Time | Output tokens | Reasoning tokens | Findings | +|---|---|---|---|---| +| GPT-5 Mini | 68s | 5,445 | 2,240 | 6 (+1 cut off) | +| GPT-5 | 116s | 10,604 | 8,128 | 10 | +| Claude Sonnet 4.6 | 38s | 1,868 | (internal) | 8 | + +**What they found — common ground (all 3 identified):** +- Fill-to-position query race (fill event triggers evaluation but position + store hasn't yet reflected the fill) +- Restrict flag ETS table destruction on PM crash → permissive window +- Kill switch check vs liquidation submission race +- Ticker subscription timing gap (new position opened but ticks not yet + subscribed → breach goes undetected) + +**GPT-5 unique findings (not in either other model):** +- Stale prices are NOT fail-safe for drawdown (higher stale price → inflated + portfolio value → understated drawdown). The document claims "fail-safe" + but this only holds for exposure metrics, not drawdown. This is the most + architecturally significant finding across all three models. +- Price definition mismatch between PM (last_trade from ETS) and OrderManager/ + broker (bid/ask/mid) causing mis-sized liquidation and oscillation +- Cross-component oscillation: PM hysteresis internal vs PRisk's immediate + binary restrict gate clearing (no cross-component cooldown) +- Liquidation stuck after OM restart (terminal events lost; liquidation_in_ + flight stays true indefinitely with no timeout/rehydration) +- "Minimal risk checks" not enforced — PM goes through same OM gates as + strategy orders but MarketHours/StalePrice controls may reject after-hours + or stale-price liquidation attempts +- FLATTEN mode semantics gap — PM refrains from liquidating when kill switch + engaged, but FLATTEN cancels open orders without actually CLOSING positions. + No component left to close positions. + +**Claude Sonnet 4.6 unique findings (not in either other model):** +- Liquidation feedback loop with PortfolioRisk — buy-to-cover for short + positions could INCREASE net long exposure at portfolio level, paradoxically + worsening concentration while fixing position-level metrics +- High water mark reset on pipeline restart masks true intraday drawdown + (restart → HWM resets to lower current value → drawdown calculated from + false baseline → larger losses permitted than intended) +- Multi-metric breach with single boolean flag — concentration liquidation + for AAPL sets liquidation_in_flight, blocking simultaneous daily P&L + liquidation for different positions +- Market close/open vs after-hours fills — claims to evaluate after-hours + fills but uses stale market-close prices + +**GPT-5 Mini unique findings (not in either other model):** +- OrderManager order splitting/remapping causing liquidation_in_flight + correlation failure (parent/child order ID mapping breaks terminal-event + detection). Well-reasoned but highly implementation-specific. +- Restrict/clear oscillation loop with strategy behavior (strategies react + to rejects → back off → restrict clears → strategies re-enter aggressively + → re-breach). Good systems-thinking about emergent feedback. + +**Quality assessment:** +- **GPT-5** produced the most findings (10) and the highest-quality + architectural insight: the stale-price/drawdown contradiction is a genuine + design flaw that contradicts the document's own safety claim. Multiple + findings showed cross-boundary reasoning about semantic mismatches (price + definition, FLATTEN semantics, gate bypass). Every finding named specific + components and described precise event sequences. +- **Claude Sonnet 4.6** was fast (38s, only 1,868 tokens) and produced 8 + solid findings. The HWM reset finding and the multi-metric/single-flag + finding show genuine architectural reasoning. The liquidation feedback + loop (buy-to-cover worsening portfolio concentration) is subtle and + shows cross-position reasoning. However, some findings overlapped + significantly with the common-ground set and added less unique depth. + Sonnet performed MUCH better here than on race condition identification + (Finding #13) — 8/10 ratio vs 7/12 previously. +- **GPT-5 Mini** produced 6 findings in 68s with 2,240 reasoning tokens. + Quality was genuinely good — the order-splitting/correlation finding + and the oscillation feedback loop both show real reasoning depth. It's + clearly NOT GPT-4.1 Mini — it reasons about component interactions, + not just within-frame risks. However, it found fewer issues and one + response was cut off (token limit or response truncation). + +**Key insight — task framing as the dominant variable:** +This experiment used a much more structured prompt than previous ones: +specified 5 categories, required specific output format, explicitly excluded +single-component failures. The result: ALL models produced higher-quality, +more focused output than in earlier experiments with broader prompts. Even +Sonnet — which struggled on race conditions (Finding #13) — performed well +here. The structured categories likely helped models organize their reasoning +without losing track of what they were looking for. + +The prompt explicitly asked for "cross-component interaction failures" rather +than general analysis. This is the narrow-lens effect from Finding #2, but +applied to a complex multi-component document. The lens is narrow (only +inter-component gaps) but the scope is broad (459 lines, many interactions). +This combination — narrow analytical lens + broad document scope — appears +to be the sweet spot for getting quality from all model tiers. + +**GPT-5 Mini positioning:** +First time testing GPT-5 Mini. Results: 6 findings in 68s vs GPT-5's 10 in +116s. That's 60% of the findings in 59% of the time, with 28% of the +reasoning tokens (2,240 vs 8,128). Quality-per-finding was solid — the order +correlation finding especially showed genuine systems reasoning. GPT-5 Mini +appears to be a legitimate mid-tier: more capable than GPT-4.1 (which can't +do this kind of cross-boundary reasoning) but less exhaustive than GPT-5. +Viable for: first-pass screening, bulk document review where you'd run many +docs and can't afford full GPT-5 on each. + +**Sonnet recovery from Finding #13:** +Sonnet went from 7 findings (with errors) on race conditions to 8 solid +findings here. The difference: this prompt was more structured, the document +was larger with more explicit interaction descriptions, and the task didn't +require pure temporal/sequential reasoning. "Cross-component interaction +failures" is closer to assumption-finding (Sonnet's strength) than race +condition identification (Sonnet's weakness). Task taxonomy continues to +matter more than raw model capability. + +**Updated model assignment for cross-component analysis:** +1. GPT-5 — broadest coverage, finds semantic contradictions in the doc's + own claims (10 findings) +2. Sonnet 4.6 — fast and efficient, good at portfolio-level reasoning and + feedback loops (8 findings in 38s) +3. GPT-5 Mini — viable first-pass with genuine reasoning depth (6 findings) +4. (Opus untested for this task type — likely strong on design tensions) + +### 20. Invariant violation path analysis: GPT-5 is maximally selective (3 findings, all genuine); Opus shows unique self-correcting analytical style; new task type favors precision over exhaustiveness + +**Date:** 2026-05-04 +**Task:** Identify invariant violation paths in gargoyle's `user-pipeline-lifecycle.md` +(730 lines) — sequences of legal operations that can violate the system's stated or +implied invariants. NEW analytical lens not previously tested, distinct from assumption- +finding, race conditions, or coherence checking. +**How we used them:** Same document (full text) + same focused analytical question to all +3 models via HAI proxy. Highly structured prompt specifying 5 categories of invariant +violations (state machine escapes, invariant composition failures, monotonicity violations, +idempotency boundary violations, authority inversion sequences). Required specific output +format per finding. No tools, no project context beyond the document itself. + +| Model | Time | Output tokens | Reasoning tokens | Findings | +|---|---|---|---|---| +| GPT-5 | 143s | 784 | 12,032 | 3 | +| Claude Opus 4.6 | 113s | 6,183 | (internal) | 7 (with 2 self-corrections) | +| Claude Sonnet 4.6 | 23s | 1,266 | (internal) | 5 | + +**What they found — common ground (2+ models identified):** + +- **Periodic reconciliation overrides operator manual stop** (GPT-5 #3 + Opus #5 + + Sonnet #1): An admin who stops a pipeline via `stop_user/1` with `:admin_action` + has their decision overridden within 5 minutes by periodic reconciliation, because + there's no "admin stopped" state in `check_eligibility/1`. All three models + independently identified this as the clearest authority inversion. +- **DynamicSupervisor restart bypasses eligibility gate** (Opus #1/#3 + Sonnet #2): + When `UserPipeline.Supervisor` crashes and is restarted by OTP supervision, the + restart bypasses `start_user/1` and `check_eligibility/1` entirely — potentially + resuming trading while the kill switch is engaged. +- **Stale ReconciliationGate after crash** (Opus #7): After a crash-triggered + DynamicSupervisor restart (not via `stop_user/1`), the ReconciliationGate remains + `:ready` from the previous instance because `stop_user/1` (which resets it) was + never called. The new OrderManager may accept orders during its own reconciliation. +- **HealthMonitor co-lifecycle violation** (Opus #2 + Sonnet #4): After a + DynamicSupervisor-initiated restart, the HealthMonitor is still subscribed to the + old PIDs — no code re-establishes monitoring for the new pipeline processes. + +**GPT-5 unique findings (not in either other model):** + +- **Kill switch bypass for users configured DURING engagement** (#1): A user who + saves credentials while the kill switch is engaged is never added to the pending + operator release set (only running pipelines are added at engage time). After + disengage, periodic reconciliation auto-starts this user's pipeline without + operator release — violating "resuming always requires human judgment." This is + the most precisely reasoned finding across all three models: each step is + individually correct per the spec, and the violation emerges purely from the + composition of legal operations. +- **Premature release bypass** (#2): If `operator_release_user/1` is called while + the kill switch is still engaged (a legal operation), it clears the pending + release flag but `start_user/1` correctly refuses. After later disengage, the + flag is gone — auto-start proceeds without fresh operator judgment. The release + was "spent" at the wrong time. + +**Claude Opus unique findings (not in either other model):** + +- **`operator_release_system/0` clears unrelated safety obligations** (#4): + Operator intends to release one user from a recent event but + `operator_release_system/0` also releases other users still pending from an + earlier, unresolved event. One release call discharges multiple independent + safety obligations — monotonicity violation. +- **State machine incompleteness for blocked users** (#6): Users who become + configured during kill switch engagement (blocked with reason + `:kill_switch_engaged`) have no state machine transition back to `starting` + after disengage — they're not in the pending release set, and no event fires. + System works via periodic reconciliation (up to 5 minutes delay), but the + documented state machine doesn't represent this path. +- **Self-correcting analytical style:** Opus explicitly withdrew two draft + findings mid-analysis ("Actually, this sequence works as designed. Let me + identify a real violation instead." / "this is likely handled"). This + self-correction behavior was first observed in Finding #15 and is now + confirmed as a consistent Opus trait for invariant-style analysis. + +**Claude Sonnet unique findings (not in either other model):** + +- **Cold-start Tier 3 failure creates supervision restart loop** (#2): A + persistent Tier 3 failure (phantom fills) crashes OrderManager, `:rest_for_one` + kills the tree, DynamicSupervisor restarts it, cold-start fails again → infinite + loop. State machine shows `starting → stopped` but supervision creates + `starting → starting` indefinitely. +- **HealthMonitor start failure during start_user** (#4): If HealthMonitor.Supervisor + is momentarily crashed when `start_user/1` runs step 4, the pipeline starts + without monitoring. No error handling specified for this partial-start state. + +**Quality assessment:** + +- **GPT-5** was MAXIMALLY SELECTIVE — only 3 findings from 12,032 reasoning tokens + (4,011 reasoning tokens per finding). This is the most extreme + reasoning-to-output ratio observed: 15:1 (12,032 reasoning / 784 output tokens). + For comparison, in previous experiments GPT-5 typically shows 1:1 to 2:1 ratios. + Every finding is a genuine invariant violation with a precise, step-by-step + sequence where each step is individually legal. ZERO false positives, zero + padding, zero "this might be an issue." GPT-5 appears to have used almost all + its reasoning budget for VERIFICATION — confirming that each candidate is + genuinely a violation before including it. +- **Claude Opus** produced the most findings (7) with its characteristic depth and + self-correction. Two findings were revised mid-analysis, showing Opus actively + testing its own reasoning against the document before committing to a finding. + The DynamicSupervisor restart thread (findings #1, #2, #3, #7) forms a coherent + cluster — Opus identified one root cause (OTP restarts bypass the lifecycle + layer) and explored its multiple consequences. The `operator_release_system` + monotonicity finding (#4) is architecturally significant and unique. +- **Claude Sonnet** was extremely fast (23s, 1,266 tokens) and produced 5 findings. + Quality was mixed: Finding #1 partially mirrors GPT-5's authority inversion but + with vaguer reasoning ("race condition with ETS operations" — not specified). + Finding #3 describes a contradiction but the scenario is internally inconsistent + (step 5 says "pipeline termination fails" but then step 7 says pipeline is still + running — this conflates two failure modes). Findings #2 and #4 are genuine and + well-reasoned. Sonnet's precision is lower than the other two on this task. + +**Key insight — "Invariant violation paths" as a task type:** + +This is a genuinely DIFFERENT analytical task from any previously tested. It requires: +1. Identifying the invariants (explicit or implied) +2. Constructing a sequence of operations (creative/generative) +3. Verifying each step is legal per the spec (verification) +4. Confirming the end state violates the invariant (correctness proof) + +This four-phase cognitive process explains GPT-5's extreme selectivity: steps 2-4 are +all verification-heavy, and GPT-5's reasoning tokens are being burned on steps 3 and 4 +(confirming each step is genuinely legal and the final state genuinely violates). In +previous tasks like "find hidden assumptions" or "find gaps," only step 1 (identification) +is needed — there's no construction or verification phase. + +**Comparison to previous task types:** + +| Task type | GPT-5 findings | Opus findings | GPT-5 reasoning overhead | +|---|---|---|---| +| Hidden assumptions | 20-35 | 12-13 | 5-7K reasoning | +| Race conditions | 12 | 10 | 8K reasoning | +| Design coherence | 4 | 7 | 9K reasoning | +| Invariant violation paths | 3 | 7 | **12K reasoning** | + +The pattern: as the task requires more VERIFICATION (vs identification), GPT-5 becomes +more selective and spends more reasoning tokens per finding. Invariant violation paths +demand the highest verification burden (every step must be confirmed legal), and GPT-5 +responds with the highest selectivity and reasoning investment. + +Opus inverts: it produces MORE findings on verification-heavy tasks (7 for coherence, +7 for invariant paths) vs identification tasks (10-13 for assumptions). This suggests +Opus uses its internal reasoning differently — it's more willing to present findings +that have "likely" rather than "proven" violations, then self-corrects inline if the +verification fails. + +**Practical implication:** + +For invariant violation path analysis: +- **GPT-5** produces the highest-precision findings but very few. Every finding is a + genuine spec-level bug. Use when you need zero-false-positive bug reports to present + to a design team. +- **Opus** produces more findings with slightly lower precision but unique analytical + depth. Its self-correction behavior means false positives are often caught inline. + Use when you want both confirmed violations AND identified tensions. +- **Sonnet** is too imprecise for this task type — some findings have internal + inconsistencies. Use for lighter analytical tasks (assumption-finding, spec gaps). + +The three findings GPT-5 produced are ALL genuine design bugs that should be fixed: +1. Users configured during kill switch engagement bypass operator release +2. Premature operator release (while KS still engaged) creates future bypass +3. Admin stops are overridden by periodic reconciliation + +These are the kind of findings that, in a real financial system, prevent production +incidents. The 12K reasoning tokens to produce 3 perfect findings is excellent ROI. + +### 21. Reasoning effort (low/medium/high) has negligible effect on GPT-5's analytical output; the parameter may not work as documented for open-ended analysis + +**Date:** 2026-05-04 +**Task:** Identify hidden assumptions in gargoyle's `order-state-machine.md` (221 lines) +— a well-structured state machine specification covering order lifecycle, fill precedence, +TIF semantics, and parameter resolution. +**How we used them:** Same document, same prompt, same model (GPT-5), same +max_completion_tokens (16K). Only variable: `reasoning.effort` parameter set to +"low", "medium", or "high". Run sequentially via HAI proxy (OpenAI-compatible +endpoint). No tools, no project context beyond the document. + +| Effort | Time (ms) | Output tokens | Reasoning tokens | Findings | +|---|---|---|---|---| +| Low | 97,913 | 7,657 | 4,288 | 33 (+11 recs) | +| Medium | 94,824 | 7,112 | 4,160 | 30 | +| High | 88,607 | 6,891 | 3,712 | 30 | + +**The counterintuitive result:** Higher reasoning effort produced FEWER findings, +FEWER reasoning tokens, FEWER output tokens, and completed FASTER. The expected +pattern (high effort → more reasoning → more depth) was inverted. + +**Per-finding metrics (remarkably consistent):** + +| Effort | Output tokens/finding | Reasoning tokens/finding | +|---|---|---| +| Low | 232 | 129 | +| Medium | 237 | 138 | +| High | 229 | 123 | + +The depth per finding was nearly identical across all three levels. The models +didn't get more detailed or rigorous per-finding at higher effort — they just +found slightly fewer things. + +**Severity distributions (similar across all three):** +- Low: 7 Critical, 21 High, 5 Medium (33 findings) +- Medium: 9 Critical, 15 High, 4 Medium + 2 borderline (30 findings) +- High: 6 Critical, 14 High, 5 Medium + 4 borderline (30 findings) + +**Qualitative differences — WHAT they found:** + +High-effort unique findings (not in low): +- Single-writer authority to broker (no out-of-band modifications) +- Broker emits fills for all executed quantities (no silent netting) +- Instrument identity remains stable across corporate actions +- Late-fill override won't violate downstream invariants +- Validation covers lot sizes, price ticks, borrow/locate constraints +- Multiple accounts and venues are part of the correlation key +- Streaming and polling APIs are consistent +- System can handle multi-leg instruments + +Low-effort unique findings (not in high): +- Acks arrive before fills (no pre-ack fills) +- Cancel-before-ack handling (submitted → cancelled missing) +- Fill totals never exceed requested quantity +- Deterministic ordering within a broker stream +- Exercise/assignment and non-order position changes +- Client-side idempotency of "place order" +- Partial accept/normalize on replace +- No "child" order fragmentation at broker +- Submitted state can receive terminal events +- Late cancel vs local expired mismatch + +**Character of the differences:** +- HIGH-unique findings tend to be more **architectural/systemic** (multi-leg + instruments, streaming vs polling consistency, downstream invariant violations, + corporate actions). These require reasoning about the system's relationship + to the broader world. +- LOW-unique findings tend to be more **implementation-specific edge cases** + (cancel-before-ack, pre-ack fills, child order fragmentation, partial accepts). + These require reasoning about specific event interleavings and protocol details. + +Both sets are valid and actionable. Neither is clearly "better." They represent +different analytical modes — breadth-of-scope (high) vs depth-of-protocol (low). + +**Key insight — reasoning_effort doesn't scale analysis linearly:** + +Three possible explanations for the inverted behavior: + +1. **GPT-5 already uses near-maximum reasoning for analytical tasks regardless + of the effort parameter.** The ~4K reasoning tokens across all three levels + (4288/4160/3712) are too similar to reflect a genuine effort gradient. The + parameter may primarily affect OTHER task types (math, code, logic puzzles) + where reasoning depth is more variable. + +2. **Higher effort increases FILTERING, not exploration.** At high effort, GPT-5 + may spend more of its reasoning on VERIFYING whether findings are genuine + before including them — similar to the extreme selectivity observed in + Finding #20 (invariant violation paths, 12K reasoning for 3 findings). This + would explain fewer findings despite theoretically "trying harder." + +3. **The parameter has minimal practical effect for this model version.** + The differences (33 vs 30 vs 30) are within normal stochastic variation. + Repeated runs at the same effort level might show similar variance. + +**The prompt cache hit on HIGH (2304 cached prompt tokens) may have slightly +accelerated processing, but doesn't explain the reasoning token difference.** + +**Comparison to previous findings:** +In Finding #20 (invariant violation paths), GPT-5 used 12,032 reasoning tokens +for 3 findings — extreme verification behavior. Here, at default effort on a +different task type (hidden assumptions), it uses ~4K reasoning for ~30 findings. +This confirms that TASK TYPE is a far stronger predictor of GPT-5's reasoning +behavior than the reasoning_effort parameter. The invariant violation prompt +triggered deep verification; the assumption-finding prompt triggers broad +exploration regardless of effort setting. + +**Practical implication:** +For open-ended analytical tasks (assumption-finding, gap analysis, spec review), +the reasoning_effort parameter appears to have negligible practical effect on +GPT-5. Don't bother tuning it for these tasks — the default is fine. The +parameter may be more meaningful for: +- Tasks with verifiable correct answers (math, logic) +- Tasks where the model could short-circuit (simple questions) +- Extremely long documents where exploration budget matters + +For architecture review specifically: reasoning_effort is NOT a useful lever. +Task framing (the prompt structure) and document selection remain the dominant +variables for output quality. Save reasoning_effort tuning for coding/math tasks +where the parameter was likely trained and evaluated. + +**Open question:** Would running the same experiment 5x at each level show that +the 33-vs-30 difference is within stochastic noise? If so, reasoning_effort is +effectively a no-op for analytical prompts. If not, low-effort consistently +produces more (less filtered) output, which could be useful for brainstorming- +style analysis where you want maximum coverage before manual triage. + +### 27. Design coherence on risk-controls.md: GPT-5 regains top position; Opus's advantage from Finding #15 was document-specific, not task-specific + +**Date:** 2026-05-05 +**Task:** Identify internal design incoherences in gargoyle's `risk-controls.md` (277 lines) +— a pre-trade risk control specification covering two evaluation stages, reduction semantics, +ordering rationale, fail-closed claims, and audit logging. +**How we used them:** Same document (full text) + same focused analytical question to all +3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence +(safety properties not enforced, ordering/sequencing contradictions, reduction semantics +conflicts, fail-closed claims vs actual behavior, cross-stage inconsistencies). Required +each finding to reference specific contradictory parts. No tools, no project context beyond +the document itself. + +| Model | Time | Output tokens | Reasoning tokens | Incoherences found | Critical | High | Medium | +|---|---|---|---|---|---|---|---| +| GPT-5 | 112s | 8,231 | 7,232 | 6 | 1 | 3 | 2 | +| Claude Opus 4.6 | 41s | 1,858 | (internal) | 5 | 2 | 2 | 1 | +| Claude Sonnet 4.6 | 15s | 699 | (internal) | 4 | 1 | 2 | 1 | + +**What they found — common ground (all 3 identified):** +- Reduction re-entry at BuyingPower contradicts "reducing controls never re-enter + earlier controls" (all three flagged this as the most obvious contradiction — + Concentration at position 5 reduces, re-enters at BuyingPower at position 4, + which IS an earlier control) +- Ordering rationale's categorization of buying power/concentration is internally + confused (the doc labels both as "quantity-sensitive checks" that run after + reducing controls, but concentration IS a reducing control at position 5 while + buying power at position 4 sits between the two reducing controls) + +**GPT-5 unique findings (not in either Claude model):** +- Signal-level "no portfolio context" contradicts NoShortSales requiring knowledge + of current positions. The doc explicitly states signals are evaluated "in isolation" + with "no portfolio context — only the signal itself and user settings" — but checking + whether the user holds a position IS portfolio context. This is a genuine design + tension: either SignalRisk has hidden portfolio access (violating isolation) or + NoShortSales can't actually work as specified. +- Settings "fall through to system defaults" vs "Settings cache miss → reject." + Two incompatible instructions for the same condition (missing settings). +- "Universal fail-closed" with "only exception is order rate window" contradicted + by Failure Modes table showing buying power as another exception ("Conservative + estimate; may over-reject" is NOT rejection — it's a different failure mode than + either fail-closed or the documented single exception). +- Audit model says "every control evaluation produces an audit entry regardless of + outcome" but the signal-stage write point only describes writing on rejection. + Passing signals produce no documented audit entry at the signal stage. + +**Claude Opus unique findings (not in either other model):** +- Signal flow diagram swaps control order vs table: table shows (1) MarketHours, + (2) PerTradeStop, (3) NoShortSales, but diagram flows MarketHours → NoShortSales + → PerTradeStopLoss. Controls 2 and 3 are reversed between the two representations. + (VERIFIED: this is correct — the diagram does show a different order.) +- Concentration re-entry loop can bypass Order Rate, Duplicate, Self-Trade, and + Fat Finger entirely during intermediate iterations. Also: Position Size at order 3 + is never re-checked against Concentration-reduced quantity because re-entry starts + at BuyingPower (order 4), meaning "smaller quantity wins" semantics are implemented + differently than the linear model described in Reduction Semantics. + +**Claude Sonnet unique findings (not in either other model):** +- Buying Power "Can reduce? No" creates a dead end: if a reduced quantity still + exceeds buying power, the system can only reject entirely (no mechanism to further + optimize), defeating the purpose of the reduction system for capital-limited users. + (NOTE: this is more of a design limitation than a self-contradiction, but the + framing — that the reduction system's purpose is undermined by buying power's + inability to reduce — is a legitimate coherence observation.) + +**Quality assessment:** +- **GPT-5** produced the most findings (6) with the broadest coverage across the + prompt's 5 categories. The NoShortSales/portfolio-context finding is the most + genuinely insightful — it's a fundamental design-level contradiction (a signal-level + control that REQUIRES decision-level context). The settings contradiction and + audit logging inconsistency are also solid. Every finding points to two specific + textual statements that are incompatible. Severity ratings were calibrated (1 + Critical, 3 High, 2 Medium — compared to Opus's 2 Critical for similar findings). +- **Claude Opus** was remarkably fast (41s, 1,858 tokens) and found one thing + neither other model caught: the diagram/table order reversal for signal controls. + This is a concrete, verifiable error (not a design tension — a literal mistake in + the document). The re-entry loop analysis (finding #5) goes deeper than GPT-5's + version of the same core issue, exploring the implications for "smaller quantity + wins" semantics. However, Opus found fewer total issues and missed the + settings contradiction and audit logging inconsistency. +- **Claude Sonnet** was the fastest (15s, 699 tokens) and found 4 issues. The buying + power dead-end observation is unique and shows genuine reasoning about the reduction + system's limitations. However, it's more of a "this design can't achieve its stated + goal" than a strict self-contradiction. Sonnet's other findings overlap with the + common ground. Quality is solid but narrower scope. + +**Key insight — Finding #15's Opus > GPT-5 result was document-specific:** +In Finding #15 (coherence checking on failure-modes.md), Opus found 7 incoherences +vs GPT-5's 4. Here, on risk-controls.md, GPT-5 found 6 vs Opus's 5. The reversal +suggests that the relative performance on coherence checking depends on the +DOCUMENT'S structure, not on a fixed model advantage: + +- **failure-modes.md** (383 lines): A complex multi-process system with many + stated invariants across failure states, supervision trees, and recovery paths. + Rich in design TENSIONS where one subsystem's safety mechanism undermines another. + This plays to Opus's strength (finding design tensions between subsystems). +- **risk-controls.md** (277 lines): A more focused specification with explicit rules, + ordering constraints, and behavior tables. Rich in SPECIFICATION CONTRADICTIONS + where one statement directly conflicts with another. This plays to GPT-5's + strength (systematic verification of claims against stated mechanisms). + +The difference: Opus excels when contradictions are EMERGENT (arise from composing +multiple design decisions). GPT-5 excels when contradictions are EXPLICIT (two +statements in the document say incompatible things). Risk-controls.md has more +explicit contradictions (the settings fallback vs fail-closed, the "no portfolio +context" vs NoShortSales, the audit "always" vs write point "only on reject"). + +**Model performance depends on CONTRADICTION TYPE:** +| Contradiction type | Best model | Example | +|---|---|---| +| Emergent/compositional | Opus | "Rest-for-one cascade creates a 5th state" | +| Explicit/definitional | GPT-5 | "No portfolio context" but check requires portfolio | +| Diagrammatic/structural | Opus | Table order ≠ diagram order | +| Semantic/category confusion | All (common ground) | Reduction re-entry violates ordering claims | + +**Revised conclusion on Finding #15's open question:** +"Does Opus > GPT-5 ordering for coherence checking hold across other documents?" +**No.** The ordering depends on the document's contradiction density and type. +Documents rich in emergent design tensions favor Opus. Documents with explicit +specification errors favor GPT-5. The task type (coherence checking) doesn't have +a fixed model winner — it depends on what KIND of incoherences the document contains. + +**Practical implication:** Continue running both models for coherence checking. Their +strengths are complementary even within the same task type. GPT-5 catches things you +can point to in the spec and say "these two sentences conflict." Opus catches things +where you need to reason about the implications of multiple mechanisms interacting. + +## Open Questions + +- Does GPT's advantage in finding inconsistencies extend to logical + inconsistencies in arguments? One data point (verdict mismatches) — need more. +- What's the optimal task granularity for GPT analytical review? "Whole PR" is + too big. Is "one hypothesis" right, or can we batch? +- ~~Is the GPT-4.1 Mini bias detection result repeatable, or was it a well- + structured task that any model would ace?~~ **ANSWERED (Finding #8):** Any + model aces it when the biased text is presented without noise. The original + result was about noise elimination, not model capability. +- **NEW:** Does adding a narrow bias-check question to a rich PR review + context recover the detection that broad review misses? (Signal-to-noise + confirmation test) +- ~~How does reasoning_effort affect analytical quality? Only tested default so + far.~~ **ANSWERED (Finding #21):** Negligible effect on GPT-5 for open-ended + analytical tasks. Low/medium/high produced 33/30/30 findings with nearly + identical reasoning tokens (~4K) and per-finding depth. The parameter + may primarily affect verifiable-answer tasks, not exploration. Task framing + remains the dominant quality lever. +- Can we design a systematic "analytical review checklist" that leverages each + model's strengths? +- ~~What analytical tasks is Opus best at vs Sonnet? Finding #11 shows Opus + excels at design-tension identification. How does Sonnet compare on the + same task? (Sonnet is non-reasoning but fast — would it match GPT-4.1?)~~ + **ANSWERED (Finding #12):** Sonnet 4.6 significantly outperforms GPT-4.1 + (17 vs ~14 assumptions) and approaches GPT-5 (17 vs 20). It's not a + non-reasoning model in the GPT-4.1 sense — it occupies a middle tier with + genuine component-interaction reasoning. Opus still wins on design-tension + identification specifically. +- How do the models compare on research synthesis tasks (our #381 rewrite)? + We'll find out during the actual rewrite. +- ~~Does the reasoning-token advantage scale with document complexity? Test + with a simpler doc to see if the gap narrows.~~ **ANSWERED (Finding #11):** + The gap doesn't narrow with simpler docs. GPT-5 maintains ~1.7x the findings + of GPT-4.1 regardless of document complexity. Reasoning tokens enable + exhaustive exploration independent of input difficulty. +- ~~Would Claude Opus (also a reasoning model) match GPT-5's assumption-finding + performance, or does it have different blind spots?~~ **ANSWERED (Finding #11):** + Different blind spots, different strengths. GPT-5 reasons deeper into + implementation mechanics (breadth + technical depth). Opus reasons wider + about system context and design tensions (insight density). They're + complementary, not competing. Run both on important architecture docs. +- ~~Does Sonnet 4.6's strong showing hold across other analytical tasks + (bias detection, gap-finding) or is it specific to assumption-finding on + complex documents? Need to test Sonnet on simpler docs and different question + types.~~ **PARTIALLY ANSWERED (Finding #13):** Sonnet's strength does NOT + transfer to concurrency reasoning. It dropped from 85% of GPT-5 (assumption- + finding) to ~58% (race condition identification). Task type matters more + than we thought. Still untested: gap-finding, bias detection for Sonnet. +- **NEW:** What other analytical tasks require sequential/temporal reasoning + (like race condition identification) vs pattern-matching reasoning (like + assumption-finding)? Building a task taxonomy would help assign models + correctly. +- **NEW:** What explains Sonnet taking slightly longer than Opus here (106s vs + 105s) despite normally being the faster model? Is it the document length, or + does Sonnet's internal reasoning scale with complexity similarly to Opus? +- ~~How does GPT-5 Mini compare to GPT-5 on analytical tasks? Is it a viable + cheaper substitute?~~ **ANSWERED (Finding #14):** GPT-5 Mini is a viable + middle option. Finds fewer issues (6 vs 10) but with genuine reasoning + depth at ~50% cost/time. Better than non-reasoning models, not as + exhaustive as GPT-5. +- **NEW:** How does Claude 4.5 Opus/Sonnet compare to Claude 4.6? HAI now + exposes both; worth testing whether the newer versions regress on + analytical tasks. +- ~~Would running GPT-5 Mini + Sonnet together (different axes) + approach GPT-5's coverage at lower combined cost?~~ **ANSWERED (Finding #19):** + 71% coverage at 31% cost. Good for low-stakes work; GPT-5 irreplaceable for + high-stakes due to unique domain-knowledge findings in the missing 29%. +- ~~**NEW (Finding #15):** Does the Opus > GPT-5 ordering for coherence checking + hold across other documents? The inversion (Opus finding more than GPT-5) + was striking — need to confirm it wasn't document-specific.~~ + **ANSWERED (Finding #27):** No — it was document-specific. On risk-controls.md, + GPT-5 found 6 vs Opus's 5. The winner depends on contradiction TYPE: Opus + excels at emergent/compositional contradictions, GPT-5 at explicit/definitional + ones. No fixed ordering for this task type. +- **NEW (Finding #15):** Is the two-pass approach (Opus generates → GPT-5 + validates) worth the extra cost vs just running Opus alone? Need to test + whether GPT-5 actually catches Opus false-positives or just agrees. +- ~~How do the Claude 4.5 and 4.6 models compare on analytical tasks?~~ + **ANSWERED (Finding #16):** 4.5 is more exhaustive (2x findings), 4.6 is + more precise (higher signal-to-noise). Genuine tradeoff, not a regression. + 4.5 for coverage, 4.6 for actionability. +- **NEW (Finding #16):** Does the 4.5 vs 4.6 pattern hold across other task + types? Spec completeness may favor exhaustiveness; would coherence checking + or race condition analysis show the same pattern? +- **NEW (Finding #16):** Is running both Sonnet versions (4.5 + 4.6) cost- + effective vs just running GPT-5? Need to compare the UNION of their findings + against GPT-5's output for overlap analysis. +- **NEW (Finding #18):** Does Opus's "predictable exploit window" detection + transfer to other policy documents? It uniquely identified that the cooldown + mechanism creates a GUARANTEED safe window that strategies could systematically + exploit — this is a higher-order security insight. Worth testing whether Opus + consistently finds "adversarial opportunity" framings that other models miss. +- **NEW (Finding #20):** Does GPT-5's extreme verification behavior (15:1 + reasoning-to-output ratio, 3 findings from 12K reasoning) persist across + other documents with this prompt? Or was user-pipeline-lifecycle.md + particularly verification-heavy? Test invariant violation paths on a simpler + document. +- **NEW (Finding #20):** Would giving GPT-5 a "minimum 8 findings" instruction + reduce its selectivity and produce MORE invariant violations at lower + precision? Or would it just pad with non-violations? The extreme selectivity + may be a feature OR it may mean GPT-5 is discarding valid-but-hard-to-verify + findings. +- **NEW (Finding #20):** Opus's self-correction behavior is now confirmed across + Finding #15 and #20. Is this trainable/promptable? Could we ask non-Opus models + to "show your reasoning and withdraw findings you cannot fully verify"? +- **NEW (Finding #22):** The "silent correctness" lens revealed three distinct + analytical modes: Opus → missing functionality, GPT-5 → mechanism incorrectness, + Sonnet → composition failures. Does this three-way differentiation hold on other + documents, or was it specific to the regulatory/financial domain of specid-lot-selection? +- **NEW (Finding #22):** Does the "silent correctness" lens work on non-financial + documents? The financial/regulatory domain has a large gap between syntactic and + semantic correctness. Would the same prompt on an infrastructure/systems doc produce + equally differentiated findings, or would it collapse into assumption-finding? +- ~~**NEW (Finding #22):** Opus's "missing feature identification" mode (wash sales, + commissions) — is this promptable on other models? Could we explicitly ask GPT-5 + "what should this system compute but doesn't" and get similar results?~~ + **ANSWERED (Finding #26):** YES — all three models find regulatory gaps and + missing features when explicitly prompted. Opus's unique behavior in #22 was + an emergent DEFAULT tendency, not a capability. Prompt framing dominates + model personality. + +- **NEW (Finding #28):** Cross-document consistency found real bugs in gargoyle + docs (fills vs events, position ownership, signal persistence). Does running + this analysis across MORE document pairs (e.g., domain readmes vs implementation + docs, design docs vs plan docs) yield additional real inconsistencies? Could + become a systematic documentation maintenance tool. +- **NEW (Finding #28):** Opus was 2.4x faster AND found more issues than GPT-5 + on cross-document consistency. Is this because cross-doc contradictions are + easy to verify once spotted (reducing GPT-5's verification advantage)? Or + because boundary reasoning (Opus's strength) is the primary skill needed? + +## Methodology Notes + +- Internet opinions about models are overwhelmingly about coding. Don't + extrapolate to analytical work without testing. +- "Just because someone says it on the internet doesn't make it right." — + Aaron, 2026-04-26. Opinions need context. Track our own evidence. +- Absence of published methodology for a use case is itself a finding. +- Each finding needs: date, task, **how we used it** (context shape, task + framing, what info the model had/didn't have), what happened, takeaway. + No unsupported generalizations. +- **Context dimensions to track:** + - Rich vs minimal (how much background info) + - Broad vs focused ("review this" vs "answer this specific question") + - What kind of context (diff, full files, issue text, research notes, + project conventions, nothing) + - Whether the model had access to tools or just text + - Whether the task was explicit step-by-step or open-ended +# Design Coherence Analysis — Finding #15 + +**Date:** 2026-05-03 +**Task:** Identify internal design incoherences in gargoyle's `failure-modes.md` (383 lines) +— places where the document's stated principles/invariants are contradicted by its own +specified mechanisms. +**How we used them:** Same document (full text) + same focused analytical question to all +3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence +to look for (safety properties not enforced, state machine violations, recovery contradictions, +supervision conflicts, cross-mechanism contradictions). Required each finding to reference +specific sections. No tools, no project context beyond the document itself. + +| Model | Time | Output tokens | Reasoning tokens | Incoherences found | +|---|---|---|---|---| +| Claude Sonnet 4.6 | ~39s | 1,168 | (internal) | 5 | +| Claude Opus 4.6 | ~105s | 3,378 | (internal) | 7 (8 attempted, 1 self-withdrawn) | +| GPT-5 | ~120s | 10,235 | 9,088 | 4 | + +**What they found — common ground (all 3 identified):** +- State machine universality claim vs Strategy.Worker crash behavior (process + crashes bypass the degraded state entirely — no transition path in the model) +- Market data staleness advisory-only vs the "don't trade when ambiguous" principle + (or vs concurrent failure auto-halt) +- `pending_cancel`/`pending_replace` absent from recovery query set (GPT-5 and + Sonnet found this directly; Opus addressed the broader state machine gap) + +**GPT-5 unique findings (not in either Claude model):** +- Kill switch halted = "process terminated" vs kill switch requiring RUNNING + processes for cancel_all, FLATTEN, and LIQUIDATE modes. The state definition + claims processes are terminated, but the mechanisms require them alive to + execute orders. **This is the most architecturally significant finding** — it + reveals a fundamental definitional error in the state machine. +- Per-symbol degradation contradicts the process-level degradation semantics. + A worker "enters degraded" but continues operating for non-stale symbols — + violating the stated definition that degraded = "cannot perform primary + function." The metrics/eventing model has no per-symbol dimension. + +**Claude Opus unique findings (not in either other model):** +- `:rest_for_one` cascade creates a FIFTH implicit state (terminated-and- + restarting) not in the four-state model — processes that were `normal` are + forcibly killed (not by kill switch) and restart. Self-corrected one finding + that initially looked like incoherence but was actually consistent. +- PortfolioMonitor continues evaluating with stale data ("fail-safe") while + Strategy.Workers are stopped for the SAME condition — contradicts both the + universal state machine (PM doesn't transition to degraded) and the doc's + reasoning about why stale data is dangerous. +- Path-dependent recovery criteria: Strategy.Worker recovery requires 21+ bars + after crash but only "price continuity check" after staleness. The state + machine's single "catch-up complete" exit condition can't express this. +- `halted → [*]` transition in state diagram is logically impossible if "halted" + means the process is already terminated — dead processes can't fire transitions. +- Compound failure detection requires a meta-observer across processes but the + per-process state machine model has no way to express cross-process conditions. + +**Claude Sonnet unique findings (not in either other model):** +- Market data global staleness: the failure table says "Manual (disengage)" for + recovery — implying automatic engagement happened — but the text says it's + advisory only. Table contradicts prose. +- ReconciliationGate: doc claims gate survives OM crash (separate supervision + tree), but then says "missing ETS table = not ready" when OM crashes. If the + gate survives, why would its table be missing? +- Signal survival claims are contradictory between sections: worker crash says + downstream signals survive, but OM crash says all upstream signals lost. + (NOTE: this is actually describing different scenarios — worker crash doesn't + cascade to SignalRisk, OM crash does via :rest_for_one. Sonnet may have + misread the architecture here — the two statements are consistent when you + understand the supervision tree.) + +**Quality assessment:** +- **GPT-5** found only 4 incoherences but TWO of them are genuinely critical + architectural findings. The "halted = terminated" vs "kill switch requires + running processes" contradiction is a real design error — you can't both + terminate processes AND require them to execute cancel/liquidation orders. + The per-symbol degradation finding is also a real modeling gap. GPT-5 was + MORE SELECTIVE here than in previous experiments — it didn't pad with + medium-severity findings. Each of its 4 was high/critical. +- **Claude Opus** produced the most findings (7 valid) with characteristic + depth. Its self-correction (withdrawing finding #6 after deeper analysis) + shows intellectual honesty rare in model outputs. The PortfolioMonitor + stale-data contradiction is genuinely insightful — same input condition, + opposite response, no justification within the state machine model. The + compound failure meta-observer finding identifies a modeling category error. + Opus also found modeling imprecisions (path-dependent recovery, halted → [*] + impossibility) that the other models didn't notice. +- **Claude Sonnet** found 5 issues quickly (39s, 1,168 tokens) but quality was + mixed. Finding #4 (ReconciliationGate) raises a genuine question about + the ETS table ownership claim. Finding #1 (table vs prose contradiction on + market data staleness) is a real documentation inconsistency. However, + Finding #5 appears to misread the supervision architecture — the two + statements about signal survival ARE consistent when you understand that + different crashes cascade differently. Sonnet produced one false positive. + +**Key insight — "design coherence" is a NEW analytical category with distinct model strengths:** +This is different from assumption-finding (Finding #10-12), race conditions +(Finding #13), and cross-component interactions (Finding #14). Coherence +checking requires the model to hold MULTIPLE parts of the document in tension +with each other and reason about whether they're compatible. Results: + +- **GPT-5** was MORE SELECTIVE than in any previous experiment. Only 4 findings + vs 10-24 in other tasks. But precision was near-perfect — all 4 are genuine + contradictions. This suggests GPT-5's reasoning tokens are being used for + VERIFICATION (checking whether apparent contradictions hold up) rather than + EXPLORATION (finding more things). The 9K reasoning tokens produced 4 findings + vs the usual 10+ — GPT-5 is self-editing aggressively. +- **Opus** hit its sweet spot. Coherence checking IS design-tension identification + — Opus's consistent strength. Finding incoherences requires exactly the kind + of "how does this design disagree with itself" reasoning that Opus excels at. + It also showed unique self-correction behavior (withdrawing a finding after + deeper analysis). +- **Sonnet** was fast but produced a false positive. Coherence checking requires + holding multiple document sections in memory simultaneously and reasoning about + their compatibility — this is harder than assumption-finding (where you + reason about one mechanism at a time) but easier than race conditions (which + require sequential temporal reasoning). Sonnet occupies a middle ground. + +**Model ranking for design coherence checking:** +1. Claude Opus 4.6 — most findings, highest depth, self-correcting (7 valid) +2. GPT-5 — fewest findings but near-perfect precision, finds the critical ones (4) +3. Claude Sonnet 4.6 — fast screening, but prone to false positives on + architectural misreads (4/5 valid) + +**This inverts the usual GPT-5 > Opus ordering.** In previous experiments, GPT-5 +consistently found MORE issues. Here, GPT-5 was more selective than Opus. The +task type (self-consistency checking) favors Opus's "design tension" reasoning +style over GPT-5's "exhaustive exploration" style. GPT-5 apparently uses its +reasoning to VERIFY rather than GENERATE when the task is about contradictions +rather than gaps. + +**Practical implication:** For architecture documents, run coherence checking as +a separate pass using Opus as the primary model. GPT-5's higher precision means +it's good for confirming which Opus findings are genuine vs overreads. The +two-pass approach: Opus generates candidates → GPT-5 validates → result is the +intersection plus GPT-5's independent finds. + +### 16. Specification completeness: Sonnet 4.5 produces 2x the findings of Sonnet 4.6 on implementation-gap analysis; quality vs volume tradeoff + +**Date:** 2026-05-03 +**Task:** Identify specification gaps in gargoyle's `kill-switch.md` (185 lines) — places +where an implementer would be forced to guess or decide on their own because the spec +doesn't clearly specify behavior. New analytical lens not previously tested. +**How we used them:** Same document (full text) + same focused analytical question to all +3 models via HAI proxy. Highly structured prompt specifying 5 categories of underspecification +(behavioral ambiguity, missing edge cases, ordering/sequencing gaps, interface contracts +undefined, concurrency semantics omitted). Required specific output format per finding +(gap, section, what implementer must decide, risk if wrong, severity). No tools, no +project context beyond the document itself. + +| Model | Time | Output tokens | Reasoning tokens | Gaps found | Critical | High | Medium | Low | +|---|---|---|---|---|---|---|---|---| +| Claude Sonnet 4.6 | 73s | 3,403 | (internal) | 13 | 8 | 4 | 0 | 1 | +| Claude Sonnet 4.5 | 102s | 5,191 | (internal) | 25 | 14 | 6 | 4 | 1 | +| GPT-5 | 109s | 10,140 | 7,872 | 19 | 8 | 7 | 3 | 0 | + +**What they found — common ground (all 3 identified):** +- Pipeline process identification ambiguity (which processes are "pipeline processes") +- Per-user process scope mapping (how to terminate only one user's processes) +- ETS table ownership and lifecycle (who owns it, what happens on crash) +- Concurrent engage operations (what happens when two sources engage simultaneously) +- Liquidation order tagging mechanism (what the tag is, how verified) +- Process restart prevention (how "must not restart" is enforced) +- Engage sequence atomicity (partial failure between DB write and termination) +- Startup ordering and ETS readiness (pipeline starting before ETS populated) +- Disengage sequence ordering (what happens and in what order) + +**Sonnet 4.5 unique findings (not in either other model):** +- ETS table schema/structure (set vs ordered_set, key format, value schema) +- Missing ETS detection mechanism (catch :badarg vs table existence check) +- Database write atomicity with ETS (transaction boundaries, rollback semantics) +- Per-user engage while global is already engaged (is it a no-op or error?) +- Broker rejection semantics ("already filled" vs "invalid cancel" distinction) +- Cold-start gate interaction (independence vs dependency of the two gates) +- User deletion with active kill switch (orphaned rows, cascade semantics) +- Global disengage effect on per-user states (independent or auto-clear?) +- Audit log write failure during engage (critical-path vs best-effort) +- Dashboard control ambiguity in LIQUIDATE mode (contradictory disable/enable) +- Cancel timeout duration (operational parameter not specified) +- Manual order source code path during LIQUIDATE (how orders bypass the dead pipeline) + +**GPT-5 unique findings (not in either other model):** +- Combined global/per-user mode semantics (what happens when global=RESTRICT, + user=LIQUIDATE — can user's liquidation proceed?) +- Scope of "all" in cancel_all and liquidation (system-wide vs per-user) +- Gate behavior when ETS missing but liquidation needed (conflicting requirements: + fail-closed says block, but liquidation needs to pass) +- Disengage during in-flight cancellations (what happens to racing tasks) +- Gate placement relative to broker submission (exact point in the flow) +- Engage latency expectations (no quantified SLA) +- Mode change while already engaged (RESTRICT → LIQUIDATE without disengage) +- Dashboard vs backend scope for manual liquidation (individual vs bulk only) + +**Sonnet 4.6 unique findings (not in either other model):** +- ETS sequencing relative to process termination (ETS before or after kill?) +- Concurrent disengage + re-engage race (specific interleaving scenario) +- Close-only enforcement mechanism (UI-only vs backend validation) +- Order-in-flight past ETS gate during termination (already-checked orders) + +**Quality assessment:** +- **Claude Sonnet 4.5** was the most EXHAUSTIVE (25 gaps) but with notable + quality variance. Several findings were highly specific and implementation- + relevant (ETS schema, missing-table detection, broker rejection semantics). + Others were relatively obvious or lower-impact (user deletion, audit log + failure, cancel timeout duration). The 14 Critical ratings feel somewhat + generous — some would be more accurately rated as High in practice. Output + was well-structured with clear per-finding format. +- **GPT-5** found 19 gaps with consistent high quality. Its unique findings + show cross-cutting reasoning: the combined mode semantics finding (global + vs per-user mode interaction) identifies a genuine specification gap that + neither Sonnet version noticed. The "ETS missing but liquidation needed" + finding is architecturally significant — it identifies a CONTRADICTION in + the spec's own rules (fail-closed blocks everything, but liquidation must + pass). Every finding was actionable. More selective severity ratings + (8 Critical vs Sonnet 4.5's 14). +- **Claude Sonnet 4.6** was the most SELECTIVE (13 gaps) but with the highest + precision. Every finding was genuinely a specification gap that an + implementer would face. The ETS sequencing finding (#4) is particularly + well-reasoned — it identifies a specific ordering dependency that creates + a race window. Sonnet 4.6 appears to self-filter aggressively, producing + only findings it's confident about. Higher signal-to-noise than 4.5. + +**Key insight — Sonnet 4.5 vs 4.6 on analytical tasks:** +This is the first direct comparison between Claude model versions on the same +analytical task. Key differences: + +- **Volume:** 4.5 produced almost 2x the findings (25 vs 13) +- **Tokens:** 4.5 used ~1.5x the output tokens (5,191 vs 3,403) +- **Time:** 4.5 took ~1.4x longer (102s vs 73s) +- **Severity distribution:** 4.5 had more Critical findings (14 vs 8) but + with more generous severity ratings +- **Quality per finding:** 4.6 had higher average quality; fewer "obvious" + or lower-impact findings + +The 4.6 model appears to have been trained toward higher precision/selectivity. +It finds fewer things but each finding is more reliably a genuine gap. The 4.5 +model is more exhaustive but includes findings that a reviewer might triage as +"yes, technically, but not really a spec gap." This mirrors a known training +direction in Claude models: later versions tend to be more concise and selective. + +**For practical use:** If you want completeness (cast a wide net, accept some +noise): use 4.5. If you want precision (every finding is actionable, no triage +needed): use 4.6. For architecture review where missing a gap has cost, 4.5's +exhaustiveness is probably worth the noise. For review where false positives +cost attention (e.g., PR review comments), 4.6's selectivity is preferred. + +**GPT-5 vs Sonnet comparison on this task:** +GPT-5 (19 findings) sits between the two Sonnets in volume but has the highest +consistency — no obvious misses or inflated severities. Its unique strength +here: finding CONTRADICTIONS within the spec's own rules (ETS-missing blocking +conflicts with liquidation needing to pass). This is consistent with Finding #15 +where GPT-5 was unusually selective but precise on coherence checking. + +Specification completeness analysis appears to be a task where: +1. Sonnet 4.5 is strongest for breadth (25 findings, catches operational gaps) +2. GPT-5 is strongest for detecting spec self-contradictions (19 findings, high precision) +3. Sonnet 4.6 is strongest for precision (13 findings, zero noise) + +**Updated model version comparison:** +- Claude 4.6 → higher precision, more selective, concise +- Claude 4.5 → more exhaustive, more verbose, occasional severity inflation +- This is a genuine tradeoff, not a simple regression or improvement + +**Practical implication:** Run BOTH Sonnet versions? 4.5 catches things 4.6 +filters out (ETS schema, broker rejection semantics, cold-start gate interaction). +4.6 catches things with more specificity (sequencing gaps, exact race windows). +For a one-shot budget: 4.5 if you want coverage, 4.6 if you want actionability. +GPT-5 if you want to find where the spec contradicts itself. + +### 7. Token budget matters more than model size for gap analysis (confirmed) + +**Date:** 2026-05-03 +**Task:** Identify unaddressed failure scenarios in gargoyle's `failure-modes.md` (383 lines, ~25KB) +**How we used them:** Same document, same analytical question ("What failure scenarios +are NOT covered?"), three models. GPT-5 with 16K max_completion_tokens, Sonnet 4 +with 4K max_tokens, GPT-4.1 Mini with 4K max_completion_tokens. No project context +beyond the document itself. Pure gap-analysis task. + +**Results:** +- GPT-5 (16K budget): 28 gaps, most exhaustive. Found domain-specific edge cases + others missed entirely: ClOrdID collision across restarts, fractional share rounding, + broker maintenance windows (410/426), hot code upgrades, regulatory halts vs staleness + distinction, cancel-ack-then-late-fill race, DNS/TLS as distinct from network outage. +- Sonnet 4 (4K budget): 12 gaps, concise. Unique framing: distinguished latency + degradation from outage (subtle but actionable). ETS corruption vs loss. +- GPT-4.1 Mini (4K budget): 13 gaps with summary table. Unique angles: unknown broker + status enum values, configuration schema mismatches on cold-start, malformed signals + from logic bugs (not just crashes). + +**Overlap (all three):** Rate limiting, clock skew, resource exhaustion, DB failures, +message backpressure, partial connectivity. + +**Key insight:** GPT-5's 4K attempt produced ZERO output (finish_reason: length) — +all tokens consumed by internal reasoning. At 16K it produced the richest analysis. +This confirms finding #3 (GPT-5 needs generous token budgets) AND adds a new +observation: for open-ended analytical questions, GPT-5's reasoning overhead is +proportionally larger. The 4K models (Sonnet, Mini) both produced useful output at +4K because they don't burn tokens on chain-of-thought. + +**Model personality confirmed:** +- GPT-5: exhaustive, domain-aware, finds edge cases a senior SRE would know +- Sonnet: precise, architectural, finds design-level distinctions +- GPT-4.1 Mini: structured, systematic, finds enumeration gaps + +**Practical implication:** For failure mode / gap analysis on design docs: +- GPT-5 with ≥16K tokens for maximum coverage (most unique findings) +- Sonnet for architectural framing ("this is really two different problems") +- Mini for completeness checking ("what about this enum value?") +- Running all three costs ~$0.50 and catches gaps none alone would find +- GPT-5 at 4K is USELESS for this task — always give it room to think + +**Note on GPT-5 reasoning overhead:** First attempt at 4K max_completion_tokens +returned empty content with finish_reason: length. The model spent all 4K tokens +on internal reasoning and produced nothing. This is worse than a short answer — +it's zero value for non-zero cost. Always budget ≥16K for GPT-5 analytical tasks. + +### 18. Temporal boundary analysis: GPT-5 is most exhaustive; Opus finds design-level contradictions; Sonnet 4.5 is structured but less deep + +**Date:** 2026-05-04 +**Task:** Identify temporal boundary vulnerabilities in gargoyle's `escalation-policy.md` +(238 lines) — scenarios where the timing model (evaluation cycles, debounce counts, +cooldown periods) creates windows of incorrect or dangerous behavior. +**How we used them:** Same document (full text) + same focused analytical question to all +3 models via HAI proxy. Highly structured prompt specifying 5 categories of temporal +vulnerability (timing exploitation, counter reset abuse, asymmetric time exposure, +cross-metric temporal interactions, state loss temporal effects). Required specific +output format per finding (name, sequence with cycle numbers, mechanism, severity, fix). +No tools, no project context beyond the document itself. + +| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium | +|---|---|---|---|---|---|---|---| +| GPT-5 | ~128s | 9,175 | 5,888 | 15 | 3 | 7 | 2 | +| Claude Opus 4.6 | ~120s | 5,112 | (internal) | 10 | 3 | 5 | 2 | +| Claude Sonnet 4.5 | ~100s | 4,056 | (internal) | 12 | 3 | 3 | 3 | + +**What they found — common ground (all 3 identified):** +- Flash crash / inter-evaluation gap exploitation (metric spikes between discrete + evaluation cycles go undetected) +- Single clear cycle resetting debounce counter (transient recovery defeats escalation + despite sustained risk — metric can breach 80%+ of cycles and never escalate) +- Asymmetric escalation time vs loss compounding rate (11 cycles to reach liquidation + while losses compound every single cycle) +- Monitor crash resets state to Clear, losing all escalation progress +- Liquidation re-trigger requiring full debounce reset, delaying subsequent batches +- Kill switch N value unspecified (timing indeterminacy) + +**GPT-5 unique findings (not in either other model):** +- Boundary-alignment counter starvation: explicitly modeled the "adversarial flicker" + pattern (breaching 2 cycles, 1 clear, repeat — 66% breach time, never escalates) + with a precise mathematical framing of why K-of-N is needed +- Cycle-length drift under load: GC pauses or CPU contention stretching evaluation + intervals means "3 cycles" could be 12 minutes instead of 90 seconds when it + matters most (high-load market stress = slowest evaluations) +- Adversarial boundary timing (market microstructure masking): illiquid instruments + where opposing prints predictably arrive near evaluation boundaries, exploiting + deterministic sampling points +- Cross-metric escalation gap and hedging prohibition: Restrict prevents ALL new + positions including risk-REDUCING hedges needed for a different metric still + escalating on its own timeline — protection for metric A actively worsens metric B +- Cooldown stall causing prolonged Restrict: repeated transient spikes near hysteresis + threshold reset cooldown indefinitely while metric is actually safe +- State inconsistency between restriction flags and monitor after restart: + documented asymmetry where flag persists (manual clear) but state resets (auto + clear) — creates orphaned restriction or unprotected window depending on + reconciliation approach +- Metric computation fail-closed interacting with debounce: system errors create + false escalations with long cooldown, potentially blocking hedging trades +- Unspecified N for kill switch post-liquidation breaches: coupled with crash + reset, system can loop indefinitely without reaching kill switch +- In-liquidate flicker stall: one cycle below threshold after partial fill resets + re-trigger counter, stalling further liquidation + +**Claude Opus unique findings (not in either other model):** +- De-escalation cooldown exploitation (predictable window): after cooldown completes + and restriction lifts, strategy has a GUARANTEED 5+ cycle window of unrestricted + trading before Restrict can re-engage — an automated strategy could systematically + exploit this predictable safe window to re-enter dangerous positions +- Orphaned restriction flag asymmetry framed as a DESIGN CONTRADICTION: the failure + modes table specifies opposing recovery paths for state (automatic → Clear) vs + flags (manual clear), creating an irreconcilable dual state. Opus uniquely + identified that operator intervention to clear the flag could inadvertently + create a WORSE protection gap than leaving it orphaned +- Self-correcting analysis style: Opus's summary explicitly synthesized that the + three Critical findings share a common cause (debounce optimizes against false + positives at the expense of false negatives during sustained events) and proposed + a single architectural fix (severity-aware fast path) that addresses all three + +**Claude Sonnet 4.5 unique findings (not in either other model):** +- De-escalation timing not accounting for proximity to breach threshold: system + removes protection while metric is still near-dangerous, and re-escalation + requires full debounce — created a specific "whipsaw" scenario with cycle numbers +- Manual-only de-escalation from Liquidate creates UNBOUNDED recovery time: + if triggered at 2 AM Saturday, trading disabled until Monday despite metrics + recovering in minutes. Framed as contradiction with "autonomous" design goals +- Evaluation cycle synchronization assumption: no handling of variable timing + (CPU contention, GC pauses) — implicit throughout but never addressed +- Cold start escalation ambiguity: system starts with no prior state while + portfolio may already be in breach condition +- De-escalation event ordering race: multiple metrics de-escalating simultaneously + may emit events in non-deterministic order, confusing external observers + +**Quality assessment:** +- **GPT-5** was the most exhaustive (15 findings) and showed the strongest + mathematical/systems reasoning. Its unique findings included precise attack + models (adversarial flicker, boundary alignment, microstructure masking) that + describe exact exploitation patterns with percentages and cycle counts. The + cross-metric hedging prohibition finding is architecturally significant — it + identifies that protection for one metric can actively CREATE risk for another. + Every finding was actionable with specific fixes. +- **Claude Opus 4.6** produced fewer findings (10) but with characteristic depth + and self-awareness. Its cooldown exploitation finding identified a PREDICTABLE + exploit window that an automated strategy could systematically abuse — framed + not as an accident but as an adversarial opportunity. The summary synthesis + (identifying common cause across Critical findings) shows meta-analytical + capability the other models didn't demonstrate. Opus also uniquely identified + that human intervention to fix one problem could create a WORSE problem — + second-order operational reasoning. +- **Claude Sonnet 4.5** was well-structured (12 findings, clean severity tiers, + organized by Critical/High/Medium/Low) and faster than both other models. + Its findings were solid but less architecturally deep. The manual de-escalation + contradiction finding was genuinely insightful (unbounded recovery time vs + autonomous design goals). However, several findings restated concepts the + other models covered with less specificity about exploitation mechanics. + +**Key insight — temporal reasoning as a task type:** +This is the first experiment specifically testing "temporal boundary analysis" — +reasoning about time-domain properties of a state machine (evaluation frequency, +counter semantics, cooldown mechanics, crash/restart timing). + +Results compared to Finding #13 (race condition identification on a concurrency doc): +- GPT-5: 15 findings here vs 12 in Finding #13. Consistent high performance + on temporal reasoning tasks across both experiments. +- Opus: 10 findings here vs 10 in Finding #13. Remarkably consistent — Opus + produces ~10 high-quality findings regardless of temporal task variant. +- Sonnet 4.5: 12 findings here (first test). Compare to Sonnet 4.6's 7 findings + (with errors) in Finding #13. Sonnet 4.5 handles temporal reasoning better than + 4.6 — consistent with Finding #16 showing 4.5 is more exhaustive across task types. + +**Sonnet 4.5 vs 4.6 on temporal reasoning (inferred comparison):** +Sonnet 4.6 struggled significantly on race condition identification (Finding #13: +7 findings with analytical errors, misreading architecture). Sonnet 4.5 here +produced 12 solid findings with no apparent misreadings. This suggests 4.5's +exhaustiveness advantage extends to temporal reasoning — the additional +exploration it does (vs 4.6's aggressive self-filtering) catches more temporal +interactions. Confirms Finding #16's pattern: 4.5 for coverage, 4.6 for precision. + +**The structured-prompt effect continues:** +All three models produced focused, high-quality output with this highly structured +prompt (5 specific categories + required output format). This confirms Finding #14: +narrow analytical lens + broad document scope is the sweet spot for all model tiers. +The prompt structure appears to be a stronger predictor of output quality than model +choice for the bottom 80% of findings (all models find the common-ground issues). +Model choice matters for the TOP 20% — the unique insights that require deeper +reasoning about system interactions. + +**Updated model assignment for temporal boundary analysis:** +1. GPT-5 — most exhaustive, strongest at modeling adversarial exploitation patterns + and mathematical edge cases (15 findings) +2. Claude Sonnet 4.5 — good volume with clean structure, viable for first-pass + temporal analysis (12 findings, no errors) +3. Claude Opus 4.6 — fewest findings but highest insight density, uniquely + identifies predictable exploit windows and operational second-order effects + (10 findings) + +**Practical implication:** For temporal analysis on state machines and timing-dependent +policies, the three-model stack produces genuine complementary value: +- GPT-5 catches the adversarial attack patterns and mathematical edge cases +- Opus catches the predictable exploit windows and operational contradictions +- Sonnet 4.5 provides good breadth at lower cost with clean severity categorization + +The union of unique findings across all three models reveals significantly more +temporal vulnerabilities than any single model alone. For a document governing +autonomous financial actions (liquidation, kill switch), the cost of running all +three (~$1-2) is trivially justified against the risk of missing a timing exploit. + +### 19. Union coverage test: GPT-5 Mini + Sonnet 4.6 covers ~71% of GPT-5's findings; the missing 29% is where the real value lives + +**Date:** 2026-05-04 +**Task:** Identify hidden assumptions in gargoyle's `trading-pipeline.md` (1,110 lines, +~62KB) — the most complex document tested so far, covering the full end-to-end path +from tick ingestion through order execution. +**How we used them:** Same document (full text, no truncation) + same focused analytical +question to all 3 models via HAI proxy. Standard hidden-assumption prompt with 5 +categories (runtime behavior, external dependencies, timing/ordering, scale/load, +uncovered failure modes). Required specific output format per finding. No tools, no +project context beyond the document itself. + +| Model | Time | Output tokens | Reasoning tokens | Assumptions found | +|---|---|---|---|---| +| GPT-5 | 99s | 9,418 | 5,696 | 35 | +| GPT-5 Mini | 93s | 5,309 | 1,792 | 21 | +| Claude Sonnet 4.6 | 38s | 1,792 | (internal) | 17 | + +**Coverage analysis — can Mini + Sonnet together replace GPT-5?** + +Categorized each of GPT-5's 35 findings by whether the union of Mini + Sonnet +also identified the same assumption: + +- **Covered by BOTH Mini and Sonnet:** ~12 findings (common ground — any model + finds these: idempotency, single-writer, clock sync, instrument resolution, + fill immutability, reconciliation gate, backpressure, fill correlation, event + ordering, audit scalability, PortfolioRisk bottleneck) +- **Covered by Mini only (not Sonnet):** ~7 findings (transactional atomicity, + audit causal consistency, modification-in-flight enforcement, OM throughput, + decimal precision, PM/PR close-only race, partition duplicate submit) +- **Covered by Sonnet only (not Mini):** ~6 findings (market data feed rates, + pipeline-vs-market speed, corporate actions atomicity, kill switch partition, + shared port isolation, market close vs auction fills) +- **Union(Mini + Sonnet) total coverage:** ~25/35 = **~71%** of GPT-5's findings +- **GPT-5 unique (missed by both):** ~10-18 findings depending on strictness + +**What GPT-5 uniquely found that the cheaper pair missed:** + +The missing 29% is NOT random — it's systematically different in character: + +1. **Operational edge cases:** Default TIF "day" broker semantics, OrderRate + counting retries, extended-hours MarketHours mismatch, fractional quantities, + local expiry timer precision per instrument +2. **Design-level interaction gaps:** PortfolioRisk concurrent decision race + (snapshot stale between two parallel approvals), re-validation gap between + approval and submit, decision loss on crash after audit write +3. **Domain-specific knowledge:** Manual broker-side actions conflicting with + state machine, options/complex instrument position_effect mapping, Decision→Order + 1:1 invariant vs broker auto-splitting, wash sale retroactive P&L mutation +4. **Architectural observations:** Reduction re-entry rule insufficiency, + PortfolioMonitor coalescing vs fast breach detection, multi-aggregator fanout + and audit partial writes, replay/backtest alignment with production controls + +These share a common trait: they require **domain expertise** (knowing how brokers +actually behave, how regulatory rules interact, how production trading systems +fail in practice) combined with **architectural reasoning** (how the design's own +mechanisms interact under those real-world conditions). The cheaper models find +assumptions about the document's internal consistency; GPT-5 additionally finds +assumptions about the document's relationship to the external world it must +operate in. + +**GPT-5 Mini vs Sonnet 4.6 — complementary, not redundant:** + +Mini and Sonnet covered different gaps: +- Mini was stronger on **internal consistency** (transactional atomicity, causal + consistency, decimal precision, modification serialization) +- Sonnet was stronger on **external interactions** (market data feeds, corporate + actions, kill switch distribution, shared resource isolation) + +This aligns with previous findings: Mini reasons about implementation mechanics; +Sonnet reasons about system boundaries and external interactions. Their union +covers more ground than either alone. + +**Cost comparison:** + +| Approach | Total tokens | Approx. cost | Coverage of GPT-5 | +|---|---|---|---| +| GPT-5 alone | ~21K (9.4K output + 5.7K reasoning) | ~$0.80 | 100% (35 findings) | +| Mini + Sonnet | ~7.1K output + 1.8K reasoning | ~$0.25 | ~71% (25/35 findings) | +| All three | ~28K total | ~$1.05 | >100% (35 + unique Sonnet/Mini extras) | + +**Key insight — the 71% coverage is a floor, not a ceiling:** + +The union covers 71% of GPT-5's specific findings. But Mini and Sonnet each +also produced findings that GPT-5 DIDN'T make: +- Sonnet: DailyLossLimit query performance scaling, instrument reference data + propagation atomicity across components +- Mini: Signal audit correlation ambiguity under replay/duplicate ticks + +So the total unique finding space is LARGER than any single model. Running all +three produces the most comprehensive analysis. + +**Answer to the open question: "Would running GPT-5 Mini + Sonnet together +approach GPT-5's coverage at lower combined cost?"** + +**Partially.** The pair covers ~71% of GPT-5's findings at ~31% of the cost. +But the missing 29% is disproportionately valuable — it contains the +domain-specific, interaction-level, real-world-knowledge findings that are +most likely to prevent production incidents. For a quick sanity check or +first-pass screening, Mini + Sonnet is excellent value. For architecture +review where completeness matters (financial system, safety-critical), GPT-5 +is not replaceable by cheaper models — its unique findings are exactly the +ones that would cause real-world failures. + +**Practical implication:** The optimal strategy depends on stakes: +- **Low stakes** (internal doc review, non-critical systems): Mini + Sonnet + is 71% coverage at 31% cost — strong ROI +- **High stakes** (financial systems, safety-critical): run all three — the + ~$1 total cost is irrelevant vs the value of the extra 10-18 findings +- **Budget-conscious high stakes:** run GPT-5 alone — it subsumes most of + what Mini + Sonnet find, and adds the critical domain-knowledge findings + +The cost argument for Mini + Sonnet as a GPT-5 REPLACEMENT doesn't hold for +important work. The cost argument for Mini + Sonnet as a GPT-5 COMPLEMENT +is strong — they catch a few things GPT-5 misses, and the union of all three +is the most thorough analysis available. + +**Document complexity observation:** +This is the largest document tested (1,110 lines vs previous 185-785 lines). +GPT-5's finding count scaled up (35 vs 20-26 on smaller docs) while maintaining +quality — no padding with obvious/low-value findings. Mini also scaled (21 vs +6 on 459-line doc in Finding #14). Sonnet scaled less (17 vs 12-17 on smaller +docs) — it appears to have a natural output ceiling regardless of document size, +consistent with its self-filtering behavior observed in previous findings. + +### 22. Silent correctness failures: NEW analytical lens reveals Opus's strength at domain/regulatory reasoning; GPT-5 shows regulatory depth; all models converge on compliance timestamp errors + +**Date:** 2026-05-05 +**Task:** Identify scenarios where the mechanism produces SILENTLY INCORRECT results +(not errors, not crashes — wrong financial calculations, wrong lot selections, or wrong +compliance records that pass all validation) in gargoyle's `specid-lot-selection.md` +(306 lines) — a financial system specification covering tax lot selection strategies, +cost basis accounting, and IRS SpecID compliance. +**How we used them:** Same document (full text) + same focused analytical question to +all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of silent +incorrectness (stale data, semantic precision, ordering sensitivity, composition errors, +temporal reference errors). Required specific output format per finding with concrete +numerical examples of financial impact. No tools, no project context beyond the document. + +| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium | +|---|---|---|---|---|---|---|---| +| GPT-5 | 147s | 13,006 | 10,496 | 7 | 2 | 2 | 3 | +| Claude Opus 4.6 | 119s | 5,902 | (internal) | 10 | 3 | 3 | 4 | +| Claude Sonnet 4.6 | 122s | 6,011 | (internal) | 6 | 3 | 3 | 0 | + +**What they found — common ground (all 3 identified):** +- `designation_at` = `DateTime.utc_now()` at processing time, NOT at actual + designation time (manual selection was made at order submission, standing + orders were configured earlier) — compliance record factually incorrect +- Holding period calculation boundary errors (>365 days vs IRS "more than one + year" rule, off-by-one at leap year boundaries, day-after-acquisition start) +- HIFO tie-breaker `opened_at ASC` ignores tax_term dimension — selects + long-term losses over short-term losses when both have identical cost basis, + producing less tax-valuable outcomes +- Strategy preference resolved at fill processing time, not at trade time + (preference changes between trade and fill processing apply retroactively) + +**GPT-5 unique findings (not in either Claude model):** +- Corporate action applied late stale cost basis in HIFO: ROC/dividend reduces + basis but if close/4 fires before apply_corporate_action/3, HIFO sorts on + pre-adjusted basis AND records wrong realized P&L permanently. No mechanism + to restate previously persisted LotClosed events. Concrete example: $2,000 + overstated loss from one trade. +- `designation_at` fragmentation: a single sell consuming multiple lots calls + DateTime.utc_now() per loop iteration, producing slightly different timestamps + for what should be a single coherent designation event. Audit risk. +- LIFO label in `selection_method` field: records "lifo" but for securities LIFO + isn't an authorized tax method — the operation is legally SpecID electing + newest lots. Downstream reporting may reject or misclassify. + +**Claude Opus unique findings (not in either other model):** +- Realized P&L excludes commissions/fees: formula uses `sell_fill.price` (raw + execution price) minus `lot.cost_basis`, not net proceeds. If cost_basis also + excludes buy-side commissions, P&L is doubly overstated. Active trader doing + 1000 trades/year: ~$20,000+ cumulative P&L overstatement. +- Position `average_cost` is meaningless under SpecID and potentially misleading: + SpecID exists to exploit lot-level basis differences, but position-level average + obscures this. If downstream consumers use average_cost for tax estimation, + results can be 50%+ wrong per lot. +- GenServer mailbox ordering determines lot-to-fill assignment for concurrent sells: + two simultaneous fills for the same instrument get different lots based on network + arrival timing. With different holding periods, produces $670+ tax difference + without user awareness. +- Wash sale rule completely unaddressed: system reports losses as realized/deductible + without checking 30-day substantially identical purchase rule. Active trader + harvesting $50,000 in losses could have $0 actually deductible — $18,500 tax gap. +- `opened_at` semantics undefined: whether it's exchange execution time, GenServer + arrival time, or settlement date affects every downstream calculation (FIFO/LIFO + ordering, holding periods, tax terms). Network timing could produce wrong FIFO + lot selection. + +**Claude Sonnet 4.6 unique findings (not in either other model):** +- Stale cost basis in manual lot picker during concurrent corporate actions: UI shows + pre-action basis, user selects based on stale data, but close/4 only validates + open/ownership/quantity — never re-validates that the selection rationale is still + correct. No field records the discrepancy. +- `average_cost` recomputation ordering ambiguity in event-sourced model: step 4 + recomputes from "updated lots" but step 3 (persist events) may not have completed + — if implementation re-derives from event store rather than in-memory state, reads + pre-closure lot quantities. Accumulates $500+ error per partial close. +- Strategy fallback + config corruption silently overwrites selection method in + compliance record: if config becomes invalid, fallback to :fifo is logged at + :warning but LotClosed records `selection_method: "fifo"` — compliance record + shows user "chose" FIFO when they configured HIFO. No field records intended vs + actual strategy. + +**Quality assessment:** +- **Claude Opus** produced the most findings (10) with the broadest analytical scope. + Several findings went BEYOND the document's mechanism to identify missing features + that create silent incorrectness (wash sale rules, commission handling, opened_at + semantics). This is a different analytical mode: Opus identified what the system + SHOULD compute but DOESN'T, not just where the existing computation is wrong. + The wash sale finding is the highest-impact across all three models — an active + trader's entire tax-loss harvesting strategy could be invalid. The GenServer + mailbox ordering finding shows characteristic Opus reasoning about emergent + behavior from design decisions. +- **GPT-5** produced fewer findings (7) but with extreme precision and specificity. + Every finding includes concrete dollar amounts and specific field references. + The corporate action stale basis finding is uniquely actionable — it identifies a + specific race condition between two documented mechanisms (close/4 and + apply_corporate_action/3) that produces permanently incorrect persisted data + with no correction path. The designation_at fragmentation finding shows attention + to implementation detail that neither Claude model noticed. GPT-5 used 10,496 + reasoning tokens for 7 findings (1,500 tokens/finding) — HIGH verification, + consistent with Finding #20's pattern for precision-over-breadth tasks. +- **Claude Sonnet 4.6** produced 6 findings with strong specificity and novel angles. + The event-sourced recomputation ordering finding (#5) is architecturally subtle — + it identifies a composition error between the walk-and-consume algorithm's step + ordering and event-sourcing patterns. The strategy fallback compliance recording + finding is a genuine audit hazard. However, Sonnet produced no Medium-severity + findings — it either found Critical/High issues or filtered everything else out. + This aligns with its established high-precision, high-self-filtering behavior. + +**Key insight — "Silent correctness" as an analytical lens:** + +This is the FIRST experiment testing a "silent incorrectness" prompt. The key +difference from previous analytical lenses: +- **Assumption-finding:** "What must be true for this to work?" (Finding #10-12) +- **Race conditions:** "What timing issues exist?" (Finding #13) +- **Design coherence:** "Does the design contradict itself?" (Finding #15) +- **Invariant violations:** "What operation sequences break invariants?" (Finding #20) +- **Silent correctness:** "Where does the system CONFIDENTLY produce WRONG output + with NO indication of error?" + +The silent correctness lens produced qualitatively different findings from all +previous lenses. The emphasis on "passes all validation" forced models to reason +about what SHOULD be validated but ISN'T, and about semantic correctness (regulatory +requirements, financial accounting rules) vs syntactic correctness (valid types, +non-nil fields, correct schema). + +This lens also revealed a key model differentiation not seen before: +- **Opus reasons about MISSING functionality** (wash sales, commissions, opened_at + semantics) — things the system should do but doesn't +- **GPT-5 reasons about EXISTING functionality being wrong** (corporate action race, + designation fragmentation, LIFO labeling) — things the system does but incorrectly +- **Sonnet reasons about COMPOSITION failures** (event-sourcing step ordering, + strategy fallback propagation) — things that are individually correct but combine + incorrectly + +These are three genuinely different analytical modes, not just "more/less thorough." +All three are valuable for different review outcomes: Opus for feature completeness, +GPT-5 for mechanism correctness, Sonnet for integration correctness. + +**Financial domain advantage:** + +This is the first experiment on a document with strong regulatory/financial semantics. +All three models demonstrated domain knowledge (IRS holding period rules, Treas. Reg. +1.1012-1(c) requirements, wash sale IRC §1091, long-term/short-term capital gains +rate differentials). Opus in particular referenced specific IRC sections and provided +concrete tax rate calculations. The "silent incorrectness" lens works especially well +on financial/regulatory documents because the gap between "syntactically valid output" +and "semantically/legally correct output" is large and consequential. + +**Comparison to previous findings on the same models:** + +| Task type | GPT-5 findings | Opus findings | Sonnet findings | Opus > GPT-5? | +|---|---|---|---|---| +| Hidden assumptions (#10-12) | 20-35 | 12-13 | 13-17 | No | +| Race conditions (#13) | 12 | 10 | 7 | No | +| Design coherence (#15) | 4 | 7 | 5 | **Yes** | +| Invariant violations (#20) | 3 | 7 | 5 | **Yes** | +| Silent correctness (#22) | 7 | 10 | 6 | **Yes** | + +Pattern confirmed: Opus outperforms GPT-5 (by finding count) on tasks that require +reasoning about the design's RELATIONSHIP to external requirements (regulatory, +financial, consumer expectations). GPT-5 outperforms Opus on tasks that require +EXHAUSTIVE EXPLORATION within a self-contained system (assumptions, race conditions). + +The "silent correctness" lens is structurally similar to coherence checking (does the +system match its external requirements?) rather than gap-finding (what's missing +within the system?). This explains why Opus outperforms: the task requires reasoning +about the world outside the document (IRS rules, financial accounting standards, +regulatory requirements), which is Opus's strength. + +**Practical implication:** +For financial/regulatory system review, the "silent correctness" lens should be +run using Opus as the primary model (broadest findings including missing-feature +identification) plus GPT-5 for mechanism-level precision. Sonnet adds value for +composition/integration issues that neither Opus nor GPT-5 catches. All three +produced unique, actionable findings that the others missed. + +The three findings ALL models converged on (designation_at, holding period, HIFO +tie-breaker, strategy preference timing) should be treated as confirmed design +bugs requiring fixes. The fact that three independent models all identified them +with concrete financial impact examples increases confidence that these are real. + +### 23. Regulatory compliance analysis: GPT-5 finds broadest scope of IRS issues; Opus self-corrects mid-analysis; all models converge on cross-account gap + +**Date:** 2026-05-05 +**Task:** Identify where gargoyle's `wash-sale-tracking.md` (391 lines) could produce +incorrect tax reporting compared to IRS wash sale regulations (IRC 1091). NEW +analytical lens: regulatory compliance verification — asking models to reason about +a code implementation's correctness against EXTERNAL regulatory requirements (not +internal system assumptions or race conditions). +**How we used them:** Same document (full text) + same focused analytical question +to all 3 models via HAI proxy. Prompt structured around 5 categories: regulatory +gaps, interpretation errors, edge cases at regulatory boundaries, cross-account/entity +concerns, and interaction with other IRC sections. Required specific regulatory +citations, implementation analysis, concrete tax errors, and audit risk levels. +No tools, no project context beyond the document. + +| Model | Time | Output tokens | Reasoning tokens | Findings | +|---|---|---|---|---| +| GPT-5 | 178s | 12,525 | 9,536 | 16 | +| Claude Opus 4.6 | 155s | 7,326 | (internal) | 16 (with 2 self-corrections/withdrawals) | +| Claude Sonnet 4.6 | 40s | 1,818 | (internal) | 12 | + +**What they found — common ground (all 3 identified):** +- Cross-account/IRA/external broker wash sales not tracked (IRS applies at taxpayer level) +- Options/contracts to acquire stock not triggering wash sales (explicit in IRC 1091(a) text) +- "Substantially identical" definition too narrow (same index ETFs, share classes, ADRs) +- Trade date vs settlement date ambiguity in opened_at/closed_at +- Short sale wash sales not addressed +- Section 475 mark-to-market traders incorrectly subjected to wash sale tracking +- IRC 1092 straddle rules interaction not addressed +- Related party / spousal transactions not considered +- Corporate action identity changes breaking matching + +**GPT-5 unique findings (not in either other model):** +- **Per-share vs lot-level basis tacking** (#1): The system applies `disallowed_loss` + and `tacked_opened_at` at the LOT level, but IRS requires per-share treatment + when only partial shares are matched. A lot of 100 shares where only 60 trigger + wash sale should have per-share basis segregation — the system inflates basis for + all 100 shares. **Most architecturally significant finding** — a fundamental + design-level error, not a missing feature. +- **IRA permanent disallowance** (#2): When replacement purchase is in an IRA, the + loss is PERMANENTLY lost (no basis adjustment possible in tax-deferred accounts). + System either incorrectly applies basis adjustment inside IRA or misses it entirely. +- **Instruments not subject to §1091** (#4): §1256 contracts (futures, index options), + cryptocurrency, and §475 elections are all exempt — system may over-disallow. +- **Average-cost mutual fund basis** (#11): Wash sale adjustments for funds using + average-cost method require different math than discrete lot-level adjustments. +- **ADRs vs local shares** (#14): ADRs and underlying foreign ordinaries are + substantially identical but have different instrument_ids. +- **RSU vestings/ESPP purchases** (#15): Equity compensation creating lots via + corporate action paths may not trigger `check_replacement/2`. +- **Ordering priority between pre/post sale purchases** (#10): Industry convention + (post-sale first, then pre-sale) may differ from system's strict chronological + ordering, causing 1099-B mismatches. + +**Claude Opus unique findings (not in either other model):** +- **Year-end boundary timing** (#5): Loss in December + replacement in January means + tax reports generated between Dec 31 and the replacement purchase date are incorrect. + Forward detection fires retroactively but users may have already filed. System needs + a "30-day pending window" for year-end reports. +- **Form 8949 reporting format** (#6): IRS requires code "W" in column (f) and + specific adjustment amounts in column (g). System doesn't describe how `tax_summary/3` + produces Form 8949-compatible output — potential CP2000 notice triggers from + automated IRS matching against broker 1099-B. +- **"Open lots" query in backward detection** (#10): If backward detection only + queries currently-open lots, it misses replacements that were acquired AND SOLD + within the window. IRS looks at acquisition regardless of current holding status. + (Rev. Rul. 56-602) +- **Forward detection loss ordering unspecified** (#7): When multiple prior losses + compete for the same replacement shares, ordering matters — different allocation + produces different basis amounts on the replacement lot. +- **DRIP reinvestments triggering wash sales** (#9): Dividend reinvestment creates + new lots that should trigger forward detection but may not if only buy fills + produce `LotOpened` events. +- **Self-correcting analytical style (CONFIRMED):** Opus withdrew Finding #4 + entirely mid-analysis ("Revised assessment: holding period logic appears correct. + I withdraw the claim of error"). Spent ~500 words reasoning through the holding + period tacking logic, found it correct, and explicitly retracted. This is now + confirmed across Findings #15, #20, and #23 as a consistent Opus behavior for + verification-heavy regulatory analysis. + +**Claude Sonnet unique findings (not in either other model):** +- **Entity-level tracking for partnerships/S-Corps** (#4.2): Tax-transparent entities + trading through the platform need K-1 reporting to partners — user-scoped model + doesn't address pass-through entity wash sale reporting. +- **Constructive sale integration (IRC 1259)** (#4.1): Short positions or derivatives + creating constructive ownership interact with wash sale determination in ways not + addressed. +- **NOL carryforward interaction** (#5.3): Wash sale deferrals affect character and + timing of losses contributing to NOL calculations across tax years. + +**Quality assessment:** +- **GPT-5** produced the broadest regulatory scope (16 findings) with the most + specific IRS citations (Rev. Rul. 2008-5, Pub. 550, IRC §§267, 1091, 1092, 1222, + 1223, 1256, 475). Its per-share vs lot-level finding (#1) is the only one that + identifies a FUNDAMENTAL DESIGN ERROR (not a missing feature). Most other models' + findings are "you don't handle X" — GPT-5's #1 says "what you DO handle is + handled INCORRECTLY." This distinction matters: missing features are known scope + limitations; incorrect logic is a bug. +- **Claude Opus** matched GPT-5's count (16 with 2 self-corrections = 14 net + confirmed) but with different character. Opus excelled at identifying OPERATIONAL + implications (year-end boundary timing, Form 8949 format requirements, forward + detection ordering) rather than just statutory gaps. Its findings tend to describe + HOW the gap manifests in practice ("user files taxes, then January purchase + retroactively invalidates the filing") vs GPT-5's approach of citing the statute + and describing the theoretical violation. +- **Claude Sonnet** was fast (40s) and produced 12 competent findings but with less + regulatory precision. Findings lacked specific IRS citations (no Rev. Rul. + references, no Treas. Reg. citations). Several findings overlapped heavily with + common ground items without adding unique depth. The entity-level and + constructive sale findings show awareness of tax complexity but are relatively + generic ("this is complex and not addressed"). + +**Key insight — regulatory compliance as a distinct task type:** + +This experiment tests a fundamentally different cognitive demand than previous ones: +previous tasks asked "what could go wrong with this system?" (internal reasoning). +This task asks "does this system correctly implement external rules?" (external +reasoning). The model must hold TWO bodies of knowledge simultaneously: the +implementation spec AND the regulatory framework, then find mismatches. + +All three models had strong tax law knowledge — they cited IRC sections, Revenue +Rulings, and Treasury Regulations correctly. The differentiation wasn't in legal +knowledge but in HOW they applied it: + +- **GPT-5:** Exhaustive statutory mapping ("here's every IRC section that touches + wash sales; here's where the implementation falls short on each"). Breadth-first + coverage. Found the most issues by sheer scope of regulatory awareness. +- **Opus:** Operational consequence reasoning ("here's how this gap manifests as + a real-world problem for the user/auditor"). Found issues by reasoning about + the implementation's interaction with real-world workflows (filing deadlines, + form formats, broker reconciliation). +- **Sonnet:** Category-based analysis ("here are cross-account issues, here are + entity issues, here are interaction issues"). Followed the prompt structure + closely but didn't go deep within each category. + +**The per-share vs lot-level finding (GPT-5 #1) — why it matters:** + +This is the experiment's most important result. Every model found missing features +(options, cross-account, short sales) — those are SCOPE limitations that the +document itself acknowledges or defers. GPT-5 uniquely found a correctness bug in +the IMPLEMENTED logic: the system's lot-level basis adjustment is mathematically +wrong for partial wash sales. + +Example: Loss lot of 100 shares, replacement lot of 60 shares. Only 60 shares +trigger wash sale. System adds full 60% of disallowed loss to the entire +replacement lot's basis. If the replacement lot later sells 30 shares, the +per-share basis is inflated (reflects 60 shares of adjustment spread across 60 +shares). This is actually correct for the replacement lot specifically — but +the `tacked_opened_at` is applied to ALL 60 shares when only the matched shares +should have tacked holding periods. For lots where `adjusted_quantity < +replacement_quantity`, the non-matched shares have incorrect holding period +characterization. + +Actually, on closer inspection: if `adjusted_quantity = min(loss_quantity, +replacement_quantity)`, and the system matches 60 shares of a 60-share +replacement lot, ALL shares of that lot are matched. The edge case GPT-5 +identifies would require a replacement lot larger than the loss — e.g., loss of +60 shares matched against a replacement lot of 100 shares where only 60 are +affected. In that case, the `tacked_opened_at` is set on the entire lot (100 +shares) when only 60 should be affected. This IS a genuine bug: 40 shares get +incorrect holding period classification. + +**Updated task-type taxonomy:** + +| Task type | Primary cognitive demand | Best model | +|---|---|---| +| Hidden assumptions | Breadth identification (what's not stated?) | GPT-5 (exhaustive) | +| Race conditions | Sequential temporal reasoning | GPT-5 + Opus | +| Cross-component interactions | Component boundary reasoning | GPT-5 + Sonnet | +| Design coherence | Internal consistency checking | Opus | +| Invariant violation paths | Construction + verification | GPT-5 (precision) | +| Silent correctness | External requirement matching | Opus | +| **Regulatory compliance** | **Dual-knowledge-base comparison** | **GPT-5 (breadth) + Opus (operations)** | + +Regulatory compliance is closest to "silent correctness" (Finding #22) in that +both require reasoning about external requirements. The key difference: +- Silent correctness asks "does this produce correct outputs for all inputs?" +- Regulatory compliance asks "does this implement the law correctly?" + +Both favor models that reason about the system's relationship to the outside +world (Opus's strength), but regulatory compliance also rewards breadth of +statutory knowledge (GPT-5's strength). The combination produces the most +complete picture. + +**Practical implication:** +For regulatory compliance review of financial systems: +- Run GPT-5 for exhaustive statutory coverage (finds the most gaps) +- Run Opus for operational impact analysis (finds how gaps manifest in practice) +- Sonnet adds marginal value — use only if budget allows +- GPT-5's unique strength: identifying correctness bugs in implemented logic + (not just missing features) +- Opus's unique strength: identifying timing/workflow issues (year-end, form + reporting, reconciliation with broker) + +### 24. Design improvement proposals: GPT-5 excels at defense-in-depth thinking; Opus finds subtle design contradictions; Sonnet produces generic recommendations + +**Date:** 2026-05-05 +**Task:** Propose specific design improvements for gargoyle's `kill-switch.md` (185 lines) +— the primary safety mechanism that prevents rogue orders. NEW task type: generative/ +creative ("what would you improve?") rather than purely analytical ("what's wrong?"). +**How we used them:** Same document (full text) + same focused prompt to all 3 models +via HAI proxy. Prompt asked for 8-15 specific improvements with: weakness, proposed +change (concrete), tradeoff, severity rating. Explicitly excluded generic advice +("add more tests") and asked about runtime assumptions. No tools, no project context. + +| Model | Time | Output tokens | Reasoning tokens | Improvements proposed | +|---|---|---|---|---| +| GPT-5 | 118s | 8,710 | 6,016 | 15 | +| Claude Opus 4.6 | 127s | 4,985 | (internal) | 15 | +| Claude Sonnet 4.6 | 40s | 1,636 | (internal) | 12 | + +**What they found — common ground (all 3 identified):** +- DB write failure blocking engagement (fail-open under DB outage) — all three + proposed in-memory-first engagement with async persistence +- Kill switch process liveness monitoring (heartbeat/watchdog) +- Broker connectivity loss during cancellation operations +- ETS table ownership and crash-window vulnerability +- Supervisor restart suppression as unstated mechanism +- Per-venue/per-broker scope extension + +**GPT-5 unique findings (not in either other model):** +- **Infrastructure-level "hard kill"** — egress proxy or service mesh that blocks + broker traffic independently of the application. Belt-and-suspenders approach + where the kill switch works even if the entire BEAM VM is unresponsive. This + was GPT-5's highest-impact unique insight. +- **Kill fence token (epoch)** — every order-carrying message includes an epoch; + stale-epoch messages are dropped at the gate. Elegantly solves in-flight + messages without needing drain timeouts. +- **Cluster/multi-node propagation** — detailed leader election + epoch broadcast + + fail-closed on partition design. +- **Post-engage broker verification** — query broker AFTER engaging to confirm no + orders slipped through during the engagement window. +- **Liquidation exposure validation** — proving tagged liquidation orders actually + REDUCE exposure rather than trusting the tag. +- **Recovery/cold-start order suppression** — ensuring reconciliation/recovery + routines can't submit orders while engaged. +- **Engage latency reordering** — ETS first, terminate second, DB async. +- **Audit log tamper evidence** — append-only external sink + hash chain. + +**Claude Opus unique findings (not in either other model):** +- **Ordering contradiction in engagement sequence** — identified that the + documented order (DB → ETS → terminate) creates a specific risk if a crash + occurs BETWEEN termination and ETS update (not just DB failure). The insight + is about the window where termination has started but gate is still open. + More subtle than GPT-5's version (which focused on DB-blocking-engage). +- **Concurrent engagement race (mode escalation)** — multiple triggers + simultaneously issuing conflicting modes (RESTRICT vs LIQUIDATE). Proposed + explicit escalation rules (LIQUIDATE always wins) with GenServer serialization. +- **Shared resources under per-user scope** — per-user kill switch doesn't + address orders in shared broker connection buffers. Forces architectural + decision about connection pooling strategy. +- **Clock/time integrity for audit log** — monotonic counters + NTP validation + for forensic reliability. +- **Partial multi-user engagement failures** — what happens when global engage + successfully terminates 4/5 user pipelines but one has orphaned processes. +- **Liquidation direction validation** — similar to GPT-5's exposure validation + but framed differently: checking corrupted position records could cause + liquidation to OPEN positions rather than close them. +- **Process termination verification** — checking that `:kill` signals actually + worked (defense against trap_exit, NIF blocking). +- **Engagement latency SLA** — defining a 50ms target with monitoring/alerting. + +**Claude Sonnet findings (all also present in GPT-5 or Opus, differently framed):** +- No genuinely unique improvements that GPT-5 or Opus didn't also identify. +- Several were generic: "missing resource cleanup," "circuit breaker integration," + "performance monitoring" — exactly the kind of advice the prompt tried to + exclude. +- The "missing heartbeat" and "network partition handling" proposals were solid + but less detailed than the corresponding GPT-5/Opus versions. + +**Quality assessment:** +- **GPT-5** produced the most ACTIONABLE improvements. Its proposals were + architecturally concrete ("add an egress proxy," "use kill epochs in messages," + "query broker post-engage") and showed defense-in-depth thinking — multiple + independent layers rather than fixing one path. The infrastructure kill (#2) + is genuinely novel: no other model proposed going OUTSIDE the application + boundary for safety enforcement. GPT-5 consistently thought about "what if + this entire runtime is compromised?" rather than just fixing within-app paths. +- **Claude Opus** produced equally numerous improvements (15) with characteristic + precision about failure SEQUENCES. Its unique strength: identifying design + contradictions rather than just gaps (the engagement ordering issue, concurrent + mode escalation, shared-resource scope mismatch). Opus's proposals were more + "fix the design tension" while GPT-5's were more "add another safety layer." + Opus also included the process termination verification and engagement latency + SLA — operational rigor that GPT-5 skipped. +- **Claude Sonnet** produced 12 proposals in 40s (fast) but quality was notably + lower. Several proposals were generic software engineering advice that the + prompt explicitly excluded ("add performance monitoring," "resource cleanup"). + No unique insights emerged. Sonnet's proposals lacked the architectural depth + of GPT-5 (no outside-the-application thinking) and the design-tension + identification of Opus. + +**Key insight — generative vs analytical tasks:** + +This is the first experiment testing a GENERATIVE task ("propose improvements") +rather than a purely analytical one ("find problems"). The results reveal: + +1. **GPT-5's defense-in-depth thinking is unique.** In analytical tasks, GPT-5 + finds exhaustive lists of issues. In generative tasks, it proposes LAYERED + solutions — multiple independent mechanisms that each catch what the others + miss. The infrastructure kill proposal (external to the application) shows + GPT-5 reasoning about failure modes that are invisible to within-app analysis. + +2. **Opus's design-tension identification transfers to improvement proposals.** + In analytical tasks, Opus finds where parts of a design contradict each other. + In generative tasks, this manifests as proposals that RESOLVE tensions rather + than just adding patches. The engagement ordering contradiction and mode + escalation rules are both "this design says X but the mechanism allows Y — + here's how to make them consistent." + +3. **Sonnet doesn't transfer well to generative tasks.** In analytical tasks + (assumption-finding, cross-component analysis), Sonnet performs well (85% of + GPT-5 in some experiments). In generative tasks, it falls back to generic + engineering advice. The task requires both identifying problems AND proposing + concrete solutions — Sonnet handles the first step but not the second with + sufficient depth. + +**Comparison to analytical task performance:** + +| Task type | GPT-5 character | Opus character | Sonnet character | +|---|---|---|---| +| Assumption-finding (#10-12) | Exhaustive breadth | Design tensions | Good (85% of GPT-5) | +| Race conditions (#13) | Technical precision | Design contradictions | Weak (errors) | +| Invariant violations (#20) | Maximum selectivity | Self-correcting depth | Imprecise | +| **Design improvements (#24)** | **Defense-in-depth layers** | **Tension resolution** | **Generic advice** | + +The generative task reveals model ARCHITECTURES more clearly than analytical tasks. +GPT-5's reasoning enables it to construct multi-layered solutions. Opus's internal +reasoning enables it to identify what a design SHOULD be (not just what's wrong). +Sonnet pattern-matches against known engineering practices without deep synthesis. + +**Practical implication:** + +For design improvement sessions on safety-critical systems: +- Run GPT-5 for defense-in-depth proposals ("what layers should exist?") +- Run Opus for design consistency proposals ("where does the design contradict itself?") +- Skip Sonnet — its output is indistinguishable from generic checklists +- The combination of GPT-5 + Opus produces complementary improvements: GPT-5 adds + safety layers, Opus fixes internal contradictions. Together they address both + "not enough protection" and "protection mechanisms that work against each other." + +**Cost analysis:** +GPT-5: 118s, ~10.9K tokens (6K reasoning). Opus: 127s, ~5K tokens. Sonnet: 40s, ~1.6K tokens. +For a safety-critical design review, running GPT-5 + Opus costs ~16K tokens and produces +30 improvements with near-zero overlap in unique insights. Excellent ROI for a kill switch +design that protects real money. + +### 25. Contradiction detection: NEW task type — Opus excels at finding genuine contradictions with precision; GPT-5 is thorough but spends heavily on reasoning; Sonnet finds surface-level issues quickly + +**Date:** 2026-05-05 +**Task:** Identify internal contradictions, logical inconsistencies, and conflicting rules +in gargoyle's `order-state-machine.md` (311 lines) — a document defining states, +transitions, invariants, fill precedence rules, and time-in-force behavior. +**How we used them:** Same document (full text) + same focused analytical question to all +3 models via HAI proxy. Prompt specifically asked for: state machine contradictions, +semantic conflicts, rule violations, implicit contradictions, and terminology +inconsistencies. Required each finding to quote the conflicting statements, explain +the logical argument, assign severity, and recommend which statement should "win." +No tools, no project context beyond the document itself. + +| Model | Time | Output tokens | Reasoning tokens | Contradictions found | +|---|---|---|---|---| +| GPT-5 | 162s | 12,074 | 11,008 | 4 | +| Claude Opus 4.6 | 41s | 2,056 | (internal) | 6 | +| Claude Sonnet 4.6 | 17s | 826 | (internal) | 4 | + +**What they found — common ground (2+ models identified):** + +- **Missing `pending_cancel → partially_filled` revert transition** (GPT-5 #1 + + Opus #2 + Sonnet partial): The "Rejection reverts" invariant states orders return + to their "pre-modification state (`working` or `partially_filled`)", but the state + diagram only shows `pending_cancel → working` for cancel rejection — no path back + to `partially_filled`. All models correctly identified this as the diagram being + incomplete relative to the stated invariant. GPT-5 and Opus rated CRITICAL. +- **Same issue for `pending_replace` revert** (GPT-5 #1 + Opus #3): The state diagram + only shows `pending_replace → working` for replace rejection, but a replace + requested from `partially_filled` should revert to `partially_filled`. Same root + cause as above, just the replace variant. +- **FOK "never partially fills" vs state machine allowing it** (GPT-5 #2 + Opus #4): + The TIF table says FOK "never partially fills" but the state machine has no guards + preventing FOK orders from reaching `partially_filled`. Both correctly noted this + is a broker-enforced guarantee but the document presents it as system-level. +- **`rejection_reason` described as "broker-provided" but local rejections exist** + (GPT-5 #4 + Opus #5 + Sonnet): `pending → rejected` is "local validation failure" + with no broker interaction, but the field says "Broker-provided reason when + rejected." All three caught this terminology inconsistency. + +**GPT-5 unique findings (not in either other model):** + +- **IOC valid terminal states exclude `expired` vs generic expiry transitions** (#3): + IOC should never reach `expired` (unfilled portion is cancelled immediately), but + the state diagram allows any order to transition to `expired` without TIF guards. + Well-reasoned extension of the FOK finding to IOC semantics. GPT-5 correctly + identified that broker "expired-like" outcomes should map to `cancelled` for IOC. + +**Claude Opus unique findings (not in either other model):** + +- **Terminal states that aren't terminal — the `partially_filled` re-entry problem** + (#1): Opus identified the DEEPER contradiction beyond the surface-level "cancelled + states have outgoing transitions." When `cancelled → partially_filled` fires via + late fill, the order is now non-terminal with NO defined mechanism to re-terminate + if no further fills arrive. The order is stuck in `partially_filled` indefinitely. + This goes beyond "the diagram contradicts the definition of terminal" to "the fill + precedence rule creates an unspecified operational scenario." This is the most + architecturally significant finding across all three models. +- **Fill precedence label misapplication to non-terminal states** (#6): The state + diagram labels transitions from `pending_cancel → partially_filled` and + `pending_replace → partially_filled` as "fill precedence," but the Fill + Precedence Rule explicitly defines itself as overriding TERMINAL states. + `pending_cancel` is non-terminal. The label conflates two different mechanisms + (fill during pending modification vs. fill overriding terminal state), which + could cause implementers to use the same code path for fundamentally different + scenarios. + +**Claude Sonnet unique findings (not in either other model):** + +- **State diagram terminal arrow contradiction** (#1): Sonnet was the only model to + explicitly note that the Mermaid diagram shows `cancelled → [*]` (terminal arrow) + while simultaneously showing `cancelled → partially_filled` (outgoing transition). + A valid observation but more surface-level than Opus's deeper analysis of the same + phenomenon. +- **Pending replace fill logic error** (#3): Sonnet argued that receiving a fill + during `pending_replace` creates a logical impossibility because the order + parameters are in flux. This is WRONG — fills always apply to current parameters + (the replace hasn't been confirmed yet), and the document actually handles this + correctly. This is a FALSE POSITIVE from Sonnet. + +**Quality assessment:** + +- **Claude Opus** was the clear winner for this task. Found the most contradictions + (6), had the highest precision (0 false positives), and — crucially — found + qualitatively deeper issues. The `partially_filled` re-entry problem (#1) isn't + just "the diagram has a missing transition" but "the fill precedence rule creates + an unresolvable operational state." The fill precedence label misapplication (#6) + identifies a conceptual confusion that would genuinely cause implementation bugs. + Opus completed in only 41s with 2,056 output tokens — by far the most efficient. +- **GPT-5** found 4 genuine contradictions with 0 false positives but spent an + extraordinary amount of reasoning tokens (11,008) for modest output (1,066 visible + content tokens, 10.3:1 reasoning ratio). The IOC finding was unique and valuable. + But the cost is disproportionate: 162s and 12K tokens for 4 findings vs Opus's + 41s and 2K tokens for 6 findings. GPT-5's reasoning budget seems to have been + mostly spent on VERIFICATION (confirming each finding is genuine), consistent + with Finding #20's observation. +- **Claude Sonnet** was fastest (17s) and found 4 items, but one was a false positive + (the pending_replace logic error claim is incorrect). That gives it a precision of + 75% (3/4 genuine) — the lowest of the three. Its genuine findings were all also + found by the other models (no unique true contributions). Sonnet appears to trade + speed for accuracy on contradiction detection. + +**Key insight — contradiction detection favors precision-oriented models:** + +This task is fundamentally about LOGICAL ARGUMENTATION: proving that two statements +cannot both be true. Unlike assumption-finding (which is about imagining what could go +wrong) or gap-finding (which is about identifying missing content), contradiction +detection requires the model to: +1. Hold two statements in working memory simultaneously +2. Construct a formal argument for why they conflict +3. NOT get confused by statements that SEEM contradictory but are actually consistent + +Requirement #3 is where models diverge. Sonnet produced a false positive because it +didn't fully reason through whether the pending_replace fill scenario is actually +inconsistent (it isn't — current parameters apply). Opus avoided this trap entirely +and additionally found DEEPER contradictions that require multi-step logical reasoning +(the re-entry problem, the label misapplication). GPT-5 also avoided false positives +but at massive computational cost. + +**Opus's efficiency advantage:** +This is the first task where Opus is not just qualitatively better but also +quantitatively more efficient. 6 findings in 41s and 2K tokens vs GPT-5's 4 findings +in 162s and 12K tokens. That's 3x more findings per token and 4x faster. For +contradiction detection specifically, Opus appears to have a structural advantage — +possibly because its internal reasoning is better calibrated for logical argumentation +than GPT-5's externalized reasoning chain. + +**Comparison to Finding #20 (invariant violation paths):** +In Finding #20, GPT-5 was maximally selective (3 findings, all genuine, 15:1 +reasoning ratio). Here, GPT-5 shows the same pattern: few findings, all genuine, +high reasoning ratio (10.3:1). The difference: in #20, GPT-5's selectivity meant +it found UNIQUE violations others missed. Here, all of GPT-5's findings were also +found by Opus (plus Opus found 2 more). GPT-5's high verification bar doesn't help +when Opus is ALSO precise AND more thorough. + +**Updated task-model assignment:** + +For contradiction/consistency checking: +1. **Opus** — best choice: highest precision, deepest contradictions, most efficient +2. **GPT-5** — solid backup: zero false positives, unique TIF-related insights, but + expensive and slower +3. **Sonnet** — NOT recommended for this task: produces false positives, no unique + true contributions + +This confirms the emerging pattern: each model has task types where it excels. +Opus excels at logical argumentation and design tensions. GPT-5 excels at +exhaustive enumeration and operational concerns. Sonnet excels at speed and +structural/assumption analysis but struggles with tasks requiring formal logical +reasoning (contradiction detection, concurrency analysis per Finding #13). + +**Practical implication:** When reviewing architecture documents for internal +consistency (e.g., before implementation begins), run Opus. If budget allows, +add GPT-5 for TIF/edge-case coverage. Skip Sonnet for consistency checking — +its speed advantage is negated by the false positive risk. + +### 26. Missing-feature identification IS promptable across all models; prompt framing eliminates Opus's historical advantage — all three models find regulatory gaps when explicitly asked + +**Date:** 2026-05-05 +**Task:** Identify computations, behaviors, or features that gargoyle's +`corporate-actions.md` (992 lines) SHOULD perform for financial correctness, +regulatory compliance, or operational safety — but doesn't describe. +**How we used them:** Same document (full text) + same focused analytical +prompt to all 3 models via HAI proxy. Prompt explicitly structured around 5 +categories: missing computations, missing behaviors, missing validations, +missing integrations, and regulatory gaps. Required concrete findings with +severity. No tools, no project context beyond the document. GPT-5 via +OpenAI endpoint (16K max_completion_tokens), Opus 4.6 and Sonnet 4.6 via +Anthropic endpoint (8K max_tokens). + +| Model | Output tokens | Reasoning tokens | Findings | Critical | High | Medium | +|---|---|---|---|---|---|---| +| GPT-5 | 11,354 | 8,512 | 20 | 3 | 10 | 7 | +| Claude Opus 4.6 | 4,111 | (internal) | 23 | 6 | 10 | 7 | +| Claude Sonnet 4.6 | 4,686 | (internal) | 15 | 5 | 6 | 4 | + +**What they found — common ground (all 3 identified):** +- Wash sale rule interaction with CA-driven lot closures (IRC §1091) +- Short position treatment for corporate actions +- Same-day corporate action ordering beyond `recorded_at` timestamp +- Record date / ex-date position verification (entitlement timing) +- Idempotency guard preventing double-application per user +- Decimal precision/rounding policy unspecified +- Superseded CA status has no lot rollback mechanism +- Rights/warrants post-creation lifecycle (exercise/expiration) +- Basis preservation invariant has no runtime enforcement +- Manual entry authorization and audit trail + +**GPT-5 unique findings (not in either Claude model):** +- Per-lot eligibility based on entitlement date (not just user-level) +- Election-based outcomes for shareholder choices (cash vs stock) +- Instrument-level trading hold during CA application window +- Pre-application consistency checks against broker entitlements +- DB-level enforcement of status transitions and invariants +- Action-type-specific date semantics per field (ex vs record vs payable) +- Voluntary/tender actions beyond distributions +- Backfill/initialization guard for newly onboarded users +- Applicator retry/backoff semantics and confirmation race +- Rights indivisibility constraints vs exact Decimal quantities + +**Claude Opus unique findings (not in either other model):** +- Pending order PRICE adjustment after splits (not just cancellation) +- Multi-instrument position recalculation atomicity for mergers +- Mixed merger basis floor at zero (can produce negative basis) +- Tax lot identification method interaction with inherited dates +- Corporate action effect on strategy position limits/risk params +- Corporate actions on instruments not yet in the database +- Partial application window: new user acquires position mid-fan-out +- IRC §305(c) deemed distributions (taxable stock dividends) +- CA impact on unrealized P&L display and strategy evaluation +- Concurrent OrderManager startup + Applicator fan-out race + +**Claude Sonnet unique findings (not in either other model):** +- Stale orders: failure modes table contradicts "excluded" section +- IRC §1223(1) holding period tacking verification at lot close +- Spinoff allocation percentage — no validation child != parent instrument +- Combined spinoff allocations exceeding meaningful bounds +- Cash dividend bypasses OrderManager — record-date quantity snapshot lost +- Mixed merger large-denominator exchange ratio overflow +- Detector schedule: no intraday re-poll for same-day announcements +- ROC cumulative basis floor tracking and IRC §301(c)(3) distinction +- Mixed merger deferred loss not explicitly recorded in metadata + +**Quality assessment:** +- **Claude Opus** was the MOST PROLIFIC (23 findings) — a notable inversion + from previous experiments where Opus typically found fewer but deeper + findings. Here, the explicit "missing feature" framing appears to have + unlocked Opus's breadth. Its unique findings included genuinely critical + items: pending order price adjustment after splits (Critical — direct + financial loss), multi-instrument atomicity for mergers (Critical — + position loss), and mixed merger negative basis (High — accounting + corruption). The findings were precise, well-reasoned, and showed both + regulatory depth (IRC §305(c)) and operational awareness. +- **GPT-5** was slightly less prolific (20 findings) but maintained its + characteristic breadth and operational-level thinking. Per-lot eligibility + (not just per-user) is a subtle but important distinction. The election- + based outcomes finding shows awareness of real-world corporate action + complexity. The backfill/initialization guard is operationally significant. + GPT-5 spent 8,512 reasoning tokens — moderate for its output volume. +- **Claude Sonnet** found fewer gaps (15) but several were genuinely + insightful. The internal contradiction between the failure modes table + and the "excluded" section is a real document inconsistency. The cash + dividend record-date quantity snapshot insight (#9) identifies a DATA LOSS + problem — the opportunity to capture that data expires. The mixed merger + deferred loss recording gap shows regulatory awareness. However, some + findings were more surface-level or overlapped heavily with the others. + +**KEY INSIGHT — The original question from Finding #22 is ANSWERED:** + +> "Opus's 'missing feature identification' mode (wash sales, commissions) — +> is this promptable on other models? Could we explicitly ask GPT-5 'what +> should this system compute but doesn't' and get similar results?" + +**YES.** When explicitly prompted with a structured "missing feature" +framing, ALL three models found regulatory gaps (wash sales, IRC sections), +missing computations (basis calculations, rounding), and missing behaviors +(lifecycle events, notifications). GPT-5 produced findings in the same +*category* as what Opus uniquely found in Finding #22 (silent correctness +failures on specid-lot-selection.md). + +In Finding #22, Opus uniquely identified wash sales and commission tracking +as missing features while GPT-5 focused on mechanism incorrectness and +Sonnet on composition failures. HERE, with the explicit "what's missing" +prompt, ALL three models found wash sales, ALL found regulatory gaps, and +ALL found missing behaviors. + +**This confirms:** Opus's "missing feature identification" mode in Finding +#22 was NOT an inherent model capability — it was an emergent behavior from +the open-ended "silent correctness failures" prompt. When you give ALL models +the EXPLICIT instruction to look for missing features, they all do it. The +differentiation from #22 was caused by the prompt being more open-ended, +allowing each model to default to its natural analytical mode: +- Opus → "what's missing" (features/functionality) +- GPT-5 → "what's wrong" (mechanism failures) +- Sonnet → "what breaks when combined" (composition) + +**Prompt framing dominates model personality.** With the right prompt, +any model can be directed into any analytical mode. The model differences +that emerged in earlier open-ended experiments reflect DEFAULT TENDENCIES, +not capabilities. + +**NEW finding about Opus on complex documents:** +Opus produced MORE findings than GPT-5 (23 vs 20) — the first time this +has happened on a broad analytical task. Previous pattern: GPT-5 always +finds more (20-33 findings) while Opus finds fewer but deeper (7-13). +What changed? The document is 992 lines — the longest tested — and the +task is explicitly about breadth ("find all gaps"). On this specific +combination (long document + breadth-focused prompt), Opus appears to +allocate its internal reasoning budget toward exploration rather than +its usual depth-first design-tension mode. This suggests Opus's typical +"fewer but deeper" pattern is partially a RESPONSE to shorter documents +where depth is more productive than breadth. + +**Practical implications:** +1. For missing-feature analysis: prompt structure matters more than model + choice. All three models are viable. Use the explicit 5-category prompt. +2. Run all three for critical docs — they find different specific gaps + despite finding the same categories. +3. For open-ended analysis where you want models to find DIFFERENT things: + use open-ended prompts. For analysis where you want COMPREHENSIVE + coverage of one type: use structured prompts. +4. Opus's "fewer but deeper" personality can be overridden by document + length + breadth-focused prompt. On 992-line docs, it competes on + volume with GPT-5. + +**Cost-effectiveness:** +Opus: 4,111 output tokens for 23 findings = 179 tokens/finding +GPT-5: 11,354 output tokens (+ 8,512 reasoning) for 20 findings = 993 tokens/finding +Sonnet: 4,686 output tokens for 15 findings = 312 tokens/finding + +Opus is by far the most efficient: nearly 6x fewer tokens than GPT-5 per +finding, with MORE findings. This is the strongest cost-effectiveness case +for Opus on any tested task. On long documents with breadth-focused prompts, +Opus appears to be the optimal choice for both quality AND efficiency. + +### 28. Cross-document consistency analysis: NEW task type — GPT-5 finds deep semantic contradictions; Opus finds structural/boundary mismatches; Sonnet identifies core issues quickly + +**Date:** 2026-05-05 +**Task:** Identify contradictions and inconsistencies BETWEEN two architecture documents +describing the same system: `system-overview.md` (323 lines, narrative overview with +component flows, invariants, and domain events) and `architecture.md` (213 lines, +DDD-focused with bounded contexts, context map, and message taxonomy). +**How we used them:** BOTH documents provided as full text in a single prompt (~25KB +total). Highly structured prompt specifying 5 categories of cross-document inconsistency +(terminology conflicts, structural contradictions, flow/sequence conflicts, +ownership/authority conflicts, philosophical contradictions). Required specific output +format per finding. Explicitly excluded omissions (things one doc covers and the other +doesn't) and detail-level differences. No tools, no project context beyond the two +documents. This is a NEW analytical task not previously tested: reasoning about +CONSISTENCY BETWEEN documents rather than internal coherence of a single document. + +| Model | Time | Output tokens | Reasoning tokens | Inconsistencies found | Critical | High | Medium | +|---|---|---|---|---|---|---|---| +| GPT-5 | 125s | 9,415 | 8,384 | 6 | 2 | 3 | 1 | +| Claude Opus 4.6 | 52s | 2,351 | (internal) | 7 | 3 | 3 | 1 | +| Claude Sonnet 4.6 | 14s | 776 | (internal) | 4 | 1 | 2 | 1 | + +**What they found — common ground (all 3 identified):** +- Event sourcing (all events as source of truth) vs fills-only ground truth: + Document A says fills are "ground truth from which all other state can be + derived," while Document B says "events are the source of truth, state is + computed by replaying events." A treats fills as the recovery foundation; + B treats ALL domain events as authoritative. All three models rated this + Critical. +- Bounded context naming mismatch: "Decision Engine" / "Order Management" (A) + vs "Engine" / "Trading" (B) for the same functional responsibilities. + GPT-5 folded this into a broader ownership analysis; Opus and Sonnet + surfaced it as its own finding. +- Signal classification conflict: Document A lists "Signal emitted" as a domain + event; Document B explicitly categorizes `SignalEmitted` as an audit event + ("not used to rebuild state"). This determines event store design and + recovery semantics. + +**GPT-5 unique findings (not in either Claude model):** +- Signal persistence contradiction: Document A states "Signals are never + persisted" while Document B lists `SignalEmitted` as an audit event that IS + persisted and states the audit log is mandatory for trading. These are + directly incompatible claims about whether signal data is stored. +- Audit event ownership conflict: Document A says "Decision approved" events + originate from PortfolioRisk. Document B states "only the decision engine + writes audit events" and lists `DecisionApproved` as an audit event example. + If PortfolioRisk is part of Risk (not Engine), this is an authority violation. +- "Single writer per user" (A: OrderManager writes all trading state) vs + per-aggregate single-writer (B: each aggregate writes its own event stream, + Ledger owns positions). These are incompatible authority models — either OM + centralizes writes or each domain owns its own events. + +**Claude Opus unique findings (not in either other model):** +- Engine → OrderManager is an internal pipeline flow (A: same subgraph, direct + arrow) vs Engine → Trading is a cross-domain COMMAND (B: `PlaceOrder` command + crossing a bounded context boundary). This structural disagreement determines + whether order management is an internal pipeline stage or an independent domain + with its own aggregates and command validation. +- Signal Risk's architectural position: Document A shows a two-stage risk + architecture (SignalRisk pre-aggregation, PortfolioRisk post-aggregation) + where Risk is embedded in the pipeline. Document B's context map shows Risk + as a separate domain that Engine merely QUERIES ("kill switch active?") — + no arrow shows signal routing through Risk. Either risk logic lives inside + Engine (contradicting B's context boundary) or the context map is incomplete. +- The "reduce" step ownership: A's top-level flow labels `Approved →|"reduce"| + Decisions` (reduction at aggregation), while A's own domain events table says + "Decision reduced" originates from PortfolioRisk (reduction after aggregation). + This is actually an INTRA-document inconsistency in Document A, but Opus surfaced + it as part of cross-doc analysis. + +**Claude Sonnet unique findings:** +- None genuinely unique. All 4 findings overlapped with GPT-5/Opus common ground + (event sourcing, signal persistence, context count/naming). Sonnet was efficient + (14s, 776 tokens) but didn't identify any inconsistency that the other two missed. + +**Quality assessment:** +- **GPT-5** produced 6 well-reasoned findings with the deepest analysis of + OWNERSHIP conflicts. Its signal-persistence contradiction and single-writer + authority conflict are genuinely important — they reveal places where the two + documents would lead implementers to build fundamentally different systems. + Every finding quotes specific text from both documents and explains precisely + WHY they can't both be correct. The reasoning investment (8,384 tokens) was + used for thorough cross-referencing between documents. +- **Claude Opus** found the most inconsistencies (7) and was remarkably fast + (52s vs GPT-5's 125s). Its unique strength: identifying STRUCTURAL contradictions + about component boundaries and communication patterns. The Engine→Trading + command vs internal pipeline finding is architecturally the most significant + discovery — it reveals a fundamental disagreement about whether order + management is INSIDE or OUTSIDE the decision engine's boundary. Opus also + caught a bonus intra-document inconsistency (the "reduce" labeling error). +- **Claude Sonnet** was the fastest (14s) and most concise (776 tokens) but + found only the obvious common-ground issues. For cross-document consistency, + Sonnet's speed advantage came at the cost of missing the architectural + insights that make this task valuable. It did correctly identify all the + Critical-level issues, making it viable as a quick first-pass screen. + +**Key insight — cross-document consistency is a DISTINCT task type:** +This is fundamentally different from single-document analysis (assumptions, +race conditions, coherence). It requires: +1. Building a mental model from Document A +2. Building a separate mental model from Document B +3. Finding places where the models are incompatible +4. Reasoning about WHY they can't both be correct (not just "different") + +Step 4 is what distinguishes this from simple diff-detection. Many surface +differences (naming, detail level, scope) are NOT contradictions — the models +must judge which differences are genuinely incompatible vs. complementary. +The prompt explicitly excluded omissions and detail-level differences, and +all three models respected this constraint well. + +**Model strengths on cross-document analysis:** +- **GPT-5** excels at ownership/authority conflicts: it systematically + checked "who owns this concept" in each document and found mismatches. + Its findings cluster around "who writes what" and "who is authoritative." +- **Opus** excels at structural/boundary contradictions: it identified where + the documents draw architectural lines differently. Its findings cluster + around "where are the boundaries" and "what crosses them." +- **Sonnet** identifies the obvious/critical issues quickly but doesn't dig + deeper. Viable for screening, not for thorough analysis. + +**Comparison to Finding #15 / #27 (single-document coherence checking):** +Single-document coherence asks "does this document contradict itself?" +Cross-document consistency asks "do these documents contradict each other?" +Key differences in results: + +| Aspect | Single-doc coherence | Cross-doc consistency | +|---|---|---| +| Opus findings | 5-7 | 7 | +| GPT-5 findings | 4-6 | 6 | +| Sonnet findings | 4-5 | 4 | +| Opus unique | Design tensions | Structural/boundary mismatches | +| GPT-5 unique | Definitional errors | Ownership/authority conflicts | +| Best model | Task-dependent | Opus (most findings + fastest) | + +The relative ordering is similar (Opus ≥ GPT-5 > Sonnet for coherence-style +tasks), but the CHARACTER of unique findings shifted. On single-doc coherence, +Opus finds design tensions within a single design. On cross-doc consistency, +Opus finds BOUNDARY disagreements between two designs. GPT-5 shifts from +finding definitional errors to ownership conflicts. + +**Are these findings REAL bugs in the gargoyle documentation?** +Yes — several are genuine issues worth fixing: +1. The fills-vs-events-as-ground-truth is a real philosophical tension between + the two documents that needs resolution. +2. The Position event ownership (OrderManager vs Ledger) is a real boundary + conflict that affects implementation. +3. The Engine→Trading communication style (internal pipeline vs cross-domain + command) is a genuine structural ambiguity. +4. The signal persistence claim ("never persisted" vs `SignalEmitted` audit + event) is a direct textual contradiction. + +These are the kind of cross-document inconsistencies that cause teams to build +inconsistent implementations — one engineer reads Document A and builds one way, +another reads Document B and builds differently. + +**Practical implication:** Cross-document consistency analysis is a high-value +task for documentation maintenance. Run it when: +- A system has multiple architecture docs written at different times +- A refactoring has updated one doc but not another +- Multiple people contribute to design documentation +- Moving from high-level overview to detailed specification + +Opus is the recommended model for this task: fastest (52s vs 125s), most +findings (7 vs 6), and uniquely strong at boundary disagreements. GPT-5 adds +value for ownership-specific conflicts. Sonnet is sufficient for quick +screening (catches the Critical issues in 14s) but won't find the architectural +insights. + +**Cost-effectiveness:** +Opus: 2,351 output tokens for 7 findings = 336 tokens/finding (52s) +GPT-5: 9,415 output + 8,384 reasoning for 6 findings = 2,967 tokens/finding (125s) +Sonnet: 776 output tokens for 4 findings = 194 tokens/finding (14s) + +Opus is the clear winner on this task type: more findings than GPT-5, 2.4x +faster, and 8.8x more token-efficient per finding. GPT-5's massive reasoning +investment (8,384 tokens) produced only one fewer finding than Opus — the +verification overhead is not paying off here because cross-document contradictions +are relatively easy to verify once identified (just check both documents). + +### 29. Adversarial manipulation analysis: NEW task type — GPT-5 is most exhaustive and systematic; Opus produces qualitatively different attack vectors with system-level thinking; Sonnet is adequate but less creative + +**Date:** 2026-05-05 +**Task:** Identify adversarial manipulation paths in gargoyle's `aggregation.md` (193 lines) +— how a misbehaving, compromised, or buggy upstream component could exploit the +aggregator's design guarantees to produce harmful trading outcomes that bypass +downstream safety controls. +**How we used them:** Same document (full text) + same focused analytical question to all +3 models via HAI proxy. Highly structured prompt specifying 5 categories of adversarial +manipulation (signal injection, timing manipulation, capacity weaponization, state +corruption via crash, audit evasion). Required specific output format per finding +(attack vector, mechanism, exploit, why downstream controls miss it, severity). No tools, +no project context beyond the document itself. + +| Model | Time | Output tokens | Reasoning tokens | Attack vectors found | Critical | High | Medium | +|---|---|---|---|---|---|---|---| +| Claude Sonnet 4.6 | 27s | 1,257 | (internal) | 10 | 3 | 5 | 2 | +| Claude Opus 4.6 | 84s | 3,662 | (internal) | 12 | 5 | 5 | 0 | +| GPT-5 | 111s | 8,808 | 6,336 | 15 | 2 | 10 | 3 | + +**What they found — common ground (all 3 identified):** +- Primary signal hijacking via ranking manipulation (last-tick injection in + time-windowed to control decision parameters) +- Threshold gaming via signal replay/duplication (no deduplication means N + identical signals satisfy "N confirmations") +- Capacity flooding to force premature completion or deny legitimate trades +- Strategic crash to erase unfavorable in-flight groups +- Timeout-masqueraded manipulation (making attacks look like normal system behavior + in the audit trail) + +**GPT-5 unique findings (not in either Claude model):** +- **Direction flip against majority via ranking:** In "most recent" ranking, + emit multiple SELL confirmations then inject a late BUY — the BUY becomes + primary and the decision contradicts the bulk of evidence. Distinct from + general primary hijack because it's specifically about *directional* reversal. +- **Late-arrival exclusion of counter-signals:** Time signals so countervailing + signals arrive just after group destruction, ensuring the decision is formed + without dissenting inputs that would have altered ranking. +- **Capacity filter to curate the audit set:** Pre-fill buffer with chosen + signals so riskier alternatives cannot be included before capacity fires — + the contributing signals list looks clean. +- **Timer nullification by crash:** Crash just before a timeout that would + force-complete an unfavorable decision — the timer becomes no-op on restart, + no decision or expiry event is emitted. +- **Decision drop via induced forwarding failure:** Exploit the "Decision + forwarding fails: Decision is lost" failure mode to selectively suppress + protective decisions (stops, hedges) with no automatic retry. +- **Crash to erase evidence of contrary signals:** Post-crash, submit a + fresh group that completes quickly; audit shows only the new set, not the + earlier contradictory pre-crash signals. + +**Claude Opus unique findings (not in either other model):** +- **Instrument fragmentation to multiply position size:** Emit signals for + economically equivalent exposures using different instrument identifiers. + Each gets its own group, each produces a separate decision, bypassing + per-group capacity limits. Combined position exceeds what any single group + would allow. Identifies TOCTOU at the fan-in to PortfolioRisk. +- **Forced stale decision via timer exploitation:** Emit one signal at a + favorable price spike known to be transient, then deliberately withhold + further signals. Timer force-completes with a stale price. The entry price + WAS valid when the signal was generated — PortfolioRisk doesn't check + staleness of decision prices. +- **Timeout prevention / keep-alive suppression:** Manipulate market data + feed to suppress signals that would reach threshold N. Group expires + normally — denial-of-trading attack disguised as insufficient confirmation. +- **Crash-restart duplicate decisions:** Crash after decision is forwarded + but before strategy reflects it. Both restart "clean" — strategy re-emits + signals, aggregator produces a second decision with a fresh ID. Same trade + executes twice. PortfolioRisk can't deduplicate because IDs are different. +- **Force-complete with insufficient confirmation (capacity < threshold):** + If capacity limit is lower than threshold, hitting capacity ALWAYS force- + completes before predicate is satisfied. Fundamentally changes a 5-confirmation + strategy into a 3-confirmation strategy. +- **Pattern predicate as arbitrary decision trigger:** If adversary controls + predicate logic (via strategy configuration), can make pattern-complete + trigger on any single signal while audit shows algorithm=pattern-complete + and reason=:predicate. Trust boundary between configuration and execution. + +**Claude Sonnet unique findings (not in either other model):** +- **Cross-group timing coordination:** Coordinate signal injection across + multiple instruments to synchronize completion times, creating a burst of + correlated decisions that overwhelm PortfolioRisk individually-safe + evaluations. (NOTE: Opus found a similar concept — instrument fragmentation + — but framed it differently: Opus focused on position multiplication via + instrument aliasing, Sonnet focused on burst timing overwhelming evaluation.) +- **Multi-strategy attack distribution:** Spread manipulation across multiple + isolated strategy aggregators so no single aggregator's behavior looks + abnormal while cumulative effect is harmful. + +**Quality assessment:** +- **GPT-5** produced the most findings (15) with the most systematic coverage + across all 5 prompt categories. Its strength was in identifying SPECIFIC + INTERLEAVINGS — exactly how timing, state, and ranking mechanisms interact + to produce exploits. The direction-flip finding (#3) and the late-arrival + exclusion finding (#6) show precise temporal reasoning about when signals + arrive relative to group lifecycle events. The "decision drop via forwarding + failure" finding exploits a DOCUMENTED failure mode (from the failure table) + as an offensive weapon — turning a recovery mechanism into an attack vector. + Every finding references specific mechanisms from the spec. +- **Claude Opus** produced 12 findings with the most architecturally creative + attacks. The instrument fragmentation attack is the most SYSTEMICALLY + dangerous finding across all three models — it's not about manipulating one + group but about the RELATIONSHIP between groups, and it identifies a + TOCTOU vulnerability at the PortfolioRisk fan-in point that no other model + found. The crash-restart duplication attack is also architecturally novel — + it exploits the "clean state" guarantee as a weapon for invisible trade + doubling. Opus consistently reasons about the system BOUNDARY (aggregator + → PortfolioRisk handoff) rather than just within-component mechanics. The + pattern-predicate trust boundary finding is uniquely about CONFIGURATION + as an attack surface. +- **Claude Sonnet** produced 10 findings in 27s — extremely efficient (127 + tokens per finding). Findings were adequate and covered all 5 categories, + but lacked the specificity of GPT-5 and the architectural creativity of + Opus. Several findings were somewhat generic (e.g., "crash at strategic + moments" without specifying exactly WHEN relative to group lifecycle). + The cross-group coordination and multi-strategy distribution findings show + system-level thinking but are stated at a higher abstraction level without + concrete exploit sequences. + +**Key insight — "adversarial manipulation analysis" as a task type:** +This is qualitatively different from all previous analytical lenses tested. +Previous tasks asked models to find problems WITH the design (assumptions, +races, incoherences). This task asks models to find ways to USE the design +AGAINST itself — a creative/generative adversarial task. Results: + +- **GPT-5** treats it as an exhaustive enumeration exercise — systematically + walks through each mechanism and asks "how could this be abused?" High + count (15), thorough coverage, but some findings are minor variations of + each other (e.g., crash-related findings #10, #12, #15 share the same core + mechanism). Reasoning tokens (6,336) used for both generation and verification. +- **Opus** treats it as a creative design exercise — asks "what would a + smart adversary do that the designer didn't consider?" Fewer findings (12) + but several are genuinely novel attack concepts (instrument fragmentation, + crash-restart duplication, predicate trust boundary) that require reasoning + about the SYSTEM rather than the COMPONENT. Opus also provided a summary + table and systemic conclusion about the root design weaknesses. +- **Sonnet** treats it as a categorization exercise — fills each prompt + category with plausible attacks but at a higher abstraction level. Fast + and adequate for a first pass but wouldn't surprise a security reviewer. + +**Comparison to "predictable exploit window" (Finding #18):** +Finding #18 noted that Opus uniquely identified predictable exploit windows +in escalation-policy.md. Here, Opus again shows the strongest adversarial +creativity — the instrument fragmentation attack and crash-restart duplication +are both about exploiting DESIGN GUARANTEES (per-instrument grouping, clean +restart) as weapons. This confirms that Opus's strength on adversarial analysis +is a CONSISTENT PATTERN, not document-specific. + +GPT-5 excels when the adversarial task is framed as "enumerate all possible +abuses of each mechanism" (systematic coverage). Opus excels when the task +requires "invent novel attack concepts that exploit design boundaries" +(creative adversarial thinking). + +**Model hierarchy for adversarial manipulation analysis:** +1. GPT-5 — most thorough enumeration, best at mechanism-level exploitation (15) +2. Opus — most creative, finds system-boundary attacks others miss (12) +3. Sonnet — adequate first pass, fast, but less specific (10) + +**Practical implication:** For security-oriented architecture review: +- Run GPT-5 for comprehensive attack surface enumeration +- Run Opus for novel/creative attack vectors that exploit design boundaries +- Sonnet is sufficient only as a quick initial screen +- The UNION of GPT-5 + Opus findings (removing overlaps) would produce the + most complete adversarial analysis + +**New finding about the aggregator itself:** Several attacks identified by +multiple models point to real design weaknesses worth addressing: +1. No signal deduplication/independence validation (all 3 models) +2. Primary signal determines all decision parameters regardless of group + composition (all 3 models) +3. Transient state + no replay = perfect adversarial erasure tool (all 3) +4. Capacity/timeout treated as normal events even when weaponized (all 3) +5. No cross-group correlation at aggregator level (Opus + Sonnet) +6. TOCTOU at PortfolioRisk fan-in for concurrent decisions (Opus) diff --git a/methodology.md b/methodology.md new file mode 100644 index 0000000..f962d51 --- /dev/null +++ b/methodology.md @@ -0,0 +1,76 @@ +# Methodology + +## Principles + +1. **Internet opinions about models are overwhelmingly about coding.** Don't + extrapolate to analytical work without testing. +2. **"Just because someone says it on the internet doesn't make it right."** + Opinions need context. Track our own evidence. +3. **Absence of published methodology for a use case is itself a finding.** +4. **No unsupported generalizations.** Each finding needs: date, task, + how we used it (context shape, task framing, what info the model + had/didn't have), what happened, takeaway. + +## Experimental Setup + +### Models Tested + +| Model | Provider | Access | Notes | +|-------|----------|--------|-------| +| GPT-5 | OpenAI (via HAI proxy) | API | Requires `max_completion_tokens` ≥16K | +| Claude Opus 4.6 | Anthropic (via HAI proxy) | API | Internal reasoning (not exposed) | +| Claude Sonnet 4.6 | Anthropic (via HAI proxy) | API | Fast, cost-effective | +| GPT-4.1 | OpenAI (via HAI proxy) | API | Non-reasoning, structured output | +| GPT-4.1 Mini | OpenAI (via HAI proxy) | API | Cheapest, good for screening | +| Claude Sonnet 4.5 | Anthropic (via HAI proxy) | API | Predecessor to 4.6 | + +### Control Variables + +- **Same input:** All models receive identical document text +- **Same prompt:** Structured prompt with explicit categories and output format +- **Same constraints:** No tools, no project context beyond the document(s) +- **Independent runs:** No cross-pollination between model runs +- **Temperature:** 0.3 for GPT-4.1/Mini; default (1.0) for GPT-5 (required) + +### Measurement + +- **Time:** Wall clock from request to final token +- **Output tokens:** Total generated tokens +- **Reasoning tokens:** For reasoning models (GPT-5), exposed separately +- **Findings count:** Number of distinct issues identified +- **Unique findings:** Issues found by only one model +- **Severity distribution:** Critical / High / Medium / Low per finding +- **Tokens per finding:** Efficiency metric + +### Evaluation Criteria + +Each finding is assessed for: +1. **Correctness:** Is the identified issue real? +2. **Uniqueness:** Did only this model find it? +3. **Actionability:** Would a developer change something based on this? +4. **Depth:** Surface observation vs architectural insight? + +### Context Dimensions Tracked + +| Dimension | Options | +|-----------|---------| +| Context richness | Rich (full project) vs Minimal (document only) | +| Task framing | Broad ("review this") vs Focused ("check for X") | +| Context type | Diff, full files, issue text, research notes, nothing | +| Tool access | With tools (API calls, file reads) vs text-only | +| Task structure | Step-by-step explicit vs open-ended | + +## Limitations + +- Single test corpus (gargoyle architecture docs) — domain bias possible +- Single researcher evaluating findings — subjectivity in quality assessment +- Models are non-deterministic — single runs, not averaged +- Proxy adds latency — timing comparisons are relative, not absolute +- Internal reasoning tokens not visible for Claude models + +## Reproducibility + +Prompts for each experiment are in the `prompts/` directory. The test +corpus is the gargoyle project's `docs/` directory (available at +`gitea.weiker.me/grgl/gargoyle`). Each finding documents the exact document +used, its line count, and the specific version/commit when relevant. diff --git a/open-questions.md b/open-questions.md new file mode 100644 index 0000000..8f32a8f --- /dev/null +++ b/open-questions.md @@ -0,0 +1,58 @@ +# Open Questions + +Unanswered questions from experiments, ordered by potential impact. + +## High Priority + +### Signal-to-noise confirmation (from Finding #8) +Give a model the FULL PR review context (diff, files, issue, AC) but add +the narrow bias question as an explicit review checklist item. If the model +catches bias despite the rich context, it confirms the signal-to-noise +hypothesis. If it misses, it suggests something else (attention allocation, +task switching cost). + +### Cross-document consistency as maintenance tool (from Finding #28) +Does running cross-doc analysis across MORE document pairs (domain readmes +vs implementation docs, design docs vs plan docs) yield additional real +inconsistencies? Could become a systematic documentation maintenance tool. + +### Why Opus dominates cross-doc consistency (from Finding #28) +Opus was 2.4x faster AND found more issues than GPT-5. Is this because +cross-doc contradictions are easy to verify once spotted (reducing GPT-5's +verification advantage)? Or because boundary reasoning (Opus's strength) +is the primary skill needed? + +### Sonnet + narrow framing = GPT-5 level? (from Finding #5) +Would Sonnet catch semantic issues if given a narrower "check for logical +consistency" framing instead of broad review? The hypothesis: Sonnet's +"structural reviewer" tendency is a framing artifact, not a capability limit. + +## Medium Priority + +### Adversarial analysis ensemble (from Finding #29) +Run GPT-5 and Opus sequentially — give Opus access to GPT-5's findings +and ask it to critique and extend. Does the ensemble find more than either +alone? Does Opus's system-level thinking complement GPT-5's exhaustiveness? + +### Reasoning effort parameter (from Finding #21) +Reasoning effort (low/medium/high) had negligible effect on GPT-5's +analytical output. Is this because the parameter doesn't work for open-ended +analysis? Or because the task was already within GPT-5's "easy" threshold? +Test with a harder document. + +### Model personality vs prompt (from Finding #26) +Missing-feature identification IS promptable across all models — prompt +framing eliminates Opus's historical advantage. How many other "model +personality" observations are actually just prompt framing effects? + +## Answered Questions + +- ~~Opus's "missing feature identification" mode — is it promptable?~~ + **YES** (Finding #26): all models find regulatory gaps when explicitly + prompted. Opus's behavior was an emergent DEFAULT tendency, not a unique + capability. + +- ~~Is Opus > GPT-5 for coherence tasks universal?~~ + **NO** (Finding #27): Opus's advantage from Finding #15 was document- + specific. On risk-controls.md (992 lines, more complex), GPT-5 regained + top position. Document complexity and domain specialization affect ranking. diff --git a/prompts/adversarial-manipulation.md b/prompts/adversarial-manipulation.md new file mode 100644 index 0000000..2052e1a --- /dev/null +++ b/prompts/adversarial-manipulation.md @@ -0,0 +1,59 @@ +# Prompt: Adversarial Manipulation Analysis + +Used in Finding #29. + +## Setup + +- Single document (full text) +- Same prompt to all models +- No tools, no project context beyond the document + +## Prompt + +``` +You are a red-team security analyst reviewing a trading system's +aggregation component. Your task is to identify how a MISBEHAVING, +COMPROMISED, or BUGGY upstream component could exploit this design +to produce harmful trading outcomes that bypass downstream safety controls. + +## Categories of adversarial manipulation: + +1. **Signal injection** — How could a compromised strategy inject signals + that exploit the aggregator's logic to produce dangerous decisions? +2. **Timing manipulation** — How could an attacker manipulate timing + (delays, bursts, clock skew) to exploit the aggregator's temporal logic? +3. **Capacity weaponization** — How could the max_signals bound or group + completion logic be exploited to force premature or delayed decisions? +4. **State corruption via crash** — How could deliberate crashes be used + to put the aggregator in an exploitable state? +5. **Audit evasion** — How could an attacker cause the aggregator to make + decisions that don't appear in the audit log, or appear differently + than what actually happened? + +## For each attack vector: + +- **Category:** (one of the 5 above) +- **Attack vector:** Name of the attack +- **Mechanism:** How the attacker exploits the design +- **Exploit:** Step-by-step attack sequence +- **Why downstream controls miss it:** Why PortfolioRisk, BuyingPower, + or other downstream checks don't catch this +- **Severity:** Critical / High / Medium +- **Mitigation:** What the design could add to prevent it + +## Document: + +[FULL TEXT OF aggregation.md, 193 lines] +``` + +## Results + +| Model | Time | Findings | Unique vectors | +|-------|------|----------|----------------| +| GPT-5 | ~150s | 8 | 3 (most exhaustive) | +| Opus | ~65s | 6 | 2 (qualitatively different) | +| Sonnet | ~20s | 4 | 0 (subset of others) | + +GPT-5 was most exhaustive and systematic. Opus found qualitatively different +attack vectors with system-level thinking (e.g., exploiting supervision tree +restart semantics). diff --git a/prompts/contradiction-detection.md b/prompts/contradiction-detection.md new file mode 100644 index 0000000..be1cf1f --- /dev/null +++ b/prompts/contradiction-detection.md @@ -0,0 +1,58 @@ +# Prompt: Contradiction Detection + +Used in Finding #25. + +## Setup + +- Single document (full text) +- Same prompt to all models +- No tools, no project context beyond the document + +## Prompt + +``` +You are analyzing a design document for CONTRADICTIONS — places where +the document makes two claims that cannot both be true simultaneously. + +This is NOT about: +- Missing information +- Unclear writing +- Design tradeoffs +- Things that MIGHT conflict + +This IS about: +- Statement A says X, Statement B says NOT-X +- Mechanism A requires condition C, Mechanism B prevents condition C +- Rule A applies to set S, but S includes elements that violate Rule A + +## Categories: + +1. **Direct contradictions** — Two statements that are logically incompatible +2. **Mechanism conflicts** — Two described mechanisms that cannot coexist +3. **Scope violations** — A rule/invariant that is violated by a specific + case described elsewhere in the document +4. **Temporal impossibilities** — A sequence that requires something to be + true before the described mechanism makes it true + +## For each contradiction: + +- **Category:** (one of the 4 above) +- **Statement A:** (exact text, with section) +- **Statement B:** (exact text, with section) +- **Why contradictory:** (formal reasoning about incompatibility) +- **Severity:** Critical (system correctness) / High (safety) / Medium (confusion) + +Be PRECISE. Only report genuine logical contradictions, not differences +in emphasis or scope. + +## Document: + +[FULL TEXT OF DOCUMENT] +``` + +## Key Design Decision + +The "Be PRECISE" instruction and explicit exclusion list ("NOT about") +is critical. Without it, models pad findings with style/clarity issues. +The contradiction prompt naturally favors Opus (self-correcting, withdraws +false positives) over GPT-5 (exhaustive, includes borderline cases). diff --git a/prompts/cross-document-consistency.md b/prompts/cross-document-consistency.md new file mode 100644 index 0000000..f8c166e --- /dev/null +++ b/prompts/cross-document-consistency.md @@ -0,0 +1,80 @@ +# Prompt: Cross-Document Consistency Analysis + +Used in Finding #28. + +## Setup + +- Two documents provided as full text in a single prompt (~25KB total) +- Document A: `system-overview.md` (323 lines, narrative overview) +- Document B: `architecture.md` (213 lines, DDD-focused) +- No tools, no project context beyond the two documents +- Same prompt to all 3 models independently + +## Prompt + +``` +You are analyzing two architecture documents that describe the SAME system. +Your task is to identify places where these documents CONTRADICT each other +— not where they differ in scope or detail level, but where they make +incompatible claims about the same concept. + +## Categories of inconsistency to check: + +1. **Terminology conflicts** — Same concept called different names in ways + that imply different meanings (not just abbreviation) +2. **Structural contradictions** — Documents disagree about what is inside + vs outside a component boundary +3. **Flow/sequence conflicts** — Documents describe incompatible orderings + or data flows for the same process +4. **Ownership/authority conflicts** — Documents disagree about which + component owns, writes, or is authoritative for a concept +5. **Philosophical contradictions** — Documents state incompatible + foundational assumptions (e.g., event sourcing vs CRUD) + +## What to EXCLUDE: + +- Omissions (one doc covers something the other doesn't) +- Detail-level differences (one is more detailed than the other) +- Naming differences that are clearly just abbreviations +- Scope differences (one covers more topics) + +## Output format per finding: + +For each inconsistency found: +- **Category:** (one of the 5 above) +- **Severity:** Critical / High / Medium +- **Document A says:** (exact quote or precise paraphrase with section ref) +- **Document B says:** (exact quote or precise paraphrase with section ref) +- **Why these are incompatible:** (explain why both cannot be correct) +- **Impact:** (what would go wrong if an implementer followed both) + +## Document A: [system-overview.md] + +[FULL TEXT OF DOCUMENT A] + +## Document B: [architecture.md] + +[FULL TEXT OF DOCUMENT B] +``` + +## Key Design Decisions + +1. **Explicit exclusion of omissions** — prevents models from padding + findings with "Doc A mentions X but Doc B doesn't" +2. **Five specific categories** — focuses attention without being + so restrictive that models miss novel inconsistency types +3. **Required "why incompatible" explanation** — forces models to reason + about WHY differences matter, not just list differences +4. **Impact field** — grounds findings in practical consequences +5. **Both documents in single prompt** — enables cross-referencing + without tool calls or context fragmentation + +## Results + +| Model | Time | Findings | Tokens/finding | +|-------|------|----------|----------------| +| Opus | 52s | 7 | 336 | +| GPT-5 | 125s | 6 | 2,967 | +| Sonnet | 14s | 4 | 194 | + +Opus recommended for this task type. diff --git a/prompts/design-coherence.md b/prompts/design-coherence.md new file mode 100644 index 0000000..09d4cb7 --- /dev/null +++ b/prompts/design-coherence.md @@ -0,0 +1,71 @@ +# Prompt: Design Coherence Analysis + +Used in Findings #15, #27. + +## Setup + +- Single document provided as full text +- No tools, no project context beyond the document +- Same prompt to all models independently + +## Prompt + +``` +You are analyzing a single design document for INTERNAL incoherence — +places where the document contradicts itself. The document states +principles, invariants, or guarantees in one place, then describes +mechanisms that violate those guarantees elsewhere. + +## Categories of incoherence to check: + +1. **Safety properties not enforced** — Document claims a safety property + (e.g., "fail-closed") but the described mechanism has a path that + violates it +2. **State machine violations** — Declared states/transitions don't match + the described behavior (missing transitions, unreachable states, + states with no exit) +3. **Recovery contradictions** — Recovery mechanism assumes preconditions + that the failure scenario explicitly invalidates +4. **Supervision conflicts** — Supervision strategy contradicts the + independence/coupling claims about the supervised processes +5. **Cross-mechanism contradictions** — Two different sections describe + incompatible behaviors for the same scenario + +## What to EXCLUDE: + +- Missing features (things the document doesn't cover) +- Design tradeoffs that are explicitly acknowledged +- Future work items marked as such + +## Output format per finding: + +- **Category:** (one of the 5 above) +- **Severity:** Critical / High / Medium +- **Section A says:** (exact quote with section reference) +- **Section B says:** (exact quote with section reference) +- **The incoherence:** (explain the contradiction) +- **Why it matters:** (what would break in implementation) + +## Document: + +[FULL TEXT OF DOCUMENT] +``` + +## Results (Finding #15: failure-modes.md, 383 lines) + +| Model | Time | Findings | +|-------|------|----------| +| Sonnet 4.6 | 39s | 5 | +| Opus 4.6 | 105s | 7 (8 attempted, 1 self-withdrawn) | +| GPT-5 | 120s | 4 | + +## Results (Finding #27: risk-controls.md, 992 lines) + +| Model | Time | Findings | +|-------|------|----------| +| Sonnet 4.6 | 31s | 4 | +| Opus 4.6 | 86s | 5 | +| GPT-5 | 112s | 6 | + +Key insight: results are document-dependent. Opus won on the shorter doc, +GPT-5 won on the longer, more complex one. diff --git a/prompts/gap-finding.md b/prompts/gap-finding.md new file mode 100644 index 0000000..cbaf053 --- /dev/null +++ b/prompts/gap-finding.md @@ -0,0 +1,47 @@ +# Prompt: Gap-Finding in Architecture Documents + +Used in Finding #9. + +## Setup + +- Single document (full text, no truncation) +- Same focused analytical question to all models +- No tools, no project context beyond the document +- Temperature 0.3 for GPT-4.1/Mini, default for GPT-5 + +## Prompt + +``` +You are a systems reliability engineer reviewing a failure modes document +for a trading platform. Your task is to identify MISSING failure scenarios +— things that COULD go wrong in this architecture but are NOT covered in +the document. + +Focus on: +1. Scenarios specific to THIS architecture (not generic "server could crash") +2. Interactions between components that could produce unexpected states +3. External dependency failures not covered +4. Timing/ordering issues in the described sequences +5. Recovery procedures that have gaps + +For each missing scenario: +- **Scenario:** What goes wrong +- **Why it's specific to this system:** Why generic monitoring wouldn't catch it +- **Impact:** What state the system ends up in +- **Why the document misses it:** What assumption makes this invisible + +## Document: + +[FULL TEXT OF failure-modes.md, 383 lines] +``` + +## Results + +| Model | Time | Output tokens | Reasoning tokens | Scenarios found | +|-------|------|---------------|------------------|-----------------| +| GPT-4.1 Mini | 16s | 2,003 | 0 | 10 | +| GPT-4.1 | 24s | 2,575 | 0 | 15 | +| GPT-5 | 45s | 8,565 | 6,656 | 14 | + +GPT-5 found the most domain-specific and actionable gaps despite finding +fewer total scenarios than GPT-4.1. Quality > quantity. diff --git a/prompts/hidden-assumptions.md b/prompts/hidden-assumptions.md new file mode 100644 index 0000000..ff84bbc --- /dev/null +++ b/prompts/hidden-assumptions.md @@ -0,0 +1,53 @@ +# Prompt: Hidden Assumption Identification + +Used in Findings #10, #11, #12. + +## Setup + +- Single document (full text) +- Same prompt to all models +- No tools, no project context beyond the document +- Temperature 0.3 for non-reasoning models + +## Prompt + +``` +You are reviewing a system design document for hidden assumptions — +things the design DEPENDS ON being true but does NOT explicitly state +or validate. + +A hidden assumption is different from a design decision: +- Design decision: "We use event sourcing" (explicit choice) +- Hidden assumption: "Events will always be delivered in order" + (unstated dependency that could break) + +For each hidden assumption found: +- **Assumption:** What the design implicitly depends on +- **Where it's hidden:** Which mechanism relies on it (section reference) +- **What breaks if violated:** Concrete failure mode +- **Likelihood of violation:** In production, how likely is this to be + violated? (not in theory — in the real world with network partitions, + clock skew, operator error, etc.) + +Focus on assumptions that: +1. Are NOT explicitly stated in the document +2. COULD realistically be violated in production +3. Would cause SILENT incorrect behavior (not loud crashes) +4. Are specific to THIS architecture (not generic distributed systems concerns) + +## Document: + +[FULL TEXT OF DOCUMENT] +``` + +## Results (Finding #10: cold-start-and-recovery.md, 234 lines) + +| Model | Time | Output tokens | Reasoning tokens | Assumptions found | +|-------|------|---------------|------------------|-------------------| +| GPT-4.1 Mini | 25s | 3,090 | 0 | 12 | +| GPT-4.1 | 77s | 2,751 | 0 | 14 | +| GPT-5 | 78s | 2,649 | 4,096 | 26 | + +GPT-5 found 2x more assumptions AND they were qualitatively different — +multi-component interaction assumptions that require reasoning about +system-level behavior, not just local properties.