diff --git a/README.md b/README.md index 5d5f38b..e7fe256 100644 --- a/README.md +++ b/README.md @@ -53,12 +53,15 @@ Each experiment: ## Repository Structure ``` -findings/ # Individual findings with full analysis - 01-different-models-different-things.md - 02-narrow-lens-vs-broad-review.md +findings/ # Individual findings with full analysis + README.md # Context and index + YYYY-MM-DD-NN-slug.md # One file per experiment + 2026-04-26-01-different-models-catch-different-things.md + 2026-04-26-07-emerging-role-assignments-pattern-not.md + 2026-05-03-07b-token-budget-matters-more-than.md # Duplicate #7 (suffix b) + 2026-05-03-15-design-coherence-analysis.md ... - 28-cross-document-consistency.md - 29-adversarial-manipulation.md + 2026-05-05-29-adversarial-manipulation-analysis-new-task.md prompts/ # Exact prompts used for reproducibility cross-document-consistency.md design-coherence.md @@ -69,6 +72,9 @@ open-questions.md # Unanswered questions for future experiments methodology.md # Full methodology notes ``` +Findings are named `YYYY-MM-DD-NN-slug.md` for chronological sorting. +Numbers are zero-padded (01–29). The duplicate finding #7 uses a `b` suffix. + ## Who We Are This research is conducted by [Rodin](https://gitea.weiker.me/rodin) (AI diff --git a/findings/2026-04-26-01-different-models-catch-different-things.md b/findings/2026-04-26-01-different-models-catch-different-things.md new file mode 100644 index 0000000..72472b7 --- /dev/null +++ b/findings/2026-04-26-01-different-models-catch-different-things.md @@ -0,0 +1,16 @@ +# Finding 1: Different models catch different things (confirmed) + +**Date:** 2026-04-26 +**Task:** PR reviews on DDD reference docs (~6,600 lines across 18 files) +**How we used them:** Both models got the same task via pr-review skill — +fetch diff, fetch full file content for changed files, review against PR +description and linked issue acceptance criteria. Rich context: full diff, +project CLAUDE.md conventions, issue body. Each reviewer ran independently +in its own sub-agent with its own Gitea token. No cross-pollination. + +- GPT-5 caught SUMMARY.md verdict mismatches (Commanded classification, + small teams classification) that Sonnet missed entirely (PR #375) +- Sonnet caught a broken cross-reference link first that GPT-5 missed (PR #378) +- **Takeaway:** Different blind spots are real. Neither model is strictly better + for analytical review — they complement each other. This is why we run two + independent reviewers from different model families. diff --git a/findings/2026-04-26-02-cheap-model-narrow-lens-expensive.md b/findings/2026-04-26-02-cheap-model-narrow-lens-expensive.md new file mode 100644 index 0000000..230e168 --- /dev/null +++ b/findings/2026-04-26-02-cheap-model-narrow-lens-expensive.md @@ -0,0 +1,18 @@ +# Finding 2: Cheap model + narrow lens > expensive model + broad review (one data point) + +**Date:** 2026-04-26 +**Task:** Check 12 rewritten hypotheses for directional bias +**How we used them:** +- Sonnet & GPT-5: full PR review context (diff, file content, issue, AC). + Broad mandate: "review this PR." Rich context but unfocused task. +- GPT-4.1 Mini: given ONLY the 12 hypothesis texts + one focused question: + "Do any of these hypotheses lead toward a predetermined conclusion?" + Minimal context, laser-focused task. No diff, no project docs, no issue. + +- Both Sonnet and GPT-5 approved the hypotheses as reviewers +- GPT-4.1 Mini found ALL 12 pushed toward predetermined conclusions +- Words like "requires," "necessary," "must be" were flagged as directional +- **Takeaway:** Task framing mattered more than model size. Rich context + + broad mandate = missed the forest for the trees. Minimal context + precise + question = found exactly what mattered. This needs more testing — was it + the narrow framing, the lack of surrounding context, or both? diff --git a/findings/2026-04-26-03-gpt5-times-out-on-complex.md b/findings/2026-04-26-03-gpt5-times-out-on-complex.md new file mode 100644 index 0000000..30a0a4c --- /dev/null +++ b/findings/2026-04-26-03-gpt5-times-out-on-complex.md @@ -0,0 +1,15 @@ +# Finding 3: GPT-5 times out on complex multi-step analytical tasks (confirmed pattern) + +**Date:** 2026-04-26 +**Task:** Full PR review of #382 (research document rewrite) +**How we used it:** pr-review skill — multi-phase (fetch diff, fetch files, +check CI, analyze against AC, post inline comments, post summary). 7 phases, +many curl calls to Gitea API, large diff context. Heavy tool-use workflow +through SAP proxy (adds latency vs direct API). 300s timeout. + +- Timed out 3 times at 300s (17, 6, 6 tool calls respectively) +- Bottleneck was model processing time, not network (~0.3s Gitea API latency) +- **Takeaway:** Break analytical tasks into focused bounded pieces. Twelve + small deep reviews > one rushed big one. The issue isn't GPT-5's analysis + quality — it's that multi-phase tool-heavy workflows burn too much time + on mechanics. Separate the data gathering from the analysis. diff --git a/findings/2026-04-26-04-gpt5-defaults-to-delegation-claude.md b/findings/2026-04-26-04-gpt5-defaults-to-delegation-claude.md new file mode 100644 index 0000000..cc2beb7 --- /dev/null +++ b/findings/2026-04-26-04-gpt5-defaults-to-delegation-claude.md @@ -0,0 +1,18 @@ +# Finding 4: GPT-5 defaults to delegation; Claude defaults to doing the work + +**Date:** 2026-04-26 +**Task:** PR review delegation to sub-agents +**How we used them:** Both spawned as sub-agents from main session with +same task description, same pr-review skill file, same Gitea credentials. +Difference: GPT-5 got model override to gpt5, Sonnet used default model. +Both got full skill instructions. + +- GPT-5 first attempt: spawned sub-sub-agents and timed out +- GPT-5 with "do it yourself, no sub-agents" + step-by-step: worked +- Even with constraints, GPT-5 sometimes dumps raw tool output instead of + synthesizing — needs explicit output format instructions +- Claude (Sonnet/Opus) given the same kind of task does the work directly +- **Takeaway:** GPT interprets complex task descriptions as delegation + opportunities. Claude interprets them as work to do. For GPT: explicit + single-actor instructions + output format. For Claude: can give broader + mandate. Same skill file, very different behavior. diff --git a/findings/2026-04-26-05-sonnet-is-fast-and-catches.md b/findings/2026-04-26-05-sonnet-is-fast-and-catches.md new file mode 100644 index 0000000..3d94a74 --- /dev/null +++ b/findings/2026-04-26-05-sonnet-is-fast-and-catches.md @@ -0,0 +1,17 @@ +# Finding 5: Sonnet is fast and catches structural issues; GPT-5 is slow and catches semantic issues + +**Date:** 2026-04-26 +**Task:** Dual review across PRs #372, #375, #378, #380, #382 +**How we used them:** Same pr-review skill, same context (diff + files + +issue + AC), same sub-agent pattern. Only variable: model. Both got rich +context. Both ran the full 7-phase review skill. + +- Sonnet consistently finishes first, catches formatting, broken links, + structural problems (missing sections, dangling refs) +- GPT-5 takes longer, catches meaning-level problems (verdict mismatches, + classification inconsistencies, logical gaps) +- **Takeaway:** With identical rich context and identical instructions, the + models naturally gravitate to different things. Sonnet is the structural + reviewer; GPT-5 is the semantic reviewer. Both roles matter. Question: + would Sonnet catch semantic issues if given a narrower "check for logical + consistency" framing instead of broad review? diff --git a/findings/2026-04-26-06-single-agent-cant-handle-1000.md b/findings/2026-04-26-06-single-agent-cant-handle-1000.md new file mode 100644 index 0000000..6cc9df4 --- /dev/null +++ b/findings/2026-04-26-06-single-agent-cant-handle-1000.md @@ -0,0 +1,20 @@ +# Finding 6: Single agent can't handle 1000+ line document generation (confirmed pattern) + +**Date:** 2026-04-26 +**Task:** DDD v2 forge analysis drafting +**How we used them:** Single Sonnet/Opus sub-agents given full research +material (~3,874 lines of research notes) + outline + instructions to write +complete document. Very rich context (all research), very large output +requirement (1000+ lines). + +- Five single-agent attempts died (OOM, disconnect, timeout) trying to write + full documents +- Sectional approach (5 parallel Sonnet subagents, ~500-700 lines each) + succeeded immediately — each got same research but only their section's + outline +- Same pattern when Claude Code attempted full Part V rewrite — died +- Three agents × ~320 lines each worked first try +- **Takeaway:** This is a confirmed, repeatable limit for generation tasks. + Not model-specific — it's a context/output length problem. Rich input + context is fine; it's the output length that kills. Break output into + sections, keep input context rich, draft in parallel, assemble. diff --git a/findings/2026-04-26-07-emerging-role-assignments-pattern-not.md b/findings/2026-04-26-07-emerging-role-assignments-pattern-not.md new file mode 100644 index 0000000..aa8d4fb --- /dev/null +++ b/findings/2026-04-26-07-emerging-role-assignments-pattern-not.md @@ -0,0 +1,17 @@ +# Finding 7: Emerging role assignments (pattern, not conclusion) + +**Date:** 2026-04-26 (one day of intensive work — treat as hypothesis) + +- Opus (via Claude Code): complex generation needing deep project context. + Rich context: CLAUDE.md, full codebase access, design docs. Broad mandate. +- Sonnet: parallel volume work (5 subagents drafting simultaneously). + Rich context per section, constrained output scope. +- GPT-5: independent analytical review. Rich context (diff + files + issue). + Best when task is bounded and explicit. +- GPT-4.1 Mini: focused narrow analysis (bias detection). Minimal context, + precise question. Cheap and fast. +- **Takeaway:** The role assignment matters, but so does the context shape. + Opus gets broad context + broad mandate. Sonnet gets broad context + + narrow scope. GPT-5 gets rich context + explicit task. GPT-4.1 Mini gets + minimal context + laser question. We haven't tested swapping these + combinations — that's where the real learning will come from. diff --git a/findings/2026-04-27-08-bias-detection-all-models-catch.md b/findings/2026-04-27-08-bias-detection-all-models-catch.md new file mode 100644 index 0000000..24a2573 --- /dev/null +++ b/findings/2026-04-27-08-bias-detection-all-models-catch.md @@ -0,0 +1,58 @@ +# Finding 8: Bias detection: all models catch it with any framing — when the signal isn't buried + +**Date:** 2026-04-27 +**Task:** Detect directional bias in 8 deliberately biased hypotheses about +microservices vs monolith architecture for fintech startups. +**How we used them:** Created fresh test material (8 hypotheses with pro- +microservices bias via absolutes like "inevitably," "necessary," "must," +"requires," plus one factually inverted claim about consistency guarantees). +Ran 4 conditions in parallel sub-agents: + +| Condition | Model | Framing | Context | +|---|---|---|---| +| A | GPT-4.1 Mini | Narrow: "Do any lead toward a predetermined conclusion?" | Hypotheses only | +| B | Sonnet | Same narrow question | Hypotheses only | +| C | GPT-5 | Same narrow question | Hypotheses only | +| D | Sonnet | Broad: "Review quality, clarity, testability, and issues" | Hypotheses only | + +**Results:** +- **All 4 conditions detected 8/8 biased hypotheses.** No misses. +- All 3 narrow-framing models (Mini, Sonnet, GPT-5) produced structurally + similar output: per-hypothesis verdict, biasing words, neutral version, + severity assessment. +- All 3 narrow-framing models flagged H8's factual inversion (distributed + transactions DON'T provide stronger consistency than monolithic ACID). +- GPT-5 added specific counterexamples (LMAX Disruptor, Shopify, Stack + Overflow, Basecamp) — marginally richer analysis. +- Sonnet broad mandate also caught the bias — framed as one of three + "systemic problems" (deterministic language, pro-microservices framing + bias, underspecified constructs). Additionally provided testability and + operationalization analysis that the narrow framing didn't ask for. +- Sonnet broad took ~72s vs ~39s for narrow conditions (more output). + +**Takeaway:** When the biased text is the ONLY input (no surrounding noise), +all tested models — including the cheapest (GPT-4.1 Mini) — detect bias +regardless of whether the question is narrow or broad. This appears to +**contradict** original finding #2 ("cheap model + narrow lens > expensive +model + broad review"), but the key difference is context noise: + +- **Original experiment (2026-04-26):** Sonnet and GPT-5 missed bias during + FULL PR REVIEW with rich project context (diff, file content, issue text, + acceptance criteria, project conventions). The hypotheses were buried in + layers of review mechanics. +- **This experiment (2026-04-27):** Even the "broad" condition gave ONLY the + hypothesis text — no diff, no PR structure, no project context noise. + +**Refined hypothesis:** The original finding #2 was about **signal-to-noise +ratio**, not about model capability or framing precision. When biased text +is presented in isolation, any model catches it. When biased text is buried +in a large PR review with many other things to check, the bias signal gets +lost in the noise — unless you explicitly ask about it. The "narrow lens" +worked because it eliminated the noise, not because smaller models are +better at bias detection. + +**Next experiment to confirm:** Give a model the FULL PR review context +(diff, files, issue, AC) but add the narrow bias question as an explicit +review checklist item. If the model catches bias despite the rich context, +it confirms the signal-to-noise hypothesis. If it misses, it suggests +something else is at play (attention allocation, task switching cost). diff --git a/findings/2026-05-02-09-gapfinding-in-architecture-docs-gpt5.md b/findings/2026-05-02-09-gapfinding-in-architecture-docs-gpt5.md new file mode 100644 index 0000000..6dc3d2b --- /dev/null +++ b/findings/2026-05-02-09-gapfinding-in-architecture-docs-gpt5.md @@ -0,0 +1,77 @@ +# Finding 9: Gap-finding in architecture docs: GPT-5 finds domain-specific gaps, GPT-4.1 is generic, Mini is formulaic + +**Date:** 2026-05-02 +**Task:** Identify missing failure scenarios in gargoyle's `failure-modes.md` (383 lines) +**How we used them:** Same document (full text, no truncation) + same focused +analytical question to all 3 models via HAI proxy (OpenAI-compatible endpoint). +No tools, no project context beyond the document itself. Single prompt, no +conversation history. Temperature 0.3 for GPT-4.1/Mini, default (1.0) for GPT-5 +(required by the model). + +| Model | Time | Output tokens | Reasoning tokens | Scenarios found | +|---|---|---|---|---| +| GPT-4.1 Mini | 16s | 2,003 | 0 | 10 | +| GPT-4.1 | 24s | 2,575 | 0 | 15 | +| GPT-5 | 45s | 8,565 | 6,656 | 14 | + +**What they found — common ground (all 3 identified):** +- ETS table corruption/loss affecting gates +- BEAM scheduler starvation / GC pauses +- WebSocket message duplication/reordering +- Postgres connection pool exhaustion / deadlocks +- Clock skew / time drift +- Process registry inconsistency + +**GPT-5 unique findings (not in either other model):** +- Broker rate limiting (429s) — not "connection lost" so existing logic + doesn't trigger, but can't flatten during kill switch +- Broker auth failure / credential rotation — distinct from connection loss +- Corporate actions (splits, symbol changes) — position drift without + triggering staleness detection +- Duplicate pipeline instances for same user (DynamicSupervisor race) +- DB "commit unknown outcome" causing restart loops (Ecto commit succeeds + at Postgres but client times out → retry → unique constraint → crash loop) +- Cross-symbol strategies with partial staleness — multi-leg signals + computed from mix of fresh and stale data +- Partial cancel_all during kill switch masked by process restarts + +**GPT-4.1 unique findings (not in GPT-5 or Mini):** +- Zombie processes after halt (supervisor misconfiguration) +- Unsupervised Task crashes going unnoticed +- Audit log writes failing silently (not in same transaction as state change) +- ClOrdID unique constraint violation from race in sequence generation +- Broker API semantic changes (silent breaking changes) + +**GPT-4.1 Mini unique findings:** +- Race between kill switch engagement and reconciliation completion + (timing coordination gap) — this was more explicitly called out than + in the other models, though GPT-5 touches it implicitly +- Strategy.Worker / Aggregator partial crash inconsistency + +**Quality assessment:** +- **GPT-5** had the most *domain-relevant* and *actionable* gaps. Broker + rate limiting, auth failures, corporate actions, and the DB commit + unknown-outcome scenario are all realistic production issues specific + to THIS system. The cross-symbol partial staleness finding shows + deeper architectural reasoning about component interactions. +- **GPT-4.1** was thorough and well-structured but more generic/defensive. + Many of its unique findings (zombie processes, unsupervised Tasks, + audit log loss) are general Elixir concerns rather than specific to + the document's architecture. Good for a completeness checklist. +- **GPT-4.1 Mini** was formulaic — each finding followed the same template + and several were somewhat surface-level or restated things the document + partially covers. Still found the most scenarios per dollar. + +**Takeaway:** For gap-finding in architecture documents, GPT-5's reasoning +tokens pay off. It doesn't just list "things that could go wrong" — it +identifies *specific interactions* that the document's existing mechanisms +don't cover (e.g., rate limiting bypasses the "connection lost" detection, +corporate actions bypass staleness detection). GPT-4.1 is a solid +middle-ground: more thorough than Mini, less insightful than GPT-5. +Mini is fine for a quick sanity check but won't find the subtle gaps. + +**Cost-effectiveness:** Mini found 10 scenarios in 16s for ~7K tokens. +GPT-5 found 14 scenarios (with 7 genuinely unique insights) in 45s for +~13.5K tokens (including 6.6K reasoning). For architecture review where +missing a gap could mean financial loss, the GPT-5 cost is justified. +For routine doc review, Mini + human judgment is probably sufficient. diff --git a/findings/2026-05-02-10-hiddenassumption-identification-gpt5s-reasoning-produces.md b/findings/2026-05-02-10-hiddenassumption-identification-gpt5s-reasoning-produces.md new file mode 100644 index 0000000..0360f2a --- /dev/null +++ b/findings/2026-05-02-10-hiddenassumption-identification-gpt5s-reasoning-produces.md @@ -0,0 +1,98 @@ +# Finding 10: Hidden-assumption identification: GPT-5's reasoning produces qualitatively different (not just more) findings + +**Date:** 2026-05-02 +**Task:** Identify hidden assumptions in gargoyle's `cold-start-and-recovery.md` (234 lines) +that could break under real-world production conditions. +**How we used them:** Same document (full text) + same focused analytical question +to all 3 models via HAI proxy (OpenAI-compatible endpoint). No tools, no project +context beyond the document itself. Single prompt, no conversation history. +Temperature 0.3 for GPT-4.1/Mini; GPT-5 uses default (required). + +| Model | Time | Output tokens | Reasoning tokens | Assumptions found | +|---|---|---|---|---| +| GPT-4.1 Mini | 25s | 3,090 | 0 | 12 | +| GPT-4.1 | 77s | 2,751 | 0 | 14 | +| GPT-5 | 78s | 2,649 | 4,096 | 26 | + +**What they found — common ground (all 3 identified):** +- Broker API consistency/availability during reconciliation +- ETS table availability and fail-closed behavior +- Single-writer/mailbox ordering guarantees holding in practice +- User independence assumption vs shared resources (rate limits, DB) +- Reconciliation idempotency under repeated runs +- Corporate action data completeness/timeliness +- Escalation threshold calibration vs changing market conditions +- Strategy warmup with partial/missing historical data +- Signal expiry correctness on restart + +**GPT-5 unique findings (not in either other model):** +- Unbounded mailbox growth during extended reconciliation (memory pressure + from queued messages at market open) +- handle_continue side effects in OTHER processes (risk, metrics) acting + concurrently via different paths +- Pre-existing GTC orders filling while gated (positions as moving target) +- Broker position semantics mismatch (trade-date vs settled-date) +- Strategy warmup evaluate() having non-signal side effects (metrics, caches) +- Historical bar / live tick boundary alignment (double-processing or gaps) +- ETS gate caching in process state creating fail-open windows +- Correlated retry stampede when many users restart together +- Corporate action double-application race with broker (missing idempotency + keys per action/instrument/date) +- Kill switch state vs DB unavailability at startup +- Market data subscriptions as shared bottleneck across "independent" users +- Time-invariant signals incorrectly expired by aggregation window logic +- Broker fills vs positions endpoints internally inconsistent (different caches) +- Positions changing under reconciliation while kill switch is engaged +- Gate phase sequencing: :ready written before worker warmup completes +- Periodic reconciler allowing 1hr of divergent trading (rate-of-change blind) + +**GPT-4.1 unique findings (not in GPT-5 or Mini):** +- No correlated failure handling (all failure modes treated as isolated) — + only model to frame this as a meta-assumption about the failure table + +**GPT-4.1 Mini unique findings:** +- None that weren't also covered by the other two models + +**Quality assessment:** +- **GPT-5** didn't just find more assumptions — it found *qualitatively + different kinds*. Many of its unique findings involve multi-component + interactions (mailbox + reconciliation + market open timing), semantic + mismatches (trade-date vs settled positions), and second-order effects + (metrics side effects during warmup, GTC orders filling while gated). + These require reasoning about system behavior across boundaries the + document doesn't explicitly draw. +- **GPT-4.1** was competent and structured, found the same core assumptions + as Mini, plus one good meta-observation about correlated failures. But + it stayed within the document's own framing — it found assumptions the + document *almost* states rather than ones the document can't see. +- **GPT-4.1 Mini** was formulaic. Every finding maps cleanly to a section + of the document. It's essentially "what could go wrong with each stated + mechanism" rather than "what does this design take for granted about + the world outside itself." + +**Key insight — reasoning tokens change the KIND of analysis:** +GPT-5's 4,096 reasoning tokens aren't producing "more of the same" — +they're producing a different analytical mode. The non-reasoning models +(4.1 and Mini) identify risks within the document's own frame of reference. +GPT-5 reasons about the document's relationship to the external world: +broker semantics, deployment topology, OTP runtime behavior under load, +timing correlations across independent subsystems. This is the difference +between "what could this mechanism fail at" and "what must be true about +the world for this mechanism to work." + +**Comparison to Finding #9 (gap-finding on failure-modes.md):** +Same pattern confirmed. GPT-5 consistently finds domain-specific, +interaction-level issues that require reasoning about component boundaries. +GPT-4.1 is thorough within-frame. Mini is formulaic. The gap between +GPT-5 and the others is larger here than in #9 — possibly because +"hidden assumptions" requires more abstraction than "missing failure +scenarios." Assumption-finding requires the model to reason about what +ISN'T stated, which benefits more from extended reasoning. + +**Practical implication:** For architecture review, running GPT-5 on +"identify hidden assumptions" is higher-value than the same question to +non-reasoning models. The cost difference (4K extra reasoning tokens) is +trivial for a document that will drive months of implementation. Use +non-reasoning models for within-frame checks ("does this section have +gaps") and reasoning models for cross-boundary analysis ("what must be +true about the world for this to work"). diff --git a/findings/2026-05-02-11-hiddenassumption-identification-on-simpler-doc.md b/findings/2026-05-02-11-hiddenassumption-identification-on-simpler-doc.md new file mode 100644 index 0000000..31a189c --- /dev/null +++ b/findings/2026-05-02-11-hiddenassumption-identification-on-simpler-doc.md @@ -0,0 +1,124 @@ +# Finding 11: Hidden-assumption identification on simpler doc: reasoning models diverge in approach, both outperform non-reasoning + +**Date:** 2026-05-02 +**Task:** Identify hidden assumptions in gargoyle's `market-calendar.md` (238 lines) +— a simpler, single-component document vs the 234-line cold-start doc from Finding #10. +**How we used them:** Same document (full text) + same focused analytical question +to all 3 models via HAI proxy. No tools, no project context beyond the document +itself. Single prompt, no conversation history. Temperature 0.3 for GPT-4.1; +GPT-5 and Opus use their defaults (required). Same prompt across all three. + +| Model | Time | Output tokens | Reasoning tokens | Assumptions found | +|---|---|---|---|---| +| GPT-4.1 | 19s | 2,554 | 0 | 14 | +| Claude Opus 4.6 | 74s | 3,288 | (internal, not reported) | 13 | +| GPT-5 | 101s | 8,417 | 5,504 | 24 | + +**What they found — common ground (all 3 identified):** +- Alpaca calendar API data correctness/completeness as single source of truth +- Alpaca API availability at startup (no local cache persistence) +- ETS table atomicity during refresh (partial-state exposure risk) +- System clock/timezone alignment (dates are timezone-naive) +- NYSE emergency/unscheduled closures not reflected until refresh +- Two-year cache range sufficiency +- API response format stability +- Rate limiting / API capacity concerns + +**GPT-5 unique findings (not in either other model):** +- Date struct term-ordering in ETS match specs may not match chronological + order (ETS range guards rely on Erlang term comparison, not Date semantics) +- close_time/1 returns naive Time without timezone — DST conversion burden on + consumers, one hour off twice per year +- trading_day?/1 conflates "not a trading day" with "calendar unavailable" — + operational outages invisible to callers +- ETS table name collision risk (global namespace per node) +- No other process should modify the ETS table (access mode discipline) +- Network egress and credential availability on all nodes at all times +- ETS read/write concurrency flags for contention under load +- Direct ETS access by consumers bypassing the module's error handling +- next/prev_trading_day edge cases at cache boundaries +- Alpaca API start/end parameter inclusivity (off-by-one at year boundaries) +- Half-day vs full-day distinction insufficiency for special sessions +- Small table size makes O(n) selects acceptable (scaling concern) +- Year-end refresh failure leaving gaps at boundary +- Alpaca never omits a legitimate trading day (absence = non-trading conflation) + +**Claude Opus unique findings (not in either other model):** +- ETS ownership semantics: heir-protection would change fail-closed behavior; + current design means ALL consumers fail simultaneously during crash-to-restart + window (framed as a design tension, not just a risk) +- Silent data corruption from partial API response (pagination/truncation) — + specifically that missing rows are SILENT failures with no error propagation + (other models mentioned API completeness but not the silence aspect) +- Consumers calling functions with Dates, not DateTimes — the API accepts Date.t() + but doesn't specify HOW consumers should derive "today" (system-wide + coordination problem made invisible by the API contract) +- `trading_day?/1` returning false is NOT fail-closed for ALL consumers — only + for PDT-like "block action" consumers; for batch-trigger consumers it's + fail-OPEN (subtle inversion of safety semantics) +- Startup ordering: background_children placement means PDT could receive orders + before MarketCalendar finishes init, creating recurring rejection windows + during hot deploys +- Continuous-running assumption for refresh timer (daily restarts would mean + refresh mechanism never fires — no staleness alert exists) + +**GPT-4.1 unique findings (not in either other model):** +- No need for real-time calendar change notification (event emission gap) +- All consumers using the same module instance (configuration consistency) +- No need for historical calendar data (audit/backtesting limitation) +- Consumers correctly handling {:error, :calendar_unavailable} in practice + +**Quality assessment:** +- **GPT-5** found the most assumptions (24) with the most technical specificity. + Many are implementation-level insights (ETS term ordering, named table + collisions, read_concurrency flags) that demonstrate deep Erlang/OTP + knowledge. Some are slightly obvious or overlapping. The ETS term-ordering + finding is genuinely insightful — Date structs DO compare correctly in Erlang + term order (year > month > day fields), but questioning it shows depth of + reasoning about underlying mechanisms. Also provided concrete recommendations. +- **Claude Opus** found fewer assumptions (13) but several were qualitatively + different — they identified *design tensions* and *semantic inversions* + rather than just failure scenarios. The fail-open/fail-closed inversion + (finding #12), the ETS ownership tension, and the "API makes timezone + coordination invisible" findings show reasoning about the design's + *relationship to its consumers* rather than just its internal mechanics. + Tighter, more curated output with less filler. +- **GPT-4.1** was competent and well-structured (14 assumptions, clean table) + but stayed within the document's own framing. Its unique findings are + relatively generic ("consumers should handle errors correctly," "no + historical data"). Solid baseline, no surprises. + +**Key insight — two reasoning models, different analytical styles:** +GPT-5 and Opus are both reasoning models, but they reason about different +things. GPT-5 reasons DEEPER into implementation mechanics (how does ETS +actually work? what are the exact failure modes of each component?). Opus +reasons WIDER about system context (how does this component's API contract +affect the safety properties of the overall system? what tensions does this +design create that aren't visible to the author?). + +GPT-5's approach: "Here are 24 things that could go wrong, many highly +technical." Opus's approach: "Here are 13 assumptions, several of which +reveal design tensions the document can't see about itself." + +**Does the reasoning gap narrow with simpler docs?** +Comparing to Finding #10 (cold-start doc, 234 lines, 26 vs 14 vs 12 assumptions +for GPT-5/GPT-4.1/Mini): +- GPT-5 still dominates in raw count (24 vs 14 for GPT-4.1) +- The gap ratio is similar (~1.7x here vs ~1.9x in Finding #10) +- Document complexity doesn't appear to be the driver of the gap — + reasoning tokens enable more exhaustive exploration regardless of + input complexity + +**Claude Opus vs GPT-5 (the headline comparison):** +They're not competing on the same axis. GPT-5 is better for "find all +possible issues" (breadth + technical depth). Opus is better for "find +the assumptions that will actually surprise the author" (insight density). +If you want a security-audit-style exhaustive list: GPT-5. If you want a +design-review-style "here's what you're not seeing about your own design": +Opus. Both are better than GPT-4.1 for this task, but in different ways. + +**Practical implication:** Run BOTH reasoning models on architecture docs. +GPT-5 catches implementation-level hazards the team might miss during +coding. Opus catches design-level tensions the team might miss during +planning. GPT-4.1 is sufficient as a quick sanity check but won't +surprise you. diff --git a/findings/2026-05-02-12-sonnet-46-outperforms-expectations-on.md b/findings/2026-05-02-12-sonnet-46-outperforms-expectations-on.md new file mode 100644 index 0000000..9c03078 --- /dev/null +++ b/findings/2026-05-02-12-sonnet-46-outperforms-expectations-on.md @@ -0,0 +1,125 @@ +# Finding 12: Sonnet 4.6 outperforms expectations on assumption-finding; competes with reasoning models on complex docs + +**Date:** 2026-05-02 +**Task:** Identify hidden assumptions in gargoyle's `order-execution.md` (785 lines) +— a complex, multi-component document covering OrderManager, BrokerAdapter, +TradeStream, and PositionReconciler. +**How we used them:** Same document (full text, no truncation) + same focused +analytical question to all 3 models. GPT-5 via HAI OpenAI endpoint; Opus 4.6 +and Sonnet 4.6 via HAI Anthropic endpoint. No tools, no project context beyond +the document itself. Single prompt, no conversation history. + +| Model | Time | Output tokens | Reasoning tokens | Assumptions found | +|---|---|---|---|---| +| GPT-5 | 93s | 8,485 | 6,016 | 20 | +| Claude Sonnet 4.6 | 106s | 4,637 | (internal) | 17 | +| Claude Opus 4.6 | 105s | 4,615 | (internal) | 12 | + +**What they found — common ground (all 3 identified):** +- Synchronous broker REST calls blocking OrderManager GenServer (mailbox growth) +- TradeStream event ordering assumptions (out-of-order fills/status) +- Fill deduplication gap (no explicit fill-level idempotency) +- `cancel_all/1` with `timeout: :infinity` blocking GenServer during FLATTEN +- Recovery/restart races with TradeStream fill delivery (fills queued during + `handle_continue/2`) +- Lot operation idempotency under crash recovery (partial execution) +- Replace race: fills for new broker_order_id arriving before `replaced` event +- Database write latency impact on GenServer throughput under burst fills +- ETS table scope assumptions (single-node, access mode) + +**GPT-5 unique findings (not in either Claude model):** +- Rate-limit retry blocking OrderManager inline (no async retry path specified) +- Single TradeStream connection per user not enforced (duplicate detection gap) +- Kill switch FLATTEN vs degraded state interaction (OM drops cancels while + degraded, but FLATTEN calls cancel_all through OM) +- ClOrdID uniqueness scope/retention at broker across sessions and days +- `after: datetime` filter semantics (clock skew, timezone, inclusive/exclusive) +- Reconciliation responses may exceed single-response size (no pagination) +- Event broadcasting blocking model (synchronous vs fire-and-forget) +- Credential rotation during TradeStream connection lifetime +- `market_closed` semantics varying across brokers (reject vs queue) +- Dropped Alpaca statuses (stopped/suspended/calculated) may affect accounting + +**Claude Sonnet 4.6 unique findings (not in either other model):** +- Single fill per fill event assumption (broker batching multiple fills into + one WebSocket message) +- Lot operations (`Lots.open/2`, `Lots.close/4`) assumed to never fail — + no `{:error, _}` handling shown, crash propagation risk +- `Task.async_stream` inside GenServer creating linked tasks whose crash + signals propagate to OrderManager during critical cancel_all +- Broker cancel semantics during in-flight replace at the broker level + (cancel targets old broker_order_id which broker already replaced away) +- Database operations in fill processing assumed transactional (no explicit + Ecto.Multi/transaction mention) +- Broker position reflects only Gargoyle's activity (external trades cause + false-positive reconciliation halts) + +**Claude Opus 4.6 unique findings (not in either other model):** +- `{:ok, broker_order_id}` from REST place conflated with durable OMS + acceptance vs mere HTTP acknowledgment (no timeout on `submitted` state) +- Concurrent `apply_corrections/2` from periodic reconciler running in + separate process conflicts with OrderManager's single-writer invariant + (corrections write to same tables outside GenServer serialization) +- Reconciliation gate initialized state after `:rest_for_one` restart — + ETS table EXISTS but freshly initialized vs table MISSING are different + conditions with different safety properties +- Escalation state reset after crash creating double-exposure window + (systematic issue persists but escalation timer resets to zero) +- `replace/3` error semantics: non-atomic replace (cancel + re-submit) + where cancel succeeds but re-submit fails leaves original order cancelled + at broker while OrderManager reverts to "working" locally + +**Quality assessment:** +- **GPT-5** maintained its pattern from previous findings: broadest coverage + (20 assumptions), most technically specific about implementation details. + Found cross-cutting operational concerns (clock skew, credential rotation, + pagination) that the Claude models didn't surface. However, several of its + findings were medium-severity operational concerns rather than architectural + assumptions. +- **Claude Sonnet 4.6** was the surprise performer. Found 17 assumptions — + close to GPT-5's count (85%) — and several of its unique findings were + genuinely insightful. The `cancel_all` race with broker-side replace state + (finding #16) and the lot operation failure propagation (finding #6) show + deep reasoning about component interaction despite Sonnet not being + positioned as a "reasoning" model. More importantly, Sonnet's findings were + consistently well-structured with clear "how it could break" scenarios. +- **Claude Opus 4.6** found the fewest assumptions (12) but — consistent with + Finding #11 — its unique findings were qualitatively different. The + concurrent `apply_corrections` write conflict, the gate initialization state + distinction, and the non-atomic replace error semantics all reveal design + tensions that neither GPT-5 nor Sonnet identified. Opus continues to reason + about the *boundaries between components* rather than within-component + mechanics. + +**Key insight — Sonnet 4.6 is NOT just a faster GPT-4.1:** +In previous findings (#9, #10, #11), non-reasoning models (GPT-4.1, GPT-4.1 +Mini) performed significantly below reasoning models on assumption-finding. +GPT-4.1 found ~14 assumptions where GPT-5 found 24-26. Here, Sonnet 4.6 +finds 17 where GPT-5 finds 20 — a much smaller gap (~85% vs ~58% previously). + +Sonnet's findings also included several that showed genuine reasoning about +component interactions (not just within-frame risks). This suggests Sonnet 4.6 +is qualitatively different from GPT-4.1 for analytical work — it occupies a +middle ground between GPT-4.1's "competent but surface-level" and GPT-5's +"exhaustive and deep." The severity distribution was also similar to GPT-5 +(multiple critical/high findings), whereas GPT-4.1 in previous experiments +tended toward medium-severity generic concerns. + +**Updated model hierarchy for assumption-finding:** +1. GPT-5 — broadest coverage, most operational-level findings (20) +2. Sonnet 4.6 — strong analytical depth, good component interaction reasoning (17) +3. Opus 4.6 — fewest but most architecturally insightful, finds design tensions (12) +4. GPT-4.1 — competent within-frame, generic (~14 from previous experiments) +5. GPT-4.1 Mini — formulaic, surface-level (~10-12) + +**Practical implication:** For architecture review, Sonnet 4.6 is now a strong +candidate for volume analytical work. It's fast enough to run alongside GPT-5 +and catches different things (lot operation failures, broker-side replace races). +The ideal three-model review stack for architecture docs appears to be: +- GPT-5 for breadth + operational concerns +- Sonnet 4.6 for component interaction analysis +- Opus 4.6 for design-tension identification + +Each consistently finds things the others miss. The cost-efficiency argument +for Sonnet is strong: ~85% of GPT-5's count with more actionable findings +per token generated (4,637 vs 8,485 tokens for 17 vs 20 assumptions). diff --git a/findings/2026-05-03-07b-token-budget-matters-more-than.md b/findings/2026-05-03-07b-token-budget-matters-more-than.md new file mode 100644 index 0000000..f74e08d --- /dev/null +++ b/findings/2026-05-03-07b-token-budget-matters-more-than.md @@ -0,0 +1,46 @@ +# Finding 7: Token budget matters more than model size for gap analysis (confirmed) + +**Date:** 2026-05-03 +**Task:** Identify unaddressed failure scenarios in gargoyle's `failure-modes.md` (383 lines, ~25KB) +**How we used them:** Same document, same analytical question ("What failure scenarios +are NOT covered?"), three models. GPT-5 with 16K max_completion_tokens, Sonnet 4 +with 4K max_tokens, GPT-4.1 Mini with 4K max_completion_tokens. No project context +beyond the document itself. Pure gap-analysis task. + +**Results:** +- GPT-5 (16K budget): 28 gaps, most exhaustive. Found domain-specific edge cases + others missed entirely: ClOrdID collision across restarts, fractional share rounding, + broker maintenance windows (410/426), hot code upgrades, regulatory halts vs staleness + distinction, cancel-ack-then-late-fill race, DNS/TLS as distinct from network outage. +- Sonnet 4 (4K budget): 12 gaps, concise. Unique framing: distinguished latency + degradation from outage (subtle but actionable). ETS corruption vs loss. +- GPT-4.1 Mini (4K budget): 13 gaps with summary table. Unique angles: unknown broker + status enum values, configuration schema mismatches on cold-start, malformed signals + from logic bugs (not just crashes). + +**Overlap (all three):** Rate limiting, clock skew, resource exhaustion, DB failures, +message backpressure, partial connectivity. + +**Key insight:** GPT-5's 4K attempt produced ZERO output (finish_reason: length) — +all tokens consumed by internal reasoning. At 16K it produced the richest analysis. +This confirms finding #3 (GPT-5 needs generous token budgets) AND adds a new +observation: for open-ended analytical questions, GPT-5's reasoning overhead is +proportionally larger. The 4K models (Sonnet, Mini) both produced useful output at +4K because they don't burn tokens on chain-of-thought. + +**Model personality confirmed:** +- GPT-5: exhaustive, domain-aware, finds edge cases a senior SRE would know +- Sonnet: precise, architectural, finds design-level distinctions +- GPT-4.1 Mini: structured, systematic, finds enumeration gaps + +**Practical implication:** For failure mode / gap analysis on design docs: +- GPT-5 with ≥16K tokens for maximum coverage (most unique findings) +- Sonnet for architectural framing ("this is really two different problems") +- Mini for completeness checking ("what about this enum value?") +- Running all three costs ~$0.50 and catches gaps none alone would find +- GPT-5 at 4K is USELESS for this task — always give it room to think + +**Note on GPT-5 reasoning overhead:** First attempt at 4K max_completion_tokens +returned empty content with finish_reason: length. The model spent all 4K tokens +on internal reasoning and produced nothing. This is worse than a short answer — +it's zero value for non-zero cost. Always budget ≥16K for GPT-5 analytical tasks. diff --git a/findings/2026-05-03-13-race-condition-identification-opus-excels.md b/findings/2026-05-03-13-race-condition-identification-opus-excels.md new file mode 100644 index 0000000..ba006a3 --- /dev/null +++ b/findings/2026-05-03-13-race-condition-identification-opus-excels.md @@ -0,0 +1,126 @@ +# Finding 13: Race condition identification: Opus excels at subtle temporal interactions; GPT-5 is exhaustive; Sonnet struggles with concurrency reasoning + +**Date:** 2026-05-03 +**Task:** Identify race conditions, timing-dependent bugs, and ordering hazards in +gargoyle's `concurrent-failure-detection.md` (241 lines) — a document specifically +about concurrent detection logic with timers, ETS state, and multi-process events. +**How we used them:** Same document (full text) + same focused analytical question +to all 3 models via HAI proxy. Prompt specifically asked for event ordering problems, +timer interaction bugs, state visibility gaps, crash/restart timing, and multi-instance +coordination. Required each finding to reference specific mechanisms in the document +with specific interleaving descriptions. No tools, no project context beyond the +document itself. + +| Model | Time | Output tokens | Reasoning tokens | Race conditions found | +|---|---|---|---|---| +| GPT-5 | 116s | 10,587 | 8,192 | 12 | +| Claude Opus 4.6 | ~105s | 4,610 | (internal) | 10 | +| Claude Sonnet 4.6 | ~39s | 1,404 | (internal) | 7 | + +**What they found — common ground (all 3 identified):** +- Stale timer messages in mailbox after cancellation (classic Erlang timer race) +- HealthMonitor crash losing compound detection state (init from :unknown, no replay) +- ETS vs GenServer state divergence visible to dashboard +- Kill switch mode conflict (FLATTEN from single-process vs HALT from compound path) + +**GPT-5 unique findings (not in either Claude model):** +- Cross-sender message ordering: recovery events from pipeline processes vs timer + expiry from runtime (Erlang preserves per-sender order, NOT cross-sender) — the + "rapid recovery" safety argument in the doc relies on state being updated before + timer fires, which isn't guaranteed +- Debounce starvation: flapping component repeatedly restarting the timer, causing + compound evaluation to be indefinitely postponed while ≥2 genuinely degraded +- State regression: {:degraded} arriving after {:escalated, :kill_switch} with no + guard in the event table — state machine allows regressing from :halted to :degraded +- Cold-start window: application boots with existing degraded processes that won't + re-emit events, compound detection never fires +- Catch-all handle_info could accidentally swallow timer messages if pattern matching + is ordered wrong (implementation pitfall of the described approach) +- Debounce window growing beyond calibrated bounds from repeated timer restarts + +**Claude Opus unique findings (not in either other model):** +- Timer restart pushing evaluation PAST single-process escalation timeout — the + debounce mechanism can DEFEAT compound detection when second degradation arrives + near end of first window (resets to full window, first process escalates via + single-process path before new window fires). This means system gets FLATTEN + instead of HALT — exactly what compound detection was supposed to prevent. +- Strategy worker single-atom masking via event ordering: Worker A degrades, Worker + B degrades (same atom), Worker A recovers → atom set to :normal while B is still + degraded. Event ordering across different workers mapped to same atom creates + state loss. +- Registry stale PID after HealthMonitor crash: if subscription is PID-based (not + PubSub topic), new HM instance is deaf — events go to dead PID, silently dropped. + Compound detection completely disabled for that user until subscription refresh. +- :rest_for_one cascade + coincidental independent issue: debounce designed to + filter cascade restarts, but cascade can EXPOSE a pre-existing issue (SignalRisk + restarts and finds its upstream stale). Is this "compound" or "cascade + bad luck"? + Semantic ambiguity the design doesn't address. +- Compound cleared event without recovery debounce: :compound_degradation_cleared + emitted immediately when last process recovers (no settling period), causing + operator oscillation if recovery is transient. + +**Claude Sonnet unique findings:** +- ETS table creation race at startup (HealthMonitor writes before table exists) +- Registry lookup failure during pipeline startup (events before HM registered) +- However, Sonnet also made analytical errors: it described "multiple HealthMonitor + instances for the same user" scenarios despite the document clearly stating one + instance per user via DynamicSupervisor. Several of its findings assumed + multi-instance coordination that doesn't match the architecture. + +**Quality assessment:** +- **GPT-5** was the most exhaustive and technically precise. Its cross-sender + ordering finding (#2) is genuinely insightful — it identifies that the document's + "rapid recovery" safety argument implicitly assumes events arrive in wall-clock + order, which Erlang does NOT guarantee across different senders. The debounce + starvation finding (#3) identifies a real operational hazard with practical + consequences. All 12 findings reference specific mechanisms and describe specific + interleavings clearly. +- **Claude Opus** found fewer race conditions but several were qualitatively + superior. The timer-restart-defeats-compound-detection finding is the most + architecturally significant race in the entire analysis — it shows that the + debounce mechanism can work AGAINST the design's stated goals in specific + (realistic) timing scenarios. The strategy-worker event ordering masking is + also a genuine design flaw unique to the single-atom decision. Opus continues + its pattern of reasoning about design TENSIONS rather than just failure modes. +- **Claude Sonnet** was notably weaker here than in previous experiments. Only + 1,404 output tokens vs 4,610 (Opus) and 10,587 (GPT-5). Several findings + contained analytical errors (assuming multi-instance coordination that doesn't + exist). It found only 7 races, and 2-3 of those were based on misreadings of + the architecture. This is a significant regression from Finding #12 where + Sonnet found 17 assumptions (85% of GPT-5's count). + +**Key insight — concurrency reasoning is a different skill than assumption-finding:** +In previous experiments (#10, #11, #12), Sonnet 4.6 performed well on +assumption-finding (a task that requires reasoning about what's NOT stated). +Here, on race condition identification (a task requiring reasoning about temporal +interleavings and message ordering semantics), Sonnet drops significantly. This +suggests the task type matters more than we previously thought: + +- **Assumption-finding:** Requires breadth of consideration ("what must be true + for this to work?"). Sonnet handles this well — it's essentially pattern + matching across possible failure dimensions. +- **Race condition identification:** Requires SEQUENTIAL reasoning about specific + interleavings ("if A happens, then B happens, then C happens, what state is + visible?"). This benefits dramatically from extended reasoning tokens (GPT-5's + 8,192 reasoning tokens) or from Opus's internal reasoning depth. + +The lesson: don't extrapolate model performance across task types. A model that's +85% as good at assumption-finding may be 50% as good at concurrency analysis. +The cognitive demands are different. + +**Opus's distinguishing strength — finding design contradictions:** +Opus's best finding (timer restart defeating compound detection) isn't just a +race condition — it's identifying that the debounce mechanism can work against +the design's own stated goals. This is consistent with Opus's pattern in +previous findings: it finds tensions where one part of the design undermines +another part. For race condition analysis specifically, this manifests as +"here's where your safety mechanism becomes your vulnerability." + +**Practical implication for architecture review:** +- For race condition analysis: use GPT-5 (exhaustive) + Opus (design-tension) +- Sonnet is NOT suitable for concurrency reasoning tasks — use it for + assumption-finding and structural review instead +- The three-model stack needs task-appropriate assignment: + - Structural/assumption review: all three models contribute + - Concurrency/race analysis: GPT-5 + Opus only + - Bias detection: any model (per Finding #8) diff --git a/findings/2026-05-03-14-crosscomponent-interaction-analysis-gpt5-mini.md b/findings/2026-05-03-14-crosscomponent-interaction-analysis-gpt5-mini.md new file mode 100644 index 0000000..ec3e0a7 --- /dev/null +++ b/findings/2026-05-03-14-crosscomponent-interaction-analysis-gpt5-mini.md @@ -0,0 +1,131 @@ +# Finding 14: Cross-component interaction analysis: GPT-5 Mini enters as viable mid-tier; task framing strongly shapes output quality + +**Date:** 2026-05-03 +**Task:** Identify cross-component interaction failures in gargoyle's +`continuous-risk-monitoring.md` (459 lines) — a document specifying +PortfolioMonitor's interactions with OrderManager, PortfolioRisk, MarketData, +KillSwitch, ETS tables, and the pipeline supervision tree. +**How we used them:** Same document (full text) + same focused analytical +question to all 3 models via HAI proxy. Prompt was highly structured: specified +5 categories of cross-component failures to look for (semantic mismatches, +ordering violations, feedback loops, partial visibility, supervision boundary +effects) and required specific output format (components, sequence, gap, impact). +No tools, no project context beyond the document itself. + +| Model | Time | Output tokens | Reasoning tokens | Findings | +|---|---|---|---|---| +| GPT-5 Mini | 68s | 5,445 | 2,240 | 6 (+1 cut off) | +| GPT-5 | 116s | 10,604 | 8,128 | 10 | +| Claude Sonnet 4.6 | 38s | 1,868 | (internal) | 8 | + +**What they found — common ground (all 3 identified):** +- Fill-to-position query race (fill event triggers evaluation but position + store hasn't yet reflected the fill) +- Restrict flag ETS table destruction on PM crash → permissive window +- Kill switch check vs liquidation submission race +- Ticker subscription timing gap (new position opened but ticks not yet + subscribed → breach goes undetected) + +**GPT-5 unique findings (not in either other model):** +- Stale prices are NOT fail-safe for drawdown (higher stale price → inflated + portfolio value → understated drawdown). The document claims "fail-safe" + but this only holds for exposure metrics, not drawdown. This is the most + architecturally significant finding across all three models. +- Price definition mismatch between PM (last_trade from ETS) and OrderManager/ + broker (bid/ask/mid) causing mis-sized liquidation and oscillation +- Cross-component oscillation: PM hysteresis internal vs PRisk's immediate + binary restrict gate clearing (no cross-component cooldown) +- Liquidation stuck after OM restart (terminal events lost; liquidation_in_ + flight stays true indefinitely with no timeout/rehydration) +- "Minimal risk checks" not enforced — PM goes through same OM gates as + strategy orders but MarketHours/StalePrice controls may reject after-hours + or stale-price liquidation attempts +- FLATTEN mode semantics gap — PM refrains from liquidating when kill switch + engaged, but FLATTEN cancels open orders without actually CLOSING positions. + No component left to close positions. + +**Claude Sonnet 4.6 unique findings (not in either other model):** +- Liquidation feedback loop with PortfolioRisk — buy-to-cover for short + positions could INCREASE net long exposure at portfolio level, paradoxically + worsening concentration while fixing position-level metrics +- High water mark reset on pipeline restart masks true intraday drawdown + (restart → HWM resets to lower current value → drawdown calculated from + false baseline → larger losses permitted than intended) +- Multi-metric breach with single boolean flag — concentration liquidation + for AAPL sets liquidation_in_flight, blocking simultaneous daily P&L + liquidation for different positions +- Market close/open vs after-hours fills — claims to evaluate after-hours + fills but uses stale market-close prices + +**GPT-5 Mini unique findings (not in either other model):** +- OrderManager order splitting/remapping causing liquidation_in_flight + correlation failure (parent/child order ID mapping breaks terminal-event + detection). Well-reasoned but highly implementation-specific. +- Restrict/clear oscillation loop with strategy behavior (strategies react + to rejects → back off → restrict clears → strategies re-enter aggressively + → re-breach). Good systems-thinking about emergent feedback. + +**Quality assessment:** +- **GPT-5** produced the most findings (10) and the highest-quality + architectural insight: the stale-price/drawdown contradiction is a genuine + design flaw that contradicts the document's own safety claim. Multiple + findings showed cross-boundary reasoning about semantic mismatches (price + definition, FLATTEN semantics, gate bypass). Every finding named specific + components and described precise event sequences. +- **Claude Sonnet 4.6** was fast (38s, only 1,868 tokens) and produced 8 + solid findings. The HWM reset finding and the multi-metric/single-flag + finding show genuine architectural reasoning. The liquidation feedback + loop (buy-to-cover worsening portfolio concentration) is subtle and + shows cross-position reasoning. However, some findings overlapped + significantly with the common-ground set and added less unique depth. + Sonnet performed MUCH better here than on race condition identification + (Finding #13) — 8/10 ratio vs 7/12 previously. +- **GPT-5 Mini** produced 6 findings in 68s with 2,240 reasoning tokens. + Quality was genuinely good — the order-splitting/correlation finding + and the oscillation feedback loop both show real reasoning depth. It's + clearly NOT GPT-4.1 Mini — it reasons about component interactions, + not just within-frame risks. However, it found fewer issues and one + response was cut off (token limit or response truncation). + +**Key insight — task framing as the dominant variable:** +This experiment used a much more structured prompt than previous ones: +specified 5 categories, required specific output format, explicitly excluded +single-component failures. The result: ALL models produced higher-quality, +more focused output than in earlier experiments with broader prompts. Even +Sonnet — which struggled on race conditions (Finding #13) — performed well +here. The structured categories likely helped models organize their reasoning +without losing track of what they were looking for. + +The prompt explicitly asked for "cross-component interaction failures" rather +than general analysis. This is the narrow-lens effect from Finding #2, but +applied to a complex multi-component document. The lens is narrow (only +inter-component gaps) but the scope is broad (459 lines, many interactions). +This combination — narrow analytical lens + broad document scope — appears +to be the sweet spot for getting quality from all model tiers. + +**GPT-5 Mini positioning:** +First time testing GPT-5 Mini. Results: 6 findings in 68s vs GPT-5's 10 in +116s. That's 60% of the findings in 59% of the time, with 28% of the +reasoning tokens (2,240 vs 8,128). Quality-per-finding was solid — the order +correlation finding especially showed genuine systems reasoning. GPT-5 Mini +appears to be a legitimate mid-tier: more capable than GPT-4.1 (which can't +do this kind of cross-boundary reasoning) but less exhaustive than GPT-5. +Viable for: first-pass screening, bulk document review where you'd run many +docs and can't afford full GPT-5 on each. + +**Sonnet recovery from Finding #13:** +Sonnet went from 7 findings (with errors) on race conditions to 8 solid +findings here. The difference: this prompt was more structured, the document +was larger with more explicit interaction descriptions, and the task didn't +require pure temporal/sequential reasoning. "Cross-component interaction +failures" is closer to assumption-finding (Sonnet's strength) than race +condition identification (Sonnet's weakness). Task taxonomy continues to +matter more than raw model capability. + +**Updated model assignment for cross-component analysis:** +1. GPT-5 — broadest coverage, finds semantic contradictions in the doc's + own claims (10 findings) +2. Sonnet 4.6 — fast and efficient, good at portfolio-level reasoning and + feedback loops (8 findings in 38s) +3. GPT-5 Mini — viable first-pass with genuine reasoning depth (6 findings) +4. (Opus untested for this task type — likely strong on design tensions) diff --git a/findings/2026-05-03-15-design-coherence-analysis.md b/findings/2026-05-03-15-design-coherence-analysis.md new file mode 100644 index 0000000..8c930e4 --- /dev/null +++ b/findings/2026-05-03-15-design-coherence-analysis.md @@ -0,0 +1,133 @@ +# Finding 15: Design Coherence Analysis + +**Date:** 2026-05-03 +**Task:** Identify internal design incoherences in gargoyle's `failure-modes.md` (383 lines) +— places where the document's stated principles/invariants are contradicted by its own +specified mechanisms. +**How we used them:** Same document (full text) + same focused analytical question to all +3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence +to look for (safety properties not enforced, state machine violations, recovery contradictions, +supervision conflicts, cross-mechanism contradictions). Required each finding to reference +specific sections. No tools, no project context beyond the document itself. + +| Model | Time | Output tokens | Reasoning tokens | Incoherences found | +|---|---|---|---|---| +| Claude Sonnet 4.6 | ~39s | 1,168 | (internal) | 5 | +| Claude Opus 4.6 | ~105s | 3,378 | (internal) | 7 (8 attempted, 1 self-withdrawn) | +| GPT-5 | ~120s | 10,235 | 9,088 | 4 | + +**What they found — common ground (all 3 identified):** +- State machine universality claim vs Strategy.Worker crash behavior (process + crashes bypass the degraded state entirely — no transition path in the model) +- Market data staleness advisory-only vs the "don't trade when ambiguous" principle + (or vs concurrent failure auto-halt) +- `pending_cancel`/`pending_replace` absent from recovery query set (GPT-5 and + Sonnet found this directly; Opus addressed the broader state machine gap) + +**GPT-5 unique findings (not in either Claude model):** +- Kill switch halted = "process terminated" vs kill switch requiring RUNNING + processes for cancel_all, FLATTEN, and LIQUIDATE modes. The state definition + claims processes are terminated, but the mechanisms require them alive to + execute orders. **This is the most architecturally significant finding** — it + reveals a fundamental definitional error in the state machine. +- Per-symbol degradation contradicts the process-level degradation semantics. + A worker "enters degraded" but continues operating for non-stale symbols — + violating the stated definition that degraded = "cannot perform primary + function." The metrics/eventing model has no per-symbol dimension. + +**Claude Opus unique findings (not in either other model):** +- `:rest_for_one` cascade creates a FIFTH implicit state (terminated-and- + restarting) not in the four-state model — processes that were `normal` are + forcibly killed (not by kill switch) and restart. Self-corrected one finding + that initially looked like incoherence but was actually consistent. +- PortfolioMonitor continues evaluating with stale data ("fail-safe") while + Strategy.Workers are stopped for the SAME condition — contradicts both the + universal state machine (PM doesn't transition to degraded) and the doc's + reasoning about why stale data is dangerous. +- Path-dependent recovery criteria: Strategy.Worker recovery requires 21+ bars + after crash but only "price continuity check" after staleness. The state + machine's single "catch-up complete" exit condition can't express this. +- `halted → [*]` transition in state diagram is logically impossible if "halted" + means the process is already terminated — dead processes can't fire transitions. +- Compound failure detection requires a meta-observer across processes but the + per-process state machine model has no way to express cross-process conditions. + +**Claude Sonnet unique findings (not in either other model):** +- Market data global staleness: the failure table says "Manual (disengage)" for + recovery — implying automatic engagement happened — but the text says it's + advisory only. Table contradicts prose. +- ReconciliationGate: doc claims gate survives OM crash (separate supervision + tree), but then says "missing ETS table = not ready" when OM crashes. If the + gate survives, why would its table be missing? +- Signal survival claims are contradictory between sections: worker crash says + downstream signals survive, but OM crash says all upstream signals lost. + (NOTE: this is actually describing different scenarios — worker crash doesn't + cascade to SignalRisk, OM crash does via :rest_for_one. Sonnet may have + misread the architecture here — the two statements are consistent when you + understand the supervision tree.) + +**Quality assessment:** +- **GPT-5** found only 4 incoherences but TWO of them are genuinely critical + architectural findings. The "halted = terminated" vs "kill switch requires + running processes" contradiction is a real design error — you can't both + terminate processes AND require them to execute cancel/liquidation orders. + The per-symbol degradation finding is also a real modeling gap. GPT-5 was + MORE SELECTIVE here than in previous experiments — it didn't pad with + medium-severity findings. Each of its 4 was high/critical. +- **Claude Opus** produced the most findings (7 valid) with characteristic + depth. Its self-correction (withdrawing finding #6 after deeper analysis) + shows intellectual honesty rare in model outputs. The PortfolioMonitor + stale-data contradiction is genuinely insightful — same input condition, + opposite response, no justification within the state machine model. The + compound failure meta-observer finding identifies a modeling category error. + Opus also found modeling imprecisions (path-dependent recovery, halted → [*] + impossibility) that the other models didn't notice. +- **Claude Sonnet** found 5 issues quickly (39s, 1,168 tokens) but quality was + mixed. Finding #4 (ReconciliationGate) raises a genuine question about + the ETS table ownership claim. Finding #1 (table vs prose contradiction on + market data staleness) is a real documentation inconsistency. However, + Finding #5 appears to misread the supervision architecture — the two + statements about signal survival ARE consistent when you understand that + different crashes cascade differently. Sonnet produced one false positive. + +**Key insight — "design coherence" is a NEW analytical category with distinct model strengths:** +This is different from assumption-finding (Finding #10-12), race conditions +(Finding #13), and cross-component interactions (Finding #14). Coherence +checking requires the model to hold MULTIPLE parts of the document in tension +with each other and reason about whether they're compatible. Results: + +- **GPT-5** was MORE SELECTIVE than in any previous experiment. Only 4 findings + vs 10-24 in other tasks. But precision was near-perfect — all 4 are genuine + contradictions. This suggests GPT-5's reasoning tokens are being used for + VERIFICATION (checking whether apparent contradictions hold up) rather than + EXPLORATION (finding more things). The 9K reasoning tokens produced 4 findings + vs the usual 10+ — GPT-5 is self-editing aggressively. +- **Opus** hit its sweet spot. Coherence checking IS design-tension identification + — Opus's consistent strength. Finding incoherences requires exactly the kind + of "how does this design disagree with itself" reasoning that Opus excels at. + It also showed unique self-correction behavior (withdrawing a finding after + deeper analysis). +- **Sonnet** was fast but produced a false positive. Coherence checking requires + holding multiple document sections in memory simultaneously and reasoning about + their compatibility — this is harder than assumption-finding (where you + reason about one mechanism at a time) but easier than race conditions (which + require sequential temporal reasoning). Sonnet occupies a middle ground. + +**Model ranking for design coherence checking:** +1. Claude Opus 4.6 — most findings, highest depth, self-correcting (7 valid) +2. GPT-5 — fewest findings but near-perfect precision, finds the critical ones (4) +3. Claude Sonnet 4.6 — fast screening, but prone to false positives on + architectural misreads (4/5 valid) + +**This inverts the usual GPT-5 > Opus ordering.** In previous experiments, GPT-5 +consistently found MORE issues. Here, GPT-5 was more selective than Opus. The +task type (self-consistency checking) favors Opus's "design tension" reasoning +style over GPT-5's "exhaustive exploration" style. GPT-5 apparently uses its +reasoning to VERIFY rather than GENERATE when the task is about contradictions +rather than gaps. + +**Practical implication:** For architecture documents, run coherence checking as +a separate pass using Opus as the primary model. GPT-5's higher precision means +it's good for confirming which Opus findings are genuine vs overreads. The +two-pass approach: Opus generates candidates → GPT-5 validates → result is the +intersection plus GPT-5's independent finds. diff --git a/findings/2026-05-03-16-specification-completeness-sonnet-45-produces.md b/findings/2026-05-03-16-specification-completeness-sonnet-45-produces.md new file mode 100644 index 0000000..94f7f50 --- /dev/null +++ b/findings/2026-05-03-16-specification-completeness-sonnet-45-produces.md @@ -0,0 +1,131 @@ +# Finding 16: Specification completeness: Sonnet 4.5 produces 2x the findings of Sonnet 4.6 on implementation-gap analysis; quality vs volume tradeoff + +**Date:** 2026-05-03 +**Task:** Identify specification gaps in gargoyle's `kill-switch.md` (185 lines) — places +where an implementer would be forced to guess or decide on their own because the spec +doesn't clearly specify behavior. New analytical lens not previously tested. +**How we used them:** Same document (full text) + same focused analytical question to all +3 models via HAI proxy. Highly structured prompt specifying 5 categories of underspecification +(behavioral ambiguity, missing edge cases, ordering/sequencing gaps, interface contracts +undefined, concurrency semantics omitted). Required specific output format per finding +(gap, section, what implementer must decide, risk if wrong, severity). No tools, no +project context beyond the document itself. + +| Model | Time | Output tokens | Reasoning tokens | Gaps found | Critical | High | Medium | Low | +|---|---|---|---|---|---|---|---|---| +| Claude Sonnet 4.6 | 73s | 3,403 | (internal) | 13 | 8 | 4 | 0 | 1 | +| Claude Sonnet 4.5 | 102s | 5,191 | (internal) | 25 | 14 | 6 | 4 | 1 | +| GPT-5 | 109s | 10,140 | 7,872 | 19 | 8 | 7 | 3 | 0 | + +**What they found — common ground (all 3 identified):** +- Pipeline process identification ambiguity (which processes are "pipeline processes") +- Per-user process scope mapping (how to terminate only one user's processes) +- ETS table ownership and lifecycle (who owns it, what happens on crash) +- Concurrent engage operations (what happens when two sources engage simultaneously) +- Liquidation order tagging mechanism (what the tag is, how verified) +- Process restart prevention (how "must not restart" is enforced) +- Engage sequence atomicity (partial failure between DB write and termination) +- Startup ordering and ETS readiness (pipeline starting before ETS populated) +- Disengage sequence ordering (what happens and in what order) + +**Sonnet 4.5 unique findings (not in either other model):** +- ETS table schema/structure (set vs ordered_set, key format, value schema) +- Missing ETS detection mechanism (catch :badarg vs table existence check) +- Database write atomicity with ETS (transaction boundaries, rollback semantics) +- Per-user engage while global is already engaged (is it a no-op or error?) +- Broker rejection semantics ("already filled" vs "invalid cancel" distinction) +- Cold-start gate interaction (independence vs dependency of the two gates) +- User deletion with active kill switch (orphaned rows, cascade semantics) +- Global disengage effect on per-user states (independent or auto-clear?) +- Audit log write failure during engage (critical-path vs best-effort) +- Dashboard control ambiguity in LIQUIDATE mode (contradictory disable/enable) +- Cancel timeout duration (operational parameter not specified) +- Manual order source code path during LIQUIDATE (how orders bypass the dead pipeline) + +**GPT-5 unique findings (not in either other model):** +- Combined global/per-user mode semantics (what happens when global=RESTRICT, + user=LIQUIDATE — can user's liquidation proceed?) +- Scope of "all" in cancel_all and liquidation (system-wide vs per-user) +- Gate behavior when ETS missing but liquidation needed (conflicting requirements: + fail-closed says block, but liquidation needs to pass) +- Disengage during in-flight cancellations (what happens to racing tasks) +- Gate placement relative to broker submission (exact point in the flow) +- Engage latency expectations (no quantified SLA) +- Mode change while already engaged (RESTRICT → LIQUIDATE without disengage) +- Dashboard vs backend scope for manual liquidation (individual vs bulk only) + +**Sonnet 4.6 unique findings (not in either other model):** +- ETS sequencing relative to process termination (ETS before or after kill?) +- Concurrent disengage + re-engage race (specific interleaving scenario) +- Close-only enforcement mechanism (UI-only vs backend validation) +- Order-in-flight past ETS gate during termination (already-checked orders) + +**Quality assessment:** +- **Claude Sonnet 4.5** was the most EXHAUSTIVE (25 gaps) but with notable + quality variance. Several findings were highly specific and implementation- + relevant (ETS schema, missing-table detection, broker rejection semantics). + Others were relatively obvious or lower-impact (user deletion, audit log + failure, cancel timeout duration). The 14 Critical ratings feel somewhat + generous — some would be more accurately rated as High in practice. Output + was well-structured with clear per-finding format. +- **GPT-5** found 19 gaps with consistent high quality. Its unique findings + show cross-cutting reasoning: the combined mode semantics finding (global + vs per-user mode interaction) identifies a genuine specification gap that + neither Sonnet version noticed. The "ETS missing but liquidation needed" + finding is architecturally significant — it identifies a CONTRADICTION in + the spec's own rules (fail-closed blocks everything, but liquidation must + pass). Every finding was actionable. More selective severity ratings + (8 Critical vs Sonnet 4.5's 14). +- **Claude Sonnet 4.6** was the most SELECTIVE (13 gaps) but with the highest + precision. Every finding was genuinely a specification gap that an + implementer would face. The ETS sequencing finding (#4) is particularly + well-reasoned — it identifies a specific ordering dependency that creates + a race window. Sonnet 4.6 appears to self-filter aggressively, producing + only findings it's confident about. Higher signal-to-noise than 4.5. + +**Key insight — Sonnet 4.5 vs 4.6 on analytical tasks:** +This is the first direct comparison between Claude model versions on the same +analytical task. Key differences: + +- **Volume:** 4.5 produced almost 2x the findings (25 vs 13) +- **Tokens:** 4.5 used ~1.5x the output tokens (5,191 vs 3,403) +- **Time:** 4.5 took ~1.4x longer (102s vs 73s) +- **Severity distribution:** 4.5 had more Critical findings (14 vs 8) but + with more generous severity ratings +- **Quality per finding:** 4.6 had higher average quality; fewer "obvious" + or lower-impact findings + +The 4.6 model appears to have been trained toward higher precision/selectivity. +It finds fewer things but each finding is more reliably a genuine gap. The 4.5 +model is more exhaustive but includes findings that a reviewer might triage as +"yes, technically, but not really a spec gap." This mirrors a known training +direction in Claude models: later versions tend to be more concise and selective. + +**For practical use:** If you want completeness (cast a wide net, accept some +noise): use 4.5. If you want precision (every finding is actionable, no triage +needed): use 4.6. For architecture review where missing a gap has cost, 4.5's +exhaustiveness is probably worth the noise. For review where false positives +cost attention (e.g., PR review comments), 4.6's selectivity is preferred. + +**GPT-5 vs Sonnet comparison on this task:** +GPT-5 (19 findings) sits between the two Sonnets in volume but has the highest +consistency — no obvious misses or inflated severities. Its unique strength +here: finding CONTRADICTIONS within the spec's own rules (ETS-missing blocking +conflicts with liquidation needing to pass). This is consistent with Finding #15 +where GPT-5 was unusually selective but precise on coherence checking. + +Specification completeness analysis appears to be a task where: +1. Sonnet 4.5 is strongest for breadth (25 findings, catches operational gaps) +2. GPT-5 is strongest for detecting spec self-contradictions (19 findings, high precision) +3. Sonnet 4.6 is strongest for precision (13 findings, zero noise) + +**Updated model version comparison:** +- Claude 4.6 → higher precision, more selective, concise +- Claude 4.5 → more exhaustive, more verbose, occasional severity inflation +- This is a genuine tradeoff, not a simple regression or improvement + +**Practical implication:** Run BOTH Sonnet versions? 4.5 catches things 4.6 +filters out (ETS schema, broker rejection semantics, cold-start gate interaction). +4.6 catches things with more specificity (sequencing gaps, exact race windows). +For a one-shot budget: 4.5 if you want coverage, 4.6 if you want actionability. +GPT-5 if you want to find where the spec contradicts itself. diff --git a/findings/2026-05-04-18-temporal-boundary-analysis-gpt5-is.md b/findings/2026-05-04-18-temporal-boundary-analysis-gpt5-is.md new file mode 100644 index 0000000..11e5cd4 --- /dev/null +++ b/findings/2026-05-04-18-temporal-boundary-analysis-gpt5-is.md @@ -0,0 +1,158 @@ +# Finding 18: Temporal boundary analysis: GPT-5 is most exhaustive; Opus finds design-level contradictions; Sonnet 4.5 is structured but less deep + +**Date:** 2026-05-04 +**Task:** Identify temporal boundary vulnerabilities in gargoyle's `escalation-policy.md` +(238 lines) — scenarios where the timing model (evaluation cycles, debounce counts, +cooldown periods) creates windows of incorrect or dangerous behavior. +**How we used them:** Same document (full text) + same focused analytical question to all +3 models via HAI proxy. Highly structured prompt specifying 5 categories of temporal +vulnerability (timing exploitation, counter reset abuse, asymmetric time exposure, +cross-metric temporal interactions, state loss temporal effects). Required specific +output format per finding (name, sequence with cycle numbers, mechanism, severity, fix). +No tools, no project context beyond the document itself. + +| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium | +|---|---|---|---|---|---|---|---| +| GPT-5 | ~128s | 9,175 | 5,888 | 15 | 3 | 7 | 2 | +| Claude Opus 4.6 | ~120s | 5,112 | (internal) | 10 | 3 | 5 | 2 | +| Claude Sonnet 4.5 | ~100s | 4,056 | (internal) | 12 | 3 | 3 | 3 | + +**What they found — common ground (all 3 identified):** +- Flash crash / inter-evaluation gap exploitation (metric spikes between discrete + evaluation cycles go undetected) +- Single clear cycle resetting debounce counter (transient recovery defeats escalation + despite sustained risk — metric can breach 80%+ of cycles and never escalate) +- Asymmetric escalation time vs loss compounding rate (11 cycles to reach liquidation + while losses compound every single cycle) +- Monitor crash resets state to Clear, losing all escalation progress +- Liquidation re-trigger requiring full debounce reset, delaying subsequent batches +- Kill switch N value unspecified (timing indeterminacy) + +**GPT-5 unique findings (not in either other model):** +- Boundary-alignment counter starvation: explicitly modeled the "adversarial flicker" + pattern (breaching 2 cycles, 1 clear, repeat — 66% breach time, never escalates) + with a precise mathematical framing of why K-of-N is needed +- Cycle-length drift under load: GC pauses or CPU contention stretching evaluation + intervals means "3 cycles" could be 12 minutes instead of 90 seconds when it + matters most (high-load market stress = slowest evaluations) +- Adversarial boundary timing (market microstructure masking): illiquid instruments + where opposing prints predictably arrive near evaluation boundaries, exploiting + deterministic sampling points +- Cross-metric escalation gap and hedging prohibition: Restrict prevents ALL new + positions including risk-REDUCING hedges needed for a different metric still + escalating on its own timeline — protection for metric A actively worsens metric B +- Cooldown stall causing prolonged Restrict: repeated transient spikes near hysteresis + threshold reset cooldown indefinitely while metric is actually safe +- State inconsistency between restriction flags and monitor after restart: + documented asymmetry where flag persists (manual clear) but state resets (auto + clear) — creates orphaned restriction or unprotected window depending on + reconciliation approach +- Metric computation fail-closed interacting with debounce: system errors create + false escalations with long cooldown, potentially blocking hedging trades +- Unspecified N for kill switch post-liquidation breaches: coupled with crash + reset, system can loop indefinitely without reaching kill switch +- In-liquidate flicker stall: one cycle below threshold after partial fill resets + re-trigger counter, stalling further liquidation + +**Claude Opus unique findings (not in either other model):** +- De-escalation cooldown exploitation (predictable window): after cooldown completes + and restriction lifts, strategy has a GUARANTEED 5+ cycle window of unrestricted + trading before Restrict can re-engage — an automated strategy could systematically + exploit this predictable safe window to re-enter dangerous positions +- Orphaned restriction flag asymmetry framed as a DESIGN CONTRADICTION: the failure + modes table specifies opposing recovery paths for state (automatic → Clear) vs + flags (manual clear), creating an irreconcilable dual state. Opus uniquely + identified that operator intervention to clear the flag could inadvertently + create a WORSE protection gap than leaving it orphaned +- Self-correcting analysis style: Opus's summary explicitly synthesized that the + three Critical findings share a common cause (debounce optimizes against false + positives at the expense of false negatives during sustained events) and proposed + a single architectural fix (severity-aware fast path) that addresses all three + +**Claude Sonnet 4.5 unique findings (not in either other model):** +- De-escalation timing not accounting for proximity to breach threshold: system + removes protection while metric is still near-dangerous, and re-escalation + requires full debounce — created a specific "whipsaw" scenario with cycle numbers +- Manual-only de-escalation from Liquidate creates UNBOUNDED recovery time: + if triggered at 2 AM Saturday, trading disabled until Monday despite metrics + recovering in minutes. Framed as contradiction with "autonomous" design goals +- Evaluation cycle synchronization assumption: no handling of variable timing + (CPU contention, GC pauses) — implicit throughout but never addressed +- Cold start escalation ambiguity: system starts with no prior state while + portfolio may already be in breach condition +- De-escalation event ordering race: multiple metrics de-escalating simultaneously + may emit events in non-deterministic order, confusing external observers + +**Quality assessment:** +- **GPT-5** was the most exhaustive (15 findings) and showed the strongest + mathematical/systems reasoning. Its unique findings included precise attack + models (adversarial flicker, boundary alignment, microstructure masking) that + describe exact exploitation patterns with percentages and cycle counts. The + cross-metric hedging prohibition finding is architecturally significant — it + identifies that protection for one metric can actively CREATE risk for another. + Every finding was actionable with specific fixes. +- **Claude Opus 4.6** produced fewer findings (10) but with characteristic depth + and self-awareness. Its cooldown exploitation finding identified a PREDICTABLE + exploit window that an automated strategy could systematically abuse — framed + not as an accident but as an adversarial opportunity. The summary synthesis + (identifying common cause across Critical findings) shows meta-analytical + capability the other models didn't demonstrate. Opus also uniquely identified + that human intervention to fix one problem could create a WORSE problem — + second-order operational reasoning. +- **Claude Sonnet 4.5** was well-structured (12 findings, clean severity tiers, + organized by Critical/High/Medium/Low) and faster than both other models. + Its findings were solid but less architecturally deep. The manual de-escalation + contradiction finding was genuinely insightful (unbounded recovery time vs + autonomous design goals). However, several findings restated concepts the + other models covered with less specificity about exploitation mechanics. + +**Key insight — temporal reasoning as a task type:** +This is the first experiment specifically testing "temporal boundary analysis" — +reasoning about time-domain properties of a state machine (evaluation frequency, +counter semantics, cooldown mechanics, crash/restart timing). + +Results compared to Finding #13 (race condition identification on a concurrency doc): +- GPT-5: 15 findings here vs 12 in Finding #13. Consistent high performance + on temporal reasoning tasks across both experiments. +- Opus: 10 findings here vs 10 in Finding #13. Remarkably consistent — Opus + produces ~10 high-quality findings regardless of temporal task variant. +- Sonnet 4.5: 12 findings here (first test). Compare to Sonnet 4.6's 7 findings + (with errors) in Finding #13. Sonnet 4.5 handles temporal reasoning better than + 4.6 — consistent with Finding #16 showing 4.5 is more exhaustive across task types. + +**Sonnet 4.5 vs 4.6 on temporal reasoning (inferred comparison):** +Sonnet 4.6 struggled significantly on race condition identification (Finding #13: +7 findings with analytical errors, misreading architecture). Sonnet 4.5 here +produced 12 solid findings with no apparent misreadings. This suggests 4.5's +exhaustiveness advantage extends to temporal reasoning — the additional +exploration it does (vs 4.6's aggressive self-filtering) catches more temporal +interactions. Confirms Finding #16's pattern: 4.5 for coverage, 4.6 for precision. + +**The structured-prompt effect continues:** +All three models produced focused, high-quality output with this highly structured +prompt (5 specific categories + required output format). This confirms Finding #14: +narrow analytical lens + broad document scope is the sweet spot for all model tiers. +The prompt structure appears to be a stronger predictor of output quality than model +choice for the bottom 80% of findings (all models find the common-ground issues). +Model choice matters for the TOP 20% — the unique insights that require deeper +reasoning about system interactions. + +**Updated model assignment for temporal boundary analysis:** +1. GPT-5 — most exhaustive, strongest at modeling adversarial exploitation patterns + and mathematical edge cases (15 findings) +2. Claude Sonnet 4.5 — good volume with clean structure, viable for first-pass + temporal analysis (12 findings, no errors) +3. Claude Opus 4.6 — fewest findings but highest insight density, uniquely + identifies predictable exploit windows and operational second-order effects + (10 findings) + +**Practical implication:** For temporal analysis on state machines and timing-dependent +policies, the three-model stack produces genuine complementary value: +- GPT-5 catches the adversarial attack patterns and mathematical edge cases +- Opus catches the predictable exploit windows and operational contradictions +- Sonnet 4.5 provides good breadth at lower cost with clean severity categorization + +The union of unique findings across all three models reveals significantly more +temporal vulnerabilities than any single model alone. For a document governing +autonomous financial actions (liquidation, kill switch), the cost of running all +three (~$1-2) is trivially justified against the risk of missing a timing exploit. diff --git a/findings/2026-05-04-19-union-coverage-test-gpt5-mini.md b/findings/2026-05-04-19-union-coverage-test-gpt5-mini.md new file mode 100644 index 0000000..e26b842 --- /dev/null +++ b/findings/2026-05-04-19-union-coverage-test-gpt5-mini.md @@ -0,0 +1,124 @@ +# Finding 19: Union coverage test: GPT-5 Mini + Sonnet 4.6 covers ~71% of GPT-5's findings; the missing 29% is where the real value lives + +**Date:** 2026-05-04 +**Task:** Identify hidden assumptions in gargoyle's `trading-pipeline.md` (1,110 lines, +~62KB) — the most complex document tested so far, covering the full end-to-end path +from tick ingestion through order execution. +**How we used them:** Same document (full text, no truncation) + same focused analytical +question to all 3 models via HAI proxy. Standard hidden-assumption prompt with 5 +categories (runtime behavior, external dependencies, timing/ordering, scale/load, +uncovered failure modes). Required specific output format per finding. No tools, no +project context beyond the document itself. + +| Model | Time | Output tokens | Reasoning tokens | Assumptions found | +|---|---|---|---|---| +| GPT-5 | 99s | 9,418 | 5,696 | 35 | +| GPT-5 Mini | 93s | 5,309 | 1,792 | 21 | +| Claude Sonnet 4.6 | 38s | 1,792 | (internal) | 17 | + +**Coverage analysis — can Mini + Sonnet together replace GPT-5?** + +Categorized each of GPT-5's 35 findings by whether the union of Mini + Sonnet +also identified the same assumption: + +- **Covered by BOTH Mini and Sonnet:** ~12 findings (common ground — any model + finds these: idempotency, single-writer, clock sync, instrument resolution, + fill immutability, reconciliation gate, backpressure, fill correlation, event + ordering, audit scalability, PortfolioRisk bottleneck) +- **Covered by Mini only (not Sonnet):** ~7 findings (transactional atomicity, + audit causal consistency, modification-in-flight enforcement, OM throughput, + decimal precision, PM/PR close-only race, partition duplicate submit) +- **Covered by Sonnet only (not Mini):** ~6 findings (market data feed rates, + pipeline-vs-market speed, corporate actions atomicity, kill switch partition, + shared port isolation, market close vs auction fills) +- **Union(Mini + Sonnet) total coverage:** ~25/35 = **~71%** of GPT-5's findings +- **GPT-5 unique (missed by both):** ~10-18 findings depending on strictness + +**What GPT-5 uniquely found that the cheaper pair missed:** + +The missing 29% is NOT random — it's systematically different in character: + +1. **Operational edge cases:** Default TIF "day" broker semantics, OrderRate + counting retries, extended-hours MarketHours mismatch, fractional quantities, + local expiry timer precision per instrument +2. **Design-level interaction gaps:** PortfolioRisk concurrent decision race + (snapshot stale between two parallel approvals), re-validation gap between + approval and submit, decision loss on crash after audit write +3. **Domain-specific knowledge:** Manual broker-side actions conflicting with + state machine, options/complex instrument position_effect mapping, Decision→Order + 1:1 invariant vs broker auto-splitting, wash sale retroactive P&L mutation +4. **Architectural observations:** Reduction re-entry rule insufficiency, + PortfolioMonitor coalescing vs fast breach detection, multi-aggregator fanout + and audit partial writes, replay/backtest alignment with production controls + +These share a common trait: they require **domain expertise** (knowing how brokers +actually behave, how regulatory rules interact, how production trading systems +fail in practice) combined with **architectural reasoning** (how the design's own +mechanisms interact under those real-world conditions). The cheaper models find +assumptions about the document's internal consistency; GPT-5 additionally finds +assumptions about the document's relationship to the external world it must +operate in. + +**GPT-5 Mini vs Sonnet 4.6 — complementary, not redundant:** + +Mini and Sonnet covered different gaps: +- Mini was stronger on **internal consistency** (transactional atomicity, causal + consistency, decimal precision, modification serialization) +- Sonnet was stronger on **external interactions** (market data feeds, corporate + actions, kill switch distribution, shared resource isolation) + +This aligns with previous findings: Mini reasons about implementation mechanics; +Sonnet reasons about system boundaries and external interactions. Their union +covers more ground than either alone. + +**Cost comparison:** + +| Approach | Total tokens | Approx. cost | Coverage of GPT-5 | +|---|---|---|---| +| GPT-5 alone | ~21K (9.4K output + 5.7K reasoning) | ~$0.80 | 100% (35 findings) | +| Mini + Sonnet | ~7.1K output + 1.8K reasoning | ~$0.25 | ~71% (25/35 findings) | +| All three | ~28K total | ~$1.05 | >100% (35 + unique Sonnet/Mini extras) | + +**Key insight — the 71% coverage is a floor, not a ceiling:** + +The union covers 71% of GPT-5's specific findings. But Mini and Sonnet each +also produced findings that GPT-5 DIDN'T make: +- Sonnet: DailyLossLimit query performance scaling, instrument reference data + propagation atomicity across components +- Mini: Signal audit correlation ambiguity under replay/duplicate ticks + +So the total unique finding space is LARGER than any single model. Running all +three produces the most comprehensive analysis. + +**Answer to the open question: "Would running GPT-5 Mini + Sonnet together +approach GPT-5's coverage at lower combined cost?"** + +**Partially.** The pair covers ~71% of GPT-5's findings at ~31% of the cost. +But the missing 29% is disproportionately valuable — it contains the +domain-specific, interaction-level, real-world-knowledge findings that are +most likely to prevent production incidents. For a quick sanity check or +first-pass screening, Mini + Sonnet is excellent value. For architecture +review where completeness matters (financial system, safety-critical), GPT-5 +is not replaceable by cheaper models — its unique findings are exactly the +ones that would cause real-world failures. + +**Practical implication:** The optimal strategy depends on stakes: +- **Low stakes** (internal doc review, non-critical systems): Mini + Sonnet + is 71% coverage at 31% cost — strong ROI +- **High stakes** (financial systems, safety-critical): run all three — the + ~$1 total cost is irrelevant vs the value of the extra 10-18 findings +- **Budget-conscious high stakes:** run GPT-5 alone — it subsumes most of + what Mini + Sonnet find, and adds the critical domain-knowledge findings + +The cost argument for Mini + Sonnet as a GPT-5 REPLACEMENT doesn't hold for +important work. The cost argument for Mini + Sonnet as a GPT-5 COMPLEMENT +is strong — they catch a few things GPT-5 misses, and the union of all three +is the most thorough analysis available. + +**Document complexity observation:** +This is the largest document tested (1,110 lines vs previous 185-785 lines). +GPT-5's finding count scaled up (35 vs 20-26 on smaller docs) while maintaining +quality — no padding with obvious/low-value findings. Mini also scaled (21 vs +6 on 459-line doc in Finding #14). Sonnet scaled less (17 vs 12-17 on smaller +docs) — it appears to have a natural output ceiling regardless of document size, +consistent with its self-filtering behavior observed in previous findings. diff --git a/findings/2026-05-04-20-invariant-violation-path-analysis-gpt5.md b/findings/2026-05-04-20-invariant-violation-path-analysis-gpt5.md new file mode 100644 index 0000000..f6c2be8 --- /dev/null +++ b/findings/2026-05-04-20-invariant-violation-path-analysis-gpt5.md @@ -0,0 +1,163 @@ +# Finding 20: Invariant violation path analysis: GPT-5 is maximally selective (3 findings, all genuine); Opus shows unique self-correcting analytical style; new task type favors precision over exhaustiveness + +**Date:** 2026-05-04 +**Task:** Identify invariant violation paths in gargoyle's `user-pipeline-lifecycle.md` +(730 lines) — sequences of legal operations that can violate the system's stated or +implied invariants. NEW analytical lens not previously tested, distinct from assumption- +finding, race conditions, or coherence checking. +**How we used them:** Same document (full text) + same focused analytical question to all +3 models via HAI proxy. Highly structured prompt specifying 5 categories of invariant +violations (state machine escapes, invariant composition failures, monotonicity violations, +idempotency boundary violations, authority inversion sequences). Required specific output +format per finding. No tools, no project context beyond the document itself. + +| Model | Time | Output tokens | Reasoning tokens | Findings | +|---|---|---|---|---| +| GPT-5 | 143s | 784 | 12,032 | 3 | +| Claude Opus 4.6 | 113s | 6,183 | (internal) | 7 (with 2 self-corrections) | +| Claude Sonnet 4.6 | 23s | 1,266 | (internal) | 5 | + +**What they found — common ground (2+ models identified):** + +- **Periodic reconciliation overrides operator manual stop** (GPT-5 #3 + Opus #5 + + Sonnet #1): An admin who stops a pipeline via `stop_user/1` with `:admin_action` + has their decision overridden within 5 minutes by periodic reconciliation, because + there's no "admin stopped" state in `check_eligibility/1`. All three models + independently identified this as the clearest authority inversion. +- **DynamicSupervisor restart bypasses eligibility gate** (Opus #1/#3 + Sonnet #2): + When `UserPipeline.Supervisor` crashes and is restarted by OTP supervision, the + restart bypasses `start_user/1` and `check_eligibility/1` entirely — potentially + resuming trading while the kill switch is engaged. +- **Stale ReconciliationGate after crash** (Opus #7): After a crash-triggered + DynamicSupervisor restart (not via `stop_user/1`), the ReconciliationGate remains + `:ready` from the previous instance because `stop_user/1` (which resets it) was + never called. The new OrderManager may accept orders during its own reconciliation. +- **HealthMonitor co-lifecycle violation** (Opus #2 + Sonnet #4): After a + DynamicSupervisor-initiated restart, the HealthMonitor is still subscribed to the + old PIDs — no code re-establishes monitoring for the new pipeline processes. + +**GPT-5 unique findings (not in either other model):** + +- **Kill switch bypass for users configured DURING engagement** (#1): A user who + saves credentials while the kill switch is engaged is never added to the pending + operator release set (only running pipelines are added at engage time). After + disengage, periodic reconciliation auto-starts this user's pipeline without + operator release — violating "resuming always requires human judgment." This is + the most precisely reasoned finding across all three models: each step is + individually correct per the spec, and the violation emerges purely from the + composition of legal operations. +- **Premature release bypass** (#2): If `operator_release_user/1` is called while + the kill switch is still engaged (a legal operation), it clears the pending + release flag but `start_user/1` correctly refuses. After later disengage, the + flag is gone — auto-start proceeds without fresh operator judgment. The release + was "spent" at the wrong time. + +**Claude Opus unique findings (not in either other model):** + +- **`operator_release_system/0` clears unrelated safety obligations** (#4): + Operator intends to release one user from a recent event but + `operator_release_system/0` also releases other users still pending from an + earlier, unresolved event. One release call discharges multiple independent + safety obligations — monotonicity violation. +- **State machine incompleteness for blocked users** (#6): Users who become + configured during kill switch engagement (blocked with reason + `:kill_switch_engaged`) have no state machine transition back to `starting` + after disengage — they're not in the pending release set, and no event fires. + System works via periodic reconciliation (up to 5 minutes delay), but the + documented state machine doesn't represent this path. +- **Self-correcting analytical style:** Opus explicitly withdrew two draft + findings mid-analysis ("Actually, this sequence works as designed. Let me + identify a real violation instead." / "this is likely handled"). This + self-correction behavior was first observed in Finding #15 and is now + confirmed as a consistent Opus trait for invariant-style analysis. + +**Claude Sonnet unique findings (not in either other model):** + +- **Cold-start Tier 3 failure creates supervision restart loop** (#2): A + persistent Tier 3 failure (phantom fills) crashes OrderManager, `:rest_for_one` + kills the tree, DynamicSupervisor restarts it, cold-start fails again → infinite + loop. State machine shows `starting → stopped` but supervision creates + `starting → starting` indefinitely. +- **HealthMonitor start failure during start_user** (#4): If HealthMonitor.Supervisor + is momentarily crashed when `start_user/1` runs step 4, the pipeline starts + without monitoring. No error handling specified for this partial-start state. + +**Quality assessment:** + +- **GPT-5** was MAXIMALLY SELECTIVE — only 3 findings from 12,032 reasoning tokens + (4,011 reasoning tokens per finding). This is the most extreme + reasoning-to-output ratio observed: 15:1 (12,032 reasoning / 784 output tokens). + For comparison, in previous experiments GPT-5 typically shows 1:1 to 2:1 ratios. + Every finding is a genuine invariant violation with a precise, step-by-step + sequence where each step is individually legal. ZERO false positives, zero + padding, zero "this might be an issue." GPT-5 appears to have used almost all + its reasoning budget for VERIFICATION — confirming that each candidate is + genuinely a violation before including it. +- **Claude Opus** produced the most findings (7) with its characteristic depth and + self-correction. Two findings were revised mid-analysis, showing Opus actively + testing its own reasoning against the document before committing to a finding. + The DynamicSupervisor restart thread (findings #1, #2, #3, #7) forms a coherent + cluster — Opus identified one root cause (OTP restarts bypass the lifecycle + layer) and explored its multiple consequences. The `operator_release_system` + monotonicity finding (#4) is architecturally significant and unique. +- **Claude Sonnet** was extremely fast (23s, 1,266 tokens) and produced 5 findings. + Quality was mixed: Finding #1 partially mirrors GPT-5's authority inversion but + with vaguer reasoning ("race condition with ETS operations" — not specified). + Finding #3 describes a contradiction but the scenario is internally inconsistent + (step 5 says "pipeline termination fails" but then step 7 says pipeline is still + running — this conflates two failure modes). Findings #2 and #4 are genuine and + well-reasoned. Sonnet's precision is lower than the other two on this task. + +**Key insight — "Invariant violation paths" as a task type:** + +This is a genuinely DIFFERENT analytical task from any previously tested. It requires: +1. Identifying the invariants (explicit or implied) +2. Constructing a sequence of operations (creative/generative) +3. Verifying each step is legal per the spec (verification) +4. Confirming the end state violates the invariant (correctness proof) + +This four-phase cognitive process explains GPT-5's extreme selectivity: steps 2-4 are +all verification-heavy, and GPT-5's reasoning tokens are being burned on steps 3 and 4 +(confirming each step is genuinely legal and the final state genuinely violates). In +previous tasks like "find hidden assumptions" or "find gaps," only step 1 (identification) +is needed — there's no construction or verification phase. + +**Comparison to previous task types:** + +| Task type | GPT-5 findings | Opus findings | GPT-5 reasoning overhead | +|---|---|---|---| +| Hidden assumptions | 20-35 | 12-13 | 5-7K reasoning | +| Race conditions | 12 | 10 | 8K reasoning | +| Design coherence | 4 | 7 | 9K reasoning | +| Invariant violation paths | 3 | 7 | **12K reasoning** | + +The pattern: as the task requires more VERIFICATION (vs identification), GPT-5 becomes +more selective and spends more reasoning tokens per finding. Invariant violation paths +demand the highest verification burden (every step must be confirmed legal), and GPT-5 +responds with the highest selectivity and reasoning investment. + +Opus inverts: it produces MORE findings on verification-heavy tasks (7 for coherence, +7 for invariant paths) vs identification tasks (10-13 for assumptions). This suggests +Opus uses its internal reasoning differently — it's more willing to present findings +that have "likely" rather than "proven" violations, then self-corrects inline if the +verification fails. + +**Practical implication:** + +For invariant violation path analysis: +- **GPT-5** produces the highest-precision findings but very few. Every finding is a + genuine spec-level bug. Use when you need zero-false-positive bug reports to present + to a design team. +- **Opus** produces more findings with slightly lower precision but unique analytical + depth. Its self-correction behavior means false positives are often caught inline. + Use when you want both confirmed violations AND identified tensions. +- **Sonnet** is too imprecise for this task type — some findings have internal + inconsistencies. Use for lighter analytical tasks (assumption-finding, spec gaps). + +The three findings GPT-5 produced are ALL genuine design bugs that should be fixed: +1. Users configured during kill switch engagement bypass operator release +2. Premature operator release (while KS still engaged) creates future bypass +3. Admin stops are overridden by periodic reconciliation + +These are the kind of findings that, in a real financial system, prevent production +incidents. The 12K reasoning tokens to produce 3 perfect findings is excellent ROI. diff --git a/findings/2026-05-04-21-reasoning-effort-lowmediumhigh-has-negligible.md b/findings/2026-05-04-21-reasoning-effort-lowmediumhigh-has-negligible.md new file mode 100644 index 0000000..b91e04d --- /dev/null +++ b/findings/2026-05-04-21-reasoning-effort-lowmediumhigh-has-negligible.md @@ -0,0 +1,125 @@ +# Finding 21: Reasoning effort (low/medium/high) has negligible effect on GPT-5's analytical output; the parameter may not work as documented for open-ended analysis + +**Date:** 2026-05-04 +**Task:** Identify hidden assumptions in gargoyle's `order-state-machine.md` (221 lines) +— a well-structured state machine specification covering order lifecycle, fill precedence, +TIF semantics, and parameter resolution. +**How we used them:** Same document, same prompt, same model (GPT-5), same +max_completion_tokens (16K). Only variable: `reasoning.effort` parameter set to +"low", "medium", or "high". Run sequentially via HAI proxy (OpenAI-compatible +endpoint). No tools, no project context beyond the document. + +| Effort | Time (ms) | Output tokens | Reasoning tokens | Findings | +|---|---|---|---|---| +| Low | 97,913 | 7,657 | 4,288 | 33 (+11 recs) | +| Medium | 94,824 | 7,112 | 4,160 | 30 | +| High | 88,607 | 6,891 | 3,712 | 30 | + +**The counterintuitive result:** Higher reasoning effort produced FEWER findings, +FEWER reasoning tokens, FEWER output tokens, and completed FASTER. The expected +pattern (high effort → more reasoning → more depth) was inverted. + +**Per-finding metrics (remarkably consistent):** + +| Effort | Output tokens/finding | Reasoning tokens/finding | +|---|---|---| +| Low | 232 | 129 | +| Medium | 237 | 138 | +| High | 229 | 123 | + +The depth per finding was nearly identical across all three levels. The models +didn't get more detailed or rigorous per-finding at higher effort — they just +found slightly fewer things. + +**Severity distributions (similar across all three):** +- Low: 7 Critical, 21 High, 5 Medium (33 findings) +- Medium: 9 Critical, 15 High, 4 Medium + 2 borderline (30 findings) +- High: 6 Critical, 14 High, 5 Medium + 4 borderline (30 findings) + +**Qualitative differences — WHAT they found:** + +High-effort unique findings (not in low): +- Single-writer authority to broker (no out-of-band modifications) +- Broker emits fills for all executed quantities (no silent netting) +- Instrument identity remains stable across corporate actions +- Late-fill override won't violate downstream invariants +- Validation covers lot sizes, price ticks, borrow/locate constraints +- Multiple accounts and venues are part of the correlation key +- Streaming and polling APIs are consistent +- System can handle multi-leg instruments + +Low-effort unique findings (not in high): +- Acks arrive before fills (no pre-ack fills) +- Cancel-before-ack handling (submitted → cancelled missing) +- Fill totals never exceed requested quantity +- Deterministic ordering within a broker stream +- Exercise/assignment and non-order position changes +- Client-side idempotency of "place order" +- Partial accept/normalize on replace +- No "child" order fragmentation at broker +- Submitted state can receive terminal events +- Late cancel vs local expired mismatch + +**Character of the differences:** +- HIGH-unique findings tend to be more **architectural/systemic** (multi-leg + instruments, streaming vs polling consistency, downstream invariant violations, + corporate actions). These require reasoning about the system's relationship + to the broader world. +- LOW-unique findings tend to be more **implementation-specific edge cases** + (cancel-before-ack, pre-ack fills, child order fragmentation, partial accepts). + These require reasoning about specific event interleavings and protocol details. + +Both sets are valid and actionable. Neither is clearly "better." They represent +different analytical modes — breadth-of-scope (high) vs depth-of-protocol (low). + +**Key insight — reasoning_effort doesn't scale analysis linearly:** + +Three possible explanations for the inverted behavior: + +1. **GPT-5 already uses near-maximum reasoning for analytical tasks regardless + of the effort parameter.** The ~4K reasoning tokens across all three levels + (4288/4160/3712) are too similar to reflect a genuine effort gradient. The + parameter may primarily affect OTHER task types (math, code, logic puzzles) + where reasoning depth is more variable. + +2. **Higher effort increases FILTERING, not exploration.** At high effort, GPT-5 + may spend more of its reasoning on VERIFYING whether findings are genuine + before including them — similar to the extreme selectivity observed in + Finding #20 (invariant violation paths, 12K reasoning for 3 findings). This + would explain fewer findings despite theoretically "trying harder." + +3. **The parameter has minimal practical effect for this model version.** + The differences (33 vs 30 vs 30) are within normal stochastic variation. + Repeated runs at the same effort level might show similar variance. + +**The prompt cache hit on HIGH (2304 cached prompt tokens) may have slightly +accelerated processing, but doesn't explain the reasoning token difference.** + +**Comparison to previous findings:** +In Finding #20 (invariant violation paths), GPT-5 used 12,032 reasoning tokens +for 3 findings — extreme verification behavior. Here, at default effort on a +different task type (hidden assumptions), it uses ~4K reasoning for ~30 findings. +This confirms that TASK TYPE is a far stronger predictor of GPT-5's reasoning +behavior than the reasoning_effort parameter. The invariant violation prompt +triggered deep verification; the assumption-finding prompt triggers broad +exploration regardless of effort setting. + +**Practical implication:** +For open-ended analytical tasks (assumption-finding, gap analysis, spec review), +the reasoning_effort parameter appears to have negligible practical effect on +GPT-5. Don't bother tuning it for these tasks — the default is fine. The +parameter may be more meaningful for: +- Tasks with verifiable correct answers (math, logic) +- Tasks where the model could short-circuit (simple questions) +- Extremely long documents where exploration budget matters + +For architecture review specifically: reasoning_effort is NOT a useful lever. +Task framing (the prompt structure) and document selection remain the dominant +variables for output quality. Save reasoning_effort tuning for coding/math tasks +where the parameter was likely trained and evaluated. + +**Open question:** Would running the same experiment 5x at each level show that +the 33-vs-30 difference is within stochastic noise? If so, reasoning_effort is +effectively a no-op for analytical prompts. If not, low-effort consistently +produces more (less filtered) output, which could be useful for brainstorming- +style analysis where you want maximum coverage before manual triage. diff --git a/findings/2026-05-05-22-silent-correctness-failures-new-analytical.md b/findings/2026-05-05-22-silent-correctness-failures-new-analytical.md new file mode 100644 index 0000000..7c9a78a --- /dev/null +++ b/findings/2026-05-05-22-silent-correctness-failures-new-analytical.md @@ -0,0 +1,180 @@ +# Finding 22: Silent correctness failures: NEW analytical lens reveals Opus's strength at domain/regulatory reasoning; GPT-5 shows regulatory depth; all models converge on compliance timestamp errors + +**Date:** 2026-05-05 +**Task:** Identify scenarios where the mechanism produces SILENTLY INCORRECT results +(not errors, not crashes — wrong financial calculations, wrong lot selections, or wrong +compliance records that pass all validation) in gargoyle's `specid-lot-selection.md` +(306 lines) — a financial system specification covering tax lot selection strategies, +cost basis accounting, and IRS SpecID compliance. +**How we used them:** Same document (full text) + same focused analytical question to +all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of silent +incorrectness (stale data, semantic precision, ordering sensitivity, composition errors, +temporal reference errors). Required specific output format per finding with concrete +numerical examples of financial impact. No tools, no project context beyond the document. + +| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium | +|---|---|---|---|---|---|---|---| +| GPT-5 | 147s | 13,006 | 10,496 | 7 | 2 | 2 | 3 | +| Claude Opus 4.6 | 119s | 5,902 | (internal) | 10 | 3 | 3 | 4 | +| Claude Sonnet 4.6 | 122s | 6,011 | (internal) | 6 | 3 | 3 | 0 | + +**What they found — common ground (all 3 identified):** +- `designation_at` = `DateTime.utc_now()` at processing time, NOT at actual + designation time (manual selection was made at order submission, standing + orders were configured earlier) — compliance record factually incorrect +- Holding period calculation boundary errors (>365 days vs IRS "more than one + year" rule, off-by-one at leap year boundaries, day-after-acquisition start) +- HIFO tie-breaker `opened_at ASC` ignores tax_term dimension — selects + long-term losses over short-term losses when both have identical cost basis, + producing less tax-valuable outcomes +- Strategy preference resolved at fill processing time, not at trade time + (preference changes between trade and fill processing apply retroactively) + +**GPT-5 unique findings (not in either Claude model):** +- Corporate action applied late stale cost basis in HIFO: ROC/dividend reduces + basis but if close/4 fires before apply_corporate_action/3, HIFO sorts on + pre-adjusted basis AND records wrong realized P&L permanently. No mechanism + to restate previously persisted LotClosed events. Concrete example: $2,000 + overstated loss from one trade. +- `designation_at` fragmentation: a single sell consuming multiple lots calls + DateTime.utc_now() per loop iteration, producing slightly different timestamps + for what should be a single coherent designation event. Audit risk. +- LIFO label in `selection_method` field: records "lifo" but for securities LIFO + isn't an authorized tax method — the operation is legally SpecID electing + newest lots. Downstream reporting may reject or misclassify. + +**Claude Opus unique findings (not in either other model):** +- Realized P&L excludes commissions/fees: formula uses `sell_fill.price` (raw + execution price) minus `lot.cost_basis`, not net proceeds. If cost_basis also + excludes buy-side commissions, P&L is doubly overstated. Active trader doing + 1000 trades/year: ~$20,000+ cumulative P&L overstatement. +- Position `average_cost` is meaningless under SpecID and potentially misleading: + SpecID exists to exploit lot-level basis differences, but position-level average + obscures this. If downstream consumers use average_cost for tax estimation, + results can be 50%+ wrong per lot. +- GenServer mailbox ordering determines lot-to-fill assignment for concurrent sells: + two simultaneous fills for the same instrument get different lots based on network + arrival timing. With different holding periods, produces $670+ tax difference + without user awareness. +- Wash sale rule completely unaddressed: system reports losses as realized/deductible + without checking 30-day substantially identical purchase rule. Active trader + harvesting $50,000 in losses could have $0 actually deductible — $18,500 tax gap. +- `opened_at` semantics undefined: whether it's exchange execution time, GenServer + arrival time, or settlement date affects every downstream calculation (FIFO/LIFO + ordering, holding periods, tax terms). Network timing could produce wrong FIFO + lot selection. + +**Claude Sonnet 4.6 unique findings (not in either other model):** +- Stale cost basis in manual lot picker during concurrent corporate actions: UI shows + pre-action basis, user selects based on stale data, but close/4 only validates + open/ownership/quantity — never re-validates that the selection rationale is still + correct. No field records the discrepancy. +- `average_cost` recomputation ordering ambiguity in event-sourced model: step 4 + recomputes from "updated lots" but step 3 (persist events) may not have completed + — if implementation re-derives from event store rather than in-memory state, reads + pre-closure lot quantities. Accumulates $500+ error per partial close. +- Strategy fallback + config corruption silently overwrites selection method in + compliance record: if config becomes invalid, fallback to :fifo is logged at + :warning but LotClosed records `selection_method: "fifo"` — compliance record + shows user "chose" FIFO when they configured HIFO. No field records intended vs + actual strategy. + +**Quality assessment:** +- **Claude Opus** produced the most findings (10) with the broadest analytical scope. + Several findings went BEYOND the document's mechanism to identify missing features + that create silent incorrectness (wash sale rules, commission handling, opened_at + semantics). This is a different analytical mode: Opus identified what the system + SHOULD compute but DOESN'T, not just where the existing computation is wrong. + The wash sale finding is the highest-impact across all three models — an active + trader's entire tax-loss harvesting strategy could be invalid. The GenServer + mailbox ordering finding shows characteristic Opus reasoning about emergent + behavior from design decisions. +- **GPT-5** produced fewer findings (7) but with extreme precision and specificity. + Every finding includes concrete dollar amounts and specific field references. + The corporate action stale basis finding is uniquely actionable — it identifies a + specific race condition between two documented mechanisms (close/4 and + apply_corporate_action/3) that produces permanently incorrect persisted data + with no correction path. The designation_at fragmentation finding shows attention + to implementation detail that neither Claude model noticed. GPT-5 used 10,496 + reasoning tokens for 7 findings (1,500 tokens/finding) — HIGH verification, + consistent with Finding #20's pattern for precision-over-breadth tasks. +- **Claude Sonnet 4.6** produced 6 findings with strong specificity and novel angles. + The event-sourced recomputation ordering finding (#5) is architecturally subtle — + it identifies a composition error between the walk-and-consume algorithm's step + ordering and event-sourcing patterns. The strategy fallback compliance recording + finding is a genuine audit hazard. However, Sonnet produced no Medium-severity + findings — it either found Critical/High issues or filtered everything else out. + This aligns with its established high-precision, high-self-filtering behavior. + +**Key insight — "Silent correctness" as an analytical lens:** + +This is the FIRST experiment testing a "silent incorrectness" prompt. The key +difference from previous analytical lenses: +- **Assumption-finding:** "What must be true for this to work?" (Finding #10-12) +- **Race conditions:** "What timing issues exist?" (Finding #13) +- **Design coherence:** "Does the design contradict itself?" (Finding #15) +- **Invariant violations:** "What operation sequences break invariants?" (Finding #20) +- **Silent correctness:** "Where does the system CONFIDENTLY produce WRONG output + with NO indication of error?" + +The silent correctness lens produced qualitatively different findings from all +previous lenses. The emphasis on "passes all validation" forced models to reason +about what SHOULD be validated but ISN'T, and about semantic correctness (regulatory +requirements, financial accounting rules) vs syntactic correctness (valid types, +non-nil fields, correct schema). + +This lens also revealed a key model differentiation not seen before: +- **Opus reasons about MISSING functionality** (wash sales, commissions, opened_at + semantics) — things the system should do but doesn't +- **GPT-5 reasons about EXISTING functionality being wrong** (corporate action race, + designation fragmentation, LIFO labeling) — things the system does but incorrectly +- **Sonnet reasons about COMPOSITION failures** (event-sourcing step ordering, + strategy fallback propagation) — things that are individually correct but combine + incorrectly + +These are three genuinely different analytical modes, not just "more/less thorough." +All three are valuable for different review outcomes: Opus for feature completeness, +GPT-5 for mechanism correctness, Sonnet for integration correctness. + +**Financial domain advantage:** + +This is the first experiment on a document with strong regulatory/financial semantics. +All three models demonstrated domain knowledge (IRS holding period rules, Treas. Reg. +1.1012-1(c) requirements, wash sale IRC §1091, long-term/short-term capital gains +rate differentials). Opus in particular referenced specific IRC sections and provided +concrete tax rate calculations. The "silent incorrectness" lens works especially well +on financial/regulatory documents because the gap between "syntactically valid output" +and "semantically/legally correct output" is large and consequential. + +**Comparison to previous findings on the same models:** + +| Task type | GPT-5 findings | Opus findings | Sonnet findings | Opus > GPT-5? | +|---|---|---|---|---| +| Hidden assumptions (#10-12) | 20-35 | 12-13 | 13-17 | No | +| Race conditions (#13) | 12 | 10 | 7 | No | +| Design coherence (#15) | 4 | 7 | 5 | **Yes** | +| Invariant violations (#20) | 3 | 7 | 5 | **Yes** | +| Silent correctness (#22) | 7 | 10 | 6 | **Yes** | + +Pattern confirmed: Opus outperforms GPT-5 (by finding count) on tasks that require +reasoning about the design's RELATIONSHIP to external requirements (regulatory, +financial, consumer expectations). GPT-5 outperforms Opus on tasks that require +EXHAUSTIVE EXPLORATION within a self-contained system (assumptions, race conditions). + +The "silent correctness" lens is structurally similar to coherence checking (does the +system match its external requirements?) rather than gap-finding (what's missing +within the system?). This explains why Opus outperforms: the task requires reasoning +about the world outside the document (IRS rules, financial accounting standards, +regulatory requirements), which is Opus's strength. + +**Practical implication:** +For financial/regulatory system review, the "silent correctness" lens should be +run using Opus as the primary model (broadest findings including missing-feature +identification) plus GPT-5 for mechanism-level precision. Sonnet adds value for +composition/integration issues that neither Opus nor GPT-5 catches. All three +produced unique, actionable findings that the others missed. + +The three findings ALL models converged on (designation_at, holding period, HIFO +tie-breaker, strategy preference timing) should be treated as confirmed design +bugs requiring fixes. The fact that three independent models all identified them +with concrete financial impact examples increases confidence that these are real. diff --git a/findings/2026-05-05-23-regulatory-compliance-analysis-gpt5-finds.md b/findings/2026-05-05-23-regulatory-compliance-analysis-gpt5-finds.md new file mode 100644 index 0000000..8ec8ddc --- /dev/null +++ b/findings/2026-05-05-23-regulatory-compliance-analysis-gpt5-finds.md @@ -0,0 +1,193 @@ +# Finding 23: Regulatory compliance analysis: GPT-5 finds broadest scope of IRS issues; Opus self-corrects mid-analysis; all models converge on cross-account gap + +**Date:** 2026-05-05 +**Task:** Identify where gargoyle's `wash-sale-tracking.md` (391 lines) could produce +incorrect tax reporting compared to IRS wash sale regulations (IRC 1091). NEW +analytical lens: regulatory compliance verification — asking models to reason about +a code implementation's correctness against EXTERNAL regulatory requirements (not +internal system assumptions or race conditions). +**How we used them:** Same document (full text) + same focused analytical question +to all 3 models via HAI proxy. Prompt structured around 5 categories: regulatory +gaps, interpretation errors, edge cases at regulatory boundaries, cross-account/entity +concerns, and interaction with other IRC sections. Required specific regulatory +citations, implementation analysis, concrete tax errors, and audit risk levels. +No tools, no project context beyond the document. + +| Model | Time | Output tokens | Reasoning tokens | Findings | +|---|---|---|---|---| +| GPT-5 | 178s | 12,525 | 9,536 | 16 | +| Claude Opus 4.6 | 155s | 7,326 | (internal) | 16 (with 2 self-corrections/withdrawals) | +| Claude Sonnet 4.6 | 40s | 1,818 | (internal) | 12 | + +**What they found — common ground (all 3 identified):** +- Cross-account/IRA/external broker wash sales not tracked (IRS applies at taxpayer level) +- Options/contracts to acquire stock not triggering wash sales (explicit in IRC 1091(a) text) +- "Substantially identical" definition too narrow (same index ETFs, share classes, ADRs) +- Trade date vs settlement date ambiguity in opened_at/closed_at +- Short sale wash sales not addressed +- Section 475 mark-to-market traders incorrectly subjected to wash sale tracking +- IRC 1092 straddle rules interaction not addressed +- Related party / spousal transactions not considered +- Corporate action identity changes breaking matching + +**GPT-5 unique findings (not in either other model):** +- **Per-share vs lot-level basis tacking** (#1): The system applies `disallowed_loss` + and `tacked_opened_at` at the LOT level, but IRS requires per-share treatment + when only partial shares are matched. A lot of 100 shares where only 60 trigger + wash sale should have per-share basis segregation — the system inflates basis for + all 100 shares. **Most architecturally significant finding** — a fundamental + design-level error, not a missing feature. +- **IRA permanent disallowance** (#2): When replacement purchase is in an IRA, the + loss is PERMANENTLY lost (no basis adjustment possible in tax-deferred accounts). + System either incorrectly applies basis adjustment inside IRA or misses it entirely. +- **Instruments not subject to §1091** (#4): §1256 contracts (futures, index options), + cryptocurrency, and §475 elections are all exempt — system may over-disallow. +- **Average-cost mutual fund basis** (#11): Wash sale adjustments for funds using + average-cost method require different math than discrete lot-level adjustments. +- **ADRs vs local shares** (#14): ADRs and underlying foreign ordinaries are + substantially identical but have different instrument_ids. +- **RSU vestings/ESPP purchases** (#15): Equity compensation creating lots via + corporate action paths may not trigger `check_replacement/2`. +- **Ordering priority between pre/post sale purchases** (#10): Industry convention + (post-sale first, then pre-sale) may differ from system's strict chronological + ordering, causing 1099-B mismatches. + +**Claude Opus unique findings (not in either other model):** +- **Year-end boundary timing** (#5): Loss in December + replacement in January means + tax reports generated between Dec 31 and the replacement purchase date are incorrect. + Forward detection fires retroactively but users may have already filed. System needs + a "30-day pending window" for year-end reports. +- **Form 8949 reporting format** (#6): IRS requires code "W" in column (f) and + specific adjustment amounts in column (g). System doesn't describe how `tax_summary/3` + produces Form 8949-compatible output — potential CP2000 notice triggers from + automated IRS matching against broker 1099-B. +- **"Open lots" query in backward detection** (#10): If backward detection only + queries currently-open lots, it misses replacements that were acquired AND SOLD + within the window. IRS looks at acquisition regardless of current holding status. + (Rev. Rul. 56-602) +- **Forward detection loss ordering unspecified** (#7): When multiple prior losses + compete for the same replacement shares, ordering matters — different allocation + produces different basis amounts on the replacement lot. +- **DRIP reinvestments triggering wash sales** (#9): Dividend reinvestment creates + new lots that should trigger forward detection but may not if only buy fills + produce `LotOpened` events. +- **Self-correcting analytical style (CONFIRMED):** Opus withdrew Finding #4 + entirely mid-analysis ("Revised assessment: holding period logic appears correct. + I withdraw the claim of error"). Spent ~500 words reasoning through the holding + period tacking logic, found it correct, and explicitly retracted. This is now + confirmed across Findings #15, #20, and #23 as a consistent Opus behavior for + verification-heavy regulatory analysis. + +**Claude Sonnet unique findings (not in either other model):** +- **Entity-level tracking for partnerships/S-Corps** (#4.2): Tax-transparent entities + trading through the platform need K-1 reporting to partners — user-scoped model + doesn't address pass-through entity wash sale reporting. +- **Constructive sale integration (IRC 1259)** (#4.1): Short positions or derivatives + creating constructive ownership interact with wash sale determination in ways not + addressed. +- **NOL carryforward interaction** (#5.3): Wash sale deferrals affect character and + timing of losses contributing to NOL calculations across tax years. + +**Quality assessment:** +- **GPT-5** produced the broadest regulatory scope (16 findings) with the most + specific IRS citations (Rev. Rul. 2008-5, Pub. 550, IRC §§267, 1091, 1092, 1222, + 1223, 1256, 475). Its per-share vs lot-level finding (#1) is the only one that + identifies a FUNDAMENTAL DESIGN ERROR (not a missing feature). Most other models' + findings are "you don't handle X" — GPT-5's #1 says "what you DO handle is + handled INCORRECTLY." This distinction matters: missing features are known scope + limitations; incorrect logic is a bug. +- **Claude Opus** matched GPT-5's count (16 with 2 self-corrections = 14 net + confirmed) but with different character. Opus excelled at identifying OPERATIONAL + implications (year-end boundary timing, Form 8949 format requirements, forward + detection ordering) rather than just statutory gaps. Its findings tend to describe + HOW the gap manifests in practice ("user files taxes, then January purchase + retroactively invalidates the filing") vs GPT-5's approach of citing the statute + and describing the theoretical violation. +- **Claude Sonnet** was fast (40s) and produced 12 competent findings but with less + regulatory precision. Findings lacked specific IRS citations (no Rev. Rul. + references, no Treas. Reg. citations). Several findings overlapped heavily with + common ground items without adding unique depth. The entity-level and + constructive sale findings show awareness of tax complexity but are relatively + generic ("this is complex and not addressed"). + +**Key insight — regulatory compliance as a distinct task type:** + +This experiment tests a fundamentally different cognitive demand than previous ones: +previous tasks asked "what could go wrong with this system?" (internal reasoning). +This task asks "does this system correctly implement external rules?" (external +reasoning). The model must hold TWO bodies of knowledge simultaneously: the +implementation spec AND the regulatory framework, then find mismatches. + +All three models had strong tax law knowledge — they cited IRC sections, Revenue +Rulings, and Treasury Regulations correctly. The differentiation wasn't in legal +knowledge but in HOW they applied it: + +- **GPT-5:** Exhaustive statutory mapping ("here's every IRC section that touches + wash sales; here's where the implementation falls short on each"). Breadth-first + coverage. Found the most issues by sheer scope of regulatory awareness. +- **Opus:** Operational consequence reasoning ("here's how this gap manifests as + a real-world problem for the user/auditor"). Found issues by reasoning about + the implementation's interaction with real-world workflows (filing deadlines, + form formats, broker reconciliation). +- **Sonnet:** Category-based analysis ("here are cross-account issues, here are + entity issues, here are interaction issues"). Followed the prompt structure + closely but didn't go deep within each category. + +**The per-share vs lot-level finding (GPT-5 #1) — why it matters:** + +This is the experiment's most important result. Every model found missing features +(options, cross-account, short sales) — those are SCOPE limitations that the +document itself acknowledges or defers. GPT-5 uniquely found a correctness bug in +the IMPLEMENTED logic: the system's lot-level basis adjustment is mathematically +wrong for partial wash sales. + +Example: Loss lot of 100 shares, replacement lot of 60 shares. Only 60 shares +trigger wash sale. System adds full 60% of disallowed loss to the entire +replacement lot's basis. If the replacement lot later sells 30 shares, the +per-share basis is inflated (reflects 60 shares of adjustment spread across 60 +shares). This is actually correct for the replacement lot specifically — but +the `tacked_opened_at` is applied to ALL 60 shares when only the matched shares +should have tacked holding periods. For lots where `adjusted_quantity < +replacement_quantity`, the non-matched shares have incorrect holding period +characterization. + +Actually, on closer inspection: if `adjusted_quantity = min(loss_quantity, +replacement_quantity)`, and the system matches 60 shares of a 60-share +replacement lot, ALL shares of that lot are matched. The edge case GPT-5 +identifies would require a replacement lot larger than the loss — e.g., loss of +60 shares matched against a replacement lot of 100 shares where only 60 are +affected. In that case, the `tacked_opened_at` is set on the entire lot (100 +shares) when only 60 should be affected. This IS a genuine bug: 40 shares get +incorrect holding period classification. + +**Updated task-type taxonomy:** + +| Task type | Primary cognitive demand | Best model | +|---|---|---| +| Hidden assumptions | Breadth identification (what's not stated?) | GPT-5 (exhaustive) | +| Race conditions | Sequential temporal reasoning | GPT-5 + Opus | +| Cross-component interactions | Component boundary reasoning | GPT-5 + Sonnet | +| Design coherence | Internal consistency checking | Opus | +| Invariant violation paths | Construction + verification | GPT-5 (precision) | +| Silent correctness | External requirement matching | Opus | +| **Regulatory compliance** | **Dual-knowledge-base comparison** | **GPT-5 (breadth) + Opus (operations)** | + +Regulatory compliance is closest to "silent correctness" (Finding #22) in that +both require reasoning about external requirements. The key difference: +- Silent correctness asks "does this produce correct outputs for all inputs?" +- Regulatory compliance asks "does this implement the law correctly?" + +Both favor models that reason about the system's relationship to the outside +world (Opus's strength), but regulatory compliance also rewards breadth of +statutory knowledge (GPT-5's strength). The combination produces the most +complete picture. + +**Practical implication:** +For regulatory compliance review of financial systems: +- Run GPT-5 for exhaustive statutory coverage (finds the most gaps) +- Run Opus for operational impact analysis (finds how gaps manifest in practice) +- Sonnet adds marginal value — use only if budget allows +- GPT-5's unique strength: identifying correctness bugs in implemented logic + (not just missing features) +- Opus's unique strength: identifying timing/workflow issues (year-end, form + reporting, reconciliation with broker) diff --git a/findings/2026-05-05-24-design-improvement-proposals-gpt5-excels.md b/findings/2026-05-05-24-design-improvement-proposals-gpt5-excels.md new file mode 100644 index 0000000..c4b7c88 --- /dev/null +++ b/findings/2026-05-05-24-design-improvement-proposals-gpt5-excels.md @@ -0,0 +1,152 @@ +# Finding 24: Design improvement proposals: GPT-5 excels at defense-in-depth thinking; Opus finds subtle design contradictions; Sonnet produces generic recommendations + +**Date:** 2026-05-05 +**Task:** Propose specific design improvements for gargoyle's `kill-switch.md` (185 lines) +— the primary safety mechanism that prevents rogue orders. NEW task type: generative/ +creative ("what would you improve?") rather than purely analytical ("what's wrong?"). +**How we used them:** Same document (full text) + same focused prompt to all 3 models +via HAI proxy. Prompt asked for 8-15 specific improvements with: weakness, proposed +change (concrete), tradeoff, severity rating. Explicitly excluded generic advice +("add more tests") and asked about runtime assumptions. No tools, no project context. + +| Model | Time | Output tokens | Reasoning tokens | Improvements proposed | +|---|---|---|---|---| +| GPT-5 | 118s | 8,710 | 6,016 | 15 | +| Claude Opus 4.6 | 127s | 4,985 | (internal) | 15 | +| Claude Sonnet 4.6 | 40s | 1,636 | (internal) | 12 | + +**What they found — common ground (all 3 identified):** +- DB write failure blocking engagement (fail-open under DB outage) — all three + proposed in-memory-first engagement with async persistence +- Kill switch process liveness monitoring (heartbeat/watchdog) +- Broker connectivity loss during cancellation operations +- ETS table ownership and crash-window vulnerability +- Supervisor restart suppression as unstated mechanism +- Per-venue/per-broker scope extension + +**GPT-5 unique findings (not in either other model):** +- **Infrastructure-level "hard kill"** — egress proxy or service mesh that blocks + broker traffic independently of the application. Belt-and-suspenders approach + where the kill switch works even if the entire BEAM VM is unresponsive. This + was GPT-5's highest-impact unique insight. +- **Kill fence token (epoch)** — every order-carrying message includes an epoch; + stale-epoch messages are dropped at the gate. Elegantly solves in-flight + messages without needing drain timeouts. +- **Cluster/multi-node propagation** — detailed leader election + epoch broadcast + + fail-closed on partition design. +- **Post-engage broker verification** — query broker AFTER engaging to confirm no + orders slipped through during the engagement window. +- **Liquidation exposure validation** — proving tagged liquidation orders actually + REDUCE exposure rather than trusting the tag. +- **Recovery/cold-start order suppression** — ensuring reconciliation/recovery + routines can't submit orders while engaged. +- **Engage latency reordering** — ETS first, terminate second, DB async. +- **Audit log tamper evidence** — append-only external sink + hash chain. + +**Claude Opus unique findings (not in either other model):** +- **Ordering contradiction in engagement sequence** — identified that the + documented order (DB → ETS → terminate) creates a specific risk if a crash + occurs BETWEEN termination and ETS update (not just DB failure). The insight + is about the window where termination has started but gate is still open. + More subtle than GPT-5's version (which focused on DB-blocking-engage). +- **Concurrent engagement race (mode escalation)** — multiple triggers + simultaneously issuing conflicting modes (RESTRICT vs LIQUIDATE). Proposed + explicit escalation rules (LIQUIDATE always wins) with GenServer serialization. +- **Shared resources under per-user scope** — per-user kill switch doesn't + address orders in shared broker connection buffers. Forces architectural + decision about connection pooling strategy. +- **Clock/time integrity for audit log** — monotonic counters + NTP validation + for forensic reliability. +- **Partial multi-user engagement failures** — what happens when global engage + successfully terminates 4/5 user pipelines but one has orphaned processes. +- **Liquidation direction validation** — similar to GPT-5's exposure validation + but framed differently: checking corrupted position records could cause + liquidation to OPEN positions rather than close them. +- **Process termination verification** — checking that `:kill` signals actually + worked (defense against trap_exit, NIF blocking). +- **Engagement latency SLA** — defining a 50ms target with monitoring/alerting. + +**Claude Sonnet findings (all also present in GPT-5 or Opus, differently framed):** +- No genuinely unique improvements that GPT-5 or Opus didn't also identify. +- Several were generic: "missing resource cleanup," "circuit breaker integration," + "performance monitoring" — exactly the kind of advice the prompt tried to + exclude. +- The "missing heartbeat" and "network partition handling" proposals were solid + but less detailed than the corresponding GPT-5/Opus versions. + +**Quality assessment:** +- **GPT-5** produced the most ACTIONABLE improvements. Its proposals were + architecturally concrete ("add an egress proxy," "use kill epochs in messages," + "query broker post-engage") and showed defense-in-depth thinking — multiple + independent layers rather than fixing one path. The infrastructure kill (#2) + is genuinely novel: no other model proposed going OUTSIDE the application + boundary for safety enforcement. GPT-5 consistently thought about "what if + this entire runtime is compromised?" rather than just fixing within-app paths. +- **Claude Opus** produced equally numerous improvements (15) with characteristic + precision about failure SEQUENCES. Its unique strength: identifying design + contradictions rather than just gaps (the engagement ordering issue, concurrent + mode escalation, shared-resource scope mismatch). Opus's proposals were more + "fix the design tension" while GPT-5's were more "add another safety layer." + Opus also included the process termination verification and engagement latency + SLA — operational rigor that GPT-5 skipped. +- **Claude Sonnet** produced 12 proposals in 40s (fast) but quality was notably + lower. Several proposals were generic software engineering advice that the + prompt explicitly excluded ("add performance monitoring," "resource cleanup"). + No unique insights emerged. Sonnet's proposals lacked the architectural depth + of GPT-5 (no outside-the-application thinking) and the design-tension + identification of Opus. + +**Key insight — generative vs analytical tasks:** + +This is the first experiment testing a GENERATIVE task ("propose improvements") +rather than a purely analytical one ("find problems"). The results reveal: + +1. **GPT-5's defense-in-depth thinking is unique.** In analytical tasks, GPT-5 + finds exhaustive lists of issues. In generative tasks, it proposes LAYERED + solutions — multiple independent mechanisms that each catch what the others + miss. The infrastructure kill proposal (external to the application) shows + GPT-5 reasoning about failure modes that are invisible to within-app analysis. + +2. **Opus's design-tension identification transfers to improvement proposals.** + In analytical tasks, Opus finds where parts of a design contradict each other. + In generative tasks, this manifests as proposals that RESOLVE tensions rather + than just adding patches. The engagement ordering contradiction and mode + escalation rules are both "this design says X but the mechanism allows Y — + here's how to make them consistent." + +3. **Sonnet doesn't transfer well to generative tasks.** In analytical tasks + (assumption-finding, cross-component analysis), Sonnet performs well (85% of + GPT-5 in some experiments). In generative tasks, it falls back to generic + engineering advice. The task requires both identifying problems AND proposing + concrete solutions — Sonnet handles the first step but not the second with + sufficient depth. + +**Comparison to analytical task performance:** + +| Task type | GPT-5 character | Opus character | Sonnet character | +|---|---|---|---| +| Assumption-finding (#10-12) | Exhaustive breadth | Design tensions | Good (85% of GPT-5) | +| Race conditions (#13) | Technical precision | Design contradictions | Weak (errors) | +| Invariant violations (#20) | Maximum selectivity | Self-correcting depth | Imprecise | +| **Design improvements (#24)** | **Defense-in-depth layers** | **Tension resolution** | **Generic advice** | + +The generative task reveals model ARCHITECTURES more clearly than analytical tasks. +GPT-5's reasoning enables it to construct multi-layered solutions. Opus's internal +reasoning enables it to identify what a design SHOULD be (not just what's wrong). +Sonnet pattern-matches against known engineering practices without deep synthesis. + +**Practical implication:** + +For design improvement sessions on safety-critical systems: +- Run GPT-5 for defense-in-depth proposals ("what layers should exist?") +- Run Opus for design consistency proposals ("where does the design contradict itself?") +- Skip Sonnet — its output is indistinguishable from generic checklists +- The combination of GPT-5 + Opus produces complementary improvements: GPT-5 adds + safety layers, Opus fixes internal contradictions. Together they address both + "not enough protection" and "protection mechanisms that work against each other." + +**Cost analysis:** +GPT-5: 118s, ~10.9K tokens (6K reasoning). Opus: 127s, ~5K tokens. Sonnet: 40s, ~1.6K tokens. +For a safety-critical design review, running GPT-5 + Opus costs ~16K tokens and produces +30 improvements with near-zero overlap in unique insights. Excellent ROI for a kill switch +design that protects real money. diff --git a/findings/2026-05-05-25-contradiction-detection-new-task-type.md b/findings/2026-05-05-25-contradiction-detection-new-task-type.md new file mode 100644 index 0000000..fb28ab7 --- /dev/null +++ b/findings/2026-05-05-25-contradiction-detection-new-task-type.md @@ -0,0 +1,154 @@ +# Finding 25: Contradiction detection: NEW task type — Opus excels at finding genuine contradictions with precision; GPT-5 is thorough but spends heavily on reasoning; Sonnet finds surface-level issues quickly + +**Date:** 2026-05-05 +**Task:** Identify internal contradictions, logical inconsistencies, and conflicting rules +in gargoyle's `order-state-machine.md` (311 lines) — a document defining states, +transitions, invariants, fill precedence rules, and time-in-force behavior. +**How we used them:** Same document (full text) + same focused analytical question to all +3 models via HAI proxy. Prompt specifically asked for: state machine contradictions, +semantic conflicts, rule violations, implicit contradictions, and terminology +inconsistencies. Required each finding to quote the conflicting statements, explain +the logical argument, assign severity, and recommend which statement should "win." +No tools, no project context beyond the document itself. + +| Model | Time | Output tokens | Reasoning tokens | Contradictions found | +|---|---|---|---|---| +| GPT-5 | 162s | 12,074 | 11,008 | 4 | +| Claude Opus 4.6 | 41s | 2,056 | (internal) | 6 | +| Claude Sonnet 4.6 | 17s | 826 | (internal) | 4 | + +**What they found — common ground (2+ models identified):** + +- **Missing `pending_cancel → partially_filled` revert transition** (GPT-5 #1 + + Opus #2 + Sonnet partial): The "Rejection reverts" invariant states orders return + to their "pre-modification state (`working` or `partially_filled`)", but the state + diagram only shows `pending_cancel → working` for cancel rejection — no path back + to `partially_filled`. All models correctly identified this as the diagram being + incomplete relative to the stated invariant. GPT-5 and Opus rated CRITICAL. +- **Same issue for `pending_replace` revert** (GPT-5 #1 + Opus #3): The state diagram + only shows `pending_replace → working` for replace rejection, but a replace + requested from `partially_filled` should revert to `partially_filled`. Same root + cause as above, just the replace variant. +- **FOK "never partially fills" vs state machine allowing it** (GPT-5 #2 + Opus #4): + The TIF table says FOK "never partially fills" but the state machine has no guards + preventing FOK orders from reaching `partially_filled`. Both correctly noted this + is a broker-enforced guarantee but the document presents it as system-level. +- **`rejection_reason` described as "broker-provided" but local rejections exist** + (GPT-5 #4 + Opus #5 + Sonnet): `pending → rejected` is "local validation failure" + with no broker interaction, but the field says "Broker-provided reason when + rejected." All three caught this terminology inconsistency. + +**GPT-5 unique findings (not in either other model):** + +- **IOC valid terminal states exclude `expired` vs generic expiry transitions** (#3): + IOC should never reach `expired` (unfilled portion is cancelled immediately), but + the state diagram allows any order to transition to `expired` without TIF guards. + Well-reasoned extension of the FOK finding to IOC semantics. GPT-5 correctly + identified that broker "expired-like" outcomes should map to `cancelled` for IOC. + +**Claude Opus unique findings (not in either other model):** + +- **Terminal states that aren't terminal — the `partially_filled` re-entry problem** + (#1): Opus identified the DEEPER contradiction beyond the surface-level "cancelled + states have outgoing transitions." When `cancelled → partially_filled` fires via + late fill, the order is now non-terminal with NO defined mechanism to re-terminate + if no further fills arrive. The order is stuck in `partially_filled` indefinitely. + This goes beyond "the diagram contradicts the definition of terminal" to "the fill + precedence rule creates an unspecified operational scenario." This is the most + architecturally significant finding across all three models. +- **Fill precedence label misapplication to non-terminal states** (#6): The state + diagram labels transitions from `pending_cancel → partially_filled` and + `pending_replace → partially_filled` as "fill precedence," but the Fill + Precedence Rule explicitly defines itself as overriding TERMINAL states. + `pending_cancel` is non-terminal. The label conflates two different mechanisms + (fill during pending modification vs. fill overriding terminal state), which + could cause implementers to use the same code path for fundamentally different + scenarios. + +**Claude Sonnet unique findings (not in either other model):** + +- **State diagram terminal arrow contradiction** (#1): Sonnet was the only model to + explicitly note that the Mermaid diagram shows `cancelled → [*]` (terminal arrow) + while simultaneously showing `cancelled → partially_filled` (outgoing transition). + A valid observation but more surface-level than Opus's deeper analysis of the same + phenomenon. +- **Pending replace fill logic error** (#3): Sonnet argued that receiving a fill + during `pending_replace` creates a logical impossibility because the order + parameters are in flux. This is WRONG — fills always apply to current parameters + (the replace hasn't been confirmed yet), and the document actually handles this + correctly. This is a FALSE POSITIVE from Sonnet. + +**Quality assessment:** + +- **Claude Opus** was the clear winner for this task. Found the most contradictions + (6), had the highest precision (0 false positives), and — crucially — found + qualitatively deeper issues. The `partially_filled` re-entry problem (#1) isn't + just "the diagram has a missing transition" but "the fill precedence rule creates + an unresolvable operational state." The fill precedence label misapplication (#6) + identifies a conceptual confusion that would genuinely cause implementation bugs. + Opus completed in only 41s with 2,056 output tokens — by far the most efficient. +- **GPT-5** found 4 genuine contradictions with 0 false positives but spent an + extraordinary amount of reasoning tokens (11,008) for modest output (1,066 visible + content tokens, 10.3:1 reasoning ratio). The IOC finding was unique and valuable. + But the cost is disproportionate: 162s and 12K tokens for 4 findings vs Opus's + 41s and 2K tokens for 6 findings. GPT-5's reasoning budget seems to have been + mostly spent on VERIFICATION (confirming each finding is genuine), consistent + with Finding #20's observation. +- **Claude Sonnet** was fastest (17s) and found 4 items, but one was a false positive + (the pending_replace logic error claim is incorrect). That gives it a precision of + 75% (3/4 genuine) — the lowest of the three. Its genuine findings were all also + found by the other models (no unique true contributions). Sonnet appears to trade + speed for accuracy on contradiction detection. + +**Key insight — contradiction detection favors precision-oriented models:** + +This task is fundamentally about LOGICAL ARGUMENTATION: proving that two statements +cannot both be true. Unlike assumption-finding (which is about imagining what could go +wrong) or gap-finding (which is about identifying missing content), contradiction +detection requires the model to: +1. Hold two statements in working memory simultaneously +2. Construct a formal argument for why they conflict +3. NOT get confused by statements that SEEM contradictory but are actually consistent + +Requirement #3 is where models diverge. Sonnet produced a false positive because it +didn't fully reason through whether the pending_replace fill scenario is actually +inconsistent (it isn't — current parameters apply). Opus avoided this trap entirely +and additionally found DEEPER contradictions that require multi-step logical reasoning +(the re-entry problem, the label misapplication). GPT-5 also avoided false positives +but at massive computational cost. + +**Opus's efficiency advantage:** +This is the first task where Opus is not just qualitatively better but also +quantitatively more efficient. 6 findings in 41s and 2K tokens vs GPT-5's 4 findings +in 162s and 12K tokens. That's 3x more findings per token and 4x faster. For +contradiction detection specifically, Opus appears to have a structural advantage — +possibly because its internal reasoning is better calibrated for logical argumentation +than GPT-5's externalized reasoning chain. + +**Comparison to Finding #20 (invariant violation paths):** +In Finding #20, GPT-5 was maximally selective (3 findings, all genuine, 15:1 +reasoning ratio). Here, GPT-5 shows the same pattern: few findings, all genuine, +high reasoning ratio (10.3:1). The difference: in #20, GPT-5's selectivity meant +it found UNIQUE violations others missed. Here, all of GPT-5's findings were also +found by Opus (plus Opus found 2 more). GPT-5's high verification bar doesn't help +when Opus is ALSO precise AND more thorough. + +**Updated task-model assignment:** + +For contradiction/consistency checking: +1. **Opus** — best choice: highest precision, deepest contradictions, most efficient +2. **GPT-5** — solid backup: zero false positives, unique TIF-related insights, but + expensive and slower +3. **Sonnet** — NOT recommended for this task: produces false positives, no unique + true contributions + +This confirms the emerging pattern: each model has task types where it excels. +Opus excels at logical argumentation and design tensions. GPT-5 excels at +exhaustive enumeration and operational concerns. Sonnet excels at speed and +structural/assumption analysis but struggles with tasks requiring formal logical +reasoning (contradiction detection, concurrency analysis per Finding #13). + +**Practical implication:** When reviewing architecture documents for internal +consistency (e.g., before implementation begins), run Opus. If budget allows, +add GPT-5 for TIF/edge-case coverage. Skip Sonnet for consistency checking — +its speed advantage is negated by the false positive risk. diff --git a/findings/2026-05-05-26-missingfeature-identification-is-promptable-across.md b/findings/2026-05-05-26-missingfeature-identification-is-promptable-across.md new file mode 100644 index 0000000..72a5c9d --- /dev/null +++ b/findings/2026-05-05-26-missingfeature-identification-is-promptable-across.md @@ -0,0 +1,158 @@ +# Finding 26: Missing-feature identification IS promptable across all models; prompt framing eliminates Opus's historical advantage — all three models find regulatory gaps when explicitly asked + +**Date:** 2026-05-05 +**Task:** Identify computations, behaviors, or features that gargoyle's +`corporate-actions.md` (992 lines) SHOULD perform for financial correctness, +regulatory compliance, or operational safety — but doesn't describe. +**How we used them:** Same document (full text) + same focused analytical +prompt to all 3 models via HAI proxy. Prompt explicitly structured around 5 +categories: missing computations, missing behaviors, missing validations, +missing integrations, and regulatory gaps. Required concrete findings with +severity. No tools, no project context beyond the document. GPT-5 via +OpenAI endpoint (16K max_completion_tokens), Opus 4.6 and Sonnet 4.6 via +Anthropic endpoint (8K max_tokens). + +| Model | Output tokens | Reasoning tokens | Findings | Critical | High | Medium | +|---|---|---|---|---|---|---| +| GPT-5 | 11,354 | 8,512 | 20 | 3 | 10 | 7 | +| Claude Opus 4.6 | 4,111 | (internal) | 23 | 6 | 10 | 7 | +| Claude Sonnet 4.6 | 4,686 | (internal) | 15 | 5 | 6 | 4 | + +**What they found — common ground (all 3 identified):** +- Wash sale rule interaction with CA-driven lot closures (IRC §1091) +- Short position treatment for corporate actions +- Same-day corporate action ordering beyond `recorded_at` timestamp +- Record date / ex-date position verification (entitlement timing) +- Idempotency guard preventing double-application per user +- Decimal precision/rounding policy unspecified +- Superseded CA status has no lot rollback mechanism +- Rights/warrants post-creation lifecycle (exercise/expiration) +- Basis preservation invariant has no runtime enforcement +- Manual entry authorization and audit trail + +**GPT-5 unique findings (not in either Claude model):** +- Per-lot eligibility based on entitlement date (not just user-level) +- Election-based outcomes for shareholder choices (cash vs stock) +- Instrument-level trading hold during CA application window +- Pre-application consistency checks against broker entitlements +- DB-level enforcement of status transitions and invariants +- Action-type-specific date semantics per field (ex vs record vs payable) +- Voluntary/tender actions beyond distributions +- Backfill/initialization guard for newly onboarded users +- Applicator retry/backoff semantics and confirmation race +- Rights indivisibility constraints vs exact Decimal quantities + +**Claude Opus unique findings (not in either other model):** +- Pending order PRICE adjustment after splits (not just cancellation) +- Multi-instrument position recalculation atomicity for mergers +- Mixed merger basis floor at zero (can produce negative basis) +- Tax lot identification method interaction with inherited dates +- Corporate action effect on strategy position limits/risk params +- Corporate actions on instruments not yet in the database +- Partial application window: new user acquires position mid-fan-out +- IRC §305(c) deemed distributions (taxable stock dividends) +- CA impact on unrealized P&L display and strategy evaluation +- Concurrent OrderManager startup + Applicator fan-out race + +**Claude Sonnet unique findings (not in either other model):** +- Stale orders: failure modes table contradicts "excluded" section +- IRC §1223(1) holding period tacking verification at lot close +- Spinoff allocation percentage — no validation child != parent instrument +- Combined spinoff allocations exceeding meaningful bounds +- Cash dividend bypasses OrderManager — record-date quantity snapshot lost +- Mixed merger large-denominator exchange ratio overflow +- Detector schedule: no intraday re-poll for same-day announcements +- ROC cumulative basis floor tracking and IRC §301(c)(3) distinction +- Mixed merger deferred loss not explicitly recorded in metadata + +**Quality assessment:** +- **Claude Opus** was the MOST PROLIFIC (23 findings) — a notable inversion + from previous experiments where Opus typically found fewer but deeper + findings. Here, the explicit "missing feature" framing appears to have + unlocked Opus's breadth. Its unique findings included genuinely critical + items: pending order price adjustment after splits (Critical — direct + financial loss), multi-instrument atomicity for mergers (Critical — + position loss), and mixed merger negative basis (High — accounting + corruption). The findings were precise, well-reasoned, and showed both + regulatory depth (IRC §305(c)) and operational awareness. +- **GPT-5** was slightly less prolific (20 findings) but maintained its + characteristic breadth and operational-level thinking. Per-lot eligibility + (not just per-user) is a subtle but important distinction. The election- + based outcomes finding shows awareness of real-world corporate action + complexity. The backfill/initialization guard is operationally significant. + GPT-5 spent 8,512 reasoning tokens — moderate for its output volume. +- **Claude Sonnet** found fewer gaps (15) but several were genuinely + insightful. The internal contradiction between the failure modes table + and the "excluded" section is a real document inconsistency. The cash + dividend record-date quantity snapshot insight (#9) identifies a DATA LOSS + problem — the opportunity to capture that data expires. The mixed merger + deferred loss recording gap shows regulatory awareness. However, some + findings were more surface-level or overlapped heavily with the others. + +**KEY INSIGHT — The original question from Finding #22 is ANSWERED:** + +> "Opus's 'missing feature identification' mode (wash sales, commissions) — +> is this promptable on other models? Could we explicitly ask GPT-5 'what +> should this system compute but doesn't' and get similar results?" + +**YES.** When explicitly prompted with a structured "missing feature" +framing, ALL three models found regulatory gaps (wash sales, IRC sections), +missing computations (basis calculations, rounding), and missing behaviors +(lifecycle events, notifications). GPT-5 produced findings in the same +*category* as what Opus uniquely found in Finding #22 (silent correctness +failures on specid-lot-selection.md). + +In Finding #22, Opus uniquely identified wash sales and commission tracking +as missing features while GPT-5 focused on mechanism incorrectness and +Sonnet on composition failures. HERE, with the explicit "what's missing" +prompt, ALL three models found wash sales, ALL found regulatory gaps, and +ALL found missing behaviors. + +**This confirms:** Opus's "missing feature identification" mode in Finding +#22 was NOT an inherent model capability — it was an emergent behavior from +the open-ended "silent correctness failures" prompt. When you give ALL models +the EXPLICIT instruction to look for missing features, they all do it. The +differentiation from #22 was caused by the prompt being more open-ended, +allowing each model to default to its natural analytical mode: +- Opus → "what's missing" (features/functionality) +- GPT-5 → "what's wrong" (mechanism failures) +- Sonnet → "what breaks when combined" (composition) + +**Prompt framing dominates model personality.** With the right prompt, +any model can be directed into any analytical mode. The model differences +that emerged in earlier open-ended experiments reflect DEFAULT TENDENCIES, +not capabilities. + +**NEW finding about Opus on complex documents:** +Opus produced MORE findings than GPT-5 (23 vs 20) — the first time this +has happened on a broad analytical task. Previous pattern: GPT-5 always +finds more (20-33 findings) while Opus finds fewer but deeper (7-13). +What changed? The document is 992 lines — the longest tested — and the +task is explicitly about breadth ("find all gaps"). On this specific +combination (long document + breadth-focused prompt), Opus appears to +allocate its internal reasoning budget toward exploration rather than +its usual depth-first design-tension mode. This suggests Opus's typical +"fewer but deeper" pattern is partially a RESPONSE to shorter documents +where depth is more productive than breadth. + +**Practical implications:** +1. For missing-feature analysis: prompt structure matters more than model + choice. All three models are viable. Use the explicit 5-category prompt. +2. Run all three for critical docs — they find different specific gaps + despite finding the same categories. +3. For open-ended analysis where you want models to find DIFFERENT things: + use open-ended prompts. For analysis where you want COMPREHENSIVE + coverage of one type: use structured prompts. +4. Opus's "fewer but deeper" personality can be overridden by document + length + breadth-focused prompt. On 992-line docs, it competes on + volume with GPT-5. + +**Cost-effectiveness:** +Opus: 4,111 output tokens for 23 findings = 179 tokens/finding +GPT-5: 11,354 output tokens (+ 8,512 reasoning) for 20 findings = 993 tokens/finding +Sonnet: 4,686 output tokens for 15 findings = 312 tokens/finding + +Opus is by far the most efficient: nearly 6x fewer tokens than GPT-5 per +finding, with MORE findings. This is the strongest cost-effectiveness case +for Opus on any tested task. On long documents with breadth-focused prompts, +Opus appears to be the optimal choice for both quality AND efficiency. diff --git a/findings/2026-05-05-27-design-coherence-on-riskcontrolsmd-gpt5.md b/findings/2026-05-05-27-design-coherence-on-riskcontrolsmd-gpt5.md new file mode 100644 index 0000000..79562be --- /dev/null +++ b/findings/2026-05-05-27-design-coherence-on-riskcontrolsmd-gpt5.md @@ -0,0 +1,276 @@ +# Finding 27: Design coherence on risk-controls.md: GPT-5 regains top position; Opus's advantage from Finding #15 was document-specific, not task-specific + +**Date:** 2026-05-05 +**Task:** Identify internal design incoherences in gargoyle's `risk-controls.md` (277 lines) +— a pre-trade risk control specification covering two evaluation stages, reduction semantics, +ordering rationale, fail-closed claims, and audit logging. +**How we used them:** Same document (full text) + same focused analytical question to all +3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence +(safety properties not enforced, ordering/sequencing contradictions, reduction semantics +conflicts, fail-closed claims vs actual behavior, cross-stage inconsistencies). Required +each finding to reference specific contradictory parts. No tools, no project context beyond +the document itself. + +| Model | Time | Output tokens | Reasoning tokens | Incoherences found | Critical | High | Medium | +|---|---|---|---|---|---|---|---| +| GPT-5 | 112s | 8,231 | 7,232 | 6 | 1 | 3 | 2 | +| Claude Opus 4.6 | 41s | 1,858 | (internal) | 5 | 2 | 2 | 1 | +| Claude Sonnet 4.6 | 15s | 699 | (internal) | 4 | 1 | 2 | 1 | + +**What they found — common ground (all 3 identified):** +- Reduction re-entry at BuyingPower contradicts "reducing controls never re-enter + earlier controls" (all three flagged this as the most obvious contradiction — + Concentration at position 5 reduces, re-enters at BuyingPower at position 4, + which IS an earlier control) +- Ordering rationale's categorization of buying power/concentration is internally + confused (the doc labels both as "quantity-sensitive checks" that run after + reducing controls, but concentration IS a reducing control at position 5 while + buying power at position 4 sits between the two reducing controls) + +**GPT-5 unique findings (not in either Claude model):** +- Signal-level "no portfolio context" contradicts NoShortSales requiring knowledge + of current positions. The doc explicitly states signals are evaluated "in isolation" + with "no portfolio context — only the signal itself and user settings" — but checking + whether the user holds a position IS portfolio context. This is a genuine design + tension: either SignalRisk has hidden portfolio access (violating isolation) or + NoShortSales can't actually work as specified. +- Settings "fall through to system defaults" vs "Settings cache miss → reject." + Two incompatible instructions for the same condition (missing settings). +- "Universal fail-closed" with "only exception is order rate window" contradicted + by Failure Modes table showing buying power as another exception ("Conservative + estimate; may over-reject" is NOT rejection — it's a different failure mode than + either fail-closed or the documented single exception). +- Audit model says "every control evaluation produces an audit entry regardless of + outcome" but the signal-stage write point only describes writing on rejection. + Passing signals produce no documented audit entry at the signal stage. + +**Claude Opus unique findings (not in either other model):** +- Signal flow diagram swaps control order vs table: table shows (1) MarketHours, + (2) PerTradeStop, (3) NoShortSales, but diagram flows MarketHours → NoShortSales + → PerTradeStopLoss. Controls 2 and 3 are reversed between the two representations. + (VERIFIED: this is correct — the diagram does show a different order.) +- Concentration re-entry loop can bypass Order Rate, Duplicate, Self-Trade, and + Fat Finger entirely during intermediate iterations. Also: Position Size at order 3 + is never re-checked against Concentration-reduced quantity because re-entry starts + at BuyingPower (order 4), meaning "smaller quantity wins" semantics are implemented + differently than the linear model described in Reduction Semantics. + +**Claude Sonnet unique findings (not in either other model):** +- Buying Power "Can reduce? No" creates a dead end: if a reduced quantity still + exceeds buying power, the system can only reject entirely (no mechanism to further + optimize), defeating the purpose of the reduction system for capital-limited users. + (NOTE: this is more of a design limitation than a self-contradiction, but the + framing — that the reduction system's purpose is undermined by buying power's + inability to reduce — is a legitimate coherence observation.) + +**Quality assessment:** +- **GPT-5** produced the most findings (6) with the broadest coverage across the + prompt's 5 categories. The NoShortSales/portfolio-context finding is the most + genuinely insightful — it's a fundamental design-level contradiction (a signal-level + control that REQUIRES decision-level context). The settings contradiction and + audit logging inconsistency are also solid. Every finding points to two specific + textual statements that are incompatible. Severity ratings were calibrated (1 + Critical, 3 High, 2 Medium — compared to Opus's 2 Critical for similar findings). +- **Claude Opus** was remarkably fast (41s, 1,858 tokens) and found one thing + neither other model caught: the diagram/table order reversal for signal controls. + This is a concrete, verifiable error (not a design tension — a literal mistake in + the document). The re-entry loop analysis (finding #5) goes deeper than GPT-5's + version of the same core issue, exploring the implications for "smaller quantity + wins" semantics. However, Opus found fewer total issues and missed the + settings contradiction and audit logging inconsistency. +- **Claude Sonnet** was the fastest (15s, 699 tokens) and found 4 issues. The buying + power dead-end observation is unique and shows genuine reasoning about the reduction + system's limitations. However, it's more of a "this design can't achieve its stated + goal" than a strict self-contradiction. Sonnet's other findings overlap with the + common ground. Quality is solid but narrower scope. + +**Key insight — Finding #15's Opus > GPT-5 result was document-specific:** +In Finding #15 (coherence checking on failure-modes.md), Opus found 7 incoherences +vs GPT-5's 4. Here, on risk-controls.md, GPT-5 found 6 vs Opus's 5. The reversal +suggests that the relative performance on coherence checking depends on the +DOCUMENT'S structure, not on a fixed model advantage: + +- **failure-modes.md** (383 lines): A complex multi-process system with many + stated invariants across failure states, supervision trees, and recovery paths. + Rich in design TENSIONS where one subsystem's safety mechanism undermines another. + This plays to Opus's strength (finding design tensions between subsystems). +- **risk-controls.md** (277 lines): A more focused specification with explicit rules, + ordering constraints, and behavior tables. Rich in SPECIFICATION CONTRADICTIONS + where one statement directly conflicts with another. This plays to GPT-5's + strength (systematic verification of claims against stated mechanisms). + +The difference: Opus excels when contradictions are EMERGENT (arise from composing +multiple design decisions). GPT-5 excels when contradictions are EXPLICIT (two +statements in the document say incompatible things). Risk-controls.md has more +explicit contradictions (the settings fallback vs fail-closed, the "no portfolio +context" vs NoShortSales, the audit "always" vs write point "only on reject"). + +**Model performance depends on CONTRADICTION TYPE:** +| Contradiction type | Best model | Example | +|---|---|---| +| Emergent/compositional | Opus | "Rest-for-one cascade creates a 5th state" | +| Explicit/definitional | GPT-5 | "No portfolio context" but check requires portfolio | +| Diagrammatic/structural | Opus | Table order ≠ diagram order | +| Semantic/category confusion | All (common ground) | Reduction re-entry violates ordering claims | + +**Revised conclusion on Finding #15's open question:** +"Does Opus > GPT-5 ordering for coherence checking hold across other documents?" +**No.** The ordering depends on the document's contradiction density and type. +Documents rich in emergent design tensions favor Opus. Documents with explicit +specification errors favor GPT-5. The task type (coherence checking) doesn't have +a fixed model winner — it depends on what KIND of incoherences the document contains. + +**Practical implication:** Continue running both models for coherence checking. Their +strengths are complementary even within the same task type. GPT-5 catches things you +can point to in the spec and say "these two sentences conflict." Opus catches things +where you need to reason about the implications of multiple mechanisms interacting. + +## Open Questions + +- Does GPT's advantage in finding inconsistencies extend to logical + inconsistencies in arguments? One data point (verdict mismatches) — need more. +- What's the optimal task granularity for GPT analytical review? "Whole PR" is + too big. Is "one hypothesis" right, or can we batch? +- ~~Is the GPT-4.1 Mini bias detection result repeatable, or was it a well- + structured task that any model would ace?~~ **ANSWERED (Finding #8):** Any + model aces it when the biased text is presented without noise. The original + result was about noise elimination, not model capability. +- **NEW:** Does adding a narrow bias-check question to a rich PR review + context recover the detection that broad review misses? (Signal-to-noise + confirmation test) +- ~~How does reasoning_effort affect analytical quality? Only tested default so + far.~~ **ANSWERED (Finding #21):** Negligible effect on GPT-5 for open-ended + analytical tasks. Low/medium/high produced 33/30/30 findings with nearly + identical reasoning tokens (~4K) and per-finding depth. The parameter + may primarily affect verifiable-answer tasks, not exploration. Task framing + remains the dominant quality lever. +- Can we design a systematic "analytical review checklist" that leverages each + model's strengths? +- ~~What analytical tasks is Opus best at vs Sonnet? Finding #11 shows Opus + excels at design-tension identification. How does Sonnet compare on the + same task? (Sonnet is non-reasoning but fast — would it match GPT-4.1?)~~ + **ANSWERED (Finding #12):** Sonnet 4.6 significantly outperforms GPT-4.1 + (17 vs ~14 assumptions) and approaches GPT-5 (17 vs 20). It's not a + non-reasoning model in the GPT-4.1 sense — it occupies a middle tier with + genuine component-interaction reasoning. Opus still wins on design-tension + identification specifically. +- How do the models compare on research synthesis tasks (our #381 rewrite)? + We'll find out during the actual rewrite. +- ~~Does the reasoning-token advantage scale with document complexity? Test + with a simpler doc to see if the gap narrows.~~ **ANSWERED (Finding #11):** + The gap doesn't narrow with simpler docs. GPT-5 maintains ~1.7x the findings + of GPT-4.1 regardless of document complexity. Reasoning tokens enable + exhaustive exploration independent of input difficulty. +- ~~Would Claude Opus (also a reasoning model) match GPT-5's assumption-finding + performance, or does it have different blind spots?~~ **ANSWERED (Finding #11):** + Different blind spots, different strengths. GPT-5 reasons deeper into + implementation mechanics (breadth + technical depth). Opus reasons wider + about system context and design tensions (insight density). They're + complementary, not competing. Run both on important architecture docs. +- ~~Does Sonnet 4.6's strong showing hold across other analytical tasks + (bias detection, gap-finding) or is it specific to assumption-finding on + complex documents? Need to test Sonnet on simpler docs and different question + types.~~ **PARTIALLY ANSWERED (Finding #13):** Sonnet's strength does NOT + transfer to concurrency reasoning. It dropped from 85% of GPT-5 (assumption- + finding) to ~58% (race condition identification). Task type matters more + than we thought. Still untested: gap-finding, bias detection for Sonnet. +- **NEW:** What other analytical tasks require sequential/temporal reasoning + (like race condition identification) vs pattern-matching reasoning (like + assumption-finding)? Building a task taxonomy would help assign models + correctly. +- **NEW:** What explains Sonnet taking slightly longer than Opus here (106s vs + 105s) despite normally being the faster model? Is it the document length, or + does Sonnet's internal reasoning scale with complexity similarly to Opus? +- ~~How does GPT-5 Mini compare to GPT-5 on analytical tasks? Is it a viable + cheaper substitute?~~ **ANSWERED (Finding #14):** GPT-5 Mini is a viable + middle option. Finds fewer issues (6 vs 10) but with genuine reasoning + depth at ~50% cost/time. Better than non-reasoning models, not as + exhaustive as GPT-5. +- **NEW:** How does Claude 4.5 Opus/Sonnet compare to Claude 4.6? HAI now + exposes both; worth testing whether the newer versions regress on + analytical tasks. +- ~~Would running GPT-5 Mini + Sonnet together (different axes) + approach GPT-5's coverage at lower combined cost?~~ **ANSWERED (Finding #19):** + 71% coverage at 31% cost. Good for low-stakes work; GPT-5 irreplaceable for + high-stakes due to unique domain-knowledge findings in the missing 29%. +- ~~**NEW (Finding #15):** Does the Opus > GPT-5 ordering for coherence checking + hold across other documents? The inversion (Opus finding more than GPT-5) + was striking — need to confirm it wasn't document-specific.~~ + **ANSWERED (Finding #27):** No — it was document-specific. On risk-controls.md, + GPT-5 found 6 vs Opus's 5. The winner depends on contradiction TYPE: Opus + excels at emergent/compositional contradictions, GPT-5 at explicit/definitional + ones. No fixed ordering for this task type. +- **NEW (Finding #15):** Is the two-pass approach (Opus generates → GPT-5 + validates) worth the extra cost vs just running Opus alone? Need to test + whether GPT-5 actually catches Opus false-positives or just agrees. +- ~~How do the Claude 4.5 and 4.6 models compare on analytical tasks?~~ + **ANSWERED (Finding #16):** 4.5 is more exhaustive (2x findings), 4.6 is + more precise (higher signal-to-noise). Genuine tradeoff, not a regression. + 4.5 for coverage, 4.6 for actionability. +- **NEW (Finding #16):** Does the 4.5 vs 4.6 pattern hold across other task + types? Spec completeness may favor exhaustiveness; would coherence checking + or race condition analysis show the same pattern? +- **NEW (Finding #16):** Is running both Sonnet versions (4.5 + 4.6) cost- + effective vs just running GPT-5? Need to compare the UNION of their findings + against GPT-5's output for overlap analysis. +- **NEW (Finding #18):** Does Opus's "predictable exploit window" detection + transfer to other policy documents? It uniquely identified that the cooldown + mechanism creates a GUARANTEED safe window that strategies could systematically + exploit — this is a higher-order security insight. Worth testing whether Opus + consistently finds "adversarial opportunity" framings that other models miss. +- **NEW (Finding #20):** Does GPT-5's extreme verification behavior (15:1 + reasoning-to-output ratio, 3 findings from 12K reasoning) persist across + other documents with this prompt? Or was user-pipeline-lifecycle.md + particularly verification-heavy? Test invariant violation paths on a simpler + document. +- **NEW (Finding #20):** Would giving GPT-5 a "minimum 8 findings" instruction + reduce its selectivity and produce MORE invariant violations at lower + precision? Or would it just pad with non-violations? The extreme selectivity + may be a feature OR it may mean GPT-5 is discarding valid-but-hard-to-verify + findings. +- **NEW (Finding #20):** Opus's self-correction behavior is now confirmed across + Finding #15 and #20. Is this trainable/promptable? Could we ask non-Opus models + to "show your reasoning and withdraw findings you cannot fully verify"? +- **NEW (Finding #22):** The "silent correctness" lens revealed three distinct + analytical modes: Opus → missing functionality, GPT-5 → mechanism incorrectness, + Sonnet → composition failures. Does this three-way differentiation hold on other + documents, or was it specific to the regulatory/financial domain of specid-lot-selection? +- **NEW (Finding #22):** Does the "silent correctness" lens work on non-financial + documents? The financial/regulatory domain has a large gap between syntactic and + semantic correctness. Would the same prompt on an infrastructure/systems doc produce + equally differentiated findings, or would it collapse into assumption-finding? +- ~~**NEW (Finding #22):** Opus's "missing feature identification" mode (wash sales, + commissions) — is this promptable on other models? Could we explicitly ask GPT-5 + "what should this system compute but doesn't" and get similar results?~~ + **ANSWERED (Finding #26):** YES — all three models find regulatory gaps and + missing features when explicitly prompted. Opus's unique behavior in #22 was + an emergent DEFAULT tendency, not a capability. Prompt framing dominates + model personality. + +- **NEW (Finding #28):** Cross-document consistency found real bugs in gargoyle + docs (fills vs events, position ownership, signal persistence). Does running + this analysis across MORE document pairs (e.g., domain readmes vs implementation + docs, design docs vs plan docs) yield additional real inconsistencies? Could + become a systematic documentation maintenance tool. +- **NEW (Finding #28):** Opus was 2.4x faster AND found more issues than GPT-5 + on cross-document consistency. Is this because cross-doc contradictions are + easy to verify once spotted (reducing GPT-5's verification advantage)? Or + because boundary reasoning (Opus's strength) is the primary skill needed? + +## Methodology Notes + +- Internet opinions about models are overwhelmingly about coding. Don't + extrapolate to analytical work without testing. +- "Just because someone says it on the internet doesn't make it right." — + Aaron, 2026-04-26. Opinions need context. Track our own evidence. +- Absence of published methodology for a use case is itself a finding. +- Each finding needs: date, task, **how we used it** (context shape, task + framing, what info the model had/didn't have), what happened, takeaway. + No unsupported generalizations. +- **Context dimensions to track:** + - Rich vs minimal (how much background info) + - Broad vs focused ("review this" vs "answer this specific question") + - What kind of context (diff, full files, issue text, research notes, + project conventions, nothing) + - Whether the model had access to tools or just text + - Whether the task was explicit step-by-step or open-ended diff --git a/findings/2026-05-05-28-crossdocument-consistency-analysis-new-task.md b/findings/2026-05-05-28-crossdocument-consistency-analysis-new-task.md new file mode 100644 index 0000000..a054a59 --- /dev/null +++ b/findings/2026-05-05-28-crossdocument-consistency-analysis-new-task.md @@ -0,0 +1,178 @@ +# Finding 28: Cross-document consistency analysis: NEW task type — GPT-5 finds deep semantic contradictions; Opus finds structural/boundary mismatches; Sonnet identifies core issues quickly + +**Date:** 2026-05-05 +**Task:** Identify contradictions and inconsistencies BETWEEN two architecture documents +describing the same system: `system-overview.md` (323 lines, narrative overview with +component flows, invariants, and domain events) and `architecture.md` (213 lines, +DDD-focused with bounded contexts, context map, and message taxonomy). +**How we used them:** BOTH documents provided as full text in a single prompt (~25KB +total). Highly structured prompt specifying 5 categories of cross-document inconsistency +(terminology conflicts, structural contradictions, flow/sequence conflicts, +ownership/authority conflicts, philosophical contradictions). Required specific output +format per finding. Explicitly excluded omissions (things one doc covers and the other +doesn't) and detail-level differences. No tools, no project context beyond the two +documents. This is a NEW analytical task not previously tested: reasoning about +CONSISTENCY BETWEEN documents rather than internal coherence of a single document. + +| Model | Time | Output tokens | Reasoning tokens | Inconsistencies found | Critical | High | Medium | +|---|---|---|---|---|---|---|---| +| GPT-5 | 125s | 9,415 | 8,384 | 6 | 2 | 3 | 1 | +| Claude Opus 4.6 | 52s | 2,351 | (internal) | 7 | 3 | 3 | 1 | +| Claude Sonnet 4.6 | 14s | 776 | (internal) | 4 | 1 | 2 | 1 | + +**What they found — common ground (all 3 identified):** +- Event sourcing (all events as source of truth) vs fills-only ground truth: + Document A says fills are "ground truth from which all other state can be + derived," while Document B says "events are the source of truth, state is + computed by replaying events." A treats fills as the recovery foundation; + B treats ALL domain events as authoritative. All three models rated this + Critical. +- Bounded context naming mismatch: "Decision Engine" / "Order Management" (A) + vs "Engine" / "Trading" (B) for the same functional responsibilities. + GPT-5 folded this into a broader ownership analysis; Opus and Sonnet + surfaced it as its own finding. +- Signal classification conflict: Document A lists "Signal emitted" as a domain + event; Document B explicitly categorizes `SignalEmitted` as an audit event + ("not used to rebuild state"). This determines event store design and + recovery semantics. + +**GPT-5 unique findings (not in either Claude model):** +- Signal persistence contradiction: Document A states "Signals are never + persisted" while Document B lists `SignalEmitted` as an audit event that IS + persisted and states the audit log is mandatory for trading. These are + directly incompatible claims about whether signal data is stored. +- Audit event ownership conflict: Document A says "Decision approved" events + originate from PortfolioRisk. Document B states "only the decision engine + writes audit events" and lists `DecisionApproved` as an audit event example. + If PortfolioRisk is part of Risk (not Engine), this is an authority violation. +- "Single writer per user" (A: OrderManager writes all trading state) vs + per-aggregate single-writer (B: each aggregate writes its own event stream, + Ledger owns positions). These are incompatible authority models — either OM + centralizes writes or each domain owns its own events. + +**Claude Opus unique findings (not in either other model):** +- Engine → OrderManager is an internal pipeline flow (A: same subgraph, direct + arrow) vs Engine → Trading is a cross-domain COMMAND (B: `PlaceOrder` command + crossing a bounded context boundary). This structural disagreement determines + whether order management is an internal pipeline stage or an independent domain + with its own aggregates and command validation. +- Signal Risk's architectural position: Document A shows a two-stage risk + architecture (SignalRisk pre-aggregation, PortfolioRisk post-aggregation) + where Risk is embedded in the pipeline. Document B's context map shows Risk + as a separate domain that Engine merely QUERIES ("kill switch active?") — + no arrow shows signal routing through Risk. Either risk logic lives inside + Engine (contradicting B's context boundary) or the context map is incomplete. +- The "reduce" step ownership: A's top-level flow labels `Approved →|"reduce"| + Decisions` (reduction at aggregation), while A's own domain events table says + "Decision reduced" originates from PortfolioRisk (reduction after aggregation). + This is actually an INTRA-document inconsistency in Document A, but Opus surfaced + it as part of cross-doc analysis. + +**Claude Sonnet unique findings:** +- None genuinely unique. All 4 findings overlapped with GPT-5/Opus common ground + (event sourcing, signal persistence, context count/naming). Sonnet was efficient + (14s, 776 tokens) but didn't identify any inconsistency that the other two missed. + +**Quality assessment:** +- **GPT-5** produced 6 well-reasoned findings with the deepest analysis of + OWNERSHIP conflicts. Its signal-persistence contradiction and single-writer + authority conflict are genuinely important — they reveal places where the two + documents would lead implementers to build fundamentally different systems. + Every finding quotes specific text from both documents and explains precisely + WHY they can't both be correct. The reasoning investment (8,384 tokens) was + used for thorough cross-referencing between documents. +- **Claude Opus** found the most inconsistencies (7) and was remarkably fast + (52s vs GPT-5's 125s). Its unique strength: identifying STRUCTURAL contradictions + about component boundaries and communication patterns. The Engine→Trading + command vs internal pipeline finding is architecturally the most significant + discovery — it reveals a fundamental disagreement about whether order + management is INSIDE or OUTSIDE the decision engine's boundary. Opus also + caught a bonus intra-document inconsistency (the "reduce" labeling error). +- **Claude Sonnet** was the fastest (14s) and most concise (776 tokens) but + found only the obvious common-ground issues. For cross-document consistency, + Sonnet's speed advantage came at the cost of missing the architectural + insights that make this task valuable. It did correctly identify all the + Critical-level issues, making it viable as a quick first-pass screen. + +**Key insight — cross-document consistency is a DISTINCT task type:** +This is fundamentally different from single-document analysis (assumptions, +race conditions, coherence). It requires: +1. Building a mental model from Document A +2. Building a separate mental model from Document B +3. Finding places where the models are incompatible +4. Reasoning about WHY they can't both be correct (not just "different") + +Step 4 is what distinguishes this from simple diff-detection. Many surface +differences (naming, detail level, scope) are NOT contradictions — the models +must judge which differences are genuinely incompatible vs. complementary. +The prompt explicitly excluded omissions and detail-level differences, and +all three models respected this constraint well. + +**Model strengths on cross-document analysis:** +- **GPT-5** excels at ownership/authority conflicts: it systematically + checked "who owns this concept" in each document and found mismatches. + Its findings cluster around "who writes what" and "who is authoritative." +- **Opus** excels at structural/boundary contradictions: it identified where + the documents draw architectural lines differently. Its findings cluster + around "where are the boundaries" and "what crosses them." +- **Sonnet** identifies the obvious/critical issues quickly but doesn't dig + deeper. Viable for screening, not for thorough analysis. + +**Comparison to Finding #15 / #27 (single-document coherence checking):** +Single-document coherence asks "does this document contradict itself?" +Cross-document consistency asks "do these documents contradict each other?" +Key differences in results: + +| Aspect | Single-doc coherence | Cross-doc consistency | +|---|---|---| +| Opus findings | 5-7 | 7 | +| GPT-5 findings | 4-6 | 6 | +| Sonnet findings | 4-5 | 4 | +| Opus unique | Design tensions | Structural/boundary mismatches | +| GPT-5 unique | Definitional errors | Ownership/authority conflicts | +| Best model | Task-dependent | Opus (most findings + fastest) | + +The relative ordering is similar (Opus ≥ GPT-5 > Sonnet for coherence-style +tasks), but the CHARACTER of unique findings shifted. On single-doc coherence, +Opus finds design tensions within a single design. On cross-doc consistency, +Opus finds BOUNDARY disagreements between two designs. GPT-5 shifts from +finding definitional errors to ownership conflicts. + +**Are these findings REAL bugs in the gargoyle documentation?** +Yes — several are genuine issues worth fixing: +1. The fills-vs-events-as-ground-truth is a real philosophical tension between + the two documents that needs resolution. +2. The Position event ownership (OrderManager vs Ledger) is a real boundary + conflict that affects implementation. +3. The Engine→Trading communication style (internal pipeline vs cross-domain + command) is a genuine structural ambiguity. +4. The signal persistence claim ("never persisted" vs `SignalEmitted` audit + event) is a direct textual contradiction. + +These are the kind of cross-document inconsistencies that cause teams to build +inconsistent implementations — one engineer reads Document A and builds one way, +another reads Document B and builds differently. + +**Practical implication:** Cross-document consistency analysis is a high-value +task for documentation maintenance. Run it when: +- A system has multiple architecture docs written at different times +- A refactoring has updated one doc but not another +- Multiple people contribute to design documentation +- Moving from high-level overview to detailed specification + +Opus is the recommended model for this task: fastest (52s vs 125s), most +findings (7 vs 6), and uniquely strong at boundary disagreements. GPT-5 adds +value for ownership-specific conflicts. Sonnet is sufficient for quick +screening (catches the Critical issues in 14s) but won't find the architectural +insights. + +**Cost-effectiveness:** +Opus: 2,351 output tokens for 7 findings = 336 tokens/finding (52s) +GPT-5: 9,415 output + 8,384 reasoning for 6 findings = 2,967 tokens/finding (125s) +Sonnet: 776 output tokens for 4 findings = 194 tokens/finding (14s) + +Opus is the clear winner on this task type: more findings than GPT-5, 2.4x +faster, and 8.8x more token-efficient per finding. GPT-5's massive reasoning +investment (8,384 tokens) produced only one fewer finding than Opus — the +verification overhead is not paying off here because cross-document contradictions +are relatively easy to verify once identified (just check both documents). diff --git a/findings/2026-05-05-29-adversarial-manipulation-analysis-new-task.md b/findings/2026-05-05-29-adversarial-manipulation-analysis-new-task.md new file mode 100644 index 0000000..39dacb7 --- /dev/null +++ b/findings/2026-05-05-29-adversarial-manipulation-analysis-new-task.md @@ -0,0 +1,174 @@ +# Finding 29: Adversarial manipulation analysis: NEW task type — GPT-5 is most exhaustive and systematic; Opus produces qualitatively different attack vectors with system-level thinking; Sonnet is adequate but less creative + +**Date:** 2026-05-05 +**Task:** Identify adversarial manipulation paths in gargoyle's `aggregation.md` (193 lines) +— how a misbehaving, compromised, or buggy upstream component could exploit the +aggregator's design guarantees to produce harmful trading outcomes that bypass +downstream safety controls. +**How we used them:** Same document (full text) + same focused analytical question to all +3 models via HAI proxy. Highly structured prompt specifying 5 categories of adversarial +manipulation (signal injection, timing manipulation, capacity weaponization, state +corruption via crash, audit evasion). Required specific output format per finding +(attack vector, mechanism, exploit, why downstream controls miss it, severity). No tools, +no project context beyond the document itself. + +| Model | Time | Output tokens | Reasoning tokens | Attack vectors found | Critical | High | Medium | +|---|---|---|---|---|---|---|---| +| Claude Sonnet 4.6 | 27s | 1,257 | (internal) | 10 | 3 | 5 | 2 | +| Claude Opus 4.6 | 84s | 3,662 | (internal) | 12 | 5 | 5 | 0 | +| GPT-5 | 111s | 8,808 | 6,336 | 15 | 2 | 10 | 3 | + +**What they found — common ground (all 3 identified):** +- Primary signal hijacking via ranking manipulation (last-tick injection in + time-windowed to control decision parameters) +- Threshold gaming via signal replay/duplication (no deduplication means N + identical signals satisfy "N confirmations") +- Capacity flooding to force premature completion or deny legitimate trades +- Strategic crash to erase unfavorable in-flight groups +- Timeout-masqueraded manipulation (making attacks look like normal system behavior + in the audit trail) + +**GPT-5 unique findings (not in either Claude model):** +- **Direction flip against majority via ranking:** In "most recent" ranking, + emit multiple SELL confirmations then inject a late BUY — the BUY becomes + primary and the decision contradicts the bulk of evidence. Distinct from + general primary hijack because it's specifically about *directional* reversal. +- **Late-arrival exclusion of counter-signals:** Time signals so countervailing + signals arrive just after group destruction, ensuring the decision is formed + without dissenting inputs that would have altered ranking. +- **Capacity filter to curate the audit set:** Pre-fill buffer with chosen + signals so riskier alternatives cannot be included before capacity fires — + the contributing signals list looks clean. +- **Timer nullification by crash:** Crash just before a timeout that would + force-complete an unfavorable decision — the timer becomes no-op on restart, + no decision or expiry event is emitted. +- **Decision drop via induced forwarding failure:** Exploit the "Decision + forwarding fails: Decision is lost" failure mode to selectively suppress + protective decisions (stops, hedges) with no automatic retry. +- **Crash to erase evidence of contrary signals:** Post-crash, submit a + fresh group that completes quickly; audit shows only the new set, not the + earlier contradictory pre-crash signals. + +**Claude Opus unique findings (not in either other model):** +- **Instrument fragmentation to multiply position size:** Emit signals for + economically equivalent exposures using different instrument identifiers. + Each gets its own group, each produces a separate decision, bypassing + per-group capacity limits. Combined position exceeds what any single group + would allow. Identifies TOCTOU at the fan-in to PortfolioRisk. +- **Forced stale decision via timer exploitation:** Emit one signal at a + favorable price spike known to be transient, then deliberately withhold + further signals. Timer force-completes with a stale price. The entry price + WAS valid when the signal was generated — PortfolioRisk doesn't check + staleness of decision prices. +- **Timeout prevention / keep-alive suppression:** Manipulate market data + feed to suppress signals that would reach threshold N. Group expires + normally — denial-of-trading attack disguised as insufficient confirmation. +- **Crash-restart duplicate decisions:** Crash after decision is forwarded + but before strategy reflects it. Both restart "clean" — strategy re-emits + signals, aggregator produces a second decision with a fresh ID. Same trade + executes twice. PortfolioRisk can't deduplicate because IDs are different. +- **Force-complete with insufficient confirmation (capacity < threshold):** + If capacity limit is lower than threshold, hitting capacity ALWAYS force- + completes before predicate is satisfied. Fundamentally changes a 5-confirmation + strategy into a 3-confirmation strategy. +- **Pattern predicate as arbitrary decision trigger:** If adversary controls + predicate logic (via strategy configuration), can make pattern-complete + trigger on any single signal while audit shows algorithm=pattern-complete + and reason=:predicate. Trust boundary between configuration and execution. + +**Claude Sonnet unique findings (not in either other model):** +- **Cross-group timing coordination:** Coordinate signal injection across + multiple instruments to synchronize completion times, creating a burst of + correlated decisions that overwhelm PortfolioRisk individually-safe + evaluations. (NOTE: Opus found a similar concept — instrument fragmentation + — but framed it differently: Opus focused on position multiplication via + instrument aliasing, Sonnet focused on burst timing overwhelming evaluation.) +- **Multi-strategy attack distribution:** Spread manipulation across multiple + isolated strategy aggregators so no single aggregator's behavior looks + abnormal while cumulative effect is harmful. + +**Quality assessment:** +- **GPT-5** produced the most findings (15) with the most systematic coverage + across all 5 prompt categories. Its strength was in identifying SPECIFIC + INTERLEAVINGS — exactly how timing, state, and ranking mechanisms interact + to produce exploits. The direction-flip finding (#3) and the late-arrival + exclusion finding (#6) show precise temporal reasoning about when signals + arrive relative to group lifecycle events. The "decision drop via forwarding + failure" finding exploits a DOCUMENTED failure mode (from the failure table) + as an offensive weapon — turning a recovery mechanism into an attack vector. + Every finding references specific mechanisms from the spec. +- **Claude Opus** produced 12 findings with the most architecturally creative + attacks. The instrument fragmentation attack is the most SYSTEMICALLY + dangerous finding across all three models — it's not about manipulating one + group but about the RELATIONSHIP between groups, and it identifies a + TOCTOU vulnerability at the PortfolioRisk fan-in point that no other model + found. The crash-restart duplication attack is also architecturally novel — + it exploits the "clean state" guarantee as a weapon for invisible trade + doubling. Opus consistently reasons about the system BOUNDARY (aggregator + → PortfolioRisk handoff) rather than just within-component mechanics. The + pattern-predicate trust boundary finding is uniquely about CONFIGURATION + as an attack surface. +- **Claude Sonnet** produced 10 findings in 27s — extremely efficient (127 + tokens per finding). Findings were adequate and covered all 5 categories, + but lacked the specificity of GPT-5 and the architectural creativity of + Opus. Several findings were somewhat generic (e.g., "crash at strategic + moments" without specifying exactly WHEN relative to group lifecycle). + The cross-group coordination and multi-strategy distribution findings show + system-level thinking but are stated at a higher abstraction level without + concrete exploit sequences. + +**Key insight — "adversarial manipulation analysis" as a task type:** +This is qualitatively different from all previous analytical lenses tested. +Previous tasks asked models to find problems WITH the design (assumptions, +races, incoherences). This task asks models to find ways to USE the design +AGAINST itself — a creative/generative adversarial task. Results: + +- **GPT-5** treats it as an exhaustive enumeration exercise — systematically + walks through each mechanism and asks "how could this be abused?" High + count (15), thorough coverage, but some findings are minor variations of + each other (e.g., crash-related findings #10, #12, #15 share the same core + mechanism). Reasoning tokens (6,336) used for both generation and verification. +- **Opus** treats it as a creative design exercise — asks "what would a + smart adversary do that the designer didn't consider?" Fewer findings (12) + but several are genuinely novel attack concepts (instrument fragmentation, + crash-restart duplication, predicate trust boundary) that require reasoning + about the SYSTEM rather than the COMPONENT. Opus also provided a summary + table and systemic conclusion about the root design weaknesses. +- **Sonnet** treats it as a categorization exercise — fills each prompt + category with plausible attacks but at a higher abstraction level. Fast + and adequate for a first pass but wouldn't surprise a security reviewer. + +**Comparison to "predictable exploit window" (Finding #18):** +Finding #18 noted that Opus uniquely identified predictable exploit windows +in escalation-policy.md. Here, Opus again shows the strongest adversarial +creativity — the instrument fragmentation attack and crash-restart duplication +are both about exploiting DESIGN GUARANTEES (per-instrument grouping, clean +restart) as weapons. This confirms that Opus's strength on adversarial analysis +is a CONSISTENT PATTERN, not document-specific. + +GPT-5 excels when the adversarial task is framed as "enumerate all possible +abuses of each mechanism" (systematic coverage). Opus excels when the task +requires "invent novel attack concepts that exploit design boundaries" +(creative adversarial thinking). + +**Model hierarchy for adversarial manipulation analysis:** +1. GPT-5 — most thorough enumeration, best at mechanism-level exploitation (15) +2. Opus — most creative, finds system-boundary attacks others miss (12) +3. Sonnet — adequate first pass, fast, but less specific (10) + +**Practical implication:** For security-oriented architecture review: +- Run GPT-5 for comprehensive attack surface enumeration +- Run Opus for novel/creative attack vectors that exploit design boundaries +- Sonnet is sufficient only as a quick initial screen +- The UNION of GPT-5 + Opus findings (removing overlaps) would produce the + most complete adversarial analysis + +**New finding about the aggregator itself:** Several attacks identified by +multiple models point to real design weaknesses worth addressing: +1. No signal deduplication/independence validation (all 3 models) +2. Primary signal determines all decision parameters regardless of group + composition (all 3 models) +3. Transient state + no replay = perfect adversarial erasure tool (all 3) +4. Capacity/timeout treated as normal events even when weaponized (all 3) +5. No cross-group correlation at aggregator level (Opus + Sonnet) +6. TOCTOU at PortfolioRisk fan-in for concurrent decisions (Opus) diff --git a/findings/ALL-FINDINGS.md b/findings/ALL-FINDINGS.md deleted file mode 100644 index ed0762d..0000000 --- a/findings/ALL-FINDINGS.md +++ /dev/null @@ -1,3249 +0,0 @@ -# Model Findings — Analytical & Research Work - -_Tracking what actually works (and doesn't) when using AI models for research, -analysis, bias detection, and document review — not coding._ - -Started: 2026-04-26 - -## Context - -We use multiple models in different roles: Claude Code (Opus/Sonnet) for -generation, Sonnet + GPT-5 for independent dual review, smaller models for -focused analytical tasks. Most public discussion is about coding. We found -almost no published methodology for using models in analytical research tasks -(searched 2026-04-26). That gap is why we're tracking this. - -## Findings - -### 1. Different models catch different things (confirmed) - -**Date:** 2026-04-26 -**Task:** PR reviews on DDD reference docs (~6,600 lines across 18 files) -**How we used them:** Both models got the same task via pr-review skill — -fetch diff, fetch full file content for changed files, review against PR -description and linked issue acceptance criteria. Rich context: full diff, -project CLAUDE.md conventions, issue body. Each reviewer ran independently -in its own sub-agent with its own Gitea token. No cross-pollination. - -- GPT-5 caught SUMMARY.md verdict mismatches (Commanded classification, - small teams classification) that Sonnet missed entirely (PR #375) -- Sonnet caught a broken cross-reference link first that GPT-5 missed (PR #378) -- **Takeaway:** Different blind spots are real. Neither model is strictly better - for analytical review — they complement each other. This is why we run two - independent reviewers from different model families. - -### 2. Cheap model + narrow lens > expensive model + broad review (one data point) - -**Date:** 2026-04-26 -**Task:** Check 12 rewritten hypotheses for directional bias -**How we used them:** -- Sonnet & GPT-5: full PR review context (diff, file content, issue, AC). - Broad mandate: "review this PR." Rich context but unfocused task. -- GPT-4.1 Mini: given ONLY the 12 hypothesis texts + one focused question: - "Do any of these hypotheses lead toward a predetermined conclusion?" - Minimal context, laser-focused task. No diff, no project docs, no issue. - -- Both Sonnet and GPT-5 approved the hypotheses as reviewers -- GPT-4.1 Mini found ALL 12 pushed toward predetermined conclusions -- Words like "requires," "necessary," "must be" were flagged as directional -- **Takeaway:** Task framing mattered more than model size. Rich context + - broad mandate = missed the forest for the trees. Minimal context + precise - question = found exactly what mattered. This needs more testing — was it - the narrow framing, the lack of surrounding context, or both? - -### 3. GPT-5 times out on complex multi-step analytical tasks (confirmed pattern) - -**Date:** 2026-04-26 -**Task:** Full PR review of #382 (research document rewrite) -**How we used it:** pr-review skill — multi-phase (fetch diff, fetch files, -check CI, analyze against AC, post inline comments, post summary). 7 phases, -many curl calls to Gitea API, large diff context. Heavy tool-use workflow -through SAP proxy (adds latency vs direct API). 300s timeout. - -- Timed out 3 times at 300s (17, 6, 6 tool calls respectively) -- Bottleneck was model processing time, not network (~0.3s Gitea API latency) -- **Takeaway:** Break analytical tasks into focused bounded pieces. Twelve - small deep reviews > one rushed big one. The issue isn't GPT-5's analysis - quality — it's that multi-phase tool-heavy workflows burn too much time - on mechanics. Separate the data gathering from the analysis. - -### 4. GPT-5 defaults to delegation; Claude defaults to doing the work - -**Date:** 2026-04-26 -**Task:** PR review delegation to sub-agents -**How we used them:** Both spawned as sub-agents from main session with -same task description, same pr-review skill file, same Gitea credentials. -Difference: GPT-5 got model override to gpt5, Sonnet used default model. -Both got full skill instructions. - -- GPT-5 first attempt: spawned sub-sub-agents and timed out -- GPT-5 with "do it yourself, no sub-agents" + step-by-step: worked -- Even with constraints, GPT-5 sometimes dumps raw tool output instead of - synthesizing — needs explicit output format instructions -- Claude (Sonnet/Opus) given the same kind of task does the work directly -- **Takeaway:** GPT interprets complex task descriptions as delegation - opportunities. Claude interprets them as work to do. For GPT: explicit - single-actor instructions + output format. For Claude: can give broader - mandate. Same skill file, very different behavior. - -### 5. Sonnet is fast and catches structural issues; GPT-5 is slow and catches semantic issues - -**Date:** 2026-04-26 -**Task:** Dual review across PRs #372, #375, #378, #380, #382 -**How we used them:** Same pr-review skill, same context (diff + files + -issue + AC), same sub-agent pattern. Only variable: model. Both got rich -context. Both ran the full 7-phase review skill. - -- Sonnet consistently finishes first, catches formatting, broken links, - structural problems (missing sections, dangling refs) -- GPT-5 takes longer, catches meaning-level problems (verdict mismatches, - classification inconsistencies, logical gaps) -- **Takeaway:** With identical rich context and identical instructions, the - models naturally gravitate to different things. Sonnet is the structural - reviewer; GPT-5 is the semantic reviewer. Both roles matter. Question: - would Sonnet catch semantic issues if given a narrower "check for logical - consistency" framing instead of broad review? - -### 6. Single agent can't handle 1000+ line document generation (confirmed pattern) - -**Date:** 2026-04-26 -**Task:** DDD v2 forge analysis drafting -**How we used them:** Single Sonnet/Opus sub-agents given full research -material (~3,874 lines of research notes) + outline + instructions to write -complete document. Very rich context (all research), very large output -requirement (1000+ lines). - -- Five single-agent attempts died (OOM, disconnect, timeout) trying to write - full documents -- Sectional approach (5 parallel Sonnet subagents, ~500-700 lines each) - succeeded immediately — each got same research but only their section's - outline -- Same pattern when Claude Code attempted full Part V rewrite — died -- Three agents × ~320 lines each worked first try -- **Takeaway:** This is a confirmed, repeatable limit for generation tasks. - Not model-specific — it's a context/output length problem. Rich input - context is fine; it's the output length that kills. Break output into - sections, keep input context rich, draft in parallel, assemble. - -### 7. Emerging role assignments (pattern, not conclusion) - -**Date:** 2026-04-26 (one day of intensive work — treat as hypothesis) - -- Opus (via Claude Code): complex generation needing deep project context. - Rich context: CLAUDE.md, full codebase access, design docs. Broad mandate. -- Sonnet: parallel volume work (5 subagents drafting simultaneously). - Rich context per section, constrained output scope. -- GPT-5: independent analytical review. Rich context (diff + files + issue). - Best when task is bounded and explicit. -- GPT-4.1 Mini: focused narrow analysis (bias detection). Minimal context, - precise question. Cheap and fast. -- **Takeaway:** The role assignment matters, but so does the context shape. - Opus gets broad context + broad mandate. Sonnet gets broad context + - narrow scope. GPT-5 gets rich context + explicit task. GPT-4.1 Mini gets - minimal context + laser question. We haven't tested swapping these - combinations — that's where the real learning will come from. - -### 8. Bias detection: all models catch it with any framing — when the signal isn't buried - -**Date:** 2026-04-27 -**Task:** Detect directional bias in 8 deliberately biased hypotheses about -microservices vs monolith architecture for fintech startups. -**How we used them:** Created fresh test material (8 hypotheses with pro- -microservices bias via absolutes like "inevitably," "necessary," "must," -"requires," plus one factually inverted claim about consistency guarantees). -Ran 4 conditions in parallel sub-agents: - -| Condition | Model | Framing | Context | -|---|---|---|---| -| A | GPT-4.1 Mini | Narrow: "Do any lead toward a predetermined conclusion?" | Hypotheses only | -| B | Sonnet | Same narrow question | Hypotheses only | -| C | GPT-5 | Same narrow question | Hypotheses only | -| D | Sonnet | Broad: "Review quality, clarity, testability, and issues" | Hypotheses only | - -**Results:** -- **All 4 conditions detected 8/8 biased hypotheses.** No misses. -- All 3 narrow-framing models (Mini, Sonnet, GPT-5) produced structurally - similar output: per-hypothesis verdict, biasing words, neutral version, - severity assessment. -- All 3 narrow-framing models flagged H8's factual inversion (distributed - transactions DON'T provide stronger consistency than monolithic ACID). -- GPT-5 added specific counterexamples (LMAX Disruptor, Shopify, Stack - Overflow, Basecamp) — marginally richer analysis. -- Sonnet broad mandate also caught the bias — framed as one of three - "systemic problems" (deterministic language, pro-microservices framing - bias, underspecified constructs). Additionally provided testability and - operationalization analysis that the narrow framing didn't ask for. -- Sonnet broad took ~72s vs ~39s for narrow conditions (more output). - -**Takeaway:** When the biased text is the ONLY input (no surrounding noise), -all tested models — including the cheapest (GPT-4.1 Mini) — detect bias -regardless of whether the question is narrow or broad. This appears to -**contradict** original finding #2 ("cheap model + narrow lens > expensive -model + broad review"), but the key difference is context noise: - -- **Original experiment (2026-04-26):** Sonnet and GPT-5 missed bias during - FULL PR REVIEW with rich project context (diff, file content, issue text, - acceptance criteria, project conventions). The hypotheses were buried in - layers of review mechanics. -- **This experiment (2026-04-27):** Even the "broad" condition gave ONLY the - hypothesis text — no diff, no PR structure, no project context noise. - -**Refined hypothesis:** The original finding #2 was about **signal-to-noise -ratio**, not about model capability or framing precision. When biased text -is presented in isolation, any model catches it. When biased text is buried -in a large PR review with many other things to check, the bias signal gets -lost in the noise — unless you explicitly ask about it. The "narrow lens" -worked because it eliminated the noise, not because smaller models are -better at bias detection. - -**Next experiment to confirm:** Give a model the FULL PR review context -(diff, files, issue, AC) but add the narrow bias question as an explicit -review checklist item. If the model catches bias despite the rich context, -it confirms the signal-to-noise hypothesis. If it misses, it suggests -something else is at play (attention allocation, task switching cost). - -### 9. Gap-finding in architecture docs: GPT-5 finds domain-specific gaps, GPT-4.1 is generic, Mini is formulaic - -**Date:** 2026-05-02 -**Task:** Identify missing failure scenarios in gargoyle's `failure-modes.md` (383 lines) -**How we used them:** Same document (full text, no truncation) + same focused -analytical question to all 3 models via HAI proxy (OpenAI-compatible endpoint). -No tools, no project context beyond the document itself. Single prompt, no -conversation history. Temperature 0.3 for GPT-4.1/Mini, default (1.0) for GPT-5 -(required by the model). - -| Model | Time | Output tokens | Reasoning tokens | Scenarios found | -|---|---|---|---|---| -| GPT-4.1 Mini | 16s | 2,003 | 0 | 10 | -| GPT-4.1 | 24s | 2,575 | 0 | 15 | -| GPT-5 | 45s | 8,565 | 6,656 | 14 | - -**What they found — common ground (all 3 identified):** -- ETS table corruption/loss affecting gates -- BEAM scheduler starvation / GC pauses -- WebSocket message duplication/reordering -- Postgres connection pool exhaustion / deadlocks -- Clock skew / time drift -- Process registry inconsistency - -**GPT-5 unique findings (not in either other model):** -- Broker rate limiting (429s) — not "connection lost" so existing logic - doesn't trigger, but can't flatten during kill switch -- Broker auth failure / credential rotation — distinct from connection loss -- Corporate actions (splits, symbol changes) — position drift without - triggering staleness detection -- Duplicate pipeline instances for same user (DynamicSupervisor race) -- DB "commit unknown outcome" causing restart loops (Ecto commit succeeds - at Postgres but client times out → retry → unique constraint → crash loop) -- Cross-symbol strategies with partial staleness — multi-leg signals - computed from mix of fresh and stale data -- Partial cancel_all during kill switch masked by process restarts - -**GPT-4.1 unique findings (not in GPT-5 or Mini):** -- Zombie processes after halt (supervisor misconfiguration) -- Unsupervised Task crashes going unnoticed -- Audit log writes failing silently (not in same transaction as state change) -- ClOrdID unique constraint violation from race in sequence generation -- Broker API semantic changes (silent breaking changes) - -**GPT-4.1 Mini unique findings:** -- Race between kill switch engagement and reconciliation completion - (timing coordination gap) — this was more explicitly called out than - in the other models, though GPT-5 touches it implicitly -- Strategy.Worker / Aggregator partial crash inconsistency - -**Quality assessment:** -- **GPT-5** had the most *domain-relevant* and *actionable* gaps. Broker - rate limiting, auth failures, corporate actions, and the DB commit - unknown-outcome scenario are all realistic production issues specific - to THIS system. The cross-symbol partial staleness finding shows - deeper architectural reasoning about component interactions. -- **GPT-4.1** was thorough and well-structured but more generic/defensive. - Many of its unique findings (zombie processes, unsupervised Tasks, - audit log loss) are general Elixir concerns rather than specific to - the document's architecture. Good for a completeness checklist. -- **GPT-4.1 Mini** was formulaic — each finding followed the same template - and several were somewhat surface-level or restated things the document - partially covers. Still found the most scenarios per dollar. - -**Takeaway:** For gap-finding in architecture documents, GPT-5's reasoning -tokens pay off. It doesn't just list "things that could go wrong" — it -identifies *specific interactions* that the document's existing mechanisms -don't cover (e.g., rate limiting bypasses the "connection lost" detection, -corporate actions bypass staleness detection). GPT-4.1 is a solid -middle-ground: more thorough than Mini, less insightful than GPT-5. -Mini is fine for a quick sanity check but won't find the subtle gaps. - -**Cost-effectiveness:** Mini found 10 scenarios in 16s for ~7K tokens. -GPT-5 found 14 scenarios (with 7 genuinely unique insights) in 45s for -~13.5K tokens (including 6.6K reasoning). For architecture review where -missing a gap could mean financial loss, the GPT-5 cost is justified. -For routine doc review, Mini + human judgment is probably sufficient. - -### 10. Hidden-assumption identification: GPT-5's reasoning produces qualitatively different (not just more) findings - -**Date:** 2026-05-02 -**Task:** Identify hidden assumptions in gargoyle's `cold-start-and-recovery.md` (234 lines) -that could break under real-world production conditions. -**How we used them:** Same document (full text) + same focused analytical question -to all 3 models via HAI proxy (OpenAI-compatible endpoint). No tools, no project -context beyond the document itself. Single prompt, no conversation history. -Temperature 0.3 for GPT-4.1/Mini; GPT-5 uses default (required). - -| Model | Time | Output tokens | Reasoning tokens | Assumptions found | -|---|---|---|---|---| -| GPT-4.1 Mini | 25s | 3,090 | 0 | 12 | -| GPT-4.1 | 77s | 2,751 | 0 | 14 | -| GPT-5 | 78s | 2,649 | 4,096 | 26 | - -**What they found — common ground (all 3 identified):** -- Broker API consistency/availability during reconciliation -- ETS table availability and fail-closed behavior -- Single-writer/mailbox ordering guarantees holding in practice -- User independence assumption vs shared resources (rate limits, DB) -- Reconciliation idempotency under repeated runs -- Corporate action data completeness/timeliness -- Escalation threshold calibration vs changing market conditions -- Strategy warmup with partial/missing historical data -- Signal expiry correctness on restart - -**GPT-5 unique findings (not in either other model):** -- Unbounded mailbox growth during extended reconciliation (memory pressure - from queued messages at market open) -- handle_continue side effects in OTHER processes (risk, metrics) acting - concurrently via different paths -- Pre-existing GTC orders filling while gated (positions as moving target) -- Broker position semantics mismatch (trade-date vs settled-date) -- Strategy warmup evaluate() having non-signal side effects (metrics, caches) -- Historical bar / live tick boundary alignment (double-processing or gaps) -- ETS gate caching in process state creating fail-open windows -- Correlated retry stampede when many users restart together -- Corporate action double-application race with broker (missing idempotency - keys per action/instrument/date) -- Kill switch state vs DB unavailability at startup -- Market data subscriptions as shared bottleneck across "independent" users -- Time-invariant signals incorrectly expired by aggregation window logic -- Broker fills vs positions endpoints internally inconsistent (different caches) -- Positions changing under reconciliation while kill switch is engaged -- Gate phase sequencing: :ready written before worker warmup completes -- Periodic reconciler allowing 1hr of divergent trading (rate-of-change blind) - -**GPT-4.1 unique findings (not in GPT-5 or Mini):** -- No correlated failure handling (all failure modes treated as isolated) — - only model to frame this as a meta-assumption about the failure table - -**GPT-4.1 Mini unique findings:** -- None that weren't also covered by the other two models - -**Quality assessment:** -- **GPT-5** didn't just find more assumptions — it found *qualitatively - different kinds*. Many of its unique findings involve multi-component - interactions (mailbox + reconciliation + market open timing), semantic - mismatches (trade-date vs settled positions), and second-order effects - (metrics side effects during warmup, GTC orders filling while gated). - These require reasoning about system behavior across boundaries the - document doesn't explicitly draw. -- **GPT-4.1** was competent and structured, found the same core assumptions - as Mini, plus one good meta-observation about correlated failures. But - it stayed within the document's own framing — it found assumptions the - document *almost* states rather than ones the document can't see. -- **GPT-4.1 Mini** was formulaic. Every finding maps cleanly to a section - of the document. It's essentially "what could go wrong with each stated - mechanism" rather than "what does this design take for granted about - the world outside itself." - -**Key insight — reasoning tokens change the KIND of analysis:** -GPT-5's 4,096 reasoning tokens aren't producing "more of the same" — -they're producing a different analytical mode. The non-reasoning models -(4.1 and Mini) identify risks within the document's own frame of reference. -GPT-5 reasons about the document's relationship to the external world: -broker semantics, deployment topology, OTP runtime behavior under load, -timing correlations across independent subsystems. This is the difference -between "what could this mechanism fail at" and "what must be true about -the world for this mechanism to work." - -**Comparison to Finding #9 (gap-finding on failure-modes.md):** -Same pattern confirmed. GPT-5 consistently finds domain-specific, -interaction-level issues that require reasoning about component boundaries. -GPT-4.1 is thorough within-frame. Mini is formulaic. The gap between -GPT-5 and the others is larger here than in #9 — possibly because -"hidden assumptions" requires more abstraction than "missing failure -scenarios." Assumption-finding requires the model to reason about what -ISN'T stated, which benefits more from extended reasoning. - -**Practical implication:** For architecture review, running GPT-5 on -"identify hidden assumptions" is higher-value than the same question to -non-reasoning models. The cost difference (4K extra reasoning tokens) is -trivial for a document that will drive months of implementation. Use -non-reasoning models for within-frame checks ("does this section have -gaps") and reasoning models for cross-boundary analysis ("what must be -true about the world for this to work"). - -### 11. Hidden-assumption identification on simpler doc: reasoning models diverge in approach, both outperform non-reasoning - -**Date:** 2026-05-02 -**Task:** Identify hidden assumptions in gargoyle's `market-calendar.md` (238 lines) -— a simpler, single-component document vs the 234-line cold-start doc from Finding #10. -**How we used them:** Same document (full text) + same focused analytical question -to all 3 models via HAI proxy. No tools, no project context beyond the document -itself. Single prompt, no conversation history. Temperature 0.3 for GPT-4.1; -GPT-5 and Opus use their defaults (required). Same prompt across all three. - -| Model | Time | Output tokens | Reasoning tokens | Assumptions found | -|---|---|---|---|---| -| GPT-4.1 | 19s | 2,554 | 0 | 14 | -| Claude Opus 4.6 | 74s | 3,288 | (internal, not reported) | 13 | -| GPT-5 | 101s | 8,417 | 5,504 | 24 | - -**What they found — common ground (all 3 identified):** -- Alpaca calendar API data correctness/completeness as single source of truth -- Alpaca API availability at startup (no local cache persistence) -- ETS table atomicity during refresh (partial-state exposure risk) -- System clock/timezone alignment (dates are timezone-naive) -- NYSE emergency/unscheduled closures not reflected until refresh -- Two-year cache range sufficiency -- API response format stability -- Rate limiting / API capacity concerns - -**GPT-5 unique findings (not in either other model):** -- Date struct term-ordering in ETS match specs may not match chronological - order (ETS range guards rely on Erlang term comparison, not Date semantics) -- close_time/1 returns naive Time without timezone — DST conversion burden on - consumers, one hour off twice per year -- trading_day?/1 conflates "not a trading day" with "calendar unavailable" — - operational outages invisible to callers -- ETS table name collision risk (global namespace per node) -- No other process should modify the ETS table (access mode discipline) -- Network egress and credential availability on all nodes at all times -- ETS read/write concurrency flags for contention under load -- Direct ETS access by consumers bypassing the module's error handling -- next/prev_trading_day edge cases at cache boundaries -- Alpaca API start/end parameter inclusivity (off-by-one at year boundaries) -- Half-day vs full-day distinction insufficiency for special sessions -- Small table size makes O(n) selects acceptable (scaling concern) -- Year-end refresh failure leaving gaps at boundary -- Alpaca never omits a legitimate trading day (absence = non-trading conflation) - -**Claude Opus unique findings (not in either other model):** -- ETS ownership semantics: heir-protection would change fail-closed behavior; - current design means ALL consumers fail simultaneously during crash-to-restart - window (framed as a design tension, not just a risk) -- Silent data corruption from partial API response (pagination/truncation) — - specifically that missing rows are SILENT failures with no error propagation - (other models mentioned API completeness but not the silence aspect) -- Consumers calling functions with Dates, not DateTimes — the API accepts Date.t() - but doesn't specify HOW consumers should derive "today" (system-wide - coordination problem made invisible by the API contract) -- `trading_day?/1` returning false is NOT fail-closed for ALL consumers — only - for PDT-like "block action" consumers; for batch-trigger consumers it's - fail-OPEN (subtle inversion of safety semantics) -- Startup ordering: background_children placement means PDT could receive orders - before MarketCalendar finishes init, creating recurring rejection windows - during hot deploys -- Continuous-running assumption for refresh timer (daily restarts would mean - refresh mechanism never fires — no staleness alert exists) - -**GPT-4.1 unique findings (not in either other model):** -- No need for real-time calendar change notification (event emission gap) -- All consumers using the same module instance (configuration consistency) -- No need for historical calendar data (audit/backtesting limitation) -- Consumers correctly handling {:error, :calendar_unavailable} in practice - -**Quality assessment:** -- **GPT-5** found the most assumptions (24) with the most technical specificity. - Many are implementation-level insights (ETS term ordering, named table - collisions, read_concurrency flags) that demonstrate deep Erlang/OTP - knowledge. Some are slightly obvious or overlapping. The ETS term-ordering - finding is genuinely insightful — Date structs DO compare correctly in Erlang - term order (year > month > day fields), but questioning it shows depth of - reasoning about underlying mechanisms. Also provided concrete recommendations. -- **Claude Opus** found fewer assumptions (13) but several were qualitatively - different — they identified *design tensions* and *semantic inversions* - rather than just failure scenarios. The fail-open/fail-closed inversion - (finding #12), the ETS ownership tension, and the "API makes timezone - coordination invisible" findings show reasoning about the design's - *relationship to its consumers* rather than just its internal mechanics. - Tighter, more curated output with less filler. -- **GPT-4.1** was competent and well-structured (14 assumptions, clean table) - but stayed within the document's own framing. Its unique findings are - relatively generic ("consumers should handle errors correctly," "no - historical data"). Solid baseline, no surprises. - -**Key insight — two reasoning models, different analytical styles:** -GPT-5 and Opus are both reasoning models, but they reason about different -things. GPT-5 reasons DEEPER into implementation mechanics (how does ETS -actually work? what are the exact failure modes of each component?). Opus -reasons WIDER about system context (how does this component's API contract -affect the safety properties of the overall system? what tensions does this -design create that aren't visible to the author?). - -GPT-5's approach: "Here are 24 things that could go wrong, many highly -technical." Opus's approach: "Here are 13 assumptions, several of which -reveal design tensions the document can't see about itself." - -**Does the reasoning gap narrow with simpler docs?** -Comparing to Finding #10 (cold-start doc, 234 lines, 26 vs 14 vs 12 assumptions -for GPT-5/GPT-4.1/Mini): -- GPT-5 still dominates in raw count (24 vs 14 for GPT-4.1) -- The gap ratio is similar (~1.7x here vs ~1.9x in Finding #10) -- Document complexity doesn't appear to be the driver of the gap — - reasoning tokens enable more exhaustive exploration regardless of - input complexity - -**Claude Opus vs GPT-5 (the headline comparison):** -They're not competing on the same axis. GPT-5 is better for "find all -possible issues" (breadth + technical depth). Opus is better for "find -the assumptions that will actually surprise the author" (insight density). -If you want a security-audit-style exhaustive list: GPT-5. If you want a -design-review-style "here's what you're not seeing about your own design": -Opus. Both are better than GPT-4.1 for this task, but in different ways. - -**Practical implication:** Run BOTH reasoning models on architecture docs. -GPT-5 catches implementation-level hazards the team might miss during -coding. Opus catches design-level tensions the team might miss during -planning. GPT-4.1 is sufficient as a quick sanity check but won't -surprise you. - -### 12. Sonnet 4.6 outperforms expectations on assumption-finding; competes with reasoning models on complex docs - -**Date:** 2026-05-02 -**Task:** Identify hidden assumptions in gargoyle's `order-execution.md` (785 lines) -— a complex, multi-component document covering OrderManager, BrokerAdapter, -TradeStream, and PositionReconciler. -**How we used them:** Same document (full text, no truncation) + same focused -analytical question to all 3 models. GPT-5 via HAI OpenAI endpoint; Opus 4.6 -and Sonnet 4.6 via HAI Anthropic endpoint. No tools, no project context beyond -the document itself. Single prompt, no conversation history. - -| Model | Time | Output tokens | Reasoning tokens | Assumptions found | -|---|---|---|---|---| -| GPT-5 | 93s | 8,485 | 6,016 | 20 | -| Claude Sonnet 4.6 | 106s | 4,637 | (internal) | 17 | -| Claude Opus 4.6 | 105s | 4,615 | (internal) | 12 | - -**What they found — common ground (all 3 identified):** -- Synchronous broker REST calls blocking OrderManager GenServer (mailbox growth) -- TradeStream event ordering assumptions (out-of-order fills/status) -- Fill deduplication gap (no explicit fill-level idempotency) -- `cancel_all/1` with `timeout: :infinity` blocking GenServer during FLATTEN -- Recovery/restart races with TradeStream fill delivery (fills queued during - `handle_continue/2`) -- Lot operation idempotency under crash recovery (partial execution) -- Replace race: fills for new broker_order_id arriving before `replaced` event -- Database write latency impact on GenServer throughput under burst fills -- ETS table scope assumptions (single-node, access mode) - -**GPT-5 unique findings (not in either Claude model):** -- Rate-limit retry blocking OrderManager inline (no async retry path specified) -- Single TradeStream connection per user not enforced (duplicate detection gap) -- Kill switch FLATTEN vs degraded state interaction (OM drops cancels while - degraded, but FLATTEN calls cancel_all through OM) -- ClOrdID uniqueness scope/retention at broker across sessions and days -- `after: datetime` filter semantics (clock skew, timezone, inclusive/exclusive) -- Reconciliation responses may exceed single-response size (no pagination) -- Event broadcasting blocking model (synchronous vs fire-and-forget) -- Credential rotation during TradeStream connection lifetime -- `market_closed` semantics varying across brokers (reject vs queue) -- Dropped Alpaca statuses (stopped/suspended/calculated) may affect accounting - -**Claude Sonnet 4.6 unique findings (not in either other model):** -- Single fill per fill event assumption (broker batching multiple fills into - one WebSocket message) -- Lot operations (`Lots.open/2`, `Lots.close/4`) assumed to never fail — - no `{:error, _}` handling shown, crash propagation risk -- `Task.async_stream` inside GenServer creating linked tasks whose crash - signals propagate to OrderManager during critical cancel_all -- Broker cancel semantics during in-flight replace at the broker level - (cancel targets old broker_order_id which broker already replaced away) -- Database operations in fill processing assumed transactional (no explicit - Ecto.Multi/transaction mention) -- Broker position reflects only Gargoyle's activity (external trades cause - false-positive reconciliation halts) - -**Claude Opus 4.6 unique findings (not in either other model):** -- `{:ok, broker_order_id}` from REST place conflated with durable OMS - acceptance vs mere HTTP acknowledgment (no timeout on `submitted` state) -- Concurrent `apply_corrections/2` from periodic reconciler running in - separate process conflicts with OrderManager's single-writer invariant - (corrections write to same tables outside GenServer serialization) -- Reconciliation gate initialized state after `:rest_for_one` restart — - ETS table EXISTS but freshly initialized vs table MISSING are different - conditions with different safety properties -- Escalation state reset after crash creating double-exposure window - (systematic issue persists but escalation timer resets to zero) -- `replace/3` error semantics: non-atomic replace (cancel + re-submit) - where cancel succeeds but re-submit fails leaves original order cancelled - at broker while OrderManager reverts to "working" locally - -**Quality assessment:** -- **GPT-5** maintained its pattern from previous findings: broadest coverage - (20 assumptions), most technically specific about implementation details. - Found cross-cutting operational concerns (clock skew, credential rotation, - pagination) that the Claude models didn't surface. However, several of its - findings were medium-severity operational concerns rather than architectural - assumptions. -- **Claude Sonnet 4.6** was the surprise performer. Found 17 assumptions — - close to GPT-5's count (85%) — and several of its unique findings were - genuinely insightful. The `cancel_all` race with broker-side replace state - (finding #16) and the lot operation failure propagation (finding #6) show - deep reasoning about component interaction despite Sonnet not being - positioned as a "reasoning" model. More importantly, Sonnet's findings were - consistently well-structured with clear "how it could break" scenarios. -- **Claude Opus 4.6** found the fewest assumptions (12) but — consistent with - Finding #11 — its unique findings were qualitatively different. The - concurrent `apply_corrections` write conflict, the gate initialization state - distinction, and the non-atomic replace error semantics all reveal design - tensions that neither GPT-5 nor Sonnet identified. Opus continues to reason - about the *boundaries between components* rather than within-component - mechanics. - -**Key insight — Sonnet 4.6 is NOT just a faster GPT-4.1:** -In previous findings (#9, #10, #11), non-reasoning models (GPT-4.1, GPT-4.1 -Mini) performed significantly below reasoning models on assumption-finding. -GPT-4.1 found ~14 assumptions where GPT-5 found 24-26. Here, Sonnet 4.6 -finds 17 where GPT-5 finds 20 — a much smaller gap (~85% vs ~58% previously). - -Sonnet's findings also included several that showed genuine reasoning about -component interactions (not just within-frame risks). This suggests Sonnet 4.6 -is qualitatively different from GPT-4.1 for analytical work — it occupies a -middle ground between GPT-4.1's "competent but surface-level" and GPT-5's -"exhaustive and deep." The severity distribution was also similar to GPT-5 -(multiple critical/high findings), whereas GPT-4.1 in previous experiments -tended toward medium-severity generic concerns. - -**Updated model hierarchy for assumption-finding:** -1. GPT-5 — broadest coverage, most operational-level findings (20) -2. Sonnet 4.6 — strong analytical depth, good component interaction reasoning (17) -3. Opus 4.6 — fewest but most architecturally insightful, finds design tensions (12) -4. GPT-4.1 — competent within-frame, generic (~14 from previous experiments) -5. GPT-4.1 Mini — formulaic, surface-level (~10-12) - -**Practical implication:** For architecture review, Sonnet 4.6 is now a strong -candidate for volume analytical work. It's fast enough to run alongside GPT-5 -and catches different things (lot operation failures, broker-side replace races). -The ideal three-model review stack for architecture docs appears to be: -- GPT-5 for breadth + operational concerns -- Sonnet 4.6 for component interaction analysis -- Opus 4.6 for design-tension identification - -Each consistently finds things the others miss. The cost-efficiency argument -for Sonnet is strong: ~85% of GPT-5's count with more actionable findings -per token generated (4,637 vs 8,485 tokens for 17 vs 20 assumptions). - -### 13. Race condition identification: Opus excels at subtle temporal interactions; GPT-5 is exhaustive; Sonnet struggles with concurrency reasoning - -**Date:** 2026-05-03 -**Task:** Identify race conditions, timing-dependent bugs, and ordering hazards in -gargoyle's `concurrent-failure-detection.md` (241 lines) — a document specifically -about concurrent detection logic with timers, ETS state, and multi-process events. -**How we used them:** Same document (full text) + same focused analytical question -to all 3 models via HAI proxy. Prompt specifically asked for event ordering problems, -timer interaction bugs, state visibility gaps, crash/restart timing, and multi-instance -coordination. Required each finding to reference specific mechanisms in the document -with specific interleaving descriptions. No tools, no project context beyond the -document itself. - -| Model | Time | Output tokens | Reasoning tokens | Race conditions found | -|---|---|---|---|---| -| GPT-5 | 116s | 10,587 | 8,192 | 12 | -| Claude Opus 4.6 | ~105s | 4,610 | (internal) | 10 | -| Claude Sonnet 4.6 | ~39s | 1,404 | (internal) | 7 | - -**What they found — common ground (all 3 identified):** -- Stale timer messages in mailbox after cancellation (classic Erlang timer race) -- HealthMonitor crash losing compound detection state (init from :unknown, no replay) -- ETS vs GenServer state divergence visible to dashboard -- Kill switch mode conflict (FLATTEN from single-process vs HALT from compound path) - -**GPT-5 unique findings (not in either Claude model):** -- Cross-sender message ordering: recovery events from pipeline processes vs timer - expiry from runtime (Erlang preserves per-sender order, NOT cross-sender) — the - "rapid recovery" safety argument in the doc relies on state being updated before - timer fires, which isn't guaranteed -- Debounce starvation: flapping component repeatedly restarting the timer, causing - compound evaluation to be indefinitely postponed while ≥2 genuinely degraded -- State regression: {:degraded} arriving after {:escalated, :kill_switch} with no - guard in the event table — state machine allows regressing from :halted to :degraded -- Cold-start window: application boots with existing degraded processes that won't - re-emit events, compound detection never fires -- Catch-all handle_info could accidentally swallow timer messages if pattern matching - is ordered wrong (implementation pitfall of the described approach) -- Debounce window growing beyond calibrated bounds from repeated timer restarts - -**Claude Opus unique findings (not in either other model):** -- Timer restart pushing evaluation PAST single-process escalation timeout — the - debounce mechanism can DEFEAT compound detection when second degradation arrives - near end of first window (resets to full window, first process escalates via - single-process path before new window fires). This means system gets FLATTEN - instead of HALT — exactly what compound detection was supposed to prevent. -- Strategy worker single-atom masking via event ordering: Worker A degrades, Worker - B degrades (same atom), Worker A recovers → atom set to :normal while B is still - degraded. Event ordering across different workers mapped to same atom creates - state loss. -- Registry stale PID after HealthMonitor crash: if subscription is PID-based (not - PubSub topic), new HM instance is deaf — events go to dead PID, silently dropped. - Compound detection completely disabled for that user until subscription refresh. -- :rest_for_one cascade + coincidental independent issue: debounce designed to - filter cascade restarts, but cascade can EXPOSE a pre-existing issue (SignalRisk - restarts and finds its upstream stale). Is this "compound" or "cascade + bad luck"? - Semantic ambiguity the design doesn't address. -- Compound cleared event without recovery debounce: :compound_degradation_cleared - emitted immediately when last process recovers (no settling period), causing - operator oscillation if recovery is transient. - -**Claude Sonnet unique findings:** -- ETS table creation race at startup (HealthMonitor writes before table exists) -- Registry lookup failure during pipeline startup (events before HM registered) -- However, Sonnet also made analytical errors: it described "multiple HealthMonitor - instances for the same user" scenarios despite the document clearly stating one - instance per user via DynamicSupervisor. Several of its findings assumed - multi-instance coordination that doesn't match the architecture. - -**Quality assessment:** -- **GPT-5** was the most exhaustive and technically precise. Its cross-sender - ordering finding (#2) is genuinely insightful — it identifies that the document's - "rapid recovery" safety argument implicitly assumes events arrive in wall-clock - order, which Erlang does NOT guarantee across different senders. The debounce - starvation finding (#3) identifies a real operational hazard with practical - consequences. All 12 findings reference specific mechanisms and describe specific - interleavings clearly. -- **Claude Opus** found fewer race conditions but several were qualitatively - superior. The timer-restart-defeats-compound-detection finding is the most - architecturally significant race in the entire analysis — it shows that the - debounce mechanism can work AGAINST the design's stated goals in specific - (realistic) timing scenarios. The strategy-worker event ordering masking is - also a genuine design flaw unique to the single-atom decision. Opus continues - its pattern of reasoning about design TENSIONS rather than just failure modes. -- **Claude Sonnet** was notably weaker here than in previous experiments. Only - 1,404 output tokens vs 4,610 (Opus) and 10,587 (GPT-5). Several findings - contained analytical errors (assuming multi-instance coordination that doesn't - exist). It found only 7 races, and 2-3 of those were based on misreadings of - the architecture. This is a significant regression from Finding #12 where - Sonnet found 17 assumptions (85% of GPT-5's count). - -**Key insight — concurrency reasoning is a different skill than assumption-finding:** -In previous experiments (#10, #11, #12), Sonnet 4.6 performed well on -assumption-finding (a task that requires reasoning about what's NOT stated). -Here, on race condition identification (a task requiring reasoning about temporal -interleavings and message ordering semantics), Sonnet drops significantly. This -suggests the task type matters more than we previously thought: - -- **Assumption-finding:** Requires breadth of consideration ("what must be true - for this to work?"). Sonnet handles this well — it's essentially pattern - matching across possible failure dimensions. -- **Race condition identification:** Requires SEQUENTIAL reasoning about specific - interleavings ("if A happens, then B happens, then C happens, what state is - visible?"). This benefits dramatically from extended reasoning tokens (GPT-5's - 8,192 reasoning tokens) or from Opus's internal reasoning depth. - -The lesson: don't extrapolate model performance across task types. A model that's -85% as good at assumption-finding may be 50% as good at concurrency analysis. -The cognitive demands are different. - -**Opus's distinguishing strength — finding design contradictions:** -Opus's best finding (timer restart defeating compound detection) isn't just a -race condition — it's identifying that the debounce mechanism can work against -the design's own stated goals. This is consistent with Opus's pattern in -previous findings: it finds tensions where one part of the design undermines -another part. For race condition analysis specifically, this manifests as -"here's where your safety mechanism becomes your vulnerability." - -**Practical implication for architecture review:** -- For race condition analysis: use GPT-5 (exhaustive) + Opus (design-tension) -- Sonnet is NOT suitable for concurrency reasoning tasks — use it for - assumption-finding and structural review instead -- The three-model stack needs task-appropriate assignment: - - Structural/assumption review: all three models contribute - - Concurrency/race analysis: GPT-5 + Opus only - - Bias detection: any model (per Finding #8) - -### 14. Cross-component interaction analysis: GPT-5 Mini enters as viable mid-tier; task framing strongly shapes output quality - -**Date:** 2026-05-03 -**Task:** Identify cross-component interaction failures in gargoyle's -`continuous-risk-monitoring.md` (459 lines) — a document specifying -PortfolioMonitor's interactions with OrderManager, PortfolioRisk, MarketData, -KillSwitch, ETS tables, and the pipeline supervision tree. -**How we used them:** Same document (full text) + same focused analytical -question to all 3 models via HAI proxy. Prompt was highly structured: specified -5 categories of cross-component failures to look for (semantic mismatches, -ordering violations, feedback loops, partial visibility, supervision boundary -effects) and required specific output format (components, sequence, gap, impact). -No tools, no project context beyond the document itself. - -| Model | Time | Output tokens | Reasoning tokens | Findings | -|---|---|---|---|---| -| GPT-5 Mini | 68s | 5,445 | 2,240 | 6 (+1 cut off) | -| GPT-5 | 116s | 10,604 | 8,128 | 10 | -| Claude Sonnet 4.6 | 38s | 1,868 | (internal) | 8 | - -**What they found — common ground (all 3 identified):** -- Fill-to-position query race (fill event triggers evaluation but position - store hasn't yet reflected the fill) -- Restrict flag ETS table destruction on PM crash → permissive window -- Kill switch check vs liquidation submission race -- Ticker subscription timing gap (new position opened but ticks not yet - subscribed → breach goes undetected) - -**GPT-5 unique findings (not in either other model):** -- Stale prices are NOT fail-safe for drawdown (higher stale price → inflated - portfolio value → understated drawdown). The document claims "fail-safe" - but this only holds for exposure metrics, not drawdown. This is the most - architecturally significant finding across all three models. -- Price definition mismatch between PM (last_trade from ETS) and OrderManager/ - broker (bid/ask/mid) causing mis-sized liquidation and oscillation -- Cross-component oscillation: PM hysteresis internal vs PRisk's immediate - binary restrict gate clearing (no cross-component cooldown) -- Liquidation stuck after OM restart (terminal events lost; liquidation_in_ - flight stays true indefinitely with no timeout/rehydration) -- "Minimal risk checks" not enforced — PM goes through same OM gates as - strategy orders but MarketHours/StalePrice controls may reject after-hours - or stale-price liquidation attempts -- FLATTEN mode semantics gap — PM refrains from liquidating when kill switch - engaged, but FLATTEN cancels open orders without actually CLOSING positions. - No component left to close positions. - -**Claude Sonnet 4.6 unique findings (not in either other model):** -- Liquidation feedback loop with PortfolioRisk — buy-to-cover for short - positions could INCREASE net long exposure at portfolio level, paradoxically - worsening concentration while fixing position-level metrics -- High water mark reset on pipeline restart masks true intraday drawdown - (restart → HWM resets to lower current value → drawdown calculated from - false baseline → larger losses permitted than intended) -- Multi-metric breach with single boolean flag — concentration liquidation - for AAPL sets liquidation_in_flight, blocking simultaneous daily P&L - liquidation for different positions -- Market close/open vs after-hours fills — claims to evaluate after-hours - fills but uses stale market-close prices - -**GPT-5 Mini unique findings (not in either other model):** -- OrderManager order splitting/remapping causing liquidation_in_flight - correlation failure (parent/child order ID mapping breaks terminal-event - detection). Well-reasoned but highly implementation-specific. -- Restrict/clear oscillation loop with strategy behavior (strategies react - to rejects → back off → restrict clears → strategies re-enter aggressively - → re-breach). Good systems-thinking about emergent feedback. - -**Quality assessment:** -- **GPT-5** produced the most findings (10) and the highest-quality - architectural insight: the stale-price/drawdown contradiction is a genuine - design flaw that contradicts the document's own safety claim. Multiple - findings showed cross-boundary reasoning about semantic mismatches (price - definition, FLATTEN semantics, gate bypass). Every finding named specific - components and described precise event sequences. -- **Claude Sonnet 4.6** was fast (38s, only 1,868 tokens) and produced 8 - solid findings. The HWM reset finding and the multi-metric/single-flag - finding show genuine architectural reasoning. The liquidation feedback - loop (buy-to-cover worsening portfolio concentration) is subtle and - shows cross-position reasoning. However, some findings overlapped - significantly with the common-ground set and added less unique depth. - Sonnet performed MUCH better here than on race condition identification - (Finding #13) — 8/10 ratio vs 7/12 previously. -- **GPT-5 Mini** produced 6 findings in 68s with 2,240 reasoning tokens. - Quality was genuinely good — the order-splitting/correlation finding - and the oscillation feedback loop both show real reasoning depth. It's - clearly NOT GPT-4.1 Mini — it reasons about component interactions, - not just within-frame risks. However, it found fewer issues and one - response was cut off (token limit or response truncation). - -**Key insight — task framing as the dominant variable:** -This experiment used a much more structured prompt than previous ones: -specified 5 categories, required specific output format, explicitly excluded -single-component failures. The result: ALL models produced higher-quality, -more focused output than in earlier experiments with broader prompts. Even -Sonnet — which struggled on race conditions (Finding #13) — performed well -here. The structured categories likely helped models organize their reasoning -without losing track of what they were looking for. - -The prompt explicitly asked for "cross-component interaction failures" rather -than general analysis. This is the narrow-lens effect from Finding #2, but -applied to a complex multi-component document. The lens is narrow (only -inter-component gaps) but the scope is broad (459 lines, many interactions). -This combination — narrow analytical lens + broad document scope — appears -to be the sweet spot for getting quality from all model tiers. - -**GPT-5 Mini positioning:** -First time testing GPT-5 Mini. Results: 6 findings in 68s vs GPT-5's 10 in -116s. That's 60% of the findings in 59% of the time, with 28% of the -reasoning tokens (2,240 vs 8,128). Quality-per-finding was solid — the order -correlation finding especially showed genuine systems reasoning. GPT-5 Mini -appears to be a legitimate mid-tier: more capable than GPT-4.1 (which can't -do this kind of cross-boundary reasoning) but less exhaustive than GPT-5. -Viable for: first-pass screening, bulk document review where you'd run many -docs and can't afford full GPT-5 on each. - -**Sonnet recovery from Finding #13:** -Sonnet went from 7 findings (with errors) on race conditions to 8 solid -findings here. The difference: this prompt was more structured, the document -was larger with more explicit interaction descriptions, and the task didn't -require pure temporal/sequential reasoning. "Cross-component interaction -failures" is closer to assumption-finding (Sonnet's strength) than race -condition identification (Sonnet's weakness). Task taxonomy continues to -matter more than raw model capability. - -**Updated model assignment for cross-component analysis:** -1. GPT-5 — broadest coverage, finds semantic contradictions in the doc's - own claims (10 findings) -2. Sonnet 4.6 — fast and efficient, good at portfolio-level reasoning and - feedback loops (8 findings in 38s) -3. GPT-5 Mini — viable first-pass with genuine reasoning depth (6 findings) -4. (Opus untested for this task type — likely strong on design tensions) - -### 20. Invariant violation path analysis: GPT-5 is maximally selective (3 findings, all genuine); Opus shows unique self-correcting analytical style; new task type favors precision over exhaustiveness - -**Date:** 2026-05-04 -**Task:** Identify invariant violation paths in gargoyle's `user-pipeline-lifecycle.md` -(730 lines) — sequences of legal operations that can violate the system's stated or -implied invariants. NEW analytical lens not previously tested, distinct from assumption- -finding, race conditions, or coherence checking. -**How we used them:** Same document (full text) + same focused analytical question to all -3 models via HAI proxy. Highly structured prompt specifying 5 categories of invariant -violations (state machine escapes, invariant composition failures, monotonicity violations, -idempotency boundary violations, authority inversion sequences). Required specific output -format per finding. No tools, no project context beyond the document itself. - -| Model | Time | Output tokens | Reasoning tokens | Findings | -|---|---|---|---|---| -| GPT-5 | 143s | 784 | 12,032 | 3 | -| Claude Opus 4.6 | 113s | 6,183 | (internal) | 7 (with 2 self-corrections) | -| Claude Sonnet 4.6 | 23s | 1,266 | (internal) | 5 | - -**What they found — common ground (2+ models identified):** - -- **Periodic reconciliation overrides operator manual stop** (GPT-5 #3 + Opus #5 + - Sonnet #1): An admin who stops a pipeline via `stop_user/1` with `:admin_action` - has their decision overridden within 5 minutes by periodic reconciliation, because - there's no "admin stopped" state in `check_eligibility/1`. All three models - independently identified this as the clearest authority inversion. -- **DynamicSupervisor restart bypasses eligibility gate** (Opus #1/#3 + Sonnet #2): - When `UserPipeline.Supervisor` crashes and is restarted by OTP supervision, the - restart bypasses `start_user/1` and `check_eligibility/1` entirely — potentially - resuming trading while the kill switch is engaged. -- **Stale ReconciliationGate after crash** (Opus #7): After a crash-triggered - DynamicSupervisor restart (not via `stop_user/1`), the ReconciliationGate remains - `:ready` from the previous instance because `stop_user/1` (which resets it) was - never called. The new OrderManager may accept orders during its own reconciliation. -- **HealthMonitor co-lifecycle violation** (Opus #2 + Sonnet #4): After a - DynamicSupervisor-initiated restart, the HealthMonitor is still subscribed to the - old PIDs — no code re-establishes monitoring for the new pipeline processes. - -**GPT-5 unique findings (not in either other model):** - -- **Kill switch bypass for users configured DURING engagement** (#1): A user who - saves credentials while the kill switch is engaged is never added to the pending - operator release set (only running pipelines are added at engage time). After - disengage, periodic reconciliation auto-starts this user's pipeline without - operator release — violating "resuming always requires human judgment." This is - the most precisely reasoned finding across all three models: each step is - individually correct per the spec, and the violation emerges purely from the - composition of legal operations. -- **Premature release bypass** (#2): If `operator_release_user/1` is called while - the kill switch is still engaged (a legal operation), it clears the pending - release flag but `start_user/1` correctly refuses. After later disengage, the - flag is gone — auto-start proceeds without fresh operator judgment. The release - was "spent" at the wrong time. - -**Claude Opus unique findings (not in either other model):** - -- **`operator_release_system/0` clears unrelated safety obligations** (#4): - Operator intends to release one user from a recent event but - `operator_release_system/0` also releases other users still pending from an - earlier, unresolved event. One release call discharges multiple independent - safety obligations — monotonicity violation. -- **State machine incompleteness for blocked users** (#6): Users who become - configured during kill switch engagement (blocked with reason - `:kill_switch_engaged`) have no state machine transition back to `starting` - after disengage — they're not in the pending release set, and no event fires. - System works via periodic reconciliation (up to 5 minutes delay), but the - documented state machine doesn't represent this path. -- **Self-correcting analytical style:** Opus explicitly withdrew two draft - findings mid-analysis ("Actually, this sequence works as designed. Let me - identify a real violation instead." / "this is likely handled"). This - self-correction behavior was first observed in Finding #15 and is now - confirmed as a consistent Opus trait for invariant-style analysis. - -**Claude Sonnet unique findings (not in either other model):** - -- **Cold-start Tier 3 failure creates supervision restart loop** (#2): A - persistent Tier 3 failure (phantom fills) crashes OrderManager, `:rest_for_one` - kills the tree, DynamicSupervisor restarts it, cold-start fails again → infinite - loop. State machine shows `starting → stopped` but supervision creates - `starting → starting` indefinitely. -- **HealthMonitor start failure during start_user** (#4): If HealthMonitor.Supervisor - is momentarily crashed when `start_user/1` runs step 4, the pipeline starts - without monitoring. No error handling specified for this partial-start state. - -**Quality assessment:** - -- **GPT-5** was MAXIMALLY SELECTIVE — only 3 findings from 12,032 reasoning tokens - (4,011 reasoning tokens per finding). This is the most extreme - reasoning-to-output ratio observed: 15:1 (12,032 reasoning / 784 output tokens). - For comparison, in previous experiments GPT-5 typically shows 1:1 to 2:1 ratios. - Every finding is a genuine invariant violation with a precise, step-by-step - sequence where each step is individually legal. ZERO false positives, zero - padding, zero "this might be an issue." GPT-5 appears to have used almost all - its reasoning budget for VERIFICATION — confirming that each candidate is - genuinely a violation before including it. -- **Claude Opus** produced the most findings (7) with its characteristic depth and - self-correction. Two findings were revised mid-analysis, showing Opus actively - testing its own reasoning against the document before committing to a finding. - The DynamicSupervisor restart thread (findings #1, #2, #3, #7) forms a coherent - cluster — Opus identified one root cause (OTP restarts bypass the lifecycle - layer) and explored its multiple consequences. The `operator_release_system` - monotonicity finding (#4) is architecturally significant and unique. -- **Claude Sonnet** was extremely fast (23s, 1,266 tokens) and produced 5 findings. - Quality was mixed: Finding #1 partially mirrors GPT-5's authority inversion but - with vaguer reasoning ("race condition with ETS operations" — not specified). - Finding #3 describes a contradiction but the scenario is internally inconsistent - (step 5 says "pipeline termination fails" but then step 7 says pipeline is still - running — this conflates two failure modes). Findings #2 and #4 are genuine and - well-reasoned. Sonnet's precision is lower than the other two on this task. - -**Key insight — "Invariant violation paths" as a task type:** - -This is a genuinely DIFFERENT analytical task from any previously tested. It requires: -1. Identifying the invariants (explicit or implied) -2. Constructing a sequence of operations (creative/generative) -3. Verifying each step is legal per the spec (verification) -4. Confirming the end state violates the invariant (correctness proof) - -This four-phase cognitive process explains GPT-5's extreme selectivity: steps 2-4 are -all verification-heavy, and GPT-5's reasoning tokens are being burned on steps 3 and 4 -(confirming each step is genuinely legal and the final state genuinely violates). In -previous tasks like "find hidden assumptions" or "find gaps," only step 1 (identification) -is needed — there's no construction or verification phase. - -**Comparison to previous task types:** - -| Task type | GPT-5 findings | Opus findings | GPT-5 reasoning overhead | -|---|---|---|---| -| Hidden assumptions | 20-35 | 12-13 | 5-7K reasoning | -| Race conditions | 12 | 10 | 8K reasoning | -| Design coherence | 4 | 7 | 9K reasoning | -| Invariant violation paths | 3 | 7 | **12K reasoning** | - -The pattern: as the task requires more VERIFICATION (vs identification), GPT-5 becomes -more selective and spends more reasoning tokens per finding. Invariant violation paths -demand the highest verification burden (every step must be confirmed legal), and GPT-5 -responds with the highest selectivity and reasoning investment. - -Opus inverts: it produces MORE findings on verification-heavy tasks (7 for coherence, -7 for invariant paths) vs identification tasks (10-13 for assumptions). This suggests -Opus uses its internal reasoning differently — it's more willing to present findings -that have "likely" rather than "proven" violations, then self-corrects inline if the -verification fails. - -**Practical implication:** - -For invariant violation path analysis: -- **GPT-5** produces the highest-precision findings but very few. Every finding is a - genuine spec-level bug. Use when you need zero-false-positive bug reports to present - to a design team. -- **Opus** produces more findings with slightly lower precision but unique analytical - depth. Its self-correction behavior means false positives are often caught inline. - Use when you want both confirmed violations AND identified tensions. -- **Sonnet** is too imprecise for this task type — some findings have internal - inconsistencies. Use for lighter analytical tasks (assumption-finding, spec gaps). - -The three findings GPT-5 produced are ALL genuine design bugs that should be fixed: -1. Users configured during kill switch engagement bypass operator release -2. Premature operator release (while KS still engaged) creates future bypass -3. Admin stops are overridden by periodic reconciliation - -These are the kind of findings that, in a real financial system, prevent production -incidents. The 12K reasoning tokens to produce 3 perfect findings is excellent ROI. - -### 21. Reasoning effort (low/medium/high) has negligible effect on GPT-5's analytical output; the parameter may not work as documented for open-ended analysis - -**Date:** 2026-05-04 -**Task:** Identify hidden assumptions in gargoyle's `order-state-machine.md` (221 lines) -— a well-structured state machine specification covering order lifecycle, fill precedence, -TIF semantics, and parameter resolution. -**How we used them:** Same document, same prompt, same model (GPT-5), same -max_completion_tokens (16K). Only variable: `reasoning.effort` parameter set to -"low", "medium", or "high". Run sequentially via HAI proxy (OpenAI-compatible -endpoint). No tools, no project context beyond the document. - -| Effort | Time (ms) | Output tokens | Reasoning tokens | Findings | -|---|---|---|---|---| -| Low | 97,913 | 7,657 | 4,288 | 33 (+11 recs) | -| Medium | 94,824 | 7,112 | 4,160 | 30 | -| High | 88,607 | 6,891 | 3,712 | 30 | - -**The counterintuitive result:** Higher reasoning effort produced FEWER findings, -FEWER reasoning tokens, FEWER output tokens, and completed FASTER. The expected -pattern (high effort → more reasoning → more depth) was inverted. - -**Per-finding metrics (remarkably consistent):** - -| Effort | Output tokens/finding | Reasoning tokens/finding | -|---|---|---| -| Low | 232 | 129 | -| Medium | 237 | 138 | -| High | 229 | 123 | - -The depth per finding was nearly identical across all three levels. The models -didn't get more detailed or rigorous per-finding at higher effort — they just -found slightly fewer things. - -**Severity distributions (similar across all three):** -- Low: 7 Critical, 21 High, 5 Medium (33 findings) -- Medium: 9 Critical, 15 High, 4 Medium + 2 borderline (30 findings) -- High: 6 Critical, 14 High, 5 Medium + 4 borderline (30 findings) - -**Qualitative differences — WHAT they found:** - -High-effort unique findings (not in low): -- Single-writer authority to broker (no out-of-band modifications) -- Broker emits fills for all executed quantities (no silent netting) -- Instrument identity remains stable across corporate actions -- Late-fill override won't violate downstream invariants -- Validation covers lot sizes, price ticks, borrow/locate constraints -- Multiple accounts and venues are part of the correlation key -- Streaming and polling APIs are consistent -- System can handle multi-leg instruments - -Low-effort unique findings (not in high): -- Acks arrive before fills (no pre-ack fills) -- Cancel-before-ack handling (submitted → cancelled missing) -- Fill totals never exceed requested quantity -- Deterministic ordering within a broker stream -- Exercise/assignment and non-order position changes -- Client-side idempotency of "place order" -- Partial accept/normalize on replace -- No "child" order fragmentation at broker -- Submitted state can receive terminal events -- Late cancel vs local expired mismatch - -**Character of the differences:** -- HIGH-unique findings tend to be more **architectural/systemic** (multi-leg - instruments, streaming vs polling consistency, downstream invariant violations, - corporate actions). These require reasoning about the system's relationship - to the broader world. -- LOW-unique findings tend to be more **implementation-specific edge cases** - (cancel-before-ack, pre-ack fills, child order fragmentation, partial accepts). - These require reasoning about specific event interleavings and protocol details. - -Both sets are valid and actionable. Neither is clearly "better." They represent -different analytical modes — breadth-of-scope (high) vs depth-of-protocol (low). - -**Key insight — reasoning_effort doesn't scale analysis linearly:** - -Three possible explanations for the inverted behavior: - -1. **GPT-5 already uses near-maximum reasoning for analytical tasks regardless - of the effort parameter.** The ~4K reasoning tokens across all three levels - (4288/4160/3712) are too similar to reflect a genuine effort gradient. The - parameter may primarily affect OTHER task types (math, code, logic puzzles) - where reasoning depth is more variable. - -2. **Higher effort increases FILTERING, not exploration.** At high effort, GPT-5 - may spend more of its reasoning on VERIFYING whether findings are genuine - before including them — similar to the extreme selectivity observed in - Finding #20 (invariant violation paths, 12K reasoning for 3 findings). This - would explain fewer findings despite theoretically "trying harder." - -3. **The parameter has minimal practical effect for this model version.** - The differences (33 vs 30 vs 30) are within normal stochastic variation. - Repeated runs at the same effort level might show similar variance. - -**The prompt cache hit on HIGH (2304 cached prompt tokens) may have slightly -accelerated processing, but doesn't explain the reasoning token difference.** - -**Comparison to previous findings:** -In Finding #20 (invariant violation paths), GPT-5 used 12,032 reasoning tokens -for 3 findings — extreme verification behavior. Here, at default effort on a -different task type (hidden assumptions), it uses ~4K reasoning for ~30 findings. -This confirms that TASK TYPE is a far stronger predictor of GPT-5's reasoning -behavior than the reasoning_effort parameter. The invariant violation prompt -triggered deep verification; the assumption-finding prompt triggers broad -exploration regardless of effort setting. - -**Practical implication:** -For open-ended analytical tasks (assumption-finding, gap analysis, spec review), -the reasoning_effort parameter appears to have negligible practical effect on -GPT-5. Don't bother tuning it for these tasks — the default is fine. The -parameter may be more meaningful for: -- Tasks with verifiable correct answers (math, logic) -- Tasks where the model could short-circuit (simple questions) -- Extremely long documents where exploration budget matters - -For architecture review specifically: reasoning_effort is NOT a useful lever. -Task framing (the prompt structure) and document selection remain the dominant -variables for output quality. Save reasoning_effort tuning for coding/math tasks -where the parameter was likely trained and evaluated. - -**Open question:** Would running the same experiment 5x at each level show that -the 33-vs-30 difference is within stochastic noise? If so, reasoning_effort is -effectively a no-op for analytical prompts. If not, low-effort consistently -produces more (less filtered) output, which could be useful for brainstorming- -style analysis where you want maximum coverage before manual triage. - -### 27. Design coherence on risk-controls.md: GPT-5 regains top position; Opus's advantage from Finding #15 was document-specific, not task-specific - -**Date:** 2026-05-05 -**Task:** Identify internal design incoherences in gargoyle's `risk-controls.md` (277 lines) -— a pre-trade risk control specification covering two evaluation stages, reduction semantics, -ordering rationale, fail-closed claims, and audit logging. -**How we used them:** Same document (full text) + same focused analytical question to all -3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence -(safety properties not enforced, ordering/sequencing contradictions, reduction semantics -conflicts, fail-closed claims vs actual behavior, cross-stage inconsistencies). Required -each finding to reference specific contradictory parts. No tools, no project context beyond -the document itself. - -| Model | Time | Output tokens | Reasoning tokens | Incoherences found | Critical | High | Medium | -|---|---|---|---|---|---|---|---| -| GPT-5 | 112s | 8,231 | 7,232 | 6 | 1 | 3 | 2 | -| Claude Opus 4.6 | 41s | 1,858 | (internal) | 5 | 2 | 2 | 1 | -| Claude Sonnet 4.6 | 15s | 699 | (internal) | 4 | 1 | 2 | 1 | - -**What they found — common ground (all 3 identified):** -- Reduction re-entry at BuyingPower contradicts "reducing controls never re-enter - earlier controls" (all three flagged this as the most obvious contradiction — - Concentration at position 5 reduces, re-enters at BuyingPower at position 4, - which IS an earlier control) -- Ordering rationale's categorization of buying power/concentration is internally - confused (the doc labels both as "quantity-sensitive checks" that run after - reducing controls, but concentration IS a reducing control at position 5 while - buying power at position 4 sits between the two reducing controls) - -**GPT-5 unique findings (not in either Claude model):** -- Signal-level "no portfolio context" contradicts NoShortSales requiring knowledge - of current positions. The doc explicitly states signals are evaluated "in isolation" - with "no portfolio context — only the signal itself and user settings" — but checking - whether the user holds a position IS portfolio context. This is a genuine design - tension: either SignalRisk has hidden portfolio access (violating isolation) or - NoShortSales can't actually work as specified. -- Settings "fall through to system defaults" vs "Settings cache miss → reject." - Two incompatible instructions for the same condition (missing settings). -- "Universal fail-closed" with "only exception is order rate window" contradicted - by Failure Modes table showing buying power as another exception ("Conservative - estimate; may over-reject" is NOT rejection — it's a different failure mode than - either fail-closed or the documented single exception). -- Audit model says "every control evaluation produces an audit entry regardless of - outcome" but the signal-stage write point only describes writing on rejection. - Passing signals produce no documented audit entry at the signal stage. - -**Claude Opus unique findings (not in either other model):** -- Signal flow diagram swaps control order vs table: table shows (1) MarketHours, - (2) PerTradeStop, (3) NoShortSales, but diagram flows MarketHours → NoShortSales - → PerTradeStopLoss. Controls 2 and 3 are reversed between the two representations. - (VERIFIED: this is correct — the diagram does show a different order.) -- Concentration re-entry loop can bypass Order Rate, Duplicate, Self-Trade, and - Fat Finger entirely during intermediate iterations. Also: Position Size at order 3 - is never re-checked against Concentration-reduced quantity because re-entry starts - at BuyingPower (order 4), meaning "smaller quantity wins" semantics are implemented - differently than the linear model described in Reduction Semantics. - -**Claude Sonnet unique findings (not in either other model):** -- Buying Power "Can reduce? No" creates a dead end: if a reduced quantity still - exceeds buying power, the system can only reject entirely (no mechanism to further - optimize), defeating the purpose of the reduction system for capital-limited users. - (NOTE: this is more of a design limitation than a self-contradiction, but the - framing — that the reduction system's purpose is undermined by buying power's - inability to reduce — is a legitimate coherence observation.) - -**Quality assessment:** -- **GPT-5** produced the most findings (6) with the broadest coverage across the - prompt's 5 categories. The NoShortSales/portfolio-context finding is the most - genuinely insightful — it's a fundamental design-level contradiction (a signal-level - control that REQUIRES decision-level context). The settings contradiction and - audit logging inconsistency are also solid. Every finding points to two specific - textual statements that are incompatible. Severity ratings were calibrated (1 - Critical, 3 High, 2 Medium — compared to Opus's 2 Critical for similar findings). -- **Claude Opus** was remarkably fast (41s, 1,858 tokens) and found one thing - neither other model caught: the diagram/table order reversal for signal controls. - This is a concrete, verifiable error (not a design tension — a literal mistake in - the document). The re-entry loop analysis (finding #5) goes deeper than GPT-5's - version of the same core issue, exploring the implications for "smaller quantity - wins" semantics. However, Opus found fewer total issues and missed the - settings contradiction and audit logging inconsistency. -- **Claude Sonnet** was the fastest (15s, 699 tokens) and found 4 issues. The buying - power dead-end observation is unique and shows genuine reasoning about the reduction - system's limitations. However, it's more of a "this design can't achieve its stated - goal" than a strict self-contradiction. Sonnet's other findings overlap with the - common ground. Quality is solid but narrower scope. - -**Key insight — Finding #15's Opus > GPT-5 result was document-specific:** -In Finding #15 (coherence checking on failure-modes.md), Opus found 7 incoherences -vs GPT-5's 4. Here, on risk-controls.md, GPT-5 found 6 vs Opus's 5. The reversal -suggests that the relative performance on coherence checking depends on the -DOCUMENT'S structure, not on a fixed model advantage: - -- **failure-modes.md** (383 lines): A complex multi-process system with many - stated invariants across failure states, supervision trees, and recovery paths. - Rich in design TENSIONS where one subsystem's safety mechanism undermines another. - This plays to Opus's strength (finding design tensions between subsystems). -- **risk-controls.md** (277 lines): A more focused specification with explicit rules, - ordering constraints, and behavior tables. Rich in SPECIFICATION CONTRADICTIONS - where one statement directly conflicts with another. This plays to GPT-5's - strength (systematic verification of claims against stated mechanisms). - -The difference: Opus excels when contradictions are EMERGENT (arise from composing -multiple design decisions). GPT-5 excels when contradictions are EXPLICIT (two -statements in the document say incompatible things). Risk-controls.md has more -explicit contradictions (the settings fallback vs fail-closed, the "no portfolio -context" vs NoShortSales, the audit "always" vs write point "only on reject"). - -**Model performance depends on CONTRADICTION TYPE:** -| Contradiction type | Best model | Example | -|---|---|---| -| Emergent/compositional | Opus | "Rest-for-one cascade creates a 5th state" | -| Explicit/definitional | GPT-5 | "No portfolio context" but check requires portfolio | -| Diagrammatic/structural | Opus | Table order ≠ diagram order | -| Semantic/category confusion | All (common ground) | Reduction re-entry violates ordering claims | - -**Revised conclusion on Finding #15's open question:** -"Does Opus > GPT-5 ordering for coherence checking hold across other documents?" -**No.** The ordering depends on the document's contradiction density and type. -Documents rich in emergent design tensions favor Opus. Documents with explicit -specification errors favor GPT-5. The task type (coherence checking) doesn't have -a fixed model winner — it depends on what KIND of incoherences the document contains. - -**Practical implication:** Continue running both models for coherence checking. Their -strengths are complementary even within the same task type. GPT-5 catches things you -can point to in the spec and say "these two sentences conflict." Opus catches things -where you need to reason about the implications of multiple mechanisms interacting. - -## Open Questions - -- Does GPT's advantage in finding inconsistencies extend to logical - inconsistencies in arguments? One data point (verdict mismatches) — need more. -- What's the optimal task granularity for GPT analytical review? "Whole PR" is - too big. Is "one hypothesis" right, or can we batch? -- ~~Is the GPT-4.1 Mini bias detection result repeatable, or was it a well- - structured task that any model would ace?~~ **ANSWERED (Finding #8):** Any - model aces it when the biased text is presented without noise. The original - result was about noise elimination, not model capability. -- **NEW:** Does adding a narrow bias-check question to a rich PR review - context recover the detection that broad review misses? (Signal-to-noise - confirmation test) -- ~~How does reasoning_effort affect analytical quality? Only tested default so - far.~~ **ANSWERED (Finding #21):** Negligible effect on GPT-5 for open-ended - analytical tasks. Low/medium/high produced 33/30/30 findings with nearly - identical reasoning tokens (~4K) and per-finding depth. The parameter - may primarily affect verifiable-answer tasks, not exploration. Task framing - remains the dominant quality lever. -- Can we design a systematic "analytical review checklist" that leverages each - model's strengths? -- ~~What analytical tasks is Opus best at vs Sonnet? Finding #11 shows Opus - excels at design-tension identification. How does Sonnet compare on the - same task? (Sonnet is non-reasoning but fast — would it match GPT-4.1?)~~ - **ANSWERED (Finding #12):** Sonnet 4.6 significantly outperforms GPT-4.1 - (17 vs ~14 assumptions) and approaches GPT-5 (17 vs 20). It's not a - non-reasoning model in the GPT-4.1 sense — it occupies a middle tier with - genuine component-interaction reasoning. Opus still wins on design-tension - identification specifically. -- How do the models compare on research synthesis tasks (our #381 rewrite)? - We'll find out during the actual rewrite. -- ~~Does the reasoning-token advantage scale with document complexity? Test - with a simpler doc to see if the gap narrows.~~ **ANSWERED (Finding #11):** - The gap doesn't narrow with simpler docs. GPT-5 maintains ~1.7x the findings - of GPT-4.1 regardless of document complexity. Reasoning tokens enable - exhaustive exploration independent of input difficulty. -- ~~Would Claude Opus (also a reasoning model) match GPT-5's assumption-finding - performance, or does it have different blind spots?~~ **ANSWERED (Finding #11):** - Different blind spots, different strengths. GPT-5 reasons deeper into - implementation mechanics (breadth + technical depth). Opus reasons wider - about system context and design tensions (insight density). They're - complementary, not competing. Run both on important architecture docs. -- ~~Does Sonnet 4.6's strong showing hold across other analytical tasks - (bias detection, gap-finding) or is it specific to assumption-finding on - complex documents? Need to test Sonnet on simpler docs and different question - types.~~ **PARTIALLY ANSWERED (Finding #13):** Sonnet's strength does NOT - transfer to concurrency reasoning. It dropped from 85% of GPT-5 (assumption- - finding) to ~58% (race condition identification). Task type matters more - than we thought. Still untested: gap-finding, bias detection for Sonnet. -- **NEW:** What other analytical tasks require sequential/temporal reasoning - (like race condition identification) vs pattern-matching reasoning (like - assumption-finding)? Building a task taxonomy would help assign models - correctly. -- **NEW:** What explains Sonnet taking slightly longer than Opus here (106s vs - 105s) despite normally being the faster model? Is it the document length, or - does Sonnet's internal reasoning scale with complexity similarly to Opus? -- ~~How does GPT-5 Mini compare to GPT-5 on analytical tasks? Is it a viable - cheaper substitute?~~ **ANSWERED (Finding #14):** GPT-5 Mini is a viable - middle option. Finds fewer issues (6 vs 10) but with genuine reasoning - depth at ~50% cost/time. Better than non-reasoning models, not as - exhaustive as GPT-5. -- **NEW:** How does Claude 4.5 Opus/Sonnet compare to Claude 4.6? HAI now - exposes both; worth testing whether the newer versions regress on - analytical tasks. -- ~~Would running GPT-5 Mini + Sonnet together (different axes) - approach GPT-5's coverage at lower combined cost?~~ **ANSWERED (Finding #19):** - 71% coverage at 31% cost. Good for low-stakes work; GPT-5 irreplaceable for - high-stakes due to unique domain-knowledge findings in the missing 29%. -- ~~**NEW (Finding #15):** Does the Opus > GPT-5 ordering for coherence checking - hold across other documents? The inversion (Opus finding more than GPT-5) - was striking — need to confirm it wasn't document-specific.~~ - **ANSWERED (Finding #27):** No — it was document-specific. On risk-controls.md, - GPT-5 found 6 vs Opus's 5. The winner depends on contradiction TYPE: Opus - excels at emergent/compositional contradictions, GPT-5 at explicit/definitional - ones. No fixed ordering for this task type. -- **NEW (Finding #15):** Is the two-pass approach (Opus generates → GPT-5 - validates) worth the extra cost vs just running Opus alone? Need to test - whether GPT-5 actually catches Opus false-positives or just agrees. -- ~~How do the Claude 4.5 and 4.6 models compare on analytical tasks?~~ - **ANSWERED (Finding #16):** 4.5 is more exhaustive (2x findings), 4.6 is - more precise (higher signal-to-noise). Genuine tradeoff, not a regression. - 4.5 for coverage, 4.6 for actionability. -- **NEW (Finding #16):** Does the 4.5 vs 4.6 pattern hold across other task - types? Spec completeness may favor exhaustiveness; would coherence checking - or race condition analysis show the same pattern? -- **NEW (Finding #16):** Is running both Sonnet versions (4.5 + 4.6) cost- - effective vs just running GPT-5? Need to compare the UNION of their findings - against GPT-5's output for overlap analysis. -- **NEW (Finding #18):** Does Opus's "predictable exploit window" detection - transfer to other policy documents? It uniquely identified that the cooldown - mechanism creates a GUARANTEED safe window that strategies could systematically - exploit — this is a higher-order security insight. Worth testing whether Opus - consistently finds "adversarial opportunity" framings that other models miss. -- **NEW (Finding #20):** Does GPT-5's extreme verification behavior (15:1 - reasoning-to-output ratio, 3 findings from 12K reasoning) persist across - other documents with this prompt? Or was user-pipeline-lifecycle.md - particularly verification-heavy? Test invariant violation paths on a simpler - document. -- **NEW (Finding #20):** Would giving GPT-5 a "minimum 8 findings" instruction - reduce its selectivity and produce MORE invariant violations at lower - precision? Or would it just pad with non-violations? The extreme selectivity - may be a feature OR it may mean GPT-5 is discarding valid-but-hard-to-verify - findings. -- **NEW (Finding #20):** Opus's self-correction behavior is now confirmed across - Finding #15 and #20. Is this trainable/promptable? Could we ask non-Opus models - to "show your reasoning and withdraw findings you cannot fully verify"? -- **NEW (Finding #22):** The "silent correctness" lens revealed three distinct - analytical modes: Opus → missing functionality, GPT-5 → mechanism incorrectness, - Sonnet → composition failures. Does this three-way differentiation hold on other - documents, or was it specific to the regulatory/financial domain of specid-lot-selection? -- **NEW (Finding #22):** Does the "silent correctness" lens work on non-financial - documents? The financial/regulatory domain has a large gap between syntactic and - semantic correctness. Would the same prompt on an infrastructure/systems doc produce - equally differentiated findings, or would it collapse into assumption-finding? -- ~~**NEW (Finding #22):** Opus's "missing feature identification" mode (wash sales, - commissions) — is this promptable on other models? Could we explicitly ask GPT-5 - "what should this system compute but doesn't" and get similar results?~~ - **ANSWERED (Finding #26):** YES — all three models find regulatory gaps and - missing features when explicitly prompted. Opus's unique behavior in #22 was - an emergent DEFAULT tendency, not a capability. Prompt framing dominates - model personality. - -- **NEW (Finding #28):** Cross-document consistency found real bugs in gargoyle - docs (fills vs events, position ownership, signal persistence). Does running - this analysis across MORE document pairs (e.g., domain readmes vs implementation - docs, design docs vs plan docs) yield additional real inconsistencies? Could - become a systematic documentation maintenance tool. -- **NEW (Finding #28):** Opus was 2.4x faster AND found more issues than GPT-5 - on cross-document consistency. Is this because cross-doc contradictions are - easy to verify once spotted (reducing GPT-5's verification advantage)? Or - because boundary reasoning (Opus's strength) is the primary skill needed? - -## Methodology Notes - -- Internet opinions about models are overwhelmingly about coding. Don't - extrapolate to analytical work without testing. -- "Just because someone says it on the internet doesn't make it right." — - Aaron, 2026-04-26. Opinions need context. Track our own evidence. -- Absence of published methodology for a use case is itself a finding. -- Each finding needs: date, task, **how we used it** (context shape, task - framing, what info the model had/didn't have), what happened, takeaway. - No unsupported generalizations. -- **Context dimensions to track:** - - Rich vs minimal (how much background info) - - Broad vs focused ("review this" vs "answer this specific question") - - What kind of context (diff, full files, issue text, research notes, - project conventions, nothing) - - Whether the model had access to tools or just text - - Whether the task was explicit step-by-step or open-ended -# Design Coherence Analysis — Finding #15 - -**Date:** 2026-05-03 -**Task:** Identify internal design incoherences in gargoyle's `failure-modes.md` (383 lines) -— places where the document's stated principles/invariants are contradicted by its own -specified mechanisms. -**How we used them:** Same document (full text) + same focused analytical question to all -3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence -to look for (safety properties not enforced, state machine violations, recovery contradictions, -supervision conflicts, cross-mechanism contradictions). Required each finding to reference -specific sections. No tools, no project context beyond the document itself. - -| Model | Time | Output tokens | Reasoning tokens | Incoherences found | -|---|---|---|---|---| -| Claude Sonnet 4.6 | ~39s | 1,168 | (internal) | 5 | -| Claude Opus 4.6 | ~105s | 3,378 | (internal) | 7 (8 attempted, 1 self-withdrawn) | -| GPT-5 | ~120s | 10,235 | 9,088 | 4 | - -**What they found — common ground (all 3 identified):** -- State machine universality claim vs Strategy.Worker crash behavior (process - crashes bypass the degraded state entirely — no transition path in the model) -- Market data staleness advisory-only vs the "don't trade when ambiguous" principle - (or vs concurrent failure auto-halt) -- `pending_cancel`/`pending_replace` absent from recovery query set (GPT-5 and - Sonnet found this directly; Opus addressed the broader state machine gap) - -**GPT-5 unique findings (not in either Claude model):** -- Kill switch halted = "process terminated" vs kill switch requiring RUNNING - processes for cancel_all, FLATTEN, and LIQUIDATE modes. The state definition - claims processes are terminated, but the mechanisms require them alive to - execute orders. **This is the most architecturally significant finding** — it - reveals a fundamental definitional error in the state machine. -- Per-symbol degradation contradicts the process-level degradation semantics. - A worker "enters degraded" but continues operating for non-stale symbols — - violating the stated definition that degraded = "cannot perform primary - function." The metrics/eventing model has no per-symbol dimension. - -**Claude Opus unique findings (not in either other model):** -- `:rest_for_one` cascade creates a FIFTH implicit state (terminated-and- - restarting) not in the four-state model — processes that were `normal` are - forcibly killed (not by kill switch) and restart. Self-corrected one finding - that initially looked like incoherence but was actually consistent. -- PortfolioMonitor continues evaluating with stale data ("fail-safe") while - Strategy.Workers are stopped for the SAME condition — contradicts both the - universal state machine (PM doesn't transition to degraded) and the doc's - reasoning about why stale data is dangerous. -- Path-dependent recovery criteria: Strategy.Worker recovery requires 21+ bars - after crash but only "price continuity check" after staleness. The state - machine's single "catch-up complete" exit condition can't express this. -- `halted → [*]` transition in state diagram is logically impossible if "halted" - means the process is already terminated — dead processes can't fire transitions. -- Compound failure detection requires a meta-observer across processes but the - per-process state machine model has no way to express cross-process conditions. - -**Claude Sonnet unique findings (not in either other model):** -- Market data global staleness: the failure table says "Manual (disengage)" for - recovery — implying automatic engagement happened — but the text says it's - advisory only. Table contradicts prose. -- ReconciliationGate: doc claims gate survives OM crash (separate supervision - tree), but then says "missing ETS table = not ready" when OM crashes. If the - gate survives, why would its table be missing? -- Signal survival claims are contradictory between sections: worker crash says - downstream signals survive, but OM crash says all upstream signals lost. - (NOTE: this is actually describing different scenarios — worker crash doesn't - cascade to SignalRisk, OM crash does via :rest_for_one. Sonnet may have - misread the architecture here — the two statements are consistent when you - understand the supervision tree.) - -**Quality assessment:** -- **GPT-5** found only 4 incoherences but TWO of them are genuinely critical - architectural findings. The "halted = terminated" vs "kill switch requires - running processes" contradiction is a real design error — you can't both - terminate processes AND require them to execute cancel/liquidation orders. - The per-symbol degradation finding is also a real modeling gap. GPT-5 was - MORE SELECTIVE here than in previous experiments — it didn't pad with - medium-severity findings. Each of its 4 was high/critical. -- **Claude Opus** produced the most findings (7 valid) with characteristic - depth. Its self-correction (withdrawing finding #6 after deeper analysis) - shows intellectual honesty rare in model outputs. The PortfolioMonitor - stale-data contradiction is genuinely insightful — same input condition, - opposite response, no justification within the state machine model. The - compound failure meta-observer finding identifies a modeling category error. - Opus also found modeling imprecisions (path-dependent recovery, halted → [*] - impossibility) that the other models didn't notice. -- **Claude Sonnet** found 5 issues quickly (39s, 1,168 tokens) but quality was - mixed. Finding #4 (ReconciliationGate) raises a genuine question about - the ETS table ownership claim. Finding #1 (table vs prose contradiction on - market data staleness) is a real documentation inconsistency. However, - Finding #5 appears to misread the supervision architecture — the two - statements about signal survival ARE consistent when you understand that - different crashes cascade differently. Sonnet produced one false positive. - -**Key insight — "design coherence" is a NEW analytical category with distinct model strengths:** -This is different from assumption-finding (Finding #10-12), race conditions -(Finding #13), and cross-component interactions (Finding #14). Coherence -checking requires the model to hold MULTIPLE parts of the document in tension -with each other and reason about whether they're compatible. Results: - -- **GPT-5** was MORE SELECTIVE than in any previous experiment. Only 4 findings - vs 10-24 in other tasks. But precision was near-perfect — all 4 are genuine - contradictions. This suggests GPT-5's reasoning tokens are being used for - VERIFICATION (checking whether apparent contradictions hold up) rather than - EXPLORATION (finding more things). The 9K reasoning tokens produced 4 findings - vs the usual 10+ — GPT-5 is self-editing aggressively. -- **Opus** hit its sweet spot. Coherence checking IS design-tension identification - — Opus's consistent strength. Finding incoherences requires exactly the kind - of "how does this design disagree with itself" reasoning that Opus excels at. - It also showed unique self-correction behavior (withdrawing a finding after - deeper analysis). -- **Sonnet** was fast but produced a false positive. Coherence checking requires - holding multiple document sections in memory simultaneously and reasoning about - their compatibility — this is harder than assumption-finding (where you - reason about one mechanism at a time) but easier than race conditions (which - require sequential temporal reasoning). Sonnet occupies a middle ground. - -**Model ranking for design coherence checking:** -1. Claude Opus 4.6 — most findings, highest depth, self-correcting (7 valid) -2. GPT-5 — fewest findings but near-perfect precision, finds the critical ones (4) -3. Claude Sonnet 4.6 — fast screening, but prone to false positives on - architectural misreads (4/5 valid) - -**This inverts the usual GPT-5 > Opus ordering.** In previous experiments, GPT-5 -consistently found MORE issues. Here, GPT-5 was more selective than Opus. The -task type (self-consistency checking) favors Opus's "design tension" reasoning -style over GPT-5's "exhaustive exploration" style. GPT-5 apparently uses its -reasoning to VERIFY rather than GENERATE when the task is about contradictions -rather than gaps. - -**Practical implication:** For architecture documents, run coherence checking as -a separate pass using Opus as the primary model. GPT-5's higher precision means -it's good for confirming which Opus findings are genuine vs overreads. The -two-pass approach: Opus generates candidates → GPT-5 validates → result is the -intersection plus GPT-5's independent finds. - -### 16. Specification completeness: Sonnet 4.5 produces 2x the findings of Sonnet 4.6 on implementation-gap analysis; quality vs volume tradeoff - -**Date:** 2026-05-03 -**Task:** Identify specification gaps in gargoyle's `kill-switch.md` (185 lines) — places -where an implementer would be forced to guess or decide on their own because the spec -doesn't clearly specify behavior. New analytical lens not previously tested. -**How we used them:** Same document (full text) + same focused analytical question to all -3 models via HAI proxy. Highly structured prompt specifying 5 categories of underspecification -(behavioral ambiguity, missing edge cases, ordering/sequencing gaps, interface contracts -undefined, concurrency semantics omitted). Required specific output format per finding -(gap, section, what implementer must decide, risk if wrong, severity). No tools, no -project context beyond the document itself. - -| Model | Time | Output tokens | Reasoning tokens | Gaps found | Critical | High | Medium | Low | -|---|---|---|---|---|---|---|---|---| -| Claude Sonnet 4.6 | 73s | 3,403 | (internal) | 13 | 8 | 4 | 0 | 1 | -| Claude Sonnet 4.5 | 102s | 5,191 | (internal) | 25 | 14 | 6 | 4 | 1 | -| GPT-5 | 109s | 10,140 | 7,872 | 19 | 8 | 7 | 3 | 0 | - -**What they found — common ground (all 3 identified):** -- Pipeline process identification ambiguity (which processes are "pipeline processes") -- Per-user process scope mapping (how to terminate only one user's processes) -- ETS table ownership and lifecycle (who owns it, what happens on crash) -- Concurrent engage operations (what happens when two sources engage simultaneously) -- Liquidation order tagging mechanism (what the tag is, how verified) -- Process restart prevention (how "must not restart" is enforced) -- Engage sequence atomicity (partial failure between DB write and termination) -- Startup ordering and ETS readiness (pipeline starting before ETS populated) -- Disengage sequence ordering (what happens and in what order) - -**Sonnet 4.5 unique findings (not in either other model):** -- ETS table schema/structure (set vs ordered_set, key format, value schema) -- Missing ETS detection mechanism (catch :badarg vs table existence check) -- Database write atomicity with ETS (transaction boundaries, rollback semantics) -- Per-user engage while global is already engaged (is it a no-op or error?) -- Broker rejection semantics ("already filled" vs "invalid cancel" distinction) -- Cold-start gate interaction (independence vs dependency of the two gates) -- User deletion with active kill switch (orphaned rows, cascade semantics) -- Global disengage effect on per-user states (independent or auto-clear?) -- Audit log write failure during engage (critical-path vs best-effort) -- Dashboard control ambiguity in LIQUIDATE mode (contradictory disable/enable) -- Cancel timeout duration (operational parameter not specified) -- Manual order source code path during LIQUIDATE (how orders bypass the dead pipeline) - -**GPT-5 unique findings (not in either other model):** -- Combined global/per-user mode semantics (what happens when global=RESTRICT, - user=LIQUIDATE — can user's liquidation proceed?) -- Scope of "all" in cancel_all and liquidation (system-wide vs per-user) -- Gate behavior when ETS missing but liquidation needed (conflicting requirements: - fail-closed says block, but liquidation needs to pass) -- Disengage during in-flight cancellations (what happens to racing tasks) -- Gate placement relative to broker submission (exact point in the flow) -- Engage latency expectations (no quantified SLA) -- Mode change while already engaged (RESTRICT → LIQUIDATE without disengage) -- Dashboard vs backend scope for manual liquidation (individual vs bulk only) - -**Sonnet 4.6 unique findings (not in either other model):** -- ETS sequencing relative to process termination (ETS before or after kill?) -- Concurrent disengage + re-engage race (specific interleaving scenario) -- Close-only enforcement mechanism (UI-only vs backend validation) -- Order-in-flight past ETS gate during termination (already-checked orders) - -**Quality assessment:** -- **Claude Sonnet 4.5** was the most EXHAUSTIVE (25 gaps) but with notable - quality variance. Several findings were highly specific and implementation- - relevant (ETS schema, missing-table detection, broker rejection semantics). - Others were relatively obvious or lower-impact (user deletion, audit log - failure, cancel timeout duration). The 14 Critical ratings feel somewhat - generous — some would be more accurately rated as High in practice. Output - was well-structured with clear per-finding format. -- **GPT-5** found 19 gaps with consistent high quality. Its unique findings - show cross-cutting reasoning: the combined mode semantics finding (global - vs per-user mode interaction) identifies a genuine specification gap that - neither Sonnet version noticed. The "ETS missing but liquidation needed" - finding is architecturally significant — it identifies a CONTRADICTION in - the spec's own rules (fail-closed blocks everything, but liquidation must - pass). Every finding was actionable. More selective severity ratings - (8 Critical vs Sonnet 4.5's 14). -- **Claude Sonnet 4.6** was the most SELECTIVE (13 gaps) but with the highest - precision. Every finding was genuinely a specification gap that an - implementer would face. The ETS sequencing finding (#4) is particularly - well-reasoned — it identifies a specific ordering dependency that creates - a race window. Sonnet 4.6 appears to self-filter aggressively, producing - only findings it's confident about. Higher signal-to-noise than 4.5. - -**Key insight — Sonnet 4.5 vs 4.6 on analytical tasks:** -This is the first direct comparison between Claude model versions on the same -analytical task. Key differences: - -- **Volume:** 4.5 produced almost 2x the findings (25 vs 13) -- **Tokens:** 4.5 used ~1.5x the output tokens (5,191 vs 3,403) -- **Time:** 4.5 took ~1.4x longer (102s vs 73s) -- **Severity distribution:** 4.5 had more Critical findings (14 vs 8) but - with more generous severity ratings -- **Quality per finding:** 4.6 had higher average quality; fewer "obvious" - or lower-impact findings - -The 4.6 model appears to have been trained toward higher precision/selectivity. -It finds fewer things but each finding is more reliably a genuine gap. The 4.5 -model is more exhaustive but includes findings that a reviewer might triage as -"yes, technically, but not really a spec gap." This mirrors a known training -direction in Claude models: later versions tend to be more concise and selective. - -**For practical use:** If you want completeness (cast a wide net, accept some -noise): use 4.5. If you want precision (every finding is actionable, no triage -needed): use 4.6. For architecture review where missing a gap has cost, 4.5's -exhaustiveness is probably worth the noise. For review where false positives -cost attention (e.g., PR review comments), 4.6's selectivity is preferred. - -**GPT-5 vs Sonnet comparison on this task:** -GPT-5 (19 findings) sits between the two Sonnets in volume but has the highest -consistency — no obvious misses or inflated severities. Its unique strength -here: finding CONTRADICTIONS within the spec's own rules (ETS-missing blocking -conflicts with liquidation needing to pass). This is consistent with Finding #15 -where GPT-5 was unusually selective but precise on coherence checking. - -Specification completeness analysis appears to be a task where: -1. Sonnet 4.5 is strongest for breadth (25 findings, catches operational gaps) -2. GPT-5 is strongest for detecting spec self-contradictions (19 findings, high precision) -3. Sonnet 4.6 is strongest for precision (13 findings, zero noise) - -**Updated model version comparison:** -- Claude 4.6 → higher precision, more selective, concise -- Claude 4.5 → more exhaustive, more verbose, occasional severity inflation -- This is a genuine tradeoff, not a simple regression or improvement - -**Practical implication:** Run BOTH Sonnet versions? 4.5 catches things 4.6 -filters out (ETS schema, broker rejection semantics, cold-start gate interaction). -4.6 catches things with more specificity (sequencing gaps, exact race windows). -For a one-shot budget: 4.5 if you want coverage, 4.6 if you want actionability. -GPT-5 if you want to find where the spec contradicts itself. - -### 7. Token budget matters more than model size for gap analysis (confirmed) - -**Date:** 2026-05-03 -**Task:** Identify unaddressed failure scenarios in gargoyle's `failure-modes.md` (383 lines, ~25KB) -**How we used them:** Same document, same analytical question ("What failure scenarios -are NOT covered?"), three models. GPT-5 with 16K max_completion_tokens, Sonnet 4 -with 4K max_tokens, GPT-4.1 Mini with 4K max_completion_tokens. No project context -beyond the document itself. Pure gap-analysis task. - -**Results:** -- GPT-5 (16K budget): 28 gaps, most exhaustive. Found domain-specific edge cases - others missed entirely: ClOrdID collision across restarts, fractional share rounding, - broker maintenance windows (410/426), hot code upgrades, regulatory halts vs staleness - distinction, cancel-ack-then-late-fill race, DNS/TLS as distinct from network outage. -- Sonnet 4 (4K budget): 12 gaps, concise. Unique framing: distinguished latency - degradation from outage (subtle but actionable). ETS corruption vs loss. -- GPT-4.1 Mini (4K budget): 13 gaps with summary table. Unique angles: unknown broker - status enum values, configuration schema mismatches on cold-start, malformed signals - from logic bugs (not just crashes). - -**Overlap (all three):** Rate limiting, clock skew, resource exhaustion, DB failures, -message backpressure, partial connectivity. - -**Key insight:** GPT-5's 4K attempt produced ZERO output (finish_reason: length) — -all tokens consumed by internal reasoning. At 16K it produced the richest analysis. -This confirms finding #3 (GPT-5 needs generous token budgets) AND adds a new -observation: for open-ended analytical questions, GPT-5's reasoning overhead is -proportionally larger. The 4K models (Sonnet, Mini) both produced useful output at -4K because they don't burn tokens on chain-of-thought. - -**Model personality confirmed:** -- GPT-5: exhaustive, domain-aware, finds edge cases a senior SRE would know -- Sonnet: precise, architectural, finds design-level distinctions -- GPT-4.1 Mini: structured, systematic, finds enumeration gaps - -**Practical implication:** For failure mode / gap analysis on design docs: -- GPT-5 with ≥16K tokens for maximum coverage (most unique findings) -- Sonnet for architectural framing ("this is really two different problems") -- Mini for completeness checking ("what about this enum value?") -- Running all three costs ~$0.50 and catches gaps none alone would find -- GPT-5 at 4K is USELESS for this task — always give it room to think - -**Note on GPT-5 reasoning overhead:** First attempt at 4K max_completion_tokens -returned empty content with finish_reason: length. The model spent all 4K tokens -on internal reasoning and produced nothing. This is worse than a short answer — -it's zero value for non-zero cost. Always budget ≥16K for GPT-5 analytical tasks. - -### 18. Temporal boundary analysis: GPT-5 is most exhaustive; Opus finds design-level contradictions; Sonnet 4.5 is structured but less deep - -**Date:** 2026-05-04 -**Task:** Identify temporal boundary vulnerabilities in gargoyle's `escalation-policy.md` -(238 lines) — scenarios where the timing model (evaluation cycles, debounce counts, -cooldown periods) creates windows of incorrect or dangerous behavior. -**How we used them:** Same document (full text) + same focused analytical question to all -3 models via HAI proxy. Highly structured prompt specifying 5 categories of temporal -vulnerability (timing exploitation, counter reset abuse, asymmetric time exposure, -cross-metric temporal interactions, state loss temporal effects). Required specific -output format per finding (name, sequence with cycle numbers, mechanism, severity, fix). -No tools, no project context beyond the document itself. - -| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium | -|---|---|---|---|---|---|---|---| -| GPT-5 | ~128s | 9,175 | 5,888 | 15 | 3 | 7 | 2 | -| Claude Opus 4.6 | ~120s | 5,112 | (internal) | 10 | 3 | 5 | 2 | -| Claude Sonnet 4.5 | ~100s | 4,056 | (internal) | 12 | 3 | 3 | 3 | - -**What they found — common ground (all 3 identified):** -- Flash crash / inter-evaluation gap exploitation (metric spikes between discrete - evaluation cycles go undetected) -- Single clear cycle resetting debounce counter (transient recovery defeats escalation - despite sustained risk — metric can breach 80%+ of cycles and never escalate) -- Asymmetric escalation time vs loss compounding rate (11 cycles to reach liquidation - while losses compound every single cycle) -- Monitor crash resets state to Clear, losing all escalation progress -- Liquidation re-trigger requiring full debounce reset, delaying subsequent batches -- Kill switch N value unspecified (timing indeterminacy) - -**GPT-5 unique findings (not in either other model):** -- Boundary-alignment counter starvation: explicitly modeled the "adversarial flicker" - pattern (breaching 2 cycles, 1 clear, repeat — 66% breach time, never escalates) - with a precise mathematical framing of why K-of-N is needed -- Cycle-length drift under load: GC pauses or CPU contention stretching evaluation - intervals means "3 cycles" could be 12 minutes instead of 90 seconds when it - matters most (high-load market stress = slowest evaluations) -- Adversarial boundary timing (market microstructure masking): illiquid instruments - where opposing prints predictably arrive near evaluation boundaries, exploiting - deterministic sampling points -- Cross-metric escalation gap and hedging prohibition: Restrict prevents ALL new - positions including risk-REDUCING hedges needed for a different metric still - escalating on its own timeline — protection for metric A actively worsens metric B -- Cooldown stall causing prolonged Restrict: repeated transient spikes near hysteresis - threshold reset cooldown indefinitely while metric is actually safe -- State inconsistency between restriction flags and monitor after restart: - documented asymmetry where flag persists (manual clear) but state resets (auto - clear) — creates orphaned restriction or unprotected window depending on - reconciliation approach -- Metric computation fail-closed interacting with debounce: system errors create - false escalations with long cooldown, potentially blocking hedging trades -- Unspecified N for kill switch post-liquidation breaches: coupled with crash - reset, system can loop indefinitely without reaching kill switch -- In-liquidate flicker stall: one cycle below threshold after partial fill resets - re-trigger counter, stalling further liquidation - -**Claude Opus unique findings (not in either other model):** -- De-escalation cooldown exploitation (predictable window): after cooldown completes - and restriction lifts, strategy has a GUARANTEED 5+ cycle window of unrestricted - trading before Restrict can re-engage — an automated strategy could systematically - exploit this predictable safe window to re-enter dangerous positions -- Orphaned restriction flag asymmetry framed as a DESIGN CONTRADICTION: the failure - modes table specifies opposing recovery paths for state (automatic → Clear) vs - flags (manual clear), creating an irreconcilable dual state. Opus uniquely - identified that operator intervention to clear the flag could inadvertently - create a WORSE protection gap than leaving it orphaned -- Self-correcting analysis style: Opus's summary explicitly synthesized that the - three Critical findings share a common cause (debounce optimizes against false - positives at the expense of false negatives during sustained events) and proposed - a single architectural fix (severity-aware fast path) that addresses all three - -**Claude Sonnet 4.5 unique findings (not in either other model):** -- De-escalation timing not accounting for proximity to breach threshold: system - removes protection while metric is still near-dangerous, and re-escalation - requires full debounce — created a specific "whipsaw" scenario with cycle numbers -- Manual-only de-escalation from Liquidate creates UNBOUNDED recovery time: - if triggered at 2 AM Saturday, trading disabled until Monday despite metrics - recovering in minutes. Framed as contradiction with "autonomous" design goals -- Evaluation cycle synchronization assumption: no handling of variable timing - (CPU contention, GC pauses) — implicit throughout but never addressed -- Cold start escalation ambiguity: system starts with no prior state while - portfolio may already be in breach condition -- De-escalation event ordering race: multiple metrics de-escalating simultaneously - may emit events in non-deterministic order, confusing external observers - -**Quality assessment:** -- **GPT-5** was the most exhaustive (15 findings) and showed the strongest - mathematical/systems reasoning. Its unique findings included precise attack - models (adversarial flicker, boundary alignment, microstructure masking) that - describe exact exploitation patterns with percentages and cycle counts. The - cross-metric hedging prohibition finding is architecturally significant — it - identifies that protection for one metric can actively CREATE risk for another. - Every finding was actionable with specific fixes. -- **Claude Opus 4.6** produced fewer findings (10) but with characteristic depth - and self-awareness. Its cooldown exploitation finding identified a PREDICTABLE - exploit window that an automated strategy could systematically abuse — framed - not as an accident but as an adversarial opportunity. The summary synthesis - (identifying common cause across Critical findings) shows meta-analytical - capability the other models didn't demonstrate. Opus also uniquely identified - that human intervention to fix one problem could create a WORSE problem — - second-order operational reasoning. -- **Claude Sonnet 4.5** was well-structured (12 findings, clean severity tiers, - organized by Critical/High/Medium/Low) and faster than both other models. - Its findings were solid but less architecturally deep. The manual de-escalation - contradiction finding was genuinely insightful (unbounded recovery time vs - autonomous design goals). However, several findings restated concepts the - other models covered with less specificity about exploitation mechanics. - -**Key insight — temporal reasoning as a task type:** -This is the first experiment specifically testing "temporal boundary analysis" — -reasoning about time-domain properties of a state machine (evaluation frequency, -counter semantics, cooldown mechanics, crash/restart timing). - -Results compared to Finding #13 (race condition identification on a concurrency doc): -- GPT-5: 15 findings here vs 12 in Finding #13. Consistent high performance - on temporal reasoning tasks across both experiments. -- Opus: 10 findings here vs 10 in Finding #13. Remarkably consistent — Opus - produces ~10 high-quality findings regardless of temporal task variant. -- Sonnet 4.5: 12 findings here (first test). Compare to Sonnet 4.6's 7 findings - (with errors) in Finding #13. Sonnet 4.5 handles temporal reasoning better than - 4.6 — consistent with Finding #16 showing 4.5 is more exhaustive across task types. - -**Sonnet 4.5 vs 4.6 on temporal reasoning (inferred comparison):** -Sonnet 4.6 struggled significantly on race condition identification (Finding #13: -7 findings with analytical errors, misreading architecture). Sonnet 4.5 here -produced 12 solid findings with no apparent misreadings. This suggests 4.5's -exhaustiveness advantage extends to temporal reasoning — the additional -exploration it does (vs 4.6's aggressive self-filtering) catches more temporal -interactions. Confirms Finding #16's pattern: 4.5 for coverage, 4.6 for precision. - -**The structured-prompt effect continues:** -All three models produced focused, high-quality output with this highly structured -prompt (5 specific categories + required output format). This confirms Finding #14: -narrow analytical lens + broad document scope is the sweet spot for all model tiers. -The prompt structure appears to be a stronger predictor of output quality than model -choice for the bottom 80% of findings (all models find the common-ground issues). -Model choice matters for the TOP 20% — the unique insights that require deeper -reasoning about system interactions. - -**Updated model assignment for temporal boundary analysis:** -1. GPT-5 — most exhaustive, strongest at modeling adversarial exploitation patterns - and mathematical edge cases (15 findings) -2. Claude Sonnet 4.5 — good volume with clean structure, viable for first-pass - temporal analysis (12 findings, no errors) -3. Claude Opus 4.6 — fewest findings but highest insight density, uniquely - identifies predictable exploit windows and operational second-order effects - (10 findings) - -**Practical implication:** For temporal analysis on state machines and timing-dependent -policies, the three-model stack produces genuine complementary value: -- GPT-5 catches the adversarial attack patterns and mathematical edge cases -- Opus catches the predictable exploit windows and operational contradictions -- Sonnet 4.5 provides good breadth at lower cost with clean severity categorization - -The union of unique findings across all three models reveals significantly more -temporal vulnerabilities than any single model alone. For a document governing -autonomous financial actions (liquidation, kill switch), the cost of running all -three (~$1-2) is trivially justified against the risk of missing a timing exploit. - -### 19. Union coverage test: GPT-5 Mini + Sonnet 4.6 covers ~71% of GPT-5's findings; the missing 29% is where the real value lives - -**Date:** 2026-05-04 -**Task:** Identify hidden assumptions in gargoyle's `trading-pipeline.md` (1,110 lines, -~62KB) — the most complex document tested so far, covering the full end-to-end path -from tick ingestion through order execution. -**How we used them:** Same document (full text, no truncation) + same focused analytical -question to all 3 models via HAI proxy. Standard hidden-assumption prompt with 5 -categories (runtime behavior, external dependencies, timing/ordering, scale/load, -uncovered failure modes). Required specific output format per finding. No tools, no -project context beyond the document itself. - -| Model | Time | Output tokens | Reasoning tokens | Assumptions found | -|---|---|---|---|---| -| GPT-5 | 99s | 9,418 | 5,696 | 35 | -| GPT-5 Mini | 93s | 5,309 | 1,792 | 21 | -| Claude Sonnet 4.6 | 38s | 1,792 | (internal) | 17 | - -**Coverage analysis — can Mini + Sonnet together replace GPT-5?** - -Categorized each of GPT-5's 35 findings by whether the union of Mini + Sonnet -also identified the same assumption: - -- **Covered by BOTH Mini and Sonnet:** ~12 findings (common ground — any model - finds these: idempotency, single-writer, clock sync, instrument resolution, - fill immutability, reconciliation gate, backpressure, fill correlation, event - ordering, audit scalability, PortfolioRisk bottleneck) -- **Covered by Mini only (not Sonnet):** ~7 findings (transactional atomicity, - audit causal consistency, modification-in-flight enforcement, OM throughput, - decimal precision, PM/PR close-only race, partition duplicate submit) -- **Covered by Sonnet only (not Mini):** ~6 findings (market data feed rates, - pipeline-vs-market speed, corporate actions atomicity, kill switch partition, - shared port isolation, market close vs auction fills) -- **Union(Mini + Sonnet) total coverage:** ~25/35 = **~71%** of GPT-5's findings -- **GPT-5 unique (missed by both):** ~10-18 findings depending on strictness - -**What GPT-5 uniquely found that the cheaper pair missed:** - -The missing 29% is NOT random — it's systematically different in character: - -1. **Operational edge cases:** Default TIF "day" broker semantics, OrderRate - counting retries, extended-hours MarketHours mismatch, fractional quantities, - local expiry timer precision per instrument -2. **Design-level interaction gaps:** PortfolioRisk concurrent decision race - (snapshot stale between two parallel approvals), re-validation gap between - approval and submit, decision loss on crash after audit write -3. **Domain-specific knowledge:** Manual broker-side actions conflicting with - state machine, options/complex instrument position_effect mapping, Decision→Order - 1:1 invariant vs broker auto-splitting, wash sale retroactive P&L mutation -4. **Architectural observations:** Reduction re-entry rule insufficiency, - PortfolioMonitor coalescing vs fast breach detection, multi-aggregator fanout - and audit partial writes, replay/backtest alignment with production controls - -These share a common trait: they require **domain expertise** (knowing how brokers -actually behave, how regulatory rules interact, how production trading systems -fail in practice) combined with **architectural reasoning** (how the design's own -mechanisms interact under those real-world conditions). The cheaper models find -assumptions about the document's internal consistency; GPT-5 additionally finds -assumptions about the document's relationship to the external world it must -operate in. - -**GPT-5 Mini vs Sonnet 4.6 — complementary, not redundant:** - -Mini and Sonnet covered different gaps: -- Mini was stronger on **internal consistency** (transactional atomicity, causal - consistency, decimal precision, modification serialization) -- Sonnet was stronger on **external interactions** (market data feeds, corporate - actions, kill switch distribution, shared resource isolation) - -This aligns with previous findings: Mini reasons about implementation mechanics; -Sonnet reasons about system boundaries and external interactions. Their union -covers more ground than either alone. - -**Cost comparison:** - -| Approach | Total tokens | Approx. cost | Coverage of GPT-5 | -|---|---|---|---| -| GPT-5 alone | ~21K (9.4K output + 5.7K reasoning) | ~$0.80 | 100% (35 findings) | -| Mini + Sonnet | ~7.1K output + 1.8K reasoning | ~$0.25 | ~71% (25/35 findings) | -| All three | ~28K total | ~$1.05 | >100% (35 + unique Sonnet/Mini extras) | - -**Key insight — the 71% coverage is a floor, not a ceiling:** - -The union covers 71% of GPT-5's specific findings. But Mini and Sonnet each -also produced findings that GPT-5 DIDN'T make: -- Sonnet: DailyLossLimit query performance scaling, instrument reference data - propagation atomicity across components -- Mini: Signal audit correlation ambiguity under replay/duplicate ticks - -So the total unique finding space is LARGER than any single model. Running all -three produces the most comprehensive analysis. - -**Answer to the open question: "Would running GPT-5 Mini + Sonnet together -approach GPT-5's coverage at lower combined cost?"** - -**Partially.** The pair covers ~71% of GPT-5's findings at ~31% of the cost. -But the missing 29% is disproportionately valuable — it contains the -domain-specific, interaction-level, real-world-knowledge findings that are -most likely to prevent production incidents. For a quick sanity check or -first-pass screening, Mini + Sonnet is excellent value. For architecture -review where completeness matters (financial system, safety-critical), GPT-5 -is not replaceable by cheaper models — its unique findings are exactly the -ones that would cause real-world failures. - -**Practical implication:** The optimal strategy depends on stakes: -- **Low stakes** (internal doc review, non-critical systems): Mini + Sonnet - is 71% coverage at 31% cost — strong ROI -- **High stakes** (financial systems, safety-critical): run all three — the - ~$1 total cost is irrelevant vs the value of the extra 10-18 findings -- **Budget-conscious high stakes:** run GPT-5 alone — it subsumes most of - what Mini + Sonnet find, and adds the critical domain-knowledge findings - -The cost argument for Mini + Sonnet as a GPT-5 REPLACEMENT doesn't hold for -important work. The cost argument for Mini + Sonnet as a GPT-5 COMPLEMENT -is strong — they catch a few things GPT-5 misses, and the union of all three -is the most thorough analysis available. - -**Document complexity observation:** -This is the largest document tested (1,110 lines vs previous 185-785 lines). -GPT-5's finding count scaled up (35 vs 20-26 on smaller docs) while maintaining -quality — no padding with obvious/low-value findings. Mini also scaled (21 vs -6 on 459-line doc in Finding #14). Sonnet scaled less (17 vs 12-17 on smaller -docs) — it appears to have a natural output ceiling regardless of document size, -consistent with its self-filtering behavior observed in previous findings. - -### 22. Silent correctness failures: NEW analytical lens reveals Opus's strength at domain/regulatory reasoning; GPT-5 shows regulatory depth; all models converge on compliance timestamp errors - -**Date:** 2026-05-05 -**Task:** Identify scenarios where the mechanism produces SILENTLY INCORRECT results -(not errors, not crashes — wrong financial calculations, wrong lot selections, or wrong -compliance records that pass all validation) in gargoyle's `specid-lot-selection.md` -(306 lines) — a financial system specification covering tax lot selection strategies, -cost basis accounting, and IRS SpecID compliance. -**How we used them:** Same document (full text) + same focused analytical question to -all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of silent -incorrectness (stale data, semantic precision, ordering sensitivity, composition errors, -temporal reference errors). Required specific output format per finding with concrete -numerical examples of financial impact. No tools, no project context beyond the document. - -| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium | -|---|---|---|---|---|---|---|---| -| GPT-5 | 147s | 13,006 | 10,496 | 7 | 2 | 2 | 3 | -| Claude Opus 4.6 | 119s | 5,902 | (internal) | 10 | 3 | 3 | 4 | -| Claude Sonnet 4.6 | 122s | 6,011 | (internal) | 6 | 3 | 3 | 0 | - -**What they found — common ground (all 3 identified):** -- `designation_at` = `DateTime.utc_now()` at processing time, NOT at actual - designation time (manual selection was made at order submission, standing - orders were configured earlier) — compliance record factually incorrect -- Holding period calculation boundary errors (>365 days vs IRS "more than one - year" rule, off-by-one at leap year boundaries, day-after-acquisition start) -- HIFO tie-breaker `opened_at ASC` ignores tax_term dimension — selects - long-term losses over short-term losses when both have identical cost basis, - producing less tax-valuable outcomes -- Strategy preference resolved at fill processing time, not at trade time - (preference changes between trade and fill processing apply retroactively) - -**GPT-5 unique findings (not in either Claude model):** -- Corporate action applied late stale cost basis in HIFO: ROC/dividend reduces - basis but if close/4 fires before apply_corporate_action/3, HIFO sorts on - pre-adjusted basis AND records wrong realized P&L permanently. No mechanism - to restate previously persisted LotClosed events. Concrete example: $2,000 - overstated loss from one trade. -- `designation_at` fragmentation: a single sell consuming multiple lots calls - DateTime.utc_now() per loop iteration, producing slightly different timestamps - for what should be a single coherent designation event. Audit risk. -- LIFO label in `selection_method` field: records "lifo" but for securities LIFO - isn't an authorized tax method — the operation is legally SpecID electing - newest lots. Downstream reporting may reject or misclassify. - -**Claude Opus unique findings (not in either other model):** -- Realized P&L excludes commissions/fees: formula uses `sell_fill.price` (raw - execution price) minus `lot.cost_basis`, not net proceeds. If cost_basis also - excludes buy-side commissions, P&L is doubly overstated. Active trader doing - 1000 trades/year: ~$20,000+ cumulative P&L overstatement. -- Position `average_cost` is meaningless under SpecID and potentially misleading: - SpecID exists to exploit lot-level basis differences, but position-level average - obscures this. If downstream consumers use average_cost for tax estimation, - results can be 50%+ wrong per lot. -- GenServer mailbox ordering determines lot-to-fill assignment for concurrent sells: - two simultaneous fills for the same instrument get different lots based on network - arrival timing. With different holding periods, produces $670+ tax difference - without user awareness. -- Wash sale rule completely unaddressed: system reports losses as realized/deductible - without checking 30-day substantially identical purchase rule. Active trader - harvesting $50,000 in losses could have $0 actually deductible — $18,500 tax gap. -- `opened_at` semantics undefined: whether it's exchange execution time, GenServer - arrival time, or settlement date affects every downstream calculation (FIFO/LIFO - ordering, holding periods, tax terms). Network timing could produce wrong FIFO - lot selection. - -**Claude Sonnet 4.6 unique findings (not in either other model):** -- Stale cost basis in manual lot picker during concurrent corporate actions: UI shows - pre-action basis, user selects based on stale data, but close/4 only validates - open/ownership/quantity — never re-validates that the selection rationale is still - correct. No field records the discrepancy. -- `average_cost` recomputation ordering ambiguity in event-sourced model: step 4 - recomputes from "updated lots" but step 3 (persist events) may not have completed - — if implementation re-derives from event store rather than in-memory state, reads - pre-closure lot quantities. Accumulates $500+ error per partial close. -- Strategy fallback + config corruption silently overwrites selection method in - compliance record: if config becomes invalid, fallback to :fifo is logged at - :warning but LotClosed records `selection_method: "fifo"` — compliance record - shows user "chose" FIFO when they configured HIFO. No field records intended vs - actual strategy. - -**Quality assessment:** -- **Claude Opus** produced the most findings (10) with the broadest analytical scope. - Several findings went BEYOND the document's mechanism to identify missing features - that create silent incorrectness (wash sale rules, commission handling, opened_at - semantics). This is a different analytical mode: Opus identified what the system - SHOULD compute but DOESN'T, not just where the existing computation is wrong. - The wash sale finding is the highest-impact across all three models — an active - trader's entire tax-loss harvesting strategy could be invalid. The GenServer - mailbox ordering finding shows characteristic Opus reasoning about emergent - behavior from design decisions. -- **GPT-5** produced fewer findings (7) but with extreme precision and specificity. - Every finding includes concrete dollar amounts and specific field references. - The corporate action stale basis finding is uniquely actionable — it identifies a - specific race condition between two documented mechanisms (close/4 and - apply_corporate_action/3) that produces permanently incorrect persisted data - with no correction path. The designation_at fragmentation finding shows attention - to implementation detail that neither Claude model noticed. GPT-5 used 10,496 - reasoning tokens for 7 findings (1,500 tokens/finding) — HIGH verification, - consistent with Finding #20's pattern for precision-over-breadth tasks. -- **Claude Sonnet 4.6** produced 6 findings with strong specificity and novel angles. - The event-sourced recomputation ordering finding (#5) is architecturally subtle — - it identifies a composition error between the walk-and-consume algorithm's step - ordering and event-sourcing patterns. The strategy fallback compliance recording - finding is a genuine audit hazard. However, Sonnet produced no Medium-severity - findings — it either found Critical/High issues or filtered everything else out. - This aligns with its established high-precision, high-self-filtering behavior. - -**Key insight — "Silent correctness" as an analytical lens:** - -This is the FIRST experiment testing a "silent incorrectness" prompt. The key -difference from previous analytical lenses: -- **Assumption-finding:** "What must be true for this to work?" (Finding #10-12) -- **Race conditions:** "What timing issues exist?" (Finding #13) -- **Design coherence:** "Does the design contradict itself?" (Finding #15) -- **Invariant violations:** "What operation sequences break invariants?" (Finding #20) -- **Silent correctness:** "Where does the system CONFIDENTLY produce WRONG output - with NO indication of error?" - -The silent correctness lens produced qualitatively different findings from all -previous lenses. The emphasis on "passes all validation" forced models to reason -about what SHOULD be validated but ISN'T, and about semantic correctness (regulatory -requirements, financial accounting rules) vs syntactic correctness (valid types, -non-nil fields, correct schema). - -This lens also revealed a key model differentiation not seen before: -- **Opus reasons about MISSING functionality** (wash sales, commissions, opened_at - semantics) — things the system should do but doesn't -- **GPT-5 reasons about EXISTING functionality being wrong** (corporate action race, - designation fragmentation, LIFO labeling) — things the system does but incorrectly -- **Sonnet reasons about COMPOSITION failures** (event-sourcing step ordering, - strategy fallback propagation) — things that are individually correct but combine - incorrectly - -These are three genuinely different analytical modes, not just "more/less thorough." -All three are valuable for different review outcomes: Opus for feature completeness, -GPT-5 for mechanism correctness, Sonnet for integration correctness. - -**Financial domain advantage:** - -This is the first experiment on a document with strong regulatory/financial semantics. -All three models demonstrated domain knowledge (IRS holding period rules, Treas. Reg. -1.1012-1(c) requirements, wash sale IRC §1091, long-term/short-term capital gains -rate differentials). Opus in particular referenced specific IRC sections and provided -concrete tax rate calculations. The "silent incorrectness" lens works especially well -on financial/regulatory documents because the gap between "syntactically valid output" -and "semantically/legally correct output" is large and consequential. - -**Comparison to previous findings on the same models:** - -| Task type | GPT-5 findings | Opus findings | Sonnet findings | Opus > GPT-5? | -|---|---|---|---|---| -| Hidden assumptions (#10-12) | 20-35 | 12-13 | 13-17 | No | -| Race conditions (#13) | 12 | 10 | 7 | No | -| Design coherence (#15) | 4 | 7 | 5 | **Yes** | -| Invariant violations (#20) | 3 | 7 | 5 | **Yes** | -| Silent correctness (#22) | 7 | 10 | 6 | **Yes** | - -Pattern confirmed: Opus outperforms GPT-5 (by finding count) on tasks that require -reasoning about the design's RELATIONSHIP to external requirements (regulatory, -financial, consumer expectations). GPT-5 outperforms Opus on tasks that require -EXHAUSTIVE EXPLORATION within a self-contained system (assumptions, race conditions). - -The "silent correctness" lens is structurally similar to coherence checking (does the -system match its external requirements?) rather than gap-finding (what's missing -within the system?). This explains why Opus outperforms: the task requires reasoning -about the world outside the document (IRS rules, financial accounting standards, -regulatory requirements), which is Opus's strength. - -**Practical implication:** -For financial/regulatory system review, the "silent correctness" lens should be -run using Opus as the primary model (broadest findings including missing-feature -identification) plus GPT-5 for mechanism-level precision. Sonnet adds value for -composition/integration issues that neither Opus nor GPT-5 catches. All three -produced unique, actionable findings that the others missed. - -The three findings ALL models converged on (designation_at, holding period, HIFO -tie-breaker, strategy preference timing) should be treated as confirmed design -bugs requiring fixes. The fact that three independent models all identified them -with concrete financial impact examples increases confidence that these are real. - -### 23. Regulatory compliance analysis: GPT-5 finds broadest scope of IRS issues; Opus self-corrects mid-analysis; all models converge on cross-account gap - -**Date:** 2026-05-05 -**Task:** Identify where gargoyle's `wash-sale-tracking.md` (391 lines) could produce -incorrect tax reporting compared to IRS wash sale regulations (IRC 1091). NEW -analytical lens: regulatory compliance verification — asking models to reason about -a code implementation's correctness against EXTERNAL regulatory requirements (not -internal system assumptions or race conditions). -**How we used them:** Same document (full text) + same focused analytical question -to all 3 models via HAI proxy. Prompt structured around 5 categories: regulatory -gaps, interpretation errors, edge cases at regulatory boundaries, cross-account/entity -concerns, and interaction with other IRC sections. Required specific regulatory -citations, implementation analysis, concrete tax errors, and audit risk levels. -No tools, no project context beyond the document. - -| Model | Time | Output tokens | Reasoning tokens | Findings | -|---|---|---|---|---| -| GPT-5 | 178s | 12,525 | 9,536 | 16 | -| Claude Opus 4.6 | 155s | 7,326 | (internal) | 16 (with 2 self-corrections/withdrawals) | -| Claude Sonnet 4.6 | 40s | 1,818 | (internal) | 12 | - -**What they found — common ground (all 3 identified):** -- Cross-account/IRA/external broker wash sales not tracked (IRS applies at taxpayer level) -- Options/contracts to acquire stock not triggering wash sales (explicit in IRC 1091(a) text) -- "Substantially identical" definition too narrow (same index ETFs, share classes, ADRs) -- Trade date vs settlement date ambiguity in opened_at/closed_at -- Short sale wash sales not addressed -- Section 475 mark-to-market traders incorrectly subjected to wash sale tracking -- IRC 1092 straddle rules interaction not addressed -- Related party / spousal transactions not considered -- Corporate action identity changes breaking matching - -**GPT-5 unique findings (not in either other model):** -- **Per-share vs lot-level basis tacking** (#1): The system applies `disallowed_loss` - and `tacked_opened_at` at the LOT level, but IRS requires per-share treatment - when only partial shares are matched. A lot of 100 shares where only 60 trigger - wash sale should have per-share basis segregation — the system inflates basis for - all 100 shares. **Most architecturally significant finding** — a fundamental - design-level error, not a missing feature. -- **IRA permanent disallowance** (#2): When replacement purchase is in an IRA, the - loss is PERMANENTLY lost (no basis adjustment possible in tax-deferred accounts). - System either incorrectly applies basis adjustment inside IRA or misses it entirely. -- **Instruments not subject to §1091** (#4): §1256 contracts (futures, index options), - cryptocurrency, and §475 elections are all exempt — system may over-disallow. -- **Average-cost mutual fund basis** (#11): Wash sale adjustments for funds using - average-cost method require different math than discrete lot-level adjustments. -- **ADRs vs local shares** (#14): ADRs and underlying foreign ordinaries are - substantially identical but have different instrument_ids. -- **RSU vestings/ESPP purchases** (#15): Equity compensation creating lots via - corporate action paths may not trigger `check_replacement/2`. -- **Ordering priority between pre/post sale purchases** (#10): Industry convention - (post-sale first, then pre-sale) may differ from system's strict chronological - ordering, causing 1099-B mismatches. - -**Claude Opus unique findings (not in either other model):** -- **Year-end boundary timing** (#5): Loss in December + replacement in January means - tax reports generated between Dec 31 and the replacement purchase date are incorrect. - Forward detection fires retroactively but users may have already filed. System needs - a "30-day pending window" for year-end reports. -- **Form 8949 reporting format** (#6): IRS requires code "W" in column (f) and - specific adjustment amounts in column (g). System doesn't describe how `tax_summary/3` - produces Form 8949-compatible output — potential CP2000 notice triggers from - automated IRS matching against broker 1099-B. -- **"Open lots" query in backward detection** (#10): If backward detection only - queries currently-open lots, it misses replacements that were acquired AND SOLD - within the window. IRS looks at acquisition regardless of current holding status. - (Rev. Rul. 56-602) -- **Forward detection loss ordering unspecified** (#7): When multiple prior losses - compete for the same replacement shares, ordering matters — different allocation - produces different basis amounts on the replacement lot. -- **DRIP reinvestments triggering wash sales** (#9): Dividend reinvestment creates - new lots that should trigger forward detection but may not if only buy fills - produce `LotOpened` events. -- **Self-correcting analytical style (CONFIRMED):** Opus withdrew Finding #4 - entirely mid-analysis ("Revised assessment: holding period logic appears correct. - I withdraw the claim of error"). Spent ~500 words reasoning through the holding - period tacking logic, found it correct, and explicitly retracted. This is now - confirmed across Findings #15, #20, and #23 as a consistent Opus behavior for - verification-heavy regulatory analysis. - -**Claude Sonnet unique findings (not in either other model):** -- **Entity-level tracking for partnerships/S-Corps** (#4.2): Tax-transparent entities - trading through the platform need K-1 reporting to partners — user-scoped model - doesn't address pass-through entity wash sale reporting. -- **Constructive sale integration (IRC 1259)** (#4.1): Short positions or derivatives - creating constructive ownership interact with wash sale determination in ways not - addressed. -- **NOL carryforward interaction** (#5.3): Wash sale deferrals affect character and - timing of losses contributing to NOL calculations across tax years. - -**Quality assessment:** -- **GPT-5** produced the broadest regulatory scope (16 findings) with the most - specific IRS citations (Rev. Rul. 2008-5, Pub. 550, IRC §§267, 1091, 1092, 1222, - 1223, 1256, 475). Its per-share vs lot-level finding (#1) is the only one that - identifies a FUNDAMENTAL DESIGN ERROR (not a missing feature). Most other models' - findings are "you don't handle X" — GPT-5's #1 says "what you DO handle is - handled INCORRECTLY." This distinction matters: missing features are known scope - limitations; incorrect logic is a bug. -- **Claude Opus** matched GPT-5's count (16 with 2 self-corrections = 14 net - confirmed) but with different character. Opus excelled at identifying OPERATIONAL - implications (year-end boundary timing, Form 8949 format requirements, forward - detection ordering) rather than just statutory gaps. Its findings tend to describe - HOW the gap manifests in practice ("user files taxes, then January purchase - retroactively invalidates the filing") vs GPT-5's approach of citing the statute - and describing the theoretical violation. -- **Claude Sonnet** was fast (40s) and produced 12 competent findings but with less - regulatory precision. Findings lacked specific IRS citations (no Rev. Rul. - references, no Treas. Reg. citations). Several findings overlapped heavily with - common ground items without adding unique depth. The entity-level and - constructive sale findings show awareness of tax complexity but are relatively - generic ("this is complex and not addressed"). - -**Key insight — regulatory compliance as a distinct task type:** - -This experiment tests a fundamentally different cognitive demand than previous ones: -previous tasks asked "what could go wrong with this system?" (internal reasoning). -This task asks "does this system correctly implement external rules?" (external -reasoning). The model must hold TWO bodies of knowledge simultaneously: the -implementation spec AND the regulatory framework, then find mismatches. - -All three models had strong tax law knowledge — they cited IRC sections, Revenue -Rulings, and Treasury Regulations correctly. The differentiation wasn't in legal -knowledge but in HOW they applied it: - -- **GPT-5:** Exhaustive statutory mapping ("here's every IRC section that touches - wash sales; here's where the implementation falls short on each"). Breadth-first - coverage. Found the most issues by sheer scope of regulatory awareness. -- **Opus:** Operational consequence reasoning ("here's how this gap manifests as - a real-world problem for the user/auditor"). Found issues by reasoning about - the implementation's interaction with real-world workflows (filing deadlines, - form formats, broker reconciliation). -- **Sonnet:** Category-based analysis ("here are cross-account issues, here are - entity issues, here are interaction issues"). Followed the prompt structure - closely but didn't go deep within each category. - -**The per-share vs lot-level finding (GPT-5 #1) — why it matters:** - -This is the experiment's most important result. Every model found missing features -(options, cross-account, short sales) — those are SCOPE limitations that the -document itself acknowledges or defers. GPT-5 uniquely found a correctness bug in -the IMPLEMENTED logic: the system's lot-level basis adjustment is mathematically -wrong for partial wash sales. - -Example: Loss lot of 100 shares, replacement lot of 60 shares. Only 60 shares -trigger wash sale. System adds full 60% of disallowed loss to the entire -replacement lot's basis. If the replacement lot later sells 30 shares, the -per-share basis is inflated (reflects 60 shares of adjustment spread across 60 -shares). This is actually correct for the replacement lot specifically — but -the `tacked_opened_at` is applied to ALL 60 shares when only the matched shares -should have tacked holding periods. For lots where `adjusted_quantity < -replacement_quantity`, the non-matched shares have incorrect holding period -characterization. - -Actually, on closer inspection: if `adjusted_quantity = min(loss_quantity, -replacement_quantity)`, and the system matches 60 shares of a 60-share -replacement lot, ALL shares of that lot are matched. The edge case GPT-5 -identifies would require a replacement lot larger than the loss — e.g., loss of -60 shares matched against a replacement lot of 100 shares where only 60 are -affected. In that case, the `tacked_opened_at` is set on the entire lot (100 -shares) when only 60 should be affected. This IS a genuine bug: 40 shares get -incorrect holding period classification. - -**Updated task-type taxonomy:** - -| Task type | Primary cognitive demand | Best model | -|---|---|---| -| Hidden assumptions | Breadth identification (what's not stated?) | GPT-5 (exhaustive) | -| Race conditions | Sequential temporal reasoning | GPT-5 + Opus | -| Cross-component interactions | Component boundary reasoning | GPT-5 + Sonnet | -| Design coherence | Internal consistency checking | Opus | -| Invariant violation paths | Construction + verification | GPT-5 (precision) | -| Silent correctness | External requirement matching | Opus | -| **Regulatory compliance** | **Dual-knowledge-base comparison** | **GPT-5 (breadth) + Opus (operations)** | - -Regulatory compliance is closest to "silent correctness" (Finding #22) in that -both require reasoning about external requirements. The key difference: -- Silent correctness asks "does this produce correct outputs for all inputs?" -- Regulatory compliance asks "does this implement the law correctly?" - -Both favor models that reason about the system's relationship to the outside -world (Opus's strength), but regulatory compliance also rewards breadth of -statutory knowledge (GPT-5's strength). The combination produces the most -complete picture. - -**Practical implication:** -For regulatory compliance review of financial systems: -- Run GPT-5 for exhaustive statutory coverage (finds the most gaps) -- Run Opus for operational impact analysis (finds how gaps manifest in practice) -- Sonnet adds marginal value — use only if budget allows -- GPT-5's unique strength: identifying correctness bugs in implemented logic - (not just missing features) -- Opus's unique strength: identifying timing/workflow issues (year-end, form - reporting, reconciliation with broker) - -### 24. Design improvement proposals: GPT-5 excels at defense-in-depth thinking; Opus finds subtle design contradictions; Sonnet produces generic recommendations - -**Date:** 2026-05-05 -**Task:** Propose specific design improvements for gargoyle's `kill-switch.md` (185 lines) -— the primary safety mechanism that prevents rogue orders. NEW task type: generative/ -creative ("what would you improve?") rather than purely analytical ("what's wrong?"). -**How we used them:** Same document (full text) + same focused prompt to all 3 models -via HAI proxy. Prompt asked for 8-15 specific improvements with: weakness, proposed -change (concrete), tradeoff, severity rating. Explicitly excluded generic advice -("add more tests") and asked about runtime assumptions. No tools, no project context. - -| Model | Time | Output tokens | Reasoning tokens | Improvements proposed | -|---|---|---|---|---| -| GPT-5 | 118s | 8,710 | 6,016 | 15 | -| Claude Opus 4.6 | 127s | 4,985 | (internal) | 15 | -| Claude Sonnet 4.6 | 40s | 1,636 | (internal) | 12 | - -**What they found — common ground (all 3 identified):** -- DB write failure blocking engagement (fail-open under DB outage) — all three - proposed in-memory-first engagement with async persistence -- Kill switch process liveness monitoring (heartbeat/watchdog) -- Broker connectivity loss during cancellation operations -- ETS table ownership and crash-window vulnerability -- Supervisor restart suppression as unstated mechanism -- Per-venue/per-broker scope extension - -**GPT-5 unique findings (not in either other model):** -- **Infrastructure-level "hard kill"** — egress proxy or service mesh that blocks - broker traffic independently of the application. Belt-and-suspenders approach - where the kill switch works even if the entire BEAM VM is unresponsive. This - was GPT-5's highest-impact unique insight. -- **Kill fence token (epoch)** — every order-carrying message includes an epoch; - stale-epoch messages are dropped at the gate. Elegantly solves in-flight - messages without needing drain timeouts. -- **Cluster/multi-node propagation** — detailed leader election + epoch broadcast - + fail-closed on partition design. -- **Post-engage broker verification** — query broker AFTER engaging to confirm no - orders slipped through during the engagement window. -- **Liquidation exposure validation** — proving tagged liquidation orders actually - REDUCE exposure rather than trusting the tag. -- **Recovery/cold-start order suppression** — ensuring reconciliation/recovery - routines can't submit orders while engaged. -- **Engage latency reordering** — ETS first, terminate second, DB async. -- **Audit log tamper evidence** — append-only external sink + hash chain. - -**Claude Opus unique findings (not in either other model):** -- **Ordering contradiction in engagement sequence** — identified that the - documented order (DB → ETS → terminate) creates a specific risk if a crash - occurs BETWEEN termination and ETS update (not just DB failure). The insight - is about the window where termination has started but gate is still open. - More subtle than GPT-5's version (which focused on DB-blocking-engage). -- **Concurrent engagement race (mode escalation)** — multiple triggers - simultaneously issuing conflicting modes (RESTRICT vs LIQUIDATE). Proposed - explicit escalation rules (LIQUIDATE always wins) with GenServer serialization. -- **Shared resources under per-user scope** — per-user kill switch doesn't - address orders in shared broker connection buffers. Forces architectural - decision about connection pooling strategy. -- **Clock/time integrity for audit log** — monotonic counters + NTP validation - for forensic reliability. -- **Partial multi-user engagement failures** — what happens when global engage - successfully terminates 4/5 user pipelines but one has orphaned processes. -- **Liquidation direction validation** — similar to GPT-5's exposure validation - but framed differently: checking corrupted position records could cause - liquidation to OPEN positions rather than close them. -- **Process termination verification** — checking that `:kill` signals actually - worked (defense against trap_exit, NIF blocking). -- **Engagement latency SLA** — defining a 50ms target with monitoring/alerting. - -**Claude Sonnet findings (all also present in GPT-5 or Opus, differently framed):** -- No genuinely unique improvements that GPT-5 or Opus didn't also identify. -- Several were generic: "missing resource cleanup," "circuit breaker integration," - "performance monitoring" — exactly the kind of advice the prompt tried to - exclude. -- The "missing heartbeat" and "network partition handling" proposals were solid - but less detailed than the corresponding GPT-5/Opus versions. - -**Quality assessment:** -- **GPT-5** produced the most ACTIONABLE improvements. Its proposals were - architecturally concrete ("add an egress proxy," "use kill epochs in messages," - "query broker post-engage") and showed defense-in-depth thinking — multiple - independent layers rather than fixing one path. The infrastructure kill (#2) - is genuinely novel: no other model proposed going OUTSIDE the application - boundary for safety enforcement. GPT-5 consistently thought about "what if - this entire runtime is compromised?" rather than just fixing within-app paths. -- **Claude Opus** produced equally numerous improvements (15) with characteristic - precision about failure SEQUENCES. Its unique strength: identifying design - contradictions rather than just gaps (the engagement ordering issue, concurrent - mode escalation, shared-resource scope mismatch). Opus's proposals were more - "fix the design tension" while GPT-5's were more "add another safety layer." - Opus also included the process termination verification and engagement latency - SLA — operational rigor that GPT-5 skipped. -- **Claude Sonnet** produced 12 proposals in 40s (fast) but quality was notably - lower. Several proposals were generic software engineering advice that the - prompt explicitly excluded ("add performance monitoring," "resource cleanup"). - No unique insights emerged. Sonnet's proposals lacked the architectural depth - of GPT-5 (no outside-the-application thinking) and the design-tension - identification of Opus. - -**Key insight — generative vs analytical tasks:** - -This is the first experiment testing a GENERATIVE task ("propose improvements") -rather than a purely analytical one ("find problems"). The results reveal: - -1. **GPT-5's defense-in-depth thinking is unique.** In analytical tasks, GPT-5 - finds exhaustive lists of issues. In generative tasks, it proposes LAYERED - solutions — multiple independent mechanisms that each catch what the others - miss. The infrastructure kill proposal (external to the application) shows - GPT-5 reasoning about failure modes that are invisible to within-app analysis. - -2. **Opus's design-tension identification transfers to improvement proposals.** - In analytical tasks, Opus finds where parts of a design contradict each other. - In generative tasks, this manifests as proposals that RESOLVE tensions rather - than just adding patches. The engagement ordering contradiction and mode - escalation rules are both "this design says X but the mechanism allows Y — - here's how to make them consistent." - -3. **Sonnet doesn't transfer well to generative tasks.** In analytical tasks - (assumption-finding, cross-component analysis), Sonnet performs well (85% of - GPT-5 in some experiments). In generative tasks, it falls back to generic - engineering advice. The task requires both identifying problems AND proposing - concrete solutions — Sonnet handles the first step but not the second with - sufficient depth. - -**Comparison to analytical task performance:** - -| Task type | GPT-5 character | Opus character | Sonnet character | -|---|---|---|---| -| Assumption-finding (#10-12) | Exhaustive breadth | Design tensions | Good (85% of GPT-5) | -| Race conditions (#13) | Technical precision | Design contradictions | Weak (errors) | -| Invariant violations (#20) | Maximum selectivity | Self-correcting depth | Imprecise | -| **Design improvements (#24)** | **Defense-in-depth layers** | **Tension resolution** | **Generic advice** | - -The generative task reveals model ARCHITECTURES more clearly than analytical tasks. -GPT-5's reasoning enables it to construct multi-layered solutions. Opus's internal -reasoning enables it to identify what a design SHOULD be (not just what's wrong). -Sonnet pattern-matches against known engineering practices without deep synthesis. - -**Practical implication:** - -For design improvement sessions on safety-critical systems: -- Run GPT-5 for defense-in-depth proposals ("what layers should exist?") -- Run Opus for design consistency proposals ("where does the design contradict itself?") -- Skip Sonnet — its output is indistinguishable from generic checklists -- The combination of GPT-5 + Opus produces complementary improvements: GPT-5 adds - safety layers, Opus fixes internal contradictions. Together they address both - "not enough protection" and "protection mechanisms that work against each other." - -**Cost analysis:** -GPT-5: 118s, ~10.9K tokens (6K reasoning). Opus: 127s, ~5K tokens. Sonnet: 40s, ~1.6K tokens. -For a safety-critical design review, running GPT-5 + Opus costs ~16K tokens and produces -30 improvements with near-zero overlap in unique insights. Excellent ROI for a kill switch -design that protects real money. - -### 25. Contradiction detection: NEW task type — Opus excels at finding genuine contradictions with precision; GPT-5 is thorough but spends heavily on reasoning; Sonnet finds surface-level issues quickly - -**Date:** 2026-05-05 -**Task:** Identify internal contradictions, logical inconsistencies, and conflicting rules -in gargoyle's `order-state-machine.md` (311 lines) — a document defining states, -transitions, invariants, fill precedence rules, and time-in-force behavior. -**How we used them:** Same document (full text) + same focused analytical question to all -3 models via HAI proxy. Prompt specifically asked for: state machine contradictions, -semantic conflicts, rule violations, implicit contradictions, and terminology -inconsistencies. Required each finding to quote the conflicting statements, explain -the logical argument, assign severity, and recommend which statement should "win." -No tools, no project context beyond the document itself. - -| Model | Time | Output tokens | Reasoning tokens | Contradictions found | -|---|---|---|---|---| -| GPT-5 | 162s | 12,074 | 11,008 | 4 | -| Claude Opus 4.6 | 41s | 2,056 | (internal) | 6 | -| Claude Sonnet 4.6 | 17s | 826 | (internal) | 4 | - -**What they found — common ground (2+ models identified):** - -- **Missing `pending_cancel → partially_filled` revert transition** (GPT-5 #1 + - Opus #2 + Sonnet partial): The "Rejection reverts" invariant states orders return - to their "pre-modification state (`working` or `partially_filled`)", but the state - diagram only shows `pending_cancel → working` for cancel rejection — no path back - to `partially_filled`. All models correctly identified this as the diagram being - incomplete relative to the stated invariant. GPT-5 and Opus rated CRITICAL. -- **Same issue for `pending_replace` revert** (GPT-5 #1 + Opus #3): The state diagram - only shows `pending_replace → working` for replace rejection, but a replace - requested from `partially_filled` should revert to `partially_filled`. Same root - cause as above, just the replace variant. -- **FOK "never partially fills" vs state machine allowing it** (GPT-5 #2 + Opus #4): - The TIF table says FOK "never partially fills" but the state machine has no guards - preventing FOK orders from reaching `partially_filled`. Both correctly noted this - is a broker-enforced guarantee but the document presents it as system-level. -- **`rejection_reason` described as "broker-provided" but local rejections exist** - (GPT-5 #4 + Opus #5 + Sonnet): `pending → rejected` is "local validation failure" - with no broker interaction, but the field says "Broker-provided reason when - rejected." All three caught this terminology inconsistency. - -**GPT-5 unique findings (not in either other model):** - -- **IOC valid terminal states exclude `expired` vs generic expiry transitions** (#3): - IOC should never reach `expired` (unfilled portion is cancelled immediately), but - the state diagram allows any order to transition to `expired` without TIF guards. - Well-reasoned extension of the FOK finding to IOC semantics. GPT-5 correctly - identified that broker "expired-like" outcomes should map to `cancelled` for IOC. - -**Claude Opus unique findings (not in either other model):** - -- **Terminal states that aren't terminal — the `partially_filled` re-entry problem** - (#1): Opus identified the DEEPER contradiction beyond the surface-level "cancelled - states have outgoing transitions." When `cancelled → partially_filled` fires via - late fill, the order is now non-terminal with NO defined mechanism to re-terminate - if no further fills arrive. The order is stuck in `partially_filled` indefinitely. - This goes beyond "the diagram contradicts the definition of terminal" to "the fill - precedence rule creates an unspecified operational scenario." This is the most - architecturally significant finding across all three models. -- **Fill precedence label misapplication to non-terminal states** (#6): The state - diagram labels transitions from `pending_cancel → partially_filled` and - `pending_replace → partially_filled` as "fill precedence," but the Fill - Precedence Rule explicitly defines itself as overriding TERMINAL states. - `pending_cancel` is non-terminal. The label conflates two different mechanisms - (fill during pending modification vs. fill overriding terminal state), which - could cause implementers to use the same code path for fundamentally different - scenarios. - -**Claude Sonnet unique findings (not in either other model):** - -- **State diagram terminal arrow contradiction** (#1): Sonnet was the only model to - explicitly note that the Mermaid diagram shows `cancelled → [*]` (terminal arrow) - while simultaneously showing `cancelled → partially_filled` (outgoing transition). - A valid observation but more surface-level than Opus's deeper analysis of the same - phenomenon. -- **Pending replace fill logic error** (#3): Sonnet argued that receiving a fill - during `pending_replace` creates a logical impossibility because the order - parameters are in flux. This is WRONG — fills always apply to current parameters - (the replace hasn't been confirmed yet), and the document actually handles this - correctly. This is a FALSE POSITIVE from Sonnet. - -**Quality assessment:** - -- **Claude Opus** was the clear winner for this task. Found the most contradictions - (6), had the highest precision (0 false positives), and — crucially — found - qualitatively deeper issues. The `partially_filled` re-entry problem (#1) isn't - just "the diagram has a missing transition" but "the fill precedence rule creates - an unresolvable operational state." The fill precedence label misapplication (#6) - identifies a conceptual confusion that would genuinely cause implementation bugs. - Opus completed in only 41s with 2,056 output tokens — by far the most efficient. -- **GPT-5** found 4 genuine contradictions with 0 false positives but spent an - extraordinary amount of reasoning tokens (11,008) for modest output (1,066 visible - content tokens, 10.3:1 reasoning ratio). The IOC finding was unique and valuable. - But the cost is disproportionate: 162s and 12K tokens for 4 findings vs Opus's - 41s and 2K tokens for 6 findings. GPT-5's reasoning budget seems to have been - mostly spent on VERIFICATION (confirming each finding is genuine), consistent - with Finding #20's observation. -- **Claude Sonnet** was fastest (17s) and found 4 items, but one was a false positive - (the pending_replace logic error claim is incorrect). That gives it a precision of - 75% (3/4 genuine) — the lowest of the three. Its genuine findings were all also - found by the other models (no unique true contributions). Sonnet appears to trade - speed for accuracy on contradiction detection. - -**Key insight — contradiction detection favors precision-oriented models:** - -This task is fundamentally about LOGICAL ARGUMENTATION: proving that two statements -cannot both be true. Unlike assumption-finding (which is about imagining what could go -wrong) or gap-finding (which is about identifying missing content), contradiction -detection requires the model to: -1. Hold two statements in working memory simultaneously -2. Construct a formal argument for why they conflict -3. NOT get confused by statements that SEEM contradictory but are actually consistent - -Requirement #3 is where models diverge. Sonnet produced a false positive because it -didn't fully reason through whether the pending_replace fill scenario is actually -inconsistent (it isn't — current parameters apply). Opus avoided this trap entirely -and additionally found DEEPER contradictions that require multi-step logical reasoning -(the re-entry problem, the label misapplication). GPT-5 also avoided false positives -but at massive computational cost. - -**Opus's efficiency advantage:** -This is the first task where Opus is not just qualitatively better but also -quantitatively more efficient. 6 findings in 41s and 2K tokens vs GPT-5's 4 findings -in 162s and 12K tokens. That's 3x more findings per token and 4x faster. For -contradiction detection specifically, Opus appears to have a structural advantage — -possibly because its internal reasoning is better calibrated for logical argumentation -than GPT-5's externalized reasoning chain. - -**Comparison to Finding #20 (invariant violation paths):** -In Finding #20, GPT-5 was maximally selective (3 findings, all genuine, 15:1 -reasoning ratio). Here, GPT-5 shows the same pattern: few findings, all genuine, -high reasoning ratio (10.3:1). The difference: in #20, GPT-5's selectivity meant -it found UNIQUE violations others missed. Here, all of GPT-5's findings were also -found by Opus (plus Opus found 2 more). GPT-5's high verification bar doesn't help -when Opus is ALSO precise AND more thorough. - -**Updated task-model assignment:** - -For contradiction/consistency checking: -1. **Opus** — best choice: highest precision, deepest contradictions, most efficient -2. **GPT-5** — solid backup: zero false positives, unique TIF-related insights, but - expensive and slower -3. **Sonnet** — NOT recommended for this task: produces false positives, no unique - true contributions - -This confirms the emerging pattern: each model has task types where it excels. -Opus excels at logical argumentation and design tensions. GPT-5 excels at -exhaustive enumeration and operational concerns. Sonnet excels at speed and -structural/assumption analysis but struggles with tasks requiring formal logical -reasoning (contradiction detection, concurrency analysis per Finding #13). - -**Practical implication:** When reviewing architecture documents for internal -consistency (e.g., before implementation begins), run Opus. If budget allows, -add GPT-5 for TIF/edge-case coverage. Skip Sonnet for consistency checking — -its speed advantage is negated by the false positive risk. - -### 26. Missing-feature identification IS promptable across all models; prompt framing eliminates Opus's historical advantage — all three models find regulatory gaps when explicitly asked - -**Date:** 2026-05-05 -**Task:** Identify computations, behaviors, or features that gargoyle's -`corporate-actions.md` (992 lines) SHOULD perform for financial correctness, -regulatory compliance, or operational safety — but doesn't describe. -**How we used them:** Same document (full text) + same focused analytical -prompt to all 3 models via HAI proxy. Prompt explicitly structured around 5 -categories: missing computations, missing behaviors, missing validations, -missing integrations, and regulatory gaps. Required concrete findings with -severity. No tools, no project context beyond the document. GPT-5 via -OpenAI endpoint (16K max_completion_tokens), Opus 4.6 and Sonnet 4.6 via -Anthropic endpoint (8K max_tokens). - -| Model | Output tokens | Reasoning tokens | Findings | Critical | High | Medium | -|---|---|---|---|---|---|---| -| GPT-5 | 11,354 | 8,512 | 20 | 3 | 10 | 7 | -| Claude Opus 4.6 | 4,111 | (internal) | 23 | 6 | 10 | 7 | -| Claude Sonnet 4.6 | 4,686 | (internal) | 15 | 5 | 6 | 4 | - -**What they found — common ground (all 3 identified):** -- Wash sale rule interaction with CA-driven lot closures (IRC §1091) -- Short position treatment for corporate actions -- Same-day corporate action ordering beyond `recorded_at` timestamp -- Record date / ex-date position verification (entitlement timing) -- Idempotency guard preventing double-application per user -- Decimal precision/rounding policy unspecified -- Superseded CA status has no lot rollback mechanism -- Rights/warrants post-creation lifecycle (exercise/expiration) -- Basis preservation invariant has no runtime enforcement -- Manual entry authorization and audit trail - -**GPT-5 unique findings (not in either Claude model):** -- Per-lot eligibility based on entitlement date (not just user-level) -- Election-based outcomes for shareholder choices (cash vs stock) -- Instrument-level trading hold during CA application window -- Pre-application consistency checks against broker entitlements -- DB-level enforcement of status transitions and invariants -- Action-type-specific date semantics per field (ex vs record vs payable) -- Voluntary/tender actions beyond distributions -- Backfill/initialization guard for newly onboarded users -- Applicator retry/backoff semantics and confirmation race -- Rights indivisibility constraints vs exact Decimal quantities - -**Claude Opus unique findings (not in either other model):** -- Pending order PRICE adjustment after splits (not just cancellation) -- Multi-instrument position recalculation atomicity for mergers -- Mixed merger basis floor at zero (can produce negative basis) -- Tax lot identification method interaction with inherited dates -- Corporate action effect on strategy position limits/risk params -- Corporate actions on instruments not yet in the database -- Partial application window: new user acquires position mid-fan-out -- IRC §305(c) deemed distributions (taxable stock dividends) -- CA impact on unrealized P&L display and strategy evaluation -- Concurrent OrderManager startup + Applicator fan-out race - -**Claude Sonnet unique findings (not in either other model):** -- Stale orders: failure modes table contradicts "excluded" section -- IRC §1223(1) holding period tacking verification at lot close -- Spinoff allocation percentage — no validation child != parent instrument -- Combined spinoff allocations exceeding meaningful bounds -- Cash dividend bypasses OrderManager — record-date quantity snapshot lost -- Mixed merger large-denominator exchange ratio overflow -- Detector schedule: no intraday re-poll for same-day announcements -- ROC cumulative basis floor tracking and IRC §301(c)(3) distinction -- Mixed merger deferred loss not explicitly recorded in metadata - -**Quality assessment:** -- **Claude Opus** was the MOST PROLIFIC (23 findings) — a notable inversion - from previous experiments where Opus typically found fewer but deeper - findings. Here, the explicit "missing feature" framing appears to have - unlocked Opus's breadth. Its unique findings included genuinely critical - items: pending order price adjustment after splits (Critical — direct - financial loss), multi-instrument atomicity for mergers (Critical — - position loss), and mixed merger negative basis (High — accounting - corruption). The findings were precise, well-reasoned, and showed both - regulatory depth (IRC §305(c)) and operational awareness. -- **GPT-5** was slightly less prolific (20 findings) but maintained its - characteristic breadth and operational-level thinking. Per-lot eligibility - (not just per-user) is a subtle but important distinction. The election- - based outcomes finding shows awareness of real-world corporate action - complexity. The backfill/initialization guard is operationally significant. - GPT-5 spent 8,512 reasoning tokens — moderate for its output volume. -- **Claude Sonnet** found fewer gaps (15) but several were genuinely - insightful. The internal contradiction between the failure modes table - and the "excluded" section is a real document inconsistency. The cash - dividend record-date quantity snapshot insight (#9) identifies a DATA LOSS - problem — the opportunity to capture that data expires. The mixed merger - deferred loss recording gap shows regulatory awareness. However, some - findings were more surface-level or overlapped heavily with the others. - -**KEY INSIGHT — The original question from Finding #22 is ANSWERED:** - -> "Opus's 'missing feature identification' mode (wash sales, commissions) — -> is this promptable on other models? Could we explicitly ask GPT-5 'what -> should this system compute but doesn't' and get similar results?" - -**YES.** When explicitly prompted with a structured "missing feature" -framing, ALL three models found regulatory gaps (wash sales, IRC sections), -missing computations (basis calculations, rounding), and missing behaviors -(lifecycle events, notifications). GPT-5 produced findings in the same -*category* as what Opus uniquely found in Finding #22 (silent correctness -failures on specid-lot-selection.md). - -In Finding #22, Opus uniquely identified wash sales and commission tracking -as missing features while GPT-5 focused on mechanism incorrectness and -Sonnet on composition failures. HERE, with the explicit "what's missing" -prompt, ALL three models found wash sales, ALL found regulatory gaps, and -ALL found missing behaviors. - -**This confirms:** Opus's "missing feature identification" mode in Finding -#22 was NOT an inherent model capability — it was an emergent behavior from -the open-ended "silent correctness failures" prompt. When you give ALL models -the EXPLICIT instruction to look for missing features, they all do it. The -differentiation from #22 was caused by the prompt being more open-ended, -allowing each model to default to its natural analytical mode: -- Opus → "what's missing" (features/functionality) -- GPT-5 → "what's wrong" (mechanism failures) -- Sonnet → "what breaks when combined" (composition) - -**Prompt framing dominates model personality.** With the right prompt, -any model can be directed into any analytical mode. The model differences -that emerged in earlier open-ended experiments reflect DEFAULT TENDENCIES, -not capabilities. - -**NEW finding about Opus on complex documents:** -Opus produced MORE findings than GPT-5 (23 vs 20) — the first time this -has happened on a broad analytical task. Previous pattern: GPT-5 always -finds more (20-33 findings) while Opus finds fewer but deeper (7-13). -What changed? The document is 992 lines — the longest tested — and the -task is explicitly about breadth ("find all gaps"). On this specific -combination (long document + breadth-focused prompt), Opus appears to -allocate its internal reasoning budget toward exploration rather than -its usual depth-first design-tension mode. This suggests Opus's typical -"fewer but deeper" pattern is partially a RESPONSE to shorter documents -where depth is more productive than breadth. - -**Practical implications:** -1. For missing-feature analysis: prompt structure matters more than model - choice. All three models are viable. Use the explicit 5-category prompt. -2. Run all three for critical docs — they find different specific gaps - despite finding the same categories. -3. For open-ended analysis where you want models to find DIFFERENT things: - use open-ended prompts. For analysis where you want COMPREHENSIVE - coverage of one type: use structured prompts. -4. Opus's "fewer but deeper" personality can be overridden by document - length + breadth-focused prompt. On 992-line docs, it competes on - volume with GPT-5. - -**Cost-effectiveness:** -Opus: 4,111 output tokens for 23 findings = 179 tokens/finding -GPT-5: 11,354 output tokens (+ 8,512 reasoning) for 20 findings = 993 tokens/finding -Sonnet: 4,686 output tokens for 15 findings = 312 tokens/finding - -Opus is by far the most efficient: nearly 6x fewer tokens than GPT-5 per -finding, with MORE findings. This is the strongest cost-effectiveness case -for Opus on any tested task. On long documents with breadth-focused prompts, -Opus appears to be the optimal choice for both quality AND efficiency. - -### 28. Cross-document consistency analysis: NEW task type — GPT-5 finds deep semantic contradictions; Opus finds structural/boundary mismatches; Sonnet identifies core issues quickly - -**Date:** 2026-05-05 -**Task:** Identify contradictions and inconsistencies BETWEEN two architecture documents -describing the same system: `system-overview.md` (323 lines, narrative overview with -component flows, invariants, and domain events) and `architecture.md` (213 lines, -DDD-focused with bounded contexts, context map, and message taxonomy). -**How we used them:** BOTH documents provided as full text in a single prompt (~25KB -total). Highly structured prompt specifying 5 categories of cross-document inconsistency -(terminology conflicts, structural contradictions, flow/sequence conflicts, -ownership/authority conflicts, philosophical contradictions). Required specific output -format per finding. Explicitly excluded omissions (things one doc covers and the other -doesn't) and detail-level differences. No tools, no project context beyond the two -documents. This is a NEW analytical task not previously tested: reasoning about -CONSISTENCY BETWEEN documents rather than internal coherence of a single document. - -| Model | Time | Output tokens | Reasoning tokens | Inconsistencies found | Critical | High | Medium | -|---|---|---|---|---|---|---|---| -| GPT-5 | 125s | 9,415 | 8,384 | 6 | 2 | 3 | 1 | -| Claude Opus 4.6 | 52s | 2,351 | (internal) | 7 | 3 | 3 | 1 | -| Claude Sonnet 4.6 | 14s | 776 | (internal) | 4 | 1 | 2 | 1 | - -**What they found — common ground (all 3 identified):** -- Event sourcing (all events as source of truth) vs fills-only ground truth: - Document A says fills are "ground truth from which all other state can be - derived," while Document B says "events are the source of truth, state is - computed by replaying events." A treats fills as the recovery foundation; - B treats ALL domain events as authoritative. All three models rated this - Critical. -- Bounded context naming mismatch: "Decision Engine" / "Order Management" (A) - vs "Engine" / "Trading" (B) for the same functional responsibilities. - GPT-5 folded this into a broader ownership analysis; Opus and Sonnet - surfaced it as its own finding. -- Signal classification conflict: Document A lists "Signal emitted" as a domain - event; Document B explicitly categorizes `SignalEmitted` as an audit event - ("not used to rebuild state"). This determines event store design and - recovery semantics. - -**GPT-5 unique findings (not in either Claude model):** -- Signal persistence contradiction: Document A states "Signals are never - persisted" while Document B lists `SignalEmitted` as an audit event that IS - persisted and states the audit log is mandatory for trading. These are - directly incompatible claims about whether signal data is stored. -- Audit event ownership conflict: Document A says "Decision approved" events - originate from PortfolioRisk. Document B states "only the decision engine - writes audit events" and lists `DecisionApproved` as an audit event example. - If PortfolioRisk is part of Risk (not Engine), this is an authority violation. -- "Single writer per user" (A: OrderManager writes all trading state) vs - per-aggregate single-writer (B: each aggregate writes its own event stream, - Ledger owns positions). These are incompatible authority models — either OM - centralizes writes or each domain owns its own events. - -**Claude Opus unique findings (not in either other model):** -- Engine → OrderManager is an internal pipeline flow (A: same subgraph, direct - arrow) vs Engine → Trading is a cross-domain COMMAND (B: `PlaceOrder` command - crossing a bounded context boundary). This structural disagreement determines - whether order management is an internal pipeline stage or an independent domain - with its own aggregates and command validation. -- Signal Risk's architectural position: Document A shows a two-stage risk - architecture (SignalRisk pre-aggregation, PortfolioRisk post-aggregation) - where Risk is embedded in the pipeline. Document B's context map shows Risk - as a separate domain that Engine merely QUERIES ("kill switch active?") — - no arrow shows signal routing through Risk. Either risk logic lives inside - Engine (contradicting B's context boundary) or the context map is incomplete. -- The "reduce" step ownership: A's top-level flow labels `Approved →|"reduce"| - Decisions` (reduction at aggregation), while A's own domain events table says - "Decision reduced" originates from PortfolioRisk (reduction after aggregation). - This is actually an INTRA-document inconsistency in Document A, but Opus surfaced - it as part of cross-doc analysis. - -**Claude Sonnet unique findings:** -- None genuinely unique. All 4 findings overlapped with GPT-5/Opus common ground - (event sourcing, signal persistence, context count/naming). Sonnet was efficient - (14s, 776 tokens) but didn't identify any inconsistency that the other two missed. - -**Quality assessment:** -- **GPT-5** produced 6 well-reasoned findings with the deepest analysis of - OWNERSHIP conflicts. Its signal-persistence contradiction and single-writer - authority conflict are genuinely important — they reveal places where the two - documents would lead implementers to build fundamentally different systems. - Every finding quotes specific text from both documents and explains precisely - WHY they can't both be correct. The reasoning investment (8,384 tokens) was - used for thorough cross-referencing between documents. -- **Claude Opus** found the most inconsistencies (7) and was remarkably fast - (52s vs GPT-5's 125s). Its unique strength: identifying STRUCTURAL contradictions - about component boundaries and communication patterns. The Engine→Trading - command vs internal pipeline finding is architecturally the most significant - discovery — it reveals a fundamental disagreement about whether order - management is INSIDE or OUTSIDE the decision engine's boundary. Opus also - caught a bonus intra-document inconsistency (the "reduce" labeling error). -- **Claude Sonnet** was the fastest (14s) and most concise (776 tokens) but - found only the obvious common-ground issues. For cross-document consistency, - Sonnet's speed advantage came at the cost of missing the architectural - insights that make this task valuable. It did correctly identify all the - Critical-level issues, making it viable as a quick first-pass screen. - -**Key insight — cross-document consistency is a DISTINCT task type:** -This is fundamentally different from single-document analysis (assumptions, -race conditions, coherence). It requires: -1. Building a mental model from Document A -2. Building a separate mental model from Document B -3. Finding places where the models are incompatible -4. Reasoning about WHY they can't both be correct (not just "different") - -Step 4 is what distinguishes this from simple diff-detection. Many surface -differences (naming, detail level, scope) are NOT contradictions — the models -must judge which differences are genuinely incompatible vs. complementary. -The prompt explicitly excluded omissions and detail-level differences, and -all three models respected this constraint well. - -**Model strengths on cross-document analysis:** -- **GPT-5** excels at ownership/authority conflicts: it systematically - checked "who owns this concept" in each document and found mismatches. - Its findings cluster around "who writes what" and "who is authoritative." -- **Opus** excels at structural/boundary contradictions: it identified where - the documents draw architectural lines differently. Its findings cluster - around "where are the boundaries" and "what crosses them." -- **Sonnet** identifies the obvious/critical issues quickly but doesn't dig - deeper. Viable for screening, not for thorough analysis. - -**Comparison to Finding #15 / #27 (single-document coherence checking):** -Single-document coherence asks "does this document contradict itself?" -Cross-document consistency asks "do these documents contradict each other?" -Key differences in results: - -| Aspect | Single-doc coherence | Cross-doc consistency | -|---|---|---| -| Opus findings | 5-7 | 7 | -| GPT-5 findings | 4-6 | 6 | -| Sonnet findings | 4-5 | 4 | -| Opus unique | Design tensions | Structural/boundary mismatches | -| GPT-5 unique | Definitional errors | Ownership/authority conflicts | -| Best model | Task-dependent | Opus (most findings + fastest) | - -The relative ordering is similar (Opus ≥ GPT-5 > Sonnet for coherence-style -tasks), but the CHARACTER of unique findings shifted. On single-doc coherence, -Opus finds design tensions within a single design. On cross-doc consistency, -Opus finds BOUNDARY disagreements between two designs. GPT-5 shifts from -finding definitional errors to ownership conflicts. - -**Are these findings REAL bugs in the gargoyle documentation?** -Yes — several are genuine issues worth fixing: -1. The fills-vs-events-as-ground-truth is a real philosophical tension between - the two documents that needs resolution. -2. The Position event ownership (OrderManager vs Ledger) is a real boundary - conflict that affects implementation. -3. The Engine→Trading communication style (internal pipeline vs cross-domain - command) is a genuine structural ambiguity. -4. The signal persistence claim ("never persisted" vs `SignalEmitted` audit - event) is a direct textual contradiction. - -These are the kind of cross-document inconsistencies that cause teams to build -inconsistent implementations — one engineer reads Document A and builds one way, -another reads Document B and builds differently. - -**Practical implication:** Cross-document consistency analysis is a high-value -task for documentation maintenance. Run it when: -- A system has multiple architecture docs written at different times -- A refactoring has updated one doc but not another -- Multiple people contribute to design documentation -- Moving from high-level overview to detailed specification - -Opus is the recommended model for this task: fastest (52s vs 125s), most -findings (7 vs 6), and uniquely strong at boundary disagreements. GPT-5 adds -value for ownership-specific conflicts. Sonnet is sufficient for quick -screening (catches the Critical issues in 14s) but won't find the architectural -insights. - -**Cost-effectiveness:** -Opus: 2,351 output tokens for 7 findings = 336 tokens/finding (52s) -GPT-5: 9,415 output + 8,384 reasoning for 6 findings = 2,967 tokens/finding (125s) -Sonnet: 776 output tokens for 4 findings = 194 tokens/finding (14s) - -Opus is the clear winner on this task type: more findings than GPT-5, 2.4x -faster, and 8.8x more token-efficient per finding. GPT-5's massive reasoning -investment (8,384 tokens) produced only one fewer finding than Opus — the -verification overhead is not paying off here because cross-document contradictions -are relatively easy to verify once identified (just check both documents). - -### 29. Adversarial manipulation analysis: NEW task type — GPT-5 is most exhaustive and systematic; Opus produces qualitatively different attack vectors with system-level thinking; Sonnet is adequate but less creative - -**Date:** 2026-05-05 -**Task:** Identify adversarial manipulation paths in gargoyle's `aggregation.md` (193 lines) -— how a misbehaving, compromised, or buggy upstream component could exploit the -aggregator's design guarantees to produce harmful trading outcomes that bypass -downstream safety controls. -**How we used them:** Same document (full text) + same focused analytical question to all -3 models via HAI proxy. Highly structured prompt specifying 5 categories of adversarial -manipulation (signal injection, timing manipulation, capacity weaponization, state -corruption via crash, audit evasion). Required specific output format per finding -(attack vector, mechanism, exploit, why downstream controls miss it, severity). No tools, -no project context beyond the document itself. - -| Model | Time | Output tokens | Reasoning tokens | Attack vectors found | Critical | High | Medium | -|---|---|---|---|---|---|---|---| -| Claude Sonnet 4.6 | 27s | 1,257 | (internal) | 10 | 3 | 5 | 2 | -| Claude Opus 4.6 | 84s | 3,662 | (internal) | 12 | 5 | 5 | 0 | -| GPT-5 | 111s | 8,808 | 6,336 | 15 | 2 | 10 | 3 | - -**What they found — common ground (all 3 identified):** -- Primary signal hijacking via ranking manipulation (last-tick injection in - time-windowed to control decision parameters) -- Threshold gaming via signal replay/duplication (no deduplication means N - identical signals satisfy "N confirmations") -- Capacity flooding to force premature completion or deny legitimate trades -- Strategic crash to erase unfavorable in-flight groups -- Timeout-masqueraded manipulation (making attacks look like normal system behavior - in the audit trail) - -**GPT-5 unique findings (not in either Claude model):** -- **Direction flip against majority via ranking:** In "most recent" ranking, - emit multiple SELL confirmations then inject a late BUY — the BUY becomes - primary and the decision contradicts the bulk of evidence. Distinct from - general primary hijack because it's specifically about *directional* reversal. -- **Late-arrival exclusion of counter-signals:** Time signals so countervailing - signals arrive just after group destruction, ensuring the decision is formed - without dissenting inputs that would have altered ranking. -- **Capacity filter to curate the audit set:** Pre-fill buffer with chosen - signals so riskier alternatives cannot be included before capacity fires — - the contributing signals list looks clean. -- **Timer nullification by crash:** Crash just before a timeout that would - force-complete an unfavorable decision — the timer becomes no-op on restart, - no decision or expiry event is emitted. -- **Decision drop via induced forwarding failure:** Exploit the "Decision - forwarding fails: Decision is lost" failure mode to selectively suppress - protective decisions (stops, hedges) with no automatic retry. -- **Crash to erase evidence of contrary signals:** Post-crash, submit a - fresh group that completes quickly; audit shows only the new set, not the - earlier contradictory pre-crash signals. - -**Claude Opus unique findings (not in either other model):** -- **Instrument fragmentation to multiply position size:** Emit signals for - economically equivalent exposures using different instrument identifiers. - Each gets its own group, each produces a separate decision, bypassing - per-group capacity limits. Combined position exceeds what any single group - would allow. Identifies TOCTOU at the fan-in to PortfolioRisk. -- **Forced stale decision via timer exploitation:** Emit one signal at a - favorable price spike known to be transient, then deliberately withhold - further signals. Timer force-completes with a stale price. The entry price - WAS valid when the signal was generated — PortfolioRisk doesn't check - staleness of decision prices. -- **Timeout prevention / keep-alive suppression:** Manipulate market data - feed to suppress signals that would reach threshold N. Group expires - normally — denial-of-trading attack disguised as insufficient confirmation. -- **Crash-restart duplicate decisions:** Crash after decision is forwarded - but before strategy reflects it. Both restart "clean" — strategy re-emits - signals, aggregator produces a second decision with a fresh ID. Same trade - executes twice. PortfolioRisk can't deduplicate because IDs are different. -- **Force-complete with insufficient confirmation (capacity < threshold):** - If capacity limit is lower than threshold, hitting capacity ALWAYS force- - completes before predicate is satisfied. Fundamentally changes a 5-confirmation - strategy into a 3-confirmation strategy. -- **Pattern predicate as arbitrary decision trigger:** If adversary controls - predicate logic (via strategy configuration), can make pattern-complete - trigger on any single signal while audit shows algorithm=pattern-complete - and reason=:predicate. Trust boundary between configuration and execution. - -**Claude Sonnet unique findings (not in either other model):** -- **Cross-group timing coordination:** Coordinate signal injection across - multiple instruments to synchronize completion times, creating a burst of - correlated decisions that overwhelm PortfolioRisk individually-safe - evaluations. (NOTE: Opus found a similar concept — instrument fragmentation - — but framed it differently: Opus focused on position multiplication via - instrument aliasing, Sonnet focused on burst timing overwhelming evaluation.) -- **Multi-strategy attack distribution:** Spread manipulation across multiple - isolated strategy aggregators so no single aggregator's behavior looks - abnormal while cumulative effect is harmful. - -**Quality assessment:** -- **GPT-5** produced the most findings (15) with the most systematic coverage - across all 5 prompt categories. Its strength was in identifying SPECIFIC - INTERLEAVINGS — exactly how timing, state, and ranking mechanisms interact - to produce exploits. The direction-flip finding (#3) and the late-arrival - exclusion finding (#6) show precise temporal reasoning about when signals - arrive relative to group lifecycle events. The "decision drop via forwarding - failure" finding exploits a DOCUMENTED failure mode (from the failure table) - as an offensive weapon — turning a recovery mechanism into an attack vector. - Every finding references specific mechanisms from the spec. -- **Claude Opus** produced 12 findings with the most architecturally creative - attacks. The instrument fragmentation attack is the most SYSTEMICALLY - dangerous finding across all three models — it's not about manipulating one - group but about the RELATIONSHIP between groups, and it identifies a - TOCTOU vulnerability at the PortfolioRisk fan-in point that no other model - found. The crash-restart duplication attack is also architecturally novel — - it exploits the "clean state" guarantee as a weapon for invisible trade - doubling. Opus consistently reasons about the system BOUNDARY (aggregator - → PortfolioRisk handoff) rather than just within-component mechanics. The - pattern-predicate trust boundary finding is uniquely about CONFIGURATION - as an attack surface. -- **Claude Sonnet** produced 10 findings in 27s — extremely efficient (127 - tokens per finding). Findings were adequate and covered all 5 categories, - but lacked the specificity of GPT-5 and the architectural creativity of - Opus. Several findings were somewhat generic (e.g., "crash at strategic - moments" without specifying exactly WHEN relative to group lifecycle). - The cross-group coordination and multi-strategy distribution findings show - system-level thinking but are stated at a higher abstraction level without - concrete exploit sequences. - -**Key insight — "adversarial manipulation analysis" as a task type:** -This is qualitatively different from all previous analytical lenses tested. -Previous tasks asked models to find problems WITH the design (assumptions, -races, incoherences). This task asks models to find ways to USE the design -AGAINST itself — a creative/generative adversarial task. Results: - -- **GPT-5** treats it as an exhaustive enumeration exercise — systematically - walks through each mechanism and asks "how could this be abused?" High - count (15), thorough coverage, but some findings are minor variations of - each other (e.g., crash-related findings #10, #12, #15 share the same core - mechanism). Reasoning tokens (6,336) used for both generation and verification. -- **Opus** treats it as a creative design exercise — asks "what would a - smart adversary do that the designer didn't consider?" Fewer findings (12) - but several are genuinely novel attack concepts (instrument fragmentation, - crash-restart duplication, predicate trust boundary) that require reasoning - about the SYSTEM rather than the COMPONENT. Opus also provided a summary - table and systemic conclusion about the root design weaknesses. -- **Sonnet** treats it as a categorization exercise — fills each prompt - category with plausible attacks but at a higher abstraction level. Fast - and adequate for a first pass but wouldn't surprise a security reviewer. - -**Comparison to "predictable exploit window" (Finding #18):** -Finding #18 noted that Opus uniquely identified predictable exploit windows -in escalation-policy.md. Here, Opus again shows the strongest adversarial -creativity — the instrument fragmentation attack and crash-restart duplication -are both about exploiting DESIGN GUARANTEES (per-instrument grouping, clean -restart) as weapons. This confirms that Opus's strength on adversarial analysis -is a CONSISTENT PATTERN, not document-specific. - -GPT-5 excels when the adversarial task is framed as "enumerate all possible -abuses of each mechanism" (systematic coverage). Opus excels when the task -requires "invent novel attack concepts that exploit design boundaries" -(creative adversarial thinking). - -**Model hierarchy for adversarial manipulation analysis:** -1. GPT-5 — most thorough enumeration, best at mechanism-level exploitation (15) -2. Opus — most creative, finds system-boundary attacks others miss (12) -3. Sonnet — adequate first pass, fast, but less specific (10) - -**Practical implication:** For security-oriented architecture review: -- Run GPT-5 for comprehensive attack surface enumeration -- Run Opus for novel/creative attack vectors that exploit design boundaries -- Sonnet is sufficient only as a quick initial screen -- The UNION of GPT-5 + Opus findings (removing overlaps) would produce the - most complete adversarial analysis - -**New finding about the aggregator itself:** Several attacks identified by -multiple models point to real design weaknesses worth addressing: -1. No signal deduplication/independence validation (all 3 models) -2. Primary signal determines all decision parameters regardless of group - composition (all 3 models) -3. Transient state + no replay = perfect adversarial erasure tool (all 3) -4. Capacity/timeout treated as normal events even when weaponized (all 3) -5. No cross-group correlation at aggregator level (Opus + Sonnet) -6. TOCTOU at PortfolioRisk fan-in for concurrent decisions (Opus) diff --git a/findings/README.md b/findings/README.md new file mode 100644 index 0000000..e92e3df --- /dev/null +++ b/findings/README.md @@ -0,0 +1,16 @@ +# Model Findings — Analytical & Research Work + +_Tracking what actually works (and doesn't) when using AI models for research, +analysis, bias detection, and document review — not coding._ + +Started: 2026-04-26 + +## Context + +We use multiple models in different roles: Claude Code (Opus/Sonnet) for +generation, Sonnet + GPT-5 for independent dual review, smaller models for +focused analytical tasks. Most public discussion is about coding. We found +almost no published methodology for using models in analytical research tasks +(searched 2026-04-26). That gap is why we're tracking this. + +Each experiment lives in its own file. See individual finding files below.