refactor(findings): split ALL-FINDINGS.md into per-experiment files

Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00
parent 1b108ff66e
commit 6af8a6ee10
32 changed files with 3232 additions and 3254 deletions
@@ -53,12 +53,15 @@ Each experiment:
 ## Repository Structure

 ```
-findings/           # Individual findings with full analysis
-  01-different-models-different-things.md
-  02-narrow-lens-vs-broad-review.md
+findings/                                         # Individual findings with full analysis
+  README.md                                       # Context and index
+  YYYY-MM-DD-NN-slug.md                           # One file per experiment
+  2026-04-26-01-different-models-catch-different-things.md
+  2026-04-26-07-emerging-role-assignments-pattern-not.md
+  2026-05-03-07b-token-budget-matters-more-than.md  # Duplicate #7 (suffix b)
+  2026-05-03-15-design-coherence-analysis.md
  ...
-  28-cross-document-consistency.md
-  29-adversarial-manipulation.md
+  2026-05-05-29-adversarial-manipulation-analysis-new-task.md
 prompts/            # Exact prompts used for reproducibility
  cross-document-consistency.md
  design-coherence.md
@@ -69,6 +72,9 @@ open-questions.md   # Unanswered questions for future experiments
 methodology.md      # Full methodology notes
 ```

+Findings are named `YYYY-MM-DD-NN-slug.md` for chronological sorting.
+Numbers are zero-padded (01–29). The duplicate finding #7 uses a `b` suffix.
+
 ## Who We Are

 This research is conducted by [Rodin](https://gitea.weiker.me/rodin) (AI
@@ -0,0 +1,16 @@
+# Finding 1: Different models catch different things (confirmed)
+
+**Date:** 2026-04-26
+**Task:** PR reviews on DDD reference docs (~6,600 lines across 18 files)
+**How we used them:** Both models got the same task via pr-review skill —
+fetch diff, fetch full file content for changed files, review against PR
+description and linked issue acceptance criteria. Rich context: full diff,
+project CLAUDE.md conventions, issue body. Each reviewer ran independently
+in its own sub-agent with its own Gitea token. No cross-pollination.
+
+- GPT-5 caught SUMMARY.md verdict mismatches (Commanded classification,
+  small teams classification) that Sonnet missed entirely (PR #375)
+- Sonnet caught a broken cross-reference link first that GPT-5 missed (PR #378)
+- **Takeaway:** Different blind spots are real. Neither model is strictly better
+  for analytical review — they complement each other. This is why we run two
+  independent reviewers from different model families.
@@ -0,0 +1,18 @@
+# Finding 2: Cheap model + narrow lens > expensive model + broad review (one data point)
+
+**Date:** 2026-04-26
+**Task:** Check 12 rewritten hypotheses for directional bias
+**How we used them:**
+- Sonnet & GPT-5: full PR review context (diff, file content, issue, AC).
+  Broad mandate: "review this PR." Rich context but unfocused task.
+- GPT-4.1 Mini: given ONLY the 12 hypothesis texts + one focused question:
+  "Do any of these hypotheses lead toward a predetermined conclusion?"
+  Minimal context, laser-focused task. No diff, no project docs, no issue.
+
+- Both Sonnet and GPT-5 approved the hypotheses as reviewers
+- GPT-4.1 Mini found ALL 12 pushed toward predetermined conclusions
+- Words like "requires," "necessary," "must be" were flagged as directional
+- **Takeaway:** Task framing mattered more than model size. Rich context +
+  broad mandate = missed the forest for the trees. Minimal context + precise
+  question = found exactly what mattered. This needs more testing — was it
+  the narrow framing, the lack of surrounding context, or both?
@@ -0,0 +1,15 @@
+# Finding 3: GPT-5 times out on complex multi-step analytical tasks (confirmed pattern)
+
+**Date:** 2026-04-26
+**Task:** Full PR review of #382 (research document rewrite)
+**How we used it:** pr-review skill — multi-phase (fetch diff, fetch files,
+check CI, analyze against AC, post inline comments, post summary). 7 phases,
+many curl calls to Gitea API, large diff context. Heavy tool-use workflow
+through SAP proxy (adds latency vs direct API). 300s timeout.
+
+- Timed out 3 times at 300s (17, 6, 6 tool calls respectively)
+- Bottleneck was model processing time, not network (~0.3s Gitea API latency)
+- **Takeaway:** Break analytical tasks into focused bounded pieces. Twelve
+  small deep reviews > one rushed big one. The issue isn't GPT-5's analysis
+  quality — it's that multi-phase tool-heavy workflows burn too much time
+  on mechanics. Separate the data gathering from the analysis.
@@ -0,0 +1,18 @@
+# Finding 4: GPT-5 defaults to delegation; Claude defaults to doing the work
+
+**Date:** 2026-04-26
+**Task:** PR review delegation to sub-agents
+**How we used them:** Both spawned as sub-agents from main session with
+same task description, same pr-review skill file, same Gitea credentials.
+Difference: GPT-5 got model override to gpt5, Sonnet used default model.
+Both got full skill instructions.
+
+- GPT-5 first attempt: spawned sub-sub-agents and timed out
+- GPT-5 with "do it yourself, no sub-agents" + step-by-step: worked
+- Even with constraints, GPT-5 sometimes dumps raw tool output instead of
+  synthesizing — needs explicit output format instructions
+- Claude (Sonnet/Opus) given the same kind of task does the work directly
+- **Takeaway:** GPT interprets complex task descriptions as delegation
+  opportunities. Claude interprets them as work to do. For GPT: explicit
+  single-actor instructions + output format. For Claude: can give broader
+  mandate. Same skill file, very different behavior.
@@ -0,0 +1,17 @@
+# Finding 5: Sonnet is fast and catches structural issues; GPT-5 is slow and catches semantic issues
+
+**Date:** 2026-04-26
+**Task:** Dual review across PRs #372, #375, #378, #380, #382
+**How we used them:** Same pr-review skill, same context (diff + files +
+issue + AC), same sub-agent pattern. Only variable: model. Both got rich
+context. Both ran the full 7-phase review skill.
+
+- Sonnet consistently finishes first, catches formatting, broken links,
+  structural problems (missing sections, dangling refs)
+- GPT-5 takes longer, catches meaning-level problems (verdict mismatches,
+  classification inconsistencies, logical gaps)
+- **Takeaway:** With identical rich context and identical instructions, the
+  models naturally gravitate to different things. Sonnet is the structural
+  reviewer; GPT-5 is the semantic reviewer. Both roles matter. Question:
+  would Sonnet catch semantic issues if given a narrower "check for logical
+  consistency" framing instead of broad review?
@@ -0,0 +1,20 @@
+# Finding 6: Single agent can't handle 1000+ line document generation (confirmed pattern)
+
+**Date:** 2026-04-26
+**Task:** DDD v2 forge analysis drafting
+**How we used them:** Single Sonnet/Opus sub-agents given full research
+material (~3,874 lines of research notes) + outline + instructions to write
+complete document. Very rich context (all research), very large output
+requirement (1000+ lines).
+
+- Five single-agent attempts died (OOM, disconnect, timeout) trying to write
+  full documents
+- Sectional approach (5 parallel Sonnet subagents, ~500-700 lines each)
+  succeeded immediately — each got same research but only their section's
+  outline
+- Same pattern when Claude Code attempted full Part V rewrite — died
+- Three agents × ~320 lines each worked first try
+- **Takeaway:** This is a confirmed, repeatable limit for generation tasks.
+  Not model-specific — it's a context/output length problem. Rich input
+  context is fine; it's the output length that kills. Break output into
+  sections, keep input context rich, draft in parallel, assemble.
@@ -0,0 +1,17 @@
+# Finding 7: Emerging role assignments (pattern, not conclusion)
+
+**Date:** 2026-04-26 (one day of intensive work — treat as hypothesis)
+
+- Opus (via Claude Code): complex generation needing deep project context.
+  Rich context: CLAUDE.md, full codebase access, design docs. Broad mandate.
+- Sonnet: parallel volume work (5 subagents drafting simultaneously).
+  Rich context per section, constrained output scope.
+- GPT-5: independent analytical review. Rich context (diff + files + issue).
+  Best when task is bounded and explicit.
+- GPT-4.1 Mini: focused narrow analysis (bias detection). Minimal context,
+  precise question. Cheap and fast.
+- **Takeaway:** The role assignment matters, but so does the context shape.
+  Opus gets broad context + broad mandate. Sonnet gets broad context +
+  narrow scope. GPT-5 gets rich context + explicit task. GPT-4.1 Mini gets
+  minimal context + laser question. We haven't tested swapping these
+  combinations — that's where the real learning will come from.
@@ -0,0 +1,58 @@
+# Finding 8: Bias detection: all models catch it with any framing — when the signal isn't buried
+
+**Date:** 2026-04-27
+**Task:** Detect directional bias in 8 deliberately biased hypotheses about
+microservices vs monolith architecture for fintech startups.
+**How we used them:** Created fresh test material (8 hypotheses with pro-
+microservices bias via absolutes like "inevitably," "necessary," "must,"
+"requires," plus one factually inverted claim about consistency guarantees).
+Ran 4 conditions in parallel sub-agents:
+
+| Condition | Model | Framing | Context |
+|---|---|---|---|
+| A | GPT-4.1 Mini | Narrow: "Do any lead toward a predetermined conclusion?" | Hypotheses only |
+| B | Sonnet | Same narrow question | Hypotheses only |
+| C | GPT-5 | Same narrow question | Hypotheses only |
+| D | Sonnet | Broad: "Review quality, clarity, testability, and issues" | Hypotheses only |
+
+**Results:**
+- **All 4 conditions detected 8/8 biased hypotheses.** No misses.
+- All 3 narrow-framing models (Mini, Sonnet, GPT-5) produced structurally
+  similar output: per-hypothesis verdict, biasing words, neutral version,
+  severity assessment.
+- All 3 narrow-framing models flagged H8's factual inversion (distributed
+  transactions DON'T provide stronger consistency than monolithic ACID).
+- GPT-5 added specific counterexamples (LMAX Disruptor, Shopify, Stack
+  Overflow, Basecamp) — marginally richer analysis.
+- Sonnet broad mandate also caught the bias — framed as one of three
+  "systemic problems" (deterministic language, pro-microservices framing
+  bias, underspecified constructs). Additionally provided testability and
+  operationalization analysis that the narrow framing didn't ask for.
+- Sonnet broad took ~72s vs ~39s for narrow conditions (more output).
+
+**Takeaway:** When the biased text is the ONLY input (no surrounding noise),
+all tested models — including the cheapest (GPT-4.1 Mini) — detect bias
+regardless of whether the question is narrow or broad. This appears to
+**contradict** original finding #2 ("cheap model + narrow lens > expensive
+model + broad review"), but the key difference is context noise:
+
+- **Original experiment (2026-04-26):** Sonnet and GPT-5 missed bias during
+  FULL PR REVIEW with rich project context (diff, file content, issue text,
+  acceptance criteria, project conventions). The hypotheses were buried in
+  layers of review mechanics.
+- **This experiment (2026-04-27):** Even the "broad" condition gave ONLY the
+  hypothesis text — no diff, no PR structure, no project context noise.
+
+**Refined hypothesis:** The original finding #2 was about **signal-to-noise
+ratio**, not about model capability or framing precision. When biased text
+is presented in isolation, any model catches it. When biased text is buried
+in a large PR review with many other things to check, the bias signal gets
+lost in the noise — unless you explicitly ask about it. The "narrow lens"
+worked because it eliminated the noise, not because smaller models are
+better at bias detection.
+
+**Next experiment to confirm:** Give a model the FULL PR review context
+(diff, files, issue, AC) but add the narrow bias question as an explicit
+review checklist item. If the model catches bias despite the rich context,
+it confirms the signal-to-noise hypothesis. If it misses, it suggests
+something else is at play (attention allocation, task switching cost).
@@ -0,0 +1,77 @@
+# Finding 9: Gap-finding in architecture docs: GPT-5 finds domain-specific gaps, GPT-4.1 is generic, Mini is formulaic
+
+**Date:** 2026-05-02
+**Task:** Identify missing failure scenarios in gargoyle's `failure-modes.md` (383 lines)
+**How we used them:** Same document (full text, no truncation) + same focused
+analytical question to all 3 models via HAI proxy (OpenAI-compatible endpoint).
+No tools, no project context beyond the document itself. Single prompt, no
+conversation history. Temperature 0.3 for GPT-4.1/Mini, default (1.0) for GPT-5
+(required by the model).
+
+| Model | Time | Output tokens | Reasoning tokens | Scenarios found |
+|---|---|---|---|---|
+| GPT-4.1 Mini | 16s | 2,003 | 0 | 10 |
+| GPT-4.1 | 24s | 2,575 | 0 | 15 |
+| GPT-5 | 45s | 8,565 | 6,656 | 14 |
+
+**What they found — common ground (all 3 identified):**
+- ETS table corruption/loss affecting gates
+- BEAM scheduler starvation / GC pauses
+- WebSocket message duplication/reordering
+- Postgres connection pool exhaustion / deadlocks
+- Clock skew / time drift
+- Process registry inconsistency
+
+**GPT-5 unique findings (not in either other model):**
+- Broker rate limiting (429s) — not "connection lost" so existing logic
+  doesn't trigger, but can't flatten during kill switch
+- Broker auth failure / credential rotation — distinct from connection loss
+- Corporate actions (splits, symbol changes) — position drift without
+  triggering staleness detection
+- Duplicate pipeline instances for same user (DynamicSupervisor race)
+- DB "commit unknown outcome" causing restart loops (Ecto commit succeeds
+  at Postgres but client times out → retry → unique constraint → crash loop)
+- Cross-symbol strategies with partial staleness — multi-leg signals
+  computed from mix of fresh and stale data
+- Partial cancel_all during kill switch masked by process restarts
+
+**GPT-4.1 unique findings (not in GPT-5 or Mini):**
+- Zombie processes after halt (supervisor misconfiguration)
+- Unsupervised Task crashes going unnoticed
+- Audit log writes failing silently (not in same transaction as state change)
+- ClOrdID unique constraint violation from race in sequence generation
+- Broker API semantic changes (silent breaking changes)
+
+**GPT-4.1 Mini unique findings:**
+- Race between kill switch engagement and reconciliation completion
+  (timing coordination gap) — this was more explicitly called out than
+  in the other models, though GPT-5 touches it implicitly
+- Strategy.Worker / Aggregator partial crash inconsistency
+
+**Quality assessment:**
+- **GPT-5** had the most *domain-relevant* and *actionable* gaps. Broker
+  rate limiting, auth failures, corporate actions, and the DB commit
+  unknown-outcome scenario are all realistic production issues specific
+  to THIS system. The cross-symbol partial staleness finding shows
+  deeper architectural reasoning about component interactions.
+- **GPT-4.1** was thorough and well-structured but more generic/defensive.
+  Many of its unique findings (zombie processes, unsupervised Tasks,
+  audit log loss) are general Elixir concerns rather than specific to
+  the document's architecture. Good for a completeness checklist.
+- **GPT-4.1 Mini** was formulaic — each finding followed the same template
+  and several were somewhat surface-level or restated things the document
+  partially covers. Still found the most scenarios per dollar.
+
+**Takeaway:** For gap-finding in architecture documents, GPT-5's reasoning
+tokens pay off. It doesn't just list "things that could go wrong" — it
+identifies *specific interactions* that the document's existing mechanisms
+don't cover (e.g., rate limiting bypasses the "connection lost" detection,
+corporate actions bypass staleness detection). GPT-4.1 is a solid
+middle-ground: more thorough than Mini, less insightful than GPT-5.
+Mini is fine for a quick sanity check but won't find the subtle gaps.
+
+**Cost-effectiveness:** Mini found 10 scenarios in 16s for ~7K tokens.
+GPT-5 found 14 scenarios (with 7 genuinely unique insights) in 45s for
+~13.5K tokens (including 6.6K reasoning). For architecture review where
+missing a gap could mean financial loss, the GPT-5 cost is justified.
+For routine doc review, Mini + human judgment is probably sufficient.
@@ -0,0 +1,98 @@
+# Finding 10: Hidden-assumption identification: GPT-5's reasoning produces qualitatively different (not just more) findings
+
+**Date:** 2026-05-02
+**Task:** Identify hidden assumptions in gargoyle's `cold-start-and-recovery.md` (234 lines)
+that could break under real-world production conditions.
+**How we used them:** Same document (full text) + same focused analytical question
+to all 3 models via HAI proxy (OpenAI-compatible endpoint). No tools, no project
+context beyond the document itself. Single prompt, no conversation history.
+Temperature 0.3 for GPT-4.1/Mini; GPT-5 uses default (required).
+
+| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
+|---|---|---|---|---|
+| GPT-4.1 Mini | 25s | 3,090 | 0 | 12 |
+| GPT-4.1 | 77s | 2,751 | 0 | 14 |
+| GPT-5 | 78s | 2,649 | 4,096 | 26 |
+
+**What they found — common ground (all 3 identified):**
+- Broker API consistency/availability during reconciliation
+- ETS table availability and fail-closed behavior
+- Single-writer/mailbox ordering guarantees holding in practice
+- User independence assumption vs shared resources (rate limits, DB)
+- Reconciliation idempotency under repeated runs
+- Corporate action data completeness/timeliness
+- Escalation threshold calibration vs changing market conditions
+- Strategy warmup with partial/missing historical data
+- Signal expiry correctness on restart
+
+**GPT-5 unique findings (not in either other model):**
+- Unbounded mailbox growth during extended reconciliation (memory pressure
+  from queued messages at market open)
+- handle_continue side effects in OTHER processes (risk, metrics) acting
+  concurrently via different paths
+- Pre-existing GTC orders filling while gated (positions as moving target)
+- Broker position semantics mismatch (trade-date vs settled-date)
+- Strategy warmup evaluate() having non-signal side effects (metrics, caches)
+- Historical bar / live tick boundary alignment (double-processing or gaps)
+- ETS gate caching in process state creating fail-open windows
+- Correlated retry stampede when many users restart together
+- Corporate action double-application race with broker (missing idempotency
+  keys per action/instrument/date)
+- Kill switch state vs DB unavailability at startup
+- Market data subscriptions as shared bottleneck across "independent" users
+- Time-invariant signals incorrectly expired by aggregation window logic
+- Broker fills vs positions endpoints internally inconsistent (different caches)
+- Positions changing under reconciliation while kill switch is engaged
+- Gate phase sequencing: :ready written before worker warmup completes
+- Periodic reconciler allowing 1hr of divergent trading (rate-of-change blind)
+
+**GPT-4.1 unique findings (not in GPT-5 or Mini):**
+- No correlated failure handling (all failure modes treated as isolated) —
+  only model to frame this as a meta-assumption about the failure table
+
+**GPT-4.1 Mini unique findings:**
+- None that weren't also covered by the other two models
+
+**Quality assessment:**
+- **GPT-5** didn't just find more assumptions — it found *qualitatively
+  different kinds*. Many of its unique findings involve multi-component
+  interactions (mailbox + reconciliation + market open timing), semantic
+  mismatches (trade-date vs settled positions), and second-order effects
+  (metrics side effects during warmup, GTC orders filling while gated).
+  These require reasoning about system behavior across boundaries the
+  document doesn't explicitly draw.
+- **GPT-4.1** was competent and structured, found the same core assumptions
+  as Mini, plus one good meta-observation about correlated failures. But
+  it stayed within the document's own framing — it found assumptions the
+  document *almost* states rather than ones the document can't see.
+- **GPT-4.1 Mini** was formulaic. Every finding maps cleanly to a section
+  of the document. It's essentially "what could go wrong with each stated
+  mechanism" rather than "what does this design take for granted about
+  the world outside itself."
+
+**Key insight — reasoning tokens change the KIND of analysis:**
+GPT-5's 4,096 reasoning tokens aren't producing "more of the same" —
+they're producing a different analytical mode. The non-reasoning models
+(4.1 and Mini) identify risks within the document's own frame of reference.
+GPT-5 reasons about the document's relationship to the external world:
+broker semantics, deployment topology, OTP runtime behavior under load,
+timing correlations across independent subsystems. This is the difference
+between "what could this mechanism fail at" and "what must be true about
+the world for this mechanism to work."
+
+**Comparison to Finding #9 (gap-finding on failure-modes.md):**
+Same pattern confirmed. GPT-5 consistently finds domain-specific,
+interaction-level issues that require reasoning about component boundaries.
+GPT-4.1 is thorough within-frame. Mini is formulaic. The gap between
+GPT-5 and the others is larger here than in #9 — possibly because
+"hidden assumptions" requires more abstraction than "missing failure
+scenarios." Assumption-finding requires the model to reason about what
+ISN'T stated, which benefits more from extended reasoning.
+
+**Practical implication:** For architecture review, running GPT-5 on
+"identify hidden assumptions" is higher-value than the same question to
+non-reasoning models. The cost difference (4K extra reasoning tokens) is
+trivial for a document that will drive months of implementation. Use
+non-reasoning models for within-frame checks ("does this section have
+gaps") and reasoning models for cross-boundary analysis ("what must be
+true about the world for this to work").
@@ -0,0 +1,124 @@
+# Finding 11: Hidden-assumption identification on simpler doc: reasoning models diverge in approach, both outperform non-reasoning
+
+**Date:** 2026-05-02
+**Task:** Identify hidden assumptions in gargoyle's `market-calendar.md` (238 lines)
+— a simpler, single-component document vs the 234-line cold-start doc from Finding #10.
+**How we used them:** Same document (full text) + same focused analytical question
+to all 3 models via HAI proxy. No tools, no project context beyond the document
+itself. Single prompt, no conversation history. Temperature 0.3 for GPT-4.1;
+GPT-5 and Opus use their defaults (required). Same prompt across all three.
+
+| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
+|---|---|---|---|---|
+| GPT-4.1 | 19s | 2,554 | 0 | 14 |
+| Claude Opus 4.6 | 74s | 3,288 | (internal, not reported) | 13 |
+| GPT-5 | 101s | 8,417 | 5,504 | 24 |
+
+**What they found — common ground (all 3 identified):**
+- Alpaca calendar API data correctness/completeness as single source of truth
+- Alpaca API availability at startup (no local cache persistence)
+- ETS table atomicity during refresh (partial-state exposure risk)
+- System clock/timezone alignment (dates are timezone-naive)
+- NYSE emergency/unscheduled closures not reflected until refresh
+- Two-year cache range sufficiency
+- API response format stability
+- Rate limiting / API capacity concerns
+
+**GPT-5 unique findings (not in either other model):**
+- Date struct term-ordering in ETS match specs may not match chronological
+  order (ETS range guards rely on Erlang term comparison, not Date semantics)
+- close_time/1 returns naive Time without timezone — DST conversion burden on
+  consumers, one hour off twice per year
+- trading_day?/1 conflates "not a trading day" with "calendar unavailable" —
+  operational outages invisible to callers
+- ETS table name collision risk (global namespace per node)
+- No other process should modify the ETS table (access mode discipline)
+- Network egress and credential availability on all nodes at all times
+- ETS read/write concurrency flags for contention under load
+- Direct ETS access by consumers bypassing the module's error handling
+- next/prev_trading_day edge cases at cache boundaries
+- Alpaca API start/end parameter inclusivity (off-by-one at year boundaries)
+- Half-day vs full-day distinction insufficiency for special sessions
+- Small table size makes O(n) selects acceptable (scaling concern)
+- Year-end refresh failure leaving gaps at boundary
+- Alpaca never omits a legitimate trading day (absence = non-trading conflation)
+
+**Claude Opus unique findings (not in either other model):**
+- ETS ownership semantics: heir-protection would change fail-closed behavior;
+  current design means ALL consumers fail simultaneously during crash-to-restart
+  window (framed as a design tension, not just a risk)
+- Silent data corruption from partial API response (pagination/truncation) —
+  specifically that missing rows are SILENT failures with no error propagation
+  (other models mentioned API completeness but not the silence aspect)
+- Consumers calling functions with Dates, not DateTimes — the API accepts Date.t()
+  but doesn't specify HOW consumers should derive "today" (system-wide
+  coordination problem made invisible by the API contract)
+- `trading_day?/1` returning false is NOT fail-closed for ALL consumers — only
+  for PDT-like "block action" consumers; for batch-trigger consumers it's
+  fail-OPEN (subtle inversion of safety semantics)
+- Startup ordering: background_children placement means PDT could receive orders
+  before MarketCalendar finishes init, creating recurring rejection windows
+  during hot deploys
+- Continuous-running assumption for refresh timer (daily restarts would mean
+  refresh mechanism never fires — no staleness alert exists)
+
+**GPT-4.1 unique findings (not in either other model):**
+- No need for real-time calendar change notification (event emission gap)
+- All consumers using the same module instance (configuration consistency)
+- No need for historical calendar data (audit/backtesting limitation)
+- Consumers correctly handling {:error, :calendar_unavailable} in practice
+
+**Quality assessment:**
+- **GPT-5** found the most assumptions (24) with the most technical specificity.
+  Many are implementation-level insights (ETS term ordering, named table
+  collisions, read_concurrency flags) that demonstrate deep Erlang/OTP
+  knowledge. Some are slightly obvious or overlapping. The ETS term-ordering
+  finding is genuinely insightful — Date structs DO compare correctly in Erlang
+  term order (year > month > day fields), but questioning it shows depth of
+  reasoning about underlying mechanisms. Also provided concrete recommendations.
+- **Claude Opus** found fewer assumptions (13) but several were qualitatively
+  different — they identified *design tensions* and *semantic inversions*
+  rather than just failure scenarios. The fail-open/fail-closed inversion
+  (finding #12), the ETS ownership tension, and the "API makes timezone
+  coordination invisible" findings show reasoning about the design's
+  *relationship to its consumers* rather than just its internal mechanics.
+  Tighter, more curated output with less filler.
+- **GPT-4.1** was competent and well-structured (14 assumptions, clean table)
+  but stayed within the document's own framing. Its unique findings are
+  relatively generic ("consumers should handle errors correctly," "no
+  historical data"). Solid baseline, no surprises.
+
+**Key insight — two reasoning models, different analytical styles:**
+GPT-5 and Opus are both reasoning models, but they reason about different
+things. GPT-5 reasons DEEPER into implementation mechanics (how does ETS
+actually work? what are the exact failure modes of each component?). Opus
+reasons WIDER about system context (how does this component's API contract
+affect the safety properties of the overall system? what tensions does this
+design create that aren't visible to the author?).
+
+GPT-5's approach: "Here are 24 things that could go wrong, many highly
+technical." Opus's approach: "Here are 13 assumptions, several of which
+reveal design tensions the document can't see about itself."
+
+**Does the reasoning gap narrow with simpler docs?**
+Comparing to Finding #10 (cold-start doc, 234 lines, 26 vs 14 vs 12 assumptions
+for GPT-5/GPT-4.1/Mini):
+- GPT-5 still dominates in raw count (24 vs 14 for GPT-4.1)
+- The gap ratio is similar (~1.7x here vs ~1.9x in Finding #10)
+- Document complexity doesn't appear to be the driver of the gap —
+  reasoning tokens enable more exhaustive exploration regardless of
+  input complexity
+
+**Claude Opus vs GPT-5 (the headline comparison):**
+They're not competing on the same axis. GPT-5 is better for "find all
+possible issues" (breadth + technical depth). Opus is better for "find
+the assumptions that will actually surprise the author" (insight density).
+If you want a security-audit-style exhaustive list: GPT-5. If you want a
+design-review-style "here's what you're not seeing about your own design":
+Opus. Both are better than GPT-4.1 for this task, but in different ways.
+
+**Practical implication:** Run BOTH reasoning models on architecture docs.
+GPT-5 catches implementation-level hazards the team might miss during
+coding. Opus catches design-level tensions the team might miss during
+planning. GPT-4.1 is sufficient as a quick sanity check but won't
+surprise you.
@@ -0,0 +1,125 @@
+# Finding 12: Sonnet 4.6 outperforms expectations on assumption-finding; competes with reasoning models on complex docs
+
+**Date:** 2026-05-02
+**Task:** Identify hidden assumptions in gargoyle's `order-execution.md` (785 lines)
+— a complex, multi-component document covering OrderManager, BrokerAdapter,
+TradeStream, and PositionReconciler.
+**How we used them:** Same document (full text, no truncation) + same focused
+analytical question to all 3 models. GPT-5 via HAI OpenAI endpoint; Opus 4.6
+and Sonnet 4.6 via HAI Anthropic endpoint. No tools, no project context beyond
+the document itself. Single prompt, no conversation history.
+
+| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
+|---|---|---|---|---|
+| GPT-5 | 93s | 8,485 | 6,016 | 20 |
+| Claude Sonnet 4.6 | 106s | 4,637 | (internal) | 17 |
+| Claude Opus 4.6 | 105s | 4,615 | (internal) | 12 |
+
+**What they found — common ground (all 3 identified):**
+- Synchronous broker REST calls blocking OrderManager GenServer (mailbox growth)
+- TradeStream event ordering assumptions (out-of-order fills/status)
+- Fill deduplication gap (no explicit fill-level idempotency)
+- `cancel_all/1` with `timeout: :infinity` blocking GenServer during FLATTEN
+- Recovery/restart races with TradeStream fill delivery (fills queued during
+  `handle_continue/2`)
+- Lot operation idempotency under crash recovery (partial execution)
+- Replace race: fills for new broker_order_id arriving before `replaced` event
+- Database write latency impact on GenServer throughput under burst fills
+- ETS table scope assumptions (single-node, access mode)
+
+**GPT-5 unique findings (not in either Claude model):**
+- Rate-limit retry blocking OrderManager inline (no async retry path specified)
+- Single TradeStream connection per user not enforced (duplicate detection gap)
+- Kill switch FLATTEN vs degraded state interaction (OM drops cancels while
+  degraded, but FLATTEN calls cancel_all through OM)
+- ClOrdID uniqueness scope/retention at broker across sessions and days
+- `after: datetime` filter semantics (clock skew, timezone, inclusive/exclusive)
+- Reconciliation responses may exceed single-response size (no pagination)
+- Event broadcasting blocking model (synchronous vs fire-and-forget)
+- Credential rotation during TradeStream connection lifetime
+- `market_closed` semantics varying across brokers (reject vs queue)
+- Dropped Alpaca statuses (stopped/suspended/calculated) may affect accounting
+
+**Claude Sonnet 4.6 unique findings (not in either other model):**
+- Single fill per fill event assumption (broker batching multiple fills into
+  one WebSocket message)
+- Lot operations (`Lots.open/2`, `Lots.close/4`) assumed to never fail —
+  no `{:error, _}` handling shown, crash propagation risk
+- `Task.async_stream` inside GenServer creating linked tasks whose crash
+  signals propagate to OrderManager during critical cancel_all
+- Broker cancel semantics during in-flight replace at the broker level
+  (cancel targets old broker_order_id which broker already replaced away)
+- Database operations in fill processing assumed transactional (no explicit
+  Ecto.Multi/transaction mention)
+- Broker position reflects only Gargoyle's activity (external trades cause
+  false-positive reconciliation halts)
+
+**Claude Opus 4.6 unique findings (not in either other model):**
+- `{:ok, broker_order_id}` from REST place conflated with durable OMS
+  acceptance vs mere HTTP acknowledgment (no timeout on `submitted` state)
+- Concurrent `apply_corrections/2` from periodic reconciler running in
+  separate process conflicts with OrderManager's single-writer invariant
+  (corrections write to same tables outside GenServer serialization)
+- Reconciliation gate initialized state after `:rest_for_one` restart —
+  ETS table EXISTS but freshly initialized vs table MISSING are different
+  conditions with different safety properties
+- Escalation state reset after crash creating double-exposure window
+  (systematic issue persists but escalation timer resets to zero)
+- `replace/3` error semantics: non-atomic replace (cancel + re-submit)
+  where cancel succeeds but re-submit fails leaves original order cancelled
+  at broker while OrderManager reverts to "working" locally
+
+**Quality assessment:**
+- **GPT-5** maintained its pattern from previous findings: broadest coverage
+  (20 assumptions), most technically specific about implementation details.
+  Found cross-cutting operational concerns (clock skew, credential rotation,
+  pagination) that the Claude models didn't surface. However, several of its
+  findings were medium-severity operational concerns rather than architectural
+  assumptions.
+- **Claude Sonnet 4.6** was the surprise performer. Found 17 assumptions —
+  close to GPT-5's count (85%) — and several of its unique findings were
+  genuinely insightful. The `cancel_all` race with broker-side replace state
+  (finding #16) and the lot operation failure propagation (finding #6) show
+  deep reasoning about component interaction despite Sonnet not being
+  positioned as a "reasoning" model. More importantly, Sonnet's findings were
+  consistently well-structured with clear "how it could break" scenarios.
+- **Claude Opus 4.6** found the fewest assumptions (12) but — consistent with
+  Finding #11 — its unique findings were qualitatively different. The
+  concurrent `apply_corrections` write conflict, the gate initialization state
+  distinction, and the non-atomic replace error semantics all reveal design
+  tensions that neither GPT-5 nor Sonnet identified. Opus continues to reason
+  about the *boundaries between components* rather than within-component
+  mechanics.
+
+**Key insight — Sonnet 4.6 is NOT just a faster GPT-4.1:**
+In previous findings (#9, #10, #11), non-reasoning models (GPT-4.1, GPT-4.1
+Mini) performed significantly below reasoning models on assumption-finding.
+GPT-4.1 found ~14 assumptions where GPT-5 found 24-26. Here, Sonnet 4.6
+finds 17 where GPT-5 finds 20 — a much smaller gap (~85% vs ~58% previously).
+
+Sonnet's findings also included several that showed genuine reasoning about
+component interactions (not just within-frame risks). This suggests Sonnet 4.6
+is qualitatively different from GPT-4.1 for analytical work — it occupies a
+middle ground between GPT-4.1's "competent but surface-level" and GPT-5's
+"exhaustive and deep." The severity distribution was also similar to GPT-5
+(multiple critical/high findings), whereas GPT-4.1 in previous experiments
+tended toward medium-severity generic concerns.
+
+**Updated model hierarchy for assumption-finding:**
+1. GPT-5 — broadest coverage, most operational-level findings (20)
+2. Sonnet 4.6 — strong analytical depth, good component interaction reasoning (17)
+3. Opus 4.6 — fewest but most architecturally insightful, finds design tensions (12)
+4. GPT-4.1 — competent within-frame, generic (~14 from previous experiments)
+5. GPT-4.1 Mini — formulaic, surface-level (~10-12)
+
+**Practical implication:** For architecture review, Sonnet 4.6 is now a strong
+candidate for volume analytical work. It's fast enough to run alongside GPT-5
+and catches different things (lot operation failures, broker-side replace races).
+The ideal three-model review stack for architecture docs appears to be:
+- GPT-5 for breadth + operational concerns
+- Sonnet 4.6 for component interaction analysis
+- Opus 4.6 for design-tension identification
+
+Each consistently finds things the others miss. The cost-efficiency argument
+for Sonnet is strong: ~85% of GPT-5's count with more actionable findings
+per token generated (4,637 vs 8,485 tokens for 17 vs 20 assumptions).
@@ -0,0 +1,46 @@
+# Finding 7: Token budget matters more than model size for gap analysis (confirmed)
+
+**Date:** 2026-05-03
+**Task:** Identify unaddressed failure scenarios in gargoyle's `failure-modes.md` (383 lines, ~25KB)
+**How we used them:** Same document, same analytical question ("What failure scenarios
+are NOT covered?"), three models. GPT-5 with 16K max_completion_tokens, Sonnet 4
+with 4K max_tokens, GPT-4.1 Mini with 4K max_completion_tokens. No project context
+beyond the document itself. Pure gap-analysis task.
+
+**Results:**
+- GPT-5 (16K budget): 28 gaps, most exhaustive. Found domain-specific edge cases
+  others missed entirely: ClOrdID collision across restarts, fractional share rounding,
+  broker maintenance windows (410/426), hot code upgrades, regulatory halts vs staleness
+  distinction, cancel-ack-then-late-fill race, DNS/TLS as distinct from network outage.
+- Sonnet 4 (4K budget): 12 gaps, concise. Unique framing: distinguished latency
+  degradation from outage (subtle but actionable). ETS corruption vs loss.
+- GPT-4.1 Mini (4K budget): 13 gaps with summary table. Unique angles: unknown broker
+  status enum values, configuration schema mismatches on cold-start, malformed signals
+  from logic bugs (not just crashes).
+
+**Overlap (all three):** Rate limiting, clock skew, resource exhaustion, DB failures,
+message backpressure, partial connectivity.
+
+**Key insight:** GPT-5's 4K attempt produced ZERO output (finish_reason: length) —
+all tokens consumed by internal reasoning. At 16K it produced the richest analysis.
+This confirms finding #3 (GPT-5 needs generous token budgets) AND adds a new
+observation: for open-ended analytical questions, GPT-5's reasoning overhead is
+proportionally larger. The 4K models (Sonnet, Mini) both produced useful output at
+4K because they don't burn tokens on chain-of-thought.
+
+**Model personality confirmed:**
+- GPT-5: exhaustive, domain-aware, finds edge cases a senior SRE would know
+- Sonnet: precise, architectural, finds design-level distinctions
+- GPT-4.1 Mini: structured, systematic, finds enumeration gaps
+
+**Practical implication:** For failure mode / gap analysis on design docs:
+- GPT-5 with ≥16K tokens for maximum coverage (most unique findings)
+- Sonnet for architectural framing ("this is really two different problems")
+- Mini for completeness checking ("what about this enum value?")
+- Running all three costs ~$0.50 and catches gaps none alone would find
+- GPT-5 at 4K is USELESS for this task — always give it room to think
+
+**Note on GPT-5 reasoning overhead:** First attempt at 4K max_completion_tokens
+returned empty content with finish_reason: length. The model spent all 4K tokens
+on internal reasoning and produced nothing. This is worse than a short answer —
+it's zero value for non-zero cost. Always budget ≥16K for GPT-5 analytical tasks.
@@ -0,0 +1,126 @@
+# Finding 13: Race condition identification: Opus excels at subtle temporal interactions; GPT-5 is exhaustive; Sonnet struggles with concurrency reasoning
+
+**Date:** 2026-05-03
+**Task:** Identify race conditions, timing-dependent bugs, and ordering hazards in
+gargoyle's `concurrent-failure-detection.md` (241 lines) — a document specifically
+about concurrent detection logic with timers, ETS state, and multi-process events.
+**How we used them:** Same document (full text) + same focused analytical question
+to all 3 models via HAI proxy. Prompt specifically asked for event ordering problems,
+timer interaction bugs, state visibility gaps, crash/restart timing, and multi-instance
+coordination. Required each finding to reference specific mechanisms in the document
+with specific interleaving descriptions. No tools, no project context beyond the
+document itself.
+
+| Model | Time | Output tokens | Reasoning tokens | Race conditions found |
+|---|---|---|---|---|
+| GPT-5 | 116s | 10,587 | 8,192 | 12 |
+| Claude Opus 4.6 | ~105s | 4,610 | (internal) | 10 |
+| Claude Sonnet 4.6 | ~39s | 1,404 | (internal) | 7 |
+
+**What they found — common ground (all 3 identified):**
+- Stale timer messages in mailbox after cancellation (classic Erlang timer race)
+- HealthMonitor crash losing compound detection state (init from :unknown, no replay)
+- ETS vs GenServer state divergence visible to dashboard
+- Kill switch mode conflict (FLATTEN from single-process vs HALT from compound path)
+
+**GPT-5 unique findings (not in either Claude model):**
+- Cross-sender message ordering: recovery events from pipeline processes vs timer
+  expiry from runtime (Erlang preserves per-sender order, NOT cross-sender) — the
+  "rapid recovery" safety argument in the doc relies on state being updated before
+  timer fires, which isn't guaranteed
+- Debounce starvation: flapping component repeatedly restarting the timer, causing
+  compound evaluation to be indefinitely postponed while ≥2 genuinely degraded
+- State regression: {:degraded} arriving after {:escalated, :kill_switch} with no
+  guard in the event table — state machine allows regressing from :halted to :degraded
+- Cold-start window: application boots with existing degraded processes that won't
+  re-emit events, compound detection never fires
+- Catch-all handle_info could accidentally swallow timer messages if pattern matching
+  is ordered wrong (implementation pitfall of the described approach)
+- Debounce window growing beyond calibrated bounds from repeated timer restarts
+
+**Claude Opus unique findings (not in either other model):**
+- Timer restart pushing evaluation PAST single-process escalation timeout — the
+  debounce mechanism can DEFEAT compound detection when second degradation arrives
+  near end of first window (resets to full window, first process escalates via
+  single-process path before new window fires). This means system gets FLATTEN
+  instead of HALT — exactly what compound detection was supposed to prevent.
+- Strategy worker single-atom masking via event ordering: Worker A degrades, Worker
+  B degrades (same atom), Worker A recovers → atom set to :normal while B is still
+  degraded. Event ordering across different workers mapped to same atom creates
+  state loss.
+- Registry stale PID after HealthMonitor crash: if subscription is PID-based (not
+  PubSub topic), new HM instance is deaf — events go to dead PID, silently dropped.
+  Compound detection completely disabled for that user until subscription refresh.
+- :rest_for_one cascade + coincidental independent issue: debounce designed to
+  filter cascade restarts, but cascade can EXPOSE a pre-existing issue (SignalRisk
+  restarts and finds its upstream stale). Is this "compound" or "cascade + bad luck"?
+  Semantic ambiguity the design doesn't address.
+- Compound cleared event without recovery debounce: :compound_degradation_cleared
+  emitted immediately when last process recovers (no settling period), causing
+  operator oscillation if recovery is transient.
+
+**Claude Sonnet unique findings:**
+- ETS table creation race at startup (HealthMonitor writes before table exists)
+- Registry lookup failure during pipeline startup (events before HM registered)
+- However, Sonnet also made analytical errors: it described "multiple HealthMonitor
+  instances for the same user" scenarios despite the document clearly stating one
+  instance per user via DynamicSupervisor. Several of its findings assumed
+  multi-instance coordination that doesn't match the architecture.
+
+**Quality assessment:**
+- **GPT-5** was the most exhaustive and technically precise. Its cross-sender
+  ordering finding (#2) is genuinely insightful — it identifies that the document's
+  "rapid recovery" safety argument implicitly assumes events arrive in wall-clock
+  order, which Erlang does NOT guarantee across different senders. The debounce
+  starvation finding (#3) identifies a real operational hazard with practical
+  consequences. All 12 findings reference specific mechanisms and describe specific
+  interleavings clearly.
+- **Claude Opus** found fewer race conditions but several were qualitatively
+  superior. The timer-restart-defeats-compound-detection finding is the most
+  architecturally significant race in the entire analysis — it shows that the
+  debounce mechanism can work AGAINST the design's stated goals in specific
+  (realistic) timing scenarios. The strategy-worker event ordering masking is
+  also a genuine design flaw unique to the single-atom decision. Opus continues
+  its pattern of reasoning about design TENSIONS rather than just failure modes.
+- **Claude Sonnet** was notably weaker here than in previous experiments. Only
+  1,404 output tokens vs 4,610 (Opus) and 10,587 (GPT-5). Several findings
+  contained analytical errors (assuming multi-instance coordination that doesn't
+  exist). It found only 7 races, and 2-3 of those were based on misreadings of
+  the architecture. This is a significant regression from Finding #12 where
+  Sonnet found 17 assumptions (85% of GPT-5's count).
+
+**Key insight — concurrency reasoning is a different skill than assumption-finding:**
+In previous experiments (#10, #11, #12), Sonnet 4.6 performed well on
+assumption-finding (a task that requires reasoning about what's NOT stated).
+Here, on race condition identification (a task requiring reasoning about temporal
+interleavings and message ordering semantics), Sonnet drops significantly. This
+suggests the task type matters more than we previously thought:
+
+- **Assumption-finding:** Requires breadth of consideration ("what must be true
+  for this to work?"). Sonnet handles this well — it's essentially pattern
+  matching across possible failure dimensions.
+- **Race condition identification:** Requires SEQUENTIAL reasoning about specific
+  interleavings ("if A happens, then B happens, then C happens, what state is
+  visible?"). This benefits dramatically from extended reasoning tokens (GPT-5's
+  8,192 reasoning tokens) or from Opus's internal reasoning depth.
+
+The lesson: don't extrapolate model performance across task types. A model that's
+85% as good at assumption-finding may be 50% as good at concurrency analysis.
+The cognitive demands are different.
+
+**Opus's distinguishing strength — finding design contradictions:**
+Opus's best finding (timer restart defeating compound detection) isn't just a
+race condition — it's identifying that the debounce mechanism can work against
+the design's own stated goals. This is consistent with Opus's pattern in
+previous findings: it finds tensions where one part of the design undermines
+another part. For race condition analysis specifically, this manifests as
+"here's where your safety mechanism becomes your vulnerability."
+
+**Practical implication for architecture review:**
+- For race condition analysis: use GPT-5 (exhaustive) + Opus (design-tension)
+- Sonnet is NOT suitable for concurrency reasoning tasks — use it for
+  assumption-finding and structural review instead
+- The three-model stack needs task-appropriate assignment:
+  - Structural/assumption review: all three models contribute
+  - Concurrency/race analysis: GPT-5 + Opus only
+  - Bias detection: any model (per Finding #8)
@@ -0,0 +1,131 @@
+# Finding 14: Cross-component interaction analysis: GPT-5 Mini enters as viable mid-tier; task framing strongly shapes output quality
+
+**Date:** 2026-05-03
+**Task:** Identify cross-component interaction failures in gargoyle's
+`continuous-risk-monitoring.md` (459 lines) — a document specifying
+PortfolioMonitor's interactions with OrderManager, PortfolioRisk, MarketData,
+KillSwitch, ETS tables, and the pipeline supervision tree.
+**How we used them:** Same document (full text) + same focused analytical
+question to all 3 models via HAI proxy. Prompt was highly structured: specified
+5 categories of cross-component failures to look for (semantic mismatches,
+ordering violations, feedback loops, partial visibility, supervision boundary
+effects) and required specific output format (components, sequence, gap, impact).
+No tools, no project context beyond the document itself.
+
+| Model | Time | Output tokens | Reasoning tokens | Findings |
+|---|---|---|---|---|
+| GPT-5 Mini | 68s | 5,445 | 2,240 | 6 (+1 cut off) |
+| GPT-5 | 116s | 10,604 | 8,128 | 10 |
+| Claude Sonnet 4.6 | 38s | 1,868 | (internal) | 8 |
+
+**What they found — common ground (all 3 identified):**
+- Fill-to-position query race (fill event triggers evaluation but position
+  store hasn't yet reflected the fill)
+- Restrict flag ETS table destruction on PM crash → permissive window
+- Kill switch check vs liquidation submission race
+- Ticker subscription timing gap (new position opened but ticks not yet
+  subscribed → breach goes undetected)
+
+**GPT-5 unique findings (not in either other model):**
+- Stale prices are NOT fail-safe for drawdown (higher stale price → inflated
+  portfolio value → understated drawdown). The document claims "fail-safe"
+  but this only holds for exposure metrics, not drawdown. This is the most
+  architecturally significant finding across all three models.
+- Price definition mismatch between PM (last_trade from ETS) and OrderManager/
+  broker (bid/ask/mid) causing mis-sized liquidation and oscillation
+- Cross-component oscillation: PM hysteresis internal vs PRisk's immediate
+  binary restrict gate clearing (no cross-component cooldown)
+- Liquidation stuck after OM restart (terminal events lost; liquidation_in_
+  flight stays true indefinitely with no timeout/rehydration)
+- "Minimal risk checks" not enforced — PM goes through same OM gates as
+  strategy orders but MarketHours/StalePrice controls may reject after-hours
+  or stale-price liquidation attempts
+- FLATTEN mode semantics gap — PM refrains from liquidating when kill switch
+  engaged, but FLATTEN cancels open orders without actually CLOSING positions.
+  No component left to close positions.
+
+**Claude Sonnet 4.6 unique findings (not in either other model):**
+- Liquidation feedback loop with PortfolioRisk — buy-to-cover for short
+  positions could INCREASE net long exposure at portfolio level, paradoxically
+  worsening concentration while fixing position-level metrics
+- High water mark reset on pipeline restart masks true intraday drawdown
+  (restart → HWM resets to lower current value → drawdown calculated from
+  false baseline → larger losses permitted than intended)
+- Multi-metric breach with single boolean flag — concentration liquidation
+  for AAPL sets liquidation_in_flight, blocking simultaneous daily P&L
+  liquidation for different positions
+- Market close/open vs after-hours fills — claims to evaluate after-hours
+  fills but uses stale market-close prices
+
+**GPT-5 Mini unique findings (not in either other model):**
+- OrderManager order splitting/remapping causing liquidation_in_flight
+  correlation failure (parent/child order ID mapping breaks terminal-event
+  detection). Well-reasoned but highly implementation-specific.
+- Restrict/clear oscillation loop with strategy behavior (strategies react
+  to rejects → back off → restrict clears → strategies re-enter aggressively
+  → re-breach). Good systems-thinking about emergent feedback.
+
+**Quality assessment:**
+- **GPT-5** produced the most findings (10) and the highest-quality
+  architectural insight: the stale-price/drawdown contradiction is a genuine
+  design flaw that contradicts the document's own safety claim. Multiple
+  findings showed cross-boundary reasoning about semantic mismatches (price
+  definition, FLATTEN semantics, gate bypass). Every finding named specific
+  components and described precise event sequences.
+- **Claude Sonnet 4.6** was fast (38s, only 1,868 tokens) and produced 8
+  solid findings. The HWM reset finding and the multi-metric/single-flag
+  finding show genuine architectural reasoning. The liquidation feedback
+  loop (buy-to-cover worsening portfolio concentration) is subtle and
+  shows cross-position reasoning. However, some findings overlapped
+  significantly with the common-ground set and added less unique depth.
+  Sonnet performed MUCH better here than on race condition identification
+  (Finding #13) — 8/10 ratio vs 7/12 previously.
+- **GPT-5 Mini** produced 6 findings in 68s with 2,240 reasoning tokens.
+  Quality was genuinely good — the order-splitting/correlation finding
+  and the oscillation feedback loop both show real reasoning depth. It's
+  clearly NOT GPT-4.1 Mini — it reasons about component interactions,
+  not just within-frame risks. However, it found fewer issues and one
+  response was cut off (token limit or response truncation).
+
+**Key insight — task framing as the dominant variable:**
+This experiment used a much more structured prompt than previous ones:
+specified 5 categories, required specific output format, explicitly excluded
+single-component failures. The result: ALL models produced higher-quality,
+more focused output than in earlier experiments with broader prompts. Even
+Sonnet — which struggled on race conditions (Finding #13) — performed well
+here. The structured categories likely helped models organize their reasoning
+without losing track of what they were looking for.
+
+The prompt explicitly asked for "cross-component interaction failures" rather
+than general analysis. This is the narrow-lens effect from Finding #2, but
+applied to a complex multi-component document. The lens is narrow (only
+inter-component gaps) but the scope is broad (459 lines, many interactions).
+This combination — narrow analytical lens + broad document scope — appears
+to be the sweet spot for getting quality from all model tiers.
+
+**GPT-5 Mini positioning:**
+First time testing GPT-5 Mini. Results: 6 findings in 68s vs GPT-5's 10 in
+116s. That's 60% of the findings in 59% of the time, with 28% of the
+reasoning tokens (2,240 vs 8,128). Quality-per-finding was solid — the order
+correlation finding especially showed genuine systems reasoning. GPT-5 Mini
+appears to be a legitimate mid-tier: more capable than GPT-4.1 (which can't
+do this kind of cross-boundary reasoning) but less exhaustive than GPT-5.
+Viable for: first-pass screening, bulk document review where you'd run many
+docs and can't afford full GPT-5 on each.
+
+**Sonnet recovery from Finding #13:**
+Sonnet went from 7 findings (with errors) on race conditions to 8 solid
+findings here. The difference: this prompt was more structured, the document
+was larger with more explicit interaction descriptions, and the task didn't
+require pure temporal/sequential reasoning. "Cross-component interaction
+failures" is closer to assumption-finding (Sonnet's strength) than race
+condition identification (Sonnet's weakness). Task taxonomy continues to
+matter more than raw model capability.
+
+**Updated model assignment for cross-component analysis:**
+1. GPT-5 — broadest coverage, finds semantic contradictions in the doc's
+   own claims (10 findings)
+2. Sonnet 4.6 — fast and efficient, good at portfolio-level reasoning and
+   feedback loops (8 findings in 38s)
+3. GPT-5 Mini — viable first-pass with genuine reasoning depth (6 findings)
+4. (Opus untested for this task type — likely strong on design tensions)
@@ -0,0 +1,133 @@
+# Finding 15: Design Coherence Analysis
+
+**Date:** 2026-05-03
+**Task:** Identify internal design incoherences in gargoyle's `failure-modes.md` (383 lines)
+— places where the document's stated principles/invariants are contradicted by its own
+specified mechanisms.
+**How we used them:** Same document (full text) + same focused analytical question to all
+3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence
+to look for (safety properties not enforced, state machine violations, recovery contradictions,
+supervision conflicts, cross-mechanism contradictions). Required each finding to reference
+specific sections. No tools, no project context beyond the document itself.
+
+| Model | Time | Output tokens | Reasoning tokens | Incoherences found |
+|---|---|---|---|---|
+| Claude Sonnet 4.6 | ~39s | 1,168 | (internal) | 5 |
+| Claude Opus 4.6 | ~105s | 3,378 | (internal) | 7 (8 attempted, 1 self-withdrawn) |
+| GPT-5 | ~120s | 10,235 | 9,088 | 4 |
+
+**What they found — common ground (all 3 identified):**
+- State machine universality claim vs Strategy.Worker crash behavior (process
+  crashes bypass the degraded state entirely — no transition path in the model)
+- Market data staleness advisory-only vs the "don't trade when ambiguous" principle
+  (or vs concurrent failure auto-halt)
+- `pending_cancel`/`pending_replace` absent from recovery query set (GPT-5 and
+  Sonnet found this directly; Opus addressed the broader state machine gap)
+
+**GPT-5 unique findings (not in either Claude model):**
+- Kill switch halted = "process terminated" vs kill switch requiring RUNNING
+  processes for cancel_all, FLATTEN, and LIQUIDATE modes. The state definition
+  claims processes are terminated, but the mechanisms require them alive to
+  execute orders. **This is the most architecturally significant finding** — it
+  reveals a fundamental definitional error in the state machine.
+- Per-symbol degradation contradicts the process-level degradation semantics.
+  A worker "enters degraded" but continues operating for non-stale symbols —
+  violating the stated definition that degraded = "cannot perform primary
+  function." The metrics/eventing model has no per-symbol dimension.
+
+**Claude Opus unique findings (not in either other model):**
+- `:rest_for_one` cascade creates a FIFTH implicit state (terminated-and-
+  restarting) not in the four-state model — processes that were `normal` are
+  forcibly killed (not by kill switch) and restart. Self-corrected one finding
+  that initially looked like incoherence but was actually consistent.
+- PortfolioMonitor continues evaluating with stale data ("fail-safe") while
+  Strategy.Workers are stopped for the SAME condition — contradicts both the
+  universal state machine (PM doesn't transition to degraded) and the doc's
+  reasoning about why stale data is dangerous.
+- Path-dependent recovery criteria: Strategy.Worker recovery requires 21+ bars
+  after crash but only "price continuity check" after staleness. The state
+  machine's single "catch-up complete" exit condition can't express this.
+- `halted → [*]` transition in state diagram is logically impossible if "halted"
+  means the process is already terminated — dead processes can't fire transitions.
+- Compound failure detection requires a meta-observer across processes but the
+  per-process state machine model has no way to express cross-process conditions.
+
+**Claude Sonnet unique findings (not in either other model):**
+- Market data global staleness: the failure table says "Manual (disengage)" for
+  recovery — implying automatic engagement happened — but the text says it's
+  advisory only. Table contradicts prose.
+- ReconciliationGate: doc claims gate survives OM crash (separate supervision
+  tree), but then says "missing ETS table = not ready" when OM crashes. If the
+  gate survives, why would its table be missing?
+- Signal survival claims are contradictory between sections: worker crash says
+  downstream signals survive, but OM crash says all upstream signals lost.
+  (NOTE: this is actually describing different scenarios — worker crash doesn't
+  cascade to SignalRisk, OM crash does via :rest_for_one. Sonnet may have
+  misread the architecture here — the two statements are consistent when you
+  understand the supervision tree.)
+
+**Quality assessment:**
+- **GPT-5** found only 4 incoherences but TWO of them are genuinely critical
+  architectural findings. The "halted = terminated" vs "kill switch requires
+  running processes" contradiction is a real design error — you can't both
+  terminate processes AND require them to execute cancel/liquidation orders.
+  The per-symbol degradation finding is also a real modeling gap. GPT-5 was
+  MORE SELECTIVE here than in previous experiments — it didn't pad with
+  medium-severity findings. Each of its 4 was high/critical.
+- **Claude Opus** produced the most findings (7 valid) with characteristic
+  depth. Its self-correction (withdrawing finding #6 after deeper analysis)
+  shows intellectual honesty rare in model outputs. The PortfolioMonitor
+  stale-data contradiction is genuinely insightful — same input condition,
+  opposite response, no justification within the state machine model. The
+  compound failure meta-observer finding identifies a modeling category error.
+  Opus also found modeling imprecisions (path-dependent recovery, halted → [*]
+  impossibility) that the other models didn't notice.
+- **Claude Sonnet** found 5 issues quickly (39s, 1,168 tokens) but quality was
+  mixed. Finding #4 (ReconciliationGate) raises a genuine question about
+  the ETS table ownership claim. Finding #1 (table vs prose contradiction on
+  market data staleness) is a real documentation inconsistency. However,
+  Finding #5 appears to misread the supervision architecture — the two
+  statements about signal survival ARE consistent when you understand that
+  different crashes cascade differently. Sonnet produced one false positive.
+
+**Key insight — "design coherence" is a NEW analytical category with distinct model strengths:**
+This is different from assumption-finding (Finding #10-12), race conditions
+(Finding #13), and cross-component interactions (Finding #14). Coherence
+checking requires the model to hold MULTIPLE parts of the document in tension
+with each other and reason about whether they're compatible. Results:
+
+- **GPT-5** was MORE SELECTIVE than in any previous experiment. Only 4 findings
+  vs 10-24 in other tasks. But precision was near-perfect — all 4 are genuine
+  contradictions. This suggests GPT-5's reasoning tokens are being used for
+  VERIFICATION (checking whether apparent contradictions hold up) rather than
+  EXPLORATION (finding more things). The 9K reasoning tokens produced 4 findings
+  vs the usual 10+ — GPT-5 is self-editing aggressively.
+- **Opus** hit its sweet spot. Coherence checking IS design-tension identification
+  — Opus's consistent strength. Finding incoherences requires exactly the kind
+  of "how does this design disagree with itself" reasoning that Opus excels at.
+  It also showed unique self-correction behavior (withdrawing a finding after
+  deeper analysis).
+- **Sonnet** was fast but produced a false positive. Coherence checking requires
+  holding multiple document sections in memory simultaneously and reasoning about
+  their compatibility — this is harder than assumption-finding (where you
+  reason about one mechanism at a time) but easier than race conditions (which
+  require sequential temporal reasoning). Sonnet occupies a middle ground.
+
+**Model ranking for design coherence checking:**
+1. Claude Opus 4.6 — most findings, highest depth, self-correcting (7 valid)
+2. GPT-5 — fewest findings but near-perfect precision, finds the critical ones (4)
+3. Claude Sonnet 4.6 — fast screening, but prone to false positives on
+   architectural misreads (4/5 valid)
+
+**This inverts the usual GPT-5 > Opus ordering.** In previous experiments, GPT-5
+consistently found MORE issues. Here, GPT-5 was more selective than Opus. The
+task type (self-consistency checking) favors Opus's "design tension" reasoning
+style over GPT-5's "exhaustive exploration" style. GPT-5 apparently uses its
+reasoning to VERIFY rather than GENERATE when the task is about contradictions
+rather than gaps.
+
+**Practical implication:** For architecture documents, run coherence checking as
+a separate pass using Opus as the primary model. GPT-5's higher precision means
+it's good for confirming which Opus findings are genuine vs overreads. The
+two-pass approach: Opus generates candidates → GPT-5 validates → result is the
+intersection plus GPT-5's independent finds.
@@ -0,0 +1,131 @@
+# Finding 16: Specification completeness: Sonnet 4.5 produces 2x the findings of Sonnet 4.6 on implementation-gap analysis; quality vs volume tradeoff
+
+**Date:** 2026-05-03
+**Task:** Identify specification gaps in gargoyle's `kill-switch.md` (185 lines) — places
+where an implementer would be forced to guess or decide on their own because the spec
+doesn't clearly specify behavior. New analytical lens not previously tested.
+**How we used them:** Same document (full text) + same focused analytical question to all
+3 models via HAI proxy. Highly structured prompt specifying 5 categories of underspecification
+(behavioral ambiguity, missing edge cases, ordering/sequencing gaps, interface contracts
+undefined, concurrency semantics omitted). Required specific output format per finding
+(gap, section, what implementer must decide, risk if wrong, severity). No tools, no
+project context beyond the document itself.
+
+| Model | Time | Output tokens | Reasoning tokens | Gaps found | Critical | High | Medium | Low |
+|---|---|---|---|---|---|---|---|---|
+| Claude Sonnet 4.6 | 73s | 3,403 | (internal) | 13 | 8 | 4 | 0 | 1 |
+| Claude Sonnet 4.5 | 102s | 5,191 | (internal) | 25 | 14 | 6 | 4 | 1 |
+| GPT-5 | 109s | 10,140 | 7,872 | 19 | 8 | 7 | 3 | 0 |
+
+**What they found — common ground (all 3 identified):**
+- Pipeline process identification ambiguity (which processes are "pipeline processes")
+- Per-user process scope mapping (how to terminate only one user's processes)
+- ETS table ownership and lifecycle (who owns it, what happens on crash)
+- Concurrent engage operations (what happens when two sources engage simultaneously)
+- Liquidation order tagging mechanism (what the tag is, how verified)
+- Process restart prevention (how "must not restart" is enforced)
+- Engage sequence atomicity (partial failure between DB write and termination)
+- Startup ordering and ETS readiness (pipeline starting before ETS populated)
+- Disengage sequence ordering (what happens and in what order)
+
+**Sonnet 4.5 unique findings (not in either other model):**
+- ETS table schema/structure (set vs ordered_set, key format, value schema)
+- Missing ETS detection mechanism (catch :badarg vs table existence check)
+- Database write atomicity with ETS (transaction boundaries, rollback semantics)
+- Per-user engage while global is already engaged (is it a no-op or error?)
+- Broker rejection semantics ("already filled" vs "invalid cancel" distinction)
+- Cold-start gate interaction (independence vs dependency of the two gates)
+- User deletion with active kill switch (orphaned rows, cascade semantics)
+- Global disengage effect on per-user states (independent or auto-clear?)
+- Audit log write failure during engage (critical-path vs best-effort)
+- Dashboard control ambiguity in LIQUIDATE mode (contradictory disable/enable)
+- Cancel timeout duration (operational parameter not specified)
+- Manual order source code path during LIQUIDATE (how orders bypass the dead pipeline)
+
+**GPT-5 unique findings (not in either other model):**
+- Combined global/per-user mode semantics (what happens when global=RESTRICT,
+  user=LIQUIDATE — can user's liquidation proceed?)
+- Scope of "all" in cancel_all and liquidation (system-wide vs per-user)
+- Gate behavior when ETS missing but liquidation needed (conflicting requirements:
+  fail-closed says block, but liquidation needs to pass)
+- Disengage during in-flight cancellations (what happens to racing tasks)
+- Gate placement relative to broker submission (exact point in the flow)
+- Engage latency expectations (no quantified SLA)
+- Mode change while already engaged (RESTRICT → LIQUIDATE without disengage)
+- Dashboard vs backend scope for manual liquidation (individual vs bulk only)
+
+**Sonnet 4.6 unique findings (not in either other model):**
+- ETS sequencing relative to process termination (ETS before or after kill?)
+- Concurrent disengage + re-engage race (specific interleaving scenario)
+- Close-only enforcement mechanism (UI-only vs backend validation)
+- Order-in-flight past ETS gate during termination (already-checked orders)
+
+**Quality assessment:**
+- **Claude Sonnet 4.5** was the most EXHAUSTIVE (25 gaps) but with notable
+  quality variance. Several findings were highly specific and implementation-
+  relevant (ETS schema, missing-table detection, broker rejection semantics).
+  Others were relatively obvious or lower-impact (user deletion, audit log
+  failure, cancel timeout duration). The 14 Critical ratings feel somewhat
+  generous — some would be more accurately rated as High in practice. Output
+  was well-structured with clear per-finding format.
+- **GPT-5** found 19 gaps with consistent high quality. Its unique findings
+  show cross-cutting reasoning: the combined mode semantics finding (global
+  vs per-user mode interaction) identifies a genuine specification gap that
+  neither Sonnet version noticed. The "ETS missing but liquidation needed"
+  finding is architecturally significant — it identifies a CONTRADICTION in
+  the spec's own rules (fail-closed blocks everything, but liquidation must
+  pass). Every finding was actionable. More selective severity ratings
+  (8 Critical vs Sonnet 4.5's 14).
+- **Claude Sonnet 4.6** was the most SELECTIVE (13 gaps) but with the highest
+  precision. Every finding was genuinely a specification gap that an
+  implementer would face. The ETS sequencing finding (#4) is particularly
+  well-reasoned — it identifies a specific ordering dependency that creates
+  a race window. Sonnet 4.6 appears to self-filter aggressively, producing
+  only findings it's confident about. Higher signal-to-noise than 4.5.
+
+**Key insight — Sonnet 4.5 vs 4.6 on analytical tasks:**
+This is the first direct comparison between Claude model versions on the same
+analytical task. Key differences:
+
+- **Volume:** 4.5 produced almost 2x the findings (25 vs 13)
+- **Tokens:** 4.5 used ~1.5x the output tokens (5,191 vs 3,403)
+- **Time:** 4.5 took ~1.4x longer (102s vs 73s)
+- **Severity distribution:** 4.5 had more Critical findings (14 vs 8) but
+  with more generous severity ratings
+- **Quality per finding:** 4.6 had higher average quality; fewer "obvious"
+  or lower-impact findings
+
+The 4.6 model appears to have been trained toward higher precision/selectivity.
+It finds fewer things but each finding is more reliably a genuine gap. The 4.5
+model is more exhaustive but includes findings that a reviewer might triage as
+"yes, technically, but not really a spec gap." This mirrors a known training
+direction in Claude models: later versions tend to be more concise and selective.
+
+**For practical use:** If you want completeness (cast a wide net, accept some
+noise): use 4.5. If you want precision (every finding is actionable, no triage
+needed): use 4.6. For architecture review where missing a gap has cost, 4.5's
+exhaustiveness is probably worth the noise. For review where false positives
+cost attention (e.g., PR review comments), 4.6's selectivity is preferred.
+
+**GPT-5 vs Sonnet comparison on this task:**
+GPT-5 (19 findings) sits between the two Sonnets in volume but has the highest
+consistency — no obvious misses or inflated severities. Its unique strength
+here: finding CONTRADICTIONS within the spec's own rules (ETS-missing blocking
+conflicts with liquidation needing to pass). This is consistent with Finding #15
+where GPT-5 was unusually selective but precise on coherence checking.
+
+Specification completeness analysis appears to be a task where:
+1. Sonnet 4.5 is strongest for breadth (25 findings, catches operational gaps)
+2. GPT-5 is strongest for detecting spec self-contradictions (19 findings, high precision)
+3. Sonnet 4.6 is strongest for precision (13 findings, zero noise)
+
+**Updated model version comparison:**
+- Claude 4.6 → higher precision, more selective, concise
+- Claude 4.5 → more exhaustive, more verbose, occasional severity inflation
+- This is a genuine tradeoff, not a simple regression or improvement
+
+**Practical implication:** Run BOTH Sonnet versions? 4.5 catches things 4.6
+filters out (ETS schema, broker rejection semantics, cold-start gate interaction).
+4.6 catches things with more specificity (sequencing gaps, exact race windows).
+For a one-shot budget: 4.5 if you want coverage, 4.6 if you want actionability.
+GPT-5 if you want to find where the spec contradicts itself.
@@ -0,0 +1,158 @@
+# Finding 18: Temporal boundary analysis: GPT-5 is most exhaustive; Opus finds design-level contradictions; Sonnet 4.5 is structured but less deep
+
+**Date:** 2026-05-04
+**Task:** Identify temporal boundary vulnerabilities in gargoyle's `escalation-policy.md`
+(238 lines) — scenarios where the timing model (evaluation cycles, debounce counts,
+cooldown periods) creates windows of incorrect or dangerous behavior.
+**How we used them:** Same document (full text) + same focused analytical question to all
+3 models via HAI proxy. Highly structured prompt specifying 5 categories of temporal
+vulnerability (timing exploitation, counter reset abuse, asymmetric time exposure,
+cross-metric temporal interactions, state loss temporal effects). Required specific
+output format per finding (name, sequence with cycle numbers, mechanism, severity, fix).
+No tools, no project context beyond the document itself.
+
+| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
+|---|---|---|---|---|---|---|---|
+| GPT-5 | ~128s | 9,175 | 5,888 | 15 | 3 | 7 | 2 |
+| Claude Opus 4.6 | ~120s | 5,112 | (internal) | 10 | 3 | 5 | 2 |
+| Claude Sonnet 4.5 | ~100s | 4,056 | (internal) | 12 | 3 | 3 | 3 |
+
+**What they found — common ground (all 3 identified):**
+- Flash crash / inter-evaluation gap exploitation (metric spikes between discrete
+  evaluation cycles go undetected)
+- Single clear cycle resetting debounce counter (transient recovery defeats escalation
+  despite sustained risk — metric can breach 80%+ of cycles and never escalate)
+- Asymmetric escalation time vs loss compounding rate (11 cycles to reach liquidation
+  while losses compound every single cycle)
+- Monitor crash resets state to Clear, losing all escalation progress
+- Liquidation re-trigger requiring full debounce reset, delaying subsequent batches
+- Kill switch N value unspecified (timing indeterminacy)
+
+**GPT-5 unique findings (not in either other model):**
+- Boundary-alignment counter starvation: explicitly modeled the "adversarial flicker"
+  pattern (breaching 2 cycles, 1 clear, repeat — 66% breach time, never escalates)
+  with a precise mathematical framing of why K-of-N is needed
+- Cycle-length drift under load: GC pauses or CPU contention stretching evaluation
+  intervals means "3 cycles" could be 12 minutes instead of 90 seconds when it
+  matters most (high-load market stress = slowest evaluations)
+- Adversarial boundary timing (market microstructure masking): illiquid instruments
+  where opposing prints predictably arrive near evaluation boundaries, exploiting
+  deterministic sampling points
+- Cross-metric escalation gap and hedging prohibition: Restrict prevents ALL new
+  positions including risk-REDUCING hedges needed for a different metric still
+  escalating on its own timeline — protection for metric A actively worsens metric B
+- Cooldown stall causing prolonged Restrict: repeated transient spikes near hysteresis
+  threshold reset cooldown indefinitely while metric is actually safe
+- State inconsistency between restriction flags and monitor after restart:
+  documented asymmetry where flag persists (manual clear) but state resets (auto
+  clear) — creates orphaned restriction or unprotected window depending on
+  reconciliation approach
+- Metric computation fail-closed interacting with debounce: system errors create
+  false escalations with long cooldown, potentially blocking hedging trades
+- Unspecified N for kill switch post-liquidation breaches: coupled with crash
+  reset, system can loop indefinitely without reaching kill switch
+- In-liquidate flicker stall: one cycle below threshold after partial fill resets
+  re-trigger counter, stalling further liquidation
+
+**Claude Opus unique findings (not in either other model):**
+- De-escalation cooldown exploitation (predictable window): after cooldown completes
+  and restriction lifts, strategy has a GUARANTEED 5+ cycle window of unrestricted
+  trading before Restrict can re-engage — an automated strategy could systematically
+  exploit this predictable safe window to re-enter dangerous positions
+- Orphaned restriction flag asymmetry framed as a DESIGN CONTRADICTION: the failure
+  modes table specifies opposing recovery paths for state (automatic → Clear) vs
+  flags (manual clear), creating an irreconcilable dual state. Opus uniquely
+  identified that operator intervention to clear the flag could inadvertently
+  create a WORSE protection gap than leaving it orphaned
+- Self-correcting analysis style: Opus's summary explicitly synthesized that the
+  three Critical findings share a common cause (debounce optimizes against false
+  positives at the expense of false negatives during sustained events) and proposed
+  a single architectural fix (severity-aware fast path) that addresses all three
+
+**Claude Sonnet 4.5 unique findings (not in either other model):**
+- De-escalation timing not accounting for proximity to breach threshold: system
+  removes protection while metric is still near-dangerous, and re-escalation
+  requires full debounce — created a specific "whipsaw" scenario with cycle numbers
+- Manual-only de-escalation from Liquidate creates UNBOUNDED recovery time:
+  if triggered at 2 AM Saturday, trading disabled until Monday despite metrics
+  recovering in minutes. Framed as contradiction with "autonomous" design goals
+- Evaluation cycle synchronization assumption: no handling of variable timing
+  (CPU contention, GC pauses) — implicit throughout but never addressed
+- Cold start escalation ambiguity: system starts with no prior state while
+  portfolio may already be in breach condition
+- De-escalation event ordering race: multiple metrics de-escalating simultaneously
+  may emit events in non-deterministic order, confusing external observers
+
+**Quality assessment:**
+- **GPT-5** was the most exhaustive (15 findings) and showed the strongest
+  mathematical/systems reasoning. Its unique findings included precise attack
+  models (adversarial flicker, boundary alignment, microstructure masking) that
+  describe exact exploitation patterns with percentages and cycle counts. The
+  cross-metric hedging prohibition finding is architecturally significant — it
+  identifies that protection for one metric can actively CREATE risk for another.
+  Every finding was actionable with specific fixes.
+- **Claude Opus 4.6** produced fewer findings (10) but with characteristic depth
+  and self-awareness. Its cooldown exploitation finding identified a PREDICTABLE
+  exploit window that an automated strategy could systematically abuse — framed
+  not as an accident but as an adversarial opportunity. The summary synthesis
+  (identifying common cause across Critical findings) shows meta-analytical
+  capability the other models didn't demonstrate. Opus also uniquely identified
+  that human intervention to fix one problem could create a WORSE problem —
+  second-order operational reasoning.
+- **Claude Sonnet 4.5** was well-structured (12 findings, clean severity tiers,
+  organized by Critical/High/Medium/Low) and faster than both other models.
+  Its findings were solid but less architecturally deep. The manual de-escalation
+  contradiction finding was genuinely insightful (unbounded recovery time vs
+  autonomous design goals). However, several findings restated concepts the
+  other models covered with less specificity about exploitation mechanics.
+
+**Key insight — temporal reasoning as a task type:**
+This is the first experiment specifically testing "temporal boundary analysis" —
+reasoning about time-domain properties of a state machine (evaluation frequency,
+counter semantics, cooldown mechanics, crash/restart timing).
+
+Results compared to Finding #13 (race condition identification on a concurrency doc):
+- GPT-5: 15 findings here vs 12 in Finding #13. Consistent high performance
+  on temporal reasoning tasks across both experiments.
+- Opus: 10 findings here vs 10 in Finding #13. Remarkably consistent — Opus
+  produces ~10 high-quality findings regardless of temporal task variant.
+- Sonnet 4.5: 12 findings here (first test). Compare to Sonnet 4.6's 7 findings
+  (with errors) in Finding #13. Sonnet 4.5 handles temporal reasoning better than
+  4.6 — consistent with Finding #16 showing 4.5 is more exhaustive across task types.
+
+**Sonnet 4.5 vs 4.6 on temporal reasoning (inferred comparison):**
+Sonnet 4.6 struggled significantly on race condition identification (Finding #13:
+7 findings with analytical errors, misreading architecture). Sonnet 4.5 here
+produced 12 solid findings with no apparent misreadings. This suggests 4.5's
+exhaustiveness advantage extends to temporal reasoning — the additional
+exploration it does (vs 4.6's aggressive self-filtering) catches more temporal
+interactions. Confirms Finding #16's pattern: 4.5 for coverage, 4.6 for precision.
+
+**The structured-prompt effect continues:**
+All three models produced focused, high-quality output with this highly structured
+prompt (5 specific categories + required output format). This confirms Finding #14:
+narrow analytical lens + broad document scope is the sweet spot for all model tiers.
+The prompt structure appears to be a stronger predictor of output quality than model
+choice for the bottom 80% of findings (all models find the common-ground issues).
+Model choice matters for the TOP 20% — the unique insights that require deeper
+reasoning about system interactions.
+
+**Updated model assignment for temporal boundary analysis:**
+1. GPT-5 — most exhaustive, strongest at modeling adversarial exploitation patterns
+   and mathematical edge cases (15 findings)
+2. Claude Sonnet 4.5 — good volume with clean structure, viable for first-pass
+   temporal analysis (12 findings, no errors)
+3. Claude Opus 4.6 — fewest findings but highest insight density, uniquely
+   identifies predictable exploit windows and operational second-order effects
+   (10 findings)
+
+**Practical implication:** For temporal analysis on state machines and timing-dependent
+policies, the three-model stack produces genuine complementary value:
+- GPT-5 catches the adversarial attack patterns and mathematical edge cases
+- Opus catches the predictable exploit windows and operational contradictions
+- Sonnet 4.5 provides good breadth at lower cost with clean severity categorization
+
+The union of unique findings across all three models reveals significantly more
+temporal vulnerabilities than any single model alone. For a document governing
+autonomous financial actions (liquidation, kill switch), the cost of running all
+three (~$1-2) is trivially justified against the risk of missing a timing exploit.
@@ -0,0 +1,124 @@
+# Finding 19: Union coverage test: GPT-5 Mini + Sonnet 4.6 covers ~71% of GPT-5's findings; the missing 29% is where the real value lives
+
+**Date:** 2026-05-04
+**Task:** Identify hidden assumptions in gargoyle's `trading-pipeline.md` (1,110 lines,
+~62KB) — the most complex document tested so far, covering the full end-to-end path
+from tick ingestion through order execution.
+**How we used them:** Same document (full text, no truncation) + same focused analytical
+question to all 3 models via HAI proxy. Standard hidden-assumption prompt with 5
+categories (runtime behavior, external dependencies, timing/ordering, scale/load,
+uncovered failure modes). Required specific output format per finding. No tools, no
+project context beyond the document itself.
+
+| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
+|---|---|---|---|---|
+| GPT-5 | 99s | 9,418 | 5,696 | 35 |
+| GPT-5 Mini | 93s | 5,309 | 1,792 | 21 |
+| Claude Sonnet 4.6 | 38s | 1,792 | (internal) | 17 |
+
+**Coverage analysis — can Mini + Sonnet together replace GPT-5?**
+
+Categorized each of GPT-5's 35 findings by whether the union of Mini + Sonnet
+also identified the same assumption:
+
+- **Covered by BOTH Mini and Sonnet:** ~12 findings (common ground — any model
+  finds these: idempotency, single-writer, clock sync, instrument resolution,
+  fill immutability, reconciliation gate, backpressure, fill correlation, event
+  ordering, audit scalability, PortfolioRisk bottleneck)
+- **Covered by Mini only (not Sonnet):** ~7 findings (transactional atomicity,
+  audit causal consistency, modification-in-flight enforcement, OM throughput,
+  decimal precision, PM/PR close-only race, partition duplicate submit)
+- **Covered by Sonnet only (not Mini):** ~6 findings (market data feed rates,
+  pipeline-vs-market speed, corporate actions atomicity, kill switch partition,
+  shared port isolation, market close vs auction fills)
+- **Union(Mini + Sonnet) total coverage:** ~25/35 = **~71%** of GPT-5's findings
+- **GPT-5 unique (missed by both):** ~10-18 findings depending on strictness
+
+**What GPT-5 uniquely found that the cheaper pair missed:**
+
+The missing 29% is NOT random — it's systematically different in character:
+
+1. **Operational edge cases:** Default TIF "day" broker semantics, OrderRate
+   counting retries, extended-hours MarketHours mismatch, fractional quantities,
+   local expiry timer precision per instrument
+2. **Design-level interaction gaps:** PortfolioRisk concurrent decision race
+   (snapshot stale between two parallel approvals), re-validation gap between
+   approval and submit, decision loss on crash after audit write
+3. **Domain-specific knowledge:** Manual broker-side actions conflicting with
+   state machine, options/complex instrument position_effect mapping, Decision→Order
+   1:1 invariant vs broker auto-splitting, wash sale retroactive P&L mutation
+4. **Architectural observations:** Reduction re-entry rule insufficiency,
+   PortfolioMonitor coalescing vs fast breach detection, multi-aggregator fanout
+   and audit partial writes, replay/backtest alignment with production controls
+
+These share a common trait: they require **domain expertise** (knowing how brokers
+actually behave, how regulatory rules interact, how production trading systems
+fail in practice) combined with **architectural reasoning** (how the design's own
+mechanisms interact under those real-world conditions). The cheaper models find
+assumptions about the document's internal consistency; GPT-5 additionally finds
+assumptions about the document's relationship to the external world it must
+operate in.
+
+**GPT-5 Mini vs Sonnet 4.6 — complementary, not redundant:**
+
+Mini and Sonnet covered different gaps:
+- Mini was stronger on **internal consistency** (transactional atomicity, causal
+  consistency, decimal precision, modification serialization)
+- Sonnet was stronger on **external interactions** (market data feeds, corporate
+  actions, kill switch distribution, shared resource isolation)
+
+This aligns with previous findings: Mini reasons about implementation mechanics;
+Sonnet reasons about system boundaries and external interactions. Their union
+covers more ground than either alone.
+
+**Cost comparison:**
+
+| Approach | Total tokens | Approx. cost | Coverage of GPT-5 |
+|---|---|---|---|
+| GPT-5 alone | ~21K (9.4K output + 5.7K reasoning) | ~$0.80 | 100% (35 findings) |
+| Mini + Sonnet | ~7.1K output + 1.8K reasoning | ~$0.25 | ~71% (25/35 findings) |
+| All three | ~28K total | ~$1.05 | >100% (35 + unique Sonnet/Mini extras) |
+
+**Key insight — the 71% coverage is a floor, not a ceiling:**
+
+The union covers 71% of GPT-5's specific findings. But Mini and Sonnet each
+also produced findings that GPT-5 DIDN'T make:
+- Sonnet: DailyLossLimit query performance scaling, instrument reference data
+  propagation atomicity across components
+- Mini: Signal audit correlation ambiguity under replay/duplicate ticks
+
+So the total unique finding space is LARGER than any single model. Running all
+three produces the most comprehensive analysis.
+
+**Answer to the open question: "Would running GPT-5 Mini + Sonnet together
+approach GPT-5's coverage at lower combined cost?"**
+
+**Partially.** The pair covers ~71% of GPT-5's findings at ~31% of the cost.
+But the missing 29% is disproportionately valuable — it contains the
+domain-specific, interaction-level, real-world-knowledge findings that are
+most likely to prevent production incidents. For a quick sanity check or
+first-pass screening, Mini + Sonnet is excellent value. For architecture
+review where completeness matters (financial system, safety-critical), GPT-5
+is not replaceable by cheaper models — its unique findings are exactly the
+ones that would cause real-world failures.
+
+**Practical implication:** The optimal strategy depends on stakes:
+- **Low stakes** (internal doc review, non-critical systems): Mini + Sonnet
+  is 71% coverage at 31% cost — strong ROI
+- **High stakes** (financial systems, safety-critical): run all three — the
+  ~$1 total cost is irrelevant vs the value of the extra 10-18 findings
+- **Budget-conscious high stakes:** run GPT-5 alone — it subsumes most of
+  what Mini + Sonnet find, and adds the critical domain-knowledge findings
+
+The cost argument for Mini + Sonnet as a GPT-5 REPLACEMENT doesn't hold for
+important work. The cost argument for Mini + Sonnet as a GPT-5 COMPLEMENT
+is strong — they catch a few things GPT-5 misses, and the union of all three
+is the most thorough analysis available.
+
+**Document complexity observation:**
+This is the largest document tested (1,110 lines vs previous 185-785 lines).
+GPT-5's finding count scaled up (35 vs 20-26 on smaller docs) while maintaining
+quality — no padding with obvious/low-value findings. Mini also scaled (21 vs
+6 on 459-line doc in Finding #14). Sonnet scaled less (17 vs 12-17 on smaller
+docs) — it appears to have a natural output ceiling regardless of document size,
+consistent with its self-filtering behavior observed in previous findings.
@@ -0,0 +1,163 @@
+# Finding 20: Invariant violation path analysis: GPT-5 is maximally selective (3 findings, all genuine); Opus shows unique self-correcting analytical style; new task type favors precision over exhaustiveness
+
+**Date:** 2026-05-04
+**Task:** Identify invariant violation paths in gargoyle's `user-pipeline-lifecycle.md`
+(730 lines) — sequences of legal operations that can violate the system's stated or
+implied invariants. NEW analytical lens not previously tested, distinct from assumption-
+finding, race conditions, or coherence checking.
+**How we used them:** Same document (full text) + same focused analytical question to all
+3 models via HAI proxy. Highly structured prompt specifying 5 categories of invariant
+violations (state machine escapes, invariant composition failures, monotonicity violations,
+idempotency boundary violations, authority inversion sequences). Required specific output
+format per finding. No tools, no project context beyond the document itself.
+
+| Model | Time | Output tokens | Reasoning tokens | Findings |
+|---|---|---|---|---|
+| GPT-5 | 143s | 784 | 12,032 | 3 |
+| Claude Opus 4.6 | 113s | 6,183 | (internal) | 7 (with 2 self-corrections) |
+| Claude Sonnet 4.6 | 23s | 1,266 | (internal) | 5 |
+
+**What they found — common ground (2+ models identified):**
+
+- **Periodic reconciliation overrides operator manual stop** (GPT-5 #3 + Opus #5 +
+  Sonnet #1): An admin who stops a pipeline via `stop_user/1` with `:admin_action`
+  has their decision overridden within 5 minutes by periodic reconciliation, because
+  there's no "admin stopped" state in `check_eligibility/1`. All three models
+  independently identified this as the clearest authority inversion.
+- **DynamicSupervisor restart bypasses eligibility gate** (Opus #1/#3 + Sonnet #2):
+  When `UserPipeline.Supervisor` crashes and is restarted by OTP supervision, the
+  restart bypasses `start_user/1` and `check_eligibility/1` entirely — potentially
+  resuming trading while the kill switch is engaged.
+- **Stale ReconciliationGate after crash** (Opus #7): After a crash-triggered
+  DynamicSupervisor restart (not via `stop_user/1`), the ReconciliationGate remains
+  `:ready` from the previous instance because `stop_user/1` (which resets it) was
+  never called. The new OrderManager may accept orders during its own reconciliation.
+- **HealthMonitor co-lifecycle violation** (Opus #2 + Sonnet #4): After a
+  DynamicSupervisor-initiated restart, the HealthMonitor is still subscribed to the
+  old PIDs — no code re-establishes monitoring for the new pipeline processes.
+
+**GPT-5 unique findings (not in either other model):**
+
+- **Kill switch bypass for users configured DURING engagement** (#1): A user who
+  saves credentials while the kill switch is engaged is never added to the pending
+  operator release set (only running pipelines are added at engage time). After
+  disengage, periodic reconciliation auto-starts this user's pipeline without
+  operator release — violating "resuming always requires human judgment." This is
+  the most precisely reasoned finding across all three models: each step is
+  individually correct per the spec, and the violation emerges purely from the
+  composition of legal operations.
+- **Premature release bypass** (#2): If `operator_release_user/1` is called while
+  the kill switch is still engaged (a legal operation), it clears the pending
+  release flag but `start_user/1` correctly refuses. After later disengage, the
+  flag is gone — auto-start proceeds without fresh operator judgment. The release
+  was "spent" at the wrong time.
+
+**Claude Opus unique findings (not in either other model):**
+
+- **`operator_release_system/0` clears unrelated safety obligations** (#4):
+  Operator intends to release one user from a recent event but
+  `operator_release_system/0` also releases other users still pending from an
+  earlier, unresolved event. One release call discharges multiple independent
+  safety obligations — monotonicity violation.
+- **State machine incompleteness for blocked users** (#6): Users who become
+  configured during kill switch engagement (blocked with reason
+  `:kill_switch_engaged`) have no state machine transition back to `starting`
+  after disengage — they're not in the pending release set, and no event fires.
+  System works via periodic reconciliation (up to 5 minutes delay), but the
+  documented state machine doesn't represent this path.
+- **Self-correcting analytical style:** Opus explicitly withdrew two draft
+  findings mid-analysis ("Actually, this sequence works as designed. Let me
+  identify a real violation instead." / "this is likely handled"). This
+  self-correction behavior was first observed in Finding #15 and is now
+  confirmed as a consistent Opus trait for invariant-style analysis.
+
+**Claude Sonnet unique findings (not in either other model):**
+
+- **Cold-start Tier 3 failure creates supervision restart loop** (#2): A
+  persistent Tier 3 failure (phantom fills) crashes OrderManager, `:rest_for_one`
+  kills the tree, DynamicSupervisor restarts it, cold-start fails again → infinite
+  loop. State machine shows `starting → stopped` but supervision creates
+  `starting → starting` indefinitely.
+- **HealthMonitor start failure during start_user** (#4): If HealthMonitor.Supervisor
+  is momentarily crashed when `start_user/1` runs step 4, the pipeline starts
+  without monitoring. No error handling specified for this partial-start state.
+
+**Quality assessment:**
+
+- **GPT-5** was MAXIMALLY SELECTIVE — only 3 findings from 12,032 reasoning tokens
+  (4,011 reasoning tokens per finding). This is the most extreme
+  reasoning-to-output ratio observed: 15:1 (12,032 reasoning / 784 output tokens).
+  For comparison, in previous experiments GPT-5 typically shows 1:1 to 2:1 ratios.
+  Every finding is a genuine invariant violation with a precise, step-by-step
+  sequence where each step is individually legal. ZERO false positives, zero
+  padding, zero "this might be an issue." GPT-5 appears to have used almost all
+  its reasoning budget for VERIFICATION — confirming that each candidate is
+  genuinely a violation before including it.
+- **Claude Opus** produced the most findings (7) with its characteristic depth and
+  self-correction. Two findings were revised mid-analysis, showing Opus actively
+  testing its own reasoning against the document before committing to a finding.
+  The DynamicSupervisor restart thread (findings #1, #2, #3, #7) forms a coherent
+  cluster — Opus identified one root cause (OTP restarts bypass the lifecycle
+  layer) and explored its multiple consequences. The `operator_release_system`
+  monotonicity finding (#4) is architecturally significant and unique.
+- **Claude Sonnet** was extremely fast (23s, 1,266 tokens) and produced 5 findings.
+  Quality was mixed: Finding #1 partially mirrors GPT-5's authority inversion but
+  with vaguer reasoning ("race condition with ETS operations" — not specified).
+  Finding #3 describes a contradiction but the scenario is internally inconsistent
+  (step 5 says "pipeline termination fails" but then step 7 says pipeline is still
+  running — this conflates two failure modes). Findings #2 and #4 are genuine and
+  well-reasoned. Sonnet's precision is lower than the other two on this task.
+
+**Key insight — "Invariant violation paths" as a task type:**
+
+This is a genuinely DIFFERENT analytical task from any previously tested. It requires:
+1. Identifying the invariants (explicit or implied)
+2. Constructing a sequence of operations (creative/generative)
+3. Verifying each step is legal per the spec (verification)
+4. Confirming the end state violates the invariant (correctness proof)
+
+This four-phase cognitive process explains GPT-5's extreme selectivity: steps 2-4 are
+all verification-heavy, and GPT-5's reasoning tokens are being burned on steps 3 and 4
+(confirming each step is genuinely legal and the final state genuinely violates). In
+previous tasks like "find hidden assumptions" or "find gaps," only step 1 (identification)
+is needed — there's no construction or verification phase.
+
+**Comparison to previous task types:**
+
+| Task type | GPT-5 findings | Opus findings | GPT-5 reasoning overhead |
+|---|---|---|---|
+| Hidden assumptions | 20-35 | 12-13 | 5-7K reasoning |
+| Race conditions | 12 | 10 | 8K reasoning |
+| Design coherence | 4 | 7 | 9K reasoning |
+| Invariant violation paths | 3 | 7 | **12K reasoning** |
+
+The pattern: as the task requires more VERIFICATION (vs identification), GPT-5 becomes
+more selective and spends more reasoning tokens per finding. Invariant violation paths
+demand the highest verification burden (every step must be confirmed legal), and GPT-5
+responds with the highest selectivity and reasoning investment.
+
+Opus inverts: it produces MORE findings on verification-heavy tasks (7 for coherence,
+7 for invariant paths) vs identification tasks (10-13 for assumptions). This suggests
+Opus uses its internal reasoning differently — it's more willing to present findings
+that have "likely" rather than "proven" violations, then self-corrects inline if the
+verification fails.
+
+**Practical implication:**
+
+For invariant violation path analysis:
+- **GPT-5** produces the highest-precision findings but very few. Every finding is a
+  genuine spec-level bug. Use when you need zero-false-positive bug reports to present
+  to a design team.
+- **Opus** produces more findings with slightly lower precision but unique analytical
+  depth. Its self-correction behavior means false positives are often caught inline.
+  Use when you want both confirmed violations AND identified tensions.
+- **Sonnet** is too imprecise for this task type — some findings have internal
+  inconsistencies. Use for lighter analytical tasks (assumption-finding, spec gaps).
+
+The three findings GPT-5 produced are ALL genuine design bugs that should be fixed:
+1. Users configured during kill switch engagement bypass operator release
+2. Premature operator release (while KS still engaged) creates future bypass
+3. Admin stops are overridden by periodic reconciliation
+
+These are the kind of findings that, in a real financial system, prevent production
+incidents. The 12K reasoning tokens to produce 3 perfect findings is excellent ROI.
@@ -0,0 +1,125 @@
+# Finding 21: Reasoning effort (low/medium/high) has negligible effect on GPT-5's analytical output; the parameter may not work as documented for open-ended analysis
+
+**Date:** 2026-05-04
+**Task:** Identify hidden assumptions in gargoyle's `order-state-machine.md` (221 lines)
+— a well-structured state machine specification covering order lifecycle, fill precedence,
+TIF semantics, and parameter resolution.
+**How we used them:** Same document, same prompt, same model (GPT-5), same
+max_completion_tokens (16K). Only variable: `reasoning.effort` parameter set to
+"low", "medium", or "high". Run sequentially via HAI proxy (OpenAI-compatible
+endpoint). No tools, no project context beyond the document.
+
+| Effort | Time (ms) | Output tokens | Reasoning tokens | Findings |
+|---|---|---|---|---|
+| Low | 97,913 | 7,657 | 4,288 | 33 (+11 recs) |
+| Medium | 94,824 | 7,112 | 4,160 | 30 |
+| High | 88,607 | 6,891 | 3,712 | 30 |
+
+**The counterintuitive result:** Higher reasoning effort produced FEWER findings,
+FEWER reasoning tokens, FEWER output tokens, and completed FASTER. The expected
+pattern (high effort → more reasoning → more depth) was inverted.
+
+**Per-finding metrics (remarkably consistent):**
+
+| Effort | Output tokens/finding | Reasoning tokens/finding |
+|---|---|---|
+| Low | 232 | 129 |
+| Medium | 237 | 138 |
+| High | 229 | 123 |
+
+The depth per finding was nearly identical across all three levels. The models
+didn't get more detailed or rigorous per-finding at higher effort — they just
+found slightly fewer things.
+
+**Severity distributions (similar across all three):**
+- Low: 7 Critical, 21 High, 5 Medium (33 findings)
+- Medium: 9 Critical, 15 High, 4 Medium + 2 borderline (30 findings)
+- High: 6 Critical, 14 High, 5 Medium + 4 borderline (30 findings)
+
+**Qualitative differences — WHAT they found:**
+
+High-effort unique findings (not in low):
+- Single-writer authority to broker (no out-of-band modifications)
+- Broker emits fills for all executed quantities (no silent netting)
+- Instrument identity remains stable across corporate actions
+- Late-fill override won't violate downstream invariants
+- Validation covers lot sizes, price ticks, borrow/locate constraints
+- Multiple accounts and venues are part of the correlation key
+- Streaming and polling APIs are consistent
+- System can handle multi-leg instruments
+
+Low-effort unique findings (not in high):
+- Acks arrive before fills (no pre-ack fills)
+- Cancel-before-ack handling (submitted → cancelled missing)
+- Fill totals never exceed requested quantity
+- Deterministic ordering within a broker stream
+- Exercise/assignment and non-order position changes
+- Client-side idempotency of "place order"
+- Partial accept/normalize on replace
+- No "child" order fragmentation at broker
+- Submitted state can receive terminal events
+- Late cancel vs local expired mismatch
+
+**Character of the differences:**
+- HIGH-unique findings tend to be more **architectural/systemic** (multi-leg
+  instruments, streaming vs polling consistency, downstream invariant violations,
+  corporate actions). These require reasoning about the system's relationship
+  to the broader world.
+- LOW-unique findings tend to be more **implementation-specific edge cases**
+  (cancel-before-ack, pre-ack fills, child order fragmentation, partial accepts).
+  These require reasoning about specific event interleavings and protocol details.
+
+Both sets are valid and actionable. Neither is clearly "better." They represent
+different analytical modes — breadth-of-scope (high) vs depth-of-protocol (low).
+
+**Key insight — reasoning_effort doesn't scale analysis linearly:**
+
+Three possible explanations for the inverted behavior:
+
+1. **GPT-5 already uses near-maximum reasoning for analytical tasks regardless
+   of the effort parameter.** The ~4K reasoning tokens across all three levels
+   (4288/4160/3712) are too similar to reflect a genuine effort gradient. The
+   parameter may primarily affect OTHER task types (math, code, logic puzzles)
+   where reasoning depth is more variable.
+
+2. **Higher effort increases FILTERING, not exploration.** At high effort, GPT-5
+   may spend more of its reasoning on VERIFYING whether findings are genuine
+   before including them — similar to the extreme selectivity observed in
+   Finding #20 (invariant violation paths, 12K reasoning for 3 findings). This
+   would explain fewer findings despite theoretically "trying harder."
+
+3. **The parameter has minimal practical effect for this model version.**
+   The differences (33 vs 30 vs 30) are within normal stochastic variation.
+   Repeated runs at the same effort level might show similar variance.
+
+**The prompt cache hit on HIGH (2304 cached prompt tokens) may have slightly
+accelerated processing, but doesn't explain the reasoning token difference.**
+
+**Comparison to previous findings:**
+In Finding #20 (invariant violation paths), GPT-5 used 12,032 reasoning tokens
+for 3 findings — extreme verification behavior. Here, at default effort on a
+different task type (hidden assumptions), it uses ~4K reasoning for ~30 findings.
+This confirms that TASK TYPE is a far stronger predictor of GPT-5's reasoning
+behavior than the reasoning_effort parameter. The invariant violation prompt
+triggered deep verification; the assumption-finding prompt triggers broad
+exploration regardless of effort setting.
+
+**Practical implication:**
+For open-ended analytical tasks (assumption-finding, gap analysis, spec review),
+the reasoning_effort parameter appears to have negligible practical effect on
+GPT-5. Don't bother tuning it for these tasks — the default is fine. The
+parameter may be more meaningful for:
+- Tasks with verifiable correct answers (math, logic)
+- Tasks where the model could short-circuit (simple questions)
+- Extremely long documents where exploration budget matters
+
+For architecture review specifically: reasoning_effort is NOT a useful lever.
+Task framing (the prompt structure) and document selection remain the dominant
+variables for output quality. Save reasoning_effort tuning for coding/math tasks
+where the parameter was likely trained and evaluated.
+
+**Open question:** Would running the same experiment 5x at each level show that
+the 33-vs-30 difference is within stochastic noise? If so, reasoning_effort is
+effectively a no-op for analytical prompts. If not, low-effort consistently
+produces more (less filtered) output, which could be useful for brainstorming-
+style analysis where you want maximum coverage before manual triage.
@@ -0,0 +1,180 @@
+# Finding 22: Silent correctness failures: NEW analytical lens reveals Opus's strength at domain/regulatory reasoning; GPT-5 shows regulatory depth; all models converge on compliance timestamp errors
+
+**Date:** 2026-05-05
+**Task:** Identify scenarios where the mechanism produces SILENTLY INCORRECT results
+(not errors, not crashes — wrong financial calculations, wrong lot selections, or wrong
+compliance records that pass all validation) in gargoyle's `specid-lot-selection.md`
+(306 lines) — a financial system specification covering tax lot selection strategies,
+cost basis accounting, and IRS SpecID compliance.
+**How we used them:** Same document (full text) + same focused analytical question to
+all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of silent
+incorrectness (stale data, semantic precision, ordering sensitivity, composition errors,
+temporal reference errors). Required specific output format per finding with concrete
+numerical examples of financial impact. No tools, no project context beyond the document.
+
+| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
+|---|---|---|---|---|---|---|---|
+| GPT-5 | 147s | 13,006 | 10,496 | 7 | 2 | 2 | 3 |
+| Claude Opus 4.6 | 119s | 5,902 | (internal) | 10 | 3 | 3 | 4 |
+| Claude Sonnet 4.6 | 122s | 6,011 | (internal) | 6 | 3 | 3 | 0 |
+
+**What they found — common ground (all 3 identified):**
+- `designation_at` = `DateTime.utc_now()` at processing time, NOT at actual
+  designation time (manual selection was made at order submission, standing
+  orders were configured earlier) — compliance record factually incorrect
+- Holding period calculation boundary errors (>365 days vs IRS "more than one
+  year" rule, off-by-one at leap year boundaries, day-after-acquisition start)
+- HIFO tie-breaker `opened_at ASC` ignores tax_term dimension — selects
+  long-term losses over short-term losses when both have identical cost basis,
+  producing less tax-valuable outcomes
+- Strategy preference resolved at fill processing time, not at trade time
+  (preference changes between trade and fill processing apply retroactively)
+
+**GPT-5 unique findings (not in either Claude model):**
+- Corporate action applied late stale cost basis in HIFO: ROC/dividend reduces
+  basis but if close/4 fires before apply_corporate_action/3, HIFO sorts on
+  pre-adjusted basis AND records wrong realized P&L permanently. No mechanism
+  to restate previously persisted LotClosed events. Concrete example: $2,000
+  overstated loss from one trade.
+- `designation_at` fragmentation: a single sell consuming multiple lots calls
+  DateTime.utc_now() per loop iteration, producing slightly different timestamps
+  for what should be a single coherent designation event. Audit risk.
+- LIFO label in `selection_method` field: records "lifo" but for securities LIFO
+  isn't an authorized tax method — the operation is legally SpecID electing
+  newest lots. Downstream reporting may reject or misclassify.
+
+**Claude Opus unique findings (not in either other model):**
+- Realized P&L excludes commissions/fees: formula uses `sell_fill.price` (raw
+  execution price) minus `lot.cost_basis`, not net proceeds. If cost_basis also
+  excludes buy-side commissions, P&L is doubly overstated. Active trader doing
+  1000 trades/year: ~$20,000+ cumulative P&L overstatement.
+- Position `average_cost` is meaningless under SpecID and potentially misleading:
+  SpecID exists to exploit lot-level basis differences, but position-level average
+  obscures this. If downstream consumers use average_cost for tax estimation,
+  results can be 50%+ wrong per lot.
+- GenServer mailbox ordering determines lot-to-fill assignment for concurrent sells:
+  two simultaneous fills for the same instrument get different lots based on network
+  arrival timing. With different holding periods, produces $670+ tax difference
+  without user awareness.
+- Wash sale rule completely unaddressed: system reports losses as realized/deductible
+  without checking 30-day substantially identical purchase rule. Active trader
+  harvesting $50,000 in losses could have $0 actually deductible — $18,500 tax gap.
+- `opened_at` semantics undefined: whether it's exchange execution time, GenServer
+  arrival time, or settlement date affects every downstream calculation (FIFO/LIFO
+  ordering, holding periods, tax terms). Network timing could produce wrong FIFO
+  lot selection.
+
+**Claude Sonnet 4.6 unique findings (not in either other model):**
+- Stale cost basis in manual lot picker during concurrent corporate actions: UI shows
+  pre-action basis, user selects based on stale data, but close/4 only validates
+  open/ownership/quantity — never re-validates that the selection rationale is still
+  correct. No field records the discrepancy.
+- `average_cost` recomputation ordering ambiguity in event-sourced model: step 4
+  recomputes from "updated lots" but step 3 (persist events) may not have completed
+  — if implementation re-derives from event store rather than in-memory state, reads
+  pre-closure lot quantities. Accumulates $500+ error per partial close.
+- Strategy fallback + config corruption silently overwrites selection method in
+  compliance record: if config becomes invalid, fallback to :fifo is logged at
+  :warning but LotClosed records `selection_method: "fifo"` — compliance record
+  shows user "chose" FIFO when they configured HIFO. No field records intended vs
+  actual strategy.
+
+**Quality assessment:**
+- **Claude Opus** produced the most findings (10) with the broadest analytical scope.
+  Several findings went BEYOND the document's mechanism to identify missing features
+  that create silent incorrectness (wash sale rules, commission handling, opened_at
+  semantics). This is a different analytical mode: Opus identified what the system
+  SHOULD compute but DOESN'T, not just where the existing computation is wrong.
+  The wash sale finding is the highest-impact across all three models — an active
+  trader's entire tax-loss harvesting strategy could be invalid. The GenServer
+  mailbox ordering finding shows characteristic Opus reasoning about emergent
+  behavior from design decisions.
+- **GPT-5** produced fewer findings (7) but with extreme precision and specificity.
+  Every finding includes concrete dollar amounts and specific field references.
+  The corporate action stale basis finding is uniquely actionable — it identifies a
+  specific race condition between two documented mechanisms (close/4 and
+  apply_corporate_action/3) that produces permanently incorrect persisted data
+  with no correction path. The designation_at fragmentation finding shows attention
+  to implementation detail that neither Claude model noticed. GPT-5 used 10,496
+  reasoning tokens for 7 findings (1,500 tokens/finding) — HIGH verification,
+  consistent with Finding #20's pattern for precision-over-breadth tasks.
+- **Claude Sonnet 4.6** produced 6 findings with strong specificity and novel angles.
+  The event-sourced recomputation ordering finding (#5) is architecturally subtle —
+  it identifies a composition error between the walk-and-consume algorithm's step
+  ordering and event-sourcing patterns. The strategy fallback compliance recording
+  finding is a genuine audit hazard. However, Sonnet produced no Medium-severity
+  findings — it either found Critical/High issues or filtered everything else out.
+  This aligns with its established high-precision, high-self-filtering behavior.
+
+**Key insight — "Silent correctness" as an analytical lens:**
+
+This is the FIRST experiment testing a "silent incorrectness" prompt. The key
+difference from previous analytical lenses:
+- **Assumption-finding:** "What must be true for this to work?" (Finding #10-12)
+- **Race conditions:** "What timing issues exist?" (Finding #13)
+- **Design coherence:** "Does the design contradict itself?" (Finding #15)
+- **Invariant violations:** "What operation sequences break invariants?" (Finding #20)
+- **Silent correctness:** "Where does the system CONFIDENTLY produce WRONG output
+  with NO indication of error?"
+
+The silent correctness lens produced qualitatively different findings from all
+previous lenses. The emphasis on "passes all validation" forced models to reason
+about what SHOULD be validated but ISN'T, and about semantic correctness (regulatory
+requirements, financial accounting rules) vs syntactic correctness (valid types,
+non-nil fields, correct schema).
+
+This lens also revealed a key model differentiation not seen before:
+- **Opus reasons about MISSING functionality** (wash sales, commissions, opened_at
+  semantics) — things the system should do but doesn't
+- **GPT-5 reasons about EXISTING functionality being wrong** (corporate action race,
+  designation fragmentation, LIFO labeling) — things the system does but incorrectly
+- **Sonnet reasons about COMPOSITION failures** (event-sourcing step ordering,
+  strategy fallback propagation) — things that are individually correct but combine
+  incorrectly
+
+These are three genuinely different analytical modes, not just "more/less thorough."
+All three are valuable for different review outcomes: Opus for feature completeness,
+GPT-5 for mechanism correctness, Sonnet for integration correctness.
+
+**Financial domain advantage:**
+
+This is the first experiment on a document with strong regulatory/financial semantics.
+All three models demonstrated domain knowledge (IRS holding period rules, Treas. Reg.
+1.1012-1(c) requirements, wash sale IRC §1091, long-term/short-term capital gains
+rate differentials). Opus in particular referenced specific IRC sections and provided
+concrete tax rate calculations. The "silent incorrectness" lens works especially well
+on financial/regulatory documents because the gap between "syntactically valid output"
+and "semantically/legally correct output" is large and consequential.
+
+**Comparison to previous findings on the same models:**
+
+| Task type | GPT-5 findings | Opus findings | Sonnet findings | Opus > GPT-5? |
+|---|---|---|---|---|
+| Hidden assumptions (#10-12) | 20-35 | 12-13 | 13-17 | No |
+| Race conditions (#13) | 12 | 10 | 7 | No |
+| Design coherence (#15) | 4 | 7 | 5 | **Yes** |
+| Invariant violations (#20) | 3 | 7 | 5 | **Yes** |
+| Silent correctness (#22) | 7 | 10 | 6 | **Yes** |
+
+Pattern confirmed: Opus outperforms GPT-5 (by finding count) on tasks that require
+reasoning about the design's RELATIONSHIP to external requirements (regulatory,
+financial, consumer expectations). GPT-5 outperforms Opus on tasks that require
+EXHAUSTIVE EXPLORATION within a self-contained system (assumptions, race conditions).
+
+The "silent correctness" lens is structurally similar to coherence checking (does the
+system match its external requirements?) rather than gap-finding (what's missing
+within the system?). This explains why Opus outperforms: the task requires reasoning
+about the world outside the document (IRS rules, financial accounting standards,
+regulatory requirements), which is Opus's strength.
+
+**Practical implication:**
+For financial/regulatory system review, the "silent correctness" lens should be
+run using Opus as the primary model (broadest findings including missing-feature
+identification) plus GPT-5 for mechanism-level precision. Sonnet adds value for
+composition/integration issues that neither Opus nor GPT-5 catches. All three
+produced unique, actionable findings that the others missed.
+
+The three findings ALL models converged on (designation_at, holding period, HIFO
+tie-breaker, strategy preference timing) should be treated as confirmed design
+bugs requiring fixes. The fact that three independent models all identified them
+with concrete financial impact examples increases confidence that these are real.
@@ -0,0 +1,193 @@
+# Finding 23: Regulatory compliance analysis: GPT-5 finds broadest scope of IRS issues; Opus self-corrects mid-analysis; all models converge on cross-account gap
+
+**Date:** 2026-05-05
+**Task:** Identify where gargoyle's `wash-sale-tracking.md` (391 lines) could produce
+incorrect tax reporting compared to IRS wash sale regulations (IRC 1091). NEW
+analytical lens: regulatory compliance verification — asking models to reason about
+a code implementation's correctness against EXTERNAL regulatory requirements (not
+internal system assumptions or race conditions).
+**How we used them:** Same document (full text) + same focused analytical question
+to all 3 models via HAI proxy. Prompt structured around 5 categories: regulatory
+gaps, interpretation errors, edge cases at regulatory boundaries, cross-account/entity
+concerns, and interaction with other IRC sections. Required specific regulatory
+citations, implementation analysis, concrete tax errors, and audit risk levels.
+No tools, no project context beyond the document.
+
+| Model | Time | Output tokens | Reasoning tokens | Findings |
+|---|---|---|---|---|
+| GPT-5 | 178s | 12,525 | 9,536 | 16 |
+| Claude Opus 4.6 | 155s | 7,326 | (internal) | 16 (with 2 self-corrections/withdrawals) |
+| Claude Sonnet 4.6 | 40s | 1,818 | (internal) | 12 |
+
+**What they found — common ground (all 3 identified):**
+- Cross-account/IRA/external broker wash sales not tracked (IRS applies at taxpayer level)
+- Options/contracts to acquire stock not triggering wash sales (explicit in IRC 1091(a) text)
+- "Substantially identical" definition too narrow (same index ETFs, share classes, ADRs)
+- Trade date vs settlement date ambiguity in opened_at/closed_at
+- Short sale wash sales not addressed
+- Section 475 mark-to-market traders incorrectly subjected to wash sale tracking
+- IRC 1092 straddle rules interaction not addressed
+- Related party / spousal transactions not considered
+- Corporate action identity changes breaking matching
+
+**GPT-5 unique findings (not in either other model):**
+- **Per-share vs lot-level basis tacking** (#1): The system applies `disallowed_loss`
+  and `tacked_opened_at` at the LOT level, but IRS requires per-share treatment
+  when only partial shares are matched. A lot of 100 shares where only 60 trigger
+  wash sale should have per-share basis segregation — the system inflates basis for
+  all 100 shares. **Most architecturally significant finding** — a fundamental
+  design-level error, not a missing feature.
+- **IRA permanent disallowance** (#2): When replacement purchase is in an IRA, the
+  loss is PERMANENTLY lost (no basis adjustment possible in tax-deferred accounts).
+  System either incorrectly applies basis adjustment inside IRA or misses it entirely.
+- **Instruments not subject to §1091** (#4): §1256 contracts (futures, index options),
+  cryptocurrency, and §475 elections are all exempt — system may over-disallow.
+- **Average-cost mutual fund basis** (#11): Wash sale adjustments for funds using
+  average-cost method require different math than discrete lot-level adjustments.
+- **ADRs vs local shares** (#14): ADRs and underlying foreign ordinaries are
+  substantially identical but have different instrument_ids.
+- **RSU vestings/ESPP purchases** (#15): Equity compensation creating lots via
+  corporate action paths may not trigger `check_replacement/2`.
+- **Ordering priority between pre/post sale purchases** (#10): Industry convention
+  (post-sale first, then pre-sale) may differ from system's strict chronological
+  ordering, causing 1099-B mismatches.
+
+**Claude Opus unique findings (not in either other model):**
+- **Year-end boundary timing** (#5): Loss in December + replacement in January means
+  tax reports generated between Dec 31 and the replacement purchase date are incorrect.
+  Forward detection fires retroactively but users may have already filed. System needs
+  a "30-day pending window" for year-end reports.
+- **Form 8949 reporting format** (#6): IRS requires code "W" in column (f) and
+  specific adjustment amounts in column (g). System doesn't describe how `tax_summary/3`
+  produces Form 8949-compatible output — potential CP2000 notice triggers from
+  automated IRS matching against broker 1099-B.
+- **"Open lots" query in backward detection** (#10): If backward detection only
+  queries currently-open lots, it misses replacements that were acquired AND SOLD
+  within the window. IRS looks at acquisition regardless of current holding status.
+  (Rev. Rul. 56-602)
+- **Forward detection loss ordering unspecified** (#7): When multiple prior losses
+  compete for the same replacement shares, ordering matters — different allocation
+  produces different basis amounts on the replacement lot.
+- **DRIP reinvestments triggering wash sales** (#9): Dividend reinvestment creates
+  new lots that should trigger forward detection but may not if only buy fills
+  produce `LotOpened` events.
+- **Self-correcting analytical style (CONFIRMED):** Opus withdrew Finding #4
+  entirely mid-analysis ("Revised assessment: holding period logic appears correct.
+  I withdraw the claim of error"). Spent ~500 words reasoning through the holding
+  period tacking logic, found it correct, and explicitly retracted. This is now
+  confirmed across Findings #15, #20, and #23 as a consistent Opus behavior for
+  verification-heavy regulatory analysis.
+
+**Claude Sonnet unique findings (not in either other model):**
+- **Entity-level tracking for partnerships/S-Corps** (#4.2): Tax-transparent entities
+  trading through the platform need K-1 reporting to partners — user-scoped model
+  doesn't address pass-through entity wash sale reporting.
+- **Constructive sale integration (IRC 1259)** (#4.1): Short positions or derivatives
+  creating constructive ownership interact with wash sale determination in ways not
+  addressed.
+- **NOL carryforward interaction** (#5.3): Wash sale deferrals affect character and
+  timing of losses contributing to NOL calculations across tax years.
+
+**Quality assessment:**
+- **GPT-5** produced the broadest regulatory scope (16 findings) with the most
+  specific IRS citations (Rev. Rul. 2008-5, Pub. 550, IRC §§267, 1091, 1092, 1222,
+  1223, 1256, 475). Its per-share vs lot-level finding (#1) is the only one that
+  identifies a FUNDAMENTAL DESIGN ERROR (not a missing feature). Most other models'
+  findings are "you don't handle X" — GPT-5's #1 says "what you DO handle is
+  handled INCORRECTLY." This distinction matters: missing features are known scope
+  limitations; incorrect logic is a bug.
+- **Claude Opus** matched GPT-5's count (16 with 2 self-corrections = 14 net
+  confirmed) but with different character. Opus excelled at identifying OPERATIONAL
+  implications (year-end boundary timing, Form 8949 format requirements, forward
+  detection ordering) rather than just statutory gaps. Its findings tend to describe
+  HOW the gap manifests in practice ("user files taxes, then January purchase
+  retroactively invalidates the filing") vs GPT-5's approach of citing the statute
+  and describing the theoretical violation.
+- **Claude Sonnet** was fast (40s) and produced 12 competent findings but with less
+  regulatory precision. Findings lacked specific IRS citations (no Rev. Rul.
+  references, no Treas. Reg. citations). Several findings overlapped heavily with
+  common ground items without adding unique depth. The entity-level and
+  constructive sale findings show awareness of tax complexity but are relatively
+  generic ("this is complex and not addressed").
+
+**Key insight — regulatory compliance as a distinct task type:**
+
+This experiment tests a fundamentally different cognitive demand than previous ones:
+previous tasks asked "what could go wrong with this system?" (internal reasoning).
+This task asks "does this system correctly implement external rules?" (external
+reasoning). The model must hold TWO bodies of knowledge simultaneously: the
+implementation spec AND the regulatory framework, then find mismatches.
+
+All three models had strong tax law knowledge — they cited IRC sections, Revenue
+Rulings, and Treasury Regulations correctly. The differentiation wasn't in legal
+knowledge but in HOW they applied it:
+
+- **GPT-5:** Exhaustive statutory mapping ("here's every IRC section that touches
+  wash sales; here's where the implementation falls short on each"). Breadth-first
+  coverage. Found the most issues by sheer scope of regulatory awareness.
+- **Opus:** Operational consequence reasoning ("here's how this gap manifests as
+  a real-world problem for the user/auditor"). Found issues by reasoning about
+  the implementation's interaction with real-world workflows (filing deadlines,
+  form formats, broker reconciliation).
+- **Sonnet:** Category-based analysis ("here are cross-account issues, here are
+  entity issues, here are interaction issues"). Followed the prompt structure
+  closely but didn't go deep within each category.
+
+**The per-share vs lot-level finding (GPT-5 #1) — why it matters:**
+
+This is the experiment's most important result. Every model found missing features
+(options, cross-account, short sales) — those are SCOPE limitations that the
+document itself acknowledges or defers. GPT-5 uniquely found a correctness bug in
+the IMPLEMENTED logic: the system's lot-level basis adjustment is mathematically
+wrong for partial wash sales.
+
+Example: Loss lot of 100 shares, replacement lot of 60 shares. Only 60 shares
+trigger wash sale. System adds full 60% of disallowed loss to the entire
+replacement lot's basis. If the replacement lot later sells 30 shares, the
+per-share basis is inflated (reflects 60 shares of adjustment spread across 60
+shares). This is actually correct for the replacement lot specifically — but
+the `tacked_opened_at` is applied to ALL 60 shares when only the matched shares
+should have tacked holding periods. For lots where `adjusted_quantity <
+replacement_quantity`, the non-matched shares have incorrect holding period
+characterization.
+
+Actually, on closer inspection: if `adjusted_quantity = min(loss_quantity,
+replacement_quantity)`, and the system matches 60 shares of a 60-share
+replacement lot, ALL shares of that lot are matched. The edge case GPT-5
+identifies would require a replacement lot larger than the loss — e.g., loss of
+60 shares matched against a replacement lot of 100 shares where only 60 are
+affected. In that case, the `tacked_opened_at` is set on the entire lot (100
+shares) when only 60 should be affected. This IS a genuine bug: 40 shares get
+incorrect holding period classification.
+
+**Updated task-type taxonomy:**
+
+| Task type | Primary cognitive demand | Best model |
+|---|---|---|
+| Hidden assumptions | Breadth identification (what's not stated?) | GPT-5 (exhaustive) |
+| Race conditions | Sequential temporal reasoning | GPT-5 + Opus |
+| Cross-component interactions | Component boundary reasoning | GPT-5 + Sonnet |
+| Design coherence | Internal consistency checking | Opus |
+| Invariant violation paths | Construction + verification | GPT-5 (precision) |
+| Silent correctness | External requirement matching | Opus |
+| **Regulatory compliance** | **Dual-knowledge-base comparison** | **GPT-5 (breadth) + Opus (operations)** |
+
+Regulatory compliance is closest to "silent correctness" (Finding #22) in that
+both require reasoning about external requirements. The key difference:
+- Silent correctness asks "does this produce correct outputs for all inputs?"
+- Regulatory compliance asks "does this implement the law correctly?"
+
+Both favor models that reason about the system's relationship to the outside
+world (Opus's strength), but regulatory compliance also rewards breadth of
+statutory knowledge (GPT-5's strength). The combination produces the most
+complete picture.
+
+**Practical implication:**
+For regulatory compliance review of financial systems:
+- Run GPT-5 for exhaustive statutory coverage (finds the most gaps)
+- Run Opus for operational impact analysis (finds how gaps manifest in practice)
+- Sonnet adds marginal value — use only if budget allows
+- GPT-5's unique strength: identifying correctness bugs in implemented logic
+  (not just missing features)
+- Opus's unique strength: identifying timing/workflow issues (year-end, form
+  reporting, reconciliation with broker)
@@ -0,0 +1,152 @@
+# Finding 24: Design improvement proposals: GPT-5 excels at defense-in-depth thinking; Opus finds subtle design contradictions; Sonnet produces generic recommendations
+
+**Date:** 2026-05-05
+**Task:** Propose specific design improvements for gargoyle's `kill-switch.md` (185 lines)
+— the primary safety mechanism that prevents rogue orders. NEW task type: generative/
+creative ("what would you improve?") rather than purely analytical ("what's wrong?").
+**How we used them:** Same document (full text) + same focused prompt to all 3 models
+via HAI proxy. Prompt asked for 8-15 specific improvements with: weakness, proposed
+change (concrete), tradeoff, severity rating. Explicitly excluded generic advice
+("add more tests") and asked about runtime assumptions. No tools, no project context.
+
+| Model | Time | Output tokens | Reasoning tokens | Improvements proposed |
+|---|---|---|---|---|
+| GPT-5 | 118s | 8,710 | 6,016 | 15 |
+| Claude Opus 4.6 | 127s | 4,985 | (internal) | 15 |
+| Claude Sonnet 4.6 | 40s | 1,636 | (internal) | 12 |
+
+**What they found — common ground (all 3 identified):**
+- DB write failure blocking engagement (fail-open under DB outage) — all three
+  proposed in-memory-first engagement with async persistence
+- Kill switch process liveness monitoring (heartbeat/watchdog)
+- Broker connectivity loss during cancellation operations
+- ETS table ownership and crash-window vulnerability
+- Supervisor restart suppression as unstated mechanism
+- Per-venue/per-broker scope extension
+
+**GPT-5 unique findings (not in either other model):**
+- **Infrastructure-level "hard kill"** — egress proxy or service mesh that blocks
+  broker traffic independently of the application. Belt-and-suspenders approach
+  where the kill switch works even if the entire BEAM VM is unresponsive. This
+  was GPT-5's highest-impact unique insight.
+- **Kill fence token (epoch)** — every order-carrying message includes an epoch;
+  stale-epoch messages are dropped at the gate. Elegantly solves in-flight
+  messages without needing drain timeouts.
+- **Cluster/multi-node propagation** — detailed leader election + epoch broadcast
+  + fail-closed on partition design.
+- **Post-engage broker verification** — query broker AFTER engaging to confirm no
+  orders slipped through during the engagement window.
+- **Liquidation exposure validation** — proving tagged liquidation orders actually
+  REDUCE exposure rather than trusting the tag.
+- **Recovery/cold-start order suppression** — ensuring reconciliation/recovery
+  routines can't submit orders while engaged.
+- **Engage latency reordering** — ETS first, terminate second, DB async.
+- **Audit log tamper evidence** — append-only external sink + hash chain.
+
+**Claude Opus unique findings (not in either other model):**
+- **Ordering contradiction in engagement sequence** — identified that the
+  documented order (DB → ETS → terminate) creates a specific risk if a crash
+  occurs BETWEEN termination and ETS update (not just DB failure). The insight
+  is about the window where termination has started but gate is still open.
+  More subtle than GPT-5's version (which focused on DB-blocking-engage).
+- **Concurrent engagement race (mode escalation)** — multiple triggers
+  simultaneously issuing conflicting modes (RESTRICT vs LIQUIDATE). Proposed
+  explicit escalation rules (LIQUIDATE always wins) with GenServer serialization.
+- **Shared resources under per-user scope** — per-user kill switch doesn't
+  address orders in shared broker connection buffers. Forces architectural
+  decision about connection pooling strategy.
+- **Clock/time integrity for audit log** — monotonic counters + NTP validation
+  for forensic reliability.
+- **Partial multi-user engagement failures** — what happens when global engage
+  successfully terminates 4/5 user pipelines but one has orphaned processes.
+- **Liquidation direction validation** — similar to GPT-5's exposure validation
+  but framed differently: checking corrupted position records could cause
+  liquidation to OPEN positions rather than close them.
+- **Process termination verification** — checking that `:kill` signals actually
+  worked (defense against trap_exit, NIF blocking).
+- **Engagement latency SLA** — defining a 50ms target with monitoring/alerting.
+
+**Claude Sonnet findings (all also present in GPT-5 or Opus, differently framed):**
+- No genuinely unique improvements that GPT-5 or Opus didn't also identify.
+- Several were generic: "missing resource cleanup," "circuit breaker integration,"
+  "performance monitoring" — exactly the kind of advice the prompt tried to
+  exclude.
+- The "missing heartbeat" and "network partition handling" proposals were solid
+  but less detailed than the corresponding GPT-5/Opus versions.
+
+**Quality assessment:**
+- **GPT-5** produced the most ACTIONABLE improvements. Its proposals were
+  architecturally concrete ("add an egress proxy," "use kill epochs in messages,"
+  "query broker post-engage") and showed defense-in-depth thinking — multiple
+  independent layers rather than fixing one path. The infrastructure kill (#2)
+  is genuinely novel: no other model proposed going OUTSIDE the application
+  boundary for safety enforcement. GPT-5 consistently thought about "what if
+  this entire runtime is compromised?" rather than just fixing within-app paths.
+- **Claude Opus** produced equally numerous improvements (15) with characteristic
+  precision about failure SEQUENCES. Its unique strength: identifying design
+  contradictions rather than just gaps (the engagement ordering issue, concurrent
+  mode escalation, shared-resource scope mismatch). Opus's proposals were more
+  "fix the design tension" while GPT-5's were more "add another safety layer."
+  Opus also included the process termination verification and engagement latency
+  SLA — operational rigor that GPT-5 skipped.
+- **Claude Sonnet** produced 12 proposals in 40s (fast) but quality was notably
+  lower. Several proposals were generic software engineering advice that the
+  prompt explicitly excluded ("add performance monitoring," "resource cleanup").
+  No unique insights emerged. Sonnet's proposals lacked the architectural depth
+  of GPT-5 (no outside-the-application thinking) and the design-tension
+  identification of Opus.
+
+**Key insight — generative vs analytical tasks:**
+
+This is the first experiment testing a GENERATIVE task ("propose improvements")
+rather than a purely analytical one ("find problems"). The results reveal:
+
+1. **GPT-5's defense-in-depth thinking is unique.** In analytical tasks, GPT-5
+   finds exhaustive lists of issues. In generative tasks, it proposes LAYERED
+   solutions — multiple independent mechanisms that each catch what the others
+   miss. The infrastructure kill proposal (external to the application) shows
+   GPT-5 reasoning about failure modes that are invisible to within-app analysis.
+
+2. **Opus's design-tension identification transfers to improvement proposals.**
+   In analytical tasks, Opus finds where parts of a design contradict each other.
+   In generative tasks, this manifests as proposals that RESOLVE tensions rather
+   than just adding patches. The engagement ordering contradiction and mode
+   escalation rules are both "this design says X but the mechanism allows Y —
+   here's how to make them consistent."
+
+3. **Sonnet doesn't transfer well to generative tasks.** In analytical tasks
+   (assumption-finding, cross-component analysis), Sonnet performs well (85% of
+   GPT-5 in some experiments). In generative tasks, it falls back to generic
+   engineering advice. The task requires both identifying problems AND proposing
+   concrete solutions — Sonnet handles the first step but not the second with
+   sufficient depth.
+
+**Comparison to analytical task performance:**
+
+| Task type | GPT-5 character | Opus character | Sonnet character |
+|---|---|---|---|
+| Assumption-finding (#10-12) | Exhaustive breadth | Design tensions | Good (85% of GPT-5) |
+| Race conditions (#13) | Technical precision | Design contradictions | Weak (errors) |
+| Invariant violations (#20) | Maximum selectivity | Self-correcting depth | Imprecise |
+| **Design improvements (#24)** | **Defense-in-depth layers** | **Tension resolution** | **Generic advice** |
+
+The generative task reveals model ARCHITECTURES more clearly than analytical tasks.
+GPT-5's reasoning enables it to construct multi-layered solutions. Opus's internal
+reasoning enables it to identify what a design SHOULD be (not just what's wrong).
+Sonnet pattern-matches against known engineering practices without deep synthesis.
+
+**Practical implication:**
+
+For design improvement sessions on safety-critical systems:
+- Run GPT-5 for defense-in-depth proposals ("what layers should exist?")
+- Run Opus for design consistency proposals ("where does the design contradict itself?")
+- Skip Sonnet — its output is indistinguishable from generic checklists
+- The combination of GPT-5 + Opus produces complementary improvements: GPT-5 adds
+  safety layers, Opus fixes internal contradictions. Together they address both
+  "not enough protection" and "protection mechanisms that work against each other."
+
+**Cost analysis:**
+GPT-5: 118s, ~10.9K tokens (6K reasoning). Opus: 127s, ~5K tokens. Sonnet: 40s, ~1.6K tokens.
+For a safety-critical design review, running GPT-5 + Opus costs ~16K tokens and produces
+30 improvements with near-zero overlap in unique insights. Excellent ROI for a kill switch
+design that protects real money.
@@ -0,0 +1,154 @@
+# Finding 25: Contradiction detection: NEW task type — Opus excels at finding genuine contradictions with precision; GPT-5 is thorough but spends heavily on reasoning; Sonnet finds surface-level issues quickly
+
+**Date:** 2026-05-05
+**Task:** Identify internal contradictions, logical inconsistencies, and conflicting rules
+in gargoyle's `order-state-machine.md` (311 lines) — a document defining states,
+transitions, invariants, fill precedence rules, and time-in-force behavior.
+**How we used them:** Same document (full text) + same focused analytical question to all
+3 models via HAI proxy. Prompt specifically asked for: state machine contradictions,
+semantic conflicts, rule violations, implicit contradictions, and terminology
+inconsistencies. Required each finding to quote the conflicting statements, explain
+the logical argument, assign severity, and recommend which statement should "win."
+No tools, no project context beyond the document itself.
+
+| Model | Time | Output tokens | Reasoning tokens | Contradictions found |
+|---|---|---|---|---|
+| GPT-5 | 162s | 12,074 | 11,008 | 4 |
+| Claude Opus 4.6 | 41s | 2,056 | (internal) | 6 |
+| Claude Sonnet 4.6 | 17s | 826 | (internal) | 4 |
+
+**What they found — common ground (2+ models identified):**
+
+- **Missing `pending_cancel → partially_filled` revert transition** (GPT-5 #1 +
+  Opus #2 + Sonnet partial): The "Rejection reverts" invariant states orders return
+  to their "pre-modification state (`working` or `partially_filled`)", but the state
+  diagram only shows `pending_cancel → working` for cancel rejection — no path back
+  to `partially_filled`. All models correctly identified this as the diagram being
+  incomplete relative to the stated invariant. GPT-5 and Opus rated CRITICAL.
+- **Same issue for `pending_replace` revert** (GPT-5 #1 + Opus #3): The state diagram
+  only shows `pending_replace → working` for replace rejection, but a replace
+  requested from `partially_filled` should revert to `partially_filled`. Same root
+  cause as above, just the replace variant.
+- **FOK "never partially fills" vs state machine allowing it** (GPT-5 #2 + Opus #4):
+  The TIF table says FOK "never partially fills" but the state machine has no guards
+  preventing FOK orders from reaching `partially_filled`. Both correctly noted this
+  is a broker-enforced guarantee but the document presents it as system-level.
+- **`rejection_reason` described as "broker-provided" but local rejections exist**
+  (GPT-5 #4 + Opus #5 + Sonnet): `pending → rejected` is "local validation failure"
+  with no broker interaction, but the field says "Broker-provided reason when
+  rejected." All three caught this terminology inconsistency.
+
+**GPT-5 unique findings (not in either other model):**
+
+- **IOC valid terminal states exclude `expired` vs generic expiry transitions** (#3):
+  IOC should never reach `expired` (unfilled portion is cancelled immediately), but
+  the state diagram allows any order to transition to `expired` without TIF guards.
+  Well-reasoned extension of the FOK finding to IOC semantics. GPT-5 correctly
+  identified that broker "expired-like" outcomes should map to `cancelled` for IOC.
+
+**Claude Opus unique findings (not in either other model):**
+
+- **Terminal states that aren't terminal — the `partially_filled` re-entry problem**
+  (#1): Opus identified the DEEPER contradiction beyond the surface-level "cancelled
+  states have outgoing transitions." When `cancelled → partially_filled` fires via
+  late fill, the order is now non-terminal with NO defined mechanism to re-terminate
+  if no further fills arrive. The order is stuck in `partially_filled` indefinitely.
+  This goes beyond "the diagram contradicts the definition of terminal" to "the fill
+  precedence rule creates an unspecified operational scenario." This is the most
+  architecturally significant finding across all three models.
+- **Fill precedence label misapplication to non-terminal states** (#6): The state
+  diagram labels transitions from `pending_cancel → partially_filled` and
+  `pending_replace → partially_filled` as "fill precedence," but the Fill
+  Precedence Rule explicitly defines itself as overriding TERMINAL states.
+  `pending_cancel` is non-terminal. The label conflates two different mechanisms
+  (fill during pending modification vs. fill overriding terminal state), which
+  could cause implementers to use the same code path for fundamentally different
+  scenarios.
+
+**Claude Sonnet unique findings (not in either other model):**
+
+- **State diagram terminal arrow contradiction** (#1): Sonnet was the only model to
+  explicitly note that the Mermaid diagram shows `cancelled → [*]` (terminal arrow)
+  while simultaneously showing `cancelled → partially_filled` (outgoing transition).
+  A valid observation but more surface-level than Opus's deeper analysis of the same
+  phenomenon.
+- **Pending replace fill logic error** (#3): Sonnet argued that receiving a fill
+  during `pending_replace` creates a logical impossibility because the order
+  parameters are in flux. This is WRONG — fills always apply to current parameters
+  (the replace hasn't been confirmed yet), and the document actually handles this
+  correctly. This is a FALSE POSITIVE from Sonnet.
+
+**Quality assessment:**
+
+- **Claude Opus** was the clear winner for this task. Found the most contradictions
+  (6), had the highest precision (0 false positives), and — crucially — found
+  qualitatively deeper issues. The `partially_filled` re-entry problem (#1) isn't
+  just "the diagram has a missing transition" but "the fill precedence rule creates
+  an unresolvable operational state." The fill precedence label misapplication (#6)
+  identifies a conceptual confusion that would genuinely cause implementation bugs.
+  Opus completed in only 41s with 2,056 output tokens — by far the most efficient.
+- **GPT-5** found 4 genuine contradictions with 0 false positives but spent an
+  extraordinary amount of reasoning tokens (11,008) for modest output (1,066 visible
+  content tokens, 10.3:1 reasoning ratio). The IOC finding was unique and valuable.
+  But the cost is disproportionate: 162s and 12K tokens for 4 findings vs Opus's
+  41s and 2K tokens for 6 findings. GPT-5's reasoning budget seems to have been
+  mostly spent on VERIFICATION (confirming each finding is genuine), consistent
+  with Finding #20's observation.
+- **Claude Sonnet** was fastest (17s) and found 4 items, but one was a false positive
+  (the pending_replace logic error claim is incorrect). That gives it a precision of
+  75% (3/4 genuine) — the lowest of the three. Its genuine findings were all also
+  found by the other models (no unique true contributions). Sonnet appears to trade
+  speed for accuracy on contradiction detection.
+
+**Key insight — contradiction detection favors precision-oriented models:**
+
+This task is fundamentally about LOGICAL ARGUMENTATION: proving that two statements
+cannot both be true. Unlike assumption-finding (which is about imagining what could go
+wrong) or gap-finding (which is about identifying missing content), contradiction
+detection requires the model to:
+1. Hold two statements in working memory simultaneously
+2. Construct a formal argument for why they conflict
+3. NOT get confused by statements that SEEM contradictory but are actually consistent
+
+Requirement #3 is where models diverge. Sonnet produced a false positive because it
+didn't fully reason through whether the pending_replace fill scenario is actually
+inconsistent (it isn't — current parameters apply). Opus avoided this trap entirely
+and additionally found DEEPER contradictions that require multi-step logical reasoning
+(the re-entry problem, the label misapplication). GPT-5 also avoided false positives
+but at massive computational cost.
+
+**Opus's efficiency advantage:**
+This is the first task where Opus is not just qualitatively better but also
+quantitatively more efficient. 6 findings in 41s and 2K tokens vs GPT-5's 4 findings
+in 162s and 12K tokens. That's 3x more findings per token and 4x faster. For
+contradiction detection specifically, Opus appears to have a structural advantage —
+possibly because its internal reasoning is better calibrated for logical argumentation
+than GPT-5's externalized reasoning chain.
+
+**Comparison to Finding #20 (invariant violation paths):**
+In Finding #20, GPT-5 was maximally selective (3 findings, all genuine, 15:1
+reasoning ratio). Here, GPT-5 shows the same pattern: few findings, all genuine,
+high reasoning ratio (10.3:1). The difference: in #20, GPT-5's selectivity meant
+it found UNIQUE violations others missed. Here, all of GPT-5's findings were also
+found by Opus (plus Opus found 2 more). GPT-5's high verification bar doesn't help
+when Opus is ALSO precise AND more thorough.
+
+**Updated task-model assignment:**
+
+For contradiction/consistency checking:
+1. **Opus** — best choice: highest precision, deepest contradictions, most efficient
+2. **GPT-5** — solid backup: zero false positives, unique TIF-related insights, but
+   expensive and slower
+3. **Sonnet** — NOT recommended for this task: produces false positives, no unique
+   true contributions
+
+This confirms the emerging pattern: each model has task types where it excels.
+Opus excels at logical argumentation and design tensions. GPT-5 excels at
+exhaustive enumeration and operational concerns. Sonnet excels at speed and
+structural/assumption analysis but struggles with tasks requiring formal logical
+reasoning (contradiction detection, concurrency analysis per Finding #13).
+
+**Practical implication:** When reviewing architecture documents for internal
+consistency (e.g., before implementation begins), run Opus. If budget allows,
+add GPT-5 for TIF/edge-case coverage. Skip Sonnet for consistency checking —
+its speed advantage is negated by the false positive risk.
@@ -0,0 +1,158 @@
+# Finding 26: Missing-feature identification IS promptable across all models; prompt framing eliminates Opus's historical advantage — all three models find regulatory gaps when explicitly asked
+
+**Date:** 2026-05-05
+**Task:** Identify computations, behaviors, or features that gargoyle's
+`corporate-actions.md` (992 lines) SHOULD perform for financial correctness,
+regulatory compliance, or operational safety — but doesn't describe.
+**How we used them:** Same document (full text) + same focused analytical
+prompt to all 3 models via HAI proxy. Prompt explicitly structured around 5
+categories: missing computations, missing behaviors, missing validations,
+missing integrations, and regulatory gaps. Required concrete findings with
+severity. No tools, no project context beyond the document. GPT-5 via
+OpenAI endpoint (16K max_completion_tokens), Opus 4.6 and Sonnet 4.6 via
+Anthropic endpoint (8K max_tokens).
+
+| Model | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
+|---|---|---|---|---|---|---|
+| GPT-5 | 11,354 | 8,512 | 20 | 3 | 10 | 7 |
+| Claude Opus 4.6 | 4,111 | (internal) | 23 | 6 | 10 | 7 |
+| Claude Sonnet 4.6 | 4,686 | (internal) | 15 | 5 | 6 | 4 |
+
+**What they found — common ground (all 3 identified):**
+- Wash sale rule interaction with CA-driven lot closures (IRC §1091)
+- Short position treatment for corporate actions
+- Same-day corporate action ordering beyond `recorded_at` timestamp
+- Record date / ex-date position verification (entitlement timing)
+- Idempotency guard preventing double-application per user
+- Decimal precision/rounding policy unspecified
+- Superseded CA status has no lot rollback mechanism
+- Rights/warrants post-creation lifecycle (exercise/expiration)
+- Basis preservation invariant has no runtime enforcement
+- Manual entry authorization and audit trail
+
+**GPT-5 unique findings (not in either Claude model):**
+- Per-lot eligibility based on entitlement date (not just user-level)
+- Election-based outcomes for shareholder choices (cash vs stock)
+- Instrument-level trading hold during CA application window
+- Pre-application consistency checks against broker entitlements
+- DB-level enforcement of status transitions and invariants
+- Action-type-specific date semantics per field (ex vs record vs payable)
+- Voluntary/tender actions beyond distributions
+- Backfill/initialization guard for newly onboarded users
+- Applicator retry/backoff semantics and confirmation race
+- Rights indivisibility constraints vs exact Decimal quantities
+
+**Claude Opus unique findings (not in either other model):**
+- Pending order PRICE adjustment after splits (not just cancellation)
+- Multi-instrument position recalculation atomicity for mergers
+- Mixed merger basis floor at zero (can produce negative basis)
+- Tax lot identification method interaction with inherited dates
+- Corporate action effect on strategy position limits/risk params
+- Corporate actions on instruments not yet in the database
+- Partial application window: new user acquires position mid-fan-out
+- IRC §305(c) deemed distributions (taxable stock dividends)
+- CA impact on unrealized P&L display and strategy evaluation
+- Concurrent OrderManager startup + Applicator fan-out race
+
+**Claude Sonnet unique findings (not in either other model):**
+- Stale orders: failure modes table contradicts "excluded" section
+- IRC §1223(1) holding period tacking verification at lot close
+- Spinoff allocation percentage — no validation child != parent instrument
+- Combined spinoff allocations exceeding meaningful bounds
+- Cash dividend bypasses OrderManager — record-date quantity snapshot lost
+- Mixed merger large-denominator exchange ratio overflow
+- Detector schedule: no intraday re-poll for same-day announcements
+- ROC cumulative basis floor tracking and IRC §301(c)(3) distinction
+- Mixed merger deferred loss not explicitly recorded in metadata
+
+**Quality assessment:**
+- **Claude Opus** was the MOST PROLIFIC (23 findings) — a notable inversion
+  from previous experiments where Opus typically found fewer but deeper
+  findings. Here, the explicit "missing feature" framing appears to have
+  unlocked Opus's breadth. Its unique findings included genuinely critical
+  items: pending order price adjustment after splits (Critical — direct
+  financial loss), multi-instrument atomicity for mergers (Critical —
+  position loss), and mixed merger negative basis (High — accounting
+  corruption). The findings were precise, well-reasoned, and showed both
+  regulatory depth (IRC §305(c)) and operational awareness.
+- **GPT-5** was slightly less prolific (20 findings) but maintained its
+  characteristic breadth and operational-level thinking. Per-lot eligibility
+  (not just per-user) is a subtle but important distinction. The election-
+  based outcomes finding shows awareness of real-world corporate action
+  complexity. The backfill/initialization guard is operationally significant.
+  GPT-5 spent 8,512 reasoning tokens — moderate for its output volume.
+- **Claude Sonnet** found fewer gaps (15) but several were genuinely
+  insightful. The internal contradiction between the failure modes table
+  and the "excluded" section is a real document inconsistency. The cash
+  dividend record-date quantity snapshot insight (#9) identifies a DATA LOSS
+  problem — the opportunity to capture that data expires. The mixed merger
+  deferred loss recording gap shows regulatory awareness. However, some
+  findings were more surface-level or overlapped heavily with the others.
+
+**KEY INSIGHT — The original question from Finding #22 is ANSWERED:**
+
+> "Opus's 'missing feature identification' mode (wash sales, commissions) —
+> is this promptable on other models? Could we explicitly ask GPT-5 'what
+> should this system compute but doesn't' and get similar results?"
+
+**YES.** When explicitly prompted with a structured "missing feature"
+framing, ALL three models found regulatory gaps (wash sales, IRC sections),
+missing computations (basis calculations, rounding), and missing behaviors
+(lifecycle events, notifications). GPT-5 produced findings in the same
+*category* as what Opus uniquely found in Finding #22 (silent correctness
+failures on specid-lot-selection.md).
+
+In Finding #22, Opus uniquely identified wash sales and commission tracking
+as missing features while GPT-5 focused on mechanism incorrectness and
+Sonnet on composition failures. HERE, with the explicit "what's missing"
+prompt, ALL three models found wash sales, ALL found regulatory gaps, and
+ALL found missing behaviors.
+
+**This confirms:** Opus's "missing feature identification" mode in Finding
+#22 was NOT an inherent model capability — it was an emergent behavior from
+the open-ended "silent correctness failures" prompt. When you give ALL models
+the EXPLICIT instruction to look for missing features, they all do it. The
+differentiation from #22 was caused by the prompt being more open-ended,
+allowing each model to default to its natural analytical mode:
+- Opus → "what's missing" (features/functionality)
+- GPT-5 → "what's wrong" (mechanism failures)
+- Sonnet → "what breaks when combined" (composition)
+
+**Prompt framing dominates model personality.** With the right prompt,
+any model can be directed into any analytical mode. The model differences
+that emerged in earlier open-ended experiments reflect DEFAULT TENDENCIES,
+not capabilities.
+
+**NEW finding about Opus on complex documents:**
+Opus produced MORE findings than GPT-5 (23 vs 20) — the first time this
+has happened on a broad analytical task. Previous pattern: GPT-5 always
+finds more (20-33 findings) while Opus finds fewer but deeper (7-13).
+What changed? The document is 992 lines — the longest tested — and the
+task is explicitly about breadth ("find all gaps"). On this specific
+combination (long document + breadth-focused prompt), Opus appears to
+allocate its internal reasoning budget toward exploration rather than
+its usual depth-first design-tension mode. This suggests Opus's typical
+"fewer but deeper" pattern is partially a RESPONSE to shorter documents
+where depth is more productive than breadth.
+
+**Practical implications:**
+1. For missing-feature analysis: prompt structure matters more than model
+   choice. All three models are viable. Use the explicit 5-category prompt.
+2. Run all three for critical docs — they find different specific gaps
+   despite finding the same categories.
+3. For open-ended analysis where you want models to find DIFFERENT things:
+   use open-ended prompts. For analysis where you want COMPREHENSIVE
+   coverage of one type: use structured prompts.
+4. Opus's "fewer but deeper" personality can be overridden by document
+   length + breadth-focused prompt. On 992-line docs, it competes on
+   volume with GPT-5.
+
+**Cost-effectiveness:**
+Opus: 4,111 output tokens for 23 findings = 179 tokens/finding
+GPT-5: 11,354 output tokens (+ 8,512 reasoning) for 20 findings = 993 tokens/finding
+Sonnet: 4,686 output tokens for 15 findings = 312 tokens/finding
+
+Opus is by far the most efficient: nearly 6x fewer tokens than GPT-5 per
+finding, with MORE findings. This is the strongest cost-effectiveness case
+for Opus on any tested task. On long documents with breadth-focused prompts,
+Opus appears to be the optimal choice for both quality AND efficiency.
@@ -0,0 +1,276 @@
+# Finding 27: Design coherence on risk-controls.md: GPT-5 regains top position; Opus's advantage from Finding #15 was document-specific, not task-specific
+
+**Date:** 2026-05-05
+**Task:** Identify internal design incoherences in gargoyle's `risk-controls.md` (277 lines)
+— a pre-trade risk control specification covering two evaluation stages, reduction semantics,
+ordering rationale, fail-closed claims, and audit logging.
+**How we used them:** Same document (full text) + same focused analytical question to all
+3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence
+(safety properties not enforced, ordering/sequencing contradictions, reduction semantics
+conflicts, fail-closed claims vs actual behavior, cross-stage inconsistencies). Required
+each finding to reference specific contradictory parts. No tools, no project context beyond
+the document itself.
+
+| Model | Time | Output tokens | Reasoning tokens | Incoherences found | Critical | High | Medium |
+|---|---|---|---|---|---|---|---|
+| GPT-5 | 112s | 8,231 | 7,232 | 6 | 1 | 3 | 2 |
+| Claude Opus 4.6 | 41s | 1,858 | (internal) | 5 | 2 | 2 | 1 |
+| Claude Sonnet 4.6 | 15s | 699 | (internal) | 4 | 1 | 2 | 1 |
+
+**What they found — common ground (all 3 identified):**
+- Reduction re-entry at BuyingPower contradicts "reducing controls never re-enter
+  earlier controls" (all three flagged this as the most obvious contradiction —
+  Concentration at position 5 reduces, re-enters at BuyingPower at position 4,
+  which IS an earlier control)
+- Ordering rationale's categorization of buying power/concentration is internally
+  confused (the doc labels both as "quantity-sensitive checks" that run after
+  reducing controls, but concentration IS a reducing control at position 5 while
+  buying power at position 4 sits between the two reducing controls)
+
+**GPT-5 unique findings (not in either Claude model):**
+- Signal-level "no portfolio context" contradicts NoShortSales requiring knowledge
+  of current positions. The doc explicitly states signals are evaluated "in isolation"
+  with "no portfolio context — only the signal itself and user settings" — but checking
+  whether the user holds a position IS portfolio context. This is a genuine design
+  tension: either SignalRisk has hidden portfolio access (violating isolation) or
+  NoShortSales can't actually work as specified.
+- Settings "fall through to system defaults" vs "Settings cache miss → reject."
+  Two incompatible instructions for the same condition (missing settings).
+- "Universal fail-closed" with "only exception is order rate window" contradicted
+  by Failure Modes table showing buying power as another exception ("Conservative
+  estimate; may over-reject" is NOT rejection — it's a different failure mode than
+  either fail-closed or the documented single exception).
+- Audit model says "every control evaluation produces an audit entry regardless of
+  outcome" but the signal-stage write point only describes writing on rejection.
+  Passing signals produce no documented audit entry at the signal stage.
+
+**Claude Opus unique findings (not in either other model):**
+- Signal flow diagram swaps control order vs table: table shows (1) MarketHours,
+  (2) PerTradeStop, (3) NoShortSales, but diagram flows MarketHours → NoShortSales
+  → PerTradeStopLoss. Controls 2 and 3 are reversed between the two representations.
+  (VERIFIED: this is correct — the diagram does show a different order.)
+- Concentration re-entry loop can bypass Order Rate, Duplicate, Self-Trade, and
+  Fat Finger entirely during intermediate iterations. Also: Position Size at order 3
+  is never re-checked against Concentration-reduced quantity because re-entry starts
+  at BuyingPower (order 4), meaning "smaller quantity wins" semantics are implemented
+  differently than the linear model described in Reduction Semantics.
+
+**Claude Sonnet unique findings (not in either other model):**
+- Buying Power "Can reduce? No" creates a dead end: if a reduced quantity still
+  exceeds buying power, the system can only reject entirely (no mechanism to further
+  optimize), defeating the purpose of the reduction system for capital-limited users.
+  (NOTE: this is more of a design limitation than a self-contradiction, but the
+  framing — that the reduction system's purpose is undermined by buying power's
+  inability to reduce — is a legitimate coherence observation.)
+
+**Quality assessment:**
+- **GPT-5** produced the most findings (6) with the broadest coverage across the
+  prompt's 5 categories. The NoShortSales/portfolio-context finding is the most
+  genuinely insightful — it's a fundamental design-level contradiction (a signal-level
+  control that REQUIRES decision-level context). The settings contradiction and
+  audit logging inconsistency are also solid. Every finding points to two specific
+  textual statements that are incompatible. Severity ratings were calibrated (1
+  Critical, 3 High, 2 Medium — compared to Opus's 2 Critical for similar findings).
+- **Claude Opus** was remarkably fast (41s, 1,858 tokens) and found one thing
+  neither other model caught: the diagram/table order reversal for signal controls.
+  This is a concrete, verifiable error (not a design tension — a literal mistake in
+  the document). The re-entry loop analysis (finding #5) goes deeper than GPT-5's
+  version of the same core issue, exploring the implications for "smaller quantity
+  wins" semantics. However, Opus found fewer total issues and missed the
+  settings contradiction and audit logging inconsistency.
+- **Claude Sonnet** was the fastest (15s, 699 tokens) and found 4 issues. The buying
+  power dead-end observation is unique and shows genuine reasoning about the reduction
+  system's limitations. However, it's more of a "this design can't achieve its stated
+  goal" than a strict self-contradiction. Sonnet's other findings overlap with the
+  common ground. Quality is solid but narrower scope.
+
+**Key insight — Finding #15's Opus > GPT-5 result was document-specific:**
+In Finding #15 (coherence checking on failure-modes.md), Opus found 7 incoherences
+vs GPT-5's 4. Here, on risk-controls.md, GPT-5 found 6 vs Opus's 5. The reversal
+suggests that the relative performance on coherence checking depends on the
+DOCUMENT'S structure, not on a fixed model advantage:
+
+- **failure-modes.md** (383 lines): A complex multi-process system with many
+  stated invariants across failure states, supervision trees, and recovery paths.
+  Rich in design TENSIONS where one subsystem's safety mechanism undermines another.
+  This plays to Opus's strength (finding design tensions between subsystems).
+- **risk-controls.md** (277 lines): A more focused specification with explicit rules,
+  ordering constraints, and behavior tables. Rich in SPECIFICATION CONTRADICTIONS
+  where one statement directly conflicts with another. This plays to GPT-5's
+  strength (systematic verification of claims against stated mechanisms).
+
+The difference: Opus excels when contradictions are EMERGENT (arise from composing
+multiple design decisions). GPT-5 excels when contradictions are EXPLICIT (two
+statements in the document say incompatible things). Risk-controls.md has more
+explicit contradictions (the settings fallback vs fail-closed, the "no portfolio
+context" vs NoShortSales, the audit "always" vs write point "only on reject").
+
+**Model performance depends on CONTRADICTION TYPE:**
+| Contradiction type | Best model | Example |
+|---|---|---|
+| Emergent/compositional | Opus | "Rest-for-one cascade creates a 5th state" |
+| Explicit/definitional | GPT-5 | "No portfolio context" but check requires portfolio |
+| Diagrammatic/structural | Opus | Table order ≠ diagram order |
+| Semantic/category confusion | All (common ground) | Reduction re-entry violates ordering claims |
+
+**Revised conclusion on Finding #15's open question:**
+"Does Opus > GPT-5 ordering for coherence checking hold across other documents?"
+**No.** The ordering depends on the document's contradiction density and type.
+Documents rich in emergent design tensions favor Opus. Documents with explicit
+specification errors favor GPT-5. The task type (coherence checking) doesn't have
+a fixed model winner — it depends on what KIND of incoherences the document contains.
+
+**Practical implication:** Continue running both models for coherence checking. Their
+strengths are complementary even within the same task type. GPT-5 catches things you
+can point to in the spec and say "these two sentences conflict." Opus catches things
+where you need to reason about the implications of multiple mechanisms interacting.
+
+## Open Questions
+
+- Does GPT's advantage in finding inconsistencies extend to logical
+  inconsistencies in arguments? One data point (verdict mismatches) — need more.
+- What's the optimal task granularity for GPT analytical review? "Whole PR" is
+  too big. Is "one hypothesis" right, or can we batch?
+- ~~Is the GPT-4.1 Mini bias detection result repeatable, or was it a well-
+  structured task that any model would ace?~~ **ANSWERED (Finding #8):** Any
+  model aces it when the biased text is presented without noise. The original
+  result was about noise elimination, not model capability.
+- **NEW:** Does adding a narrow bias-check question to a rich PR review
+  context recover the detection that broad review misses? (Signal-to-noise
+  confirmation test)
+- ~~How does reasoning_effort affect analytical quality? Only tested default so
+  far.~~ **ANSWERED (Finding #21):** Negligible effect on GPT-5 for open-ended
+  analytical tasks. Low/medium/high produced 33/30/30 findings with nearly
+  identical reasoning tokens (~4K) and per-finding depth. The parameter
+  may primarily affect verifiable-answer tasks, not exploration. Task framing
+  remains the dominant quality lever.
+- Can we design a systematic "analytical review checklist" that leverages each
+  model's strengths?
+- ~~What analytical tasks is Opus best at vs Sonnet? Finding #11 shows Opus
+  excels at design-tension identification. How does Sonnet compare on the
+  same task? (Sonnet is non-reasoning but fast — would it match GPT-4.1?)~~
+  **ANSWERED (Finding #12):** Sonnet 4.6 significantly outperforms GPT-4.1
+  (17 vs ~14 assumptions) and approaches GPT-5 (17 vs 20). It's not a
+  non-reasoning model in the GPT-4.1 sense — it occupies a middle tier with
+  genuine component-interaction reasoning. Opus still wins on design-tension
+  identification specifically.
+- How do the models compare on research synthesis tasks (our #381 rewrite)?
+  We'll find out during the actual rewrite.
+- ~~Does the reasoning-token advantage scale with document complexity? Test
+  with a simpler doc to see if the gap narrows.~~ **ANSWERED (Finding #11):**
+  The gap doesn't narrow with simpler docs. GPT-5 maintains ~1.7x the findings
+  of GPT-4.1 regardless of document complexity. Reasoning tokens enable
+  exhaustive exploration independent of input difficulty.
+- ~~Would Claude Opus (also a reasoning model) match GPT-5's assumption-finding
+  performance, or does it have different blind spots?~~ **ANSWERED (Finding #11):**
+  Different blind spots, different strengths. GPT-5 reasons deeper into
+  implementation mechanics (breadth + technical depth). Opus reasons wider
+  about system context and design tensions (insight density). They're
+  complementary, not competing. Run both on important architecture docs.
+- ~~Does Sonnet 4.6's strong showing hold across other analytical tasks
+  (bias detection, gap-finding) or is it specific to assumption-finding on
+  complex documents? Need to test Sonnet on simpler docs and different question
+  types.~~ **PARTIALLY ANSWERED (Finding #13):** Sonnet's strength does NOT
+  transfer to concurrency reasoning. It dropped from 85% of GPT-5 (assumption-
+  finding) to ~58% (race condition identification). Task type matters more
+  than we thought. Still untested: gap-finding, bias detection for Sonnet.
+- **NEW:** What other analytical tasks require sequential/temporal reasoning
+  (like race condition identification) vs pattern-matching reasoning (like
+  assumption-finding)? Building a task taxonomy would help assign models
+  correctly.
+- **NEW:** What explains Sonnet taking slightly longer than Opus here (106s vs
+  105s) despite normally being the faster model? Is it the document length, or
+  does Sonnet's internal reasoning scale with complexity similarly to Opus?
+- ~~How does GPT-5 Mini compare to GPT-5 on analytical tasks? Is it a viable
+  cheaper substitute?~~ **ANSWERED (Finding #14):** GPT-5 Mini is a viable
+  middle option. Finds fewer issues (6 vs 10) but with genuine reasoning
+  depth at ~50% cost/time. Better than non-reasoning models, not as
+  exhaustive as GPT-5.
+- **NEW:** How does Claude 4.5 Opus/Sonnet compare to Claude 4.6? HAI now
+  exposes both; worth testing whether the newer versions regress on
+  analytical tasks.
+- ~~Would running GPT-5 Mini + Sonnet together (different axes)
+  approach GPT-5's coverage at lower combined cost?~~ **ANSWERED (Finding #19):**
+  71% coverage at 31% cost. Good for low-stakes work; GPT-5 irreplaceable for
+  high-stakes due to unique domain-knowledge findings in the missing 29%.
+- ~~**NEW (Finding #15):** Does the Opus > GPT-5 ordering for coherence checking
+  hold across other documents? The inversion (Opus finding more than GPT-5)
+  was striking — need to confirm it wasn't document-specific.~~
+  **ANSWERED (Finding #27):** No — it was document-specific. On risk-controls.md,
+  GPT-5 found 6 vs Opus's 5. The winner depends on contradiction TYPE: Opus
+  excels at emergent/compositional contradictions, GPT-5 at explicit/definitional
+  ones. No fixed ordering for this task type.
+- **NEW (Finding #15):** Is the two-pass approach (Opus generates → GPT-5
+  validates) worth the extra cost vs just running Opus alone? Need to test
+  whether GPT-5 actually catches Opus false-positives or just agrees.
+- ~~How do the Claude 4.5 and 4.6 models compare on analytical tasks?~~
+  **ANSWERED (Finding #16):** 4.5 is more exhaustive (2x findings), 4.6 is
+  more precise (higher signal-to-noise). Genuine tradeoff, not a regression.
+  4.5 for coverage, 4.6 for actionability.
+- **NEW (Finding #16):** Does the 4.5 vs 4.6 pattern hold across other task
+  types? Spec completeness may favor exhaustiveness; would coherence checking
+  or race condition analysis show the same pattern?
+- **NEW (Finding #16):** Is running both Sonnet versions (4.5 + 4.6) cost-
+  effective vs just running GPT-5? Need to compare the UNION of their findings
+  against GPT-5's output for overlap analysis.
+- **NEW (Finding #18):** Does Opus's "predictable exploit window" detection
+  transfer to other policy documents? It uniquely identified that the cooldown
+  mechanism creates a GUARANTEED safe window that strategies could systematically
+  exploit — this is a higher-order security insight. Worth testing whether Opus
+  consistently finds "adversarial opportunity" framings that other models miss.
+- **NEW (Finding #20):** Does GPT-5's extreme verification behavior (15:1
+  reasoning-to-output ratio, 3 findings from 12K reasoning) persist across
+  other documents with this prompt? Or was user-pipeline-lifecycle.md
+  particularly verification-heavy? Test invariant violation paths on a simpler
+  document.
+- **NEW (Finding #20):** Would giving GPT-5 a "minimum 8 findings" instruction
+  reduce its selectivity and produce MORE invariant violations at lower
+  precision? Or would it just pad with non-violations? The extreme selectivity
+  may be a feature OR it may mean GPT-5 is discarding valid-but-hard-to-verify
+  findings.
+- **NEW (Finding #20):** Opus's self-correction behavior is now confirmed across
+  Finding #15 and #20. Is this trainable/promptable? Could we ask non-Opus models
+  to "show your reasoning and withdraw findings you cannot fully verify"?
+- **NEW (Finding #22):** The "silent correctness" lens revealed three distinct
+  analytical modes: Opus → missing functionality, GPT-5 → mechanism incorrectness,
+  Sonnet → composition failures. Does this three-way differentiation hold on other
+  documents, or was it specific to the regulatory/financial domain of specid-lot-selection?
+- **NEW (Finding #22):** Does the "silent correctness" lens work on non-financial
+  documents? The financial/regulatory domain has a large gap between syntactic and
+  semantic correctness. Would the same prompt on an infrastructure/systems doc produce
+  equally differentiated findings, or would it collapse into assumption-finding?
+- ~~**NEW (Finding #22):** Opus's "missing feature identification" mode (wash sales,
+  commissions) — is this promptable on other models? Could we explicitly ask GPT-5
+  "what should this system compute but doesn't" and get similar results?~~
+  **ANSWERED (Finding #26):** YES — all three models find regulatory gaps and
+  missing features when explicitly prompted. Opus's unique behavior in #22 was
+  an emergent DEFAULT tendency, not a capability. Prompt framing dominates
+  model personality.
+
+- **NEW (Finding #28):** Cross-document consistency found real bugs in gargoyle
+  docs (fills vs events, position ownership, signal persistence). Does running
+  this analysis across MORE document pairs (e.g., domain readmes vs implementation
+  docs, design docs vs plan docs) yield additional real inconsistencies? Could
+  become a systematic documentation maintenance tool.
+- **NEW (Finding #28):** Opus was 2.4x faster AND found more issues than GPT-5
+  on cross-document consistency. Is this because cross-doc contradictions are
+  easy to verify once spotted (reducing GPT-5's verification advantage)? Or
+  because boundary reasoning (Opus's strength) is the primary skill needed?
+
+## Methodology Notes
+
+- Internet opinions about models are overwhelmingly about coding. Don't
+  extrapolate to analytical work without testing.
+- "Just because someone says it on the internet doesn't make it right." —
+  Aaron, 2026-04-26. Opinions need context. Track our own evidence.
+- Absence of published methodology for a use case is itself a finding.
+- Each finding needs: date, task, **how we used it** (context shape, task
+  framing, what info the model had/didn't have), what happened, takeaway.
+  No unsupported generalizations.
+- **Context dimensions to track:**
+  - Rich vs minimal (how much background info)
+  - Broad vs focused ("review this" vs "answer this specific question")
+  - What kind of context (diff, full files, issue text, research notes,
+    project conventions, nothing)
+  - Whether the model had access to tools or just text
+  - Whether the task was explicit step-by-step or open-ended
@@ -0,0 +1,178 @@
+# Finding 28: Cross-document consistency analysis: NEW task type — GPT-5 finds deep semantic contradictions; Opus finds structural/boundary mismatches; Sonnet identifies core issues quickly
+
+**Date:** 2026-05-05
+**Task:** Identify contradictions and inconsistencies BETWEEN two architecture documents
+describing the same system: `system-overview.md` (323 lines, narrative overview with
+component flows, invariants, and domain events) and `architecture.md` (213 lines,
+DDD-focused with bounded contexts, context map, and message taxonomy).
+**How we used them:** BOTH documents provided as full text in a single prompt (~25KB
+total). Highly structured prompt specifying 5 categories of cross-document inconsistency
+(terminology conflicts, structural contradictions, flow/sequence conflicts,
+ownership/authority conflicts, philosophical contradictions). Required specific output
+format per finding. Explicitly excluded omissions (things one doc covers and the other
+doesn't) and detail-level differences. No tools, no project context beyond the two
+documents. This is a NEW analytical task not previously tested: reasoning about
+CONSISTENCY BETWEEN documents rather than internal coherence of a single document.
+
+| Model | Time | Output tokens | Reasoning tokens | Inconsistencies found | Critical | High | Medium |
+|---|---|---|---|---|---|---|---|
+| GPT-5 | 125s | 9,415 | 8,384 | 6 | 2 | 3 | 1 |
+| Claude Opus 4.6 | 52s | 2,351 | (internal) | 7 | 3 | 3 | 1 |
+| Claude Sonnet 4.6 | 14s | 776 | (internal) | 4 | 1 | 2 | 1 |
+
+**What they found — common ground (all 3 identified):**
+- Event sourcing (all events as source of truth) vs fills-only ground truth:
+  Document A says fills are "ground truth from which all other state can be
+  derived," while Document B says "events are the source of truth, state is
+  computed by replaying events." A treats fills as the recovery foundation;
+  B treats ALL domain events as authoritative. All three models rated this
+  Critical.
+- Bounded context naming mismatch: "Decision Engine" / "Order Management" (A)
+  vs "Engine" / "Trading" (B) for the same functional responsibilities.
+  GPT-5 folded this into a broader ownership analysis; Opus and Sonnet
+  surfaced it as its own finding.
+- Signal classification conflict: Document A lists "Signal emitted" as a domain
+  event; Document B explicitly categorizes `SignalEmitted` as an audit event
+  ("not used to rebuild state"). This determines event store design and
+  recovery semantics.
+
+**GPT-5 unique findings (not in either Claude model):**
+- Signal persistence contradiction: Document A states "Signals are never
+  persisted" while Document B lists `SignalEmitted` as an audit event that IS
+  persisted and states the audit log is mandatory for trading. These are
+  directly incompatible claims about whether signal data is stored.
+- Audit event ownership conflict: Document A says "Decision approved" events
+  originate from PortfolioRisk. Document B states "only the decision engine
+  writes audit events" and lists `DecisionApproved` as an audit event example.
+  If PortfolioRisk is part of Risk (not Engine), this is an authority violation.
+- "Single writer per user" (A: OrderManager writes all trading state) vs
+  per-aggregate single-writer (B: each aggregate writes its own event stream,
+  Ledger owns positions). These are incompatible authority models — either OM
+  centralizes writes or each domain owns its own events.
+
+**Claude Opus unique findings (not in either other model):**
+- Engine → OrderManager is an internal pipeline flow (A: same subgraph, direct
+  arrow) vs Engine → Trading is a cross-domain COMMAND (B: `PlaceOrder` command
+  crossing a bounded context boundary). This structural disagreement determines
+  whether order management is an internal pipeline stage or an independent domain
+  with its own aggregates and command validation.
+- Signal Risk's architectural position: Document A shows a two-stage risk
+  architecture (SignalRisk pre-aggregation, PortfolioRisk post-aggregation)
+  where Risk is embedded in the pipeline. Document B's context map shows Risk
+  as a separate domain that Engine merely QUERIES ("kill switch active?") —
+  no arrow shows signal routing through Risk. Either risk logic lives inside
+  Engine (contradicting B's context boundary) or the context map is incomplete.
+- The "reduce" step ownership: A's top-level flow labels `Approved →|"reduce"|
+  Decisions` (reduction at aggregation), while A's own domain events table says
+  "Decision reduced" originates from PortfolioRisk (reduction after aggregation).
+  This is actually an INTRA-document inconsistency in Document A, but Opus surfaced
+  it as part of cross-doc analysis.
+
+**Claude Sonnet unique findings:**
+- None genuinely unique. All 4 findings overlapped with GPT-5/Opus common ground
+  (event sourcing, signal persistence, context count/naming). Sonnet was efficient
+  (14s, 776 tokens) but didn't identify any inconsistency that the other two missed.
+
+**Quality assessment:**
+- **GPT-5** produced 6 well-reasoned findings with the deepest analysis of
+  OWNERSHIP conflicts. Its signal-persistence contradiction and single-writer
+  authority conflict are genuinely important — they reveal places where the two
+  documents would lead implementers to build fundamentally different systems.
+  Every finding quotes specific text from both documents and explains precisely
+  WHY they can't both be correct. The reasoning investment (8,384 tokens) was
+  used for thorough cross-referencing between documents.
+- **Claude Opus** found the most inconsistencies (7) and was remarkably fast
+  (52s vs GPT-5's 125s). Its unique strength: identifying STRUCTURAL contradictions
+  about component boundaries and communication patterns. The Engine→Trading
+  command vs internal pipeline finding is architecturally the most significant
+  discovery — it reveals a fundamental disagreement about whether order
+  management is INSIDE or OUTSIDE the decision engine's boundary. Opus also
+  caught a bonus intra-document inconsistency (the "reduce" labeling error).
+- **Claude Sonnet** was the fastest (14s) and most concise (776 tokens) but
+  found only the obvious common-ground issues. For cross-document consistency,
+  Sonnet's speed advantage came at the cost of missing the architectural
+  insights that make this task valuable. It did correctly identify all the
+  Critical-level issues, making it viable as a quick first-pass screen.
+
+**Key insight — cross-document consistency is a DISTINCT task type:**
+This is fundamentally different from single-document analysis (assumptions,
+race conditions, coherence). It requires:
+1. Building a mental model from Document A
+2. Building a separate mental model from Document B
+3. Finding places where the models are incompatible
+4. Reasoning about WHY they can't both be correct (not just "different")
+
+Step 4 is what distinguishes this from simple diff-detection. Many surface
+differences (naming, detail level, scope) are NOT contradictions — the models
+must judge which differences are genuinely incompatible vs. complementary.
+The prompt explicitly excluded omissions and detail-level differences, and
+all three models respected this constraint well.
+
+**Model strengths on cross-document analysis:**
+- **GPT-5** excels at ownership/authority conflicts: it systematically
+  checked "who owns this concept" in each document and found mismatches.
+  Its findings cluster around "who writes what" and "who is authoritative."
+- **Opus** excels at structural/boundary contradictions: it identified where
+  the documents draw architectural lines differently. Its findings cluster
+  around "where are the boundaries" and "what crosses them."
+- **Sonnet** identifies the obvious/critical issues quickly but doesn't dig
+  deeper. Viable for screening, not for thorough analysis.
+
+**Comparison to Finding #15 / #27 (single-document coherence checking):**
+Single-document coherence asks "does this document contradict itself?"
+Cross-document consistency asks "do these documents contradict each other?"
+Key differences in results:
+
+| Aspect | Single-doc coherence | Cross-doc consistency |
+|---|---|---|
+| Opus findings | 5-7 | 7 |
+| GPT-5 findings | 4-6 | 6 |
+| Sonnet findings | 4-5 | 4 |
+| Opus unique | Design tensions | Structural/boundary mismatches |
+| GPT-5 unique | Definitional errors | Ownership/authority conflicts |
+| Best model | Task-dependent | Opus (most findings + fastest) |
+
+The relative ordering is similar (Opus ≥ GPT-5 > Sonnet for coherence-style
+tasks), but the CHARACTER of unique findings shifted. On single-doc coherence,
+Opus finds design tensions within a single design. On cross-doc consistency,
+Opus finds BOUNDARY disagreements between two designs. GPT-5 shifts from
+finding definitional errors to ownership conflicts.
+
+**Are these findings REAL bugs in the gargoyle documentation?**
+Yes — several are genuine issues worth fixing:
+1. The fills-vs-events-as-ground-truth is a real philosophical tension between
+   the two documents that needs resolution.
+2. The Position event ownership (OrderManager vs Ledger) is a real boundary
+   conflict that affects implementation.
+3. The Engine→Trading communication style (internal pipeline vs cross-domain
+   command) is a genuine structural ambiguity.
+4. The signal persistence claim ("never persisted" vs `SignalEmitted` audit
+   event) is a direct textual contradiction.
+
+These are the kind of cross-document inconsistencies that cause teams to build
+inconsistent implementations — one engineer reads Document A and builds one way,
+another reads Document B and builds differently.
+
+**Practical implication:** Cross-document consistency analysis is a high-value
+task for documentation maintenance. Run it when:
+- A system has multiple architecture docs written at different times
+- A refactoring has updated one doc but not another
+- Multiple people contribute to design documentation
+- Moving from high-level overview to detailed specification
+
+Opus is the recommended model for this task: fastest (52s vs 125s), most
+findings (7 vs 6), and uniquely strong at boundary disagreements. GPT-5 adds
+value for ownership-specific conflicts. Sonnet is sufficient for quick
+screening (catches the Critical issues in 14s) but won't find the architectural
+insights.
+
+**Cost-effectiveness:**
+Opus: 2,351 output tokens for 7 findings = 336 tokens/finding (52s)
+GPT-5: 9,415 output + 8,384 reasoning for 6 findings = 2,967 tokens/finding (125s)
+Sonnet: 776 output tokens for 4 findings = 194 tokens/finding (14s)
+
+Opus is the clear winner on this task type: more findings than GPT-5, 2.4x
+faster, and 8.8x more token-efficient per finding. GPT-5's massive reasoning
+investment (8,384 tokens) produced only one fewer finding than Opus — the
+verification overhead is not paying off here because cross-document contradictions
+are relatively easy to verify once identified (just check both documents).
@@ -0,0 +1,174 @@
+# Finding 29: Adversarial manipulation analysis: NEW task type — GPT-5 is most exhaustive and systematic; Opus produces qualitatively different attack vectors with system-level thinking; Sonnet is adequate but less creative
+
+**Date:** 2026-05-05
+**Task:** Identify adversarial manipulation paths in gargoyle's `aggregation.md` (193 lines)
+— how a misbehaving, compromised, or buggy upstream component could exploit the
+aggregator's design guarantees to produce harmful trading outcomes that bypass
+downstream safety controls.
+**How we used them:** Same document (full text) + same focused analytical question to all
+3 models via HAI proxy. Highly structured prompt specifying 5 categories of adversarial
+manipulation (signal injection, timing manipulation, capacity weaponization, state
+corruption via crash, audit evasion). Required specific output format per finding
+(attack vector, mechanism, exploit, why downstream controls miss it, severity). No tools,
+no project context beyond the document itself.
+
+| Model | Time | Output tokens | Reasoning tokens | Attack vectors found | Critical | High | Medium |
+|---|---|---|---|---|---|---|---|
+| Claude Sonnet 4.6 | 27s | 1,257 | (internal) | 10 | 3 | 5 | 2 |
+| Claude Opus 4.6 | 84s | 3,662 | (internal) | 12 | 5 | 5 | 0 |
+| GPT-5 | 111s | 8,808 | 6,336 | 15 | 2 | 10 | 3 |
+
+**What they found — common ground (all 3 identified):**
+- Primary signal hijacking via ranking manipulation (last-tick injection in
+  time-windowed to control decision parameters)
+- Threshold gaming via signal replay/duplication (no deduplication means N
+  identical signals satisfy "N confirmations")
+- Capacity flooding to force premature completion or deny legitimate trades
+- Strategic crash to erase unfavorable in-flight groups
+- Timeout-masqueraded manipulation (making attacks look like normal system behavior
+  in the audit trail)
+
+**GPT-5 unique findings (not in either Claude model):**
+- **Direction flip against majority via ranking:** In "most recent" ranking,
+  emit multiple SELL confirmations then inject a late BUY — the BUY becomes
+  primary and the decision contradicts the bulk of evidence. Distinct from
+  general primary hijack because it's specifically about *directional* reversal.
+- **Late-arrival exclusion of counter-signals:** Time signals so countervailing
+  signals arrive just after group destruction, ensuring the decision is formed
+  without dissenting inputs that would have altered ranking.
+- **Capacity filter to curate the audit set:** Pre-fill buffer with chosen
+  signals so riskier alternatives cannot be included before capacity fires —
+  the contributing signals list looks clean.
+- **Timer nullification by crash:** Crash just before a timeout that would
+  force-complete an unfavorable decision — the timer becomes no-op on restart,
+  no decision or expiry event is emitted.
+- **Decision drop via induced forwarding failure:** Exploit the "Decision
+  forwarding fails: Decision is lost" failure mode to selectively suppress
+  protective decisions (stops, hedges) with no automatic retry.
+- **Crash to erase evidence of contrary signals:** Post-crash, submit a
+  fresh group that completes quickly; audit shows only the new set, not the
+  earlier contradictory pre-crash signals.
+
+**Claude Opus unique findings (not in either other model):**
+- **Instrument fragmentation to multiply position size:** Emit signals for
+  economically equivalent exposures using different instrument identifiers.
+  Each gets its own group, each produces a separate decision, bypassing
+  per-group capacity limits. Combined position exceeds what any single group
+  would allow. Identifies TOCTOU at the fan-in to PortfolioRisk.
+- **Forced stale decision via timer exploitation:** Emit one signal at a
+  favorable price spike known to be transient, then deliberately withhold
+  further signals. Timer force-completes with a stale price. The entry price
+  WAS valid when the signal was generated — PortfolioRisk doesn't check
+  staleness of decision prices.
+- **Timeout prevention / keep-alive suppression:** Manipulate market data
+  feed to suppress signals that would reach threshold N. Group expires
+  normally — denial-of-trading attack disguised as insufficient confirmation.
+- **Crash-restart duplicate decisions:** Crash after decision is forwarded
+  but before strategy reflects it. Both restart "clean" — strategy re-emits
+  signals, aggregator produces a second decision with a fresh ID. Same trade
+  executes twice. PortfolioRisk can't deduplicate because IDs are different.
+- **Force-complete with insufficient confirmation (capacity < threshold):**
+  If capacity limit is lower than threshold, hitting capacity ALWAYS force-
+  completes before predicate is satisfied. Fundamentally changes a 5-confirmation
+  strategy into a 3-confirmation strategy.
+- **Pattern predicate as arbitrary decision trigger:** If adversary controls
+  predicate logic (via strategy configuration), can make pattern-complete
+  trigger on any single signal while audit shows algorithm=pattern-complete
+  and reason=:predicate. Trust boundary between configuration and execution.
+
+**Claude Sonnet unique findings (not in either other model):**
+- **Cross-group timing coordination:** Coordinate signal injection across
+  multiple instruments to synchronize completion times, creating a burst of
+  correlated decisions that overwhelm PortfolioRisk individually-safe
+  evaluations. (NOTE: Opus found a similar concept — instrument fragmentation
+  — but framed it differently: Opus focused on position multiplication via
+  instrument aliasing, Sonnet focused on burst timing overwhelming evaluation.)
+- **Multi-strategy attack distribution:** Spread manipulation across multiple
+  isolated strategy aggregators so no single aggregator's behavior looks
+  abnormal while cumulative effect is harmful.
+
+**Quality assessment:**
+- **GPT-5** produced the most findings (15) with the most systematic coverage
+  across all 5 prompt categories. Its strength was in identifying SPECIFIC
+  INTERLEAVINGS — exactly how timing, state, and ranking mechanisms interact
+  to produce exploits. The direction-flip finding (#3) and the late-arrival
+  exclusion finding (#6) show precise temporal reasoning about when signals
+  arrive relative to group lifecycle events. The "decision drop via forwarding
+  failure" finding exploits a DOCUMENTED failure mode (from the failure table)
+  as an offensive weapon — turning a recovery mechanism into an attack vector.
+  Every finding references specific mechanisms from the spec.
+- **Claude Opus** produced 12 findings with the most architecturally creative
+  attacks. The instrument fragmentation attack is the most SYSTEMICALLY
+  dangerous finding across all three models — it's not about manipulating one
+  group but about the RELATIONSHIP between groups, and it identifies a
+  TOCTOU vulnerability at the PortfolioRisk fan-in point that no other model
+  found. The crash-restart duplication attack is also architecturally novel —
+  it exploits the "clean state" guarantee as a weapon for invisible trade
+  doubling. Opus consistently reasons about the system BOUNDARY (aggregator
+  → PortfolioRisk handoff) rather than just within-component mechanics. The
+  pattern-predicate trust boundary finding is uniquely about CONFIGURATION
+  as an attack surface.
+- **Claude Sonnet** produced 10 findings in 27s — extremely efficient (127
+  tokens per finding). Findings were adequate and covered all 5 categories,
+  but lacked the specificity of GPT-5 and the architectural creativity of
+  Opus. Several findings were somewhat generic (e.g., "crash at strategic
+  moments" without specifying exactly WHEN relative to group lifecycle).
+  The cross-group coordination and multi-strategy distribution findings show
+  system-level thinking but are stated at a higher abstraction level without
+  concrete exploit sequences.
+
+**Key insight — "adversarial manipulation analysis" as a task type:**
+This is qualitatively different from all previous analytical lenses tested.
+Previous tasks asked models to find problems WITH the design (assumptions,
+races, incoherences). This task asks models to find ways to USE the design
+AGAINST itself — a creative/generative adversarial task. Results:
+
+- **GPT-5** treats it as an exhaustive enumeration exercise — systematically
+  walks through each mechanism and asks "how could this be abused?" High
+  count (15), thorough coverage, but some findings are minor variations of
+  each other (e.g., crash-related findings #10, #12, #15 share the same core
+  mechanism). Reasoning tokens (6,336) used for both generation and verification.
+- **Opus** treats it as a creative design exercise — asks "what would a
+  smart adversary do that the designer didn't consider?" Fewer findings (12)
+  but several are genuinely novel attack concepts (instrument fragmentation,
+  crash-restart duplication, predicate trust boundary) that require reasoning
+  about the SYSTEM rather than the COMPONENT. Opus also provided a summary
+  table and systemic conclusion about the root design weaknesses.
+- **Sonnet** treats it as a categorization exercise — fills each prompt
+  category with plausible attacks but at a higher abstraction level. Fast
+  and adequate for a first pass but wouldn't surprise a security reviewer.
+
+**Comparison to "predictable exploit window" (Finding #18):**
+Finding #18 noted that Opus uniquely identified predictable exploit windows
+in escalation-policy.md. Here, Opus again shows the strongest adversarial
+creativity — the instrument fragmentation attack and crash-restart duplication
+are both about exploiting DESIGN GUARANTEES (per-instrument grouping, clean
+restart) as weapons. This confirms that Opus's strength on adversarial analysis
+is a CONSISTENT PATTERN, not document-specific.
+
+GPT-5 excels when the adversarial task is framed as "enumerate all possible
+abuses of each mechanism" (systematic coverage). Opus excels when the task
+requires "invent novel attack concepts that exploit design boundaries"
+(creative adversarial thinking).
+
+**Model hierarchy for adversarial manipulation analysis:**
+1. GPT-5 — most thorough enumeration, best at mechanism-level exploitation (15)
+2. Opus — most creative, finds system-boundary attacks others miss (12)
+3. Sonnet — adequate first pass, fast, but less specific (10)
+
+**Practical implication:** For security-oriented architecture review:
+- Run GPT-5 for comprehensive attack surface enumeration
+- Run Opus for novel/creative attack vectors that exploit design boundaries
+- Sonnet is sufficient only as a quick initial screen
+- The UNION of GPT-5 + Opus findings (removing overlaps) would produce the
+  most complete adversarial analysis
+
+**New finding about the aggregator itself:** Several attacks identified by
+multiple models point to real design weaknesses worth addressing:
+1. No signal deduplication/independence validation (all 3 models)
+2. Primary signal determines all decision parameters regardless of group
+   composition (all 3 models)
+3. Transient state + no replay = perfect adversarial erasure tool (all 3)
+4. Capacity/timeout treated as normal events even when weaponized (all 3)
+5. No cross-group correlation at aggregator level (Opus + Sonnet)
+6. TOCTOU at PortfolioRisk fan-in for concurrent decisions (Opus)
@@ -0,0 +1,16 @@
+# Model Findings — Analytical & Research Work
+
+_Tracking what actually works (and doesn't) when using AI models for research,
+analysis, bias detection, and document review — not coding._
+
+Started: 2026-04-26
+
+## Context
+
+We use multiple models in different roles: Claude Code (Opus/Sonnet) for
+generation, Sonnet + GPT-5 for independent dual review, smaller models for
+focused analytical tasks. Most public discussion is about coding. We found
+almost no published methodology for using models in analytical research tasks
+(searched 2026-04-26). That gap is why we're tracking this.
+
+Each experiment lives in its own file. See individual finding files below.