diff --git a/README.md b/README.md
index 5d5f38b..e7fe256 100644
--- a/README.md
+++ b/README.md
@@ -53,12 +53,15 @@ Each experiment:
 ## Repository Structure
 
 ```
-findings/           # Individual findings with full analysis
-  01-different-models-different-things.md
-  02-narrow-lens-vs-broad-review.md
+findings/                                         # Individual findings with full analysis
+  README.md                                       # Context and index
+  YYYY-MM-DD-NN-slug.md                           # One file per experiment
+  2026-04-26-01-different-models-catch-different-things.md
+  2026-04-26-07-emerging-role-assignments-pattern-not.md
+  2026-05-03-07b-token-budget-matters-more-than.md  # Duplicate #7 (suffix b)
+  2026-05-03-15-design-coherence-analysis.md
   ...
-  28-cross-document-consistency.md
-  29-adversarial-manipulation.md
+  2026-05-05-29-adversarial-manipulation-analysis-new-task.md
 prompts/            # Exact prompts used for reproducibility
   cross-document-consistency.md
   design-coherence.md
@@ -69,6 +72,9 @@ open-questions.md   # Unanswered questions for future experiments
 methodology.md      # Full methodology notes
 ```
 
+Findings are named `YYYY-MM-DD-NN-slug.md` for chronological sorting.
+Numbers are zero-padded (01–29). The duplicate finding #7 uses a `b` suffix.
+
 ## Who We Are
 
 This research is conducted by [Rodin](https://gitea.weiker.me/rodin) (AI
diff --git a/findings/2026-04-26-01-different-models-catch-different-things.md b/findings/2026-04-26-01-different-models-catch-different-things.md
new file mode 100644
index 0000000..72472b7
--- /dev/null
+++ b/findings/2026-04-26-01-different-models-catch-different-things.md
@@ -0,0 +1,16 @@
+# Finding 1: Different models catch different things (confirmed)
+
+**Date:** 2026-04-26
+**Task:** PR reviews on DDD reference docs (~6,600 lines across 18 files)
+**How we used them:** Both models got the same task via pr-review skill —
+fetch diff, fetch full file content for changed files, review against PR
+description and linked issue acceptance criteria. Rich context: full diff,
+project CLAUDE.md conventions, issue body. Each reviewer ran independently
+in its own sub-agent with its own Gitea token. No cross-pollination.
+
+- GPT-5 caught SUMMARY.md verdict mismatches (Commanded classification,
+  small teams classification) that Sonnet missed entirely (PR #375)
+- Sonnet caught a broken cross-reference link first that GPT-5 missed (PR #378)
+- **Takeaway:** Different blind spots are real. Neither model is strictly better
+  for analytical review — they complement each other. This is why we run two
+  independent reviewers from different model families.
diff --git a/findings/2026-04-26-02-cheap-model-narrow-lens-expensive.md b/findings/2026-04-26-02-cheap-model-narrow-lens-expensive.md
new file mode 100644
index 0000000..230e168
--- /dev/null
+++ b/findings/2026-04-26-02-cheap-model-narrow-lens-expensive.md
@@ -0,0 +1,18 @@
+# Finding 2: Cheap model + narrow lens > expensive model + broad review (one data point)
+
+**Date:** 2026-04-26
+**Task:** Check 12 rewritten hypotheses for directional bias
+**How we used them:**
+- Sonnet & GPT-5: full PR review context (diff, file content, issue, AC).
+  Broad mandate: "review this PR." Rich context but unfocused task.
+- GPT-4.1 Mini: given ONLY the 12 hypothesis texts + one focused question:
+  "Do any of these hypotheses lead toward a predetermined conclusion?"
+  Minimal context, laser-focused task. No diff, no project docs, no issue.
+
+- Both Sonnet and GPT-5 approved the hypotheses as reviewers
+- GPT-4.1 Mini found ALL 12 pushed toward predetermined conclusions
+- Words like "requires," "necessary," "must be" were flagged as directional
+- **Takeaway:** Task framing mattered more than model size. Rich context +
+  broad mandate = missed the forest for the trees. Minimal context + precise
+  question = found exactly what mattered. This needs more testing — was it
+  the narrow framing, the lack of surrounding context, or both?
diff --git a/findings/2026-04-26-03-gpt5-times-out-on-complex.md b/findings/2026-04-26-03-gpt5-times-out-on-complex.md
new file mode 100644
index 0000000..30a0a4c
--- /dev/null
+++ b/findings/2026-04-26-03-gpt5-times-out-on-complex.md
@@ -0,0 +1,15 @@
+# Finding 3: GPT-5 times out on complex multi-step analytical tasks (confirmed pattern)
+
+**Date:** 2026-04-26
+**Task:** Full PR review of #382 (research document rewrite)
+**How we used it:** pr-review skill — multi-phase (fetch diff, fetch files,
+check CI, analyze against AC, post inline comments, post summary). 7 phases,
+many curl calls to Gitea API, large diff context. Heavy tool-use workflow
+through SAP proxy (adds latency vs direct API). 300s timeout.
+
+- Timed out 3 times at 300s (17, 6, 6 tool calls respectively)
+- Bottleneck was model processing time, not network (~0.3s Gitea API latency)
+- **Takeaway:** Break analytical tasks into focused bounded pieces. Twelve
+  small deep reviews > one rushed big one. The issue isn't GPT-5's analysis
+  quality — it's that multi-phase tool-heavy workflows burn too much time
+  on mechanics. Separate the data gathering from the analysis.
diff --git a/findings/2026-04-26-04-gpt5-defaults-to-delegation-claude.md b/findings/2026-04-26-04-gpt5-defaults-to-delegation-claude.md
new file mode 100644
index 0000000..cc2beb7
--- /dev/null
+++ b/findings/2026-04-26-04-gpt5-defaults-to-delegation-claude.md
@@ -0,0 +1,18 @@
+# Finding 4: GPT-5 defaults to delegation; Claude defaults to doing the work
+
+**Date:** 2026-04-26
+**Task:** PR review delegation to sub-agents
+**How we used them:** Both spawned as sub-agents from main session with
+same task description, same pr-review skill file, same Gitea credentials.
+Difference: GPT-5 got model override to gpt5, Sonnet used default model.
+Both got full skill instructions.
+
+- GPT-5 first attempt: spawned sub-sub-agents and timed out
+- GPT-5 with "do it yourself, no sub-agents" + step-by-step: worked
+- Even with constraints, GPT-5 sometimes dumps raw tool output instead of
+  synthesizing — needs explicit output format instructions
+- Claude (Sonnet/Opus) given the same kind of task does the work directly
+- **Takeaway:** GPT interprets complex task descriptions as delegation
+  opportunities. Claude interprets them as work to do. For GPT: explicit
+  single-actor instructions + output format. For Claude: can give broader
+  mandate. Same skill file, very different behavior.
diff --git a/findings/2026-04-26-05-sonnet-is-fast-and-catches.md b/findings/2026-04-26-05-sonnet-is-fast-and-catches.md
new file mode 100644
index 0000000..3d94a74
--- /dev/null
+++ b/findings/2026-04-26-05-sonnet-is-fast-and-catches.md
@@ -0,0 +1,17 @@
+# Finding 5: Sonnet is fast and catches structural issues; GPT-5 is slow and catches semantic issues
+
+**Date:** 2026-04-26
+**Task:** Dual review across PRs #372, #375, #378, #380, #382
+**How we used them:** Same pr-review skill, same context (diff + files +
+issue + AC), same sub-agent pattern. Only variable: model. Both got rich
+context. Both ran the full 7-phase review skill.
+
+- Sonnet consistently finishes first, catches formatting, broken links,
+  structural problems (missing sections, dangling refs)
+- GPT-5 takes longer, catches meaning-level problems (verdict mismatches,
+  classification inconsistencies, logical gaps)
+- **Takeaway:** With identical rich context and identical instructions, the
+  models naturally gravitate to different things. Sonnet is the structural
+  reviewer; GPT-5 is the semantic reviewer. Both roles matter. Question:
+  would Sonnet catch semantic issues if given a narrower "check for logical
+  consistency" framing instead of broad review?
diff --git a/findings/2026-04-26-06-single-agent-cant-handle-1000.md b/findings/2026-04-26-06-single-agent-cant-handle-1000.md
new file mode 100644
index 0000000..6cc9df4
--- /dev/null
+++ b/findings/2026-04-26-06-single-agent-cant-handle-1000.md
@@ -0,0 +1,20 @@
+# Finding 6: Single agent can't handle 1000+ line document generation (confirmed pattern)
+
+**Date:** 2026-04-26
+**Task:** DDD v2 forge analysis drafting
+**How we used them:** Single Sonnet/Opus sub-agents given full research
+material (~3,874 lines of research notes) + outline + instructions to write
+complete document. Very rich context (all research), very large output
+requirement (1000+ lines).
+
+- Five single-agent attempts died (OOM, disconnect, timeout) trying to write
+  full documents
+- Sectional approach (5 parallel Sonnet subagents, ~500-700 lines each)
+  succeeded immediately — each got same research but only their section's
+  outline
+- Same pattern when Claude Code attempted full Part V rewrite — died
+- Three agents × ~320 lines each worked first try
+- **Takeaway:** This is a confirmed, repeatable limit for generation tasks.
+  Not model-specific — it's a context/output length problem. Rich input
+  context is fine; it's the output length that kills. Break output into
+  sections, keep input context rich, draft in parallel, assemble.
diff --git a/findings/2026-04-26-07-emerging-role-assignments-pattern-not.md b/findings/2026-04-26-07-emerging-role-assignments-pattern-not.md
new file mode 100644
index 0000000..aa8d4fb
--- /dev/null
+++ b/findings/2026-04-26-07-emerging-role-assignments-pattern-not.md
@@ -0,0 +1,17 @@
+# Finding 7: Emerging role assignments (pattern, not conclusion)
+
+**Date:** 2026-04-26 (one day of intensive work — treat as hypothesis)
+
+- Opus (via Claude Code): complex generation needing deep project context.
+  Rich context: CLAUDE.md, full codebase access, design docs. Broad mandate.
+- Sonnet: parallel volume work (5 subagents drafting simultaneously).
+  Rich context per section, constrained output scope.
+- GPT-5: independent analytical review. Rich context (diff + files + issue).
+  Best when task is bounded and explicit.
+- GPT-4.1 Mini: focused narrow analysis (bias detection). Minimal context,
+  precise question. Cheap and fast.
+- **Takeaway:** The role assignment matters, but so does the context shape.
+  Opus gets broad context + broad mandate. Sonnet gets broad context +
+  narrow scope. GPT-5 gets rich context + explicit task. GPT-4.1 Mini gets
+  minimal context + laser question. We haven't tested swapping these
+  combinations — that's where the real learning will come from.
diff --git a/findings/2026-04-27-08-bias-detection-all-models-catch.md b/findings/2026-04-27-08-bias-detection-all-models-catch.md
new file mode 100644
index 0000000..24a2573
--- /dev/null
+++ b/findings/2026-04-27-08-bias-detection-all-models-catch.md
@@ -0,0 +1,58 @@
+# Finding 8: Bias detection: all models catch it with any framing — when the signal isn't buried
+
+**Date:** 2026-04-27
+**Task:** Detect directional bias in 8 deliberately biased hypotheses about
+microservices vs monolith architecture for fintech startups.
+**How we used them:** Created fresh test material (8 hypotheses with pro-
+microservices bias via absolutes like "inevitably," "necessary," "must,"
+"requires," plus one factually inverted claim about consistency guarantees).
+Ran 4 conditions in parallel sub-agents:
+
+| Condition | Model | Framing | Context |
+|---|---|---|---|
+| A | GPT-4.1 Mini | Narrow: "Do any lead toward a predetermined conclusion?" | Hypotheses only |
+| B | Sonnet | Same narrow question | Hypotheses only |
+| C | GPT-5 | Same narrow question | Hypotheses only |
+| D | Sonnet | Broad: "Review quality, clarity, testability, and issues" | Hypotheses only |
+
+**Results:**
+- **All 4 conditions detected 8/8 biased hypotheses.** No misses.
+- All 3 narrow-framing models (Mini, Sonnet, GPT-5) produced structurally
+  similar output: per-hypothesis verdict, biasing words, neutral version,
+  severity assessment.
+- All 3 narrow-framing models flagged H8's factual inversion (distributed
+  transactions DON'T provide stronger consistency than monolithic ACID).
+- GPT-5 added specific counterexamples (LMAX Disruptor, Shopify, Stack
+  Overflow, Basecamp) — marginally richer analysis.
+- Sonnet broad mandate also caught the bias — framed as one of three
+  "systemic problems" (deterministic language, pro-microservices framing
+  bias, underspecified constructs). Additionally provided testability and
+  operationalization analysis that the narrow framing didn't ask for.
+- Sonnet broad took ~72s vs ~39s for narrow conditions (more output).
+
+**Takeaway:** When the biased text is the ONLY input (no surrounding noise),
+all tested models — including the cheapest (GPT-4.1 Mini) — detect bias
+regardless of whether the question is narrow or broad. This appears to
+**contradict** original finding #2 ("cheap model + narrow lens > expensive
+model + broad review"), but the key difference is context noise:
+
+- **Original experiment (2026-04-26):** Sonnet and GPT-5 missed bias during
+  FULL PR REVIEW with rich project context (diff, file content, issue text,
+  acceptance criteria, project conventions). The hypotheses were buried in
+  layers of review mechanics.
+- **This experiment (2026-04-27):** Even the "broad" condition gave ONLY the
+  hypothesis text — no diff, no PR structure, no project context noise.
+
+**Refined hypothesis:** The original finding #2 was about **signal-to-noise
+ratio**, not about model capability or framing precision. When biased text
+is presented in isolation, any model catches it. When biased text is buried
+in a large PR review with many other things to check, the bias signal gets
+lost in the noise — unless you explicitly ask about it. The "narrow lens"
+worked because it eliminated the noise, not because smaller models are
+better at bias detection.
+
+**Next experiment to confirm:** Give a model the FULL PR review context
+(diff, files, issue, AC) but add the narrow bias question as an explicit
+review checklist item. If the model catches bias despite the rich context,
+it confirms the signal-to-noise hypothesis. If it misses, it suggests
+something else is at play (attention allocation, task switching cost).
diff --git a/findings/2026-05-02-09-gapfinding-in-architecture-docs-gpt5.md b/findings/2026-05-02-09-gapfinding-in-architecture-docs-gpt5.md
new file mode 100644
index 0000000..6dc3d2b
--- /dev/null
+++ b/findings/2026-05-02-09-gapfinding-in-architecture-docs-gpt5.md
@@ -0,0 +1,77 @@
+# Finding 9: Gap-finding in architecture docs: GPT-5 finds domain-specific gaps, GPT-4.1 is generic, Mini is formulaic
+
+**Date:** 2026-05-02
+**Task:** Identify missing failure scenarios in gargoyle's `failure-modes.md` (383 lines)
+**How we used them:** Same document (full text, no truncation) + same focused
+analytical question to all 3 models via HAI proxy (OpenAI-compatible endpoint).
+No tools, no project context beyond the document itself. Single prompt, no
+conversation history. Temperature 0.3 for GPT-4.1/Mini, default (1.0) for GPT-5
+(required by the model).
+
+| Model | Time | Output tokens | Reasoning tokens | Scenarios found |
+|---|---|---|---|---|
+| GPT-4.1 Mini | 16s | 2,003 | 0 | 10 |
+| GPT-4.1 | 24s | 2,575 | 0 | 15 |
+| GPT-5 | 45s | 8,565 | 6,656 | 14 |
+
+**What they found — common ground (all 3 identified):**
+- ETS table corruption/loss affecting gates
+- BEAM scheduler starvation / GC pauses
+- WebSocket message duplication/reordering
+- Postgres connection pool exhaustion / deadlocks
+- Clock skew / time drift
+- Process registry inconsistency
+
+**GPT-5 unique findings (not in either other model):**
+- Broker rate limiting (429s) — not "connection lost" so existing logic
+  doesn't trigger, but can't flatten during kill switch
+- Broker auth failure / credential rotation — distinct from connection loss
+- Corporate actions (splits, symbol changes) — position drift without
+  triggering staleness detection
+- Duplicate pipeline instances for same user (DynamicSupervisor race)
+- DB "commit unknown outcome" causing restart loops (Ecto commit succeeds
+  at Postgres but client times out → retry → unique constraint → crash loop)
+- Cross-symbol strategies with partial staleness — multi-leg signals
+  computed from mix of fresh and stale data
+- Partial cancel_all during kill switch masked by process restarts
+
+**GPT-4.1 unique findings (not in GPT-5 or Mini):**
+- Zombie processes after halt (supervisor misconfiguration)
+- Unsupervised Task crashes going unnoticed
+- Audit log writes failing silently (not in same transaction as state change)
+- ClOrdID unique constraint violation from race in sequence generation
+- Broker API semantic changes (silent breaking changes)
+
+**GPT-4.1 Mini unique findings:**
+- Race between kill switch engagement and reconciliation completion
+  (timing coordination gap) — this was more explicitly called out than
+  in the other models, though GPT-5 touches it implicitly
+- Strategy.Worker / Aggregator partial crash inconsistency
+
+**Quality assessment:**
+- **GPT-5** had the most *domain-relevant* and *actionable* gaps. Broker
+  rate limiting, auth failures, corporate actions, and the DB commit
+  unknown-outcome scenario are all realistic production issues specific
+  to THIS system. The cross-symbol partial staleness finding shows
+  deeper architectural reasoning about component interactions.
+- **GPT-4.1** was thorough and well-structured but more generic/defensive.
+  Many of its unique findings (zombie processes, unsupervised Tasks,
+  audit log loss) are general Elixir concerns rather than specific to
+  the document's architecture. Good for a completeness checklist.
+- **GPT-4.1 Mini** was formulaic — each finding followed the same template
+  and several were somewhat surface-level or restated things the document
+  partially covers. Still found the most scenarios per dollar.
+
+**Takeaway:** For gap-finding in architecture documents, GPT-5's reasoning
+tokens pay off. It doesn't just list "things that could go wrong" — it
+identifies *specific interactions* that the document's existing mechanisms
+don't cover (e.g., rate limiting bypasses the "connection lost" detection,
+corporate actions bypass staleness detection). GPT-4.1 is a solid
+middle-ground: more thorough than Mini, less insightful than GPT-5.
+Mini is fine for a quick sanity check but won't find the subtle gaps.
+
+**Cost-effectiveness:** Mini found 10 scenarios in 16s for ~7K tokens.
+GPT-5 found 14 scenarios (with 7 genuinely unique insights) in 45s for
+~13.5K tokens (including 6.6K reasoning). For architecture review where
+missing a gap could mean financial loss, the GPT-5 cost is justified.
+For routine doc review, Mini + human judgment is probably sufficient.
diff --git a/findings/2026-05-02-10-hiddenassumption-identification-gpt5s-reasoning-produces.md b/findings/2026-05-02-10-hiddenassumption-identification-gpt5s-reasoning-produces.md
new file mode 100644
index 0000000..0360f2a
--- /dev/null
+++ b/findings/2026-05-02-10-hiddenassumption-identification-gpt5s-reasoning-produces.md
@@ -0,0 +1,98 @@
+# Finding 10: Hidden-assumption identification: GPT-5's reasoning produces qualitatively different (not just more) findings
+
+**Date:** 2026-05-02
+**Task:** Identify hidden assumptions in gargoyle's `cold-start-and-recovery.md` (234 lines)
+that could break under real-world production conditions.
+**How we used them:** Same document (full text) + same focused analytical question
+to all 3 models via HAI proxy (OpenAI-compatible endpoint). No tools, no project
+context beyond the document itself. Single prompt, no conversation history.
+Temperature 0.3 for GPT-4.1/Mini; GPT-5 uses default (required).
+
+| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
+|---|---|---|---|---|
+| GPT-4.1 Mini | 25s | 3,090 | 0 | 12 |
+| GPT-4.1 | 77s | 2,751 | 0 | 14 |
+| GPT-5 | 78s | 2,649 | 4,096 | 26 |
+
+**What they found — common ground (all 3 identified):**
+- Broker API consistency/availability during reconciliation
+- ETS table availability and fail-closed behavior
+- Single-writer/mailbox ordering guarantees holding in practice
+- User independence assumption vs shared resources (rate limits, DB)
+- Reconciliation idempotency under repeated runs
+- Corporate action data completeness/timeliness
+- Escalation threshold calibration vs changing market conditions
+- Strategy warmup with partial/missing historical data
+- Signal expiry correctness on restart
+
+**GPT-5 unique findings (not in either other model):**
+- Unbounded mailbox growth during extended reconciliation (memory pressure
+  from queued messages at market open)
+- handle_continue side effects in OTHER processes (risk, metrics) acting
+  concurrently via different paths
+- Pre-existing GTC orders filling while gated (positions as moving target)
+- Broker position semantics mismatch (trade-date vs settled-date)
+- Strategy warmup evaluate() having non-signal side effects (metrics, caches)
+- Historical bar / live tick boundary alignment (double-processing or gaps)
+- ETS gate caching in process state creating fail-open windows
+- Correlated retry stampede when many users restart together
+- Corporate action double-application race with broker (missing idempotency
+  keys per action/instrument/date)
+- Kill switch state vs DB unavailability at startup
+- Market data subscriptions as shared bottleneck across "independent" users
+- Time-invariant signals incorrectly expired by aggregation window logic
+- Broker fills vs positions endpoints internally inconsistent (different caches)
+- Positions changing under reconciliation while kill switch is engaged
+- Gate phase sequencing: :ready written before worker warmup completes
+- Periodic reconciler allowing 1hr of divergent trading (rate-of-change blind)
+
+**GPT-4.1 unique findings (not in GPT-5 or Mini):**
+- No correlated failure handling (all failure modes treated as isolated) —
+  only model to frame this as a meta-assumption about the failure table
+
+**GPT-4.1 Mini unique findings:**
+- None that weren't also covered by the other two models
+
+**Quality assessment:**
+- **GPT-5** didn't just find more assumptions — it found *qualitatively
+  different kinds*. Many of its unique findings involve multi-component
+  interactions (mailbox + reconciliation + market open timing), semantic
+  mismatches (trade-date vs settled positions), and second-order effects
+  (metrics side effects during warmup, GTC orders filling while gated).
+  These require reasoning about system behavior across boundaries the
+  document doesn't explicitly draw.
+- **GPT-4.1** was competent and structured, found the same core assumptions
+  as Mini, plus one good meta-observation about correlated failures. But
+  it stayed within the document's own framing — it found assumptions the
+  document *almost* states rather than ones the document can't see.
+- **GPT-4.1 Mini** was formulaic. Every finding maps cleanly to a section
+  of the document. It's essentially "what could go wrong with each stated
+  mechanism" rather than "what does this design take for granted about
+  the world outside itself."
+
+**Key insight — reasoning tokens change the KIND of analysis:**
+GPT-5's 4,096 reasoning tokens aren't producing "more of the same" —
+they're producing a different analytical mode. The non-reasoning models
+(4.1 and Mini) identify risks within the document's own frame of reference.
+GPT-5 reasons about the document's relationship to the external world:
+broker semantics, deployment topology, OTP runtime behavior under load,
+timing correlations across independent subsystems. This is the difference
+between "what could this mechanism fail at" and "what must be true about
+the world for this mechanism to work."
+
+**Comparison to Finding #9 (gap-finding on failure-modes.md):**
+Same pattern confirmed. GPT-5 consistently finds domain-specific,
+interaction-level issues that require reasoning about component boundaries.
+GPT-4.1 is thorough within-frame. Mini is formulaic. The gap between
+GPT-5 and the others is larger here than in #9 — possibly because
+"hidden assumptions" requires more abstraction than "missing failure
+scenarios." Assumption-finding requires the model to reason about what
+ISN'T stated, which benefits more from extended reasoning.
+
+**Practical implication:** For architecture review, running GPT-5 on
+"identify hidden assumptions" is higher-value than the same question to
+non-reasoning models. The cost difference (4K extra reasoning tokens) is
+trivial for a document that will drive months of implementation. Use
+non-reasoning models for within-frame checks ("does this section have
+gaps") and reasoning models for cross-boundary analysis ("what must be
+true about the world for this to work").
diff --git a/findings/2026-05-02-11-hiddenassumption-identification-on-simpler-doc.md b/findings/2026-05-02-11-hiddenassumption-identification-on-simpler-doc.md
new file mode 100644
index 0000000..31a189c
--- /dev/null
+++ b/findings/2026-05-02-11-hiddenassumption-identification-on-simpler-doc.md
@@ -0,0 +1,124 @@
+# Finding 11: Hidden-assumption identification on simpler doc: reasoning models diverge in approach, both outperform non-reasoning
+
+**Date:** 2026-05-02
+**Task:** Identify hidden assumptions in gargoyle's `market-calendar.md` (238 lines)
+— a simpler, single-component document vs the 234-line cold-start doc from Finding #10.
+**How we used them:** Same document (full text) + same focused analytical question
+to all 3 models via HAI proxy. No tools, no project context beyond the document
+itself. Single prompt, no conversation history. Temperature 0.3 for GPT-4.1;
+GPT-5 and Opus use their defaults (required). Same prompt across all three.
+
+| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
+|---|---|---|---|---|
+| GPT-4.1 | 19s | 2,554 | 0 | 14 |
+| Claude Opus 4.6 | 74s | 3,288 | (internal, not reported) | 13 |
+| GPT-5 | 101s | 8,417 | 5,504 | 24 |
+
+**What they found — common ground (all 3 identified):**
+- Alpaca calendar API data correctness/completeness as single source of truth
+- Alpaca API availability at startup (no local cache persistence)
+- ETS table atomicity during refresh (partial-state exposure risk)
+- System clock/timezone alignment (dates are timezone-naive)
+- NYSE emergency/unscheduled closures not reflected until refresh
+- Two-year cache range sufficiency
+- API response format stability
+- Rate limiting / API capacity concerns
+
+**GPT-5 unique findings (not in either other model):**
+- Date struct term-ordering in ETS match specs may not match chronological
+  order (ETS range guards rely on Erlang term comparison, not Date semantics)
+- close_time/1 returns naive Time without timezone — DST conversion burden on
+  consumers, one hour off twice per year
+- trading_day?/1 conflates "not a trading day" with "calendar unavailable" —
+  operational outages invisible to callers
+- ETS table name collision risk (global namespace per node)
+- No other process should modify the ETS table (access mode discipline)
+- Network egress and credential availability on all nodes at all times
+- ETS read/write concurrency flags for contention under load
+- Direct ETS access by consumers bypassing the module's error handling
+- next/prev_trading_day edge cases at cache boundaries
+- Alpaca API start/end parameter inclusivity (off-by-one at year boundaries)
+- Half-day vs full-day distinction insufficiency for special sessions
+- Small table size makes O(n) selects acceptable (scaling concern)
+- Year-end refresh failure leaving gaps at boundary
+- Alpaca never omits a legitimate trading day (absence = non-trading conflation)
+
+**Claude Opus unique findings (not in either other model):**
+- ETS ownership semantics: heir-protection would change fail-closed behavior;
+  current design means ALL consumers fail simultaneously during crash-to-restart
+  window (framed as a design tension, not just a risk)
+- Silent data corruption from partial API response (pagination/truncation) —
+  specifically that missing rows are SILENT failures with no error propagation
+  (other models mentioned API completeness but not the silence aspect)
+- Consumers calling functions with Dates, not DateTimes — the API accepts Date.t()
+  but doesn't specify HOW consumers should derive "today" (system-wide
+  coordination problem made invisible by the API contract)
+- `trading_day?/1` returning false is NOT fail-closed for ALL consumers — only
+  for PDT-like "block action" consumers; for batch-trigger consumers it's
+  fail-OPEN (subtle inversion of safety semantics)
+- Startup ordering: background_children placement means PDT could receive orders
+  before MarketCalendar finishes init, creating recurring rejection windows
+  during hot deploys
+- Continuous-running assumption for refresh timer (daily restarts would mean
+  refresh mechanism never fires — no staleness alert exists)
+
+**GPT-4.1 unique findings (not in either other model):**
+- No need for real-time calendar change notification (event emission gap)
+- All consumers using the same module instance (configuration consistency)
+- No need for historical calendar data (audit/backtesting limitation)
+- Consumers correctly handling {:error, :calendar_unavailable} in practice
+
+**Quality assessment:**
+- **GPT-5** found the most assumptions (24) with the most technical specificity.
+  Many are implementation-level insights (ETS term ordering, named table
+  collisions, read_concurrency flags) that demonstrate deep Erlang/OTP
+  knowledge. Some are slightly obvious or overlapping. The ETS term-ordering
+  finding is genuinely insightful — Date structs DO compare correctly in Erlang
+  term order (year > month > day fields), but questioning it shows depth of
+  reasoning about underlying mechanisms. Also provided concrete recommendations.
+- **Claude Opus** found fewer assumptions (13) but several were qualitatively
+  different — they identified *design tensions* and *semantic inversions*
+  rather than just failure scenarios. The fail-open/fail-closed inversion
+  (finding #12), the ETS ownership tension, and the "API makes timezone
+  coordination invisible" findings show reasoning about the design's
+  *relationship to its consumers* rather than just its internal mechanics.
+  Tighter, more curated output with less filler.
+- **GPT-4.1** was competent and well-structured (14 assumptions, clean table)
+  but stayed within the document's own framing. Its unique findings are
+  relatively generic ("consumers should handle errors correctly," "no
+  historical data"). Solid baseline, no surprises.
+
+**Key insight — two reasoning models, different analytical styles:**
+GPT-5 and Opus are both reasoning models, but they reason about different
+things. GPT-5 reasons DEEPER into implementation mechanics (how does ETS
+actually work? what are the exact failure modes of each component?). Opus
+reasons WIDER about system context (how does this component's API contract
+affect the safety properties of the overall system? what tensions does this
+design create that aren't visible to the author?).
+
+GPT-5's approach: "Here are 24 things that could go wrong, many highly
+technical." Opus's approach: "Here are 13 assumptions, several of which
+reveal design tensions the document can't see about itself."
+
+**Does the reasoning gap narrow with simpler docs?**
+Comparing to Finding #10 (cold-start doc, 234 lines, 26 vs 14 vs 12 assumptions
+for GPT-5/GPT-4.1/Mini):
+- GPT-5 still dominates in raw count (24 vs 14 for GPT-4.1)
+- The gap ratio is similar (~1.7x here vs ~1.9x in Finding #10)
+- Document complexity doesn't appear to be the driver of the gap —
+  reasoning tokens enable more exhaustive exploration regardless of
+  input complexity
+
+**Claude Opus vs GPT-5 (the headline comparison):**
+They're not competing on the same axis. GPT-5 is better for "find all
+possible issues" (breadth + technical depth). Opus is better for "find
+the assumptions that will actually surprise the author" (insight density).
+If you want a security-audit-style exhaustive list: GPT-5. If you want a
+design-review-style "here's what you're not seeing about your own design":
+Opus. Both are better than GPT-4.1 for this task, but in different ways.
+
+**Practical implication:** Run BOTH reasoning models on architecture docs.
+GPT-5 catches implementation-level hazards the team might miss during
+coding. Opus catches design-level tensions the team might miss during
+planning. GPT-4.1 is sufficient as a quick sanity check but won't
+surprise you.
diff --git a/findings/2026-05-02-12-sonnet-46-outperforms-expectations-on.md b/findings/2026-05-02-12-sonnet-46-outperforms-expectations-on.md
new file mode 100644
index 0000000..9c03078
--- /dev/null
+++ b/findings/2026-05-02-12-sonnet-46-outperforms-expectations-on.md
@@ -0,0 +1,125 @@
+# Finding 12: Sonnet 4.6 outperforms expectations on assumption-finding; competes with reasoning models on complex docs
+
+**Date:** 2026-05-02
+**Task:** Identify hidden assumptions in gargoyle's `order-execution.md` (785 lines)
+— a complex, multi-component document covering OrderManager, BrokerAdapter,
+TradeStream, and PositionReconciler.
+**How we used them:** Same document (full text, no truncation) + same focused
+analytical question to all 3 models. GPT-5 via HAI OpenAI endpoint; Opus 4.6
+and Sonnet 4.6 via HAI Anthropic endpoint. No tools, no project context beyond
+the document itself. Single prompt, no conversation history.
+
+| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
+|---|---|---|---|---|
+| GPT-5 | 93s | 8,485 | 6,016 | 20 |
+| Claude Sonnet 4.6 | 106s | 4,637 | (internal) | 17 |
+| Claude Opus 4.6 | 105s | 4,615 | (internal) | 12 |
+
+**What they found — common ground (all 3 identified):**
+- Synchronous broker REST calls blocking OrderManager GenServer (mailbox growth)
+- TradeStream event ordering assumptions (out-of-order fills/status)
+- Fill deduplication gap (no explicit fill-level idempotency)
+- `cancel_all/1` with `timeout: :infinity` blocking GenServer during FLATTEN
+- Recovery/restart races with TradeStream fill delivery (fills queued during
+  `handle_continue/2`)
+- Lot operation idempotency under crash recovery (partial execution)
+- Replace race: fills for new broker_order_id arriving before `replaced` event
+- Database write latency impact on GenServer throughput under burst fills
+- ETS table scope assumptions (single-node, access mode)
+
+**GPT-5 unique findings (not in either Claude model):**
+- Rate-limit retry blocking OrderManager inline (no async retry path specified)
+- Single TradeStream connection per user not enforced (duplicate detection gap)
+- Kill switch FLATTEN vs degraded state interaction (OM drops cancels while
+  degraded, but FLATTEN calls cancel_all through OM)
+- ClOrdID uniqueness scope/retention at broker across sessions and days
+- `after: datetime` filter semantics (clock skew, timezone, inclusive/exclusive)
+- Reconciliation responses may exceed single-response size (no pagination)
+- Event broadcasting blocking model (synchronous vs fire-and-forget)
+- Credential rotation during TradeStream connection lifetime
+- `market_closed` semantics varying across brokers (reject vs queue)
+- Dropped Alpaca statuses (stopped/suspended/calculated) may affect accounting
+
+**Claude Sonnet 4.6 unique findings (not in either other model):**
+- Single fill per fill event assumption (broker batching multiple fills into
+  one WebSocket message)
+- Lot operations (`Lots.open/2`, `Lots.close/4`) assumed to never fail —
+  no `{:error, _}` handling shown, crash propagation risk
+- `Task.async_stream` inside GenServer creating linked tasks whose crash
+  signals propagate to OrderManager during critical cancel_all
+- Broker cancel semantics during in-flight replace at the broker level
+  (cancel targets old broker_order_id which broker already replaced away)
+- Database operations in fill processing assumed transactional (no explicit
+  Ecto.Multi/transaction mention)
+- Broker position reflects only Gargoyle's activity (external trades cause
+  false-positive reconciliation halts)
+
+**Claude Opus 4.6 unique findings (not in either other model):**
+- `{:ok, broker_order_id}` from REST place conflated with durable OMS
+  acceptance vs mere HTTP acknowledgment (no timeout on `submitted` state)
+- Concurrent `apply_corrections/2` from periodic reconciler running in
+  separate process conflicts with OrderManager's single-writer invariant
+  (corrections write to same tables outside GenServer serialization)
+- Reconciliation gate initialized state after `:rest_for_one` restart —
+  ETS table EXISTS but freshly initialized vs table MISSING are different
+  conditions with different safety properties
+- Escalation state reset after crash creating double-exposure window
+  (systematic issue persists but escalation timer resets to zero)
+- `replace/3` error semantics: non-atomic replace (cancel + re-submit)
+  where cancel succeeds but re-submit fails leaves original order cancelled
+  at broker while OrderManager reverts to "working" locally
+
+**Quality assessment:**
+- **GPT-5** maintained its pattern from previous findings: broadest coverage
+  (20 assumptions), most technically specific about implementation details.
+  Found cross-cutting operational concerns (clock skew, credential rotation,
+  pagination) that the Claude models didn't surface. However, several of its
+  findings were medium-severity operational concerns rather than architectural
+  assumptions.
+- **Claude Sonnet 4.6** was the surprise performer. Found 17 assumptions —
+  close to GPT-5's count (85%) — and several of its unique findings were
+  genuinely insightful. The `cancel_all` race with broker-side replace state
+  (finding #16) and the lot operation failure propagation (finding #6) show
+  deep reasoning about component interaction despite Sonnet not being
+  positioned as a "reasoning" model. More importantly, Sonnet's findings were
+  consistently well-structured with clear "how it could break" scenarios.
+- **Claude Opus 4.6** found the fewest assumptions (12) but — consistent with
+  Finding #11 — its unique findings were qualitatively different. The
+  concurrent `apply_corrections` write conflict, the gate initialization state
+  distinction, and the non-atomic replace error semantics all reveal design
+  tensions that neither GPT-5 nor Sonnet identified. Opus continues to reason
+  about the *boundaries between components* rather than within-component
+  mechanics.
+
+**Key insight — Sonnet 4.6 is NOT just a faster GPT-4.1:**
+In previous findings (#9, #10, #11), non-reasoning models (GPT-4.1, GPT-4.1
+Mini) performed significantly below reasoning models on assumption-finding.
+GPT-4.1 found ~14 assumptions where GPT-5 found 24-26. Here, Sonnet 4.6
+finds 17 where GPT-5 finds 20 — a much smaller gap (~85% vs ~58% previously).
+
+Sonnet's findings also included several that showed genuine reasoning about
+component interactions (not just within-frame risks). This suggests Sonnet 4.6
+is qualitatively different from GPT-4.1 for analytical work — it occupies a
+middle ground between GPT-4.1's "competent but surface-level" and GPT-5's
+"exhaustive and deep." The severity distribution was also similar to GPT-5
+(multiple critical/high findings), whereas GPT-4.1 in previous experiments
+tended toward medium-severity generic concerns.
+
+**Updated model hierarchy for assumption-finding:**
+1. GPT-5 — broadest coverage, most operational-level findings (20)
+2. Sonnet 4.6 — strong analytical depth, good component interaction reasoning (17)
+3. Opus 4.6 — fewest but most architecturally insightful, finds design tensions (12)
+4. GPT-4.1 — competent within-frame, generic (~14 from previous experiments)
+5. GPT-4.1 Mini — formulaic, surface-level (~10-12)
+
+**Practical implication:** For architecture review, Sonnet 4.6 is now a strong
+candidate for volume analytical work. It's fast enough to run alongside GPT-5
+and catches different things (lot operation failures, broker-side replace races).
+The ideal three-model review stack for architecture docs appears to be:
+- GPT-5 for breadth + operational concerns
+- Sonnet 4.6 for component interaction analysis
+- Opus 4.6 for design-tension identification
+
+Each consistently finds things the others miss. The cost-efficiency argument
+for Sonnet is strong: ~85% of GPT-5's count with more actionable findings
+per token generated (4,637 vs 8,485 tokens for 17 vs 20 assumptions).
diff --git a/findings/2026-05-03-07b-token-budget-matters-more-than.md b/findings/2026-05-03-07b-token-budget-matters-more-than.md
new file mode 100644
index 0000000..f74e08d
--- /dev/null
+++ b/findings/2026-05-03-07b-token-budget-matters-more-than.md
@@ -0,0 +1,46 @@
+# Finding 7: Token budget matters more than model size for gap analysis (confirmed)
+
+**Date:** 2026-05-03
+**Task:** Identify unaddressed failure scenarios in gargoyle's `failure-modes.md` (383 lines, ~25KB)
+**How we used them:** Same document, same analytical question ("What failure scenarios
+are NOT covered?"), three models. GPT-5 with 16K max_completion_tokens, Sonnet 4
+with 4K max_tokens, GPT-4.1 Mini with 4K max_completion_tokens. No project context
+beyond the document itself. Pure gap-analysis task.
+
+**Results:**
+- GPT-5 (16K budget): 28 gaps, most exhaustive. Found domain-specific edge cases
+  others missed entirely: ClOrdID collision across restarts, fractional share rounding,
+  broker maintenance windows (410/426), hot code upgrades, regulatory halts vs staleness
+  distinction, cancel-ack-then-late-fill race, DNS/TLS as distinct from network outage.
+- Sonnet 4 (4K budget): 12 gaps, concise. Unique framing: distinguished latency
+  degradation from outage (subtle but actionable). ETS corruption vs loss.
+- GPT-4.1 Mini (4K budget): 13 gaps with summary table. Unique angles: unknown broker
+  status enum values, configuration schema mismatches on cold-start, malformed signals
+  from logic bugs (not just crashes).
+
+**Overlap (all three):** Rate limiting, clock skew, resource exhaustion, DB failures,
+message backpressure, partial connectivity.
+
+**Key insight:** GPT-5's 4K attempt produced ZERO output (finish_reason: length) —
+all tokens consumed by internal reasoning. At 16K it produced the richest analysis.
+This confirms finding #3 (GPT-5 needs generous token budgets) AND adds a new
+observation: for open-ended analytical questions, GPT-5's reasoning overhead is
+proportionally larger. The 4K models (Sonnet, Mini) both produced useful output at
+4K because they don't burn tokens on chain-of-thought.
+
+**Model personality confirmed:**
+- GPT-5: exhaustive, domain-aware, finds edge cases a senior SRE would know
+- Sonnet: precise, architectural, finds design-level distinctions
+- GPT-4.1 Mini: structured, systematic, finds enumeration gaps
+
+**Practical implication:** For failure mode / gap analysis on design docs:
+- GPT-5 with ≥16K tokens for maximum coverage (most unique findings)
+- Sonnet for architectural framing ("this is really two different problems")
+- Mini for completeness checking ("what about this enum value?")
+- Running all three costs ~$0.50 and catches gaps none alone would find
+- GPT-5 at 4K is USELESS for this task — always give it room to think
+
+**Note on GPT-5 reasoning overhead:** First attempt at 4K max_completion_tokens
+returned empty content with finish_reason: length. The model spent all 4K tokens
+on internal reasoning and produced nothing. This is worse than a short answer —
+it's zero value for non-zero cost. Always budget ≥16K for GPT-5 analytical tasks.
diff --git a/findings/2026-05-03-13-race-condition-identification-opus-excels.md b/findings/2026-05-03-13-race-condition-identification-opus-excels.md
new file mode 100644
index 0000000..ba006a3
--- /dev/null
+++ b/findings/2026-05-03-13-race-condition-identification-opus-excels.md
@@ -0,0 +1,126 @@
+# Finding 13: Race condition identification: Opus excels at subtle temporal interactions; GPT-5 is exhaustive; Sonnet struggles with concurrency reasoning
+
+**Date:** 2026-05-03
+**Task:** Identify race conditions, timing-dependent bugs, and ordering hazards in
+gargoyle's `concurrent-failure-detection.md` (241 lines) — a document specifically
+about concurrent detection logic with timers, ETS state, and multi-process events.
+**How we used them:** Same document (full text) + same focused analytical question
+to all 3 models via HAI proxy. Prompt specifically asked for event ordering problems,
+timer interaction bugs, state visibility gaps, crash/restart timing, and multi-instance
+coordination. Required each finding to reference specific mechanisms in the document
+with specific interleaving descriptions. No tools, no project context beyond the
+document itself.
+
+| Model | Time | Output tokens | Reasoning tokens | Race conditions found |
+|---|---|---|---|---|
+| GPT-5 | 116s | 10,587 | 8,192 | 12 |
+| Claude Opus 4.6 | ~105s | 4,610 | (internal) | 10 |
+| Claude Sonnet 4.6 | ~39s | 1,404 | (internal) | 7 |
+
+**What they found — common ground (all 3 identified):**
+- Stale timer messages in mailbox after cancellation (classic Erlang timer race)
+- HealthMonitor crash losing compound detection state (init from :unknown, no replay)
+- ETS vs GenServer state divergence visible to dashboard
+- Kill switch mode conflict (FLATTEN from single-process vs HALT from compound path)
+
+**GPT-5 unique findings (not in either Claude model):**
+- Cross-sender message ordering: recovery events from pipeline processes vs timer
+  expiry from runtime (Erlang preserves per-sender order, NOT cross-sender) — the
+  "rapid recovery" safety argument in the doc relies on state being updated before
+  timer fires, which isn't guaranteed
+- Debounce starvation: flapping component repeatedly restarting the timer, causing
+  compound evaluation to be indefinitely postponed while ≥2 genuinely degraded
+- State regression: {:degraded} arriving after {:escalated, :kill_switch} with no
+  guard in the event table — state machine allows regressing from :halted to :degraded
+- Cold-start window: application boots with existing degraded processes that won't
+  re-emit events, compound detection never fires
+- Catch-all handle_info could accidentally swallow timer messages if pattern matching
+  is ordered wrong (implementation pitfall of the described approach)
+- Debounce window growing beyond calibrated bounds from repeated timer restarts
+
+**Claude Opus unique findings (not in either other model):**
+- Timer restart pushing evaluation PAST single-process escalation timeout — the
+  debounce mechanism can DEFEAT compound detection when second degradation arrives
+  near end of first window (resets to full window, first process escalates via
+  single-process path before new window fires). This means system gets FLATTEN
+  instead of HALT — exactly what compound detection was supposed to prevent.
+- Strategy worker single-atom masking via event ordering: Worker A degrades, Worker
+  B degrades (same atom), Worker A recovers → atom set to :normal while B is still
+  degraded. Event ordering across different workers mapped to same atom creates
+  state loss.
+- Registry stale PID after HealthMonitor crash: if subscription is PID-based (not
+  PubSub topic), new HM instance is deaf — events go to dead PID, silently dropped.
+  Compound detection completely disabled for that user until subscription refresh.
+- :rest_for_one cascade + coincidental independent issue: debounce designed to
+  filter cascade restarts, but cascade can EXPOSE a pre-existing issue (SignalRisk
+  restarts and finds its upstream stale). Is this "compound" or "cascade + bad luck"?
+  Semantic ambiguity the design doesn't address.
+- Compound cleared event without recovery debounce: :compound_degradation_cleared
+  emitted immediately when last process recovers (no settling period), causing
+  operator oscillation if recovery is transient.
+
+**Claude Sonnet unique findings:**
+- ETS table creation race at startup (HealthMonitor writes before table exists)
+- Registry lookup failure during pipeline startup (events before HM registered)
+- However, Sonnet also made analytical errors: it described "multiple HealthMonitor
+  instances for the same user" scenarios despite the document clearly stating one
+  instance per user via DynamicSupervisor. Several of its findings assumed
+  multi-instance coordination that doesn't match the architecture.
+
+**Quality assessment:**
+- **GPT-5** was the most exhaustive and technically precise. Its cross-sender
+  ordering finding (#2) is genuinely insightful — it identifies that the document's
+  "rapid recovery" safety argument implicitly assumes events arrive in wall-clock
+  order, which Erlang does NOT guarantee across different senders. The debounce
+  starvation finding (#3) identifies a real operational hazard with practical
+  consequences. All 12 findings reference specific mechanisms and describe specific
+  interleavings clearly.
+- **Claude Opus** found fewer race conditions but several were qualitatively
+  superior. The timer-restart-defeats-compound-detection finding is the most
+  architecturally significant race in the entire analysis — it shows that the
+  debounce mechanism can work AGAINST the design's stated goals in specific
+  (realistic) timing scenarios. The strategy-worker event ordering masking is
+  also a genuine design flaw unique to the single-atom decision. Opus continues
+  its pattern of reasoning about design TENSIONS rather than just failure modes.
+- **Claude Sonnet** was notably weaker here than in previous experiments. Only
+  1,404 output tokens vs 4,610 (Opus) and 10,587 (GPT-5). Several findings
+  contained analytical errors (assuming multi-instance coordination that doesn't
+  exist). It found only 7 races, and 2-3 of those were based on misreadings of
+  the architecture. This is a significant regression from Finding #12 where
+  Sonnet found 17 assumptions (85% of GPT-5's count).
+
+**Key insight — concurrency reasoning is a different skill than assumption-finding:**
+In previous experiments (#10, #11, #12), Sonnet 4.6 performed well on
+assumption-finding (a task that requires reasoning about what's NOT stated).
+Here, on race condition identification (a task requiring reasoning about temporal
+interleavings and message ordering semantics), Sonnet drops significantly. This
+suggests the task type matters more than we previously thought:
+
+- **Assumption-finding:** Requires breadth of consideration ("what must be true
+  for this to work?"). Sonnet handles this well — it's essentially pattern
+  matching across possible failure dimensions.
+- **Race condition identification:** Requires SEQUENTIAL reasoning about specific
+  interleavings ("if A happens, then B happens, then C happens, what state is
+  visible?"). This benefits dramatically from extended reasoning tokens (GPT-5's
+  8,192 reasoning tokens) or from Opus's internal reasoning depth.
+
+The lesson: don't extrapolate model performance across task types. A model that's
+85% as good at assumption-finding may be 50% as good at concurrency analysis.
+The cognitive demands are different.
+
+**Opus's distinguishing strength — finding design contradictions:**
+Opus's best finding (timer restart defeating compound detection) isn't just a
+race condition — it's identifying that the debounce mechanism can work against
+the design's own stated goals. This is consistent with Opus's pattern in
+previous findings: it finds tensions where one part of the design undermines
+another part. For race condition analysis specifically, this manifests as
+"here's where your safety mechanism becomes your vulnerability."
+
+**Practical implication for architecture review:**
+- For race condition analysis: use GPT-5 (exhaustive) + Opus (design-tension)
+- Sonnet is NOT suitable for concurrency reasoning tasks — use it for
+  assumption-finding and structural review instead
+- The three-model stack needs task-appropriate assignment:
+  - Structural/assumption review: all three models contribute
+  - Concurrency/race analysis: GPT-5 + Opus only
+  - Bias detection: any model (per Finding #8)
diff --git a/findings/2026-05-03-14-crosscomponent-interaction-analysis-gpt5-mini.md b/findings/2026-05-03-14-crosscomponent-interaction-analysis-gpt5-mini.md
new file mode 100644
index 0000000..ec3e0a7
--- /dev/null
+++ b/findings/2026-05-03-14-crosscomponent-interaction-analysis-gpt5-mini.md
@@ -0,0 +1,131 @@
+# Finding 14: Cross-component interaction analysis: GPT-5 Mini enters as viable mid-tier; task framing strongly shapes output quality
+
+**Date:** 2026-05-03
+**Task:** Identify cross-component interaction failures in gargoyle's
+`continuous-risk-monitoring.md` (459 lines) — a document specifying
+PortfolioMonitor's interactions with OrderManager, PortfolioRisk, MarketData,
+KillSwitch, ETS tables, and the pipeline supervision tree.
+**How we used them:** Same document (full text) + same focused analytical
+question to all 3 models via HAI proxy. Prompt was highly structured: specified
+5 categories of cross-component failures to look for (semantic mismatches,
+ordering violations, feedback loops, partial visibility, supervision boundary
+effects) and required specific output format (components, sequence, gap, impact).
+No tools, no project context beyond the document itself.
+
+| Model | Time | Output tokens | Reasoning tokens | Findings |
+|---|---|---|---|---|
+| GPT-5 Mini | 68s | 5,445 | 2,240 | 6 (+1 cut off) |
+| GPT-5 | 116s | 10,604 | 8,128 | 10 |
+| Claude Sonnet 4.6 | 38s | 1,868 | (internal) | 8 |
+
+**What they found — common ground (all 3 identified):**
+- Fill-to-position query race (fill event triggers evaluation but position
+  store hasn't yet reflected the fill)
+- Restrict flag ETS table destruction on PM crash → permissive window
+- Kill switch check vs liquidation submission race
+- Ticker subscription timing gap (new position opened but ticks not yet
+  subscribed → breach goes undetected)
+
+**GPT-5 unique findings (not in either other model):**
+- Stale prices are NOT fail-safe for drawdown (higher stale price → inflated
+  portfolio value → understated drawdown). The document claims "fail-safe"
+  but this only holds for exposure metrics, not drawdown. This is the most
+  architecturally significant finding across all three models.
+- Price definition mismatch between PM (last_trade from ETS) and OrderManager/
+  broker (bid/ask/mid) causing mis-sized liquidation and oscillation
+- Cross-component oscillation: PM hysteresis internal vs PRisk's immediate
+  binary restrict gate clearing (no cross-component cooldown)
+- Liquidation stuck after OM restart (terminal events lost; liquidation_in_
+  flight stays true indefinitely with no timeout/rehydration)
+- "Minimal risk checks" not enforced — PM goes through same OM gates as
+  strategy orders but MarketHours/StalePrice controls may reject after-hours
+  or stale-price liquidation attempts
+- FLATTEN mode semantics gap — PM refrains from liquidating when kill switch
+  engaged, but FLATTEN cancels open orders without actually CLOSING positions.
+  No component left to close positions.
+
+**Claude Sonnet 4.6 unique findings (not in either other model):**
+- Liquidation feedback loop with PortfolioRisk — buy-to-cover for short
+  positions could INCREASE net long exposure at portfolio level, paradoxically
+  worsening concentration while fixing position-level metrics
+- High water mark reset on pipeline restart masks true intraday drawdown
+  (restart → HWM resets to lower current value → drawdown calculated from
+  false baseline → larger losses permitted than intended)
+- Multi-metric breach with single boolean flag — concentration liquidation
+  for AAPL sets liquidation_in_flight, blocking simultaneous daily P&L
+  liquidation for different positions
+- Market close/open vs after-hours fills — claims to evaluate after-hours
+  fills but uses stale market-close prices
+
+**GPT-5 Mini unique findings (not in either other model):**
+- OrderManager order splitting/remapping causing liquidation_in_flight
+  correlation failure (parent/child order ID mapping breaks terminal-event
+  detection). Well-reasoned but highly implementation-specific.
+- Restrict/clear oscillation loop with strategy behavior (strategies react
+  to rejects → back off → restrict clears → strategies re-enter aggressively
+  → re-breach). Good systems-thinking about emergent feedback.
+
+**Quality assessment:**
+- **GPT-5** produced the most findings (10) and the highest-quality
+  architectural insight: the stale-price/drawdown contradiction is a genuine
+  design flaw that contradicts the document's own safety claim. Multiple
+  findings showed cross-boundary reasoning about semantic mismatches (price
+  definition, FLATTEN semantics, gate bypass). Every finding named specific
+  components and described precise event sequences.
+- **Claude Sonnet 4.6** was fast (38s, only 1,868 tokens) and produced 8
+  solid findings. The HWM reset finding and the multi-metric/single-flag
+  finding show genuine architectural reasoning. The liquidation feedback
+  loop (buy-to-cover worsening portfolio concentration) is subtle and
+  shows cross-position reasoning. However, some findings overlapped
+  significantly with the common-ground set and added less unique depth.
+  Sonnet performed MUCH better here than on race condition identification
+  (Finding #13) — 8/10 ratio vs 7/12 previously.
+- **GPT-5 Mini** produced 6 findings in 68s with 2,240 reasoning tokens.
+  Quality was genuinely good — the order-splitting/correlation finding
+  and the oscillation feedback loop both show real reasoning depth. It's
+  clearly NOT GPT-4.1 Mini — it reasons about component interactions,
+  not just within-frame risks. However, it found fewer issues and one
+  response was cut off (token limit or response truncation).
+
+**Key insight — task framing as the dominant variable:**
+This experiment used a much more structured prompt than previous ones:
+specified 5 categories, required specific output format, explicitly excluded
+single-component failures. The result: ALL models produced higher-quality,
+more focused output than in earlier experiments with broader prompts. Even
+Sonnet — which struggled on race conditions (Finding #13) — performed well
+here. The structured categories likely helped models organize their reasoning
+without losing track of what they were looking for.
+
+The prompt explicitly asked for "cross-component interaction failures" rather
+than general analysis. This is the narrow-lens effect from Finding #2, but
+applied to a complex multi-component document. The lens is narrow (only
+inter-component gaps) but the scope is broad (459 lines, many interactions).
+This combination — narrow analytical lens + broad document scope — appears
+to be the sweet spot for getting quality from all model tiers.
+
+**GPT-5 Mini positioning:**
+First time testing GPT-5 Mini. Results: 6 findings in 68s vs GPT-5's 10 in
+116s. That's 60% of the findings in 59% of the time, with 28% of the
+reasoning tokens (2,240 vs 8,128). Quality-per-finding was solid — the order
+correlation finding especially showed genuine systems reasoning. GPT-5 Mini
+appears to be a legitimate mid-tier: more capable than GPT-4.1 (which can't
+do this kind of cross-boundary reasoning) but less exhaustive than GPT-5.
+Viable for: first-pass screening, bulk document review where you'd run many
+docs and can't afford full GPT-5 on each.
+
+**Sonnet recovery from Finding #13:**
+Sonnet went from 7 findings (with errors) on race conditions to 8 solid
+findings here. The difference: this prompt was more structured, the document
+was larger with more explicit interaction descriptions, and the task didn't
+require pure temporal/sequential reasoning. "Cross-component interaction
+failures" is closer to assumption-finding (Sonnet's strength) than race
+condition identification (Sonnet's weakness). Task taxonomy continues to
+matter more than raw model capability.
+
+**Updated model assignment for cross-component analysis:**
+1. GPT-5 — broadest coverage, finds semantic contradictions in the doc's
+   own claims (10 findings)
+2. Sonnet 4.6 — fast and efficient, good at portfolio-level reasoning and
+   feedback loops (8 findings in 38s)
+3. GPT-5 Mini — viable first-pass with genuine reasoning depth (6 findings)
+4. (Opus untested for this task type — likely strong on design tensions)
diff --git a/findings/2026-05-03-15-design-coherence-analysis.md b/findings/2026-05-03-15-design-coherence-analysis.md
new file mode 100644
index 0000000..8c930e4
--- /dev/null
+++ b/findings/2026-05-03-15-design-coherence-analysis.md
@@ -0,0 +1,133 @@
+# Finding 15: Design Coherence Analysis
+
+**Date:** 2026-05-03
+**Task:** Identify internal design incoherences in gargoyle's `failure-modes.md` (383 lines)
+— places where the document's stated principles/invariants are contradicted by its own
+specified mechanisms.
+**How we used them:** Same document (full text) + same focused analytical question to all
+3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence
+to look for (safety properties not enforced, state machine violations, recovery contradictions,
+supervision conflicts, cross-mechanism contradictions). Required each finding to reference
+specific sections. No tools, no project context beyond the document itself.
+
+| Model | Time | Output tokens | Reasoning tokens | Incoherences found |
+|---|---|---|---|---|
+| Claude Sonnet 4.6 | ~39s | 1,168 | (internal) | 5 |
+| Claude Opus 4.6 | ~105s | 3,378 | (internal) | 7 (8 attempted, 1 self-withdrawn) |
+| GPT-5 | ~120s | 10,235 | 9,088 | 4 |
+
+**What they found — common ground (all 3 identified):**
+- State machine universality claim vs Strategy.Worker crash behavior (process
+  crashes bypass the degraded state entirely — no transition path in the model)
+- Market data staleness advisory-only vs the "don't trade when ambiguous" principle
+  (or vs concurrent failure auto-halt)
+- `pending_cancel`/`pending_replace` absent from recovery query set (GPT-5 and
+  Sonnet found this directly; Opus addressed the broader state machine gap)
+
+**GPT-5 unique findings (not in either Claude model):**
+- Kill switch halted = "process terminated" vs kill switch requiring RUNNING
+  processes for cancel_all, FLATTEN, and LIQUIDATE modes. The state definition
+  claims processes are terminated, but the mechanisms require them alive to
+  execute orders. **This is the most architecturally significant finding** — it
+  reveals a fundamental definitional error in the state machine.
+- Per-symbol degradation contradicts the process-level degradation semantics.
+  A worker "enters degraded" but continues operating for non-stale symbols —
+  violating the stated definition that degraded = "cannot perform primary
+  function." The metrics/eventing model has no per-symbol dimension.
+
+**Claude Opus unique findings (not in either other model):**
+- `:rest_for_one` cascade creates a FIFTH implicit state (terminated-and-
+  restarting) not in the four-state model — processes that were `normal` are
+  forcibly killed (not by kill switch) and restart. Self-corrected one finding
+  that initially looked like incoherence but was actually consistent.
+- PortfolioMonitor continues evaluating with stale data ("fail-safe") while
+  Strategy.Workers are stopped for the SAME condition — contradicts both the
+  universal state machine (PM doesn't transition to degraded) and the doc's
+  reasoning about why stale data is dangerous.
+- Path-dependent recovery criteria: Strategy.Worker recovery requires 21+ bars
+  after crash but only "price continuity check" after staleness. The state
+  machine's single "catch-up complete" exit condition can't express this.
+- `halted → [*]` transition in state diagram is logically impossible if "halted"
+  means the process is already terminated — dead processes can't fire transitions.
+- Compound failure detection requires a meta-observer across processes but the
+  per-process state machine model has no way to express cross-process conditions.
+
+**Claude Sonnet unique findings (not in either other model):**
+- Market data global staleness: the failure table says "Manual (disengage)" for
+  recovery — implying automatic engagement happened — but the text says it's
+  advisory only. Table contradicts prose.
+- ReconciliationGate: doc claims gate survives OM crash (separate supervision
+  tree), but then says "missing ETS table = not ready" when OM crashes. If the
+  gate survives, why would its table be missing?
+- Signal survival claims are contradictory between sections: worker crash says
+  downstream signals survive, but OM crash says all upstream signals lost.
+  (NOTE: this is actually describing different scenarios — worker crash doesn't
+  cascade to SignalRisk, OM crash does via :rest_for_one. Sonnet may have
+  misread the architecture here — the two statements are consistent when you
+  understand the supervision tree.)
+
+**Quality assessment:**
+- **GPT-5** found only 4 incoherences but TWO of them are genuinely critical
+  architectural findings. The "halted = terminated" vs "kill switch requires
+  running processes" contradiction is a real design error — you can't both
+  terminate processes AND require them to execute cancel/liquidation orders.
+  The per-symbol degradation finding is also a real modeling gap. GPT-5 was
+  MORE SELECTIVE here than in previous experiments — it didn't pad with
+  medium-severity findings. Each of its 4 was high/critical.
+- **Claude Opus** produced the most findings (7 valid) with characteristic
+  depth. Its self-correction (withdrawing finding #6 after deeper analysis)
+  shows intellectual honesty rare in model outputs. The PortfolioMonitor
+  stale-data contradiction is genuinely insightful — same input condition,
+  opposite response, no justification within the state machine model. The
+  compound failure meta-observer finding identifies a modeling category error.
+  Opus also found modeling imprecisions (path-dependent recovery, halted → [*]
+  impossibility) that the other models didn't notice.
+- **Claude Sonnet** found 5 issues quickly (39s, 1,168 tokens) but quality was
+  mixed. Finding #4 (ReconciliationGate) raises a genuine question about
+  the ETS table ownership claim. Finding #1 (table vs prose contradiction on
+  market data staleness) is a real documentation inconsistency. However,
+  Finding #5 appears to misread the supervision architecture — the two
+  statements about signal survival ARE consistent when you understand that
+  different crashes cascade differently. Sonnet produced one false positive.
+
+**Key insight — "design coherence" is a NEW analytical category with distinct model strengths:**
+This is different from assumption-finding (Finding #10-12), race conditions
+(Finding #13), and cross-component interactions (Finding #14). Coherence
+checking requires the model to hold MULTIPLE parts of the document in tension
+with each other and reason about whether they're compatible. Results:
+
+- **GPT-5** was MORE SELECTIVE than in any previous experiment. Only 4 findings
+  vs 10-24 in other tasks. But precision was near-perfect — all 4 are genuine
+  contradictions. This suggests GPT-5's reasoning tokens are being used for
+  VERIFICATION (checking whether apparent contradictions hold up) rather than
+  EXPLORATION (finding more things). The 9K reasoning tokens produced 4 findings
+  vs the usual 10+ — GPT-5 is self-editing aggressively.
+- **Opus** hit its sweet spot. Coherence checking IS design-tension identification
+  — Opus's consistent strength. Finding incoherences requires exactly the kind
+  of "how does this design disagree with itself" reasoning that Opus excels at.
+  It also showed unique self-correction behavior (withdrawing a finding after
+  deeper analysis).
+- **Sonnet** was fast but produced a false positive. Coherence checking requires
+  holding multiple document sections in memory simultaneously and reasoning about
+  their compatibility — this is harder than assumption-finding (where you
+  reason about one mechanism at a time) but easier than race conditions (which
+  require sequential temporal reasoning). Sonnet occupies a middle ground.
+
+**Model ranking for design coherence checking:**
+1. Claude Opus 4.6 — most findings, highest depth, self-correcting (7 valid)
+2. GPT-5 — fewest findings but near-perfect precision, finds the critical ones (4)
+3. Claude Sonnet 4.6 — fast screening, but prone to false positives on
+   architectural misreads (4/5 valid)
+
+**This inverts the usual GPT-5 > Opus ordering.** In previous experiments, GPT-5
+consistently found MORE issues. Here, GPT-5 was more selective than Opus. The
+task type (self-consistency checking) favors Opus's "design tension" reasoning
+style over GPT-5's "exhaustive exploration" style. GPT-5 apparently uses its
+reasoning to VERIFY rather than GENERATE when the task is about contradictions
+rather than gaps.
+
+**Practical implication:** For architecture documents, run coherence checking as
+a separate pass using Opus as the primary model. GPT-5's higher precision means
+it's good for confirming which Opus findings are genuine vs overreads. The
+two-pass approach: Opus generates candidates → GPT-5 validates → result is the
+intersection plus GPT-5's independent finds.
diff --git a/findings/2026-05-03-16-specification-completeness-sonnet-45-produces.md b/findings/2026-05-03-16-specification-completeness-sonnet-45-produces.md
new file mode 100644
index 0000000..94f7f50
--- /dev/null
+++ b/findings/2026-05-03-16-specification-completeness-sonnet-45-produces.md
@@ -0,0 +1,131 @@
+# Finding 16: Specification completeness: Sonnet 4.5 produces 2x the findings of Sonnet 4.6 on implementation-gap analysis; quality vs volume tradeoff
+
+**Date:** 2026-05-03
+**Task:** Identify specification gaps in gargoyle's `kill-switch.md` (185 lines) — places
+where an implementer would be forced to guess or decide on their own because the spec
+doesn't clearly specify behavior. New analytical lens not previously tested.
+**How we used them:** Same document (full text) + same focused analytical question to all
+3 models via HAI proxy. Highly structured prompt specifying 5 categories of underspecification
+(behavioral ambiguity, missing edge cases, ordering/sequencing gaps, interface contracts
+undefined, concurrency semantics omitted). Required specific output format per finding
+(gap, section, what implementer must decide, risk if wrong, severity). No tools, no
+project context beyond the document itself.
+
+| Model | Time | Output tokens | Reasoning tokens | Gaps found | Critical | High | Medium | Low |
+|---|---|---|---|---|---|---|---|---|
+| Claude Sonnet 4.6 | 73s | 3,403 | (internal) | 13 | 8 | 4 | 0 | 1 |
+| Claude Sonnet 4.5 | 102s | 5,191 | (internal) | 25 | 14 | 6 | 4 | 1 |
+| GPT-5 | 109s | 10,140 | 7,872 | 19 | 8 | 7 | 3 | 0 |
+
+**What they found — common ground (all 3 identified):**
+- Pipeline process identification ambiguity (which processes are "pipeline processes")
+- Per-user process scope mapping (how to terminate only one user's processes)
+- ETS table ownership and lifecycle (who owns it, what happens on crash)
+- Concurrent engage operations (what happens when two sources engage simultaneously)
+- Liquidation order tagging mechanism (what the tag is, how verified)
+- Process restart prevention (how "must not restart" is enforced)
+- Engage sequence atomicity (partial failure between DB write and termination)
+- Startup ordering and ETS readiness (pipeline starting before ETS populated)
+- Disengage sequence ordering (what happens and in what order)
+
+**Sonnet 4.5 unique findings (not in either other model):**
+- ETS table schema/structure (set vs ordered_set, key format, value schema)
+- Missing ETS detection mechanism (catch :badarg vs table existence check)
+- Database write atomicity with ETS (transaction boundaries, rollback semantics)
+- Per-user engage while global is already engaged (is it a no-op or error?)
+- Broker rejection semantics ("already filled" vs "invalid cancel" distinction)
+- Cold-start gate interaction (independence vs dependency of the two gates)
+- User deletion with active kill switch (orphaned rows, cascade semantics)
+- Global disengage effect on per-user states (independent or auto-clear?)
+- Audit log write failure during engage (critical-path vs best-effort)
+- Dashboard control ambiguity in LIQUIDATE mode (contradictory disable/enable)
+- Cancel timeout duration (operational parameter not specified)
+- Manual order source code path during LIQUIDATE (how orders bypass the dead pipeline)
+
+**GPT-5 unique findings (not in either other model):**
+- Combined global/per-user mode semantics (what happens when global=RESTRICT,
+  user=LIQUIDATE — can user's liquidation proceed?)
+- Scope of "all" in cancel_all and liquidation (system-wide vs per-user)
+- Gate behavior when ETS missing but liquidation needed (conflicting requirements:
+  fail-closed says block, but liquidation needs to pass)
+- Disengage during in-flight cancellations (what happens to racing tasks)
+- Gate placement relative to broker submission (exact point in the flow)
+- Engage latency expectations (no quantified SLA)
+- Mode change while already engaged (RESTRICT → LIQUIDATE without disengage)
+- Dashboard vs backend scope for manual liquidation (individual vs bulk only)
+
+**Sonnet 4.6 unique findings (not in either other model):**
+- ETS sequencing relative to process termination (ETS before or after kill?)
+- Concurrent disengage + re-engage race (specific interleaving scenario)
+- Close-only enforcement mechanism (UI-only vs backend validation)
+- Order-in-flight past ETS gate during termination (already-checked orders)
+
+**Quality assessment:**
+- **Claude Sonnet 4.5** was the most EXHAUSTIVE (25 gaps) but with notable
+  quality variance. Several findings were highly specific and implementation-
+  relevant (ETS schema, missing-table detection, broker rejection semantics).
+  Others were relatively obvious or lower-impact (user deletion, audit log
+  failure, cancel timeout duration). The 14 Critical ratings feel somewhat
+  generous — some would be more accurately rated as High in practice. Output
+  was well-structured with clear per-finding format.
+- **GPT-5** found 19 gaps with consistent high quality. Its unique findings
+  show cross-cutting reasoning: the combined mode semantics finding (global
+  vs per-user mode interaction) identifies a genuine specification gap that
+  neither Sonnet version noticed. The "ETS missing but liquidation needed"
+  finding is architecturally significant — it identifies a CONTRADICTION in
+  the spec's own rules (fail-closed blocks everything, but liquidation must
+  pass). Every finding was actionable. More selective severity ratings
+  (8 Critical vs Sonnet 4.5's 14).
+- **Claude Sonnet 4.6** was the most SELECTIVE (13 gaps) but with the highest
+  precision. Every finding was genuinely a specification gap that an
+  implementer would face. The ETS sequencing finding (#4) is particularly
+  well-reasoned — it identifies a specific ordering dependency that creates
+  a race window. Sonnet 4.6 appears to self-filter aggressively, producing
+  only findings it's confident about. Higher signal-to-noise than 4.5.
+
+**Key insight — Sonnet 4.5 vs 4.6 on analytical tasks:**
+This is the first direct comparison between Claude model versions on the same
+analytical task. Key differences:
+
+- **Volume:** 4.5 produced almost 2x the findings (25 vs 13)
+- **Tokens:** 4.5 used ~1.5x the output tokens (5,191 vs 3,403)
+- **Time:** 4.5 took ~1.4x longer (102s vs 73s)
+- **Severity distribution:** 4.5 had more Critical findings (14 vs 8) but
+  with more generous severity ratings
+- **Quality per finding:** 4.6 had higher average quality; fewer "obvious"
+  or lower-impact findings
+
+The 4.6 model appears to have been trained toward higher precision/selectivity.
+It finds fewer things but each finding is more reliably a genuine gap. The 4.5
+model is more exhaustive but includes findings that a reviewer might triage as
+"yes, technically, but not really a spec gap." This mirrors a known training
+direction in Claude models: later versions tend to be more concise and selective.
+
+**For practical use:** If you want completeness (cast a wide net, accept some
+noise): use 4.5. If you want precision (every finding is actionable, no triage
+needed): use 4.6. For architecture review where missing a gap has cost, 4.5's
+exhaustiveness is probably worth the noise. For review where false positives
+cost attention (e.g., PR review comments), 4.6's selectivity is preferred.
+
+**GPT-5 vs Sonnet comparison on this task:**
+GPT-5 (19 findings) sits between the two Sonnets in volume but has the highest
+consistency — no obvious misses or inflated severities. Its unique strength
+here: finding CONTRADICTIONS within the spec's own rules (ETS-missing blocking
+conflicts with liquidation needing to pass). This is consistent with Finding #15
+where GPT-5 was unusually selective but precise on coherence checking.
+
+Specification completeness analysis appears to be a task where:
+1. Sonnet 4.5 is strongest for breadth (25 findings, catches operational gaps)
+2. GPT-5 is strongest for detecting spec self-contradictions (19 findings, high precision)
+3. Sonnet 4.6 is strongest for precision (13 findings, zero noise)
+
+**Updated model version comparison:**
+- Claude 4.6 → higher precision, more selective, concise
+- Claude 4.5 → more exhaustive, more verbose, occasional severity inflation
+- This is a genuine tradeoff, not a simple regression or improvement
+
+**Practical implication:** Run BOTH Sonnet versions? 4.5 catches things 4.6
+filters out (ETS schema, broker rejection semantics, cold-start gate interaction).
+4.6 catches things with more specificity (sequencing gaps, exact race windows).
+For a one-shot budget: 4.5 if you want coverage, 4.6 if you want actionability.
+GPT-5 if you want to find where the spec contradicts itself.
diff --git a/findings/2026-05-04-18-temporal-boundary-analysis-gpt5-is.md b/findings/2026-05-04-18-temporal-boundary-analysis-gpt5-is.md
new file mode 100644
index 0000000..11e5cd4
--- /dev/null
+++ b/findings/2026-05-04-18-temporal-boundary-analysis-gpt5-is.md
@@ -0,0 +1,158 @@
+# Finding 18: Temporal boundary analysis: GPT-5 is most exhaustive; Opus finds design-level contradictions; Sonnet 4.5 is structured but less deep
+
+**Date:** 2026-05-04
+**Task:** Identify temporal boundary vulnerabilities in gargoyle's `escalation-policy.md`
+(238 lines) — scenarios where the timing model (evaluation cycles, debounce counts,
+cooldown periods) creates windows of incorrect or dangerous behavior.
+**How we used them:** Same document (full text) + same focused analytical question to all
+3 models via HAI proxy. Highly structured prompt specifying 5 categories of temporal
+vulnerability (timing exploitation, counter reset abuse, asymmetric time exposure,
+cross-metric temporal interactions, state loss temporal effects). Required specific
+output format per finding (name, sequence with cycle numbers, mechanism, severity, fix).
+No tools, no project context beyond the document itself.
+
+| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
+|---|---|---|---|---|---|---|---|
+| GPT-5 | ~128s | 9,175 | 5,888 | 15 | 3 | 7 | 2 |
+| Claude Opus 4.6 | ~120s | 5,112 | (internal) | 10 | 3 | 5 | 2 |
+| Claude Sonnet 4.5 | ~100s | 4,056 | (internal) | 12 | 3 | 3 | 3 |
+
+**What they found — common ground (all 3 identified):**
+- Flash crash / inter-evaluation gap exploitation (metric spikes between discrete
+  evaluation cycles go undetected)
+- Single clear cycle resetting debounce counter (transient recovery defeats escalation
+  despite sustained risk — metric can breach 80%+ of cycles and never escalate)
+- Asymmetric escalation time vs loss compounding rate (11 cycles to reach liquidation
+  while losses compound every single cycle)
+- Monitor crash resets state to Clear, losing all escalation progress
+- Liquidation re-trigger requiring full debounce reset, delaying subsequent batches
+- Kill switch N value unspecified (timing indeterminacy)
+
+**GPT-5 unique findings (not in either other model):**
+- Boundary-alignment counter starvation: explicitly modeled the "adversarial flicker"
+  pattern (breaching 2 cycles, 1 clear, repeat — 66% breach time, never escalates)
+  with a precise mathematical framing of why K-of-N is needed
+- Cycle-length drift under load: GC pauses or CPU contention stretching evaluation
+  intervals means "3 cycles" could be 12 minutes instead of 90 seconds when it
+  matters most (high-load market stress = slowest evaluations)
+- Adversarial boundary timing (market microstructure masking): illiquid instruments
+  where opposing prints predictably arrive near evaluation boundaries, exploiting
+  deterministic sampling points
+- Cross-metric escalation gap and hedging prohibition: Restrict prevents ALL new
+  positions including risk-REDUCING hedges needed for a different metric still
+  escalating on its own timeline — protection for metric A actively worsens metric B
+- Cooldown stall causing prolonged Restrict: repeated transient spikes near hysteresis
+  threshold reset cooldown indefinitely while metric is actually safe
+- State inconsistency between restriction flags and monitor after restart:
+  documented asymmetry where flag persists (manual clear) but state resets (auto
+  clear) — creates orphaned restriction or unprotected window depending on
+  reconciliation approach
+- Metric computation fail-closed interacting with debounce: system errors create
+  false escalations with long cooldown, potentially blocking hedging trades
+- Unspecified N for kill switch post-liquidation breaches: coupled with crash
+  reset, system can loop indefinitely without reaching kill switch
+- In-liquidate flicker stall: one cycle below threshold after partial fill resets
+  re-trigger counter, stalling further liquidation
+
+**Claude Opus unique findings (not in either other model):**
+- De-escalation cooldown exploitation (predictable window): after cooldown completes
+  and restriction lifts, strategy has a GUARANTEED 5+ cycle window of unrestricted
+  trading before Restrict can re-engage — an automated strategy could systematically
+  exploit this predictable safe window to re-enter dangerous positions
+- Orphaned restriction flag asymmetry framed as a DESIGN CONTRADICTION: the failure
+  modes table specifies opposing recovery paths for state (automatic → Clear) vs
+  flags (manual clear), creating an irreconcilable dual state. Opus uniquely
+  identified that operator intervention to clear the flag could inadvertently
+  create a WORSE protection gap than leaving it orphaned
+- Self-correcting analysis style: Opus's summary explicitly synthesized that the
+  three Critical findings share a common cause (debounce optimizes against false
+  positives at the expense of false negatives during sustained events) and proposed
+  a single architectural fix (severity-aware fast path) that addresses all three
+
+**Claude Sonnet 4.5 unique findings (not in either other model):**
+- De-escalation timing not accounting for proximity to breach threshold: system
+  removes protection while metric is still near-dangerous, and re-escalation
+  requires full debounce — created a specific "whipsaw" scenario with cycle numbers
+- Manual-only de-escalation from Liquidate creates UNBOUNDED recovery time:
+  if triggered at 2 AM Saturday, trading disabled until Monday despite metrics
+  recovering in minutes. Framed as contradiction with "autonomous" design goals
+- Evaluation cycle synchronization assumption: no handling of variable timing
+  (CPU contention, GC pauses) — implicit throughout but never addressed
+- Cold start escalation ambiguity: system starts with no prior state while
+  portfolio may already be in breach condition
+- De-escalation event ordering race: multiple metrics de-escalating simultaneously
+  may emit events in non-deterministic order, confusing external observers
+
+**Quality assessment:**
+- **GPT-5** was the most exhaustive (15 findings) and showed the strongest
+  mathematical/systems reasoning. Its unique findings included precise attack
+  models (adversarial flicker, boundary alignment, microstructure masking) that
+  describe exact exploitation patterns with percentages and cycle counts. The
+  cross-metric hedging prohibition finding is architecturally significant — it
+  identifies that protection for one metric can actively CREATE risk for another.
+  Every finding was actionable with specific fixes.
+- **Claude Opus 4.6** produced fewer findings (10) but with characteristic depth
+  and self-awareness. Its cooldown exploitation finding identified a PREDICTABLE
+  exploit window that an automated strategy could systematically abuse — framed
+  not as an accident but as an adversarial opportunity. The summary synthesis
+  (identifying common cause across Critical findings) shows meta-analytical
+  capability the other models didn't demonstrate. Opus also uniquely identified
+  that human intervention to fix one problem could create a WORSE problem —
+  second-order operational reasoning.
+- **Claude Sonnet 4.5** was well-structured (12 findings, clean severity tiers,
+  organized by Critical/High/Medium/Low) and faster than both other models.
+  Its findings were solid but less architecturally deep. The manual de-escalation
+  contradiction finding was genuinely insightful (unbounded recovery time vs
+  autonomous design goals). However, several findings restated concepts the
+  other models covered with less specificity about exploitation mechanics.
+
+**Key insight — temporal reasoning as a task type:**
+This is the first experiment specifically testing "temporal boundary analysis" —
+reasoning about time-domain properties of a state machine (evaluation frequency,
+counter semantics, cooldown mechanics, crash/restart timing).
+
+Results compared to Finding #13 (race condition identification on a concurrency doc):
+- GPT-5: 15 findings here vs 12 in Finding #13. Consistent high performance
+  on temporal reasoning tasks across both experiments.
+- Opus: 10 findings here vs 10 in Finding #13. Remarkably consistent — Opus
+  produces ~10 high-quality findings regardless of temporal task variant.
+- Sonnet 4.5: 12 findings here (first test). Compare to Sonnet 4.6's 7 findings
+  (with errors) in Finding #13. Sonnet 4.5 handles temporal reasoning better than
+  4.6 — consistent with Finding #16 showing 4.5 is more exhaustive across task types.
+
+**Sonnet 4.5 vs 4.6 on temporal reasoning (inferred comparison):**
+Sonnet 4.6 struggled significantly on race condition identification (Finding #13:
+7 findings with analytical errors, misreading architecture). Sonnet 4.5 here
+produced 12 solid findings with no apparent misreadings. This suggests 4.5's
+exhaustiveness advantage extends to temporal reasoning — the additional
+exploration it does (vs 4.6's aggressive self-filtering) catches more temporal
+interactions. Confirms Finding #16's pattern: 4.5 for coverage, 4.6 for precision.
+
+**The structured-prompt effect continues:**
+All three models produced focused, high-quality output with this highly structured
+prompt (5 specific categories + required output format). This confirms Finding #14:
+narrow analytical lens + broad document scope is the sweet spot for all model tiers.
+The prompt structure appears to be a stronger predictor of output quality than model
+choice for the bottom 80% of findings (all models find the common-ground issues).
+Model choice matters for the TOP 20% — the unique insights that require deeper
+reasoning about system interactions.
+
+**Updated model assignment for temporal boundary analysis:**
+1. GPT-5 — most exhaustive, strongest at modeling adversarial exploitation patterns
+   and mathematical edge cases (15 findings)
+2. Claude Sonnet 4.5 — good volume with clean structure, viable for first-pass
+   temporal analysis (12 findings, no errors)
+3. Claude Opus 4.6 — fewest findings but highest insight density, uniquely
+   identifies predictable exploit windows and operational second-order effects
+   (10 findings)
+
+**Practical implication:** For temporal analysis on state machines and timing-dependent
+policies, the three-model stack produces genuine complementary value:
+- GPT-5 catches the adversarial attack patterns and mathematical edge cases
+- Opus catches the predictable exploit windows and operational contradictions
+- Sonnet 4.5 provides good breadth at lower cost with clean severity categorization
+
+The union of unique findings across all three models reveals significantly more
+temporal vulnerabilities than any single model alone. For a document governing
+autonomous financial actions (liquidation, kill switch), the cost of running all
+three (~$1-2) is trivially justified against the risk of missing a timing exploit.
diff --git a/findings/2026-05-04-19-union-coverage-test-gpt5-mini.md b/findings/2026-05-04-19-union-coverage-test-gpt5-mini.md
new file mode 100644
index 0000000..e26b842
--- /dev/null
+++ b/findings/2026-05-04-19-union-coverage-test-gpt5-mini.md
@@ -0,0 +1,124 @@
+# Finding 19: Union coverage test: GPT-5 Mini + Sonnet 4.6 covers ~71% of GPT-5's findings; the missing 29% is where the real value lives
+
+**Date:** 2026-05-04
+**Task:** Identify hidden assumptions in gargoyle's `trading-pipeline.md` (1,110 lines,
+~62KB) — the most complex document tested so far, covering the full end-to-end path
+from tick ingestion through order execution.
+**How we used them:** Same document (full text, no truncation) + same focused analytical
+question to all 3 models via HAI proxy. Standard hidden-assumption prompt with 5
+categories (runtime behavior, external dependencies, timing/ordering, scale/load,
+uncovered failure modes). Required specific output format per finding. No tools, no
+project context beyond the document itself.
+
+| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
+|---|---|---|---|---|
+| GPT-5 | 99s | 9,418 | 5,696 | 35 |
+| GPT-5 Mini | 93s | 5,309 | 1,792 | 21 |
+| Claude Sonnet 4.6 | 38s | 1,792 | (internal) | 17 |
+
+**Coverage analysis — can Mini + Sonnet together replace GPT-5?**
+
+Categorized each of GPT-5's 35 findings by whether the union of Mini + Sonnet
+also identified the same assumption:
+
+- **Covered by BOTH Mini and Sonnet:** ~12 findings (common ground — any model
+  finds these: idempotency, single-writer, clock sync, instrument resolution,
+  fill immutability, reconciliation gate, backpressure, fill correlation, event
+  ordering, audit scalability, PortfolioRisk bottleneck)
+- **Covered by Mini only (not Sonnet):** ~7 findings (transactional atomicity,
+  audit causal consistency, modification-in-flight enforcement, OM throughput,
+  decimal precision, PM/PR close-only race, partition duplicate submit)
+- **Covered by Sonnet only (not Mini):** ~6 findings (market data feed rates,
+  pipeline-vs-market speed, corporate actions atomicity, kill switch partition,
+  shared port isolation, market close vs auction fills)
+- **Union(Mini + Sonnet) total coverage:** ~25/35 = **~71%** of GPT-5's findings
+- **GPT-5 unique (missed by both):** ~10-18 findings depending on strictness
+
+**What GPT-5 uniquely found that the cheaper pair missed:**
+
+The missing 29% is NOT random — it's systematically different in character:
+
+1. **Operational edge cases:** Default TIF "day" broker semantics, OrderRate
+   counting retries, extended-hours MarketHours mismatch, fractional quantities,
+   local expiry timer precision per instrument
+2. **Design-level interaction gaps:** PortfolioRisk concurrent decision race
+   (snapshot stale between two parallel approvals), re-validation gap between
+   approval and submit, decision loss on crash after audit write
+3. **Domain-specific knowledge:** Manual broker-side actions conflicting with
+   state machine, options/complex instrument position_effect mapping, Decision→Order
+   1:1 invariant vs broker auto-splitting, wash sale retroactive P&L mutation
+4. **Architectural observations:** Reduction re-entry rule insufficiency,
+   PortfolioMonitor coalescing vs fast breach detection, multi-aggregator fanout
+   and audit partial writes, replay/backtest alignment with production controls
+
+These share a common trait: they require **domain expertise** (knowing how brokers
+actually behave, how regulatory rules interact, how production trading systems
+fail in practice) combined with **architectural reasoning** (how the design's own
+mechanisms interact under those real-world conditions). The cheaper models find
+assumptions about the document's internal consistency; GPT-5 additionally finds
+assumptions about the document's relationship to the external world it must
+operate in.
+
+**GPT-5 Mini vs Sonnet 4.6 — complementary, not redundant:**
+
+Mini and Sonnet covered different gaps:
+- Mini was stronger on **internal consistency** (transactional atomicity, causal
+  consistency, decimal precision, modification serialization)
+- Sonnet was stronger on **external interactions** (market data feeds, corporate
+  actions, kill switch distribution, shared resource isolation)
+
+This aligns with previous findings: Mini reasons about implementation mechanics;
+Sonnet reasons about system boundaries and external interactions. Their union
+covers more ground than either alone.
+
+**Cost comparison:**
+
+| Approach | Total tokens | Approx. cost | Coverage of GPT-5 |
+|---|---|---|---|
+| GPT-5 alone | ~21K (9.4K output + 5.7K reasoning) | ~$0.80 | 100% (35 findings) |
+| Mini + Sonnet | ~7.1K output + 1.8K reasoning | ~$0.25 | ~71% (25/35 findings) |
+| All three | ~28K total | ~$1.05 | >100% (35 + unique Sonnet/Mini extras) |
+
+**Key insight — the 71% coverage is a floor, not a ceiling:**
+
+The union covers 71% of GPT-5's specific findings. But Mini and Sonnet each
+also produced findings that GPT-5 DIDN'T make:
+- Sonnet: DailyLossLimit query performance scaling, instrument reference data
+  propagation atomicity across components
+- Mini: Signal audit correlation ambiguity under replay/duplicate ticks
+
+So the total unique finding space is LARGER than any single model. Running all
+three produces the most comprehensive analysis.
+
+**Answer to the open question: "Would running GPT-5 Mini + Sonnet together
+approach GPT-5's coverage at lower combined cost?"**
+
+**Partially.** The pair covers ~71% of GPT-5's findings at ~31% of the cost.
+But the missing 29% is disproportionately valuable — it contains the
+domain-specific, interaction-level, real-world-knowledge findings that are
+most likely to prevent production incidents. For a quick sanity check or
+first-pass screening, Mini + Sonnet is excellent value. For architecture
+review where completeness matters (financial system, safety-critical), GPT-5
+is not replaceable by cheaper models — its unique findings are exactly the
+ones that would cause real-world failures.
+
+**Practical implication:** The optimal strategy depends on stakes:
+- **Low stakes** (internal doc review, non-critical systems): Mini + Sonnet
+  is 71% coverage at 31% cost — strong ROI
+- **High stakes** (financial systems, safety-critical): run all three — the
+  ~$1 total cost is irrelevant vs the value of the extra 10-18 findings
+- **Budget-conscious high stakes:** run GPT-5 alone — it subsumes most of
+  what Mini + Sonnet find, and adds the critical domain-knowledge findings
+
+The cost argument for Mini + Sonnet as a GPT-5 REPLACEMENT doesn't hold for
+important work. The cost argument for Mini + Sonnet as a GPT-5 COMPLEMENT
+is strong — they catch a few things GPT-5 misses, and the union of all three
+is the most thorough analysis available.
+
+**Document complexity observation:**
+This is the largest document tested (1,110 lines vs previous 185-785 lines).
+GPT-5's finding count scaled up (35 vs 20-26 on smaller docs) while maintaining
+quality — no padding with obvious/low-value findings. Mini also scaled (21 vs
+6 on 459-line doc in Finding #14). Sonnet scaled less (17 vs 12-17 on smaller
+docs) — it appears to have a natural output ceiling regardless of document size,
+consistent with its self-filtering behavior observed in previous findings.
diff --git a/findings/2026-05-04-20-invariant-violation-path-analysis-gpt5.md b/findings/2026-05-04-20-invariant-violation-path-analysis-gpt5.md
new file mode 100644
index 0000000..f6c2be8
--- /dev/null
+++ b/findings/2026-05-04-20-invariant-violation-path-analysis-gpt5.md
@@ -0,0 +1,163 @@
+# Finding 20: Invariant violation path analysis: GPT-5 is maximally selective (3 findings, all genuine); Opus shows unique self-correcting analytical style; new task type favors precision over exhaustiveness
+
+**Date:** 2026-05-04
+**Task:** Identify invariant violation paths in gargoyle's `user-pipeline-lifecycle.md`
+(730 lines) — sequences of legal operations that can violate the system's stated or
+implied invariants. NEW analytical lens not previously tested, distinct from assumption-
+finding, race conditions, or coherence checking.
+**How we used them:** Same document (full text) + same focused analytical question to all
+3 models via HAI proxy. Highly structured prompt specifying 5 categories of invariant
+violations (state machine escapes, invariant composition failures, monotonicity violations,
+idempotency boundary violations, authority inversion sequences). Required specific output
+format per finding. No tools, no project context beyond the document itself.
+
+| Model | Time | Output tokens | Reasoning tokens | Findings |
+|---|---|---|---|---|
+| GPT-5 | 143s | 784 | 12,032 | 3 |
+| Claude Opus 4.6 | 113s | 6,183 | (internal) | 7 (with 2 self-corrections) |
+| Claude Sonnet 4.6 | 23s | 1,266 | (internal) | 5 |
+
+**What they found — common ground (2+ models identified):**
+
+- **Periodic reconciliation overrides operator manual stop** (GPT-5 #3 + Opus #5 +
+  Sonnet #1): An admin who stops a pipeline via `stop_user/1` with `:admin_action`
+  has their decision overridden within 5 minutes by periodic reconciliation, because
+  there's no "admin stopped" state in `check_eligibility/1`. All three models
+  independently identified this as the clearest authority inversion.
+- **DynamicSupervisor restart bypasses eligibility gate** (Opus #1/#3 + Sonnet #2):
+  When `UserPipeline.Supervisor` crashes and is restarted by OTP supervision, the
+  restart bypasses `start_user/1` and `check_eligibility/1` entirely — potentially
+  resuming trading while the kill switch is engaged.
+- **Stale ReconciliationGate after crash** (Opus #7): After a crash-triggered
+  DynamicSupervisor restart (not via `stop_user/1`), the ReconciliationGate remains
+  `:ready` from the previous instance because `stop_user/1` (which resets it) was
+  never called. The new OrderManager may accept orders during its own reconciliation.
+- **HealthMonitor co-lifecycle violation** (Opus #2 + Sonnet #4): After a
+  DynamicSupervisor-initiated restart, the HealthMonitor is still subscribed to the
+  old PIDs — no code re-establishes monitoring for the new pipeline processes.
+
+**GPT-5 unique findings (not in either other model):**
+
+- **Kill switch bypass for users configured DURING engagement** (#1): A user who
+  saves credentials while the kill switch is engaged is never added to the pending
+  operator release set (only running pipelines are added at engage time). After
+  disengage, periodic reconciliation auto-starts this user's pipeline without
+  operator release — violating "resuming always requires human judgment." This is
+  the most precisely reasoned finding across all three models: each step is
+  individually correct per the spec, and the violation emerges purely from the
+  composition of legal operations.
+- **Premature release bypass** (#2): If `operator_release_user/1` is called while
+  the kill switch is still engaged (a legal operation), it clears the pending
+  release flag but `start_user/1` correctly refuses. After later disengage, the
+  flag is gone — auto-start proceeds without fresh operator judgment. The release
+  was "spent" at the wrong time.
+
+**Claude Opus unique findings (not in either other model):**
+
+- **`operator_release_system/0` clears unrelated safety obligations** (#4):
+  Operator intends to release one user from a recent event but
+  `operator_release_system/0` also releases other users still pending from an
+  earlier, unresolved event. One release call discharges multiple independent
+  safety obligations — monotonicity violation.
+- **State machine incompleteness for blocked users** (#6): Users who become
+  configured during kill switch engagement (blocked with reason
+  `:kill_switch_engaged`) have no state machine transition back to `starting`
+  after disengage — they're not in the pending release set, and no event fires.
+  System works via periodic reconciliation (up to 5 minutes delay), but the
+  documented state machine doesn't represent this path.
+- **Self-correcting analytical style:** Opus explicitly withdrew two draft
+  findings mid-analysis ("Actually, this sequence works as designed. Let me
+  identify a real violation instead." / "this is likely handled"). This
+  self-correction behavior was first observed in Finding #15 and is now
+  confirmed as a consistent Opus trait for invariant-style analysis.
+
+**Claude Sonnet unique findings (not in either other model):**
+
+- **Cold-start Tier 3 failure creates supervision restart loop** (#2): A
+  persistent Tier 3 failure (phantom fills) crashes OrderManager, `:rest_for_one`
+  kills the tree, DynamicSupervisor restarts it, cold-start fails again → infinite
+  loop. State machine shows `starting → stopped` but supervision creates
+  `starting → starting` indefinitely.
+- **HealthMonitor start failure during start_user** (#4): If HealthMonitor.Supervisor
+  is momentarily crashed when `start_user/1` runs step 4, the pipeline starts
+  without monitoring. No error handling specified for this partial-start state.
+
+**Quality assessment:**
+
+- **GPT-5** was MAXIMALLY SELECTIVE — only 3 findings from 12,032 reasoning tokens
+  (4,011 reasoning tokens per finding). This is the most extreme
+  reasoning-to-output ratio observed: 15:1 (12,032 reasoning / 784 output tokens).
+  For comparison, in previous experiments GPT-5 typically shows 1:1 to 2:1 ratios.
+  Every finding is a genuine invariant violation with a precise, step-by-step
+  sequence where each step is individually legal. ZERO false positives, zero
+  padding, zero "this might be an issue." GPT-5 appears to have used almost all
+  its reasoning budget for VERIFICATION — confirming that each candidate is
+  genuinely a violation before including it.
+- **Claude Opus** produced the most findings (7) with its characteristic depth and
+  self-correction. Two findings were revised mid-analysis, showing Opus actively
+  testing its own reasoning against the document before committing to a finding.
+  The DynamicSupervisor restart thread (findings #1, #2, #3, #7) forms a coherent
+  cluster — Opus identified one root cause (OTP restarts bypass the lifecycle
+  layer) and explored its multiple consequences. The `operator_release_system`
+  monotonicity finding (#4) is architecturally significant and unique.
+- **Claude Sonnet** was extremely fast (23s, 1,266 tokens) and produced 5 findings.
+  Quality was mixed: Finding #1 partially mirrors GPT-5's authority inversion but
+  with vaguer reasoning ("race condition with ETS operations" — not specified).
+  Finding #3 describes a contradiction but the scenario is internally inconsistent
+  (step 5 says "pipeline termination fails" but then step 7 says pipeline is still
+  running — this conflates two failure modes). Findings #2 and #4 are genuine and
+  well-reasoned. Sonnet's precision is lower than the other two on this task.
+
+**Key insight — "Invariant violation paths" as a task type:**
+
+This is a genuinely DIFFERENT analytical task from any previously tested. It requires:
+1. Identifying the invariants (explicit or implied)
+2. Constructing a sequence of operations (creative/generative)
+3. Verifying each step is legal per the spec (verification)
+4. Confirming the end state violates the invariant (correctness proof)
+
+This four-phase cognitive process explains GPT-5's extreme selectivity: steps 2-4 are
+all verification-heavy, and GPT-5's reasoning tokens are being burned on steps 3 and 4
+(confirming each step is genuinely legal and the final state genuinely violates). In
+previous tasks like "find hidden assumptions" or "find gaps," only step 1 (identification)
+is needed — there's no construction or verification phase.
+
+**Comparison to previous task types:**
+
+| Task type | GPT-5 findings | Opus findings | GPT-5 reasoning overhead |
+|---|---|---|---|
+| Hidden assumptions | 20-35 | 12-13 | 5-7K reasoning |
+| Race conditions | 12 | 10 | 8K reasoning |
+| Design coherence | 4 | 7 | 9K reasoning |
+| Invariant violation paths | 3 | 7 | **12K reasoning** |
+
+The pattern: as the task requires more VERIFICATION (vs identification), GPT-5 becomes
+more selective and spends more reasoning tokens per finding. Invariant violation paths
+demand the highest verification burden (every step must be confirmed legal), and GPT-5
+responds with the highest selectivity and reasoning investment.
+
+Opus inverts: it produces MORE findings on verification-heavy tasks (7 for coherence,
+7 for invariant paths) vs identification tasks (10-13 for assumptions). This suggests
+Opus uses its internal reasoning differently — it's more willing to present findings
+that have "likely" rather than "proven" violations, then self-corrects inline if the
+verification fails.
+
+**Practical implication:**
+
+For invariant violation path analysis:
+- **GPT-5** produces the highest-precision findings but very few. Every finding is a
+  genuine spec-level bug. Use when you need zero-false-positive bug reports to present
+  to a design team.
+- **Opus** produces more findings with slightly lower precision but unique analytical
+  depth. Its self-correction behavior means false positives are often caught inline.
+  Use when you want both confirmed violations AND identified tensions.
+- **Sonnet** is too imprecise for this task type — some findings have internal
+  inconsistencies. Use for lighter analytical tasks (assumption-finding, spec gaps).
+
+The three findings GPT-5 produced are ALL genuine design bugs that should be fixed:
+1. Users configured during kill switch engagement bypass operator release
+2. Premature operator release (while KS still engaged) creates future bypass
+3. Admin stops are overridden by periodic reconciliation
+
+These are the kind of findings that, in a real financial system, prevent production
+incidents. The 12K reasoning tokens to produce 3 perfect findings is excellent ROI.
diff --git a/findings/2026-05-04-21-reasoning-effort-lowmediumhigh-has-negligible.md b/findings/2026-05-04-21-reasoning-effort-lowmediumhigh-has-negligible.md
new file mode 100644
index 0000000..b91e04d
--- /dev/null
+++ b/findings/2026-05-04-21-reasoning-effort-lowmediumhigh-has-negligible.md
@@ -0,0 +1,125 @@
+# Finding 21: Reasoning effort (low/medium/high) has negligible effect on GPT-5's analytical output; the parameter may not work as documented for open-ended analysis
+
+**Date:** 2026-05-04
+**Task:** Identify hidden assumptions in gargoyle's `order-state-machine.md` (221 lines)
+— a well-structured state machine specification covering order lifecycle, fill precedence,
+TIF semantics, and parameter resolution.
+**How we used them:** Same document, same prompt, same model (GPT-5), same
+max_completion_tokens (16K). Only variable: `reasoning.effort` parameter set to
+"low", "medium", or "high". Run sequentially via HAI proxy (OpenAI-compatible
+endpoint). No tools, no project context beyond the document.
+
+| Effort | Time (ms) | Output tokens | Reasoning tokens | Findings |
+|---|---|---|---|---|
+| Low | 97,913 | 7,657 | 4,288 | 33 (+11 recs) |
+| Medium | 94,824 | 7,112 | 4,160 | 30 |
+| High | 88,607 | 6,891 | 3,712 | 30 |
+
+**The counterintuitive result:** Higher reasoning effort produced FEWER findings,
+FEWER reasoning tokens, FEWER output tokens, and completed FASTER. The expected
+pattern (high effort → more reasoning → more depth) was inverted.
+
+**Per-finding metrics (remarkably consistent):**
+
+| Effort | Output tokens/finding | Reasoning tokens/finding |
+|---|---|---|
+| Low | 232 | 129 |
+| Medium | 237 | 138 |
+| High | 229 | 123 |
+
+The depth per finding was nearly identical across all three levels. The models
+didn't get more detailed or rigorous per-finding at higher effort — they just
+found slightly fewer things.
+
+**Severity distributions (similar across all three):**
+- Low: 7 Critical, 21 High, 5 Medium (33 findings)
+- Medium: 9 Critical, 15 High, 4 Medium + 2 borderline (30 findings)
+- High: 6 Critical, 14 High, 5 Medium + 4 borderline (30 findings)
+
+**Qualitative differences — WHAT they found:**
+
+High-effort unique findings (not in low):
+- Single-writer authority to broker (no out-of-band modifications)
+- Broker emits fills for all executed quantities (no silent netting)
+- Instrument identity remains stable across corporate actions
+- Late-fill override won't violate downstream invariants
+- Validation covers lot sizes, price ticks, borrow/locate constraints
+- Multiple accounts and venues are part of the correlation key
+- Streaming and polling APIs are consistent
+- System can handle multi-leg instruments
+
+Low-effort unique findings (not in high):
+- Acks arrive before fills (no pre-ack fills)
+- Cancel-before-ack handling (submitted → cancelled missing)
+- Fill totals never exceed requested quantity
+- Deterministic ordering within a broker stream
+- Exercise/assignment and non-order position changes
+- Client-side idempotency of "place order"
+- Partial accept/normalize on replace
+- No "child" order fragmentation at broker
+- Submitted state can receive terminal events
+- Late cancel vs local expired mismatch
+
+**Character of the differences:**
+- HIGH-unique findings tend to be more **architectural/systemic** (multi-leg
+  instruments, streaming vs polling consistency, downstream invariant violations,
+  corporate actions). These require reasoning about the system's relationship
+  to the broader world.
+- LOW-unique findings tend to be more **implementation-specific edge cases**
+  (cancel-before-ack, pre-ack fills, child order fragmentation, partial accepts).
+  These require reasoning about specific event interleavings and protocol details.
+
+Both sets are valid and actionable. Neither is clearly "better." They represent
+different analytical modes — breadth-of-scope (high) vs depth-of-protocol (low).
+
+**Key insight — reasoning_effort doesn't scale analysis linearly:**
+
+Three possible explanations for the inverted behavior:
+
+1. **GPT-5 already uses near-maximum reasoning for analytical tasks regardless
+   of the effort parameter.** The ~4K reasoning tokens across all three levels
+   (4288/4160/3712) are too similar to reflect a genuine effort gradient. The
+   parameter may primarily affect OTHER task types (math, code, logic puzzles)
+   where reasoning depth is more variable.
+
+2. **Higher effort increases FILTERING, not exploration.** At high effort, GPT-5
+   may spend more of its reasoning on VERIFYING whether findings are genuine
+   before including them — similar to the extreme selectivity observed in
+   Finding #20 (invariant violation paths, 12K reasoning for 3 findings). This
+   would explain fewer findings despite theoretically "trying harder."
+
+3. **The parameter has minimal practical effect for this model version.**
+   The differences (33 vs 30 vs 30) are within normal stochastic variation.
+   Repeated runs at the same effort level might show similar variance.
+
+**The prompt cache hit on HIGH (2304 cached prompt tokens) may have slightly
+accelerated processing, but doesn't explain the reasoning token difference.**
+
+**Comparison to previous findings:**
+In Finding #20 (invariant violation paths), GPT-5 used 12,032 reasoning tokens
+for 3 findings — extreme verification behavior. Here, at default effort on a
+different task type (hidden assumptions), it uses ~4K reasoning for ~30 findings.
+This confirms that TASK TYPE is a far stronger predictor of GPT-5's reasoning
+behavior than the reasoning_effort parameter. The invariant violation prompt
+triggered deep verification; the assumption-finding prompt triggers broad
+exploration regardless of effort setting.
+
+**Practical implication:**
+For open-ended analytical tasks (assumption-finding, gap analysis, spec review),
+the reasoning_effort parameter appears to have negligible practical effect on
+GPT-5. Don't bother tuning it for these tasks — the default is fine. The
+parameter may be more meaningful for:
+- Tasks with verifiable correct answers (math, logic)
+- Tasks where the model could short-circuit (simple questions)
+- Extremely long documents where exploration budget matters
+
+For architecture review specifically: reasoning_effort is NOT a useful lever.
+Task framing (the prompt structure) and document selection remain the dominant
+variables for output quality. Save reasoning_effort tuning for coding/math tasks
+where the parameter was likely trained and evaluated.
+
+**Open question:** Would running the same experiment 5x at each level show that
+the 33-vs-30 difference is within stochastic noise? If so, reasoning_effort is
+effectively a no-op for analytical prompts. If not, low-effort consistently
+produces more (less filtered) output, which could be useful for brainstorming-
+style analysis where you want maximum coverage before manual triage.
diff --git a/findings/2026-05-05-22-silent-correctness-failures-new-analytical.md b/findings/2026-05-05-22-silent-correctness-failures-new-analytical.md
new file mode 100644
index 0000000..7c9a78a
--- /dev/null
+++ b/findings/2026-05-05-22-silent-correctness-failures-new-analytical.md
@@ -0,0 +1,180 @@
+# Finding 22: Silent correctness failures: NEW analytical lens reveals Opus's strength at domain/regulatory reasoning; GPT-5 shows regulatory depth; all models converge on compliance timestamp errors
+
+**Date:** 2026-05-05
+**Task:** Identify scenarios where the mechanism produces SILENTLY INCORRECT results
+(not errors, not crashes — wrong financial calculations, wrong lot selections, or wrong
+compliance records that pass all validation) in gargoyle's `specid-lot-selection.md`
+(306 lines) — a financial system specification covering tax lot selection strategies,
+cost basis accounting, and IRS SpecID compliance.
+**How we used them:** Same document (full text) + same focused analytical question to
+all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of silent
+incorrectness (stale data, semantic precision, ordering sensitivity, composition errors,
+temporal reference errors). Required specific output format per finding with concrete
+numerical examples of financial impact. No tools, no project context beyond the document.
+
+| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
+|---|---|---|---|---|---|---|---|
+| GPT-5 | 147s | 13,006 | 10,496 | 7 | 2 | 2 | 3 |
+| Claude Opus 4.6 | 119s | 5,902 | (internal) | 10 | 3 | 3 | 4 |
+| Claude Sonnet 4.6 | 122s | 6,011 | (internal) | 6 | 3 | 3 | 0 |
+
+**What they found — common ground (all 3 identified):**
+- `designation_at` = `DateTime.utc_now()` at processing time, NOT at actual
+  designation time (manual selection was made at order submission, standing
+  orders were configured earlier) — compliance record factually incorrect
+- Holding period calculation boundary errors (>365 days vs IRS "more than one
+  year" rule, off-by-one at leap year boundaries, day-after-acquisition start)
+- HIFO tie-breaker `opened_at ASC` ignores tax_term dimension — selects
+  long-term losses over short-term losses when both have identical cost basis,
+  producing less tax-valuable outcomes
+- Strategy preference resolved at fill processing time, not at trade time
+  (preference changes between trade and fill processing apply retroactively)
+
+**GPT-5 unique findings (not in either Claude model):**
+- Corporate action applied late stale cost basis in HIFO: ROC/dividend reduces
+  basis but if close/4 fires before apply_corporate_action/3, HIFO sorts on
+  pre-adjusted basis AND records wrong realized P&L permanently. No mechanism
+  to restate previously persisted LotClosed events. Concrete example: $2,000
+  overstated loss from one trade.
+- `designation_at` fragmentation: a single sell consuming multiple lots calls
+  DateTime.utc_now() per loop iteration, producing slightly different timestamps
+  for what should be a single coherent designation event. Audit risk.
+- LIFO label in `selection_method` field: records "lifo" but for securities LIFO
+  isn't an authorized tax method — the operation is legally SpecID electing
+  newest lots. Downstream reporting may reject or misclassify.
+
+**Claude Opus unique findings (not in either other model):**
+- Realized P&L excludes commissions/fees: formula uses `sell_fill.price` (raw
+  execution price) minus `lot.cost_basis`, not net proceeds. If cost_basis also
+  excludes buy-side commissions, P&L is doubly overstated. Active trader doing
+  1000 trades/year: ~$20,000+ cumulative P&L overstatement.
+- Position `average_cost` is meaningless under SpecID and potentially misleading:
+  SpecID exists to exploit lot-level basis differences, but position-level average
+  obscures this. If downstream consumers use average_cost for tax estimation,
+  results can be 50%+ wrong per lot.
+- GenServer mailbox ordering determines lot-to-fill assignment for concurrent sells:
+  two simultaneous fills for the same instrument get different lots based on network
+  arrival timing. With different holding periods, produces $670+ tax difference
+  without user awareness.
+- Wash sale rule completely unaddressed: system reports losses as realized/deductible
+  without checking 30-day substantially identical purchase rule. Active trader
+  harvesting $50,000 in losses could have $0 actually deductible — $18,500 tax gap.
+- `opened_at` semantics undefined: whether it's exchange execution time, GenServer
+  arrival time, or settlement date affects every downstream calculation (FIFO/LIFO
+  ordering, holding periods, tax terms). Network timing could produce wrong FIFO
+  lot selection.
+
+**Claude Sonnet 4.6 unique findings (not in either other model):**
+- Stale cost basis in manual lot picker during concurrent corporate actions: UI shows
+  pre-action basis, user selects based on stale data, but close/4 only validates
+  open/ownership/quantity — never re-validates that the selection rationale is still
+  correct. No field records the discrepancy.
+- `average_cost` recomputation ordering ambiguity in event-sourced model: step 4
+  recomputes from "updated lots" but step 3 (persist events) may not have completed
+  — if implementation re-derives from event store rather than in-memory state, reads
+  pre-closure lot quantities. Accumulates $500+ error per partial close.
+- Strategy fallback + config corruption silently overwrites selection method in
+  compliance record: if config becomes invalid, fallback to :fifo is logged at
+  :warning but LotClosed records `selection_method: "fifo"` — compliance record
+  shows user "chose" FIFO when they configured HIFO. No field records intended vs
+  actual strategy.
+
+**Quality assessment:**
+- **Claude Opus** produced the most findings (10) with the broadest analytical scope.
+  Several findings went BEYOND the document's mechanism to identify missing features
+  that create silent incorrectness (wash sale rules, commission handling, opened_at
+  semantics). This is a different analytical mode: Opus identified what the system
+  SHOULD compute but DOESN'T, not just where the existing computation is wrong.
+  The wash sale finding is the highest-impact across all three models — an active
+  trader's entire tax-loss harvesting strategy could be invalid. The GenServer
+  mailbox ordering finding shows characteristic Opus reasoning about emergent
+  behavior from design decisions.
+- **GPT-5** produced fewer findings (7) but with extreme precision and specificity.
+  Every finding includes concrete dollar amounts and specific field references.
+  The corporate action stale basis finding is uniquely actionable — it identifies a
+  specific race condition between two documented mechanisms (close/4 and
+  apply_corporate_action/3) that produces permanently incorrect persisted data
+  with no correction path. The designation_at fragmentation finding shows attention
+  to implementation detail that neither Claude model noticed. GPT-5 used 10,496
+  reasoning tokens for 7 findings (1,500 tokens/finding) — HIGH verification,
+  consistent with Finding #20's pattern for precision-over-breadth tasks.
+- **Claude Sonnet 4.6** produced 6 findings with strong specificity and novel angles.
+  The event-sourced recomputation ordering finding (#5) is architecturally subtle —
+  it identifies a composition error between the walk-and-consume algorithm's step
+  ordering and event-sourcing patterns. The strategy fallback compliance recording
+  finding is a genuine audit hazard. However, Sonnet produced no Medium-severity
+  findings — it either found Critical/High issues or filtered everything else out.
+  This aligns with its established high-precision, high-self-filtering behavior.
+
+**Key insight — "Silent correctness" as an analytical lens:**
+
+This is the FIRST experiment testing a "silent incorrectness" prompt. The key
+difference from previous analytical lenses:
+- **Assumption-finding:** "What must be true for this to work?" (Finding #10-12)
+- **Race conditions:** "What timing issues exist?" (Finding #13)
+- **Design coherence:** "Does the design contradict itself?" (Finding #15)
+- **Invariant violations:** "What operation sequences break invariants?" (Finding #20)
+- **Silent correctness:** "Where does the system CONFIDENTLY produce WRONG output
+  with NO indication of error?"
+
+The silent correctness lens produced qualitatively different findings from all
+previous lenses. The emphasis on "passes all validation" forced models to reason
+about what SHOULD be validated but ISN'T, and about semantic correctness (regulatory
+requirements, financial accounting rules) vs syntactic correctness (valid types,
+non-nil fields, correct schema).
+
+This lens also revealed a key model differentiation not seen before:
+- **Opus reasons about MISSING functionality** (wash sales, commissions, opened_at
+  semantics) — things the system should do but doesn't
+- **GPT-5 reasons about EXISTING functionality being wrong** (corporate action race,
+  designation fragmentation, LIFO labeling) — things the system does but incorrectly
+- **Sonnet reasons about COMPOSITION failures** (event-sourcing step ordering,
+  strategy fallback propagation) — things that are individually correct but combine
+  incorrectly
+
+These are three genuinely different analytical modes, not just "more/less thorough."
+All three are valuable for different review outcomes: Opus for feature completeness,
+GPT-5 for mechanism correctness, Sonnet for integration correctness.
+
+**Financial domain advantage:**
+
+This is the first experiment on a document with strong regulatory/financial semantics.
+All three models demonstrated domain knowledge (IRS holding period rules, Treas. Reg.
+1.1012-1(c) requirements, wash sale IRC §1091, long-term/short-term capital gains
+rate differentials). Opus in particular referenced specific IRC sections and provided
+concrete tax rate calculations. The "silent incorrectness" lens works especially well
+on financial/regulatory documents because the gap between "syntactically valid output"
+and "semantically/legally correct output" is large and consequential.
+
+**Comparison to previous findings on the same models:**
+
+| Task type | GPT-5 findings | Opus findings | Sonnet findings | Opus > GPT-5? |
+|---|---|---|---|---|
+| Hidden assumptions (#10-12) | 20-35 | 12-13 | 13-17 | No |
+| Race conditions (#13) | 12 | 10 | 7 | No |
+| Design coherence (#15) | 4 | 7 | 5 | **Yes** |
+| Invariant violations (#20) | 3 | 7 | 5 | **Yes** |
+| Silent correctness (#22) | 7 | 10 | 6 | **Yes** |
+
+Pattern confirmed: Opus outperforms GPT-5 (by finding count) on tasks that require
+reasoning about the design's RELATIONSHIP to external requirements (regulatory,
+financial, consumer expectations). GPT-5 outperforms Opus on tasks that require
+EXHAUSTIVE EXPLORATION within a self-contained system (assumptions, race conditions).
+
+The "silent correctness" lens is structurally similar to coherence checking (does the
+system match its external requirements?) rather than gap-finding (what's missing
+within the system?). This explains why Opus outperforms: the task requires reasoning
+about the world outside the document (IRS rules, financial accounting standards,
+regulatory requirements), which is Opus's strength.
+
+**Practical implication:**
+For financial/regulatory system review, the "silent correctness" lens should be
+run using Opus as the primary model (broadest findings including missing-feature
+identification) plus GPT-5 for mechanism-level precision. Sonnet adds value for
+composition/integration issues that neither Opus nor GPT-5 catches. All three
+produced unique, actionable findings that the others missed.
+
+The three findings ALL models converged on (designation_at, holding period, HIFO
+tie-breaker, strategy preference timing) should be treated as confirmed design
+bugs requiring fixes. The fact that three independent models all identified them
+with concrete financial impact examples increases confidence that these are real.
diff --git a/findings/2026-05-05-23-regulatory-compliance-analysis-gpt5-finds.md b/findings/2026-05-05-23-regulatory-compliance-analysis-gpt5-finds.md
new file mode 100644
index 0000000..8ec8ddc
--- /dev/null
+++ b/findings/2026-05-05-23-regulatory-compliance-analysis-gpt5-finds.md
@@ -0,0 +1,193 @@
+# Finding 23: Regulatory compliance analysis: GPT-5 finds broadest scope of IRS issues; Opus self-corrects mid-analysis; all models converge on cross-account gap
+
+**Date:** 2026-05-05
+**Task:** Identify where gargoyle's `wash-sale-tracking.md` (391 lines) could produce
+incorrect tax reporting compared to IRS wash sale regulations (IRC 1091). NEW
+analytical lens: regulatory compliance verification — asking models to reason about
+a code implementation's correctness against EXTERNAL regulatory requirements (not
+internal system assumptions or race conditions).
+**How we used them:** Same document (full text) + same focused analytical question
+to all 3 models via HAI proxy. Prompt structured around 5 categories: regulatory
+gaps, interpretation errors, edge cases at regulatory boundaries, cross-account/entity
+concerns, and interaction with other IRC sections. Required specific regulatory
+citations, implementation analysis, concrete tax errors, and audit risk levels.
+No tools, no project context beyond the document.
+
+| Model | Time | Output tokens | Reasoning tokens | Findings |
+|---|---|---|---|---|
+| GPT-5 | 178s | 12,525 | 9,536 | 16 |
+| Claude Opus 4.6 | 155s | 7,326 | (internal) | 16 (with 2 self-corrections/withdrawals) |
+| Claude Sonnet 4.6 | 40s | 1,818 | (internal) | 12 |
+
+**What they found — common ground (all 3 identified):**
+- Cross-account/IRA/external broker wash sales not tracked (IRS applies at taxpayer level)
+- Options/contracts to acquire stock not triggering wash sales (explicit in IRC 1091(a) text)
+- "Substantially identical" definition too narrow (same index ETFs, share classes, ADRs)
+- Trade date vs settlement date ambiguity in opened_at/closed_at
+- Short sale wash sales not addressed
+- Section 475 mark-to-market traders incorrectly subjected to wash sale tracking
+- IRC 1092 straddle rules interaction not addressed
+- Related party / spousal transactions not considered
+- Corporate action identity changes breaking matching
+
+**GPT-5 unique findings (not in either other model):**
+- **Per-share vs lot-level basis tacking** (#1): The system applies `disallowed_loss`
+  and `tacked_opened_at` at the LOT level, but IRS requires per-share treatment
+  when only partial shares are matched. A lot of 100 shares where only 60 trigger
+  wash sale should have per-share basis segregation — the system inflates basis for
+  all 100 shares. **Most architecturally significant finding** — a fundamental
+  design-level error, not a missing feature.
+- **IRA permanent disallowance** (#2): When replacement purchase is in an IRA, the
+  loss is PERMANENTLY lost (no basis adjustment possible in tax-deferred accounts).
+  System either incorrectly applies basis adjustment inside IRA or misses it entirely.
+- **Instruments not subject to §1091** (#4): §1256 contracts (futures, index options),
+  cryptocurrency, and §475 elections are all exempt — system may over-disallow.
+- **Average-cost mutual fund basis** (#11): Wash sale adjustments for funds using
+  average-cost method require different math than discrete lot-level adjustments.
+- **ADRs vs local shares** (#14): ADRs and underlying foreign ordinaries are
+  substantially identical but have different instrument_ids.
+- **RSU vestings/ESPP purchases** (#15): Equity compensation creating lots via
+  corporate action paths may not trigger `check_replacement/2`.
+- **Ordering priority between pre/post sale purchases** (#10): Industry convention
+  (post-sale first, then pre-sale) may differ from system's strict chronological
+  ordering, causing 1099-B mismatches.
+
+**Claude Opus unique findings (not in either other model):**
+- **Year-end boundary timing** (#5): Loss in December + replacement in January means
+  tax reports generated between Dec 31 and the replacement purchase date are incorrect.
+  Forward detection fires retroactively but users may have already filed. System needs
+  a "30-day pending window" for year-end reports.
+- **Form 8949 reporting format** (#6): IRS requires code "W" in column (f) and
+  specific adjustment amounts in column (g). System doesn't describe how `tax_summary/3`
+  produces Form 8949-compatible output — potential CP2000 notice triggers from
+  automated IRS matching against broker 1099-B.
+- **"Open lots" query in backward detection** (#10): If backward detection only
+  queries currently-open lots, it misses replacements that were acquired AND SOLD
+  within the window. IRS looks at acquisition regardless of current holding status.
+  (Rev. Rul. 56-602)
+- **Forward detection loss ordering unspecified** (#7): When multiple prior losses
+  compete for the same replacement shares, ordering matters — different allocation
+  produces different basis amounts on the replacement lot.
+- **DRIP reinvestments triggering wash sales** (#9): Dividend reinvestment creates
+  new lots that should trigger forward detection but may not if only buy fills
+  produce `LotOpened` events.
+- **Self-correcting analytical style (CONFIRMED):** Opus withdrew Finding #4
+  entirely mid-analysis ("Revised assessment: holding period logic appears correct.
+  I withdraw the claim of error"). Spent ~500 words reasoning through the holding
+  period tacking logic, found it correct, and explicitly retracted. This is now
+  confirmed across Findings #15, #20, and #23 as a consistent Opus behavior for
+  verification-heavy regulatory analysis.
+
+**Claude Sonnet unique findings (not in either other model):**
+- **Entity-level tracking for partnerships/S-Corps** (#4.2): Tax-transparent entities
+  trading through the platform need K-1 reporting to partners — user-scoped model
+  doesn't address pass-through entity wash sale reporting.
+- **Constructive sale integration (IRC 1259)** (#4.1): Short positions or derivatives
+  creating constructive ownership interact with wash sale determination in ways not
+  addressed.
+- **NOL carryforward interaction** (#5.3): Wash sale deferrals affect character and
+  timing of losses contributing to NOL calculations across tax years.
+
+**Quality assessment:**
+- **GPT-5** produced the broadest regulatory scope (16 findings) with the most
+  specific IRS citations (Rev. Rul. 2008-5, Pub. 550, IRC §§267, 1091, 1092, 1222,
+  1223, 1256, 475). Its per-share vs lot-level finding (#1) is the only one that
+  identifies a FUNDAMENTAL DESIGN ERROR (not a missing feature). Most other models'
+  findings are "you don't handle X" — GPT-5's #1 says "what you DO handle is
+  handled INCORRECTLY." This distinction matters: missing features are known scope
+  limitations; incorrect logic is a bug.
+- **Claude Opus** matched GPT-5's count (16 with 2 self-corrections = 14 net
+  confirmed) but with different character. Opus excelled at identifying OPERATIONAL
+  implications (year-end boundary timing, Form 8949 format requirements, forward
+  detection ordering) rather than just statutory gaps. Its findings tend to describe
+  HOW the gap manifests in practice ("user files taxes, then January purchase
+  retroactively invalidates the filing") vs GPT-5's approach of citing the statute
+  and describing the theoretical violation.
+- **Claude Sonnet** was fast (40s) and produced 12 competent findings but with less
+  regulatory precision. Findings lacked specific IRS citations (no Rev. Rul.
+  references, no Treas. Reg. citations). Several findings overlapped heavily with
+  common ground items without adding unique depth. The entity-level and
+  constructive sale findings show awareness of tax complexity but are relatively
+  generic ("this is complex and not addressed").
+
+**Key insight — regulatory compliance as a distinct task type:**
+
+This experiment tests a fundamentally different cognitive demand than previous ones:
+previous tasks asked "what could go wrong with this system?" (internal reasoning).
+This task asks "does this system correctly implement external rules?" (external
+reasoning). The model must hold TWO bodies of knowledge simultaneously: the
+implementation spec AND the regulatory framework, then find mismatches.
+
+All three models had strong tax law knowledge — they cited IRC sections, Revenue
+Rulings, and Treasury Regulations correctly. The differentiation wasn't in legal
+knowledge but in HOW they applied it:
+
+- **GPT-5:** Exhaustive statutory mapping ("here's every IRC section that touches
+  wash sales; here's where the implementation falls short on each"). Breadth-first
+  coverage. Found the most issues by sheer scope of regulatory awareness.
+- **Opus:** Operational consequence reasoning ("here's how this gap manifests as
+  a real-world problem for the user/auditor"). Found issues by reasoning about
+  the implementation's interaction with real-world workflows (filing deadlines,
+  form formats, broker reconciliation).
+- **Sonnet:** Category-based analysis ("here are cross-account issues, here are
+  entity issues, here are interaction issues"). Followed the prompt structure
+  closely but didn't go deep within each category.
+
+**The per-share vs lot-level finding (GPT-5 #1) — why it matters:**
+
+This is the experiment's most important result. Every model found missing features
+(options, cross-account, short sales) — those are SCOPE limitations that the
+document itself acknowledges or defers. GPT-5 uniquely found a correctness bug in
+the IMPLEMENTED logic: the system's lot-level basis adjustment is mathematically
+wrong for partial wash sales.
+
+Example: Loss lot of 100 shares, replacement lot of 60 shares. Only 60 shares
+trigger wash sale. System adds full 60% of disallowed loss to the entire
+replacement lot's basis. If the replacement lot later sells 30 shares, the
+per-share basis is inflated (reflects 60 shares of adjustment spread across 60
+shares). This is actually correct for the replacement lot specifically — but
+the `tacked_opened_at` is applied to ALL 60 shares when only the matched shares
+should have tacked holding periods. For lots where `adjusted_quantity <
+replacement_quantity`, the non-matched shares have incorrect holding period
+characterization.
+
+Actually, on closer inspection: if `adjusted_quantity = min(loss_quantity,
+replacement_quantity)`, and the system matches 60 shares of a 60-share
+replacement lot, ALL shares of that lot are matched. The edge case GPT-5
+identifies would require a replacement lot larger than the loss — e.g., loss of
+60 shares matched against a replacement lot of 100 shares where only 60 are
+affected. In that case, the `tacked_opened_at` is set on the entire lot (100
+shares) when only 60 should be affected. This IS a genuine bug: 40 shares get
+incorrect holding period classification.
+
+**Updated task-type taxonomy:**
+
+| Task type | Primary cognitive demand | Best model |
+|---|---|---|
+| Hidden assumptions | Breadth identification (what's not stated?) | GPT-5 (exhaustive) |
+| Race conditions | Sequential temporal reasoning | GPT-5 + Opus |
+| Cross-component interactions | Component boundary reasoning | GPT-5 + Sonnet |
+| Design coherence | Internal consistency checking | Opus |
+| Invariant violation paths | Construction + verification | GPT-5 (precision) |
+| Silent correctness | External requirement matching | Opus |
+| **Regulatory compliance** | **Dual-knowledge-base comparison** | **GPT-5 (breadth) + Opus (operations)** |
+
+Regulatory compliance is closest to "silent correctness" (Finding #22) in that
+both require reasoning about external requirements. The key difference:
+- Silent correctness asks "does this produce correct outputs for all inputs?"
+- Regulatory compliance asks "does this implement the law correctly?"
+
+Both favor models that reason about the system's relationship to the outside
+world (Opus's strength), but regulatory compliance also rewards breadth of
+statutory knowledge (GPT-5's strength). The combination produces the most
+complete picture.
+
+**Practical implication:**
+For regulatory compliance review of financial systems:
+- Run GPT-5 for exhaustive statutory coverage (finds the most gaps)
+- Run Opus for operational impact analysis (finds how gaps manifest in practice)
+- Sonnet adds marginal value — use only if budget allows
+- GPT-5's unique strength: identifying correctness bugs in implemented logic
+  (not just missing features)
+- Opus's unique strength: identifying timing/workflow issues (year-end, form
+  reporting, reconciliation with broker)
diff --git a/findings/2026-05-05-24-design-improvement-proposals-gpt5-excels.md b/findings/2026-05-05-24-design-improvement-proposals-gpt5-excels.md
new file mode 100644
index 0000000..c4b7c88
--- /dev/null
+++ b/findings/2026-05-05-24-design-improvement-proposals-gpt5-excels.md
@@ -0,0 +1,152 @@
+# Finding 24: Design improvement proposals: GPT-5 excels at defense-in-depth thinking; Opus finds subtle design contradictions; Sonnet produces generic recommendations
+
+**Date:** 2026-05-05
+**Task:** Propose specific design improvements for gargoyle's `kill-switch.md` (185 lines)
+— the primary safety mechanism that prevents rogue orders. NEW task type: generative/
+creative ("what would you improve?") rather than purely analytical ("what's wrong?").
+**How we used them:** Same document (full text) + same focused prompt to all 3 models
+via HAI proxy. Prompt asked for 8-15 specific improvements with: weakness, proposed
+change (concrete), tradeoff, severity rating. Explicitly excluded generic advice
+("add more tests") and asked about runtime assumptions. No tools, no project context.
+
+| Model | Time | Output tokens | Reasoning tokens | Improvements proposed |
+|---|---|---|---|---|
+| GPT-5 | 118s | 8,710 | 6,016 | 15 |
+| Claude Opus 4.6 | 127s | 4,985 | (internal) | 15 |
+| Claude Sonnet 4.6 | 40s | 1,636 | (internal) | 12 |
+
+**What they found — common ground (all 3 identified):**
+- DB write failure blocking engagement (fail-open under DB outage) — all three
+  proposed in-memory-first engagement with async persistence
+- Kill switch process liveness monitoring (heartbeat/watchdog)
+- Broker connectivity loss during cancellation operations
+- ETS table ownership and crash-window vulnerability
+- Supervisor restart suppression as unstated mechanism
+- Per-venue/per-broker scope extension
+
+**GPT-5 unique findings (not in either other model):**
+- **Infrastructure-level "hard kill"** — egress proxy or service mesh that blocks
+  broker traffic independently of the application. Belt-and-suspenders approach
+  where the kill switch works even if the entire BEAM VM is unresponsive. This
+  was GPT-5's highest-impact unique insight.
+- **Kill fence token (epoch)** — every order-carrying message includes an epoch;
+  stale-epoch messages are dropped at the gate. Elegantly solves in-flight
+  messages without needing drain timeouts.
+- **Cluster/multi-node propagation** — detailed leader election + epoch broadcast
+  + fail-closed on partition design.
+- **Post-engage broker verification** — query broker AFTER engaging to confirm no
+  orders slipped through during the engagement window.
+- **Liquidation exposure validation** — proving tagged liquidation orders actually
+  REDUCE exposure rather than trusting the tag.
+- **Recovery/cold-start order suppression** — ensuring reconciliation/recovery
+  routines can't submit orders while engaged.
+- **Engage latency reordering** — ETS first, terminate second, DB async.
+- **Audit log tamper evidence** — append-only external sink + hash chain.
+
+**Claude Opus unique findings (not in either other model):**
+- **Ordering contradiction in engagement sequence** — identified that the
+  documented order (DB → ETS → terminate) creates a specific risk if a crash
+  occurs BETWEEN termination and ETS update (not just DB failure). The insight
+  is about the window where termination has started but gate is still open.
+  More subtle than GPT-5's version (which focused on DB-blocking-engage).
+- **Concurrent engagement race (mode escalation)** — multiple triggers
+  simultaneously issuing conflicting modes (RESTRICT vs LIQUIDATE). Proposed
+  explicit escalation rules (LIQUIDATE always wins) with GenServer serialization.
+- **Shared resources under per-user scope** — per-user kill switch doesn't
+  address orders in shared broker connection buffers. Forces architectural
+  decision about connection pooling strategy.
+- **Clock/time integrity for audit log** — monotonic counters + NTP validation
+  for forensic reliability.
+- **Partial multi-user engagement failures** — what happens when global engage
+  successfully terminates 4/5 user pipelines but one has orphaned processes.
+- **Liquidation direction validation** — similar to GPT-5's exposure validation
+  but framed differently: checking corrupted position records could cause
+  liquidation to OPEN positions rather than close them.
+- **Process termination verification** — checking that `:kill` signals actually
+  worked (defense against trap_exit, NIF blocking).
+- **Engagement latency SLA** — defining a 50ms target with monitoring/alerting.
+
+**Claude Sonnet findings (all also present in GPT-5 or Opus, differently framed):**
+- No genuinely unique improvements that GPT-5 or Opus didn't also identify.
+- Several were generic: "missing resource cleanup," "circuit breaker integration,"
+  "performance monitoring" — exactly the kind of advice the prompt tried to
+  exclude.
+- The "missing heartbeat" and "network partition handling" proposals were solid
+  but less detailed than the corresponding GPT-5/Opus versions.
+
+**Quality assessment:**
+- **GPT-5** produced the most ACTIONABLE improvements. Its proposals were
+  architecturally concrete ("add an egress proxy," "use kill epochs in messages,"
+  "query broker post-engage") and showed defense-in-depth thinking — multiple
+  independent layers rather than fixing one path. The infrastructure kill (#2)
+  is genuinely novel: no other model proposed going OUTSIDE the application
+  boundary for safety enforcement. GPT-5 consistently thought about "what if
+  this entire runtime is compromised?" rather than just fixing within-app paths.
+- **Claude Opus** produced equally numerous improvements (15) with characteristic
+  precision about failure SEQUENCES. Its unique strength: identifying design
+  contradictions rather than just gaps (the engagement ordering issue, concurrent
+  mode escalation, shared-resource scope mismatch). Opus's proposals were more
+  "fix the design tension" while GPT-5's were more "add another safety layer."
+  Opus also included the process termination verification and engagement latency
+  SLA — operational rigor that GPT-5 skipped.
+- **Claude Sonnet** produced 12 proposals in 40s (fast) but quality was notably
+  lower. Several proposals were generic software engineering advice that the
+  prompt explicitly excluded ("add performance monitoring," "resource cleanup").
+  No unique insights emerged. Sonnet's proposals lacked the architectural depth
+  of GPT-5 (no outside-the-application thinking) and the design-tension
+  identification of Opus.
+
+**Key insight — generative vs analytical tasks:**
+
+This is the first experiment testing a GENERATIVE task ("propose improvements")
+rather than a purely analytical one ("find problems"). The results reveal:
+
+1. **GPT-5's defense-in-depth thinking is unique.** In analytical tasks, GPT-5
+   finds exhaustive lists of issues. In generative tasks, it proposes LAYERED
+   solutions — multiple independent mechanisms that each catch what the others
+   miss. The infrastructure kill proposal (external to the application) shows
+   GPT-5 reasoning about failure modes that are invisible to within-app analysis.
+
+2. **Opus's design-tension identification transfers to improvement proposals.**
+   In analytical tasks, Opus finds where parts of a design contradict each other.
+   In generative tasks, this manifests as proposals that RESOLVE tensions rather
+   than just adding patches. The engagement ordering contradiction and mode
+   escalation rules are both "this design says X but the mechanism allows Y —
+   here's how to make them consistent."
+
+3. **Sonnet doesn't transfer well to generative tasks.** In analytical tasks
+   (assumption-finding, cross-component analysis), Sonnet performs well (85% of
+   GPT-5 in some experiments). In generative tasks, it falls back to generic
+   engineering advice. The task requires both identifying problems AND proposing
+   concrete solutions — Sonnet handles the first step but not the second with
+   sufficient depth.
+
+**Comparison to analytical task performance:**
+
+| Task type | GPT-5 character | Opus character | Sonnet character |
+|---|---|---|---|
+| Assumption-finding (#10-12) | Exhaustive breadth | Design tensions | Good (85% of GPT-5) |
+| Race conditions (#13) | Technical precision | Design contradictions | Weak (errors) |
+| Invariant violations (#20) | Maximum selectivity | Self-correcting depth | Imprecise |
+| **Design improvements (#24)** | **Defense-in-depth layers** | **Tension resolution** | **Generic advice** |
+
+The generative task reveals model ARCHITECTURES more clearly than analytical tasks.
+GPT-5's reasoning enables it to construct multi-layered solutions. Opus's internal
+reasoning enables it to identify what a design SHOULD be (not just what's wrong).
+Sonnet pattern-matches against known engineering practices without deep synthesis.
+
+**Practical implication:**
+
+For design improvement sessions on safety-critical systems:
+- Run GPT-5 for defense-in-depth proposals ("what layers should exist?")
+- Run Opus for design consistency proposals ("where does the design contradict itself?")
+- Skip Sonnet — its output is indistinguishable from generic checklists
+- The combination of GPT-5 + Opus produces complementary improvements: GPT-5 adds
+  safety layers, Opus fixes internal contradictions. Together they address both
+  "not enough protection" and "protection mechanisms that work against each other."
+
+**Cost analysis:**
+GPT-5: 118s, ~10.9K tokens (6K reasoning). Opus: 127s, ~5K tokens. Sonnet: 40s, ~1.6K tokens.
+For a safety-critical design review, running GPT-5 + Opus costs ~16K tokens and produces
+30 improvements with near-zero overlap in unique insights. Excellent ROI for a kill switch
+design that protects real money.
diff --git a/findings/2026-05-05-25-contradiction-detection-new-task-type.md b/findings/2026-05-05-25-contradiction-detection-new-task-type.md
new file mode 100644
index 0000000..fb28ab7
--- /dev/null
+++ b/findings/2026-05-05-25-contradiction-detection-new-task-type.md
@@ -0,0 +1,154 @@
+# Finding 25: Contradiction detection: NEW task type — Opus excels at finding genuine contradictions with precision; GPT-5 is thorough but spends heavily on reasoning; Sonnet finds surface-level issues quickly
+
+**Date:** 2026-05-05
+**Task:** Identify internal contradictions, logical inconsistencies, and conflicting rules
+in gargoyle's `order-state-machine.md` (311 lines) — a document defining states,
+transitions, invariants, fill precedence rules, and time-in-force behavior.
+**How we used them:** Same document (full text) + same focused analytical question to all
+3 models via HAI proxy. Prompt specifically asked for: state machine contradictions,
+semantic conflicts, rule violations, implicit contradictions, and terminology
+inconsistencies. Required each finding to quote the conflicting statements, explain
+the logical argument, assign severity, and recommend which statement should "win."
+No tools, no project context beyond the document itself.
+
+| Model | Time | Output tokens | Reasoning tokens | Contradictions found |
+|---|---|---|---|---|
+| GPT-5 | 162s | 12,074 | 11,008 | 4 |
+| Claude Opus 4.6 | 41s | 2,056 | (internal) | 6 |
+| Claude Sonnet 4.6 | 17s | 826 | (internal) | 4 |
+
+**What they found — common ground (2+ models identified):**
+
+- **Missing `pending_cancel → partially_filled` revert transition** (GPT-5 #1 +
+  Opus #2 + Sonnet partial): The "Rejection reverts" invariant states orders return
+  to their "pre-modification state (`working` or `partially_filled`)", but the state
+  diagram only shows `pending_cancel → working` for cancel rejection — no path back
+  to `partially_filled`. All models correctly identified this as the diagram being
+  incomplete relative to the stated invariant. GPT-5 and Opus rated CRITICAL.
+- **Same issue for `pending_replace` revert** (GPT-5 #1 + Opus #3): The state diagram
+  only shows `pending_replace → working` for replace rejection, but a replace
+  requested from `partially_filled` should revert to `partially_filled`. Same root
+  cause as above, just the replace variant.
+- **FOK "never partially fills" vs state machine allowing it** (GPT-5 #2 + Opus #4):
+  The TIF table says FOK "never partially fills" but the state machine has no guards
+  preventing FOK orders from reaching `partially_filled`. Both correctly noted this
+  is a broker-enforced guarantee but the document presents it as system-level.
+- **`rejection_reason` described as "broker-provided" but local rejections exist**
+  (GPT-5 #4 + Opus #5 + Sonnet): `pending → rejected` is "local validation failure"
+  with no broker interaction, but the field says "Broker-provided reason when
+  rejected." All three caught this terminology inconsistency.
+
+**GPT-5 unique findings (not in either other model):**
+
+- **IOC valid terminal states exclude `expired` vs generic expiry transitions** (#3):
+  IOC should never reach `expired` (unfilled portion is cancelled immediately), but
+  the state diagram allows any order to transition to `expired` without TIF guards.
+  Well-reasoned extension of the FOK finding to IOC semantics. GPT-5 correctly
+  identified that broker "expired-like" outcomes should map to `cancelled` for IOC.
+
+**Claude Opus unique findings (not in either other model):**
+
+- **Terminal states that aren't terminal — the `partially_filled` re-entry problem**
+  (#1): Opus identified the DEEPER contradiction beyond the surface-level "cancelled
+  states have outgoing transitions." When `cancelled → partially_filled` fires via
+  late fill, the order is now non-terminal with NO defined mechanism to re-terminate
+  if no further fills arrive. The order is stuck in `partially_filled` indefinitely.
+  This goes beyond "the diagram contradicts the definition of terminal" to "the fill
+  precedence rule creates an unspecified operational scenario." This is the most
+  architecturally significant finding across all three models.
+- **Fill precedence label misapplication to non-terminal states** (#6): The state
+  diagram labels transitions from `pending_cancel → partially_filled` and
+  `pending_replace → partially_filled` as "fill precedence," but the Fill
+  Precedence Rule explicitly defines itself as overriding TERMINAL states.
+  `pending_cancel` is non-terminal. The label conflates two different mechanisms
+  (fill during pending modification vs. fill overriding terminal state), which
+  could cause implementers to use the same code path for fundamentally different
+  scenarios.
+
+**Claude Sonnet unique findings (not in either other model):**
+
+- **State diagram terminal arrow contradiction** (#1): Sonnet was the only model to
+  explicitly note that the Mermaid diagram shows `cancelled → [*]` (terminal arrow)
+  while simultaneously showing `cancelled → partially_filled` (outgoing transition).
+  A valid observation but more surface-level than Opus's deeper analysis of the same
+  phenomenon.
+- **Pending replace fill logic error** (#3): Sonnet argued that receiving a fill
+  during `pending_replace` creates a logical impossibility because the order
+  parameters are in flux. This is WRONG — fills always apply to current parameters
+  (the replace hasn't been confirmed yet), and the document actually handles this
+  correctly. This is a FALSE POSITIVE from Sonnet.
+
+**Quality assessment:**
+
+- **Claude Opus** was the clear winner for this task. Found the most contradictions
+  (6), had the highest precision (0 false positives), and — crucially — found
+  qualitatively deeper issues. The `partially_filled` re-entry problem (#1) isn't
+  just "the diagram has a missing transition" but "the fill precedence rule creates
+  an unresolvable operational state." The fill precedence label misapplication (#6)
+  identifies a conceptual confusion that would genuinely cause implementation bugs.
+  Opus completed in only 41s with 2,056 output tokens — by far the most efficient.
+- **GPT-5** found 4 genuine contradictions with 0 false positives but spent an
+  extraordinary amount of reasoning tokens (11,008) for modest output (1,066 visible
+  content tokens, 10.3:1 reasoning ratio). The IOC finding was unique and valuable.
+  But the cost is disproportionate: 162s and 12K tokens for 4 findings vs Opus's
+  41s and 2K tokens for 6 findings. GPT-5's reasoning budget seems to have been
+  mostly spent on VERIFICATION (confirming each finding is genuine), consistent
+  with Finding #20's observation.
+- **Claude Sonnet** was fastest (17s) and found 4 items, but one was a false positive
+  (the pending_replace logic error claim is incorrect). That gives it a precision of
+  75% (3/4 genuine) — the lowest of the three. Its genuine findings were all also
+  found by the other models (no unique true contributions). Sonnet appears to trade
+  speed for accuracy on contradiction detection.
+
+**Key insight — contradiction detection favors precision-oriented models:**
+
+This task is fundamentally about LOGICAL ARGUMENTATION: proving that two statements
+cannot both be true. Unlike assumption-finding (which is about imagining what could go
+wrong) or gap-finding (which is about identifying missing content), contradiction
+detection requires the model to:
+1. Hold two statements in working memory simultaneously
+2. Construct a formal argument for why they conflict
+3. NOT get confused by statements that SEEM contradictory but are actually consistent
+
+Requirement #3 is where models diverge. Sonnet produced a false positive because it
+didn't fully reason through whether the pending_replace fill scenario is actually
+inconsistent (it isn't — current parameters apply). Opus avoided this trap entirely
+and additionally found DEEPER contradictions that require multi-step logical reasoning
+(the re-entry problem, the label misapplication). GPT-5 also avoided false positives
+but at massive computational cost.
+
+**Opus's efficiency advantage:**
+This is the first task where Opus is not just qualitatively better but also
+quantitatively more efficient. 6 findings in 41s and 2K tokens vs GPT-5's 4 findings
+in 162s and 12K tokens. That's 3x more findings per token and 4x faster. For
+contradiction detection specifically, Opus appears to have a structural advantage —
+possibly because its internal reasoning is better calibrated for logical argumentation
+than GPT-5's externalized reasoning chain.
+
+**Comparison to Finding #20 (invariant violation paths):**
+In Finding #20, GPT-5 was maximally selective (3 findings, all genuine, 15:1
+reasoning ratio). Here, GPT-5 shows the same pattern: few findings, all genuine,
+high reasoning ratio (10.3:1). The difference: in #20, GPT-5's selectivity meant
+it found UNIQUE violations others missed. Here, all of GPT-5's findings were also
+found by Opus (plus Opus found 2 more). GPT-5's high verification bar doesn't help
+when Opus is ALSO precise AND more thorough.
+
+**Updated task-model assignment:**
+
+For contradiction/consistency checking:
+1. **Opus** — best choice: highest precision, deepest contradictions, most efficient
+2. **GPT-5** — solid backup: zero false positives, unique TIF-related insights, but
+   expensive and slower
+3. **Sonnet** — NOT recommended for this task: produces false positives, no unique
+   true contributions
+
+This confirms the emerging pattern: each model has task types where it excels.
+Opus excels at logical argumentation and design tensions. GPT-5 excels at
+exhaustive enumeration and operational concerns. Sonnet excels at speed and
+structural/assumption analysis but struggles with tasks requiring formal logical
+reasoning (contradiction detection, concurrency analysis per Finding #13).
+
+**Practical implication:** When reviewing architecture documents for internal
+consistency (e.g., before implementation begins), run Opus. If budget allows,
+add GPT-5 for TIF/edge-case coverage. Skip Sonnet for consistency checking —
+its speed advantage is negated by the false positive risk.
diff --git a/findings/2026-05-05-26-missingfeature-identification-is-promptable-across.md b/findings/2026-05-05-26-missingfeature-identification-is-promptable-across.md
new file mode 100644
index 0000000..72a5c9d
--- /dev/null
+++ b/findings/2026-05-05-26-missingfeature-identification-is-promptable-across.md
@@ -0,0 +1,158 @@
+# Finding 26: Missing-feature identification IS promptable across all models; prompt framing eliminates Opus's historical advantage — all three models find regulatory gaps when explicitly asked
+
+**Date:** 2026-05-05
+**Task:** Identify computations, behaviors, or features that gargoyle's
+`corporate-actions.md` (992 lines) SHOULD perform for financial correctness,
+regulatory compliance, or operational safety — but doesn't describe.
+**How we used them:** Same document (full text) + same focused analytical
+prompt to all 3 models via HAI proxy. Prompt explicitly structured around 5
+categories: missing computations, missing behaviors, missing validations,
+missing integrations, and regulatory gaps. Required concrete findings with
+severity. No tools, no project context beyond the document. GPT-5 via
+OpenAI endpoint (16K max_completion_tokens), Opus 4.6 and Sonnet 4.6 via
+Anthropic endpoint (8K max_tokens).
+
+| Model | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
+|---|---|---|---|---|---|---|
+| GPT-5 | 11,354 | 8,512 | 20 | 3 | 10 | 7 |
+| Claude Opus 4.6 | 4,111 | (internal) | 23 | 6 | 10 | 7 |
+| Claude Sonnet 4.6 | 4,686 | (internal) | 15 | 5 | 6 | 4 |
+
+**What they found — common ground (all 3 identified):**
+- Wash sale rule interaction with CA-driven lot closures (IRC §1091)
+- Short position treatment for corporate actions
+- Same-day corporate action ordering beyond `recorded_at` timestamp
+- Record date / ex-date position verification (entitlement timing)
+- Idempotency guard preventing double-application per user
+- Decimal precision/rounding policy unspecified
+- Superseded CA status has no lot rollback mechanism
+- Rights/warrants post-creation lifecycle (exercise/expiration)
+- Basis preservation invariant has no runtime enforcement
+- Manual entry authorization and audit trail
+
+**GPT-5 unique findings (not in either Claude model):**
+- Per-lot eligibility based on entitlement date (not just user-level)
+- Election-based outcomes for shareholder choices (cash vs stock)
+- Instrument-level trading hold during CA application window
+- Pre-application consistency checks against broker entitlements
+- DB-level enforcement of status transitions and invariants
+- Action-type-specific date semantics per field (ex vs record vs payable)
+- Voluntary/tender actions beyond distributions
+- Backfill/initialization guard for newly onboarded users
+- Applicator retry/backoff semantics and confirmation race
+- Rights indivisibility constraints vs exact Decimal quantities
+
+**Claude Opus unique findings (not in either other model):**
+- Pending order PRICE adjustment after splits (not just cancellation)
+- Multi-instrument position recalculation atomicity for mergers
+- Mixed merger basis floor at zero (can produce negative basis)
+- Tax lot identification method interaction with inherited dates
+- Corporate action effect on strategy position limits/risk params
+- Corporate actions on instruments not yet in the database
+- Partial application window: new user acquires position mid-fan-out
+- IRC §305(c) deemed distributions (taxable stock dividends)
+- CA impact on unrealized P&L display and strategy evaluation
+- Concurrent OrderManager startup + Applicator fan-out race
+
+**Claude Sonnet unique findings (not in either other model):**
+- Stale orders: failure modes table contradicts "excluded" section
+- IRC §1223(1) holding period tacking verification at lot close
+- Spinoff allocation percentage — no validation child != parent instrument
+- Combined spinoff allocations exceeding meaningful bounds
+- Cash dividend bypasses OrderManager — record-date quantity snapshot lost
+- Mixed merger large-denominator exchange ratio overflow
+- Detector schedule: no intraday re-poll for same-day announcements
+- ROC cumulative basis floor tracking and IRC §301(c)(3) distinction
+- Mixed merger deferred loss not explicitly recorded in metadata
+
+**Quality assessment:**
+- **Claude Opus** was the MOST PROLIFIC (23 findings) — a notable inversion
+  from previous experiments where Opus typically found fewer but deeper
+  findings. Here, the explicit "missing feature" framing appears to have
+  unlocked Opus's breadth. Its unique findings included genuinely critical
+  items: pending order price adjustment after splits (Critical — direct
+  financial loss), multi-instrument atomicity for mergers (Critical —
+  position loss), and mixed merger negative basis (High — accounting
+  corruption). The findings were precise, well-reasoned, and showed both
+  regulatory depth (IRC §305(c)) and operational awareness.
+- **GPT-5** was slightly less prolific (20 findings) but maintained its
+  characteristic breadth and operational-level thinking. Per-lot eligibility
+  (not just per-user) is a subtle but important distinction. The election-
+  based outcomes finding shows awareness of real-world corporate action
+  complexity. The backfill/initialization guard is operationally significant.
+  GPT-5 spent 8,512 reasoning tokens — moderate for its output volume.
+- **Claude Sonnet** found fewer gaps (15) but several were genuinely
+  insightful. The internal contradiction between the failure modes table
+  and the "excluded" section is a real document inconsistency. The cash
+  dividend record-date quantity snapshot insight (#9) identifies a DATA LOSS
+  problem — the opportunity to capture that data expires. The mixed merger
+  deferred loss recording gap shows regulatory awareness. However, some
+  findings were more surface-level or overlapped heavily with the others.
+
+**KEY INSIGHT — The original question from Finding #22 is ANSWERED:**
+
+> "Opus's 'missing feature identification' mode (wash sales, commissions) —
+> is this promptable on other models? Could we explicitly ask GPT-5 'what
+> should this system compute but doesn't' and get similar results?"
+
+**YES.** When explicitly prompted with a structured "missing feature"
+framing, ALL three models found regulatory gaps (wash sales, IRC sections),
+missing computations (basis calculations, rounding), and missing behaviors
+(lifecycle events, notifications). GPT-5 produced findings in the same
+*category* as what Opus uniquely found in Finding #22 (silent correctness
+failures on specid-lot-selection.md).
+
+In Finding #22, Opus uniquely identified wash sales and commission tracking
+as missing features while GPT-5 focused on mechanism incorrectness and
+Sonnet on composition failures. HERE, with the explicit "what's missing"
+prompt, ALL three models found wash sales, ALL found regulatory gaps, and
+ALL found missing behaviors.
+
+**This confirms:** Opus's "missing feature identification" mode in Finding
+#22 was NOT an inherent model capability — it was an emergent behavior from
+the open-ended "silent correctness failures" prompt. When you give ALL models
+the EXPLICIT instruction to look for missing features, they all do it. The
+differentiation from #22 was caused by the prompt being more open-ended,
+allowing each model to default to its natural analytical mode:
+- Opus → "what's missing" (features/functionality)
+- GPT-5 → "what's wrong" (mechanism failures)
+- Sonnet → "what breaks when combined" (composition)
+
+**Prompt framing dominates model personality.** With the right prompt,
+any model can be directed into any analytical mode. The model differences
+that emerged in earlier open-ended experiments reflect DEFAULT TENDENCIES,
+not capabilities.
+
+**NEW finding about Opus on complex documents:**
+Opus produced MORE findings than GPT-5 (23 vs 20) — the first time this
+has happened on a broad analytical task. Previous pattern: GPT-5 always
+finds more (20-33 findings) while Opus finds fewer but deeper (7-13).
+What changed? The document is 992 lines — the longest tested — and the
+task is explicitly about breadth ("find all gaps"). On this specific
+combination (long document + breadth-focused prompt), Opus appears to
+allocate its internal reasoning budget toward exploration rather than
+its usual depth-first design-tension mode. This suggests Opus's typical
+"fewer but deeper" pattern is partially a RESPONSE to shorter documents
+where depth is more productive than breadth.
+
+**Practical implications:**
+1. For missing-feature analysis: prompt structure matters more than model
+   choice. All three models are viable. Use the explicit 5-category prompt.
+2. Run all three for critical docs — they find different specific gaps
+   despite finding the same categories.
+3. For open-ended analysis where you want models to find DIFFERENT things:
+   use open-ended prompts. For analysis where you want COMPREHENSIVE
+   coverage of one type: use structured prompts.
+4. Opus's "fewer but deeper" personality can be overridden by document
+   length + breadth-focused prompt. On 992-line docs, it competes on
+   volume with GPT-5.
+
+**Cost-effectiveness:**
+Opus: 4,111 output tokens for 23 findings = 179 tokens/finding
+GPT-5: 11,354 output tokens (+ 8,512 reasoning) for 20 findings = 993 tokens/finding
+Sonnet: 4,686 output tokens for 15 findings = 312 tokens/finding
+
+Opus is by far the most efficient: nearly 6x fewer tokens than GPT-5 per
+finding, with MORE findings. This is the strongest cost-effectiveness case
+for Opus on any tested task. On long documents with breadth-focused prompts,
+Opus appears to be the optimal choice for both quality AND efficiency.
diff --git a/findings/2026-05-05-27-design-coherence-on-riskcontrolsmd-gpt5.md b/findings/2026-05-05-27-design-coherence-on-riskcontrolsmd-gpt5.md
new file mode 100644
index 0000000..79562be
--- /dev/null
+++ b/findings/2026-05-05-27-design-coherence-on-riskcontrolsmd-gpt5.md
@@ -0,0 +1,276 @@
+# Finding 27: Design coherence on risk-controls.md: GPT-5 regains top position; Opus's advantage from Finding #15 was document-specific, not task-specific
+
+**Date:** 2026-05-05
+**Task:** Identify internal design incoherences in gargoyle's `risk-controls.md` (277 lines)
+— a pre-trade risk control specification covering two evaluation stages, reduction semantics,
+ordering rationale, fail-closed claims, and audit logging.
+**How we used them:** Same document (full text) + same focused analytical question to all
+3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence
+(safety properties not enforced, ordering/sequencing contradictions, reduction semantics
+conflicts, fail-closed claims vs actual behavior, cross-stage inconsistencies). Required
+each finding to reference specific contradictory parts. No tools, no project context beyond
+the document itself.
+
+| Model | Time | Output tokens | Reasoning tokens | Incoherences found | Critical | High | Medium |
+|---|---|---|---|---|---|---|---|
+| GPT-5 | 112s | 8,231 | 7,232 | 6 | 1 | 3 | 2 |
+| Claude Opus 4.6 | 41s | 1,858 | (internal) | 5 | 2 | 2 | 1 |
+| Claude Sonnet 4.6 | 15s | 699 | (internal) | 4 | 1 | 2 | 1 |
+
+**What they found — common ground (all 3 identified):**
+- Reduction re-entry at BuyingPower contradicts "reducing controls never re-enter
+  earlier controls" (all three flagged this as the most obvious contradiction —
+  Concentration at position 5 reduces, re-enters at BuyingPower at position 4,
+  which IS an earlier control)
+- Ordering rationale's categorization of buying power/concentration is internally
+  confused (the doc labels both as "quantity-sensitive checks" that run after
+  reducing controls, but concentration IS a reducing control at position 5 while
+  buying power at position 4 sits between the two reducing controls)
+
+**GPT-5 unique findings (not in either Claude model):**
+- Signal-level "no portfolio context" contradicts NoShortSales requiring knowledge
+  of current positions. The doc explicitly states signals are evaluated "in isolation"
+  with "no portfolio context — only the signal itself and user settings" — but checking
+  whether the user holds a position IS portfolio context. This is a genuine design
+  tension: either SignalRisk has hidden portfolio access (violating isolation) or
+  NoShortSales can't actually work as specified.
+- Settings "fall through to system defaults" vs "Settings cache miss → reject."
+  Two incompatible instructions for the same condition (missing settings).
+- "Universal fail-closed" with "only exception is order rate window" contradicted
+  by Failure Modes table showing buying power as another exception ("Conservative
+  estimate; may over-reject" is NOT rejection — it's a different failure mode than
+  either fail-closed or the documented single exception).
+- Audit model says "every control evaluation produces an audit entry regardless of
+  outcome" but the signal-stage write point only describes writing on rejection.
+  Passing signals produce no documented audit entry at the signal stage.
+
+**Claude Opus unique findings (not in either other model):**
+- Signal flow diagram swaps control order vs table: table shows (1) MarketHours,
+  (2) PerTradeStop, (3) NoShortSales, but diagram flows MarketHours → NoShortSales
+  → PerTradeStopLoss. Controls 2 and 3 are reversed between the two representations.
+  (VERIFIED: this is correct — the diagram does show a different order.)
+- Concentration re-entry loop can bypass Order Rate, Duplicate, Self-Trade, and
+  Fat Finger entirely during intermediate iterations. Also: Position Size at order 3
+  is never re-checked against Concentration-reduced quantity because re-entry starts
+  at BuyingPower (order 4), meaning "smaller quantity wins" semantics are implemented
+  differently than the linear model described in Reduction Semantics.
+
+**Claude Sonnet unique findings (not in either other model):**
+- Buying Power "Can reduce? No" creates a dead end: if a reduced quantity still
+  exceeds buying power, the system can only reject entirely (no mechanism to further
+  optimize), defeating the purpose of the reduction system for capital-limited users.
+  (NOTE: this is more of a design limitation than a self-contradiction, but the
+  framing — that the reduction system's purpose is undermined by buying power's
+  inability to reduce — is a legitimate coherence observation.)
+
+**Quality assessment:**
+- **GPT-5** produced the most findings (6) with the broadest coverage across the
+  prompt's 5 categories. The NoShortSales/portfolio-context finding is the most
+  genuinely insightful — it's a fundamental design-level contradiction (a signal-level
+  control that REQUIRES decision-level context). The settings contradiction and
+  audit logging inconsistency are also solid. Every finding points to two specific
+  textual statements that are incompatible. Severity ratings were calibrated (1
+  Critical, 3 High, 2 Medium — compared to Opus's 2 Critical for similar findings).
+- **Claude Opus** was remarkably fast (41s, 1,858 tokens) and found one thing
+  neither other model caught: the diagram/table order reversal for signal controls.
+  This is a concrete, verifiable error (not a design tension — a literal mistake in
+  the document). The re-entry loop analysis (finding #5) goes deeper than GPT-5's
+  version of the same core issue, exploring the implications for "smaller quantity
+  wins" semantics. However, Opus found fewer total issues and missed the
+  settings contradiction and audit logging inconsistency.
+- **Claude Sonnet** was the fastest (15s, 699 tokens) and found 4 issues. The buying
+  power dead-end observation is unique and shows genuine reasoning about the reduction
+  system's limitations. However, it's more of a "this design can't achieve its stated
+  goal" than a strict self-contradiction. Sonnet's other findings overlap with the
+  common ground. Quality is solid but narrower scope.
+
+**Key insight — Finding #15's Opus > GPT-5 result was document-specific:**
+In Finding #15 (coherence checking on failure-modes.md), Opus found 7 incoherences
+vs GPT-5's 4. Here, on risk-controls.md, GPT-5 found 6 vs Opus's 5. The reversal
+suggests that the relative performance on coherence checking depends on the
+DOCUMENT'S structure, not on a fixed model advantage:
+
+- **failure-modes.md** (383 lines): A complex multi-process system with many
+  stated invariants across failure states, supervision trees, and recovery paths.
+  Rich in design TENSIONS where one subsystem's safety mechanism undermines another.
+  This plays to Opus's strength (finding design tensions between subsystems).
+- **risk-controls.md** (277 lines): A more focused specification with explicit rules,
+  ordering constraints, and behavior tables. Rich in SPECIFICATION CONTRADICTIONS
+  where one statement directly conflicts with another. This plays to GPT-5's
+  strength (systematic verification of claims against stated mechanisms).
+
+The difference: Opus excels when contradictions are EMERGENT (arise from composing
+multiple design decisions). GPT-5 excels when contradictions are EXPLICIT (two
+statements in the document say incompatible things). Risk-controls.md has more
+explicit contradictions (the settings fallback vs fail-closed, the "no portfolio
+context" vs NoShortSales, the audit "always" vs write point "only on reject").
+
+**Model performance depends on CONTRADICTION TYPE:**
+| Contradiction type | Best model | Example |
+|---|---|---|
+| Emergent/compositional | Opus | "Rest-for-one cascade creates a 5th state" |
+| Explicit/definitional | GPT-5 | "No portfolio context" but check requires portfolio |
+| Diagrammatic/structural | Opus | Table order ≠ diagram order |
+| Semantic/category confusion | All (common ground) | Reduction re-entry violates ordering claims |
+
+**Revised conclusion on Finding #15's open question:**
+"Does Opus > GPT-5 ordering for coherence checking hold across other documents?"
+**No.** The ordering depends on the document's contradiction density and type.
+Documents rich in emergent design tensions favor Opus. Documents with explicit
+specification errors favor GPT-5. The task type (coherence checking) doesn't have
+a fixed model winner — it depends on what KIND of incoherences the document contains.
+
+**Practical implication:** Continue running both models for coherence checking. Their
+strengths are complementary even within the same task type. GPT-5 catches things you
+can point to in the spec and say "these two sentences conflict." Opus catches things
+where you need to reason about the implications of multiple mechanisms interacting.
+
+## Open Questions
+
+- Does GPT's advantage in finding inconsistencies extend to logical
+  inconsistencies in arguments? One data point (verdict mismatches) — need more.
+- What's the optimal task granularity for GPT analytical review? "Whole PR" is
+  too big. Is "one hypothesis" right, or can we batch?
+- ~~Is the GPT-4.1 Mini bias detection result repeatable, or was it a well-
+  structured task that any model would ace?~~ **ANSWERED (Finding #8):** Any
+  model aces it when the biased text is presented without noise. The original
+  result was about noise elimination, not model capability.
+- **NEW:** Does adding a narrow bias-check question to a rich PR review
+  context recover the detection that broad review misses? (Signal-to-noise
+  confirmation test)
+- ~~How does reasoning_effort affect analytical quality? Only tested default so
+  far.~~ **ANSWERED (Finding #21):** Negligible effect on GPT-5 for open-ended
+  analytical tasks. Low/medium/high produced 33/30/30 findings with nearly
+  identical reasoning tokens (~4K) and per-finding depth. The parameter
+  may primarily affect verifiable-answer tasks, not exploration. Task framing
+  remains the dominant quality lever.
+- Can we design a systematic "analytical review checklist" that leverages each
+  model's strengths?
+- ~~What analytical tasks is Opus best at vs Sonnet? Finding #11 shows Opus
+  excels at design-tension identification. How does Sonnet compare on the
+  same task? (Sonnet is non-reasoning but fast — would it match GPT-4.1?)~~
+  **ANSWERED (Finding #12):** Sonnet 4.6 significantly outperforms GPT-4.1
+  (17 vs ~14 assumptions) and approaches GPT-5 (17 vs 20). It's not a
+  non-reasoning model in the GPT-4.1 sense — it occupies a middle tier with
+  genuine component-interaction reasoning. Opus still wins on design-tension
+  identification specifically.
+- How do the models compare on research synthesis tasks (our #381 rewrite)?
+  We'll find out during the actual rewrite.
+- ~~Does the reasoning-token advantage scale with document complexity? Test
+  with a simpler doc to see if the gap narrows.~~ **ANSWERED (Finding #11):**
+  The gap doesn't narrow with simpler docs. GPT-5 maintains ~1.7x the findings
+  of GPT-4.1 regardless of document complexity. Reasoning tokens enable
+  exhaustive exploration independent of input difficulty.
+- ~~Would Claude Opus (also a reasoning model) match GPT-5's assumption-finding
+  performance, or does it have different blind spots?~~ **ANSWERED (Finding #11):**
+  Different blind spots, different strengths. GPT-5 reasons deeper into
+  implementation mechanics (breadth + technical depth). Opus reasons wider
+  about system context and design tensions (insight density). They're
+  complementary, not competing. Run both on important architecture docs.
+- ~~Does Sonnet 4.6's strong showing hold across other analytical tasks
+  (bias detection, gap-finding) or is it specific to assumption-finding on
+  complex documents? Need to test Sonnet on simpler docs and different question
+  types.~~ **PARTIALLY ANSWERED (Finding #13):** Sonnet's strength does NOT
+  transfer to concurrency reasoning. It dropped from 85% of GPT-5 (assumption-
+  finding) to ~58% (race condition identification). Task type matters more
+  than we thought. Still untested: gap-finding, bias detection for Sonnet.
+- **NEW:** What other analytical tasks require sequential/temporal reasoning
+  (like race condition identification) vs pattern-matching reasoning (like
+  assumption-finding)? Building a task taxonomy would help assign models
+  correctly.
+- **NEW:** What explains Sonnet taking slightly longer than Opus here (106s vs
+  105s) despite normally being the faster model? Is it the document length, or
+  does Sonnet's internal reasoning scale with complexity similarly to Opus?
+- ~~How does GPT-5 Mini compare to GPT-5 on analytical tasks? Is it a viable
+  cheaper substitute?~~ **ANSWERED (Finding #14):** GPT-5 Mini is a viable
+  middle option. Finds fewer issues (6 vs 10) but with genuine reasoning
+  depth at ~50% cost/time. Better than non-reasoning models, not as
+  exhaustive as GPT-5.
+- **NEW:** How does Claude 4.5 Opus/Sonnet compare to Claude 4.6? HAI now
+  exposes both; worth testing whether the newer versions regress on
+  analytical tasks.
+- ~~Would running GPT-5 Mini + Sonnet together (different axes)
+  approach GPT-5's coverage at lower combined cost?~~ **ANSWERED (Finding #19):**
+  71% coverage at 31% cost. Good for low-stakes work; GPT-5 irreplaceable for
+  high-stakes due to unique domain-knowledge findings in the missing 29%.
+- ~~**NEW (Finding #15):** Does the Opus > GPT-5 ordering for coherence checking
+  hold across other documents? The inversion (Opus finding more than GPT-5)
+  was striking — need to confirm it wasn't document-specific.~~
+  **ANSWERED (Finding #27):** No — it was document-specific. On risk-controls.md,
+  GPT-5 found 6 vs Opus's 5. The winner depends on contradiction TYPE: Opus
+  excels at emergent/compositional contradictions, GPT-5 at explicit/definitional
+  ones. No fixed ordering for this task type.
+- **NEW (Finding #15):** Is the two-pass approach (Opus generates → GPT-5
+  validates) worth the extra cost vs just running Opus alone? Need to test
+  whether GPT-5 actually catches Opus false-positives or just agrees.
+- ~~How do the Claude 4.5 and 4.6 models compare on analytical tasks?~~
+  **ANSWERED (Finding #16):** 4.5 is more exhaustive (2x findings), 4.6 is
+  more precise (higher signal-to-noise). Genuine tradeoff, not a regression.
+  4.5 for coverage, 4.6 for actionability.
+- **NEW (Finding #16):** Does the 4.5 vs 4.6 pattern hold across other task
+  types? Spec completeness may favor exhaustiveness; would coherence checking
+  or race condition analysis show the same pattern?
+- **NEW (Finding #16):** Is running both Sonnet versions (4.5 + 4.6) cost-
+  effective vs just running GPT-5? Need to compare the UNION of their findings
+  against GPT-5's output for overlap analysis.
+- **NEW (Finding #18):** Does Opus's "predictable exploit window" detection
+  transfer to other policy documents? It uniquely identified that the cooldown
+  mechanism creates a GUARANTEED safe window that strategies could systematically
+  exploit — this is a higher-order security insight. Worth testing whether Opus
+  consistently finds "adversarial opportunity" framings that other models miss.
+- **NEW (Finding #20):** Does GPT-5's extreme verification behavior (15:1
+  reasoning-to-output ratio, 3 findings from 12K reasoning) persist across
+  other documents with this prompt? Or was user-pipeline-lifecycle.md
+  particularly verification-heavy? Test invariant violation paths on a simpler
+  document.
+- **NEW (Finding #20):** Would giving GPT-5 a "minimum 8 findings" instruction
+  reduce its selectivity and produce MORE invariant violations at lower
+  precision? Or would it just pad with non-violations? The extreme selectivity
+  may be a feature OR it may mean GPT-5 is discarding valid-but-hard-to-verify
+  findings.
+- **NEW (Finding #20):** Opus's self-correction behavior is now confirmed across
+  Finding #15 and #20. Is this trainable/promptable? Could we ask non-Opus models
+  to "show your reasoning and withdraw findings you cannot fully verify"?
+- **NEW (Finding #22):** The "silent correctness" lens revealed three distinct
+  analytical modes: Opus → missing functionality, GPT-5 → mechanism incorrectness,
+  Sonnet → composition failures. Does this three-way differentiation hold on other
+  documents, or was it specific to the regulatory/financial domain of specid-lot-selection?
+- **NEW (Finding #22):** Does the "silent correctness" lens work on non-financial
+  documents? The financial/regulatory domain has a large gap between syntactic and
+  semantic correctness. Would the same prompt on an infrastructure/systems doc produce
+  equally differentiated findings, or would it collapse into assumption-finding?
+- ~~**NEW (Finding #22):** Opus's "missing feature identification" mode (wash sales,
+  commissions) — is this promptable on other models? Could we explicitly ask GPT-5
+  "what should this system compute but doesn't" and get similar results?~~
+  **ANSWERED (Finding #26):** YES — all three models find regulatory gaps and
+  missing features when explicitly prompted. Opus's unique behavior in #22 was
+  an emergent DEFAULT tendency, not a capability. Prompt framing dominates
+  model personality.
+
+- **NEW (Finding #28):** Cross-document consistency found real bugs in gargoyle
+  docs (fills vs events, position ownership, signal persistence). Does running
+  this analysis across MORE document pairs (e.g., domain readmes vs implementation
+  docs, design docs vs plan docs) yield additional real inconsistencies? Could
+  become a systematic documentation maintenance tool.
+- **NEW (Finding #28):** Opus was 2.4x faster AND found more issues than GPT-5
+  on cross-document consistency. Is this because cross-doc contradictions are
+  easy to verify once spotted (reducing GPT-5's verification advantage)? Or
+  because boundary reasoning (Opus's strength) is the primary skill needed?
+
+## Methodology Notes
+
+- Internet opinions about models are overwhelmingly about coding. Don't
+  extrapolate to analytical work without testing.
+- "Just because someone says it on the internet doesn't make it right." —
+  Aaron, 2026-04-26. Opinions need context. Track our own evidence.
+- Absence of published methodology for a use case is itself a finding.
+- Each finding needs: date, task, **how we used it** (context shape, task
+  framing, what info the model had/didn't have), what happened, takeaway.
+  No unsupported generalizations.
+- **Context dimensions to track:**
+  - Rich vs minimal (how much background info)
+  - Broad vs focused ("review this" vs "answer this specific question")
+  - What kind of context (diff, full files, issue text, research notes,
+    project conventions, nothing)
+  - Whether the model had access to tools or just text
+  - Whether the task was explicit step-by-step or open-ended
diff --git a/findings/2026-05-05-28-crossdocument-consistency-analysis-new-task.md b/findings/2026-05-05-28-crossdocument-consistency-analysis-new-task.md
new file mode 100644
index 0000000..a054a59
--- /dev/null
+++ b/findings/2026-05-05-28-crossdocument-consistency-analysis-new-task.md
@@ -0,0 +1,178 @@
+# Finding 28: Cross-document consistency analysis: NEW task type — GPT-5 finds deep semantic contradictions; Opus finds structural/boundary mismatches; Sonnet identifies core issues quickly
+
+**Date:** 2026-05-05
+**Task:** Identify contradictions and inconsistencies BETWEEN two architecture documents
+describing the same system: `system-overview.md` (323 lines, narrative overview with
+component flows, invariants, and domain events) and `architecture.md` (213 lines,
+DDD-focused with bounded contexts, context map, and message taxonomy).
+**How we used them:** BOTH documents provided as full text in a single prompt (~25KB
+total). Highly structured prompt specifying 5 categories of cross-document inconsistency
+(terminology conflicts, structural contradictions, flow/sequence conflicts,
+ownership/authority conflicts, philosophical contradictions). Required specific output
+format per finding. Explicitly excluded omissions (things one doc covers and the other
+doesn't) and detail-level differences. No tools, no project context beyond the two
+documents. This is a NEW analytical task not previously tested: reasoning about
+CONSISTENCY BETWEEN documents rather than internal coherence of a single document.
+
+| Model | Time | Output tokens | Reasoning tokens | Inconsistencies found | Critical | High | Medium |
+|---|---|---|---|---|---|---|---|
+| GPT-5 | 125s | 9,415 | 8,384 | 6 | 2 | 3 | 1 |
+| Claude Opus 4.6 | 52s | 2,351 | (internal) | 7 | 3 | 3 | 1 |
+| Claude Sonnet 4.6 | 14s | 776 | (internal) | 4 | 1 | 2 | 1 |
+
+**What they found — common ground (all 3 identified):**
+- Event sourcing (all events as source of truth) vs fills-only ground truth:
+  Document A says fills are "ground truth from which all other state can be
+  derived," while Document B says "events are the source of truth, state is
+  computed by replaying events." A treats fills as the recovery foundation;
+  B treats ALL domain events as authoritative. All three models rated this
+  Critical.
+- Bounded context naming mismatch: "Decision Engine" / "Order Management" (A)
+  vs "Engine" / "Trading" (B) for the same functional responsibilities.
+  GPT-5 folded this into a broader ownership analysis; Opus and Sonnet
+  surfaced it as its own finding.
+- Signal classification conflict: Document A lists "Signal emitted" as a domain
+  event; Document B explicitly categorizes `SignalEmitted` as an audit event
+  ("not used to rebuild state"). This determines event store design and
+  recovery semantics.
+
+**GPT-5 unique findings (not in either Claude model):**
+- Signal persistence contradiction: Document A states "Signals are never
+  persisted" while Document B lists `SignalEmitted` as an audit event that IS
+  persisted and states the audit log is mandatory for trading. These are
+  directly incompatible claims about whether signal data is stored.
+- Audit event ownership conflict: Document A says "Decision approved" events
+  originate from PortfolioRisk. Document B states "only the decision engine
+  writes audit events" and lists `DecisionApproved` as an audit event example.
+  If PortfolioRisk is part of Risk (not Engine), this is an authority violation.
+- "Single writer per user" (A: OrderManager writes all trading state) vs
+  per-aggregate single-writer (B: each aggregate writes its own event stream,
+  Ledger owns positions). These are incompatible authority models — either OM
+  centralizes writes or each domain owns its own events.
+
+**Claude Opus unique findings (not in either other model):**
+- Engine → OrderManager is an internal pipeline flow (A: same subgraph, direct
+  arrow) vs Engine → Trading is a cross-domain COMMAND (B: `PlaceOrder` command
+  crossing a bounded context boundary). This structural disagreement determines
+  whether order management is an internal pipeline stage or an independent domain
+  with its own aggregates and command validation.
+- Signal Risk's architectural position: Document A shows a two-stage risk
+  architecture (SignalRisk pre-aggregation, PortfolioRisk post-aggregation)
+  where Risk is embedded in the pipeline. Document B's context map shows Risk
+  as a separate domain that Engine merely QUERIES ("kill switch active?") —
+  no arrow shows signal routing through Risk. Either risk logic lives inside
+  Engine (contradicting B's context boundary) or the context map is incomplete.
+- The "reduce" step ownership: A's top-level flow labels `Approved →|"reduce"|
+  Decisions` (reduction at aggregation), while A's own domain events table says
+  "Decision reduced" originates from PortfolioRisk (reduction after aggregation).
+  This is actually an INTRA-document inconsistency in Document A, but Opus surfaced
+  it as part of cross-doc analysis.
+
+**Claude Sonnet unique findings:**
+- None genuinely unique. All 4 findings overlapped with GPT-5/Opus common ground
+  (event sourcing, signal persistence, context count/naming). Sonnet was efficient
+  (14s, 776 tokens) but didn't identify any inconsistency that the other two missed.
+
+**Quality assessment:**
+- **GPT-5** produced 6 well-reasoned findings with the deepest analysis of
+  OWNERSHIP conflicts. Its signal-persistence contradiction and single-writer
+  authority conflict are genuinely important — they reveal places where the two
+  documents would lead implementers to build fundamentally different systems.
+  Every finding quotes specific text from both documents and explains precisely
+  WHY they can't both be correct. The reasoning investment (8,384 tokens) was
+  used for thorough cross-referencing between documents.
+- **Claude Opus** found the most inconsistencies (7) and was remarkably fast
+  (52s vs GPT-5's 125s). Its unique strength: identifying STRUCTURAL contradictions
+  about component boundaries and communication patterns. The Engine→Trading
+  command vs internal pipeline finding is architecturally the most significant
+  discovery — it reveals a fundamental disagreement about whether order
+  management is INSIDE or OUTSIDE the decision engine's boundary. Opus also
+  caught a bonus intra-document inconsistency (the "reduce" labeling error).
+- **Claude Sonnet** was the fastest (14s) and most concise (776 tokens) but
+  found only the obvious common-ground issues. For cross-document consistency,
+  Sonnet's speed advantage came at the cost of missing the architectural
+  insights that make this task valuable. It did correctly identify all the
+  Critical-level issues, making it viable as a quick first-pass screen.
+
+**Key insight — cross-document consistency is a DISTINCT task type:**
+This is fundamentally different from single-document analysis (assumptions,
+race conditions, coherence). It requires:
+1. Building a mental model from Document A
+2. Building a separate mental model from Document B
+3. Finding places where the models are incompatible
+4. Reasoning about WHY they can't both be correct (not just "different")
+
+Step 4 is what distinguishes this from simple diff-detection. Many surface
+differences (naming, detail level, scope) are NOT contradictions — the models
+must judge which differences are genuinely incompatible vs. complementary.
+The prompt explicitly excluded omissions and detail-level differences, and
+all three models respected this constraint well.
+
+**Model strengths on cross-document analysis:**
+- **GPT-5** excels at ownership/authority conflicts: it systematically
+  checked "who owns this concept" in each document and found mismatches.
+  Its findings cluster around "who writes what" and "who is authoritative."
+- **Opus** excels at structural/boundary contradictions: it identified where
+  the documents draw architectural lines differently. Its findings cluster
+  around "where are the boundaries" and "what crosses them."
+- **Sonnet** identifies the obvious/critical issues quickly but doesn't dig
+  deeper. Viable for screening, not for thorough analysis.
+
+**Comparison to Finding #15 / #27 (single-document coherence checking):**
+Single-document coherence asks "does this document contradict itself?"
+Cross-document consistency asks "do these documents contradict each other?"
+Key differences in results:
+
+| Aspect | Single-doc coherence | Cross-doc consistency |
+|---|---|---|
+| Opus findings | 5-7 | 7 |
+| GPT-5 findings | 4-6 | 6 |
+| Sonnet findings | 4-5 | 4 |
+| Opus unique | Design tensions | Structural/boundary mismatches |
+| GPT-5 unique | Definitional errors | Ownership/authority conflicts |
+| Best model | Task-dependent | Opus (most findings + fastest) |
+
+The relative ordering is similar (Opus ≥ GPT-5 > Sonnet for coherence-style
+tasks), but the CHARACTER of unique findings shifted. On single-doc coherence,
+Opus finds design tensions within a single design. On cross-doc consistency,
+Opus finds BOUNDARY disagreements between two designs. GPT-5 shifts from
+finding definitional errors to ownership conflicts.
+
+**Are these findings REAL bugs in the gargoyle documentation?**
+Yes — several are genuine issues worth fixing:
+1. The fills-vs-events-as-ground-truth is a real philosophical tension between
+   the two documents that needs resolution.
+2. The Position event ownership (OrderManager vs Ledger) is a real boundary
+   conflict that affects implementation.
+3. The Engine→Trading communication style (internal pipeline vs cross-domain
+   command) is a genuine structural ambiguity.
+4. The signal persistence claim ("never persisted" vs `SignalEmitted` audit
+   event) is a direct textual contradiction.
+
+These are the kind of cross-document inconsistencies that cause teams to build
+inconsistent implementations — one engineer reads Document A and builds one way,
+another reads Document B and builds differently.
+
+**Practical implication:** Cross-document consistency analysis is a high-value
+task for documentation maintenance. Run it when:
+- A system has multiple architecture docs written at different times
+- A refactoring has updated one doc but not another
+- Multiple people contribute to design documentation
+- Moving from high-level overview to detailed specification
+
+Opus is the recommended model for this task: fastest (52s vs 125s), most
+findings (7 vs 6), and uniquely strong at boundary disagreements. GPT-5 adds
+value for ownership-specific conflicts. Sonnet is sufficient for quick
+screening (catches the Critical issues in 14s) but won't find the architectural
+insights.
+
+**Cost-effectiveness:**
+Opus: 2,351 output tokens for 7 findings = 336 tokens/finding (52s)
+GPT-5: 9,415 output + 8,384 reasoning for 6 findings = 2,967 tokens/finding (125s)
+Sonnet: 776 output tokens for 4 findings = 194 tokens/finding (14s)
+
+Opus is the clear winner on this task type: more findings than GPT-5, 2.4x
+faster, and 8.8x more token-efficient per finding. GPT-5's massive reasoning
+investment (8,384 tokens) produced only one fewer finding than Opus — the
+verification overhead is not paying off here because cross-document contradictions
+are relatively easy to verify once identified (just check both documents).
diff --git a/findings/2026-05-05-29-adversarial-manipulation-analysis-new-task.md b/findings/2026-05-05-29-adversarial-manipulation-analysis-new-task.md
new file mode 100644
index 0000000..39dacb7
--- /dev/null
+++ b/findings/2026-05-05-29-adversarial-manipulation-analysis-new-task.md
@@ -0,0 +1,174 @@
+# Finding 29: Adversarial manipulation analysis: NEW task type — GPT-5 is most exhaustive and systematic; Opus produces qualitatively different attack vectors with system-level thinking; Sonnet is adequate but less creative
+
+**Date:** 2026-05-05
+**Task:** Identify adversarial manipulation paths in gargoyle's `aggregation.md` (193 lines)
+— how a misbehaving, compromised, or buggy upstream component could exploit the
+aggregator's design guarantees to produce harmful trading outcomes that bypass
+downstream safety controls.
+**How we used them:** Same document (full text) + same focused analytical question to all
+3 models via HAI proxy. Highly structured prompt specifying 5 categories of adversarial
+manipulation (signal injection, timing manipulation, capacity weaponization, state
+corruption via crash, audit evasion). Required specific output format per finding
+(attack vector, mechanism, exploit, why downstream controls miss it, severity). No tools,
+no project context beyond the document itself.
+
+| Model | Time | Output tokens | Reasoning tokens | Attack vectors found | Critical | High | Medium |
+|---|---|---|---|---|---|---|---|
+| Claude Sonnet 4.6 | 27s | 1,257 | (internal) | 10 | 3 | 5 | 2 |
+| Claude Opus 4.6 | 84s | 3,662 | (internal) | 12 | 5 | 5 | 0 |
+| GPT-5 | 111s | 8,808 | 6,336 | 15 | 2 | 10 | 3 |
+
+**What they found — common ground (all 3 identified):**
+- Primary signal hijacking via ranking manipulation (last-tick injection in
+  time-windowed to control decision parameters)
+- Threshold gaming via signal replay/duplication (no deduplication means N
+  identical signals satisfy "N confirmations")
+- Capacity flooding to force premature completion or deny legitimate trades
+- Strategic crash to erase unfavorable in-flight groups
+- Timeout-masqueraded manipulation (making attacks look like normal system behavior
+  in the audit trail)
+
+**GPT-5 unique findings (not in either Claude model):**
+- **Direction flip against majority via ranking:** In "most recent" ranking,
+  emit multiple SELL confirmations then inject a late BUY — the BUY becomes
+  primary and the decision contradicts the bulk of evidence. Distinct from
+  general primary hijack because it's specifically about *directional* reversal.
+- **Late-arrival exclusion of counter-signals:** Time signals so countervailing
+  signals arrive just after group destruction, ensuring the decision is formed
+  without dissenting inputs that would have altered ranking.
+- **Capacity filter to curate the audit set:** Pre-fill buffer with chosen
+  signals so riskier alternatives cannot be included before capacity fires —
+  the contributing signals list looks clean.
+- **Timer nullification by crash:** Crash just before a timeout that would
+  force-complete an unfavorable decision — the timer becomes no-op on restart,
+  no decision or expiry event is emitted.
+- **Decision drop via induced forwarding failure:** Exploit the "Decision
+  forwarding fails: Decision is lost" failure mode to selectively suppress
+  protective decisions (stops, hedges) with no automatic retry.
+- **Crash to erase evidence of contrary signals:** Post-crash, submit a
+  fresh group that completes quickly; audit shows only the new set, not the
+  earlier contradictory pre-crash signals.
+
+**Claude Opus unique findings (not in either other model):**
+- **Instrument fragmentation to multiply position size:** Emit signals for
+  economically equivalent exposures using different instrument identifiers.
+  Each gets its own group, each produces a separate decision, bypassing
+  per-group capacity limits. Combined position exceeds what any single group
+  would allow. Identifies TOCTOU at the fan-in to PortfolioRisk.
+- **Forced stale decision via timer exploitation:** Emit one signal at a
+  favorable price spike known to be transient, then deliberately withhold
+  further signals. Timer force-completes with a stale price. The entry price
+  WAS valid when the signal was generated — PortfolioRisk doesn't check
+  staleness of decision prices.
+- **Timeout prevention / keep-alive suppression:** Manipulate market data
+  feed to suppress signals that would reach threshold N. Group expires
+  normally — denial-of-trading attack disguised as insufficient confirmation.
+- **Crash-restart duplicate decisions:** Crash after decision is forwarded
+  but before strategy reflects it. Both restart "clean" — strategy re-emits
+  signals, aggregator produces a second decision with a fresh ID. Same trade
+  executes twice. PortfolioRisk can't deduplicate because IDs are different.
+- **Force-complete with insufficient confirmation (capacity < threshold):**
+  If capacity limit is lower than threshold, hitting capacity ALWAYS force-
+  completes before predicate is satisfied. Fundamentally changes a 5-confirmation
+  strategy into a 3-confirmation strategy.
+- **Pattern predicate as arbitrary decision trigger:** If adversary controls
+  predicate logic (via strategy configuration), can make pattern-complete
+  trigger on any single signal while audit shows algorithm=pattern-complete
+  and reason=:predicate. Trust boundary between configuration and execution.
+
+**Claude Sonnet unique findings (not in either other model):**
+- **Cross-group timing coordination:** Coordinate signal injection across
+  multiple instruments to synchronize completion times, creating a burst of
+  correlated decisions that overwhelm PortfolioRisk individually-safe
+  evaluations. (NOTE: Opus found a similar concept — instrument fragmentation
+  — but framed it differently: Opus focused on position multiplication via
+  instrument aliasing, Sonnet focused on burst timing overwhelming evaluation.)
+- **Multi-strategy attack distribution:** Spread manipulation across multiple
+  isolated strategy aggregators so no single aggregator's behavior looks
+  abnormal while cumulative effect is harmful.
+
+**Quality assessment:**
+- **GPT-5** produced the most findings (15) with the most systematic coverage
+  across all 5 prompt categories. Its strength was in identifying SPECIFIC
+  INTERLEAVINGS — exactly how timing, state, and ranking mechanisms interact
+  to produce exploits. The direction-flip finding (#3) and the late-arrival
+  exclusion finding (#6) show precise temporal reasoning about when signals
+  arrive relative to group lifecycle events. The "decision drop via forwarding
+  failure" finding exploits a DOCUMENTED failure mode (from the failure table)
+  as an offensive weapon — turning a recovery mechanism into an attack vector.
+  Every finding references specific mechanisms from the spec.
+- **Claude Opus** produced 12 findings with the most architecturally creative
+  attacks. The instrument fragmentation attack is the most SYSTEMICALLY
+  dangerous finding across all three models — it's not about manipulating one
+  group but about the RELATIONSHIP between groups, and it identifies a
+  TOCTOU vulnerability at the PortfolioRisk fan-in point that no other model
+  found. The crash-restart duplication attack is also architecturally novel —
+  it exploits the "clean state" guarantee as a weapon for invisible trade
+  doubling. Opus consistently reasons about the system BOUNDARY (aggregator
+  → PortfolioRisk handoff) rather than just within-component mechanics. The
+  pattern-predicate trust boundary finding is uniquely about CONFIGURATION
+  as an attack surface.
+- **Claude Sonnet** produced 10 findings in 27s — extremely efficient (127
+  tokens per finding). Findings were adequate and covered all 5 categories,
+  but lacked the specificity of GPT-5 and the architectural creativity of
+  Opus. Several findings were somewhat generic (e.g., "crash at strategic
+  moments" without specifying exactly WHEN relative to group lifecycle).
+  The cross-group coordination and multi-strategy distribution findings show
+  system-level thinking but are stated at a higher abstraction level without
+  concrete exploit sequences.
+
+**Key insight — "adversarial manipulation analysis" as a task type:**
+This is qualitatively different from all previous analytical lenses tested.
+Previous tasks asked models to find problems WITH the design (assumptions,
+races, incoherences). This task asks models to find ways to USE the design
+AGAINST itself — a creative/generative adversarial task. Results:
+
+- **GPT-5** treats it as an exhaustive enumeration exercise — systematically
+  walks through each mechanism and asks "how could this be abused?" High
+  count (15), thorough coverage, but some findings are minor variations of
+  each other (e.g., crash-related findings #10, #12, #15 share the same core
+  mechanism). Reasoning tokens (6,336) used for both generation and verification.
+- **Opus** treats it as a creative design exercise — asks "what would a
+  smart adversary do that the designer didn't consider?" Fewer findings (12)
+  but several are genuinely novel attack concepts (instrument fragmentation,
+  crash-restart duplication, predicate trust boundary) that require reasoning
+  about the SYSTEM rather than the COMPONENT. Opus also provided a summary
+  table and systemic conclusion about the root design weaknesses.
+- **Sonnet** treats it as a categorization exercise — fills each prompt
+  category with plausible attacks but at a higher abstraction level. Fast
+  and adequate for a first pass but wouldn't surprise a security reviewer.
+
+**Comparison to "predictable exploit window" (Finding #18):**
+Finding #18 noted that Opus uniquely identified predictable exploit windows
+in escalation-policy.md. Here, Opus again shows the strongest adversarial
+creativity — the instrument fragmentation attack and crash-restart duplication
+are both about exploiting DESIGN GUARANTEES (per-instrument grouping, clean
+restart) as weapons. This confirms that Opus's strength on adversarial analysis
+is a CONSISTENT PATTERN, not document-specific.
+
+GPT-5 excels when the adversarial task is framed as "enumerate all possible
+abuses of each mechanism" (systematic coverage). Opus excels when the task
+requires "invent novel attack concepts that exploit design boundaries"
+(creative adversarial thinking).
+
+**Model hierarchy for adversarial manipulation analysis:**
+1. GPT-5 — most thorough enumeration, best at mechanism-level exploitation (15)
+2. Opus — most creative, finds system-boundary attacks others miss (12)
+3. Sonnet — adequate first pass, fast, but less specific (10)
+
+**Practical implication:** For security-oriented architecture review:
+- Run GPT-5 for comprehensive attack surface enumeration
+- Run Opus for novel/creative attack vectors that exploit design boundaries
+- Sonnet is sufficient only as a quick initial screen
+- The UNION of GPT-5 + Opus findings (removing overlaps) would produce the
+  most complete adversarial analysis
+
+**New finding about the aggregator itself:** Several attacks identified by
+multiple models point to real design weaknesses worth addressing:
+1. No signal deduplication/independence validation (all 3 models)
+2. Primary signal determines all decision parameters regardless of group
+   composition (all 3 models)
+3. Transient state + no replay = perfect adversarial erasure tool (all 3)
+4. Capacity/timeout treated as normal events even when weaponized (all 3)
+5. No cross-group correlation at aggregator level (Opus + Sonnet)
+6. TOCTOU at PortfolioRisk fan-in for concurrent decisions (Opus)
diff --git a/findings/ALL-FINDINGS.md b/findings/ALL-FINDINGS.md
deleted file mode 100644
index ed0762d..0000000
--- a/findings/ALL-FINDINGS.md
+++ /dev/null
@@ -1,3249 +0,0 @@
-# Model Findings — Analytical & Research Work
-
-_Tracking what actually works (and doesn't) when using AI models for research,
-analysis, bias detection, and document review — not coding._
-
-Started: 2026-04-26
-
-## Context
-
-We use multiple models in different roles: Claude Code (Opus/Sonnet) for
-generation, Sonnet + GPT-5 for independent dual review, smaller models for
-focused analytical tasks. Most public discussion is about coding. We found
-almost no published methodology for using models in analytical research tasks
-(searched 2026-04-26). That gap is why we're tracking this.
-
-## Findings
-
-### 1. Different models catch different things (confirmed)
-
-**Date:** 2026-04-26
-**Task:** PR reviews on DDD reference docs (~6,600 lines across 18 files)
-**How we used them:** Both models got the same task via pr-review skill —
-fetch diff, fetch full file content for changed files, review against PR
-description and linked issue acceptance criteria. Rich context: full diff,
-project CLAUDE.md conventions, issue body. Each reviewer ran independently
-in its own sub-agent with its own Gitea token. No cross-pollination.
-
-- GPT-5 caught SUMMARY.md verdict mismatches (Commanded classification,
-  small teams classification) that Sonnet missed entirely (PR #375)
-- Sonnet caught a broken cross-reference link first that GPT-5 missed (PR #378)
-- **Takeaway:** Different blind spots are real. Neither model is strictly better
-  for analytical review — they complement each other. This is why we run two
-  independent reviewers from different model families.
-
-### 2. Cheap model + narrow lens > expensive model + broad review (one data point)
-
-**Date:** 2026-04-26
-**Task:** Check 12 rewritten hypotheses for directional bias
-**How we used them:**
-- Sonnet & GPT-5: full PR review context (diff, file content, issue, AC).
-  Broad mandate: "review this PR." Rich context but unfocused task.
-- GPT-4.1 Mini: given ONLY the 12 hypothesis texts + one focused question:
-  "Do any of these hypotheses lead toward a predetermined conclusion?"
-  Minimal context, laser-focused task. No diff, no project docs, no issue.
-
-- Both Sonnet and GPT-5 approved the hypotheses as reviewers
-- GPT-4.1 Mini found ALL 12 pushed toward predetermined conclusions
-- Words like "requires," "necessary," "must be" were flagged as directional
-- **Takeaway:** Task framing mattered more than model size. Rich context +
-  broad mandate = missed the forest for the trees. Minimal context + precise
-  question = found exactly what mattered. This needs more testing — was it
-  the narrow framing, the lack of surrounding context, or both?
-
-### 3. GPT-5 times out on complex multi-step analytical tasks (confirmed pattern)
-
-**Date:** 2026-04-26
-**Task:** Full PR review of #382 (research document rewrite)
-**How we used it:** pr-review skill — multi-phase (fetch diff, fetch files,
-check CI, analyze against AC, post inline comments, post summary). 7 phases,
-many curl calls to Gitea API, large diff context. Heavy tool-use workflow
-through SAP proxy (adds latency vs direct API). 300s timeout.
-
-- Timed out 3 times at 300s (17, 6, 6 tool calls respectively)
-- Bottleneck was model processing time, not network (~0.3s Gitea API latency)
-- **Takeaway:** Break analytical tasks into focused bounded pieces. Twelve
-  small deep reviews > one rushed big one. The issue isn't GPT-5's analysis
-  quality — it's that multi-phase tool-heavy workflows burn too much time
-  on mechanics. Separate the data gathering from the analysis.
-
-### 4. GPT-5 defaults to delegation; Claude defaults to doing the work
-
-**Date:** 2026-04-26
-**Task:** PR review delegation to sub-agents
-**How we used them:** Both spawned as sub-agents from main session with
-same task description, same pr-review skill file, same Gitea credentials.
-Difference: GPT-5 got model override to gpt5, Sonnet used default model.
-Both got full skill instructions.
-
-- GPT-5 first attempt: spawned sub-sub-agents and timed out
-- GPT-5 with "do it yourself, no sub-agents" + step-by-step: worked
-- Even with constraints, GPT-5 sometimes dumps raw tool output instead of
-  synthesizing — needs explicit output format instructions
-- Claude (Sonnet/Opus) given the same kind of task does the work directly
-- **Takeaway:** GPT interprets complex task descriptions as delegation
-  opportunities. Claude interprets them as work to do. For GPT: explicit
-  single-actor instructions + output format. For Claude: can give broader
-  mandate. Same skill file, very different behavior.
-
-### 5. Sonnet is fast and catches structural issues; GPT-5 is slow and catches semantic issues
-
-**Date:** 2026-04-26
-**Task:** Dual review across PRs #372, #375, #378, #380, #382
-**How we used them:** Same pr-review skill, same context (diff + files +
-issue + AC), same sub-agent pattern. Only variable: model. Both got rich
-context. Both ran the full 7-phase review skill.
-
-- Sonnet consistently finishes first, catches formatting, broken links,
-  structural problems (missing sections, dangling refs)
-- GPT-5 takes longer, catches meaning-level problems (verdict mismatches,
-  classification inconsistencies, logical gaps)
-- **Takeaway:** With identical rich context and identical instructions, the
-  models naturally gravitate to different things. Sonnet is the structural
-  reviewer; GPT-5 is the semantic reviewer. Both roles matter. Question:
-  would Sonnet catch semantic issues if given a narrower "check for logical
-  consistency" framing instead of broad review?
-
-### 6. Single agent can't handle 1000+ line document generation (confirmed pattern)
-
-**Date:** 2026-04-26
-**Task:** DDD v2 forge analysis drafting
-**How we used them:** Single Sonnet/Opus sub-agents given full research
-material (~3,874 lines of research notes) + outline + instructions to write
-complete document. Very rich context (all research), very large output
-requirement (1000+ lines).
-
-- Five single-agent attempts died (OOM, disconnect, timeout) trying to write
-  full documents
-- Sectional approach (5 parallel Sonnet subagents, ~500-700 lines each)
-  succeeded immediately — each got same research but only their section's
-  outline
-- Same pattern when Claude Code attempted full Part V rewrite — died
-- Three agents × ~320 lines each worked first try
-- **Takeaway:** This is a confirmed, repeatable limit for generation tasks.
-  Not model-specific — it's a context/output length problem. Rich input
-  context is fine; it's the output length that kills. Break output into
-  sections, keep input context rich, draft in parallel, assemble.
-
-### 7. Emerging role assignments (pattern, not conclusion)
-
-**Date:** 2026-04-26 (one day of intensive work — treat as hypothesis)
-
-- Opus (via Claude Code): complex generation needing deep project context.
-  Rich context: CLAUDE.md, full codebase access, design docs. Broad mandate.
-- Sonnet: parallel volume work (5 subagents drafting simultaneously).
-  Rich context per section, constrained output scope.
-- GPT-5: independent analytical review. Rich context (diff + files + issue).
-  Best when task is bounded and explicit.
-- GPT-4.1 Mini: focused narrow analysis (bias detection). Minimal context,
-  precise question. Cheap and fast.
-- **Takeaway:** The role assignment matters, but so does the context shape.
-  Opus gets broad context + broad mandate. Sonnet gets broad context +
-  narrow scope. GPT-5 gets rich context + explicit task. GPT-4.1 Mini gets
-  minimal context + laser question. We haven't tested swapping these
-  combinations — that's where the real learning will come from.
-
-### 8. Bias detection: all models catch it with any framing — when the signal isn't buried
-
-**Date:** 2026-04-27
-**Task:** Detect directional bias in 8 deliberately biased hypotheses about
-microservices vs monolith architecture for fintech startups.
-**How we used them:** Created fresh test material (8 hypotheses with pro-
-microservices bias via absolutes like "inevitably," "necessary," "must,"
-"requires," plus one factually inverted claim about consistency guarantees).
-Ran 4 conditions in parallel sub-agents:
-
-| Condition | Model | Framing | Context |
-|---|---|---|---|
-| A | GPT-4.1 Mini | Narrow: "Do any lead toward a predetermined conclusion?" | Hypotheses only |
-| B | Sonnet | Same narrow question | Hypotheses only |
-| C | GPT-5 | Same narrow question | Hypotheses only |
-| D | Sonnet | Broad: "Review quality, clarity, testability, and issues" | Hypotheses only |
-
-**Results:**
-- **All 4 conditions detected 8/8 biased hypotheses.** No misses.
-- All 3 narrow-framing models (Mini, Sonnet, GPT-5) produced structurally
-  similar output: per-hypothesis verdict, biasing words, neutral version,
-  severity assessment.
-- All 3 narrow-framing models flagged H8's factual inversion (distributed
-  transactions DON'T provide stronger consistency than monolithic ACID).
-- GPT-5 added specific counterexamples (LMAX Disruptor, Shopify, Stack
-  Overflow, Basecamp) — marginally richer analysis.
-- Sonnet broad mandate also caught the bias — framed as one of three
-  "systemic problems" (deterministic language, pro-microservices framing
-  bias, underspecified constructs). Additionally provided testability and
-  operationalization analysis that the narrow framing didn't ask for.
-- Sonnet broad took ~72s vs ~39s for narrow conditions (more output).
-
-**Takeaway:** When the biased text is the ONLY input (no surrounding noise),
-all tested models — including the cheapest (GPT-4.1 Mini) — detect bias
-regardless of whether the question is narrow or broad. This appears to
-**contradict** original finding #2 ("cheap model + narrow lens > expensive
-model + broad review"), but the key difference is context noise:
-
-- **Original experiment (2026-04-26):** Sonnet and GPT-5 missed bias during
-  FULL PR REVIEW with rich project context (diff, file content, issue text,
-  acceptance criteria, project conventions). The hypotheses were buried in
-  layers of review mechanics.
-- **This experiment (2026-04-27):** Even the "broad" condition gave ONLY the
-  hypothesis text — no diff, no PR structure, no project context noise.
-
-**Refined hypothesis:** The original finding #2 was about **signal-to-noise
-ratio**, not about model capability or framing precision. When biased text
-is presented in isolation, any model catches it. When biased text is buried
-in a large PR review with many other things to check, the bias signal gets
-lost in the noise — unless you explicitly ask about it. The "narrow lens"
-worked because it eliminated the noise, not because smaller models are
-better at bias detection.
-
-**Next experiment to confirm:** Give a model the FULL PR review context
-(diff, files, issue, AC) but add the narrow bias question as an explicit
-review checklist item. If the model catches bias despite the rich context,
-it confirms the signal-to-noise hypothesis. If it misses, it suggests
-something else is at play (attention allocation, task switching cost).
-
-### 9. Gap-finding in architecture docs: GPT-5 finds domain-specific gaps, GPT-4.1 is generic, Mini is formulaic
-
-**Date:** 2026-05-02
-**Task:** Identify missing failure scenarios in gargoyle's `failure-modes.md` (383 lines)
-**How we used them:** Same document (full text, no truncation) + same focused
-analytical question to all 3 models via HAI proxy (OpenAI-compatible endpoint).
-No tools, no project context beyond the document itself. Single prompt, no
-conversation history. Temperature 0.3 for GPT-4.1/Mini, default (1.0) for GPT-5
-(required by the model).
-
-| Model | Time | Output tokens | Reasoning tokens | Scenarios found |
-|---|---|---|---|---|
-| GPT-4.1 Mini | 16s | 2,003 | 0 | 10 |
-| GPT-4.1 | 24s | 2,575 | 0 | 15 |
-| GPT-5 | 45s | 8,565 | 6,656 | 14 |
-
-**What they found — common ground (all 3 identified):**
-- ETS table corruption/loss affecting gates
-- BEAM scheduler starvation / GC pauses
-- WebSocket message duplication/reordering
-- Postgres connection pool exhaustion / deadlocks
-- Clock skew / time drift
-- Process registry inconsistency
-
-**GPT-5 unique findings (not in either other model):**
-- Broker rate limiting (429s) — not "connection lost" so existing logic
-  doesn't trigger, but can't flatten during kill switch
-- Broker auth failure / credential rotation — distinct from connection loss
-- Corporate actions (splits, symbol changes) — position drift without
-  triggering staleness detection
-- Duplicate pipeline instances for same user (DynamicSupervisor race)
-- DB "commit unknown outcome" causing restart loops (Ecto commit succeeds
-  at Postgres but client times out → retry → unique constraint → crash loop)
-- Cross-symbol strategies with partial staleness — multi-leg signals
-  computed from mix of fresh and stale data
-- Partial cancel_all during kill switch masked by process restarts
-
-**GPT-4.1 unique findings (not in GPT-5 or Mini):**
-- Zombie processes after halt (supervisor misconfiguration)
-- Unsupervised Task crashes going unnoticed
-- Audit log writes failing silently (not in same transaction as state change)
-- ClOrdID unique constraint violation from race in sequence generation
-- Broker API semantic changes (silent breaking changes)
-
-**GPT-4.1 Mini unique findings:**
-- Race between kill switch engagement and reconciliation completion
-  (timing coordination gap) — this was more explicitly called out than
-  in the other models, though GPT-5 touches it implicitly
-- Strategy.Worker / Aggregator partial crash inconsistency
-
-**Quality assessment:**
-- **GPT-5** had the most *domain-relevant* and *actionable* gaps. Broker
-  rate limiting, auth failures, corporate actions, and the DB commit
-  unknown-outcome scenario are all realistic production issues specific
-  to THIS system. The cross-symbol partial staleness finding shows
-  deeper architectural reasoning about component interactions.
-- **GPT-4.1** was thorough and well-structured but more generic/defensive.
-  Many of its unique findings (zombie processes, unsupervised Tasks,
-  audit log loss) are general Elixir concerns rather than specific to
-  the document's architecture. Good for a completeness checklist.
-- **GPT-4.1 Mini** was formulaic — each finding followed the same template
-  and several were somewhat surface-level or restated things the document
-  partially covers. Still found the most scenarios per dollar.
-
-**Takeaway:** For gap-finding in architecture documents, GPT-5's reasoning
-tokens pay off. It doesn't just list "things that could go wrong" — it
-identifies *specific interactions* that the document's existing mechanisms
-don't cover (e.g., rate limiting bypasses the "connection lost" detection,
-corporate actions bypass staleness detection). GPT-4.1 is a solid
-middle-ground: more thorough than Mini, less insightful than GPT-5.
-Mini is fine for a quick sanity check but won't find the subtle gaps.
-
-**Cost-effectiveness:** Mini found 10 scenarios in 16s for ~7K tokens.
-GPT-5 found 14 scenarios (with 7 genuinely unique insights) in 45s for
-~13.5K tokens (including 6.6K reasoning). For architecture review where
-missing a gap could mean financial loss, the GPT-5 cost is justified.
-For routine doc review, Mini + human judgment is probably sufficient.
-
-### 10. Hidden-assumption identification: GPT-5's reasoning produces qualitatively different (not just more) findings
-
-**Date:** 2026-05-02
-**Task:** Identify hidden assumptions in gargoyle's `cold-start-and-recovery.md` (234 lines)
-that could break under real-world production conditions.
-**How we used them:** Same document (full text) + same focused analytical question
-to all 3 models via HAI proxy (OpenAI-compatible endpoint). No tools, no project
-context beyond the document itself. Single prompt, no conversation history.
-Temperature 0.3 for GPT-4.1/Mini; GPT-5 uses default (required).
-
-| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
-|---|---|---|---|---|
-| GPT-4.1 Mini | 25s | 3,090 | 0 | 12 |
-| GPT-4.1 | 77s | 2,751 | 0 | 14 |
-| GPT-5 | 78s | 2,649 | 4,096 | 26 |
-
-**What they found — common ground (all 3 identified):**
-- Broker API consistency/availability during reconciliation
-- ETS table availability and fail-closed behavior
-- Single-writer/mailbox ordering guarantees holding in practice
-- User independence assumption vs shared resources (rate limits, DB)
-- Reconciliation idempotency under repeated runs
-- Corporate action data completeness/timeliness
-- Escalation threshold calibration vs changing market conditions
-- Strategy warmup with partial/missing historical data
-- Signal expiry correctness on restart
-
-**GPT-5 unique findings (not in either other model):**
-- Unbounded mailbox growth during extended reconciliation (memory pressure
-  from queued messages at market open)
-- handle_continue side effects in OTHER processes (risk, metrics) acting
-  concurrently via different paths
-- Pre-existing GTC orders filling while gated (positions as moving target)
-- Broker position semantics mismatch (trade-date vs settled-date)
-- Strategy warmup evaluate() having non-signal side effects (metrics, caches)
-- Historical bar / live tick boundary alignment (double-processing or gaps)
-- ETS gate caching in process state creating fail-open windows
-- Correlated retry stampede when many users restart together
-- Corporate action double-application race with broker (missing idempotency
-  keys per action/instrument/date)
-- Kill switch state vs DB unavailability at startup
-- Market data subscriptions as shared bottleneck across "independent" users
-- Time-invariant signals incorrectly expired by aggregation window logic
-- Broker fills vs positions endpoints internally inconsistent (different caches)
-- Positions changing under reconciliation while kill switch is engaged
-- Gate phase sequencing: :ready written before worker warmup completes
-- Periodic reconciler allowing 1hr of divergent trading (rate-of-change blind)
-
-**GPT-4.1 unique findings (not in GPT-5 or Mini):**
-- No correlated failure handling (all failure modes treated as isolated) —
-  only model to frame this as a meta-assumption about the failure table
-
-**GPT-4.1 Mini unique findings:**
-- None that weren't also covered by the other two models
-
-**Quality assessment:**
-- **GPT-5** didn't just find more assumptions — it found *qualitatively
-  different kinds*. Many of its unique findings involve multi-component
-  interactions (mailbox + reconciliation + market open timing), semantic
-  mismatches (trade-date vs settled positions), and second-order effects
-  (metrics side effects during warmup, GTC orders filling while gated).
-  These require reasoning about system behavior across boundaries the
-  document doesn't explicitly draw.
-- **GPT-4.1** was competent and structured, found the same core assumptions
-  as Mini, plus one good meta-observation about correlated failures. But
-  it stayed within the document's own framing — it found assumptions the
-  document *almost* states rather than ones the document can't see.
-- **GPT-4.1 Mini** was formulaic. Every finding maps cleanly to a section
-  of the document. It's essentially "what could go wrong with each stated
-  mechanism" rather than "what does this design take for granted about
-  the world outside itself."
-
-**Key insight — reasoning tokens change the KIND of analysis:**
-GPT-5's 4,096 reasoning tokens aren't producing "more of the same" —
-they're producing a different analytical mode. The non-reasoning models
-(4.1 and Mini) identify risks within the document's own frame of reference.
-GPT-5 reasons about the document's relationship to the external world:
-broker semantics, deployment topology, OTP runtime behavior under load,
-timing correlations across independent subsystems. This is the difference
-between "what could this mechanism fail at" and "what must be true about
-the world for this mechanism to work."
-
-**Comparison to Finding #9 (gap-finding on failure-modes.md):**
-Same pattern confirmed. GPT-5 consistently finds domain-specific,
-interaction-level issues that require reasoning about component boundaries.
-GPT-4.1 is thorough within-frame. Mini is formulaic. The gap between
-GPT-5 and the others is larger here than in #9 — possibly because
-"hidden assumptions" requires more abstraction than "missing failure
-scenarios." Assumption-finding requires the model to reason about what
-ISN'T stated, which benefits more from extended reasoning.
-
-**Practical implication:** For architecture review, running GPT-5 on
-"identify hidden assumptions" is higher-value than the same question to
-non-reasoning models. The cost difference (4K extra reasoning tokens) is
-trivial for a document that will drive months of implementation. Use
-non-reasoning models for within-frame checks ("does this section have
-gaps") and reasoning models for cross-boundary analysis ("what must be
-true about the world for this to work").
-
-### 11. Hidden-assumption identification on simpler doc: reasoning models diverge in approach, both outperform non-reasoning
-
-**Date:** 2026-05-02
-**Task:** Identify hidden assumptions in gargoyle's `market-calendar.md` (238 lines)
-— a simpler, single-component document vs the 234-line cold-start doc from Finding #10.
-**How we used them:** Same document (full text) + same focused analytical question
-to all 3 models via HAI proxy. No tools, no project context beyond the document
-itself. Single prompt, no conversation history. Temperature 0.3 for GPT-4.1;
-GPT-5 and Opus use their defaults (required). Same prompt across all three.
-
-| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
-|---|---|---|---|---|
-| GPT-4.1 | 19s | 2,554 | 0 | 14 |
-| Claude Opus 4.6 | 74s | 3,288 | (internal, not reported) | 13 |
-| GPT-5 | 101s | 8,417 | 5,504 | 24 |
-
-**What they found — common ground (all 3 identified):**
-- Alpaca calendar API data correctness/completeness as single source of truth
-- Alpaca API availability at startup (no local cache persistence)
-- ETS table atomicity during refresh (partial-state exposure risk)
-- System clock/timezone alignment (dates are timezone-naive)
-- NYSE emergency/unscheduled closures not reflected until refresh
-- Two-year cache range sufficiency
-- API response format stability
-- Rate limiting / API capacity concerns
-
-**GPT-5 unique findings (not in either other model):**
-- Date struct term-ordering in ETS match specs may not match chronological
-  order (ETS range guards rely on Erlang term comparison, not Date semantics)
-- close_time/1 returns naive Time without timezone — DST conversion burden on
-  consumers, one hour off twice per year
-- trading_day?/1 conflates "not a trading day" with "calendar unavailable" —
-  operational outages invisible to callers
-- ETS table name collision risk (global namespace per node)
-- No other process should modify the ETS table (access mode discipline)
-- Network egress and credential availability on all nodes at all times
-- ETS read/write concurrency flags for contention under load
-- Direct ETS access by consumers bypassing the module's error handling
-- next/prev_trading_day edge cases at cache boundaries
-- Alpaca API start/end parameter inclusivity (off-by-one at year boundaries)
-- Half-day vs full-day distinction insufficiency for special sessions
-- Small table size makes O(n) selects acceptable (scaling concern)
-- Year-end refresh failure leaving gaps at boundary
-- Alpaca never omits a legitimate trading day (absence = non-trading conflation)
-
-**Claude Opus unique findings (not in either other model):**
-- ETS ownership semantics: heir-protection would change fail-closed behavior;
-  current design means ALL consumers fail simultaneously during crash-to-restart
-  window (framed as a design tension, not just a risk)
-- Silent data corruption from partial API response (pagination/truncation) —
-  specifically that missing rows are SILENT failures with no error propagation
-  (other models mentioned API completeness but not the silence aspect)
-- Consumers calling functions with Dates, not DateTimes — the API accepts Date.t()
-  but doesn't specify HOW consumers should derive "today" (system-wide
-  coordination problem made invisible by the API contract)
-- `trading_day?/1` returning false is NOT fail-closed for ALL consumers — only
-  for PDT-like "block action" consumers; for batch-trigger consumers it's
-  fail-OPEN (subtle inversion of safety semantics)
-- Startup ordering: background_children placement means PDT could receive orders
-  before MarketCalendar finishes init, creating recurring rejection windows
-  during hot deploys
-- Continuous-running assumption for refresh timer (daily restarts would mean
-  refresh mechanism never fires — no staleness alert exists)
-
-**GPT-4.1 unique findings (not in either other model):**
-- No need for real-time calendar change notification (event emission gap)
-- All consumers using the same module instance (configuration consistency)
-- No need for historical calendar data (audit/backtesting limitation)
-- Consumers correctly handling {:error, :calendar_unavailable} in practice
-
-**Quality assessment:**
-- **GPT-5** found the most assumptions (24) with the most technical specificity.
-  Many are implementation-level insights (ETS term ordering, named table
-  collisions, read_concurrency flags) that demonstrate deep Erlang/OTP
-  knowledge. Some are slightly obvious or overlapping. The ETS term-ordering
-  finding is genuinely insightful — Date structs DO compare correctly in Erlang
-  term order (year > month > day fields), but questioning it shows depth of
-  reasoning about underlying mechanisms. Also provided concrete recommendations.
-- **Claude Opus** found fewer assumptions (13) but several were qualitatively
-  different — they identified *design tensions* and *semantic inversions*
-  rather than just failure scenarios. The fail-open/fail-closed inversion
-  (finding #12), the ETS ownership tension, and the "API makes timezone
-  coordination invisible" findings show reasoning about the design's
-  *relationship to its consumers* rather than just its internal mechanics.
-  Tighter, more curated output with less filler.
-- **GPT-4.1** was competent and well-structured (14 assumptions, clean table)
-  but stayed within the document's own framing. Its unique findings are
-  relatively generic ("consumers should handle errors correctly," "no
-  historical data"). Solid baseline, no surprises.
-
-**Key insight — two reasoning models, different analytical styles:**
-GPT-5 and Opus are both reasoning models, but they reason about different
-things. GPT-5 reasons DEEPER into implementation mechanics (how does ETS
-actually work? what are the exact failure modes of each component?). Opus
-reasons WIDER about system context (how does this component's API contract
-affect the safety properties of the overall system? what tensions does this
-design create that aren't visible to the author?).
-
-GPT-5's approach: "Here are 24 things that could go wrong, many highly
-technical." Opus's approach: "Here are 13 assumptions, several of which
-reveal design tensions the document can't see about itself."
-
-**Does the reasoning gap narrow with simpler docs?**
-Comparing to Finding #10 (cold-start doc, 234 lines, 26 vs 14 vs 12 assumptions
-for GPT-5/GPT-4.1/Mini):
-- GPT-5 still dominates in raw count (24 vs 14 for GPT-4.1)
-- The gap ratio is similar (~1.7x here vs ~1.9x in Finding #10)
-- Document complexity doesn't appear to be the driver of the gap —
-  reasoning tokens enable more exhaustive exploration regardless of
-  input complexity
-
-**Claude Opus vs GPT-5 (the headline comparison):**
-They're not competing on the same axis. GPT-5 is better for "find all
-possible issues" (breadth + technical depth). Opus is better for "find
-the assumptions that will actually surprise the author" (insight density).
-If you want a security-audit-style exhaustive list: GPT-5. If you want a
-design-review-style "here's what you're not seeing about your own design":
-Opus. Both are better than GPT-4.1 for this task, but in different ways.
-
-**Practical implication:** Run BOTH reasoning models on architecture docs.
-GPT-5 catches implementation-level hazards the team might miss during
-coding. Opus catches design-level tensions the team might miss during
-planning. GPT-4.1 is sufficient as a quick sanity check but won't
-surprise you.
-
-### 12. Sonnet 4.6 outperforms expectations on assumption-finding; competes with reasoning models on complex docs
-
-**Date:** 2026-05-02
-**Task:** Identify hidden assumptions in gargoyle's `order-execution.md` (785 lines)
-— a complex, multi-component document covering OrderManager, BrokerAdapter,
-TradeStream, and PositionReconciler.
-**How we used them:** Same document (full text, no truncation) + same focused
-analytical question to all 3 models. GPT-5 via HAI OpenAI endpoint; Opus 4.6
-and Sonnet 4.6 via HAI Anthropic endpoint. No tools, no project context beyond
-the document itself. Single prompt, no conversation history.
-
-| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
-|---|---|---|---|---|
-| GPT-5 | 93s | 8,485 | 6,016 | 20 |
-| Claude Sonnet 4.6 | 106s | 4,637 | (internal) | 17 |
-| Claude Opus 4.6 | 105s | 4,615 | (internal) | 12 |
-
-**What they found — common ground (all 3 identified):**
-- Synchronous broker REST calls blocking OrderManager GenServer (mailbox growth)
-- TradeStream event ordering assumptions (out-of-order fills/status)
-- Fill deduplication gap (no explicit fill-level idempotency)
-- `cancel_all/1` with `timeout: :infinity` blocking GenServer during FLATTEN
-- Recovery/restart races with TradeStream fill delivery (fills queued during
-  `handle_continue/2`)
-- Lot operation idempotency under crash recovery (partial execution)
-- Replace race: fills for new broker_order_id arriving before `replaced` event
-- Database write latency impact on GenServer throughput under burst fills
-- ETS table scope assumptions (single-node, access mode)
-
-**GPT-5 unique findings (not in either Claude model):**
-- Rate-limit retry blocking OrderManager inline (no async retry path specified)
-- Single TradeStream connection per user not enforced (duplicate detection gap)
-- Kill switch FLATTEN vs degraded state interaction (OM drops cancels while
-  degraded, but FLATTEN calls cancel_all through OM)
-- ClOrdID uniqueness scope/retention at broker across sessions and days
-- `after: datetime` filter semantics (clock skew, timezone, inclusive/exclusive)
-- Reconciliation responses may exceed single-response size (no pagination)
-- Event broadcasting blocking model (synchronous vs fire-and-forget)
-- Credential rotation during TradeStream connection lifetime
-- `market_closed` semantics varying across brokers (reject vs queue)
-- Dropped Alpaca statuses (stopped/suspended/calculated) may affect accounting
-
-**Claude Sonnet 4.6 unique findings (not in either other model):**
-- Single fill per fill event assumption (broker batching multiple fills into
-  one WebSocket message)
-- Lot operations (`Lots.open/2`, `Lots.close/4`) assumed to never fail —
-  no `{:error, _}` handling shown, crash propagation risk
-- `Task.async_stream` inside GenServer creating linked tasks whose crash
-  signals propagate to OrderManager during critical cancel_all
-- Broker cancel semantics during in-flight replace at the broker level
-  (cancel targets old broker_order_id which broker already replaced away)
-- Database operations in fill processing assumed transactional (no explicit
-  Ecto.Multi/transaction mention)
-- Broker position reflects only Gargoyle's activity (external trades cause
-  false-positive reconciliation halts)
-
-**Claude Opus 4.6 unique findings (not in either other model):**
-- `{:ok, broker_order_id}` from REST place conflated with durable OMS
-  acceptance vs mere HTTP acknowledgment (no timeout on `submitted` state)
-- Concurrent `apply_corrections/2` from periodic reconciler running in
-  separate process conflicts with OrderManager's single-writer invariant
-  (corrections write to same tables outside GenServer serialization)
-- Reconciliation gate initialized state after `:rest_for_one` restart —
-  ETS table EXISTS but freshly initialized vs table MISSING are different
-  conditions with different safety properties
-- Escalation state reset after crash creating double-exposure window
-  (systematic issue persists but escalation timer resets to zero)
-- `replace/3` error semantics: non-atomic replace (cancel + re-submit)
-  where cancel succeeds but re-submit fails leaves original order cancelled
-  at broker while OrderManager reverts to "working" locally
-
-**Quality assessment:**
-- **GPT-5** maintained its pattern from previous findings: broadest coverage
-  (20 assumptions), most technically specific about implementation details.
-  Found cross-cutting operational concerns (clock skew, credential rotation,
-  pagination) that the Claude models didn't surface. However, several of its
-  findings were medium-severity operational concerns rather than architectural
-  assumptions.
-- **Claude Sonnet 4.6** was the surprise performer. Found 17 assumptions —
-  close to GPT-5's count (85%) — and several of its unique findings were
-  genuinely insightful. The `cancel_all` race with broker-side replace state
-  (finding #16) and the lot operation failure propagation (finding #6) show
-  deep reasoning about component interaction despite Sonnet not being
-  positioned as a "reasoning" model. More importantly, Sonnet's findings were
-  consistently well-structured with clear "how it could break" scenarios.
-- **Claude Opus 4.6** found the fewest assumptions (12) but — consistent with
-  Finding #11 — its unique findings were qualitatively different. The
-  concurrent `apply_corrections` write conflict, the gate initialization state
-  distinction, and the non-atomic replace error semantics all reveal design
-  tensions that neither GPT-5 nor Sonnet identified. Opus continues to reason
-  about the *boundaries between components* rather than within-component
-  mechanics.
-
-**Key insight — Sonnet 4.6 is NOT just a faster GPT-4.1:**
-In previous findings (#9, #10, #11), non-reasoning models (GPT-4.1, GPT-4.1
-Mini) performed significantly below reasoning models on assumption-finding.
-GPT-4.1 found ~14 assumptions where GPT-5 found 24-26. Here, Sonnet 4.6
-finds 17 where GPT-5 finds 20 — a much smaller gap (~85% vs ~58% previously).
-
-Sonnet's findings also included several that showed genuine reasoning about
-component interactions (not just within-frame risks). This suggests Sonnet 4.6
-is qualitatively different from GPT-4.1 for analytical work — it occupies a
-middle ground between GPT-4.1's "competent but surface-level" and GPT-5's
-"exhaustive and deep." The severity distribution was also similar to GPT-5
-(multiple critical/high findings), whereas GPT-4.1 in previous experiments
-tended toward medium-severity generic concerns.
-
-**Updated model hierarchy for assumption-finding:**
-1. GPT-5 — broadest coverage, most operational-level findings (20)
-2. Sonnet 4.6 — strong analytical depth, good component interaction reasoning (17)
-3. Opus 4.6 — fewest but most architecturally insightful, finds design tensions (12)
-4. GPT-4.1 — competent within-frame, generic (~14 from previous experiments)
-5. GPT-4.1 Mini — formulaic, surface-level (~10-12)
-
-**Practical implication:** For architecture review, Sonnet 4.6 is now a strong
-candidate for volume analytical work. It's fast enough to run alongside GPT-5
-and catches different things (lot operation failures, broker-side replace races).
-The ideal three-model review stack for architecture docs appears to be:
-- GPT-5 for breadth + operational concerns
-- Sonnet 4.6 for component interaction analysis
-- Opus 4.6 for design-tension identification
-
-Each consistently finds things the others miss. The cost-efficiency argument
-for Sonnet is strong: ~85% of GPT-5's count with more actionable findings
-per token generated (4,637 vs 8,485 tokens for 17 vs 20 assumptions).
-
-### 13. Race condition identification: Opus excels at subtle temporal interactions; GPT-5 is exhaustive; Sonnet struggles with concurrency reasoning
-
-**Date:** 2026-05-03
-**Task:** Identify race conditions, timing-dependent bugs, and ordering hazards in
-gargoyle's `concurrent-failure-detection.md` (241 lines) — a document specifically
-about concurrent detection logic with timers, ETS state, and multi-process events.
-**How we used them:** Same document (full text) + same focused analytical question
-to all 3 models via HAI proxy. Prompt specifically asked for event ordering problems,
-timer interaction bugs, state visibility gaps, crash/restart timing, and multi-instance
-coordination. Required each finding to reference specific mechanisms in the document
-with specific interleaving descriptions. No tools, no project context beyond the
-document itself.
-
-| Model | Time | Output tokens | Reasoning tokens | Race conditions found |
-|---|---|---|---|---|
-| GPT-5 | 116s | 10,587 | 8,192 | 12 |
-| Claude Opus 4.6 | ~105s | 4,610 | (internal) | 10 |
-| Claude Sonnet 4.6 | ~39s | 1,404 | (internal) | 7 |
-
-**What they found — common ground (all 3 identified):**
-- Stale timer messages in mailbox after cancellation (classic Erlang timer race)
-- HealthMonitor crash losing compound detection state (init from :unknown, no replay)
-- ETS vs GenServer state divergence visible to dashboard
-- Kill switch mode conflict (FLATTEN from single-process vs HALT from compound path)
-
-**GPT-5 unique findings (not in either Claude model):**
-- Cross-sender message ordering: recovery events from pipeline processes vs timer
-  expiry from runtime (Erlang preserves per-sender order, NOT cross-sender) — the
-  "rapid recovery" safety argument in the doc relies on state being updated before
-  timer fires, which isn't guaranteed
-- Debounce starvation: flapping component repeatedly restarting the timer, causing
-  compound evaluation to be indefinitely postponed while ≥2 genuinely degraded
-- State regression: {:degraded} arriving after {:escalated, :kill_switch} with no
-  guard in the event table — state machine allows regressing from :halted to :degraded
-- Cold-start window: application boots with existing degraded processes that won't
-  re-emit events, compound detection never fires
-- Catch-all handle_info could accidentally swallow timer messages if pattern matching
-  is ordered wrong (implementation pitfall of the described approach)
-- Debounce window growing beyond calibrated bounds from repeated timer restarts
-
-**Claude Opus unique findings (not in either other model):**
-- Timer restart pushing evaluation PAST single-process escalation timeout — the
-  debounce mechanism can DEFEAT compound detection when second degradation arrives
-  near end of first window (resets to full window, first process escalates via
-  single-process path before new window fires). This means system gets FLATTEN
-  instead of HALT — exactly what compound detection was supposed to prevent.
-- Strategy worker single-atom masking via event ordering: Worker A degrades, Worker
-  B degrades (same atom), Worker A recovers → atom set to :normal while B is still
-  degraded. Event ordering across different workers mapped to same atom creates
-  state loss.
-- Registry stale PID after HealthMonitor crash: if subscription is PID-based (not
-  PubSub topic), new HM instance is deaf — events go to dead PID, silently dropped.
-  Compound detection completely disabled for that user until subscription refresh.
-- :rest_for_one cascade + coincidental independent issue: debounce designed to
-  filter cascade restarts, but cascade can EXPOSE a pre-existing issue (SignalRisk
-  restarts and finds its upstream stale). Is this "compound" or "cascade + bad luck"?
-  Semantic ambiguity the design doesn't address.
-- Compound cleared event without recovery debounce: :compound_degradation_cleared
-  emitted immediately when last process recovers (no settling period), causing
-  operator oscillation if recovery is transient.
-
-**Claude Sonnet unique findings:**
-- ETS table creation race at startup (HealthMonitor writes before table exists)
-- Registry lookup failure during pipeline startup (events before HM registered)
-- However, Sonnet also made analytical errors: it described "multiple HealthMonitor
-  instances for the same user" scenarios despite the document clearly stating one
-  instance per user via DynamicSupervisor. Several of its findings assumed
-  multi-instance coordination that doesn't match the architecture.
-
-**Quality assessment:**
-- **GPT-5** was the most exhaustive and technically precise. Its cross-sender
-  ordering finding (#2) is genuinely insightful — it identifies that the document's
-  "rapid recovery" safety argument implicitly assumes events arrive in wall-clock
-  order, which Erlang does NOT guarantee across different senders. The debounce
-  starvation finding (#3) identifies a real operational hazard with practical
-  consequences. All 12 findings reference specific mechanisms and describe specific
-  interleavings clearly.
-- **Claude Opus** found fewer race conditions but several were qualitatively
-  superior. The timer-restart-defeats-compound-detection finding is the most
-  architecturally significant race in the entire analysis — it shows that the
-  debounce mechanism can work AGAINST the design's stated goals in specific
-  (realistic) timing scenarios. The strategy-worker event ordering masking is
-  also a genuine design flaw unique to the single-atom decision. Opus continues
-  its pattern of reasoning about design TENSIONS rather than just failure modes.
-- **Claude Sonnet** was notably weaker here than in previous experiments. Only
-  1,404 output tokens vs 4,610 (Opus) and 10,587 (GPT-5). Several findings
-  contained analytical errors (assuming multi-instance coordination that doesn't
-  exist). It found only 7 races, and 2-3 of those were based on misreadings of
-  the architecture. This is a significant regression from Finding #12 where
-  Sonnet found 17 assumptions (85% of GPT-5's count).
-
-**Key insight — concurrency reasoning is a different skill than assumption-finding:**
-In previous experiments (#10, #11, #12), Sonnet 4.6 performed well on
-assumption-finding (a task that requires reasoning about what's NOT stated).
-Here, on race condition identification (a task requiring reasoning about temporal
-interleavings and message ordering semantics), Sonnet drops significantly. This
-suggests the task type matters more than we previously thought:
-
-- **Assumption-finding:** Requires breadth of consideration ("what must be true
-  for this to work?"). Sonnet handles this well — it's essentially pattern
-  matching across possible failure dimensions.
-- **Race condition identification:** Requires SEQUENTIAL reasoning about specific
-  interleavings ("if A happens, then B happens, then C happens, what state is
-  visible?"). This benefits dramatically from extended reasoning tokens (GPT-5's
-  8,192 reasoning tokens) or from Opus's internal reasoning depth.
-
-The lesson: don't extrapolate model performance across task types. A model that's
-85% as good at assumption-finding may be 50% as good at concurrency analysis.
-The cognitive demands are different.
-
-**Opus's distinguishing strength — finding design contradictions:**
-Opus's best finding (timer restart defeating compound detection) isn't just a
-race condition — it's identifying that the debounce mechanism can work against
-the design's own stated goals. This is consistent with Opus's pattern in
-previous findings: it finds tensions where one part of the design undermines
-another part. For race condition analysis specifically, this manifests as
-"here's where your safety mechanism becomes your vulnerability."
-
-**Practical implication for architecture review:**
-- For race condition analysis: use GPT-5 (exhaustive) + Opus (design-tension)
-- Sonnet is NOT suitable for concurrency reasoning tasks — use it for
-  assumption-finding and structural review instead
-- The three-model stack needs task-appropriate assignment:
-  - Structural/assumption review: all three models contribute
-  - Concurrency/race analysis: GPT-5 + Opus only
-  - Bias detection: any model (per Finding #8)
-
-### 14. Cross-component interaction analysis: GPT-5 Mini enters as viable mid-tier; task framing strongly shapes output quality
-
-**Date:** 2026-05-03
-**Task:** Identify cross-component interaction failures in gargoyle's
-`continuous-risk-monitoring.md` (459 lines) — a document specifying
-PortfolioMonitor's interactions with OrderManager, PortfolioRisk, MarketData,
-KillSwitch, ETS tables, and the pipeline supervision tree.
-**How we used them:** Same document (full text) + same focused analytical
-question to all 3 models via HAI proxy. Prompt was highly structured: specified
-5 categories of cross-component failures to look for (semantic mismatches,
-ordering violations, feedback loops, partial visibility, supervision boundary
-effects) and required specific output format (components, sequence, gap, impact).
-No tools, no project context beyond the document itself.
-
-| Model | Time | Output tokens | Reasoning tokens | Findings |
-|---|---|---|---|---|
-| GPT-5 Mini | 68s | 5,445 | 2,240 | 6 (+1 cut off) |
-| GPT-5 | 116s | 10,604 | 8,128 | 10 |
-| Claude Sonnet 4.6 | 38s | 1,868 | (internal) | 8 |
-
-**What they found — common ground (all 3 identified):**
-- Fill-to-position query race (fill event triggers evaluation but position
-  store hasn't yet reflected the fill)
-- Restrict flag ETS table destruction on PM crash → permissive window
-- Kill switch check vs liquidation submission race
-- Ticker subscription timing gap (new position opened but ticks not yet
-  subscribed → breach goes undetected)
-
-**GPT-5 unique findings (not in either other model):**
-- Stale prices are NOT fail-safe for drawdown (higher stale price → inflated
-  portfolio value → understated drawdown). The document claims "fail-safe"
-  but this only holds for exposure metrics, not drawdown. This is the most
-  architecturally significant finding across all three models.
-- Price definition mismatch between PM (last_trade from ETS) and OrderManager/
-  broker (bid/ask/mid) causing mis-sized liquidation and oscillation
-- Cross-component oscillation: PM hysteresis internal vs PRisk's immediate
-  binary restrict gate clearing (no cross-component cooldown)
-- Liquidation stuck after OM restart (terminal events lost; liquidation_in_
-  flight stays true indefinitely with no timeout/rehydration)
-- "Minimal risk checks" not enforced — PM goes through same OM gates as
-  strategy orders but MarketHours/StalePrice controls may reject after-hours
-  or stale-price liquidation attempts
-- FLATTEN mode semantics gap — PM refrains from liquidating when kill switch
-  engaged, but FLATTEN cancels open orders without actually CLOSING positions.
-  No component left to close positions.
-
-**Claude Sonnet 4.6 unique findings (not in either other model):**
-- Liquidation feedback loop with PortfolioRisk — buy-to-cover for short
-  positions could INCREASE net long exposure at portfolio level, paradoxically
-  worsening concentration while fixing position-level metrics
-- High water mark reset on pipeline restart masks true intraday drawdown
-  (restart → HWM resets to lower current value → drawdown calculated from
-  false baseline → larger losses permitted than intended)
-- Multi-metric breach with single boolean flag — concentration liquidation
-  for AAPL sets liquidation_in_flight, blocking simultaneous daily P&L
-  liquidation for different positions
-- Market close/open vs after-hours fills — claims to evaluate after-hours
-  fills but uses stale market-close prices
-
-**GPT-5 Mini unique findings (not in either other model):**
-- OrderManager order splitting/remapping causing liquidation_in_flight
-  correlation failure (parent/child order ID mapping breaks terminal-event
-  detection). Well-reasoned but highly implementation-specific.
-- Restrict/clear oscillation loop with strategy behavior (strategies react
-  to rejects → back off → restrict clears → strategies re-enter aggressively
-  → re-breach). Good systems-thinking about emergent feedback.
-
-**Quality assessment:**
-- **GPT-5** produced the most findings (10) and the highest-quality
-  architectural insight: the stale-price/drawdown contradiction is a genuine
-  design flaw that contradicts the document's own safety claim. Multiple
-  findings showed cross-boundary reasoning about semantic mismatches (price
-  definition, FLATTEN semantics, gate bypass). Every finding named specific
-  components and described precise event sequences.
-- **Claude Sonnet 4.6** was fast (38s, only 1,868 tokens) and produced 8
-  solid findings. The HWM reset finding and the multi-metric/single-flag
-  finding show genuine architectural reasoning. The liquidation feedback
-  loop (buy-to-cover worsening portfolio concentration) is subtle and
-  shows cross-position reasoning. However, some findings overlapped
-  significantly with the common-ground set and added less unique depth.
-  Sonnet performed MUCH better here than on race condition identification
-  (Finding #13) — 8/10 ratio vs 7/12 previously.
-- **GPT-5 Mini** produced 6 findings in 68s with 2,240 reasoning tokens.
-  Quality was genuinely good — the order-splitting/correlation finding
-  and the oscillation feedback loop both show real reasoning depth. It's
-  clearly NOT GPT-4.1 Mini — it reasons about component interactions,
-  not just within-frame risks. However, it found fewer issues and one
-  response was cut off (token limit or response truncation).
-
-**Key insight — task framing as the dominant variable:**
-This experiment used a much more structured prompt than previous ones:
-specified 5 categories, required specific output format, explicitly excluded
-single-component failures. The result: ALL models produced higher-quality,
-more focused output than in earlier experiments with broader prompts. Even
-Sonnet — which struggled on race conditions (Finding #13) — performed well
-here. The structured categories likely helped models organize their reasoning
-without losing track of what they were looking for.
-
-The prompt explicitly asked for "cross-component interaction failures" rather
-than general analysis. This is the narrow-lens effect from Finding #2, but
-applied to a complex multi-component document. The lens is narrow (only
-inter-component gaps) but the scope is broad (459 lines, many interactions).
-This combination — narrow analytical lens + broad document scope — appears
-to be the sweet spot for getting quality from all model tiers.
-
-**GPT-5 Mini positioning:**
-First time testing GPT-5 Mini. Results: 6 findings in 68s vs GPT-5's 10 in
-116s. That's 60% of the findings in 59% of the time, with 28% of the
-reasoning tokens (2,240 vs 8,128). Quality-per-finding was solid — the order
-correlation finding especially showed genuine systems reasoning. GPT-5 Mini
-appears to be a legitimate mid-tier: more capable than GPT-4.1 (which can't
-do this kind of cross-boundary reasoning) but less exhaustive than GPT-5.
-Viable for: first-pass screening, bulk document review where you'd run many
-docs and can't afford full GPT-5 on each.
-
-**Sonnet recovery from Finding #13:**
-Sonnet went from 7 findings (with errors) on race conditions to 8 solid
-findings here. The difference: this prompt was more structured, the document
-was larger with more explicit interaction descriptions, and the task didn't
-require pure temporal/sequential reasoning. "Cross-component interaction
-failures" is closer to assumption-finding (Sonnet's strength) than race
-condition identification (Sonnet's weakness). Task taxonomy continues to
-matter more than raw model capability.
-
-**Updated model assignment for cross-component analysis:**
-1. GPT-5 — broadest coverage, finds semantic contradictions in the doc's
-   own claims (10 findings)
-2. Sonnet 4.6 — fast and efficient, good at portfolio-level reasoning and
-   feedback loops (8 findings in 38s)
-3. GPT-5 Mini — viable first-pass with genuine reasoning depth (6 findings)
-4. (Opus untested for this task type — likely strong on design tensions)
-
-### 20. Invariant violation path analysis: GPT-5 is maximally selective (3 findings, all genuine); Opus shows unique self-correcting analytical style; new task type favors precision over exhaustiveness
-
-**Date:** 2026-05-04
-**Task:** Identify invariant violation paths in gargoyle's `user-pipeline-lifecycle.md`
-(730 lines) — sequences of legal operations that can violate the system's stated or
-implied invariants. NEW analytical lens not previously tested, distinct from assumption-
-finding, race conditions, or coherence checking.
-**How we used them:** Same document (full text) + same focused analytical question to all
-3 models via HAI proxy. Highly structured prompt specifying 5 categories of invariant
-violations (state machine escapes, invariant composition failures, monotonicity violations,
-idempotency boundary violations, authority inversion sequences). Required specific output
-format per finding. No tools, no project context beyond the document itself.
-
-| Model | Time | Output tokens | Reasoning tokens | Findings |
-|---|---|---|---|---|
-| GPT-5 | 143s | 784 | 12,032 | 3 |
-| Claude Opus 4.6 | 113s | 6,183 | (internal) | 7 (with 2 self-corrections) |
-| Claude Sonnet 4.6 | 23s | 1,266 | (internal) | 5 |
-
-**What they found — common ground (2+ models identified):**
-
-- **Periodic reconciliation overrides operator manual stop** (GPT-5 #3 + Opus #5 +
-  Sonnet #1): An admin who stops a pipeline via `stop_user/1` with `:admin_action`
-  has their decision overridden within 5 minutes by periodic reconciliation, because
-  there's no "admin stopped" state in `check_eligibility/1`. All three models
-  independently identified this as the clearest authority inversion.
-- **DynamicSupervisor restart bypasses eligibility gate** (Opus #1/#3 + Sonnet #2):
-  When `UserPipeline.Supervisor` crashes and is restarted by OTP supervision, the
-  restart bypasses `start_user/1` and `check_eligibility/1` entirely — potentially
-  resuming trading while the kill switch is engaged.
-- **Stale ReconciliationGate after crash** (Opus #7): After a crash-triggered
-  DynamicSupervisor restart (not via `stop_user/1`), the ReconciliationGate remains
-  `:ready` from the previous instance because `stop_user/1` (which resets it) was
-  never called. The new OrderManager may accept orders during its own reconciliation.
-- **HealthMonitor co-lifecycle violation** (Opus #2 + Sonnet #4): After a
-  DynamicSupervisor-initiated restart, the HealthMonitor is still subscribed to the
-  old PIDs — no code re-establishes monitoring for the new pipeline processes.
-
-**GPT-5 unique findings (not in either other model):**
-
-- **Kill switch bypass for users configured DURING engagement** (#1): A user who
-  saves credentials while the kill switch is engaged is never added to the pending
-  operator release set (only running pipelines are added at engage time). After
-  disengage, periodic reconciliation auto-starts this user's pipeline without
-  operator release — violating "resuming always requires human judgment." This is
-  the most precisely reasoned finding across all three models: each step is
-  individually correct per the spec, and the violation emerges purely from the
-  composition of legal operations.
-- **Premature release bypass** (#2): If `operator_release_user/1` is called while
-  the kill switch is still engaged (a legal operation), it clears the pending
-  release flag but `start_user/1` correctly refuses. After later disengage, the
-  flag is gone — auto-start proceeds without fresh operator judgment. The release
-  was "spent" at the wrong time.
-
-**Claude Opus unique findings (not in either other model):**
-
-- **`operator_release_system/0` clears unrelated safety obligations** (#4):
-  Operator intends to release one user from a recent event but
-  `operator_release_system/0` also releases other users still pending from an
-  earlier, unresolved event. One release call discharges multiple independent
-  safety obligations — monotonicity violation.
-- **State machine incompleteness for blocked users** (#6): Users who become
-  configured during kill switch engagement (blocked with reason
-  `:kill_switch_engaged`) have no state machine transition back to `starting`
-  after disengage — they're not in the pending release set, and no event fires.
-  System works via periodic reconciliation (up to 5 minutes delay), but the
-  documented state machine doesn't represent this path.
-- **Self-correcting analytical style:** Opus explicitly withdrew two draft
-  findings mid-analysis ("Actually, this sequence works as designed. Let me
-  identify a real violation instead." / "this is likely handled"). This
-  self-correction behavior was first observed in Finding #15 and is now
-  confirmed as a consistent Opus trait for invariant-style analysis.
-
-**Claude Sonnet unique findings (not in either other model):**
-
-- **Cold-start Tier 3 failure creates supervision restart loop** (#2): A
-  persistent Tier 3 failure (phantom fills) crashes OrderManager, `:rest_for_one`
-  kills the tree, DynamicSupervisor restarts it, cold-start fails again → infinite
-  loop. State machine shows `starting → stopped` but supervision creates
-  `starting → starting` indefinitely.
-- **HealthMonitor start failure during start_user** (#4): If HealthMonitor.Supervisor
-  is momentarily crashed when `start_user/1` runs step 4, the pipeline starts
-  without monitoring. No error handling specified for this partial-start state.
-
-**Quality assessment:**
-
-- **GPT-5** was MAXIMALLY SELECTIVE — only 3 findings from 12,032 reasoning tokens
-  (4,011 reasoning tokens per finding). This is the most extreme
-  reasoning-to-output ratio observed: 15:1 (12,032 reasoning / 784 output tokens).
-  For comparison, in previous experiments GPT-5 typically shows 1:1 to 2:1 ratios.
-  Every finding is a genuine invariant violation with a precise, step-by-step
-  sequence where each step is individually legal. ZERO false positives, zero
-  padding, zero "this might be an issue." GPT-5 appears to have used almost all
-  its reasoning budget for VERIFICATION — confirming that each candidate is
-  genuinely a violation before including it.
-- **Claude Opus** produced the most findings (7) with its characteristic depth and
-  self-correction. Two findings were revised mid-analysis, showing Opus actively
-  testing its own reasoning against the document before committing to a finding.
-  The DynamicSupervisor restart thread (findings #1, #2, #3, #7) forms a coherent
-  cluster — Opus identified one root cause (OTP restarts bypass the lifecycle
-  layer) and explored its multiple consequences. The `operator_release_system`
-  monotonicity finding (#4) is architecturally significant and unique.
-- **Claude Sonnet** was extremely fast (23s, 1,266 tokens) and produced 5 findings.
-  Quality was mixed: Finding #1 partially mirrors GPT-5's authority inversion but
-  with vaguer reasoning ("race condition with ETS operations" — not specified).
-  Finding #3 describes a contradiction but the scenario is internally inconsistent
-  (step 5 says "pipeline termination fails" but then step 7 says pipeline is still
-  running — this conflates two failure modes). Findings #2 and #4 are genuine and
-  well-reasoned. Sonnet's precision is lower than the other two on this task.
-
-**Key insight — "Invariant violation paths" as a task type:**
-
-This is a genuinely DIFFERENT analytical task from any previously tested. It requires:
-1. Identifying the invariants (explicit or implied)
-2. Constructing a sequence of operations (creative/generative)
-3. Verifying each step is legal per the spec (verification)
-4. Confirming the end state violates the invariant (correctness proof)
-
-This four-phase cognitive process explains GPT-5's extreme selectivity: steps 2-4 are
-all verification-heavy, and GPT-5's reasoning tokens are being burned on steps 3 and 4
-(confirming each step is genuinely legal and the final state genuinely violates). In
-previous tasks like "find hidden assumptions" or "find gaps," only step 1 (identification)
-is needed — there's no construction or verification phase.
-
-**Comparison to previous task types:**
-
-| Task type | GPT-5 findings | Opus findings | GPT-5 reasoning overhead |
-|---|---|---|---|
-| Hidden assumptions | 20-35 | 12-13 | 5-7K reasoning |
-| Race conditions | 12 | 10 | 8K reasoning |
-| Design coherence | 4 | 7 | 9K reasoning |
-| Invariant violation paths | 3 | 7 | **12K reasoning** |
-
-The pattern: as the task requires more VERIFICATION (vs identification), GPT-5 becomes
-more selective and spends more reasoning tokens per finding. Invariant violation paths
-demand the highest verification burden (every step must be confirmed legal), and GPT-5
-responds with the highest selectivity and reasoning investment.
-
-Opus inverts: it produces MORE findings on verification-heavy tasks (7 for coherence,
-7 for invariant paths) vs identification tasks (10-13 for assumptions). This suggests
-Opus uses its internal reasoning differently — it's more willing to present findings
-that have "likely" rather than "proven" violations, then self-corrects inline if the
-verification fails.
-
-**Practical implication:**
-
-For invariant violation path analysis:
-- **GPT-5** produces the highest-precision findings but very few. Every finding is a
-  genuine spec-level bug. Use when you need zero-false-positive bug reports to present
-  to a design team.
-- **Opus** produces more findings with slightly lower precision but unique analytical
-  depth. Its self-correction behavior means false positives are often caught inline.
-  Use when you want both confirmed violations AND identified tensions.
-- **Sonnet** is too imprecise for this task type — some findings have internal
-  inconsistencies. Use for lighter analytical tasks (assumption-finding, spec gaps).
-
-The three findings GPT-5 produced are ALL genuine design bugs that should be fixed:
-1. Users configured during kill switch engagement bypass operator release
-2. Premature operator release (while KS still engaged) creates future bypass
-3. Admin stops are overridden by periodic reconciliation
-
-These are the kind of findings that, in a real financial system, prevent production
-incidents. The 12K reasoning tokens to produce 3 perfect findings is excellent ROI.
-
-### 21. Reasoning effort (low/medium/high) has negligible effect on GPT-5's analytical output; the parameter may not work as documented for open-ended analysis
-
-**Date:** 2026-05-04
-**Task:** Identify hidden assumptions in gargoyle's `order-state-machine.md` (221 lines)
-— a well-structured state machine specification covering order lifecycle, fill precedence,
-TIF semantics, and parameter resolution.
-**How we used them:** Same document, same prompt, same model (GPT-5), same
-max_completion_tokens (16K). Only variable: `reasoning.effort` parameter set to
-"low", "medium", or "high". Run sequentially via HAI proxy (OpenAI-compatible
-endpoint). No tools, no project context beyond the document.
-
-| Effort | Time (ms) | Output tokens | Reasoning tokens | Findings |
-|---|---|---|---|---|
-| Low | 97,913 | 7,657 | 4,288 | 33 (+11 recs) |
-| Medium | 94,824 | 7,112 | 4,160 | 30 |
-| High | 88,607 | 6,891 | 3,712 | 30 |
-
-**The counterintuitive result:** Higher reasoning effort produced FEWER findings,
-FEWER reasoning tokens, FEWER output tokens, and completed FASTER. The expected
-pattern (high effort → more reasoning → more depth) was inverted.
-
-**Per-finding metrics (remarkably consistent):**
-
-| Effort | Output tokens/finding | Reasoning tokens/finding |
-|---|---|---|
-| Low | 232 | 129 |
-| Medium | 237 | 138 |
-| High | 229 | 123 |
-
-The depth per finding was nearly identical across all three levels. The models
-didn't get more detailed or rigorous per-finding at higher effort — they just
-found slightly fewer things.
-
-**Severity distributions (similar across all three):**
-- Low: 7 Critical, 21 High, 5 Medium (33 findings)
-- Medium: 9 Critical, 15 High, 4 Medium + 2 borderline (30 findings)
-- High: 6 Critical, 14 High, 5 Medium + 4 borderline (30 findings)
-
-**Qualitative differences — WHAT they found:**
-
-High-effort unique findings (not in low):
-- Single-writer authority to broker (no out-of-band modifications)
-- Broker emits fills for all executed quantities (no silent netting)
-- Instrument identity remains stable across corporate actions
-- Late-fill override won't violate downstream invariants
-- Validation covers lot sizes, price ticks, borrow/locate constraints
-- Multiple accounts and venues are part of the correlation key
-- Streaming and polling APIs are consistent
-- System can handle multi-leg instruments
-
-Low-effort unique findings (not in high):
-- Acks arrive before fills (no pre-ack fills)
-- Cancel-before-ack handling (submitted → cancelled missing)
-- Fill totals never exceed requested quantity
-- Deterministic ordering within a broker stream
-- Exercise/assignment and non-order position changes
-- Client-side idempotency of "place order"
-- Partial accept/normalize on replace
-- No "child" order fragmentation at broker
-- Submitted state can receive terminal events
-- Late cancel vs local expired mismatch
-
-**Character of the differences:**
-- HIGH-unique findings tend to be more **architectural/systemic** (multi-leg
-  instruments, streaming vs polling consistency, downstream invariant violations,
-  corporate actions). These require reasoning about the system's relationship
-  to the broader world.
-- LOW-unique findings tend to be more **implementation-specific edge cases**
-  (cancel-before-ack, pre-ack fills, child order fragmentation, partial accepts).
-  These require reasoning about specific event interleavings and protocol details.
-
-Both sets are valid and actionable. Neither is clearly "better." They represent
-different analytical modes — breadth-of-scope (high) vs depth-of-protocol (low).
-
-**Key insight — reasoning_effort doesn't scale analysis linearly:**
-
-Three possible explanations for the inverted behavior:
-
-1. **GPT-5 already uses near-maximum reasoning for analytical tasks regardless
-   of the effort parameter.** The ~4K reasoning tokens across all three levels
-   (4288/4160/3712) are too similar to reflect a genuine effort gradient. The
-   parameter may primarily affect OTHER task types (math, code, logic puzzles)
-   where reasoning depth is more variable.
-
-2. **Higher effort increases FILTERING, not exploration.** At high effort, GPT-5
-   may spend more of its reasoning on VERIFYING whether findings are genuine
-   before including them — similar to the extreme selectivity observed in
-   Finding #20 (invariant violation paths, 12K reasoning for 3 findings). This
-   would explain fewer findings despite theoretically "trying harder."
-
-3. **The parameter has minimal practical effect for this model version.**
-   The differences (33 vs 30 vs 30) are within normal stochastic variation.
-   Repeated runs at the same effort level might show similar variance.
-
-**The prompt cache hit on HIGH (2304 cached prompt tokens) may have slightly
-accelerated processing, but doesn't explain the reasoning token difference.**
-
-**Comparison to previous findings:**
-In Finding #20 (invariant violation paths), GPT-5 used 12,032 reasoning tokens
-for 3 findings — extreme verification behavior. Here, at default effort on a
-different task type (hidden assumptions), it uses ~4K reasoning for ~30 findings.
-This confirms that TASK TYPE is a far stronger predictor of GPT-5's reasoning
-behavior than the reasoning_effort parameter. The invariant violation prompt
-triggered deep verification; the assumption-finding prompt triggers broad
-exploration regardless of effort setting.
-
-**Practical implication:**
-For open-ended analytical tasks (assumption-finding, gap analysis, spec review),
-the reasoning_effort parameter appears to have negligible practical effect on
-GPT-5. Don't bother tuning it for these tasks — the default is fine. The
-parameter may be more meaningful for:
-- Tasks with verifiable correct answers (math, logic)
-- Tasks where the model could short-circuit (simple questions)
-- Extremely long documents where exploration budget matters
-
-For architecture review specifically: reasoning_effort is NOT a useful lever.
-Task framing (the prompt structure) and document selection remain the dominant
-variables for output quality. Save reasoning_effort tuning for coding/math tasks
-where the parameter was likely trained and evaluated.
-
-**Open question:** Would running the same experiment 5x at each level show that
-the 33-vs-30 difference is within stochastic noise? If so, reasoning_effort is
-effectively a no-op for analytical prompts. If not, low-effort consistently
-produces more (less filtered) output, which could be useful for brainstorming-
-style analysis where you want maximum coverage before manual triage.
-
-### 27. Design coherence on risk-controls.md: GPT-5 regains top position; Opus's advantage from Finding #15 was document-specific, not task-specific
-
-**Date:** 2026-05-05
-**Task:** Identify internal design incoherences in gargoyle's `risk-controls.md` (277 lines)
-— a pre-trade risk control specification covering two evaluation stages, reduction semantics,
-ordering rationale, fail-closed claims, and audit logging.
-**How we used them:** Same document (full text) + same focused analytical question to all
-3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence
-(safety properties not enforced, ordering/sequencing contradictions, reduction semantics
-conflicts, fail-closed claims vs actual behavior, cross-stage inconsistencies). Required
-each finding to reference specific contradictory parts. No tools, no project context beyond
-the document itself.
-
-| Model | Time | Output tokens | Reasoning tokens | Incoherences found | Critical | High | Medium |
-|---|---|---|---|---|---|---|---|
-| GPT-5 | 112s | 8,231 | 7,232 | 6 | 1 | 3 | 2 |
-| Claude Opus 4.6 | 41s | 1,858 | (internal) | 5 | 2 | 2 | 1 |
-| Claude Sonnet 4.6 | 15s | 699 | (internal) | 4 | 1 | 2 | 1 |
-
-**What they found — common ground (all 3 identified):**
-- Reduction re-entry at BuyingPower contradicts "reducing controls never re-enter
-  earlier controls" (all three flagged this as the most obvious contradiction —
-  Concentration at position 5 reduces, re-enters at BuyingPower at position 4,
-  which IS an earlier control)
-- Ordering rationale's categorization of buying power/concentration is internally
-  confused (the doc labels both as "quantity-sensitive checks" that run after
-  reducing controls, but concentration IS a reducing control at position 5 while
-  buying power at position 4 sits between the two reducing controls)
-
-**GPT-5 unique findings (not in either Claude model):**
-- Signal-level "no portfolio context" contradicts NoShortSales requiring knowledge
-  of current positions. The doc explicitly states signals are evaluated "in isolation"
-  with "no portfolio context — only the signal itself and user settings" — but checking
-  whether the user holds a position IS portfolio context. This is a genuine design
-  tension: either SignalRisk has hidden portfolio access (violating isolation) or
-  NoShortSales can't actually work as specified.
-- Settings "fall through to system defaults" vs "Settings cache miss → reject."
-  Two incompatible instructions for the same condition (missing settings).
-- "Universal fail-closed" with "only exception is order rate window" contradicted
-  by Failure Modes table showing buying power as another exception ("Conservative
-  estimate; may over-reject" is NOT rejection — it's a different failure mode than
-  either fail-closed or the documented single exception).
-- Audit model says "every control evaluation produces an audit entry regardless of
-  outcome" but the signal-stage write point only describes writing on rejection.
-  Passing signals produce no documented audit entry at the signal stage.
-
-**Claude Opus unique findings (not in either other model):**
-- Signal flow diagram swaps control order vs table: table shows (1) MarketHours,
-  (2) PerTradeStop, (3) NoShortSales, but diagram flows MarketHours → NoShortSales
-  → PerTradeStopLoss. Controls 2 and 3 are reversed between the two representations.
-  (VERIFIED: this is correct — the diagram does show a different order.)
-- Concentration re-entry loop can bypass Order Rate, Duplicate, Self-Trade, and
-  Fat Finger entirely during intermediate iterations. Also: Position Size at order 3
-  is never re-checked against Concentration-reduced quantity because re-entry starts
-  at BuyingPower (order 4), meaning "smaller quantity wins" semantics are implemented
-  differently than the linear model described in Reduction Semantics.
-
-**Claude Sonnet unique findings (not in either other model):**
-- Buying Power "Can reduce? No" creates a dead end: if a reduced quantity still
-  exceeds buying power, the system can only reject entirely (no mechanism to further
-  optimize), defeating the purpose of the reduction system for capital-limited users.
-  (NOTE: this is more of a design limitation than a self-contradiction, but the
-  framing — that the reduction system's purpose is undermined by buying power's
-  inability to reduce — is a legitimate coherence observation.)
-
-**Quality assessment:**
-- **GPT-5** produced the most findings (6) with the broadest coverage across the
-  prompt's 5 categories. The NoShortSales/portfolio-context finding is the most
-  genuinely insightful — it's a fundamental design-level contradiction (a signal-level
-  control that REQUIRES decision-level context). The settings contradiction and
-  audit logging inconsistency are also solid. Every finding points to two specific
-  textual statements that are incompatible. Severity ratings were calibrated (1
-  Critical, 3 High, 2 Medium — compared to Opus's 2 Critical for similar findings).
-- **Claude Opus** was remarkably fast (41s, 1,858 tokens) and found one thing
-  neither other model caught: the diagram/table order reversal for signal controls.
-  This is a concrete, verifiable error (not a design tension — a literal mistake in
-  the document). The re-entry loop analysis (finding #5) goes deeper than GPT-5's
-  version of the same core issue, exploring the implications for "smaller quantity
-  wins" semantics. However, Opus found fewer total issues and missed the
-  settings contradiction and audit logging inconsistency.
-- **Claude Sonnet** was the fastest (15s, 699 tokens) and found 4 issues. The buying
-  power dead-end observation is unique and shows genuine reasoning about the reduction
-  system's limitations. However, it's more of a "this design can't achieve its stated
-  goal" than a strict self-contradiction. Sonnet's other findings overlap with the
-  common ground. Quality is solid but narrower scope.
-
-**Key insight — Finding #15's Opus > GPT-5 result was document-specific:**
-In Finding #15 (coherence checking on failure-modes.md), Opus found 7 incoherences
-vs GPT-5's 4. Here, on risk-controls.md, GPT-5 found 6 vs Opus's 5. The reversal
-suggests that the relative performance on coherence checking depends on the
-DOCUMENT'S structure, not on a fixed model advantage:
-
-- **failure-modes.md** (383 lines): A complex multi-process system with many
-  stated invariants across failure states, supervision trees, and recovery paths.
-  Rich in design TENSIONS where one subsystem's safety mechanism undermines another.
-  This plays to Opus's strength (finding design tensions between subsystems).
-- **risk-controls.md** (277 lines): A more focused specification with explicit rules,
-  ordering constraints, and behavior tables. Rich in SPECIFICATION CONTRADICTIONS
-  where one statement directly conflicts with another. This plays to GPT-5's
-  strength (systematic verification of claims against stated mechanisms).
-
-The difference: Opus excels when contradictions are EMERGENT (arise from composing
-multiple design decisions). GPT-5 excels when contradictions are EXPLICIT (two
-statements in the document say incompatible things). Risk-controls.md has more
-explicit contradictions (the settings fallback vs fail-closed, the "no portfolio
-context" vs NoShortSales, the audit "always" vs write point "only on reject").
-
-**Model performance depends on CONTRADICTION TYPE:**
-| Contradiction type | Best model | Example |
-|---|---|---|
-| Emergent/compositional | Opus | "Rest-for-one cascade creates a 5th state" |
-| Explicit/definitional | GPT-5 | "No portfolio context" but check requires portfolio |
-| Diagrammatic/structural | Opus | Table order ≠ diagram order |
-| Semantic/category confusion | All (common ground) | Reduction re-entry violates ordering claims |
-
-**Revised conclusion on Finding #15's open question:**
-"Does Opus > GPT-5 ordering for coherence checking hold across other documents?"
-**No.** The ordering depends on the document's contradiction density and type.
-Documents rich in emergent design tensions favor Opus. Documents with explicit
-specification errors favor GPT-5. The task type (coherence checking) doesn't have
-a fixed model winner — it depends on what KIND of incoherences the document contains.
-
-**Practical implication:** Continue running both models for coherence checking. Their
-strengths are complementary even within the same task type. GPT-5 catches things you
-can point to in the spec and say "these two sentences conflict." Opus catches things
-where you need to reason about the implications of multiple mechanisms interacting.
-
-## Open Questions
-
-- Does GPT's advantage in finding inconsistencies extend to logical
-  inconsistencies in arguments? One data point (verdict mismatches) — need more.
-- What's the optimal task granularity for GPT analytical review? "Whole PR" is
-  too big. Is "one hypothesis" right, or can we batch?
-- ~~Is the GPT-4.1 Mini bias detection result repeatable, or was it a well-
-  structured task that any model would ace?~~ **ANSWERED (Finding #8):** Any
-  model aces it when the biased text is presented without noise. The original
-  result was about noise elimination, not model capability.
-- **NEW:** Does adding a narrow bias-check question to a rich PR review
-  context recover the detection that broad review misses? (Signal-to-noise
-  confirmation test)
-- ~~How does reasoning_effort affect analytical quality? Only tested default so
-  far.~~ **ANSWERED (Finding #21):** Negligible effect on GPT-5 for open-ended
-  analytical tasks. Low/medium/high produced 33/30/30 findings with nearly
-  identical reasoning tokens (~4K) and per-finding depth. The parameter
-  may primarily affect verifiable-answer tasks, not exploration. Task framing
-  remains the dominant quality lever.
-- Can we design a systematic "analytical review checklist" that leverages each
-  model's strengths?
-- ~~What analytical tasks is Opus best at vs Sonnet? Finding #11 shows Opus
-  excels at design-tension identification. How does Sonnet compare on the
-  same task? (Sonnet is non-reasoning but fast — would it match GPT-4.1?)~~
-  **ANSWERED (Finding #12):** Sonnet 4.6 significantly outperforms GPT-4.1
-  (17 vs ~14 assumptions) and approaches GPT-5 (17 vs 20). It's not a
-  non-reasoning model in the GPT-4.1 sense — it occupies a middle tier with
-  genuine component-interaction reasoning. Opus still wins on design-tension
-  identification specifically.
-- How do the models compare on research synthesis tasks (our #381 rewrite)?
-  We'll find out during the actual rewrite.
-- ~~Does the reasoning-token advantage scale with document complexity? Test
-  with a simpler doc to see if the gap narrows.~~ **ANSWERED (Finding #11):**
-  The gap doesn't narrow with simpler docs. GPT-5 maintains ~1.7x the findings
-  of GPT-4.1 regardless of document complexity. Reasoning tokens enable
-  exhaustive exploration independent of input difficulty.
-- ~~Would Claude Opus (also a reasoning model) match GPT-5's assumption-finding
-  performance, or does it have different blind spots?~~ **ANSWERED (Finding #11):**
-  Different blind spots, different strengths. GPT-5 reasons deeper into
-  implementation mechanics (breadth + technical depth). Opus reasons wider
-  about system context and design tensions (insight density). They're
-  complementary, not competing. Run both on important architecture docs.
-- ~~Does Sonnet 4.6's strong showing hold across other analytical tasks
-  (bias detection, gap-finding) or is it specific to assumption-finding on
-  complex documents? Need to test Sonnet on simpler docs and different question
-  types.~~ **PARTIALLY ANSWERED (Finding #13):** Sonnet's strength does NOT
-  transfer to concurrency reasoning. It dropped from 85% of GPT-5 (assumption-
-  finding) to ~58% (race condition identification). Task type matters more
-  than we thought. Still untested: gap-finding, bias detection for Sonnet.
-- **NEW:** What other analytical tasks require sequential/temporal reasoning
-  (like race condition identification) vs pattern-matching reasoning (like
-  assumption-finding)? Building a task taxonomy would help assign models
-  correctly.
-- **NEW:** What explains Sonnet taking slightly longer than Opus here (106s vs
-  105s) despite normally being the faster model? Is it the document length, or
-  does Sonnet's internal reasoning scale with complexity similarly to Opus?
-- ~~How does GPT-5 Mini compare to GPT-5 on analytical tasks? Is it a viable
-  cheaper substitute?~~ **ANSWERED (Finding #14):** GPT-5 Mini is a viable
-  middle option. Finds fewer issues (6 vs 10) but with genuine reasoning
-  depth at ~50% cost/time. Better than non-reasoning models, not as
-  exhaustive as GPT-5.
-- **NEW:** How does Claude 4.5 Opus/Sonnet compare to Claude 4.6? HAI now
-  exposes both; worth testing whether the newer versions regress on
-  analytical tasks.
-- ~~Would running GPT-5 Mini + Sonnet together (different axes)
-  approach GPT-5's coverage at lower combined cost?~~ **ANSWERED (Finding #19):**
-  71% coverage at 31% cost. Good for low-stakes work; GPT-5 irreplaceable for
-  high-stakes due to unique domain-knowledge findings in the missing 29%.
-- ~~**NEW (Finding #15):** Does the Opus > GPT-5 ordering for coherence checking
-  hold across other documents? The inversion (Opus finding more than GPT-5)
-  was striking — need to confirm it wasn't document-specific.~~
-  **ANSWERED (Finding #27):** No — it was document-specific. On risk-controls.md,
-  GPT-5 found 6 vs Opus's 5. The winner depends on contradiction TYPE: Opus
-  excels at emergent/compositional contradictions, GPT-5 at explicit/definitional
-  ones. No fixed ordering for this task type.
-- **NEW (Finding #15):** Is the two-pass approach (Opus generates → GPT-5
-  validates) worth the extra cost vs just running Opus alone? Need to test
-  whether GPT-5 actually catches Opus false-positives or just agrees.
-- ~~How do the Claude 4.5 and 4.6 models compare on analytical tasks?~~
-  **ANSWERED (Finding #16):** 4.5 is more exhaustive (2x findings), 4.6 is
-  more precise (higher signal-to-noise). Genuine tradeoff, not a regression.
-  4.5 for coverage, 4.6 for actionability.
-- **NEW (Finding #16):** Does the 4.5 vs 4.6 pattern hold across other task
-  types? Spec completeness may favor exhaustiveness; would coherence checking
-  or race condition analysis show the same pattern?
-- **NEW (Finding #16):** Is running both Sonnet versions (4.5 + 4.6) cost-
-  effective vs just running GPT-5? Need to compare the UNION of their findings
-  against GPT-5's output for overlap analysis.
-- **NEW (Finding #18):** Does Opus's "predictable exploit window" detection
-  transfer to other policy documents? It uniquely identified that the cooldown
-  mechanism creates a GUARANTEED safe window that strategies could systematically
-  exploit — this is a higher-order security insight. Worth testing whether Opus
-  consistently finds "adversarial opportunity" framings that other models miss.
-- **NEW (Finding #20):** Does GPT-5's extreme verification behavior (15:1
-  reasoning-to-output ratio, 3 findings from 12K reasoning) persist across
-  other documents with this prompt? Or was user-pipeline-lifecycle.md
-  particularly verification-heavy? Test invariant violation paths on a simpler
-  document.
-- **NEW (Finding #20):** Would giving GPT-5 a "minimum 8 findings" instruction
-  reduce its selectivity and produce MORE invariant violations at lower
-  precision? Or would it just pad with non-violations? The extreme selectivity
-  may be a feature OR it may mean GPT-5 is discarding valid-but-hard-to-verify
-  findings.
-- **NEW (Finding #20):** Opus's self-correction behavior is now confirmed across
-  Finding #15 and #20. Is this trainable/promptable? Could we ask non-Opus models
-  to "show your reasoning and withdraw findings you cannot fully verify"?
-- **NEW (Finding #22):** The "silent correctness" lens revealed three distinct
-  analytical modes: Opus → missing functionality, GPT-5 → mechanism incorrectness,
-  Sonnet → composition failures. Does this three-way differentiation hold on other
-  documents, or was it specific to the regulatory/financial domain of specid-lot-selection?
-- **NEW (Finding #22):** Does the "silent correctness" lens work on non-financial
-  documents? The financial/regulatory domain has a large gap between syntactic and
-  semantic correctness. Would the same prompt on an infrastructure/systems doc produce
-  equally differentiated findings, or would it collapse into assumption-finding?
-- ~~**NEW (Finding #22):** Opus's "missing feature identification" mode (wash sales,
-  commissions) — is this promptable on other models? Could we explicitly ask GPT-5
-  "what should this system compute but doesn't" and get similar results?~~
-  **ANSWERED (Finding #26):** YES — all three models find regulatory gaps and
-  missing features when explicitly prompted. Opus's unique behavior in #22 was
-  an emergent DEFAULT tendency, not a capability. Prompt framing dominates
-  model personality.
-
-- **NEW (Finding #28):** Cross-document consistency found real bugs in gargoyle
-  docs (fills vs events, position ownership, signal persistence). Does running
-  this analysis across MORE document pairs (e.g., domain readmes vs implementation
-  docs, design docs vs plan docs) yield additional real inconsistencies? Could
-  become a systematic documentation maintenance tool.
-- **NEW (Finding #28):** Opus was 2.4x faster AND found more issues than GPT-5
-  on cross-document consistency. Is this because cross-doc contradictions are
-  easy to verify once spotted (reducing GPT-5's verification advantage)? Or
-  because boundary reasoning (Opus's strength) is the primary skill needed?
-
-## Methodology Notes
-
-- Internet opinions about models are overwhelmingly about coding. Don't
-  extrapolate to analytical work without testing.
-- "Just because someone says it on the internet doesn't make it right." —
-  Aaron, 2026-04-26. Opinions need context. Track our own evidence.
-- Absence of published methodology for a use case is itself a finding.
-- Each finding needs: date, task, **how we used it** (context shape, task
-  framing, what info the model had/didn't have), what happened, takeaway.
-  No unsupported generalizations.
-- **Context dimensions to track:**
-  - Rich vs minimal (how much background info)
-  - Broad vs focused ("review this" vs "answer this specific question")
-  - What kind of context (diff, full files, issue text, research notes,
-    project conventions, nothing)
-  - Whether the model had access to tools or just text
-  - Whether the task was explicit step-by-step or open-ended
-# Design Coherence Analysis — Finding #15
-
-**Date:** 2026-05-03
-**Task:** Identify internal design incoherences in gargoyle's `failure-modes.md` (383 lines)
-— places where the document's stated principles/invariants are contradicted by its own
-specified mechanisms.
-**How we used them:** Same document (full text) + same focused analytical question to all
-3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence
-to look for (safety properties not enforced, state machine violations, recovery contradictions,
-supervision conflicts, cross-mechanism contradictions). Required each finding to reference
-specific sections. No tools, no project context beyond the document itself.
-
-| Model | Time | Output tokens | Reasoning tokens | Incoherences found |
-|---|---|---|---|---|
-| Claude Sonnet 4.6 | ~39s | 1,168 | (internal) | 5 |
-| Claude Opus 4.6 | ~105s | 3,378 | (internal) | 7 (8 attempted, 1 self-withdrawn) |
-| GPT-5 | ~120s | 10,235 | 9,088 | 4 |
-
-**What they found — common ground (all 3 identified):**
-- State machine universality claim vs Strategy.Worker crash behavior (process
-  crashes bypass the degraded state entirely — no transition path in the model)
-- Market data staleness advisory-only vs the "don't trade when ambiguous" principle
-  (or vs concurrent failure auto-halt)
-- `pending_cancel`/`pending_replace` absent from recovery query set (GPT-5 and
-  Sonnet found this directly; Opus addressed the broader state machine gap)
-
-**GPT-5 unique findings (not in either Claude model):**
-- Kill switch halted = "process terminated" vs kill switch requiring RUNNING
-  processes for cancel_all, FLATTEN, and LIQUIDATE modes. The state definition
-  claims processes are terminated, but the mechanisms require them alive to
-  execute orders. **This is the most architecturally significant finding** — it
-  reveals a fundamental definitional error in the state machine.
-- Per-symbol degradation contradicts the process-level degradation semantics.
-  A worker "enters degraded" but continues operating for non-stale symbols —
-  violating the stated definition that degraded = "cannot perform primary
-  function." The metrics/eventing model has no per-symbol dimension.
-
-**Claude Opus unique findings (not in either other model):**
-- `:rest_for_one` cascade creates a FIFTH implicit state (terminated-and-
-  restarting) not in the four-state model — processes that were `normal` are
-  forcibly killed (not by kill switch) and restart. Self-corrected one finding
-  that initially looked like incoherence but was actually consistent.
-- PortfolioMonitor continues evaluating with stale data ("fail-safe") while
-  Strategy.Workers are stopped for the SAME condition — contradicts both the
-  universal state machine (PM doesn't transition to degraded) and the doc's
-  reasoning about why stale data is dangerous.
-- Path-dependent recovery criteria: Strategy.Worker recovery requires 21+ bars
-  after crash but only "price continuity check" after staleness. The state
-  machine's single "catch-up complete" exit condition can't express this.
-- `halted → [*]` transition in state diagram is logically impossible if "halted"
-  means the process is already terminated — dead processes can't fire transitions.
-- Compound failure detection requires a meta-observer across processes but the
-  per-process state machine model has no way to express cross-process conditions.
-
-**Claude Sonnet unique findings (not in either other model):**
-- Market data global staleness: the failure table says "Manual (disengage)" for
-  recovery — implying automatic engagement happened — but the text says it's
-  advisory only. Table contradicts prose.
-- ReconciliationGate: doc claims gate survives OM crash (separate supervision
-  tree), but then says "missing ETS table = not ready" when OM crashes. If the
-  gate survives, why would its table be missing?
-- Signal survival claims are contradictory between sections: worker crash says
-  downstream signals survive, but OM crash says all upstream signals lost.
-  (NOTE: this is actually describing different scenarios — worker crash doesn't
-  cascade to SignalRisk, OM crash does via :rest_for_one. Sonnet may have
-  misread the architecture here — the two statements are consistent when you
-  understand the supervision tree.)
-
-**Quality assessment:**
-- **GPT-5** found only 4 incoherences but TWO of them are genuinely critical
-  architectural findings. The "halted = terminated" vs "kill switch requires
-  running processes" contradiction is a real design error — you can't both
-  terminate processes AND require them to execute cancel/liquidation orders.
-  The per-symbol degradation finding is also a real modeling gap. GPT-5 was
-  MORE SELECTIVE here than in previous experiments — it didn't pad with
-  medium-severity findings. Each of its 4 was high/critical.
-- **Claude Opus** produced the most findings (7 valid) with characteristic
-  depth. Its self-correction (withdrawing finding #6 after deeper analysis)
-  shows intellectual honesty rare in model outputs. The PortfolioMonitor
-  stale-data contradiction is genuinely insightful — same input condition,
-  opposite response, no justification within the state machine model. The
-  compound failure meta-observer finding identifies a modeling category error.
-  Opus also found modeling imprecisions (path-dependent recovery, halted → [*]
-  impossibility) that the other models didn't notice.
-- **Claude Sonnet** found 5 issues quickly (39s, 1,168 tokens) but quality was
-  mixed. Finding #4 (ReconciliationGate) raises a genuine question about
-  the ETS table ownership claim. Finding #1 (table vs prose contradiction on
-  market data staleness) is a real documentation inconsistency. However,
-  Finding #5 appears to misread the supervision architecture — the two
-  statements about signal survival ARE consistent when you understand that
-  different crashes cascade differently. Sonnet produced one false positive.
-
-**Key insight — "design coherence" is a NEW analytical category with distinct model strengths:**
-This is different from assumption-finding (Finding #10-12), race conditions
-(Finding #13), and cross-component interactions (Finding #14). Coherence
-checking requires the model to hold MULTIPLE parts of the document in tension
-with each other and reason about whether they're compatible. Results:
-
-- **GPT-5** was MORE SELECTIVE than in any previous experiment. Only 4 findings
-  vs 10-24 in other tasks. But precision was near-perfect — all 4 are genuine
-  contradictions. This suggests GPT-5's reasoning tokens are being used for
-  VERIFICATION (checking whether apparent contradictions hold up) rather than
-  EXPLORATION (finding more things). The 9K reasoning tokens produced 4 findings
-  vs the usual 10+ — GPT-5 is self-editing aggressively.
-- **Opus** hit its sweet spot. Coherence checking IS design-tension identification
-  — Opus's consistent strength. Finding incoherences requires exactly the kind
-  of "how does this design disagree with itself" reasoning that Opus excels at.
-  It also showed unique self-correction behavior (withdrawing a finding after
-  deeper analysis).
-- **Sonnet** was fast but produced a false positive. Coherence checking requires
-  holding multiple document sections in memory simultaneously and reasoning about
-  their compatibility — this is harder than assumption-finding (where you
-  reason about one mechanism at a time) but easier than race conditions (which
-  require sequential temporal reasoning). Sonnet occupies a middle ground.
-
-**Model ranking for design coherence checking:**
-1. Claude Opus 4.6 — most findings, highest depth, self-correcting (7 valid)
-2. GPT-5 — fewest findings but near-perfect precision, finds the critical ones (4)
-3. Claude Sonnet 4.6 — fast screening, but prone to false positives on
-   architectural misreads (4/5 valid)
-
-**This inverts the usual GPT-5 > Opus ordering.** In previous experiments, GPT-5
-consistently found MORE issues. Here, GPT-5 was more selective than Opus. The
-task type (self-consistency checking) favors Opus's "design tension" reasoning
-style over GPT-5's "exhaustive exploration" style. GPT-5 apparently uses its
-reasoning to VERIFY rather than GENERATE when the task is about contradictions
-rather than gaps.
-
-**Practical implication:** For architecture documents, run coherence checking as
-a separate pass using Opus as the primary model. GPT-5's higher precision means
-it's good for confirming which Opus findings are genuine vs overreads. The
-two-pass approach: Opus generates candidates → GPT-5 validates → result is the
-intersection plus GPT-5's independent finds.
-
-### 16. Specification completeness: Sonnet 4.5 produces 2x the findings of Sonnet 4.6 on implementation-gap analysis; quality vs volume tradeoff
-
-**Date:** 2026-05-03
-**Task:** Identify specification gaps in gargoyle's `kill-switch.md` (185 lines) — places
-where an implementer would be forced to guess or decide on their own because the spec
-doesn't clearly specify behavior. New analytical lens not previously tested.
-**How we used them:** Same document (full text) + same focused analytical question to all
-3 models via HAI proxy. Highly structured prompt specifying 5 categories of underspecification
-(behavioral ambiguity, missing edge cases, ordering/sequencing gaps, interface contracts
-undefined, concurrency semantics omitted). Required specific output format per finding
-(gap, section, what implementer must decide, risk if wrong, severity). No tools, no
-project context beyond the document itself.
-
-| Model | Time | Output tokens | Reasoning tokens | Gaps found | Critical | High | Medium | Low |
-|---|---|---|---|---|---|---|---|---|
-| Claude Sonnet 4.6 | 73s | 3,403 | (internal) | 13 | 8 | 4 | 0 | 1 |
-| Claude Sonnet 4.5 | 102s | 5,191 | (internal) | 25 | 14 | 6 | 4 | 1 |
-| GPT-5 | 109s | 10,140 | 7,872 | 19 | 8 | 7 | 3 | 0 |
-
-**What they found — common ground (all 3 identified):**
-- Pipeline process identification ambiguity (which processes are "pipeline processes")
-- Per-user process scope mapping (how to terminate only one user's processes)
-- ETS table ownership and lifecycle (who owns it, what happens on crash)
-- Concurrent engage operations (what happens when two sources engage simultaneously)
-- Liquidation order tagging mechanism (what the tag is, how verified)
-- Process restart prevention (how "must not restart" is enforced)
-- Engage sequence atomicity (partial failure between DB write and termination)
-- Startup ordering and ETS readiness (pipeline starting before ETS populated)
-- Disengage sequence ordering (what happens and in what order)
-
-**Sonnet 4.5 unique findings (not in either other model):**
-- ETS table schema/structure (set vs ordered_set, key format, value schema)
-- Missing ETS detection mechanism (catch :badarg vs table existence check)
-- Database write atomicity with ETS (transaction boundaries, rollback semantics)
-- Per-user engage while global is already engaged (is it a no-op or error?)
-- Broker rejection semantics ("already filled" vs "invalid cancel" distinction)
-- Cold-start gate interaction (independence vs dependency of the two gates)
-- User deletion with active kill switch (orphaned rows, cascade semantics)
-- Global disengage effect on per-user states (independent or auto-clear?)
-- Audit log write failure during engage (critical-path vs best-effort)
-- Dashboard control ambiguity in LIQUIDATE mode (contradictory disable/enable)
-- Cancel timeout duration (operational parameter not specified)
-- Manual order source code path during LIQUIDATE (how orders bypass the dead pipeline)
-
-**GPT-5 unique findings (not in either other model):**
-- Combined global/per-user mode semantics (what happens when global=RESTRICT,
-  user=LIQUIDATE — can user's liquidation proceed?)
-- Scope of "all" in cancel_all and liquidation (system-wide vs per-user)
-- Gate behavior when ETS missing but liquidation needed (conflicting requirements:
-  fail-closed says block, but liquidation needs to pass)
-- Disengage during in-flight cancellations (what happens to racing tasks)
-- Gate placement relative to broker submission (exact point in the flow)
-- Engage latency expectations (no quantified SLA)
-- Mode change while already engaged (RESTRICT → LIQUIDATE without disengage)
-- Dashboard vs backend scope for manual liquidation (individual vs bulk only)
-
-**Sonnet 4.6 unique findings (not in either other model):**
-- ETS sequencing relative to process termination (ETS before or after kill?)
-- Concurrent disengage + re-engage race (specific interleaving scenario)
-- Close-only enforcement mechanism (UI-only vs backend validation)
-- Order-in-flight past ETS gate during termination (already-checked orders)
-
-**Quality assessment:**
-- **Claude Sonnet 4.5** was the most EXHAUSTIVE (25 gaps) but with notable
-  quality variance. Several findings were highly specific and implementation-
-  relevant (ETS schema, missing-table detection, broker rejection semantics).
-  Others were relatively obvious or lower-impact (user deletion, audit log
-  failure, cancel timeout duration). The 14 Critical ratings feel somewhat
-  generous — some would be more accurately rated as High in practice. Output
-  was well-structured with clear per-finding format.
-- **GPT-5** found 19 gaps with consistent high quality. Its unique findings
-  show cross-cutting reasoning: the combined mode semantics finding (global
-  vs per-user mode interaction) identifies a genuine specification gap that
-  neither Sonnet version noticed. The "ETS missing but liquidation needed"
-  finding is architecturally significant — it identifies a CONTRADICTION in
-  the spec's own rules (fail-closed blocks everything, but liquidation must
-  pass). Every finding was actionable. More selective severity ratings
-  (8 Critical vs Sonnet 4.5's 14).
-- **Claude Sonnet 4.6** was the most SELECTIVE (13 gaps) but with the highest
-  precision. Every finding was genuinely a specification gap that an
-  implementer would face. The ETS sequencing finding (#4) is particularly
-  well-reasoned — it identifies a specific ordering dependency that creates
-  a race window. Sonnet 4.6 appears to self-filter aggressively, producing
-  only findings it's confident about. Higher signal-to-noise than 4.5.
-
-**Key insight — Sonnet 4.5 vs 4.6 on analytical tasks:**
-This is the first direct comparison between Claude model versions on the same
-analytical task. Key differences:
-
-- **Volume:** 4.5 produced almost 2x the findings (25 vs 13)
-- **Tokens:** 4.5 used ~1.5x the output tokens (5,191 vs 3,403)
-- **Time:** 4.5 took ~1.4x longer (102s vs 73s)
-- **Severity distribution:** 4.5 had more Critical findings (14 vs 8) but
-  with more generous severity ratings
-- **Quality per finding:** 4.6 had higher average quality; fewer "obvious"
-  or lower-impact findings
-
-The 4.6 model appears to have been trained toward higher precision/selectivity.
-It finds fewer things but each finding is more reliably a genuine gap. The 4.5
-model is more exhaustive but includes findings that a reviewer might triage as
-"yes, technically, but not really a spec gap." This mirrors a known training
-direction in Claude models: later versions tend to be more concise and selective.
-
-**For practical use:** If you want completeness (cast a wide net, accept some
-noise): use 4.5. If you want precision (every finding is actionable, no triage
-needed): use 4.6. For architecture review where missing a gap has cost, 4.5's
-exhaustiveness is probably worth the noise. For review where false positives
-cost attention (e.g., PR review comments), 4.6's selectivity is preferred.
-
-**GPT-5 vs Sonnet comparison on this task:**
-GPT-5 (19 findings) sits between the two Sonnets in volume but has the highest
-consistency — no obvious misses or inflated severities. Its unique strength
-here: finding CONTRADICTIONS within the spec's own rules (ETS-missing blocking
-conflicts with liquidation needing to pass). This is consistent with Finding #15
-where GPT-5 was unusually selective but precise on coherence checking.
-
-Specification completeness analysis appears to be a task where:
-1. Sonnet 4.5 is strongest for breadth (25 findings, catches operational gaps)
-2. GPT-5 is strongest for detecting spec self-contradictions (19 findings, high precision)
-3. Sonnet 4.6 is strongest for precision (13 findings, zero noise)
-
-**Updated model version comparison:**
-- Claude 4.6 → higher precision, more selective, concise
-- Claude 4.5 → more exhaustive, more verbose, occasional severity inflation
-- This is a genuine tradeoff, not a simple regression or improvement
-
-**Practical implication:** Run BOTH Sonnet versions? 4.5 catches things 4.6
-filters out (ETS schema, broker rejection semantics, cold-start gate interaction).
-4.6 catches things with more specificity (sequencing gaps, exact race windows).
-For a one-shot budget: 4.5 if you want coverage, 4.6 if you want actionability.
-GPT-5 if you want to find where the spec contradicts itself.
-
-### 7. Token budget matters more than model size for gap analysis (confirmed)
-
-**Date:** 2026-05-03
-**Task:** Identify unaddressed failure scenarios in gargoyle's `failure-modes.md` (383 lines, ~25KB)
-**How we used them:** Same document, same analytical question ("What failure scenarios
-are NOT covered?"), three models. GPT-5 with 16K max_completion_tokens, Sonnet 4
-with 4K max_tokens, GPT-4.1 Mini with 4K max_completion_tokens. No project context
-beyond the document itself. Pure gap-analysis task.
-
-**Results:**
-- GPT-5 (16K budget): 28 gaps, most exhaustive. Found domain-specific edge cases
-  others missed entirely: ClOrdID collision across restarts, fractional share rounding,
-  broker maintenance windows (410/426), hot code upgrades, regulatory halts vs staleness
-  distinction, cancel-ack-then-late-fill race, DNS/TLS as distinct from network outage.
-- Sonnet 4 (4K budget): 12 gaps, concise. Unique framing: distinguished latency
-  degradation from outage (subtle but actionable). ETS corruption vs loss.
-- GPT-4.1 Mini (4K budget): 13 gaps with summary table. Unique angles: unknown broker
-  status enum values, configuration schema mismatches on cold-start, malformed signals
-  from logic bugs (not just crashes).
-
-**Overlap (all three):** Rate limiting, clock skew, resource exhaustion, DB failures,
-message backpressure, partial connectivity.
-
-**Key insight:** GPT-5's 4K attempt produced ZERO output (finish_reason: length) —
-all tokens consumed by internal reasoning. At 16K it produced the richest analysis.
-This confirms finding #3 (GPT-5 needs generous token budgets) AND adds a new
-observation: for open-ended analytical questions, GPT-5's reasoning overhead is
-proportionally larger. The 4K models (Sonnet, Mini) both produced useful output at
-4K because they don't burn tokens on chain-of-thought.
-
-**Model personality confirmed:**
-- GPT-5: exhaustive, domain-aware, finds edge cases a senior SRE would know
-- Sonnet: precise, architectural, finds design-level distinctions
-- GPT-4.1 Mini: structured, systematic, finds enumeration gaps
-
-**Practical implication:** For failure mode / gap analysis on design docs:
-- GPT-5 with ≥16K tokens for maximum coverage (most unique findings)
-- Sonnet for architectural framing ("this is really two different problems")
-- Mini for completeness checking ("what about this enum value?")
-- Running all three costs ~$0.50 and catches gaps none alone would find
-- GPT-5 at 4K is USELESS for this task — always give it room to think
-
-**Note on GPT-5 reasoning overhead:** First attempt at 4K max_completion_tokens
-returned empty content with finish_reason: length. The model spent all 4K tokens
-on internal reasoning and produced nothing. This is worse than a short answer —
-it's zero value for non-zero cost. Always budget ≥16K for GPT-5 analytical tasks.
-
-### 18. Temporal boundary analysis: GPT-5 is most exhaustive; Opus finds design-level contradictions; Sonnet 4.5 is structured but less deep
-
-**Date:** 2026-05-04
-**Task:** Identify temporal boundary vulnerabilities in gargoyle's `escalation-policy.md`
-(238 lines) — scenarios where the timing model (evaluation cycles, debounce counts,
-cooldown periods) creates windows of incorrect or dangerous behavior.
-**How we used them:** Same document (full text) + same focused analytical question to all
-3 models via HAI proxy. Highly structured prompt specifying 5 categories of temporal
-vulnerability (timing exploitation, counter reset abuse, asymmetric time exposure,
-cross-metric temporal interactions, state loss temporal effects). Required specific
-output format per finding (name, sequence with cycle numbers, mechanism, severity, fix).
-No tools, no project context beyond the document itself.
-
-| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
-|---|---|---|---|---|---|---|---|
-| GPT-5 | ~128s | 9,175 | 5,888 | 15 | 3 | 7 | 2 |
-| Claude Opus 4.6 | ~120s | 5,112 | (internal) | 10 | 3 | 5 | 2 |
-| Claude Sonnet 4.5 | ~100s | 4,056 | (internal) | 12 | 3 | 3 | 3 |
-
-**What they found — common ground (all 3 identified):**
-- Flash crash / inter-evaluation gap exploitation (metric spikes between discrete
-  evaluation cycles go undetected)
-- Single clear cycle resetting debounce counter (transient recovery defeats escalation
-  despite sustained risk — metric can breach 80%+ of cycles and never escalate)
-- Asymmetric escalation time vs loss compounding rate (11 cycles to reach liquidation
-  while losses compound every single cycle)
-- Monitor crash resets state to Clear, losing all escalation progress
-- Liquidation re-trigger requiring full debounce reset, delaying subsequent batches
-- Kill switch N value unspecified (timing indeterminacy)
-
-**GPT-5 unique findings (not in either other model):**
-- Boundary-alignment counter starvation: explicitly modeled the "adversarial flicker"
-  pattern (breaching 2 cycles, 1 clear, repeat — 66% breach time, never escalates)
-  with a precise mathematical framing of why K-of-N is needed
-- Cycle-length drift under load: GC pauses or CPU contention stretching evaluation
-  intervals means "3 cycles" could be 12 minutes instead of 90 seconds when it
-  matters most (high-load market stress = slowest evaluations)
-- Adversarial boundary timing (market microstructure masking): illiquid instruments
-  where opposing prints predictably arrive near evaluation boundaries, exploiting
-  deterministic sampling points
-- Cross-metric escalation gap and hedging prohibition: Restrict prevents ALL new
-  positions including risk-REDUCING hedges needed for a different metric still
-  escalating on its own timeline — protection for metric A actively worsens metric B
-- Cooldown stall causing prolonged Restrict: repeated transient spikes near hysteresis
-  threshold reset cooldown indefinitely while metric is actually safe
-- State inconsistency between restriction flags and monitor after restart:
-  documented asymmetry where flag persists (manual clear) but state resets (auto
-  clear) — creates orphaned restriction or unprotected window depending on
-  reconciliation approach
-- Metric computation fail-closed interacting with debounce: system errors create
-  false escalations with long cooldown, potentially blocking hedging trades
-- Unspecified N for kill switch post-liquidation breaches: coupled with crash
-  reset, system can loop indefinitely without reaching kill switch
-- In-liquidate flicker stall: one cycle below threshold after partial fill resets
-  re-trigger counter, stalling further liquidation
-
-**Claude Opus unique findings (not in either other model):**
-- De-escalation cooldown exploitation (predictable window): after cooldown completes
-  and restriction lifts, strategy has a GUARANTEED 5+ cycle window of unrestricted
-  trading before Restrict can re-engage — an automated strategy could systematically
-  exploit this predictable safe window to re-enter dangerous positions
-- Orphaned restriction flag asymmetry framed as a DESIGN CONTRADICTION: the failure
-  modes table specifies opposing recovery paths for state (automatic → Clear) vs
-  flags (manual clear), creating an irreconcilable dual state. Opus uniquely
-  identified that operator intervention to clear the flag could inadvertently
-  create a WORSE protection gap than leaving it orphaned
-- Self-correcting analysis style: Opus's summary explicitly synthesized that the
-  three Critical findings share a common cause (debounce optimizes against false
-  positives at the expense of false negatives during sustained events) and proposed
-  a single architectural fix (severity-aware fast path) that addresses all three
-
-**Claude Sonnet 4.5 unique findings (not in either other model):**
-- De-escalation timing not accounting for proximity to breach threshold: system
-  removes protection while metric is still near-dangerous, and re-escalation
-  requires full debounce — created a specific "whipsaw" scenario with cycle numbers
-- Manual-only de-escalation from Liquidate creates UNBOUNDED recovery time:
-  if triggered at 2 AM Saturday, trading disabled until Monday despite metrics
-  recovering in minutes. Framed as contradiction with "autonomous" design goals
-- Evaluation cycle synchronization assumption: no handling of variable timing
-  (CPU contention, GC pauses) — implicit throughout but never addressed
-- Cold start escalation ambiguity: system starts with no prior state while
-  portfolio may already be in breach condition
-- De-escalation event ordering race: multiple metrics de-escalating simultaneously
-  may emit events in non-deterministic order, confusing external observers
-
-**Quality assessment:**
-- **GPT-5** was the most exhaustive (15 findings) and showed the strongest
-  mathematical/systems reasoning. Its unique findings included precise attack
-  models (adversarial flicker, boundary alignment, microstructure masking) that
-  describe exact exploitation patterns with percentages and cycle counts. The
-  cross-metric hedging prohibition finding is architecturally significant — it
-  identifies that protection for one metric can actively CREATE risk for another.
-  Every finding was actionable with specific fixes.
-- **Claude Opus 4.6** produced fewer findings (10) but with characteristic depth
-  and self-awareness. Its cooldown exploitation finding identified a PREDICTABLE
-  exploit window that an automated strategy could systematically abuse — framed
-  not as an accident but as an adversarial opportunity. The summary synthesis
-  (identifying common cause across Critical findings) shows meta-analytical
-  capability the other models didn't demonstrate. Opus also uniquely identified
-  that human intervention to fix one problem could create a WORSE problem —
-  second-order operational reasoning.
-- **Claude Sonnet 4.5** was well-structured (12 findings, clean severity tiers,
-  organized by Critical/High/Medium/Low) and faster than both other models.
-  Its findings were solid but less architecturally deep. The manual de-escalation
-  contradiction finding was genuinely insightful (unbounded recovery time vs
-  autonomous design goals). However, several findings restated concepts the
-  other models covered with less specificity about exploitation mechanics.
-
-**Key insight — temporal reasoning as a task type:**
-This is the first experiment specifically testing "temporal boundary analysis" —
-reasoning about time-domain properties of a state machine (evaluation frequency,
-counter semantics, cooldown mechanics, crash/restart timing).
-
-Results compared to Finding #13 (race condition identification on a concurrency doc):
-- GPT-5: 15 findings here vs 12 in Finding #13. Consistent high performance
-  on temporal reasoning tasks across both experiments.
-- Opus: 10 findings here vs 10 in Finding #13. Remarkably consistent — Opus
-  produces ~10 high-quality findings regardless of temporal task variant.
-- Sonnet 4.5: 12 findings here (first test). Compare to Sonnet 4.6's 7 findings
-  (with errors) in Finding #13. Sonnet 4.5 handles temporal reasoning better than
-  4.6 — consistent with Finding #16 showing 4.5 is more exhaustive across task types.
-
-**Sonnet 4.5 vs 4.6 on temporal reasoning (inferred comparison):**
-Sonnet 4.6 struggled significantly on race condition identification (Finding #13:
-7 findings with analytical errors, misreading architecture). Sonnet 4.5 here
-produced 12 solid findings with no apparent misreadings. This suggests 4.5's
-exhaustiveness advantage extends to temporal reasoning — the additional
-exploration it does (vs 4.6's aggressive self-filtering) catches more temporal
-interactions. Confirms Finding #16's pattern: 4.5 for coverage, 4.6 for precision.
-
-**The structured-prompt effect continues:**
-All three models produced focused, high-quality output with this highly structured
-prompt (5 specific categories + required output format). This confirms Finding #14:
-narrow analytical lens + broad document scope is the sweet spot for all model tiers.
-The prompt structure appears to be a stronger predictor of output quality than model
-choice for the bottom 80% of findings (all models find the common-ground issues).
-Model choice matters for the TOP 20% — the unique insights that require deeper
-reasoning about system interactions.
-
-**Updated model assignment for temporal boundary analysis:**
-1. GPT-5 — most exhaustive, strongest at modeling adversarial exploitation patterns
-   and mathematical edge cases (15 findings)
-2. Claude Sonnet 4.5 — good volume with clean structure, viable for first-pass
-   temporal analysis (12 findings, no errors)
-3. Claude Opus 4.6 — fewest findings but highest insight density, uniquely
-   identifies predictable exploit windows and operational second-order effects
-   (10 findings)
-
-**Practical implication:** For temporal analysis on state machines and timing-dependent
-policies, the three-model stack produces genuine complementary value:
-- GPT-5 catches the adversarial attack patterns and mathematical edge cases
-- Opus catches the predictable exploit windows and operational contradictions
-- Sonnet 4.5 provides good breadth at lower cost with clean severity categorization
-
-The union of unique findings across all three models reveals significantly more
-temporal vulnerabilities than any single model alone. For a document governing
-autonomous financial actions (liquidation, kill switch), the cost of running all
-three (~$1-2) is trivially justified against the risk of missing a timing exploit.
-
-### 19. Union coverage test: GPT-5 Mini + Sonnet 4.6 covers ~71% of GPT-5's findings; the missing 29% is where the real value lives
-
-**Date:** 2026-05-04
-**Task:** Identify hidden assumptions in gargoyle's `trading-pipeline.md` (1,110 lines,
-~62KB) — the most complex document tested so far, covering the full end-to-end path
-from tick ingestion through order execution.
-**How we used them:** Same document (full text, no truncation) + same focused analytical
-question to all 3 models via HAI proxy. Standard hidden-assumption prompt with 5
-categories (runtime behavior, external dependencies, timing/ordering, scale/load,
-uncovered failure modes). Required specific output format per finding. No tools, no
-project context beyond the document itself.
-
-| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
-|---|---|---|---|---|
-| GPT-5 | 99s | 9,418 | 5,696 | 35 |
-| GPT-5 Mini | 93s | 5,309 | 1,792 | 21 |
-| Claude Sonnet 4.6 | 38s | 1,792 | (internal) | 17 |
-
-**Coverage analysis — can Mini + Sonnet together replace GPT-5?**
-
-Categorized each of GPT-5's 35 findings by whether the union of Mini + Sonnet
-also identified the same assumption:
-
-- **Covered by BOTH Mini and Sonnet:** ~12 findings (common ground — any model
-  finds these: idempotency, single-writer, clock sync, instrument resolution,
-  fill immutability, reconciliation gate, backpressure, fill correlation, event
-  ordering, audit scalability, PortfolioRisk bottleneck)
-- **Covered by Mini only (not Sonnet):** ~7 findings (transactional atomicity,
-  audit causal consistency, modification-in-flight enforcement, OM throughput,
-  decimal precision, PM/PR close-only race, partition duplicate submit)
-- **Covered by Sonnet only (not Mini):** ~6 findings (market data feed rates,
-  pipeline-vs-market speed, corporate actions atomicity, kill switch partition,
-  shared port isolation, market close vs auction fills)
-- **Union(Mini + Sonnet) total coverage:** ~25/35 = **~71%** of GPT-5's findings
-- **GPT-5 unique (missed by both):** ~10-18 findings depending on strictness
-
-**What GPT-5 uniquely found that the cheaper pair missed:**
-
-The missing 29% is NOT random — it's systematically different in character:
-
-1. **Operational edge cases:** Default TIF "day" broker semantics, OrderRate
-   counting retries, extended-hours MarketHours mismatch, fractional quantities,
-   local expiry timer precision per instrument
-2. **Design-level interaction gaps:** PortfolioRisk concurrent decision race
-   (snapshot stale between two parallel approvals), re-validation gap between
-   approval and submit, decision loss on crash after audit write
-3. **Domain-specific knowledge:** Manual broker-side actions conflicting with
-   state machine, options/complex instrument position_effect mapping, Decision→Order
-   1:1 invariant vs broker auto-splitting, wash sale retroactive P&L mutation
-4. **Architectural observations:** Reduction re-entry rule insufficiency,
-   PortfolioMonitor coalescing vs fast breach detection, multi-aggregator fanout
-   and audit partial writes, replay/backtest alignment with production controls
-
-These share a common trait: they require **domain expertise** (knowing how brokers
-actually behave, how regulatory rules interact, how production trading systems
-fail in practice) combined with **architectural reasoning** (how the design's own
-mechanisms interact under those real-world conditions). The cheaper models find
-assumptions about the document's internal consistency; GPT-5 additionally finds
-assumptions about the document's relationship to the external world it must
-operate in.
-
-**GPT-5 Mini vs Sonnet 4.6 — complementary, not redundant:**
-
-Mini and Sonnet covered different gaps:
-- Mini was stronger on **internal consistency** (transactional atomicity, causal
-  consistency, decimal precision, modification serialization)
-- Sonnet was stronger on **external interactions** (market data feeds, corporate
-  actions, kill switch distribution, shared resource isolation)
-
-This aligns with previous findings: Mini reasons about implementation mechanics;
-Sonnet reasons about system boundaries and external interactions. Their union
-covers more ground than either alone.
-
-**Cost comparison:**
-
-| Approach | Total tokens | Approx. cost | Coverage of GPT-5 |
-|---|---|---|---|
-| GPT-5 alone | ~21K (9.4K output + 5.7K reasoning) | ~$0.80 | 100% (35 findings) |
-| Mini + Sonnet | ~7.1K output + 1.8K reasoning | ~$0.25 | ~71% (25/35 findings) |
-| All three | ~28K total | ~$1.05 | >100% (35 + unique Sonnet/Mini extras) |
-
-**Key insight — the 71% coverage is a floor, not a ceiling:**
-
-The union covers 71% of GPT-5's specific findings. But Mini and Sonnet each
-also produced findings that GPT-5 DIDN'T make:
-- Sonnet: DailyLossLimit query performance scaling, instrument reference data
-  propagation atomicity across components
-- Mini: Signal audit correlation ambiguity under replay/duplicate ticks
-
-So the total unique finding space is LARGER than any single model. Running all
-three produces the most comprehensive analysis.
-
-**Answer to the open question: "Would running GPT-5 Mini + Sonnet together
-approach GPT-5's coverage at lower combined cost?"**
-
-**Partially.** The pair covers ~71% of GPT-5's findings at ~31% of the cost.
-But the missing 29% is disproportionately valuable — it contains the
-domain-specific, interaction-level, real-world-knowledge findings that are
-most likely to prevent production incidents. For a quick sanity check or
-first-pass screening, Mini + Sonnet is excellent value. For architecture
-review where completeness matters (financial system, safety-critical), GPT-5
-is not replaceable by cheaper models — its unique findings are exactly the
-ones that would cause real-world failures.
-
-**Practical implication:** The optimal strategy depends on stakes:
-- **Low stakes** (internal doc review, non-critical systems): Mini + Sonnet
-  is 71% coverage at 31% cost — strong ROI
-- **High stakes** (financial systems, safety-critical): run all three — the
-  ~$1 total cost is irrelevant vs the value of the extra 10-18 findings
-- **Budget-conscious high stakes:** run GPT-5 alone — it subsumes most of
-  what Mini + Sonnet find, and adds the critical domain-knowledge findings
-
-The cost argument for Mini + Sonnet as a GPT-5 REPLACEMENT doesn't hold for
-important work. The cost argument for Mini + Sonnet as a GPT-5 COMPLEMENT
-is strong — they catch a few things GPT-5 misses, and the union of all three
-is the most thorough analysis available.
-
-**Document complexity observation:**
-This is the largest document tested (1,110 lines vs previous 185-785 lines).
-GPT-5's finding count scaled up (35 vs 20-26 on smaller docs) while maintaining
-quality — no padding with obvious/low-value findings. Mini also scaled (21 vs
-6 on 459-line doc in Finding #14). Sonnet scaled less (17 vs 12-17 on smaller
-docs) — it appears to have a natural output ceiling regardless of document size,
-consistent with its self-filtering behavior observed in previous findings.
-
-### 22. Silent correctness failures: NEW analytical lens reveals Opus's strength at domain/regulatory reasoning; GPT-5 shows regulatory depth; all models converge on compliance timestamp errors
-
-**Date:** 2026-05-05
-**Task:** Identify scenarios where the mechanism produces SILENTLY INCORRECT results
-(not errors, not crashes — wrong financial calculations, wrong lot selections, or wrong
-compliance records that pass all validation) in gargoyle's `specid-lot-selection.md`
-(306 lines) — a financial system specification covering tax lot selection strategies,
-cost basis accounting, and IRS SpecID compliance.
-**How we used them:** Same document (full text) + same focused analytical question to
-all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of silent
-incorrectness (stale data, semantic precision, ordering sensitivity, composition errors,
-temporal reference errors). Required specific output format per finding with concrete
-numerical examples of financial impact. No tools, no project context beyond the document.
-
-| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
-|---|---|---|---|---|---|---|---|
-| GPT-5 | 147s | 13,006 | 10,496 | 7 | 2 | 2 | 3 |
-| Claude Opus 4.6 | 119s | 5,902 | (internal) | 10 | 3 | 3 | 4 |
-| Claude Sonnet 4.6 | 122s | 6,011 | (internal) | 6 | 3 | 3 | 0 |
-
-**What they found — common ground (all 3 identified):**
-- `designation_at` = `DateTime.utc_now()` at processing time, NOT at actual
-  designation time (manual selection was made at order submission, standing
-  orders were configured earlier) — compliance record factually incorrect
-- Holding period calculation boundary errors (>365 days vs IRS "more than one
-  year" rule, off-by-one at leap year boundaries, day-after-acquisition start)
-- HIFO tie-breaker `opened_at ASC` ignores tax_term dimension — selects
-  long-term losses over short-term losses when both have identical cost basis,
-  producing less tax-valuable outcomes
-- Strategy preference resolved at fill processing time, not at trade time
-  (preference changes between trade and fill processing apply retroactively)
-
-**GPT-5 unique findings (not in either Claude model):**
-- Corporate action applied late stale cost basis in HIFO: ROC/dividend reduces
-  basis but if close/4 fires before apply_corporate_action/3, HIFO sorts on
-  pre-adjusted basis AND records wrong realized P&L permanently. No mechanism
-  to restate previously persisted LotClosed events. Concrete example: $2,000
-  overstated loss from one trade.
-- `designation_at` fragmentation: a single sell consuming multiple lots calls
-  DateTime.utc_now() per loop iteration, producing slightly different timestamps
-  for what should be a single coherent designation event. Audit risk.
-- LIFO label in `selection_method` field: records "lifo" but for securities LIFO
-  isn't an authorized tax method — the operation is legally SpecID electing
-  newest lots. Downstream reporting may reject or misclassify.
-
-**Claude Opus unique findings (not in either other model):**
-- Realized P&L excludes commissions/fees: formula uses `sell_fill.price` (raw
-  execution price) minus `lot.cost_basis`, not net proceeds. If cost_basis also
-  excludes buy-side commissions, P&L is doubly overstated. Active trader doing
-  1000 trades/year: ~$20,000+ cumulative P&L overstatement.
-- Position `average_cost` is meaningless under SpecID and potentially misleading:
-  SpecID exists to exploit lot-level basis differences, but position-level average
-  obscures this. If downstream consumers use average_cost for tax estimation,
-  results can be 50%+ wrong per lot.
-- GenServer mailbox ordering determines lot-to-fill assignment for concurrent sells:
-  two simultaneous fills for the same instrument get different lots based on network
-  arrival timing. With different holding periods, produces $670+ tax difference
-  without user awareness.
-- Wash sale rule completely unaddressed: system reports losses as realized/deductible
-  without checking 30-day substantially identical purchase rule. Active trader
-  harvesting $50,000 in losses could have $0 actually deductible — $18,500 tax gap.
-- `opened_at` semantics undefined: whether it's exchange execution time, GenServer
-  arrival time, or settlement date affects every downstream calculation (FIFO/LIFO
-  ordering, holding periods, tax terms). Network timing could produce wrong FIFO
-  lot selection.
-
-**Claude Sonnet 4.6 unique findings (not in either other model):**
-- Stale cost basis in manual lot picker during concurrent corporate actions: UI shows
-  pre-action basis, user selects based on stale data, but close/4 only validates
-  open/ownership/quantity — never re-validates that the selection rationale is still
-  correct. No field records the discrepancy.
-- `average_cost` recomputation ordering ambiguity in event-sourced model: step 4
-  recomputes from "updated lots" but step 3 (persist events) may not have completed
-  — if implementation re-derives from event store rather than in-memory state, reads
-  pre-closure lot quantities. Accumulates $500+ error per partial close.
-- Strategy fallback + config corruption silently overwrites selection method in
-  compliance record: if config becomes invalid, fallback to :fifo is logged at
-  :warning but LotClosed records `selection_method: "fifo"` — compliance record
-  shows user "chose" FIFO when they configured HIFO. No field records intended vs
-  actual strategy.
-
-**Quality assessment:**
-- **Claude Opus** produced the most findings (10) with the broadest analytical scope.
-  Several findings went BEYOND the document's mechanism to identify missing features
-  that create silent incorrectness (wash sale rules, commission handling, opened_at
-  semantics). This is a different analytical mode: Opus identified what the system
-  SHOULD compute but DOESN'T, not just where the existing computation is wrong.
-  The wash sale finding is the highest-impact across all three models — an active
-  trader's entire tax-loss harvesting strategy could be invalid. The GenServer
-  mailbox ordering finding shows characteristic Opus reasoning about emergent
-  behavior from design decisions.
-- **GPT-5** produced fewer findings (7) but with extreme precision and specificity.
-  Every finding includes concrete dollar amounts and specific field references.
-  The corporate action stale basis finding is uniquely actionable — it identifies a
-  specific race condition between two documented mechanisms (close/4 and
-  apply_corporate_action/3) that produces permanently incorrect persisted data
-  with no correction path. The designation_at fragmentation finding shows attention
-  to implementation detail that neither Claude model noticed. GPT-5 used 10,496
-  reasoning tokens for 7 findings (1,500 tokens/finding) — HIGH verification,
-  consistent with Finding #20's pattern for precision-over-breadth tasks.
-- **Claude Sonnet 4.6** produced 6 findings with strong specificity and novel angles.
-  The event-sourced recomputation ordering finding (#5) is architecturally subtle —
-  it identifies a composition error between the walk-and-consume algorithm's step
-  ordering and event-sourcing patterns. The strategy fallback compliance recording
-  finding is a genuine audit hazard. However, Sonnet produced no Medium-severity
-  findings — it either found Critical/High issues or filtered everything else out.
-  This aligns with its established high-precision, high-self-filtering behavior.
-
-**Key insight — "Silent correctness" as an analytical lens:**
-
-This is the FIRST experiment testing a "silent incorrectness" prompt. The key
-difference from previous analytical lenses:
-- **Assumption-finding:** "What must be true for this to work?" (Finding #10-12)
-- **Race conditions:** "What timing issues exist?" (Finding #13)
-- **Design coherence:** "Does the design contradict itself?" (Finding #15)
-- **Invariant violations:** "What operation sequences break invariants?" (Finding #20)
-- **Silent correctness:** "Where does the system CONFIDENTLY produce WRONG output
-  with NO indication of error?"
-
-The silent correctness lens produced qualitatively different findings from all
-previous lenses. The emphasis on "passes all validation" forced models to reason
-about what SHOULD be validated but ISN'T, and about semantic correctness (regulatory
-requirements, financial accounting rules) vs syntactic correctness (valid types,
-non-nil fields, correct schema).
-
-This lens also revealed a key model differentiation not seen before:
-- **Opus reasons about MISSING functionality** (wash sales, commissions, opened_at
-  semantics) — things the system should do but doesn't
-- **GPT-5 reasons about EXISTING functionality being wrong** (corporate action race,
-  designation fragmentation, LIFO labeling) — things the system does but incorrectly
-- **Sonnet reasons about COMPOSITION failures** (event-sourcing step ordering,
-  strategy fallback propagation) — things that are individually correct but combine
-  incorrectly
-
-These are three genuinely different analytical modes, not just "more/less thorough."
-All three are valuable for different review outcomes: Opus for feature completeness,
-GPT-5 for mechanism correctness, Sonnet for integration correctness.
-
-**Financial domain advantage:**
-
-This is the first experiment on a document with strong regulatory/financial semantics.
-All three models demonstrated domain knowledge (IRS holding period rules, Treas. Reg.
-1.1012-1(c) requirements, wash sale IRC §1091, long-term/short-term capital gains
-rate differentials). Opus in particular referenced specific IRC sections and provided
-concrete tax rate calculations. The "silent incorrectness" lens works especially well
-on financial/regulatory documents because the gap between "syntactically valid output"
-and "semantically/legally correct output" is large and consequential.
-
-**Comparison to previous findings on the same models:**
-
-| Task type | GPT-5 findings | Opus findings | Sonnet findings | Opus > GPT-5? |
-|---|---|---|---|---|
-| Hidden assumptions (#10-12) | 20-35 | 12-13 | 13-17 | No |
-| Race conditions (#13) | 12 | 10 | 7 | No |
-| Design coherence (#15) | 4 | 7 | 5 | **Yes** |
-| Invariant violations (#20) | 3 | 7 | 5 | **Yes** |
-| Silent correctness (#22) | 7 | 10 | 6 | **Yes** |
-
-Pattern confirmed: Opus outperforms GPT-5 (by finding count) on tasks that require
-reasoning about the design's RELATIONSHIP to external requirements (regulatory,
-financial, consumer expectations). GPT-5 outperforms Opus on tasks that require
-EXHAUSTIVE EXPLORATION within a self-contained system (assumptions, race conditions).
-
-The "silent correctness" lens is structurally similar to coherence checking (does the
-system match its external requirements?) rather than gap-finding (what's missing
-within the system?). This explains why Opus outperforms: the task requires reasoning
-about the world outside the document (IRS rules, financial accounting standards,
-regulatory requirements), which is Opus's strength.
-
-**Practical implication:**
-For financial/regulatory system review, the "silent correctness" lens should be
-run using Opus as the primary model (broadest findings including missing-feature
-identification) plus GPT-5 for mechanism-level precision. Sonnet adds value for
-composition/integration issues that neither Opus nor GPT-5 catches. All three
-produced unique, actionable findings that the others missed.
-
-The three findings ALL models converged on (designation_at, holding period, HIFO
-tie-breaker, strategy preference timing) should be treated as confirmed design
-bugs requiring fixes. The fact that three independent models all identified them
-with concrete financial impact examples increases confidence that these are real.
-
-### 23. Regulatory compliance analysis: GPT-5 finds broadest scope of IRS issues; Opus self-corrects mid-analysis; all models converge on cross-account gap
-
-**Date:** 2026-05-05
-**Task:** Identify where gargoyle's `wash-sale-tracking.md` (391 lines) could produce
-incorrect tax reporting compared to IRS wash sale regulations (IRC 1091). NEW
-analytical lens: regulatory compliance verification — asking models to reason about
-a code implementation's correctness against EXTERNAL regulatory requirements (not
-internal system assumptions or race conditions).
-**How we used them:** Same document (full text) + same focused analytical question
-to all 3 models via HAI proxy. Prompt structured around 5 categories: regulatory
-gaps, interpretation errors, edge cases at regulatory boundaries, cross-account/entity
-concerns, and interaction with other IRC sections. Required specific regulatory
-citations, implementation analysis, concrete tax errors, and audit risk levels.
-No tools, no project context beyond the document.
-
-| Model | Time | Output tokens | Reasoning tokens | Findings |
-|---|---|---|---|---|
-| GPT-5 | 178s | 12,525 | 9,536 | 16 |
-| Claude Opus 4.6 | 155s | 7,326 | (internal) | 16 (with 2 self-corrections/withdrawals) |
-| Claude Sonnet 4.6 | 40s | 1,818 | (internal) | 12 |
-
-**What they found — common ground (all 3 identified):**
-- Cross-account/IRA/external broker wash sales not tracked (IRS applies at taxpayer level)
-- Options/contracts to acquire stock not triggering wash sales (explicit in IRC 1091(a) text)
-- "Substantially identical" definition too narrow (same index ETFs, share classes, ADRs)
-- Trade date vs settlement date ambiguity in opened_at/closed_at
-- Short sale wash sales not addressed
-- Section 475 mark-to-market traders incorrectly subjected to wash sale tracking
-- IRC 1092 straddle rules interaction not addressed
-- Related party / spousal transactions not considered
-- Corporate action identity changes breaking matching
-
-**GPT-5 unique findings (not in either other model):**
-- **Per-share vs lot-level basis tacking** (#1): The system applies `disallowed_loss`
-  and `tacked_opened_at` at the LOT level, but IRS requires per-share treatment
-  when only partial shares are matched. A lot of 100 shares where only 60 trigger
-  wash sale should have per-share basis segregation — the system inflates basis for
-  all 100 shares. **Most architecturally significant finding** — a fundamental
-  design-level error, not a missing feature.
-- **IRA permanent disallowance** (#2): When replacement purchase is in an IRA, the
-  loss is PERMANENTLY lost (no basis adjustment possible in tax-deferred accounts).
-  System either incorrectly applies basis adjustment inside IRA or misses it entirely.
-- **Instruments not subject to §1091** (#4): §1256 contracts (futures, index options),
-  cryptocurrency, and §475 elections are all exempt — system may over-disallow.
-- **Average-cost mutual fund basis** (#11): Wash sale adjustments for funds using
-  average-cost method require different math than discrete lot-level adjustments.
-- **ADRs vs local shares** (#14): ADRs and underlying foreign ordinaries are
-  substantially identical but have different instrument_ids.
-- **RSU vestings/ESPP purchases** (#15): Equity compensation creating lots via
-  corporate action paths may not trigger `check_replacement/2`.
-- **Ordering priority between pre/post sale purchases** (#10): Industry convention
-  (post-sale first, then pre-sale) may differ from system's strict chronological
-  ordering, causing 1099-B mismatches.
-
-**Claude Opus unique findings (not in either other model):**
-- **Year-end boundary timing** (#5): Loss in December + replacement in January means
-  tax reports generated between Dec 31 and the replacement purchase date are incorrect.
-  Forward detection fires retroactively but users may have already filed. System needs
-  a "30-day pending window" for year-end reports.
-- **Form 8949 reporting format** (#6): IRS requires code "W" in column (f) and
-  specific adjustment amounts in column (g). System doesn't describe how `tax_summary/3`
-  produces Form 8949-compatible output — potential CP2000 notice triggers from
-  automated IRS matching against broker 1099-B.
-- **"Open lots" query in backward detection** (#10): If backward detection only
-  queries currently-open lots, it misses replacements that were acquired AND SOLD
-  within the window. IRS looks at acquisition regardless of current holding status.
-  (Rev. Rul. 56-602)
-- **Forward detection loss ordering unspecified** (#7): When multiple prior losses
-  compete for the same replacement shares, ordering matters — different allocation
-  produces different basis amounts on the replacement lot.
-- **DRIP reinvestments triggering wash sales** (#9): Dividend reinvestment creates
-  new lots that should trigger forward detection but may not if only buy fills
-  produce `LotOpened` events.
-- **Self-correcting analytical style (CONFIRMED):** Opus withdrew Finding #4
-  entirely mid-analysis ("Revised assessment: holding period logic appears correct.
-  I withdraw the claim of error"). Spent ~500 words reasoning through the holding
-  period tacking logic, found it correct, and explicitly retracted. This is now
-  confirmed across Findings #15, #20, and #23 as a consistent Opus behavior for
-  verification-heavy regulatory analysis.
-
-**Claude Sonnet unique findings (not in either other model):**
-- **Entity-level tracking for partnerships/S-Corps** (#4.2): Tax-transparent entities
-  trading through the platform need K-1 reporting to partners — user-scoped model
-  doesn't address pass-through entity wash sale reporting.
-- **Constructive sale integration (IRC 1259)** (#4.1): Short positions or derivatives
-  creating constructive ownership interact with wash sale determination in ways not
-  addressed.
-- **NOL carryforward interaction** (#5.3): Wash sale deferrals affect character and
-  timing of losses contributing to NOL calculations across tax years.
-
-**Quality assessment:**
-- **GPT-5** produced the broadest regulatory scope (16 findings) with the most
-  specific IRS citations (Rev. Rul. 2008-5, Pub. 550, IRC §§267, 1091, 1092, 1222,
-  1223, 1256, 475). Its per-share vs lot-level finding (#1) is the only one that
-  identifies a FUNDAMENTAL DESIGN ERROR (not a missing feature). Most other models'
-  findings are "you don't handle X" — GPT-5's #1 says "what you DO handle is
-  handled INCORRECTLY." This distinction matters: missing features are known scope
-  limitations; incorrect logic is a bug.
-- **Claude Opus** matched GPT-5's count (16 with 2 self-corrections = 14 net
-  confirmed) but with different character. Opus excelled at identifying OPERATIONAL
-  implications (year-end boundary timing, Form 8949 format requirements, forward
-  detection ordering) rather than just statutory gaps. Its findings tend to describe
-  HOW the gap manifests in practice ("user files taxes, then January purchase
-  retroactively invalidates the filing") vs GPT-5's approach of citing the statute
-  and describing the theoretical violation.
-- **Claude Sonnet** was fast (40s) and produced 12 competent findings but with less
-  regulatory precision. Findings lacked specific IRS citations (no Rev. Rul.
-  references, no Treas. Reg. citations). Several findings overlapped heavily with
-  common ground items without adding unique depth. The entity-level and
-  constructive sale findings show awareness of tax complexity but are relatively
-  generic ("this is complex and not addressed").
-
-**Key insight — regulatory compliance as a distinct task type:**
-
-This experiment tests a fundamentally different cognitive demand than previous ones:
-previous tasks asked "what could go wrong with this system?" (internal reasoning).
-This task asks "does this system correctly implement external rules?" (external
-reasoning). The model must hold TWO bodies of knowledge simultaneously: the
-implementation spec AND the regulatory framework, then find mismatches.
-
-All three models had strong tax law knowledge — they cited IRC sections, Revenue
-Rulings, and Treasury Regulations correctly. The differentiation wasn't in legal
-knowledge but in HOW they applied it:
-
-- **GPT-5:** Exhaustive statutory mapping ("here's every IRC section that touches
-  wash sales; here's where the implementation falls short on each"). Breadth-first
-  coverage. Found the most issues by sheer scope of regulatory awareness.
-- **Opus:** Operational consequence reasoning ("here's how this gap manifests as
-  a real-world problem for the user/auditor"). Found issues by reasoning about
-  the implementation's interaction with real-world workflows (filing deadlines,
-  form formats, broker reconciliation).
-- **Sonnet:** Category-based analysis ("here are cross-account issues, here are
-  entity issues, here are interaction issues"). Followed the prompt structure
-  closely but didn't go deep within each category.
-
-**The per-share vs lot-level finding (GPT-5 #1) — why it matters:**
-
-This is the experiment's most important result. Every model found missing features
-(options, cross-account, short sales) — those are SCOPE limitations that the
-document itself acknowledges or defers. GPT-5 uniquely found a correctness bug in
-the IMPLEMENTED logic: the system's lot-level basis adjustment is mathematically
-wrong for partial wash sales.
-
-Example: Loss lot of 100 shares, replacement lot of 60 shares. Only 60 shares
-trigger wash sale. System adds full 60% of disallowed loss to the entire
-replacement lot's basis. If the replacement lot later sells 30 shares, the
-per-share basis is inflated (reflects 60 shares of adjustment spread across 60
-shares). This is actually correct for the replacement lot specifically — but
-the `tacked_opened_at` is applied to ALL 60 shares when only the matched shares
-should have tacked holding periods. For lots where `adjusted_quantity <
-replacement_quantity`, the non-matched shares have incorrect holding period
-characterization.
-
-Actually, on closer inspection: if `adjusted_quantity = min(loss_quantity,
-replacement_quantity)`, and the system matches 60 shares of a 60-share
-replacement lot, ALL shares of that lot are matched. The edge case GPT-5
-identifies would require a replacement lot larger than the loss — e.g., loss of
-60 shares matched against a replacement lot of 100 shares where only 60 are
-affected. In that case, the `tacked_opened_at` is set on the entire lot (100
-shares) when only 60 should be affected. This IS a genuine bug: 40 shares get
-incorrect holding period classification.
-
-**Updated task-type taxonomy:**
-
-| Task type | Primary cognitive demand | Best model |
-|---|---|---|
-| Hidden assumptions | Breadth identification (what's not stated?) | GPT-5 (exhaustive) |
-| Race conditions | Sequential temporal reasoning | GPT-5 + Opus |
-| Cross-component interactions | Component boundary reasoning | GPT-5 + Sonnet |
-| Design coherence | Internal consistency checking | Opus |
-| Invariant violation paths | Construction + verification | GPT-5 (precision) |
-| Silent correctness | External requirement matching | Opus |
-| **Regulatory compliance** | **Dual-knowledge-base comparison** | **GPT-5 (breadth) + Opus (operations)** |
-
-Regulatory compliance is closest to "silent correctness" (Finding #22) in that
-both require reasoning about external requirements. The key difference:
-- Silent correctness asks "does this produce correct outputs for all inputs?"
-- Regulatory compliance asks "does this implement the law correctly?"
-
-Both favor models that reason about the system's relationship to the outside
-world (Opus's strength), but regulatory compliance also rewards breadth of
-statutory knowledge (GPT-5's strength). The combination produces the most
-complete picture.
-
-**Practical implication:**
-For regulatory compliance review of financial systems:
-- Run GPT-5 for exhaustive statutory coverage (finds the most gaps)
-- Run Opus for operational impact analysis (finds how gaps manifest in practice)
-- Sonnet adds marginal value — use only if budget allows
-- GPT-5's unique strength: identifying correctness bugs in implemented logic
-  (not just missing features)
-- Opus's unique strength: identifying timing/workflow issues (year-end, form
-  reporting, reconciliation with broker)
-
-### 24. Design improvement proposals: GPT-5 excels at defense-in-depth thinking; Opus finds subtle design contradictions; Sonnet produces generic recommendations
-
-**Date:** 2026-05-05
-**Task:** Propose specific design improvements for gargoyle's `kill-switch.md` (185 lines)
-— the primary safety mechanism that prevents rogue orders. NEW task type: generative/
-creative ("what would you improve?") rather than purely analytical ("what's wrong?").
-**How we used them:** Same document (full text) + same focused prompt to all 3 models
-via HAI proxy. Prompt asked for 8-15 specific improvements with: weakness, proposed
-change (concrete), tradeoff, severity rating. Explicitly excluded generic advice
-("add more tests") and asked about runtime assumptions. No tools, no project context.
-
-| Model | Time | Output tokens | Reasoning tokens | Improvements proposed |
-|---|---|---|---|---|
-| GPT-5 | 118s | 8,710 | 6,016 | 15 |
-| Claude Opus 4.6 | 127s | 4,985 | (internal) | 15 |
-| Claude Sonnet 4.6 | 40s | 1,636 | (internal) | 12 |
-
-**What they found — common ground (all 3 identified):**
-- DB write failure blocking engagement (fail-open under DB outage) — all three
-  proposed in-memory-first engagement with async persistence
-- Kill switch process liveness monitoring (heartbeat/watchdog)
-- Broker connectivity loss during cancellation operations
-- ETS table ownership and crash-window vulnerability
-- Supervisor restart suppression as unstated mechanism
-- Per-venue/per-broker scope extension
-
-**GPT-5 unique findings (not in either other model):**
-- **Infrastructure-level "hard kill"** — egress proxy or service mesh that blocks
-  broker traffic independently of the application. Belt-and-suspenders approach
-  where the kill switch works even if the entire BEAM VM is unresponsive. This
-  was GPT-5's highest-impact unique insight.
-- **Kill fence token (epoch)** — every order-carrying message includes an epoch;
-  stale-epoch messages are dropped at the gate. Elegantly solves in-flight
-  messages without needing drain timeouts.
-- **Cluster/multi-node propagation** — detailed leader election + epoch broadcast
-  + fail-closed on partition design.
-- **Post-engage broker verification** — query broker AFTER engaging to confirm no
-  orders slipped through during the engagement window.
-- **Liquidation exposure validation** — proving tagged liquidation orders actually
-  REDUCE exposure rather than trusting the tag.
-- **Recovery/cold-start order suppression** — ensuring reconciliation/recovery
-  routines can't submit orders while engaged.
-- **Engage latency reordering** — ETS first, terminate second, DB async.
-- **Audit log tamper evidence** — append-only external sink + hash chain.
-
-**Claude Opus unique findings (not in either other model):**
-- **Ordering contradiction in engagement sequence** — identified that the
-  documented order (DB → ETS → terminate) creates a specific risk if a crash
-  occurs BETWEEN termination and ETS update (not just DB failure). The insight
-  is about the window where termination has started but gate is still open.
-  More subtle than GPT-5's version (which focused on DB-blocking-engage).
-- **Concurrent engagement race (mode escalation)** — multiple triggers
-  simultaneously issuing conflicting modes (RESTRICT vs LIQUIDATE). Proposed
-  explicit escalation rules (LIQUIDATE always wins) with GenServer serialization.
-- **Shared resources under per-user scope** — per-user kill switch doesn't
-  address orders in shared broker connection buffers. Forces architectural
-  decision about connection pooling strategy.
-- **Clock/time integrity for audit log** — monotonic counters + NTP validation
-  for forensic reliability.
-- **Partial multi-user engagement failures** — what happens when global engage
-  successfully terminates 4/5 user pipelines but one has orphaned processes.
-- **Liquidation direction validation** — similar to GPT-5's exposure validation
-  but framed differently: checking corrupted position records could cause
-  liquidation to OPEN positions rather than close them.
-- **Process termination verification** — checking that `:kill` signals actually
-  worked (defense against trap_exit, NIF blocking).
-- **Engagement latency SLA** — defining a 50ms target with monitoring/alerting.
-
-**Claude Sonnet findings (all also present in GPT-5 or Opus, differently framed):**
-- No genuinely unique improvements that GPT-5 or Opus didn't also identify.
-- Several were generic: "missing resource cleanup," "circuit breaker integration,"
-  "performance monitoring" — exactly the kind of advice the prompt tried to
-  exclude.
-- The "missing heartbeat" and "network partition handling" proposals were solid
-  but less detailed than the corresponding GPT-5/Opus versions.
-
-**Quality assessment:**
-- **GPT-5** produced the most ACTIONABLE improvements. Its proposals were
-  architecturally concrete ("add an egress proxy," "use kill epochs in messages,"
-  "query broker post-engage") and showed defense-in-depth thinking — multiple
-  independent layers rather than fixing one path. The infrastructure kill (#2)
-  is genuinely novel: no other model proposed going OUTSIDE the application
-  boundary for safety enforcement. GPT-5 consistently thought about "what if
-  this entire runtime is compromised?" rather than just fixing within-app paths.
-- **Claude Opus** produced equally numerous improvements (15) with characteristic
-  precision about failure SEQUENCES. Its unique strength: identifying design
-  contradictions rather than just gaps (the engagement ordering issue, concurrent
-  mode escalation, shared-resource scope mismatch). Opus's proposals were more
-  "fix the design tension" while GPT-5's were more "add another safety layer."
-  Opus also included the process termination verification and engagement latency
-  SLA — operational rigor that GPT-5 skipped.
-- **Claude Sonnet** produced 12 proposals in 40s (fast) but quality was notably
-  lower. Several proposals were generic software engineering advice that the
-  prompt explicitly excluded ("add performance monitoring," "resource cleanup").
-  No unique insights emerged. Sonnet's proposals lacked the architectural depth
-  of GPT-5 (no outside-the-application thinking) and the design-tension
-  identification of Opus.
-
-**Key insight — generative vs analytical tasks:**
-
-This is the first experiment testing a GENERATIVE task ("propose improvements")
-rather than a purely analytical one ("find problems"). The results reveal:
-
-1. **GPT-5's defense-in-depth thinking is unique.** In analytical tasks, GPT-5
-   finds exhaustive lists of issues. In generative tasks, it proposes LAYERED
-   solutions — multiple independent mechanisms that each catch what the others
-   miss. The infrastructure kill proposal (external to the application) shows
-   GPT-5 reasoning about failure modes that are invisible to within-app analysis.
-
-2. **Opus's design-tension identification transfers to improvement proposals.**
-   In analytical tasks, Opus finds where parts of a design contradict each other.
-   In generative tasks, this manifests as proposals that RESOLVE tensions rather
-   than just adding patches. The engagement ordering contradiction and mode
-   escalation rules are both "this design says X but the mechanism allows Y —
-   here's how to make them consistent."
-
-3. **Sonnet doesn't transfer well to generative tasks.** In analytical tasks
-   (assumption-finding, cross-component analysis), Sonnet performs well (85% of
-   GPT-5 in some experiments). In generative tasks, it falls back to generic
-   engineering advice. The task requires both identifying problems AND proposing
-   concrete solutions — Sonnet handles the first step but not the second with
-   sufficient depth.
-
-**Comparison to analytical task performance:**
-
-| Task type | GPT-5 character | Opus character | Sonnet character |
-|---|---|---|---|
-| Assumption-finding (#10-12) | Exhaustive breadth | Design tensions | Good (85% of GPT-5) |
-| Race conditions (#13) | Technical precision | Design contradictions | Weak (errors) |
-| Invariant violations (#20) | Maximum selectivity | Self-correcting depth | Imprecise |
-| **Design improvements (#24)** | **Defense-in-depth layers** | **Tension resolution** | **Generic advice** |
-
-The generative task reveals model ARCHITECTURES more clearly than analytical tasks.
-GPT-5's reasoning enables it to construct multi-layered solutions. Opus's internal
-reasoning enables it to identify what a design SHOULD be (not just what's wrong).
-Sonnet pattern-matches against known engineering practices without deep synthesis.
-
-**Practical implication:**
-
-For design improvement sessions on safety-critical systems:
-- Run GPT-5 for defense-in-depth proposals ("what layers should exist?")
-- Run Opus for design consistency proposals ("where does the design contradict itself?")
-- Skip Sonnet — its output is indistinguishable from generic checklists
-- The combination of GPT-5 + Opus produces complementary improvements: GPT-5 adds
-  safety layers, Opus fixes internal contradictions. Together they address both
-  "not enough protection" and "protection mechanisms that work against each other."
-
-**Cost analysis:**
-GPT-5: 118s, ~10.9K tokens (6K reasoning). Opus: 127s, ~5K tokens. Sonnet: 40s, ~1.6K tokens.
-For a safety-critical design review, running GPT-5 + Opus costs ~16K tokens and produces
-30 improvements with near-zero overlap in unique insights. Excellent ROI for a kill switch
-design that protects real money.
-
-### 25. Contradiction detection: NEW task type — Opus excels at finding genuine contradictions with precision; GPT-5 is thorough but spends heavily on reasoning; Sonnet finds surface-level issues quickly
-
-**Date:** 2026-05-05
-**Task:** Identify internal contradictions, logical inconsistencies, and conflicting rules
-in gargoyle's `order-state-machine.md` (311 lines) — a document defining states,
-transitions, invariants, fill precedence rules, and time-in-force behavior.
-**How we used them:** Same document (full text) + same focused analytical question to all
-3 models via HAI proxy. Prompt specifically asked for: state machine contradictions,
-semantic conflicts, rule violations, implicit contradictions, and terminology
-inconsistencies. Required each finding to quote the conflicting statements, explain
-the logical argument, assign severity, and recommend which statement should "win."
-No tools, no project context beyond the document itself.
-
-| Model | Time | Output tokens | Reasoning tokens | Contradictions found |
-|---|---|---|---|---|
-| GPT-5 | 162s | 12,074 | 11,008 | 4 |
-| Claude Opus 4.6 | 41s | 2,056 | (internal) | 6 |
-| Claude Sonnet 4.6 | 17s | 826 | (internal) | 4 |
-
-**What they found — common ground (2+ models identified):**
-
-- **Missing `pending_cancel → partially_filled` revert transition** (GPT-5 #1 +
-  Opus #2 + Sonnet partial): The "Rejection reverts" invariant states orders return
-  to their "pre-modification state (`working` or `partially_filled`)", but the state
-  diagram only shows `pending_cancel → working` for cancel rejection — no path back
-  to `partially_filled`. All models correctly identified this as the diagram being
-  incomplete relative to the stated invariant. GPT-5 and Opus rated CRITICAL.
-- **Same issue for `pending_replace` revert** (GPT-5 #1 + Opus #3): The state diagram
-  only shows `pending_replace → working` for replace rejection, but a replace
-  requested from `partially_filled` should revert to `partially_filled`. Same root
-  cause as above, just the replace variant.
-- **FOK "never partially fills" vs state machine allowing it** (GPT-5 #2 + Opus #4):
-  The TIF table says FOK "never partially fills" but the state machine has no guards
-  preventing FOK orders from reaching `partially_filled`. Both correctly noted this
-  is a broker-enforced guarantee but the document presents it as system-level.
-- **`rejection_reason` described as "broker-provided" but local rejections exist**
-  (GPT-5 #4 + Opus #5 + Sonnet): `pending → rejected` is "local validation failure"
-  with no broker interaction, but the field says "Broker-provided reason when
-  rejected." All three caught this terminology inconsistency.
-
-**GPT-5 unique findings (not in either other model):**
-
-- **IOC valid terminal states exclude `expired` vs generic expiry transitions** (#3):
-  IOC should never reach `expired` (unfilled portion is cancelled immediately), but
-  the state diagram allows any order to transition to `expired` without TIF guards.
-  Well-reasoned extension of the FOK finding to IOC semantics. GPT-5 correctly
-  identified that broker "expired-like" outcomes should map to `cancelled` for IOC.
-
-**Claude Opus unique findings (not in either other model):**
-
-- **Terminal states that aren't terminal — the `partially_filled` re-entry problem**
-  (#1): Opus identified the DEEPER contradiction beyond the surface-level "cancelled
-  states have outgoing transitions." When `cancelled → partially_filled` fires via
-  late fill, the order is now non-terminal with NO defined mechanism to re-terminate
-  if no further fills arrive. The order is stuck in `partially_filled` indefinitely.
-  This goes beyond "the diagram contradicts the definition of terminal" to "the fill
-  precedence rule creates an unspecified operational scenario." This is the most
-  architecturally significant finding across all three models.
-- **Fill precedence label misapplication to non-terminal states** (#6): The state
-  diagram labels transitions from `pending_cancel → partially_filled` and
-  `pending_replace → partially_filled` as "fill precedence," but the Fill
-  Precedence Rule explicitly defines itself as overriding TERMINAL states.
-  `pending_cancel` is non-terminal. The label conflates two different mechanisms
-  (fill during pending modification vs. fill overriding terminal state), which
-  could cause implementers to use the same code path for fundamentally different
-  scenarios.
-
-**Claude Sonnet unique findings (not in either other model):**
-
-- **State diagram terminal arrow contradiction** (#1): Sonnet was the only model to
-  explicitly note that the Mermaid diagram shows `cancelled → [*]` (terminal arrow)
-  while simultaneously showing `cancelled → partially_filled` (outgoing transition).
-  A valid observation but more surface-level than Opus's deeper analysis of the same
-  phenomenon.
-- **Pending replace fill logic error** (#3): Sonnet argued that receiving a fill
-  during `pending_replace` creates a logical impossibility because the order
-  parameters are in flux. This is WRONG — fills always apply to current parameters
-  (the replace hasn't been confirmed yet), and the document actually handles this
-  correctly. This is a FALSE POSITIVE from Sonnet.
-
-**Quality assessment:**
-
-- **Claude Opus** was the clear winner for this task. Found the most contradictions
-  (6), had the highest precision (0 false positives), and — crucially — found
-  qualitatively deeper issues. The `partially_filled` re-entry problem (#1) isn't
-  just "the diagram has a missing transition" but "the fill precedence rule creates
-  an unresolvable operational state." The fill precedence label misapplication (#6)
-  identifies a conceptual confusion that would genuinely cause implementation bugs.
-  Opus completed in only 41s with 2,056 output tokens — by far the most efficient.
-- **GPT-5** found 4 genuine contradictions with 0 false positives but spent an
-  extraordinary amount of reasoning tokens (11,008) for modest output (1,066 visible
-  content tokens, 10.3:1 reasoning ratio). The IOC finding was unique and valuable.
-  But the cost is disproportionate: 162s and 12K tokens for 4 findings vs Opus's
-  41s and 2K tokens for 6 findings. GPT-5's reasoning budget seems to have been
-  mostly spent on VERIFICATION (confirming each finding is genuine), consistent
-  with Finding #20's observation.
-- **Claude Sonnet** was fastest (17s) and found 4 items, but one was a false positive
-  (the pending_replace logic error claim is incorrect). That gives it a precision of
-  75% (3/4 genuine) — the lowest of the three. Its genuine findings were all also
-  found by the other models (no unique true contributions). Sonnet appears to trade
-  speed for accuracy on contradiction detection.
-
-**Key insight — contradiction detection favors precision-oriented models:**
-
-This task is fundamentally about LOGICAL ARGUMENTATION: proving that two statements
-cannot both be true. Unlike assumption-finding (which is about imagining what could go
-wrong) or gap-finding (which is about identifying missing content), contradiction
-detection requires the model to:
-1. Hold two statements in working memory simultaneously
-2. Construct a formal argument for why they conflict
-3. NOT get confused by statements that SEEM contradictory but are actually consistent
-
-Requirement #3 is where models diverge. Sonnet produced a false positive because it
-didn't fully reason through whether the pending_replace fill scenario is actually
-inconsistent (it isn't — current parameters apply). Opus avoided this trap entirely
-and additionally found DEEPER contradictions that require multi-step logical reasoning
-(the re-entry problem, the label misapplication). GPT-5 also avoided false positives
-but at massive computational cost.
-
-**Opus's efficiency advantage:**
-This is the first task where Opus is not just qualitatively better but also
-quantitatively more efficient. 6 findings in 41s and 2K tokens vs GPT-5's 4 findings
-in 162s and 12K tokens. That's 3x more findings per token and 4x faster. For
-contradiction detection specifically, Opus appears to have a structural advantage —
-possibly because its internal reasoning is better calibrated for logical argumentation
-than GPT-5's externalized reasoning chain.
-
-**Comparison to Finding #20 (invariant violation paths):**
-In Finding #20, GPT-5 was maximally selective (3 findings, all genuine, 15:1
-reasoning ratio). Here, GPT-5 shows the same pattern: few findings, all genuine,
-high reasoning ratio (10.3:1). The difference: in #20, GPT-5's selectivity meant
-it found UNIQUE violations others missed. Here, all of GPT-5's findings were also
-found by Opus (plus Opus found 2 more). GPT-5's high verification bar doesn't help
-when Opus is ALSO precise AND more thorough.
-
-**Updated task-model assignment:**
-
-For contradiction/consistency checking:
-1. **Opus** — best choice: highest precision, deepest contradictions, most efficient
-2. **GPT-5** — solid backup: zero false positives, unique TIF-related insights, but
-   expensive and slower
-3. **Sonnet** — NOT recommended for this task: produces false positives, no unique
-   true contributions
-
-This confirms the emerging pattern: each model has task types where it excels.
-Opus excels at logical argumentation and design tensions. GPT-5 excels at
-exhaustive enumeration and operational concerns. Sonnet excels at speed and
-structural/assumption analysis but struggles with tasks requiring formal logical
-reasoning (contradiction detection, concurrency analysis per Finding #13).
-
-**Practical implication:** When reviewing architecture documents for internal
-consistency (e.g., before implementation begins), run Opus. If budget allows,
-add GPT-5 for TIF/edge-case coverage. Skip Sonnet for consistency checking —
-its speed advantage is negated by the false positive risk.
-
-### 26. Missing-feature identification IS promptable across all models; prompt framing eliminates Opus's historical advantage — all three models find regulatory gaps when explicitly asked
-
-**Date:** 2026-05-05
-**Task:** Identify computations, behaviors, or features that gargoyle's
-`corporate-actions.md` (992 lines) SHOULD perform for financial correctness,
-regulatory compliance, or operational safety — but doesn't describe.
-**How we used them:** Same document (full text) + same focused analytical
-prompt to all 3 models via HAI proxy. Prompt explicitly structured around 5
-categories: missing computations, missing behaviors, missing validations,
-missing integrations, and regulatory gaps. Required concrete findings with
-severity. No tools, no project context beyond the document. GPT-5 via
-OpenAI endpoint (16K max_completion_tokens), Opus 4.6 and Sonnet 4.6 via
-Anthropic endpoint (8K max_tokens).
-
-| Model | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
-|---|---|---|---|---|---|---|
-| GPT-5 | 11,354 | 8,512 | 20 | 3 | 10 | 7 |
-| Claude Opus 4.6 | 4,111 | (internal) | 23 | 6 | 10 | 7 |
-| Claude Sonnet 4.6 | 4,686 | (internal) | 15 | 5 | 6 | 4 |
-
-**What they found — common ground (all 3 identified):**
-- Wash sale rule interaction with CA-driven lot closures (IRC §1091)
-- Short position treatment for corporate actions
-- Same-day corporate action ordering beyond `recorded_at` timestamp
-- Record date / ex-date position verification (entitlement timing)
-- Idempotency guard preventing double-application per user
-- Decimal precision/rounding policy unspecified
-- Superseded CA status has no lot rollback mechanism
-- Rights/warrants post-creation lifecycle (exercise/expiration)
-- Basis preservation invariant has no runtime enforcement
-- Manual entry authorization and audit trail
-
-**GPT-5 unique findings (not in either Claude model):**
-- Per-lot eligibility based on entitlement date (not just user-level)
-- Election-based outcomes for shareholder choices (cash vs stock)
-- Instrument-level trading hold during CA application window
-- Pre-application consistency checks against broker entitlements
-- DB-level enforcement of status transitions and invariants
-- Action-type-specific date semantics per field (ex vs record vs payable)
-- Voluntary/tender actions beyond distributions
-- Backfill/initialization guard for newly onboarded users
-- Applicator retry/backoff semantics and confirmation race
-- Rights indivisibility constraints vs exact Decimal quantities
-
-**Claude Opus unique findings (not in either other model):**
-- Pending order PRICE adjustment after splits (not just cancellation)
-- Multi-instrument position recalculation atomicity for mergers
-- Mixed merger basis floor at zero (can produce negative basis)
-- Tax lot identification method interaction with inherited dates
-- Corporate action effect on strategy position limits/risk params
-- Corporate actions on instruments not yet in the database
-- Partial application window: new user acquires position mid-fan-out
-- IRC §305(c) deemed distributions (taxable stock dividends)
-- CA impact on unrealized P&L display and strategy evaluation
-- Concurrent OrderManager startup + Applicator fan-out race
-
-**Claude Sonnet unique findings (not in either other model):**
-- Stale orders: failure modes table contradicts "excluded" section
-- IRC §1223(1) holding period tacking verification at lot close
-- Spinoff allocation percentage — no validation child != parent instrument
-- Combined spinoff allocations exceeding meaningful bounds
-- Cash dividend bypasses OrderManager — record-date quantity snapshot lost
-- Mixed merger large-denominator exchange ratio overflow
-- Detector schedule: no intraday re-poll for same-day announcements
-- ROC cumulative basis floor tracking and IRC §301(c)(3) distinction
-- Mixed merger deferred loss not explicitly recorded in metadata
-
-**Quality assessment:**
-- **Claude Opus** was the MOST PROLIFIC (23 findings) — a notable inversion
-  from previous experiments where Opus typically found fewer but deeper
-  findings. Here, the explicit "missing feature" framing appears to have
-  unlocked Opus's breadth. Its unique findings included genuinely critical
-  items: pending order price adjustment after splits (Critical — direct
-  financial loss), multi-instrument atomicity for mergers (Critical —
-  position loss), and mixed merger negative basis (High — accounting
-  corruption). The findings were precise, well-reasoned, and showed both
-  regulatory depth (IRC §305(c)) and operational awareness.
-- **GPT-5** was slightly less prolific (20 findings) but maintained its
-  characteristic breadth and operational-level thinking. Per-lot eligibility
-  (not just per-user) is a subtle but important distinction. The election-
-  based outcomes finding shows awareness of real-world corporate action
-  complexity. The backfill/initialization guard is operationally significant.
-  GPT-5 spent 8,512 reasoning tokens — moderate for its output volume.
-- **Claude Sonnet** found fewer gaps (15) but several were genuinely
-  insightful. The internal contradiction between the failure modes table
-  and the "excluded" section is a real document inconsistency. The cash
-  dividend record-date quantity snapshot insight (#9) identifies a DATA LOSS
-  problem — the opportunity to capture that data expires. The mixed merger
-  deferred loss recording gap shows regulatory awareness. However, some
-  findings were more surface-level or overlapped heavily with the others.
-
-**KEY INSIGHT — The original question from Finding #22 is ANSWERED:**
-
-> "Opus's 'missing feature identification' mode (wash sales, commissions) —
-> is this promptable on other models? Could we explicitly ask GPT-5 'what
-> should this system compute but doesn't' and get similar results?"
-
-**YES.** When explicitly prompted with a structured "missing feature"
-framing, ALL three models found regulatory gaps (wash sales, IRC sections),
-missing computations (basis calculations, rounding), and missing behaviors
-(lifecycle events, notifications). GPT-5 produced findings in the same
-*category* as what Opus uniquely found in Finding #22 (silent correctness
-failures on specid-lot-selection.md).
-
-In Finding #22, Opus uniquely identified wash sales and commission tracking
-as missing features while GPT-5 focused on mechanism incorrectness and
-Sonnet on composition failures. HERE, with the explicit "what's missing"
-prompt, ALL three models found wash sales, ALL found regulatory gaps, and
-ALL found missing behaviors.
-
-**This confirms:** Opus's "missing feature identification" mode in Finding
-#22 was NOT an inherent model capability — it was an emergent behavior from
-the open-ended "silent correctness failures" prompt. When you give ALL models
-the EXPLICIT instruction to look for missing features, they all do it. The
-differentiation from #22 was caused by the prompt being more open-ended,
-allowing each model to default to its natural analytical mode:
-- Opus → "what's missing" (features/functionality)
-- GPT-5 → "what's wrong" (mechanism failures)
-- Sonnet → "what breaks when combined" (composition)
-
-**Prompt framing dominates model personality.** With the right prompt,
-any model can be directed into any analytical mode. The model differences
-that emerged in earlier open-ended experiments reflect DEFAULT TENDENCIES,
-not capabilities.
-
-**NEW finding about Opus on complex documents:**
-Opus produced MORE findings than GPT-5 (23 vs 20) — the first time this
-has happened on a broad analytical task. Previous pattern: GPT-5 always
-finds more (20-33 findings) while Opus finds fewer but deeper (7-13).
-What changed? The document is 992 lines — the longest tested — and the
-task is explicitly about breadth ("find all gaps"). On this specific
-combination (long document + breadth-focused prompt), Opus appears to
-allocate its internal reasoning budget toward exploration rather than
-its usual depth-first design-tension mode. This suggests Opus's typical
-"fewer but deeper" pattern is partially a RESPONSE to shorter documents
-where depth is more productive than breadth.
-
-**Practical implications:**
-1. For missing-feature analysis: prompt structure matters more than model
-   choice. All three models are viable. Use the explicit 5-category prompt.
-2. Run all three for critical docs — they find different specific gaps
-   despite finding the same categories.
-3. For open-ended analysis where you want models to find DIFFERENT things:
-   use open-ended prompts. For analysis where you want COMPREHENSIVE
-   coverage of one type: use structured prompts.
-4. Opus's "fewer but deeper" personality can be overridden by document
-   length + breadth-focused prompt. On 992-line docs, it competes on
-   volume with GPT-5.
-
-**Cost-effectiveness:**
-Opus: 4,111 output tokens for 23 findings = 179 tokens/finding
-GPT-5: 11,354 output tokens (+ 8,512 reasoning) for 20 findings = 993 tokens/finding
-Sonnet: 4,686 output tokens for 15 findings = 312 tokens/finding
-
-Opus is by far the most efficient: nearly 6x fewer tokens than GPT-5 per
-finding, with MORE findings. This is the strongest cost-effectiveness case
-for Opus on any tested task. On long documents with breadth-focused prompts,
-Opus appears to be the optimal choice for both quality AND efficiency.
-
-### 28. Cross-document consistency analysis: NEW task type — GPT-5 finds deep semantic contradictions; Opus finds structural/boundary mismatches; Sonnet identifies core issues quickly
-
-**Date:** 2026-05-05
-**Task:** Identify contradictions and inconsistencies BETWEEN two architecture documents
-describing the same system: `system-overview.md` (323 lines, narrative overview with
-component flows, invariants, and domain events) and `architecture.md` (213 lines,
-DDD-focused with bounded contexts, context map, and message taxonomy).
-**How we used them:** BOTH documents provided as full text in a single prompt (~25KB
-total). Highly structured prompt specifying 5 categories of cross-document inconsistency
-(terminology conflicts, structural contradictions, flow/sequence conflicts,
-ownership/authority conflicts, philosophical contradictions). Required specific output
-format per finding. Explicitly excluded omissions (things one doc covers and the other
-doesn't) and detail-level differences. No tools, no project context beyond the two
-documents. This is a NEW analytical task not previously tested: reasoning about
-CONSISTENCY BETWEEN documents rather than internal coherence of a single document.
-
-| Model | Time | Output tokens | Reasoning tokens | Inconsistencies found | Critical | High | Medium |
-|---|---|---|---|---|---|---|---|
-| GPT-5 | 125s | 9,415 | 8,384 | 6 | 2 | 3 | 1 |
-| Claude Opus 4.6 | 52s | 2,351 | (internal) | 7 | 3 | 3 | 1 |
-| Claude Sonnet 4.6 | 14s | 776 | (internal) | 4 | 1 | 2 | 1 |
-
-**What they found — common ground (all 3 identified):**
-- Event sourcing (all events as source of truth) vs fills-only ground truth:
-  Document A says fills are "ground truth from which all other state can be
-  derived," while Document B says "events are the source of truth, state is
-  computed by replaying events." A treats fills as the recovery foundation;
-  B treats ALL domain events as authoritative. All three models rated this
-  Critical.
-- Bounded context naming mismatch: "Decision Engine" / "Order Management" (A)
-  vs "Engine" / "Trading" (B) for the same functional responsibilities.
-  GPT-5 folded this into a broader ownership analysis; Opus and Sonnet
-  surfaced it as its own finding.
-- Signal classification conflict: Document A lists "Signal emitted" as a domain
-  event; Document B explicitly categorizes `SignalEmitted` as an audit event
-  ("not used to rebuild state"). This determines event store design and
-  recovery semantics.
-
-**GPT-5 unique findings (not in either Claude model):**
-- Signal persistence contradiction: Document A states "Signals are never
-  persisted" while Document B lists `SignalEmitted` as an audit event that IS
-  persisted and states the audit log is mandatory for trading. These are
-  directly incompatible claims about whether signal data is stored.
-- Audit event ownership conflict: Document A says "Decision approved" events
-  originate from PortfolioRisk. Document B states "only the decision engine
-  writes audit events" and lists `DecisionApproved` as an audit event example.
-  If PortfolioRisk is part of Risk (not Engine), this is an authority violation.
-- "Single writer per user" (A: OrderManager writes all trading state) vs
-  per-aggregate single-writer (B: each aggregate writes its own event stream,
-  Ledger owns positions). These are incompatible authority models — either OM
-  centralizes writes or each domain owns its own events.
-
-**Claude Opus unique findings (not in either other model):**
-- Engine → OrderManager is an internal pipeline flow (A: same subgraph, direct
-  arrow) vs Engine → Trading is a cross-domain COMMAND (B: `PlaceOrder` command
-  crossing a bounded context boundary). This structural disagreement determines
-  whether order management is an internal pipeline stage or an independent domain
-  with its own aggregates and command validation.
-- Signal Risk's architectural position: Document A shows a two-stage risk
-  architecture (SignalRisk pre-aggregation, PortfolioRisk post-aggregation)
-  where Risk is embedded in the pipeline. Document B's context map shows Risk
-  as a separate domain that Engine merely QUERIES ("kill switch active?") —
-  no arrow shows signal routing through Risk. Either risk logic lives inside
-  Engine (contradicting B's context boundary) or the context map is incomplete.
-- The "reduce" step ownership: A's top-level flow labels `Approved →|"reduce"|
-  Decisions` (reduction at aggregation), while A's own domain events table says
-  "Decision reduced" originates from PortfolioRisk (reduction after aggregation).
-  This is actually an INTRA-document inconsistency in Document A, but Opus surfaced
-  it as part of cross-doc analysis.
-
-**Claude Sonnet unique findings:**
-- None genuinely unique. All 4 findings overlapped with GPT-5/Opus common ground
-  (event sourcing, signal persistence, context count/naming). Sonnet was efficient
-  (14s, 776 tokens) but didn't identify any inconsistency that the other two missed.
-
-**Quality assessment:**
-- **GPT-5** produced 6 well-reasoned findings with the deepest analysis of
-  OWNERSHIP conflicts. Its signal-persistence contradiction and single-writer
-  authority conflict are genuinely important — they reveal places where the two
-  documents would lead implementers to build fundamentally different systems.
-  Every finding quotes specific text from both documents and explains precisely
-  WHY they can't both be correct. The reasoning investment (8,384 tokens) was
-  used for thorough cross-referencing between documents.
-- **Claude Opus** found the most inconsistencies (7) and was remarkably fast
-  (52s vs GPT-5's 125s). Its unique strength: identifying STRUCTURAL contradictions
-  about component boundaries and communication patterns. The Engine→Trading
-  command vs internal pipeline finding is architecturally the most significant
-  discovery — it reveals a fundamental disagreement about whether order
-  management is INSIDE or OUTSIDE the decision engine's boundary. Opus also
-  caught a bonus intra-document inconsistency (the "reduce" labeling error).
-- **Claude Sonnet** was the fastest (14s) and most concise (776 tokens) but
-  found only the obvious common-ground issues. For cross-document consistency,
-  Sonnet's speed advantage came at the cost of missing the architectural
-  insights that make this task valuable. It did correctly identify all the
-  Critical-level issues, making it viable as a quick first-pass screen.
-
-**Key insight — cross-document consistency is a DISTINCT task type:**
-This is fundamentally different from single-document analysis (assumptions,
-race conditions, coherence). It requires:
-1. Building a mental model from Document A
-2. Building a separate mental model from Document B
-3. Finding places where the models are incompatible
-4. Reasoning about WHY they can't both be correct (not just "different")
-
-Step 4 is what distinguishes this from simple diff-detection. Many surface
-differences (naming, detail level, scope) are NOT contradictions — the models
-must judge which differences are genuinely incompatible vs. complementary.
-The prompt explicitly excluded omissions and detail-level differences, and
-all three models respected this constraint well.
-
-**Model strengths on cross-document analysis:**
-- **GPT-5** excels at ownership/authority conflicts: it systematically
-  checked "who owns this concept" in each document and found mismatches.
-  Its findings cluster around "who writes what" and "who is authoritative."
-- **Opus** excels at structural/boundary contradictions: it identified where
-  the documents draw architectural lines differently. Its findings cluster
-  around "where are the boundaries" and "what crosses them."
-- **Sonnet** identifies the obvious/critical issues quickly but doesn't dig
-  deeper. Viable for screening, not for thorough analysis.
-
-**Comparison to Finding #15 / #27 (single-document coherence checking):**
-Single-document coherence asks "does this document contradict itself?"
-Cross-document consistency asks "do these documents contradict each other?"
-Key differences in results:
-
-| Aspect | Single-doc coherence | Cross-doc consistency |
-|---|---|---|
-| Opus findings | 5-7 | 7 |
-| GPT-5 findings | 4-6 | 6 |
-| Sonnet findings | 4-5 | 4 |
-| Opus unique | Design tensions | Structural/boundary mismatches |
-| GPT-5 unique | Definitional errors | Ownership/authority conflicts |
-| Best model | Task-dependent | Opus (most findings + fastest) |
-
-The relative ordering is similar (Opus ≥ GPT-5 > Sonnet for coherence-style
-tasks), but the CHARACTER of unique findings shifted. On single-doc coherence,
-Opus finds design tensions within a single design. On cross-doc consistency,
-Opus finds BOUNDARY disagreements between two designs. GPT-5 shifts from
-finding definitional errors to ownership conflicts.
-
-**Are these findings REAL bugs in the gargoyle documentation?**
-Yes — several are genuine issues worth fixing:
-1. The fills-vs-events-as-ground-truth is a real philosophical tension between
-   the two documents that needs resolution.
-2. The Position event ownership (OrderManager vs Ledger) is a real boundary
-   conflict that affects implementation.
-3. The Engine→Trading communication style (internal pipeline vs cross-domain
-   command) is a genuine structural ambiguity.
-4. The signal persistence claim ("never persisted" vs `SignalEmitted` audit
-   event) is a direct textual contradiction.
-
-These are the kind of cross-document inconsistencies that cause teams to build
-inconsistent implementations — one engineer reads Document A and builds one way,
-another reads Document B and builds differently.
-
-**Practical implication:** Cross-document consistency analysis is a high-value
-task for documentation maintenance. Run it when:
-- A system has multiple architecture docs written at different times
-- A refactoring has updated one doc but not another
-- Multiple people contribute to design documentation
-- Moving from high-level overview to detailed specification
-
-Opus is the recommended model for this task: fastest (52s vs 125s), most
-findings (7 vs 6), and uniquely strong at boundary disagreements. GPT-5 adds
-value for ownership-specific conflicts. Sonnet is sufficient for quick
-screening (catches the Critical issues in 14s) but won't find the architectural
-insights.
-
-**Cost-effectiveness:**
-Opus: 2,351 output tokens for 7 findings = 336 tokens/finding (52s)
-GPT-5: 9,415 output + 8,384 reasoning for 6 findings = 2,967 tokens/finding (125s)
-Sonnet: 776 output tokens for 4 findings = 194 tokens/finding (14s)
-
-Opus is the clear winner on this task type: more findings than GPT-5, 2.4x
-faster, and 8.8x more token-efficient per finding. GPT-5's massive reasoning
-investment (8,384 tokens) produced only one fewer finding than Opus — the
-verification overhead is not paying off here because cross-document contradictions
-are relatively easy to verify once identified (just check both documents).
-
-### 29. Adversarial manipulation analysis: NEW task type — GPT-5 is most exhaustive and systematic; Opus produces qualitatively different attack vectors with system-level thinking; Sonnet is adequate but less creative
-
-**Date:** 2026-05-05
-**Task:** Identify adversarial manipulation paths in gargoyle's `aggregation.md` (193 lines)
-— how a misbehaving, compromised, or buggy upstream component could exploit the
-aggregator's design guarantees to produce harmful trading outcomes that bypass
-downstream safety controls.
-**How we used them:** Same document (full text) + same focused analytical question to all
-3 models via HAI proxy. Highly structured prompt specifying 5 categories of adversarial
-manipulation (signal injection, timing manipulation, capacity weaponization, state
-corruption via crash, audit evasion). Required specific output format per finding
-(attack vector, mechanism, exploit, why downstream controls miss it, severity). No tools,
-no project context beyond the document itself.
-
-| Model | Time | Output tokens | Reasoning tokens | Attack vectors found | Critical | High | Medium |
-|---|---|---|---|---|---|---|---|
-| Claude Sonnet 4.6 | 27s | 1,257 | (internal) | 10 | 3 | 5 | 2 |
-| Claude Opus 4.6 | 84s | 3,662 | (internal) | 12 | 5 | 5 | 0 |
-| GPT-5 | 111s | 8,808 | 6,336 | 15 | 2 | 10 | 3 |
-
-**What they found — common ground (all 3 identified):**
-- Primary signal hijacking via ranking manipulation (last-tick injection in
-  time-windowed to control decision parameters)
-- Threshold gaming via signal replay/duplication (no deduplication means N
-  identical signals satisfy "N confirmations")
-- Capacity flooding to force premature completion or deny legitimate trades
-- Strategic crash to erase unfavorable in-flight groups
-- Timeout-masqueraded manipulation (making attacks look like normal system behavior
-  in the audit trail)
-
-**GPT-5 unique findings (not in either Claude model):**
-- **Direction flip against majority via ranking:** In "most recent" ranking,
-  emit multiple SELL confirmations then inject a late BUY — the BUY becomes
-  primary and the decision contradicts the bulk of evidence. Distinct from
-  general primary hijack because it's specifically about *directional* reversal.
-- **Late-arrival exclusion of counter-signals:** Time signals so countervailing
-  signals arrive just after group destruction, ensuring the decision is formed
-  without dissenting inputs that would have altered ranking.
-- **Capacity filter to curate the audit set:** Pre-fill buffer with chosen
-  signals so riskier alternatives cannot be included before capacity fires —
-  the contributing signals list looks clean.
-- **Timer nullification by crash:** Crash just before a timeout that would
-  force-complete an unfavorable decision — the timer becomes no-op on restart,
-  no decision or expiry event is emitted.
-- **Decision drop via induced forwarding failure:** Exploit the "Decision
-  forwarding fails: Decision is lost" failure mode to selectively suppress
-  protective decisions (stops, hedges) with no automatic retry.
-- **Crash to erase evidence of contrary signals:** Post-crash, submit a
-  fresh group that completes quickly; audit shows only the new set, not the
-  earlier contradictory pre-crash signals.
-
-**Claude Opus unique findings (not in either other model):**
-- **Instrument fragmentation to multiply position size:** Emit signals for
-  economically equivalent exposures using different instrument identifiers.
-  Each gets its own group, each produces a separate decision, bypassing
-  per-group capacity limits. Combined position exceeds what any single group
-  would allow. Identifies TOCTOU at the fan-in to PortfolioRisk.
-- **Forced stale decision via timer exploitation:** Emit one signal at a
-  favorable price spike known to be transient, then deliberately withhold
-  further signals. Timer force-completes with a stale price. The entry price
-  WAS valid when the signal was generated — PortfolioRisk doesn't check
-  staleness of decision prices.
-- **Timeout prevention / keep-alive suppression:** Manipulate market data
-  feed to suppress signals that would reach threshold N. Group expires
-  normally — denial-of-trading attack disguised as insufficient confirmation.
-- **Crash-restart duplicate decisions:** Crash after decision is forwarded
-  but before strategy reflects it. Both restart "clean" — strategy re-emits
-  signals, aggregator produces a second decision with a fresh ID. Same trade
-  executes twice. PortfolioRisk can't deduplicate because IDs are different.
-- **Force-complete with insufficient confirmation (capacity < threshold):**
-  If capacity limit is lower than threshold, hitting capacity ALWAYS force-
-  completes before predicate is satisfied. Fundamentally changes a 5-confirmation
-  strategy into a 3-confirmation strategy.
-- **Pattern predicate as arbitrary decision trigger:** If adversary controls
-  predicate logic (via strategy configuration), can make pattern-complete
-  trigger on any single signal while audit shows algorithm=pattern-complete
-  and reason=:predicate. Trust boundary between configuration and execution.
-
-**Claude Sonnet unique findings (not in either other model):**
-- **Cross-group timing coordination:** Coordinate signal injection across
-  multiple instruments to synchronize completion times, creating a burst of
-  correlated decisions that overwhelm PortfolioRisk individually-safe
-  evaluations. (NOTE: Opus found a similar concept — instrument fragmentation
-  — but framed it differently: Opus focused on position multiplication via
-  instrument aliasing, Sonnet focused on burst timing overwhelming evaluation.)
-- **Multi-strategy attack distribution:** Spread manipulation across multiple
-  isolated strategy aggregators so no single aggregator's behavior looks
-  abnormal while cumulative effect is harmful.
-
-**Quality assessment:**
-- **GPT-5** produced the most findings (15) with the most systematic coverage
-  across all 5 prompt categories. Its strength was in identifying SPECIFIC
-  INTERLEAVINGS — exactly how timing, state, and ranking mechanisms interact
-  to produce exploits. The direction-flip finding (#3) and the late-arrival
-  exclusion finding (#6) show precise temporal reasoning about when signals
-  arrive relative to group lifecycle events. The "decision drop via forwarding
-  failure" finding exploits a DOCUMENTED failure mode (from the failure table)
-  as an offensive weapon — turning a recovery mechanism into an attack vector.
-  Every finding references specific mechanisms from the spec.
-- **Claude Opus** produced 12 findings with the most architecturally creative
-  attacks. The instrument fragmentation attack is the most SYSTEMICALLY
-  dangerous finding across all three models — it's not about manipulating one
-  group but about the RELATIONSHIP between groups, and it identifies a
-  TOCTOU vulnerability at the PortfolioRisk fan-in point that no other model
-  found. The crash-restart duplication attack is also architecturally novel —
-  it exploits the "clean state" guarantee as a weapon for invisible trade
-  doubling. Opus consistently reasons about the system BOUNDARY (aggregator
-  → PortfolioRisk handoff) rather than just within-component mechanics. The
-  pattern-predicate trust boundary finding is uniquely about CONFIGURATION
-  as an attack surface.
-- **Claude Sonnet** produced 10 findings in 27s — extremely efficient (127
-  tokens per finding). Findings were adequate and covered all 5 categories,
-  but lacked the specificity of GPT-5 and the architectural creativity of
-  Opus. Several findings were somewhat generic (e.g., "crash at strategic
-  moments" without specifying exactly WHEN relative to group lifecycle).
-  The cross-group coordination and multi-strategy distribution findings show
-  system-level thinking but are stated at a higher abstraction level without
-  concrete exploit sequences.
-
-**Key insight — "adversarial manipulation analysis" as a task type:**
-This is qualitatively different from all previous analytical lenses tested.
-Previous tasks asked models to find problems WITH the design (assumptions,
-races, incoherences). This task asks models to find ways to USE the design
-AGAINST itself — a creative/generative adversarial task. Results:
-
-- **GPT-5** treats it as an exhaustive enumeration exercise — systematically
-  walks through each mechanism and asks "how could this be abused?" High
-  count (15), thorough coverage, but some findings are minor variations of
-  each other (e.g., crash-related findings #10, #12, #15 share the same core
-  mechanism). Reasoning tokens (6,336) used for both generation and verification.
-- **Opus** treats it as a creative design exercise — asks "what would a
-  smart adversary do that the designer didn't consider?" Fewer findings (12)
-  but several are genuinely novel attack concepts (instrument fragmentation,
-  crash-restart duplication, predicate trust boundary) that require reasoning
-  about the SYSTEM rather than the COMPONENT. Opus also provided a summary
-  table and systemic conclusion about the root design weaknesses.
-- **Sonnet** treats it as a categorization exercise — fills each prompt
-  category with plausible attacks but at a higher abstraction level. Fast
-  and adequate for a first pass but wouldn't surprise a security reviewer.
-
-**Comparison to "predictable exploit window" (Finding #18):**
-Finding #18 noted that Opus uniquely identified predictable exploit windows
-in escalation-policy.md. Here, Opus again shows the strongest adversarial
-creativity — the instrument fragmentation attack and crash-restart duplication
-are both about exploiting DESIGN GUARANTEES (per-instrument grouping, clean
-restart) as weapons. This confirms that Opus's strength on adversarial analysis
-is a CONSISTENT PATTERN, not document-specific.
-
-GPT-5 excels when the adversarial task is framed as "enumerate all possible
-abuses of each mechanism" (systematic coverage). Opus excels when the task
-requires "invent novel attack concepts that exploit design boundaries"
-(creative adversarial thinking).
-
-**Model hierarchy for adversarial manipulation analysis:**
-1. GPT-5 — most thorough enumeration, best at mechanism-level exploitation (15)
-2. Opus — most creative, finds system-boundary attacks others miss (12)
-3. Sonnet — adequate first pass, fast, but less specific (10)
-
-**Practical implication:** For security-oriented architecture review:
-- Run GPT-5 for comprehensive attack surface enumeration
-- Run Opus for novel/creative attack vectors that exploit design boundaries
-- Sonnet is sufficient only as a quick initial screen
-- The UNION of GPT-5 + Opus findings (removing overlaps) would produce the
-  most complete adversarial analysis
-
-**New finding about the aggregator itself:** Several attacks identified by
-multiple models point to real design weaknesses worth addressing:
-1. No signal deduplication/independence validation (all 3 models)
-2. Primary signal determines all decision parameters regardless of group
-   composition (all 3 models)
-3. Transient state + no replay = perfect adversarial erasure tool (all 3)
-4. Capacity/timeout treated as normal events even when weaponized (all 3)
-5. No cross-group correlation at aggregator level (Opus + Sonnet)
-6. TOCTOU at PortfolioRisk fan-in for concurrent decisions (Opus)
diff --git a/findings/README.md b/findings/README.md
new file mode 100644
index 0000000..e92e3df
--- /dev/null
+++ b/findings/README.md
@@ -0,0 +1,16 @@
+# Model Findings — Analytical & Research Work
+
+_Tracking what actually works (and doesn't) when using AI models for research,
+analysis, bias detection, and document review — not coding._
+
+Started: 2026-04-26
+
+## Context
+
+We use multiple models in different roles: Claude Code (Opus/Sonnet) for
+generation, Sonnet + GPT-5 for independent dual review, smaller models for
+focused analytical tasks. Most public discussion is about coding. We found
+almost no published methodology for using models in analytical research tasks
+(searched 2026-04-26). That gap is why we're tracking this.
+
+Each experiment lives in its own file. See individual finding files below.