refactor(findings): split ALL-FINDINGS.md into per-experiment files

Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
This commit is contained in:
Rodin
2026-05-06 07:15:50 -07:00
parent 1b108ff66e
commit 6af8a6ee10
32 changed files with 3232 additions and 3254 deletions
+10 -4
View File
@@ -54,11 +54,14 @@ Each experiment:
```
findings/ # Individual findings with full analysis
01-different-models-different-things.md
02-narrow-lens-vs-broad-review.md
README.md # Context and index
YYYY-MM-DD-NN-slug.md # One file per experiment
2026-04-26-01-different-models-catch-different-things.md
2026-04-26-07-emerging-role-assignments-pattern-not.md
2026-05-03-07b-token-budget-matters-more-than.md # Duplicate #7 (suffix b)
2026-05-03-15-design-coherence-analysis.md
...
28-cross-document-consistency.md
29-adversarial-manipulation.md
2026-05-05-29-adversarial-manipulation-analysis-new-task.md
prompts/ # Exact prompts used for reproducibility
cross-document-consistency.md
design-coherence.md
@@ -69,6 +72,9 @@ open-questions.md # Unanswered questions for future experiments
methodology.md # Full methodology notes
```
Findings are named `YYYY-MM-DD-NN-slug.md` for chronological sorting.
Numbers are zero-padded (0129). The duplicate finding #7 uses a `b` suffix.
## Who We Are
This research is conducted by [Rodin](https://gitea.weiker.me/rodin) (AI
@@ -0,0 +1,16 @@
# Finding 1: Different models catch different things (confirmed)
**Date:** 2026-04-26
**Task:** PR reviews on DDD reference docs (~6,600 lines across 18 files)
**How we used them:** Both models got the same task via pr-review skill —
fetch diff, fetch full file content for changed files, review against PR
description and linked issue acceptance criteria. Rich context: full diff,
project CLAUDE.md conventions, issue body. Each reviewer ran independently
in its own sub-agent with its own Gitea token. No cross-pollination.
- GPT-5 caught SUMMARY.md verdict mismatches (Commanded classification,
small teams classification) that Sonnet missed entirely (PR #375)
- Sonnet caught a broken cross-reference link first that GPT-5 missed (PR #378)
- **Takeaway:** Different blind spots are real. Neither model is strictly better
for analytical review — they complement each other. This is why we run two
independent reviewers from different model families.
@@ -0,0 +1,18 @@
# Finding 2: Cheap model + narrow lens > expensive model + broad review (one data point)
**Date:** 2026-04-26
**Task:** Check 12 rewritten hypotheses for directional bias
**How we used them:**
- Sonnet & GPT-5: full PR review context (diff, file content, issue, AC).
Broad mandate: "review this PR." Rich context but unfocused task.
- GPT-4.1 Mini: given ONLY the 12 hypothesis texts + one focused question:
"Do any of these hypotheses lead toward a predetermined conclusion?"
Minimal context, laser-focused task. No diff, no project docs, no issue.
- Both Sonnet and GPT-5 approved the hypotheses as reviewers
- GPT-4.1 Mini found ALL 12 pushed toward predetermined conclusions
- Words like "requires," "necessary," "must be" were flagged as directional
- **Takeaway:** Task framing mattered more than model size. Rich context +
broad mandate = missed the forest for the trees. Minimal context + precise
question = found exactly what mattered. This needs more testing — was it
the narrow framing, the lack of surrounding context, or both?
@@ -0,0 +1,15 @@
# Finding 3: GPT-5 times out on complex multi-step analytical tasks (confirmed pattern)
**Date:** 2026-04-26
**Task:** Full PR review of #382 (research document rewrite)
**How we used it:** pr-review skill — multi-phase (fetch diff, fetch files,
check CI, analyze against AC, post inline comments, post summary). 7 phases,
many curl calls to Gitea API, large diff context. Heavy tool-use workflow
through SAP proxy (adds latency vs direct API). 300s timeout.
- Timed out 3 times at 300s (17, 6, 6 tool calls respectively)
- Bottleneck was model processing time, not network (~0.3s Gitea API latency)
- **Takeaway:** Break analytical tasks into focused bounded pieces. Twelve
small deep reviews > one rushed big one. The issue isn't GPT-5's analysis
quality — it's that multi-phase tool-heavy workflows burn too much time
on mechanics. Separate the data gathering from the analysis.
@@ -0,0 +1,18 @@
# Finding 4: GPT-5 defaults to delegation; Claude defaults to doing the work
**Date:** 2026-04-26
**Task:** PR review delegation to sub-agents
**How we used them:** Both spawned as sub-agents from main session with
same task description, same pr-review skill file, same Gitea credentials.
Difference: GPT-5 got model override to gpt5, Sonnet used default model.
Both got full skill instructions.
- GPT-5 first attempt: spawned sub-sub-agents and timed out
- GPT-5 with "do it yourself, no sub-agents" + step-by-step: worked
- Even with constraints, GPT-5 sometimes dumps raw tool output instead of
synthesizing — needs explicit output format instructions
- Claude (Sonnet/Opus) given the same kind of task does the work directly
- **Takeaway:** GPT interprets complex task descriptions as delegation
opportunities. Claude interprets them as work to do. For GPT: explicit
single-actor instructions + output format. For Claude: can give broader
mandate. Same skill file, very different behavior.
@@ -0,0 +1,17 @@
# Finding 5: Sonnet is fast and catches structural issues; GPT-5 is slow and catches semantic issues
**Date:** 2026-04-26
**Task:** Dual review across PRs #372, #375, #378, #380, #382
**How we used them:** Same pr-review skill, same context (diff + files +
issue + AC), same sub-agent pattern. Only variable: model. Both got rich
context. Both ran the full 7-phase review skill.
- Sonnet consistently finishes first, catches formatting, broken links,
structural problems (missing sections, dangling refs)
- GPT-5 takes longer, catches meaning-level problems (verdict mismatches,
classification inconsistencies, logical gaps)
- **Takeaway:** With identical rich context and identical instructions, the
models naturally gravitate to different things. Sonnet is the structural
reviewer; GPT-5 is the semantic reviewer. Both roles matter. Question:
would Sonnet catch semantic issues if given a narrower "check for logical
consistency" framing instead of broad review?
@@ -0,0 +1,20 @@
# Finding 6: Single agent can't handle 1000+ line document generation (confirmed pattern)
**Date:** 2026-04-26
**Task:** DDD v2 forge analysis drafting
**How we used them:** Single Sonnet/Opus sub-agents given full research
material (~3,874 lines of research notes) + outline + instructions to write
complete document. Very rich context (all research), very large output
requirement (1000+ lines).
- Five single-agent attempts died (OOM, disconnect, timeout) trying to write
full documents
- Sectional approach (5 parallel Sonnet subagents, ~500-700 lines each)
succeeded immediately — each got same research but only their section's
outline
- Same pattern when Claude Code attempted full Part V rewrite — died
- Three agents × ~320 lines each worked first try
- **Takeaway:** This is a confirmed, repeatable limit for generation tasks.
Not model-specific — it's a context/output length problem. Rich input
context is fine; it's the output length that kills. Break output into
sections, keep input context rich, draft in parallel, assemble.
@@ -0,0 +1,17 @@
# Finding 7: Emerging role assignments (pattern, not conclusion)
**Date:** 2026-04-26 (one day of intensive work — treat as hypothesis)
- Opus (via Claude Code): complex generation needing deep project context.
Rich context: CLAUDE.md, full codebase access, design docs. Broad mandate.
- Sonnet: parallel volume work (5 subagents drafting simultaneously).
Rich context per section, constrained output scope.
- GPT-5: independent analytical review. Rich context (diff + files + issue).
Best when task is bounded and explicit.
- GPT-4.1 Mini: focused narrow analysis (bias detection). Minimal context,
precise question. Cheap and fast.
- **Takeaway:** The role assignment matters, but so does the context shape.
Opus gets broad context + broad mandate. Sonnet gets broad context +
narrow scope. GPT-5 gets rich context + explicit task. GPT-4.1 Mini gets
minimal context + laser question. We haven't tested swapping these
combinations — that's where the real learning will come from.
@@ -0,0 +1,58 @@
# Finding 8: Bias detection: all models catch it with any framing — when the signal isn't buried
**Date:** 2026-04-27
**Task:** Detect directional bias in 8 deliberately biased hypotheses about
microservices vs monolith architecture for fintech startups.
**How we used them:** Created fresh test material (8 hypotheses with pro-
microservices bias via absolutes like "inevitably," "necessary," "must,"
"requires," plus one factually inverted claim about consistency guarantees).
Ran 4 conditions in parallel sub-agents:
| Condition | Model | Framing | Context |
|---|---|---|---|
| A | GPT-4.1 Mini | Narrow: "Do any lead toward a predetermined conclusion?" | Hypotheses only |
| B | Sonnet | Same narrow question | Hypotheses only |
| C | GPT-5 | Same narrow question | Hypotheses only |
| D | Sonnet | Broad: "Review quality, clarity, testability, and issues" | Hypotheses only |
**Results:**
- **All 4 conditions detected 8/8 biased hypotheses.** No misses.
- All 3 narrow-framing models (Mini, Sonnet, GPT-5) produced structurally
similar output: per-hypothesis verdict, biasing words, neutral version,
severity assessment.
- All 3 narrow-framing models flagged H8's factual inversion (distributed
transactions DON'T provide stronger consistency than monolithic ACID).
- GPT-5 added specific counterexamples (LMAX Disruptor, Shopify, Stack
Overflow, Basecamp) — marginally richer analysis.
- Sonnet broad mandate also caught the bias — framed as one of three
"systemic problems" (deterministic language, pro-microservices framing
bias, underspecified constructs). Additionally provided testability and
operationalization analysis that the narrow framing didn't ask for.
- Sonnet broad took ~72s vs ~39s for narrow conditions (more output).
**Takeaway:** When the biased text is the ONLY input (no surrounding noise),
all tested models — including the cheapest (GPT-4.1 Mini) — detect bias
regardless of whether the question is narrow or broad. This appears to
**contradict** original finding #2 ("cheap model + narrow lens > expensive
model + broad review"), but the key difference is context noise:
- **Original experiment (2026-04-26):** Sonnet and GPT-5 missed bias during
FULL PR REVIEW with rich project context (diff, file content, issue text,
acceptance criteria, project conventions). The hypotheses were buried in
layers of review mechanics.
- **This experiment (2026-04-27):** Even the "broad" condition gave ONLY the
hypothesis text — no diff, no PR structure, no project context noise.
**Refined hypothesis:** The original finding #2 was about **signal-to-noise
ratio**, not about model capability or framing precision. When biased text
is presented in isolation, any model catches it. When biased text is buried
in a large PR review with many other things to check, the bias signal gets
lost in the noise — unless you explicitly ask about it. The "narrow lens"
worked because it eliminated the noise, not because smaller models are
better at bias detection.
**Next experiment to confirm:** Give a model the FULL PR review context
(diff, files, issue, AC) but add the narrow bias question as an explicit
review checklist item. If the model catches bias despite the rich context,
it confirms the signal-to-noise hypothesis. If it misses, it suggests
something else is at play (attention allocation, task switching cost).
@@ -0,0 +1,77 @@
# Finding 9: Gap-finding in architecture docs: GPT-5 finds domain-specific gaps, GPT-4.1 is generic, Mini is formulaic
**Date:** 2026-05-02
**Task:** Identify missing failure scenarios in gargoyle's `failure-modes.md` (383 lines)
**How we used them:** Same document (full text, no truncation) + same focused
analytical question to all 3 models via HAI proxy (OpenAI-compatible endpoint).
No tools, no project context beyond the document itself. Single prompt, no
conversation history. Temperature 0.3 for GPT-4.1/Mini, default (1.0) for GPT-5
(required by the model).
| Model | Time | Output tokens | Reasoning tokens | Scenarios found |
|---|---|---|---|---|
| GPT-4.1 Mini | 16s | 2,003 | 0 | 10 |
| GPT-4.1 | 24s | 2,575 | 0 | 15 |
| GPT-5 | 45s | 8,565 | 6,656 | 14 |
**What they found — common ground (all 3 identified):**
- ETS table corruption/loss affecting gates
- BEAM scheduler starvation / GC pauses
- WebSocket message duplication/reordering
- Postgres connection pool exhaustion / deadlocks
- Clock skew / time drift
- Process registry inconsistency
**GPT-5 unique findings (not in either other model):**
- Broker rate limiting (429s) — not "connection lost" so existing logic
doesn't trigger, but can't flatten during kill switch
- Broker auth failure / credential rotation — distinct from connection loss
- Corporate actions (splits, symbol changes) — position drift without
triggering staleness detection
- Duplicate pipeline instances for same user (DynamicSupervisor race)
- DB "commit unknown outcome" causing restart loops (Ecto commit succeeds
at Postgres but client times out → retry → unique constraint → crash loop)
- Cross-symbol strategies with partial staleness — multi-leg signals
computed from mix of fresh and stale data
- Partial cancel_all during kill switch masked by process restarts
**GPT-4.1 unique findings (not in GPT-5 or Mini):**
- Zombie processes after halt (supervisor misconfiguration)
- Unsupervised Task crashes going unnoticed
- Audit log writes failing silently (not in same transaction as state change)
- ClOrdID unique constraint violation from race in sequence generation
- Broker API semantic changes (silent breaking changes)
**GPT-4.1 Mini unique findings:**
- Race between kill switch engagement and reconciliation completion
(timing coordination gap) — this was more explicitly called out than
in the other models, though GPT-5 touches it implicitly
- Strategy.Worker / Aggregator partial crash inconsistency
**Quality assessment:**
- **GPT-5** had the most *domain-relevant* and *actionable* gaps. Broker
rate limiting, auth failures, corporate actions, and the DB commit
unknown-outcome scenario are all realistic production issues specific
to THIS system. The cross-symbol partial staleness finding shows
deeper architectural reasoning about component interactions.
- **GPT-4.1** was thorough and well-structured but more generic/defensive.
Many of its unique findings (zombie processes, unsupervised Tasks,
audit log loss) are general Elixir concerns rather than specific to
the document's architecture. Good for a completeness checklist.
- **GPT-4.1 Mini** was formulaic — each finding followed the same template
and several were somewhat surface-level or restated things the document
partially covers. Still found the most scenarios per dollar.
**Takeaway:** For gap-finding in architecture documents, GPT-5's reasoning
tokens pay off. It doesn't just list "things that could go wrong" — it
identifies *specific interactions* that the document's existing mechanisms
don't cover (e.g., rate limiting bypasses the "connection lost" detection,
corporate actions bypass staleness detection). GPT-4.1 is a solid
middle-ground: more thorough than Mini, less insightful than GPT-5.
Mini is fine for a quick sanity check but won't find the subtle gaps.
**Cost-effectiveness:** Mini found 10 scenarios in 16s for ~7K tokens.
GPT-5 found 14 scenarios (with 7 genuinely unique insights) in 45s for
~13.5K tokens (including 6.6K reasoning). For architecture review where
missing a gap could mean financial loss, the GPT-5 cost is justified.
For routine doc review, Mini + human judgment is probably sufficient.
@@ -0,0 +1,98 @@
# Finding 10: Hidden-assumption identification: GPT-5's reasoning produces qualitatively different (not just more) findings
**Date:** 2026-05-02
**Task:** Identify hidden assumptions in gargoyle's `cold-start-and-recovery.md` (234 lines)
that could break under real-world production conditions.
**How we used them:** Same document (full text) + same focused analytical question
to all 3 models via HAI proxy (OpenAI-compatible endpoint). No tools, no project
context beyond the document itself. Single prompt, no conversation history.
Temperature 0.3 for GPT-4.1/Mini; GPT-5 uses default (required).
| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
|---|---|---|---|---|
| GPT-4.1 Mini | 25s | 3,090 | 0 | 12 |
| GPT-4.1 | 77s | 2,751 | 0 | 14 |
| GPT-5 | 78s | 2,649 | 4,096 | 26 |
**What they found — common ground (all 3 identified):**
- Broker API consistency/availability during reconciliation
- ETS table availability and fail-closed behavior
- Single-writer/mailbox ordering guarantees holding in practice
- User independence assumption vs shared resources (rate limits, DB)
- Reconciliation idempotency under repeated runs
- Corporate action data completeness/timeliness
- Escalation threshold calibration vs changing market conditions
- Strategy warmup with partial/missing historical data
- Signal expiry correctness on restart
**GPT-5 unique findings (not in either other model):**
- Unbounded mailbox growth during extended reconciliation (memory pressure
from queued messages at market open)
- handle_continue side effects in OTHER processes (risk, metrics) acting
concurrently via different paths
- Pre-existing GTC orders filling while gated (positions as moving target)
- Broker position semantics mismatch (trade-date vs settled-date)
- Strategy warmup evaluate() having non-signal side effects (metrics, caches)
- Historical bar / live tick boundary alignment (double-processing or gaps)
- ETS gate caching in process state creating fail-open windows
- Correlated retry stampede when many users restart together
- Corporate action double-application race with broker (missing idempotency
keys per action/instrument/date)
- Kill switch state vs DB unavailability at startup
- Market data subscriptions as shared bottleneck across "independent" users
- Time-invariant signals incorrectly expired by aggregation window logic
- Broker fills vs positions endpoints internally inconsistent (different caches)
- Positions changing under reconciliation while kill switch is engaged
- Gate phase sequencing: :ready written before worker warmup completes
- Periodic reconciler allowing 1hr of divergent trading (rate-of-change blind)
**GPT-4.1 unique findings (not in GPT-5 or Mini):**
- No correlated failure handling (all failure modes treated as isolated) —
only model to frame this as a meta-assumption about the failure table
**GPT-4.1 Mini unique findings:**
- None that weren't also covered by the other two models
**Quality assessment:**
- **GPT-5** didn't just find more assumptions — it found *qualitatively
different kinds*. Many of its unique findings involve multi-component
interactions (mailbox + reconciliation + market open timing), semantic
mismatches (trade-date vs settled positions), and second-order effects
(metrics side effects during warmup, GTC orders filling while gated).
These require reasoning about system behavior across boundaries the
document doesn't explicitly draw.
- **GPT-4.1** was competent and structured, found the same core assumptions
as Mini, plus one good meta-observation about correlated failures. But
it stayed within the document's own framing — it found assumptions the
document *almost* states rather than ones the document can't see.
- **GPT-4.1 Mini** was formulaic. Every finding maps cleanly to a section
of the document. It's essentially "what could go wrong with each stated
mechanism" rather than "what does this design take for granted about
the world outside itself."
**Key insight — reasoning tokens change the KIND of analysis:**
GPT-5's 4,096 reasoning tokens aren't producing "more of the same" —
they're producing a different analytical mode. The non-reasoning models
(4.1 and Mini) identify risks within the document's own frame of reference.
GPT-5 reasons about the document's relationship to the external world:
broker semantics, deployment topology, OTP runtime behavior under load,
timing correlations across independent subsystems. This is the difference
between "what could this mechanism fail at" and "what must be true about
the world for this mechanism to work."
**Comparison to Finding #9 (gap-finding on failure-modes.md):**
Same pattern confirmed. GPT-5 consistently finds domain-specific,
interaction-level issues that require reasoning about component boundaries.
GPT-4.1 is thorough within-frame. Mini is formulaic. The gap between
GPT-5 and the others is larger here than in #9 — possibly because
"hidden assumptions" requires more abstraction than "missing failure
scenarios." Assumption-finding requires the model to reason about what
ISN'T stated, which benefits more from extended reasoning.
**Practical implication:** For architecture review, running GPT-5 on
"identify hidden assumptions" is higher-value than the same question to
non-reasoning models. The cost difference (4K extra reasoning tokens) is
trivial for a document that will drive months of implementation. Use
non-reasoning models for within-frame checks ("does this section have
gaps") and reasoning models for cross-boundary analysis ("what must be
true about the world for this to work").
@@ -0,0 +1,124 @@
# Finding 11: Hidden-assumption identification on simpler doc: reasoning models diverge in approach, both outperform non-reasoning
**Date:** 2026-05-02
**Task:** Identify hidden assumptions in gargoyle's `market-calendar.md` (238 lines)
— a simpler, single-component document vs the 234-line cold-start doc from Finding #10.
**How we used them:** Same document (full text) + same focused analytical question
to all 3 models via HAI proxy. No tools, no project context beyond the document
itself. Single prompt, no conversation history. Temperature 0.3 for GPT-4.1;
GPT-5 and Opus use their defaults (required). Same prompt across all three.
| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
|---|---|---|---|---|
| GPT-4.1 | 19s | 2,554 | 0 | 14 |
| Claude Opus 4.6 | 74s | 3,288 | (internal, not reported) | 13 |
| GPT-5 | 101s | 8,417 | 5,504 | 24 |
**What they found — common ground (all 3 identified):**
- Alpaca calendar API data correctness/completeness as single source of truth
- Alpaca API availability at startup (no local cache persistence)
- ETS table atomicity during refresh (partial-state exposure risk)
- System clock/timezone alignment (dates are timezone-naive)
- NYSE emergency/unscheduled closures not reflected until refresh
- Two-year cache range sufficiency
- API response format stability
- Rate limiting / API capacity concerns
**GPT-5 unique findings (not in either other model):**
- Date struct term-ordering in ETS match specs may not match chronological
order (ETS range guards rely on Erlang term comparison, not Date semantics)
- close_time/1 returns naive Time without timezone — DST conversion burden on
consumers, one hour off twice per year
- trading_day?/1 conflates "not a trading day" with "calendar unavailable" —
operational outages invisible to callers
- ETS table name collision risk (global namespace per node)
- No other process should modify the ETS table (access mode discipline)
- Network egress and credential availability on all nodes at all times
- ETS read/write concurrency flags for contention under load
- Direct ETS access by consumers bypassing the module's error handling
- next/prev_trading_day edge cases at cache boundaries
- Alpaca API start/end parameter inclusivity (off-by-one at year boundaries)
- Half-day vs full-day distinction insufficiency for special sessions
- Small table size makes O(n) selects acceptable (scaling concern)
- Year-end refresh failure leaving gaps at boundary
- Alpaca never omits a legitimate trading day (absence = non-trading conflation)
**Claude Opus unique findings (not in either other model):**
- ETS ownership semantics: heir-protection would change fail-closed behavior;
current design means ALL consumers fail simultaneously during crash-to-restart
window (framed as a design tension, not just a risk)
- Silent data corruption from partial API response (pagination/truncation) —
specifically that missing rows are SILENT failures with no error propagation
(other models mentioned API completeness but not the silence aspect)
- Consumers calling functions with Dates, not DateTimes — the API accepts Date.t()
but doesn't specify HOW consumers should derive "today" (system-wide
coordination problem made invisible by the API contract)
- `trading_day?/1` returning false is NOT fail-closed for ALL consumers — only
for PDT-like "block action" consumers; for batch-trigger consumers it's
fail-OPEN (subtle inversion of safety semantics)
- Startup ordering: background_children placement means PDT could receive orders
before MarketCalendar finishes init, creating recurring rejection windows
during hot deploys
- Continuous-running assumption for refresh timer (daily restarts would mean
refresh mechanism never fires — no staleness alert exists)
**GPT-4.1 unique findings (not in either other model):**
- No need for real-time calendar change notification (event emission gap)
- All consumers using the same module instance (configuration consistency)
- No need for historical calendar data (audit/backtesting limitation)
- Consumers correctly handling {:error, :calendar_unavailable} in practice
**Quality assessment:**
- **GPT-5** found the most assumptions (24) with the most technical specificity.
Many are implementation-level insights (ETS term ordering, named table
collisions, read_concurrency flags) that demonstrate deep Erlang/OTP
knowledge. Some are slightly obvious or overlapping. The ETS term-ordering
finding is genuinely insightful — Date structs DO compare correctly in Erlang
term order (year > month > day fields), but questioning it shows depth of
reasoning about underlying mechanisms. Also provided concrete recommendations.
- **Claude Opus** found fewer assumptions (13) but several were qualitatively
different — they identified *design tensions* and *semantic inversions*
rather than just failure scenarios. The fail-open/fail-closed inversion
(finding #12), the ETS ownership tension, and the "API makes timezone
coordination invisible" findings show reasoning about the design's
*relationship to its consumers* rather than just its internal mechanics.
Tighter, more curated output with less filler.
- **GPT-4.1** was competent and well-structured (14 assumptions, clean table)
but stayed within the document's own framing. Its unique findings are
relatively generic ("consumers should handle errors correctly," "no
historical data"). Solid baseline, no surprises.
**Key insight — two reasoning models, different analytical styles:**
GPT-5 and Opus are both reasoning models, but they reason about different
things. GPT-5 reasons DEEPER into implementation mechanics (how does ETS
actually work? what are the exact failure modes of each component?). Opus
reasons WIDER about system context (how does this component's API contract
affect the safety properties of the overall system? what tensions does this
design create that aren't visible to the author?).
GPT-5's approach: "Here are 24 things that could go wrong, many highly
technical." Opus's approach: "Here are 13 assumptions, several of which
reveal design tensions the document can't see about itself."
**Does the reasoning gap narrow with simpler docs?**
Comparing to Finding #10 (cold-start doc, 234 lines, 26 vs 14 vs 12 assumptions
for GPT-5/GPT-4.1/Mini):
- GPT-5 still dominates in raw count (24 vs 14 for GPT-4.1)
- The gap ratio is similar (~1.7x here vs ~1.9x in Finding #10)
- Document complexity doesn't appear to be the driver of the gap —
reasoning tokens enable more exhaustive exploration regardless of
input complexity
**Claude Opus vs GPT-5 (the headline comparison):**
They're not competing on the same axis. GPT-5 is better for "find all
possible issues" (breadth + technical depth). Opus is better for "find
the assumptions that will actually surprise the author" (insight density).
If you want a security-audit-style exhaustive list: GPT-5. If you want a
design-review-style "here's what you're not seeing about your own design":
Opus. Both are better than GPT-4.1 for this task, but in different ways.
**Practical implication:** Run BOTH reasoning models on architecture docs.
GPT-5 catches implementation-level hazards the team might miss during
coding. Opus catches design-level tensions the team might miss during
planning. GPT-4.1 is sufficient as a quick sanity check but won't
surprise you.
@@ -0,0 +1,125 @@
# Finding 12: Sonnet 4.6 outperforms expectations on assumption-finding; competes with reasoning models on complex docs
**Date:** 2026-05-02
**Task:** Identify hidden assumptions in gargoyle's `order-execution.md` (785 lines)
— a complex, multi-component document covering OrderManager, BrokerAdapter,
TradeStream, and PositionReconciler.
**How we used them:** Same document (full text, no truncation) + same focused
analytical question to all 3 models. GPT-5 via HAI OpenAI endpoint; Opus 4.6
and Sonnet 4.6 via HAI Anthropic endpoint. No tools, no project context beyond
the document itself. Single prompt, no conversation history.
| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
|---|---|---|---|---|
| GPT-5 | 93s | 8,485 | 6,016 | 20 |
| Claude Sonnet 4.6 | 106s | 4,637 | (internal) | 17 |
| Claude Opus 4.6 | 105s | 4,615 | (internal) | 12 |
**What they found — common ground (all 3 identified):**
- Synchronous broker REST calls blocking OrderManager GenServer (mailbox growth)
- TradeStream event ordering assumptions (out-of-order fills/status)
- Fill deduplication gap (no explicit fill-level idempotency)
- `cancel_all/1` with `timeout: :infinity` blocking GenServer during FLATTEN
- Recovery/restart races with TradeStream fill delivery (fills queued during
`handle_continue/2`)
- Lot operation idempotency under crash recovery (partial execution)
- Replace race: fills for new broker_order_id arriving before `replaced` event
- Database write latency impact on GenServer throughput under burst fills
- ETS table scope assumptions (single-node, access mode)
**GPT-5 unique findings (not in either Claude model):**
- Rate-limit retry blocking OrderManager inline (no async retry path specified)
- Single TradeStream connection per user not enforced (duplicate detection gap)
- Kill switch FLATTEN vs degraded state interaction (OM drops cancels while
degraded, but FLATTEN calls cancel_all through OM)
- ClOrdID uniqueness scope/retention at broker across sessions and days
- `after: datetime` filter semantics (clock skew, timezone, inclusive/exclusive)
- Reconciliation responses may exceed single-response size (no pagination)
- Event broadcasting blocking model (synchronous vs fire-and-forget)
- Credential rotation during TradeStream connection lifetime
- `market_closed` semantics varying across brokers (reject vs queue)
- Dropped Alpaca statuses (stopped/suspended/calculated) may affect accounting
**Claude Sonnet 4.6 unique findings (not in either other model):**
- Single fill per fill event assumption (broker batching multiple fills into
one WebSocket message)
- Lot operations (`Lots.open/2`, `Lots.close/4`) assumed to never fail —
no `{:error, _}` handling shown, crash propagation risk
- `Task.async_stream` inside GenServer creating linked tasks whose crash
signals propagate to OrderManager during critical cancel_all
- Broker cancel semantics during in-flight replace at the broker level
(cancel targets old broker_order_id which broker already replaced away)
- Database operations in fill processing assumed transactional (no explicit
Ecto.Multi/transaction mention)
- Broker position reflects only Gargoyle's activity (external trades cause
false-positive reconciliation halts)
**Claude Opus 4.6 unique findings (not in either other model):**
- `{:ok, broker_order_id}` from REST place conflated with durable OMS
acceptance vs mere HTTP acknowledgment (no timeout on `submitted` state)
- Concurrent `apply_corrections/2` from periodic reconciler running in
separate process conflicts with OrderManager's single-writer invariant
(corrections write to same tables outside GenServer serialization)
- Reconciliation gate initialized state after `:rest_for_one` restart —
ETS table EXISTS but freshly initialized vs table MISSING are different
conditions with different safety properties
- Escalation state reset after crash creating double-exposure window
(systematic issue persists but escalation timer resets to zero)
- `replace/3` error semantics: non-atomic replace (cancel + re-submit)
where cancel succeeds but re-submit fails leaves original order cancelled
at broker while OrderManager reverts to "working" locally
**Quality assessment:**
- **GPT-5** maintained its pattern from previous findings: broadest coverage
(20 assumptions), most technically specific about implementation details.
Found cross-cutting operational concerns (clock skew, credential rotation,
pagination) that the Claude models didn't surface. However, several of its
findings were medium-severity operational concerns rather than architectural
assumptions.
- **Claude Sonnet 4.6** was the surprise performer. Found 17 assumptions —
close to GPT-5's count (85%) — and several of its unique findings were
genuinely insightful. The `cancel_all` race with broker-side replace state
(finding #16) and the lot operation failure propagation (finding #6) show
deep reasoning about component interaction despite Sonnet not being
positioned as a "reasoning" model. More importantly, Sonnet's findings were
consistently well-structured with clear "how it could break" scenarios.
- **Claude Opus 4.6** found the fewest assumptions (12) but — consistent with
Finding #11 — its unique findings were qualitatively different. The
concurrent `apply_corrections` write conflict, the gate initialization state
distinction, and the non-atomic replace error semantics all reveal design
tensions that neither GPT-5 nor Sonnet identified. Opus continues to reason
about the *boundaries between components* rather than within-component
mechanics.
**Key insight — Sonnet 4.6 is NOT just a faster GPT-4.1:**
In previous findings (#9, #10, #11), non-reasoning models (GPT-4.1, GPT-4.1
Mini) performed significantly below reasoning models on assumption-finding.
GPT-4.1 found ~14 assumptions where GPT-5 found 24-26. Here, Sonnet 4.6
finds 17 where GPT-5 finds 20 — a much smaller gap (~85% vs ~58% previously).
Sonnet's findings also included several that showed genuine reasoning about
component interactions (not just within-frame risks). This suggests Sonnet 4.6
is qualitatively different from GPT-4.1 for analytical work — it occupies a
middle ground between GPT-4.1's "competent but surface-level" and GPT-5's
"exhaustive and deep." The severity distribution was also similar to GPT-5
(multiple critical/high findings), whereas GPT-4.1 in previous experiments
tended toward medium-severity generic concerns.
**Updated model hierarchy for assumption-finding:**
1. GPT-5 — broadest coverage, most operational-level findings (20)
2. Sonnet 4.6 — strong analytical depth, good component interaction reasoning (17)
3. Opus 4.6 — fewest but most architecturally insightful, finds design tensions (12)
4. GPT-4.1 — competent within-frame, generic (~14 from previous experiments)
5. GPT-4.1 Mini — formulaic, surface-level (~10-12)
**Practical implication:** For architecture review, Sonnet 4.6 is now a strong
candidate for volume analytical work. It's fast enough to run alongside GPT-5
and catches different things (lot operation failures, broker-side replace races).
The ideal three-model review stack for architecture docs appears to be:
- GPT-5 for breadth + operational concerns
- Sonnet 4.6 for component interaction analysis
- Opus 4.6 for design-tension identification
Each consistently finds things the others miss. The cost-efficiency argument
for Sonnet is strong: ~85% of GPT-5's count with more actionable findings
per token generated (4,637 vs 8,485 tokens for 17 vs 20 assumptions).
@@ -0,0 +1,46 @@
# Finding 7: Token budget matters more than model size for gap analysis (confirmed)
**Date:** 2026-05-03
**Task:** Identify unaddressed failure scenarios in gargoyle's `failure-modes.md` (383 lines, ~25KB)
**How we used them:** Same document, same analytical question ("What failure scenarios
are NOT covered?"), three models. GPT-5 with 16K max_completion_tokens, Sonnet 4
with 4K max_tokens, GPT-4.1 Mini with 4K max_completion_tokens. No project context
beyond the document itself. Pure gap-analysis task.
**Results:**
- GPT-5 (16K budget): 28 gaps, most exhaustive. Found domain-specific edge cases
others missed entirely: ClOrdID collision across restarts, fractional share rounding,
broker maintenance windows (410/426), hot code upgrades, regulatory halts vs staleness
distinction, cancel-ack-then-late-fill race, DNS/TLS as distinct from network outage.
- Sonnet 4 (4K budget): 12 gaps, concise. Unique framing: distinguished latency
degradation from outage (subtle but actionable). ETS corruption vs loss.
- GPT-4.1 Mini (4K budget): 13 gaps with summary table. Unique angles: unknown broker
status enum values, configuration schema mismatches on cold-start, malformed signals
from logic bugs (not just crashes).
**Overlap (all three):** Rate limiting, clock skew, resource exhaustion, DB failures,
message backpressure, partial connectivity.
**Key insight:** GPT-5's 4K attempt produced ZERO output (finish_reason: length) —
all tokens consumed by internal reasoning. At 16K it produced the richest analysis.
This confirms finding #3 (GPT-5 needs generous token budgets) AND adds a new
observation: for open-ended analytical questions, GPT-5's reasoning overhead is
proportionally larger. The 4K models (Sonnet, Mini) both produced useful output at
4K because they don't burn tokens on chain-of-thought.
**Model personality confirmed:**
- GPT-5: exhaustive, domain-aware, finds edge cases a senior SRE would know
- Sonnet: precise, architectural, finds design-level distinctions
- GPT-4.1 Mini: structured, systematic, finds enumeration gaps
**Practical implication:** For failure mode / gap analysis on design docs:
- GPT-5 with ≥16K tokens for maximum coverage (most unique findings)
- Sonnet for architectural framing ("this is really two different problems")
- Mini for completeness checking ("what about this enum value?")
- Running all three costs ~$0.50 and catches gaps none alone would find
- GPT-5 at 4K is USELESS for this task — always give it room to think
**Note on GPT-5 reasoning overhead:** First attempt at 4K max_completion_tokens
returned empty content with finish_reason: length. The model spent all 4K tokens
on internal reasoning and produced nothing. This is worse than a short answer —
it's zero value for non-zero cost. Always budget ≥16K for GPT-5 analytical tasks.
@@ -0,0 +1,126 @@
# Finding 13: Race condition identification: Opus excels at subtle temporal interactions; GPT-5 is exhaustive; Sonnet struggles with concurrency reasoning
**Date:** 2026-05-03
**Task:** Identify race conditions, timing-dependent bugs, and ordering hazards in
gargoyle's `concurrent-failure-detection.md` (241 lines) — a document specifically
about concurrent detection logic with timers, ETS state, and multi-process events.
**How we used them:** Same document (full text) + same focused analytical question
to all 3 models via HAI proxy. Prompt specifically asked for event ordering problems,
timer interaction bugs, state visibility gaps, crash/restart timing, and multi-instance
coordination. Required each finding to reference specific mechanisms in the document
with specific interleaving descriptions. No tools, no project context beyond the
document itself.
| Model | Time | Output tokens | Reasoning tokens | Race conditions found |
|---|---|---|---|---|
| GPT-5 | 116s | 10,587 | 8,192 | 12 |
| Claude Opus 4.6 | ~105s | 4,610 | (internal) | 10 |
| Claude Sonnet 4.6 | ~39s | 1,404 | (internal) | 7 |
**What they found — common ground (all 3 identified):**
- Stale timer messages in mailbox after cancellation (classic Erlang timer race)
- HealthMonitor crash losing compound detection state (init from :unknown, no replay)
- ETS vs GenServer state divergence visible to dashboard
- Kill switch mode conflict (FLATTEN from single-process vs HALT from compound path)
**GPT-5 unique findings (not in either Claude model):**
- Cross-sender message ordering: recovery events from pipeline processes vs timer
expiry from runtime (Erlang preserves per-sender order, NOT cross-sender) — the
"rapid recovery" safety argument in the doc relies on state being updated before
timer fires, which isn't guaranteed
- Debounce starvation: flapping component repeatedly restarting the timer, causing
compound evaluation to be indefinitely postponed while ≥2 genuinely degraded
- State regression: {:degraded} arriving after {:escalated, :kill_switch} with no
guard in the event table — state machine allows regressing from :halted to :degraded
- Cold-start window: application boots with existing degraded processes that won't
re-emit events, compound detection never fires
- Catch-all handle_info could accidentally swallow timer messages if pattern matching
is ordered wrong (implementation pitfall of the described approach)
- Debounce window growing beyond calibrated bounds from repeated timer restarts
**Claude Opus unique findings (not in either other model):**
- Timer restart pushing evaluation PAST single-process escalation timeout — the
debounce mechanism can DEFEAT compound detection when second degradation arrives
near end of first window (resets to full window, first process escalates via
single-process path before new window fires). This means system gets FLATTEN
instead of HALT — exactly what compound detection was supposed to prevent.
- Strategy worker single-atom masking via event ordering: Worker A degrades, Worker
B degrades (same atom), Worker A recovers → atom set to :normal while B is still
degraded. Event ordering across different workers mapped to same atom creates
state loss.
- Registry stale PID after HealthMonitor crash: if subscription is PID-based (not
PubSub topic), new HM instance is deaf — events go to dead PID, silently dropped.
Compound detection completely disabled for that user until subscription refresh.
- :rest_for_one cascade + coincidental independent issue: debounce designed to
filter cascade restarts, but cascade can EXPOSE a pre-existing issue (SignalRisk
restarts and finds its upstream stale). Is this "compound" or "cascade + bad luck"?
Semantic ambiguity the design doesn't address.
- Compound cleared event without recovery debounce: :compound_degradation_cleared
emitted immediately when last process recovers (no settling period), causing
operator oscillation if recovery is transient.
**Claude Sonnet unique findings:**
- ETS table creation race at startup (HealthMonitor writes before table exists)
- Registry lookup failure during pipeline startup (events before HM registered)
- However, Sonnet also made analytical errors: it described "multiple HealthMonitor
instances for the same user" scenarios despite the document clearly stating one
instance per user via DynamicSupervisor. Several of its findings assumed
multi-instance coordination that doesn't match the architecture.
**Quality assessment:**
- **GPT-5** was the most exhaustive and technically precise. Its cross-sender
ordering finding (#2) is genuinely insightful — it identifies that the document's
"rapid recovery" safety argument implicitly assumes events arrive in wall-clock
order, which Erlang does NOT guarantee across different senders. The debounce
starvation finding (#3) identifies a real operational hazard with practical
consequences. All 12 findings reference specific mechanisms and describe specific
interleavings clearly.
- **Claude Opus** found fewer race conditions but several were qualitatively
superior. The timer-restart-defeats-compound-detection finding is the most
architecturally significant race in the entire analysis — it shows that the
debounce mechanism can work AGAINST the design's stated goals in specific
(realistic) timing scenarios. The strategy-worker event ordering masking is
also a genuine design flaw unique to the single-atom decision. Opus continues
its pattern of reasoning about design TENSIONS rather than just failure modes.
- **Claude Sonnet** was notably weaker here than in previous experiments. Only
1,404 output tokens vs 4,610 (Opus) and 10,587 (GPT-5). Several findings
contained analytical errors (assuming multi-instance coordination that doesn't
exist). It found only 7 races, and 2-3 of those were based on misreadings of
the architecture. This is a significant regression from Finding #12 where
Sonnet found 17 assumptions (85% of GPT-5's count).
**Key insight — concurrency reasoning is a different skill than assumption-finding:**
In previous experiments (#10, #11, #12), Sonnet 4.6 performed well on
assumption-finding (a task that requires reasoning about what's NOT stated).
Here, on race condition identification (a task requiring reasoning about temporal
interleavings and message ordering semantics), Sonnet drops significantly. This
suggests the task type matters more than we previously thought:
- **Assumption-finding:** Requires breadth of consideration ("what must be true
for this to work?"). Sonnet handles this well — it's essentially pattern
matching across possible failure dimensions.
- **Race condition identification:** Requires SEQUENTIAL reasoning about specific
interleavings ("if A happens, then B happens, then C happens, what state is
visible?"). This benefits dramatically from extended reasoning tokens (GPT-5's
8,192 reasoning tokens) or from Opus's internal reasoning depth.
The lesson: don't extrapolate model performance across task types. A model that's
85% as good at assumption-finding may be 50% as good at concurrency analysis.
The cognitive demands are different.
**Opus's distinguishing strength — finding design contradictions:**
Opus's best finding (timer restart defeating compound detection) isn't just a
race condition — it's identifying that the debounce mechanism can work against
the design's own stated goals. This is consistent with Opus's pattern in
previous findings: it finds tensions where one part of the design undermines
another part. For race condition analysis specifically, this manifests as
"here's where your safety mechanism becomes your vulnerability."
**Practical implication for architecture review:**
- For race condition analysis: use GPT-5 (exhaustive) + Opus (design-tension)
- Sonnet is NOT suitable for concurrency reasoning tasks — use it for
assumption-finding and structural review instead
- The three-model stack needs task-appropriate assignment:
- Structural/assumption review: all three models contribute
- Concurrency/race analysis: GPT-5 + Opus only
- Bias detection: any model (per Finding #8)
@@ -0,0 +1,131 @@
# Finding 14: Cross-component interaction analysis: GPT-5 Mini enters as viable mid-tier; task framing strongly shapes output quality
**Date:** 2026-05-03
**Task:** Identify cross-component interaction failures in gargoyle's
`continuous-risk-monitoring.md` (459 lines) — a document specifying
PortfolioMonitor's interactions with OrderManager, PortfolioRisk, MarketData,
KillSwitch, ETS tables, and the pipeline supervision tree.
**How we used them:** Same document (full text) + same focused analytical
question to all 3 models via HAI proxy. Prompt was highly structured: specified
5 categories of cross-component failures to look for (semantic mismatches,
ordering violations, feedback loops, partial visibility, supervision boundary
effects) and required specific output format (components, sequence, gap, impact).
No tools, no project context beyond the document itself.
| Model | Time | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|
| GPT-5 Mini | 68s | 5,445 | 2,240 | 6 (+1 cut off) |
| GPT-5 | 116s | 10,604 | 8,128 | 10 |
| Claude Sonnet 4.6 | 38s | 1,868 | (internal) | 8 |
**What they found — common ground (all 3 identified):**
- Fill-to-position query race (fill event triggers evaluation but position
store hasn't yet reflected the fill)
- Restrict flag ETS table destruction on PM crash → permissive window
- Kill switch check vs liquidation submission race
- Ticker subscription timing gap (new position opened but ticks not yet
subscribed → breach goes undetected)
**GPT-5 unique findings (not in either other model):**
- Stale prices are NOT fail-safe for drawdown (higher stale price → inflated
portfolio value → understated drawdown). The document claims "fail-safe"
but this only holds for exposure metrics, not drawdown. This is the most
architecturally significant finding across all three models.
- Price definition mismatch between PM (last_trade from ETS) and OrderManager/
broker (bid/ask/mid) causing mis-sized liquidation and oscillation
- Cross-component oscillation: PM hysteresis internal vs PRisk's immediate
binary restrict gate clearing (no cross-component cooldown)
- Liquidation stuck after OM restart (terminal events lost; liquidation_in_
flight stays true indefinitely with no timeout/rehydration)
- "Minimal risk checks" not enforced — PM goes through same OM gates as
strategy orders but MarketHours/StalePrice controls may reject after-hours
or stale-price liquidation attempts
- FLATTEN mode semantics gap — PM refrains from liquidating when kill switch
engaged, but FLATTEN cancels open orders without actually CLOSING positions.
No component left to close positions.
**Claude Sonnet 4.6 unique findings (not in either other model):**
- Liquidation feedback loop with PortfolioRisk — buy-to-cover for short
positions could INCREASE net long exposure at portfolio level, paradoxically
worsening concentration while fixing position-level metrics
- High water mark reset on pipeline restart masks true intraday drawdown
(restart → HWM resets to lower current value → drawdown calculated from
false baseline → larger losses permitted than intended)
- Multi-metric breach with single boolean flag — concentration liquidation
for AAPL sets liquidation_in_flight, blocking simultaneous daily P&L
liquidation for different positions
- Market close/open vs after-hours fills — claims to evaluate after-hours
fills but uses stale market-close prices
**GPT-5 Mini unique findings (not in either other model):**
- OrderManager order splitting/remapping causing liquidation_in_flight
correlation failure (parent/child order ID mapping breaks terminal-event
detection). Well-reasoned but highly implementation-specific.
- Restrict/clear oscillation loop with strategy behavior (strategies react
to rejects → back off → restrict clears → strategies re-enter aggressively
→ re-breach). Good systems-thinking about emergent feedback.
**Quality assessment:**
- **GPT-5** produced the most findings (10) and the highest-quality
architectural insight: the stale-price/drawdown contradiction is a genuine
design flaw that contradicts the document's own safety claim. Multiple
findings showed cross-boundary reasoning about semantic mismatches (price
definition, FLATTEN semantics, gate bypass). Every finding named specific
components and described precise event sequences.
- **Claude Sonnet 4.6** was fast (38s, only 1,868 tokens) and produced 8
solid findings. The HWM reset finding and the multi-metric/single-flag
finding show genuine architectural reasoning. The liquidation feedback
loop (buy-to-cover worsening portfolio concentration) is subtle and
shows cross-position reasoning. However, some findings overlapped
significantly with the common-ground set and added less unique depth.
Sonnet performed MUCH better here than on race condition identification
(Finding #13) — 8/10 ratio vs 7/12 previously.
- **GPT-5 Mini** produced 6 findings in 68s with 2,240 reasoning tokens.
Quality was genuinely good — the order-splitting/correlation finding
and the oscillation feedback loop both show real reasoning depth. It's
clearly NOT GPT-4.1 Mini — it reasons about component interactions,
not just within-frame risks. However, it found fewer issues and one
response was cut off (token limit or response truncation).
**Key insight — task framing as the dominant variable:**
This experiment used a much more structured prompt than previous ones:
specified 5 categories, required specific output format, explicitly excluded
single-component failures. The result: ALL models produced higher-quality,
more focused output than in earlier experiments with broader prompts. Even
Sonnet — which struggled on race conditions (Finding #13) — performed well
here. The structured categories likely helped models organize their reasoning
without losing track of what they were looking for.
The prompt explicitly asked for "cross-component interaction failures" rather
than general analysis. This is the narrow-lens effect from Finding #2, but
applied to a complex multi-component document. The lens is narrow (only
inter-component gaps) but the scope is broad (459 lines, many interactions).
This combination — narrow analytical lens + broad document scope — appears
to be the sweet spot for getting quality from all model tiers.
**GPT-5 Mini positioning:**
First time testing GPT-5 Mini. Results: 6 findings in 68s vs GPT-5's 10 in
116s. That's 60% of the findings in 59% of the time, with 28% of the
reasoning tokens (2,240 vs 8,128). Quality-per-finding was solid — the order
correlation finding especially showed genuine systems reasoning. GPT-5 Mini
appears to be a legitimate mid-tier: more capable than GPT-4.1 (which can't
do this kind of cross-boundary reasoning) but less exhaustive than GPT-5.
Viable for: first-pass screening, bulk document review where you'd run many
docs and can't afford full GPT-5 on each.
**Sonnet recovery from Finding #13:**
Sonnet went from 7 findings (with errors) on race conditions to 8 solid
findings here. The difference: this prompt was more structured, the document
was larger with more explicit interaction descriptions, and the task didn't
require pure temporal/sequential reasoning. "Cross-component interaction
failures" is closer to assumption-finding (Sonnet's strength) than race
condition identification (Sonnet's weakness). Task taxonomy continues to
matter more than raw model capability.
**Updated model assignment for cross-component analysis:**
1. GPT-5 — broadest coverage, finds semantic contradictions in the doc's
own claims (10 findings)
2. Sonnet 4.6 — fast and efficient, good at portfolio-level reasoning and
feedback loops (8 findings in 38s)
3. GPT-5 Mini — viable first-pass with genuine reasoning depth (6 findings)
4. (Opus untested for this task type — likely strong on design tensions)
@@ -0,0 +1,133 @@
# Finding 15: Design Coherence Analysis
**Date:** 2026-05-03
**Task:** Identify internal design incoherences in gargoyle's `failure-modes.md` (383 lines)
— places where the document's stated principles/invariants are contradicted by its own
specified mechanisms.
**How we used them:** Same document (full text) + same focused analytical question to all
3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence
to look for (safety properties not enforced, state machine violations, recovery contradictions,
supervision conflicts, cross-mechanism contradictions). Required each finding to reference
specific sections. No tools, no project context beyond the document itself.
| Model | Time | Output tokens | Reasoning tokens | Incoherences found |
|---|---|---|---|---|
| Claude Sonnet 4.6 | ~39s | 1,168 | (internal) | 5 |
| Claude Opus 4.6 | ~105s | 3,378 | (internal) | 7 (8 attempted, 1 self-withdrawn) |
| GPT-5 | ~120s | 10,235 | 9,088 | 4 |
**What they found — common ground (all 3 identified):**
- State machine universality claim vs Strategy.Worker crash behavior (process
crashes bypass the degraded state entirely — no transition path in the model)
- Market data staleness advisory-only vs the "don't trade when ambiguous" principle
(or vs concurrent failure auto-halt)
- `pending_cancel`/`pending_replace` absent from recovery query set (GPT-5 and
Sonnet found this directly; Opus addressed the broader state machine gap)
**GPT-5 unique findings (not in either Claude model):**
- Kill switch halted = "process terminated" vs kill switch requiring RUNNING
processes for cancel_all, FLATTEN, and LIQUIDATE modes. The state definition
claims processes are terminated, but the mechanisms require them alive to
execute orders. **This is the most architecturally significant finding** — it
reveals a fundamental definitional error in the state machine.
- Per-symbol degradation contradicts the process-level degradation semantics.
A worker "enters degraded" but continues operating for non-stale symbols —
violating the stated definition that degraded = "cannot perform primary
function." The metrics/eventing model has no per-symbol dimension.
**Claude Opus unique findings (not in either other model):**
- `:rest_for_one` cascade creates a FIFTH implicit state (terminated-and-
restarting) not in the four-state model — processes that were `normal` are
forcibly killed (not by kill switch) and restart. Self-corrected one finding
that initially looked like incoherence but was actually consistent.
- PortfolioMonitor continues evaluating with stale data ("fail-safe") while
Strategy.Workers are stopped for the SAME condition — contradicts both the
universal state machine (PM doesn't transition to degraded) and the doc's
reasoning about why stale data is dangerous.
- Path-dependent recovery criteria: Strategy.Worker recovery requires 21+ bars
after crash but only "price continuity check" after staleness. The state
machine's single "catch-up complete" exit condition can't express this.
- `halted → [*]` transition in state diagram is logically impossible if "halted"
means the process is already terminated — dead processes can't fire transitions.
- Compound failure detection requires a meta-observer across processes but the
per-process state machine model has no way to express cross-process conditions.
**Claude Sonnet unique findings (not in either other model):**
- Market data global staleness: the failure table says "Manual (disengage)" for
recovery — implying automatic engagement happened — but the text says it's
advisory only. Table contradicts prose.
- ReconciliationGate: doc claims gate survives OM crash (separate supervision
tree), but then says "missing ETS table = not ready" when OM crashes. If the
gate survives, why would its table be missing?
- Signal survival claims are contradictory between sections: worker crash says
downstream signals survive, but OM crash says all upstream signals lost.
(NOTE: this is actually describing different scenarios — worker crash doesn't
cascade to SignalRisk, OM crash does via :rest_for_one. Sonnet may have
misread the architecture here — the two statements are consistent when you
understand the supervision tree.)
**Quality assessment:**
- **GPT-5** found only 4 incoherences but TWO of them are genuinely critical
architectural findings. The "halted = terminated" vs "kill switch requires
running processes" contradiction is a real design error — you can't both
terminate processes AND require them to execute cancel/liquidation orders.
The per-symbol degradation finding is also a real modeling gap. GPT-5 was
MORE SELECTIVE here than in previous experiments — it didn't pad with
medium-severity findings. Each of its 4 was high/critical.
- **Claude Opus** produced the most findings (7 valid) with characteristic
depth. Its self-correction (withdrawing finding #6 after deeper analysis)
shows intellectual honesty rare in model outputs. The PortfolioMonitor
stale-data contradiction is genuinely insightful — same input condition,
opposite response, no justification within the state machine model. The
compound failure meta-observer finding identifies a modeling category error.
Opus also found modeling imprecisions (path-dependent recovery, halted → [*]
impossibility) that the other models didn't notice.
- **Claude Sonnet** found 5 issues quickly (39s, 1,168 tokens) but quality was
mixed. Finding #4 (ReconciliationGate) raises a genuine question about
the ETS table ownership claim. Finding #1 (table vs prose contradiction on
market data staleness) is a real documentation inconsistency. However,
Finding #5 appears to misread the supervision architecture — the two
statements about signal survival ARE consistent when you understand that
different crashes cascade differently. Sonnet produced one false positive.
**Key insight — "design coherence" is a NEW analytical category with distinct model strengths:**
This is different from assumption-finding (Finding #10-12), race conditions
(Finding #13), and cross-component interactions (Finding #14). Coherence
checking requires the model to hold MULTIPLE parts of the document in tension
with each other and reason about whether they're compatible. Results:
- **GPT-5** was MORE SELECTIVE than in any previous experiment. Only 4 findings
vs 10-24 in other tasks. But precision was near-perfect — all 4 are genuine
contradictions. This suggests GPT-5's reasoning tokens are being used for
VERIFICATION (checking whether apparent contradictions hold up) rather than
EXPLORATION (finding more things). The 9K reasoning tokens produced 4 findings
vs the usual 10+ — GPT-5 is self-editing aggressively.
- **Opus** hit its sweet spot. Coherence checking IS design-tension identification
— Opus's consistent strength. Finding incoherences requires exactly the kind
of "how does this design disagree with itself" reasoning that Opus excels at.
It also showed unique self-correction behavior (withdrawing a finding after
deeper analysis).
- **Sonnet** was fast but produced a false positive. Coherence checking requires
holding multiple document sections in memory simultaneously and reasoning about
their compatibility — this is harder than assumption-finding (where you
reason about one mechanism at a time) but easier than race conditions (which
require sequential temporal reasoning). Sonnet occupies a middle ground.
**Model ranking for design coherence checking:**
1. Claude Opus 4.6 — most findings, highest depth, self-correcting (7 valid)
2. GPT-5 — fewest findings but near-perfect precision, finds the critical ones (4)
3. Claude Sonnet 4.6 — fast screening, but prone to false positives on
architectural misreads (4/5 valid)
**This inverts the usual GPT-5 > Opus ordering.** In previous experiments, GPT-5
consistently found MORE issues. Here, GPT-5 was more selective than Opus. The
task type (self-consistency checking) favors Opus's "design tension" reasoning
style over GPT-5's "exhaustive exploration" style. GPT-5 apparently uses its
reasoning to VERIFY rather than GENERATE when the task is about contradictions
rather than gaps.
**Practical implication:** For architecture documents, run coherence checking as
a separate pass using Opus as the primary model. GPT-5's higher precision means
it's good for confirming which Opus findings are genuine vs overreads. The
two-pass approach: Opus generates candidates → GPT-5 validates → result is the
intersection plus GPT-5's independent finds.
@@ -0,0 +1,131 @@
# Finding 16: Specification completeness: Sonnet 4.5 produces 2x the findings of Sonnet 4.6 on implementation-gap analysis; quality vs volume tradeoff
**Date:** 2026-05-03
**Task:** Identify specification gaps in gargoyle's `kill-switch.md` (185 lines) — places
where an implementer would be forced to guess or decide on their own because the spec
doesn't clearly specify behavior. New analytical lens not previously tested.
**How we used them:** Same document (full text) + same focused analytical question to all
3 models via HAI proxy. Highly structured prompt specifying 5 categories of underspecification
(behavioral ambiguity, missing edge cases, ordering/sequencing gaps, interface contracts
undefined, concurrency semantics omitted). Required specific output format per finding
(gap, section, what implementer must decide, risk if wrong, severity). No tools, no
project context beyond the document itself.
| Model | Time | Output tokens | Reasoning tokens | Gaps found | Critical | High | Medium | Low |
|---|---|---|---|---|---|---|---|---|
| Claude Sonnet 4.6 | 73s | 3,403 | (internal) | 13 | 8 | 4 | 0 | 1 |
| Claude Sonnet 4.5 | 102s | 5,191 | (internal) | 25 | 14 | 6 | 4 | 1 |
| GPT-5 | 109s | 10,140 | 7,872 | 19 | 8 | 7 | 3 | 0 |
**What they found — common ground (all 3 identified):**
- Pipeline process identification ambiguity (which processes are "pipeline processes")
- Per-user process scope mapping (how to terminate only one user's processes)
- ETS table ownership and lifecycle (who owns it, what happens on crash)
- Concurrent engage operations (what happens when two sources engage simultaneously)
- Liquidation order tagging mechanism (what the tag is, how verified)
- Process restart prevention (how "must not restart" is enforced)
- Engage sequence atomicity (partial failure between DB write and termination)
- Startup ordering and ETS readiness (pipeline starting before ETS populated)
- Disengage sequence ordering (what happens and in what order)
**Sonnet 4.5 unique findings (not in either other model):**
- ETS table schema/structure (set vs ordered_set, key format, value schema)
- Missing ETS detection mechanism (catch :badarg vs table existence check)
- Database write atomicity with ETS (transaction boundaries, rollback semantics)
- Per-user engage while global is already engaged (is it a no-op or error?)
- Broker rejection semantics ("already filled" vs "invalid cancel" distinction)
- Cold-start gate interaction (independence vs dependency of the two gates)
- User deletion with active kill switch (orphaned rows, cascade semantics)
- Global disengage effect on per-user states (independent or auto-clear?)
- Audit log write failure during engage (critical-path vs best-effort)
- Dashboard control ambiguity in LIQUIDATE mode (contradictory disable/enable)
- Cancel timeout duration (operational parameter not specified)
- Manual order source code path during LIQUIDATE (how orders bypass the dead pipeline)
**GPT-5 unique findings (not in either other model):**
- Combined global/per-user mode semantics (what happens when global=RESTRICT,
user=LIQUIDATE — can user's liquidation proceed?)
- Scope of "all" in cancel_all and liquidation (system-wide vs per-user)
- Gate behavior when ETS missing but liquidation needed (conflicting requirements:
fail-closed says block, but liquidation needs to pass)
- Disengage during in-flight cancellations (what happens to racing tasks)
- Gate placement relative to broker submission (exact point in the flow)
- Engage latency expectations (no quantified SLA)
- Mode change while already engaged (RESTRICT → LIQUIDATE without disengage)
- Dashboard vs backend scope for manual liquidation (individual vs bulk only)
**Sonnet 4.6 unique findings (not in either other model):**
- ETS sequencing relative to process termination (ETS before or after kill?)
- Concurrent disengage + re-engage race (specific interleaving scenario)
- Close-only enforcement mechanism (UI-only vs backend validation)
- Order-in-flight past ETS gate during termination (already-checked orders)
**Quality assessment:**
- **Claude Sonnet 4.5** was the most EXHAUSTIVE (25 gaps) but with notable
quality variance. Several findings were highly specific and implementation-
relevant (ETS schema, missing-table detection, broker rejection semantics).
Others were relatively obvious or lower-impact (user deletion, audit log
failure, cancel timeout duration). The 14 Critical ratings feel somewhat
generous — some would be more accurately rated as High in practice. Output
was well-structured with clear per-finding format.
- **GPT-5** found 19 gaps with consistent high quality. Its unique findings
show cross-cutting reasoning: the combined mode semantics finding (global
vs per-user mode interaction) identifies a genuine specification gap that
neither Sonnet version noticed. The "ETS missing but liquidation needed"
finding is architecturally significant — it identifies a CONTRADICTION in
the spec's own rules (fail-closed blocks everything, but liquidation must
pass). Every finding was actionable. More selective severity ratings
(8 Critical vs Sonnet 4.5's 14).
- **Claude Sonnet 4.6** was the most SELECTIVE (13 gaps) but with the highest
precision. Every finding was genuinely a specification gap that an
implementer would face. The ETS sequencing finding (#4) is particularly
well-reasoned — it identifies a specific ordering dependency that creates
a race window. Sonnet 4.6 appears to self-filter aggressively, producing
only findings it's confident about. Higher signal-to-noise than 4.5.
**Key insight — Sonnet 4.5 vs 4.6 on analytical tasks:**
This is the first direct comparison between Claude model versions on the same
analytical task. Key differences:
- **Volume:** 4.5 produced almost 2x the findings (25 vs 13)
- **Tokens:** 4.5 used ~1.5x the output tokens (5,191 vs 3,403)
- **Time:** 4.5 took ~1.4x longer (102s vs 73s)
- **Severity distribution:** 4.5 had more Critical findings (14 vs 8) but
with more generous severity ratings
- **Quality per finding:** 4.6 had higher average quality; fewer "obvious"
or lower-impact findings
The 4.6 model appears to have been trained toward higher precision/selectivity.
It finds fewer things but each finding is more reliably a genuine gap. The 4.5
model is more exhaustive but includes findings that a reviewer might triage as
"yes, technically, but not really a spec gap." This mirrors a known training
direction in Claude models: later versions tend to be more concise and selective.
**For practical use:** If you want completeness (cast a wide net, accept some
noise): use 4.5. If you want precision (every finding is actionable, no triage
needed): use 4.6. For architecture review where missing a gap has cost, 4.5's
exhaustiveness is probably worth the noise. For review where false positives
cost attention (e.g., PR review comments), 4.6's selectivity is preferred.
**GPT-5 vs Sonnet comparison on this task:**
GPT-5 (19 findings) sits between the two Sonnets in volume but has the highest
consistency — no obvious misses or inflated severities. Its unique strength
here: finding CONTRADICTIONS within the spec's own rules (ETS-missing blocking
conflicts with liquidation needing to pass). This is consistent with Finding #15
where GPT-5 was unusually selective but precise on coherence checking.
Specification completeness analysis appears to be a task where:
1. Sonnet 4.5 is strongest for breadth (25 findings, catches operational gaps)
2. GPT-5 is strongest for detecting spec self-contradictions (19 findings, high precision)
3. Sonnet 4.6 is strongest for precision (13 findings, zero noise)
**Updated model version comparison:**
- Claude 4.6 → higher precision, more selective, concise
- Claude 4.5 → more exhaustive, more verbose, occasional severity inflation
- This is a genuine tradeoff, not a simple regression or improvement
**Practical implication:** Run BOTH Sonnet versions? 4.5 catches things 4.6
filters out (ETS schema, broker rejection semantics, cold-start gate interaction).
4.6 catches things with more specificity (sequencing gaps, exact race windows).
For a one-shot budget: 4.5 if you want coverage, 4.6 if you want actionability.
GPT-5 if you want to find where the spec contradicts itself.
@@ -0,0 +1,158 @@
# Finding 18: Temporal boundary analysis: GPT-5 is most exhaustive; Opus finds design-level contradictions; Sonnet 4.5 is structured but less deep
**Date:** 2026-05-04
**Task:** Identify temporal boundary vulnerabilities in gargoyle's `escalation-policy.md`
(238 lines) — scenarios where the timing model (evaluation cycles, debounce counts,
cooldown periods) creates windows of incorrect or dangerous behavior.
**How we used them:** Same document (full text) + same focused analytical question to all
3 models via HAI proxy. Highly structured prompt specifying 5 categories of temporal
vulnerability (timing exploitation, counter reset abuse, asymmetric time exposure,
cross-metric temporal interactions, state loss temporal effects). Required specific
output format per finding (name, sequence with cycle numbers, mechanism, severity, fix).
No tools, no project context beyond the document itself.
| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
|---|---|---|---|---|---|---|---|
| GPT-5 | ~128s | 9,175 | 5,888 | 15 | 3 | 7 | 2 |
| Claude Opus 4.6 | ~120s | 5,112 | (internal) | 10 | 3 | 5 | 2 |
| Claude Sonnet 4.5 | ~100s | 4,056 | (internal) | 12 | 3 | 3 | 3 |
**What they found — common ground (all 3 identified):**
- Flash crash / inter-evaluation gap exploitation (metric spikes between discrete
evaluation cycles go undetected)
- Single clear cycle resetting debounce counter (transient recovery defeats escalation
despite sustained risk — metric can breach 80%+ of cycles and never escalate)
- Asymmetric escalation time vs loss compounding rate (11 cycles to reach liquidation
while losses compound every single cycle)
- Monitor crash resets state to Clear, losing all escalation progress
- Liquidation re-trigger requiring full debounce reset, delaying subsequent batches
- Kill switch N value unspecified (timing indeterminacy)
**GPT-5 unique findings (not in either other model):**
- Boundary-alignment counter starvation: explicitly modeled the "adversarial flicker"
pattern (breaching 2 cycles, 1 clear, repeat — 66% breach time, never escalates)
with a precise mathematical framing of why K-of-N is needed
- Cycle-length drift under load: GC pauses or CPU contention stretching evaluation
intervals means "3 cycles" could be 12 minutes instead of 90 seconds when it
matters most (high-load market stress = slowest evaluations)
- Adversarial boundary timing (market microstructure masking): illiquid instruments
where opposing prints predictably arrive near evaluation boundaries, exploiting
deterministic sampling points
- Cross-metric escalation gap and hedging prohibition: Restrict prevents ALL new
positions including risk-REDUCING hedges needed for a different metric still
escalating on its own timeline — protection for metric A actively worsens metric B
- Cooldown stall causing prolonged Restrict: repeated transient spikes near hysteresis
threshold reset cooldown indefinitely while metric is actually safe
- State inconsistency between restriction flags and monitor after restart:
documented asymmetry where flag persists (manual clear) but state resets (auto
clear) — creates orphaned restriction or unprotected window depending on
reconciliation approach
- Metric computation fail-closed interacting with debounce: system errors create
false escalations with long cooldown, potentially blocking hedging trades
- Unspecified N for kill switch post-liquidation breaches: coupled with crash
reset, system can loop indefinitely without reaching kill switch
- In-liquidate flicker stall: one cycle below threshold after partial fill resets
re-trigger counter, stalling further liquidation
**Claude Opus unique findings (not in either other model):**
- De-escalation cooldown exploitation (predictable window): after cooldown completes
and restriction lifts, strategy has a GUARANTEED 5+ cycle window of unrestricted
trading before Restrict can re-engage — an automated strategy could systematically
exploit this predictable safe window to re-enter dangerous positions
- Orphaned restriction flag asymmetry framed as a DESIGN CONTRADICTION: the failure
modes table specifies opposing recovery paths for state (automatic → Clear) vs
flags (manual clear), creating an irreconcilable dual state. Opus uniquely
identified that operator intervention to clear the flag could inadvertently
create a WORSE protection gap than leaving it orphaned
- Self-correcting analysis style: Opus's summary explicitly synthesized that the
three Critical findings share a common cause (debounce optimizes against false
positives at the expense of false negatives during sustained events) and proposed
a single architectural fix (severity-aware fast path) that addresses all three
**Claude Sonnet 4.5 unique findings (not in either other model):**
- De-escalation timing not accounting for proximity to breach threshold: system
removes protection while metric is still near-dangerous, and re-escalation
requires full debounce — created a specific "whipsaw" scenario with cycle numbers
- Manual-only de-escalation from Liquidate creates UNBOUNDED recovery time:
if triggered at 2 AM Saturday, trading disabled until Monday despite metrics
recovering in minutes. Framed as contradiction with "autonomous" design goals
- Evaluation cycle synchronization assumption: no handling of variable timing
(CPU contention, GC pauses) — implicit throughout but never addressed
- Cold start escalation ambiguity: system starts with no prior state while
portfolio may already be in breach condition
- De-escalation event ordering race: multiple metrics de-escalating simultaneously
may emit events in non-deterministic order, confusing external observers
**Quality assessment:**
- **GPT-5** was the most exhaustive (15 findings) and showed the strongest
mathematical/systems reasoning. Its unique findings included precise attack
models (adversarial flicker, boundary alignment, microstructure masking) that
describe exact exploitation patterns with percentages and cycle counts. The
cross-metric hedging prohibition finding is architecturally significant — it
identifies that protection for one metric can actively CREATE risk for another.
Every finding was actionable with specific fixes.
- **Claude Opus 4.6** produced fewer findings (10) but with characteristic depth
and self-awareness. Its cooldown exploitation finding identified a PREDICTABLE
exploit window that an automated strategy could systematically abuse — framed
not as an accident but as an adversarial opportunity. The summary synthesis
(identifying common cause across Critical findings) shows meta-analytical
capability the other models didn't demonstrate. Opus also uniquely identified
that human intervention to fix one problem could create a WORSE problem —
second-order operational reasoning.
- **Claude Sonnet 4.5** was well-structured (12 findings, clean severity tiers,
organized by Critical/High/Medium/Low) and faster than both other models.
Its findings were solid but less architecturally deep. The manual de-escalation
contradiction finding was genuinely insightful (unbounded recovery time vs
autonomous design goals). However, several findings restated concepts the
other models covered with less specificity about exploitation mechanics.
**Key insight — temporal reasoning as a task type:**
This is the first experiment specifically testing "temporal boundary analysis" —
reasoning about time-domain properties of a state machine (evaluation frequency,
counter semantics, cooldown mechanics, crash/restart timing).
Results compared to Finding #13 (race condition identification on a concurrency doc):
- GPT-5: 15 findings here vs 12 in Finding #13. Consistent high performance
on temporal reasoning tasks across both experiments.
- Opus: 10 findings here vs 10 in Finding #13. Remarkably consistent — Opus
produces ~10 high-quality findings regardless of temporal task variant.
- Sonnet 4.5: 12 findings here (first test). Compare to Sonnet 4.6's 7 findings
(with errors) in Finding #13. Sonnet 4.5 handles temporal reasoning better than
4.6 — consistent with Finding #16 showing 4.5 is more exhaustive across task types.
**Sonnet 4.5 vs 4.6 on temporal reasoning (inferred comparison):**
Sonnet 4.6 struggled significantly on race condition identification (Finding #13:
7 findings with analytical errors, misreading architecture). Sonnet 4.5 here
produced 12 solid findings with no apparent misreadings. This suggests 4.5's
exhaustiveness advantage extends to temporal reasoning — the additional
exploration it does (vs 4.6's aggressive self-filtering) catches more temporal
interactions. Confirms Finding #16's pattern: 4.5 for coverage, 4.6 for precision.
**The structured-prompt effect continues:**
All three models produced focused, high-quality output with this highly structured
prompt (5 specific categories + required output format). This confirms Finding #14:
narrow analytical lens + broad document scope is the sweet spot for all model tiers.
The prompt structure appears to be a stronger predictor of output quality than model
choice for the bottom 80% of findings (all models find the common-ground issues).
Model choice matters for the TOP 20% — the unique insights that require deeper
reasoning about system interactions.
**Updated model assignment for temporal boundary analysis:**
1. GPT-5 — most exhaustive, strongest at modeling adversarial exploitation patterns
and mathematical edge cases (15 findings)
2. Claude Sonnet 4.5 — good volume with clean structure, viable for first-pass
temporal analysis (12 findings, no errors)
3. Claude Opus 4.6 — fewest findings but highest insight density, uniquely
identifies predictable exploit windows and operational second-order effects
(10 findings)
**Practical implication:** For temporal analysis on state machines and timing-dependent
policies, the three-model stack produces genuine complementary value:
- GPT-5 catches the adversarial attack patterns and mathematical edge cases
- Opus catches the predictable exploit windows and operational contradictions
- Sonnet 4.5 provides good breadth at lower cost with clean severity categorization
The union of unique findings across all three models reveals significantly more
temporal vulnerabilities than any single model alone. For a document governing
autonomous financial actions (liquidation, kill switch), the cost of running all
three (~$1-2) is trivially justified against the risk of missing a timing exploit.
@@ -0,0 +1,124 @@
# Finding 19: Union coverage test: GPT-5 Mini + Sonnet 4.6 covers ~71% of GPT-5's findings; the missing 29% is where the real value lives
**Date:** 2026-05-04
**Task:** Identify hidden assumptions in gargoyle's `trading-pipeline.md` (1,110 lines,
~62KB) — the most complex document tested so far, covering the full end-to-end path
from tick ingestion through order execution.
**How we used them:** Same document (full text, no truncation) + same focused analytical
question to all 3 models via HAI proxy. Standard hidden-assumption prompt with 5
categories (runtime behavior, external dependencies, timing/ordering, scale/load,
uncovered failure modes). Required specific output format per finding. No tools, no
project context beyond the document itself.
| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
|---|---|---|---|---|
| GPT-5 | 99s | 9,418 | 5,696 | 35 |
| GPT-5 Mini | 93s | 5,309 | 1,792 | 21 |
| Claude Sonnet 4.6 | 38s | 1,792 | (internal) | 17 |
**Coverage analysis — can Mini + Sonnet together replace GPT-5?**
Categorized each of GPT-5's 35 findings by whether the union of Mini + Sonnet
also identified the same assumption:
- **Covered by BOTH Mini and Sonnet:** ~12 findings (common ground — any model
finds these: idempotency, single-writer, clock sync, instrument resolution,
fill immutability, reconciliation gate, backpressure, fill correlation, event
ordering, audit scalability, PortfolioRisk bottleneck)
- **Covered by Mini only (not Sonnet):** ~7 findings (transactional atomicity,
audit causal consistency, modification-in-flight enforcement, OM throughput,
decimal precision, PM/PR close-only race, partition duplicate submit)
- **Covered by Sonnet only (not Mini):** ~6 findings (market data feed rates,
pipeline-vs-market speed, corporate actions atomicity, kill switch partition,
shared port isolation, market close vs auction fills)
- **Union(Mini + Sonnet) total coverage:** ~25/35 = **~71%** of GPT-5's findings
- **GPT-5 unique (missed by both):** ~10-18 findings depending on strictness
**What GPT-5 uniquely found that the cheaper pair missed:**
The missing 29% is NOT random — it's systematically different in character:
1. **Operational edge cases:** Default TIF "day" broker semantics, OrderRate
counting retries, extended-hours MarketHours mismatch, fractional quantities,
local expiry timer precision per instrument
2. **Design-level interaction gaps:** PortfolioRisk concurrent decision race
(snapshot stale between two parallel approvals), re-validation gap between
approval and submit, decision loss on crash after audit write
3. **Domain-specific knowledge:** Manual broker-side actions conflicting with
state machine, options/complex instrument position_effect mapping, Decision→Order
1:1 invariant vs broker auto-splitting, wash sale retroactive P&L mutation
4. **Architectural observations:** Reduction re-entry rule insufficiency,
PortfolioMonitor coalescing vs fast breach detection, multi-aggregator fanout
and audit partial writes, replay/backtest alignment with production controls
These share a common trait: they require **domain expertise** (knowing how brokers
actually behave, how regulatory rules interact, how production trading systems
fail in practice) combined with **architectural reasoning** (how the design's own
mechanisms interact under those real-world conditions). The cheaper models find
assumptions about the document's internal consistency; GPT-5 additionally finds
assumptions about the document's relationship to the external world it must
operate in.
**GPT-5 Mini vs Sonnet 4.6 — complementary, not redundant:**
Mini and Sonnet covered different gaps:
- Mini was stronger on **internal consistency** (transactional atomicity, causal
consistency, decimal precision, modification serialization)
- Sonnet was stronger on **external interactions** (market data feeds, corporate
actions, kill switch distribution, shared resource isolation)
This aligns with previous findings: Mini reasons about implementation mechanics;
Sonnet reasons about system boundaries and external interactions. Their union
covers more ground than either alone.
**Cost comparison:**
| Approach | Total tokens | Approx. cost | Coverage of GPT-5 |
|---|---|---|---|
| GPT-5 alone | ~21K (9.4K output + 5.7K reasoning) | ~$0.80 | 100% (35 findings) |
| Mini + Sonnet | ~7.1K output + 1.8K reasoning | ~$0.25 | ~71% (25/35 findings) |
| All three | ~28K total | ~$1.05 | >100% (35 + unique Sonnet/Mini extras) |
**Key insight — the 71% coverage is a floor, not a ceiling:**
The union covers 71% of GPT-5's specific findings. But Mini and Sonnet each
also produced findings that GPT-5 DIDN'T make:
- Sonnet: DailyLossLimit query performance scaling, instrument reference data
propagation atomicity across components
- Mini: Signal audit correlation ambiguity under replay/duplicate ticks
So the total unique finding space is LARGER than any single model. Running all
three produces the most comprehensive analysis.
**Answer to the open question: "Would running GPT-5 Mini + Sonnet together
approach GPT-5's coverage at lower combined cost?"**
**Partially.** The pair covers ~71% of GPT-5's findings at ~31% of the cost.
But the missing 29% is disproportionately valuable — it contains the
domain-specific, interaction-level, real-world-knowledge findings that are
most likely to prevent production incidents. For a quick sanity check or
first-pass screening, Mini + Sonnet is excellent value. For architecture
review where completeness matters (financial system, safety-critical), GPT-5
is not replaceable by cheaper models — its unique findings are exactly the
ones that would cause real-world failures.
**Practical implication:** The optimal strategy depends on stakes:
- **Low stakes** (internal doc review, non-critical systems): Mini + Sonnet
is 71% coverage at 31% cost — strong ROI
- **High stakes** (financial systems, safety-critical): run all three — the
~$1 total cost is irrelevant vs the value of the extra 10-18 findings
- **Budget-conscious high stakes:** run GPT-5 alone — it subsumes most of
what Mini + Sonnet find, and adds the critical domain-knowledge findings
The cost argument for Mini + Sonnet as a GPT-5 REPLACEMENT doesn't hold for
important work. The cost argument for Mini + Sonnet as a GPT-5 COMPLEMENT
is strong — they catch a few things GPT-5 misses, and the union of all three
is the most thorough analysis available.
**Document complexity observation:**
This is the largest document tested (1,110 lines vs previous 185-785 lines).
GPT-5's finding count scaled up (35 vs 20-26 on smaller docs) while maintaining
quality — no padding with obvious/low-value findings. Mini also scaled (21 vs
6 on 459-line doc in Finding #14). Sonnet scaled less (17 vs 12-17 on smaller
docs) — it appears to have a natural output ceiling regardless of document size,
consistent with its self-filtering behavior observed in previous findings.
@@ -0,0 +1,163 @@
# Finding 20: Invariant violation path analysis: GPT-5 is maximally selective (3 findings, all genuine); Opus shows unique self-correcting analytical style; new task type favors precision over exhaustiveness
**Date:** 2026-05-04
**Task:** Identify invariant violation paths in gargoyle's `user-pipeline-lifecycle.md`
(730 lines) — sequences of legal operations that can violate the system's stated or
implied invariants. NEW analytical lens not previously tested, distinct from assumption-
finding, race conditions, or coherence checking.
**How we used them:** Same document (full text) + same focused analytical question to all
3 models via HAI proxy. Highly structured prompt specifying 5 categories of invariant
violations (state machine escapes, invariant composition failures, monotonicity violations,
idempotency boundary violations, authority inversion sequences). Required specific output
format per finding. No tools, no project context beyond the document itself.
| Model | Time | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|
| GPT-5 | 143s | 784 | 12,032 | 3 |
| Claude Opus 4.6 | 113s | 6,183 | (internal) | 7 (with 2 self-corrections) |
| Claude Sonnet 4.6 | 23s | 1,266 | (internal) | 5 |
**What they found — common ground (2+ models identified):**
- **Periodic reconciliation overrides operator manual stop** (GPT-5 #3 + Opus #5 +
Sonnet #1): An admin who stops a pipeline via `stop_user/1` with `:admin_action`
has their decision overridden within 5 minutes by periodic reconciliation, because
there's no "admin stopped" state in `check_eligibility/1`. All three models
independently identified this as the clearest authority inversion.
- **DynamicSupervisor restart bypasses eligibility gate** (Opus #1/#3 + Sonnet #2):
When `UserPipeline.Supervisor` crashes and is restarted by OTP supervision, the
restart bypasses `start_user/1` and `check_eligibility/1` entirely — potentially
resuming trading while the kill switch is engaged.
- **Stale ReconciliationGate after crash** (Opus #7): After a crash-triggered
DynamicSupervisor restart (not via `stop_user/1`), the ReconciliationGate remains
`:ready` from the previous instance because `stop_user/1` (which resets it) was
never called. The new OrderManager may accept orders during its own reconciliation.
- **HealthMonitor co-lifecycle violation** (Opus #2 + Sonnet #4): After a
DynamicSupervisor-initiated restart, the HealthMonitor is still subscribed to the
old PIDs — no code re-establishes monitoring for the new pipeline processes.
**GPT-5 unique findings (not in either other model):**
- **Kill switch bypass for users configured DURING engagement** (#1): A user who
saves credentials while the kill switch is engaged is never added to the pending
operator release set (only running pipelines are added at engage time). After
disengage, periodic reconciliation auto-starts this user's pipeline without
operator release — violating "resuming always requires human judgment." This is
the most precisely reasoned finding across all three models: each step is
individually correct per the spec, and the violation emerges purely from the
composition of legal operations.
- **Premature release bypass** (#2): If `operator_release_user/1` is called while
the kill switch is still engaged (a legal operation), it clears the pending
release flag but `start_user/1` correctly refuses. After later disengage, the
flag is gone — auto-start proceeds without fresh operator judgment. The release
was "spent" at the wrong time.
**Claude Opus unique findings (not in either other model):**
- **`operator_release_system/0` clears unrelated safety obligations** (#4):
Operator intends to release one user from a recent event but
`operator_release_system/0` also releases other users still pending from an
earlier, unresolved event. One release call discharges multiple independent
safety obligations — monotonicity violation.
- **State machine incompleteness for blocked users** (#6): Users who become
configured during kill switch engagement (blocked with reason
`:kill_switch_engaged`) have no state machine transition back to `starting`
after disengage — they're not in the pending release set, and no event fires.
System works via periodic reconciliation (up to 5 minutes delay), but the
documented state machine doesn't represent this path.
- **Self-correcting analytical style:** Opus explicitly withdrew two draft
findings mid-analysis ("Actually, this sequence works as designed. Let me
identify a real violation instead." / "this is likely handled"). This
self-correction behavior was first observed in Finding #15 and is now
confirmed as a consistent Opus trait for invariant-style analysis.
**Claude Sonnet unique findings (not in either other model):**
- **Cold-start Tier 3 failure creates supervision restart loop** (#2): A
persistent Tier 3 failure (phantom fills) crashes OrderManager, `:rest_for_one`
kills the tree, DynamicSupervisor restarts it, cold-start fails again → infinite
loop. State machine shows `starting → stopped` but supervision creates
`starting → starting` indefinitely.
- **HealthMonitor start failure during start_user** (#4): If HealthMonitor.Supervisor
is momentarily crashed when `start_user/1` runs step 4, the pipeline starts
without monitoring. No error handling specified for this partial-start state.
**Quality assessment:**
- **GPT-5** was MAXIMALLY SELECTIVE — only 3 findings from 12,032 reasoning tokens
(4,011 reasoning tokens per finding). This is the most extreme
reasoning-to-output ratio observed: 15:1 (12,032 reasoning / 784 output tokens).
For comparison, in previous experiments GPT-5 typically shows 1:1 to 2:1 ratios.
Every finding is a genuine invariant violation with a precise, step-by-step
sequence where each step is individually legal. ZERO false positives, zero
padding, zero "this might be an issue." GPT-5 appears to have used almost all
its reasoning budget for VERIFICATION — confirming that each candidate is
genuinely a violation before including it.
- **Claude Opus** produced the most findings (7) with its characteristic depth and
self-correction. Two findings were revised mid-analysis, showing Opus actively
testing its own reasoning against the document before committing to a finding.
The DynamicSupervisor restart thread (findings #1, #2, #3, #7) forms a coherent
cluster — Opus identified one root cause (OTP restarts bypass the lifecycle
layer) and explored its multiple consequences. The `operator_release_system`
monotonicity finding (#4) is architecturally significant and unique.
- **Claude Sonnet** was extremely fast (23s, 1,266 tokens) and produced 5 findings.
Quality was mixed: Finding #1 partially mirrors GPT-5's authority inversion but
with vaguer reasoning ("race condition with ETS operations" — not specified).
Finding #3 describes a contradiction but the scenario is internally inconsistent
(step 5 says "pipeline termination fails" but then step 7 says pipeline is still
running — this conflates two failure modes). Findings #2 and #4 are genuine and
well-reasoned. Sonnet's precision is lower than the other two on this task.
**Key insight — "Invariant violation paths" as a task type:**
This is a genuinely DIFFERENT analytical task from any previously tested. It requires:
1. Identifying the invariants (explicit or implied)
2. Constructing a sequence of operations (creative/generative)
3. Verifying each step is legal per the spec (verification)
4. Confirming the end state violates the invariant (correctness proof)
This four-phase cognitive process explains GPT-5's extreme selectivity: steps 2-4 are
all verification-heavy, and GPT-5's reasoning tokens are being burned on steps 3 and 4
(confirming each step is genuinely legal and the final state genuinely violates). In
previous tasks like "find hidden assumptions" or "find gaps," only step 1 (identification)
is needed — there's no construction or verification phase.
**Comparison to previous task types:**
| Task type | GPT-5 findings | Opus findings | GPT-5 reasoning overhead |
|---|---|---|---|
| Hidden assumptions | 20-35 | 12-13 | 5-7K reasoning |
| Race conditions | 12 | 10 | 8K reasoning |
| Design coherence | 4 | 7 | 9K reasoning |
| Invariant violation paths | 3 | 7 | **12K reasoning** |
The pattern: as the task requires more VERIFICATION (vs identification), GPT-5 becomes
more selective and spends more reasoning tokens per finding. Invariant violation paths
demand the highest verification burden (every step must be confirmed legal), and GPT-5
responds with the highest selectivity and reasoning investment.
Opus inverts: it produces MORE findings on verification-heavy tasks (7 for coherence,
7 for invariant paths) vs identification tasks (10-13 for assumptions). This suggests
Opus uses its internal reasoning differently — it's more willing to present findings
that have "likely" rather than "proven" violations, then self-corrects inline if the
verification fails.
**Practical implication:**
For invariant violation path analysis:
- **GPT-5** produces the highest-precision findings but very few. Every finding is a
genuine spec-level bug. Use when you need zero-false-positive bug reports to present
to a design team.
- **Opus** produces more findings with slightly lower precision but unique analytical
depth. Its self-correction behavior means false positives are often caught inline.
Use when you want both confirmed violations AND identified tensions.
- **Sonnet** is too imprecise for this task type — some findings have internal
inconsistencies. Use for lighter analytical tasks (assumption-finding, spec gaps).
The three findings GPT-5 produced are ALL genuine design bugs that should be fixed:
1. Users configured during kill switch engagement bypass operator release
2. Premature operator release (while KS still engaged) creates future bypass
3. Admin stops are overridden by periodic reconciliation
These are the kind of findings that, in a real financial system, prevent production
incidents. The 12K reasoning tokens to produce 3 perfect findings is excellent ROI.
@@ -0,0 +1,125 @@
# Finding 21: Reasoning effort (low/medium/high) has negligible effect on GPT-5's analytical output; the parameter may not work as documented for open-ended analysis
**Date:** 2026-05-04
**Task:** Identify hidden assumptions in gargoyle's `order-state-machine.md` (221 lines)
— a well-structured state machine specification covering order lifecycle, fill precedence,
TIF semantics, and parameter resolution.
**How we used them:** Same document, same prompt, same model (GPT-5), same
max_completion_tokens (16K). Only variable: `reasoning.effort` parameter set to
"low", "medium", or "high". Run sequentially via HAI proxy (OpenAI-compatible
endpoint). No tools, no project context beyond the document.
| Effort | Time (ms) | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|
| Low | 97,913 | 7,657 | 4,288 | 33 (+11 recs) |
| Medium | 94,824 | 7,112 | 4,160 | 30 |
| High | 88,607 | 6,891 | 3,712 | 30 |
**The counterintuitive result:** Higher reasoning effort produced FEWER findings,
FEWER reasoning tokens, FEWER output tokens, and completed FASTER. The expected
pattern (high effort → more reasoning → more depth) was inverted.
**Per-finding metrics (remarkably consistent):**
| Effort | Output tokens/finding | Reasoning tokens/finding |
|---|---|---|
| Low | 232 | 129 |
| Medium | 237 | 138 |
| High | 229 | 123 |
The depth per finding was nearly identical across all three levels. The models
didn't get more detailed or rigorous per-finding at higher effort — they just
found slightly fewer things.
**Severity distributions (similar across all three):**
- Low: 7 Critical, 21 High, 5 Medium (33 findings)
- Medium: 9 Critical, 15 High, 4 Medium + 2 borderline (30 findings)
- High: 6 Critical, 14 High, 5 Medium + 4 borderline (30 findings)
**Qualitative differences — WHAT they found:**
High-effort unique findings (not in low):
- Single-writer authority to broker (no out-of-band modifications)
- Broker emits fills for all executed quantities (no silent netting)
- Instrument identity remains stable across corporate actions
- Late-fill override won't violate downstream invariants
- Validation covers lot sizes, price ticks, borrow/locate constraints
- Multiple accounts and venues are part of the correlation key
- Streaming and polling APIs are consistent
- System can handle multi-leg instruments
Low-effort unique findings (not in high):
- Acks arrive before fills (no pre-ack fills)
- Cancel-before-ack handling (submitted → cancelled missing)
- Fill totals never exceed requested quantity
- Deterministic ordering within a broker stream
- Exercise/assignment and non-order position changes
- Client-side idempotency of "place order"
- Partial accept/normalize on replace
- No "child" order fragmentation at broker
- Submitted state can receive terminal events
- Late cancel vs local expired mismatch
**Character of the differences:**
- HIGH-unique findings tend to be more **architectural/systemic** (multi-leg
instruments, streaming vs polling consistency, downstream invariant violations,
corporate actions). These require reasoning about the system's relationship
to the broader world.
- LOW-unique findings tend to be more **implementation-specific edge cases**
(cancel-before-ack, pre-ack fills, child order fragmentation, partial accepts).
These require reasoning about specific event interleavings and protocol details.
Both sets are valid and actionable. Neither is clearly "better." They represent
different analytical modes — breadth-of-scope (high) vs depth-of-protocol (low).
**Key insight — reasoning_effort doesn't scale analysis linearly:**
Three possible explanations for the inverted behavior:
1. **GPT-5 already uses near-maximum reasoning for analytical tasks regardless
of the effort parameter.** The ~4K reasoning tokens across all three levels
(4288/4160/3712) are too similar to reflect a genuine effort gradient. The
parameter may primarily affect OTHER task types (math, code, logic puzzles)
where reasoning depth is more variable.
2. **Higher effort increases FILTERING, not exploration.** At high effort, GPT-5
may spend more of its reasoning on VERIFYING whether findings are genuine
before including them — similar to the extreme selectivity observed in
Finding #20 (invariant violation paths, 12K reasoning for 3 findings). This
would explain fewer findings despite theoretically "trying harder."
3. **The parameter has minimal practical effect for this model version.**
The differences (33 vs 30 vs 30) are within normal stochastic variation.
Repeated runs at the same effort level might show similar variance.
**The prompt cache hit on HIGH (2304 cached prompt tokens) may have slightly
accelerated processing, but doesn't explain the reasoning token difference.**
**Comparison to previous findings:**
In Finding #20 (invariant violation paths), GPT-5 used 12,032 reasoning tokens
for 3 findings — extreme verification behavior. Here, at default effort on a
different task type (hidden assumptions), it uses ~4K reasoning for ~30 findings.
This confirms that TASK TYPE is a far stronger predictor of GPT-5's reasoning
behavior than the reasoning_effort parameter. The invariant violation prompt
triggered deep verification; the assumption-finding prompt triggers broad
exploration regardless of effort setting.
**Practical implication:**
For open-ended analytical tasks (assumption-finding, gap analysis, spec review),
the reasoning_effort parameter appears to have negligible practical effect on
GPT-5. Don't bother tuning it for these tasks — the default is fine. The
parameter may be more meaningful for:
- Tasks with verifiable correct answers (math, logic)
- Tasks where the model could short-circuit (simple questions)
- Extremely long documents where exploration budget matters
For architecture review specifically: reasoning_effort is NOT a useful lever.
Task framing (the prompt structure) and document selection remain the dominant
variables for output quality. Save reasoning_effort tuning for coding/math tasks
where the parameter was likely trained and evaluated.
**Open question:** Would running the same experiment 5x at each level show that
the 33-vs-30 difference is within stochastic noise? If so, reasoning_effort is
effectively a no-op for analytical prompts. If not, low-effort consistently
produces more (less filtered) output, which could be useful for brainstorming-
style analysis where you want maximum coverage before manual triage.
@@ -0,0 +1,180 @@
# Finding 22: Silent correctness failures: NEW analytical lens reveals Opus's strength at domain/regulatory reasoning; GPT-5 shows regulatory depth; all models converge on compliance timestamp errors
**Date:** 2026-05-05
**Task:** Identify scenarios where the mechanism produces SILENTLY INCORRECT results
(not errors, not crashes — wrong financial calculations, wrong lot selections, or wrong
compliance records that pass all validation) in gargoyle's `specid-lot-selection.md`
(306 lines) — a financial system specification covering tax lot selection strategies,
cost basis accounting, and IRS SpecID compliance.
**How we used them:** Same document (full text) + same focused analytical question to
all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of silent
incorrectness (stale data, semantic precision, ordering sensitivity, composition errors,
temporal reference errors). Required specific output format per finding with concrete
numerical examples of financial impact. No tools, no project context beyond the document.
| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
|---|---|---|---|---|---|---|---|
| GPT-5 | 147s | 13,006 | 10,496 | 7 | 2 | 2 | 3 |
| Claude Opus 4.6 | 119s | 5,902 | (internal) | 10 | 3 | 3 | 4 |
| Claude Sonnet 4.6 | 122s | 6,011 | (internal) | 6 | 3 | 3 | 0 |
**What they found — common ground (all 3 identified):**
- `designation_at` = `DateTime.utc_now()` at processing time, NOT at actual
designation time (manual selection was made at order submission, standing
orders were configured earlier) — compliance record factually incorrect
- Holding period calculation boundary errors (>365 days vs IRS "more than one
year" rule, off-by-one at leap year boundaries, day-after-acquisition start)
- HIFO tie-breaker `opened_at ASC` ignores tax_term dimension — selects
long-term losses over short-term losses when both have identical cost basis,
producing less tax-valuable outcomes
- Strategy preference resolved at fill processing time, not at trade time
(preference changes between trade and fill processing apply retroactively)
**GPT-5 unique findings (not in either Claude model):**
- Corporate action applied late stale cost basis in HIFO: ROC/dividend reduces
basis but if close/4 fires before apply_corporate_action/3, HIFO sorts on
pre-adjusted basis AND records wrong realized P&L permanently. No mechanism
to restate previously persisted LotClosed events. Concrete example: $2,000
overstated loss from one trade.
- `designation_at` fragmentation: a single sell consuming multiple lots calls
DateTime.utc_now() per loop iteration, producing slightly different timestamps
for what should be a single coherent designation event. Audit risk.
- LIFO label in `selection_method` field: records "lifo" but for securities LIFO
isn't an authorized tax method — the operation is legally SpecID electing
newest lots. Downstream reporting may reject or misclassify.
**Claude Opus unique findings (not in either other model):**
- Realized P&L excludes commissions/fees: formula uses `sell_fill.price` (raw
execution price) minus `lot.cost_basis`, not net proceeds. If cost_basis also
excludes buy-side commissions, P&L is doubly overstated. Active trader doing
1000 trades/year: ~$20,000+ cumulative P&L overstatement.
- Position `average_cost` is meaningless under SpecID and potentially misleading:
SpecID exists to exploit lot-level basis differences, but position-level average
obscures this. If downstream consumers use average_cost for tax estimation,
results can be 50%+ wrong per lot.
- GenServer mailbox ordering determines lot-to-fill assignment for concurrent sells:
two simultaneous fills for the same instrument get different lots based on network
arrival timing. With different holding periods, produces $670+ tax difference
without user awareness.
- Wash sale rule completely unaddressed: system reports losses as realized/deductible
without checking 30-day substantially identical purchase rule. Active trader
harvesting $50,000 in losses could have $0 actually deductible — $18,500 tax gap.
- `opened_at` semantics undefined: whether it's exchange execution time, GenServer
arrival time, or settlement date affects every downstream calculation (FIFO/LIFO
ordering, holding periods, tax terms). Network timing could produce wrong FIFO
lot selection.
**Claude Sonnet 4.6 unique findings (not in either other model):**
- Stale cost basis in manual lot picker during concurrent corporate actions: UI shows
pre-action basis, user selects based on stale data, but close/4 only validates
open/ownership/quantity — never re-validates that the selection rationale is still
correct. No field records the discrepancy.
- `average_cost` recomputation ordering ambiguity in event-sourced model: step 4
recomputes from "updated lots" but step 3 (persist events) may not have completed
— if implementation re-derives from event store rather than in-memory state, reads
pre-closure lot quantities. Accumulates $500+ error per partial close.
- Strategy fallback + config corruption silently overwrites selection method in
compliance record: if config becomes invalid, fallback to :fifo is logged at
:warning but LotClosed records `selection_method: "fifo"` — compliance record
shows user "chose" FIFO when they configured HIFO. No field records intended vs
actual strategy.
**Quality assessment:**
- **Claude Opus** produced the most findings (10) with the broadest analytical scope.
Several findings went BEYOND the document's mechanism to identify missing features
that create silent incorrectness (wash sale rules, commission handling, opened_at
semantics). This is a different analytical mode: Opus identified what the system
SHOULD compute but DOESN'T, not just where the existing computation is wrong.
The wash sale finding is the highest-impact across all three models — an active
trader's entire tax-loss harvesting strategy could be invalid. The GenServer
mailbox ordering finding shows characteristic Opus reasoning about emergent
behavior from design decisions.
- **GPT-5** produced fewer findings (7) but with extreme precision and specificity.
Every finding includes concrete dollar amounts and specific field references.
The corporate action stale basis finding is uniquely actionable — it identifies a
specific race condition between two documented mechanisms (close/4 and
apply_corporate_action/3) that produces permanently incorrect persisted data
with no correction path. The designation_at fragmentation finding shows attention
to implementation detail that neither Claude model noticed. GPT-5 used 10,496
reasoning tokens for 7 findings (1,500 tokens/finding) — HIGH verification,
consistent with Finding #20's pattern for precision-over-breadth tasks.
- **Claude Sonnet 4.6** produced 6 findings with strong specificity and novel angles.
The event-sourced recomputation ordering finding (#5) is architecturally subtle —
it identifies a composition error between the walk-and-consume algorithm's step
ordering and event-sourcing patterns. The strategy fallback compliance recording
finding is a genuine audit hazard. However, Sonnet produced no Medium-severity
findings — it either found Critical/High issues or filtered everything else out.
This aligns with its established high-precision, high-self-filtering behavior.
**Key insight — "Silent correctness" as an analytical lens:**
This is the FIRST experiment testing a "silent incorrectness" prompt. The key
difference from previous analytical lenses:
- **Assumption-finding:** "What must be true for this to work?" (Finding #10-12)
- **Race conditions:** "What timing issues exist?" (Finding #13)
- **Design coherence:** "Does the design contradict itself?" (Finding #15)
- **Invariant violations:** "What operation sequences break invariants?" (Finding #20)
- **Silent correctness:** "Where does the system CONFIDENTLY produce WRONG output
with NO indication of error?"
The silent correctness lens produced qualitatively different findings from all
previous lenses. The emphasis on "passes all validation" forced models to reason
about what SHOULD be validated but ISN'T, and about semantic correctness (regulatory
requirements, financial accounting rules) vs syntactic correctness (valid types,
non-nil fields, correct schema).
This lens also revealed a key model differentiation not seen before:
- **Opus reasons about MISSING functionality** (wash sales, commissions, opened_at
semantics) — things the system should do but doesn't
- **GPT-5 reasons about EXISTING functionality being wrong** (corporate action race,
designation fragmentation, LIFO labeling) — things the system does but incorrectly
- **Sonnet reasons about COMPOSITION failures** (event-sourcing step ordering,
strategy fallback propagation) — things that are individually correct but combine
incorrectly
These are three genuinely different analytical modes, not just "more/less thorough."
All three are valuable for different review outcomes: Opus for feature completeness,
GPT-5 for mechanism correctness, Sonnet for integration correctness.
**Financial domain advantage:**
This is the first experiment on a document with strong regulatory/financial semantics.
All three models demonstrated domain knowledge (IRS holding period rules, Treas. Reg.
1.1012-1(c) requirements, wash sale IRC §1091, long-term/short-term capital gains
rate differentials). Opus in particular referenced specific IRC sections and provided
concrete tax rate calculations. The "silent incorrectness" lens works especially well
on financial/regulatory documents because the gap between "syntactically valid output"
and "semantically/legally correct output" is large and consequential.
**Comparison to previous findings on the same models:**
| Task type | GPT-5 findings | Opus findings | Sonnet findings | Opus > GPT-5? |
|---|---|---|---|---|
| Hidden assumptions (#10-12) | 20-35 | 12-13 | 13-17 | No |
| Race conditions (#13) | 12 | 10 | 7 | No |
| Design coherence (#15) | 4 | 7 | 5 | **Yes** |
| Invariant violations (#20) | 3 | 7 | 5 | **Yes** |
| Silent correctness (#22) | 7 | 10 | 6 | **Yes** |
Pattern confirmed: Opus outperforms GPT-5 (by finding count) on tasks that require
reasoning about the design's RELATIONSHIP to external requirements (regulatory,
financial, consumer expectations). GPT-5 outperforms Opus on tasks that require
EXHAUSTIVE EXPLORATION within a self-contained system (assumptions, race conditions).
The "silent correctness" lens is structurally similar to coherence checking (does the
system match its external requirements?) rather than gap-finding (what's missing
within the system?). This explains why Opus outperforms: the task requires reasoning
about the world outside the document (IRS rules, financial accounting standards,
regulatory requirements), which is Opus's strength.
**Practical implication:**
For financial/regulatory system review, the "silent correctness" lens should be
run using Opus as the primary model (broadest findings including missing-feature
identification) plus GPT-5 for mechanism-level precision. Sonnet adds value for
composition/integration issues that neither Opus nor GPT-5 catches. All three
produced unique, actionable findings that the others missed.
The three findings ALL models converged on (designation_at, holding period, HIFO
tie-breaker, strategy preference timing) should be treated as confirmed design
bugs requiring fixes. The fact that three independent models all identified them
with concrete financial impact examples increases confidence that these are real.
@@ -0,0 +1,193 @@
# Finding 23: Regulatory compliance analysis: GPT-5 finds broadest scope of IRS issues; Opus self-corrects mid-analysis; all models converge on cross-account gap
**Date:** 2026-05-05
**Task:** Identify where gargoyle's `wash-sale-tracking.md` (391 lines) could produce
incorrect tax reporting compared to IRS wash sale regulations (IRC 1091). NEW
analytical lens: regulatory compliance verification — asking models to reason about
a code implementation's correctness against EXTERNAL regulatory requirements (not
internal system assumptions or race conditions).
**How we used them:** Same document (full text) + same focused analytical question
to all 3 models via HAI proxy. Prompt structured around 5 categories: regulatory
gaps, interpretation errors, edge cases at regulatory boundaries, cross-account/entity
concerns, and interaction with other IRC sections. Required specific regulatory
citations, implementation analysis, concrete tax errors, and audit risk levels.
No tools, no project context beyond the document.
| Model | Time | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|
| GPT-5 | 178s | 12,525 | 9,536 | 16 |
| Claude Opus 4.6 | 155s | 7,326 | (internal) | 16 (with 2 self-corrections/withdrawals) |
| Claude Sonnet 4.6 | 40s | 1,818 | (internal) | 12 |
**What they found — common ground (all 3 identified):**
- Cross-account/IRA/external broker wash sales not tracked (IRS applies at taxpayer level)
- Options/contracts to acquire stock not triggering wash sales (explicit in IRC 1091(a) text)
- "Substantially identical" definition too narrow (same index ETFs, share classes, ADRs)
- Trade date vs settlement date ambiguity in opened_at/closed_at
- Short sale wash sales not addressed
- Section 475 mark-to-market traders incorrectly subjected to wash sale tracking
- IRC 1092 straddle rules interaction not addressed
- Related party / spousal transactions not considered
- Corporate action identity changes breaking matching
**GPT-5 unique findings (not in either other model):**
- **Per-share vs lot-level basis tacking** (#1): The system applies `disallowed_loss`
and `tacked_opened_at` at the LOT level, but IRS requires per-share treatment
when only partial shares are matched. A lot of 100 shares where only 60 trigger
wash sale should have per-share basis segregation — the system inflates basis for
all 100 shares. **Most architecturally significant finding** — a fundamental
design-level error, not a missing feature.
- **IRA permanent disallowance** (#2): When replacement purchase is in an IRA, the
loss is PERMANENTLY lost (no basis adjustment possible in tax-deferred accounts).
System either incorrectly applies basis adjustment inside IRA or misses it entirely.
- **Instruments not subject to §1091** (#4): §1256 contracts (futures, index options),
cryptocurrency, and §475 elections are all exempt — system may over-disallow.
- **Average-cost mutual fund basis** (#11): Wash sale adjustments for funds using
average-cost method require different math than discrete lot-level adjustments.
- **ADRs vs local shares** (#14): ADRs and underlying foreign ordinaries are
substantially identical but have different instrument_ids.
- **RSU vestings/ESPP purchases** (#15): Equity compensation creating lots via
corporate action paths may not trigger `check_replacement/2`.
- **Ordering priority between pre/post sale purchases** (#10): Industry convention
(post-sale first, then pre-sale) may differ from system's strict chronological
ordering, causing 1099-B mismatches.
**Claude Opus unique findings (not in either other model):**
- **Year-end boundary timing** (#5): Loss in December + replacement in January means
tax reports generated between Dec 31 and the replacement purchase date are incorrect.
Forward detection fires retroactively but users may have already filed. System needs
a "30-day pending window" for year-end reports.
- **Form 8949 reporting format** (#6): IRS requires code "W" in column (f) and
specific adjustment amounts in column (g). System doesn't describe how `tax_summary/3`
produces Form 8949-compatible output — potential CP2000 notice triggers from
automated IRS matching against broker 1099-B.
- **"Open lots" query in backward detection** (#10): If backward detection only
queries currently-open lots, it misses replacements that were acquired AND SOLD
within the window. IRS looks at acquisition regardless of current holding status.
(Rev. Rul. 56-602)
- **Forward detection loss ordering unspecified** (#7): When multiple prior losses
compete for the same replacement shares, ordering matters — different allocation
produces different basis amounts on the replacement lot.
- **DRIP reinvestments triggering wash sales** (#9): Dividend reinvestment creates
new lots that should trigger forward detection but may not if only buy fills
produce `LotOpened` events.
- **Self-correcting analytical style (CONFIRMED):** Opus withdrew Finding #4
entirely mid-analysis ("Revised assessment: holding period logic appears correct.
I withdraw the claim of error"). Spent ~500 words reasoning through the holding
period tacking logic, found it correct, and explicitly retracted. This is now
confirmed across Findings #15, #20, and #23 as a consistent Opus behavior for
verification-heavy regulatory analysis.
**Claude Sonnet unique findings (not in either other model):**
- **Entity-level tracking for partnerships/S-Corps** (#4.2): Tax-transparent entities
trading through the platform need K-1 reporting to partners — user-scoped model
doesn't address pass-through entity wash sale reporting.
- **Constructive sale integration (IRC 1259)** (#4.1): Short positions or derivatives
creating constructive ownership interact with wash sale determination in ways not
addressed.
- **NOL carryforward interaction** (#5.3): Wash sale deferrals affect character and
timing of losses contributing to NOL calculations across tax years.
**Quality assessment:**
- **GPT-5** produced the broadest regulatory scope (16 findings) with the most
specific IRS citations (Rev. Rul. 2008-5, Pub. 550, IRC §§267, 1091, 1092, 1222,
1223, 1256, 475). Its per-share vs lot-level finding (#1) is the only one that
identifies a FUNDAMENTAL DESIGN ERROR (not a missing feature). Most other models'
findings are "you don't handle X" — GPT-5's #1 says "what you DO handle is
handled INCORRECTLY." This distinction matters: missing features are known scope
limitations; incorrect logic is a bug.
- **Claude Opus** matched GPT-5's count (16 with 2 self-corrections = 14 net
confirmed) but with different character. Opus excelled at identifying OPERATIONAL
implications (year-end boundary timing, Form 8949 format requirements, forward
detection ordering) rather than just statutory gaps. Its findings tend to describe
HOW the gap manifests in practice ("user files taxes, then January purchase
retroactively invalidates the filing") vs GPT-5's approach of citing the statute
and describing the theoretical violation.
- **Claude Sonnet** was fast (40s) and produced 12 competent findings but with less
regulatory precision. Findings lacked specific IRS citations (no Rev. Rul.
references, no Treas. Reg. citations). Several findings overlapped heavily with
common ground items without adding unique depth. The entity-level and
constructive sale findings show awareness of tax complexity but are relatively
generic ("this is complex and not addressed").
**Key insight — regulatory compliance as a distinct task type:**
This experiment tests a fundamentally different cognitive demand than previous ones:
previous tasks asked "what could go wrong with this system?" (internal reasoning).
This task asks "does this system correctly implement external rules?" (external
reasoning). The model must hold TWO bodies of knowledge simultaneously: the
implementation spec AND the regulatory framework, then find mismatches.
All three models had strong tax law knowledge — they cited IRC sections, Revenue
Rulings, and Treasury Regulations correctly. The differentiation wasn't in legal
knowledge but in HOW they applied it:
- **GPT-5:** Exhaustive statutory mapping ("here's every IRC section that touches
wash sales; here's where the implementation falls short on each"). Breadth-first
coverage. Found the most issues by sheer scope of regulatory awareness.
- **Opus:** Operational consequence reasoning ("here's how this gap manifests as
a real-world problem for the user/auditor"). Found issues by reasoning about
the implementation's interaction with real-world workflows (filing deadlines,
form formats, broker reconciliation).
- **Sonnet:** Category-based analysis ("here are cross-account issues, here are
entity issues, here are interaction issues"). Followed the prompt structure
closely but didn't go deep within each category.
**The per-share vs lot-level finding (GPT-5 #1) — why it matters:**
This is the experiment's most important result. Every model found missing features
(options, cross-account, short sales) — those are SCOPE limitations that the
document itself acknowledges or defers. GPT-5 uniquely found a correctness bug in
the IMPLEMENTED logic: the system's lot-level basis adjustment is mathematically
wrong for partial wash sales.
Example: Loss lot of 100 shares, replacement lot of 60 shares. Only 60 shares
trigger wash sale. System adds full 60% of disallowed loss to the entire
replacement lot's basis. If the replacement lot later sells 30 shares, the
per-share basis is inflated (reflects 60 shares of adjustment spread across 60
shares). This is actually correct for the replacement lot specifically — but
the `tacked_opened_at` is applied to ALL 60 shares when only the matched shares
should have tacked holding periods. For lots where `adjusted_quantity <
replacement_quantity`, the non-matched shares have incorrect holding period
characterization.
Actually, on closer inspection: if `adjusted_quantity = min(loss_quantity,
replacement_quantity)`, and the system matches 60 shares of a 60-share
replacement lot, ALL shares of that lot are matched. The edge case GPT-5
identifies would require a replacement lot larger than the loss — e.g., loss of
60 shares matched against a replacement lot of 100 shares where only 60 are
affected. In that case, the `tacked_opened_at` is set on the entire lot (100
shares) when only 60 should be affected. This IS a genuine bug: 40 shares get
incorrect holding period classification.
**Updated task-type taxonomy:**
| Task type | Primary cognitive demand | Best model |
|---|---|---|
| Hidden assumptions | Breadth identification (what's not stated?) | GPT-5 (exhaustive) |
| Race conditions | Sequential temporal reasoning | GPT-5 + Opus |
| Cross-component interactions | Component boundary reasoning | GPT-5 + Sonnet |
| Design coherence | Internal consistency checking | Opus |
| Invariant violation paths | Construction + verification | GPT-5 (precision) |
| Silent correctness | External requirement matching | Opus |
| **Regulatory compliance** | **Dual-knowledge-base comparison** | **GPT-5 (breadth) + Opus (operations)** |
Regulatory compliance is closest to "silent correctness" (Finding #22) in that
both require reasoning about external requirements. The key difference:
- Silent correctness asks "does this produce correct outputs for all inputs?"
- Regulatory compliance asks "does this implement the law correctly?"
Both favor models that reason about the system's relationship to the outside
world (Opus's strength), but regulatory compliance also rewards breadth of
statutory knowledge (GPT-5's strength). The combination produces the most
complete picture.
**Practical implication:**
For regulatory compliance review of financial systems:
- Run GPT-5 for exhaustive statutory coverage (finds the most gaps)
- Run Opus for operational impact analysis (finds how gaps manifest in practice)
- Sonnet adds marginal value — use only if budget allows
- GPT-5's unique strength: identifying correctness bugs in implemented logic
(not just missing features)
- Opus's unique strength: identifying timing/workflow issues (year-end, form
reporting, reconciliation with broker)
@@ -0,0 +1,152 @@
# Finding 24: Design improvement proposals: GPT-5 excels at defense-in-depth thinking; Opus finds subtle design contradictions; Sonnet produces generic recommendations
**Date:** 2026-05-05
**Task:** Propose specific design improvements for gargoyle's `kill-switch.md` (185 lines)
— the primary safety mechanism that prevents rogue orders. NEW task type: generative/
creative ("what would you improve?") rather than purely analytical ("what's wrong?").
**How we used them:** Same document (full text) + same focused prompt to all 3 models
via HAI proxy. Prompt asked for 8-15 specific improvements with: weakness, proposed
change (concrete), tradeoff, severity rating. Explicitly excluded generic advice
("add more tests") and asked about runtime assumptions. No tools, no project context.
| Model | Time | Output tokens | Reasoning tokens | Improvements proposed |
|---|---|---|---|---|
| GPT-5 | 118s | 8,710 | 6,016 | 15 |
| Claude Opus 4.6 | 127s | 4,985 | (internal) | 15 |
| Claude Sonnet 4.6 | 40s | 1,636 | (internal) | 12 |
**What they found — common ground (all 3 identified):**
- DB write failure blocking engagement (fail-open under DB outage) — all three
proposed in-memory-first engagement with async persistence
- Kill switch process liveness monitoring (heartbeat/watchdog)
- Broker connectivity loss during cancellation operations
- ETS table ownership and crash-window vulnerability
- Supervisor restart suppression as unstated mechanism
- Per-venue/per-broker scope extension
**GPT-5 unique findings (not in either other model):**
- **Infrastructure-level "hard kill"** — egress proxy or service mesh that blocks
broker traffic independently of the application. Belt-and-suspenders approach
where the kill switch works even if the entire BEAM VM is unresponsive. This
was GPT-5's highest-impact unique insight.
- **Kill fence token (epoch)** — every order-carrying message includes an epoch;
stale-epoch messages are dropped at the gate. Elegantly solves in-flight
messages without needing drain timeouts.
- **Cluster/multi-node propagation** — detailed leader election + epoch broadcast
+ fail-closed on partition design.
- **Post-engage broker verification** — query broker AFTER engaging to confirm no
orders slipped through during the engagement window.
- **Liquidation exposure validation** — proving tagged liquidation orders actually
REDUCE exposure rather than trusting the tag.
- **Recovery/cold-start order suppression** — ensuring reconciliation/recovery
routines can't submit orders while engaged.
- **Engage latency reordering** — ETS first, terminate second, DB async.
- **Audit log tamper evidence** — append-only external sink + hash chain.
**Claude Opus unique findings (not in either other model):**
- **Ordering contradiction in engagement sequence** — identified that the
documented order (DB → ETS → terminate) creates a specific risk if a crash
occurs BETWEEN termination and ETS update (not just DB failure). The insight
is about the window where termination has started but gate is still open.
More subtle than GPT-5's version (which focused on DB-blocking-engage).
- **Concurrent engagement race (mode escalation)** — multiple triggers
simultaneously issuing conflicting modes (RESTRICT vs LIQUIDATE). Proposed
explicit escalation rules (LIQUIDATE always wins) with GenServer serialization.
- **Shared resources under per-user scope** — per-user kill switch doesn't
address orders in shared broker connection buffers. Forces architectural
decision about connection pooling strategy.
- **Clock/time integrity for audit log** — monotonic counters + NTP validation
for forensic reliability.
- **Partial multi-user engagement failures** — what happens when global engage
successfully terminates 4/5 user pipelines but one has orphaned processes.
- **Liquidation direction validation** — similar to GPT-5's exposure validation
but framed differently: checking corrupted position records could cause
liquidation to OPEN positions rather than close them.
- **Process termination verification** — checking that `:kill` signals actually
worked (defense against trap_exit, NIF blocking).
- **Engagement latency SLA** — defining a 50ms target with monitoring/alerting.
**Claude Sonnet findings (all also present in GPT-5 or Opus, differently framed):**
- No genuinely unique improvements that GPT-5 or Opus didn't also identify.
- Several were generic: "missing resource cleanup," "circuit breaker integration,"
"performance monitoring" — exactly the kind of advice the prompt tried to
exclude.
- The "missing heartbeat" and "network partition handling" proposals were solid
but less detailed than the corresponding GPT-5/Opus versions.
**Quality assessment:**
- **GPT-5** produced the most ACTIONABLE improvements. Its proposals were
architecturally concrete ("add an egress proxy," "use kill epochs in messages,"
"query broker post-engage") and showed defense-in-depth thinking — multiple
independent layers rather than fixing one path. The infrastructure kill (#2)
is genuinely novel: no other model proposed going OUTSIDE the application
boundary for safety enforcement. GPT-5 consistently thought about "what if
this entire runtime is compromised?" rather than just fixing within-app paths.
- **Claude Opus** produced equally numerous improvements (15) with characteristic
precision about failure SEQUENCES. Its unique strength: identifying design
contradictions rather than just gaps (the engagement ordering issue, concurrent
mode escalation, shared-resource scope mismatch). Opus's proposals were more
"fix the design tension" while GPT-5's were more "add another safety layer."
Opus also included the process termination verification and engagement latency
SLA — operational rigor that GPT-5 skipped.
- **Claude Sonnet** produced 12 proposals in 40s (fast) but quality was notably
lower. Several proposals were generic software engineering advice that the
prompt explicitly excluded ("add performance monitoring," "resource cleanup").
No unique insights emerged. Sonnet's proposals lacked the architectural depth
of GPT-5 (no outside-the-application thinking) and the design-tension
identification of Opus.
**Key insight — generative vs analytical tasks:**
This is the first experiment testing a GENERATIVE task ("propose improvements")
rather than a purely analytical one ("find problems"). The results reveal:
1. **GPT-5's defense-in-depth thinking is unique.** In analytical tasks, GPT-5
finds exhaustive lists of issues. In generative tasks, it proposes LAYERED
solutions — multiple independent mechanisms that each catch what the others
miss. The infrastructure kill proposal (external to the application) shows
GPT-5 reasoning about failure modes that are invisible to within-app analysis.
2. **Opus's design-tension identification transfers to improvement proposals.**
In analytical tasks, Opus finds where parts of a design contradict each other.
In generative tasks, this manifests as proposals that RESOLVE tensions rather
than just adding patches. The engagement ordering contradiction and mode
escalation rules are both "this design says X but the mechanism allows Y —
here's how to make them consistent."
3. **Sonnet doesn't transfer well to generative tasks.** In analytical tasks
(assumption-finding, cross-component analysis), Sonnet performs well (85% of
GPT-5 in some experiments). In generative tasks, it falls back to generic
engineering advice. The task requires both identifying problems AND proposing
concrete solutions — Sonnet handles the first step but not the second with
sufficient depth.
**Comparison to analytical task performance:**
| Task type | GPT-5 character | Opus character | Sonnet character |
|---|---|---|---|
| Assumption-finding (#10-12) | Exhaustive breadth | Design tensions | Good (85% of GPT-5) |
| Race conditions (#13) | Technical precision | Design contradictions | Weak (errors) |
| Invariant violations (#20) | Maximum selectivity | Self-correcting depth | Imprecise |
| **Design improvements (#24)** | **Defense-in-depth layers** | **Tension resolution** | **Generic advice** |
The generative task reveals model ARCHITECTURES more clearly than analytical tasks.
GPT-5's reasoning enables it to construct multi-layered solutions. Opus's internal
reasoning enables it to identify what a design SHOULD be (not just what's wrong).
Sonnet pattern-matches against known engineering practices without deep synthesis.
**Practical implication:**
For design improvement sessions on safety-critical systems:
- Run GPT-5 for defense-in-depth proposals ("what layers should exist?")
- Run Opus for design consistency proposals ("where does the design contradict itself?")
- Skip Sonnet — its output is indistinguishable from generic checklists
- The combination of GPT-5 + Opus produces complementary improvements: GPT-5 adds
safety layers, Opus fixes internal contradictions. Together they address both
"not enough protection" and "protection mechanisms that work against each other."
**Cost analysis:**
GPT-5: 118s, ~10.9K tokens (6K reasoning). Opus: 127s, ~5K tokens. Sonnet: 40s, ~1.6K tokens.
For a safety-critical design review, running GPT-5 + Opus costs ~16K tokens and produces
30 improvements with near-zero overlap in unique insights. Excellent ROI for a kill switch
design that protects real money.
@@ -0,0 +1,154 @@
# Finding 25: Contradiction detection: NEW task type — Opus excels at finding genuine contradictions with precision; GPT-5 is thorough but spends heavily on reasoning; Sonnet finds surface-level issues quickly
**Date:** 2026-05-05
**Task:** Identify internal contradictions, logical inconsistencies, and conflicting rules
in gargoyle's `order-state-machine.md` (311 lines) — a document defining states,
transitions, invariants, fill precedence rules, and time-in-force behavior.
**How we used them:** Same document (full text) + same focused analytical question to all
3 models via HAI proxy. Prompt specifically asked for: state machine contradictions,
semantic conflicts, rule violations, implicit contradictions, and terminology
inconsistencies. Required each finding to quote the conflicting statements, explain
the logical argument, assign severity, and recommend which statement should "win."
No tools, no project context beyond the document itself.
| Model | Time | Output tokens | Reasoning tokens | Contradictions found |
|---|---|---|---|---|
| GPT-5 | 162s | 12,074 | 11,008 | 4 |
| Claude Opus 4.6 | 41s | 2,056 | (internal) | 6 |
| Claude Sonnet 4.6 | 17s | 826 | (internal) | 4 |
**What they found — common ground (2+ models identified):**
- **Missing `pending_cancel → partially_filled` revert transition** (GPT-5 #1 +
Opus #2 + Sonnet partial): The "Rejection reverts" invariant states orders return
to their "pre-modification state (`working` or `partially_filled`)", but the state
diagram only shows `pending_cancel → working` for cancel rejection — no path back
to `partially_filled`. All models correctly identified this as the diagram being
incomplete relative to the stated invariant. GPT-5 and Opus rated CRITICAL.
- **Same issue for `pending_replace` revert** (GPT-5 #1 + Opus #3): The state diagram
only shows `pending_replace → working` for replace rejection, but a replace
requested from `partially_filled` should revert to `partially_filled`. Same root
cause as above, just the replace variant.
- **FOK "never partially fills" vs state machine allowing it** (GPT-5 #2 + Opus #4):
The TIF table says FOK "never partially fills" but the state machine has no guards
preventing FOK orders from reaching `partially_filled`. Both correctly noted this
is a broker-enforced guarantee but the document presents it as system-level.
- **`rejection_reason` described as "broker-provided" but local rejections exist**
(GPT-5 #4 + Opus #5 + Sonnet): `pending → rejected` is "local validation failure"
with no broker interaction, but the field says "Broker-provided reason when
rejected." All three caught this terminology inconsistency.
**GPT-5 unique findings (not in either other model):**
- **IOC valid terminal states exclude `expired` vs generic expiry transitions** (#3):
IOC should never reach `expired` (unfilled portion is cancelled immediately), but
the state diagram allows any order to transition to `expired` without TIF guards.
Well-reasoned extension of the FOK finding to IOC semantics. GPT-5 correctly
identified that broker "expired-like" outcomes should map to `cancelled` for IOC.
**Claude Opus unique findings (not in either other model):**
- **Terminal states that aren't terminal — the `partially_filled` re-entry problem**
(#1): Opus identified the DEEPER contradiction beyond the surface-level "cancelled
states have outgoing transitions." When `cancelled → partially_filled` fires via
late fill, the order is now non-terminal with NO defined mechanism to re-terminate
if no further fills arrive. The order is stuck in `partially_filled` indefinitely.
This goes beyond "the diagram contradicts the definition of terminal" to "the fill
precedence rule creates an unspecified operational scenario." This is the most
architecturally significant finding across all three models.
- **Fill precedence label misapplication to non-terminal states** (#6): The state
diagram labels transitions from `pending_cancel → partially_filled` and
`pending_replace → partially_filled` as "fill precedence," but the Fill
Precedence Rule explicitly defines itself as overriding TERMINAL states.
`pending_cancel` is non-terminal. The label conflates two different mechanisms
(fill during pending modification vs. fill overriding terminal state), which
could cause implementers to use the same code path for fundamentally different
scenarios.
**Claude Sonnet unique findings (not in either other model):**
- **State diagram terminal arrow contradiction** (#1): Sonnet was the only model to
explicitly note that the Mermaid diagram shows `cancelled → [*]` (terminal arrow)
while simultaneously showing `cancelled → partially_filled` (outgoing transition).
A valid observation but more surface-level than Opus's deeper analysis of the same
phenomenon.
- **Pending replace fill logic error** (#3): Sonnet argued that receiving a fill
during `pending_replace` creates a logical impossibility because the order
parameters are in flux. This is WRONG — fills always apply to current parameters
(the replace hasn't been confirmed yet), and the document actually handles this
correctly. This is a FALSE POSITIVE from Sonnet.
**Quality assessment:**
- **Claude Opus** was the clear winner for this task. Found the most contradictions
(6), had the highest precision (0 false positives), and — crucially — found
qualitatively deeper issues. The `partially_filled` re-entry problem (#1) isn't
just "the diagram has a missing transition" but "the fill precedence rule creates
an unresolvable operational state." The fill precedence label misapplication (#6)
identifies a conceptual confusion that would genuinely cause implementation bugs.
Opus completed in only 41s with 2,056 output tokens — by far the most efficient.
- **GPT-5** found 4 genuine contradictions with 0 false positives but spent an
extraordinary amount of reasoning tokens (11,008) for modest output (1,066 visible
content tokens, 10.3:1 reasoning ratio). The IOC finding was unique and valuable.
But the cost is disproportionate: 162s and 12K tokens for 4 findings vs Opus's
41s and 2K tokens for 6 findings. GPT-5's reasoning budget seems to have been
mostly spent on VERIFICATION (confirming each finding is genuine), consistent
with Finding #20's observation.
- **Claude Sonnet** was fastest (17s) and found 4 items, but one was a false positive
(the pending_replace logic error claim is incorrect). That gives it a precision of
75% (3/4 genuine) — the lowest of the three. Its genuine findings were all also
found by the other models (no unique true contributions). Sonnet appears to trade
speed for accuracy on contradiction detection.
**Key insight — contradiction detection favors precision-oriented models:**
This task is fundamentally about LOGICAL ARGUMENTATION: proving that two statements
cannot both be true. Unlike assumption-finding (which is about imagining what could go
wrong) or gap-finding (which is about identifying missing content), contradiction
detection requires the model to:
1. Hold two statements in working memory simultaneously
2. Construct a formal argument for why they conflict
3. NOT get confused by statements that SEEM contradictory but are actually consistent
Requirement #3 is where models diverge. Sonnet produced a false positive because it
didn't fully reason through whether the pending_replace fill scenario is actually
inconsistent (it isn't — current parameters apply). Opus avoided this trap entirely
and additionally found DEEPER contradictions that require multi-step logical reasoning
(the re-entry problem, the label misapplication). GPT-5 also avoided false positives
but at massive computational cost.
**Opus's efficiency advantage:**
This is the first task where Opus is not just qualitatively better but also
quantitatively more efficient. 6 findings in 41s and 2K tokens vs GPT-5's 4 findings
in 162s and 12K tokens. That's 3x more findings per token and 4x faster. For
contradiction detection specifically, Opus appears to have a structural advantage —
possibly because its internal reasoning is better calibrated for logical argumentation
than GPT-5's externalized reasoning chain.
**Comparison to Finding #20 (invariant violation paths):**
In Finding #20, GPT-5 was maximally selective (3 findings, all genuine, 15:1
reasoning ratio). Here, GPT-5 shows the same pattern: few findings, all genuine,
high reasoning ratio (10.3:1). The difference: in #20, GPT-5's selectivity meant
it found UNIQUE violations others missed. Here, all of GPT-5's findings were also
found by Opus (plus Opus found 2 more). GPT-5's high verification bar doesn't help
when Opus is ALSO precise AND more thorough.
**Updated task-model assignment:**
For contradiction/consistency checking:
1. **Opus** — best choice: highest precision, deepest contradictions, most efficient
2. **GPT-5** — solid backup: zero false positives, unique TIF-related insights, but
expensive and slower
3. **Sonnet** — NOT recommended for this task: produces false positives, no unique
true contributions
This confirms the emerging pattern: each model has task types where it excels.
Opus excels at logical argumentation and design tensions. GPT-5 excels at
exhaustive enumeration and operational concerns. Sonnet excels at speed and
structural/assumption analysis but struggles with tasks requiring formal logical
reasoning (contradiction detection, concurrency analysis per Finding #13).
**Practical implication:** When reviewing architecture documents for internal
consistency (e.g., before implementation begins), run Opus. If budget allows,
add GPT-5 for TIF/edge-case coverage. Skip Sonnet for consistency checking —
its speed advantage is negated by the false positive risk.
@@ -0,0 +1,158 @@
# Finding 26: Missing-feature identification IS promptable across all models; prompt framing eliminates Opus's historical advantage — all three models find regulatory gaps when explicitly asked
**Date:** 2026-05-05
**Task:** Identify computations, behaviors, or features that gargoyle's
`corporate-actions.md` (992 lines) SHOULD perform for financial correctness,
regulatory compliance, or operational safety — but doesn't describe.
**How we used them:** Same document (full text) + same focused analytical
prompt to all 3 models via HAI proxy. Prompt explicitly structured around 5
categories: missing computations, missing behaviors, missing validations,
missing integrations, and regulatory gaps. Required concrete findings with
severity. No tools, no project context beyond the document. GPT-5 via
OpenAI endpoint (16K max_completion_tokens), Opus 4.6 and Sonnet 4.6 via
Anthropic endpoint (8K max_tokens).
| Model | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
|---|---|---|---|---|---|---|
| GPT-5 | 11,354 | 8,512 | 20 | 3 | 10 | 7 |
| Claude Opus 4.6 | 4,111 | (internal) | 23 | 6 | 10 | 7 |
| Claude Sonnet 4.6 | 4,686 | (internal) | 15 | 5 | 6 | 4 |
**What they found — common ground (all 3 identified):**
- Wash sale rule interaction with CA-driven lot closures (IRC §1091)
- Short position treatment for corporate actions
- Same-day corporate action ordering beyond `recorded_at` timestamp
- Record date / ex-date position verification (entitlement timing)
- Idempotency guard preventing double-application per user
- Decimal precision/rounding policy unspecified
- Superseded CA status has no lot rollback mechanism
- Rights/warrants post-creation lifecycle (exercise/expiration)
- Basis preservation invariant has no runtime enforcement
- Manual entry authorization and audit trail
**GPT-5 unique findings (not in either Claude model):**
- Per-lot eligibility based on entitlement date (not just user-level)
- Election-based outcomes for shareholder choices (cash vs stock)
- Instrument-level trading hold during CA application window
- Pre-application consistency checks against broker entitlements
- DB-level enforcement of status transitions and invariants
- Action-type-specific date semantics per field (ex vs record vs payable)
- Voluntary/tender actions beyond distributions
- Backfill/initialization guard for newly onboarded users
- Applicator retry/backoff semantics and confirmation race
- Rights indivisibility constraints vs exact Decimal quantities
**Claude Opus unique findings (not in either other model):**
- Pending order PRICE adjustment after splits (not just cancellation)
- Multi-instrument position recalculation atomicity for mergers
- Mixed merger basis floor at zero (can produce negative basis)
- Tax lot identification method interaction with inherited dates
- Corporate action effect on strategy position limits/risk params
- Corporate actions on instruments not yet in the database
- Partial application window: new user acquires position mid-fan-out
- IRC §305(c) deemed distributions (taxable stock dividends)
- CA impact on unrealized P&L display and strategy evaluation
- Concurrent OrderManager startup + Applicator fan-out race
**Claude Sonnet unique findings (not in either other model):**
- Stale orders: failure modes table contradicts "excluded" section
- IRC §1223(1) holding period tacking verification at lot close
- Spinoff allocation percentage — no validation child != parent instrument
- Combined spinoff allocations exceeding meaningful bounds
- Cash dividend bypasses OrderManager — record-date quantity snapshot lost
- Mixed merger large-denominator exchange ratio overflow
- Detector schedule: no intraday re-poll for same-day announcements
- ROC cumulative basis floor tracking and IRC §301(c)(3) distinction
- Mixed merger deferred loss not explicitly recorded in metadata
**Quality assessment:**
- **Claude Opus** was the MOST PROLIFIC (23 findings) — a notable inversion
from previous experiments where Opus typically found fewer but deeper
findings. Here, the explicit "missing feature" framing appears to have
unlocked Opus's breadth. Its unique findings included genuinely critical
items: pending order price adjustment after splits (Critical — direct
financial loss), multi-instrument atomicity for mergers (Critical —
position loss), and mixed merger negative basis (High — accounting
corruption). The findings were precise, well-reasoned, and showed both
regulatory depth (IRC §305(c)) and operational awareness.
- **GPT-5** was slightly less prolific (20 findings) but maintained its
characteristic breadth and operational-level thinking. Per-lot eligibility
(not just per-user) is a subtle but important distinction. The election-
based outcomes finding shows awareness of real-world corporate action
complexity. The backfill/initialization guard is operationally significant.
GPT-5 spent 8,512 reasoning tokens — moderate for its output volume.
- **Claude Sonnet** found fewer gaps (15) but several were genuinely
insightful. The internal contradiction between the failure modes table
and the "excluded" section is a real document inconsistency. The cash
dividend record-date quantity snapshot insight (#9) identifies a DATA LOSS
problem — the opportunity to capture that data expires. The mixed merger
deferred loss recording gap shows regulatory awareness. However, some
findings were more surface-level or overlapped heavily with the others.
**KEY INSIGHT — The original question from Finding #22 is ANSWERED:**
> "Opus's 'missing feature identification' mode (wash sales, commissions) —
> is this promptable on other models? Could we explicitly ask GPT-5 'what
> should this system compute but doesn't' and get similar results?"
**YES.** When explicitly prompted with a structured "missing feature"
framing, ALL three models found regulatory gaps (wash sales, IRC sections),
missing computations (basis calculations, rounding), and missing behaviors
(lifecycle events, notifications). GPT-5 produced findings in the same
*category* as what Opus uniquely found in Finding #22 (silent correctness
failures on specid-lot-selection.md).
In Finding #22, Opus uniquely identified wash sales and commission tracking
as missing features while GPT-5 focused on mechanism incorrectness and
Sonnet on composition failures. HERE, with the explicit "what's missing"
prompt, ALL three models found wash sales, ALL found regulatory gaps, and
ALL found missing behaviors.
**This confirms:** Opus's "missing feature identification" mode in Finding
#22 was NOT an inherent model capability — it was an emergent behavior from
the open-ended "silent correctness failures" prompt. When you give ALL models
the EXPLICIT instruction to look for missing features, they all do it. The
differentiation from #22 was caused by the prompt being more open-ended,
allowing each model to default to its natural analytical mode:
- Opus → "what's missing" (features/functionality)
- GPT-5 → "what's wrong" (mechanism failures)
- Sonnet → "what breaks when combined" (composition)
**Prompt framing dominates model personality.** With the right prompt,
any model can be directed into any analytical mode. The model differences
that emerged in earlier open-ended experiments reflect DEFAULT TENDENCIES,
not capabilities.
**NEW finding about Opus on complex documents:**
Opus produced MORE findings than GPT-5 (23 vs 20) — the first time this
has happened on a broad analytical task. Previous pattern: GPT-5 always
finds more (20-33 findings) while Opus finds fewer but deeper (7-13).
What changed? The document is 992 lines — the longest tested — and the
task is explicitly about breadth ("find all gaps"). On this specific
combination (long document + breadth-focused prompt), Opus appears to
allocate its internal reasoning budget toward exploration rather than
its usual depth-first design-tension mode. This suggests Opus's typical
"fewer but deeper" pattern is partially a RESPONSE to shorter documents
where depth is more productive than breadth.
**Practical implications:**
1. For missing-feature analysis: prompt structure matters more than model
choice. All three models are viable. Use the explicit 5-category prompt.
2. Run all three for critical docs — they find different specific gaps
despite finding the same categories.
3. For open-ended analysis where you want models to find DIFFERENT things:
use open-ended prompts. For analysis where you want COMPREHENSIVE
coverage of one type: use structured prompts.
4. Opus's "fewer but deeper" personality can be overridden by document
length + breadth-focused prompt. On 992-line docs, it competes on
volume with GPT-5.
**Cost-effectiveness:**
Opus: 4,111 output tokens for 23 findings = 179 tokens/finding
GPT-5: 11,354 output tokens (+ 8,512 reasoning) for 20 findings = 993 tokens/finding
Sonnet: 4,686 output tokens for 15 findings = 312 tokens/finding
Opus is by far the most efficient: nearly 6x fewer tokens than GPT-5 per
finding, with MORE findings. This is the strongest cost-effectiveness case
for Opus on any tested task. On long documents with breadth-focused prompts,
Opus appears to be the optimal choice for both quality AND efficiency.
@@ -0,0 +1,276 @@
# Finding 27: Design coherence on risk-controls.md: GPT-5 regains top position; Opus's advantage from Finding #15 was document-specific, not task-specific
**Date:** 2026-05-05
**Task:** Identify internal design incoherences in gargoyle's `risk-controls.md` (277 lines)
— a pre-trade risk control specification covering two evaluation stages, reduction semantics,
ordering rationale, fail-closed claims, and audit logging.
**How we used them:** Same document (full text) + same focused analytical question to all
3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence
(safety properties not enforced, ordering/sequencing contradictions, reduction semantics
conflicts, fail-closed claims vs actual behavior, cross-stage inconsistencies). Required
each finding to reference specific contradictory parts. No tools, no project context beyond
the document itself.
| Model | Time | Output tokens | Reasoning tokens | Incoherences found | Critical | High | Medium |
|---|---|---|---|---|---|---|---|
| GPT-5 | 112s | 8,231 | 7,232 | 6 | 1 | 3 | 2 |
| Claude Opus 4.6 | 41s | 1,858 | (internal) | 5 | 2 | 2 | 1 |
| Claude Sonnet 4.6 | 15s | 699 | (internal) | 4 | 1 | 2 | 1 |
**What they found — common ground (all 3 identified):**
- Reduction re-entry at BuyingPower contradicts "reducing controls never re-enter
earlier controls" (all three flagged this as the most obvious contradiction —
Concentration at position 5 reduces, re-enters at BuyingPower at position 4,
which IS an earlier control)
- Ordering rationale's categorization of buying power/concentration is internally
confused (the doc labels both as "quantity-sensitive checks" that run after
reducing controls, but concentration IS a reducing control at position 5 while
buying power at position 4 sits between the two reducing controls)
**GPT-5 unique findings (not in either Claude model):**
- Signal-level "no portfolio context" contradicts NoShortSales requiring knowledge
of current positions. The doc explicitly states signals are evaluated "in isolation"
with "no portfolio context — only the signal itself and user settings" — but checking
whether the user holds a position IS portfolio context. This is a genuine design
tension: either SignalRisk has hidden portfolio access (violating isolation) or
NoShortSales can't actually work as specified.
- Settings "fall through to system defaults" vs "Settings cache miss → reject."
Two incompatible instructions for the same condition (missing settings).
- "Universal fail-closed" with "only exception is order rate window" contradicted
by Failure Modes table showing buying power as another exception ("Conservative
estimate; may over-reject" is NOT rejection — it's a different failure mode than
either fail-closed or the documented single exception).
- Audit model says "every control evaluation produces an audit entry regardless of
outcome" but the signal-stage write point only describes writing on rejection.
Passing signals produce no documented audit entry at the signal stage.
**Claude Opus unique findings (not in either other model):**
- Signal flow diagram swaps control order vs table: table shows (1) MarketHours,
(2) PerTradeStop, (3) NoShortSales, but diagram flows MarketHours → NoShortSales
→ PerTradeStopLoss. Controls 2 and 3 are reversed between the two representations.
(VERIFIED: this is correct — the diagram does show a different order.)
- Concentration re-entry loop can bypass Order Rate, Duplicate, Self-Trade, and
Fat Finger entirely during intermediate iterations. Also: Position Size at order 3
is never re-checked against Concentration-reduced quantity because re-entry starts
at BuyingPower (order 4), meaning "smaller quantity wins" semantics are implemented
differently than the linear model described in Reduction Semantics.
**Claude Sonnet unique findings (not in either other model):**
- Buying Power "Can reduce? No" creates a dead end: if a reduced quantity still
exceeds buying power, the system can only reject entirely (no mechanism to further
optimize), defeating the purpose of the reduction system for capital-limited users.
(NOTE: this is more of a design limitation than a self-contradiction, but the
framing — that the reduction system's purpose is undermined by buying power's
inability to reduce — is a legitimate coherence observation.)
**Quality assessment:**
- **GPT-5** produced the most findings (6) with the broadest coverage across the
prompt's 5 categories. The NoShortSales/portfolio-context finding is the most
genuinely insightful — it's a fundamental design-level contradiction (a signal-level
control that REQUIRES decision-level context). The settings contradiction and
audit logging inconsistency are also solid. Every finding points to two specific
textual statements that are incompatible. Severity ratings were calibrated (1
Critical, 3 High, 2 Medium — compared to Opus's 2 Critical for similar findings).
- **Claude Opus** was remarkably fast (41s, 1,858 tokens) and found one thing
neither other model caught: the diagram/table order reversal for signal controls.
This is a concrete, verifiable error (not a design tension — a literal mistake in
the document). The re-entry loop analysis (finding #5) goes deeper than GPT-5's
version of the same core issue, exploring the implications for "smaller quantity
wins" semantics. However, Opus found fewer total issues and missed the
settings contradiction and audit logging inconsistency.
- **Claude Sonnet** was the fastest (15s, 699 tokens) and found 4 issues. The buying
power dead-end observation is unique and shows genuine reasoning about the reduction
system's limitations. However, it's more of a "this design can't achieve its stated
goal" than a strict self-contradiction. Sonnet's other findings overlap with the
common ground. Quality is solid but narrower scope.
**Key insight — Finding #15's Opus > GPT-5 result was document-specific:**
In Finding #15 (coherence checking on failure-modes.md), Opus found 7 incoherences
vs GPT-5's 4. Here, on risk-controls.md, GPT-5 found 6 vs Opus's 5. The reversal
suggests that the relative performance on coherence checking depends on the
DOCUMENT'S structure, not on a fixed model advantage:
- **failure-modes.md** (383 lines): A complex multi-process system with many
stated invariants across failure states, supervision trees, and recovery paths.
Rich in design TENSIONS where one subsystem's safety mechanism undermines another.
This plays to Opus's strength (finding design tensions between subsystems).
- **risk-controls.md** (277 lines): A more focused specification with explicit rules,
ordering constraints, and behavior tables. Rich in SPECIFICATION CONTRADICTIONS
where one statement directly conflicts with another. This plays to GPT-5's
strength (systematic verification of claims against stated mechanisms).
The difference: Opus excels when contradictions are EMERGENT (arise from composing
multiple design decisions). GPT-5 excels when contradictions are EXPLICIT (two
statements in the document say incompatible things). Risk-controls.md has more
explicit contradictions (the settings fallback vs fail-closed, the "no portfolio
context" vs NoShortSales, the audit "always" vs write point "only on reject").
**Model performance depends on CONTRADICTION TYPE:**
| Contradiction type | Best model | Example |
|---|---|---|
| Emergent/compositional | Opus | "Rest-for-one cascade creates a 5th state" |
| Explicit/definitional | GPT-5 | "No portfolio context" but check requires portfolio |
| Diagrammatic/structural | Opus | Table order ≠ diagram order |
| Semantic/category confusion | All (common ground) | Reduction re-entry violates ordering claims |
**Revised conclusion on Finding #15's open question:**
"Does Opus > GPT-5 ordering for coherence checking hold across other documents?"
**No.** The ordering depends on the document's contradiction density and type.
Documents rich in emergent design tensions favor Opus. Documents with explicit
specification errors favor GPT-5. The task type (coherence checking) doesn't have
a fixed model winner — it depends on what KIND of incoherences the document contains.
**Practical implication:** Continue running both models for coherence checking. Their
strengths are complementary even within the same task type. GPT-5 catches things you
can point to in the spec and say "these two sentences conflict." Opus catches things
where you need to reason about the implications of multiple mechanisms interacting.
## Open Questions
- Does GPT's advantage in finding inconsistencies extend to logical
inconsistencies in arguments? One data point (verdict mismatches) — need more.
- What's the optimal task granularity for GPT analytical review? "Whole PR" is
too big. Is "one hypothesis" right, or can we batch?
- ~~Is the GPT-4.1 Mini bias detection result repeatable, or was it a well-
structured task that any model would ace?~~ **ANSWERED (Finding #8):** Any
model aces it when the biased text is presented without noise. The original
result was about noise elimination, not model capability.
- **NEW:** Does adding a narrow bias-check question to a rich PR review
context recover the detection that broad review misses? (Signal-to-noise
confirmation test)
- ~~How does reasoning_effort affect analytical quality? Only tested default so
far.~~ **ANSWERED (Finding #21):** Negligible effect on GPT-5 for open-ended
analytical tasks. Low/medium/high produced 33/30/30 findings with nearly
identical reasoning tokens (~4K) and per-finding depth. The parameter
may primarily affect verifiable-answer tasks, not exploration. Task framing
remains the dominant quality lever.
- Can we design a systematic "analytical review checklist" that leverages each
model's strengths?
- ~~What analytical tasks is Opus best at vs Sonnet? Finding #11 shows Opus
excels at design-tension identification. How does Sonnet compare on the
same task? (Sonnet is non-reasoning but fast — would it match GPT-4.1?)~~
**ANSWERED (Finding #12):** Sonnet 4.6 significantly outperforms GPT-4.1
(17 vs ~14 assumptions) and approaches GPT-5 (17 vs 20). It's not a
non-reasoning model in the GPT-4.1 sense — it occupies a middle tier with
genuine component-interaction reasoning. Opus still wins on design-tension
identification specifically.
- How do the models compare on research synthesis tasks (our #381 rewrite)?
We'll find out during the actual rewrite.
- ~~Does the reasoning-token advantage scale with document complexity? Test
with a simpler doc to see if the gap narrows.~~ **ANSWERED (Finding #11):**
The gap doesn't narrow with simpler docs. GPT-5 maintains ~1.7x the findings
of GPT-4.1 regardless of document complexity. Reasoning tokens enable
exhaustive exploration independent of input difficulty.
- ~~Would Claude Opus (also a reasoning model) match GPT-5's assumption-finding
performance, or does it have different blind spots?~~ **ANSWERED (Finding #11):**
Different blind spots, different strengths. GPT-5 reasons deeper into
implementation mechanics (breadth + technical depth). Opus reasons wider
about system context and design tensions (insight density). They're
complementary, not competing. Run both on important architecture docs.
- ~~Does Sonnet 4.6's strong showing hold across other analytical tasks
(bias detection, gap-finding) or is it specific to assumption-finding on
complex documents? Need to test Sonnet on simpler docs and different question
types.~~ **PARTIALLY ANSWERED (Finding #13):** Sonnet's strength does NOT
transfer to concurrency reasoning. It dropped from 85% of GPT-5 (assumption-
finding) to ~58% (race condition identification). Task type matters more
than we thought. Still untested: gap-finding, bias detection for Sonnet.
- **NEW:** What other analytical tasks require sequential/temporal reasoning
(like race condition identification) vs pattern-matching reasoning (like
assumption-finding)? Building a task taxonomy would help assign models
correctly.
- **NEW:** What explains Sonnet taking slightly longer than Opus here (106s vs
105s) despite normally being the faster model? Is it the document length, or
does Sonnet's internal reasoning scale with complexity similarly to Opus?
- ~~How does GPT-5 Mini compare to GPT-5 on analytical tasks? Is it a viable
cheaper substitute?~~ **ANSWERED (Finding #14):** GPT-5 Mini is a viable
middle option. Finds fewer issues (6 vs 10) but with genuine reasoning
depth at ~50% cost/time. Better than non-reasoning models, not as
exhaustive as GPT-5.
- **NEW:** How does Claude 4.5 Opus/Sonnet compare to Claude 4.6? HAI now
exposes both; worth testing whether the newer versions regress on
analytical tasks.
- ~~Would running GPT-5 Mini + Sonnet together (different axes)
approach GPT-5's coverage at lower combined cost?~~ **ANSWERED (Finding #19):**
71% coverage at 31% cost. Good for low-stakes work; GPT-5 irreplaceable for
high-stakes due to unique domain-knowledge findings in the missing 29%.
- ~~**NEW (Finding #15):** Does the Opus > GPT-5 ordering for coherence checking
hold across other documents? The inversion (Opus finding more than GPT-5)
was striking — need to confirm it wasn't document-specific.~~
**ANSWERED (Finding #27):** No — it was document-specific. On risk-controls.md,
GPT-5 found 6 vs Opus's 5. The winner depends on contradiction TYPE: Opus
excels at emergent/compositional contradictions, GPT-5 at explicit/definitional
ones. No fixed ordering for this task type.
- **NEW (Finding #15):** Is the two-pass approach (Opus generates → GPT-5
validates) worth the extra cost vs just running Opus alone? Need to test
whether GPT-5 actually catches Opus false-positives or just agrees.
- ~~How do the Claude 4.5 and 4.6 models compare on analytical tasks?~~
**ANSWERED (Finding #16):** 4.5 is more exhaustive (2x findings), 4.6 is
more precise (higher signal-to-noise). Genuine tradeoff, not a regression.
4.5 for coverage, 4.6 for actionability.
- **NEW (Finding #16):** Does the 4.5 vs 4.6 pattern hold across other task
types? Spec completeness may favor exhaustiveness; would coherence checking
or race condition analysis show the same pattern?
- **NEW (Finding #16):** Is running both Sonnet versions (4.5 + 4.6) cost-
effective vs just running GPT-5? Need to compare the UNION of their findings
against GPT-5's output for overlap analysis.
- **NEW (Finding #18):** Does Opus's "predictable exploit window" detection
transfer to other policy documents? It uniquely identified that the cooldown
mechanism creates a GUARANTEED safe window that strategies could systematically
exploit — this is a higher-order security insight. Worth testing whether Opus
consistently finds "adversarial opportunity" framings that other models miss.
- **NEW (Finding #20):** Does GPT-5's extreme verification behavior (15:1
reasoning-to-output ratio, 3 findings from 12K reasoning) persist across
other documents with this prompt? Or was user-pipeline-lifecycle.md
particularly verification-heavy? Test invariant violation paths on a simpler
document.
- **NEW (Finding #20):** Would giving GPT-5 a "minimum 8 findings" instruction
reduce its selectivity and produce MORE invariant violations at lower
precision? Or would it just pad with non-violations? The extreme selectivity
may be a feature OR it may mean GPT-5 is discarding valid-but-hard-to-verify
findings.
- **NEW (Finding #20):** Opus's self-correction behavior is now confirmed across
Finding #15 and #20. Is this trainable/promptable? Could we ask non-Opus models
to "show your reasoning and withdraw findings you cannot fully verify"?
- **NEW (Finding #22):** The "silent correctness" lens revealed three distinct
analytical modes: Opus → missing functionality, GPT-5 → mechanism incorrectness,
Sonnet → composition failures. Does this three-way differentiation hold on other
documents, or was it specific to the regulatory/financial domain of specid-lot-selection?
- **NEW (Finding #22):** Does the "silent correctness" lens work on non-financial
documents? The financial/regulatory domain has a large gap between syntactic and
semantic correctness. Would the same prompt on an infrastructure/systems doc produce
equally differentiated findings, or would it collapse into assumption-finding?
- ~~**NEW (Finding #22):** Opus's "missing feature identification" mode (wash sales,
commissions) — is this promptable on other models? Could we explicitly ask GPT-5
"what should this system compute but doesn't" and get similar results?~~
**ANSWERED (Finding #26):** YES — all three models find regulatory gaps and
missing features when explicitly prompted. Opus's unique behavior in #22 was
an emergent DEFAULT tendency, not a capability. Prompt framing dominates
model personality.
- **NEW (Finding #28):** Cross-document consistency found real bugs in gargoyle
docs (fills vs events, position ownership, signal persistence). Does running
this analysis across MORE document pairs (e.g., domain readmes vs implementation
docs, design docs vs plan docs) yield additional real inconsistencies? Could
become a systematic documentation maintenance tool.
- **NEW (Finding #28):** Opus was 2.4x faster AND found more issues than GPT-5
on cross-document consistency. Is this because cross-doc contradictions are
easy to verify once spotted (reducing GPT-5's verification advantage)? Or
because boundary reasoning (Opus's strength) is the primary skill needed?
## Methodology Notes
- Internet opinions about models are overwhelmingly about coding. Don't
extrapolate to analytical work without testing.
- "Just because someone says it on the internet doesn't make it right." —
Aaron, 2026-04-26. Opinions need context. Track our own evidence.
- Absence of published methodology for a use case is itself a finding.
- Each finding needs: date, task, **how we used it** (context shape, task
framing, what info the model had/didn't have), what happened, takeaway.
No unsupported generalizations.
- **Context dimensions to track:**
- Rich vs minimal (how much background info)
- Broad vs focused ("review this" vs "answer this specific question")
- What kind of context (diff, full files, issue text, research notes,
project conventions, nothing)
- Whether the model had access to tools or just text
- Whether the task was explicit step-by-step or open-ended
@@ -0,0 +1,178 @@
# Finding 28: Cross-document consistency analysis: NEW task type — GPT-5 finds deep semantic contradictions; Opus finds structural/boundary mismatches; Sonnet identifies core issues quickly
**Date:** 2026-05-05
**Task:** Identify contradictions and inconsistencies BETWEEN two architecture documents
describing the same system: `system-overview.md` (323 lines, narrative overview with
component flows, invariants, and domain events) and `architecture.md` (213 lines,
DDD-focused with bounded contexts, context map, and message taxonomy).
**How we used them:** BOTH documents provided as full text in a single prompt (~25KB
total). Highly structured prompt specifying 5 categories of cross-document inconsistency
(terminology conflicts, structural contradictions, flow/sequence conflicts,
ownership/authority conflicts, philosophical contradictions). Required specific output
format per finding. Explicitly excluded omissions (things one doc covers and the other
doesn't) and detail-level differences. No tools, no project context beyond the two
documents. This is a NEW analytical task not previously tested: reasoning about
CONSISTENCY BETWEEN documents rather than internal coherence of a single document.
| Model | Time | Output tokens | Reasoning tokens | Inconsistencies found | Critical | High | Medium |
|---|---|---|---|---|---|---|---|
| GPT-5 | 125s | 9,415 | 8,384 | 6 | 2 | 3 | 1 |
| Claude Opus 4.6 | 52s | 2,351 | (internal) | 7 | 3 | 3 | 1 |
| Claude Sonnet 4.6 | 14s | 776 | (internal) | 4 | 1 | 2 | 1 |
**What they found — common ground (all 3 identified):**
- Event sourcing (all events as source of truth) vs fills-only ground truth:
Document A says fills are "ground truth from which all other state can be
derived," while Document B says "events are the source of truth, state is
computed by replaying events." A treats fills as the recovery foundation;
B treats ALL domain events as authoritative. All three models rated this
Critical.
- Bounded context naming mismatch: "Decision Engine" / "Order Management" (A)
vs "Engine" / "Trading" (B) for the same functional responsibilities.
GPT-5 folded this into a broader ownership analysis; Opus and Sonnet
surfaced it as its own finding.
- Signal classification conflict: Document A lists "Signal emitted" as a domain
event; Document B explicitly categorizes `SignalEmitted` as an audit event
("not used to rebuild state"). This determines event store design and
recovery semantics.
**GPT-5 unique findings (not in either Claude model):**
- Signal persistence contradiction: Document A states "Signals are never
persisted" while Document B lists `SignalEmitted` as an audit event that IS
persisted and states the audit log is mandatory for trading. These are
directly incompatible claims about whether signal data is stored.
- Audit event ownership conflict: Document A says "Decision approved" events
originate from PortfolioRisk. Document B states "only the decision engine
writes audit events" and lists `DecisionApproved` as an audit event example.
If PortfolioRisk is part of Risk (not Engine), this is an authority violation.
- "Single writer per user" (A: OrderManager writes all trading state) vs
per-aggregate single-writer (B: each aggregate writes its own event stream,
Ledger owns positions). These are incompatible authority models — either OM
centralizes writes or each domain owns its own events.
**Claude Opus unique findings (not in either other model):**
- Engine → OrderManager is an internal pipeline flow (A: same subgraph, direct
arrow) vs Engine → Trading is a cross-domain COMMAND (B: `PlaceOrder` command
crossing a bounded context boundary). This structural disagreement determines
whether order management is an internal pipeline stage or an independent domain
with its own aggregates and command validation.
- Signal Risk's architectural position: Document A shows a two-stage risk
architecture (SignalRisk pre-aggregation, PortfolioRisk post-aggregation)
where Risk is embedded in the pipeline. Document B's context map shows Risk
as a separate domain that Engine merely QUERIES ("kill switch active?") —
no arrow shows signal routing through Risk. Either risk logic lives inside
Engine (contradicting B's context boundary) or the context map is incomplete.
- The "reduce" step ownership: A's top-level flow labels `Approved →|"reduce"|
Decisions` (reduction at aggregation), while A's own domain events table says
"Decision reduced" originates from PortfolioRisk (reduction after aggregation).
This is actually an INTRA-document inconsistency in Document A, but Opus surfaced
it as part of cross-doc analysis.
**Claude Sonnet unique findings:**
- None genuinely unique. All 4 findings overlapped with GPT-5/Opus common ground
(event sourcing, signal persistence, context count/naming). Sonnet was efficient
(14s, 776 tokens) but didn't identify any inconsistency that the other two missed.
**Quality assessment:**
- **GPT-5** produced 6 well-reasoned findings with the deepest analysis of
OWNERSHIP conflicts. Its signal-persistence contradiction and single-writer
authority conflict are genuinely important — they reveal places where the two
documents would lead implementers to build fundamentally different systems.
Every finding quotes specific text from both documents and explains precisely
WHY they can't both be correct. The reasoning investment (8,384 tokens) was
used for thorough cross-referencing between documents.
- **Claude Opus** found the most inconsistencies (7) and was remarkably fast
(52s vs GPT-5's 125s). Its unique strength: identifying STRUCTURAL contradictions
about component boundaries and communication patterns. The Engine→Trading
command vs internal pipeline finding is architecturally the most significant
discovery — it reveals a fundamental disagreement about whether order
management is INSIDE or OUTSIDE the decision engine's boundary. Opus also
caught a bonus intra-document inconsistency (the "reduce" labeling error).
- **Claude Sonnet** was the fastest (14s) and most concise (776 tokens) but
found only the obvious common-ground issues. For cross-document consistency,
Sonnet's speed advantage came at the cost of missing the architectural
insights that make this task valuable. It did correctly identify all the
Critical-level issues, making it viable as a quick first-pass screen.
**Key insight — cross-document consistency is a DISTINCT task type:**
This is fundamentally different from single-document analysis (assumptions,
race conditions, coherence). It requires:
1. Building a mental model from Document A
2. Building a separate mental model from Document B
3. Finding places where the models are incompatible
4. Reasoning about WHY they can't both be correct (not just "different")
Step 4 is what distinguishes this from simple diff-detection. Many surface
differences (naming, detail level, scope) are NOT contradictions — the models
must judge which differences are genuinely incompatible vs. complementary.
The prompt explicitly excluded omissions and detail-level differences, and
all three models respected this constraint well.
**Model strengths on cross-document analysis:**
- **GPT-5** excels at ownership/authority conflicts: it systematically
checked "who owns this concept" in each document and found mismatches.
Its findings cluster around "who writes what" and "who is authoritative."
- **Opus** excels at structural/boundary contradictions: it identified where
the documents draw architectural lines differently. Its findings cluster
around "where are the boundaries" and "what crosses them."
- **Sonnet** identifies the obvious/critical issues quickly but doesn't dig
deeper. Viable for screening, not for thorough analysis.
**Comparison to Finding #15 / #27 (single-document coherence checking):**
Single-document coherence asks "does this document contradict itself?"
Cross-document consistency asks "do these documents contradict each other?"
Key differences in results:
| Aspect | Single-doc coherence | Cross-doc consistency |
|---|---|---|
| Opus findings | 5-7 | 7 |
| GPT-5 findings | 4-6 | 6 |
| Sonnet findings | 4-5 | 4 |
| Opus unique | Design tensions | Structural/boundary mismatches |
| GPT-5 unique | Definitional errors | Ownership/authority conflicts |
| Best model | Task-dependent | Opus (most findings + fastest) |
The relative ordering is similar (Opus ≥ GPT-5 > Sonnet for coherence-style
tasks), but the CHARACTER of unique findings shifted. On single-doc coherence,
Opus finds design tensions within a single design. On cross-doc consistency,
Opus finds BOUNDARY disagreements between two designs. GPT-5 shifts from
finding definitional errors to ownership conflicts.
**Are these findings REAL bugs in the gargoyle documentation?**
Yes — several are genuine issues worth fixing:
1. The fills-vs-events-as-ground-truth is a real philosophical tension between
the two documents that needs resolution.
2. The Position event ownership (OrderManager vs Ledger) is a real boundary
conflict that affects implementation.
3. The Engine→Trading communication style (internal pipeline vs cross-domain
command) is a genuine structural ambiguity.
4. The signal persistence claim ("never persisted" vs `SignalEmitted` audit
event) is a direct textual contradiction.
These are the kind of cross-document inconsistencies that cause teams to build
inconsistent implementations — one engineer reads Document A and builds one way,
another reads Document B and builds differently.
**Practical implication:** Cross-document consistency analysis is a high-value
task for documentation maintenance. Run it when:
- A system has multiple architecture docs written at different times
- A refactoring has updated one doc but not another
- Multiple people contribute to design documentation
- Moving from high-level overview to detailed specification
Opus is the recommended model for this task: fastest (52s vs 125s), most
findings (7 vs 6), and uniquely strong at boundary disagreements. GPT-5 adds
value for ownership-specific conflicts. Sonnet is sufficient for quick
screening (catches the Critical issues in 14s) but won't find the architectural
insights.
**Cost-effectiveness:**
Opus: 2,351 output tokens for 7 findings = 336 tokens/finding (52s)
GPT-5: 9,415 output + 8,384 reasoning for 6 findings = 2,967 tokens/finding (125s)
Sonnet: 776 output tokens for 4 findings = 194 tokens/finding (14s)
Opus is the clear winner on this task type: more findings than GPT-5, 2.4x
faster, and 8.8x more token-efficient per finding. GPT-5's massive reasoning
investment (8,384 tokens) produced only one fewer finding than Opus — the
verification overhead is not paying off here because cross-document contradictions
are relatively easy to verify once identified (just check both documents).
@@ -0,0 +1,174 @@
# Finding 29: Adversarial manipulation analysis: NEW task type — GPT-5 is most exhaustive and systematic; Opus produces qualitatively different attack vectors with system-level thinking; Sonnet is adequate but less creative
**Date:** 2026-05-05
**Task:** Identify adversarial manipulation paths in gargoyle's `aggregation.md` (193 lines)
— how a misbehaving, compromised, or buggy upstream component could exploit the
aggregator's design guarantees to produce harmful trading outcomes that bypass
downstream safety controls.
**How we used them:** Same document (full text) + same focused analytical question to all
3 models via HAI proxy. Highly structured prompt specifying 5 categories of adversarial
manipulation (signal injection, timing manipulation, capacity weaponization, state
corruption via crash, audit evasion). Required specific output format per finding
(attack vector, mechanism, exploit, why downstream controls miss it, severity). No tools,
no project context beyond the document itself.
| Model | Time | Output tokens | Reasoning tokens | Attack vectors found | Critical | High | Medium |
|---|---|---|---|---|---|---|---|
| Claude Sonnet 4.6 | 27s | 1,257 | (internal) | 10 | 3 | 5 | 2 |
| Claude Opus 4.6 | 84s | 3,662 | (internal) | 12 | 5 | 5 | 0 |
| GPT-5 | 111s | 8,808 | 6,336 | 15 | 2 | 10 | 3 |
**What they found — common ground (all 3 identified):**
- Primary signal hijacking via ranking manipulation (last-tick injection in
time-windowed to control decision parameters)
- Threshold gaming via signal replay/duplication (no deduplication means N
identical signals satisfy "N confirmations")
- Capacity flooding to force premature completion or deny legitimate trades
- Strategic crash to erase unfavorable in-flight groups
- Timeout-masqueraded manipulation (making attacks look like normal system behavior
in the audit trail)
**GPT-5 unique findings (not in either Claude model):**
- **Direction flip against majority via ranking:** In "most recent" ranking,
emit multiple SELL confirmations then inject a late BUY — the BUY becomes
primary and the decision contradicts the bulk of evidence. Distinct from
general primary hijack because it's specifically about *directional* reversal.
- **Late-arrival exclusion of counter-signals:** Time signals so countervailing
signals arrive just after group destruction, ensuring the decision is formed
without dissenting inputs that would have altered ranking.
- **Capacity filter to curate the audit set:** Pre-fill buffer with chosen
signals so riskier alternatives cannot be included before capacity fires —
the contributing signals list looks clean.
- **Timer nullification by crash:** Crash just before a timeout that would
force-complete an unfavorable decision — the timer becomes no-op on restart,
no decision or expiry event is emitted.
- **Decision drop via induced forwarding failure:** Exploit the "Decision
forwarding fails: Decision is lost" failure mode to selectively suppress
protective decisions (stops, hedges) with no automatic retry.
- **Crash to erase evidence of contrary signals:** Post-crash, submit a
fresh group that completes quickly; audit shows only the new set, not the
earlier contradictory pre-crash signals.
**Claude Opus unique findings (not in either other model):**
- **Instrument fragmentation to multiply position size:** Emit signals for
economically equivalent exposures using different instrument identifiers.
Each gets its own group, each produces a separate decision, bypassing
per-group capacity limits. Combined position exceeds what any single group
would allow. Identifies TOCTOU at the fan-in to PortfolioRisk.
- **Forced stale decision via timer exploitation:** Emit one signal at a
favorable price spike known to be transient, then deliberately withhold
further signals. Timer force-completes with a stale price. The entry price
WAS valid when the signal was generated — PortfolioRisk doesn't check
staleness of decision prices.
- **Timeout prevention / keep-alive suppression:** Manipulate market data
feed to suppress signals that would reach threshold N. Group expires
normally — denial-of-trading attack disguised as insufficient confirmation.
- **Crash-restart duplicate decisions:** Crash after decision is forwarded
but before strategy reflects it. Both restart "clean" — strategy re-emits
signals, aggregator produces a second decision with a fresh ID. Same trade
executes twice. PortfolioRisk can't deduplicate because IDs are different.
- **Force-complete with insufficient confirmation (capacity < threshold):**
If capacity limit is lower than threshold, hitting capacity ALWAYS force-
completes before predicate is satisfied. Fundamentally changes a 5-confirmation
strategy into a 3-confirmation strategy.
- **Pattern predicate as arbitrary decision trigger:** If adversary controls
predicate logic (via strategy configuration), can make pattern-complete
trigger on any single signal while audit shows algorithm=pattern-complete
and reason=:predicate. Trust boundary between configuration and execution.
**Claude Sonnet unique findings (not in either other model):**
- **Cross-group timing coordination:** Coordinate signal injection across
multiple instruments to synchronize completion times, creating a burst of
correlated decisions that overwhelm PortfolioRisk individually-safe
evaluations. (NOTE: Opus found a similar concept — instrument fragmentation
— but framed it differently: Opus focused on position multiplication via
instrument aliasing, Sonnet focused on burst timing overwhelming evaluation.)
- **Multi-strategy attack distribution:** Spread manipulation across multiple
isolated strategy aggregators so no single aggregator's behavior looks
abnormal while cumulative effect is harmful.
**Quality assessment:**
- **GPT-5** produced the most findings (15) with the most systematic coverage
across all 5 prompt categories. Its strength was in identifying SPECIFIC
INTERLEAVINGS — exactly how timing, state, and ranking mechanisms interact
to produce exploits. The direction-flip finding (#3) and the late-arrival
exclusion finding (#6) show precise temporal reasoning about when signals
arrive relative to group lifecycle events. The "decision drop via forwarding
failure" finding exploits a DOCUMENTED failure mode (from the failure table)
as an offensive weapon — turning a recovery mechanism into an attack vector.
Every finding references specific mechanisms from the spec.
- **Claude Opus** produced 12 findings with the most architecturally creative
attacks. The instrument fragmentation attack is the most SYSTEMICALLY
dangerous finding across all three models — it's not about manipulating one
group but about the RELATIONSHIP between groups, and it identifies a
TOCTOU vulnerability at the PortfolioRisk fan-in point that no other model
found. The crash-restart duplication attack is also architecturally novel —
it exploits the "clean state" guarantee as a weapon for invisible trade
doubling. Opus consistently reasons about the system BOUNDARY (aggregator
→ PortfolioRisk handoff) rather than just within-component mechanics. The
pattern-predicate trust boundary finding is uniquely about CONFIGURATION
as an attack surface.
- **Claude Sonnet** produced 10 findings in 27s — extremely efficient (127
tokens per finding). Findings were adequate and covered all 5 categories,
but lacked the specificity of GPT-5 and the architectural creativity of
Opus. Several findings were somewhat generic (e.g., "crash at strategic
moments" without specifying exactly WHEN relative to group lifecycle).
The cross-group coordination and multi-strategy distribution findings show
system-level thinking but are stated at a higher abstraction level without
concrete exploit sequences.
**Key insight — "adversarial manipulation analysis" as a task type:**
This is qualitatively different from all previous analytical lenses tested.
Previous tasks asked models to find problems WITH the design (assumptions,
races, incoherences). This task asks models to find ways to USE the design
AGAINST itself — a creative/generative adversarial task. Results:
- **GPT-5** treats it as an exhaustive enumeration exercise — systematically
walks through each mechanism and asks "how could this be abused?" High
count (15), thorough coverage, but some findings are minor variations of
each other (e.g., crash-related findings #10, #12, #15 share the same core
mechanism). Reasoning tokens (6,336) used for both generation and verification.
- **Opus** treats it as a creative design exercise — asks "what would a
smart adversary do that the designer didn't consider?" Fewer findings (12)
but several are genuinely novel attack concepts (instrument fragmentation,
crash-restart duplication, predicate trust boundary) that require reasoning
about the SYSTEM rather than the COMPONENT. Opus also provided a summary
table and systemic conclusion about the root design weaknesses.
- **Sonnet** treats it as a categorization exercise — fills each prompt
category with plausible attacks but at a higher abstraction level. Fast
and adequate for a first pass but wouldn't surprise a security reviewer.
**Comparison to "predictable exploit window" (Finding #18):**
Finding #18 noted that Opus uniquely identified predictable exploit windows
in escalation-policy.md. Here, Opus again shows the strongest adversarial
creativity — the instrument fragmentation attack and crash-restart duplication
are both about exploiting DESIGN GUARANTEES (per-instrument grouping, clean
restart) as weapons. This confirms that Opus's strength on adversarial analysis
is a CONSISTENT PATTERN, not document-specific.
GPT-5 excels when the adversarial task is framed as "enumerate all possible
abuses of each mechanism" (systematic coverage). Opus excels when the task
requires "invent novel attack concepts that exploit design boundaries"
(creative adversarial thinking).
**Model hierarchy for adversarial manipulation analysis:**
1. GPT-5 — most thorough enumeration, best at mechanism-level exploitation (15)
2. Opus — most creative, finds system-boundary attacks others miss (12)
3. Sonnet — adequate first pass, fast, but less specific (10)
**Practical implication:** For security-oriented architecture review:
- Run GPT-5 for comprehensive attack surface enumeration
- Run Opus for novel/creative attack vectors that exploit design boundaries
- Sonnet is sufficient only as a quick initial screen
- The UNION of GPT-5 + Opus findings (removing overlaps) would produce the
most complete adversarial analysis
**New finding about the aggregator itself:** Several attacks identified by
multiple models point to real design weaknesses worth addressing:
1. No signal deduplication/independence validation (all 3 models)
2. Primary signal determines all decision parameters regardless of group
composition (all 3 models)
3. Transient state + no replay = perfect adversarial erasure tool (all 3)
4. Capacity/timeout treated as normal events even when weaponized (all 3)
5. No cross-group correlation at aggregator level (Opus + Sonnet)
6. TOCTOU at PortfolioRisk fan-in for concurrent decisions (Opus)
File diff suppressed because it is too large Load Diff
+16
View File
@@ -0,0 +1,16 @@
# Model Findings — Analytical & Research Work
_Tracking what actually works (and doesn't) when using AI models for research,
analysis, bias detection, and document review — not coding._
Started: 2026-04-26
## Context
We use multiple models in different roles: Claude Code (Opus/Sonnet) for
generation, Sonnet + GPT-5 for independent dual review, smaller models for
focused analytical tasks. Most public discussion is about coding. We found
almost no published methodology for using models in analytical research tasks
(searched 2026-04-26). That gap is why we're tracking this.
Each experiment lives in its own file. See individual finding files below.