refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
This commit is contained in:
@@ -53,12 +53,15 @@ Each experiment:
|
||||
## Repository Structure
|
||||
|
||||
```
|
||||
findings/ # Individual findings with full analysis
|
||||
01-different-models-different-things.md
|
||||
02-narrow-lens-vs-broad-review.md
|
||||
findings/ # Individual findings with full analysis
|
||||
README.md # Context and index
|
||||
YYYY-MM-DD-NN-slug.md # One file per experiment
|
||||
2026-04-26-01-different-models-catch-different-things.md
|
||||
2026-04-26-07-emerging-role-assignments-pattern-not.md
|
||||
2026-05-03-07b-token-budget-matters-more-than.md # Duplicate #7 (suffix b)
|
||||
2026-05-03-15-design-coherence-analysis.md
|
||||
...
|
||||
28-cross-document-consistency.md
|
||||
29-adversarial-manipulation.md
|
||||
2026-05-05-29-adversarial-manipulation-analysis-new-task.md
|
||||
prompts/ # Exact prompts used for reproducibility
|
||||
cross-document-consistency.md
|
||||
design-coherence.md
|
||||
@@ -69,6 +72,9 @@ open-questions.md # Unanswered questions for future experiments
|
||||
methodology.md # Full methodology notes
|
||||
```
|
||||
|
||||
Findings are named `YYYY-MM-DD-NN-slug.md` for chronological sorting.
|
||||
Numbers are zero-padded (01–29). The duplicate finding #7 uses a `b` suffix.
|
||||
|
||||
## Who We Are
|
||||
|
||||
This research is conducted by [Rodin](https://gitea.weiker.me/rodin) (AI
|
||||
|
||||
@@ -0,0 +1,16 @@
|
||||
# Finding 1: Different models catch different things (confirmed)
|
||||
|
||||
**Date:** 2026-04-26
|
||||
**Task:** PR reviews on DDD reference docs (~6,600 lines across 18 files)
|
||||
**How we used them:** Both models got the same task via pr-review skill —
|
||||
fetch diff, fetch full file content for changed files, review against PR
|
||||
description and linked issue acceptance criteria. Rich context: full diff,
|
||||
project CLAUDE.md conventions, issue body. Each reviewer ran independently
|
||||
in its own sub-agent with its own Gitea token. No cross-pollination.
|
||||
|
||||
- GPT-5 caught SUMMARY.md verdict mismatches (Commanded classification,
|
||||
small teams classification) that Sonnet missed entirely (PR #375)
|
||||
- Sonnet caught a broken cross-reference link first that GPT-5 missed (PR #378)
|
||||
- **Takeaway:** Different blind spots are real. Neither model is strictly better
|
||||
for analytical review — they complement each other. This is why we run two
|
||||
independent reviewers from different model families.
|
||||
@@ -0,0 +1,18 @@
|
||||
# Finding 2: Cheap model + narrow lens > expensive model + broad review (one data point)
|
||||
|
||||
**Date:** 2026-04-26
|
||||
**Task:** Check 12 rewritten hypotheses for directional bias
|
||||
**How we used them:**
|
||||
- Sonnet & GPT-5: full PR review context (diff, file content, issue, AC).
|
||||
Broad mandate: "review this PR." Rich context but unfocused task.
|
||||
- GPT-4.1 Mini: given ONLY the 12 hypothesis texts + one focused question:
|
||||
"Do any of these hypotheses lead toward a predetermined conclusion?"
|
||||
Minimal context, laser-focused task. No diff, no project docs, no issue.
|
||||
|
||||
- Both Sonnet and GPT-5 approved the hypotheses as reviewers
|
||||
- GPT-4.1 Mini found ALL 12 pushed toward predetermined conclusions
|
||||
- Words like "requires," "necessary," "must be" were flagged as directional
|
||||
- **Takeaway:** Task framing mattered more than model size. Rich context +
|
||||
broad mandate = missed the forest for the trees. Minimal context + precise
|
||||
question = found exactly what mattered. This needs more testing — was it
|
||||
the narrow framing, the lack of surrounding context, or both?
|
||||
@@ -0,0 +1,15 @@
|
||||
# Finding 3: GPT-5 times out on complex multi-step analytical tasks (confirmed pattern)
|
||||
|
||||
**Date:** 2026-04-26
|
||||
**Task:** Full PR review of #382 (research document rewrite)
|
||||
**How we used it:** pr-review skill — multi-phase (fetch diff, fetch files,
|
||||
check CI, analyze against AC, post inline comments, post summary). 7 phases,
|
||||
many curl calls to Gitea API, large diff context. Heavy tool-use workflow
|
||||
through SAP proxy (adds latency vs direct API). 300s timeout.
|
||||
|
||||
- Timed out 3 times at 300s (17, 6, 6 tool calls respectively)
|
||||
- Bottleneck was model processing time, not network (~0.3s Gitea API latency)
|
||||
- **Takeaway:** Break analytical tasks into focused bounded pieces. Twelve
|
||||
small deep reviews > one rushed big one. The issue isn't GPT-5's analysis
|
||||
quality — it's that multi-phase tool-heavy workflows burn too much time
|
||||
on mechanics. Separate the data gathering from the analysis.
|
||||
@@ -0,0 +1,18 @@
|
||||
# Finding 4: GPT-5 defaults to delegation; Claude defaults to doing the work
|
||||
|
||||
**Date:** 2026-04-26
|
||||
**Task:** PR review delegation to sub-agents
|
||||
**How we used them:** Both spawned as sub-agents from main session with
|
||||
same task description, same pr-review skill file, same Gitea credentials.
|
||||
Difference: GPT-5 got model override to gpt5, Sonnet used default model.
|
||||
Both got full skill instructions.
|
||||
|
||||
- GPT-5 first attempt: spawned sub-sub-agents and timed out
|
||||
- GPT-5 with "do it yourself, no sub-agents" + step-by-step: worked
|
||||
- Even with constraints, GPT-5 sometimes dumps raw tool output instead of
|
||||
synthesizing — needs explicit output format instructions
|
||||
- Claude (Sonnet/Opus) given the same kind of task does the work directly
|
||||
- **Takeaway:** GPT interprets complex task descriptions as delegation
|
||||
opportunities. Claude interprets them as work to do. For GPT: explicit
|
||||
single-actor instructions + output format. For Claude: can give broader
|
||||
mandate. Same skill file, very different behavior.
|
||||
@@ -0,0 +1,17 @@
|
||||
# Finding 5: Sonnet is fast and catches structural issues; GPT-5 is slow and catches semantic issues
|
||||
|
||||
**Date:** 2026-04-26
|
||||
**Task:** Dual review across PRs #372, #375, #378, #380, #382
|
||||
**How we used them:** Same pr-review skill, same context (diff + files +
|
||||
issue + AC), same sub-agent pattern. Only variable: model. Both got rich
|
||||
context. Both ran the full 7-phase review skill.
|
||||
|
||||
- Sonnet consistently finishes first, catches formatting, broken links,
|
||||
structural problems (missing sections, dangling refs)
|
||||
- GPT-5 takes longer, catches meaning-level problems (verdict mismatches,
|
||||
classification inconsistencies, logical gaps)
|
||||
- **Takeaway:** With identical rich context and identical instructions, the
|
||||
models naturally gravitate to different things. Sonnet is the structural
|
||||
reviewer; GPT-5 is the semantic reviewer. Both roles matter. Question:
|
||||
would Sonnet catch semantic issues if given a narrower "check for logical
|
||||
consistency" framing instead of broad review?
|
||||
@@ -0,0 +1,20 @@
|
||||
# Finding 6: Single agent can't handle 1000+ line document generation (confirmed pattern)
|
||||
|
||||
**Date:** 2026-04-26
|
||||
**Task:** DDD v2 forge analysis drafting
|
||||
**How we used them:** Single Sonnet/Opus sub-agents given full research
|
||||
material (~3,874 lines of research notes) + outline + instructions to write
|
||||
complete document. Very rich context (all research), very large output
|
||||
requirement (1000+ lines).
|
||||
|
||||
- Five single-agent attempts died (OOM, disconnect, timeout) trying to write
|
||||
full documents
|
||||
- Sectional approach (5 parallel Sonnet subagents, ~500-700 lines each)
|
||||
succeeded immediately — each got same research but only their section's
|
||||
outline
|
||||
- Same pattern when Claude Code attempted full Part V rewrite — died
|
||||
- Three agents × ~320 lines each worked first try
|
||||
- **Takeaway:** This is a confirmed, repeatable limit for generation tasks.
|
||||
Not model-specific — it's a context/output length problem. Rich input
|
||||
context is fine; it's the output length that kills. Break output into
|
||||
sections, keep input context rich, draft in parallel, assemble.
|
||||
@@ -0,0 +1,17 @@
|
||||
# Finding 7: Emerging role assignments (pattern, not conclusion)
|
||||
|
||||
**Date:** 2026-04-26 (one day of intensive work — treat as hypothesis)
|
||||
|
||||
- Opus (via Claude Code): complex generation needing deep project context.
|
||||
Rich context: CLAUDE.md, full codebase access, design docs. Broad mandate.
|
||||
- Sonnet: parallel volume work (5 subagents drafting simultaneously).
|
||||
Rich context per section, constrained output scope.
|
||||
- GPT-5: independent analytical review. Rich context (diff + files + issue).
|
||||
Best when task is bounded and explicit.
|
||||
- GPT-4.1 Mini: focused narrow analysis (bias detection). Minimal context,
|
||||
precise question. Cheap and fast.
|
||||
- **Takeaway:** The role assignment matters, but so does the context shape.
|
||||
Opus gets broad context + broad mandate. Sonnet gets broad context +
|
||||
narrow scope. GPT-5 gets rich context + explicit task. GPT-4.1 Mini gets
|
||||
minimal context + laser question. We haven't tested swapping these
|
||||
combinations — that's where the real learning will come from.
|
||||
@@ -0,0 +1,58 @@
|
||||
# Finding 8: Bias detection: all models catch it with any framing — when the signal isn't buried
|
||||
|
||||
**Date:** 2026-04-27
|
||||
**Task:** Detect directional bias in 8 deliberately biased hypotheses about
|
||||
microservices vs monolith architecture for fintech startups.
|
||||
**How we used them:** Created fresh test material (8 hypotheses with pro-
|
||||
microservices bias via absolutes like "inevitably," "necessary," "must,"
|
||||
"requires," plus one factually inverted claim about consistency guarantees).
|
||||
Ran 4 conditions in parallel sub-agents:
|
||||
|
||||
| Condition | Model | Framing | Context |
|
||||
|---|---|---|---|
|
||||
| A | GPT-4.1 Mini | Narrow: "Do any lead toward a predetermined conclusion?" | Hypotheses only |
|
||||
| B | Sonnet | Same narrow question | Hypotheses only |
|
||||
| C | GPT-5 | Same narrow question | Hypotheses only |
|
||||
| D | Sonnet | Broad: "Review quality, clarity, testability, and issues" | Hypotheses only |
|
||||
|
||||
**Results:**
|
||||
- **All 4 conditions detected 8/8 biased hypotheses.** No misses.
|
||||
- All 3 narrow-framing models (Mini, Sonnet, GPT-5) produced structurally
|
||||
similar output: per-hypothesis verdict, biasing words, neutral version,
|
||||
severity assessment.
|
||||
- All 3 narrow-framing models flagged H8's factual inversion (distributed
|
||||
transactions DON'T provide stronger consistency than monolithic ACID).
|
||||
- GPT-5 added specific counterexamples (LMAX Disruptor, Shopify, Stack
|
||||
Overflow, Basecamp) — marginally richer analysis.
|
||||
- Sonnet broad mandate also caught the bias — framed as one of three
|
||||
"systemic problems" (deterministic language, pro-microservices framing
|
||||
bias, underspecified constructs). Additionally provided testability and
|
||||
operationalization analysis that the narrow framing didn't ask for.
|
||||
- Sonnet broad took ~72s vs ~39s for narrow conditions (more output).
|
||||
|
||||
**Takeaway:** When the biased text is the ONLY input (no surrounding noise),
|
||||
all tested models — including the cheapest (GPT-4.1 Mini) — detect bias
|
||||
regardless of whether the question is narrow or broad. This appears to
|
||||
**contradict** original finding #2 ("cheap model + narrow lens > expensive
|
||||
model + broad review"), but the key difference is context noise:
|
||||
|
||||
- **Original experiment (2026-04-26):** Sonnet and GPT-5 missed bias during
|
||||
FULL PR REVIEW with rich project context (diff, file content, issue text,
|
||||
acceptance criteria, project conventions). The hypotheses were buried in
|
||||
layers of review mechanics.
|
||||
- **This experiment (2026-04-27):** Even the "broad" condition gave ONLY the
|
||||
hypothesis text — no diff, no PR structure, no project context noise.
|
||||
|
||||
**Refined hypothesis:** The original finding #2 was about **signal-to-noise
|
||||
ratio**, not about model capability or framing precision. When biased text
|
||||
is presented in isolation, any model catches it. When biased text is buried
|
||||
in a large PR review with many other things to check, the bias signal gets
|
||||
lost in the noise — unless you explicitly ask about it. The "narrow lens"
|
||||
worked because it eliminated the noise, not because smaller models are
|
||||
better at bias detection.
|
||||
|
||||
**Next experiment to confirm:** Give a model the FULL PR review context
|
||||
(diff, files, issue, AC) but add the narrow bias question as an explicit
|
||||
review checklist item. If the model catches bias despite the rich context,
|
||||
it confirms the signal-to-noise hypothesis. If it misses, it suggests
|
||||
something else is at play (attention allocation, task switching cost).
|
||||
@@ -0,0 +1,77 @@
|
||||
# Finding 9: Gap-finding in architecture docs: GPT-5 finds domain-specific gaps, GPT-4.1 is generic, Mini is formulaic
|
||||
|
||||
**Date:** 2026-05-02
|
||||
**Task:** Identify missing failure scenarios in gargoyle's `failure-modes.md` (383 lines)
|
||||
**How we used them:** Same document (full text, no truncation) + same focused
|
||||
analytical question to all 3 models via HAI proxy (OpenAI-compatible endpoint).
|
||||
No tools, no project context beyond the document itself. Single prompt, no
|
||||
conversation history. Temperature 0.3 for GPT-4.1/Mini, default (1.0) for GPT-5
|
||||
(required by the model).
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Scenarios found |
|
||||
|---|---|---|---|---|
|
||||
| GPT-4.1 Mini | 16s | 2,003 | 0 | 10 |
|
||||
| GPT-4.1 | 24s | 2,575 | 0 | 15 |
|
||||
| GPT-5 | 45s | 8,565 | 6,656 | 14 |
|
||||
|
||||
**What they found — common ground (all 3 identified):**
|
||||
- ETS table corruption/loss affecting gates
|
||||
- BEAM scheduler starvation / GC pauses
|
||||
- WebSocket message duplication/reordering
|
||||
- Postgres connection pool exhaustion / deadlocks
|
||||
- Clock skew / time drift
|
||||
- Process registry inconsistency
|
||||
|
||||
**GPT-5 unique findings (not in either other model):**
|
||||
- Broker rate limiting (429s) — not "connection lost" so existing logic
|
||||
doesn't trigger, but can't flatten during kill switch
|
||||
- Broker auth failure / credential rotation — distinct from connection loss
|
||||
- Corporate actions (splits, symbol changes) — position drift without
|
||||
triggering staleness detection
|
||||
- Duplicate pipeline instances for same user (DynamicSupervisor race)
|
||||
- DB "commit unknown outcome" causing restart loops (Ecto commit succeeds
|
||||
at Postgres but client times out → retry → unique constraint → crash loop)
|
||||
- Cross-symbol strategies with partial staleness — multi-leg signals
|
||||
computed from mix of fresh and stale data
|
||||
- Partial cancel_all during kill switch masked by process restarts
|
||||
|
||||
**GPT-4.1 unique findings (not in GPT-5 or Mini):**
|
||||
- Zombie processes after halt (supervisor misconfiguration)
|
||||
- Unsupervised Task crashes going unnoticed
|
||||
- Audit log writes failing silently (not in same transaction as state change)
|
||||
- ClOrdID unique constraint violation from race in sequence generation
|
||||
- Broker API semantic changes (silent breaking changes)
|
||||
|
||||
**GPT-4.1 Mini unique findings:**
|
||||
- Race between kill switch engagement and reconciliation completion
|
||||
(timing coordination gap) — this was more explicitly called out than
|
||||
in the other models, though GPT-5 touches it implicitly
|
||||
- Strategy.Worker / Aggregator partial crash inconsistency
|
||||
|
||||
**Quality assessment:**
|
||||
- **GPT-5** had the most *domain-relevant* and *actionable* gaps. Broker
|
||||
rate limiting, auth failures, corporate actions, and the DB commit
|
||||
unknown-outcome scenario are all realistic production issues specific
|
||||
to THIS system. The cross-symbol partial staleness finding shows
|
||||
deeper architectural reasoning about component interactions.
|
||||
- **GPT-4.1** was thorough and well-structured but more generic/defensive.
|
||||
Many of its unique findings (zombie processes, unsupervised Tasks,
|
||||
audit log loss) are general Elixir concerns rather than specific to
|
||||
the document's architecture. Good for a completeness checklist.
|
||||
- **GPT-4.1 Mini** was formulaic — each finding followed the same template
|
||||
and several were somewhat surface-level or restated things the document
|
||||
partially covers. Still found the most scenarios per dollar.
|
||||
|
||||
**Takeaway:** For gap-finding in architecture documents, GPT-5's reasoning
|
||||
tokens pay off. It doesn't just list "things that could go wrong" — it
|
||||
identifies *specific interactions* that the document's existing mechanisms
|
||||
don't cover (e.g., rate limiting bypasses the "connection lost" detection,
|
||||
corporate actions bypass staleness detection). GPT-4.1 is a solid
|
||||
middle-ground: more thorough than Mini, less insightful than GPT-5.
|
||||
Mini is fine for a quick sanity check but won't find the subtle gaps.
|
||||
|
||||
**Cost-effectiveness:** Mini found 10 scenarios in 16s for ~7K tokens.
|
||||
GPT-5 found 14 scenarios (with 7 genuinely unique insights) in 45s for
|
||||
~13.5K tokens (including 6.6K reasoning). For architecture review where
|
||||
missing a gap could mean financial loss, the GPT-5 cost is justified.
|
||||
For routine doc review, Mini + human judgment is probably sufficient.
|
||||
@@ -0,0 +1,98 @@
|
||||
# Finding 10: Hidden-assumption identification: GPT-5's reasoning produces qualitatively different (not just more) findings
|
||||
|
||||
**Date:** 2026-05-02
|
||||
**Task:** Identify hidden assumptions in gargoyle's `cold-start-and-recovery.md` (234 lines)
|
||||
that could break under real-world production conditions.
|
||||
**How we used them:** Same document (full text) + same focused analytical question
|
||||
to all 3 models via HAI proxy (OpenAI-compatible endpoint). No tools, no project
|
||||
context beyond the document itself. Single prompt, no conversation history.
|
||||
Temperature 0.3 for GPT-4.1/Mini; GPT-5 uses default (required).
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
|
||||
|---|---|---|---|---|
|
||||
| GPT-4.1 Mini | 25s | 3,090 | 0 | 12 |
|
||||
| GPT-4.1 | 77s | 2,751 | 0 | 14 |
|
||||
| GPT-5 | 78s | 2,649 | 4,096 | 26 |
|
||||
|
||||
**What they found — common ground (all 3 identified):**
|
||||
- Broker API consistency/availability during reconciliation
|
||||
- ETS table availability and fail-closed behavior
|
||||
- Single-writer/mailbox ordering guarantees holding in practice
|
||||
- User independence assumption vs shared resources (rate limits, DB)
|
||||
- Reconciliation idempotency under repeated runs
|
||||
- Corporate action data completeness/timeliness
|
||||
- Escalation threshold calibration vs changing market conditions
|
||||
- Strategy warmup with partial/missing historical data
|
||||
- Signal expiry correctness on restart
|
||||
|
||||
**GPT-5 unique findings (not in either other model):**
|
||||
- Unbounded mailbox growth during extended reconciliation (memory pressure
|
||||
from queued messages at market open)
|
||||
- handle_continue side effects in OTHER processes (risk, metrics) acting
|
||||
concurrently via different paths
|
||||
- Pre-existing GTC orders filling while gated (positions as moving target)
|
||||
- Broker position semantics mismatch (trade-date vs settled-date)
|
||||
- Strategy warmup evaluate() having non-signal side effects (metrics, caches)
|
||||
- Historical bar / live tick boundary alignment (double-processing or gaps)
|
||||
- ETS gate caching in process state creating fail-open windows
|
||||
- Correlated retry stampede when many users restart together
|
||||
- Corporate action double-application race with broker (missing idempotency
|
||||
keys per action/instrument/date)
|
||||
- Kill switch state vs DB unavailability at startup
|
||||
- Market data subscriptions as shared bottleneck across "independent" users
|
||||
- Time-invariant signals incorrectly expired by aggregation window logic
|
||||
- Broker fills vs positions endpoints internally inconsistent (different caches)
|
||||
- Positions changing under reconciliation while kill switch is engaged
|
||||
- Gate phase sequencing: :ready written before worker warmup completes
|
||||
- Periodic reconciler allowing 1hr of divergent trading (rate-of-change blind)
|
||||
|
||||
**GPT-4.1 unique findings (not in GPT-5 or Mini):**
|
||||
- No correlated failure handling (all failure modes treated as isolated) —
|
||||
only model to frame this as a meta-assumption about the failure table
|
||||
|
||||
**GPT-4.1 Mini unique findings:**
|
||||
- None that weren't also covered by the other two models
|
||||
|
||||
**Quality assessment:**
|
||||
- **GPT-5** didn't just find more assumptions — it found *qualitatively
|
||||
different kinds*. Many of its unique findings involve multi-component
|
||||
interactions (mailbox + reconciliation + market open timing), semantic
|
||||
mismatches (trade-date vs settled positions), and second-order effects
|
||||
(metrics side effects during warmup, GTC orders filling while gated).
|
||||
These require reasoning about system behavior across boundaries the
|
||||
document doesn't explicitly draw.
|
||||
- **GPT-4.1** was competent and structured, found the same core assumptions
|
||||
as Mini, plus one good meta-observation about correlated failures. But
|
||||
it stayed within the document's own framing — it found assumptions the
|
||||
document *almost* states rather than ones the document can't see.
|
||||
- **GPT-4.1 Mini** was formulaic. Every finding maps cleanly to a section
|
||||
of the document. It's essentially "what could go wrong with each stated
|
||||
mechanism" rather than "what does this design take for granted about
|
||||
the world outside itself."
|
||||
|
||||
**Key insight — reasoning tokens change the KIND of analysis:**
|
||||
GPT-5's 4,096 reasoning tokens aren't producing "more of the same" —
|
||||
they're producing a different analytical mode. The non-reasoning models
|
||||
(4.1 and Mini) identify risks within the document's own frame of reference.
|
||||
GPT-5 reasons about the document's relationship to the external world:
|
||||
broker semantics, deployment topology, OTP runtime behavior under load,
|
||||
timing correlations across independent subsystems. This is the difference
|
||||
between "what could this mechanism fail at" and "what must be true about
|
||||
the world for this mechanism to work."
|
||||
|
||||
**Comparison to Finding #9 (gap-finding on failure-modes.md):**
|
||||
Same pattern confirmed. GPT-5 consistently finds domain-specific,
|
||||
interaction-level issues that require reasoning about component boundaries.
|
||||
GPT-4.1 is thorough within-frame. Mini is formulaic. The gap between
|
||||
GPT-5 and the others is larger here than in #9 — possibly because
|
||||
"hidden assumptions" requires more abstraction than "missing failure
|
||||
scenarios." Assumption-finding requires the model to reason about what
|
||||
ISN'T stated, which benefits more from extended reasoning.
|
||||
|
||||
**Practical implication:** For architecture review, running GPT-5 on
|
||||
"identify hidden assumptions" is higher-value than the same question to
|
||||
non-reasoning models. The cost difference (4K extra reasoning tokens) is
|
||||
trivial for a document that will drive months of implementation. Use
|
||||
non-reasoning models for within-frame checks ("does this section have
|
||||
gaps") and reasoning models for cross-boundary analysis ("what must be
|
||||
true about the world for this to work").
|
||||
@@ -0,0 +1,124 @@
|
||||
# Finding 11: Hidden-assumption identification on simpler doc: reasoning models diverge in approach, both outperform non-reasoning
|
||||
|
||||
**Date:** 2026-05-02
|
||||
**Task:** Identify hidden assumptions in gargoyle's `market-calendar.md` (238 lines)
|
||||
— a simpler, single-component document vs the 234-line cold-start doc from Finding #10.
|
||||
**How we used them:** Same document (full text) + same focused analytical question
|
||||
to all 3 models via HAI proxy. No tools, no project context beyond the document
|
||||
itself. Single prompt, no conversation history. Temperature 0.3 for GPT-4.1;
|
||||
GPT-5 and Opus use their defaults (required). Same prompt across all three.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
|
||||
|---|---|---|---|---|
|
||||
| GPT-4.1 | 19s | 2,554 | 0 | 14 |
|
||||
| Claude Opus 4.6 | 74s | 3,288 | (internal, not reported) | 13 |
|
||||
| GPT-5 | 101s | 8,417 | 5,504 | 24 |
|
||||
|
||||
**What they found — common ground (all 3 identified):**
|
||||
- Alpaca calendar API data correctness/completeness as single source of truth
|
||||
- Alpaca API availability at startup (no local cache persistence)
|
||||
- ETS table atomicity during refresh (partial-state exposure risk)
|
||||
- System clock/timezone alignment (dates are timezone-naive)
|
||||
- NYSE emergency/unscheduled closures not reflected until refresh
|
||||
- Two-year cache range sufficiency
|
||||
- API response format stability
|
||||
- Rate limiting / API capacity concerns
|
||||
|
||||
**GPT-5 unique findings (not in either other model):**
|
||||
- Date struct term-ordering in ETS match specs may not match chronological
|
||||
order (ETS range guards rely on Erlang term comparison, not Date semantics)
|
||||
- close_time/1 returns naive Time without timezone — DST conversion burden on
|
||||
consumers, one hour off twice per year
|
||||
- trading_day?/1 conflates "not a trading day" with "calendar unavailable" —
|
||||
operational outages invisible to callers
|
||||
- ETS table name collision risk (global namespace per node)
|
||||
- No other process should modify the ETS table (access mode discipline)
|
||||
- Network egress and credential availability on all nodes at all times
|
||||
- ETS read/write concurrency flags for contention under load
|
||||
- Direct ETS access by consumers bypassing the module's error handling
|
||||
- next/prev_trading_day edge cases at cache boundaries
|
||||
- Alpaca API start/end parameter inclusivity (off-by-one at year boundaries)
|
||||
- Half-day vs full-day distinction insufficiency for special sessions
|
||||
- Small table size makes O(n) selects acceptable (scaling concern)
|
||||
- Year-end refresh failure leaving gaps at boundary
|
||||
- Alpaca never omits a legitimate trading day (absence = non-trading conflation)
|
||||
|
||||
**Claude Opus unique findings (not in either other model):**
|
||||
- ETS ownership semantics: heir-protection would change fail-closed behavior;
|
||||
current design means ALL consumers fail simultaneously during crash-to-restart
|
||||
window (framed as a design tension, not just a risk)
|
||||
- Silent data corruption from partial API response (pagination/truncation) —
|
||||
specifically that missing rows are SILENT failures with no error propagation
|
||||
(other models mentioned API completeness but not the silence aspect)
|
||||
- Consumers calling functions with Dates, not DateTimes — the API accepts Date.t()
|
||||
but doesn't specify HOW consumers should derive "today" (system-wide
|
||||
coordination problem made invisible by the API contract)
|
||||
- `trading_day?/1` returning false is NOT fail-closed for ALL consumers — only
|
||||
for PDT-like "block action" consumers; for batch-trigger consumers it's
|
||||
fail-OPEN (subtle inversion of safety semantics)
|
||||
- Startup ordering: background_children placement means PDT could receive orders
|
||||
before MarketCalendar finishes init, creating recurring rejection windows
|
||||
during hot deploys
|
||||
- Continuous-running assumption for refresh timer (daily restarts would mean
|
||||
refresh mechanism never fires — no staleness alert exists)
|
||||
|
||||
**GPT-4.1 unique findings (not in either other model):**
|
||||
- No need for real-time calendar change notification (event emission gap)
|
||||
- All consumers using the same module instance (configuration consistency)
|
||||
- No need for historical calendar data (audit/backtesting limitation)
|
||||
- Consumers correctly handling {:error, :calendar_unavailable} in practice
|
||||
|
||||
**Quality assessment:**
|
||||
- **GPT-5** found the most assumptions (24) with the most technical specificity.
|
||||
Many are implementation-level insights (ETS term ordering, named table
|
||||
collisions, read_concurrency flags) that demonstrate deep Erlang/OTP
|
||||
knowledge. Some are slightly obvious or overlapping. The ETS term-ordering
|
||||
finding is genuinely insightful — Date structs DO compare correctly in Erlang
|
||||
term order (year > month > day fields), but questioning it shows depth of
|
||||
reasoning about underlying mechanisms. Also provided concrete recommendations.
|
||||
- **Claude Opus** found fewer assumptions (13) but several were qualitatively
|
||||
different — they identified *design tensions* and *semantic inversions*
|
||||
rather than just failure scenarios. The fail-open/fail-closed inversion
|
||||
(finding #12), the ETS ownership tension, and the "API makes timezone
|
||||
coordination invisible" findings show reasoning about the design's
|
||||
*relationship to its consumers* rather than just its internal mechanics.
|
||||
Tighter, more curated output with less filler.
|
||||
- **GPT-4.1** was competent and well-structured (14 assumptions, clean table)
|
||||
but stayed within the document's own framing. Its unique findings are
|
||||
relatively generic ("consumers should handle errors correctly," "no
|
||||
historical data"). Solid baseline, no surprises.
|
||||
|
||||
**Key insight — two reasoning models, different analytical styles:**
|
||||
GPT-5 and Opus are both reasoning models, but they reason about different
|
||||
things. GPT-5 reasons DEEPER into implementation mechanics (how does ETS
|
||||
actually work? what are the exact failure modes of each component?). Opus
|
||||
reasons WIDER about system context (how does this component's API contract
|
||||
affect the safety properties of the overall system? what tensions does this
|
||||
design create that aren't visible to the author?).
|
||||
|
||||
GPT-5's approach: "Here are 24 things that could go wrong, many highly
|
||||
technical." Opus's approach: "Here are 13 assumptions, several of which
|
||||
reveal design tensions the document can't see about itself."
|
||||
|
||||
**Does the reasoning gap narrow with simpler docs?**
|
||||
Comparing to Finding #10 (cold-start doc, 234 lines, 26 vs 14 vs 12 assumptions
|
||||
for GPT-5/GPT-4.1/Mini):
|
||||
- GPT-5 still dominates in raw count (24 vs 14 for GPT-4.1)
|
||||
- The gap ratio is similar (~1.7x here vs ~1.9x in Finding #10)
|
||||
- Document complexity doesn't appear to be the driver of the gap —
|
||||
reasoning tokens enable more exhaustive exploration regardless of
|
||||
input complexity
|
||||
|
||||
**Claude Opus vs GPT-5 (the headline comparison):**
|
||||
They're not competing on the same axis. GPT-5 is better for "find all
|
||||
possible issues" (breadth + technical depth). Opus is better for "find
|
||||
the assumptions that will actually surprise the author" (insight density).
|
||||
If you want a security-audit-style exhaustive list: GPT-5. If you want a
|
||||
design-review-style "here's what you're not seeing about your own design":
|
||||
Opus. Both are better than GPT-4.1 for this task, but in different ways.
|
||||
|
||||
**Practical implication:** Run BOTH reasoning models on architecture docs.
|
||||
GPT-5 catches implementation-level hazards the team might miss during
|
||||
coding. Opus catches design-level tensions the team might miss during
|
||||
planning. GPT-4.1 is sufficient as a quick sanity check but won't
|
||||
surprise you.
|
||||
@@ -0,0 +1,125 @@
|
||||
# Finding 12: Sonnet 4.6 outperforms expectations on assumption-finding; competes with reasoning models on complex docs
|
||||
|
||||
**Date:** 2026-05-02
|
||||
**Task:** Identify hidden assumptions in gargoyle's `order-execution.md` (785 lines)
|
||||
— a complex, multi-component document covering OrderManager, BrokerAdapter,
|
||||
TradeStream, and PositionReconciler.
|
||||
**How we used them:** Same document (full text, no truncation) + same focused
|
||||
analytical question to all 3 models. GPT-5 via HAI OpenAI endpoint; Opus 4.6
|
||||
and Sonnet 4.6 via HAI Anthropic endpoint. No tools, no project context beyond
|
||||
the document itself. Single prompt, no conversation history.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
|
||||
|---|---|---|---|---|
|
||||
| GPT-5 | 93s | 8,485 | 6,016 | 20 |
|
||||
| Claude Sonnet 4.6 | 106s | 4,637 | (internal) | 17 |
|
||||
| Claude Opus 4.6 | 105s | 4,615 | (internal) | 12 |
|
||||
|
||||
**What they found — common ground (all 3 identified):**
|
||||
- Synchronous broker REST calls blocking OrderManager GenServer (mailbox growth)
|
||||
- TradeStream event ordering assumptions (out-of-order fills/status)
|
||||
- Fill deduplication gap (no explicit fill-level idempotency)
|
||||
- `cancel_all/1` with `timeout: :infinity` blocking GenServer during FLATTEN
|
||||
- Recovery/restart races with TradeStream fill delivery (fills queued during
|
||||
`handle_continue/2`)
|
||||
- Lot operation idempotency under crash recovery (partial execution)
|
||||
- Replace race: fills for new broker_order_id arriving before `replaced` event
|
||||
- Database write latency impact on GenServer throughput under burst fills
|
||||
- ETS table scope assumptions (single-node, access mode)
|
||||
|
||||
**GPT-5 unique findings (not in either Claude model):**
|
||||
- Rate-limit retry blocking OrderManager inline (no async retry path specified)
|
||||
- Single TradeStream connection per user not enforced (duplicate detection gap)
|
||||
- Kill switch FLATTEN vs degraded state interaction (OM drops cancels while
|
||||
degraded, but FLATTEN calls cancel_all through OM)
|
||||
- ClOrdID uniqueness scope/retention at broker across sessions and days
|
||||
- `after: datetime` filter semantics (clock skew, timezone, inclusive/exclusive)
|
||||
- Reconciliation responses may exceed single-response size (no pagination)
|
||||
- Event broadcasting blocking model (synchronous vs fire-and-forget)
|
||||
- Credential rotation during TradeStream connection lifetime
|
||||
- `market_closed` semantics varying across brokers (reject vs queue)
|
||||
- Dropped Alpaca statuses (stopped/suspended/calculated) may affect accounting
|
||||
|
||||
**Claude Sonnet 4.6 unique findings (not in either other model):**
|
||||
- Single fill per fill event assumption (broker batching multiple fills into
|
||||
one WebSocket message)
|
||||
- Lot operations (`Lots.open/2`, `Lots.close/4`) assumed to never fail —
|
||||
no `{:error, _}` handling shown, crash propagation risk
|
||||
- `Task.async_stream` inside GenServer creating linked tasks whose crash
|
||||
signals propagate to OrderManager during critical cancel_all
|
||||
- Broker cancel semantics during in-flight replace at the broker level
|
||||
(cancel targets old broker_order_id which broker already replaced away)
|
||||
- Database operations in fill processing assumed transactional (no explicit
|
||||
Ecto.Multi/transaction mention)
|
||||
- Broker position reflects only Gargoyle's activity (external trades cause
|
||||
false-positive reconciliation halts)
|
||||
|
||||
**Claude Opus 4.6 unique findings (not in either other model):**
|
||||
- `{:ok, broker_order_id}` from REST place conflated with durable OMS
|
||||
acceptance vs mere HTTP acknowledgment (no timeout on `submitted` state)
|
||||
- Concurrent `apply_corrections/2` from periodic reconciler running in
|
||||
separate process conflicts with OrderManager's single-writer invariant
|
||||
(corrections write to same tables outside GenServer serialization)
|
||||
- Reconciliation gate initialized state after `:rest_for_one` restart —
|
||||
ETS table EXISTS but freshly initialized vs table MISSING are different
|
||||
conditions with different safety properties
|
||||
- Escalation state reset after crash creating double-exposure window
|
||||
(systematic issue persists but escalation timer resets to zero)
|
||||
- `replace/3` error semantics: non-atomic replace (cancel + re-submit)
|
||||
where cancel succeeds but re-submit fails leaves original order cancelled
|
||||
at broker while OrderManager reverts to "working" locally
|
||||
|
||||
**Quality assessment:**
|
||||
- **GPT-5** maintained its pattern from previous findings: broadest coverage
|
||||
(20 assumptions), most technically specific about implementation details.
|
||||
Found cross-cutting operational concerns (clock skew, credential rotation,
|
||||
pagination) that the Claude models didn't surface. However, several of its
|
||||
findings were medium-severity operational concerns rather than architectural
|
||||
assumptions.
|
||||
- **Claude Sonnet 4.6** was the surprise performer. Found 17 assumptions —
|
||||
close to GPT-5's count (85%) — and several of its unique findings were
|
||||
genuinely insightful. The `cancel_all` race with broker-side replace state
|
||||
(finding #16) and the lot operation failure propagation (finding #6) show
|
||||
deep reasoning about component interaction despite Sonnet not being
|
||||
positioned as a "reasoning" model. More importantly, Sonnet's findings were
|
||||
consistently well-structured with clear "how it could break" scenarios.
|
||||
- **Claude Opus 4.6** found the fewest assumptions (12) but — consistent with
|
||||
Finding #11 — its unique findings were qualitatively different. The
|
||||
concurrent `apply_corrections` write conflict, the gate initialization state
|
||||
distinction, and the non-atomic replace error semantics all reveal design
|
||||
tensions that neither GPT-5 nor Sonnet identified. Opus continues to reason
|
||||
about the *boundaries between components* rather than within-component
|
||||
mechanics.
|
||||
|
||||
**Key insight — Sonnet 4.6 is NOT just a faster GPT-4.1:**
|
||||
In previous findings (#9, #10, #11), non-reasoning models (GPT-4.1, GPT-4.1
|
||||
Mini) performed significantly below reasoning models on assumption-finding.
|
||||
GPT-4.1 found ~14 assumptions where GPT-5 found 24-26. Here, Sonnet 4.6
|
||||
finds 17 where GPT-5 finds 20 — a much smaller gap (~85% vs ~58% previously).
|
||||
|
||||
Sonnet's findings also included several that showed genuine reasoning about
|
||||
component interactions (not just within-frame risks). This suggests Sonnet 4.6
|
||||
is qualitatively different from GPT-4.1 for analytical work — it occupies a
|
||||
middle ground between GPT-4.1's "competent but surface-level" and GPT-5's
|
||||
"exhaustive and deep." The severity distribution was also similar to GPT-5
|
||||
(multiple critical/high findings), whereas GPT-4.1 in previous experiments
|
||||
tended toward medium-severity generic concerns.
|
||||
|
||||
**Updated model hierarchy for assumption-finding:**
|
||||
1. GPT-5 — broadest coverage, most operational-level findings (20)
|
||||
2. Sonnet 4.6 — strong analytical depth, good component interaction reasoning (17)
|
||||
3. Opus 4.6 — fewest but most architecturally insightful, finds design tensions (12)
|
||||
4. GPT-4.1 — competent within-frame, generic (~14 from previous experiments)
|
||||
5. GPT-4.1 Mini — formulaic, surface-level (~10-12)
|
||||
|
||||
**Practical implication:** For architecture review, Sonnet 4.6 is now a strong
|
||||
candidate for volume analytical work. It's fast enough to run alongside GPT-5
|
||||
and catches different things (lot operation failures, broker-side replace races).
|
||||
The ideal three-model review stack for architecture docs appears to be:
|
||||
- GPT-5 for breadth + operational concerns
|
||||
- Sonnet 4.6 for component interaction analysis
|
||||
- Opus 4.6 for design-tension identification
|
||||
|
||||
Each consistently finds things the others miss. The cost-efficiency argument
|
||||
for Sonnet is strong: ~85% of GPT-5's count with more actionable findings
|
||||
per token generated (4,637 vs 8,485 tokens for 17 vs 20 assumptions).
|
||||
@@ -0,0 +1,46 @@
|
||||
# Finding 7: Token budget matters more than model size for gap analysis (confirmed)
|
||||
|
||||
**Date:** 2026-05-03
|
||||
**Task:** Identify unaddressed failure scenarios in gargoyle's `failure-modes.md` (383 lines, ~25KB)
|
||||
**How we used them:** Same document, same analytical question ("What failure scenarios
|
||||
are NOT covered?"), three models. GPT-5 with 16K max_completion_tokens, Sonnet 4
|
||||
with 4K max_tokens, GPT-4.1 Mini with 4K max_completion_tokens. No project context
|
||||
beyond the document itself. Pure gap-analysis task.
|
||||
|
||||
**Results:**
|
||||
- GPT-5 (16K budget): 28 gaps, most exhaustive. Found domain-specific edge cases
|
||||
others missed entirely: ClOrdID collision across restarts, fractional share rounding,
|
||||
broker maintenance windows (410/426), hot code upgrades, regulatory halts vs staleness
|
||||
distinction, cancel-ack-then-late-fill race, DNS/TLS as distinct from network outage.
|
||||
- Sonnet 4 (4K budget): 12 gaps, concise. Unique framing: distinguished latency
|
||||
degradation from outage (subtle but actionable). ETS corruption vs loss.
|
||||
- GPT-4.1 Mini (4K budget): 13 gaps with summary table. Unique angles: unknown broker
|
||||
status enum values, configuration schema mismatches on cold-start, malformed signals
|
||||
from logic bugs (not just crashes).
|
||||
|
||||
**Overlap (all three):** Rate limiting, clock skew, resource exhaustion, DB failures,
|
||||
message backpressure, partial connectivity.
|
||||
|
||||
**Key insight:** GPT-5's 4K attempt produced ZERO output (finish_reason: length) —
|
||||
all tokens consumed by internal reasoning. At 16K it produced the richest analysis.
|
||||
This confirms finding #3 (GPT-5 needs generous token budgets) AND adds a new
|
||||
observation: for open-ended analytical questions, GPT-5's reasoning overhead is
|
||||
proportionally larger. The 4K models (Sonnet, Mini) both produced useful output at
|
||||
4K because they don't burn tokens on chain-of-thought.
|
||||
|
||||
**Model personality confirmed:**
|
||||
- GPT-5: exhaustive, domain-aware, finds edge cases a senior SRE would know
|
||||
- Sonnet: precise, architectural, finds design-level distinctions
|
||||
- GPT-4.1 Mini: structured, systematic, finds enumeration gaps
|
||||
|
||||
**Practical implication:** For failure mode / gap analysis on design docs:
|
||||
- GPT-5 with ≥16K tokens for maximum coverage (most unique findings)
|
||||
- Sonnet for architectural framing ("this is really two different problems")
|
||||
- Mini for completeness checking ("what about this enum value?")
|
||||
- Running all three costs ~$0.50 and catches gaps none alone would find
|
||||
- GPT-5 at 4K is USELESS for this task — always give it room to think
|
||||
|
||||
**Note on GPT-5 reasoning overhead:** First attempt at 4K max_completion_tokens
|
||||
returned empty content with finish_reason: length. The model spent all 4K tokens
|
||||
on internal reasoning and produced nothing. This is worse than a short answer —
|
||||
it's zero value for non-zero cost. Always budget ≥16K for GPT-5 analytical tasks.
|
||||
@@ -0,0 +1,126 @@
|
||||
# Finding 13: Race condition identification: Opus excels at subtle temporal interactions; GPT-5 is exhaustive; Sonnet struggles with concurrency reasoning
|
||||
|
||||
**Date:** 2026-05-03
|
||||
**Task:** Identify race conditions, timing-dependent bugs, and ordering hazards in
|
||||
gargoyle's `concurrent-failure-detection.md` (241 lines) — a document specifically
|
||||
about concurrent detection logic with timers, ETS state, and multi-process events.
|
||||
**How we used them:** Same document (full text) + same focused analytical question
|
||||
to all 3 models via HAI proxy. Prompt specifically asked for event ordering problems,
|
||||
timer interaction bugs, state visibility gaps, crash/restart timing, and multi-instance
|
||||
coordination. Required each finding to reference specific mechanisms in the document
|
||||
with specific interleaving descriptions. No tools, no project context beyond the
|
||||
document itself.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Race conditions found |
|
||||
|---|---|---|---|---|
|
||||
| GPT-5 | 116s | 10,587 | 8,192 | 12 |
|
||||
| Claude Opus 4.6 | ~105s | 4,610 | (internal) | 10 |
|
||||
| Claude Sonnet 4.6 | ~39s | 1,404 | (internal) | 7 |
|
||||
|
||||
**What they found — common ground (all 3 identified):**
|
||||
- Stale timer messages in mailbox after cancellation (classic Erlang timer race)
|
||||
- HealthMonitor crash losing compound detection state (init from :unknown, no replay)
|
||||
- ETS vs GenServer state divergence visible to dashboard
|
||||
- Kill switch mode conflict (FLATTEN from single-process vs HALT from compound path)
|
||||
|
||||
**GPT-5 unique findings (not in either Claude model):**
|
||||
- Cross-sender message ordering: recovery events from pipeline processes vs timer
|
||||
expiry from runtime (Erlang preserves per-sender order, NOT cross-sender) — the
|
||||
"rapid recovery" safety argument in the doc relies on state being updated before
|
||||
timer fires, which isn't guaranteed
|
||||
- Debounce starvation: flapping component repeatedly restarting the timer, causing
|
||||
compound evaluation to be indefinitely postponed while ≥2 genuinely degraded
|
||||
- State regression: {:degraded} arriving after {:escalated, :kill_switch} with no
|
||||
guard in the event table — state machine allows regressing from :halted to :degraded
|
||||
- Cold-start window: application boots with existing degraded processes that won't
|
||||
re-emit events, compound detection never fires
|
||||
- Catch-all handle_info could accidentally swallow timer messages if pattern matching
|
||||
is ordered wrong (implementation pitfall of the described approach)
|
||||
- Debounce window growing beyond calibrated bounds from repeated timer restarts
|
||||
|
||||
**Claude Opus unique findings (not in either other model):**
|
||||
- Timer restart pushing evaluation PAST single-process escalation timeout — the
|
||||
debounce mechanism can DEFEAT compound detection when second degradation arrives
|
||||
near end of first window (resets to full window, first process escalates via
|
||||
single-process path before new window fires). This means system gets FLATTEN
|
||||
instead of HALT — exactly what compound detection was supposed to prevent.
|
||||
- Strategy worker single-atom masking via event ordering: Worker A degrades, Worker
|
||||
B degrades (same atom), Worker A recovers → atom set to :normal while B is still
|
||||
degraded. Event ordering across different workers mapped to same atom creates
|
||||
state loss.
|
||||
- Registry stale PID after HealthMonitor crash: if subscription is PID-based (not
|
||||
PubSub topic), new HM instance is deaf — events go to dead PID, silently dropped.
|
||||
Compound detection completely disabled for that user until subscription refresh.
|
||||
- :rest_for_one cascade + coincidental independent issue: debounce designed to
|
||||
filter cascade restarts, but cascade can EXPOSE a pre-existing issue (SignalRisk
|
||||
restarts and finds its upstream stale). Is this "compound" or "cascade + bad luck"?
|
||||
Semantic ambiguity the design doesn't address.
|
||||
- Compound cleared event without recovery debounce: :compound_degradation_cleared
|
||||
emitted immediately when last process recovers (no settling period), causing
|
||||
operator oscillation if recovery is transient.
|
||||
|
||||
**Claude Sonnet unique findings:**
|
||||
- ETS table creation race at startup (HealthMonitor writes before table exists)
|
||||
- Registry lookup failure during pipeline startup (events before HM registered)
|
||||
- However, Sonnet also made analytical errors: it described "multiple HealthMonitor
|
||||
instances for the same user" scenarios despite the document clearly stating one
|
||||
instance per user via DynamicSupervisor. Several of its findings assumed
|
||||
multi-instance coordination that doesn't match the architecture.
|
||||
|
||||
**Quality assessment:**
|
||||
- **GPT-5** was the most exhaustive and technically precise. Its cross-sender
|
||||
ordering finding (#2) is genuinely insightful — it identifies that the document's
|
||||
"rapid recovery" safety argument implicitly assumes events arrive in wall-clock
|
||||
order, which Erlang does NOT guarantee across different senders. The debounce
|
||||
starvation finding (#3) identifies a real operational hazard with practical
|
||||
consequences. All 12 findings reference specific mechanisms and describe specific
|
||||
interleavings clearly.
|
||||
- **Claude Opus** found fewer race conditions but several were qualitatively
|
||||
superior. The timer-restart-defeats-compound-detection finding is the most
|
||||
architecturally significant race in the entire analysis — it shows that the
|
||||
debounce mechanism can work AGAINST the design's stated goals in specific
|
||||
(realistic) timing scenarios. The strategy-worker event ordering masking is
|
||||
also a genuine design flaw unique to the single-atom decision. Opus continues
|
||||
its pattern of reasoning about design TENSIONS rather than just failure modes.
|
||||
- **Claude Sonnet** was notably weaker here than in previous experiments. Only
|
||||
1,404 output tokens vs 4,610 (Opus) and 10,587 (GPT-5). Several findings
|
||||
contained analytical errors (assuming multi-instance coordination that doesn't
|
||||
exist). It found only 7 races, and 2-3 of those were based on misreadings of
|
||||
the architecture. This is a significant regression from Finding #12 where
|
||||
Sonnet found 17 assumptions (85% of GPT-5's count).
|
||||
|
||||
**Key insight — concurrency reasoning is a different skill than assumption-finding:**
|
||||
In previous experiments (#10, #11, #12), Sonnet 4.6 performed well on
|
||||
assumption-finding (a task that requires reasoning about what's NOT stated).
|
||||
Here, on race condition identification (a task requiring reasoning about temporal
|
||||
interleavings and message ordering semantics), Sonnet drops significantly. This
|
||||
suggests the task type matters more than we previously thought:
|
||||
|
||||
- **Assumption-finding:** Requires breadth of consideration ("what must be true
|
||||
for this to work?"). Sonnet handles this well — it's essentially pattern
|
||||
matching across possible failure dimensions.
|
||||
- **Race condition identification:** Requires SEQUENTIAL reasoning about specific
|
||||
interleavings ("if A happens, then B happens, then C happens, what state is
|
||||
visible?"). This benefits dramatically from extended reasoning tokens (GPT-5's
|
||||
8,192 reasoning tokens) or from Opus's internal reasoning depth.
|
||||
|
||||
The lesson: don't extrapolate model performance across task types. A model that's
|
||||
85% as good at assumption-finding may be 50% as good at concurrency analysis.
|
||||
The cognitive demands are different.
|
||||
|
||||
**Opus's distinguishing strength — finding design contradictions:**
|
||||
Opus's best finding (timer restart defeating compound detection) isn't just a
|
||||
race condition — it's identifying that the debounce mechanism can work against
|
||||
the design's own stated goals. This is consistent with Opus's pattern in
|
||||
previous findings: it finds tensions where one part of the design undermines
|
||||
another part. For race condition analysis specifically, this manifests as
|
||||
"here's where your safety mechanism becomes your vulnerability."
|
||||
|
||||
**Practical implication for architecture review:**
|
||||
- For race condition analysis: use GPT-5 (exhaustive) + Opus (design-tension)
|
||||
- Sonnet is NOT suitable for concurrency reasoning tasks — use it for
|
||||
assumption-finding and structural review instead
|
||||
- The three-model stack needs task-appropriate assignment:
|
||||
- Structural/assumption review: all three models contribute
|
||||
- Concurrency/race analysis: GPT-5 + Opus only
|
||||
- Bias detection: any model (per Finding #8)
|
||||
@@ -0,0 +1,131 @@
|
||||
# Finding 14: Cross-component interaction analysis: GPT-5 Mini enters as viable mid-tier; task framing strongly shapes output quality
|
||||
|
||||
**Date:** 2026-05-03
|
||||
**Task:** Identify cross-component interaction failures in gargoyle's
|
||||
`continuous-risk-monitoring.md` (459 lines) — a document specifying
|
||||
PortfolioMonitor's interactions with OrderManager, PortfolioRisk, MarketData,
|
||||
KillSwitch, ETS tables, and the pipeline supervision tree.
|
||||
**How we used them:** Same document (full text) + same focused analytical
|
||||
question to all 3 models via HAI proxy. Prompt was highly structured: specified
|
||||
5 categories of cross-component failures to look for (semantic mismatches,
|
||||
ordering violations, feedback loops, partial visibility, supervision boundary
|
||||
effects) and required specific output format (components, sequence, gap, impact).
|
||||
No tools, no project context beyond the document itself.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Findings |
|
||||
|---|---|---|---|---|
|
||||
| GPT-5 Mini | 68s | 5,445 | 2,240 | 6 (+1 cut off) |
|
||||
| GPT-5 | 116s | 10,604 | 8,128 | 10 |
|
||||
| Claude Sonnet 4.6 | 38s | 1,868 | (internal) | 8 |
|
||||
|
||||
**What they found — common ground (all 3 identified):**
|
||||
- Fill-to-position query race (fill event triggers evaluation but position
|
||||
store hasn't yet reflected the fill)
|
||||
- Restrict flag ETS table destruction on PM crash → permissive window
|
||||
- Kill switch check vs liquidation submission race
|
||||
- Ticker subscription timing gap (new position opened but ticks not yet
|
||||
subscribed → breach goes undetected)
|
||||
|
||||
**GPT-5 unique findings (not in either other model):**
|
||||
- Stale prices are NOT fail-safe for drawdown (higher stale price → inflated
|
||||
portfolio value → understated drawdown). The document claims "fail-safe"
|
||||
but this only holds for exposure metrics, not drawdown. This is the most
|
||||
architecturally significant finding across all three models.
|
||||
- Price definition mismatch between PM (last_trade from ETS) and OrderManager/
|
||||
broker (bid/ask/mid) causing mis-sized liquidation and oscillation
|
||||
- Cross-component oscillation: PM hysteresis internal vs PRisk's immediate
|
||||
binary restrict gate clearing (no cross-component cooldown)
|
||||
- Liquidation stuck after OM restart (terminal events lost; liquidation_in_
|
||||
flight stays true indefinitely with no timeout/rehydration)
|
||||
- "Minimal risk checks" not enforced — PM goes through same OM gates as
|
||||
strategy orders but MarketHours/StalePrice controls may reject after-hours
|
||||
or stale-price liquidation attempts
|
||||
- FLATTEN mode semantics gap — PM refrains from liquidating when kill switch
|
||||
engaged, but FLATTEN cancels open orders without actually CLOSING positions.
|
||||
No component left to close positions.
|
||||
|
||||
**Claude Sonnet 4.6 unique findings (not in either other model):**
|
||||
- Liquidation feedback loop with PortfolioRisk — buy-to-cover for short
|
||||
positions could INCREASE net long exposure at portfolio level, paradoxically
|
||||
worsening concentration while fixing position-level metrics
|
||||
- High water mark reset on pipeline restart masks true intraday drawdown
|
||||
(restart → HWM resets to lower current value → drawdown calculated from
|
||||
false baseline → larger losses permitted than intended)
|
||||
- Multi-metric breach with single boolean flag — concentration liquidation
|
||||
for AAPL sets liquidation_in_flight, blocking simultaneous daily P&L
|
||||
liquidation for different positions
|
||||
- Market close/open vs after-hours fills — claims to evaluate after-hours
|
||||
fills but uses stale market-close prices
|
||||
|
||||
**GPT-5 Mini unique findings (not in either other model):**
|
||||
- OrderManager order splitting/remapping causing liquidation_in_flight
|
||||
correlation failure (parent/child order ID mapping breaks terminal-event
|
||||
detection). Well-reasoned but highly implementation-specific.
|
||||
- Restrict/clear oscillation loop with strategy behavior (strategies react
|
||||
to rejects → back off → restrict clears → strategies re-enter aggressively
|
||||
→ re-breach). Good systems-thinking about emergent feedback.
|
||||
|
||||
**Quality assessment:**
|
||||
- **GPT-5** produced the most findings (10) and the highest-quality
|
||||
architectural insight: the stale-price/drawdown contradiction is a genuine
|
||||
design flaw that contradicts the document's own safety claim. Multiple
|
||||
findings showed cross-boundary reasoning about semantic mismatches (price
|
||||
definition, FLATTEN semantics, gate bypass). Every finding named specific
|
||||
components and described precise event sequences.
|
||||
- **Claude Sonnet 4.6** was fast (38s, only 1,868 tokens) and produced 8
|
||||
solid findings. The HWM reset finding and the multi-metric/single-flag
|
||||
finding show genuine architectural reasoning. The liquidation feedback
|
||||
loop (buy-to-cover worsening portfolio concentration) is subtle and
|
||||
shows cross-position reasoning. However, some findings overlapped
|
||||
significantly with the common-ground set and added less unique depth.
|
||||
Sonnet performed MUCH better here than on race condition identification
|
||||
(Finding #13) — 8/10 ratio vs 7/12 previously.
|
||||
- **GPT-5 Mini** produced 6 findings in 68s with 2,240 reasoning tokens.
|
||||
Quality was genuinely good — the order-splitting/correlation finding
|
||||
and the oscillation feedback loop both show real reasoning depth. It's
|
||||
clearly NOT GPT-4.1 Mini — it reasons about component interactions,
|
||||
not just within-frame risks. However, it found fewer issues and one
|
||||
response was cut off (token limit or response truncation).
|
||||
|
||||
**Key insight — task framing as the dominant variable:**
|
||||
This experiment used a much more structured prompt than previous ones:
|
||||
specified 5 categories, required specific output format, explicitly excluded
|
||||
single-component failures. The result: ALL models produced higher-quality,
|
||||
more focused output than in earlier experiments with broader prompts. Even
|
||||
Sonnet — which struggled on race conditions (Finding #13) — performed well
|
||||
here. The structured categories likely helped models organize their reasoning
|
||||
without losing track of what they were looking for.
|
||||
|
||||
The prompt explicitly asked for "cross-component interaction failures" rather
|
||||
than general analysis. This is the narrow-lens effect from Finding #2, but
|
||||
applied to a complex multi-component document. The lens is narrow (only
|
||||
inter-component gaps) but the scope is broad (459 lines, many interactions).
|
||||
This combination — narrow analytical lens + broad document scope — appears
|
||||
to be the sweet spot for getting quality from all model tiers.
|
||||
|
||||
**GPT-5 Mini positioning:**
|
||||
First time testing GPT-5 Mini. Results: 6 findings in 68s vs GPT-5's 10 in
|
||||
116s. That's 60% of the findings in 59% of the time, with 28% of the
|
||||
reasoning tokens (2,240 vs 8,128). Quality-per-finding was solid — the order
|
||||
correlation finding especially showed genuine systems reasoning. GPT-5 Mini
|
||||
appears to be a legitimate mid-tier: more capable than GPT-4.1 (which can't
|
||||
do this kind of cross-boundary reasoning) but less exhaustive than GPT-5.
|
||||
Viable for: first-pass screening, bulk document review where you'd run many
|
||||
docs and can't afford full GPT-5 on each.
|
||||
|
||||
**Sonnet recovery from Finding #13:**
|
||||
Sonnet went from 7 findings (with errors) on race conditions to 8 solid
|
||||
findings here. The difference: this prompt was more structured, the document
|
||||
was larger with more explicit interaction descriptions, and the task didn't
|
||||
require pure temporal/sequential reasoning. "Cross-component interaction
|
||||
failures" is closer to assumption-finding (Sonnet's strength) than race
|
||||
condition identification (Sonnet's weakness). Task taxonomy continues to
|
||||
matter more than raw model capability.
|
||||
|
||||
**Updated model assignment for cross-component analysis:**
|
||||
1. GPT-5 — broadest coverage, finds semantic contradictions in the doc's
|
||||
own claims (10 findings)
|
||||
2. Sonnet 4.6 — fast and efficient, good at portfolio-level reasoning and
|
||||
feedback loops (8 findings in 38s)
|
||||
3. GPT-5 Mini — viable first-pass with genuine reasoning depth (6 findings)
|
||||
4. (Opus untested for this task type — likely strong on design tensions)
|
||||
@@ -0,0 +1,133 @@
|
||||
# Finding 15: Design Coherence Analysis
|
||||
|
||||
**Date:** 2026-05-03
|
||||
**Task:** Identify internal design incoherences in gargoyle's `failure-modes.md` (383 lines)
|
||||
— places where the document's stated principles/invariants are contradicted by its own
|
||||
specified mechanisms.
|
||||
**How we used them:** Same document (full text) + same focused analytical question to all
|
||||
3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence
|
||||
to look for (safety properties not enforced, state machine violations, recovery contradictions,
|
||||
supervision conflicts, cross-mechanism contradictions). Required each finding to reference
|
||||
specific sections. No tools, no project context beyond the document itself.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Incoherences found |
|
||||
|---|---|---|---|---|
|
||||
| Claude Sonnet 4.6 | ~39s | 1,168 | (internal) | 5 |
|
||||
| Claude Opus 4.6 | ~105s | 3,378 | (internal) | 7 (8 attempted, 1 self-withdrawn) |
|
||||
| GPT-5 | ~120s | 10,235 | 9,088 | 4 |
|
||||
|
||||
**What they found — common ground (all 3 identified):**
|
||||
- State machine universality claim vs Strategy.Worker crash behavior (process
|
||||
crashes bypass the degraded state entirely — no transition path in the model)
|
||||
- Market data staleness advisory-only vs the "don't trade when ambiguous" principle
|
||||
(or vs concurrent failure auto-halt)
|
||||
- `pending_cancel`/`pending_replace` absent from recovery query set (GPT-5 and
|
||||
Sonnet found this directly; Opus addressed the broader state machine gap)
|
||||
|
||||
**GPT-5 unique findings (not in either Claude model):**
|
||||
- Kill switch halted = "process terminated" vs kill switch requiring RUNNING
|
||||
processes for cancel_all, FLATTEN, and LIQUIDATE modes. The state definition
|
||||
claims processes are terminated, but the mechanisms require them alive to
|
||||
execute orders. **This is the most architecturally significant finding** — it
|
||||
reveals a fundamental definitional error in the state machine.
|
||||
- Per-symbol degradation contradicts the process-level degradation semantics.
|
||||
A worker "enters degraded" but continues operating for non-stale symbols —
|
||||
violating the stated definition that degraded = "cannot perform primary
|
||||
function." The metrics/eventing model has no per-symbol dimension.
|
||||
|
||||
**Claude Opus unique findings (not in either other model):**
|
||||
- `:rest_for_one` cascade creates a FIFTH implicit state (terminated-and-
|
||||
restarting) not in the four-state model — processes that were `normal` are
|
||||
forcibly killed (not by kill switch) and restart. Self-corrected one finding
|
||||
that initially looked like incoherence but was actually consistent.
|
||||
- PortfolioMonitor continues evaluating with stale data ("fail-safe") while
|
||||
Strategy.Workers are stopped for the SAME condition — contradicts both the
|
||||
universal state machine (PM doesn't transition to degraded) and the doc's
|
||||
reasoning about why stale data is dangerous.
|
||||
- Path-dependent recovery criteria: Strategy.Worker recovery requires 21+ bars
|
||||
after crash but only "price continuity check" after staleness. The state
|
||||
machine's single "catch-up complete" exit condition can't express this.
|
||||
- `halted → [*]` transition in state diagram is logically impossible if "halted"
|
||||
means the process is already terminated — dead processes can't fire transitions.
|
||||
- Compound failure detection requires a meta-observer across processes but the
|
||||
per-process state machine model has no way to express cross-process conditions.
|
||||
|
||||
**Claude Sonnet unique findings (not in either other model):**
|
||||
- Market data global staleness: the failure table says "Manual (disengage)" for
|
||||
recovery — implying automatic engagement happened — but the text says it's
|
||||
advisory only. Table contradicts prose.
|
||||
- ReconciliationGate: doc claims gate survives OM crash (separate supervision
|
||||
tree), but then says "missing ETS table = not ready" when OM crashes. If the
|
||||
gate survives, why would its table be missing?
|
||||
- Signal survival claims are contradictory between sections: worker crash says
|
||||
downstream signals survive, but OM crash says all upstream signals lost.
|
||||
(NOTE: this is actually describing different scenarios — worker crash doesn't
|
||||
cascade to SignalRisk, OM crash does via :rest_for_one. Sonnet may have
|
||||
misread the architecture here — the two statements are consistent when you
|
||||
understand the supervision tree.)
|
||||
|
||||
**Quality assessment:**
|
||||
- **GPT-5** found only 4 incoherences but TWO of them are genuinely critical
|
||||
architectural findings. The "halted = terminated" vs "kill switch requires
|
||||
running processes" contradiction is a real design error — you can't both
|
||||
terminate processes AND require them to execute cancel/liquidation orders.
|
||||
The per-symbol degradation finding is also a real modeling gap. GPT-5 was
|
||||
MORE SELECTIVE here than in previous experiments — it didn't pad with
|
||||
medium-severity findings. Each of its 4 was high/critical.
|
||||
- **Claude Opus** produced the most findings (7 valid) with characteristic
|
||||
depth. Its self-correction (withdrawing finding #6 after deeper analysis)
|
||||
shows intellectual honesty rare in model outputs. The PortfolioMonitor
|
||||
stale-data contradiction is genuinely insightful — same input condition,
|
||||
opposite response, no justification within the state machine model. The
|
||||
compound failure meta-observer finding identifies a modeling category error.
|
||||
Opus also found modeling imprecisions (path-dependent recovery, halted → [*]
|
||||
impossibility) that the other models didn't notice.
|
||||
- **Claude Sonnet** found 5 issues quickly (39s, 1,168 tokens) but quality was
|
||||
mixed. Finding #4 (ReconciliationGate) raises a genuine question about
|
||||
the ETS table ownership claim. Finding #1 (table vs prose contradiction on
|
||||
market data staleness) is a real documentation inconsistency. However,
|
||||
Finding #5 appears to misread the supervision architecture — the two
|
||||
statements about signal survival ARE consistent when you understand that
|
||||
different crashes cascade differently. Sonnet produced one false positive.
|
||||
|
||||
**Key insight — "design coherence" is a NEW analytical category with distinct model strengths:**
|
||||
This is different from assumption-finding (Finding #10-12), race conditions
|
||||
(Finding #13), and cross-component interactions (Finding #14). Coherence
|
||||
checking requires the model to hold MULTIPLE parts of the document in tension
|
||||
with each other and reason about whether they're compatible. Results:
|
||||
|
||||
- **GPT-5** was MORE SELECTIVE than in any previous experiment. Only 4 findings
|
||||
vs 10-24 in other tasks. But precision was near-perfect — all 4 are genuine
|
||||
contradictions. This suggests GPT-5's reasoning tokens are being used for
|
||||
VERIFICATION (checking whether apparent contradictions hold up) rather than
|
||||
EXPLORATION (finding more things). The 9K reasoning tokens produced 4 findings
|
||||
vs the usual 10+ — GPT-5 is self-editing aggressively.
|
||||
- **Opus** hit its sweet spot. Coherence checking IS design-tension identification
|
||||
— Opus's consistent strength. Finding incoherences requires exactly the kind
|
||||
of "how does this design disagree with itself" reasoning that Opus excels at.
|
||||
It also showed unique self-correction behavior (withdrawing a finding after
|
||||
deeper analysis).
|
||||
- **Sonnet** was fast but produced a false positive. Coherence checking requires
|
||||
holding multiple document sections in memory simultaneously and reasoning about
|
||||
their compatibility — this is harder than assumption-finding (where you
|
||||
reason about one mechanism at a time) but easier than race conditions (which
|
||||
require sequential temporal reasoning). Sonnet occupies a middle ground.
|
||||
|
||||
**Model ranking for design coherence checking:**
|
||||
1. Claude Opus 4.6 — most findings, highest depth, self-correcting (7 valid)
|
||||
2. GPT-5 — fewest findings but near-perfect precision, finds the critical ones (4)
|
||||
3. Claude Sonnet 4.6 — fast screening, but prone to false positives on
|
||||
architectural misreads (4/5 valid)
|
||||
|
||||
**This inverts the usual GPT-5 > Opus ordering.** In previous experiments, GPT-5
|
||||
consistently found MORE issues. Here, GPT-5 was more selective than Opus. The
|
||||
task type (self-consistency checking) favors Opus's "design tension" reasoning
|
||||
style over GPT-5's "exhaustive exploration" style. GPT-5 apparently uses its
|
||||
reasoning to VERIFY rather than GENERATE when the task is about contradictions
|
||||
rather than gaps.
|
||||
|
||||
**Practical implication:** For architecture documents, run coherence checking as
|
||||
a separate pass using Opus as the primary model. GPT-5's higher precision means
|
||||
it's good for confirming which Opus findings are genuine vs overreads. The
|
||||
two-pass approach: Opus generates candidates → GPT-5 validates → result is the
|
||||
intersection plus GPT-5's independent finds.
|
||||
@@ -0,0 +1,131 @@
|
||||
# Finding 16: Specification completeness: Sonnet 4.5 produces 2x the findings of Sonnet 4.6 on implementation-gap analysis; quality vs volume tradeoff
|
||||
|
||||
**Date:** 2026-05-03
|
||||
**Task:** Identify specification gaps in gargoyle's `kill-switch.md` (185 lines) — places
|
||||
where an implementer would be forced to guess or decide on their own because the spec
|
||||
doesn't clearly specify behavior. New analytical lens not previously tested.
|
||||
**How we used them:** Same document (full text) + same focused analytical question to all
|
||||
3 models via HAI proxy. Highly structured prompt specifying 5 categories of underspecification
|
||||
(behavioral ambiguity, missing edge cases, ordering/sequencing gaps, interface contracts
|
||||
undefined, concurrency semantics omitted). Required specific output format per finding
|
||||
(gap, section, what implementer must decide, risk if wrong, severity). No tools, no
|
||||
project context beyond the document itself.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Gaps found | Critical | High | Medium | Low |
|
||||
|---|---|---|---|---|---|---|---|---|
|
||||
| Claude Sonnet 4.6 | 73s | 3,403 | (internal) | 13 | 8 | 4 | 0 | 1 |
|
||||
| Claude Sonnet 4.5 | 102s | 5,191 | (internal) | 25 | 14 | 6 | 4 | 1 |
|
||||
| GPT-5 | 109s | 10,140 | 7,872 | 19 | 8 | 7 | 3 | 0 |
|
||||
|
||||
**What they found — common ground (all 3 identified):**
|
||||
- Pipeline process identification ambiguity (which processes are "pipeline processes")
|
||||
- Per-user process scope mapping (how to terminate only one user's processes)
|
||||
- ETS table ownership and lifecycle (who owns it, what happens on crash)
|
||||
- Concurrent engage operations (what happens when two sources engage simultaneously)
|
||||
- Liquidation order tagging mechanism (what the tag is, how verified)
|
||||
- Process restart prevention (how "must not restart" is enforced)
|
||||
- Engage sequence atomicity (partial failure between DB write and termination)
|
||||
- Startup ordering and ETS readiness (pipeline starting before ETS populated)
|
||||
- Disengage sequence ordering (what happens and in what order)
|
||||
|
||||
**Sonnet 4.5 unique findings (not in either other model):**
|
||||
- ETS table schema/structure (set vs ordered_set, key format, value schema)
|
||||
- Missing ETS detection mechanism (catch :badarg vs table existence check)
|
||||
- Database write atomicity with ETS (transaction boundaries, rollback semantics)
|
||||
- Per-user engage while global is already engaged (is it a no-op or error?)
|
||||
- Broker rejection semantics ("already filled" vs "invalid cancel" distinction)
|
||||
- Cold-start gate interaction (independence vs dependency of the two gates)
|
||||
- User deletion with active kill switch (orphaned rows, cascade semantics)
|
||||
- Global disengage effect on per-user states (independent or auto-clear?)
|
||||
- Audit log write failure during engage (critical-path vs best-effort)
|
||||
- Dashboard control ambiguity in LIQUIDATE mode (contradictory disable/enable)
|
||||
- Cancel timeout duration (operational parameter not specified)
|
||||
- Manual order source code path during LIQUIDATE (how orders bypass the dead pipeline)
|
||||
|
||||
**GPT-5 unique findings (not in either other model):**
|
||||
- Combined global/per-user mode semantics (what happens when global=RESTRICT,
|
||||
user=LIQUIDATE — can user's liquidation proceed?)
|
||||
- Scope of "all" in cancel_all and liquidation (system-wide vs per-user)
|
||||
- Gate behavior when ETS missing but liquidation needed (conflicting requirements:
|
||||
fail-closed says block, but liquidation needs to pass)
|
||||
- Disengage during in-flight cancellations (what happens to racing tasks)
|
||||
- Gate placement relative to broker submission (exact point in the flow)
|
||||
- Engage latency expectations (no quantified SLA)
|
||||
- Mode change while already engaged (RESTRICT → LIQUIDATE without disengage)
|
||||
- Dashboard vs backend scope for manual liquidation (individual vs bulk only)
|
||||
|
||||
**Sonnet 4.6 unique findings (not in either other model):**
|
||||
- ETS sequencing relative to process termination (ETS before or after kill?)
|
||||
- Concurrent disengage + re-engage race (specific interleaving scenario)
|
||||
- Close-only enforcement mechanism (UI-only vs backend validation)
|
||||
- Order-in-flight past ETS gate during termination (already-checked orders)
|
||||
|
||||
**Quality assessment:**
|
||||
- **Claude Sonnet 4.5** was the most EXHAUSTIVE (25 gaps) but with notable
|
||||
quality variance. Several findings were highly specific and implementation-
|
||||
relevant (ETS schema, missing-table detection, broker rejection semantics).
|
||||
Others were relatively obvious or lower-impact (user deletion, audit log
|
||||
failure, cancel timeout duration). The 14 Critical ratings feel somewhat
|
||||
generous — some would be more accurately rated as High in practice. Output
|
||||
was well-structured with clear per-finding format.
|
||||
- **GPT-5** found 19 gaps with consistent high quality. Its unique findings
|
||||
show cross-cutting reasoning: the combined mode semantics finding (global
|
||||
vs per-user mode interaction) identifies a genuine specification gap that
|
||||
neither Sonnet version noticed. The "ETS missing but liquidation needed"
|
||||
finding is architecturally significant — it identifies a CONTRADICTION in
|
||||
the spec's own rules (fail-closed blocks everything, but liquidation must
|
||||
pass). Every finding was actionable. More selective severity ratings
|
||||
(8 Critical vs Sonnet 4.5's 14).
|
||||
- **Claude Sonnet 4.6** was the most SELECTIVE (13 gaps) but with the highest
|
||||
precision. Every finding was genuinely a specification gap that an
|
||||
implementer would face. The ETS sequencing finding (#4) is particularly
|
||||
well-reasoned — it identifies a specific ordering dependency that creates
|
||||
a race window. Sonnet 4.6 appears to self-filter aggressively, producing
|
||||
only findings it's confident about. Higher signal-to-noise than 4.5.
|
||||
|
||||
**Key insight — Sonnet 4.5 vs 4.6 on analytical tasks:**
|
||||
This is the first direct comparison between Claude model versions on the same
|
||||
analytical task. Key differences:
|
||||
|
||||
- **Volume:** 4.5 produced almost 2x the findings (25 vs 13)
|
||||
- **Tokens:** 4.5 used ~1.5x the output tokens (5,191 vs 3,403)
|
||||
- **Time:** 4.5 took ~1.4x longer (102s vs 73s)
|
||||
- **Severity distribution:** 4.5 had more Critical findings (14 vs 8) but
|
||||
with more generous severity ratings
|
||||
- **Quality per finding:** 4.6 had higher average quality; fewer "obvious"
|
||||
or lower-impact findings
|
||||
|
||||
The 4.6 model appears to have been trained toward higher precision/selectivity.
|
||||
It finds fewer things but each finding is more reliably a genuine gap. The 4.5
|
||||
model is more exhaustive but includes findings that a reviewer might triage as
|
||||
"yes, technically, but not really a spec gap." This mirrors a known training
|
||||
direction in Claude models: later versions tend to be more concise and selective.
|
||||
|
||||
**For practical use:** If you want completeness (cast a wide net, accept some
|
||||
noise): use 4.5. If you want precision (every finding is actionable, no triage
|
||||
needed): use 4.6. For architecture review where missing a gap has cost, 4.5's
|
||||
exhaustiveness is probably worth the noise. For review where false positives
|
||||
cost attention (e.g., PR review comments), 4.6's selectivity is preferred.
|
||||
|
||||
**GPT-5 vs Sonnet comparison on this task:**
|
||||
GPT-5 (19 findings) sits between the two Sonnets in volume but has the highest
|
||||
consistency — no obvious misses or inflated severities. Its unique strength
|
||||
here: finding CONTRADICTIONS within the spec's own rules (ETS-missing blocking
|
||||
conflicts with liquidation needing to pass). This is consistent with Finding #15
|
||||
where GPT-5 was unusually selective but precise on coherence checking.
|
||||
|
||||
Specification completeness analysis appears to be a task where:
|
||||
1. Sonnet 4.5 is strongest for breadth (25 findings, catches operational gaps)
|
||||
2. GPT-5 is strongest for detecting spec self-contradictions (19 findings, high precision)
|
||||
3. Sonnet 4.6 is strongest for precision (13 findings, zero noise)
|
||||
|
||||
**Updated model version comparison:**
|
||||
- Claude 4.6 → higher precision, more selective, concise
|
||||
- Claude 4.5 → more exhaustive, more verbose, occasional severity inflation
|
||||
- This is a genuine tradeoff, not a simple regression or improvement
|
||||
|
||||
**Practical implication:** Run BOTH Sonnet versions? 4.5 catches things 4.6
|
||||
filters out (ETS schema, broker rejection semantics, cold-start gate interaction).
|
||||
4.6 catches things with more specificity (sequencing gaps, exact race windows).
|
||||
For a one-shot budget: 4.5 if you want coverage, 4.6 if you want actionability.
|
||||
GPT-5 if you want to find where the spec contradicts itself.
|
||||
@@ -0,0 +1,158 @@
|
||||
# Finding 18: Temporal boundary analysis: GPT-5 is most exhaustive; Opus finds design-level contradictions; Sonnet 4.5 is structured but less deep
|
||||
|
||||
**Date:** 2026-05-04
|
||||
**Task:** Identify temporal boundary vulnerabilities in gargoyle's `escalation-policy.md`
|
||||
(238 lines) — scenarios where the timing model (evaluation cycles, debounce counts,
|
||||
cooldown periods) creates windows of incorrect or dangerous behavior.
|
||||
**How we used them:** Same document (full text) + same focused analytical question to all
|
||||
3 models via HAI proxy. Highly structured prompt specifying 5 categories of temporal
|
||||
vulnerability (timing exploitation, counter reset abuse, asymmetric time exposure,
|
||||
cross-metric temporal interactions, state loss temporal effects). Required specific
|
||||
output format per finding (name, sequence with cycle numbers, mechanism, severity, fix).
|
||||
No tools, no project context beyond the document itself.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
|
||||
|---|---|---|---|---|---|---|---|
|
||||
| GPT-5 | ~128s | 9,175 | 5,888 | 15 | 3 | 7 | 2 |
|
||||
| Claude Opus 4.6 | ~120s | 5,112 | (internal) | 10 | 3 | 5 | 2 |
|
||||
| Claude Sonnet 4.5 | ~100s | 4,056 | (internal) | 12 | 3 | 3 | 3 |
|
||||
|
||||
**What they found — common ground (all 3 identified):**
|
||||
- Flash crash / inter-evaluation gap exploitation (metric spikes between discrete
|
||||
evaluation cycles go undetected)
|
||||
- Single clear cycle resetting debounce counter (transient recovery defeats escalation
|
||||
despite sustained risk — metric can breach 80%+ of cycles and never escalate)
|
||||
- Asymmetric escalation time vs loss compounding rate (11 cycles to reach liquidation
|
||||
while losses compound every single cycle)
|
||||
- Monitor crash resets state to Clear, losing all escalation progress
|
||||
- Liquidation re-trigger requiring full debounce reset, delaying subsequent batches
|
||||
- Kill switch N value unspecified (timing indeterminacy)
|
||||
|
||||
**GPT-5 unique findings (not in either other model):**
|
||||
- Boundary-alignment counter starvation: explicitly modeled the "adversarial flicker"
|
||||
pattern (breaching 2 cycles, 1 clear, repeat — 66% breach time, never escalates)
|
||||
with a precise mathematical framing of why K-of-N is needed
|
||||
- Cycle-length drift under load: GC pauses or CPU contention stretching evaluation
|
||||
intervals means "3 cycles" could be 12 minutes instead of 90 seconds when it
|
||||
matters most (high-load market stress = slowest evaluations)
|
||||
- Adversarial boundary timing (market microstructure masking): illiquid instruments
|
||||
where opposing prints predictably arrive near evaluation boundaries, exploiting
|
||||
deterministic sampling points
|
||||
- Cross-metric escalation gap and hedging prohibition: Restrict prevents ALL new
|
||||
positions including risk-REDUCING hedges needed for a different metric still
|
||||
escalating on its own timeline — protection for metric A actively worsens metric B
|
||||
- Cooldown stall causing prolonged Restrict: repeated transient spikes near hysteresis
|
||||
threshold reset cooldown indefinitely while metric is actually safe
|
||||
- State inconsistency between restriction flags and monitor after restart:
|
||||
documented asymmetry where flag persists (manual clear) but state resets (auto
|
||||
clear) — creates orphaned restriction or unprotected window depending on
|
||||
reconciliation approach
|
||||
- Metric computation fail-closed interacting with debounce: system errors create
|
||||
false escalations with long cooldown, potentially blocking hedging trades
|
||||
- Unspecified N for kill switch post-liquidation breaches: coupled with crash
|
||||
reset, system can loop indefinitely without reaching kill switch
|
||||
- In-liquidate flicker stall: one cycle below threshold after partial fill resets
|
||||
re-trigger counter, stalling further liquidation
|
||||
|
||||
**Claude Opus unique findings (not in either other model):**
|
||||
- De-escalation cooldown exploitation (predictable window): after cooldown completes
|
||||
and restriction lifts, strategy has a GUARANTEED 5+ cycle window of unrestricted
|
||||
trading before Restrict can re-engage — an automated strategy could systematically
|
||||
exploit this predictable safe window to re-enter dangerous positions
|
||||
- Orphaned restriction flag asymmetry framed as a DESIGN CONTRADICTION: the failure
|
||||
modes table specifies opposing recovery paths for state (automatic → Clear) vs
|
||||
flags (manual clear), creating an irreconcilable dual state. Opus uniquely
|
||||
identified that operator intervention to clear the flag could inadvertently
|
||||
create a WORSE protection gap than leaving it orphaned
|
||||
- Self-correcting analysis style: Opus's summary explicitly synthesized that the
|
||||
three Critical findings share a common cause (debounce optimizes against false
|
||||
positives at the expense of false negatives during sustained events) and proposed
|
||||
a single architectural fix (severity-aware fast path) that addresses all three
|
||||
|
||||
**Claude Sonnet 4.5 unique findings (not in either other model):**
|
||||
- De-escalation timing not accounting for proximity to breach threshold: system
|
||||
removes protection while metric is still near-dangerous, and re-escalation
|
||||
requires full debounce — created a specific "whipsaw" scenario with cycle numbers
|
||||
- Manual-only de-escalation from Liquidate creates UNBOUNDED recovery time:
|
||||
if triggered at 2 AM Saturday, trading disabled until Monday despite metrics
|
||||
recovering in minutes. Framed as contradiction with "autonomous" design goals
|
||||
- Evaluation cycle synchronization assumption: no handling of variable timing
|
||||
(CPU contention, GC pauses) — implicit throughout but never addressed
|
||||
- Cold start escalation ambiguity: system starts with no prior state while
|
||||
portfolio may already be in breach condition
|
||||
- De-escalation event ordering race: multiple metrics de-escalating simultaneously
|
||||
may emit events in non-deterministic order, confusing external observers
|
||||
|
||||
**Quality assessment:**
|
||||
- **GPT-5** was the most exhaustive (15 findings) and showed the strongest
|
||||
mathematical/systems reasoning. Its unique findings included precise attack
|
||||
models (adversarial flicker, boundary alignment, microstructure masking) that
|
||||
describe exact exploitation patterns with percentages and cycle counts. The
|
||||
cross-metric hedging prohibition finding is architecturally significant — it
|
||||
identifies that protection for one metric can actively CREATE risk for another.
|
||||
Every finding was actionable with specific fixes.
|
||||
- **Claude Opus 4.6** produced fewer findings (10) but with characteristic depth
|
||||
and self-awareness. Its cooldown exploitation finding identified a PREDICTABLE
|
||||
exploit window that an automated strategy could systematically abuse — framed
|
||||
not as an accident but as an adversarial opportunity. The summary synthesis
|
||||
(identifying common cause across Critical findings) shows meta-analytical
|
||||
capability the other models didn't demonstrate. Opus also uniquely identified
|
||||
that human intervention to fix one problem could create a WORSE problem —
|
||||
second-order operational reasoning.
|
||||
- **Claude Sonnet 4.5** was well-structured (12 findings, clean severity tiers,
|
||||
organized by Critical/High/Medium/Low) and faster than both other models.
|
||||
Its findings were solid but less architecturally deep. The manual de-escalation
|
||||
contradiction finding was genuinely insightful (unbounded recovery time vs
|
||||
autonomous design goals). However, several findings restated concepts the
|
||||
other models covered with less specificity about exploitation mechanics.
|
||||
|
||||
**Key insight — temporal reasoning as a task type:**
|
||||
This is the first experiment specifically testing "temporal boundary analysis" —
|
||||
reasoning about time-domain properties of a state machine (evaluation frequency,
|
||||
counter semantics, cooldown mechanics, crash/restart timing).
|
||||
|
||||
Results compared to Finding #13 (race condition identification on a concurrency doc):
|
||||
- GPT-5: 15 findings here vs 12 in Finding #13. Consistent high performance
|
||||
on temporal reasoning tasks across both experiments.
|
||||
- Opus: 10 findings here vs 10 in Finding #13. Remarkably consistent — Opus
|
||||
produces ~10 high-quality findings regardless of temporal task variant.
|
||||
- Sonnet 4.5: 12 findings here (first test). Compare to Sonnet 4.6's 7 findings
|
||||
(with errors) in Finding #13. Sonnet 4.5 handles temporal reasoning better than
|
||||
4.6 — consistent with Finding #16 showing 4.5 is more exhaustive across task types.
|
||||
|
||||
**Sonnet 4.5 vs 4.6 on temporal reasoning (inferred comparison):**
|
||||
Sonnet 4.6 struggled significantly on race condition identification (Finding #13:
|
||||
7 findings with analytical errors, misreading architecture). Sonnet 4.5 here
|
||||
produced 12 solid findings with no apparent misreadings. This suggests 4.5's
|
||||
exhaustiveness advantage extends to temporal reasoning — the additional
|
||||
exploration it does (vs 4.6's aggressive self-filtering) catches more temporal
|
||||
interactions. Confirms Finding #16's pattern: 4.5 for coverage, 4.6 for precision.
|
||||
|
||||
**The structured-prompt effect continues:**
|
||||
All three models produced focused, high-quality output with this highly structured
|
||||
prompt (5 specific categories + required output format). This confirms Finding #14:
|
||||
narrow analytical lens + broad document scope is the sweet spot for all model tiers.
|
||||
The prompt structure appears to be a stronger predictor of output quality than model
|
||||
choice for the bottom 80% of findings (all models find the common-ground issues).
|
||||
Model choice matters for the TOP 20% — the unique insights that require deeper
|
||||
reasoning about system interactions.
|
||||
|
||||
**Updated model assignment for temporal boundary analysis:**
|
||||
1. GPT-5 — most exhaustive, strongest at modeling adversarial exploitation patterns
|
||||
and mathematical edge cases (15 findings)
|
||||
2. Claude Sonnet 4.5 — good volume with clean structure, viable for first-pass
|
||||
temporal analysis (12 findings, no errors)
|
||||
3. Claude Opus 4.6 — fewest findings but highest insight density, uniquely
|
||||
identifies predictable exploit windows and operational second-order effects
|
||||
(10 findings)
|
||||
|
||||
**Practical implication:** For temporal analysis on state machines and timing-dependent
|
||||
policies, the three-model stack produces genuine complementary value:
|
||||
- GPT-5 catches the adversarial attack patterns and mathematical edge cases
|
||||
- Opus catches the predictable exploit windows and operational contradictions
|
||||
- Sonnet 4.5 provides good breadth at lower cost with clean severity categorization
|
||||
|
||||
The union of unique findings across all three models reveals significantly more
|
||||
temporal vulnerabilities than any single model alone. For a document governing
|
||||
autonomous financial actions (liquidation, kill switch), the cost of running all
|
||||
three (~$1-2) is trivially justified against the risk of missing a timing exploit.
|
||||
@@ -0,0 +1,124 @@
|
||||
# Finding 19: Union coverage test: GPT-5 Mini + Sonnet 4.6 covers ~71% of GPT-5's findings; the missing 29% is where the real value lives
|
||||
|
||||
**Date:** 2026-05-04
|
||||
**Task:** Identify hidden assumptions in gargoyle's `trading-pipeline.md` (1,110 lines,
|
||||
~62KB) — the most complex document tested so far, covering the full end-to-end path
|
||||
from tick ingestion through order execution.
|
||||
**How we used them:** Same document (full text, no truncation) + same focused analytical
|
||||
question to all 3 models via HAI proxy. Standard hidden-assumption prompt with 5
|
||||
categories (runtime behavior, external dependencies, timing/ordering, scale/load,
|
||||
uncovered failure modes). Required specific output format per finding. No tools, no
|
||||
project context beyond the document itself.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
|
||||
|---|---|---|---|---|
|
||||
| GPT-5 | 99s | 9,418 | 5,696 | 35 |
|
||||
| GPT-5 Mini | 93s | 5,309 | 1,792 | 21 |
|
||||
| Claude Sonnet 4.6 | 38s | 1,792 | (internal) | 17 |
|
||||
|
||||
**Coverage analysis — can Mini + Sonnet together replace GPT-5?**
|
||||
|
||||
Categorized each of GPT-5's 35 findings by whether the union of Mini + Sonnet
|
||||
also identified the same assumption:
|
||||
|
||||
- **Covered by BOTH Mini and Sonnet:** ~12 findings (common ground — any model
|
||||
finds these: idempotency, single-writer, clock sync, instrument resolution,
|
||||
fill immutability, reconciliation gate, backpressure, fill correlation, event
|
||||
ordering, audit scalability, PortfolioRisk bottleneck)
|
||||
- **Covered by Mini only (not Sonnet):** ~7 findings (transactional atomicity,
|
||||
audit causal consistency, modification-in-flight enforcement, OM throughput,
|
||||
decimal precision, PM/PR close-only race, partition duplicate submit)
|
||||
- **Covered by Sonnet only (not Mini):** ~6 findings (market data feed rates,
|
||||
pipeline-vs-market speed, corporate actions atomicity, kill switch partition,
|
||||
shared port isolation, market close vs auction fills)
|
||||
- **Union(Mini + Sonnet) total coverage:** ~25/35 = **~71%** of GPT-5's findings
|
||||
- **GPT-5 unique (missed by both):** ~10-18 findings depending on strictness
|
||||
|
||||
**What GPT-5 uniquely found that the cheaper pair missed:**
|
||||
|
||||
The missing 29% is NOT random — it's systematically different in character:
|
||||
|
||||
1. **Operational edge cases:** Default TIF "day" broker semantics, OrderRate
|
||||
counting retries, extended-hours MarketHours mismatch, fractional quantities,
|
||||
local expiry timer precision per instrument
|
||||
2. **Design-level interaction gaps:** PortfolioRisk concurrent decision race
|
||||
(snapshot stale between two parallel approvals), re-validation gap between
|
||||
approval and submit, decision loss on crash after audit write
|
||||
3. **Domain-specific knowledge:** Manual broker-side actions conflicting with
|
||||
state machine, options/complex instrument position_effect mapping, Decision→Order
|
||||
1:1 invariant vs broker auto-splitting, wash sale retroactive P&L mutation
|
||||
4. **Architectural observations:** Reduction re-entry rule insufficiency,
|
||||
PortfolioMonitor coalescing vs fast breach detection, multi-aggregator fanout
|
||||
and audit partial writes, replay/backtest alignment with production controls
|
||||
|
||||
These share a common trait: they require **domain expertise** (knowing how brokers
|
||||
actually behave, how regulatory rules interact, how production trading systems
|
||||
fail in practice) combined with **architectural reasoning** (how the design's own
|
||||
mechanisms interact under those real-world conditions). The cheaper models find
|
||||
assumptions about the document's internal consistency; GPT-5 additionally finds
|
||||
assumptions about the document's relationship to the external world it must
|
||||
operate in.
|
||||
|
||||
**GPT-5 Mini vs Sonnet 4.6 — complementary, not redundant:**
|
||||
|
||||
Mini and Sonnet covered different gaps:
|
||||
- Mini was stronger on **internal consistency** (transactional atomicity, causal
|
||||
consistency, decimal precision, modification serialization)
|
||||
- Sonnet was stronger on **external interactions** (market data feeds, corporate
|
||||
actions, kill switch distribution, shared resource isolation)
|
||||
|
||||
This aligns with previous findings: Mini reasons about implementation mechanics;
|
||||
Sonnet reasons about system boundaries and external interactions. Their union
|
||||
covers more ground than either alone.
|
||||
|
||||
**Cost comparison:**
|
||||
|
||||
| Approach | Total tokens | Approx. cost | Coverage of GPT-5 |
|
||||
|---|---|---|---|
|
||||
| GPT-5 alone | ~21K (9.4K output + 5.7K reasoning) | ~$0.80 | 100% (35 findings) |
|
||||
| Mini + Sonnet | ~7.1K output + 1.8K reasoning | ~$0.25 | ~71% (25/35 findings) |
|
||||
| All three | ~28K total | ~$1.05 | >100% (35 + unique Sonnet/Mini extras) |
|
||||
|
||||
**Key insight — the 71% coverage is a floor, not a ceiling:**
|
||||
|
||||
The union covers 71% of GPT-5's specific findings. But Mini and Sonnet each
|
||||
also produced findings that GPT-5 DIDN'T make:
|
||||
- Sonnet: DailyLossLimit query performance scaling, instrument reference data
|
||||
propagation atomicity across components
|
||||
- Mini: Signal audit correlation ambiguity under replay/duplicate ticks
|
||||
|
||||
So the total unique finding space is LARGER than any single model. Running all
|
||||
three produces the most comprehensive analysis.
|
||||
|
||||
**Answer to the open question: "Would running GPT-5 Mini + Sonnet together
|
||||
approach GPT-5's coverage at lower combined cost?"**
|
||||
|
||||
**Partially.** The pair covers ~71% of GPT-5's findings at ~31% of the cost.
|
||||
But the missing 29% is disproportionately valuable — it contains the
|
||||
domain-specific, interaction-level, real-world-knowledge findings that are
|
||||
most likely to prevent production incidents. For a quick sanity check or
|
||||
first-pass screening, Mini + Sonnet is excellent value. For architecture
|
||||
review where completeness matters (financial system, safety-critical), GPT-5
|
||||
is not replaceable by cheaper models — its unique findings are exactly the
|
||||
ones that would cause real-world failures.
|
||||
|
||||
**Practical implication:** The optimal strategy depends on stakes:
|
||||
- **Low stakes** (internal doc review, non-critical systems): Mini + Sonnet
|
||||
is 71% coverage at 31% cost — strong ROI
|
||||
- **High stakes** (financial systems, safety-critical): run all three — the
|
||||
~$1 total cost is irrelevant vs the value of the extra 10-18 findings
|
||||
- **Budget-conscious high stakes:** run GPT-5 alone — it subsumes most of
|
||||
what Mini + Sonnet find, and adds the critical domain-knowledge findings
|
||||
|
||||
The cost argument for Mini + Sonnet as a GPT-5 REPLACEMENT doesn't hold for
|
||||
important work. The cost argument for Mini + Sonnet as a GPT-5 COMPLEMENT
|
||||
is strong — they catch a few things GPT-5 misses, and the union of all three
|
||||
is the most thorough analysis available.
|
||||
|
||||
**Document complexity observation:**
|
||||
This is the largest document tested (1,110 lines vs previous 185-785 lines).
|
||||
GPT-5's finding count scaled up (35 vs 20-26 on smaller docs) while maintaining
|
||||
quality — no padding with obvious/low-value findings. Mini also scaled (21 vs
|
||||
6 on 459-line doc in Finding #14). Sonnet scaled less (17 vs 12-17 on smaller
|
||||
docs) — it appears to have a natural output ceiling regardless of document size,
|
||||
consistent with its self-filtering behavior observed in previous findings.
|
||||
@@ -0,0 +1,163 @@
|
||||
# Finding 20: Invariant violation path analysis: GPT-5 is maximally selective (3 findings, all genuine); Opus shows unique self-correcting analytical style; new task type favors precision over exhaustiveness
|
||||
|
||||
**Date:** 2026-05-04
|
||||
**Task:** Identify invariant violation paths in gargoyle's `user-pipeline-lifecycle.md`
|
||||
(730 lines) — sequences of legal operations that can violate the system's stated or
|
||||
implied invariants. NEW analytical lens not previously tested, distinct from assumption-
|
||||
finding, race conditions, or coherence checking.
|
||||
**How we used them:** Same document (full text) + same focused analytical question to all
|
||||
3 models via HAI proxy. Highly structured prompt specifying 5 categories of invariant
|
||||
violations (state machine escapes, invariant composition failures, monotonicity violations,
|
||||
idempotency boundary violations, authority inversion sequences). Required specific output
|
||||
format per finding. No tools, no project context beyond the document itself.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Findings |
|
||||
|---|---|---|---|---|
|
||||
| GPT-5 | 143s | 784 | 12,032 | 3 |
|
||||
| Claude Opus 4.6 | 113s | 6,183 | (internal) | 7 (with 2 self-corrections) |
|
||||
| Claude Sonnet 4.6 | 23s | 1,266 | (internal) | 5 |
|
||||
|
||||
**What they found — common ground (2+ models identified):**
|
||||
|
||||
- **Periodic reconciliation overrides operator manual stop** (GPT-5 #3 + Opus #5 +
|
||||
Sonnet #1): An admin who stops a pipeline via `stop_user/1` with `:admin_action`
|
||||
has their decision overridden within 5 minutes by periodic reconciliation, because
|
||||
there's no "admin stopped" state in `check_eligibility/1`. All three models
|
||||
independently identified this as the clearest authority inversion.
|
||||
- **DynamicSupervisor restart bypasses eligibility gate** (Opus #1/#3 + Sonnet #2):
|
||||
When `UserPipeline.Supervisor` crashes and is restarted by OTP supervision, the
|
||||
restart bypasses `start_user/1` and `check_eligibility/1` entirely — potentially
|
||||
resuming trading while the kill switch is engaged.
|
||||
- **Stale ReconciliationGate after crash** (Opus #7): After a crash-triggered
|
||||
DynamicSupervisor restart (not via `stop_user/1`), the ReconciliationGate remains
|
||||
`:ready` from the previous instance because `stop_user/1` (which resets it) was
|
||||
never called. The new OrderManager may accept orders during its own reconciliation.
|
||||
- **HealthMonitor co-lifecycle violation** (Opus #2 + Sonnet #4): After a
|
||||
DynamicSupervisor-initiated restart, the HealthMonitor is still subscribed to the
|
||||
old PIDs — no code re-establishes monitoring for the new pipeline processes.
|
||||
|
||||
**GPT-5 unique findings (not in either other model):**
|
||||
|
||||
- **Kill switch bypass for users configured DURING engagement** (#1): A user who
|
||||
saves credentials while the kill switch is engaged is never added to the pending
|
||||
operator release set (only running pipelines are added at engage time). After
|
||||
disengage, periodic reconciliation auto-starts this user's pipeline without
|
||||
operator release — violating "resuming always requires human judgment." This is
|
||||
the most precisely reasoned finding across all three models: each step is
|
||||
individually correct per the spec, and the violation emerges purely from the
|
||||
composition of legal operations.
|
||||
- **Premature release bypass** (#2): If `operator_release_user/1` is called while
|
||||
the kill switch is still engaged (a legal operation), it clears the pending
|
||||
release flag but `start_user/1` correctly refuses. After later disengage, the
|
||||
flag is gone — auto-start proceeds without fresh operator judgment. The release
|
||||
was "spent" at the wrong time.
|
||||
|
||||
**Claude Opus unique findings (not in either other model):**
|
||||
|
||||
- **`operator_release_system/0` clears unrelated safety obligations** (#4):
|
||||
Operator intends to release one user from a recent event but
|
||||
`operator_release_system/0` also releases other users still pending from an
|
||||
earlier, unresolved event. One release call discharges multiple independent
|
||||
safety obligations — monotonicity violation.
|
||||
- **State machine incompleteness for blocked users** (#6): Users who become
|
||||
configured during kill switch engagement (blocked with reason
|
||||
`:kill_switch_engaged`) have no state machine transition back to `starting`
|
||||
after disengage — they're not in the pending release set, and no event fires.
|
||||
System works via periodic reconciliation (up to 5 minutes delay), but the
|
||||
documented state machine doesn't represent this path.
|
||||
- **Self-correcting analytical style:** Opus explicitly withdrew two draft
|
||||
findings mid-analysis ("Actually, this sequence works as designed. Let me
|
||||
identify a real violation instead." / "this is likely handled"). This
|
||||
self-correction behavior was first observed in Finding #15 and is now
|
||||
confirmed as a consistent Opus trait for invariant-style analysis.
|
||||
|
||||
**Claude Sonnet unique findings (not in either other model):**
|
||||
|
||||
- **Cold-start Tier 3 failure creates supervision restart loop** (#2): A
|
||||
persistent Tier 3 failure (phantom fills) crashes OrderManager, `:rest_for_one`
|
||||
kills the tree, DynamicSupervisor restarts it, cold-start fails again → infinite
|
||||
loop. State machine shows `starting → stopped` but supervision creates
|
||||
`starting → starting` indefinitely.
|
||||
- **HealthMonitor start failure during start_user** (#4): If HealthMonitor.Supervisor
|
||||
is momentarily crashed when `start_user/1` runs step 4, the pipeline starts
|
||||
without monitoring. No error handling specified for this partial-start state.
|
||||
|
||||
**Quality assessment:**
|
||||
|
||||
- **GPT-5** was MAXIMALLY SELECTIVE — only 3 findings from 12,032 reasoning tokens
|
||||
(4,011 reasoning tokens per finding). This is the most extreme
|
||||
reasoning-to-output ratio observed: 15:1 (12,032 reasoning / 784 output tokens).
|
||||
For comparison, in previous experiments GPT-5 typically shows 1:1 to 2:1 ratios.
|
||||
Every finding is a genuine invariant violation with a precise, step-by-step
|
||||
sequence where each step is individually legal. ZERO false positives, zero
|
||||
padding, zero "this might be an issue." GPT-5 appears to have used almost all
|
||||
its reasoning budget for VERIFICATION — confirming that each candidate is
|
||||
genuinely a violation before including it.
|
||||
- **Claude Opus** produced the most findings (7) with its characteristic depth and
|
||||
self-correction. Two findings were revised mid-analysis, showing Opus actively
|
||||
testing its own reasoning against the document before committing to a finding.
|
||||
The DynamicSupervisor restart thread (findings #1, #2, #3, #7) forms a coherent
|
||||
cluster — Opus identified one root cause (OTP restarts bypass the lifecycle
|
||||
layer) and explored its multiple consequences. The `operator_release_system`
|
||||
monotonicity finding (#4) is architecturally significant and unique.
|
||||
- **Claude Sonnet** was extremely fast (23s, 1,266 tokens) and produced 5 findings.
|
||||
Quality was mixed: Finding #1 partially mirrors GPT-5's authority inversion but
|
||||
with vaguer reasoning ("race condition with ETS operations" — not specified).
|
||||
Finding #3 describes a contradiction but the scenario is internally inconsistent
|
||||
(step 5 says "pipeline termination fails" but then step 7 says pipeline is still
|
||||
running — this conflates two failure modes). Findings #2 and #4 are genuine and
|
||||
well-reasoned. Sonnet's precision is lower than the other two on this task.
|
||||
|
||||
**Key insight — "Invariant violation paths" as a task type:**
|
||||
|
||||
This is a genuinely DIFFERENT analytical task from any previously tested. It requires:
|
||||
1. Identifying the invariants (explicit or implied)
|
||||
2. Constructing a sequence of operations (creative/generative)
|
||||
3. Verifying each step is legal per the spec (verification)
|
||||
4. Confirming the end state violates the invariant (correctness proof)
|
||||
|
||||
This four-phase cognitive process explains GPT-5's extreme selectivity: steps 2-4 are
|
||||
all verification-heavy, and GPT-5's reasoning tokens are being burned on steps 3 and 4
|
||||
(confirming each step is genuinely legal and the final state genuinely violates). In
|
||||
previous tasks like "find hidden assumptions" or "find gaps," only step 1 (identification)
|
||||
is needed — there's no construction or verification phase.
|
||||
|
||||
**Comparison to previous task types:**
|
||||
|
||||
| Task type | GPT-5 findings | Opus findings | GPT-5 reasoning overhead |
|
||||
|---|---|---|---|
|
||||
| Hidden assumptions | 20-35 | 12-13 | 5-7K reasoning |
|
||||
| Race conditions | 12 | 10 | 8K reasoning |
|
||||
| Design coherence | 4 | 7 | 9K reasoning |
|
||||
| Invariant violation paths | 3 | 7 | **12K reasoning** |
|
||||
|
||||
The pattern: as the task requires more VERIFICATION (vs identification), GPT-5 becomes
|
||||
more selective and spends more reasoning tokens per finding. Invariant violation paths
|
||||
demand the highest verification burden (every step must be confirmed legal), and GPT-5
|
||||
responds with the highest selectivity and reasoning investment.
|
||||
|
||||
Opus inverts: it produces MORE findings on verification-heavy tasks (7 for coherence,
|
||||
7 for invariant paths) vs identification tasks (10-13 for assumptions). This suggests
|
||||
Opus uses its internal reasoning differently — it's more willing to present findings
|
||||
that have "likely" rather than "proven" violations, then self-corrects inline if the
|
||||
verification fails.
|
||||
|
||||
**Practical implication:**
|
||||
|
||||
For invariant violation path analysis:
|
||||
- **GPT-5** produces the highest-precision findings but very few. Every finding is a
|
||||
genuine spec-level bug. Use when you need zero-false-positive bug reports to present
|
||||
to a design team.
|
||||
- **Opus** produces more findings with slightly lower precision but unique analytical
|
||||
depth. Its self-correction behavior means false positives are often caught inline.
|
||||
Use when you want both confirmed violations AND identified tensions.
|
||||
- **Sonnet** is too imprecise for this task type — some findings have internal
|
||||
inconsistencies. Use for lighter analytical tasks (assumption-finding, spec gaps).
|
||||
|
||||
The three findings GPT-5 produced are ALL genuine design bugs that should be fixed:
|
||||
1. Users configured during kill switch engagement bypass operator release
|
||||
2. Premature operator release (while KS still engaged) creates future bypass
|
||||
3. Admin stops are overridden by periodic reconciliation
|
||||
|
||||
These are the kind of findings that, in a real financial system, prevent production
|
||||
incidents. The 12K reasoning tokens to produce 3 perfect findings is excellent ROI.
|
||||
@@ -0,0 +1,125 @@
|
||||
# Finding 21: Reasoning effort (low/medium/high) has negligible effect on GPT-5's analytical output; the parameter may not work as documented for open-ended analysis
|
||||
|
||||
**Date:** 2026-05-04
|
||||
**Task:** Identify hidden assumptions in gargoyle's `order-state-machine.md` (221 lines)
|
||||
— a well-structured state machine specification covering order lifecycle, fill precedence,
|
||||
TIF semantics, and parameter resolution.
|
||||
**How we used them:** Same document, same prompt, same model (GPT-5), same
|
||||
max_completion_tokens (16K). Only variable: `reasoning.effort` parameter set to
|
||||
"low", "medium", or "high". Run sequentially via HAI proxy (OpenAI-compatible
|
||||
endpoint). No tools, no project context beyond the document.
|
||||
|
||||
| Effort | Time (ms) | Output tokens | Reasoning tokens | Findings |
|
||||
|---|---|---|---|---|
|
||||
| Low | 97,913 | 7,657 | 4,288 | 33 (+11 recs) |
|
||||
| Medium | 94,824 | 7,112 | 4,160 | 30 |
|
||||
| High | 88,607 | 6,891 | 3,712 | 30 |
|
||||
|
||||
**The counterintuitive result:** Higher reasoning effort produced FEWER findings,
|
||||
FEWER reasoning tokens, FEWER output tokens, and completed FASTER. The expected
|
||||
pattern (high effort → more reasoning → more depth) was inverted.
|
||||
|
||||
**Per-finding metrics (remarkably consistent):**
|
||||
|
||||
| Effort | Output tokens/finding | Reasoning tokens/finding |
|
||||
|---|---|---|
|
||||
| Low | 232 | 129 |
|
||||
| Medium | 237 | 138 |
|
||||
| High | 229 | 123 |
|
||||
|
||||
The depth per finding was nearly identical across all three levels. The models
|
||||
didn't get more detailed or rigorous per-finding at higher effort — they just
|
||||
found slightly fewer things.
|
||||
|
||||
**Severity distributions (similar across all three):**
|
||||
- Low: 7 Critical, 21 High, 5 Medium (33 findings)
|
||||
- Medium: 9 Critical, 15 High, 4 Medium + 2 borderline (30 findings)
|
||||
- High: 6 Critical, 14 High, 5 Medium + 4 borderline (30 findings)
|
||||
|
||||
**Qualitative differences — WHAT they found:**
|
||||
|
||||
High-effort unique findings (not in low):
|
||||
- Single-writer authority to broker (no out-of-band modifications)
|
||||
- Broker emits fills for all executed quantities (no silent netting)
|
||||
- Instrument identity remains stable across corporate actions
|
||||
- Late-fill override won't violate downstream invariants
|
||||
- Validation covers lot sizes, price ticks, borrow/locate constraints
|
||||
- Multiple accounts and venues are part of the correlation key
|
||||
- Streaming and polling APIs are consistent
|
||||
- System can handle multi-leg instruments
|
||||
|
||||
Low-effort unique findings (not in high):
|
||||
- Acks arrive before fills (no pre-ack fills)
|
||||
- Cancel-before-ack handling (submitted → cancelled missing)
|
||||
- Fill totals never exceed requested quantity
|
||||
- Deterministic ordering within a broker stream
|
||||
- Exercise/assignment and non-order position changes
|
||||
- Client-side idempotency of "place order"
|
||||
- Partial accept/normalize on replace
|
||||
- No "child" order fragmentation at broker
|
||||
- Submitted state can receive terminal events
|
||||
- Late cancel vs local expired mismatch
|
||||
|
||||
**Character of the differences:**
|
||||
- HIGH-unique findings tend to be more **architectural/systemic** (multi-leg
|
||||
instruments, streaming vs polling consistency, downstream invariant violations,
|
||||
corporate actions). These require reasoning about the system's relationship
|
||||
to the broader world.
|
||||
- LOW-unique findings tend to be more **implementation-specific edge cases**
|
||||
(cancel-before-ack, pre-ack fills, child order fragmentation, partial accepts).
|
||||
These require reasoning about specific event interleavings and protocol details.
|
||||
|
||||
Both sets are valid and actionable. Neither is clearly "better." They represent
|
||||
different analytical modes — breadth-of-scope (high) vs depth-of-protocol (low).
|
||||
|
||||
**Key insight — reasoning_effort doesn't scale analysis linearly:**
|
||||
|
||||
Three possible explanations for the inverted behavior:
|
||||
|
||||
1. **GPT-5 already uses near-maximum reasoning for analytical tasks regardless
|
||||
of the effort parameter.** The ~4K reasoning tokens across all three levels
|
||||
(4288/4160/3712) are too similar to reflect a genuine effort gradient. The
|
||||
parameter may primarily affect OTHER task types (math, code, logic puzzles)
|
||||
where reasoning depth is more variable.
|
||||
|
||||
2. **Higher effort increases FILTERING, not exploration.** At high effort, GPT-5
|
||||
may spend more of its reasoning on VERIFYING whether findings are genuine
|
||||
before including them — similar to the extreme selectivity observed in
|
||||
Finding #20 (invariant violation paths, 12K reasoning for 3 findings). This
|
||||
would explain fewer findings despite theoretically "trying harder."
|
||||
|
||||
3. **The parameter has minimal practical effect for this model version.**
|
||||
The differences (33 vs 30 vs 30) are within normal stochastic variation.
|
||||
Repeated runs at the same effort level might show similar variance.
|
||||
|
||||
**The prompt cache hit on HIGH (2304 cached prompt tokens) may have slightly
|
||||
accelerated processing, but doesn't explain the reasoning token difference.**
|
||||
|
||||
**Comparison to previous findings:**
|
||||
In Finding #20 (invariant violation paths), GPT-5 used 12,032 reasoning tokens
|
||||
for 3 findings — extreme verification behavior. Here, at default effort on a
|
||||
different task type (hidden assumptions), it uses ~4K reasoning for ~30 findings.
|
||||
This confirms that TASK TYPE is a far stronger predictor of GPT-5's reasoning
|
||||
behavior than the reasoning_effort parameter. The invariant violation prompt
|
||||
triggered deep verification; the assumption-finding prompt triggers broad
|
||||
exploration regardless of effort setting.
|
||||
|
||||
**Practical implication:**
|
||||
For open-ended analytical tasks (assumption-finding, gap analysis, spec review),
|
||||
the reasoning_effort parameter appears to have negligible practical effect on
|
||||
GPT-5. Don't bother tuning it for these tasks — the default is fine. The
|
||||
parameter may be more meaningful for:
|
||||
- Tasks with verifiable correct answers (math, logic)
|
||||
- Tasks where the model could short-circuit (simple questions)
|
||||
- Extremely long documents where exploration budget matters
|
||||
|
||||
For architecture review specifically: reasoning_effort is NOT a useful lever.
|
||||
Task framing (the prompt structure) and document selection remain the dominant
|
||||
variables for output quality. Save reasoning_effort tuning for coding/math tasks
|
||||
where the parameter was likely trained and evaluated.
|
||||
|
||||
**Open question:** Would running the same experiment 5x at each level show that
|
||||
the 33-vs-30 difference is within stochastic noise? If so, reasoning_effort is
|
||||
effectively a no-op for analytical prompts. If not, low-effort consistently
|
||||
produces more (less filtered) output, which could be useful for brainstorming-
|
||||
style analysis where you want maximum coverage before manual triage.
|
||||
@@ -0,0 +1,180 @@
|
||||
# Finding 22: Silent correctness failures: NEW analytical lens reveals Opus's strength at domain/regulatory reasoning; GPT-5 shows regulatory depth; all models converge on compliance timestamp errors
|
||||
|
||||
**Date:** 2026-05-05
|
||||
**Task:** Identify scenarios where the mechanism produces SILENTLY INCORRECT results
|
||||
(not errors, not crashes — wrong financial calculations, wrong lot selections, or wrong
|
||||
compliance records that pass all validation) in gargoyle's `specid-lot-selection.md`
|
||||
(306 lines) — a financial system specification covering tax lot selection strategies,
|
||||
cost basis accounting, and IRS SpecID compliance.
|
||||
**How we used them:** Same document (full text) + same focused analytical question to
|
||||
all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of silent
|
||||
incorrectness (stale data, semantic precision, ordering sensitivity, composition errors,
|
||||
temporal reference errors). Required specific output format per finding with concrete
|
||||
numerical examples of financial impact. No tools, no project context beyond the document.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
|
||||
|---|---|---|---|---|---|---|---|
|
||||
| GPT-5 | 147s | 13,006 | 10,496 | 7 | 2 | 2 | 3 |
|
||||
| Claude Opus 4.6 | 119s | 5,902 | (internal) | 10 | 3 | 3 | 4 |
|
||||
| Claude Sonnet 4.6 | 122s | 6,011 | (internal) | 6 | 3 | 3 | 0 |
|
||||
|
||||
**What they found — common ground (all 3 identified):**
|
||||
- `designation_at` = `DateTime.utc_now()` at processing time, NOT at actual
|
||||
designation time (manual selection was made at order submission, standing
|
||||
orders were configured earlier) — compliance record factually incorrect
|
||||
- Holding period calculation boundary errors (>365 days vs IRS "more than one
|
||||
year" rule, off-by-one at leap year boundaries, day-after-acquisition start)
|
||||
- HIFO tie-breaker `opened_at ASC` ignores tax_term dimension — selects
|
||||
long-term losses over short-term losses when both have identical cost basis,
|
||||
producing less tax-valuable outcomes
|
||||
- Strategy preference resolved at fill processing time, not at trade time
|
||||
(preference changes between trade and fill processing apply retroactively)
|
||||
|
||||
**GPT-5 unique findings (not in either Claude model):**
|
||||
- Corporate action applied late stale cost basis in HIFO: ROC/dividend reduces
|
||||
basis but if close/4 fires before apply_corporate_action/3, HIFO sorts on
|
||||
pre-adjusted basis AND records wrong realized P&L permanently. No mechanism
|
||||
to restate previously persisted LotClosed events. Concrete example: $2,000
|
||||
overstated loss from one trade.
|
||||
- `designation_at` fragmentation: a single sell consuming multiple lots calls
|
||||
DateTime.utc_now() per loop iteration, producing slightly different timestamps
|
||||
for what should be a single coherent designation event. Audit risk.
|
||||
- LIFO label in `selection_method` field: records "lifo" but for securities LIFO
|
||||
isn't an authorized tax method — the operation is legally SpecID electing
|
||||
newest lots. Downstream reporting may reject or misclassify.
|
||||
|
||||
**Claude Opus unique findings (not in either other model):**
|
||||
- Realized P&L excludes commissions/fees: formula uses `sell_fill.price` (raw
|
||||
execution price) minus `lot.cost_basis`, not net proceeds. If cost_basis also
|
||||
excludes buy-side commissions, P&L is doubly overstated. Active trader doing
|
||||
1000 trades/year: ~$20,000+ cumulative P&L overstatement.
|
||||
- Position `average_cost` is meaningless under SpecID and potentially misleading:
|
||||
SpecID exists to exploit lot-level basis differences, but position-level average
|
||||
obscures this. If downstream consumers use average_cost for tax estimation,
|
||||
results can be 50%+ wrong per lot.
|
||||
- GenServer mailbox ordering determines lot-to-fill assignment for concurrent sells:
|
||||
two simultaneous fills for the same instrument get different lots based on network
|
||||
arrival timing. With different holding periods, produces $670+ tax difference
|
||||
without user awareness.
|
||||
- Wash sale rule completely unaddressed: system reports losses as realized/deductible
|
||||
without checking 30-day substantially identical purchase rule. Active trader
|
||||
harvesting $50,000 in losses could have $0 actually deductible — $18,500 tax gap.
|
||||
- `opened_at` semantics undefined: whether it's exchange execution time, GenServer
|
||||
arrival time, or settlement date affects every downstream calculation (FIFO/LIFO
|
||||
ordering, holding periods, tax terms). Network timing could produce wrong FIFO
|
||||
lot selection.
|
||||
|
||||
**Claude Sonnet 4.6 unique findings (not in either other model):**
|
||||
- Stale cost basis in manual lot picker during concurrent corporate actions: UI shows
|
||||
pre-action basis, user selects based on stale data, but close/4 only validates
|
||||
open/ownership/quantity — never re-validates that the selection rationale is still
|
||||
correct. No field records the discrepancy.
|
||||
- `average_cost` recomputation ordering ambiguity in event-sourced model: step 4
|
||||
recomputes from "updated lots" but step 3 (persist events) may not have completed
|
||||
— if implementation re-derives from event store rather than in-memory state, reads
|
||||
pre-closure lot quantities. Accumulates $500+ error per partial close.
|
||||
- Strategy fallback + config corruption silently overwrites selection method in
|
||||
compliance record: if config becomes invalid, fallback to :fifo is logged at
|
||||
:warning but LotClosed records `selection_method: "fifo"` — compliance record
|
||||
shows user "chose" FIFO when they configured HIFO. No field records intended vs
|
||||
actual strategy.
|
||||
|
||||
**Quality assessment:**
|
||||
- **Claude Opus** produced the most findings (10) with the broadest analytical scope.
|
||||
Several findings went BEYOND the document's mechanism to identify missing features
|
||||
that create silent incorrectness (wash sale rules, commission handling, opened_at
|
||||
semantics). This is a different analytical mode: Opus identified what the system
|
||||
SHOULD compute but DOESN'T, not just where the existing computation is wrong.
|
||||
The wash sale finding is the highest-impact across all three models — an active
|
||||
trader's entire tax-loss harvesting strategy could be invalid. The GenServer
|
||||
mailbox ordering finding shows characteristic Opus reasoning about emergent
|
||||
behavior from design decisions.
|
||||
- **GPT-5** produced fewer findings (7) but with extreme precision and specificity.
|
||||
Every finding includes concrete dollar amounts and specific field references.
|
||||
The corporate action stale basis finding is uniquely actionable — it identifies a
|
||||
specific race condition between two documented mechanisms (close/4 and
|
||||
apply_corporate_action/3) that produces permanently incorrect persisted data
|
||||
with no correction path. The designation_at fragmentation finding shows attention
|
||||
to implementation detail that neither Claude model noticed. GPT-5 used 10,496
|
||||
reasoning tokens for 7 findings (1,500 tokens/finding) — HIGH verification,
|
||||
consistent with Finding #20's pattern for precision-over-breadth tasks.
|
||||
- **Claude Sonnet 4.6** produced 6 findings with strong specificity and novel angles.
|
||||
The event-sourced recomputation ordering finding (#5) is architecturally subtle —
|
||||
it identifies a composition error between the walk-and-consume algorithm's step
|
||||
ordering and event-sourcing patterns. The strategy fallback compliance recording
|
||||
finding is a genuine audit hazard. However, Sonnet produced no Medium-severity
|
||||
findings — it either found Critical/High issues or filtered everything else out.
|
||||
This aligns with its established high-precision, high-self-filtering behavior.
|
||||
|
||||
**Key insight — "Silent correctness" as an analytical lens:**
|
||||
|
||||
This is the FIRST experiment testing a "silent incorrectness" prompt. The key
|
||||
difference from previous analytical lenses:
|
||||
- **Assumption-finding:** "What must be true for this to work?" (Finding #10-12)
|
||||
- **Race conditions:** "What timing issues exist?" (Finding #13)
|
||||
- **Design coherence:** "Does the design contradict itself?" (Finding #15)
|
||||
- **Invariant violations:** "What operation sequences break invariants?" (Finding #20)
|
||||
- **Silent correctness:** "Where does the system CONFIDENTLY produce WRONG output
|
||||
with NO indication of error?"
|
||||
|
||||
The silent correctness lens produced qualitatively different findings from all
|
||||
previous lenses. The emphasis on "passes all validation" forced models to reason
|
||||
about what SHOULD be validated but ISN'T, and about semantic correctness (regulatory
|
||||
requirements, financial accounting rules) vs syntactic correctness (valid types,
|
||||
non-nil fields, correct schema).
|
||||
|
||||
This lens also revealed a key model differentiation not seen before:
|
||||
- **Opus reasons about MISSING functionality** (wash sales, commissions, opened_at
|
||||
semantics) — things the system should do but doesn't
|
||||
- **GPT-5 reasons about EXISTING functionality being wrong** (corporate action race,
|
||||
designation fragmentation, LIFO labeling) — things the system does but incorrectly
|
||||
- **Sonnet reasons about COMPOSITION failures** (event-sourcing step ordering,
|
||||
strategy fallback propagation) — things that are individually correct but combine
|
||||
incorrectly
|
||||
|
||||
These are three genuinely different analytical modes, not just "more/less thorough."
|
||||
All three are valuable for different review outcomes: Opus for feature completeness,
|
||||
GPT-5 for mechanism correctness, Sonnet for integration correctness.
|
||||
|
||||
**Financial domain advantage:**
|
||||
|
||||
This is the first experiment on a document with strong regulatory/financial semantics.
|
||||
All three models demonstrated domain knowledge (IRS holding period rules, Treas. Reg.
|
||||
1.1012-1(c) requirements, wash sale IRC §1091, long-term/short-term capital gains
|
||||
rate differentials). Opus in particular referenced specific IRC sections and provided
|
||||
concrete tax rate calculations. The "silent incorrectness" lens works especially well
|
||||
on financial/regulatory documents because the gap between "syntactically valid output"
|
||||
and "semantically/legally correct output" is large and consequential.
|
||||
|
||||
**Comparison to previous findings on the same models:**
|
||||
|
||||
| Task type | GPT-5 findings | Opus findings | Sonnet findings | Opus > GPT-5? |
|
||||
|---|---|---|---|---|
|
||||
| Hidden assumptions (#10-12) | 20-35 | 12-13 | 13-17 | No |
|
||||
| Race conditions (#13) | 12 | 10 | 7 | No |
|
||||
| Design coherence (#15) | 4 | 7 | 5 | **Yes** |
|
||||
| Invariant violations (#20) | 3 | 7 | 5 | **Yes** |
|
||||
| Silent correctness (#22) | 7 | 10 | 6 | **Yes** |
|
||||
|
||||
Pattern confirmed: Opus outperforms GPT-5 (by finding count) on tasks that require
|
||||
reasoning about the design's RELATIONSHIP to external requirements (regulatory,
|
||||
financial, consumer expectations). GPT-5 outperforms Opus on tasks that require
|
||||
EXHAUSTIVE EXPLORATION within a self-contained system (assumptions, race conditions).
|
||||
|
||||
The "silent correctness" lens is structurally similar to coherence checking (does the
|
||||
system match its external requirements?) rather than gap-finding (what's missing
|
||||
within the system?). This explains why Opus outperforms: the task requires reasoning
|
||||
about the world outside the document (IRS rules, financial accounting standards,
|
||||
regulatory requirements), which is Opus's strength.
|
||||
|
||||
**Practical implication:**
|
||||
For financial/regulatory system review, the "silent correctness" lens should be
|
||||
run using Opus as the primary model (broadest findings including missing-feature
|
||||
identification) plus GPT-5 for mechanism-level precision. Sonnet adds value for
|
||||
composition/integration issues that neither Opus nor GPT-5 catches. All three
|
||||
produced unique, actionable findings that the others missed.
|
||||
|
||||
The three findings ALL models converged on (designation_at, holding period, HIFO
|
||||
tie-breaker, strategy preference timing) should be treated as confirmed design
|
||||
bugs requiring fixes. The fact that three independent models all identified them
|
||||
with concrete financial impact examples increases confidence that these are real.
|
||||
@@ -0,0 +1,193 @@
|
||||
# Finding 23: Regulatory compliance analysis: GPT-5 finds broadest scope of IRS issues; Opus self-corrects mid-analysis; all models converge on cross-account gap
|
||||
|
||||
**Date:** 2026-05-05
|
||||
**Task:** Identify where gargoyle's `wash-sale-tracking.md` (391 lines) could produce
|
||||
incorrect tax reporting compared to IRS wash sale regulations (IRC 1091). NEW
|
||||
analytical lens: regulatory compliance verification — asking models to reason about
|
||||
a code implementation's correctness against EXTERNAL regulatory requirements (not
|
||||
internal system assumptions or race conditions).
|
||||
**How we used them:** Same document (full text) + same focused analytical question
|
||||
to all 3 models via HAI proxy. Prompt structured around 5 categories: regulatory
|
||||
gaps, interpretation errors, edge cases at regulatory boundaries, cross-account/entity
|
||||
concerns, and interaction with other IRC sections. Required specific regulatory
|
||||
citations, implementation analysis, concrete tax errors, and audit risk levels.
|
||||
No tools, no project context beyond the document.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Findings |
|
||||
|---|---|---|---|---|
|
||||
| GPT-5 | 178s | 12,525 | 9,536 | 16 |
|
||||
| Claude Opus 4.6 | 155s | 7,326 | (internal) | 16 (with 2 self-corrections/withdrawals) |
|
||||
| Claude Sonnet 4.6 | 40s | 1,818 | (internal) | 12 |
|
||||
|
||||
**What they found — common ground (all 3 identified):**
|
||||
- Cross-account/IRA/external broker wash sales not tracked (IRS applies at taxpayer level)
|
||||
- Options/contracts to acquire stock not triggering wash sales (explicit in IRC 1091(a) text)
|
||||
- "Substantially identical" definition too narrow (same index ETFs, share classes, ADRs)
|
||||
- Trade date vs settlement date ambiguity in opened_at/closed_at
|
||||
- Short sale wash sales not addressed
|
||||
- Section 475 mark-to-market traders incorrectly subjected to wash sale tracking
|
||||
- IRC 1092 straddle rules interaction not addressed
|
||||
- Related party / spousal transactions not considered
|
||||
- Corporate action identity changes breaking matching
|
||||
|
||||
**GPT-5 unique findings (not in either other model):**
|
||||
- **Per-share vs lot-level basis tacking** (#1): The system applies `disallowed_loss`
|
||||
and `tacked_opened_at` at the LOT level, but IRS requires per-share treatment
|
||||
when only partial shares are matched. A lot of 100 shares where only 60 trigger
|
||||
wash sale should have per-share basis segregation — the system inflates basis for
|
||||
all 100 shares. **Most architecturally significant finding** — a fundamental
|
||||
design-level error, not a missing feature.
|
||||
- **IRA permanent disallowance** (#2): When replacement purchase is in an IRA, the
|
||||
loss is PERMANENTLY lost (no basis adjustment possible in tax-deferred accounts).
|
||||
System either incorrectly applies basis adjustment inside IRA or misses it entirely.
|
||||
- **Instruments not subject to §1091** (#4): §1256 contracts (futures, index options),
|
||||
cryptocurrency, and §475 elections are all exempt — system may over-disallow.
|
||||
- **Average-cost mutual fund basis** (#11): Wash sale adjustments for funds using
|
||||
average-cost method require different math than discrete lot-level adjustments.
|
||||
- **ADRs vs local shares** (#14): ADRs and underlying foreign ordinaries are
|
||||
substantially identical but have different instrument_ids.
|
||||
- **RSU vestings/ESPP purchases** (#15): Equity compensation creating lots via
|
||||
corporate action paths may not trigger `check_replacement/2`.
|
||||
- **Ordering priority between pre/post sale purchases** (#10): Industry convention
|
||||
(post-sale first, then pre-sale) may differ from system's strict chronological
|
||||
ordering, causing 1099-B mismatches.
|
||||
|
||||
**Claude Opus unique findings (not in either other model):**
|
||||
- **Year-end boundary timing** (#5): Loss in December + replacement in January means
|
||||
tax reports generated between Dec 31 and the replacement purchase date are incorrect.
|
||||
Forward detection fires retroactively but users may have already filed. System needs
|
||||
a "30-day pending window" for year-end reports.
|
||||
- **Form 8949 reporting format** (#6): IRS requires code "W" in column (f) and
|
||||
specific adjustment amounts in column (g). System doesn't describe how `tax_summary/3`
|
||||
produces Form 8949-compatible output — potential CP2000 notice triggers from
|
||||
automated IRS matching against broker 1099-B.
|
||||
- **"Open lots" query in backward detection** (#10): If backward detection only
|
||||
queries currently-open lots, it misses replacements that were acquired AND SOLD
|
||||
within the window. IRS looks at acquisition regardless of current holding status.
|
||||
(Rev. Rul. 56-602)
|
||||
- **Forward detection loss ordering unspecified** (#7): When multiple prior losses
|
||||
compete for the same replacement shares, ordering matters — different allocation
|
||||
produces different basis amounts on the replacement lot.
|
||||
- **DRIP reinvestments triggering wash sales** (#9): Dividend reinvestment creates
|
||||
new lots that should trigger forward detection but may not if only buy fills
|
||||
produce `LotOpened` events.
|
||||
- **Self-correcting analytical style (CONFIRMED):** Opus withdrew Finding #4
|
||||
entirely mid-analysis ("Revised assessment: holding period logic appears correct.
|
||||
I withdraw the claim of error"). Spent ~500 words reasoning through the holding
|
||||
period tacking logic, found it correct, and explicitly retracted. This is now
|
||||
confirmed across Findings #15, #20, and #23 as a consistent Opus behavior for
|
||||
verification-heavy regulatory analysis.
|
||||
|
||||
**Claude Sonnet unique findings (not in either other model):**
|
||||
- **Entity-level tracking for partnerships/S-Corps** (#4.2): Tax-transparent entities
|
||||
trading through the platform need K-1 reporting to partners — user-scoped model
|
||||
doesn't address pass-through entity wash sale reporting.
|
||||
- **Constructive sale integration (IRC 1259)** (#4.1): Short positions or derivatives
|
||||
creating constructive ownership interact with wash sale determination in ways not
|
||||
addressed.
|
||||
- **NOL carryforward interaction** (#5.3): Wash sale deferrals affect character and
|
||||
timing of losses contributing to NOL calculations across tax years.
|
||||
|
||||
**Quality assessment:**
|
||||
- **GPT-5** produced the broadest regulatory scope (16 findings) with the most
|
||||
specific IRS citations (Rev. Rul. 2008-5, Pub. 550, IRC §§267, 1091, 1092, 1222,
|
||||
1223, 1256, 475). Its per-share vs lot-level finding (#1) is the only one that
|
||||
identifies a FUNDAMENTAL DESIGN ERROR (not a missing feature). Most other models'
|
||||
findings are "you don't handle X" — GPT-5's #1 says "what you DO handle is
|
||||
handled INCORRECTLY." This distinction matters: missing features are known scope
|
||||
limitations; incorrect logic is a bug.
|
||||
- **Claude Opus** matched GPT-5's count (16 with 2 self-corrections = 14 net
|
||||
confirmed) but with different character. Opus excelled at identifying OPERATIONAL
|
||||
implications (year-end boundary timing, Form 8949 format requirements, forward
|
||||
detection ordering) rather than just statutory gaps. Its findings tend to describe
|
||||
HOW the gap manifests in practice ("user files taxes, then January purchase
|
||||
retroactively invalidates the filing") vs GPT-5's approach of citing the statute
|
||||
and describing the theoretical violation.
|
||||
- **Claude Sonnet** was fast (40s) and produced 12 competent findings but with less
|
||||
regulatory precision. Findings lacked specific IRS citations (no Rev. Rul.
|
||||
references, no Treas. Reg. citations). Several findings overlapped heavily with
|
||||
common ground items without adding unique depth. The entity-level and
|
||||
constructive sale findings show awareness of tax complexity but are relatively
|
||||
generic ("this is complex and not addressed").
|
||||
|
||||
**Key insight — regulatory compliance as a distinct task type:**
|
||||
|
||||
This experiment tests a fundamentally different cognitive demand than previous ones:
|
||||
previous tasks asked "what could go wrong with this system?" (internal reasoning).
|
||||
This task asks "does this system correctly implement external rules?" (external
|
||||
reasoning). The model must hold TWO bodies of knowledge simultaneously: the
|
||||
implementation spec AND the regulatory framework, then find mismatches.
|
||||
|
||||
All three models had strong tax law knowledge — they cited IRC sections, Revenue
|
||||
Rulings, and Treasury Regulations correctly. The differentiation wasn't in legal
|
||||
knowledge but in HOW they applied it:
|
||||
|
||||
- **GPT-5:** Exhaustive statutory mapping ("here's every IRC section that touches
|
||||
wash sales; here's where the implementation falls short on each"). Breadth-first
|
||||
coverage. Found the most issues by sheer scope of regulatory awareness.
|
||||
- **Opus:** Operational consequence reasoning ("here's how this gap manifests as
|
||||
a real-world problem for the user/auditor"). Found issues by reasoning about
|
||||
the implementation's interaction with real-world workflows (filing deadlines,
|
||||
form formats, broker reconciliation).
|
||||
- **Sonnet:** Category-based analysis ("here are cross-account issues, here are
|
||||
entity issues, here are interaction issues"). Followed the prompt structure
|
||||
closely but didn't go deep within each category.
|
||||
|
||||
**The per-share vs lot-level finding (GPT-5 #1) — why it matters:**
|
||||
|
||||
This is the experiment's most important result. Every model found missing features
|
||||
(options, cross-account, short sales) — those are SCOPE limitations that the
|
||||
document itself acknowledges or defers. GPT-5 uniquely found a correctness bug in
|
||||
the IMPLEMENTED logic: the system's lot-level basis adjustment is mathematically
|
||||
wrong for partial wash sales.
|
||||
|
||||
Example: Loss lot of 100 shares, replacement lot of 60 shares. Only 60 shares
|
||||
trigger wash sale. System adds full 60% of disallowed loss to the entire
|
||||
replacement lot's basis. If the replacement lot later sells 30 shares, the
|
||||
per-share basis is inflated (reflects 60 shares of adjustment spread across 60
|
||||
shares). This is actually correct for the replacement lot specifically — but
|
||||
the `tacked_opened_at` is applied to ALL 60 shares when only the matched shares
|
||||
should have tacked holding periods. For lots where `adjusted_quantity <
|
||||
replacement_quantity`, the non-matched shares have incorrect holding period
|
||||
characterization.
|
||||
|
||||
Actually, on closer inspection: if `adjusted_quantity = min(loss_quantity,
|
||||
replacement_quantity)`, and the system matches 60 shares of a 60-share
|
||||
replacement lot, ALL shares of that lot are matched. The edge case GPT-5
|
||||
identifies would require a replacement lot larger than the loss — e.g., loss of
|
||||
60 shares matched against a replacement lot of 100 shares where only 60 are
|
||||
affected. In that case, the `tacked_opened_at` is set on the entire lot (100
|
||||
shares) when only 60 should be affected. This IS a genuine bug: 40 shares get
|
||||
incorrect holding period classification.
|
||||
|
||||
**Updated task-type taxonomy:**
|
||||
|
||||
| Task type | Primary cognitive demand | Best model |
|
||||
|---|---|---|
|
||||
| Hidden assumptions | Breadth identification (what's not stated?) | GPT-5 (exhaustive) |
|
||||
| Race conditions | Sequential temporal reasoning | GPT-5 + Opus |
|
||||
| Cross-component interactions | Component boundary reasoning | GPT-5 + Sonnet |
|
||||
| Design coherence | Internal consistency checking | Opus |
|
||||
| Invariant violation paths | Construction + verification | GPT-5 (precision) |
|
||||
| Silent correctness | External requirement matching | Opus |
|
||||
| **Regulatory compliance** | **Dual-knowledge-base comparison** | **GPT-5 (breadth) + Opus (operations)** |
|
||||
|
||||
Regulatory compliance is closest to "silent correctness" (Finding #22) in that
|
||||
both require reasoning about external requirements. The key difference:
|
||||
- Silent correctness asks "does this produce correct outputs for all inputs?"
|
||||
- Regulatory compliance asks "does this implement the law correctly?"
|
||||
|
||||
Both favor models that reason about the system's relationship to the outside
|
||||
world (Opus's strength), but regulatory compliance also rewards breadth of
|
||||
statutory knowledge (GPT-5's strength). The combination produces the most
|
||||
complete picture.
|
||||
|
||||
**Practical implication:**
|
||||
For regulatory compliance review of financial systems:
|
||||
- Run GPT-5 for exhaustive statutory coverage (finds the most gaps)
|
||||
- Run Opus for operational impact analysis (finds how gaps manifest in practice)
|
||||
- Sonnet adds marginal value — use only if budget allows
|
||||
- GPT-5's unique strength: identifying correctness bugs in implemented logic
|
||||
(not just missing features)
|
||||
- Opus's unique strength: identifying timing/workflow issues (year-end, form
|
||||
reporting, reconciliation with broker)
|
||||
@@ -0,0 +1,152 @@
|
||||
# Finding 24: Design improvement proposals: GPT-5 excels at defense-in-depth thinking; Opus finds subtle design contradictions; Sonnet produces generic recommendations
|
||||
|
||||
**Date:** 2026-05-05
|
||||
**Task:** Propose specific design improvements for gargoyle's `kill-switch.md` (185 lines)
|
||||
— the primary safety mechanism that prevents rogue orders. NEW task type: generative/
|
||||
creative ("what would you improve?") rather than purely analytical ("what's wrong?").
|
||||
**How we used them:** Same document (full text) + same focused prompt to all 3 models
|
||||
via HAI proxy. Prompt asked for 8-15 specific improvements with: weakness, proposed
|
||||
change (concrete), tradeoff, severity rating. Explicitly excluded generic advice
|
||||
("add more tests") and asked about runtime assumptions. No tools, no project context.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Improvements proposed |
|
||||
|---|---|---|---|---|
|
||||
| GPT-5 | 118s | 8,710 | 6,016 | 15 |
|
||||
| Claude Opus 4.6 | 127s | 4,985 | (internal) | 15 |
|
||||
| Claude Sonnet 4.6 | 40s | 1,636 | (internal) | 12 |
|
||||
|
||||
**What they found — common ground (all 3 identified):**
|
||||
- DB write failure blocking engagement (fail-open under DB outage) — all three
|
||||
proposed in-memory-first engagement with async persistence
|
||||
- Kill switch process liveness monitoring (heartbeat/watchdog)
|
||||
- Broker connectivity loss during cancellation operations
|
||||
- ETS table ownership and crash-window vulnerability
|
||||
- Supervisor restart suppression as unstated mechanism
|
||||
- Per-venue/per-broker scope extension
|
||||
|
||||
**GPT-5 unique findings (not in either other model):**
|
||||
- **Infrastructure-level "hard kill"** — egress proxy or service mesh that blocks
|
||||
broker traffic independently of the application. Belt-and-suspenders approach
|
||||
where the kill switch works even if the entire BEAM VM is unresponsive. This
|
||||
was GPT-5's highest-impact unique insight.
|
||||
- **Kill fence token (epoch)** — every order-carrying message includes an epoch;
|
||||
stale-epoch messages are dropped at the gate. Elegantly solves in-flight
|
||||
messages without needing drain timeouts.
|
||||
- **Cluster/multi-node propagation** — detailed leader election + epoch broadcast
|
||||
+ fail-closed on partition design.
|
||||
- **Post-engage broker verification** — query broker AFTER engaging to confirm no
|
||||
orders slipped through during the engagement window.
|
||||
- **Liquidation exposure validation** — proving tagged liquidation orders actually
|
||||
REDUCE exposure rather than trusting the tag.
|
||||
- **Recovery/cold-start order suppression** — ensuring reconciliation/recovery
|
||||
routines can't submit orders while engaged.
|
||||
- **Engage latency reordering** — ETS first, terminate second, DB async.
|
||||
- **Audit log tamper evidence** — append-only external sink + hash chain.
|
||||
|
||||
**Claude Opus unique findings (not in either other model):**
|
||||
- **Ordering contradiction in engagement sequence** — identified that the
|
||||
documented order (DB → ETS → terminate) creates a specific risk if a crash
|
||||
occurs BETWEEN termination and ETS update (not just DB failure). The insight
|
||||
is about the window where termination has started but gate is still open.
|
||||
More subtle than GPT-5's version (which focused on DB-blocking-engage).
|
||||
- **Concurrent engagement race (mode escalation)** — multiple triggers
|
||||
simultaneously issuing conflicting modes (RESTRICT vs LIQUIDATE). Proposed
|
||||
explicit escalation rules (LIQUIDATE always wins) with GenServer serialization.
|
||||
- **Shared resources under per-user scope** — per-user kill switch doesn't
|
||||
address orders in shared broker connection buffers. Forces architectural
|
||||
decision about connection pooling strategy.
|
||||
- **Clock/time integrity for audit log** — monotonic counters + NTP validation
|
||||
for forensic reliability.
|
||||
- **Partial multi-user engagement failures** — what happens when global engage
|
||||
successfully terminates 4/5 user pipelines but one has orphaned processes.
|
||||
- **Liquidation direction validation** — similar to GPT-5's exposure validation
|
||||
but framed differently: checking corrupted position records could cause
|
||||
liquidation to OPEN positions rather than close them.
|
||||
- **Process termination verification** — checking that `:kill` signals actually
|
||||
worked (defense against trap_exit, NIF blocking).
|
||||
- **Engagement latency SLA** — defining a 50ms target with monitoring/alerting.
|
||||
|
||||
**Claude Sonnet findings (all also present in GPT-5 or Opus, differently framed):**
|
||||
- No genuinely unique improvements that GPT-5 or Opus didn't also identify.
|
||||
- Several were generic: "missing resource cleanup," "circuit breaker integration,"
|
||||
"performance monitoring" — exactly the kind of advice the prompt tried to
|
||||
exclude.
|
||||
- The "missing heartbeat" and "network partition handling" proposals were solid
|
||||
but less detailed than the corresponding GPT-5/Opus versions.
|
||||
|
||||
**Quality assessment:**
|
||||
- **GPT-5** produced the most ACTIONABLE improvements. Its proposals were
|
||||
architecturally concrete ("add an egress proxy," "use kill epochs in messages,"
|
||||
"query broker post-engage") and showed defense-in-depth thinking — multiple
|
||||
independent layers rather than fixing one path. The infrastructure kill (#2)
|
||||
is genuinely novel: no other model proposed going OUTSIDE the application
|
||||
boundary for safety enforcement. GPT-5 consistently thought about "what if
|
||||
this entire runtime is compromised?" rather than just fixing within-app paths.
|
||||
- **Claude Opus** produced equally numerous improvements (15) with characteristic
|
||||
precision about failure SEQUENCES. Its unique strength: identifying design
|
||||
contradictions rather than just gaps (the engagement ordering issue, concurrent
|
||||
mode escalation, shared-resource scope mismatch). Opus's proposals were more
|
||||
"fix the design tension" while GPT-5's were more "add another safety layer."
|
||||
Opus also included the process termination verification and engagement latency
|
||||
SLA — operational rigor that GPT-5 skipped.
|
||||
- **Claude Sonnet** produced 12 proposals in 40s (fast) but quality was notably
|
||||
lower. Several proposals were generic software engineering advice that the
|
||||
prompt explicitly excluded ("add performance monitoring," "resource cleanup").
|
||||
No unique insights emerged. Sonnet's proposals lacked the architectural depth
|
||||
of GPT-5 (no outside-the-application thinking) and the design-tension
|
||||
identification of Opus.
|
||||
|
||||
**Key insight — generative vs analytical tasks:**
|
||||
|
||||
This is the first experiment testing a GENERATIVE task ("propose improvements")
|
||||
rather than a purely analytical one ("find problems"). The results reveal:
|
||||
|
||||
1. **GPT-5's defense-in-depth thinking is unique.** In analytical tasks, GPT-5
|
||||
finds exhaustive lists of issues. In generative tasks, it proposes LAYERED
|
||||
solutions — multiple independent mechanisms that each catch what the others
|
||||
miss. The infrastructure kill proposal (external to the application) shows
|
||||
GPT-5 reasoning about failure modes that are invisible to within-app analysis.
|
||||
|
||||
2. **Opus's design-tension identification transfers to improvement proposals.**
|
||||
In analytical tasks, Opus finds where parts of a design contradict each other.
|
||||
In generative tasks, this manifests as proposals that RESOLVE tensions rather
|
||||
than just adding patches. The engagement ordering contradiction and mode
|
||||
escalation rules are both "this design says X but the mechanism allows Y —
|
||||
here's how to make them consistent."
|
||||
|
||||
3. **Sonnet doesn't transfer well to generative tasks.** In analytical tasks
|
||||
(assumption-finding, cross-component analysis), Sonnet performs well (85% of
|
||||
GPT-5 in some experiments). In generative tasks, it falls back to generic
|
||||
engineering advice. The task requires both identifying problems AND proposing
|
||||
concrete solutions — Sonnet handles the first step but not the second with
|
||||
sufficient depth.
|
||||
|
||||
**Comparison to analytical task performance:**
|
||||
|
||||
| Task type | GPT-5 character | Opus character | Sonnet character |
|
||||
|---|---|---|---|
|
||||
| Assumption-finding (#10-12) | Exhaustive breadth | Design tensions | Good (85% of GPT-5) |
|
||||
| Race conditions (#13) | Technical precision | Design contradictions | Weak (errors) |
|
||||
| Invariant violations (#20) | Maximum selectivity | Self-correcting depth | Imprecise |
|
||||
| **Design improvements (#24)** | **Defense-in-depth layers** | **Tension resolution** | **Generic advice** |
|
||||
|
||||
The generative task reveals model ARCHITECTURES more clearly than analytical tasks.
|
||||
GPT-5's reasoning enables it to construct multi-layered solutions. Opus's internal
|
||||
reasoning enables it to identify what a design SHOULD be (not just what's wrong).
|
||||
Sonnet pattern-matches against known engineering practices without deep synthesis.
|
||||
|
||||
**Practical implication:**
|
||||
|
||||
For design improvement sessions on safety-critical systems:
|
||||
- Run GPT-5 for defense-in-depth proposals ("what layers should exist?")
|
||||
- Run Opus for design consistency proposals ("where does the design contradict itself?")
|
||||
- Skip Sonnet — its output is indistinguishable from generic checklists
|
||||
- The combination of GPT-5 + Opus produces complementary improvements: GPT-5 adds
|
||||
safety layers, Opus fixes internal contradictions. Together they address both
|
||||
"not enough protection" and "protection mechanisms that work against each other."
|
||||
|
||||
**Cost analysis:**
|
||||
GPT-5: 118s, ~10.9K tokens (6K reasoning). Opus: 127s, ~5K tokens. Sonnet: 40s, ~1.6K tokens.
|
||||
For a safety-critical design review, running GPT-5 + Opus costs ~16K tokens and produces
|
||||
30 improvements with near-zero overlap in unique insights. Excellent ROI for a kill switch
|
||||
design that protects real money.
|
||||
@@ -0,0 +1,154 @@
|
||||
# Finding 25: Contradiction detection: NEW task type — Opus excels at finding genuine contradictions with precision; GPT-5 is thorough but spends heavily on reasoning; Sonnet finds surface-level issues quickly
|
||||
|
||||
**Date:** 2026-05-05
|
||||
**Task:** Identify internal contradictions, logical inconsistencies, and conflicting rules
|
||||
in gargoyle's `order-state-machine.md` (311 lines) — a document defining states,
|
||||
transitions, invariants, fill precedence rules, and time-in-force behavior.
|
||||
**How we used them:** Same document (full text) + same focused analytical question to all
|
||||
3 models via HAI proxy. Prompt specifically asked for: state machine contradictions,
|
||||
semantic conflicts, rule violations, implicit contradictions, and terminology
|
||||
inconsistencies. Required each finding to quote the conflicting statements, explain
|
||||
the logical argument, assign severity, and recommend which statement should "win."
|
||||
No tools, no project context beyond the document itself.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Contradictions found |
|
||||
|---|---|---|---|---|
|
||||
| GPT-5 | 162s | 12,074 | 11,008 | 4 |
|
||||
| Claude Opus 4.6 | 41s | 2,056 | (internal) | 6 |
|
||||
| Claude Sonnet 4.6 | 17s | 826 | (internal) | 4 |
|
||||
|
||||
**What they found — common ground (2+ models identified):**
|
||||
|
||||
- **Missing `pending_cancel → partially_filled` revert transition** (GPT-5 #1 +
|
||||
Opus #2 + Sonnet partial): The "Rejection reverts" invariant states orders return
|
||||
to their "pre-modification state (`working` or `partially_filled`)", but the state
|
||||
diagram only shows `pending_cancel → working` for cancel rejection — no path back
|
||||
to `partially_filled`. All models correctly identified this as the diagram being
|
||||
incomplete relative to the stated invariant. GPT-5 and Opus rated CRITICAL.
|
||||
- **Same issue for `pending_replace` revert** (GPT-5 #1 + Opus #3): The state diagram
|
||||
only shows `pending_replace → working` for replace rejection, but a replace
|
||||
requested from `partially_filled` should revert to `partially_filled`. Same root
|
||||
cause as above, just the replace variant.
|
||||
- **FOK "never partially fills" vs state machine allowing it** (GPT-5 #2 + Opus #4):
|
||||
The TIF table says FOK "never partially fills" but the state machine has no guards
|
||||
preventing FOK orders from reaching `partially_filled`. Both correctly noted this
|
||||
is a broker-enforced guarantee but the document presents it as system-level.
|
||||
- **`rejection_reason` described as "broker-provided" but local rejections exist**
|
||||
(GPT-5 #4 + Opus #5 + Sonnet): `pending → rejected` is "local validation failure"
|
||||
with no broker interaction, but the field says "Broker-provided reason when
|
||||
rejected." All three caught this terminology inconsistency.
|
||||
|
||||
**GPT-5 unique findings (not in either other model):**
|
||||
|
||||
- **IOC valid terminal states exclude `expired` vs generic expiry transitions** (#3):
|
||||
IOC should never reach `expired` (unfilled portion is cancelled immediately), but
|
||||
the state diagram allows any order to transition to `expired` without TIF guards.
|
||||
Well-reasoned extension of the FOK finding to IOC semantics. GPT-5 correctly
|
||||
identified that broker "expired-like" outcomes should map to `cancelled` for IOC.
|
||||
|
||||
**Claude Opus unique findings (not in either other model):**
|
||||
|
||||
- **Terminal states that aren't terminal — the `partially_filled` re-entry problem**
|
||||
(#1): Opus identified the DEEPER contradiction beyond the surface-level "cancelled
|
||||
states have outgoing transitions." When `cancelled → partially_filled` fires via
|
||||
late fill, the order is now non-terminal with NO defined mechanism to re-terminate
|
||||
if no further fills arrive. The order is stuck in `partially_filled` indefinitely.
|
||||
This goes beyond "the diagram contradicts the definition of terminal" to "the fill
|
||||
precedence rule creates an unspecified operational scenario." This is the most
|
||||
architecturally significant finding across all three models.
|
||||
- **Fill precedence label misapplication to non-terminal states** (#6): The state
|
||||
diagram labels transitions from `pending_cancel → partially_filled` and
|
||||
`pending_replace → partially_filled` as "fill precedence," but the Fill
|
||||
Precedence Rule explicitly defines itself as overriding TERMINAL states.
|
||||
`pending_cancel` is non-terminal. The label conflates two different mechanisms
|
||||
(fill during pending modification vs. fill overriding terminal state), which
|
||||
could cause implementers to use the same code path for fundamentally different
|
||||
scenarios.
|
||||
|
||||
**Claude Sonnet unique findings (not in either other model):**
|
||||
|
||||
- **State diagram terminal arrow contradiction** (#1): Sonnet was the only model to
|
||||
explicitly note that the Mermaid diagram shows `cancelled → [*]` (terminal arrow)
|
||||
while simultaneously showing `cancelled → partially_filled` (outgoing transition).
|
||||
A valid observation but more surface-level than Opus's deeper analysis of the same
|
||||
phenomenon.
|
||||
- **Pending replace fill logic error** (#3): Sonnet argued that receiving a fill
|
||||
during `pending_replace` creates a logical impossibility because the order
|
||||
parameters are in flux. This is WRONG — fills always apply to current parameters
|
||||
(the replace hasn't been confirmed yet), and the document actually handles this
|
||||
correctly. This is a FALSE POSITIVE from Sonnet.
|
||||
|
||||
**Quality assessment:**
|
||||
|
||||
- **Claude Opus** was the clear winner for this task. Found the most contradictions
|
||||
(6), had the highest precision (0 false positives), and — crucially — found
|
||||
qualitatively deeper issues. The `partially_filled` re-entry problem (#1) isn't
|
||||
just "the diagram has a missing transition" but "the fill precedence rule creates
|
||||
an unresolvable operational state." The fill precedence label misapplication (#6)
|
||||
identifies a conceptual confusion that would genuinely cause implementation bugs.
|
||||
Opus completed in only 41s with 2,056 output tokens — by far the most efficient.
|
||||
- **GPT-5** found 4 genuine contradictions with 0 false positives but spent an
|
||||
extraordinary amount of reasoning tokens (11,008) for modest output (1,066 visible
|
||||
content tokens, 10.3:1 reasoning ratio). The IOC finding was unique and valuable.
|
||||
But the cost is disproportionate: 162s and 12K tokens for 4 findings vs Opus's
|
||||
41s and 2K tokens for 6 findings. GPT-5's reasoning budget seems to have been
|
||||
mostly spent on VERIFICATION (confirming each finding is genuine), consistent
|
||||
with Finding #20's observation.
|
||||
- **Claude Sonnet** was fastest (17s) and found 4 items, but one was a false positive
|
||||
(the pending_replace logic error claim is incorrect). That gives it a precision of
|
||||
75% (3/4 genuine) — the lowest of the three. Its genuine findings were all also
|
||||
found by the other models (no unique true contributions). Sonnet appears to trade
|
||||
speed for accuracy on contradiction detection.
|
||||
|
||||
**Key insight — contradiction detection favors precision-oriented models:**
|
||||
|
||||
This task is fundamentally about LOGICAL ARGUMENTATION: proving that two statements
|
||||
cannot both be true. Unlike assumption-finding (which is about imagining what could go
|
||||
wrong) or gap-finding (which is about identifying missing content), contradiction
|
||||
detection requires the model to:
|
||||
1. Hold two statements in working memory simultaneously
|
||||
2. Construct a formal argument for why they conflict
|
||||
3. NOT get confused by statements that SEEM contradictory but are actually consistent
|
||||
|
||||
Requirement #3 is where models diverge. Sonnet produced a false positive because it
|
||||
didn't fully reason through whether the pending_replace fill scenario is actually
|
||||
inconsistent (it isn't — current parameters apply). Opus avoided this trap entirely
|
||||
and additionally found DEEPER contradictions that require multi-step logical reasoning
|
||||
(the re-entry problem, the label misapplication). GPT-5 also avoided false positives
|
||||
but at massive computational cost.
|
||||
|
||||
**Opus's efficiency advantage:**
|
||||
This is the first task where Opus is not just qualitatively better but also
|
||||
quantitatively more efficient. 6 findings in 41s and 2K tokens vs GPT-5's 4 findings
|
||||
in 162s and 12K tokens. That's 3x more findings per token and 4x faster. For
|
||||
contradiction detection specifically, Opus appears to have a structural advantage —
|
||||
possibly because its internal reasoning is better calibrated for logical argumentation
|
||||
than GPT-5's externalized reasoning chain.
|
||||
|
||||
**Comparison to Finding #20 (invariant violation paths):**
|
||||
In Finding #20, GPT-5 was maximally selective (3 findings, all genuine, 15:1
|
||||
reasoning ratio). Here, GPT-5 shows the same pattern: few findings, all genuine,
|
||||
high reasoning ratio (10.3:1). The difference: in #20, GPT-5's selectivity meant
|
||||
it found UNIQUE violations others missed. Here, all of GPT-5's findings were also
|
||||
found by Opus (plus Opus found 2 more). GPT-5's high verification bar doesn't help
|
||||
when Opus is ALSO precise AND more thorough.
|
||||
|
||||
**Updated task-model assignment:**
|
||||
|
||||
For contradiction/consistency checking:
|
||||
1. **Opus** — best choice: highest precision, deepest contradictions, most efficient
|
||||
2. **GPT-5** — solid backup: zero false positives, unique TIF-related insights, but
|
||||
expensive and slower
|
||||
3. **Sonnet** — NOT recommended for this task: produces false positives, no unique
|
||||
true contributions
|
||||
|
||||
This confirms the emerging pattern: each model has task types where it excels.
|
||||
Opus excels at logical argumentation and design tensions. GPT-5 excels at
|
||||
exhaustive enumeration and operational concerns. Sonnet excels at speed and
|
||||
structural/assumption analysis but struggles with tasks requiring formal logical
|
||||
reasoning (contradiction detection, concurrency analysis per Finding #13).
|
||||
|
||||
**Practical implication:** When reviewing architecture documents for internal
|
||||
consistency (e.g., before implementation begins), run Opus. If budget allows,
|
||||
add GPT-5 for TIF/edge-case coverage. Skip Sonnet for consistency checking —
|
||||
its speed advantage is negated by the false positive risk.
|
||||
@@ -0,0 +1,158 @@
|
||||
# Finding 26: Missing-feature identification IS promptable across all models; prompt framing eliminates Opus's historical advantage — all three models find regulatory gaps when explicitly asked
|
||||
|
||||
**Date:** 2026-05-05
|
||||
**Task:** Identify computations, behaviors, or features that gargoyle's
|
||||
`corporate-actions.md` (992 lines) SHOULD perform for financial correctness,
|
||||
regulatory compliance, or operational safety — but doesn't describe.
|
||||
**How we used them:** Same document (full text) + same focused analytical
|
||||
prompt to all 3 models via HAI proxy. Prompt explicitly structured around 5
|
||||
categories: missing computations, missing behaviors, missing validations,
|
||||
missing integrations, and regulatory gaps. Required concrete findings with
|
||||
severity. No tools, no project context beyond the document. GPT-5 via
|
||||
OpenAI endpoint (16K max_completion_tokens), Opus 4.6 and Sonnet 4.6 via
|
||||
Anthropic endpoint (8K max_tokens).
|
||||
|
||||
| Model | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
|
||||
|---|---|---|---|---|---|---|
|
||||
| GPT-5 | 11,354 | 8,512 | 20 | 3 | 10 | 7 |
|
||||
| Claude Opus 4.6 | 4,111 | (internal) | 23 | 6 | 10 | 7 |
|
||||
| Claude Sonnet 4.6 | 4,686 | (internal) | 15 | 5 | 6 | 4 |
|
||||
|
||||
**What they found — common ground (all 3 identified):**
|
||||
- Wash sale rule interaction with CA-driven lot closures (IRC §1091)
|
||||
- Short position treatment for corporate actions
|
||||
- Same-day corporate action ordering beyond `recorded_at` timestamp
|
||||
- Record date / ex-date position verification (entitlement timing)
|
||||
- Idempotency guard preventing double-application per user
|
||||
- Decimal precision/rounding policy unspecified
|
||||
- Superseded CA status has no lot rollback mechanism
|
||||
- Rights/warrants post-creation lifecycle (exercise/expiration)
|
||||
- Basis preservation invariant has no runtime enforcement
|
||||
- Manual entry authorization and audit trail
|
||||
|
||||
**GPT-5 unique findings (not in either Claude model):**
|
||||
- Per-lot eligibility based on entitlement date (not just user-level)
|
||||
- Election-based outcomes for shareholder choices (cash vs stock)
|
||||
- Instrument-level trading hold during CA application window
|
||||
- Pre-application consistency checks against broker entitlements
|
||||
- DB-level enforcement of status transitions and invariants
|
||||
- Action-type-specific date semantics per field (ex vs record vs payable)
|
||||
- Voluntary/tender actions beyond distributions
|
||||
- Backfill/initialization guard for newly onboarded users
|
||||
- Applicator retry/backoff semantics and confirmation race
|
||||
- Rights indivisibility constraints vs exact Decimal quantities
|
||||
|
||||
**Claude Opus unique findings (not in either other model):**
|
||||
- Pending order PRICE adjustment after splits (not just cancellation)
|
||||
- Multi-instrument position recalculation atomicity for mergers
|
||||
- Mixed merger basis floor at zero (can produce negative basis)
|
||||
- Tax lot identification method interaction with inherited dates
|
||||
- Corporate action effect on strategy position limits/risk params
|
||||
- Corporate actions on instruments not yet in the database
|
||||
- Partial application window: new user acquires position mid-fan-out
|
||||
- IRC §305(c) deemed distributions (taxable stock dividends)
|
||||
- CA impact on unrealized P&L display and strategy evaluation
|
||||
- Concurrent OrderManager startup + Applicator fan-out race
|
||||
|
||||
**Claude Sonnet unique findings (not in either other model):**
|
||||
- Stale orders: failure modes table contradicts "excluded" section
|
||||
- IRC §1223(1) holding period tacking verification at lot close
|
||||
- Spinoff allocation percentage — no validation child != parent instrument
|
||||
- Combined spinoff allocations exceeding meaningful bounds
|
||||
- Cash dividend bypasses OrderManager — record-date quantity snapshot lost
|
||||
- Mixed merger large-denominator exchange ratio overflow
|
||||
- Detector schedule: no intraday re-poll for same-day announcements
|
||||
- ROC cumulative basis floor tracking and IRC §301(c)(3) distinction
|
||||
- Mixed merger deferred loss not explicitly recorded in metadata
|
||||
|
||||
**Quality assessment:**
|
||||
- **Claude Opus** was the MOST PROLIFIC (23 findings) — a notable inversion
|
||||
from previous experiments where Opus typically found fewer but deeper
|
||||
findings. Here, the explicit "missing feature" framing appears to have
|
||||
unlocked Opus's breadth. Its unique findings included genuinely critical
|
||||
items: pending order price adjustment after splits (Critical — direct
|
||||
financial loss), multi-instrument atomicity for mergers (Critical —
|
||||
position loss), and mixed merger negative basis (High — accounting
|
||||
corruption). The findings were precise, well-reasoned, and showed both
|
||||
regulatory depth (IRC §305(c)) and operational awareness.
|
||||
- **GPT-5** was slightly less prolific (20 findings) but maintained its
|
||||
characteristic breadth and operational-level thinking. Per-lot eligibility
|
||||
(not just per-user) is a subtle but important distinction. The election-
|
||||
based outcomes finding shows awareness of real-world corporate action
|
||||
complexity. The backfill/initialization guard is operationally significant.
|
||||
GPT-5 spent 8,512 reasoning tokens — moderate for its output volume.
|
||||
- **Claude Sonnet** found fewer gaps (15) but several were genuinely
|
||||
insightful. The internal contradiction between the failure modes table
|
||||
and the "excluded" section is a real document inconsistency. The cash
|
||||
dividend record-date quantity snapshot insight (#9) identifies a DATA LOSS
|
||||
problem — the opportunity to capture that data expires. The mixed merger
|
||||
deferred loss recording gap shows regulatory awareness. However, some
|
||||
findings were more surface-level or overlapped heavily with the others.
|
||||
|
||||
**KEY INSIGHT — The original question from Finding #22 is ANSWERED:**
|
||||
|
||||
> "Opus's 'missing feature identification' mode (wash sales, commissions) —
|
||||
> is this promptable on other models? Could we explicitly ask GPT-5 'what
|
||||
> should this system compute but doesn't' and get similar results?"
|
||||
|
||||
**YES.** When explicitly prompted with a structured "missing feature"
|
||||
framing, ALL three models found regulatory gaps (wash sales, IRC sections),
|
||||
missing computations (basis calculations, rounding), and missing behaviors
|
||||
(lifecycle events, notifications). GPT-5 produced findings in the same
|
||||
*category* as what Opus uniquely found in Finding #22 (silent correctness
|
||||
failures on specid-lot-selection.md).
|
||||
|
||||
In Finding #22, Opus uniquely identified wash sales and commission tracking
|
||||
as missing features while GPT-5 focused on mechanism incorrectness and
|
||||
Sonnet on composition failures. HERE, with the explicit "what's missing"
|
||||
prompt, ALL three models found wash sales, ALL found regulatory gaps, and
|
||||
ALL found missing behaviors.
|
||||
|
||||
**This confirms:** Opus's "missing feature identification" mode in Finding
|
||||
#22 was NOT an inherent model capability — it was an emergent behavior from
|
||||
the open-ended "silent correctness failures" prompt. When you give ALL models
|
||||
the EXPLICIT instruction to look for missing features, they all do it. The
|
||||
differentiation from #22 was caused by the prompt being more open-ended,
|
||||
allowing each model to default to its natural analytical mode:
|
||||
- Opus → "what's missing" (features/functionality)
|
||||
- GPT-5 → "what's wrong" (mechanism failures)
|
||||
- Sonnet → "what breaks when combined" (composition)
|
||||
|
||||
**Prompt framing dominates model personality.** With the right prompt,
|
||||
any model can be directed into any analytical mode. The model differences
|
||||
that emerged in earlier open-ended experiments reflect DEFAULT TENDENCIES,
|
||||
not capabilities.
|
||||
|
||||
**NEW finding about Opus on complex documents:**
|
||||
Opus produced MORE findings than GPT-5 (23 vs 20) — the first time this
|
||||
has happened on a broad analytical task. Previous pattern: GPT-5 always
|
||||
finds more (20-33 findings) while Opus finds fewer but deeper (7-13).
|
||||
What changed? The document is 992 lines — the longest tested — and the
|
||||
task is explicitly about breadth ("find all gaps"). On this specific
|
||||
combination (long document + breadth-focused prompt), Opus appears to
|
||||
allocate its internal reasoning budget toward exploration rather than
|
||||
its usual depth-first design-tension mode. This suggests Opus's typical
|
||||
"fewer but deeper" pattern is partially a RESPONSE to shorter documents
|
||||
where depth is more productive than breadth.
|
||||
|
||||
**Practical implications:**
|
||||
1. For missing-feature analysis: prompt structure matters more than model
|
||||
choice. All three models are viable. Use the explicit 5-category prompt.
|
||||
2. Run all three for critical docs — they find different specific gaps
|
||||
despite finding the same categories.
|
||||
3. For open-ended analysis where you want models to find DIFFERENT things:
|
||||
use open-ended prompts. For analysis where you want COMPREHENSIVE
|
||||
coverage of one type: use structured prompts.
|
||||
4. Opus's "fewer but deeper" personality can be overridden by document
|
||||
length + breadth-focused prompt. On 992-line docs, it competes on
|
||||
volume with GPT-5.
|
||||
|
||||
**Cost-effectiveness:**
|
||||
Opus: 4,111 output tokens for 23 findings = 179 tokens/finding
|
||||
GPT-5: 11,354 output tokens (+ 8,512 reasoning) for 20 findings = 993 tokens/finding
|
||||
Sonnet: 4,686 output tokens for 15 findings = 312 tokens/finding
|
||||
|
||||
Opus is by far the most efficient: nearly 6x fewer tokens than GPT-5 per
|
||||
finding, with MORE findings. This is the strongest cost-effectiveness case
|
||||
for Opus on any tested task. On long documents with breadth-focused prompts,
|
||||
Opus appears to be the optimal choice for both quality AND efficiency.
|
||||
@@ -0,0 +1,276 @@
|
||||
# Finding 27: Design coherence on risk-controls.md: GPT-5 regains top position; Opus's advantage from Finding #15 was document-specific, not task-specific
|
||||
|
||||
**Date:** 2026-05-05
|
||||
**Task:** Identify internal design incoherences in gargoyle's `risk-controls.md` (277 lines)
|
||||
— a pre-trade risk control specification covering two evaluation stages, reduction semantics,
|
||||
ordering rationale, fail-closed claims, and audit logging.
|
||||
**How we used them:** Same document (full text) + same focused analytical question to all
|
||||
3 models via HAI proxy. Highly structured prompt specifying 5 categories of incoherence
|
||||
(safety properties not enforced, ordering/sequencing contradictions, reduction semantics
|
||||
conflicts, fail-closed claims vs actual behavior, cross-stage inconsistencies). Required
|
||||
each finding to reference specific contradictory parts. No tools, no project context beyond
|
||||
the document itself.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Incoherences found | Critical | High | Medium |
|
||||
|---|---|---|---|---|---|---|---|
|
||||
| GPT-5 | 112s | 8,231 | 7,232 | 6 | 1 | 3 | 2 |
|
||||
| Claude Opus 4.6 | 41s | 1,858 | (internal) | 5 | 2 | 2 | 1 |
|
||||
| Claude Sonnet 4.6 | 15s | 699 | (internal) | 4 | 1 | 2 | 1 |
|
||||
|
||||
**What they found — common ground (all 3 identified):**
|
||||
- Reduction re-entry at BuyingPower contradicts "reducing controls never re-enter
|
||||
earlier controls" (all three flagged this as the most obvious contradiction —
|
||||
Concentration at position 5 reduces, re-enters at BuyingPower at position 4,
|
||||
which IS an earlier control)
|
||||
- Ordering rationale's categorization of buying power/concentration is internally
|
||||
confused (the doc labels both as "quantity-sensitive checks" that run after
|
||||
reducing controls, but concentration IS a reducing control at position 5 while
|
||||
buying power at position 4 sits between the two reducing controls)
|
||||
|
||||
**GPT-5 unique findings (not in either Claude model):**
|
||||
- Signal-level "no portfolio context" contradicts NoShortSales requiring knowledge
|
||||
of current positions. The doc explicitly states signals are evaluated "in isolation"
|
||||
with "no portfolio context — only the signal itself and user settings" — but checking
|
||||
whether the user holds a position IS portfolio context. This is a genuine design
|
||||
tension: either SignalRisk has hidden portfolio access (violating isolation) or
|
||||
NoShortSales can't actually work as specified.
|
||||
- Settings "fall through to system defaults" vs "Settings cache miss → reject."
|
||||
Two incompatible instructions for the same condition (missing settings).
|
||||
- "Universal fail-closed" with "only exception is order rate window" contradicted
|
||||
by Failure Modes table showing buying power as another exception ("Conservative
|
||||
estimate; may over-reject" is NOT rejection — it's a different failure mode than
|
||||
either fail-closed or the documented single exception).
|
||||
- Audit model says "every control evaluation produces an audit entry regardless of
|
||||
outcome" but the signal-stage write point only describes writing on rejection.
|
||||
Passing signals produce no documented audit entry at the signal stage.
|
||||
|
||||
**Claude Opus unique findings (not in either other model):**
|
||||
- Signal flow diagram swaps control order vs table: table shows (1) MarketHours,
|
||||
(2) PerTradeStop, (3) NoShortSales, but diagram flows MarketHours → NoShortSales
|
||||
→ PerTradeStopLoss. Controls 2 and 3 are reversed between the two representations.
|
||||
(VERIFIED: this is correct — the diagram does show a different order.)
|
||||
- Concentration re-entry loop can bypass Order Rate, Duplicate, Self-Trade, and
|
||||
Fat Finger entirely during intermediate iterations. Also: Position Size at order 3
|
||||
is never re-checked against Concentration-reduced quantity because re-entry starts
|
||||
at BuyingPower (order 4), meaning "smaller quantity wins" semantics are implemented
|
||||
differently than the linear model described in Reduction Semantics.
|
||||
|
||||
**Claude Sonnet unique findings (not in either other model):**
|
||||
- Buying Power "Can reduce? No" creates a dead end: if a reduced quantity still
|
||||
exceeds buying power, the system can only reject entirely (no mechanism to further
|
||||
optimize), defeating the purpose of the reduction system for capital-limited users.
|
||||
(NOTE: this is more of a design limitation than a self-contradiction, but the
|
||||
framing — that the reduction system's purpose is undermined by buying power's
|
||||
inability to reduce — is a legitimate coherence observation.)
|
||||
|
||||
**Quality assessment:**
|
||||
- **GPT-5** produced the most findings (6) with the broadest coverage across the
|
||||
prompt's 5 categories. The NoShortSales/portfolio-context finding is the most
|
||||
genuinely insightful — it's a fundamental design-level contradiction (a signal-level
|
||||
control that REQUIRES decision-level context). The settings contradiction and
|
||||
audit logging inconsistency are also solid. Every finding points to two specific
|
||||
textual statements that are incompatible. Severity ratings were calibrated (1
|
||||
Critical, 3 High, 2 Medium — compared to Opus's 2 Critical for similar findings).
|
||||
- **Claude Opus** was remarkably fast (41s, 1,858 tokens) and found one thing
|
||||
neither other model caught: the diagram/table order reversal for signal controls.
|
||||
This is a concrete, verifiable error (not a design tension — a literal mistake in
|
||||
the document). The re-entry loop analysis (finding #5) goes deeper than GPT-5's
|
||||
version of the same core issue, exploring the implications for "smaller quantity
|
||||
wins" semantics. However, Opus found fewer total issues and missed the
|
||||
settings contradiction and audit logging inconsistency.
|
||||
- **Claude Sonnet** was the fastest (15s, 699 tokens) and found 4 issues. The buying
|
||||
power dead-end observation is unique and shows genuine reasoning about the reduction
|
||||
system's limitations. However, it's more of a "this design can't achieve its stated
|
||||
goal" than a strict self-contradiction. Sonnet's other findings overlap with the
|
||||
common ground. Quality is solid but narrower scope.
|
||||
|
||||
**Key insight — Finding #15's Opus > GPT-5 result was document-specific:**
|
||||
In Finding #15 (coherence checking on failure-modes.md), Opus found 7 incoherences
|
||||
vs GPT-5's 4. Here, on risk-controls.md, GPT-5 found 6 vs Opus's 5. The reversal
|
||||
suggests that the relative performance on coherence checking depends on the
|
||||
DOCUMENT'S structure, not on a fixed model advantage:
|
||||
|
||||
- **failure-modes.md** (383 lines): A complex multi-process system with many
|
||||
stated invariants across failure states, supervision trees, and recovery paths.
|
||||
Rich in design TENSIONS where one subsystem's safety mechanism undermines another.
|
||||
This plays to Opus's strength (finding design tensions between subsystems).
|
||||
- **risk-controls.md** (277 lines): A more focused specification with explicit rules,
|
||||
ordering constraints, and behavior tables. Rich in SPECIFICATION CONTRADICTIONS
|
||||
where one statement directly conflicts with another. This plays to GPT-5's
|
||||
strength (systematic verification of claims against stated mechanisms).
|
||||
|
||||
The difference: Opus excels when contradictions are EMERGENT (arise from composing
|
||||
multiple design decisions). GPT-5 excels when contradictions are EXPLICIT (two
|
||||
statements in the document say incompatible things). Risk-controls.md has more
|
||||
explicit contradictions (the settings fallback vs fail-closed, the "no portfolio
|
||||
context" vs NoShortSales, the audit "always" vs write point "only on reject").
|
||||
|
||||
**Model performance depends on CONTRADICTION TYPE:**
|
||||
| Contradiction type | Best model | Example |
|
||||
|---|---|---|
|
||||
| Emergent/compositional | Opus | "Rest-for-one cascade creates a 5th state" |
|
||||
| Explicit/definitional | GPT-5 | "No portfolio context" but check requires portfolio |
|
||||
| Diagrammatic/structural | Opus | Table order ≠ diagram order |
|
||||
| Semantic/category confusion | All (common ground) | Reduction re-entry violates ordering claims |
|
||||
|
||||
**Revised conclusion on Finding #15's open question:**
|
||||
"Does Opus > GPT-5 ordering for coherence checking hold across other documents?"
|
||||
**No.** The ordering depends on the document's contradiction density and type.
|
||||
Documents rich in emergent design tensions favor Opus. Documents with explicit
|
||||
specification errors favor GPT-5. The task type (coherence checking) doesn't have
|
||||
a fixed model winner — it depends on what KIND of incoherences the document contains.
|
||||
|
||||
**Practical implication:** Continue running both models for coherence checking. Their
|
||||
strengths are complementary even within the same task type. GPT-5 catches things you
|
||||
can point to in the spec and say "these two sentences conflict." Opus catches things
|
||||
where you need to reason about the implications of multiple mechanisms interacting.
|
||||
|
||||
## Open Questions
|
||||
|
||||
- Does GPT's advantage in finding inconsistencies extend to logical
|
||||
inconsistencies in arguments? One data point (verdict mismatches) — need more.
|
||||
- What's the optimal task granularity for GPT analytical review? "Whole PR" is
|
||||
too big. Is "one hypothesis" right, or can we batch?
|
||||
- ~~Is the GPT-4.1 Mini bias detection result repeatable, or was it a well-
|
||||
structured task that any model would ace?~~ **ANSWERED (Finding #8):** Any
|
||||
model aces it when the biased text is presented without noise. The original
|
||||
result was about noise elimination, not model capability.
|
||||
- **NEW:** Does adding a narrow bias-check question to a rich PR review
|
||||
context recover the detection that broad review misses? (Signal-to-noise
|
||||
confirmation test)
|
||||
- ~~How does reasoning_effort affect analytical quality? Only tested default so
|
||||
far.~~ **ANSWERED (Finding #21):** Negligible effect on GPT-5 for open-ended
|
||||
analytical tasks. Low/medium/high produced 33/30/30 findings with nearly
|
||||
identical reasoning tokens (~4K) and per-finding depth. The parameter
|
||||
may primarily affect verifiable-answer tasks, not exploration. Task framing
|
||||
remains the dominant quality lever.
|
||||
- Can we design a systematic "analytical review checklist" that leverages each
|
||||
model's strengths?
|
||||
- ~~What analytical tasks is Opus best at vs Sonnet? Finding #11 shows Opus
|
||||
excels at design-tension identification. How does Sonnet compare on the
|
||||
same task? (Sonnet is non-reasoning but fast — would it match GPT-4.1?)~~
|
||||
**ANSWERED (Finding #12):** Sonnet 4.6 significantly outperforms GPT-4.1
|
||||
(17 vs ~14 assumptions) and approaches GPT-5 (17 vs 20). It's not a
|
||||
non-reasoning model in the GPT-4.1 sense — it occupies a middle tier with
|
||||
genuine component-interaction reasoning. Opus still wins on design-tension
|
||||
identification specifically.
|
||||
- How do the models compare on research synthesis tasks (our #381 rewrite)?
|
||||
We'll find out during the actual rewrite.
|
||||
- ~~Does the reasoning-token advantage scale with document complexity? Test
|
||||
with a simpler doc to see if the gap narrows.~~ **ANSWERED (Finding #11):**
|
||||
The gap doesn't narrow with simpler docs. GPT-5 maintains ~1.7x the findings
|
||||
of GPT-4.1 regardless of document complexity. Reasoning tokens enable
|
||||
exhaustive exploration independent of input difficulty.
|
||||
- ~~Would Claude Opus (also a reasoning model) match GPT-5's assumption-finding
|
||||
performance, or does it have different blind spots?~~ **ANSWERED (Finding #11):**
|
||||
Different blind spots, different strengths. GPT-5 reasons deeper into
|
||||
implementation mechanics (breadth + technical depth). Opus reasons wider
|
||||
about system context and design tensions (insight density). They're
|
||||
complementary, not competing. Run both on important architecture docs.
|
||||
- ~~Does Sonnet 4.6's strong showing hold across other analytical tasks
|
||||
(bias detection, gap-finding) or is it specific to assumption-finding on
|
||||
complex documents? Need to test Sonnet on simpler docs and different question
|
||||
types.~~ **PARTIALLY ANSWERED (Finding #13):** Sonnet's strength does NOT
|
||||
transfer to concurrency reasoning. It dropped from 85% of GPT-5 (assumption-
|
||||
finding) to ~58% (race condition identification). Task type matters more
|
||||
than we thought. Still untested: gap-finding, bias detection for Sonnet.
|
||||
- **NEW:** What other analytical tasks require sequential/temporal reasoning
|
||||
(like race condition identification) vs pattern-matching reasoning (like
|
||||
assumption-finding)? Building a task taxonomy would help assign models
|
||||
correctly.
|
||||
- **NEW:** What explains Sonnet taking slightly longer than Opus here (106s vs
|
||||
105s) despite normally being the faster model? Is it the document length, or
|
||||
does Sonnet's internal reasoning scale with complexity similarly to Opus?
|
||||
- ~~How does GPT-5 Mini compare to GPT-5 on analytical tasks? Is it a viable
|
||||
cheaper substitute?~~ **ANSWERED (Finding #14):** GPT-5 Mini is a viable
|
||||
middle option. Finds fewer issues (6 vs 10) but with genuine reasoning
|
||||
depth at ~50% cost/time. Better than non-reasoning models, not as
|
||||
exhaustive as GPT-5.
|
||||
- **NEW:** How does Claude 4.5 Opus/Sonnet compare to Claude 4.6? HAI now
|
||||
exposes both; worth testing whether the newer versions regress on
|
||||
analytical tasks.
|
||||
- ~~Would running GPT-5 Mini + Sonnet together (different axes)
|
||||
approach GPT-5's coverage at lower combined cost?~~ **ANSWERED (Finding #19):**
|
||||
71% coverage at 31% cost. Good for low-stakes work; GPT-5 irreplaceable for
|
||||
high-stakes due to unique domain-knowledge findings in the missing 29%.
|
||||
- ~~**NEW (Finding #15):** Does the Opus > GPT-5 ordering for coherence checking
|
||||
hold across other documents? The inversion (Opus finding more than GPT-5)
|
||||
was striking — need to confirm it wasn't document-specific.~~
|
||||
**ANSWERED (Finding #27):** No — it was document-specific. On risk-controls.md,
|
||||
GPT-5 found 6 vs Opus's 5. The winner depends on contradiction TYPE: Opus
|
||||
excels at emergent/compositional contradictions, GPT-5 at explicit/definitional
|
||||
ones. No fixed ordering for this task type.
|
||||
- **NEW (Finding #15):** Is the two-pass approach (Opus generates → GPT-5
|
||||
validates) worth the extra cost vs just running Opus alone? Need to test
|
||||
whether GPT-5 actually catches Opus false-positives or just agrees.
|
||||
- ~~How do the Claude 4.5 and 4.6 models compare on analytical tasks?~~
|
||||
**ANSWERED (Finding #16):** 4.5 is more exhaustive (2x findings), 4.6 is
|
||||
more precise (higher signal-to-noise). Genuine tradeoff, not a regression.
|
||||
4.5 for coverage, 4.6 for actionability.
|
||||
- **NEW (Finding #16):** Does the 4.5 vs 4.6 pattern hold across other task
|
||||
types? Spec completeness may favor exhaustiveness; would coherence checking
|
||||
or race condition analysis show the same pattern?
|
||||
- **NEW (Finding #16):** Is running both Sonnet versions (4.5 + 4.6) cost-
|
||||
effective vs just running GPT-5? Need to compare the UNION of their findings
|
||||
against GPT-5's output for overlap analysis.
|
||||
- **NEW (Finding #18):** Does Opus's "predictable exploit window" detection
|
||||
transfer to other policy documents? It uniquely identified that the cooldown
|
||||
mechanism creates a GUARANTEED safe window that strategies could systematically
|
||||
exploit — this is a higher-order security insight. Worth testing whether Opus
|
||||
consistently finds "adversarial opportunity" framings that other models miss.
|
||||
- **NEW (Finding #20):** Does GPT-5's extreme verification behavior (15:1
|
||||
reasoning-to-output ratio, 3 findings from 12K reasoning) persist across
|
||||
other documents with this prompt? Or was user-pipeline-lifecycle.md
|
||||
particularly verification-heavy? Test invariant violation paths on a simpler
|
||||
document.
|
||||
- **NEW (Finding #20):** Would giving GPT-5 a "minimum 8 findings" instruction
|
||||
reduce its selectivity and produce MORE invariant violations at lower
|
||||
precision? Or would it just pad with non-violations? The extreme selectivity
|
||||
may be a feature OR it may mean GPT-5 is discarding valid-but-hard-to-verify
|
||||
findings.
|
||||
- **NEW (Finding #20):** Opus's self-correction behavior is now confirmed across
|
||||
Finding #15 and #20. Is this trainable/promptable? Could we ask non-Opus models
|
||||
to "show your reasoning and withdraw findings you cannot fully verify"?
|
||||
- **NEW (Finding #22):** The "silent correctness" lens revealed three distinct
|
||||
analytical modes: Opus → missing functionality, GPT-5 → mechanism incorrectness,
|
||||
Sonnet → composition failures. Does this three-way differentiation hold on other
|
||||
documents, or was it specific to the regulatory/financial domain of specid-lot-selection?
|
||||
- **NEW (Finding #22):** Does the "silent correctness" lens work on non-financial
|
||||
documents? The financial/regulatory domain has a large gap between syntactic and
|
||||
semantic correctness. Would the same prompt on an infrastructure/systems doc produce
|
||||
equally differentiated findings, or would it collapse into assumption-finding?
|
||||
- ~~**NEW (Finding #22):** Opus's "missing feature identification" mode (wash sales,
|
||||
commissions) — is this promptable on other models? Could we explicitly ask GPT-5
|
||||
"what should this system compute but doesn't" and get similar results?~~
|
||||
**ANSWERED (Finding #26):** YES — all three models find regulatory gaps and
|
||||
missing features when explicitly prompted. Opus's unique behavior in #22 was
|
||||
an emergent DEFAULT tendency, not a capability. Prompt framing dominates
|
||||
model personality.
|
||||
|
||||
- **NEW (Finding #28):** Cross-document consistency found real bugs in gargoyle
|
||||
docs (fills vs events, position ownership, signal persistence). Does running
|
||||
this analysis across MORE document pairs (e.g., domain readmes vs implementation
|
||||
docs, design docs vs plan docs) yield additional real inconsistencies? Could
|
||||
become a systematic documentation maintenance tool.
|
||||
- **NEW (Finding #28):** Opus was 2.4x faster AND found more issues than GPT-5
|
||||
on cross-document consistency. Is this because cross-doc contradictions are
|
||||
easy to verify once spotted (reducing GPT-5's verification advantage)? Or
|
||||
because boundary reasoning (Opus's strength) is the primary skill needed?
|
||||
|
||||
## Methodology Notes
|
||||
|
||||
- Internet opinions about models are overwhelmingly about coding. Don't
|
||||
extrapolate to analytical work without testing.
|
||||
- "Just because someone says it on the internet doesn't make it right." —
|
||||
Aaron, 2026-04-26. Opinions need context. Track our own evidence.
|
||||
- Absence of published methodology for a use case is itself a finding.
|
||||
- Each finding needs: date, task, **how we used it** (context shape, task
|
||||
framing, what info the model had/didn't have), what happened, takeaway.
|
||||
No unsupported generalizations.
|
||||
- **Context dimensions to track:**
|
||||
- Rich vs minimal (how much background info)
|
||||
- Broad vs focused ("review this" vs "answer this specific question")
|
||||
- What kind of context (diff, full files, issue text, research notes,
|
||||
project conventions, nothing)
|
||||
- Whether the model had access to tools or just text
|
||||
- Whether the task was explicit step-by-step or open-ended
|
||||
@@ -0,0 +1,178 @@
|
||||
# Finding 28: Cross-document consistency analysis: NEW task type — GPT-5 finds deep semantic contradictions; Opus finds structural/boundary mismatches; Sonnet identifies core issues quickly
|
||||
|
||||
**Date:** 2026-05-05
|
||||
**Task:** Identify contradictions and inconsistencies BETWEEN two architecture documents
|
||||
describing the same system: `system-overview.md` (323 lines, narrative overview with
|
||||
component flows, invariants, and domain events) and `architecture.md` (213 lines,
|
||||
DDD-focused with bounded contexts, context map, and message taxonomy).
|
||||
**How we used them:** BOTH documents provided as full text in a single prompt (~25KB
|
||||
total). Highly structured prompt specifying 5 categories of cross-document inconsistency
|
||||
(terminology conflicts, structural contradictions, flow/sequence conflicts,
|
||||
ownership/authority conflicts, philosophical contradictions). Required specific output
|
||||
format per finding. Explicitly excluded omissions (things one doc covers and the other
|
||||
doesn't) and detail-level differences. No tools, no project context beyond the two
|
||||
documents. This is a NEW analytical task not previously tested: reasoning about
|
||||
CONSISTENCY BETWEEN documents rather than internal coherence of a single document.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Inconsistencies found | Critical | High | Medium |
|
||||
|---|---|---|---|---|---|---|---|
|
||||
| GPT-5 | 125s | 9,415 | 8,384 | 6 | 2 | 3 | 1 |
|
||||
| Claude Opus 4.6 | 52s | 2,351 | (internal) | 7 | 3 | 3 | 1 |
|
||||
| Claude Sonnet 4.6 | 14s | 776 | (internal) | 4 | 1 | 2 | 1 |
|
||||
|
||||
**What they found — common ground (all 3 identified):**
|
||||
- Event sourcing (all events as source of truth) vs fills-only ground truth:
|
||||
Document A says fills are "ground truth from which all other state can be
|
||||
derived," while Document B says "events are the source of truth, state is
|
||||
computed by replaying events." A treats fills as the recovery foundation;
|
||||
B treats ALL domain events as authoritative. All three models rated this
|
||||
Critical.
|
||||
- Bounded context naming mismatch: "Decision Engine" / "Order Management" (A)
|
||||
vs "Engine" / "Trading" (B) for the same functional responsibilities.
|
||||
GPT-5 folded this into a broader ownership analysis; Opus and Sonnet
|
||||
surfaced it as its own finding.
|
||||
- Signal classification conflict: Document A lists "Signal emitted" as a domain
|
||||
event; Document B explicitly categorizes `SignalEmitted` as an audit event
|
||||
("not used to rebuild state"). This determines event store design and
|
||||
recovery semantics.
|
||||
|
||||
**GPT-5 unique findings (not in either Claude model):**
|
||||
- Signal persistence contradiction: Document A states "Signals are never
|
||||
persisted" while Document B lists `SignalEmitted` as an audit event that IS
|
||||
persisted and states the audit log is mandatory for trading. These are
|
||||
directly incompatible claims about whether signal data is stored.
|
||||
- Audit event ownership conflict: Document A says "Decision approved" events
|
||||
originate from PortfolioRisk. Document B states "only the decision engine
|
||||
writes audit events" and lists `DecisionApproved` as an audit event example.
|
||||
If PortfolioRisk is part of Risk (not Engine), this is an authority violation.
|
||||
- "Single writer per user" (A: OrderManager writes all trading state) vs
|
||||
per-aggregate single-writer (B: each aggregate writes its own event stream,
|
||||
Ledger owns positions). These are incompatible authority models — either OM
|
||||
centralizes writes or each domain owns its own events.
|
||||
|
||||
**Claude Opus unique findings (not in either other model):**
|
||||
- Engine → OrderManager is an internal pipeline flow (A: same subgraph, direct
|
||||
arrow) vs Engine → Trading is a cross-domain COMMAND (B: `PlaceOrder` command
|
||||
crossing a bounded context boundary). This structural disagreement determines
|
||||
whether order management is an internal pipeline stage or an independent domain
|
||||
with its own aggregates and command validation.
|
||||
- Signal Risk's architectural position: Document A shows a two-stage risk
|
||||
architecture (SignalRisk pre-aggregation, PortfolioRisk post-aggregation)
|
||||
where Risk is embedded in the pipeline. Document B's context map shows Risk
|
||||
as a separate domain that Engine merely QUERIES ("kill switch active?") —
|
||||
no arrow shows signal routing through Risk. Either risk logic lives inside
|
||||
Engine (contradicting B's context boundary) or the context map is incomplete.
|
||||
- The "reduce" step ownership: A's top-level flow labels `Approved →|"reduce"|
|
||||
Decisions` (reduction at aggregation), while A's own domain events table says
|
||||
"Decision reduced" originates from PortfolioRisk (reduction after aggregation).
|
||||
This is actually an INTRA-document inconsistency in Document A, but Opus surfaced
|
||||
it as part of cross-doc analysis.
|
||||
|
||||
**Claude Sonnet unique findings:**
|
||||
- None genuinely unique. All 4 findings overlapped with GPT-5/Opus common ground
|
||||
(event sourcing, signal persistence, context count/naming). Sonnet was efficient
|
||||
(14s, 776 tokens) but didn't identify any inconsistency that the other two missed.
|
||||
|
||||
**Quality assessment:**
|
||||
- **GPT-5** produced 6 well-reasoned findings with the deepest analysis of
|
||||
OWNERSHIP conflicts. Its signal-persistence contradiction and single-writer
|
||||
authority conflict are genuinely important — they reveal places where the two
|
||||
documents would lead implementers to build fundamentally different systems.
|
||||
Every finding quotes specific text from both documents and explains precisely
|
||||
WHY they can't both be correct. The reasoning investment (8,384 tokens) was
|
||||
used for thorough cross-referencing between documents.
|
||||
- **Claude Opus** found the most inconsistencies (7) and was remarkably fast
|
||||
(52s vs GPT-5's 125s). Its unique strength: identifying STRUCTURAL contradictions
|
||||
about component boundaries and communication patterns. The Engine→Trading
|
||||
command vs internal pipeline finding is architecturally the most significant
|
||||
discovery — it reveals a fundamental disagreement about whether order
|
||||
management is INSIDE or OUTSIDE the decision engine's boundary. Opus also
|
||||
caught a bonus intra-document inconsistency (the "reduce" labeling error).
|
||||
- **Claude Sonnet** was the fastest (14s) and most concise (776 tokens) but
|
||||
found only the obvious common-ground issues. For cross-document consistency,
|
||||
Sonnet's speed advantage came at the cost of missing the architectural
|
||||
insights that make this task valuable. It did correctly identify all the
|
||||
Critical-level issues, making it viable as a quick first-pass screen.
|
||||
|
||||
**Key insight — cross-document consistency is a DISTINCT task type:**
|
||||
This is fundamentally different from single-document analysis (assumptions,
|
||||
race conditions, coherence). It requires:
|
||||
1. Building a mental model from Document A
|
||||
2. Building a separate mental model from Document B
|
||||
3. Finding places where the models are incompatible
|
||||
4. Reasoning about WHY they can't both be correct (not just "different")
|
||||
|
||||
Step 4 is what distinguishes this from simple diff-detection. Many surface
|
||||
differences (naming, detail level, scope) are NOT contradictions — the models
|
||||
must judge which differences are genuinely incompatible vs. complementary.
|
||||
The prompt explicitly excluded omissions and detail-level differences, and
|
||||
all three models respected this constraint well.
|
||||
|
||||
**Model strengths on cross-document analysis:**
|
||||
- **GPT-5** excels at ownership/authority conflicts: it systematically
|
||||
checked "who owns this concept" in each document and found mismatches.
|
||||
Its findings cluster around "who writes what" and "who is authoritative."
|
||||
- **Opus** excels at structural/boundary contradictions: it identified where
|
||||
the documents draw architectural lines differently. Its findings cluster
|
||||
around "where are the boundaries" and "what crosses them."
|
||||
- **Sonnet** identifies the obvious/critical issues quickly but doesn't dig
|
||||
deeper. Viable for screening, not for thorough analysis.
|
||||
|
||||
**Comparison to Finding #15 / #27 (single-document coherence checking):**
|
||||
Single-document coherence asks "does this document contradict itself?"
|
||||
Cross-document consistency asks "do these documents contradict each other?"
|
||||
Key differences in results:
|
||||
|
||||
| Aspect | Single-doc coherence | Cross-doc consistency |
|
||||
|---|---|---|
|
||||
| Opus findings | 5-7 | 7 |
|
||||
| GPT-5 findings | 4-6 | 6 |
|
||||
| Sonnet findings | 4-5 | 4 |
|
||||
| Opus unique | Design tensions | Structural/boundary mismatches |
|
||||
| GPT-5 unique | Definitional errors | Ownership/authority conflicts |
|
||||
| Best model | Task-dependent | Opus (most findings + fastest) |
|
||||
|
||||
The relative ordering is similar (Opus ≥ GPT-5 > Sonnet for coherence-style
|
||||
tasks), but the CHARACTER of unique findings shifted. On single-doc coherence,
|
||||
Opus finds design tensions within a single design. On cross-doc consistency,
|
||||
Opus finds BOUNDARY disagreements between two designs. GPT-5 shifts from
|
||||
finding definitional errors to ownership conflicts.
|
||||
|
||||
**Are these findings REAL bugs in the gargoyle documentation?**
|
||||
Yes — several are genuine issues worth fixing:
|
||||
1. The fills-vs-events-as-ground-truth is a real philosophical tension between
|
||||
the two documents that needs resolution.
|
||||
2. The Position event ownership (OrderManager vs Ledger) is a real boundary
|
||||
conflict that affects implementation.
|
||||
3. The Engine→Trading communication style (internal pipeline vs cross-domain
|
||||
command) is a genuine structural ambiguity.
|
||||
4. The signal persistence claim ("never persisted" vs `SignalEmitted` audit
|
||||
event) is a direct textual contradiction.
|
||||
|
||||
These are the kind of cross-document inconsistencies that cause teams to build
|
||||
inconsistent implementations — one engineer reads Document A and builds one way,
|
||||
another reads Document B and builds differently.
|
||||
|
||||
**Practical implication:** Cross-document consistency analysis is a high-value
|
||||
task for documentation maintenance. Run it when:
|
||||
- A system has multiple architecture docs written at different times
|
||||
- A refactoring has updated one doc but not another
|
||||
- Multiple people contribute to design documentation
|
||||
- Moving from high-level overview to detailed specification
|
||||
|
||||
Opus is the recommended model for this task: fastest (52s vs 125s), most
|
||||
findings (7 vs 6), and uniquely strong at boundary disagreements. GPT-5 adds
|
||||
value for ownership-specific conflicts. Sonnet is sufficient for quick
|
||||
screening (catches the Critical issues in 14s) but won't find the architectural
|
||||
insights.
|
||||
|
||||
**Cost-effectiveness:**
|
||||
Opus: 2,351 output tokens for 7 findings = 336 tokens/finding (52s)
|
||||
GPT-5: 9,415 output + 8,384 reasoning for 6 findings = 2,967 tokens/finding (125s)
|
||||
Sonnet: 776 output tokens for 4 findings = 194 tokens/finding (14s)
|
||||
|
||||
Opus is the clear winner on this task type: more findings than GPT-5, 2.4x
|
||||
faster, and 8.8x more token-efficient per finding. GPT-5's massive reasoning
|
||||
investment (8,384 tokens) produced only one fewer finding than Opus — the
|
||||
verification overhead is not paying off here because cross-document contradictions
|
||||
are relatively easy to verify once identified (just check both documents).
|
||||
@@ -0,0 +1,174 @@
|
||||
# Finding 29: Adversarial manipulation analysis: NEW task type — GPT-5 is most exhaustive and systematic; Opus produces qualitatively different attack vectors with system-level thinking; Sonnet is adequate but less creative
|
||||
|
||||
**Date:** 2026-05-05
|
||||
**Task:** Identify adversarial manipulation paths in gargoyle's `aggregation.md` (193 lines)
|
||||
— how a misbehaving, compromised, or buggy upstream component could exploit the
|
||||
aggregator's design guarantees to produce harmful trading outcomes that bypass
|
||||
downstream safety controls.
|
||||
**How we used them:** Same document (full text) + same focused analytical question to all
|
||||
3 models via HAI proxy. Highly structured prompt specifying 5 categories of adversarial
|
||||
manipulation (signal injection, timing manipulation, capacity weaponization, state
|
||||
corruption via crash, audit evasion). Required specific output format per finding
|
||||
(attack vector, mechanism, exploit, why downstream controls miss it, severity). No tools,
|
||||
no project context beyond the document itself.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Attack vectors found | Critical | High | Medium |
|
||||
|---|---|---|---|---|---|---|---|
|
||||
| Claude Sonnet 4.6 | 27s | 1,257 | (internal) | 10 | 3 | 5 | 2 |
|
||||
| Claude Opus 4.6 | 84s | 3,662 | (internal) | 12 | 5 | 5 | 0 |
|
||||
| GPT-5 | 111s | 8,808 | 6,336 | 15 | 2 | 10 | 3 |
|
||||
|
||||
**What they found — common ground (all 3 identified):**
|
||||
- Primary signal hijacking via ranking manipulation (last-tick injection in
|
||||
time-windowed to control decision parameters)
|
||||
- Threshold gaming via signal replay/duplication (no deduplication means N
|
||||
identical signals satisfy "N confirmations")
|
||||
- Capacity flooding to force premature completion or deny legitimate trades
|
||||
- Strategic crash to erase unfavorable in-flight groups
|
||||
- Timeout-masqueraded manipulation (making attacks look like normal system behavior
|
||||
in the audit trail)
|
||||
|
||||
**GPT-5 unique findings (not in either Claude model):**
|
||||
- **Direction flip against majority via ranking:** In "most recent" ranking,
|
||||
emit multiple SELL confirmations then inject a late BUY — the BUY becomes
|
||||
primary and the decision contradicts the bulk of evidence. Distinct from
|
||||
general primary hijack because it's specifically about *directional* reversal.
|
||||
- **Late-arrival exclusion of counter-signals:** Time signals so countervailing
|
||||
signals arrive just after group destruction, ensuring the decision is formed
|
||||
without dissenting inputs that would have altered ranking.
|
||||
- **Capacity filter to curate the audit set:** Pre-fill buffer with chosen
|
||||
signals so riskier alternatives cannot be included before capacity fires —
|
||||
the contributing signals list looks clean.
|
||||
- **Timer nullification by crash:** Crash just before a timeout that would
|
||||
force-complete an unfavorable decision — the timer becomes no-op on restart,
|
||||
no decision or expiry event is emitted.
|
||||
- **Decision drop via induced forwarding failure:** Exploit the "Decision
|
||||
forwarding fails: Decision is lost" failure mode to selectively suppress
|
||||
protective decisions (stops, hedges) with no automatic retry.
|
||||
- **Crash to erase evidence of contrary signals:** Post-crash, submit a
|
||||
fresh group that completes quickly; audit shows only the new set, not the
|
||||
earlier contradictory pre-crash signals.
|
||||
|
||||
**Claude Opus unique findings (not in either other model):**
|
||||
- **Instrument fragmentation to multiply position size:** Emit signals for
|
||||
economically equivalent exposures using different instrument identifiers.
|
||||
Each gets its own group, each produces a separate decision, bypassing
|
||||
per-group capacity limits. Combined position exceeds what any single group
|
||||
would allow. Identifies TOCTOU at the fan-in to PortfolioRisk.
|
||||
- **Forced stale decision via timer exploitation:** Emit one signal at a
|
||||
favorable price spike known to be transient, then deliberately withhold
|
||||
further signals. Timer force-completes with a stale price. The entry price
|
||||
WAS valid when the signal was generated — PortfolioRisk doesn't check
|
||||
staleness of decision prices.
|
||||
- **Timeout prevention / keep-alive suppression:** Manipulate market data
|
||||
feed to suppress signals that would reach threshold N. Group expires
|
||||
normally — denial-of-trading attack disguised as insufficient confirmation.
|
||||
- **Crash-restart duplicate decisions:** Crash after decision is forwarded
|
||||
but before strategy reflects it. Both restart "clean" — strategy re-emits
|
||||
signals, aggregator produces a second decision with a fresh ID. Same trade
|
||||
executes twice. PortfolioRisk can't deduplicate because IDs are different.
|
||||
- **Force-complete with insufficient confirmation (capacity < threshold):**
|
||||
If capacity limit is lower than threshold, hitting capacity ALWAYS force-
|
||||
completes before predicate is satisfied. Fundamentally changes a 5-confirmation
|
||||
strategy into a 3-confirmation strategy.
|
||||
- **Pattern predicate as arbitrary decision trigger:** If adversary controls
|
||||
predicate logic (via strategy configuration), can make pattern-complete
|
||||
trigger on any single signal while audit shows algorithm=pattern-complete
|
||||
and reason=:predicate. Trust boundary between configuration and execution.
|
||||
|
||||
**Claude Sonnet unique findings (not in either other model):**
|
||||
- **Cross-group timing coordination:** Coordinate signal injection across
|
||||
multiple instruments to synchronize completion times, creating a burst of
|
||||
correlated decisions that overwhelm PortfolioRisk individually-safe
|
||||
evaluations. (NOTE: Opus found a similar concept — instrument fragmentation
|
||||
— but framed it differently: Opus focused on position multiplication via
|
||||
instrument aliasing, Sonnet focused on burst timing overwhelming evaluation.)
|
||||
- **Multi-strategy attack distribution:** Spread manipulation across multiple
|
||||
isolated strategy aggregators so no single aggregator's behavior looks
|
||||
abnormal while cumulative effect is harmful.
|
||||
|
||||
**Quality assessment:**
|
||||
- **GPT-5** produced the most findings (15) with the most systematic coverage
|
||||
across all 5 prompt categories. Its strength was in identifying SPECIFIC
|
||||
INTERLEAVINGS — exactly how timing, state, and ranking mechanisms interact
|
||||
to produce exploits. The direction-flip finding (#3) and the late-arrival
|
||||
exclusion finding (#6) show precise temporal reasoning about when signals
|
||||
arrive relative to group lifecycle events. The "decision drop via forwarding
|
||||
failure" finding exploits a DOCUMENTED failure mode (from the failure table)
|
||||
as an offensive weapon — turning a recovery mechanism into an attack vector.
|
||||
Every finding references specific mechanisms from the spec.
|
||||
- **Claude Opus** produced 12 findings with the most architecturally creative
|
||||
attacks. The instrument fragmentation attack is the most SYSTEMICALLY
|
||||
dangerous finding across all three models — it's not about manipulating one
|
||||
group but about the RELATIONSHIP between groups, and it identifies a
|
||||
TOCTOU vulnerability at the PortfolioRisk fan-in point that no other model
|
||||
found. The crash-restart duplication attack is also architecturally novel —
|
||||
it exploits the "clean state" guarantee as a weapon for invisible trade
|
||||
doubling. Opus consistently reasons about the system BOUNDARY (aggregator
|
||||
→ PortfolioRisk handoff) rather than just within-component mechanics. The
|
||||
pattern-predicate trust boundary finding is uniquely about CONFIGURATION
|
||||
as an attack surface.
|
||||
- **Claude Sonnet** produced 10 findings in 27s — extremely efficient (127
|
||||
tokens per finding). Findings were adequate and covered all 5 categories,
|
||||
but lacked the specificity of GPT-5 and the architectural creativity of
|
||||
Opus. Several findings were somewhat generic (e.g., "crash at strategic
|
||||
moments" without specifying exactly WHEN relative to group lifecycle).
|
||||
The cross-group coordination and multi-strategy distribution findings show
|
||||
system-level thinking but are stated at a higher abstraction level without
|
||||
concrete exploit sequences.
|
||||
|
||||
**Key insight — "adversarial manipulation analysis" as a task type:**
|
||||
This is qualitatively different from all previous analytical lenses tested.
|
||||
Previous tasks asked models to find problems WITH the design (assumptions,
|
||||
races, incoherences). This task asks models to find ways to USE the design
|
||||
AGAINST itself — a creative/generative adversarial task. Results:
|
||||
|
||||
- **GPT-5** treats it as an exhaustive enumeration exercise — systematically
|
||||
walks through each mechanism and asks "how could this be abused?" High
|
||||
count (15), thorough coverage, but some findings are minor variations of
|
||||
each other (e.g., crash-related findings #10, #12, #15 share the same core
|
||||
mechanism). Reasoning tokens (6,336) used for both generation and verification.
|
||||
- **Opus** treats it as a creative design exercise — asks "what would a
|
||||
smart adversary do that the designer didn't consider?" Fewer findings (12)
|
||||
but several are genuinely novel attack concepts (instrument fragmentation,
|
||||
crash-restart duplication, predicate trust boundary) that require reasoning
|
||||
about the SYSTEM rather than the COMPONENT. Opus also provided a summary
|
||||
table and systemic conclusion about the root design weaknesses.
|
||||
- **Sonnet** treats it as a categorization exercise — fills each prompt
|
||||
category with plausible attacks but at a higher abstraction level. Fast
|
||||
and adequate for a first pass but wouldn't surprise a security reviewer.
|
||||
|
||||
**Comparison to "predictable exploit window" (Finding #18):**
|
||||
Finding #18 noted that Opus uniquely identified predictable exploit windows
|
||||
in escalation-policy.md. Here, Opus again shows the strongest adversarial
|
||||
creativity — the instrument fragmentation attack and crash-restart duplication
|
||||
are both about exploiting DESIGN GUARANTEES (per-instrument grouping, clean
|
||||
restart) as weapons. This confirms that Opus's strength on adversarial analysis
|
||||
is a CONSISTENT PATTERN, not document-specific.
|
||||
|
||||
GPT-5 excels when the adversarial task is framed as "enumerate all possible
|
||||
abuses of each mechanism" (systematic coverage). Opus excels when the task
|
||||
requires "invent novel attack concepts that exploit design boundaries"
|
||||
(creative adversarial thinking).
|
||||
|
||||
**Model hierarchy for adversarial manipulation analysis:**
|
||||
1. GPT-5 — most thorough enumeration, best at mechanism-level exploitation (15)
|
||||
2. Opus — most creative, finds system-boundary attacks others miss (12)
|
||||
3. Sonnet — adequate first pass, fast, but less specific (10)
|
||||
|
||||
**Practical implication:** For security-oriented architecture review:
|
||||
- Run GPT-5 for comprehensive attack surface enumeration
|
||||
- Run Opus for novel/creative attack vectors that exploit design boundaries
|
||||
- Sonnet is sufficient only as a quick initial screen
|
||||
- The UNION of GPT-5 + Opus findings (removing overlaps) would produce the
|
||||
most complete adversarial analysis
|
||||
|
||||
**New finding about the aggregator itself:** Several attacks identified by
|
||||
multiple models point to real design weaknesses worth addressing:
|
||||
1. No signal deduplication/independence validation (all 3 models)
|
||||
2. Primary signal determines all decision parameters regardless of group
|
||||
composition (all 3 models)
|
||||
3. Transient state + no replay = perfect adversarial erasure tool (all 3)
|
||||
4. Capacity/timeout treated as normal events even when weaponized (all 3)
|
||||
5. No cross-group correlation at aggregator level (Opus + Sonnet)
|
||||
6. TOCTOU at PortfolioRisk fan-in for concurrent decisions (Opus)
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,16 @@
|
||||
# Model Findings — Analytical & Research Work
|
||||
|
||||
_Tracking what actually works (and doesn't) when using AI models for research,
|
||||
analysis, bias detection, and document review — not coding._
|
||||
|
||||
Started: 2026-04-26
|
||||
|
||||
## Context
|
||||
|
||||
We use multiple models in different roles: Claude Code (Opus/Sonnet) for
|
||||
generation, Sonnet + GPT-5 for independent dual review, smaller models for
|
||||
focused analytical tasks. Most public discussion is about coding. We found
|
||||
almost no published methodology for using models in analytical research tasks
|
||||
(searched 2026-04-26). That gap is why we're tracking this.
|
||||
|
||||
Each experiment lives in its own file. See individual finding files below.
|
||||
Reference in New Issue
Block a user