c1ca8cfe46
Cross-document boundary analysis: signal-lifecycle + buying-power + risk-controls. Opus decisively outperforms GPT-5 (10 vs 7 findings) — first inversion observed. New lens finds a distinct class of bug: partially-degraded paths that mimic normal behavior.
183 lines
10 KiB
Markdown
183 lines
10 KiB
Markdown
# Degraded-Mode Propagation Analysis: A New Analytical Lens
|
||
|
||
**Date:** 2026-05-08
|
||
**Finding #:** 52
|
||
**Task:** Degraded-mode propagation analysis across three related gargoyle design documents:
|
||
`signal-lifecycle.md` (111 lines), `buying-power.md` (103 lines), `risk-controls.md` (315 lines)
|
||
— spanning the decision engine → risk → order management boundary.
|
||
|
||
**Analytical lens:** NEW. "When one component enters a degraded state, what happens to the
|
||
components that depend on it? Trace degraded-mode behavior across document boundaries."
|
||
Distinct from assumption-finding (what's implicit?), gap-finding (what's missing?), or
|
||
race conditions (what interleavings fail?). This asks: "Do the documents' degraded
|
||
behaviors compose correctly?"
|
||
|
||
## Method
|
||
|
||
Same three documents (full text, 529 lines combined) + same structured prompt to all
|
||
3 models via HAI proxy. Prompt specified 5 categories of degraded-mode propagation
|
||
failures: propagation gaps, semantic mismatches, recovery ordering dependencies, silent
|
||
degradation, and degradation cascades. Required specific output format per finding with
|
||
quotes from both documents at each boundary. No tools, no project context beyond the
|
||
documents themselves.
|
||
|
||
## Results
|
||
|
||
| Model | Time | Output tokens | Reasoning tokens | Findings |
|
||
|---|---|---|---|---|
|
||
| GPT-5 | 125s | 11,147 | 8,960 | 7 |
|
||
| Claude Opus 4.6 | 116s | 5,239 | (internal) | 10 |
|
||
| Claude Sonnet 4.6 | 35s | 1,676 | (internal) | 8 |
|
||
|
||
## What They Found — Common Ground (all 3 identified)
|
||
|
||
1. **Buying power staleness vs fail-closed semantic mismatch** — Buying Power says "use
|
||
last cached value (pessimistic)" while Risk Controls says "stale cache → reject."
|
||
All three models identified this as the primary boundary gap, though with different
|
||
depth of analysis (see below).
|
||
|
||
2. **Market data staleness affecting buying power price estimates** — Risk Controls
|
||
scopes staleness to fat-finger only ("other controls unaffected"), but Buying Power
|
||
uses market prices for pro-forma deduction of market orders.
|
||
|
||
3. **Order rate window reset creating silent permissiveness** — After crash, the rate
|
||
limiter is empty while signal burst is most likely. GPT-5 and Opus explored the
|
||
cascade implications more deeply.
|
||
|
||
4. **Aggregator timeout splitting decisions** — Under load, aggregator fires early,
|
||
creating partial decisions that individually pass controls which would have caught
|
||
the combined quantity.
|
||
|
||
## GPT-5 Unique Findings (not in either Claude model)
|
||
|
||
- **Concurrent pro-forma deduction race** (CRITICAL): Under overload, "Buy order submitted"
|
||
event delays mean pending_buy_obligations is stale, allowing multiple concurrent decisions
|
||
to pass buying power simultaneously. Worse than "unavailable" because it mimics normal
|
||
acceptance.
|
||
- **Broker restriction recovery has no re-derivation trigger**: If a restriction lifts
|
||
without a fill or market open, buying power stays at restricted level indefinitely. No
|
||
alert, no timeout.
|
||
- **Market-open refresh failure + prior-day cache can be optimistic**: Pre-open overnight
|
||
debits (fees, holds) make cached value optimistic, contradicting Risk Controls' assumption
|
||
that staleness is always conservative.
|
||
|
||
## Claude Opus Unique Findings (not in either other model)
|
||
|
||
- **NoShortSales contradicts Signal Risk's "no portfolio context" description** (HIGH):
|
||
Signal-level controls are described as having "no portfolio context" but NoShortSales
|
||
requires position knowledge. Architectural contradiction about what data Signal Risk can
|
||
access; degraded position data behavior at this stage is unspecified.
|
||
- **Strategy crash mid-cycle leaves stale signals in aggregator buffer** (HIGH):
|
||
Failure mode table covers crash "mid-signal" (atomicity) but NOT crash between signals
|
||
in a multi-signal cycle. Pre-crash signals mix with post-restart fresh signals, forming
|
||
decisions from inconsistent market state.
|
||
- **Reconciliation gate location unspecified** (HIGH): Buying Power says "blocks order flow"
|
||
but Risk Controls don't mention it. If gate is at OrderManager, Portfolio Risk evaluates
|
||
against uninitialized buying power. If at signal level, aggregator buffers fill during
|
||
reconciliation.
|
||
- **Progressive buying power divergence during connectivity loss** (MEDIUM): Sell fills
|
||
produce "no immediate change" + failed broker queries = progressive conservative drift.
|
||
System enters sell-only mode where continuous monitoring liquidates but re-entry is blocked.
|
||
- **Pro-forma deduction reversal for internal rejection unspecified** (MEDIUM): Buying power
|
||
deducts at control #4 but fat finger rejects at control #9. The deduction-reversal path
|
||
for pre-submission rejection within the risk pipeline is unspecified.
|
||
- **Lost pending order state indistinguishable from empty** (CRITICAL): Duplicate Order and
|
||
Self-Trade controls have no specified fail-closed behavior for "pending order data
|
||
unavailable." Empty state = permissive, potentially allowing duplicate submissions.
|
||
|
||
## Claude Sonnet Findings
|
||
|
||
Sonnet identified 8 findings, mostly overlapping with GPT-5 and Opus at lower depth.
|
||
Its unique contribution:
|
||
|
||
- **PDT equity calculation recovery ordering** (CRITICAL): If PDT equity recovers with
|
||
different values than what buying power assumed during degradation, no reconciliation
|
||
process exists.
|
||
- **Audit log write failure behavior** (HIGH): Whether audit write failure stops or continues
|
||
evaluation is unspecified, creating either compliance gaps or unnecessary service disruption.
|
||
|
||
However, several Sonnet findings were less precise — Finding #1 restates the boundary gap
|
||
without exploring the optimistic case, and Finding #7 (PDT) was somewhat speculative without
|
||
strong textual evidence.
|
||
|
||
## Quality Assessment
|
||
|
||
- **GPT-5** was technically precise and found the most practically dangerous issues. The
|
||
concurrent pro-forma race (Finding 3) is the most operationally critical finding across
|
||
all models — it's a real concurrency bug that could cause financial loss during high-volume
|
||
trading. All findings included precise mechanism descriptions and specific interleaving
|
||
sequences. However, it was LESS prolific than Opus (7 vs 10) despite using more tokens.
|
||
|
||
- **Claude Opus** was the most prolific (10 findings) and found the deepest architectural
|
||
issues. The NoShortSales/Signal-Risk contradiction (Finding 2) is a genuine spec-level
|
||
inconsistency that no other model identified. The progressive divergence finding (Finding 8)
|
||
shows multi-step causal reasoning about how one degradation creates a cascading economic
|
||
effect. The pending-order-state finding (Finding 10) identifies a category of silent
|
||
degradation that applies to an ENTIRE CLASS of controls — not just one component.
|
||
Opus's characteristic strength — reasoning about design TENSIONS — manifests here as
|
||
finding places where one document's degradation model creates a trap for another document's
|
||
assumptions.
|
||
|
||
- **Claude Sonnet** was fast (35s, 1/3 the time) and adequate but shallow. Findings were
|
||
correct but didn't explore second-order effects or multi-step cascades. The PDT finding
|
||
was speculative. Sonnet identified the right THEMES but didn't trace them to their
|
||
architectural consequences.
|
||
|
||
## Key Insight — "Degraded-mode propagation" as an analytical lens
|
||
|
||
This is genuinely distinct from previous lenses in two ways:
|
||
|
||
1. **It's inherently cross-document.** Unlike assumption-finding (which can work on a single
|
||
doc), degraded-mode propagation REQUIRES multiple documents because it specifically asks
|
||
about boundaries. This makes it ideal for detecting integration issues.
|
||
|
||
2. **It finds a different CLASS of bug.** Previous lenses found issues with the happy path
|
||
(gaps, assumptions) or the error path (race conditions, invariant violations). This lens
|
||
finds issues with the PARTIALLY-DEGRADED path — where components are working but with
|
||
degraded inputs they don't fully detect.
|
||
|
||
The most architecturally significant findings (concurrent pro-forma race, aggregator split
|
||
decisions, progressive divergence cascade) are all about systems that APPEAR to be working
|
||
but are silently making incorrect decisions because one upstream component is degraded in a
|
||
way that mimics normal behavior.
|
||
|
||
## Model Comparison for This Task Type
|
||
|
||
| Dimension | GPT-5 | Opus | Sonnet |
|
||
|---|---|---|---|
|
||
| Finding count | 7 | 10 | 8 |
|
||
| CRITICAL findings | 1 | 4 | 2 |
|
||
| Unique insights | 3 | 6 | 2 |
|
||
| Tokens per finding | 1,592 | 524 | 210 |
|
||
| Cascade reasoning | Deep | Deep | Surface |
|
||
| Cross-doc awareness | High | Highest | Moderate |
|
||
|
||
**Opus is the strongest model for this task type.** This is the first experiment where
|
||
Opus decisively outperforms GPT-5 in both quantity AND quality. The reason: degraded-mode
|
||
propagation analysis is fundamentally about design TENSIONS between documents — which is
|
||
exactly Opus's consistent strength across all previous experiments. GPT-5's strength
|
||
(exhaustive technical detail) matters less here because the findings are at the boundary
|
||
level, not the implementation level.
|
||
|
||
## Practical Implication
|
||
|
||
For cross-document integration review:
|
||
- **Opus** as primary reviewer (highest insight density, finds architectural contradictions)
|
||
- **GPT-5** as secondary reviewer (finds the operational/concurrency issues Opus misses)
|
||
- **Sonnet** for quick first-pass only (identifies themes but not consequences)
|
||
|
||
The ideal workflow: Run Opus on all cross-document boundaries first. Then run GPT-5 on
|
||
any boundary where Opus flagged design tensions — GPT-5 will find the specific race
|
||
conditions and concurrency bugs that make those tensions exploitable.
|
||
|
||
## Cost-Effectiveness
|
||
|
||
- Opus: 10 findings in 116s at 5,239 tokens = **524 tokens per finding** (best efficiency)
|
||
- GPT-5: 7 findings in 125s at 11,147 tokens = 1,592 tokens per finding
|
||
- Sonnet: 8 findings in 35s at 1,676 tokens = 210 tokens per finding (cheapest, but
|
||
lowest quality per finding)
|
||
|
||
For architecture review at document boundaries, Opus delivers ~3× the insight density
|
||
per token compared to GPT-5, while finding more issues. This inverts the typical pattern
|
||
from previous experiments where GPT-5 was most cost-effective for exhaustive analysis.
|