Cross-document boundary analysis: signal-lifecycle + buying-power + risk-controls. Opus decisively outperforms GPT-5 (10 vs 7 findings) — first inversion observed. New lens finds a distinct class of bug: partially-degraded paths that mimic normal behavior.
10 KiB
Degraded-Mode Propagation Analysis: A New Analytical Lens
Date: 2026-05-08
Finding #: 52
Task: Degraded-mode propagation analysis across three related gargoyle design documents:
signal-lifecycle.md (111 lines), buying-power.md (103 lines), risk-controls.md (315 lines)
— spanning the decision engine → risk → order management boundary.
Analytical lens: NEW. "When one component enters a degraded state, what happens to the components that depend on it? Trace degraded-mode behavior across document boundaries." Distinct from assumption-finding (what's implicit?), gap-finding (what's missing?), or race conditions (what interleavings fail?). This asks: "Do the documents' degraded behaviors compose correctly?"
Method
Same three documents (full text, 529 lines combined) + same structured prompt to all 3 models via HAI proxy. Prompt specified 5 categories of degraded-mode propagation failures: propagation gaps, semantic mismatches, recovery ordering dependencies, silent degradation, and degradation cascades. Required specific output format per finding with quotes from both documents at each boundary. No tools, no project context beyond the documents themselves.
Results
| Model | Time | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|
| GPT-5 | 125s | 11,147 | 8,960 | 7 |
| Claude Opus 4.6 | 116s | 5,239 | (internal) | 10 |
| Claude Sonnet 4.6 | 35s | 1,676 | (internal) | 8 |
What They Found — Common Ground (all 3 identified)
-
Buying power staleness vs fail-closed semantic mismatch — Buying Power says "use last cached value (pessimistic)" while Risk Controls says "stale cache → reject." All three models identified this as the primary boundary gap, though with different depth of analysis (see below).
-
Market data staleness affecting buying power price estimates — Risk Controls scopes staleness to fat-finger only ("other controls unaffected"), but Buying Power uses market prices for pro-forma deduction of market orders.
-
Order rate window reset creating silent permissiveness — After crash, the rate limiter is empty while signal burst is most likely. GPT-5 and Opus explored the cascade implications more deeply.
-
Aggregator timeout splitting decisions — Under load, aggregator fires early, creating partial decisions that individually pass controls which would have caught the combined quantity.
GPT-5 Unique Findings (not in either Claude model)
- Concurrent pro-forma deduction race (CRITICAL): Under overload, "Buy order submitted" event delays mean pending_buy_obligations is stale, allowing multiple concurrent decisions to pass buying power simultaneously. Worse than "unavailable" because it mimics normal acceptance.
- Broker restriction recovery has no re-derivation trigger: If a restriction lifts without a fill or market open, buying power stays at restricted level indefinitely. No alert, no timeout.
- Market-open refresh failure + prior-day cache can be optimistic: Pre-open overnight debits (fees, holds) make cached value optimistic, contradicting Risk Controls' assumption that staleness is always conservative.
Claude Opus Unique Findings (not in either other model)
- NoShortSales contradicts Signal Risk's "no portfolio context" description (HIGH): Signal-level controls are described as having "no portfolio context" but NoShortSales requires position knowledge. Architectural contradiction about what data Signal Risk can access; degraded position data behavior at this stage is unspecified.
- Strategy crash mid-cycle leaves stale signals in aggregator buffer (HIGH): Failure mode table covers crash "mid-signal" (atomicity) but NOT crash between signals in a multi-signal cycle. Pre-crash signals mix with post-restart fresh signals, forming decisions from inconsistent market state.
- Reconciliation gate location unspecified (HIGH): Buying Power says "blocks order flow" but Risk Controls don't mention it. If gate is at OrderManager, Portfolio Risk evaluates against uninitialized buying power. If at signal level, aggregator buffers fill during reconciliation.
- Progressive buying power divergence during connectivity loss (MEDIUM): Sell fills produce "no immediate change" + failed broker queries = progressive conservative drift. System enters sell-only mode where continuous monitoring liquidates but re-entry is blocked.
- Pro-forma deduction reversal for internal rejection unspecified (MEDIUM): Buying power deducts at control #4 but fat finger rejects at control #9. The deduction-reversal path for pre-submission rejection within the risk pipeline is unspecified.
- Lost pending order state indistinguishable from empty (CRITICAL): Duplicate Order and Self-Trade controls have no specified fail-closed behavior for "pending order data unavailable." Empty state = permissive, potentially allowing duplicate submissions.
Claude Sonnet Findings
Sonnet identified 8 findings, mostly overlapping with GPT-5 and Opus at lower depth. Its unique contribution:
- PDT equity calculation recovery ordering (CRITICAL): If PDT equity recovers with different values than what buying power assumed during degradation, no reconciliation process exists.
- Audit log write failure behavior (HIGH): Whether audit write failure stops or continues evaluation is unspecified, creating either compliance gaps or unnecessary service disruption.
However, several Sonnet findings were less precise — Finding #1 restates the boundary gap without exploring the optimistic case, and Finding #7 (PDT) was somewhat speculative without strong textual evidence.
Quality Assessment
-
GPT-5 was technically precise and found the most practically dangerous issues. The concurrent pro-forma race (Finding 3) is the most operationally critical finding across all models — it's a real concurrency bug that could cause financial loss during high-volume trading. All findings included precise mechanism descriptions and specific interleaving sequences. However, it was LESS prolific than Opus (7 vs 10) despite using more tokens.
-
Claude Opus was the most prolific (10 findings) and found the deepest architectural issues. The NoShortSales/Signal-Risk contradiction (Finding 2) is a genuine spec-level inconsistency that no other model identified. The progressive divergence finding (Finding 8) shows multi-step causal reasoning about how one degradation creates a cascading economic effect. The pending-order-state finding (Finding 10) identifies a category of silent degradation that applies to an ENTIRE CLASS of controls — not just one component. Opus's characteristic strength — reasoning about design TENSIONS — manifests here as finding places where one document's degradation model creates a trap for another document's assumptions.
-
Claude Sonnet was fast (35s, 1/3 the time) and adequate but shallow. Findings were correct but didn't explore second-order effects or multi-step cascades. The PDT finding was speculative. Sonnet identified the right THEMES but didn't trace them to their architectural consequences.
Key Insight — "Degraded-mode propagation" as an analytical lens
This is genuinely distinct from previous lenses in two ways:
-
It's inherently cross-document. Unlike assumption-finding (which can work on a single doc), degraded-mode propagation REQUIRES multiple documents because it specifically asks about boundaries. This makes it ideal for detecting integration issues.
-
It finds a different CLASS of bug. Previous lenses found issues with the happy path (gaps, assumptions) or the error path (race conditions, invariant violations). This lens finds issues with the PARTIALLY-DEGRADED path — where components are working but with degraded inputs they don't fully detect.
The most architecturally significant findings (concurrent pro-forma race, aggregator split decisions, progressive divergence cascade) are all about systems that APPEAR to be working but are silently making incorrect decisions because one upstream component is degraded in a way that mimics normal behavior.
Model Comparison for This Task Type
| Dimension | GPT-5 | Opus | Sonnet |
|---|---|---|---|
| Finding count | 7 | 10 | 8 |
| CRITICAL findings | 1 | 4 | 2 |
| Unique insights | 3 | 6 | 2 |
| Tokens per finding | 1,592 | 524 | 210 |
| Cascade reasoning | Deep | Deep | Surface |
| Cross-doc awareness | High | Highest | Moderate |
Opus is the strongest model for this task type. This is the first experiment where Opus decisively outperforms GPT-5 in both quantity AND quality. The reason: degraded-mode propagation analysis is fundamentally about design TENSIONS between documents — which is exactly Opus's consistent strength across all previous experiments. GPT-5's strength (exhaustive technical detail) matters less here because the findings are at the boundary level, not the implementation level.
Practical Implication
For cross-document integration review:
- Opus as primary reviewer (highest insight density, finds architectural contradictions)
- GPT-5 as secondary reviewer (finds the operational/concurrency issues Opus misses)
- Sonnet for quick first-pass only (identifies themes but not consequences)
The ideal workflow: Run Opus on all cross-document boundaries first. Then run GPT-5 on any boundary where Opus flagged design tensions — GPT-5 will find the specific race conditions and concurrency bugs that make those tensions exploitable.
Cost-Effectiveness
- Opus: 10 findings in 116s at 5,239 tokens = 524 tokens per finding (best efficiency)
- GPT-5: 7 findings in 125s at 11,147 tokens = 1,592 tokens per finding
- Sonnet: 8 findings in 35s at 1,676 tokens = 210 tokens per finding (cheapest, but lowest quality per finding)
For architecture review at document boundaries, Opus delivers ~3× the insight density per token compared to GPT-5, while finding more issues. This inverts the typical pattern from previous experiments where GPT-5 was most cost-effective for exhaustive analysis.