model-research/findings/2026-05-08-52-degraded-mode-propagation-analysis.md

# Degraded-Mode Propagation Analysis: A New Analytical Lens

**Date:** 2026-05-08
**Finding #:** 52
**Task:** Degraded-mode propagation analysis across three related gargoyle design documents:
`signal-lifecycle.md` (111 lines), `buying-power.md` (103 lines), `risk-controls.md` (315 lines)
— spanning the decision engine → risk → order management boundary.

**Analytical lens:** NEW. "When one component enters a degraded state, what happens to the
components that depend on it? Trace degraded-mode behavior across document boundaries."
Distinct from assumption-finding (what's implicit?), gap-finding (what's missing?), or
race conditions (what interleavings fail?). This asks: "Do the documents' degraded
behaviors compose correctly?"

## Method

Same three documents (full text, 529 lines combined) + same structured prompt to all
3 models via HAI proxy. Prompt specified 5 categories of degraded-mode propagation
failures: propagation gaps, semantic mismatches, recovery ordering dependencies, silent
degradation, and degradation cascades. Required specific output format per finding with
quotes from both documents at each boundary. No tools, no project context beyond the
documents themselves.

## Results

| Model | Time | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|
| GPT-5 | 125s | 11,147 | 8,960 | 7 |
| Claude Opus 4.6 | 116s | 5,239 | (internal) | 10 |
| Claude Sonnet 4.6 | 35s | 1,676 | (internal) | 8 |

## What They Found — Common Ground (all 3 identified)

1. **Buying power staleness vs fail-closed semantic mismatch** — Buying Power says "use
   last cached value (pessimistic)" while Risk Controls says "stale cache → reject."
   All three models identified this as the primary boundary gap, though with different
   depth of analysis (see below).

2. **Market data staleness affecting buying power price estimates** — Risk Controls
   scopes staleness to fat-finger only ("other controls unaffected"), but Buying Power
   uses market prices for pro-forma deduction of market orders.

3. **Order rate window reset creating silent permissiveness** — After crash, the rate
   limiter is empty while signal burst is most likely. GPT-5 and Opus explored the
   cascade implications more deeply.

4. **Aggregator timeout splitting decisions** — Under load, aggregator fires early,
   creating partial decisions that individually pass controls which would have caught
   the combined quantity.

## GPT-5 Unique Findings (not in either Claude model)

- **Concurrent pro-forma deduction race** (CRITICAL): Under overload, "Buy order submitted"
  event delays mean pending_buy_obligations is stale, allowing multiple concurrent decisions
  to pass buying power simultaneously. Worse than "unavailable" because it mimics normal
  acceptance.
- **Broker restriction recovery has no re-derivation trigger**: If a restriction lifts
  without a fill or market open, buying power stays at restricted level indefinitely. No
  alert, no timeout.
- **Market-open refresh failure + prior-day cache can be optimistic**: Pre-open overnight
  debits (fees, holds) make cached value optimistic, contradicting Risk Controls' assumption
  that staleness is always conservative.

## Claude Opus Unique Findings (not in either other model)

- **NoShortSales contradicts Signal Risk's "no portfolio context" description** (HIGH):
  Signal-level controls are described as having "no portfolio context" but NoShortSales
  requires position knowledge. Architectural contradiction about what data Signal Risk can
  access; degraded position data behavior at this stage is unspecified.
- **Strategy crash mid-cycle leaves stale signals in aggregator buffer** (HIGH):
  Failure mode table covers crash "mid-signal" (atomicity) but NOT crash between signals
  in a multi-signal cycle. Pre-crash signals mix with post-restart fresh signals, forming
  decisions from inconsistent market state.
- **Reconciliation gate location unspecified** (HIGH): Buying Power says "blocks order flow"
  but Risk Controls don't mention it. If gate is at OrderManager, Portfolio Risk evaluates
  against uninitialized buying power. If at signal level, aggregator buffers fill during
  reconciliation.
- **Progressive buying power divergence during connectivity loss** (MEDIUM): Sell fills
  produce "no immediate change" + failed broker queries = progressive conservative drift.
  System enters sell-only mode where continuous monitoring liquidates but re-entry is blocked.
- **Pro-forma deduction reversal for internal rejection unspecified** (MEDIUM): Buying power
  deducts at control #4 but fat finger rejects at control #9. The deduction-reversal path
  for pre-submission rejection within the risk pipeline is unspecified.
- **Lost pending order state indistinguishable from empty** (CRITICAL): Duplicate Order and
  Self-Trade controls have no specified fail-closed behavior for "pending order data
  unavailable." Empty state = permissive, potentially allowing duplicate submissions.

## Claude Sonnet Findings

Sonnet identified 8 findings, mostly overlapping with GPT-5 and Opus at lower depth.
Its unique contribution:

- **PDT equity calculation recovery ordering** (CRITICAL): If PDT equity recovers with
  different values than what buying power assumed during degradation, no reconciliation
  process exists.
- **Audit log write failure behavior** (HIGH): Whether audit write failure stops or continues
  evaluation is unspecified, creating either compliance gaps or unnecessary service disruption.

However, several Sonnet findings were less precise — Finding #1 restates the boundary gap
without exploring the optimistic case, and Finding #7 (PDT) was somewhat speculative without
strong textual evidence.

## Quality Assessment

- **GPT-5** was technically precise and found the most practically dangerous issues. The
  concurrent pro-forma race (Finding 3) is the most operationally critical finding across
  all models — it's a real concurrency bug that could cause financial loss during high-volume
  trading. All findings included precise mechanism descriptions and specific interleaving
  sequences. However, it was LESS prolific than Opus (7 vs 10) despite using more tokens.

- **Claude Opus** was the most prolific (10 findings) and found the deepest architectural
  issues. The NoShortSales/Signal-Risk contradiction (Finding 2) is a genuine spec-level
  inconsistency that no other model identified. The progressive divergence finding (Finding 8)
  shows multi-step causal reasoning about how one degradation creates a cascading economic
  effect. The pending-order-state finding (Finding 10) identifies a category of silent
  degradation that applies to an ENTIRE CLASS of controls — not just one component.
  Opus's characteristic strength — reasoning about design TENSIONS — manifests here as
  finding places where one document's degradation model creates a trap for another document's
  assumptions.

- **Claude Sonnet** was fast (35s, 1/3 the time) and adequate but shallow. Findings were
  correct but didn't explore second-order effects or multi-step cascades. The PDT finding
  was speculative. Sonnet identified the right THEMES but didn't trace them to their
  architectural consequences.

## Key Insight — "Degraded-mode propagation" as an analytical lens

This is genuinely distinct from previous lenses in two ways:

1. **It's inherently cross-document.** Unlike assumption-finding (which can work on a single
   doc), degraded-mode propagation REQUIRES multiple documents because it specifically asks
   about boundaries. This makes it ideal for detecting integration issues.

2. **It finds a different CLASS of bug.** Previous lenses found issues with the happy path
   (gaps, assumptions) or the error path (race conditions, invariant violations). This lens
   finds issues with the PARTIALLY-DEGRADED path — where components are working but with
   degraded inputs they don't fully detect.

The most architecturally significant findings (concurrent pro-forma race, aggregator split
decisions, progressive divergence cascade) are all about systems that APPEAR to be working
but are silently making incorrect decisions because one upstream component is degraded in a
way that mimics normal behavior.

## Model Comparison for This Task Type

| Dimension | GPT-5 | Opus | Sonnet |
|---|---|---|---|
| Finding count | 7 | 10 | 8 |
| CRITICAL findings | 1 | 4 | 2 |
| Unique insights | 3 | 6 | 2 |
| Tokens per finding | 1,592 | 524 | 210 |
| Cascade reasoning | Deep | Deep | Surface |
| Cross-doc awareness | High | Highest | Moderate |

**Opus is the strongest model for this task type.** This is the first experiment where
Opus decisively outperforms GPT-5 in both quantity AND quality. The reason: degraded-mode
propagation analysis is fundamentally about design TENSIONS between documents — which is
exactly Opus's consistent strength across all previous experiments. GPT-5's strength
(exhaustive technical detail) matters less here because the findings are at the boundary
level, not the implementation level.

## Practical Implication

For cross-document integration review:
- **Opus** as primary reviewer (highest insight density, finds architectural contradictions)
- **GPT-5** as secondary reviewer (finds the operational/concurrency issues Opus misses)
- **Sonnet** for quick first-pass only (identifies themes but not consequences)

The ideal workflow: Run Opus on all cross-document boundaries first. Then run GPT-5 on
any boundary where Opus flagged design tensions — GPT-5 will find the specific race
conditions and concurrency bugs that make those tensions exploitable.

## Cost-Effectiveness

- Opus: 10 findings in 116s at 5,239 tokens = **524 tokens per finding** (best efficiency)
- GPT-5: 7 findings in 125s at 11,147 tokens = 1,592 tokens per finding
- Sonnet: 8 findings in 35s at 1,676 tokens = 210 tokens per finding (cheapest, but
  lowest quality per finding)

For architecture review at document boundaries, Opus delivers ~3× the insight density
per token compared to GPT-5, while finding more issues. This inverts the typical pattern
from previous experiments where GPT-5 was most cost-effective for exhaustive analysis.