finding #52: degraded-mode propagation analysis (new lens)

Cross-document boundary analysis: signal-lifecycle + buying-power + risk-controls.
Opus decisively outperforms GPT-5 (10 vs 7 findings) — first inversion observed.
New lens finds a distinct class of bug: partially-degraded paths that mimic normal behavior.
This commit is contained in:
claw
2026-05-08 14:29:29 -07:00
parent 79915d1dc3
commit c1ca8cfe46
@@ -0,0 +1,182 @@
# Degraded-Mode Propagation Analysis: A New Analytical Lens
**Date:** 2026-05-08
**Finding #:** 52
**Task:** Degraded-mode propagation analysis across three related gargoyle design documents:
`signal-lifecycle.md` (111 lines), `buying-power.md` (103 lines), `risk-controls.md` (315 lines)
— spanning the decision engine → risk → order management boundary.
**Analytical lens:** NEW. "When one component enters a degraded state, what happens to the
components that depend on it? Trace degraded-mode behavior across document boundaries."
Distinct from assumption-finding (what's implicit?), gap-finding (what's missing?), or
race conditions (what interleavings fail?). This asks: "Do the documents' degraded
behaviors compose correctly?"
## Method
Same three documents (full text, 529 lines combined) + same structured prompt to all
3 models via HAI proxy. Prompt specified 5 categories of degraded-mode propagation
failures: propagation gaps, semantic mismatches, recovery ordering dependencies, silent
degradation, and degradation cascades. Required specific output format per finding with
quotes from both documents at each boundary. No tools, no project context beyond the
documents themselves.
## Results
| Model | Time | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|
| GPT-5 | 125s | 11,147 | 8,960 | 7 |
| Claude Opus 4.6 | 116s | 5,239 | (internal) | 10 |
| Claude Sonnet 4.6 | 35s | 1,676 | (internal) | 8 |
## What They Found — Common Ground (all 3 identified)
1. **Buying power staleness vs fail-closed semantic mismatch** — Buying Power says "use
last cached value (pessimistic)" while Risk Controls says "stale cache → reject."
All three models identified this as the primary boundary gap, though with different
depth of analysis (see below).
2. **Market data staleness affecting buying power price estimates** — Risk Controls
scopes staleness to fat-finger only ("other controls unaffected"), but Buying Power
uses market prices for pro-forma deduction of market orders.
3. **Order rate window reset creating silent permissiveness** — After crash, the rate
limiter is empty while signal burst is most likely. GPT-5 and Opus explored the
cascade implications more deeply.
4. **Aggregator timeout splitting decisions** — Under load, aggregator fires early,
creating partial decisions that individually pass controls which would have caught
the combined quantity.
## GPT-5 Unique Findings (not in either Claude model)
- **Concurrent pro-forma deduction race** (CRITICAL): Under overload, "Buy order submitted"
event delays mean pending_buy_obligations is stale, allowing multiple concurrent decisions
to pass buying power simultaneously. Worse than "unavailable" because it mimics normal
acceptance.
- **Broker restriction recovery has no re-derivation trigger**: If a restriction lifts
without a fill or market open, buying power stays at restricted level indefinitely. No
alert, no timeout.
- **Market-open refresh failure + prior-day cache can be optimistic**: Pre-open overnight
debits (fees, holds) make cached value optimistic, contradicting Risk Controls' assumption
that staleness is always conservative.
## Claude Opus Unique Findings (not in either other model)
- **NoShortSales contradicts Signal Risk's "no portfolio context" description** (HIGH):
Signal-level controls are described as having "no portfolio context" but NoShortSales
requires position knowledge. Architectural contradiction about what data Signal Risk can
access; degraded position data behavior at this stage is unspecified.
- **Strategy crash mid-cycle leaves stale signals in aggregator buffer** (HIGH):
Failure mode table covers crash "mid-signal" (atomicity) but NOT crash between signals
in a multi-signal cycle. Pre-crash signals mix with post-restart fresh signals, forming
decisions from inconsistent market state.
- **Reconciliation gate location unspecified** (HIGH): Buying Power says "blocks order flow"
but Risk Controls don't mention it. If gate is at OrderManager, Portfolio Risk evaluates
against uninitialized buying power. If at signal level, aggregator buffers fill during
reconciliation.
- **Progressive buying power divergence during connectivity loss** (MEDIUM): Sell fills
produce "no immediate change" + failed broker queries = progressive conservative drift.
System enters sell-only mode where continuous monitoring liquidates but re-entry is blocked.
- **Pro-forma deduction reversal for internal rejection unspecified** (MEDIUM): Buying power
deducts at control #4 but fat finger rejects at control #9. The deduction-reversal path
for pre-submission rejection within the risk pipeline is unspecified.
- **Lost pending order state indistinguishable from empty** (CRITICAL): Duplicate Order and
Self-Trade controls have no specified fail-closed behavior for "pending order data
unavailable." Empty state = permissive, potentially allowing duplicate submissions.
## Claude Sonnet Findings
Sonnet identified 8 findings, mostly overlapping with GPT-5 and Opus at lower depth.
Its unique contribution:
- **PDT equity calculation recovery ordering** (CRITICAL): If PDT equity recovers with
different values than what buying power assumed during degradation, no reconciliation
process exists.
- **Audit log write failure behavior** (HIGH): Whether audit write failure stops or continues
evaluation is unspecified, creating either compliance gaps or unnecessary service disruption.
However, several Sonnet findings were less precise — Finding #1 restates the boundary gap
without exploring the optimistic case, and Finding #7 (PDT) was somewhat speculative without
strong textual evidence.
## Quality Assessment
- **GPT-5** was technically precise and found the most practically dangerous issues. The
concurrent pro-forma race (Finding 3) is the most operationally critical finding across
all models — it's a real concurrency bug that could cause financial loss during high-volume
trading. All findings included precise mechanism descriptions and specific interleaving
sequences. However, it was LESS prolific than Opus (7 vs 10) despite using more tokens.
- **Claude Opus** was the most prolific (10 findings) and found the deepest architectural
issues. The NoShortSales/Signal-Risk contradiction (Finding 2) is a genuine spec-level
inconsistency that no other model identified. The progressive divergence finding (Finding 8)
shows multi-step causal reasoning about how one degradation creates a cascading economic
effect. The pending-order-state finding (Finding 10) identifies a category of silent
degradation that applies to an ENTIRE CLASS of controls — not just one component.
Opus's characteristic strength — reasoning about design TENSIONS — manifests here as
finding places where one document's degradation model creates a trap for another document's
assumptions.
- **Claude Sonnet** was fast (35s, 1/3 the time) and adequate but shallow. Findings were
correct but didn't explore second-order effects or multi-step cascades. The PDT finding
was speculative. Sonnet identified the right THEMES but didn't trace them to their
architectural consequences.
## Key Insight — "Degraded-mode propagation" as an analytical lens
This is genuinely distinct from previous lenses in two ways:
1. **It's inherently cross-document.** Unlike assumption-finding (which can work on a single
doc), degraded-mode propagation REQUIRES multiple documents because it specifically asks
about boundaries. This makes it ideal for detecting integration issues.
2. **It finds a different CLASS of bug.** Previous lenses found issues with the happy path
(gaps, assumptions) or the error path (race conditions, invariant violations). This lens
finds issues with the PARTIALLY-DEGRADED path — where components are working but with
degraded inputs they don't fully detect.
The most architecturally significant findings (concurrent pro-forma race, aggregator split
decisions, progressive divergence cascade) are all about systems that APPEAR to be working
but are silently making incorrect decisions because one upstream component is degraded in a
way that mimics normal behavior.
## Model Comparison for This Task Type
| Dimension | GPT-5 | Opus | Sonnet |
|---|---|---|---|
| Finding count | 7 | 10 | 8 |
| CRITICAL findings | 1 | 4 | 2 |
| Unique insights | 3 | 6 | 2 |
| Tokens per finding | 1,592 | 524 | 210 |
| Cascade reasoning | Deep | Deep | Surface |
| Cross-doc awareness | High | Highest | Moderate |
**Opus is the strongest model for this task type.** This is the first experiment where
Opus decisively outperforms GPT-5 in both quantity AND quality. The reason: degraded-mode
propagation analysis is fundamentally about design TENSIONS between documents — which is
exactly Opus's consistent strength across all previous experiments. GPT-5's strength
(exhaustive technical detail) matters less here because the findings are at the boundary
level, not the implementation level.
## Practical Implication
For cross-document integration review:
- **Opus** as primary reviewer (highest insight density, finds architectural contradictions)
- **GPT-5** as secondary reviewer (finds the operational/concurrency issues Opus misses)
- **Sonnet** for quick first-pass only (identifies themes but not consequences)
The ideal workflow: Run Opus on all cross-document boundaries first. Then run GPT-5 on
any boundary where Opus flagged design tensions — GPT-5 will find the specific race
conditions and concurrency bugs that make those tensions exploitable.
## Cost-Effectiveness
- Opus: 10 findings in 116s at 5,239 tokens = **524 tokens per finding** (best efficiency)
- GPT-5: 7 findings in 125s at 11,147 tokens = 1,592 tokens per finding
- Sonnet: 8 findings in 35s at 1,676 tokens = 210 tokens per finding (cheapest, but
lowest quality per finding)
For architecture review at document boundaries, Opus delivers ~3× the insight density
per token compared to GPT-5, while finding more issues. This inverts the typical pattern
from previous experiments where GPT-5 was most cost-effective for exhaustive analysis.