finding #52: degraded-mode propagation analysis (new lens)
Cross-document boundary analysis: signal-lifecycle + buying-power + risk-controls. Opus decisively outperforms GPT-5 (10 vs 7 findings) — first inversion observed. New lens finds a distinct class of bug: partially-degraded paths that mimic normal behavior.
This commit is contained in:
@@ -0,0 +1,182 @@
|
||||
# Degraded-Mode Propagation Analysis: A New Analytical Lens
|
||||
|
||||
**Date:** 2026-05-08
|
||||
**Finding #:** 52
|
||||
**Task:** Degraded-mode propagation analysis across three related gargoyle design documents:
|
||||
`signal-lifecycle.md` (111 lines), `buying-power.md` (103 lines), `risk-controls.md` (315 lines)
|
||||
— spanning the decision engine → risk → order management boundary.
|
||||
|
||||
**Analytical lens:** NEW. "When one component enters a degraded state, what happens to the
|
||||
components that depend on it? Trace degraded-mode behavior across document boundaries."
|
||||
Distinct from assumption-finding (what's implicit?), gap-finding (what's missing?), or
|
||||
race conditions (what interleavings fail?). This asks: "Do the documents' degraded
|
||||
behaviors compose correctly?"
|
||||
|
||||
## Method
|
||||
|
||||
Same three documents (full text, 529 lines combined) + same structured prompt to all
|
||||
3 models via HAI proxy. Prompt specified 5 categories of degraded-mode propagation
|
||||
failures: propagation gaps, semantic mismatches, recovery ordering dependencies, silent
|
||||
degradation, and degradation cascades. Required specific output format per finding with
|
||||
quotes from both documents at each boundary. No tools, no project context beyond the
|
||||
documents themselves.
|
||||
|
||||
## Results
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Findings |
|
||||
|---|---|---|---|---|
|
||||
| GPT-5 | 125s | 11,147 | 8,960 | 7 |
|
||||
| Claude Opus 4.6 | 116s | 5,239 | (internal) | 10 |
|
||||
| Claude Sonnet 4.6 | 35s | 1,676 | (internal) | 8 |
|
||||
|
||||
## What They Found — Common Ground (all 3 identified)
|
||||
|
||||
1. **Buying power staleness vs fail-closed semantic mismatch** — Buying Power says "use
|
||||
last cached value (pessimistic)" while Risk Controls says "stale cache → reject."
|
||||
All three models identified this as the primary boundary gap, though with different
|
||||
depth of analysis (see below).
|
||||
|
||||
2. **Market data staleness affecting buying power price estimates** — Risk Controls
|
||||
scopes staleness to fat-finger only ("other controls unaffected"), but Buying Power
|
||||
uses market prices for pro-forma deduction of market orders.
|
||||
|
||||
3. **Order rate window reset creating silent permissiveness** — After crash, the rate
|
||||
limiter is empty while signal burst is most likely. GPT-5 and Opus explored the
|
||||
cascade implications more deeply.
|
||||
|
||||
4. **Aggregator timeout splitting decisions** — Under load, aggregator fires early,
|
||||
creating partial decisions that individually pass controls which would have caught
|
||||
the combined quantity.
|
||||
|
||||
## GPT-5 Unique Findings (not in either Claude model)
|
||||
|
||||
- **Concurrent pro-forma deduction race** (CRITICAL): Under overload, "Buy order submitted"
|
||||
event delays mean pending_buy_obligations is stale, allowing multiple concurrent decisions
|
||||
to pass buying power simultaneously. Worse than "unavailable" because it mimics normal
|
||||
acceptance.
|
||||
- **Broker restriction recovery has no re-derivation trigger**: If a restriction lifts
|
||||
without a fill or market open, buying power stays at restricted level indefinitely. No
|
||||
alert, no timeout.
|
||||
- **Market-open refresh failure + prior-day cache can be optimistic**: Pre-open overnight
|
||||
debits (fees, holds) make cached value optimistic, contradicting Risk Controls' assumption
|
||||
that staleness is always conservative.
|
||||
|
||||
## Claude Opus Unique Findings (not in either other model)
|
||||
|
||||
- **NoShortSales contradicts Signal Risk's "no portfolio context" description** (HIGH):
|
||||
Signal-level controls are described as having "no portfolio context" but NoShortSales
|
||||
requires position knowledge. Architectural contradiction about what data Signal Risk can
|
||||
access; degraded position data behavior at this stage is unspecified.
|
||||
- **Strategy crash mid-cycle leaves stale signals in aggregator buffer** (HIGH):
|
||||
Failure mode table covers crash "mid-signal" (atomicity) but NOT crash between signals
|
||||
in a multi-signal cycle. Pre-crash signals mix with post-restart fresh signals, forming
|
||||
decisions from inconsistent market state.
|
||||
- **Reconciliation gate location unspecified** (HIGH): Buying Power says "blocks order flow"
|
||||
but Risk Controls don't mention it. If gate is at OrderManager, Portfolio Risk evaluates
|
||||
against uninitialized buying power. If at signal level, aggregator buffers fill during
|
||||
reconciliation.
|
||||
- **Progressive buying power divergence during connectivity loss** (MEDIUM): Sell fills
|
||||
produce "no immediate change" + failed broker queries = progressive conservative drift.
|
||||
System enters sell-only mode where continuous monitoring liquidates but re-entry is blocked.
|
||||
- **Pro-forma deduction reversal for internal rejection unspecified** (MEDIUM): Buying power
|
||||
deducts at control #4 but fat finger rejects at control #9. The deduction-reversal path
|
||||
for pre-submission rejection within the risk pipeline is unspecified.
|
||||
- **Lost pending order state indistinguishable from empty** (CRITICAL): Duplicate Order and
|
||||
Self-Trade controls have no specified fail-closed behavior for "pending order data
|
||||
unavailable." Empty state = permissive, potentially allowing duplicate submissions.
|
||||
|
||||
## Claude Sonnet Findings
|
||||
|
||||
Sonnet identified 8 findings, mostly overlapping with GPT-5 and Opus at lower depth.
|
||||
Its unique contribution:
|
||||
|
||||
- **PDT equity calculation recovery ordering** (CRITICAL): If PDT equity recovers with
|
||||
different values than what buying power assumed during degradation, no reconciliation
|
||||
process exists.
|
||||
- **Audit log write failure behavior** (HIGH): Whether audit write failure stops or continues
|
||||
evaluation is unspecified, creating either compliance gaps or unnecessary service disruption.
|
||||
|
||||
However, several Sonnet findings were less precise — Finding #1 restates the boundary gap
|
||||
without exploring the optimistic case, and Finding #7 (PDT) was somewhat speculative without
|
||||
strong textual evidence.
|
||||
|
||||
## Quality Assessment
|
||||
|
||||
- **GPT-5** was technically precise and found the most practically dangerous issues. The
|
||||
concurrent pro-forma race (Finding 3) is the most operationally critical finding across
|
||||
all models — it's a real concurrency bug that could cause financial loss during high-volume
|
||||
trading. All findings included precise mechanism descriptions and specific interleaving
|
||||
sequences. However, it was LESS prolific than Opus (7 vs 10) despite using more tokens.
|
||||
|
||||
- **Claude Opus** was the most prolific (10 findings) and found the deepest architectural
|
||||
issues. The NoShortSales/Signal-Risk contradiction (Finding 2) is a genuine spec-level
|
||||
inconsistency that no other model identified. The progressive divergence finding (Finding 8)
|
||||
shows multi-step causal reasoning about how one degradation creates a cascading economic
|
||||
effect. The pending-order-state finding (Finding 10) identifies a category of silent
|
||||
degradation that applies to an ENTIRE CLASS of controls — not just one component.
|
||||
Opus's characteristic strength — reasoning about design TENSIONS — manifests here as
|
||||
finding places where one document's degradation model creates a trap for another document's
|
||||
assumptions.
|
||||
|
||||
- **Claude Sonnet** was fast (35s, 1/3 the time) and adequate but shallow. Findings were
|
||||
correct but didn't explore second-order effects or multi-step cascades. The PDT finding
|
||||
was speculative. Sonnet identified the right THEMES but didn't trace them to their
|
||||
architectural consequences.
|
||||
|
||||
## Key Insight — "Degraded-mode propagation" as an analytical lens
|
||||
|
||||
This is genuinely distinct from previous lenses in two ways:
|
||||
|
||||
1. **It's inherently cross-document.** Unlike assumption-finding (which can work on a single
|
||||
doc), degraded-mode propagation REQUIRES multiple documents because it specifically asks
|
||||
about boundaries. This makes it ideal for detecting integration issues.
|
||||
|
||||
2. **It finds a different CLASS of bug.** Previous lenses found issues with the happy path
|
||||
(gaps, assumptions) or the error path (race conditions, invariant violations). This lens
|
||||
finds issues with the PARTIALLY-DEGRADED path — where components are working but with
|
||||
degraded inputs they don't fully detect.
|
||||
|
||||
The most architecturally significant findings (concurrent pro-forma race, aggregator split
|
||||
decisions, progressive divergence cascade) are all about systems that APPEAR to be working
|
||||
but are silently making incorrect decisions because one upstream component is degraded in a
|
||||
way that mimics normal behavior.
|
||||
|
||||
## Model Comparison for This Task Type
|
||||
|
||||
| Dimension | GPT-5 | Opus | Sonnet |
|
||||
|---|---|---|---|
|
||||
| Finding count | 7 | 10 | 8 |
|
||||
| CRITICAL findings | 1 | 4 | 2 |
|
||||
| Unique insights | 3 | 6 | 2 |
|
||||
| Tokens per finding | 1,592 | 524 | 210 |
|
||||
| Cascade reasoning | Deep | Deep | Surface |
|
||||
| Cross-doc awareness | High | Highest | Moderate |
|
||||
|
||||
**Opus is the strongest model for this task type.** This is the first experiment where
|
||||
Opus decisively outperforms GPT-5 in both quantity AND quality. The reason: degraded-mode
|
||||
propagation analysis is fundamentally about design TENSIONS between documents — which is
|
||||
exactly Opus's consistent strength across all previous experiments. GPT-5's strength
|
||||
(exhaustive technical detail) matters less here because the findings are at the boundary
|
||||
level, not the implementation level.
|
||||
|
||||
## Practical Implication
|
||||
|
||||
For cross-document integration review:
|
||||
- **Opus** as primary reviewer (highest insight density, finds architectural contradictions)
|
||||
- **GPT-5** as secondary reviewer (finds the operational/concurrency issues Opus misses)
|
||||
- **Sonnet** for quick first-pass only (identifies themes but not consequences)
|
||||
|
||||
The ideal workflow: Run Opus on all cross-document boundaries first. Then run GPT-5 on
|
||||
any boundary where Opus flagged design tensions — GPT-5 will find the specific race
|
||||
conditions and concurrency bugs that make those tensions exploitable.
|
||||
|
||||
## Cost-Effectiveness
|
||||
|
||||
- Opus: 10 findings in 116s at 5,239 tokens = **524 tokens per finding** (best efficiency)
|
||||
- GPT-5: 7 findings in 125s at 11,147 tokens = 1,592 tokens per finding
|
||||
- Sonnet: 8 findings in 35s at 1,676 tokens = 210 tokens per finding (cheapest, but
|
||||
lowest quality per finding)
|
||||
|
||||
For architecture review at document boundaries, Opus delivers ~3× the insight density
|
||||
per token compared to GPT-5, while finding more issues. This inverts the typical pattern
|
||||
from previous experiments where GPT-5 was most cost-effective for exhaustive analysis.
|
||||
Reference in New Issue
Block a user