finding #52: degraded-mode propagation analysis (new lens)
Cross-document boundary analysis: signal-lifecycle + buying-power + risk-controls. Opus decisively outperforms GPT-5 (10 vs 7 findings) — first inversion observed. New lens finds a distinct class of bug: partially-degraded paths that mimic normal behavior.
This commit is contained in:
@@ -0,0 +1,182 @@
|
|||||||
|
# Degraded-Mode Propagation Analysis: A New Analytical Lens
|
||||||
|
|
||||||
|
**Date:** 2026-05-08
|
||||||
|
**Finding #:** 52
|
||||||
|
**Task:** Degraded-mode propagation analysis across three related gargoyle design documents:
|
||||||
|
`signal-lifecycle.md` (111 lines), `buying-power.md` (103 lines), `risk-controls.md` (315 lines)
|
||||||
|
— spanning the decision engine → risk → order management boundary.
|
||||||
|
|
||||||
|
**Analytical lens:** NEW. "When one component enters a degraded state, what happens to the
|
||||||
|
components that depend on it? Trace degraded-mode behavior across document boundaries."
|
||||||
|
Distinct from assumption-finding (what's implicit?), gap-finding (what's missing?), or
|
||||||
|
race conditions (what interleavings fail?). This asks: "Do the documents' degraded
|
||||||
|
behaviors compose correctly?"
|
||||||
|
|
||||||
|
## Method
|
||||||
|
|
||||||
|
Same three documents (full text, 529 lines combined) + same structured prompt to all
|
||||||
|
3 models via HAI proxy. Prompt specified 5 categories of degraded-mode propagation
|
||||||
|
failures: propagation gaps, semantic mismatches, recovery ordering dependencies, silent
|
||||||
|
degradation, and degradation cascades. Required specific output format per finding with
|
||||||
|
quotes from both documents at each boundary. No tools, no project context beyond the
|
||||||
|
documents themselves.
|
||||||
|
|
||||||
|
## Results
|
||||||
|
|
||||||
|
| Model | Time | Output tokens | Reasoning tokens | Findings |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| GPT-5 | 125s | 11,147 | 8,960 | 7 |
|
||||||
|
| Claude Opus 4.6 | 116s | 5,239 | (internal) | 10 |
|
||||||
|
| Claude Sonnet 4.6 | 35s | 1,676 | (internal) | 8 |
|
||||||
|
|
||||||
|
## What They Found — Common Ground (all 3 identified)
|
||||||
|
|
||||||
|
1. **Buying power staleness vs fail-closed semantic mismatch** — Buying Power says "use
|
||||||
|
last cached value (pessimistic)" while Risk Controls says "stale cache → reject."
|
||||||
|
All three models identified this as the primary boundary gap, though with different
|
||||||
|
depth of analysis (see below).
|
||||||
|
|
||||||
|
2. **Market data staleness affecting buying power price estimates** — Risk Controls
|
||||||
|
scopes staleness to fat-finger only ("other controls unaffected"), but Buying Power
|
||||||
|
uses market prices for pro-forma deduction of market orders.
|
||||||
|
|
||||||
|
3. **Order rate window reset creating silent permissiveness** — After crash, the rate
|
||||||
|
limiter is empty while signal burst is most likely. GPT-5 and Opus explored the
|
||||||
|
cascade implications more deeply.
|
||||||
|
|
||||||
|
4. **Aggregator timeout splitting decisions** — Under load, aggregator fires early,
|
||||||
|
creating partial decisions that individually pass controls which would have caught
|
||||||
|
the combined quantity.
|
||||||
|
|
||||||
|
## GPT-5 Unique Findings (not in either Claude model)
|
||||||
|
|
||||||
|
- **Concurrent pro-forma deduction race** (CRITICAL): Under overload, "Buy order submitted"
|
||||||
|
event delays mean pending_buy_obligations is stale, allowing multiple concurrent decisions
|
||||||
|
to pass buying power simultaneously. Worse than "unavailable" because it mimics normal
|
||||||
|
acceptance.
|
||||||
|
- **Broker restriction recovery has no re-derivation trigger**: If a restriction lifts
|
||||||
|
without a fill or market open, buying power stays at restricted level indefinitely. No
|
||||||
|
alert, no timeout.
|
||||||
|
- **Market-open refresh failure + prior-day cache can be optimistic**: Pre-open overnight
|
||||||
|
debits (fees, holds) make cached value optimistic, contradicting Risk Controls' assumption
|
||||||
|
that staleness is always conservative.
|
||||||
|
|
||||||
|
## Claude Opus Unique Findings (not in either other model)
|
||||||
|
|
||||||
|
- **NoShortSales contradicts Signal Risk's "no portfolio context" description** (HIGH):
|
||||||
|
Signal-level controls are described as having "no portfolio context" but NoShortSales
|
||||||
|
requires position knowledge. Architectural contradiction about what data Signal Risk can
|
||||||
|
access; degraded position data behavior at this stage is unspecified.
|
||||||
|
- **Strategy crash mid-cycle leaves stale signals in aggregator buffer** (HIGH):
|
||||||
|
Failure mode table covers crash "mid-signal" (atomicity) but NOT crash between signals
|
||||||
|
in a multi-signal cycle. Pre-crash signals mix with post-restart fresh signals, forming
|
||||||
|
decisions from inconsistent market state.
|
||||||
|
- **Reconciliation gate location unspecified** (HIGH): Buying Power says "blocks order flow"
|
||||||
|
but Risk Controls don't mention it. If gate is at OrderManager, Portfolio Risk evaluates
|
||||||
|
against uninitialized buying power. If at signal level, aggregator buffers fill during
|
||||||
|
reconciliation.
|
||||||
|
- **Progressive buying power divergence during connectivity loss** (MEDIUM): Sell fills
|
||||||
|
produce "no immediate change" + failed broker queries = progressive conservative drift.
|
||||||
|
System enters sell-only mode where continuous monitoring liquidates but re-entry is blocked.
|
||||||
|
- **Pro-forma deduction reversal for internal rejection unspecified** (MEDIUM): Buying power
|
||||||
|
deducts at control #4 but fat finger rejects at control #9. The deduction-reversal path
|
||||||
|
for pre-submission rejection within the risk pipeline is unspecified.
|
||||||
|
- **Lost pending order state indistinguishable from empty** (CRITICAL): Duplicate Order and
|
||||||
|
Self-Trade controls have no specified fail-closed behavior for "pending order data
|
||||||
|
unavailable." Empty state = permissive, potentially allowing duplicate submissions.
|
||||||
|
|
||||||
|
## Claude Sonnet Findings
|
||||||
|
|
||||||
|
Sonnet identified 8 findings, mostly overlapping with GPT-5 and Opus at lower depth.
|
||||||
|
Its unique contribution:
|
||||||
|
|
||||||
|
- **PDT equity calculation recovery ordering** (CRITICAL): If PDT equity recovers with
|
||||||
|
different values than what buying power assumed during degradation, no reconciliation
|
||||||
|
process exists.
|
||||||
|
- **Audit log write failure behavior** (HIGH): Whether audit write failure stops or continues
|
||||||
|
evaluation is unspecified, creating either compliance gaps or unnecessary service disruption.
|
||||||
|
|
||||||
|
However, several Sonnet findings were less precise — Finding #1 restates the boundary gap
|
||||||
|
without exploring the optimistic case, and Finding #7 (PDT) was somewhat speculative without
|
||||||
|
strong textual evidence.
|
||||||
|
|
||||||
|
## Quality Assessment
|
||||||
|
|
||||||
|
- **GPT-5** was technically precise and found the most practically dangerous issues. The
|
||||||
|
concurrent pro-forma race (Finding 3) is the most operationally critical finding across
|
||||||
|
all models — it's a real concurrency bug that could cause financial loss during high-volume
|
||||||
|
trading. All findings included precise mechanism descriptions and specific interleaving
|
||||||
|
sequences. However, it was LESS prolific than Opus (7 vs 10) despite using more tokens.
|
||||||
|
|
||||||
|
- **Claude Opus** was the most prolific (10 findings) and found the deepest architectural
|
||||||
|
issues. The NoShortSales/Signal-Risk contradiction (Finding 2) is a genuine spec-level
|
||||||
|
inconsistency that no other model identified. The progressive divergence finding (Finding 8)
|
||||||
|
shows multi-step causal reasoning about how one degradation creates a cascading economic
|
||||||
|
effect. The pending-order-state finding (Finding 10) identifies a category of silent
|
||||||
|
degradation that applies to an ENTIRE CLASS of controls — not just one component.
|
||||||
|
Opus's characteristic strength — reasoning about design TENSIONS — manifests here as
|
||||||
|
finding places where one document's degradation model creates a trap for another document's
|
||||||
|
assumptions.
|
||||||
|
|
||||||
|
- **Claude Sonnet** was fast (35s, 1/3 the time) and adequate but shallow. Findings were
|
||||||
|
correct but didn't explore second-order effects or multi-step cascades. The PDT finding
|
||||||
|
was speculative. Sonnet identified the right THEMES but didn't trace them to their
|
||||||
|
architectural consequences.
|
||||||
|
|
||||||
|
## Key Insight — "Degraded-mode propagation" as an analytical lens
|
||||||
|
|
||||||
|
This is genuinely distinct from previous lenses in two ways:
|
||||||
|
|
||||||
|
1. **It's inherently cross-document.** Unlike assumption-finding (which can work on a single
|
||||||
|
doc), degraded-mode propagation REQUIRES multiple documents because it specifically asks
|
||||||
|
about boundaries. This makes it ideal for detecting integration issues.
|
||||||
|
|
||||||
|
2. **It finds a different CLASS of bug.** Previous lenses found issues with the happy path
|
||||||
|
(gaps, assumptions) or the error path (race conditions, invariant violations). This lens
|
||||||
|
finds issues with the PARTIALLY-DEGRADED path — where components are working but with
|
||||||
|
degraded inputs they don't fully detect.
|
||||||
|
|
||||||
|
The most architecturally significant findings (concurrent pro-forma race, aggregator split
|
||||||
|
decisions, progressive divergence cascade) are all about systems that APPEAR to be working
|
||||||
|
but are silently making incorrect decisions because one upstream component is degraded in a
|
||||||
|
way that mimics normal behavior.
|
||||||
|
|
||||||
|
## Model Comparison for This Task Type
|
||||||
|
|
||||||
|
| Dimension | GPT-5 | Opus | Sonnet |
|
||||||
|
|---|---|---|---|
|
||||||
|
| Finding count | 7 | 10 | 8 |
|
||||||
|
| CRITICAL findings | 1 | 4 | 2 |
|
||||||
|
| Unique insights | 3 | 6 | 2 |
|
||||||
|
| Tokens per finding | 1,592 | 524 | 210 |
|
||||||
|
| Cascade reasoning | Deep | Deep | Surface |
|
||||||
|
| Cross-doc awareness | High | Highest | Moderate |
|
||||||
|
|
||||||
|
**Opus is the strongest model for this task type.** This is the first experiment where
|
||||||
|
Opus decisively outperforms GPT-5 in both quantity AND quality. The reason: degraded-mode
|
||||||
|
propagation analysis is fundamentally about design TENSIONS between documents — which is
|
||||||
|
exactly Opus's consistent strength across all previous experiments. GPT-5's strength
|
||||||
|
(exhaustive technical detail) matters less here because the findings are at the boundary
|
||||||
|
level, not the implementation level.
|
||||||
|
|
||||||
|
## Practical Implication
|
||||||
|
|
||||||
|
For cross-document integration review:
|
||||||
|
- **Opus** as primary reviewer (highest insight density, finds architectural contradictions)
|
||||||
|
- **GPT-5** as secondary reviewer (finds the operational/concurrency issues Opus misses)
|
||||||
|
- **Sonnet** for quick first-pass only (identifies themes but not consequences)
|
||||||
|
|
||||||
|
The ideal workflow: Run Opus on all cross-document boundaries first. Then run GPT-5 on
|
||||||
|
any boundary where Opus flagged design tensions — GPT-5 will find the specific race
|
||||||
|
conditions and concurrency bugs that make those tensions exploitable.
|
||||||
|
|
||||||
|
## Cost-Effectiveness
|
||||||
|
|
||||||
|
- Opus: 10 findings in 116s at 5,239 tokens = **524 tokens per finding** (best efficiency)
|
||||||
|
- GPT-5: 7 findings in 125s at 11,147 tokens = 1,592 tokens per finding
|
||||||
|
- Sonnet: 8 findings in 35s at 1,676 tokens = 210 tokens per finding (cheapest, but
|
||||||
|
lowest quality per finding)
|
||||||
|
|
||||||
|
For architecture review at document boundaries, Opus delivers ~3× the insight density
|
||||||
|
per token compared to GPT-5, while finding more issues. This inverts the typical pattern
|
||||||
|
from previous experiments where GPT-5 was most cost-effective for exhaustive analysis.
|
||||||
Reference in New Issue
Block a user