finding #52: degraded-mode propagation analysis (new lens)

Cross-document boundary analysis: signal-lifecycle + buying-power + risk-controls. Opus decisively outperforms GPT-5 (10 vs 7 findings) — first inversion observed. New lens finds a distinct class of bug: partially-degraded paths that mimic normal behavior.
2026-05-08 14:29:29 -07:00
parent 79915d1dc3
commit c1ca8cfe46
1 changed files with 182 additions and 0 deletions
@@ -0,0 +1,182 @@
+# Degraded-Mode Propagation Analysis: A New Analytical Lens
+
+**Date:** 2026-05-08
+**Finding #:** 52
+**Task:** Degraded-mode propagation analysis across three related gargoyle design documents:
+`signal-lifecycle.md` (111 lines), `buying-power.md` (103 lines), `risk-controls.md` (315 lines)
+— spanning the decision engine → risk → order management boundary.
+
+**Analytical lens:** NEW. "When one component enters a degraded state, what happens to the
+components that depend on it? Trace degraded-mode behavior across document boundaries."
+Distinct from assumption-finding (what's implicit?), gap-finding (what's missing?), or
+race conditions (what interleavings fail?). This asks: "Do the documents' degraded
+behaviors compose correctly?"
+
+## Method
+
+Same three documents (full text, 529 lines combined) + same structured prompt to all
+3 models via HAI proxy. Prompt specified 5 categories of degraded-mode propagation
+failures: propagation gaps, semantic mismatches, recovery ordering dependencies, silent
+degradation, and degradation cascades. Required specific output format per finding with
+quotes from both documents at each boundary. No tools, no project context beyond the
+documents themselves.
+
+## Results
+
+| Model | Time | Output tokens | Reasoning tokens | Findings |
+|---|---|---|---|---|
+| GPT-5 | 125s | 11,147 | 8,960 | 7 |
+| Claude Opus 4.6 | 116s | 5,239 | (internal) | 10 |
+| Claude Sonnet 4.6 | 35s | 1,676 | (internal) | 8 |
+
+## What They Found — Common Ground (all 3 identified)
+
+1. **Buying power staleness vs fail-closed semantic mismatch** — Buying Power says "use
+   last cached value (pessimistic)" while Risk Controls says "stale cache → reject."
+   All three models identified this as the primary boundary gap, though with different
+   depth of analysis (see below).
+
+2. **Market data staleness affecting buying power price estimates** — Risk Controls
+   scopes staleness to fat-finger only ("other controls unaffected"), but Buying Power
+   uses market prices for pro-forma deduction of market orders.
+
+3. **Order rate window reset creating silent permissiveness** — After crash, the rate
+   limiter is empty while signal burst is most likely. GPT-5 and Opus explored the
+   cascade implications more deeply.
+
+4. **Aggregator timeout splitting decisions** — Under load, aggregator fires early,
+   creating partial decisions that individually pass controls which would have caught
+   the combined quantity.
+
+## GPT-5 Unique Findings (not in either Claude model)
+
+- **Concurrent pro-forma deduction race** (CRITICAL): Under overload, "Buy order submitted"
+  event delays mean pending_buy_obligations is stale, allowing multiple concurrent decisions
+  to pass buying power simultaneously. Worse than "unavailable" because it mimics normal
+  acceptance.
+- **Broker restriction recovery has no re-derivation trigger**: If a restriction lifts
+  without a fill or market open, buying power stays at restricted level indefinitely. No
+  alert, no timeout.
+- **Market-open refresh failure + prior-day cache can be optimistic**: Pre-open overnight
+  debits (fees, holds) make cached value optimistic, contradicting Risk Controls' assumption
+  that staleness is always conservative.
+
+## Claude Opus Unique Findings (not in either other model)
+
+- **NoShortSales contradicts Signal Risk's "no portfolio context" description** (HIGH):
+  Signal-level controls are described as having "no portfolio context" but NoShortSales
+  requires position knowledge. Architectural contradiction about what data Signal Risk can
+  access; degraded position data behavior at this stage is unspecified.
+- **Strategy crash mid-cycle leaves stale signals in aggregator buffer** (HIGH):
+  Failure mode table covers crash "mid-signal" (atomicity) but NOT crash between signals
+  in a multi-signal cycle. Pre-crash signals mix with post-restart fresh signals, forming
+  decisions from inconsistent market state.
+- **Reconciliation gate location unspecified** (HIGH): Buying Power says "blocks order flow"
+  but Risk Controls don't mention it. If gate is at OrderManager, Portfolio Risk evaluates
+  against uninitialized buying power. If at signal level, aggregator buffers fill during
+  reconciliation.
+- **Progressive buying power divergence during connectivity loss** (MEDIUM): Sell fills
+  produce "no immediate change" + failed broker queries = progressive conservative drift.
+  System enters sell-only mode where continuous monitoring liquidates but re-entry is blocked.
+- **Pro-forma deduction reversal for internal rejection unspecified** (MEDIUM): Buying power
+  deducts at control #4 but fat finger rejects at control #9. The deduction-reversal path
+  for pre-submission rejection within the risk pipeline is unspecified.
+- **Lost pending order state indistinguishable from empty** (CRITICAL): Duplicate Order and
+  Self-Trade controls have no specified fail-closed behavior for "pending order data
+  unavailable." Empty state = permissive, potentially allowing duplicate submissions.
+
+## Claude Sonnet Findings
+
+Sonnet identified 8 findings, mostly overlapping with GPT-5 and Opus at lower depth.
+Its unique contribution:
+
+- **PDT equity calculation recovery ordering** (CRITICAL): If PDT equity recovers with
+  different values than what buying power assumed during degradation, no reconciliation
+  process exists.
+- **Audit log write failure behavior** (HIGH): Whether audit write failure stops or continues
+  evaluation is unspecified, creating either compliance gaps or unnecessary service disruption.
+
+However, several Sonnet findings were less precise — Finding #1 restates the boundary gap
+without exploring the optimistic case, and Finding #7 (PDT) was somewhat speculative without
+strong textual evidence.
+
+## Quality Assessment
+
+- **GPT-5** was technically precise and found the most practically dangerous issues. The
+  concurrent pro-forma race (Finding 3) is the most operationally critical finding across
+  all models — it's a real concurrency bug that could cause financial loss during high-volume
+  trading. All findings included precise mechanism descriptions and specific interleaving
+  sequences. However, it was LESS prolific than Opus (7 vs 10) despite using more tokens.
+
+- **Claude Opus** was the most prolific (10 findings) and found the deepest architectural
+  issues. The NoShortSales/Signal-Risk contradiction (Finding 2) is a genuine spec-level
+  inconsistency that no other model identified. The progressive divergence finding (Finding 8)
+  shows multi-step causal reasoning about how one degradation creates a cascading economic
+  effect. The pending-order-state finding (Finding 10) identifies a category of silent
+  degradation that applies to an ENTIRE CLASS of controls — not just one component.
+  Opus's characteristic strength — reasoning about design TENSIONS — manifests here as
+  finding places where one document's degradation model creates a trap for another document's
+  assumptions.
+
+- **Claude Sonnet** was fast (35s, 1/3 the time) and adequate but shallow. Findings were
+  correct but didn't explore second-order effects or multi-step cascades. The PDT finding
+  was speculative. Sonnet identified the right THEMES but didn't trace them to their
+  architectural consequences.
+
+## Key Insight — "Degraded-mode propagation" as an analytical lens
+
+This is genuinely distinct from previous lenses in two ways:
+
+1. **It's inherently cross-document.** Unlike assumption-finding (which can work on a single
+   doc), degraded-mode propagation REQUIRES multiple documents because it specifically asks
+   about boundaries. This makes it ideal for detecting integration issues.
+
+2. **It finds a different CLASS of bug.** Previous lenses found issues with the happy path
+   (gaps, assumptions) or the error path (race conditions, invariant violations). This lens
+   finds issues with the PARTIALLY-DEGRADED path — where components are working but with
+   degraded inputs they don't fully detect.
+
+The most architecturally significant findings (concurrent pro-forma race, aggregator split
+decisions, progressive divergence cascade) are all about systems that APPEAR to be working
+but are silently making incorrect decisions because one upstream component is degraded in a
+way that mimics normal behavior.
+
+## Model Comparison for This Task Type
+
+| Dimension | GPT-5 | Opus | Sonnet |
+|---|---|---|---|
+| Finding count | 7 | 10 | 8 |
+| CRITICAL findings | 1 | 4 | 2 |
+| Unique insights | 3 | 6 | 2 |
+| Tokens per finding | 1,592 | 524 | 210 |
+| Cascade reasoning | Deep | Deep | Surface |
+| Cross-doc awareness | High | Highest | Moderate |
+
+**Opus is the strongest model for this task type.** This is the first experiment where
+Opus decisively outperforms GPT-5 in both quantity AND quality. The reason: degraded-mode
+propagation analysis is fundamentally about design TENSIONS between documents — which is
+exactly Opus's consistent strength across all previous experiments. GPT-5's strength
+(exhaustive technical detail) matters less here because the findings are at the boundary
+level, not the implementation level.
+
+## Practical Implication
+
+For cross-document integration review:
+- **Opus** as primary reviewer (highest insight density, finds architectural contradictions)
+- **GPT-5** as secondary reviewer (finds the operational/concurrency issues Opus misses)
+- **Sonnet** for quick first-pass only (identifies themes but not consequences)
+
+The ideal workflow: Run Opus on all cross-document boundaries first. Then run GPT-5 on
+any boundary where Opus flagged design tensions — GPT-5 will find the specific race
+conditions and concurrency bugs that make those tensions exploitable.
+
+## Cost-Effectiveness
+
+- Opus: 10 findings in 116s at 5,239 tokens = **524 tokens per finding** (best efficiency)
+- GPT-5: 7 findings in 125s at 11,147 tokens = 1,592 tokens per finding
+- Sonnet: 8 findings in 35s at 1,676 tokens = 210 tokens per finding (cheapest, but
+  lowest quality per finding)
+
+For architecture review at document boundaries, Opus delivers ~3× the insight density
+per token compared to GPT-5, while finding more issues. This inverts the typical pattern
+from previous experiments where GPT-5 was most cost-effective for exhaustive analysis.