finding #52: degraded-mode propagation analysis (new lens)

Cross-document boundary analysis: signal-lifecycle + buying-power + risk-controls. Opus decisively outperforms GPT-5 (10 vs 7 findings) — first inversion observed. New lens finds a distinct class of bug: partially-degraded paths that mimic normal behavior.
2026-05-08 14:29:29 -07:00
parent 79915d1dc3
commit c1ca8cfe46
1 changed files with 182 additions and 0 deletions
@@ -0,0 +1,182 @@
 # Degraded-Mode Propagation Analysis: A New Analytical Lens
 **Date:** 2026-05-08
 **Finding #:** 52
 **Task:** Degraded-mode propagation analysis across three related gargoyle design documents:
 `signal-lifecycle.md` (111 lines), `buying-power.md` (103 lines), `risk-controls.md` (315 lines)
 — spanning the decision engine → risk → order management boundary.
 **Analytical lens:** NEW. "When one component enters a degraded state, what happens to the
 components that depend on it? Trace degraded-mode behavior across document boundaries."
 Distinct from assumption-finding (what's implicit?), gap-finding (what's missing?), or
 race conditions (what interleavings fail?). This asks: "Do the documents' degraded
 behaviors compose correctly?"
 ## Method
 Same three documents (full text, 529 lines combined) + same structured prompt to all
 3 models via HAI proxy. Prompt specified 5 categories of degraded-mode propagation
 failures: propagation gaps, semantic mismatches, recovery ordering dependencies, silent
 degradation, and degradation cascades. Required specific output format per finding with
 quotes from both documents at each boundary. No tools, no project context beyond the
 documents themselves.
 ## Results
 | Model | Time | Output tokens | Reasoning tokens | Findings |
 |---|---|---|---|---|
 | GPT-5 | 125s | 11,147 | 8,960 | 7 |
 | Claude Opus 4.6 | 116s | 5,239 | (internal) | 10 |
 | Claude Sonnet 4.6 | 35s | 1,676 | (internal) | 8 |
 ## What They Found — Common Ground (all 3 identified)
 1. **Buying power staleness vs fail-closed semantic mismatch** — Buying Power says "use
   last cached value (pessimistic)" while Risk Controls says "stale cache → reject."
   All three models identified this as the primary boundary gap, though with different
   depth of analysis (see below).
 2. **Market data staleness affecting buying power price estimates** — Risk Controls
   scopes staleness to fat-finger only ("other controls unaffected"), but Buying Power
   uses market prices for pro-forma deduction of market orders.
 3. **Order rate window reset creating silent permissiveness** — After crash, the rate
   limiter is empty while signal burst is most likely. GPT-5 and Opus explored the
   cascade implications more deeply.
 4. **Aggregator timeout splitting decisions** — Under load, aggregator fires early,
   creating partial decisions that individually pass controls which would have caught
   the combined quantity.
 ## GPT-5 Unique Findings (not in either Claude model)
 - **Concurrent pro-forma deduction race** (CRITICAL): Under overload, "Buy order submitted"
  event delays mean pending_buy_obligations is stale, allowing multiple concurrent decisions
  to pass buying power simultaneously. Worse than "unavailable" because it mimics normal
  acceptance.
 - **Broker restriction recovery has no re-derivation trigger**: If a restriction lifts
  without a fill or market open, buying power stays at restricted level indefinitely. No
  alert, no timeout.
 - **Market-open refresh failure + prior-day cache can be optimistic**: Pre-open overnight
  debits (fees, holds) make cached value optimistic, contradicting Risk Controls' assumption
  that staleness is always conservative.
 ## Claude Opus Unique Findings (not in either other model)
 - **NoShortSales contradicts Signal Risk's "no portfolio context" description** (HIGH):
  Signal-level controls are described as having "no portfolio context" but NoShortSales
  requires position knowledge. Architectural contradiction about what data Signal Risk can
  access; degraded position data behavior at this stage is unspecified.
 - **Strategy crash mid-cycle leaves stale signals in aggregator buffer** (HIGH):
  Failure mode table covers crash "mid-signal" (atomicity) but NOT crash between signals
  in a multi-signal cycle. Pre-crash signals mix with post-restart fresh signals, forming
  decisions from inconsistent market state.
 - **Reconciliation gate location unspecified** (HIGH): Buying Power says "blocks order flow"
  but Risk Controls don't mention it. If gate is at OrderManager, Portfolio Risk evaluates
  against uninitialized buying power. If at signal level, aggregator buffers fill during
  reconciliation.
 - **Progressive buying power divergence during connectivity loss** (MEDIUM): Sell fills
  produce "no immediate change" + failed broker queries = progressive conservative drift.
  System enters sell-only mode where continuous monitoring liquidates but re-entry is blocked.
 - **Pro-forma deduction reversal for internal rejection unspecified** (MEDIUM): Buying power
  deducts at control #4 but fat finger rejects at control #9. The deduction-reversal path
  for pre-submission rejection within the risk pipeline is unspecified.
 - **Lost pending order state indistinguishable from empty** (CRITICAL): Duplicate Order and
  Self-Trade controls have no specified fail-closed behavior for "pending order data
  unavailable." Empty state = permissive, potentially allowing duplicate submissions.
 ## Claude Sonnet Findings
 Sonnet identified 8 findings, mostly overlapping with GPT-5 and Opus at lower depth.
 Its unique contribution:
 - **PDT equity calculation recovery ordering** (CRITICAL): If PDT equity recovers with
  different values than what buying power assumed during degradation, no reconciliation
  process exists.
 - **Audit log write failure behavior** (HIGH): Whether audit write failure stops or continues
  evaluation is unspecified, creating either compliance gaps or unnecessary service disruption.
 However, several Sonnet findings were less precise — Finding #1 restates the boundary gap
 without exploring the optimistic case, and Finding #7 (PDT) was somewhat speculative without
 strong textual evidence.
 ## Quality Assessment
 - **GPT-5** was technically precise and found the most practically dangerous issues. The
  concurrent pro-forma race (Finding 3) is the most operationally critical finding across
  all models — it's a real concurrency bug that could cause financial loss during high-volume
  trading. All findings included precise mechanism descriptions and specific interleaving
  sequences. However, it was LESS prolific than Opus (7 vs 10) despite using more tokens.
 - **Claude Opus** was the most prolific (10 findings) and found the deepest architectural
  issues. The NoShortSales/Signal-Risk contradiction (Finding 2) is a genuine spec-level
  inconsistency that no other model identified. The progressive divergence finding (Finding 8)
  shows multi-step causal reasoning about how one degradation creates a cascading economic
  effect. The pending-order-state finding (Finding 10) identifies a category of silent
  degradation that applies to an ENTIRE CLASS of controls — not just one component.
  Opus's characteristic strength — reasoning about design TENSIONS — manifests here as
  finding places where one document's degradation model creates a trap for another document's
  assumptions.
 - **Claude Sonnet** was fast (35s, 1/3 the time) and adequate but shallow. Findings were
  correct but didn't explore second-order effects or multi-step cascades. The PDT finding
  was speculative. Sonnet identified the right THEMES but didn't trace them to their
  architectural consequences.
 ## Key Insight — "Degraded-mode propagation" as an analytical lens
 This is genuinely distinct from previous lenses in two ways:
 1. **It's inherently cross-document.** Unlike assumption-finding (which can work on a single
   doc), degraded-mode propagation REQUIRES multiple documents because it specifically asks
   about boundaries. This makes it ideal for detecting integration issues.
 2. **It finds a different CLASS of bug.** Previous lenses found issues with the happy path
   (gaps, assumptions) or the error path (race conditions, invariant violations). This lens
   finds issues with the PARTIALLY-DEGRADED path — where components are working but with
   degraded inputs they don't fully detect.
 The most architecturally significant findings (concurrent pro-forma race, aggregator split
 decisions, progressive divergence cascade) are all about systems that APPEAR to be working
 but are silently making incorrect decisions because one upstream component is degraded in a
 way that mimics normal behavior.
 ## Model Comparison for This Task Type
 | Dimension | GPT-5 | Opus | Sonnet |
 |---|---|---|---|
 | Finding count | 7 | 10 | 8 |
 | CRITICAL findings | 1 | 4 | 2 |
 | Unique insights | 3 | 6 | 2 |
 | Tokens per finding | 1,592 | 524 | 210 |
 | Cascade reasoning | Deep | Deep | Surface |
 | Cross-doc awareness | High | Highest | Moderate |
 **Opus is the strongest model for this task type.** This is the first experiment where
 Opus decisively outperforms GPT-5 in both quantity AND quality. The reason: degraded-mode
 propagation analysis is fundamentally about design TENSIONS between documents — which is
 exactly Opus's consistent strength across all previous experiments. GPT-5's strength
 (exhaustive technical detail) matters less here because the findings are at the boundary
 level, not the implementation level.
 ## Practical Implication
 For cross-document integration review:
 - **Opus** as primary reviewer (highest insight density, finds architectural contradictions)
 - **GPT-5** as secondary reviewer (finds the operational/concurrency issues Opus misses)
 - **Sonnet** for quick first-pass only (identifies themes but not consequences)
 The ideal workflow: Run Opus on all cross-document boundaries first. Then run GPT-5 on
 any boundary where Opus flagged design tensions — GPT-5 will find the specific race
 conditions and concurrency bugs that make those tensions exploitable.
 ## Cost-Effectiveness
 - Opus: 10 findings in 116s at 5,239 tokens = **524 tokens per finding** (best efficiency)
 - GPT-5: 7 findings in 125s at 11,147 tokens = 1,592 tokens per finding
 - Sonnet: 8 findings in 35s at 1,676 tokens = 210 tokens per finding (cheapest, but
  lowest quality per finding)
 For architecture review at document boundaries, Opus delivers ~3× the insight density
 per token compared to GPT-5, while finding more issues. This inverts the typical pattern
 from previous experiments where GPT-5 was most cost-effective for exhaustive analysis.