From c1ca8cfe4698ac159f73edb04f3903ac8d701432 Mon Sep 17 00:00:00 2001 From: claw Date: Fri, 8 May 2026 14:29:29 -0700 Subject: [PATCH] finding #52: degraded-mode propagation analysis (new lens) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cross-document boundary analysis: signal-lifecycle + buying-power + risk-controls. Opus decisively outperforms GPT-5 (10 vs 7 findings) — first inversion observed. New lens finds a distinct class of bug: partially-degraded paths that mimic normal behavior. --- ...8-52-degraded-mode-propagation-analysis.md | 182 ++++++++++++++++++ 1 file changed, 182 insertions(+) create mode 100644 findings/2026-05-08-52-degraded-mode-propagation-analysis.md diff --git a/findings/2026-05-08-52-degraded-mode-propagation-analysis.md b/findings/2026-05-08-52-degraded-mode-propagation-analysis.md new file mode 100644 index 0000000..e3f1490 --- /dev/null +++ b/findings/2026-05-08-52-degraded-mode-propagation-analysis.md @@ -0,0 +1,182 @@ +# Degraded-Mode Propagation Analysis: A New Analytical Lens + +**Date:** 2026-05-08 +**Finding #:** 52 +**Task:** Degraded-mode propagation analysis across three related gargoyle design documents: +`signal-lifecycle.md` (111 lines), `buying-power.md` (103 lines), `risk-controls.md` (315 lines) +— spanning the decision engine → risk → order management boundary. + +**Analytical lens:** NEW. "When one component enters a degraded state, what happens to the +components that depend on it? Trace degraded-mode behavior across document boundaries." +Distinct from assumption-finding (what's implicit?), gap-finding (what's missing?), or +race conditions (what interleavings fail?). This asks: "Do the documents' degraded +behaviors compose correctly?" + +## Method + +Same three documents (full text, 529 lines combined) + same structured prompt to all +3 models via HAI proxy. Prompt specified 5 categories of degraded-mode propagation +failures: propagation gaps, semantic mismatches, recovery ordering dependencies, silent +degradation, and degradation cascades. Required specific output format per finding with +quotes from both documents at each boundary. No tools, no project context beyond the +documents themselves. + +## Results + +| Model | Time | Output tokens | Reasoning tokens | Findings | +|---|---|---|---|---| +| GPT-5 | 125s | 11,147 | 8,960 | 7 | +| Claude Opus 4.6 | 116s | 5,239 | (internal) | 10 | +| Claude Sonnet 4.6 | 35s | 1,676 | (internal) | 8 | + +## What They Found — Common Ground (all 3 identified) + +1. **Buying power staleness vs fail-closed semantic mismatch** — Buying Power says "use + last cached value (pessimistic)" while Risk Controls says "stale cache → reject." + All three models identified this as the primary boundary gap, though with different + depth of analysis (see below). + +2. **Market data staleness affecting buying power price estimates** — Risk Controls + scopes staleness to fat-finger only ("other controls unaffected"), but Buying Power + uses market prices for pro-forma deduction of market orders. + +3. **Order rate window reset creating silent permissiveness** — After crash, the rate + limiter is empty while signal burst is most likely. GPT-5 and Opus explored the + cascade implications more deeply. + +4. **Aggregator timeout splitting decisions** — Under load, aggregator fires early, + creating partial decisions that individually pass controls which would have caught + the combined quantity. + +## GPT-5 Unique Findings (not in either Claude model) + +- **Concurrent pro-forma deduction race** (CRITICAL): Under overload, "Buy order submitted" + event delays mean pending_buy_obligations is stale, allowing multiple concurrent decisions + to pass buying power simultaneously. Worse than "unavailable" because it mimics normal + acceptance. +- **Broker restriction recovery has no re-derivation trigger**: If a restriction lifts + without a fill or market open, buying power stays at restricted level indefinitely. No + alert, no timeout. +- **Market-open refresh failure + prior-day cache can be optimistic**: Pre-open overnight + debits (fees, holds) make cached value optimistic, contradicting Risk Controls' assumption + that staleness is always conservative. + +## Claude Opus Unique Findings (not in either other model) + +- **NoShortSales contradicts Signal Risk's "no portfolio context" description** (HIGH): + Signal-level controls are described as having "no portfolio context" but NoShortSales + requires position knowledge. Architectural contradiction about what data Signal Risk can + access; degraded position data behavior at this stage is unspecified. +- **Strategy crash mid-cycle leaves stale signals in aggregator buffer** (HIGH): + Failure mode table covers crash "mid-signal" (atomicity) but NOT crash between signals + in a multi-signal cycle. Pre-crash signals mix with post-restart fresh signals, forming + decisions from inconsistent market state. +- **Reconciliation gate location unspecified** (HIGH): Buying Power says "blocks order flow" + but Risk Controls don't mention it. If gate is at OrderManager, Portfolio Risk evaluates + against uninitialized buying power. If at signal level, aggregator buffers fill during + reconciliation. +- **Progressive buying power divergence during connectivity loss** (MEDIUM): Sell fills + produce "no immediate change" + failed broker queries = progressive conservative drift. + System enters sell-only mode where continuous monitoring liquidates but re-entry is blocked. +- **Pro-forma deduction reversal for internal rejection unspecified** (MEDIUM): Buying power + deducts at control #4 but fat finger rejects at control #9. The deduction-reversal path + for pre-submission rejection within the risk pipeline is unspecified. +- **Lost pending order state indistinguishable from empty** (CRITICAL): Duplicate Order and + Self-Trade controls have no specified fail-closed behavior for "pending order data + unavailable." Empty state = permissive, potentially allowing duplicate submissions. + +## Claude Sonnet Findings + +Sonnet identified 8 findings, mostly overlapping with GPT-5 and Opus at lower depth. +Its unique contribution: + +- **PDT equity calculation recovery ordering** (CRITICAL): If PDT equity recovers with + different values than what buying power assumed during degradation, no reconciliation + process exists. +- **Audit log write failure behavior** (HIGH): Whether audit write failure stops or continues + evaluation is unspecified, creating either compliance gaps or unnecessary service disruption. + +However, several Sonnet findings were less precise — Finding #1 restates the boundary gap +without exploring the optimistic case, and Finding #7 (PDT) was somewhat speculative without +strong textual evidence. + +## Quality Assessment + +- **GPT-5** was technically precise and found the most practically dangerous issues. The + concurrent pro-forma race (Finding 3) is the most operationally critical finding across + all models — it's a real concurrency bug that could cause financial loss during high-volume + trading. All findings included precise mechanism descriptions and specific interleaving + sequences. However, it was LESS prolific than Opus (7 vs 10) despite using more tokens. + +- **Claude Opus** was the most prolific (10 findings) and found the deepest architectural + issues. The NoShortSales/Signal-Risk contradiction (Finding 2) is a genuine spec-level + inconsistency that no other model identified. The progressive divergence finding (Finding 8) + shows multi-step causal reasoning about how one degradation creates a cascading economic + effect. The pending-order-state finding (Finding 10) identifies a category of silent + degradation that applies to an ENTIRE CLASS of controls — not just one component. + Opus's characteristic strength — reasoning about design TENSIONS — manifests here as + finding places where one document's degradation model creates a trap for another document's + assumptions. + +- **Claude Sonnet** was fast (35s, 1/3 the time) and adequate but shallow. Findings were + correct but didn't explore second-order effects or multi-step cascades. The PDT finding + was speculative. Sonnet identified the right THEMES but didn't trace them to their + architectural consequences. + +## Key Insight — "Degraded-mode propagation" as an analytical lens + +This is genuinely distinct from previous lenses in two ways: + +1. **It's inherently cross-document.** Unlike assumption-finding (which can work on a single + doc), degraded-mode propagation REQUIRES multiple documents because it specifically asks + about boundaries. This makes it ideal for detecting integration issues. + +2. **It finds a different CLASS of bug.** Previous lenses found issues with the happy path + (gaps, assumptions) or the error path (race conditions, invariant violations). This lens + finds issues with the PARTIALLY-DEGRADED path — where components are working but with + degraded inputs they don't fully detect. + +The most architecturally significant findings (concurrent pro-forma race, aggregator split +decisions, progressive divergence cascade) are all about systems that APPEAR to be working +but are silently making incorrect decisions because one upstream component is degraded in a +way that mimics normal behavior. + +## Model Comparison for This Task Type + +| Dimension | GPT-5 | Opus | Sonnet | +|---|---|---|---| +| Finding count | 7 | 10 | 8 | +| CRITICAL findings | 1 | 4 | 2 | +| Unique insights | 3 | 6 | 2 | +| Tokens per finding | 1,592 | 524 | 210 | +| Cascade reasoning | Deep | Deep | Surface | +| Cross-doc awareness | High | Highest | Moderate | + +**Opus is the strongest model for this task type.** This is the first experiment where +Opus decisively outperforms GPT-5 in both quantity AND quality. The reason: degraded-mode +propagation analysis is fundamentally about design TENSIONS between documents — which is +exactly Opus's consistent strength across all previous experiments. GPT-5's strength +(exhaustive technical detail) matters less here because the findings are at the boundary +level, not the implementation level. + +## Practical Implication + +For cross-document integration review: +- **Opus** as primary reviewer (highest insight density, finds architectural contradictions) +- **GPT-5** as secondary reviewer (finds the operational/concurrency issues Opus misses) +- **Sonnet** for quick first-pass only (identifies themes but not consequences) + +The ideal workflow: Run Opus on all cross-document boundaries first. Then run GPT-5 on +any boundary where Opus flagged design tensions — GPT-5 will find the specific race +conditions and concurrency bugs that make those tensions exploitable. + +## Cost-Effectiveness + +- Opus: 10 findings in 116s at 5,239 tokens = **524 tokens per finding** (best efficiency) +- GPT-5: 7 findings in 125s at 11,147 tokens = 1,592 tokens per finding +- Sonnet: 8 findings in 35s at 1,676 tokens = 210 tokens per finding (cheapest, but + lowest quality per finding) + +For architecture review at document boundaries, Opus delivers ~3× the insight density +per token compared to GPT-5, while finding more issues. This inverts the typical pattern +from previous experiments where GPT-5 was most cost-effective for exhaustive analysis.