diff --git a/findings/2026-05-06-31-spec-gap-analysis-continuous-risk-monitoring.md b/findings/2026-05-06-31-spec-gap-analysis-continuous-risk-monitoring.md new file mode 100644 index 0000000..e545770 --- /dev/null +++ b/findings/2026-05-06-31-spec-gap-analysis-continuous-risk-monitoring.md @@ -0,0 +1,114 @@ +# Finding #31: Spec-Gap Analysis on continuous-risk-monitoring.md + +**Date:** 2026-05-06 +**Document:** `continuous-risk-monitoring.md` (176 lines) — real-time risk monitoring +spec governing evaluation triggers, metric computation, escalation levels, autonomous +liquidation, and failure modes. +**Task type:** Specification gap/completeness analysis (NEW — previous experiments used +adversarial gaming) +**Analytical question:** Identify specification gaps, implicit assumptions, +contradictions, race conditions, and edge cases that could lead to implementation bugs. + +## Results + +| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium | +|---|---|---|---|---|---|---|---| +| GPT-5 | 149s | 2,528 | 8,320 | 25 | 3 | 14 | 8 | +| Claude Opus 4.6 | 68s | 2,878 | (internal) | 15 | 3 | 6 | 6 | +| Claude Sonnet 4.6 | 24s | 1,106 | (internal) | 10 | 3 | 6 | 1 | + +## Category Distribution + +| Category | GPT-5 | Opus | Sonnet | +|---|---|---|---| +| Gap | 12 | 6 | 4 | +| Race condition | 3 | 3 | 2 | +| Edge case | 5 | 2 | 2 | +| Contradiction | 3 | 2 | 1 | +| Assumption | 2 | 2 | 1 | + +## Common Ground (all 3 identified) + +- Liquidation minimum quantity undefined for portfolio-level metrics +- Race between fill-triggered evaluation and in-flight liquidation orders +- Multi-metric independence creates conflicting liquidation targets +- Tick coalescing window semantics under-specified +- "Last known price" handling undefined for instruments without recent data + +## GPT-5 Unique Findings + +- **HWM restart logic is self-contradictory**: Spec claims lower HWM = stricter, but + initializing to current value after decline makes drawdown appear ZERO (not stricter) +- **Stale price directionality**: Stale prices can UNDERESTIMATE for shorts or upward gaps + (contradicts spec's blanket "overestimates" claim) +- **Currency conversion / contract multipliers**: Never addressed for non-equity instruments +- **Daily P&L reset boundary**: Trading day definition (time, timezone, multi-venue) unspecified +- **Concentration denominator**: "Total portfolio" never defined +- **Stuck liquidation orders**: No timeout for orders that neither fill nor reject +- **Restrict flag atomicity**: No consistency guarantee between flag set and decision-engine reads +- **Complex order types**: "Opening order" undefined for spreads, shorts +- **Invalid price data**: Zero/negative prices from bad ticks unhandled +- **Domain event gaps**: No events for liquidation completion or kill-switch escalation +- **P&L reconstruction on restart**: Unlike HWM, realized P&L restart behavior undefined + +## Claude Opus Unique Findings + +- **Realized P&L is non-reversible** ⭐: Realized losses cannot "return below alert + threshold" — the de-escalation model assumes metrics can recover, but this metric + FUNDAMENTALLY CANNOT. Creates permanent restrict state with no documented recovery. + **Best single finding across all models.** +- **Partial fill is not "completed"**: A partial fill is simultaneously a fill event AND + evidence the prior round is in-flight. Defined completion states (filled/cancelled/rejected) + don't cover partial fills. More precise than GPT-5's version. +- **De-escalation hysteresis cross-document conflict**: Clearing below ALERT (not RESTRICT) + collapses three levels to two on the way down. Also deferred to escalation-policy.md, + creating potential conflict. +- **Restart during closed session**: No fills/ticks arrive to trigger re-evaluation after + crash in closed/pre-post. Restrict flag stays cleared indefinitely — unbounded permissive + window. +- **HWM can be HIGHER than true peak**: Portfolio appreciation during crash window means + restart HWM > true peak, making drawdown detection LESS strict (spec only considers the + lower case). + +## Claude Sonnet Assessment + +All 10 findings are subsets of GPT-5/Opus findings with less detail. No unique insights. +Corporate actions mention is the closest to unique but lacks actionable depth. + +## Key Insights + +### Opus excels at ASSUMPTION identification (confirmed pattern) + +Across experiments, Opus consistently finds where the spec's OWN ASSUMPTIONS are false: +- #30: "spec ambiguity as exploit surface," "guaranteed conditions an adversary can rely on" +- #31: "realized P&L cannot recover" — metric violates de-escalation model core assumption + +Opus reasons about what the spec BELIEVES to be true and checks whether those beliefs hold. +GPT-5 reasons about what the spec FAILS TO SAY. Different but complementary. + +### Sonnet adds no value for spec-gap analysis + +In adversarial gaming (#30), Sonnet contributed unique meta-analytical synthesis. For +completeness analysis, Sonnet found nothing the others didn't find in more detail. +Creative/generative tasks → Sonnet adds value. Systematic/exhaustive tasks → skip Sonnet. + +### Task type affects model ranking + +| Task Type | Best for Count | Best for Insight | Sonnet Valuable? | +|---|---|---|---| +| Adversarial gaming | GPT-5 | GPT-5 (compound attacks) | Yes (meta-synthesis) | +| Spec-gap analysis | GPT-5 | Opus (false assumptions) | No | + +## Practical Recommendation + +For specification review: run GPT-5 + Opus in parallel. Skip Sonnet. +- GPT-5: exhaustive coverage of undefined terms and boundary conditions +- Opus: finding where the spec's own assumptions are logically invalid + +## Efficiency + +| Model | Tokens/finding | Seconds/finding | +|---|---|---| +| GPT-5 | 511 | 6.0s | +| Opus | 192 | 4.5s | +| Sonnet | 111 | 2.4s |