Files
model-research/findings/2026-05-06-31-spec-gap-analysis-continuous-risk-monitoring.md
T
Rodin ee3063997a finding #31: spec-gap analysis on continuous-risk-monitoring.md
New task type: specification gap/completeness analysis (vs adversarial gaming).
GPT-5 dominates count (25 findings), Opus produces best single insight
(realized P&L non-reversibility violates de-escalation model assumption).
Sonnet adds no unique value for this task type — skip for completeness audits.
2026-05-06 08:27:00 -07:00

5.3 KiB

Finding #31: Spec-Gap Analysis on continuous-risk-monitoring.md

Date: 2026-05-06 Document: continuous-risk-monitoring.md (176 lines) — real-time risk monitoring spec governing evaluation triggers, metric computation, escalation levels, autonomous liquidation, and failure modes. Task type: Specification gap/completeness analysis (NEW — previous experiments used adversarial gaming) Analytical question: Identify specification gaps, implicit assumptions, contradictions, race conditions, and edge cases that could lead to implementation bugs.

Results

Model Time Output tokens Reasoning tokens Findings Critical High Medium
GPT-5 149s 2,528 8,320 25 3 14 8
Claude Opus 4.6 68s 2,878 (internal) 15 3 6 6
Claude Sonnet 4.6 24s 1,106 (internal) 10 3 6 1

Category Distribution

Category GPT-5 Opus Sonnet
Gap 12 6 4
Race condition 3 3 2
Edge case 5 2 2
Contradiction 3 2 1
Assumption 2 2 1

Common Ground (all 3 identified)

  • Liquidation minimum quantity undefined for portfolio-level metrics
  • Race between fill-triggered evaluation and in-flight liquidation orders
  • Multi-metric independence creates conflicting liquidation targets
  • Tick coalescing window semantics under-specified
  • "Last known price" handling undefined for instruments without recent data

GPT-5 Unique Findings

  • HWM restart logic is self-contradictory: Spec claims lower HWM = stricter, but initializing to current value after decline makes drawdown appear ZERO (not stricter)
  • Stale price directionality: Stale prices can UNDERESTIMATE for shorts or upward gaps (contradicts spec's blanket "overestimates" claim)
  • Currency conversion / contract multipliers: Never addressed for non-equity instruments
  • Daily P&L reset boundary: Trading day definition (time, timezone, multi-venue) unspecified
  • Concentration denominator: "Total portfolio" never defined
  • Stuck liquidation orders: No timeout for orders that neither fill nor reject
  • Restrict flag atomicity: No consistency guarantee between flag set and decision-engine reads
  • Complex order types: "Opening order" undefined for spreads, shorts
  • Invalid price data: Zero/negative prices from bad ticks unhandled
  • Domain event gaps: No events for liquidation completion or kill-switch escalation
  • P&L reconstruction on restart: Unlike HWM, realized P&L restart behavior undefined

Claude Opus Unique Findings

  • Realized P&L is non-reversible : Realized losses cannot "return below alert threshold" — the de-escalation model assumes metrics can recover, but this metric FUNDAMENTALLY CANNOT. Creates permanent restrict state with no documented recovery. Best single finding across all models.
  • Partial fill is not "completed": A partial fill is simultaneously a fill event AND evidence the prior round is in-flight. Defined completion states (filled/cancelled/rejected) don't cover partial fills. More precise than GPT-5's version.
  • De-escalation hysteresis cross-document conflict: Clearing below ALERT (not RESTRICT) collapses three levels to two on the way down. Also deferred to escalation-policy.md, creating potential conflict.
  • Restart during closed session: No fills/ticks arrive to trigger re-evaluation after crash in closed/pre-post. Restrict flag stays cleared indefinitely — unbounded permissive window.
  • HWM can be HIGHER than true peak: Portfolio appreciation during crash window means restart HWM > true peak, making drawdown detection LESS strict (spec only considers the lower case).

Claude Sonnet Assessment

All 10 findings are subsets of GPT-5/Opus findings with less detail. No unique insights. Corporate actions mention is the closest to unique but lacks actionable depth.

Key Insights

Opus excels at ASSUMPTION identification (confirmed pattern)

Across experiments, Opus consistently finds where the spec's OWN ASSUMPTIONS are false:

  • #30: "spec ambiguity as exploit surface," "guaranteed conditions an adversary can rely on"
  • #31: "realized P&L cannot recover" — metric violates de-escalation model core assumption

Opus reasons about what the spec BELIEVES to be true and checks whether those beliefs hold. GPT-5 reasons about what the spec FAILS TO SAY. Different but complementary.

Sonnet adds no value for spec-gap analysis

In adversarial gaming (#30), Sonnet contributed unique meta-analytical synthesis. For completeness analysis, Sonnet found nothing the others didn't find in more detail. Creative/generative tasks → Sonnet adds value. Systematic/exhaustive tasks → skip Sonnet.

Task type affects model ranking

Task Type Best for Count Best for Insight Sonnet Valuable?
Adversarial gaming GPT-5 GPT-5 (compound attacks) Yes (meta-synthesis)
Spec-gap analysis GPT-5 Opus (false assumptions) No

Practical Recommendation

For specification review: run GPT-5 + Opus in parallel. Skip Sonnet.

  • GPT-5: exhaustive coverage of undefined terms and boundary conditions
  • Opus: finding where the spec's own assumptions are logically invalid

Efficiency

Model Tokens/finding Seconds/finding
GPT-5 511 6.0s
Opus 192 4.5s
Sonnet 111 2.4s