Files

T

Rodin ee3063997a finding #31 : spec-gap analysis on continuous-risk-monitoring.md

New task type: specification gap/completeness analysis (vs adversarial gaming).
GPT-5 dominates count (25 findings), Opus produces best single insight
(realized P&L non-reversibility violates de-escalation model assumption).
Sonnet adds no unique value for this task type — skip for completeness audits.

2026-05-06 08:27:00 -07:00

5.3 KiB

Raw Blame History

Finding #31: Spec-Gap Analysis on continuous-risk-monitoring.md

Date: 2026-05-06 Document: continuous-risk-monitoring.md (176 lines) — real-time risk monitoring spec governing evaluation triggers, metric computation, escalation levels, autonomous liquidation, and failure modes. Task type: Specification gap/completeness analysis (NEW — previous experiments used adversarial gaming) Analytical question: Identify specification gaps, implicit assumptions, contradictions, race conditions, and edge cases that could lead to implementation bugs.

Results

Model	Time	Output tokens	Reasoning tokens	Findings	Critical	High	Medium
GPT-5	149s	2,528	8,320	25	3	14	8
Claude Opus 4.6	68s	2,878	(internal)	15	3	6	6
Claude Sonnet 4.6	24s	1,106	(internal)	10	3	6	1

Category Distribution

Category	GPT-5	Opus	Sonnet
Gap	12	6	4
Race condition	3	3	2
Edge case	5	2	2
Contradiction	3	2	1
Assumption	2	2	1

Common Ground (all 3 identified)

Liquidation minimum quantity undefined for portfolio-level metrics
Race between fill-triggered evaluation and in-flight liquidation orders
Multi-metric independence creates conflicting liquidation targets
Tick coalescing window semantics under-specified
"Last known price" handling undefined for instruments without recent data

GPT-5 Unique Findings

HWM restart logic is self-contradictory: Spec claims lower HWM = stricter, but initializing to current value after decline makes drawdown appear ZERO (not stricter)
Stale price directionality: Stale prices can UNDERESTIMATE for shorts or upward gaps (contradicts spec's blanket "overestimates" claim)
Currency conversion / contract multipliers: Never addressed for non-equity instruments
Daily P&L reset boundary: Trading day definition (time, timezone, multi-venue) unspecified
Concentration denominator: "Total portfolio" never defined
Stuck liquidation orders: No timeout for orders that neither fill nor reject
Restrict flag atomicity: No consistency guarantee between flag set and decision-engine reads
Complex order types: "Opening order" undefined for spreads, shorts
Invalid price data: Zero/negative prices from bad ticks unhandled
Domain event gaps: No events for liquidation completion or kill-switch escalation
P&L reconstruction on restart: Unlike HWM, realized P&L restart behavior undefined

Claude Opus Unique Findings

Realized P&L is non-reversible ⭐: Realized losses cannot "return below alert threshold" — the de-escalation model assumes metrics can recover, but this metric FUNDAMENTALLY CANNOT. Creates permanent restrict state with no documented recovery. Best single finding across all models.
Partial fill is not "completed": A partial fill is simultaneously a fill event AND evidence the prior round is in-flight. Defined completion states (filled/cancelled/rejected) don't cover partial fills. More precise than GPT-5's version.
De-escalation hysteresis cross-document conflict: Clearing below ALERT (not RESTRICT) collapses three levels to two on the way down. Also deferred to escalation-policy.md, creating potential conflict.
Restart during closed session: No fills/ticks arrive to trigger re-evaluation after crash in closed/pre-post. Restrict flag stays cleared indefinitely — unbounded permissive window.
HWM can be HIGHER than true peak: Portfolio appreciation during crash window means restart HWM > true peak, making drawdown detection LESS strict (spec only considers the lower case).

Claude Sonnet Assessment

All 10 findings are subsets of GPT-5/Opus findings with less detail. No unique insights. Corporate actions mention is the closest to unique but lacks actionable depth.

Key Insights

Opus excels at ASSUMPTION identification (confirmed pattern)

Across experiments, Opus consistently finds where the spec's OWN ASSUMPTIONS are false:

#30: "spec ambiguity as exploit surface," "guaranteed conditions an adversary can rely on"
#31: "realized P&L cannot recover" — metric violates de-escalation model core assumption

Opus reasons about what the spec BELIEVES to be true and checks whether those beliefs hold. GPT-5 reasons about what the spec FAILS TO SAY. Different but complementary.

Sonnet adds no value for spec-gap analysis

In adversarial gaming (#30), Sonnet contributed unique meta-analytical synthesis. For completeness analysis, Sonnet found nothing the others didn't find in more detail. Creative/generative tasks → Sonnet adds value. Systematic/exhaustive tasks → skip Sonnet.

Task type affects model ranking

Task Type	Best for Count	Best for Insight	Sonnet Valuable?
Adversarial gaming	GPT-5	GPT-5 (compound attacks)	Yes (meta-synthesis)
Spec-gap analysis	GPT-5	Opus (false assumptions)	No

Practical Recommendation

For specification review: run GPT-5 + Opus in parallel. Skip Sonnet.

GPT-5: exhaustive coverage of undefined terms and boundary conditions
Opus: finding where the spec's own assumptions are logically invalid

Efficiency

Model	Tokens/finding	Seconds/finding
GPT-5	511	6.0s
Opus	192	4.5s
Sonnet	111	2.4s

5.3 KiB Raw Blame History