New task type: specification gap/completeness analysis (vs adversarial gaming). GPT-5 dominates count (25 findings), Opus produces best single insight (realized P&L non-reversibility violates de-escalation model assumption). Sonnet adds no unique value for this task type — skip for completeness audits.
5.3 KiB
Finding #31: Spec-Gap Analysis on continuous-risk-monitoring.md
Date: 2026-05-06
Document: continuous-risk-monitoring.md (176 lines) — real-time risk monitoring
spec governing evaluation triggers, metric computation, escalation levels, autonomous
liquidation, and failure modes.
Task type: Specification gap/completeness analysis (NEW — previous experiments used
adversarial gaming)
Analytical question: Identify specification gaps, implicit assumptions,
contradictions, race conditions, and edge cases that could lead to implementation bugs.
Results
| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
|---|---|---|---|---|---|---|---|
| GPT-5 | 149s | 2,528 | 8,320 | 25 | 3 | 14 | 8 |
| Claude Opus 4.6 | 68s | 2,878 | (internal) | 15 | 3 | 6 | 6 |
| Claude Sonnet 4.6 | 24s | 1,106 | (internal) | 10 | 3 | 6 | 1 |
Category Distribution
| Category | GPT-5 | Opus | Sonnet |
|---|---|---|---|
| Gap | 12 | 6 | 4 |
| Race condition | 3 | 3 | 2 |
| Edge case | 5 | 2 | 2 |
| Contradiction | 3 | 2 | 1 |
| Assumption | 2 | 2 | 1 |
Common Ground (all 3 identified)
- Liquidation minimum quantity undefined for portfolio-level metrics
- Race between fill-triggered evaluation and in-flight liquidation orders
- Multi-metric independence creates conflicting liquidation targets
- Tick coalescing window semantics under-specified
- "Last known price" handling undefined for instruments without recent data
GPT-5 Unique Findings
- HWM restart logic is self-contradictory: Spec claims lower HWM = stricter, but initializing to current value after decline makes drawdown appear ZERO (not stricter)
- Stale price directionality: Stale prices can UNDERESTIMATE for shorts or upward gaps (contradicts spec's blanket "overestimates" claim)
- Currency conversion / contract multipliers: Never addressed for non-equity instruments
- Daily P&L reset boundary: Trading day definition (time, timezone, multi-venue) unspecified
- Concentration denominator: "Total portfolio" never defined
- Stuck liquidation orders: No timeout for orders that neither fill nor reject
- Restrict flag atomicity: No consistency guarantee between flag set and decision-engine reads
- Complex order types: "Opening order" undefined for spreads, shorts
- Invalid price data: Zero/negative prices from bad ticks unhandled
- Domain event gaps: No events for liquidation completion or kill-switch escalation
- P&L reconstruction on restart: Unlike HWM, realized P&L restart behavior undefined
Claude Opus Unique Findings
- Realized P&L is non-reversible ⭐: Realized losses cannot "return below alert threshold" — the de-escalation model assumes metrics can recover, but this metric FUNDAMENTALLY CANNOT. Creates permanent restrict state with no documented recovery. Best single finding across all models.
- Partial fill is not "completed": A partial fill is simultaneously a fill event AND evidence the prior round is in-flight. Defined completion states (filled/cancelled/rejected) don't cover partial fills. More precise than GPT-5's version.
- De-escalation hysteresis cross-document conflict: Clearing below ALERT (not RESTRICT) collapses three levels to two on the way down. Also deferred to escalation-policy.md, creating potential conflict.
- Restart during closed session: No fills/ticks arrive to trigger re-evaluation after crash in closed/pre-post. Restrict flag stays cleared indefinitely — unbounded permissive window.
- HWM can be HIGHER than true peak: Portfolio appreciation during crash window means restart HWM > true peak, making drawdown detection LESS strict (spec only considers the lower case).
Claude Sonnet Assessment
All 10 findings are subsets of GPT-5/Opus findings with less detail. No unique insights. Corporate actions mention is the closest to unique but lacks actionable depth.
Key Insights
Opus excels at ASSUMPTION identification (confirmed pattern)
Across experiments, Opus consistently finds where the spec's OWN ASSUMPTIONS are false:
- #30: "spec ambiguity as exploit surface," "guaranteed conditions an adversary can rely on"
- #31: "realized P&L cannot recover" — metric violates de-escalation model core assumption
Opus reasons about what the spec BELIEVES to be true and checks whether those beliefs hold. GPT-5 reasons about what the spec FAILS TO SAY. Different but complementary.
Sonnet adds no value for spec-gap analysis
In adversarial gaming (#30), Sonnet contributed unique meta-analytical synthesis. For completeness analysis, Sonnet found nothing the others didn't find in more detail. Creative/generative tasks → Sonnet adds value. Systematic/exhaustive tasks → skip Sonnet.
Task type affects model ranking
| Task Type | Best for Count | Best for Insight | Sonnet Valuable? |
|---|---|---|---|
| Adversarial gaming | GPT-5 | GPT-5 (compound attacks) | Yes (meta-synthesis) |
| Spec-gap analysis | GPT-5 | Opus (false assumptions) | No |
Practical Recommendation
For specification review: run GPT-5 + Opus in parallel. Skip Sonnet.
- GPT-5: exhaustive coverage of undefined terms and boundary conditions
- Opus: finding where the spec's own assumptions are logically invalid
Efficiency
| Model | Tokens/finding | Seconds/finding |
|---|---|---|
| GPT-5 | 511 | 6.0s |
| Opus | 192 | 4.5s |
| Sonnet | 111 | 2.4s |