ee3063997a
New task type: specification gap/completeness analysis (vs adversarial gaming). GPT-5 dominates count (25 findings), Opus produces best single insight (realized P&L non-reversibility violates de-escalation model assumption). Sonnet adds no unique value for this task type — skip for completeness audits.
115 lines
5.3 KiB
Markdown
115 lines
5.3 KiB
Markdown
# Finding #31: Spec-Gap Analysis on continuous-risk-monitoring.md
|
|
|
|
**Date:** 2026-05-06
|
|
**Document:** `continuous-risk-monitoring.md` (176 lines) — real-time risk monitoring
|
|
spec governing evaluation triggers, metric computation, escalation levels, autonomous
|
|
liquidation, and failure modes.
|
|
**Task type:** Specification gap/completeness analysis (NEW — previous experiments used
|
|
adversarial gaming)
|
|
**Analytical question:** Identify specification gaps, implicit assumptions,
|
|
contradictions, race conditions, and edge cases that could lead to implementation bugs.
|
|
|
|
## Results
|
|
|
|
| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
|
|
|---|---|---|---|---|---|---|---|
|
|
| GPT-5 | 149s | 2,528 | 8,320 | 25 | 3 | 14 | 8 |
|
|
| Claude Opus 4.6 | 68s | 2,878 | (internal) | 15 | 3 | 6 | 6 |
|
|
| Claude Sonnet 4.6 | 24s | 1,106 | (internal) | 10 | 3 | 6 | 1 |
|
|
|
|
## Category Distribution
|
|
|
|
| Category | GPT-5 | Opus | Sonnet |
|
|
|---|---|---|---|
|
|
| Gap | 12 | 6 | 4 |
|
|
| Race condition | 3 | 3 | 2 |
|
|
| Edge case | 5 | 2 | 2 |
|
|
| Contradiction | 3 | 2 | 1 |
|
|
| Assumption | 2 | 2 | 1 |
|
|
|
|
## Common Ground (all 3 identified)
|
|
|
|
- Liquidation minimum quantity undefined for portfolio-level metrics
|
|
- Race between fill-triggered evaluation and in-flight liquidation orders
|
|
- Multi-metric independence creates conflicting liquidation targets
|
|
- Tick coalescing window semantics under-specified
|
|
- "Last known price" handling undefined for instruments without recent data
|
|
|
|
## GPT-5 Unique Findings
|
|
|
|
- **HWM restart logic is self-contradictory**: Spec claims lower HWM = stricter, but
|
|
initializing to current value after decline makes drawdown appear ZERO (not stricter)
|
|
- **Stale price directionality**: Stale prices can UNDERESTIMATE for shorts or upward gaps
|
|
(contradicts spec's blanket "overestimates" claim)
|
|
- **Currency conversion / contract multipliers**: Never addressed for non-equity instruments
|
|
- **Daily P&L reset boundary**: Trading day definition (time, timezone, multi-venue) unspecified
|
|
- **Concentration denominator**: "Total portfolio" never defined
|
|
- **Stuck liquidation orders**: No timeout for orders that neither fill nor reject
|
|
- **Restrict flag atomicity**: No consistency guarantee between flag set and decision-engine reads
|
|
- **Complex order types**: "Opening order" undefined for spreads, shorts
|
|
- **Invalid price data**: Zero/negative prices from bad ticks unhandled
|
|
- **Domain event gaps**: No events for liquidation completion or kill-switch escalation
|
|
- **P&L reconstruction on restart**: Unlike HWM, realized P&L restart behavior undefined
|
|
|
|
## Claude Opus Unique Findings
|
|
|
|
- **Realized P&L is non-reversible** ⭐: Realized losses cannot "return below alert
|
|
threshold" — the de-escalation model assumes metrics can recover, but this metric
|
|
FUNDAMENTALLY CANNOT. Creates permanent restrict state with no documented recovery.
|
|
**Best single finding across all models.**
|
|
- **Partial fill is not "completed"**: A partial fill is simultaneously a fill event AND
|
|
evidence the prior round is in-flight. Defined completion states (filled/cancelled/rejected)
|
|
don't cover partial fills. More precise than GPT-5's version.
|
|
- **De-escalation hysteresis cross-document conflict**: Clearing below ALERT (not RESTRICT)
|
|
collapses three levels to two on the way down. Also deferred to escalation-policy.md,
|
|
creating potential conflict.
|
|
- **Restart during closed session**: No fills/ticks arrive to trigger re-evaluation after
|
|
crash in closed/pre-post. Restrict flag stays cleared indefinitely — unbounded permissive
|
|
window.
|
|
- **HWM can be HIGHER than true peak**: Portfolio appreciation during crash window means
|
|
restart HWM > true peak, making drawdown detection LESS strict (spec only considers the
|
|
lower case).
|
|
|
|
## Claude Sonnet Assessment
|
|
|
|
All 10 findings are subsets of GPT-5/Opus findings with less detail. No unique insights.
|
|
Corporate actions mention is the closest to unique but lacks actionable depth.
|
|
|
|
## Key Insights
|
|
|
|
### Opus excels at ASSUMPTION identification (confirmed pattern)
|
|
|
|
Across experiments, Opus consistently finds where the spec's OWN ASSUMPTIONS are false:
|
|
- #30: "spec ambiguity as exploit surface," "guaranteed conditions an adversary can rely on"
|
|
- #31: "realized P&L cannot recover" — metric violates de-escalation model core assumption
|
|
|
|
Opus reasons about what the spec BELIEVES to be true and checks whether those beliefs hold.
|
|
GPT-5 reasons about what the spec FAILS TO SAY. Different but complementary.
|
|
|
|
### Sonnet adds no value for spec-gap analysis
|
|
|
|
In adversarial gaming (#30), Sonnet contributed unique meta-analytical synthesis. For
|
|
completeness analysis, Sonnet found nothing the others didn't find in more detail.
|
|
Creative/generative tasks → Sonnet adds value. Systematic/exhaustive tasks → skip Sonnet.
|
|
|
|
### Task type affects model ranking
|
|
|
|
| Task Type | Best for Count | Best for Insight | Sonnet Valuable? |
|
|
|---|---|---|---|
|
|
| Adversarial gaming | GPT-5 | GPT-5 (compound attacks) | Yes (meta-synthesis) |
|
|
| Spec-gap analysis | GPT-5 | Opus (false assumptions) | No |
|
|
|
|
## Practical Recommendation
|
|
|
|
For specification review: run GPT-5 + Opus in parallel. Skip Sonnet.
|
|
- GPT-5: exhaustive coverage of undefined terms and boundary conditions
|
|
- Opus: finding where the spec's own assumptions are logically invalid
|
|
|
|
## Efficiency
|
|
|
|
| Model | Tokens/finding | Seconds/finding |
|
|
|---|---|---|
|
|
| GPT-5 | 511 | 6.0s |
|
|
| Opus | 192 | 4.5s |
|
|
| Sonnet | 111 | 2.4s |
|