Files
model-research/findings/2026-05-09-58-state-machine-completeness-analysis.md
T
Rodin 98304604ac Finding 58: State machine completeness analysis on kill-switch.md
GPT-5 finds 16 gaps, Opus 11, Sonnet 9. GPT-5 excels at exhaustive
state space enumeration; Opus finds convention-vs-enforcement gaps;
Sonnet adequate but less thorough.

Key insight: state machine completeness is a GPT-5 sweet spot due to
reasoning tokens enabling systematic combinatorial coverage.
2026-05-09 15:06:32 -07:00

91 lines
8.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Finding 58: State Machine Completeness Analysis — GPT-5 finds most gaps, Opus finds design contradictions, Sonnet finds crash recovery concerns
**Date:** 2026-05-09
**Task:** Identify state machine gaps (undefined transitions, conflicting rules, dead-end states, crash recovery issues, cross-component desynchronization) in gargoyle's `kill-switch.md` (293 lines)
**Document:** A complex state machine with: 3 acceptance policies (open/close-only/reject-all), 2 engagement scopes (global/per-user), 2 modes (RESTRICT/LIQUIDATE), multiple engagement sources (automated/manual), crash recovery semantics, and cross-component interactions with OrderManager and User Lifecycle.
## Methodology
Same document (full text) + same focused analytical question to all 3 models via HAI AI Core proxy. Prompt was highly structured with 5 specific gap categories and required specific output format (gap, states involved, scenario, impact, severity). No tools, no project context beyond the document.
| Model | Time | Output tokens | Reasoning tokens | Gaps found |
|---|---|---|---|---|
| GPT-5 | 123s | 10,258 | 7,808 | 16 |
| Claude Opus 4.6 | 87s | 4,210 | (internal) | 11 (cut off) |
| Claude Sonnet 4.6 | 93s | 4,536 | (internal) | 9 |
## What they found — common ground (all 3 identified)
- Global vs per-user scope interaction on disengage (what happens to per-user when global disengages?)
- Acceptance policy writer arbitration missing (multiple sources can set policy with no precedence rule)
- LIQUIDATE → RESTRICT "de-escalation" or "re-escalation" path undefined
- Crash recovery gap: in-flight liquidation orders after crash (fill/cancel race, double-liquidation risk)
- Persistence write failure scenarios (engage fails but partial state may exist)
- OrderManager acceptance policy source of truth on restart not fully specified
## GPT-5 unique findings (not in either Claude model)
- **Multi-source engagement stacking** — No idempotency/ordering for repeated engage/disengage from multiple automated systems. Disengage by one source could clear another's engagement.
- **Race between decision engine termination and OM policy update** — In-flight order could escape after engagement if policy check was already passed before kill switch engaged.
- **Crash between engine termination and OM policy update** — Window where OM continues with policy=open even though kill switch is engaged.
- **Pending operator release not specified as persisted** — Could lose the release gate on restart.
- **Effective status vs enforceable status desync** — UI may show LIQUIDATE controls while effective policy is reject-all (global RESTRICT + per-user LIQUIDATE).
- **Disengage during in-flight auto-cancellations** — Behavior undefined when operator disengages before cancel-all completes.
- **Domain event ordering relative to state changes** — Subscribers may react to events before enforcement state is updated.
- **Acceptance policy default on startup conflicts with LIQUIDATE** — reject-all default until state loads, but LIQUIDATE intent needs close-only.
## Claude Opus unique findings (not in either other model)
- **Direct disengage from RESTRICT without verification** — Document describes restrict→liquidate→(verify)→disengage as "typical" but nothing *prevents* skipping directly to disengage. The manual verification gate is convention, not enforcement.
- **Second engagement with different mode** — If already in LIQUIDATE and automated system engages RESTRICT, precedence is undefined. Either safety escalation is ignored OR in-progress liquidation is interrupted.
- **Engagement command during cold-start loading window** — If health monitor detects condition before persisted state loads, engagement could be lost or overwritten by stale persistence.
- **Partial persistence failure during mode transition** — State shows LIQUIDATE but OrderManager has `reject-all` from pre-transition; operator believes liquidation is available but orders rejected.
## Claude Sonnet unique findings (not in either other model)
- **Cancellation failure leaves fill handler in ambiguous state** — Cancel/fill race at broker leaves order in transitional internal state; subsequent fill may be misrouted.
- **Pending operator release set: per-engagement vs per-user tracking undefined** — If user is added by per-user and global engagement, does disengage of one remove them?
- **Reconciliation completion vs manual release gate** — After disengage, if reconciliation completes successfully, is manual gate enforced by code or only by convention?
- **Dashboard concurrent control with automated engagement** — Two simultaneous mode transitions from operator and automated system processed in undefined order.
## Quality assessment
- **GPT-5** produced the most exhaustive analysis (16 gaps) with the strongest operational focus. Its unique findings about acceptance policy writer arbitration (gap #1) and crash-window races (gaps #7, #8) are genuinely critical — these are the scenarios most likely to cause trading when it should be stopped. The multi-source engagement stacking finding shows deep reasoning about operational reality (multiple automated systems running concurrently). The summary at the end was operationally focused.
- **Claude Opus** produced fewer gaps (11, possibly cut off) but several were qualitatively superior. The "direct disengage from RESTRICT without verification" finding is the most architecturally significant — it identifies that the manual gate described in the document is a *convention* embedded in the "typical" escalation sequence, not a state machine constraint. This is exactly the kind of gap that implementation would miss because it matches the documented happy path. The "engagement during cold-start loading" finding shows temporal reasoning about system boot sequence.
- **Claude Sonnet** produced 9 solid findings with good structure. The cancellation/fill race handler state finding shows understanding of broker semantics. The "enforcement by code vs convention" framing (gap #8) echoes Opus's design-tension style. However, several findings overlapped heavily with GPT-5's, and the severity assessments were more conservative.
## Key insight — state machine analysis is a GPT-5 sweet spot
This task required exhaustive enumeration of state combinations and transitions. GPT-5's 7,808 reasoning tokens produced systematic coverage: it explicitly iterated through engage/disengage × global/per-user × RESTRICT/LIQUIDATE × crash/restart scenarios. The Claude models both produced fewer findings despite similar output token counts — they reasoned about *interesting* gaps rather than exhaustively exploring the state space.
This is the inverse of race condition analysis (Finding #13) where Opus excelled at finding subtle timing issues. State machine completeness is about combinatorial coverage (GPT-5's strength) rather than temporal reasoning (Opus's strength).
**Task taxonomy emerging:**
- **Exhaustive state space exploration** → GPT-5 (reasoning enables systematic enumeration)
- **Design tension / convention vs enforcement** → Opus (finds what the document can't see about itself)
- **Cross-boundary interaction** → All models competitive; Sonnet surprisingly strong on broker semantics
- **Crash recovery paths** → GPT-5 and Opus; Sonnet adequate but less thorough
## Unique GPT-5 insight worth highlighting
Gap #1 (acceptance policy writer arbitration) is genuinely important: the document says OrderManager's policy "can be set by other sources" but doesn't define precedence. GPT-5 worked through the concrete scenario: kill switch sets close-only, risk component later clears breach and sets open, trading resumes despite kill switch engaged. This is a real design flaw, not a pedantic complaint.
## Unique Opus insight worth highlighting
The "direct disengage from RESTRICT" finding reveals a meta-problem: the document describes what *should* happen (typical escalation path) but doesn't *enforce* it. Implementations following the document could build a system where operators can bypass the manual verification gate by disengaging directly. Opus is consistently the model that identifies when documentation describes convention but doesn't mandate enforcement.
## Practical implications
For state machine completeness analysis:
- Use GPT-5 for exhaustive coverage — it will enumerate state combinations systematically
- Use Opus to identify enforcement gaps — conventions that aren't constraints
- Sonnet is adequate for quick sanity checks but will miss edge cases
The ideal workflow: GPT-5 first for comprehensive gap list, then Opus to identify which gaps are enforcement issues vs documentation gaps.
## Raw data
Full model outputs preserved in `/tmp/*-state-machine.json` for reference.