model-research/findings/2026-05-09-58-state-machine-completeness-analysis.md

# Finding 58: State Machine Completeness Analysis — GPT-5 finds most gaps, Opus finds design contradictions, Sonnet finds crash recovery concerns

**Date:** 2026-05-09
**Task:** Identify state machine gaps (undefined transitions, conflicting rules, dead-end states, crash recovery issues, cross-component desynchronization) in gargoyle's `kill-switch.md` (293 lines)
**Document:** A complex state machine with: 3 acceptance policies (open/close-only/reject-all), 2 engagement scopes (global/per-user), 2 modes (RESTRICT/LIQUIDATE), multiple engagement sources (automated/manual), crash recovery semantics, and cross-component interactions with OrderManager and User Lifecycle.

## Methodology

Same document (full text) + same focused analytical question to all 3 models via HAI AI Core proxy. Prompt was highly structured with 5 specific gap categories and required specific output format (gap, states involved, scenario, impact, severity). No tools, no project context beyond the document.

| Model | Time | Output tokens | Reasoning tokens | Gaps found |
|---|---|---|---|---|
| GPT-5 | 123s | 10,258 | 7,808 | 16 |
| Claude Opus 4.6 | 87s | 4,210 | (internal) | 11 (cut off) |
| Claude Sonnet 4.6 | 93s | 4,536 | (internal) | 9 |

## What they found — common ground (all 3 identified)

- Global vs per-user scope interaction on disengage (what happens to per-user when global disengages?)
- Acceptance policy writer arbitration missing (multiple sources can set policy with no precedence rule)
- LIQUIDATE → RESTRICT "de-escalation" or "re-escalation" path undefined
- Crash recovery gap: in-flight liquidation orders after crash (fill/cancel race, double-liquidation risk)
- Persistence write failure scenarios (engage fails but partial state may exist)
- OrderManager acceptance policy source of truth on restart not fully specified

## GPT-5 unique findings (not in either Claude model)

- **Multi-source engagement stacking** — No idempotency/ordering for repeated engage/disengage from multiple automated systems. Disengage by one source could clear another's engagement.
- **Race between decision engine termination and OM policy update** — In-flight order could escape after engagement if policy check was already passed before kill switch engaged.
- **Crash between engine termination and OM policy update** — Window where OM continues with policy=open even though kill switch is engaged.
- **Pending operator release not specified as persisted** — Could lose the release gate on restart.
- **Effective status vs enforceable status desync** — UI may show LIQUIDATE controls while effective policy is reject-all (global RESTRICT + per-user LIQUIDATE).
- **Disengage during in-flight auto-cancellations** — Behavior undefined when operator disengages before cancel-all completes.
- **Domain event ordering relative to state changes** — Subscribers may react to events before enforcement state is updated.
- **Acceptance policy default on startup conflicts with LIQUIDATE** — reject-all default until state loads, but LIQUIDATE intent needs close-only.

## Claude Opus unique findings (not in either other model)

- **Direct disengage from RESTRICT without verification** — Document describes restrict→liquidate→(verify)→disengage as "typical" but nothing *prevents* skipping directly to disengage. The manual verification gate is convention, not enforcement.
- **Second engagement with different mode** — If already in LIQUIDATE and automated system engages RESTRICT, precedence is undefined. Either safety escalation is ignored OR in-progress liquidation is interrupted.
- **Engagement command during cold-start loading window** — If health monitor detects condition before persisted state loads, engagement could be lost or overwritten by stale persistence.
- **Partial persistence failure during mode transition** — State shows LIQUIDATE but OrderManager has `reject-all` from pre-transition; operator believes liquidation is available but orders rejected.

## Claude Sonnet unique findings (not in either other model)

- **Cancellation failure leaves fill handler in ambiguous state** — Cancel/fill race at broker leaves order in transitional internal state; subsequent fill may be misrouted.
- **Pending operator release set: per-engagement vs per-user tracking undefined** — If user is added by per-user and global engagement, does disengage of one remove them?
- **Reconciliation completion vs manual release gate** — After disengage, if reconciliation completes successfully, is manual gate enforced by code or only by convention?
- **Dashboard concurrent control with automated engagement** — Two simultaneous mode transitions from operator and automated system processed in undefined order.

## Quality assessment

- **GPT-5** produced the most exhaustive analysis (16 gaps) with the strongest operational focus. Its unique findings about acceptance policy writer arbitration (gap #1) and crash-window races (gaps #7, #8) are genuinely critical — these are the scenarios most likely to cause trading when it should be stopped. The multi-source engagement stacking finding shows deep reasoning about operational reality (multiple automated systems running concurrently). The summary at the end was operationally focused.

- **Claude Opus** produced fewer gaps (11, possibly cut off) but several were qualitatively superior. The "direct disengage from RESTRICT without verification" finding is the most architecturally significant — it identifies that the manual gate described in the document is a *convention* embedded in the "typical" escalation sequence, not a state machine constraint. This is exactly the kind of gap that implementation would miss because it matches the documented happy path. The "engagement during cold-start loading" finding shows temporal reasoning about system boot sequence.

- **Claude Sonnet** produced 9 solid findings with good structure. The cancellation/fill race handler state finding shows understanding of broker semantics. The "enforcement by code vs convention" framing (gap #8) echoes Opus's design-tension style. However, several findings overlapped heavily with GPT-5's, and the severity assessments were more conservative.

## Key insight — state machine analysis is a GPT-5 sweet spot

This task required exhaustive enumeration of state combinations and transitions. GPT-5's 7,808 reasoning tokens produced systematic coverage: it explicitly iterated through engage/disengage × global/per-user × RESTRICT/LIQUIDATE × crash/restart scenarios. The Claude models both produced fewer findings despite similar output token counts — they reasoned about *interesting* gaps rather than exhaustively exploring the state space.

This is the inverse of race condition analysis (Finding #13) where Opus excelled at finding subtle timing issues. State machine completeness is about combinatorial coverage (GPT-5's strength) rather than temporal reasoning (Opus's strength).

**Task taxonomy emerging:**
- **Exhaustive state space exploration** → GPT-5 (reasoning enables systematic enumeration)
- **Design tension / convention vs enforcement** → Opus (finds what the document can't see about itself)
- **Cross-boundary interaction** → All models competitive; Sonnet surprisingly strong on broker semantics
- **Crash recovery paths** → GPT-5 and Opus; Sonnet adequate but less thorough

## Unique GPT-5 insight worth highlighting

Gap #1 (acceptance policy writer arbitration) is genuinely important: the document says OrderManager's policy "can be set by other sources" but doesn't define precedence. GPT-5 worked through the concrete scenario: kill switch sets close-only, risk component later clears breach and sets open, trading resumes despite kill switch engaged. This is a real design flaw, not a pedantic complaint.

## Unique Opus insight worth highlighting

The "direct disengage from RESTRICT" finding reveals a meta-problem: the document describes what *should* happen (typical escalation path) but doesn't *enforce* it. Implementations following the document could build a system where operators can bypass the manual verification gate by disengaging directly. Opus is consistently the model that identifies when documentation describes convention but doesn't mandate enforcement.

## Practical implications

For state machine completeness analysis:
- Use GPT-5 for exhaustive coverage — it will enumerate state combinations systematically
- Use Opus to identify enforcement gaps — conventions that aren't constraints
- Sonnet is adequate for quick sanity checks but will miss edge cases

The ideal workflow: GPT-5 first for comprehensive gap list, then Opus to identify which gaps are enforcement issues vs documentation gaps.

## Raw data

Full model outputs preserved in `/tmp/*-state-machine.json` for reference.