# Experiment 53: Unstated Constraint Detection on State Machine Specification **Date:** 2026-05-09 **Document:** gargoyle `order-state-machine.md` (~260 lines) **Task:** Identify unstated constraints — invariants that MUST be true for the system to work correctly but are never explicitly stated in the document. ## Method Same document (full text) + same analytical prompt to all 3 models via HAI proxy. Prompt required structured output: constraint statement, evidence quotes, failure mode, and severity (CRITICAL/HIGH/MEDIUM). No tools, no project context beyond the document. Single prompt, no conversation history. ## Performance | Model | Time | Output tokens | Reasoning tokens | Findings | |---|---|---|---|---| | GPT-5 | 107s | 9,228 | 6,976 | 15 | | Claude Sonnet 4.6 | 61s | 3,442 | (internal) | 10 | | Claude Opus 4.6 | 65s | 3,601 | (internal) | 14 | ## Findings Comparison ### Common Ground (all 3 identified) 1. **`broker_order_id` uniqueness** — Must be unique and immutable; fill correlation depends on it 2. **`filled_quantity` ≤ `quantity`** — Cannot exceed requested amount 3. **`filled_avg_price` must be quantity-weighted** — Not simple average 4. **`terminated_at` ↔ terminal state synchronization** — Must be cleared on fill override, re-set on termination 5. **`limit_price` nullity linked to `order_type`** — Non-null iff limit order 6. **`expires_at` nullity linked to `time_in_force='gtd'`** — Non-null iff gtd 7. **`decision_id` → order is one-to-one** — No multi-order splitting per decision ### GPT-5 Unique Findings (not in either Claude model) 1. **Multi-broker correlation ambiguity** — Fill schema has no broker field; if multi-broker, `broker_order_id` must be globally unique across brokers or system must guarantee single broker source. (CRITICAL) 2. **Fill ledger deduplication requires unique fill identity** — The document mentions idempotent state transitions but fills are append-only; duplicate fill messages would corrupt the ledger unless there's a fill-level unique ID. (CRITICAL) 3. **Order retention for late fills** — Orders and `broker_order_id` mapping cannot be GCed immediately after terminal state; needed for late fill processing and reconciliation. (CRITICAL) 4. **"What" vs "how" immutability after submission** — `instrument_id`, `action`, `position_effect`, `decision_id` must not be changed by replace; only execution parameters can be modified. (HIGH) 5. **Replace cannot reduce quantity below filled_quantity** — You can't "unfill" shares; quantity < filled_quantity creates impossible state. (HIGH) 6. **`instrument_id` vs `ticker` for correlation** — Must use `instrument_id` as primary key, never `ticker` (which can change via corporate actions). (HIGH) 7. **Local expiry timers only for appropriate TIFs** — No local expiry timer should be created for GTC orders. (MEDIUM) ### Claude Opus Unique Findings (not in either other model) 1. **Pre-modification state must be tracked for revert** — `pending_cancel`/`pending_replace` rejection must revert to correct state (`working` OR `partially_filled`, not always `working`). The document mentions both as valid revert targets. (HIGH) 2. **`position_effect` consistency with actual position** — When `close`, must have existing position; when `open`, no contradictory close. Otherwise lot management corrupts P&L. (HIGH) 3. **`pending_replace` must track pending new values** — Upon broker confirmation, system must know what new parameters were requested to apply them. No field exists in Order to track this. (HIGH) 4. **Terminal state override ONLY by fills** — `cancelled`/`expired` can only be reactivated by fills, not by any other broker event. This bounds the reactivation surface. (HIGH) ### Claude Sonnet Unique Findings (not in either other model) 1. **At most one active order per instrument+action+position_effect** — No stated mechanism for concurrent orders to same instrument/direction. Without ordering guarantees, same lot could be closed twice. (HIGH) 2. **Fill events must be processed in `filled_at` order per order** — Out-of-order processing produces incorrect intermediate states even if final totals are correct; could trigger unnecessary fill-override path. (MEDIUM) ### Findings Unique to GPT-5 | # | Finding | Severity | |---|---|---| | 1 | Multi-broker correlation ambiguity | CRITICAL | | 2 | Fill deduplication requires unique fill ID | CRITICAL | | 3 | Order retention for late fills | CRITICAL | | 4 | "What" vs "how" immutability boundary | HIGH | | 5 | Replace cannot reduce quantity below fills | HIGH | | 6 | `instrument_id` over `ticker` for joins | HIGH | | 7 | No local expiry timers for GTC | MEDIUM | ### Findings Unique to Opus | # | Finding | Severity | |---|---|---| | 1 | Pre-modification state tracking for revert | HIGH | | 2 | `position_effect` consistency with position | HIGH | | 3 | `pending_replace` pending parameter tracking | HIGH | | 4 | Terminal override only by fills | HIGH | ### Findings Unique to Sonnet | # | Finding | Severity | |---|---|---| | 1 | One active order per instrument+action+effect | HIGH | | 2 | Fill processing order per broker_order_id | MEDIUM | ## Quality Assessment **GPT-5** produced the most findings (15) and found the most CRITICAL-severity issues (5). The multi-broker correlation gap and fill deduplication constraint are genuinely important — these are exactly the kinds of things that would cause production incidents. GPT-5's strength: systematically checking every field and relationship for unstated dependencies. The reasoning tokens (6,976) show deep exploration. **Claude Opus** found 14 constraints with strong focus on state machine correctness — the pre-modification state tracking and pending parameter tracking findings show Opus reasoning about the *lifecycle* of state, not just the state itself. Opus's characteristic strength (finding design tensions) manifests as finding where the document implies mechanism without specifying it. **Claude Sonnet** was fastest (61s) but found the fewest (10). The unique findings (one-active-order constraint, fill ordering) are both valid but lower severity. Sonnet identifies correct constraints but doesn't pursue the implications as deeply — e.g., it mentions fill ordering but doesn't trace the cascade to lot management the way GPT-5 would. ## Key Insight — "Unstated constraints" as an analytical lens This is a productive new lens for specification review. Unlike: - **Gap analysis** (what's missing?) — this finds what's IMPLIED but not stated - **Race condition analysis** (what timing issues?) — this finds static invariants - **Ambiguity analysis** (what's unclear?) — this finds definite constraints The findings here are all things a developer might violate because they're not documented. Each model approaches this differently: - **GPT-5**: Exhaustively checks every field for nullability, uniqueness, and consistency invariants. Catches the operational/infrastructure constraints (retention, deduplication). - **Opus**: Reasons about state machine lifecycle and what must be tracked to support transitions. Catches the "how do you implement this" constraints. - **Sonnet**: Identifies the most obvious constraints quickly but doesn't explore edge cases. ## Practical Implication For state machine specification review: 1. **Run GPT-5 first** — catches the data integrity and operational constraints 2. **Run Opus second** — catches the state lifecycle and implementation mechanism constraints 3. **Sonnet** — only if time-constrained; will miss ~30% of what the others find Union of all 3 models: 21 distinct unstated constraints identified. Single-model coverage: - GPT-5 alone: 15/21 (71%) - Opus alone: 14/21 (67%) - Sonnet alone: 10/21 (48%) The multi-model approach is especially valuable for specifications because the cost of missing a constraint is high — it becomes a production bug. ## Cost Comparison | Model | Tokens/Finding | Time/Finding | |---|---|---| | GPT-5 | 615 | 7.1s | | Opus | 257 | 4.6s | | Sonnet | 344 | 6.1s | Opus is most token-efficient for this task. GPT-5's higher token count reflects the detailed reasoning but yields more CRITICAL findings. For specification review where CRITICAL constraints matter most, GPT-5 justifies the cost.