# Experiment 53: Unstated Constraint Detection on State Machine Specification

**Date:** 2026-05-09
**Document:** gargoyle `order-state-machine.md` (~260 lines)
**Task:** Identify unstated constraints — invariants that MUST be true for the system to work correctly but are never explicitly stated in the document.

## Method

Same document (full text) + same analytical prompt to all 3 models via HAI proxy. Prompt required structured output: constraint statement, evidence quotes, failure mode, and severity (CRITICAL/HIGH/MEDIUM). No tools, no project context beyond the document. Single prompt, no conversation history.

## Performance

| Model | Time | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|
| GPT-5 | 107s | 9,228 | 6,976 | 15 |
| Claude Sonnet 4.6 | 61s | 3,442 | (internal) | 10 |
| Claude Opus 4.6 | 65s | 3,601 | (internal) | 14 |

## Findings Comparison

### Common Ground (all 3 identified)

1. **`broker_order_id` uniqueness** — Must be unique and immutable; fill correlation depends on it
2. **`filled_quantity` ≤ `quantity`** — Cannot exceed requested amount
3. **`filled_avg_price` must be quantity-weighted** — Not simple average
4. **`terminated_at` ↔ terminal state synchronization** — Must be cleared on fill override, re-set on termination
5. **`limit_price` nullity linked to `order_type`** — Non-null iff limit order
6. **`expires_at` nullity linked to `time_in_force='gtd'`** — Non-null iff gtd
7. **`decision_id` → order is one-to-one** — No multi-order splitting per decision

### GPT-5 Unique Findings (not in either Claude model)

1. **Multi-broker correlation ambiguity** — Fill schema has no broker field; if multi-broker, `broker_order_id` must be globally unique across brokers or system must guarantee single broker source. (CRITICAL)

2. **Fill ledger deduplication requires unique fill identity** — The document mentions idempotent state transitions but fills are append-only; duplicate fill messages would corrupt the ledger unless there's a fill-level unique ID. (CRITICAL)

3. **Order retention for late fills** — Orders and `broker_order_id` mapping cannot be GCed immediately after terminal state; needed for late fill processing and reconciliation. (CRITICAL)

4. **"What" vs "how" immutability after submission** — `instrument_id`, `action`, `position_effect`, `decision_id` must not be changed by replace; only execution parameters can be modified. (HIGH)

5. **Replace cannot reduce quantity below filled_quantity** — You can't "unfill" shares; quantity < filled_quantity creates impossible state. (HIGH)

6. **`instrument_id` vs `ticker` for correlation** — Must use `instrument_id` as primary key, never `ticker` (which can change via corporate actions). (HIGH)

7. **Local expiry timers only for appropriate TIFs** — No local expiry timer should be created for GTC orders. (MEDIUM)

### Claude Opus Unique Findings (not in either other model)

1. **Pre-modification state must be tracked for revert** — `pending_cancel`/`pending_replace` rejection must revert to correct state (`working` OR `partially_filled`, not always `working`). The document mentions both as valid revert targets. (HIGH)

2. **`position_effect` consistency with actual position** — When `close`, must have existing position; when `open`, no contradictory close. Otherwise lot management corrupts P&L. (HIGH)

3. **`pending_replace` must track pending new values** — Upon broker confirmation, system must know what new parameters were requested to apply them. No field exists in Order to track this. (HIGH)

4. **Terminal state override ONLY by fills** — `cancelled`/`expired` can only be reactivated by fills, not by any other broker event. This bounds the reactivation surface. (HIGH)

### Claude Sonnet Unique Findings (not in either other model)

1. **At most one active order per instrument+action+position_effect** — No stated mechanism for concurrent orders to same instrument/direction. Without ordering guarantees, same lot could be closed twice. (HIGH)

2. **Fill events must be processed in `filled_at` order per order** — Out-of-order processing produces incorrect intermediate states even if final totals are correct; could trigger unnecessary fill-override path. (MEDIUM)

### Findings Unique to GPT-5

| # | Finding | Severity |
|---|---|---|
| 1 | Multi-broker correlation ambiguity | CRITICAL |
| 2 | Fill deduplication requires unique fill ID | CRITICAL |
| 3 | Order retention for late fills | CRITICAL |
| 4 | "What" vs "how" immutability boundary | HIGH |
| 5 | Replace cannot reduce quantity below fills | HIGH |
| 6 | `instrument_id` over `ticker` for joins | HIGH |
| 7 | No local expiry timers for GTC | MEDIUM |

### Findings Unique to Opus

| # | Finding | Severity |
|---|---|---|
| 1 | Pre-modification state tracking for revert | HIGH |
| 2 | `position_effect` consistency with position | HIGH |
| 3 | `pending_replace` pending parameter tracking | HIGH |
| 4 | Terminal override only by fills | HIGH |

### Findings Unique to Sonnet

| # | Finding | Severity |
|---|---|---|
| 1 | One active order per instrument+action+effect | HIGH |
| 2 | Fill processing order per broker_order_id | MEDIUM |

## Quality Assessment

**GPT-5** produced the most findings (15) and found the most CRITICAL-severity issues (5). The multi-broker correlation gap and fill deduplication constraint are genuinely important — these are exactly the kinds of things that would cause production incidents. GPT-5's strength: systematically checking every field and relationship for unstated dependencies. The reasoning tokens (6,976) show deep exploration.

**Claude Opus** found 14 constraints with strong focus on state machine correctness — the pre-modification state tracking and pending parameter tracking findings show Opus reasoning about the *lifecycle* of state, not just the state itself. Opus's characteristic strength (finding design tensions) manifests as finding where the document implies mechanism without specifying it.

**Claude Sonnet** was fastest (61s) but found the fewest (10). The unique findings (one-active-order constraint, fill ordering) are both valid but lower severity. Sonnet identifies correct constraints but doesn't pursue the implications as deeply — e.g., it mentions fill ordering but doesn't trace the cascade to lot management the way GPT-5 would.

## Key Insight — "Unstated constraints" as an analytical lens

This is a productive new lens for specification review. Unlike:
- **Gap analysis** (what's missing?) — this finds what's IMPLIED but not stated
- **Race condition analysis** (what timing issues?) — this finds static invariants
- **Ambiguity analysis** (what's unclear?) — this finds definite constraints

The findings here are all things a developer might violate because they're not documented. Each model approaches this differently:

- **GPT-5**: Exhaustively checks every field for nullability, uniqueness, and consistency invariants. Catches the operational/infrastructure constraints (retention, deduplication).
- **Opus**: Reasons about state machine lifecycle and what must be tracked to support transitions. Catches the "how do you implement this" constraints.
- **Sonnet**: Identifies the most obvious constraints quickly but doesn't explore edge cases.

## Practical Implication

For state machine specification review:
1. **Run GPT-5 first** — catches the data integrity and operational constraints
2. **Run Opus second** — catches the state lifecycle and implementation mechanism constraints
3. **Sonnet** — only if time-constrained; will miss ~30% of what the others find

Union of all 3 models: 21 distinct unstated constraints identified. Single-model coverage:
- GPT-5 alone: 15/21 (71%)
- Opus alone: 14/21 (67%)
- Sonnet alone: 10/21 (48%)

The multi-model approach is especially valuable for specifications because the cost of missing a constraint is high — it becomes a production bug.

## Cost Comparison

| Model | Tokens/Finding | Time/Finding |
|---|---|---|
| GPT-5 | 615 | 7.1s |
| Opus | 257 | 4.6s |
| Sonnet | 344 | 6.1s |

Opus is most token-efficient for this task. GPT-5's higher token count reflects the detailed reasoning but yields more CRITICAL findings. For specification review where CRITICAL constraints matter most, GPT-5 justifies the cost.