New analytical lens tested on gargoyle order-state-machine.md: - GPT-5: 15 findings (most CRITICAL issues, exhaustive field analysis) - Opus: 14 findings (state lifecycle focus, implementation mechanisms) - Sonnet: 10 findings (fast but shallow) Key insight: "unstated constraints" finds what's IMPLIED but not stated, distinct from gaps, race conditions, or ambiguities. GPT-5 is best for catching CRITICAL data integrity constraints; Opus for state machine implementation details.
8.2 KiB
Experiment 53: Unstated Constraint Detection on State Machine Specification
Date: 2026-05-09
Document: gargoyle order-state-machine.md (~260 lines)
Task: Identify unstated constraints — invariants that MUST be true for the system to work correctly but are never explicitly stated in the document.
Method
Same document (full text) + same analytical prompt to all 3 models via HAI proxy. Prompt required structured output: constraint statement, evidence quotes, failure mode, and severity (CRITICAL/HIGH/MEDIUM). No tools, no project context beyond the document. Single prompt, no conversation history.
Performance
| Model | Time | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|
| GPT-5 | 107s | 9,228 | 6,976 | 15 |
| Claude Sonnet 4.6 | 61s | 3,442 | (internal) | 10 |
| Claude Opus 4.6 | 65s | 3,601 | (internal) | 14 |
Findings Comparison
Common Ground (all 3 identified)
broker_order_iduniqueness — Must be unique and immutable; fill correlation depends on itfilled_quantity≤quantity— Cannot exceed requested amountfilled_avg_pricemust be quantity-weighted — Not simple averageterminated_at↔ terminal state synchronization — Must be cleared on fill override, re-set on terminationlimit_pricenullity linked toorder_type— Non-null iff limit orderexpires_atnullity linked totime_in_force='gtd'— Non-null iff gtddecision_id→ order is one-to-one — No multi-order splitting per decision
GPT-5 Unique Findings (not in either Claude model)
-
Multi-broker correlation ambiguity — Fill schema has no broker field; if multi-broker,
broker_order_idmust be globally unique across brokers or system must guarantee single broker source. (CRITICAL) -
Fill ledger deduplication requires unique fill identity — The document mentions idempotent state transitions but fills are append-only; duplicate fill messages would corrupt the ledger unless there's a fill-level unique ID. (CRITICAL)
-
Order retention for late fills — Orders and
broker_order_idmapping cannot be GCed immediately after terminal state; needed for late fill processing and reconciliation. (CRITICAL) -
"What" vs "how" immutability after submission —
instrument_id,action,position_effect,decision_idmust not be changed by replace; only execution parameters can be modified. (HIGH) -
Replace cannot reduce quantity below filled_quantity — You can't "unfill" shares; quantity < filled_quantity creates impossible state. (HIGH)
-
instrument_idvstickerfor correlation — Must useinstrument_idas primary key, neverticker(which can change via corporate actions). (HIGH) -
Local expiry timers only for appropriate TIFs — No local expiry timer should be created for GTC orders. (MEDIUM)
Claude Opus Unique Findings (not in either other model)
-
Pre-modification state must be tracked for revert —
pending_cancel/pending_replacerejection must revert to correct state (workingORpartially_filled, not alwaysworking). The document mentions both as valid revert targets. (HIGH) -
position_effectconsistency with actual position — Whenclose, must have existing position; whenopen, no contradictory close. Otherwise lot management corrupts P&L. (HIGH) -
pending_replacemust track pending new values — Upon broker confirmation, system must know what new parameters were requested to apply them. No field exists in Order to track this. (HIGH) -
Terminal state override ONLY by fills —
cancelled/expiredcan only be reactivated by fills, not by any other broker event. This bounds the reactivation surface. (HIGH)
Claude Sonnet Unique Findings (not in either other model)
-
At most one active order per instrument+action+position_effect — No stated mechanism for concurrent orders to same instrument/direction. Without ordering guarantees, same lot could be closed twice. (HIGH)
-
Fill events must be processed in
filled_atorder per order — Out-of-order processing produces incorrect intermediate states even if final totals are correct; could trigger unnecessary fill-override path. (MEDIUM)
Findings Unique to GPT-5
| # | Finding | Severity |
|---|---|---|
| 1 | Multi-broker correlation ambiguity | CRITICAL |
| 2 | Fill deduplication requires unique fill ID | CRITICAL |
| 3 | Order retention for late fills | CRITICAL |
| 4 | "What" vs "how" immutability boundary | HIGH |
| 5 | Replace cannot reduce quantity below fills | HIGH |
| 6 | instrument_id over ticker for joins |
HIGH |
| 7 | No local expiry timers for GTC | MEDIUM |
Findings Unique to Opus
| # | Finding | Severity |
|---|---|---|
| 1 | Pre-modification state tracking for revert | HIGH |
| 2 | position_effect consistency with position |
HIGH |
| 3 | pending_replace pending parameter tracking |
HIGH |
| 4 | Terminal override only by fills | HIGH |
Findings Unique to Sonnet
| # | Finding | Severity |
|---|---|---|
| 1 | One active order per instrument+action+effect | HIGH |
| 2 | Fill processing order per broker_order_id | MEDIUM |
Quality Assessment
GPT-5 produced the most findings (15) and found the most CRITICAL-severity issues (5). The multi-broker correlation gap and fill deduplication constraint are genuinely important — these are exactly the kinds of things that would cause production incidents. GPT-5's strength: systematically checking every field and relationship for unstated dependencies. The reasoning tokens (6,976) show deep exploration.
Claude Opus found 14 constraints with strong focus on state machine correctness — the pre-modification state tracking and pending parameter tracking findings show Opus reasoning about the lifecycle of state, not just the state itself. Opus's characteristic strength (finding design tensions) manifests as finding where the document implies mechanism without specifying it.
Claude Sonnet was fastest (61s) but found the fewest (10). The unique findings (one-active-order constraint, fill ordering) are both valid but lower severity. Sonnet identifies correct constraints but doesn't pursue the implications as deeply — e.g., it mentions fill ordering but doesn't trace the cascade to lot management the way GPT-5 would.
Key Insight — "Unstated constraints" as an analytical lens
This is a productive new lens for specification review. Unlike:
- Gap analysis (what's missing?) — this finds what's IMPLIED but not stated
- Race condition analysis (what timing issues?) — this finds static invariants
- Ambiguity analysis (what's unclear?) — this finds definite constraints
The findings here are all things a developer might violate because they're not documented. Each model approaches this differently:
- GPT-5: Exhaustively checks every field for nullability, uniqueness, and consistency invariants. Catches the operational/infrastructure constraints (retention, deduplication).
- Opus: Reasons about state machine lifecycle and what must be tracked to support transitions. Catches the "how do you implement this" constraints.
- Sonnet: Identifies the most obvious constraints quickly but doesn't explore edge cases.
Practical Implication
For state machine specification review:
- Run GPT-5 first — catches the data integrity and operational constraints
- Run Opus second — catches the state lifecycle and implementation mechanism constraints
- Sonnet — only if time-constrained; will miss ~30% of what the others find
Union of all 3 models: 21 distinct unstated constraints identified. Single-model coverage:
- GPT-5 alone: 15/21 (71%)
- Opus alone: 14/21 (67%)
- Sonnet alone: 10/21 (48%)
The multi-model approach is especially valuable for specifications because the cost of missing a constraint is high — it becomes a production bug.
Cost Comparison
| Model | Tokens/Finding | Time/Finding |
|---|---|---|
| GPT-5 | 615 | 7.1s |
| Opus | 257 | 4.6s |
| Sonnet | 344 | 6.1s |
Opus is most token-efficient for this task. GPT-5's higher token count reflects the detailed reasoning but yields more CRITICAL findings. For specification review where CRITICAL constraints matter most, GPT-5 justifies the cost.