Files
model-research/findings/2026-05-09-53-unstated-constraints.md
T
Rodin 9d0a94bd68 Add finding #53: unstated constraint detection on state machines
New analytical lens tested on gargoyle order-state-machine.md:
- GPT-5: 15 findings (most CRITICAL issues, exhaustive field analysis)
- Opus: 14 findings (state lifecycle focus, implementation mechanisms)
- Sonnet: 10 findings (fast but shallow)

Key insight: "unstated constraints" finds what's IMPLIED but not stated,
distinct from gaps, race conditions, or ambiguities. GPT-5 is best for
catching CRITICAL data integrity constraints; Opus for state machine
implementation details.
2026-05-08 23:47:51 -07:00

8.2 KiB

Experiment 53: Unstated Constraint Detection on State Machine Specification

Date: 2026-05-09 Document: gargoyle order-state-machine.md (~260 lines) Task: Identify unstated constraints — invariants that MUST be true for the system to work correctly but are never explicitly stated in the document.

Method

Same document (full text) + same analytical prompt to all 3 models via HAI proxy. Prompt required structured output: constraint statement, evidence quotes, failure mode, and severity (CRITICAL/HIGH/MEDIUM). No tools, no project context beyond the document. Single prompt, no conversation history.

Performance

Model Time Output tokens Reasoning tokens Findings
GPT-5 107s 9,228 6,976 15
Claude Sonnet 4.6 61s 3,442 (internal) 10
Claude Opus 4.6 65s 3,601 (internal) 14

Findings Comparison

Common Ground (all 3 identified)

  1. broker_order_id uniqueness — Must be unique and immutable; fill correlation depends on it
  2. filled_quantityquantity — Cannot exceed requested amount
  3. filled_avg_price must be quantity-weighted — Not simple average
  4. terminated_at ↔ terminal state synchronization — Must be cleared on fill override, re-set on termination
  5. limit_price nullity linked to order_type — Non-null iff limit order
  6. expires_at nullity linked to time_in_force='gtd' — Non-null iff gtd
  7. decision_id → order is one-to-one — No multi-order splitting per decision

GPT-5 Unique Findings (not in either Claude model)

  1. Multi-broker correlation ambiguity — Fill schema has no broker field; if multi-broker, broker_order_id must be globally unique across brokers or system must guarantee single broker source. (CRITICAL)

  2. Fill ledger deduplication requires unique fill identity — The document mentions idempotent state transitions but fills are append-only; duplicate fill messages would corrupt the ledger unless there's a fill-level unique ID. (CRITICAL)

  3. Order retention for late fills — Orders and broker_order_id mapping cannot be GCed immediately after terminal state; needed for late fill processing and reconciliation. (CRITICAL)

  4. "What" vs "how" immutability after submissioninstrument_id, action, position_effect, decision_id must not be changed by replace; only execution parameters can be modified. (HIGH)

  5. Replace cannot reduce quantity below filled_quantity — You can't "unfill" shares; quantity < filled_quantity creates impossible state. (HIGH)

  6. instrument_id vs ticker for correlation — Must use instrument_id as primary key, never ticker (which can change via corporate actions). (HIGH)

  7. Local expiry timers only for appropriate TIFs — No local expiry timer should be created for GTC orders. (MEDIUM)

Claude Opus Unique Findings (not in either other model)

  1. Pre-modification state must be tracked for revertpending_cancel/pending_replace rejection must revert to correct state (working OR partially_filled, not always working). The document mentions both as valid revert targets. (HIGH)

  2. position_effect consistency with actual position — When close, must have existing position; when open, no contradictory close. Otherwise lot management corrupts P&L. (HIGH)

  3. pending_replace must track pending new values — Upon broker confirmation, system must know what new parameters were requested to apply them. No field exists in Order to track this. (HIGH)

  4. Terminal state override ONLY by fillscancelled/expired can only be reactivated by fills, not by any other broker event. This bounds the reactivation surface. (HIGH)

Claude Sonnet Unique Findings (not in either other model)

  1. At most one active order per instrument+action+position_effect — No stated mechanism for concurrent orders to same instrument/direction. Without ordering guarantees, same lot could be closed twice. (HIGH)

  2. Fill events must be processed in filled_at order per order — Out-of-order processing produces incorrect intermediate states even if final totals are correct; could trigger unnecessary fill-override path. (MEDIUM)

Findings Unique to GPT-5

# Finding Severity
1 Multi-broker correlation ambiguity CRITICAL
2 Fill deduplication requires unique fill ID CRITICAL
3 Order retention for late fills CRITICAL
4 "What" vs "how" immutability boundary HIGH
5 Replace cannot reduce quantity below fills HIGH
6 instrument_id over ticker for joins HIGH
7 No local expiry timers for GTC MEDIUM

Findings Unique to Opus

# Finding Severity
1 Pre-modification state tracking for revert HIGH
2 position_effect consistency with position HIGH
3 pending_replace pending parameter tracking HIGH
4 Terminal override only by fills HIGH

Findings Unique to Sonnet

# Finding Severity
1 One active order per instrument+action+effect HIGH
2 Fill processing order per broker_order_id MEDIUM

Quality Assessment

GPT-5 produced the most findings (15) and found the most CRITICAL-severity issues (5). The multi-broker correlation gap and fill deduplication constraint are genuinely important — these are exactly the kinds of things that would cause production incidents. GPT-5's strength: systematically checking every field and relationship for unstated dependencies. The reasoning tokens (6,976) show deep exploration.

Claude Opus found 14 constraints with strong focus on state machine correctness — the pre-modification state tracking and pending parameter tracking findings show Opus reasoning about the lifecycle of state, not just the state itself. Opus's characteristic strength (finding design tensions) manifests as finding where the document implies mechanism without specifying it.

Claude Sonnet was fastest (61s) but found the fewest (10). The unique findings (one-active-order constraint, fill ordering) are both valid but lower severity. Sonnet identifies correct constraints but doesn't pursue the implications as deeply — e.g., it mentions fill ordering but doesn't trace the cascade to lot management the way GPT-5 would.

Key Insight — "Unstated constraints" as an analytical lens

This is a productive new lens for specification review. Unlike:

  • Gap analysis (what's missing?) — this finds what's IMPLIED but not stated
  • Race condition analysis (what timing issues?) — this finds static invariants
  • Ambiguity analysis (what's unclear?) — this finds definite constraints

The findings here are all things a developer might violate because they're not documented. Each model approaches this differently:

  • GPT-5: Exhaustively checks every field for nullability, uniqueness, and consistency invariants. Catches the operational/infrastructure constraints (retention, deduplication).
  • Opus: Reasons about state machine lifecycle and what must be tracked to support transitions. Catches the "how do you implement this" constraints.
  • Sonnet: Identifies the most obvious constraints quickly but doesn't explore edge cases.

Practical Implication

For state machine specification review:

  1. Run GPT-5 first — catches the data integrity and operational constraints
  2. Run Opus second — catches the state lifecycle and implementation mechanism constraints
  3. Sonnet — only if time-constrained; will miss ~30% of what the others find

Union of all 3 models: 21 distinct unstated constraints identified. Single-model coverage:

  • GPT-5 alone: 15/21 (71%)
  • Opus alone: 14/21 (67%)
  • Sonnet alone: 10/21 (48%)

The multi-model approach is especially valuable for specifications because the cost of missing a constraint is high — it becomes a production bug.

Cost Comparison

Model Tokens/Finding Time/Finding
GPT-5 615 7.1s
Opus 257 4.6s
Sonnet 344 6.1s

Opus is most token-efficient for this task. GPT-5's higher token count reflects the detailed reasoning but yields more CRITICAL findings. For specification review where CRITICAL constraints matter most, GPT-5 justifies the cost.