Files

T

Rodin 9d0a94bd68 Add finding #53 : unstated constraint detection on state machines

New analytical lens tested on gargoyle order-state-machine.md:
- GPT-5: 15 findings (most CRITICAL issues, exhaustive field analysis)
- Opus: 14 findings (state lifecycle focus, implementation mechanisms)
- Sonnet: 10 findings (fast but shallow)

Key insight: "unstated constraints" finds what's IMPLIED but not stated,
distinct from gaps, race conditions, or ambiguities. GPT-5 is best for
catching CRITICAL data integrity constraints; Opus for state machine
implementation details.

2026-05-08 23:47:51 -07:00

8.2 KiB

Raw Blame History

Experiment 53: Unstated Constraint Detection on State Machine Specification

Date: 2026-05-09 Document: gargoyle order-state-machine.md (~260 lines) Task: Identify unstated constraints — invariants that MUST be true for the system to work correctly but are never explicitly stated in the document.

Method

Same document (full text) + same analytical prompt to all 3 models via HAI proxy. Prompt required structured output: constraint statement, evidence quotes, failure mode, and severity (CRITICAL/HIGH/MEDIUM). No tools, no project context beyond the document. Single prompt, no conversation history.

Performance

Model	Time	Output tokens	Reasoning tokens	Findings
GPT-5	107s	9,228	6,976	15
Claude Sonnet 4.6	61s	3,442	(internal)	10
Claude Opus 4.6	65s	3,601	(internal)	14

Findings Comparison

Common Ground (all 3 identified)

broker_order_id uniqueness — Must be unique and immutable; fill correlation depends on it
filled_quantity ≤ quantity — Cannot exceed requested amount
filled_avg_price must be quantity-weighted — Not simple average
terminated_at ↔ terminal state synchronization — Must be cleared on fill override, re-set on termination
limit_price nullity linked to order_type — Non-null iff limit order
expires_at nullity linked to time_in_force='gtd' — Non-null iff gtd
decision_id → order is one-to-one — No multi-order splitting per decision

GPT-5 Unique Findings (not in either Claude model)

Multi-broker correlation ambiguity — Fill schema has no broker field; if multi-broker, broker_order_id must be globally unique across brokers or system must guarantee single broker source. (CRITICAL)
Fill ledger deduplication requires unique fill identity — The document mentions idempotent state transitions but fills are append-only; duplicate fill messages would corrupt the ledger unless there's a fill-level unique ID. (CRITICAL)
Order retention for late fills — Orders and broker_order_id mapping cannot be GCed immediately after terminal state; needed for late fill processing and reconciliation. (CRITICAL)
"What" vs "how" immutability after submission — instrument_id, action, position_effect, decision_id must not be changed by replace; only execution parameters can be modified. (HIGH)
Replace cannot reduce quantity below filled_quantity — You can't "unfill" shares; quantity < filled_quantity creates impossible state. (HIGH)
instrument_id vs ticker for correlation — Must use instrument_id as primary key, never ticker (which can change via corporate actions). (HIGH)
Local expiry timers only for appropriate TIFs — No local expiry timer should be created for GTC orders. (MEDIUM)

Claude Opus Unique Findings (not in either other model)

Pre-modification state must be tracked for revert — pending_cancel/pending_replace rejection must revert to correct state (working OR partially_filled, not always working). The document mentions both as valid revert targets. (HIGH)
position_effect consistency with actual position — When close, must have existing position; when open, no contradictory close. Otherwise lot management corrupts P&L. (HIGH)
pending_replace must track pending new values — Upon broker confirmation, system must know what new parameters were requested to apply them. No field exists in Order to track this. (HIGH)
Terminal state override ONLY by fills — cancelled/expired can only be reactivated by fills, not by any other broker event. This bounds the reactivation surface. (HIGH)

Claude Sonnet Unique Findings (not in either other model)

At most one active order per instrument+action+position_effect — No stated mechanism for concurrent orders to same instrument/direction. Without ordering guarantees, same lot could be closed twice. (HIGH)
Fill events must be processed in filled_at order per order — Out-of-order processing produces incorrect intermediate states even if final totals are correct; could trigger unnecessary fill-override path. (MEDIUM)

Findings Unique to GPT-5

#	Finding	Severity
1	Multi-broker correlation ambiguity	CRITICAL
2	Fill deduplication requires unique fill ID	CRITICAL
3	Order retention for late fills	CRITICAL
4	"What" vs "how" immutability boundary	HIGH
5	Replace cannot reduce quantity below fills	HIGH
6	`instrument_id` over `ticker` for joins	HIGH
7	No local expiry timers for GTC	MEDIUM

Findings Unique to Opus

#	Finding	Severity
1	Pre-modification state tracking for revert	HIGH
2	`position_effect` consistency with position	HIGH
3	`pending_replace` pending parameter tracking	HIGH
4	Terminal override only by fills	HIGH

Findings Unique to Sonnet

#	Finding	Severity
1	One active order per instrument+action+effect	HIGH
2	Fill processing order per broker_order_id	MEDIUM

Quality Assessment

GPT-5 produced the most findings (15) and found the most CRITICAL-severity issues (5). The multi-broker correlation gap and fill deduplication constraint are genuinely important — these are exactly the kinds of things that would cause production incidents. GPT-5's strength: systematically checking every field and relationship for unstated dependencies. The reasoning tokens (6,976) show deep exploration.

Claude Opus found 14 constraints with strong focus on state machine correctness — the pre-modification state tracking and pending parameter tracking findings show Opus reasoning about the lifecycle of state, not just the state itself. Opus's characteristic strength (finding design tensions) manifests as finding where the document implies mechanism without specifying it.

Claude Sonnet was fastest (61s) but found the fewest (10). The unique findings (one-active-order constraint, fill ordering) are both valid but lower severity. Sonnet identifies correct constraints but doesn't pursue the implications as deeply — e.g., it mentions fill ordering but doesn't trace the cascade to lot management the way GPT-5 would.

Key Insight — "Unstated constraints" as an analytical lens

This is a productive new lens for specification review. Unlike:

Gap analysis (what's missing?) — this finds what's IMPLIED but not stated
Race condition analysis (what timing issues?) — this finds static invariants
Ambiguity analysis (what's unclear?) — this finds definite constraints

The findings here are all things a developer might violate because they're not documented. Each model approaches this differently:

GPT-5: Exhaustively checks every field for nullability, uniqueness, and consistency invariants. Catches the operational/infrastructure constraints (retention, deduplication).
Opus: Reasons about state machine lifecycle and what must be tracked to support transitions. Catches the "how do you implement this" constraints.
Sonnet: Identifies the most obvious constraints quickly but doesn't explore edge cases.

Practical Implication

For state machine specification review:

Run GPT-5 first — catches the data integrity and operational constraints
Run Opus second — catches the state lifecycle and implementation mechanism constraints
Sonnet — only if time-constrained; will miss ~30% of what the others find

Union of all 3 models: 21 distinct unstated constraints identified. Single-model coverage:

GPT-5 alone: 15/21 (71%)
Opus alone: 14/21 (67%)
Sonnet alone: 10/21 (48%)

The multi-model approach is especially valuable for specifications because the cost of missing a constraint is high — it becomes a production bug.

Cost Comparison

Model	Tokens/Finding	Time/Finding
GPT-5	615	7.1s
Opus	257	4.6s
Sonnet	344	6.1s

Opus is most token-efficient for this task. GPT-5's higher token count reflects the detailed reasoning but yields more CRITICAL findings. For specification review where CRITICAL constraints matter most, GPT-5 justifies the cost.

8.2 KiB Raw Blame History