Add finding #53: unstated constraint detection on state machines
New analytical lens tested on gargoyle order-state-machine.md: - GPT-5: 15 findings (most CRITICAL issues, exhaustive field analysis) - Opus: 14 findings (state lifecycle focus, implementation mechanisms) - Sonnet: 10 findings (fast but shallow) Key insight: "unstated constraints" finds what's IMPLIED but not stated, distinct from gaps, race conditions, or ambiguities. GPT-5 is best for catching CRITICAL data integrity constraints; Opus for state machine implementation details.
This commit is contained in:
@@ -0,0 +1,134 @@
|
||||
# Experiment 53: Unstated Constraint Detection on State Machine Specification
|
||||
|
||||
**Date:** 2026-05-09
|
||||
**Document:** gargoyle `order-state-machine.md` (~260 lines)
|
||||
**Task:** Identify unstated constraints — invariants that MUST be true for the system to work correctly but are never explicitly stated in the document.
|
||||
|
||||
## Method
|
||||
|
||||
Same document (full text) + same analytical prompt to all 3 models via HAI proxy. Prompt required structured output: constraint statement, evidence quotes, failure mode, and severity (CRITICAL/HIGH/MEDIUM). No tools, no project context beyond the document. Single prompt, no conversation history.
|
||||
|
||||
## Performance
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Findings |
|
||||
|---|---|---|---|---|
|
||||
| GPT-5 | 107s | 9,228 | 6,976 | 15 |
|
||||
| Claude Sonnet 4.6 | 61s | 3,442 | (internal) | 10 |
|
||||
| Claude Opus 4.6 | 65s | 3,601 | (internal) | 14 |
|
||||
|
||||
## Findings Comparison
|
||||
|
||||
### Common Ground (all 3 identified)
|
||||
|
||||
1. **`broker_order_id` uniqueness** — Must be unique and immutable; fill correlation depends on it
|
||||
2. **`filled_quantity` ≤ `quantity`** — Cannot exceed requested amount
|
||||
3. **`filled_avg_price` must be quantity-weighted** — Not simple average
|
||||
4. **`terminated_at` ↔ terminal state synchronization** — Must be cleared on fill override, re-set on termination
|
||||
5. **`limit_price` nullity linked to `order_type`** — Non-null iff limit order
|
||||
6. **`expires_at` nullity linked to `time_in_force='gtd'`** — Non-null iff gtd
|
||||
7. **`decision_id` → order is one-to-one** — No multi-order splitting per decision
|
||||
|
||||
### GPT-5 Unique Findings (not in either Claude model)
|
||||
|
||||
1. **Multi-broker correlation ambiguity** — Fill schema has no broker field; if multi-broker, `broker_order_id` must be globally unique across brokers or system must guarantee single broker source. (CRITICAL)
|
||||
|
||||
2. **Fill ledger deduplication requires unique fill identity** — The document mentions idempotent state transitions but fills are append-only; duplicate fill messages would corrupt the ledger unless there's a fill-level unique ID. (CRITICAL)
|
||||
|
||||
3. **Order retention for late fills** — Orders and `broker_order_id` mapping cannot be GCed immediately after terminal state; needed for late fill processing and reconciliation. (CRITICAL)
|
||||
|
||||
4. **"What" vs "how" immutability after submission** — `instrument_id`, `action`, `position_effect`, `decision_id` must not be changed by replace; only execution parameters can be modified. (HIGH)
|
||||
|
||||
5. **Replace cannot reduce quantity below filled_quantity** — You can't "unfill" shares; quantity < filled_quantity creates impossible state. (HIGH)
|
||||
|
||||
6. **`instrument_id` vs `ticker` for correlation** — Must use `instrument_id` as primary key, never `ticker` (which can change via corporate actions). (HIGH)
|
||||
|
||||
7. **Local expiry timers only for appropriate TIFs** — No local expiry timer should be created for GTC orders. (MEDIUM)
|
||||
|
||||
### Claude Opus Unique Findings (not in either other model)
|
||||
|
||||
1. **Pre-modification state must be tracked for revert** — `pending_cancel`/`pending_replace` rejection must revert to correct state (`working` OR `partially_filled`, not always `working`). The document mentions both as valid revert targets. (HIGH)
|
||||
|
||||
2. **`position_effect` consistency with actual position** — When `close`, must have existing position; when `open`, no contradictory close. Otherwise lot management corrupts P&L. (HIGH)
|
||||
|
||||
3. **`pending_replace` must track pending new values** — Upon broker confirmation, system must know what new parameters were requested to apply them. No field exists in Order to track this. (HIGH)
|
||||
|
||||
4. **Terminal state override ONLY by fills** — `cancelled`/`expired` can only be reactivated by fills, not by any other broker event. This bounds the reactivation surface. (HIGH)
|
||||
|
||||
### Claude Sonnet Unique Findings (not in either other model)
|
||||
|
||||
1. **At most one active order per instrument+action+position_effect** — No stated mechanism for concurrent orders to same instrument/direction. Without ordering guarantees, same lot could be closed twice. (HIGH)
|
||||
|
||||
2. **Fill events must be processed in `filled_at` order per order** — Out-of-order processing produces incorrect intermediate states even if final totals are correct; could trigger unnecessary fill-override path. (MEDIUM)
|
||||
|
||||
### Findings Unique to GPT-5
|
||||
|
||||
| # | Finding | Severity |
|
||||
|---|---|---|
|
||||
| 1 | Multi-broker correlation ambiguity | CRITICAL |
|
||||
| 2 | Fill deduplication requires unique fill ID | CRITICAL |
|
||||
| 3 | Order retention for late fills | CRITICAL |
|
||||
| 4 | "What" vs "how" immutability boundary | HIGH |
|
||||
| 5 | Replace cannot reduce quantity below fills | HIGH |
|
||||
| 6 | `instrument_id` over `ticker` for joins | HIGH |
|
||||
| 7 | No local expiry timers for GTC | MEDIUM |
|
||||
|
||||
### Findings Unique to Opus
|
||||
|
||||
| # | Finding | Severity |
|
||||
|---|---|---|
|
||||
| 1 | Pre-modification state tracking for revert | HIGH |
|
||||
| 2 | `position_effect` consistency with position | HIGH |
|
||||
| 3 | `pending_replace` pending parameter tracking | HIGH |
|
||||
| 4 | Terminal override only by fills | HIGH |
|
||||
|
||||
### Findings Unique to Sonnet
|
||||
|
||||
| # | Finding | Severity |
|
||||
|---|---|---|
|
||||
| 1 | One active order per instrument+action+effect | HIGH |
|
||||
| 2 | Fill processing order per broker_order_id | MEDIUM |
|
||||
|
||||
## Quality Assessment
|
||||
|
||||
**GPT-5** produced the most findings (15) and found the most CRITICAL-severity issues (5). The multi-broker correlation gap and fill deduplication constraint are genuinely important — these are exactly the kinds of things that would cause production incidents. GPT-5's strength: systematically checking every field and relationship for unstated dependencies. The reasoning tokens (6,976) show deep exploration.
|
||||
|
||||
**Claude Opus** found 14 constraints with strong focus on state machine correctness — the pre-modification state tracking and pending parameter tracking findings show Opus reasoning about the *lifecycle* of state, not just the state itself. Opus's characteristic strength (finding design tensions) manifests as finding where the document implies mechanism without specifying it.
|
||||
|
||||
**Claude Sonnet** was fastest (61s) but found the fewest (10). The unique findings (one-active-order constraint, fill ordering) are both valid but lower severity. Sonnet identifies correct constraints but doesn't pursue the implications as deeply — e.g., it mentions fill ordering but doesn't trace the cascade to lot management the way GPT-5 would.
|
||||
|
||||
## Key Insight — "Unstated constraints" as an analytical lens
|
||||
|
||||
This is a productive new lens for specification review. Unlike:
|
||||
- **Gap analysis** (what's missing?) — this finds what's IMPLIED but not stated
|
||||
- **Race condition analysis** (what timing issues?) — this finds static invariants
|
||||
- **Ambiguity analysis** (what's unclear?) — this finds definite constraints
|
||||
|
||||
The findings here are all things a developer might violate because they're not documented. Each model approaches this differently:
|
||||
|
||||
- **GPT-5**: Exhaustively checks every field for nullability, uniqueness, and consistency invariants. Catches the operational/infrastructure constraints (retention, deduplication).
|
||||
- **Opus**: Reasons about state machine lifecycle and what must be tracked to support transitions. Catches the "how do you implement this" constraints.
|
||||
- **Sonnet**: Identifies the most obvious constraints quickly but doesn't explore edge cases.
|
||||
|
||||
## Practical Implication
|
||||
|
||||
For state machine specification review:
|
||||
1. **Run GPT-5 first** — catches the data integrity and operational constraints
|
||||
2. **Run Opus second** — catches the state lifecycle and implementation mechanism constraints
|
||||
3. **Sonnet** — only if time-constrained; will miss ~30% of what the others find
|
||||
|
||||
Union of all 3 models: 21 distinct unstated constraints identified. Single-model coverage:
|
||||
- GPT-5 alone: 15/21 (71%)
|
||||
- Opus alone: 14/21 (67%)
|
||||
- Sonnet alone: 10/21 (48%)
|
||||
|
||||
The multi-model approach is especially valuable for specifications because the cost of missing a constraint is high — it becomes a production bug.
|
||||
|
||||
## Cost Comparison
|
||||
|
||||
| Model | Tokens/Finding | Time/Finding |
|
||||
|---|---|---|
|
||||
| GPT-5 | 615 | 7.1s |
|
||||
| Opus | 257 | 4.6s |
|
||||
| Sonnet | 344 | 6.1s |
|
||||
|
||||
Opus is most token-efficient for this task. GPT-5's higher token count reflects the detailed reasoning but yields more CRITICAL findings. For specification review where CRITICAL constraints matter most, GPT-5 justifies the cost.
|
||||
Reference in New Issue
Block a user