Add finding #53: unstated constraint detection on state machines

New analytical lens tested on gargoyle order-state-machine.md: - GPT-5: 15 findings (most CRITICAL issues, exhaustive field analysis) - Opus: 14 findings (state lifecycle focus, implementation mechanisms) - Sonnet: 10 findings (fast but shallow) Key insight: "unstated constraints" finds what's IMPLIED but not stated, distinct from gaps, race conditions, or ambiguities. GPT-5 is best for catching CRITICAL data integrity constraints; Opus for state machine implementation details.
2026-05-08 23:47:51 -07:00
parent c1ca8cfe46
commit 9d0a94bd68
1 changed files with 134 additions and 0 deletions
@@ -0,0 +1,134 @@
+# Experiment 53: Unstated Constraint Detection on State Machine Specification
+
+**Date:** 2026-05-09
+**Document:** gargoyle `order-state-machine.md` (~260 lines)
+**Task:** Identify unstated constraints — invariants that MUST be true for the system to work correctly but are never explicitly stated in the document.
+
+## Method
+
+Same document (full text) + same analytical prompt to all 3 models via HAI proxy. Prompt required structured output: constraint statement, evidence quotes, failure mode, and severity (CRITICAL/HIGH/MEDIUM). No tools, no project context beyond the document. Single prompt, no conversation history.
+
+## Performance
+
+| Model | Time | Output tokens | Reasoning tokens | Findings |
+|---|---|---|---|---|
+| GPT-5 | 107s | 9,228 | 6,976 | 15 |
+| Claude Sonnet 4.6 | 61s | 3,442 | (internal) | 10 |
+| Claude Opus 4.6 | 65s | 3,601 | (internal) | 14 |
+
+## Findings Comparison
+
+### Common Ground (all 3 identified)
+
+1. **`broker_order_id` uniqueness** — Must be unique and immutable; fill correlation depends on it
+2. **`filled_quantity` ≤ `quantity`** — Cannot exceed requested amount
+3. **`filled_avg_price` must be quantity-weighted** — Not simple average
+4. **`terminated_at` ↔ terminal state synchronization** — Must be cleared on fill override, re-set on termination
+5. **`limit_price` nullity linked to `order_type`** — Non-null iff limit order
+6. **`expires_at` nullity linked to `time_in_force='gtd'`** — Non-null iff gtd
+7. **`decision_id` → order is one-to-one** — No multi-order splitting per decision
+
+### GPT-5 Unique Findings (not in either Claude model)
+
+1. **Multi-broker correlation ambiguity** — Fill schema has no broker field; if multi-broker, `broker_order_id` must be globally unique across brokers or system must guarantee single broker source. (CRITICAL)
+
+2. **Fill ledger deduplication requires unique fill identity** — The document mentions idempotent state transitions but fills are append-only; duplicate fill messages would corrupt the ledger unless there's a fill-level unique ID. (CRITICAL)
+
+3. **Order retention for late fills** — Orders and `broker_order_id` mapping cannot be GCed immediately after terminal state; needed for late fill processing and reconciliation. (CRITICAL)
+
+4. **"What" vs "how" immutability after submission** — `instrument_id`, `action`, `position_effect`, `decision_id` must not be changed by replace; only execution parameters can be modified. (HIGH)
+
+5. **Replace cannot reduce quantity below filled_quantity** — You can't "unfill" shares; quantity < filled_quantity creates impossible state. (HIGH)
+
+6. **`instrument_id` vs `ticker` for correlation** — Must use `instrument_id` as primary key, never `ticker` (which can change via corporate actions). (HIGH)
+
+7. **Local expiry timers only for appropriate TIFs** — No local expiry timer should be created for GTC orders. (MEDIUM)
+
+### Claude Opus Unique Findings (not in either other model)
+
+1. **Pre-modification state must be tracked for revert** — `pending_cancel`/`pending_replace` rejection must revert to correct state (`working` OR `partially_filled`, not always `working`). The document mentions both as valid revert targets. (HIGH)
+
+2. **`position_effect` consistency with actual position** — When `close`, must have existing position; when `open`, no contradictory close. Otherwise lot management corrupts P&L. (HIGH)
+
+3. **`pending_replace` must track pending new values** — Upon broker confirmation, system must know what new parameters were requested to apply them. No field exists in Order to track this. (HIGH)
+
+4. **Terminal state override ONLY by fills** — `cancelled`/`expired` can only be reactivated by fills, not by any other broker event. This bounds the reactivation surface. (HIGH)
+
+### Claude Sonnet Unique Findings (not in either other model)
+
+1. **At most one active order per instrument+action+position_effect** — No stated mechanism for concurrent orders to same instrument/direction. Without ordering guarantees, same lot could be closed twice. (HIGH)
+
+2. **Fill events must be processed in `filled_at` order per order** — Out-of-order processing produces incorrect intermediate states even if final totals are correct; could trigger unnecessary fill-override path. (MEDIUM)
+
+### Findings Unique to GPT-5
+
+| # | Finding | Severity |
+|---|---|---|
+| 1 | Multi-broker correlation ambiguity | CRITICAL |
+| 2 | Fill deduplication requires unique fill ID | CRITICAL |
+| 3 | Order retention for late fills | CRITICAL |
+| 4 | "What" vs "how" immutability boundary | HIGH |
+| 5 | Replace cannot reduce quantity below fills | HIGH |
+| 6 | `instrument_id` over `ticker` for joins | HIGH |
+| 7 | No local expiry timers for GTC | MEDIUM |
+
+### Findings Unique to Opus
+
+| # | Finding | Severity |
+|---|---|---|
+| 1 | Pre-modification state tracking for revert | HIGH |
+| 2 | `position_effect` consistency with position | HIGH |
+| 3 | `pending_replace` pending parameter tracking | HIGH |
+| 4 | Terminal override only by fills | HIGH |
+
+### Findings Unique to Sonnet
+
+| # | Finding | Severity |
+|---|---|---|
+| 1 | One active order per instrument+action+effect | HIGH |
+| 2 | Fill processing order per broker_order_id | MEDIUM |
+
+## Quality Assessment
+
+**GPT-5** produced the most findings (15) and found the most CRITICAL-severity issues (5). The multi-broker correlation gap and fill deduplication constraint are genuinely important — these are exactly the kinds of things that would cause production incidents. GPT-5's strength: systematically checking every field and relationship for unstated dependencies. The reasoning tokens (6,976) show deep exploration.
+
+**Claude Opus** found 14 constraints with strong focus on state machine correctness — the pre-modification state tracking and pending parameter tracking findings show Opus reasoning about the *lifecycle* of state, not just the state itself. Opus's characteristic strength (finding design tensions) manifests as finding where the document implies mechanism without specifying it.
+
+**Claude Sonnet** was fastest (61s) but found the fewest (10). The unique findings (one-active-order constraint, fill ordering) are both valid but lower severity. Sonnet identifies correct constraints but doesn't pursue the implications as deeply — e.g., it mentions fill ordering but doesn't trace the cascade to lot management the way GPT-5 would.
+
+## Key Insight — "Unstated constraints" as an analytical lens
+
+This is a productive new lens for specification review. Unlike:
+- **Gap analysis** (what's missing?) — this finds what's IMPLIED but not stated
+- **Race condition analysis** (what timing issues?) — this finds static invariants
+- **Ambiguity analysis** (what's unclear?) — this finds definite constraints
+
+The findings here are all things a developer might violate because they're not documented. Each model approaches this differently:
+
+- **GPT-5**: Exhaustively checks every field for nullability, uniqueness, and consistency invariants. Catches the operational/infrastructure constraints (retention, deduplication).
+- **Opus**: Reasons about state machine lifecycle and what must be tracked to support transitions. Catches the "how do you implement this" constraints.
+- **Sonnet**: Identifies the most obvious constraints quickly but doesn't explore edge cases.
+
+## Practical Implication
+
+For state machine specification review:
+1. **Run GPT-5 first** — catches the data integrity and operational constraints
+2. **Run Opus second** — catches the state lifecycle and implementation mechanism constraints
+3. **Sonnet** — only if time-constrained; will miss ~30% of what the others find
+
+Union of all 3 models: 21 distinct unstated constraints identified. Single-model coverage:
+- GPT-5 alone: 15/21 (71%)
+- Opus alone: 14/21 (67%)
+- Sonnet alone: 10/21 (48%)
+
+The multi-model approach is especially valuable for specifications because the cost of missing a constraint is high — it becomes a production bug.
+
+## Cost Comparison
+
+| Model | Tokens/Finding | Time/Finding |
+|---|---|---|
+| GPT-5 | 615 | 7.1s |
+| Opus | 257 | 4.6s |
+| Sonnet | 344 | 6.1s |
+
+Opus is most token-efficient for this task. GPT-5's higher token count reflects the detailed reasoning but yields more CRITICAL findings. For specification review where CRITICAL constraints matter most, GPT-5 justifies the cost.