diff --git a/findings/2026-05-09-53-unstated-constraints.md b/findings/2026-05-09-53-unstated-constraints.md new file mode 100644 index 0000000..2ff29bc --- /dev/null +++ b/findings/2026-05-09-53-unstated-constraints.md @@ -0,0 +1,134 @@ +# Experiment 53: Unstated Constraint Detection on State Machine Specification + +**Date:** 2026-05-09 +**Document:** gargoyle `order-state-machine.md` (~260 lines) +**Task:** Identify unstated constraints — invariants that MUST be true for the system to work correctly but are never explicitly stated in the document. + +## Method + +Same document (full text) + same analytical prompt to all 3 models via HAI proxy. Prompt required structured output: constraint statement, evidence quotes, failure mode, and severity (CRITICAL/HIGH/MEDIUM). No tools, no project context beyond the document. Single prompt, no conversation history. + +## Performance + +| Model | Time | Output tokens | Reasoning tokens | Findings | +|---|---|---|---|---| +| GPT-5 | 107s | 9,228 | 6,976 | 15 | +| Claude Sonnet 4.6 | 61s | 3,442 | (internal) | 10 | +| Claude Opus 4.6 | 65s | 3,601 | (internal) | 14 | + +## Findings Comparison + +### Common Ground (all 3 identified) + +1. **`broker_order_id` uniqueness** — Must be unique and immutable; fill correlation depends on it +2. **`filled_quantity` ≤ `quantity`** — Cannot exceed requested amount +3. **`filled_avg_price` must be quantity-weighted** — Not simple average +4. **`terminated_at` ↔ terminal state synchronization** — Must be cleared on fill override, re-set on termination +5. **`limit_price` nullity linked to `order_type`** — Non-null iff limit order +6. **`expires_at` nullity linked to `time_in_force='gtd'`** — Non-null iff gtd +7. **`decision_id` → order is one-to-one** — No multi-order splitting per decision + +### GPT-5 Unique Findings (not in either Claude model) + +1. **Multi-broker correlation ambiguity** — Fill schema has no broker field; if multi-broker, `broker_order_id` must be globally unique across brokers or system must guarantee single broker source. (CRITICAL) + +2. **Fill ledger deduplication requires unique fill identity** — The document mentions idempotent state transitions but fills are append-only; duplicate fill messages would corrupt the ledger unless there's a fill-level unique ID. (CRITICAL) + +3. **Order retention for late fills** — Orders and `broker_order_id` mapping cannot be GCed immediately after terminal state; needed for late fill processing and reconciliation. (CRITICAL) + +4. **"What" vs "how" immutability after submission** — `instrument_id`, `action`, `position_effect`, `decision_id` must not be changed by replace; only execution parameters can be modified. (HIGH) + +5. **Replace cannot reduce quantity below filled_quantity** — You can't "unfill" shares; quantity < filled_quantity creates impossible state. (HIGH) + +6. **`instrument_id` vs `ticker` for correlation** — Must use `instrument_id` as primary key, never `ticker` (which can change via corporate actions). (HIGH) + +7. **Local expiry timers only for appropriate TIFs** — No local expiry timer should be created for GTC orders. (MEDIUM) + +### Claude Opus Unique Findings (not in either other model) + +1. **Pre-modification state must be tracked for revert** — `pending_cancel`/`pending_replace` rejection must revert to correct state (`working` OR `partially_filled`, not always `working`). The document mentions both as valid revert targets. (HIGH) + +2. **`position_effect` consistency with actual position** — When `close`, must have existing position; when `open`, no contradictory close. Otherwise lot management corrupts P&L. (HIGH) + +3. **`pending_replace` must track pending new values** — Upon broker confirmation, system must know what new parameters were requested to apply them. No field exists in Order to track this. (HIGH) + +4. **Terminal state override ONLY by fills** — `cancelled`/`expired` can only be reactivated by fills, not by any other broker event. This bounds the reactivation surface. (HIGH) + +### Claude Sonnet Unique Findings (not in either other model) + +1. **At most one active order per instrument+action+position_effect** — No stated mechanism for concurrent orders to same instrument/direction. Without ordering guarantees, same lot could be closed twice. (HIGH) + +2. **Fill events must be processed in `filled_at` order per order** — Out-of-order processing produces incorrect intermediate states even if final totals are correct; could trigger unnecessary fill-override path. (MEDIUM) + +### Findings Unique to GPT-5 + +| # | Finding | Severity | +|---|---|---| +| 1 | Multi-broker correlation ambiguity | CRITICAL | +| 2 | Fill deduplication requires unique fill ID | CRITICAL | +| 3 | Order retention for late fills | CRITICAL | +| 4 | "What" vs "how" immutability boundary | HIGH | +| 5 | Replace cannot reduce quantity below fills | HIGH | +| 6 | `instrument_id` over `ticker` for joins | HIGH | +| 7 | No local expiry timers for GTC | MEDIUM | + +### Findings Unique to Opus + +| # | Finding | Severity | +|---|---|---| +| 1 | Pre-modification state tracking for revert | HIGH | +| 2 | `position_effect` consistency with position | HIGH | +| 3 | `pending_replace` pending parameter tracking | HIGH | +| 4 | Terminal override only by fills | HIGH | + +### Findings Unique to Sonnet + +| # | Finding | Severity | +|---|---|---| +| 1 | One active order per instrument+action+effect | HIGH | +| 2 | Fill processing order per broker_order_id | MEDIUM | + +## Quality Assessment + +**GPT-5** produced the most findings (15) and found the most CRITICAL-severity issues (5). The multi-broker correlation gap and fill deduplication constraint are genuinely important — these are exactly the kinds of things that would cause production incidents. GPT-5's strength: systematically checking every field and relationship for unstated dependencies. The reasoning tokens (6,976) show deep exploration. + +**Claude Opus** found 14 constraints with strong focus on state machine correctness — the pre-modification state tracking and pending parameter tracking findings show Opus reasoning about the *lifecycle* of state, not just the state itself. Opus's characteristic strength (finding design tensions) manifests as finding where the document implies mechanism without specifying it. + +**Claude Sonnet** was fastest (61s) but found the fewest (10). The unique findings (one-active-order constraint, fill ordering) are both valid but lower severity. Sonnet identifies correct constraints but doesn't pursue the implications as deeply — e.g., it mentions fill ordering but doesn't trace the cascade to lot management the way GPT-5 would. + +## Key Insight — "Unstated constraints" as an analytical lens + +This is a productive new lens for specification review. Unlike: +- **Gap analysis** (what's missing?) — this finds what's IMPLIED but not stated +- **Race condition analysis** (what timing issues?) — this finds static invariants +- **Ambiguity analysis** (what's unclear?) — this finds definite constraints + +The findings here are all things a developer might violate because they're not documented. Each model approaches this differently: + +- **GPT-5**: Exhaustively checks every field for nullability, uniqueness, and consistency invariants. Catches the operational/infrastructure constraints (retention, deduplication). +- **Opus**: Reasons about state machine lifecycle and what must be tracked to support transitions. Catches the "how do you implement this" constraints. +- **Sonnet**: Identifies the most obvious constraints quickly but doesn't explore edge cases. + +## Practical Implication + +For state machine specification review: +1. **Run GPT-5 first** — catches the data integrity and operational constraints +2. **Run Opus second** — catches the state lifecycle and implementation mechanism constraints +3. **Sonnet** — only if time-constrained; will miss ~30% of what the others find + +Union of all 3 models: 21 distinct unstated constraints identified. Single-model coverage: +- GPT-5 alone: 15/21 (71%) +- Opus alone: 14/21 (67%) +- Sonnet alone: 10/21 (48%) + +The multi-model approach is especially valuable for specifications because the cost of missing a constraint is high — it becomes a production bug. + +## Cost Comparison + +| Model | Tokens/Finding | Time/Finding | +|---|---|---| +| GPT-5 | 615 | 7.1s | +| Opus | 257 | 4.6s | +| Sonnet | 344 | 6.1s | + +Opus is most token-efficient for this task. GPT-5's higher token count reflects the detailed reasoning but yields more CRITICAL findings. For specification review where CRITICAL constraints matter most, GPT-5 justifies the cost.