model-research/findings/2026-05-04-20-invariant-violation-path-analysis-gpt5.md

# Finding 20: Invariant violation path analysis: GPT-5 is maximally selective (3 findings, all genuine); Opus shows unique self-correcting analytical style; new task type favors precision over exhaustiveness

**Date:** 2026-05-04
**Task:** Identify invariant violation paths in gargoyle's `user-pipeline-lifecycle.md`
(730 lines) — sequences of legal operations that can violate the system's stated or
implied invariants. NEW analytical lens not previously tested, distinct from assumption-
finding, race conditions, or coherence checking.
**How we used them:** Same document (full text) + same focused analytical question to all
3 models via HAI proxy. Highly structured prompt specifying 5 categories of invariant
violations (state machine escapes, invariant composition failures, monotonicity violations,
idempotency boundary violations, authority inversion sequences). Required specific output
format per finding. No tools, no project context beyond the document itself.

| Model | Time | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|
| GPT-5 | 143s | 784 | 12,032 | 3 |
| Claude Opus 4.6 | 113s | 6,183 | (internal) | 7 (with 2 self-corrections) |
| Claude Sonnet 4.6 | 23s | 1,266 | (internal) | 5 |

**What they found — common ground (2+ models identified):**

- **Periodic reconciliation overrides operator manual stop** (GPT-5 #3 + Opus #5 +
  Sonnet #1): An admin who stops a pipeline via `stop_user/1` with `:admin_action`
  has their decision overridden within 5 minutes by periodic reconciliation, because
  there's no "admin stopped" state in `check_eligibility/1`. All three models
  independently identified this as the clearest authority inversion.
- **DynamicSupervisor restart bypasses eligibility gate** (Opus #1/#3 + Sonnet #2):
  When `UserPipeline.Supervisor` crashes and is restarted by OTP supervision, the
  restart bypasses `start_user/1` and `check_eligibility/1` entirely — potentially
  resuming trading while the kill switch is engaged.
- **Stale ReconciliationGate after crash** (Opus #7): After a crash-triggered
  DynamicSupervisor restart (not via `stop_user/1`), the ReconciliationGate remains
  `:ready` from the previous instance because `stop_user/1` (which resets it) was
  never called. The new OrderManager may accept orders during its own reconciliation.
- **HealthMonitor co-lifecycle violation** (Opus #2 + Sonnet #4): After a
  DynamicSupervisor-initiated restart, the HealthMonitor is still subscribed to the
  old PIDs — no code re-establishes monitoring for the new pipeline processes.

**GPT-5 unique findings (not in either other model):**

- **Kill switch bypass for users configured DURING engagement** (#1): A user who
  saves credentials while the kill switch is engaged is never added to the pending
  operator release set (only running pipelines are added at engage time). After
  disengage, periodic reconciliation auto-starts this user's pipeline without
  operator release — violating "resuming always requires human judgment." This is
  the most precisely reasoned finding across all three models: each step is
  individually correct per the spec, and the violation emerges purely from the
  composition of legal operations.
- **Premature release bypass** (#2): If `operator_release_user/1` is called while
  the kill switch is still engaged (a legal operation), it clears the pending
  release flag but `start_user/1` correctly refuses. After later disengage, the
  flag is gone — auto-start proceeds without fresh operator judgment. The release
  was "spent" at the wrong time.

**Claude Opus unique findings (not in either other model):**

- **`operator_release_system/0` clears unrelated safety obligations** (#4):
  Operator intends to release one user from a recent event but
  `operator_release_system/0` also releases other users still pending from an
  earlier, unresolved event. One release call discharges multiple independent
  safety obligations — monotonicity violation.
- **State machine incompleteness for blocked users** (#6): Users who become
  configured during kill switch engagement (blocked with reason
  `:kill_switch_engaged`) have no state machine transition back to `starting`
  after disengage — they're not in the pending release set, and no event fires.
  System works via periodic reconciliation (up to 5 minutes delay), but the
  documented state machine doesn't represent this path.
- **Self-correcting analytical style:** Opus explicitly withdrew two draft
  findings mid-analysis ("Actually, this sequence works as designed. Let me
  identify a real violation instead." / "this is likely handled"). This
  self-correction behavior was first observed in Finding #15 and is now
  confirmed as a consistent Opus trait for invariant-style analysis.

**Claude Sonnet unique findings (not in either other model):**

- **Cold-start Tier 3 failure creates supervision restart loop** (#2): A
  persistent Tier 3 failure (phantom fills) crashes OrderManager, `:rest_for_one`
  kills the tree, DynamicSupervisor restarts it, cold-start fails again → infinite
  loop. State machine shows `starting → stopped` but supervision creates
  `starting → starting` indefinitely.
- **HealthMonitor start failure during start_user** (#4): If HealthMonitor.Supervisor
  is momentarily crashed when `start_user/1` runs step 4, the pipeline starts
  without monitoring. No error handling specified for this partial-start state.

**Quality assessment:**

- **GPT-5** was MAXIMALLY SELECTIVE — only 3 findings from 12,032 reasoning tokens
  (4,011 reasoning tokens per finding). This is the most extreme
  reasoning-to-output ratio observed: 15:1 (12,032 reasoning / 784 output tokens).
  For comparison, in previous experiments GPT-5 typically shows 1:1 to 2:1 ratios.
  Every finding is a genuine invariant violation with a precise, step-by-step
  sequence where each step is individually legal. ZERO false positives, zero
  padding, zero "this might be an issue." GPT-5 appears to have used almost all
  its reasoning budget for VERIFICATION — confirming that each candidate is
  genuinely a violation before including it.
- **Claude Opus** produced the most findings (7) with its characteristic depth and
  self-correction. Two findings were revised mid-analysis, showing Opus actively
  testing its own reasoning against the document before committing to a finding.
  The DynamicSupervisor restart thread (findings #1, #2, #3, #7) forms a coherent
  cluster — Opus identified one root cause (OTP restarts bypass the lifecycle
  layer) and explored its multiple consequences. The `operator_release_system`
  monotonicity finding (#4) is architecturally significant and unique.
- **Claude Sonnet** was extremely fast (23s, 1,266 tokens) and produced 5 findings.
  Quality was mixed: Finding #1 partially mirrors GPT-5's authority inversion but
  with vaguer reasoning ("race condition with ETS operations" — not specified).
  Finding #3 describes a contradiction but the scenario is internally inconsistent
  (step 5 says "pipeline termination fails" but then step 7 says pipeline is still
  running — this conflates two failure modes). Findings #2 and #4 are genuine and
  well-reasoned. Sonnet's precision is lower than the other two on this task.

**Key insight — "Invariant violation paths" as a task type:**

This is a genuinely DIFFERENT analytical task from any previously tested. It requires:
1. Identifying the invariants (explicit or implied)
2. Constructing a sequence of operations (creative/generative)
3. Verifying each step is legal per the spec (verification)
4. Confirming the end state violates the invariant (correctness proof)

This four-phase cognitive process explains GPT-5's extreme selectivity: steps 2-4 are
all verification-heavy, and GPT-5's reasoning tokens are being burned on steps 3 and 4
(confirming each step is genuinely legal and the final state genuinely violates). In
previous tasks like "find hidden assumptions" or "find gaps," only step 1 (identification)
is needed — there's no construction or verification phase.

**Comparison to previous task types:**

| Task type | GPT-5 findings | Opus findings | GPT-5 reasoning overhead |
|---|---|---|---|
| Hidden assumptions | 20-35 | 12-13 | 5-7K reasoning |
| Race conditions | 12 | 10 | 8K reasoning |
| Design coherence | 4 | 7 | 9K reasoning |
| Invariant violation paths | 3 | 7 | **12K reasoning** |

The pattern: as the task requires more VERIFICATION (vs identification), GPT-5 becomes
more selective and spends more reasoning tokens per finding. Invariant violation paths
demand the highest verification burden (every step must be confirmed legal), and GPT-5
responds with the highest selectivity and reasoning investment.

Opus inverts: it produces MORE findings on verification-heavy tasks (7 for coherence,
7 for invariant paths) vs identification tasks (10-13 for assumptions). This suggests
Opus uses its internal reasoning differently — it's more willing to present findings
that have "likely" rather than "proven" violations, then self-corrects inline if the
verification fails.

**Practical implication:**

For invariant violation path analysis:
- **GPT-5** produces the highest-precision findings but very few. Every finding is a
  genuine spec-level bug. Use when you need zero-false-positive bug reports to present
  to a design team.
- **Opus** produces more findings with slightly lower precision but unique analytical
  depth. Its self-correction behavior means false positives are often caught inline.
  Use when you want both confirmed violations AND identified tensions.
- **Sonnet** is too imprecise for this task type — some findings have internal
  inconsistencies. Use for lighter analytical tasks (assumption-finding, spec gaps).

The three findings GPT-5 produced are ALL genuine design bugs that should be fixed:
1. Users configured during kill switch engagement bypass operator release
2. Premature operator release (while KS still engaged) creates future bypass
3. Admin stops are overridden by periodic reconciliation

These are the kind of findings that, in a real financial system, prevent production
incidents. The 12K reasoning tokens to produce 3 perfect findings is excellent ROI.