Files
model-research/findings/2026-05-04-20-invariant-violation-path-analysis-gpt5.md
T
Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00

164 lines
10 KiB
Markdown

# Finding 20: Invariant violation path analysis: GPT-5 is maximally selective (3 findings, all genuine); Opus shows unique self-correcting analytical style; new task type favors precision over exhaustiveness
**Date:** 2026-05-04
**Task:** Identify invariant violation paths in gargoyle's `user-pipeline-lifecycle.md`
(730 lines) — sequences of legal operations that can violate the system's stated or
implied invariants. NEW analytical lens not previously tested, distinct from assumption-
finding, race conditions, or coherence checking.
**How we used them:** Same document (full text) + same focused analytical question to all
3 models via HAI proxy. Highly structured prompt specifying 5 categories of invariant
violations (state machine escapes, invariant composition failures, monotonicity violations,
idempotency boundary violations, authority inversion sequences). Required specific output
format per finding. No tools, no project context beyond the document itself.
| Model | Time | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|
| GPT-5 | 143s | 784 | 12,032 | 3 |
| Claude Opus 4.6 | 113s | 6,183 | (internal) | 7 (with 2 self-corrections) |
| Claude Sonnet 4.6 | 23s | 1,266 | (internal) | 5 |
**What they found — common ground (2+ models identified):**
- **Periodic reconciliation overrides operator manual stop** (GPT-5 #3 + Opus #5 +
Sonnet #1): An admin who stops a pipeline via `stop_user/1` with `:admin_action`
has their decision overridden within 5 minutes by periodic reconciliation, because
there's no "admin stopped" state in `check_eligibility/1`. All three models
independently identified this as the clearest authority inversion.
- **DynamicSupervisor restart bypasses eligibility gate** (Opus #1/#3 + Sonnet #2):
When `UserPipeline.Supervisor` crashes and is restarted by OTP supervision, the
restart bypasses `start_user/1` and `check_eligibility/1` entirely — potentially
resuming trading while the kill switch is engaged.
- **Stale ReconciliationGate after crash** (Opus #7): After a crash-triggered
DynamicSupervisor restart (not via `stop_user/1`), the ReconciliationGate remains
`:ready` from the previous instance because `stop_user/1` (which resets it) was
never called. The new OrderManager may accept orders during its own reconciliation.
- **HealthMonitor co-lifecycle violation** (Opus #2 + Sonnet #4): After a
DynamicSupervisor-initiated restart, the HealthMonitor is still subscribed to the
old PIDs — no code re-establishes monitoring for the new pipeline processes.
**GPT-5 unique findings (not in either other model):**
- **Kill switch bypass for users configured DURING engagement** (#1): A user who
saves credentials while the kill switch is engaged is never added to the pending
operator release set (only running pipelines are added at engage time). After
disengage, periodic reconciliation auto-starts this user's pipeline without
operator release — violating "resuming always requires human judgment." This is
the most precisely reasoned finding across all three models: each step is
individually correct per the spec, and the violation emerges purely from the
composition of legal operations.
- **Premature release bypass** (#2): If `operator_release_user/1` is called while
the kill switch is still engaged (a legal operation), it clears the pending
release flag but `start_user/1` correctly refuses. After later disengage, the
flag is gone — auto-start proceeds without fresh operator judgment. The release
was "spent" at the wrong time.
**Claude Opus unique findings (not in either other model):**
- **`operator_release_system/0` clears unrelated safety obligations** (#4):
Operator intends to release one user from a recent event but
`operator_release_system/0` also releases other users still pending from an
earlier, unresolved event. One release call discharges multiple independent
safety obligations — monotonicity violation.
- **State machine incompleteness for blocked users** (#6): Users who become
configured during kill switch engagement (blocked with reason
`:kill_switch_engaged`) have no state machine transition back to `starting`
after disengage — they're not in the pending release set, and no event fires.
System works via periodic reconciliation (up to 5 minutes delay), but the
documented state machine doesn't represent this path.
- **Self-correcting analytical style:** Opus explicitly withdrew two draft
findings mid-analysis ("Actually, this sequence works as designed. Let me
identify a real violation instead." / "this is likely handled"). This
self-correction behavior was first observed in Finding #15 and is now
confirmed as a consistent Opus trait for invariant-style analysis.
**Claude Sonnet unique findings (not in either other model):**
- **Cold-start Tier 3 failure creates supervision restart loop** (#2): A
persistent Tier 3 failure (phantom fills) crashes OrderManager, `:rest_for_one`
kills the tree, DynamicSupervisor restarts it, cold-start fails again → infinite
loop. State machine shows `starting → stopped` but supervision creates
`starting → starting` indefinitely.
- **HealthMonitor start failure during start_user** (#4): If HealthMonitor.Supervisor
is momentarily crashed when `start_user/1` runs step 4, the pipeline starts
without monitoring. No error handling specified for this partial-start state.
**Quality assessment:**
- **GPT-5** was MAXIMALLY SELECTIVE — only 3 findings from 12,032 reasoning tokens
(4,011 reasoning tokens per finding). This is the most extreme
reasoning-to-output ratio observed: 15:1 (12,032 reasoning / 784 output tokens).
For comparison, in previous experiments GPT-5 typically shows 1:1 to 2:1 ratios.
Every finding is a genuine invariant violation with a precise, step-by-step
sequence where each step is individually legal. ZERO false positives, zero
padding, zero "this might be an issue." GPT-5 appears to have used almost all
its reasoning budget for VERIFICATION — confirming that each candidate is
genuinely a violation before including it.
- **Claude Opus** produced the most findings (7) with its characteristic depth and
self-correction. Two findings were revised mid-analysis, showing Opus actively
testing its own reasoning against the document before committing to a finding.
The DynamicSupervisor restart thread (findings #1, #2, #3, #7) forms a coherent
cluster — Opus identified one root cause (OTP restarts bypass the lifecycle
layer) and explored its multiple consequences. The `operator_release_system`
monotonicity finding (#4) is architecturally significant and unique.
- **Claude Sonnet** was extremely fast (23s, 1,266 tokens) and produced 5 findings.
Quality was mixed: Finding #1 partially mirrors GPT-5's authority inversion but
with vaguer reasoning ("race condition with ETS operations" — not specified).
Finding #3 describes a contradiction but the scenario is internally inconsistent
(step 5 says "pipeline termination fails" but then step 7 says pipeline is still
running — this conflates two failure modes). Findings #2 and #4 are genuine and
well-reasoned. Sonnet's precision is lower than the other two on this task.
**Key insight — "Invariant violation paths" as a task type:**
This is a genuinely DIFFERENT analytical task from any previously tested. It requires:
1. Identifying the invariants (explicit or implied)
2. Constructing a sequence of operations (creative/generative)
3. Verifying each step is legal per the spec (verification)
4. Confirming the end state violates the invariant (correctness proof)
This four-phase cognitive process explains GPT-5's extreme selectivity: steps 2-4 are
all verification-heavy, and GPT-5's reasoning tokens are being burned on steps 3 and 4
(confirming each step is genuinely legal and the final state genuinely violates). In
previous tasks like "find hidden assumptions" or "find gaps," only step 1 (identification)
is needed — there's no construction or verification phase.
**Comparison to previous task types:**
| Task type | GPT-5 findings | Opus findings | GPT-5 reasoning overhead |
|---|---|---|---|
| Hidden assumptions | 20-35 | 12-13 | 5-7K reasoning |
| Race conditions | 12 | 10 | 8K reasoning |
| Design coherence | 4 | 7 | 9K reasoning |
| Invariant violation paths | 3 | 7 | **12K reasoning** |
The pattern: as the task requires more VERIFICATION (vs identification), GPT-5 becomes
more selective and spends more reasoning tokens per finding. Invariant violation paths
demand the highest verification burden (every step must be confirmed legal), and GPT-5
responds with the highest selectivity and reasoning investment.
Opus inverts: it produces MORE findings on verification-heavy tasks (7 for coherence,
7 for invariant paths) vs identification tasks (10-13 for assumptions). This suggests
Opus uses its internal reasoning differently — it's more willing to present findings
that have "likely" rather than "proven" violations, then self-corrects inline if the
verification fails.
**Practical implication:**
For invariant violation path analysis:
- **GPT-5** produces the highest-precision findings but very few. Every finding is a
genuine spec-level bug. Use when you need zero-false-positive bug reports to present
to a design team.
- **Opus** produces more findings with slightly lower precision but unique analytical
depth. Its self-correction behavior means false positives are often caught inline.
Use when you want both confirmed violations AND identified tensions.
- **Sonnet** is too imprecise for this task type — some findings have internal
inconsistencies. Use for lighter analytical tasks (assumption-finding, spec gaps).
The three findings GPT-5 produced are ALL genuine design bugs that should be fixed:
1. Users configured during kill switch engagement bypass operator release
2. Premature operator release (while KS still engaged) creates future bypass
3. Admin stops are overridden by periodic reconciliation
These are the kind of findings that, in a real financial system, prevent production
incidents. The 12K reasoning tokens to produce 3 perfect findings is excellent ROI.