Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
10 KiB
Finding 20: Invariant violation path analysis: GPT-5 is maximally selective (3 findings, all genuine); Opus shows unique self-correcting analytical style; new task type favors precision over exhaustiveness
Date: 2026-05-04
Task: Identify invariant violation paths in gargoyle's user-pipeline-lifecycle.md
(730 lines) — sequences of legal operations that can violate the system's stated or
implied invariants. NEW analytical lens not previously tested, distinct from assumption-
finding, race conditions, or coherence checking.
How we used them: Same document (full text) + same focused analytical question to all
3 models via HAI proxy. Highly structured prompt specifying 5 categories of invariant
violations (state machine escapes, invariant composition failures, monotonicity violations,
idempotency boundary violations, authority inversion sequences). Required specific output
format per finding. No tools, no project context beyond the document itself.
| Model | Time | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|
| GPT-5 | 143s | 784 | 12,032 | 3 |
| Claude Opus 4.6 | 113s | 6,183 | (internal) | 7 (with 2 self-corrections) |
| Claude Sonnet 4.6 | 23s | 1,266 | (internal) | 5 |
What they found — common ground (2+ models identified):
- Periodic reconciliation overrides operator manual stop (GPT-5 #3 + Opus #5 +
Sonnet #1): An admin who stops a pipeline via
stop_user/1with:admin_actionhas their decision overridden within 5 minutes by periodic reconciliation, because there's no "admin stopped" state incheck_eligibility/1. All three models independently identified this as the clearest authority inversion. - DynamicSupervisor restart bypasses eligibility gate (Opus #1/#3 + Sonnet #2):
When
UserPipeline.Supervisorcrashes and is restarted by OTP supervision, the restart bypassesstart_user/1andcheck_eligibility/1entirely — potentially resuming trading while the kill switch is engaged. - Stale ReconciliationGate after crash (Opus #7): After a crash-triggered
DynamicSupervisor restart (not via
stop_user/1), the ReconciliationGate remains:readyfrom the previous instance becausestop_user/1(which resets it) was never called. The new OrderManager may accept orders during its own reconciliation. - HealthMonitor co-lifecycle violation (Opus #2 + Sonnet #4): After a DynamicSupervisor-initiated restart, the HealthMonitor is still subscribed to the old PIDs — no code re-establishes monitoring for the new pipeline processes.
GPT-5 unique findings (not in either other model):
- Kill switch bypass for users configured DURING engagement (#1): A user who saves credentials while the kill switch is engaged is never added to the pending operator release set (only running pipelines are added at engage time). After disengage, periodic reconciliation auto-starts this user's pipeline without operator release — violating "resuming always requires human judgment." This is the most precisely reasoned finding across all three models: each step is individually correct per the spec, and the violation emerges purely from the composition of legal operations.
- Premature release bypass (#2): If
operator_release_user/1is called while the kill switch is still engaged (a legal operation), it clears the pending release flag butstart_user/1correctly refuses. After later disengage, the flag is gone — auto-start proceeds without fresh operator judgment. The release was "spent" at the wrong time.
Claude Opus unique findings (not in either other model):
operator_release_system/0clears unrelated safety obligations (#4): Operator intends to release one user from a recent event butoperator_release_system/0also releases other users still pending from an earlier, unresolved event. One release call discharges multiple independent safety obligations — monotonicity violation.- State machine incompleteness for blocked users (#6): Users who become
configured during kill switch engagement (blocked with reason
:kill_switch_engaged) have no state machine transition back tostartingafter disengage — they're not in the pending release set, and no event fires. System works via periodic reconciliation (up to 5 minutes delay), but the documented state machine doesn't represent this path. - Self-correcting analytical style: Opus explicitly withdrew two draft findings mid-analysis ("Actually, this sequence works as designed. Let me identify a real violation instead." / "this is likely handled"). This self-correction behavior was first observed in Finding #15 and is now confirmed as a consistent Opus trait for invariant-style analysis.
Claude Sonnet unique findings (not in either other model):
- Cold-start Tier 3 failure creates supervision restart loop (#2): A
persistent Tier 3 failure (phantom fills) crashes OrderManager,
:rest_for_onekills the tree, DynamicSupervisor restarts it, cold-start fails again → infinite loop. State machine showsstarting → stoppedbut supervision createsstarting → startingindefinitely. - HealthMonitor start failure during start_user (#4): If HealthMonitor.Supervisor
is momentarily crashed when
start_user/1runs step 4, the pipeline starts without monitoring. No error handling specified for this partial-start state.
Quality assessment:
- GPT-5 was MAXIMALLY SELECTIVE — only 3 findings from 12,032 reasoning tokens (4,011 reasoning tokens per finding). This is the most extreme reasoning-to-output ratio observed: 15:1 (12,032 reasoning / 784 output tokens). For comparison, in previous experiments GPT-5 typically shows 1:1 to 2:1 ratios. Every finding is a genuine invariant violation with a precise, step-by-step sequence where each step is individually legal. ZERO false positives, zero padding, zero "this might be an issue." GPT-5 appears to have used almost all its reasoning budget for VERIFICATION — confirming that each candidate is genuinely a violation before including it.
- Claude Opus produced the most findings (7) with its characteristic depth and
self-correction. Two findings were revised mid-analysis, showing Opus actively
testing its own reasoning against the document before committing to a finding.
The DynamicSupervisor restart thread (findings #1, #2, #3, #7) forms a coherent
cluster — Opus identified one root cause (OTP restarts bypass the lifecycle
layer) and explored its multiple consequences. The
operator_release_systemmonotonicity finding (#4) is architecturally significant and unique. - Claude Sonnet was extremely fast (23s, 1,266 tokens) and produced 5 findings. Quality was mixed: Finding #1 partially mirrors GPT-5's authority inversion but with vaguer reasoning ("race condition with ETS operations" — not specified). Finding #3 describes a contradiction but the scenario is internally inconsistent (step 5 says "pipeline termination fails" but then step 7 says pipeline is still running — this conflates two failure modes). Findings #2 and #4 are genuine and well-reasoned. Sonnet's precision is lower than the other two on this task.
Key insight — "Invariant violation paths" as a task type:
This is a genuinely DIFFERENT analytical task from any previously tested. It requires:
- Identifying the invariants (explicit or implied)
- Constructing a sequence of operations (creative/generative)
- Verifying each step is legal per the spec (verification)
- Confirming the end state violates the invariant (correctness proof)
This four-phase cognitive process explains GPT-5's extreme selectivity: steps 2-4 are all verification-heavy, and GPT-5's reasoning tokens are being burned on steps 3 and 4 (confirming each step is genuinely legal and the final state genuinely violates). In previous tasks like "find hidden assumptions" or "find gaps," only step 1 (identification) is needed — there's no construction or verification phase.
Comparison to previous task types:
| Task type | GPT-5 findings | Opus findings | GPT-5 reasoning overhead |
|---|---|---|---|
| Hidden assumptions | 20-35 | 12-13 | 5-7K reasoning |
| Race conditions | 12 | 10 | 8K reasoning |
| Design coherence | 4 | 7 | 9K reasoning |
| Invariant violation paths | 3 | 7 | 12K reasoning |
The pattern: as the task requires more VERIFICATION (vs identification), GPT-5 becomes more selective and spends more reasoning tokens per finding. Invariant violation paths demand the highest verification burden (every step must be confirmed legal), and GPT-5 responds with the highest selectivity and reasoning investment.
Opus inverts: it produces MORE findings on verification-heavy tasks (7 for coherence, 7 for invariant paths) vs identification tasks (10-13 for assumptions). This suggests Opus uses its internal reasoning differently — it's more willing to present findings that have "likely" rather than "proven" violations, then self-corrects inline if the verification fails.
Practical implication:
For invariant violation path analysis:
- GPT-5 produces the highest-precision findings but very few. Every finding is a genuine spec-level bug. Use when you need zero-false-positive bug reports to present to a design team.
- Opus produces more findings with slightly lower precision but unique analytical depth. Its self-correction behavior means false positives are often caught inline. Use when you want both confirmed violations AND identified tensions.
- Sonnet is too imprecise for this task type — some findings have internal inconsistencies. Use for lighter analytical tasks (assumption-finding, spec gaps).
The three findings GPT-5 produced are ALL genuine design bugs that should be fixed:
- Users configured during kill switch engagement bypass operator release
- Premature operator release (while KS still engaged) creates future bypass
- Admin stops are overridden by periodic reconciliation
These are the kind of findings that, in a real financial system, prevent production incidents. The 12K reasoning tokens to produce 3 perfect findings is excellent ROI.