6af8a6ee10
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
164 lines
10 KiB
Markdown
164 lines
10 KiB
Markdown
# Finding 20: Invariant violation path analysis: GPT-5 is maximally selective (3 findings, all genuine); Opus shows unique self-correcting analytical style; new task type favors precision over exhaustiveness
|
|
|
|
**Date:** 2026-05-04
|
|
**Task:** Identify invariant violation paths in gargoyle's `user-pipeline-lifecycle.md`
|
|
(730 lines) — sequences of legal operations that can violate the system's stated or
|
|
implied invariants. NEW analytical lens not previously tested, distinct from assumption-
|
|
finding, race conditions, or coherence checking.
|
|
**How we used them:** Same document (full text) + same focused analytical question to all
|
|
3 models via HAI proxy. Highly structured prompt specifying 5 categories of invariant
|
|
violations (state machine escapes, invariant composition failures, monotonicity violations,
|
|
idempotency boundary violations, authority inversion sequences). Required specific output
|
|
format per finding. No tools, no project context beyond the document itself.
|
|
|
|
| Model | Time | Output tokens | Reasoning tokens | Findings |
|
|
|---|---|---|---|---|
|
|
| GPT-5 | 143s | 784 | 12,032 | 3 |
|
|
| Claude Opus 4.6 | 113s | 6,183 | (internal) | 7 (with 2 self-corrections) |
|
|
| Claude Sonnet 4.6 | 23s | 1,266 | (internal) | 5 |
|
|
|
|
**What they found — common ground (2+ models identified):**
|
|
|
|
- **Periodic reconciliation overrides operator manual stop** (GPT-5 #3 + Opus #5 +
|
|
Sonnet #1): An admin who stops a pipeline via `stop_user/1` with `:admin_action`
|
|
has their decision overridden within 5 minutes by periodic reconciliation, because
|
|
there's no "admin stopped" state in `check_eligibility/1`. All three models
|
|
independently identified this as the clearest authority inversion.
|
|
- **DynamicSupervisor restart bypasses eligibility gate** (Opus #1/#3 + Sonnet #2):
|
|
When `UserPipeline.Supervisor` crashes and is restarted by OTP supervision, the
|
|
restart bypasses `start_user/1` and `check_eligibility/1` entirely — potentially
|
|
resuming trading while the kill switch is engaged.
|
|
- **Stale ReconciliationGate after crash** (Opus #7): After a crash-triggered
|
|
DynamicSupervisor restart (not via `stop_user/1`), the ReconciliationGate remains
|
|
`:ready` from the previous instance because `stop_user/1` (which resets it) was
|
|
never called. The new OrderManager may accept orders during its own reconciliation.
|
|
- **HealthMonitor co-lifecycle violation** (Opus #2 + Sonnet #4): After a
|
|
DynamicSupervisor-initiated restart, the HealthMonitor is still subscribed to the
|
|
old PIDs — no code re-establishes monitoring for the new pipeline processes.
|
|
|
|
**GPT-5 unique findings (not in either other model):**
|
|
|
|
- **Kill switch bypass for users configured DURING engagement** (#1): A user who
|
|
saves credentials while the kill switch is engaged is never added to the pending
|
|
operator release set (only running pipelines are added at engage time). After
|
|
disengage, periodic reconciliation auto-starts this user's pipeline without
|
|
operator release — violating "resuming always requires human judgment." This is
|
|
the most precisely reasoned finding across all three models: each step is
|
|
individually correct per the spec, and the violation emerges purely from the
|
|
composition of legal operations.
|
|
- **Premature release bypass** (#2): If `operator_release_user/1` is called while
|
|
the kill switch is still engaged (a legal operation), it clears the pending
|
|
release flag but `start_user/1` correctly refuses. After later disengage, the
|
|
flag is gone — auto-start proceeds without fresh operator judgment. The release
|
|
was "spent" at the wrong time.
|
|
|
|
**Claude Opus unique findings (not in either other model):**
|
|
|
|
- **`operator_release_system/0` clears unrelated safety obligations** (#4):
|
|
Operator intends to release one user from a recent event but
|
|
`operator_release_system/0` also releases other users still pending from an
|
|
earlier, unresolved event. One release call discharges multiple independent
|
|
safety obligations — monotonicity violation.
|
|
- **State machine incompleteness for blocked users** (#6): Users who become
|
|
configured during kill switch engagement (blocked with reason
|
|
`:kill_switch_engaged`) have no state machine transition back to `starting`
|
|
after disengage — they're not in the pending release set, and no event fires.
|
|
System works via periodic reconciliation (up to 5 minutes delay), but the
|
|
documented state machine doesn't represent this path.
|
|
- **Self-correcting analytical style:** Opus explicitly withdrew two draft
|
|
findings mid-analysis ("Actually, this sequence works as designed. Let me
|
|
identify a real violation instead." / "this is likely handled"). This
|
|
self-correction behavior was first observed in Finding #15 and is now
|
|
confirmed as a consistent Opus trait for invariant-style analysis.
|
|
|
|
**Claude Sonnet unique findings (not in either other model):**
|
|
|
|
- **Cold-start Tier 3 failure creates supervision restart loop** (#2): A
|
|
persistent Tier 3 failure (phantom fills) crashes OrderManager, `:rest_for_one`
|
|
kills the tree, DynamicSupervisor restarts it, cold-start fails again → infinite
|
|
loop. State machine shows `starting → stopped` but supervision creates
|
|
`starting → starting` indefinitely.
|
|
- **HealthMonitor start failure during start_user** (#4): If HealthMonitor.Supervisor
|
|
is momentarily crashed when `start_user/1` runs step 4, the pipeline starts
|
|
without monitoring. No error handling specified for this partial-start state.
|
|
|
|
**Quality assessment:**
|
|
|
|
- **GPT-5** was MAXIMALLY SELECTIVE — only 3 findings from 12,032 reasoning tokens
|
|
(4,011 reasoning tokens per finding). This is the most extreme
|
|
reasoning-to-output ratio observed: 15:1 (12,032 reasoning / 784 output tokens).
|
|
For comparison, in previous experiments GPT-5 typically shows 1:1 to 2:1 ratios.
|
|
Every finding is a genuine invariant violation with a precise, step-by-step
|
|
sequence where each step is individually legal. ZERO false positives, zero
|
|
padding, zero "this might be an issue." GPT-5 appears to have used almost all
|
|
its reasoning budget for VERIFICATION — confirming that each candidate is
|
|
genuinely a violation before including it.
|
|
- **Claude Opus** produced the most findings (7) with its characteristic depth and
|
|
self-correction. Two findings were revised mid-analysis, showing Opus actively
|
|
testing its own reasoning against the document before committing to a finding.
|
|
The DynamicSupervisor restart thread (findings #1, #2, #3, #7) forms a coherent
|
|
cluster — Opus identified one root cause (OTP restarts bypass the lifecycle
|
|
layer) and explored its multiple consequences. The `operator_release_system`
|
|
monotonicity finding (#4) is architecturally significant and unique.
|
|
- **Claude Sonnet** was extremely fast (23s, 1,266 tokens) and produced 5 findings.
|
|
Quality was mixed: Finding #1 partially mirrors GPT-5's authority inversion but
|
|
with vaguer reasoning ("race condition with ETS operations" — not specified).
|
|
Finding #3 describes a contradiction but the scenario is internally inconsistent
|
|
(step 5 says "pipeline termination fails" but then step 7 says pipeline is still
|
|
running — this conflates two failure modes). Findings #2 and #4 are genuine and
|
|
well-reasoned. Sonnet's precision is lower than the other two on this task.
|
|
|
|
**Key insight — "Invariant violation paths" as a task type:**
|
|
|
|
This is a genuinely DIFFERENT analytical task from any previously tested. It requires:
|
|
1. Identifying the invariants (explicit or implied)
|
|
2. Constructing a sequence of operations (creative/generative)
|
|
3. Verifying each step is legal per the spec (verification)
|
|
4. Confirming the end state violates the invariant (correctness proof)
|
|
|
|
This four-phase cognitive process explains GPT-5's extreme selectivity: steps 2-4 are
|
|
all verification-heavy, and GPT-5's reasoning tokens are being burned on steps 3 and 4
|
|
(confirming each step is genuinely legal and the final state genuinely violates). In
|
|
previous tasks like "find hidden assumptions" or "find gaps," only step 1 (identification)
|
|
is needed — there's no construction or verification phase.
|
|
|
|
**Comparison to previous task types:**
|
|
|
|
| Task type | GPT-5 findings | Opus findings | GPT-5 reasoning overhead |
|
|
|---|---|---|---|
|
|
| Hidden assumptions | 20-35 | 12-13 | 5-7K reasoning |
|
|
| Race conditions | 12 | 10 | 8K reasoning |
|
|
| Design coherence | 4 | 7 | 9K reasoning |
|
|
| Invariant violation paths | 3 | 7 | **12K reasoning** |
|
|
|
|
The pattern: as the task requires more VERIFICATION (vs identification), GPT-5 becomes
|
|
more selective and spends more reasoning tokens per finding. Invariant violation paths
|
|
demand the highest verification burden (every step must be confirmed legal), and GPT-5
|
|
responds with the highest selectivity and reasoning investment.
|
|
|
|
Opus inverts: it produces MORE findings on verification-heavy tasks (7 for coherence,
|
|
7 for invariant paths) vs identification tasks (10-13 for assumptions). This suggests
|
|
Opus uses its internal reasoning differently — it's more willing to present findings
|
|
that have "likely" rather than "proven" violations, then self-corrects inline if the
|
|
verification fails.
|
|
|
|
**Practical implication:**
|
|
|
|
For invariant violation path analysis:
|
|
- **GPT-5** produces the highest-precision findings but very few. Every finding is a
|
|
genuine spec-level bug. Use when you need zero-false-positive bug reports to present
|
|
to a design team.
|
|
- **Opus** produces more findings with slightly lower precision but unique analytical
|
|
depth. Its self-correction behavior means false positives are often caught inline.
|
|
Use when you want both confirmed violations AND identified tensions.
|
|
- **Sonnet** is too imprecise for this task type — some findings have internal
|
|
inconsistencies. Use for lighter analytical tasks (assumption-finding, spec gaps).
|
|
|
|
The three findings GPT-5 produced are ALL genuine design bugs that should be fixed:
|
|
1. Users configured during kill switch engagement bypass operator release
|
|
2. Premature operator release (while KS still engaged) creates future bypass
|
|
3. Admin stops are overridden by periodic reconciliation
|
|
|
|
These are the kind of findings that, in a real financial system, prevent production
|
|
incidents. The 12K reasoning tokens to produce 3 perfect findings is excellent ROI.
|