Files

T

Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files

Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.

2026-05-06 07:15:50 -07:00

10 KiB

Raw Blame History

Finding 20: Invariant violation path analysis: GPT-5 is maximally selective (3 findings, all genuine); Opus shows unique self-correcting analytical style; new task type favors precision over exhaustiveness

Date: 2026-05-04 Task: Identify invariant violation paths in gargoyle's user-pipeline-lifecycle.md (730 lines) — sequences of legal operations that can violate the system's stated or implied invariants. NEW analytical lens not previously tested, distinct from assumption- finding, race conditions, or coherence checking. How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of invariant violations (state machine escapes, invariant composition failures, monotonicity violations, idempotency boundary violations, authority inversion sequences). Required specific output format per finding. No tools, no project context beyond the document itself.

Model	Time	Output tokens	Reasoning tokens	Findings
GPT-5	143s	784	12,032	3
Claude Opus 4.6	113s	6,183	(internal)	7 (with 2 self-corrections)
Claude Sonnet 4.6	23s	1,266	(internal)	5

What they found — common ground (2+ models identified):

Periodic reconciliation overrides operator manual stop (GPT-5 #3 + Opus #5 + Sonnet #1): An admin who stops a pipeline via stop_user/1 with :admin_action has their decision overridden within 5 minutes by periodic reconciliation, because there's no "admin stopped" state in check_eligibility/1. All three models independently identified this as the clearest authority inversion.
DynamicSupervisor restart bypasses eligibility gate (Opus #1/#3 + Sonnet #2): When UserPipeline.Supervisor crashes and is restarted by OTP supervision, the restart bypasses start_user/1 and check_eligibility/1 entirely — potentially resuming trading while the kill switch is engaged.
Stale ReconciliationGate after crash (Opus #7): After a crash-triggered DynamicSupervisor restart (not via stop_user/1), the ReconciliationGate remains :ready from the previous instance because stop_user/1 (which resets it) was never called. The new OrderManager may accept orders during its own reconciliation.
HealthMonitor co-lifecycle violation (Opus #2 + Sonnet #4): After a DynamicSupervisor-initiated restart, the HealthMonitor is still subscribed to the old PIDs — no code re-establishes monitoring for the new pipeline processes.

GPT-5 unique findings (not in either other model):

Kill switch bypass for users configured DURING engagement (#1): A user who saves credentials while the kill switch is engaged is never added to the pending operator release set (only running pipelines are added at engage time). After disengage, periodic reconciliation auto-starts this user's pipeline without operator release — violating "resuming always requires human judgment." This is the most precisely reasoned finding across all three models: each step is individually correct per the spec, and the violation emerges purely from the composition of legal operations.
Premature release bypass (#2): If operator_release_user/1 is called while the kill switch is still engaged (a legal operation), it clears the pending release flag but start_user/1 correctly refuses. After later disengage, the flag is gone — auto-start proceeds without fresh operator judgment. The release was "spent" at the wrong time.

Claude Opus unique findings (not in either other model):

operator_release_system/0 clears unrelated safety obligations (#4): Operator intends to release one user from a recent event but operator_release_system/0 also releases other users still pending from an earlier, unresolved event. One release call discharges multiple independent safety obligations — monotonicity violation.
State machine incompleteness for blocked users (#6): Users who become configured during kill switch engagement (blocked with reason :kill_switch_engaged) have no state machine transition back to starting after disengage — they're not in the pending release set, and no event fires. System works via periodic reconciliation (up to 5 minutes delay), but the documented state machine doesn't represent this path.
Self-correcting analytical style: Opus explicitly withdrew two draft findings mid-analysis ("Actually, this sequence works as designed. Let me identify a real violation instead." / "this is likely handled"). This self-correction behavior was first observed in Finding #15 and is now confirmed as a consistent Opus trait for invariant-style analysis.

Claude Sonnet unique findings (not in either other model):

Cold-start Tier 3 failure creates supervision restart loop (#2): A persistent Tier 3 failure (phantom fills) crashes OrderManager, :rest_for_one kills the tree, DynamicSupervisor restarts it, cold-start fails again → infinite loop. State machine shows starting → stopped but supervision creates starting → starting indefinitely.
HealthMonitor start failure during start_user (#4): If HealthMonitor.Supervisor is momentarily crashed when start_user/1 runs step 4, the pipeline starts without monitoring. No error handling specified for this partial-start state.

Quality assessment:

GPT-5 was MAXIMALLY SELECTIVE — only 3 findings from 12,032 reasoning tokens (4,011 reasoning tokens per finding). This is the most extreme reasoning-to-output ratio observed: 15:1 (12,032 reasoning / 784 output tokens). For comparison, in previous experiments GPT-5 typically shows 1:1 to 2:1 ratios. Every finding is a genuine invariant violation with a precise, step-by-step sequence where each step is individually legal. ZERO false positives, zero padding, zero "this might be an issue." GPT-5 appears to have used almost all its reasoning budget for VERIFICATION — confirming that each candidate is genuinely a violation before including it.
Claude Opus produced the most findings (7) with its characteristic depth and self-correction. Two findings were revised mid-analysis, showing Opus actively testing its own reasoning against the document before committing to a finding. The DynamicSupervisor restart thread (findings #1, #2, #3, #7) forms a coherent cluster — Opus identified one root cause (OTP restarts bypass the lifecycle layer) and explored its multiple consequences. The operator_release_system monotonicity finding (#4) is architecturally significant and unique.
Claude Sonnet was extremely fast (23s, 1,266 tokens) and produced 5 findings. Quality was mixed: Finding #1 partially mirrors GPT-5's authority inversion but with vaguer reasoning ("race condition with ETS operations" — not specified). Finding #3 describes a contradiction but the scenario is internally inconsistent (step 5 says "pipeline termination fails" but then step 7 says pipeline is still running — this conflates two failure modes). Findings #2 and #4 are genuine and well-reasoned. Sonnet's precision is lower than the other two on this task.

Key insight — "Invariant violation paths" as a task type:

This is a genuinely DIFFERENT analytical task from any previously tested. It requires:

Identifying the invariants (explicit or implied)
Constructing a sequence of operations (creative/generative)
Verifying each step is legal per the spec (verification)
Confirming the end state violates the invariant (correctness proof)

This four-phase cognitive process explains GPT-5's extreme selectivity: steps 2-4 are all verification-heavy, and GPT-5's reasoning tokens are being burned on steps 3 and 4 (confirming each step is genuinely legal and the final state genuinely violates). In previous tasks like "find hidden assumptions" or "find gaps," only step 1 (identification) is needed — there's no construction or verification phase.

Comparison to previous task types:

Task type	GPT-5 findings	Opus findings	GPT-5 reasoning overhead
Hidden assumptions	20-35	12-13	5-7K reasoning
Race conditions	12	10	8K reasoning
Design coherence	4	7	9K reasoning
Invariant violation paths	3	7	12K reasoning

The pattern: as the task requires more VERIFICATION (vs identification), GPT-5 becomes more selective and spends more reasoning tokens per finding. Invariant violation paths demand the highest verification burden (every step must be confirmed legal), and GPT-5 responds with the highest selectivity and reasoning investment.

Opus inverts: it produces MORE findings on verification-heavy tasks (7 for coherence, 7 for invariant paths) vs identification tasks (10-13 for assumptions). This suggests Opus uses its internal reasoning differently — it's more willing to present findings that have "likely" rather than "proven" violations, then self-corrects inline if the verification fails.

Practical implication:

For invariant violation path analysis:

GPT-5 produces the highest-precision findings but very few. Every finding is a genuine spec-level bug. Use when you need zero-false-positive bug reports to present to a design team.
Opus produces more findings with slightly lower precision but unique analytical depth. Its self-correction behavior means false positives are often caught inline. Use when you want both confirmed violations AND identified tensions.
Sonnet is too imprecise for this task type — some findings have internal inconsistencies. Use for lighter analytical tasks (assumption-finding, spec gaps).

The three findings GPT-5 produced are ALL genuine design bugs that should be fixed:

Users configured during kill switch engagement bypass operator release
Premature operator release (while KS still engaged) creates future bypass
Admin stops are overridden by periodic reconciliation

These are the kind of findings that, in a real financial system, prevent production incidents. The 12K reasoning tokens to produce 3 perfect findings is excellent ROI.

10 KiB Raw Blame History

Finding 20: Invariant violation path analysis: GPT-5 is maximally selective (3 findings, all genuine); Opus shows unique self-correcting analytical style; new task type favors precision over exhaustiveness

10 KiB

Raw Blame History