Files

T

claw a65c471a3f finding 41: temporal ordering dependency analysis on kill-switch.md

New analytical lens testing whether models can identify sequential operations
where order matters but isn't mechanically enforced. GPT-5 finds systemic
gaps (WHY ordering matters), Opus finds inverted dangers (WHICH direction
is dangerous), Sonnet identifies themes without unique depth.

2026-05-07 12:47:03 -07:00

6.5 KiB

Raw Blame History

Finding 41: Temporal Ordering Dependency Analysis

Date: 2026-05-07 Document: gargoyle kill-switch.md (293 lines) Analytical lens: Temporal ordering dependencies — places where operations are described sequentially but nothing mechanically enforces that ordering

Experiment Design

Task: Identify places where the document assumes operations happen in a specific sequence but doesn't mechanically enforce it, and where reordering (due to crashes, async events, operator timing, or message ordering) would violate correctness.

Key distinction from race condition analysis: This is about SEQUENTIAL operations where order matters but isn't guaranteed — not about truly concurrent events.

Prompt structure: Specified 5 focus areas (multi-step engagement/disengagement, cross-component coordination, recovery/restart, operator timing, event vs state ordering). Required per-finding format (dependency, assumed ordering, enforcement gap, violation scenario, impact). Excluded single-component bugs, hardware failures, pure race conditions.

Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4.6 via HAI proxy. No tools, no project context beyond the document itself. Single prompt, no conversation history.

Results

Model	Time	Output tokens	Reasoning tokens	Findings
GPT-5	122s	12,437	9,856	12
Claude Opus 4.6	70s	2,903	(internal)	9
Claude Sonnet 4.6	24s	1,231	(internal)	7

Common Ground (all 3 identified)

Persistence write before decision engine termination (crash recovery gap)
Decision engine termination vs acceptance policy update ordering
Acceptance policy change before order cancellations
Event emission vs state change visibility
Application restart: components starting before kill state loads
Restrict→liquidate transition without state verification enforcement

GPT-5 Unique Findings

Acceptance policy authority conflict: OM's "blind enforcement" + "other sources can set policy" = no lock/lease/priority prevents a risk monitor clearing a transient alert and setting OM back to "open" during engagement. Ordering matters BECAUSE no coordination exists.
Global vs per-user write ordering: OM doesn't distinguish sources. Per-user "open" arriving after global "engage" wins (last-writer-wins). No composite policy computation.
Disengage: policy open before operator release: If release fires before OM applies open policy, restarted engine's orders are rejected — false start.
Cancel-all position snapshot timing: Liquidation "load all positions" doesn't wait for fill/cancel events from step 1 to be ingested — sizing is wrong.
Event ordering: per-user disengage vs global engage: User lifecycle processes events in delivery order, not logical order — could remove user from pending-release during active global kill.
OM-unavailable: pending cancel-all not re-driven on recovery: If OM was down during engagement, cancels never fired and nothing persists a "cancel-all pending" instruction.

Claude Opus Unique Findings

Close-only policy accepts dying engine's close orders (THE insight): Everyone worries about reject-all blocking cancels, but LIQUIDATE mode's close-only policy is the dangerous one — a strategy generating a close signal during the termination window gets it accepted. "A dying-but-not-yet-dead decision engine makes a trading decision that passes the relaxed acceptance policy." Safety mechanism becomes vulnerability.
OM restart: queued messages vs policy initialization: Messages from the (terminated) decision engine sit in OM's inbound queue. Queue drain may begin before policy initialization completes, allowing zombie orders through.
Cold-start reconciliation gate vs market data: After release, policy is "open" but reconciliation hasn't verified positions. Market data feed (which survived engagement) delivers ticks; engine may generate signals against stale positions before reconciliation completes.

Claude Sonnet Assessment

Identified all common-ground themes (7 findings) with correct structure
No unique findings beyond what the other models found
Vaguer violation scenarios ("brief window" without specific interleavings)
Muted severity assessments ("inconsistent audit trail" where others saw "financial exposure")
Good for quick sanity check; not for deep temporal analysis

Key Insights

GPT-5 reasons about WHY ordering matters (not just THAT it matters)

GPT-5's distinctive contribution: several findings (#5, #6, #11, #12) identify temporal issues arising not from incorrect sequencing of correctly-designed operations, but from the ABSENCE of mechanisms that would make ordering irrelevant.

"Global always wins" is a temporal ordering assumption only because no composite policy computation exists. If OM computed effective policy from all sources, ordering wouldn't matter. GPT-5 identifies these "ordering matters because architecture is incomplete" findings — a level deeper than "A must happen before B."

Opus's inversion insight

Opus finding #2 (close-only accepting dying engine's close orders) inverts the obvious concern direction. Everyone thinks "reject-all is dangerous for ordering" (blocks cancels). Opus finds LIQUIDATE mode is more dangerous temporally because it permits a SUBSET of automated orders — and a dying engine could produce exactly that subset. Consistent with Opus's pattern: finding where safety mechanisms become vulnerabilities.

Task-type positioning

"Temporal ordering dependency analysis" sits between assumption-finding and race conditions:

Closer to assumptions (what must be true about sequencing?) → Sonnet performs adequately
Further from races (what happens with truly concurrent events?) → Sonnet doesn't fail
GPT-5 and Opus both excel but at different aspects (systemic gaps vs inverted dangers)

Practical Implications

For temporal ordering analysis on architecture docs:

GPT-5: exhaustive coverage + systemic insights (why does ordering matter?)
Opus: inverted/non-obvious ordering dangers (which direction is actually dangerous?)
Sonnet: adequate sanity check but zero unique insights
Total unique findings after deduplication: ~15 distinct temporal dependencies from 293 lines

Model Hierarchy for This Task Type

GPT-5 — broadest, identifies systemic ordering issues + missing mechanisms (12 findings)
Opus — fewer but includes the architecturally most significant insight (9 findings)
Sonnet — correct themes, no unique depth, fast/cheap (7 findings)

6.5 KiB Raw Blame History