finding 41: temporal ordering dependency analysis on kill-switch.md
New analytical lens testing whether models can identify sequential operations where order matters but isn't mechanically enforced. GPT-5 finds systemic gaps (WHY ordering matters), Opus finds inverted dangers (WHICH direction is dangerous), Sonnet identifies themes without unique depth.
This commit is contained in:
@@ -0,0 +1,122 @@
|
||||
# Finding 41: Temporal Ordering Dependency Analysis
|
||||
|
||||
**Date:** 2026-05-07
|
||||
**Document:** gargoyle `kill-switch.md` (293 lines)
|
||||
**Analytical lens:** Temporal ordering dependencies — places where operations are described
|
||||
sequentially but nothing mechanically enforces that ordering
|
||||
|
||||
## Experiment Design
|
||||
|
||||
**Task:** Identify places where the document assumes operations happen in a specific
|
||||
sequence but doesn't mechanically enforce it, and where reordering (due to crashes,
|
||||
async events, operator timing, or message ordering) would violate correctness.
|
||||
|
||||
**Key distinction from race condition analysis:** This is about SEQUENTIAL operations
|
||||
where order matters but isn't guaranteed — not about truly concurrent events.
|
||||
|
||||
**Prompt structure:** Specified 5 focus areas (multi-step engagement/disengagement,
|
||||
cross-component coordination, recovery/restart, operator timing, event vs state ordering).
|
||||
Required per-finding format (dependency, assumed ordering, enforcement gap, violation
|
||||
scenario, impact). Excluded single-component bugs, hardware failures, pure race conditions.
|
||||
|
||||
**Models:** GPT-5, Claude Opus 4.6, Claude Sonnet 4.6 via HAI proxy. No tools, no
|
||||
project context beyond the document itself. Single prompt, no conversation history.
|
||||
|
||||
## Results
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Findings |
|
||||
|---|---|---|---|---|
|
||||
| GPT-5 | 122s | 12,437 | 9,856 | 12 |
|
||||
| Claude Opus 4.6 | 70s | 2,903 | (internal) | 9 |
|
||||
| Claude Sonnet 4.6 | 24s | 1,231 | (internal) | 7 |
|
||||
|
||||
## Common Ground (all 3 identified)
|
||||
|
||||
- Persistence write before decision engine termination (crash recovery gap)
|
||||
- Decision engine termination vs acceptance policy update ordering
|
||||
- Acceptance policy change before order cancellations
|
||||
- Event emission vs state change visibility
|
||||
- Application restart: components starting before kill state loads
|
||||
- Restrict→liquidate transition without state verification enforcement
|
||||
|
||||
## GPT-5 Unique Findings
|
||||
|
||||
1. **Acceptance policy authority conflict**: OM's "blind enforcement" + "other sources can set
|
||||
policy" = no lock/lease/priority prevents a risk monitor clearing a transient alert and
|
||||
setting OM back to "open" during engagement. Ordering matters BECAUSE no coordination exists.
|
||||
2. **Global vs per-user write ordering**: OM doesn't distinguish sources. Per-user "open"
|
||||
arriving after global "engage" wins (last-writer-wins). No composite policy computation.
|
||||
3. **Disengage: policy open before operator release**: If release fires before OM applies
|
||||
open policy, restarted engine's orders are rejected — false start.
|
||||
4. **Cancel-all position snapshot timing**: Liquidation "load all positions" doesn't wait
|
||||
for fill/cancel events from step 1 to be ingested — sizing is wrong.
|
||||
5. **Event ordering: per-user disengage vs global engage**: User lifecycle processes events
|
||||
in delivery order, not logical order — could remove user from pending-release during
|
||||
active global kill.
|
||||
6. **OM-unavailable: pending cancel-all not re-driven on recovery**: If OM was down during
|
||||
engagement, cancels never fired and nothing persists a "cancel-all pending" instruction.
|
||||
|
||||
## Claude Opus Unique Findings
|
||||
|
||||
1. **Close-only policy accepts dying engine's close orders** (THE insight): Everyone worries
|
||||
about reject-all blocking cancels, but LIQUIDATE mode's close-only policy is the dangerous
|
||||
one — a strategy generating a close signal during the termination window gets it accepted.
|
||||
"A dying-but-not-yet-dead decision engine makes a trading decision that passes the relaxed
|
||||
acceptance policy." Safety mechanism becomes vulnerability.
|
||||
2. **OM restart: queued messages vs policy initialization**: Messages from the (terminated)
|
||||
decision engine sit in OM's inbound queue. Queue drain may begin before policy initialization
|
||||
completes, allowing zombie orders through.
|
||||
3. **Cold-start reconciliation gate vs market data**: After release, policy is "open" but
|
||||
reconciliation hasn't verified positions. Market data feed (which survived engagement)
|
||||
delivers ticks; engine may generate signals against stale positions before reconciliation
|
||||
completes.
|
||||
|
||||
## Claude Sonnet Assessment
|
||||
|
||||
- Identified all common-ground themes (7 findings) with correct structure
|
||||
- No unique findings beyond what the other models found
|
||||
- Vaguer violation scenarios ("brief window" without specific interleavings)
|
||||
- Muted severity assessments ("inconsistent audit trail" where others saw "financial exposure")
|
||||
- Good for quick sanity check; not for deep temporal analysis
|
||||
|
||||
## Key Insights
|
||||
|
||||
### GPT-5 reasons about WHY ordering matters (not just THAT it matters)
|
||||
|
||||
GPT-5's distinctive contribution: several findings (#5, #6, #11, #12) identify temporal
|
||||
issues arising not from incorrect sequencing of correctly-designed operations, but from
|
||||
the ABSENCE of mechanisms that would make ordering irrelevant.
|
||||
|
||||
"Global always wins" is a temporal ordering assumption only because no composite policy
|
||||
computation exists. If OM computed effective policy from all sources, ordering wouldn't
|
||||
matter. GPT-5 identifies these "ordering matters because architecture is incomplete"
|
||||
findings — a level deeper than "A must happen before B."
|
||||
|
||||
### Opus's inversion insight
|
||||
|
||||
Opus finding #2 (close-only accepting dying engine's close orders) inverts the obvious
|
||||
concern direction. Everyone thinks "reject-all is dangerous for ordering" (blocks cancels).
|
||||
Opus finds LIQUIDATE mode is more dangerous temporally because it permits a SUBSET of
|
||||
automated orders — and a dying engine could produce exactly that subset. Consistent with
|
||||
Opus's pattern: finding where safety mechanisms become vulnerabilities.
|
||||
|
||||
### Task-type positioning
|
||||
|
||||
"Temporal ordering dependency analysis" sits between assumption-finding and race conditions:
|
||||
- Closer to assumptions (what must be true about sequencing?) → Sonnet performs adequately
|
||||
- Further from races (what happens with truly concurrent events?) → Sonnet doesn't fail
|
||||
- GPT-5 and Opus both excel but at different aspects (systemic gaps vs inverted dangers)
|
||||
|
||||
## Practical Implications
|
||||
|
||||
For temporal ordering analysis on architecture docs:
|
||||
- **GPT-5**: exhaustive coverage + systemic insights (why does ordering matter?)
|
||||
- **Opus**: inverted/non-obvious ordering dangers (which direction is actually dangerous?)
|
||||
- **Sonnet**: adequate sanity check but zero unique insights
|
||||
- Total unique findings after deduplication: ~15 distinct temporal dependencies from 293 lines
|
||||
|
||||
## Model Hierarchy for This Task Type
|
||||
|
||||
1. GPT-5 — broadest, identifies systemic ordering issues + missing mechanisms (12 findings)
|
||||
2. Opus — fewer but includes the architecturally most significant insight (9 findings)
|
||||
3. Sonnet — correct themes, no unique depth, fast/cheap (7 findings)
|
||||
Reference in New Issue
Block a user