b7acbd7662
Tests a novel lens asking 'what cognitive/procedural load does this design place on operators?' Applied to escalation-policy.md with GPT-5, Sonnet 4.6, and Opus 4.6. Key findings: - All models identified manual liquidate→restrict has no procedure (CRITICAL) - GPT-5 excels at exhaustive enumeration (21+ findings, config gaps) - Opus identifies systemic vulnerabilities (monitor crash → silent unsafe state) - Sonnet fills procedural gaps (authorization, timeouts) Recommendation: Opus alone for time-constrained analysis, GPT-5 + Opus for thoroughness. They find different types of issues with minimal overlap.
141 lines
8.1 KiB
Markdown
141 lines
8.1 KiB
Markdown
# Operational Burden Analysis: A New Analytical Lens
|
|
|
|
**Finding ID:** 56
|
|
**Date:** 2026-05-09
|
|
**Document:** gargoyle/docs/domain/contexts/risk/escalation-policy.md (~11.8KB, ~240 lines)
|
|
**Task type:** Operational burden analysis — a NEW analytical lens
|
|
**Prompt:** "Analyze this document for operational burden: manual steps (toil), cognitive load, implicit expertise, decision points lacking guidance, and recovery scenarios without procedures. Focus on what an on-call engineer at 3am would struggle with."
|
|
**Models compared:** GPT-5, Claude Sonnet 4.6, Claude Opus 4.6
|
|
|
|
## Experiment Design
|
|
|
|
This experiment tests a novel analytical lens: **operational burden analysis**. Unlike gap-finding or contradiction detection, this lens asks: "What cognitive and procedural load does this design place on operators?" This is particularly relevant for systems with manual intervention points, fail-safes, and recovery procedures.
|
|
|
|
The hypothesis: models with strong real-world reasoning (GPT-5) might excel at enumerating operational scenarios, while models with strong logical structure (Opus) might better identify where documented automation creates undocumented manual work.
|
|
|
|
## Performance Metrics
|
|
|
|
| Model | Time | Input Tokens | Output Tokens | Reasoning Tokens | Findings |
|
|
|-------|------|--------------|---------------|------------------|----------|
|
|
| GPT-5 | 74s | 2,416 | 5,843 | 4,224 | 21 findings + 3 observations |
|
|
| Claude Sonnet 4.6 | 55s | 2,726 | 2,586 | (internal) | 12 findings (excerpt) |
|
|
| Claude Opus 4.6 | 52s | 2,726 | 2,397 | (internal) | 12 findings (excerpt) |
|
|
|
|
## Common Ground (All Three Models)
|
|
|
|
All three models identified these CRITICAL issues — the core operational gaps:
|
|
|
|
1. **Manual liquidate→restrict de-escalation has no procedure** — The document says "confirm the portfolio is safe" but provides no acceptance criteria, tooling, or checklist. Every model flagged this as CRITICAL.
|
|
|
|
2. **Persistent restrict flag with monitor down requires manual clear without procedure** — The failure mode table says "Manual clear" but doesn't specify how, when, or what to verify first.
|
|
|
|
3. **Kill switch recovery is absent** — The policy escalates to kill switch automatically but provides no return path. Cross-references exist but no in-document guidance.
|
|
|
|
## Model-Specific Strengths
|
|
|
|
### GPT-5: Exhaustive Enumeration + Operational Context
|
|
|
|
GPT-5 produced the most comprehensive analysis (21+ findings) with specific operational details:
|
|
|
|
**Unique catches:**
|
|
- Unspecified "N" in "liquidation attempted + N more breaches" before kill switch — operator can't discover configured value under pressure
|
|
- Multi-metric conflicts: no guidance on prioritizing remediation when metrics are at different levels
|
|
- Evaluation frequency changes require proportional debounce/cooldown adjustment — no formula provided
|
|
- Event stream reconstruction burden: minimal event fields require mental correlation
|
|
- Mid-incident threshold adjustments: no safe-change procedure or rollback guidance
|
|
|
|
**Characteristic:** GPT-5 thinks like an on-call engineer. It imagines specific 3am scenarios and asks "what would I need to know?" It found configuration gaps (value of N) that the other models missed.
|
|
|
|
### Claude Opus 4.6: Systemic Vulnerability Identification
|
|
|
|
Opus produced concise findings but identified deeper systemic issues:
|
|
|
|
**Unique catches:**
|
|
- Monitor crash restarts from Clear is **worse than it appears**: the "Recovery: Automatic" label is misleading because the system resets to an *unsafe* state while risk conditions may persist. The debounce window (3-5s) allows the decision engine to open new positions during unmonitored risk.
|
|
- Liquidation death spiral: cascading liquidation in illiquid markets could worsen losses, but no guidance on when to override autonomous liquidation.
|
|
- Evaluation frequency is referenced but never authoritatively defined — operators can't derive debounce/cooldown timing without it.
|
|
|
|
**Characteristic:** Opus identifies where documented behavior creates undocumented risk. Its "silent loss of escalation context" analysis goes beyond "no procedure" to explain *why* this is dangerous operationally.
|
|
|
|
### Claude Sonnet 4.6: Structured, Complete, Efficient
|
|
|
|
Sonnet produced well-organized findings with clear severity stratification:
|
|
|
|
**Unique catches:**
|
|
- Authorization gap: who is authorized to perform manual de-escalation confirmations?
|
|
- Timeout uncertainty: can the liquidate state persist indefinitely?
|
|
- "Broker unavailable" definition: is it total outage or partial fill failure?
|
|
|
|
**Characteristic:** Sonnet thinks like a runbook author. It asks practical questions about authorization, timeouts, and edge definitions. Its analysis is more procedural than the others.
|
|
|
|
## Overlap Analysis
|
|
|
|
| Finding | GPT-5 | Sonnet | Opus |
|
|
|---------|-------|--------|------|
|
|
| Liquidate de-escalation has no procedure | ✅ CRITICAL | ✅ CRITICAL | ✅ CRITICAL |
|
|
| Restrict flag manual clear undefined | ✅ CRITICAL | ✅ CRITICAL | ✅ HIGH |
|
|
| Kill switch recovery absent | ✅ HIGH | ✅ CRITICAL | ✅ HIGH |
|
|
| Broker unavailability response undefined | ✅ HIGH | ✅ HIGH | ❌ |
|
|
| Monitor crash state loss danger | ❌ | ❌ | ✅ CRITICAL |
|
|
| Multi-metric state cognitive load | ✅ HIGH | ✅ HIGH | ❌ |
|
|
| Unspecified N before kill switch | ✅ CRITICAL | ❌ | ✅ HIGH |
|
|
| Liquidation death spiral risk | ❌ | ❌ | ✅ HIGH |
|
|
| Evaluation frequency undefined | ✅ HIGH | ❌ | ✅ HIGH |
|
|
| Mid-incident configuration changes | ✅ MEDIUM | ❌ | ❌ |
|
|
| Authorization for manual actions | ❌ | ✅ (implicit) | ❌ |
|
|
|
|
**Union:** 19 unique findings across all three models
|
|
**Intersection:** 3 findings all agreed on
|
|
**GPT-5 unique:** 4 findings
|
|
**Opus unique:** 2 findings (but deeper systemic analysis)
|
|
**Sonnet unique:** 2 findings
|
|
|
|
## Key Insight: Operational Burden Analysis Is Model-Sensitive
|
|
|
|
This lens reveals clear model personalities:
|
|
|
|
1. **GPT-5** enumerates exhaustively. It walks through every section and asks "what would an operator need here?" This catches configuration gaps and parameter questions.
|
|
|
|
2. **Opus** identifies systemic vulnerabilities. It asks "where does documented automation create hidden risk?" This catches the monitor-crash-to-clear danger that the others missed.
|
|
|
|
3. **Sonnet** structures procedurally. It asks "what would a runbook need?" This catches authorization and timeout questions.
|
|
|
|
For comprehensive operational burden analysis: **run all three**. GPT-5 finds the parameters, Opus finds the systemic risks, Sonnet fills in the procedural gaps.
|
|
|
|
## Actionable Recommendations for Escalation Policy Document
|
|
|
|
Based on this analysis, the escalation policy needs:
|
|
|
|
1. **Liquidate de-escalation runbook** (CRITICAL)
|
|
- Explicit acceptance criteria for "portfolio is safe"
|
|
- Step-by-step procedure with tooling/interface
|
|
- Authorization requirements
|
|
|
|
2. **Recovery procedures for each failure mode** (CRITICAL)
|
|
- Monitor crash: reconciliation procedure, re-escalation timeline
|
|
- Restrict flag stuck: detection, clearance procedure, verification
|
|
|
|
3. **Kill switch recovery section** (HIGH)
|
|
- Either in this document or explicit pointer with summary
|
|
|
|
4. **Configuration reference** (HIGH)
|
|
- Define N for "liquidation + N breaches"
|
|
- Define default evaluation frequency
|
|
- Document how to discover configured values
|
|
|
|
5. **Operator intervention guidance** (HIGH)
|
|
- When to override autonomous liquidation
|
|
- How to handle multi-metric conflicts
|
|
|
|
## Conclusion
|
|
|
|
Operational burden analysis is a valuable lens that complements gap-finding and contradiction detection. It asks a fundamentally different question: not "what's missing from the spec" but "what will the human need when this runs?"
|
|
|
|
**Model recommendation for this lens:**
|
|
- Run GPT-5 for exhaustive enumeration
|
|
- Run Opus for systemic vulnerability identification
|
|
- Optionally run Sonnet for procedural structure
|
|
- Union the findings — they don't overlap much
|
|
|
|
**Cost-effectiveness:** Opus (52s, 2,726+2,397 tokens) was fastest and found the most dangerous systemic issue (monitor crash → silent unsafe state). For time-constrained analysis, Opus alone catches the critical issues. For thoroughness, GPT-5 + Opus union is optimal.
|