Files
model-research/findings/2026-05-09-56-operational-burden-analysis.md
claw b7acbd7662 Finding #56: Operational burden analysis - new analytical lens
Tests a novel lens asking 'what cognitive/procedural load does this design
place on operators?' Applied to escalation-policy.md with GPT-5, Sonnet 4.6,
and Opus 4.6.

Key findings:
- All models identified manual liquidate→restrict has no procedure (CRITICAL)
- GPT-5 excels at exhaustive enumeration (21+ findings, config gaps)
- Opus identifies systemic vulnerabilities (monitor crash → silent unsafe state)
- Sonnet fills procedural gaps (authorization, timeouts)

Recommendation: Opus alone for time-constrained analysis, GPT-5 + Opus for
thoroughness. They find different types of issues with minimal overlap.
2026-05-09 06:46:29 -07:00

8.1 KiB

Operational Burden Analysis: A New Analytical Lens

Finding ID: 56 Date: 2026-05-09 Document: gargoyle/docs/domain/contexts/risk/escalation-policy.md (~11.8KB, ~240 lines) Task type: Operational burden analysis — a NEW analytical lens Prompt: "Analyze this document for operational burden: manual steps (toil), cognitive load, implicit expertise, decision points lacking guidance, and recovery scenarios without procedures. Focus on what an on-call engineer at 3am would struggle with." Models compared: GPT-5, Claude Sonnet 4.6, Claude Opus 4.6

Experiment Design

This experiment tests a novel analytical lens: operational burden analysis. Unlike gap-finding or contradiction detection, this lens asks: "What cognitive and procedural load does this design place on operators?" This is particularly relevant for systems with manual intervention points, fail-safes, and recovery procedures.

The hypothesis: models with strong real-world reasoning (GPT-5) might excel at enumerating operational scenarios, while models with strong logical structure (Opus) might better identify where documented automation creates undocumented manual work.

Performance Metrics

Model Time Input Tokens Output Tokens Reasoning Tokens Findings
GPT-5 74s 2,416 5,843 4,224 21 findings + 3 observations
Claude Sonnet 4.6 55s 2,726 2,586 (internal) 12 findings (excerpt)
Claude Opus 4.6 52s 2,726 2,397 (internal) 12 findings (excerpt)

Common Ground (All Three Models)

All three models identified these CRITICAL issues — the core operational gaps:

  1. Manual liquidate→restrict de-escalation has no procedure — The document says "confirm the portfolio is safe" but provides no acceptance criteria, tooling, or checklist. Every model flagged this as CRITICAL.

  2. Persistent restrict flag with monitor down requires manual clear without procedure — The failure mode table says "Manual clear" but doesn't specify how, when, or what to verify first.

  3. Kill switch recovery is absent — The policy escalates to kill switch automatically but provides no return path. Cross-references exist but no in-document guidance.

Model-Specific Strengths

GPT-5: Exhaustive Enumeration + Operational Context

GPT-5 produced the most comprehensive analysis (21+ findings) with specific operational details:

Unique catches:

  • Unspecified "N" in "liquidation attempted + N more breaches" before kill switch — operator can't discover configured value under pressure
  • Multi-metric conflicts: no guidance on prioritizing remediation when metrics are at different levels
  • Evaluation frequency changes require proportional debounce/cooldown adjustment — no formula provided
  • Event stream reconstruction burden: minimal event fields require mental correlation
  • Mid-incident threshold adjustments: no safe-change procedure or rollback guidance

Characteristic: GPT-5 thinks like an on-call engineer. It imagines specific 3am scenarios and asks "what would I need to know?" It found configuration gaps (value of N) that the other models missed.

Claude Opus 4.6: Systemic Vulnerability Identification

Opus produced concise findings but identified deeper systemic issues:

Unique catches:

  • Monitor crash restarts from Clear is worse than it appears: the "Recovery: Automatic" label is misleading because the system resets to an unsafe state while risk conditions may persist. The debounce window (3-5s) allows the decision engine to open new positions during unmonitored risk.
  • Liquidation death spiral: cascading liquidation in illiquid markets could worsen losses, but no guidance on when to override autonomous liquidation.
  • Evaluation frequency is referenced but never authoritatively defined — operators can't derive debounce/cooldown timing without it.

Characteristic: Opus identifies where documented behavior creates undocumented risk. Its "silent loss of escalation context" analysis goes beyond "no procedure" to explain why this is dangerous operationally.

Claude Sonnet 4.6: Structured, Complete, Efficient

Sonnet produced well-organized findings with clear severity stratification:

Unique catches:

  • Authorization gap: who is authorized to perform manual de-escalation confirmations?
  • Timeout uncertainty: can the liquidate state persist indefinitely?
  • "Broker unavailable" definition: is it total outage or partial fill failure?

Characteristic: Sonnet thinks like a runbook author. It asks practical questions about authorization, timeouts, and edge definitions. Its analysis is more procedural than the others.

Overlap Analysis

Finding GPT-5 Sonnet Opus
Liquidate de-escalation has no procedure CRITICAL CRITICAL CRITICAL
Restrict flag manual clear undefined CRITICAL CRITICAL HIGH
Kill switch recovery absent HIGH CRITICAL HIGH
Broker unavailability response undefined HIGH HIGH
Monitor crash state loss danger CRITICAL
Multi-metric state cognitive load HIGH HIGH
Unspecified N before kill switch CRITICAL HIGH
Liquidation death spiral risk HIGH
Evaluation frequency undefined HIGH HIGH
Mid-incident configuration changes MEDIUM
Authorization for manual actions (implicit)

Union: 19 unique findings across all three models Intersection: 3 findings all agreed on GPT-5 unique: 4 findings Opus unique: 2 findings (but deeper systemic analysis) Sonnet unique: 2 findings

Key Insight: Operational Burden Analysis Is Model-Sensitive

This lens reveals clear model personalities:

  1. GPT-5 enumerates exhaustively. It walks through every section and asks "what would an operator need here?" This catches configuration gaps and parameter questions.

  2. Opus identifies systemic vulnerabilities. It asks "where does documented automation create hidden risk?" This catches the monitor-crash-to-clear danger that the others missed.

  3. Sonnet structures procedurally. It asks "what would a runbook need?" This catches authorization and timeout questions.

For comprehensive operational burden analysis: run all three. GPT-5 finds the parameters, Opus finds the systemic risks, Sonnet fills in the procedural gaps.

Actionable Recommendations for Escalation Policy Document

Based on this analysis, the escalation policy needs:

  1. Liquidate de-escalation runbook (CRITICAL)

    • Explicit acceptance criteria for "portfolio is safe"
    • Step-by-step procedure with tooling/interface
    • Authorization requirements
  2. Recovery procedures for each failure mode (CRITICAL)

    • Monitor crash: reconciliation procedure, re-escalation timeline
    • Restrict flag stuck: detection, clearance procedure, verification
  3. Kill switch recovery section (HIGH)

    • Either in this document or explicit pointer with summary
  4. Configuration reference (HIGH)

    • Define N for "liquidation + N breaches"
    • Define default evaluation frequency
    • Document how to discover configured values
  5. Operator intervention guidance (HIGH)

    • When to override autonomous liquidation
    • How to handle multi-metric conflicts

Conclusion

Operational burden analysis is a valuable lens that complements gap-finding and contradiction detection. It asks a fundamentally different question: not "what's missing from the spec" but "what will the human need when this runs?"

Model recommendation for this lens:

  • Run GPT-5 for exhaustive enumeration
  • Run Opus for systemic vulnerability identification
  • Optionally run Sonnet for procedural structure
  • Union the findings — they don't overlap much

Cost-effectiveness: Opus (52s, 2,726+2,397 tokens) was fastest and found the most dangerous systemic issue (monitor crash → silent unsafe state). For time-constrained analysis, Opus alone catches the critical issues. For thoroughness, GPT-5 + Opus union is optimal.