feat: experiment #33 — observability gap analysis on aggregation.md

New analytical lens: observability gap analysis — asking 'when something goes wrong, can you SEE it?' rather than 'what can go wrong?' Results on aggregation.md (239 lines): - GPT-5: 23 findings (12 unique), exhaustive telemetry architecture - Opus: 14 findings (6 unique), operator-behavioral insights - Sonnet: 11 findings (0 unique), no added value Key insight: GPT-5 designs the instrumentation; Opus identifies where available signals mislead operators toward wrong remediations. Two-model (GPT-5 + Opus) optimal for this task type.
2026-05-06 11:49:05 -07:00
parent 8cfabfdc55
commit 20c0bd2492
1 changed files with 211 additions and 0 deletions
@@ -0,0 +1,211 @@
+# Experiment #33: Observability Gap Analysis on aggregation.md
+
+**Date:** 2026-05-06
+**Task type:** Observability gap analysis (NEW analytical lens)
+**Document:** gargoyle's `aggregation.md` (239 lines) — decision engine signal aggregation with
+state machines, timers, and cross-component forwarding
+
+## Hypothesis
+
+Observability gap analysis — identifying where system behavior becomes invisible, indistinguishable
+from normal, or impossible to diagnose during failures — is a distinct analytical lens from failure
+analysis or assumption-finding. Instead of asking "what can go wrong," it asks "when something goes
+wrong, can you SEE it?" Models may differ in whether they identify technical instrumentation gaps
+(missing metrics/events) vs. semantic indistinguishability problems (different failures that look
+the same from outside).
+
+## Method
+
+Same structured prompt to all three models via HAI proxy on anvil. Prompt specified 5 categories:
+1. Silent failures (no observable signal)
+2. Indistinguishable states (different problems, identical observable pattern)
+3. Diagnostic dead zones (unobservable time windows)
+4. Missing correlation (effects visible, causes invisible)
+5. False-normal signals (metrics healthy but correctness degraded)
+
+Required output format: Gap, Scenario, What's invisible, Impact, What the spec should add.
+
+Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4.6 (all via HAI proxy on anvil).
+
+## Results
+
+| Model | Findings | Output tokens | Reasoning tokens | Latency | Tokens/finding |
+|---|---|---|---|---|---|
+| GPT-5 | 23 | 9,433 | 5,632 | 153s | 656 |
+| Opus 4.6 | 14 | 4,493 | (internal) | 103s | 321 |
+| Sonnet 4.6 | 11 | 1,562 | (internal) | 36s | 142 |
+
+## Common Ground (all 3 identified)
+
+- **No telemetry during buffering state** — groups are opaque while accumulating signals; only
+  terminal events (completion/expiry) produce observable signals
+- **Decision forwarding failures are silent** — decisions form (event fires) but delivery to
+  PortfolioRisk has no success/failure signal
+- **Crash loses groups with no quantification** — in-flight groups vanish but nothing reports
+  how much was lost
+- **Timeout reason is indistinguishable** — `:timeout` expiry doesn't discriminate between
+  "signals stopped arriving" vs "timeout misconfigured" vs "market conditions changed"
+- **Force-complete decisions look normal downstream** — a decision formed from 1/5 expected
+  signals is indistinguishable from a complete decision to PortfolioRisk
+
+## GPT-5 Unique Findings (not in either Claude model)
+
+1. **No group_id/decision_id correlation across events** — lifecycle events can't be joined;
+   you can't trace a decision back through its group to its constituent signals
+2. **Expired groups lack instrument context** — can't attribute expiration spikes to
+   specific instruments
+3. **Timer start/deadline not observable** — operators can't verify timers were set as intended
+4. **No configuration context on events** — timeout_ms, threshold_N, capacity_limit not
+   attached to events; can't correlate config changes with behavior changes
+5. **Pattern-complete predicate is opaque** — no visibility into evaluation count, partial-match
+   state, or "why false"; impossible to tune pattern strategies
+6. **No per-strategy memory/backpressure signals** — no gauges for buffered signal count
+   or memory footprint; misfiring strategy fills memory silently
+7. **Unknown strategy signal drops are only "logged"** — no structured metric for discarded
+   signals; operational data loss goes unmetered
+8. **No cross-service trace context propagation** — no mention of trace_id/span_id flowing
+   signal → aggregation → PortfolioRisk → OrderManager
+9. **No ranking decision transparency** — when time-windowed selects "best" signal, no
+   visibility into which candidate won, why, or what alternatives existed
+10. **Capacity-triggered force-complete vs normal completion not explicitly monitored** —
+    operators alerting on `:capacity` expirations miss capacity-triggered *completions*
+11. **No version metadata** — events don't carry build/algorithm/config version; version
+    skew causes indistinguishable behavioral drift
+12. **No forwarding queue/latency visibility** — no metric for decision dispatch latency
+    or queue depth between formation and delivery
+
+## Opus Unique Findings (not in either other model)
+
+1. **Signals in-flight during crash window have no fate** — signals dispatched by SignalRisk
+   but not yet received by the aggregator vanish with no trace on either side. Distinguished
+   from "groups lost on crash" because these signals never entered the aggregator's state.
+   Unique insight: the acknowledgment boundary itself is invisible.
+2. **Custom predicate FAILURE is observationally identical to predicate returning false** —
+   a predicate that throws an exception vs. one that correctly returns false produce the same
+   downstream effect (group stays in Buffering, eventually times out). Operators misdiagnose
+   code bugs as strategy calibration problems.
+3. **Capacity expire and timeout expire require OPPOSITE remediations but share the same
+   metric pattern** — `:capacity` might mean "limit too low" OR "strategy misfiring."
+   Misfiring requires investigation; low limit requires raising it. Raising the limit on a
+   misfiring strategy converts bounded failure to unbounded memory growth.
+4. **Decision formation-to-market-conditions temporal correlation is missing** — contributing
+   signals were generated at T+0 but the decision forms at T+10min; no metric captures how
+   stale the decision's inputs are relative to current market state. Different from GPT-5's
+   "group duration" finding because this is specifically about *market relevance* decay.
+5. **Expired groups can't be correlated to missed P&L** — expired groups represent missed
+   trades but lack the business content (instrument, direction) needed to compute opportunity
+   cost against actual market moves post-expiry.
+6. **Aggregator appears "healthy but idle" indistinguishable from broken signal channel** —
+   no liveness signal distinguishes "no signals because market is quiet" from "no signals
+   because delivery channel is broken." Unique angle: this creates a false-normal condition
+   specific to the *absence* of activity rather than degradation of existing activity.
+
+## Sonnet Findings
+
+Sonnet produced 11 findings in 36s. No findings were truly unique — all overlapped substantially
+with GPT-5 or Opus findings. Sonnet's contribution was to identify the same categories of issues
+but at lower specificity:
+
+- Memory leaks from stuck groups (covered more precisely by GPT-5 #6 and Opus #2.3)
+- Decision forwarding silence (common ground)
+- Timeout indistinguishability (common ground)
+- Buffering dead zone (common ground)
+- Crash impact quantification (common ground)
+- Immediate algorithm masking excessive decision rate (covered more precisely by GPT-5 #16)
+- Signal quality hidden by completion metrics (covered by Opus #5.1, GPT-5 #10)
+- Overly permissive predicate (covered by Opus #1.3)
+
+Sonnet was the fastest (36s, 1,562 tokens) but produced no unique insights for this task type.
+
+## Quality Assessment
+
+- **GPT-5** was exhaustive and systematic — 23 findings covering all 5 categories, with specific
+  telemetry event names, measurement fields, and metadata specifications. Multiple findings
+  addressed the *instrumentation architecture* itself (trace propagation, config versioning,
+  event correlation schema). GPT-5 treated this as a telemetry engineering problem and designed
+  a complete observability layer. Its unique contributions are mostly about infrastructure
+  (correlation IDs, trace context, config hashes) that enable diagnosis rather than about
+  specific failure scenarios.
+
+- **Opus** produced fewer findings (14) but several showed qualitatively different reasoning.
+  The "acknowledgment boundary" finding (#1.2) identifies an observability gap that exists
+  *between* components — neither side knows signals were lost because neither side records
+  the handoff. The "opposite remediations" finding (#2.3) identifies where the same metric
+  guides operators toward WRONG actions depending on an invisible variable. Opus consistently
+  reasoned about *what operators would DO* with the available signals, not just what signals
+  are missing.
+
+- **Sonnet** produced no unique value on this task type. Every finding was a less-specific
+  version of something GPT-5 or Opus found. This is consistent with the task-type taxonomy
+  from previous experiments: Sonnet adds nothing on systematic/exhaustive analysis tasks.
+
+## Key Insight — Observability Analysis as Task Type
+
+This is genuinely different from failure analysis or assumption-finding:
+- **Failure analysis** asks: "What can go wrong?"
+- **Assumption-finding** asks: "What must be true for this to work?"
+- **Observability gap analysis** asks: "When something goes wrong, can you SEE it?"
+
+The third question requires reasoning about the system's *meta-properties* — not its behavior,
+but its *visibility*. This is a second-order question: you have to first imagine a failure, then
+ask whether any defined signal would fire, then determine whether that signal is distinguishable
+from normal operation or from other failures.
+
+GPT-5's approach: enumerate every possible metric/event that SHOULD exist but doesn't. Design
+the telemetry architecture. (23 specific event/metric proposals.)
+
+Opus's approach: identify the places where available signals guide operators toward WRONG actions
+or create invisible boundaries between components. (14 findings, several about operator behavior.)
+
+This distinction maps well to previous findings:
+- GPT-5 is the **telemetry architect** — "here's what you should instrument"
+- Opus is the **incident analyst** — "here's where your instrumentation will mislead you"
+
+## Model Comparison to Previous Task Types
+
+| Metric | GPT-5 | Opus | Sonnet |
+|---|---|---|---|
+| Finding count | 23 | 14 | 11 |
+| Unique findings | 12 | 6 | 0 |
+| Tokens per finding | 656 | 321 | 142 |
+| Qualitative depth | Systematic/architectural | Operator-behavioral | Surface-level |
+
+Comparison to previous experiments:
+- Finding #9 (gap-finding): GPT-5=14, Opus=n/a, Sonnet=n/a
+- Finding #10 (assumptions): GPT-5=26, Opus=13, Sonnet=n/a
+- Finding #12 (assumptions, order-execution): GPT-5=20, Sonnet=17, Opus=12
+- Finding #13 (race conditions): GPT-5=12, Opus=10, Sonnet=7
+- **This experiment (observability): GPT-5=23, Opus=14, Sonnet=11**
+
+GPT-5 produced its highest finding count (23) outside of assumption-finding tasks. This suggests
+observability gap analysis plays to GPT-5's exhaustive enumeration strength — there are many
+possible gaps and GPT-5 is motivated to find ALL of them.
+
+Sonnet's zero unique findings here vs. 6 unique findings in experiment #12 (order-execution
+assumptions) confirms the task-type dependency. Sonnet contributes when the task requires
+reasoning about component interactions in a complex multi-component document. On simpler
+documents or systematic enumeration tasks, it adds nothing.
+
+## Practical Implication
+
+For observability reviews of system specifications:
+1. **GPT-5** for comprehensive instrumentation gap enumeration — produces a complete telemetry
+   design specification (events, metrics, metadata fields)
+2. **Opus** for identifying where available signals mislead operators — finds the dangerous
+   gaps where wrong remediation appears correct
+3. **Skip Sonnet** — no unique value on this task type
+
+Two-model configuration (GPT-5 + Opus) is optimal, same as spec-gap and testability analysis.
+
+## New Taxonomy Entry
+
+| Task category | Best for | Sonnet value | Key question |
+|---|---|---|---|
+| Observability gap analysis | GPT-5 (breadth) + Opus (operator-behavioral) | None | "When it breaks, can you see it?" |
+
+This slots alongside:
+- Spec-gap analysis: GPT-5 + Opus (no Sonnet value)
+- Testability analysis: GPT-5 + Opus (no Sonnet value)
+- Assumption-finding: All three contribute (Sonnet at ~85%)
+- Race conditions: GPT-5 + Opus only (Sonnet too imprecise)
+- Cross-component interaction: All three contribute