Files
model-research/findings/2026-05-06-33-observability-gap-analysis-aggregation.md
Rodin 20c0bd2492 feat: experiment #33 — observability gap analysis on aggregation.md
New analytical lens: observability gap analysis — asking 'when something
goes wrong, can you SEE it?' rather than 'what can go wrong?'

Results on aggregation.md (239 lines):
- GPT-5: 23 findings (12 unique), exhaustive telemetry architecture
- Opus: 14 findings (6 unique), operator-behavioral insights
- Sonnet: 11 findings (0 unique), no added value

Key insight: GPT-5 designs the instrumentation; Opus identifies where
available signals mislead operators toward wrong remediations.
Two-model (GPT-5 + Opus) optimal for this task type.
2026-05-06 11:49:05 -07:00

12 KiB

Experiment #33: Observability Gap Analysis on aggregation.md

Date: 2026-05-06 Task type: Observability gap analysis (NEW analytical lens) Document: gargoyle's aggregation.md (239 lines) — decision engine signal aggregation with state machines, timers, and cross-component forwarding

Hypothesis

Observability gap analysis — identifying where system behavior becomes invisible, indistinguishable from normal, or impossible to diagnose during failures — is a distinct analytical lens from failure analysis or assumption-finding. Instead of asking "what can go wrong," it asks "when something goes wrong, can you SEE it?" Models may differ in whether they identify technical instrumentation gaps (missing metrics/events) vs. semantic indistinguishability problems (different failures that look the same from outside).

Method

Same structured prompt to all three models via HAI proxy on anvil. Prompt specified 5 categories:

  1. Silent failures (no observable signal)
  2. Indistinguishable states (different problems, identical observable pattern)
  3. Diagnostic dead zones (unobservable time windows)
  4. Missing correlation (effects visible, causes invisible)
  5. False-normal signals (metrics healthy but correctness degraded)

Required output format: Gap, Scenario, What's invisible, Impact, What the spec should add.

Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4.6 (all via HAI proxy on anvil).

Results

Model Findings Output tokens Reasoning tokens Latency Tokens/finding
GPT-5 23 9,433 5,632 153s 656
Opus 4.6 14 4,493 (internal) 103s 321
Sonnet 4.6 11 1,562 (internal) 36s 142

Common Ground (all 3 identified)

  • No telemetry during buffering state — groups are opaque while accumulating signals; only terminal events (completion/expiry) produce observable signals
  • Decision forwarding failures are silent — decisions form (event fires) but delivery to PortfolioRisk has no success/failure signal
  • Crash loses groups with no quantification — in-flight groups vanish but nothing reports how much was lost
  • Timeout reason is indistinguishable:timeout expiry doesn't discriminate between "signals stopped arriving" vs "timeout misconfigured" vs "market conditions changed"
  • Force-complete decisions look normal downstream — a decision formed from 1/5 expected signals is indistinguishable from a complete decision to PortfolioRisk

GPT-5 Unique Findings (not in either Claude model)

  1. No group_id/decision_id correlation across events — lifecycle events can't be joined; you can't trace a decision back through its group to its constituent signals
  2. Expired groups lack instrument context — can't attribute expiration spikes to specific instruments
  3. Timer start/deadline not observable — operators can't verify timers were set as intended
  4. No configuration context on events — timeout_ms, threshold_N, capacity_limit not attached to events; can't correlate config changes with behavior changes
  5. Pattern-complete predicate is opaque — no visibility into evaluation count, partial-match state, or "why false"; impossible to tune pattern strategies
  6. No per-strategy memory/backpressure signals — no gauges for buffered signal count or memory footprint; misfiring strategy fills memory silently
  7. Unknown strategy signal drops are only "logged" — no structured metric for discarded signals; operational data loss goes unmetered
  8. No cross-service trace context propagation — no mention of trace_id/span_id flowing signal → aggregation → PortfolioRisk → OrderManager
  9. No ranking decision transparency — when time-windowed selects "best" signal, no visibility into which candidate won, why, or what alternatives existed
  10. Capacity-triggered force-complete vs normal completion not explicitly monitored — operators alerting on :capacity expirations miss capacity-triggered completions
  11. No version metadata — events don't carry build/algorithm/config version; version skew causes indistinguishable behavioral drift
  12. No forwarding queue/latency visibility — no metric for decision dispatch latency or queue depth between formation and delivery

Opus Unique Findings (not in either other model)

  1. Signals in-flight during crash window have no fate — signals dispatched by SignalRisk but not yet received by the aggregator vanish with no trace on either side. Distinguished from "groups lost on crash" because these signals never entered the aggregator's state. Unique insight: the acknowledgment boundary itself is invisible.
  2. Custom predicate FAILURE is observationally identical to predicate returning false — a predicate that throws an exception vs. one that correctly returns false produce the same downstream effect (group stays in Buffering, eventually times out). Operators misdiagnose code bugs as strategy calibration problems.
  3. Capacity expire and timeout expire require OPPOSITE remediations but share the same metric pattern:capacity might mean "limit too low" OR "strategy misfiring." Misfiring requires investigation; low limit requires raising it. Raising the limit on a misfiring strategy converts bounded failure to unbounded memory growth.
  4. Decision formation-to-market-conditions temporal correlation is missing — contributing signals were generated at T+0 but the decision forms at T+10min; no metric captures how stale the decision's inputs are relative to current market state. Different from GPT-5's "group duration" finding because this is specifically about market relevance decay.
  5. Expired groups can't be correlated to missed P&L — expired groups represent missed trades but lack the business content (instrument, direction) needed to compute opportunity cost against actual market moves post-expiry.
  6. Aggregator appears "healthy but idle" indistinguishable from broken signal channel — no liveness signal distinguishes "no signals because market is quiet" from "no signals because delivery channel is broken." Unique angle: this creates a false-normal condition specific to the absence of activity rather than degradation of existing activity.

Sonnet Findings

Sonnet produced 11 findings in 36s. No findings were truly unique — all overlapped substantially with GPT-5 or Opus findings. Sonnet's contribution was to identify the same categories of issues but at lower specificity:

  • Memory leaks from stuck groups (covered more precisely by GPT-5 #6 and Opus #2.3)
  • Decision forwarding silence (common ground)
  • Timeout indistinguishability (common ground)
  • Buffering dead zone (common ground)
  • Crash impact quantification (common ground)
  • Immediate algorithm masking excessive decision rate (covered more precisely by GPT-5 #16)
  • Signal quality hidden by completion metrics (covered by Opus #5.1, GPT-5 #10)
  • Overly permissive predicate (covered by Opus #1.3)

Sonnet was the fastest (36s, 1,562 tokens) but produced no unique insights for this task type.

Quality Assessment

  • GPT-5 was exhaustive and systematic — 23 findings covering all 5 categories, with specific telemetry event names, measurement fields, and metadata specifications. Multiple findings addressed the instrumentation architecture itself (trace propagation, config versioning, event correlation schema). GPT-5 treated this as a telemetry engineering problem and designed a complete observability layer. Its unique contributions are mostly about infrastructure (correlation IDs, trace context, config hashes) that enable diagnosis rather than about specific failure scenarios.

  • Opus produced fewer findings (14) but several showed qualitatively different reasoning. The "acknowledgment boundary" finding (#1.2) identifies an observability gap that exists between components — neither side knows signals were lost because neither side records the handoff. The "opposite remediations" finding (#2.3) identifies where the same metric guides operators toward WRONG actions depending on an invisible variable. Opus consistently reasoned about what operators would DO with the available signals, not just what signals are missing.

  • Sonnet produced no unique value on this task type. Every finding was a less-specific version of something GPT-5 or Opus found. This is consistent with the task-type taxonomy from previous experiments: Sonnet adds nothing on systematic/exhaustive analysis tasks.

Key Insight — Observability Analysis as Task Type

This is genuinely different from failure analysis or assumption-finding:

  • Failure analysis asks: "What can go wrong?"
  • Assumption-finding asks: "What must be true for this to work?"
  • Observability gap analysis asks: "When something goes wrong, can you SEE it?"

The third question requires reasoning about the system's meta-properties — not its behavior, but its visibility. This is a second-order question: you have to first imagine a failure, then ask whether any defined signal would fire, then determine whether that signal is distinguishable from normal operation or from other failures.

GPT-5's approach: enumerate every possible metric/event that SHOULD exist but doesn't. Design the telemetry architecture. (23 specific event/metric proposals.)

Opus's approach: identify the places where available signals guide operators toward WRONG actions or create invisible boundaries between components. (14 findings, several about operator behavior.)

This distinction maps well to previous findings:

  • GPT-5 is the telemetry architect — "here's what you should instrument"
  • Opus is the incident analyst — "here's where your instrumentation will mislead you"

Model Comparison to Previous Task Types

Metric GPT-5 Opus Sonnet
Finding count 23 14 11
Unique findings 12 6 0
Tokens per finding 656 321 142
Qualitative depth Systematic/architectural Operator-behavioral Surface-level

Comparison to previous experiments:

  • Finding #9 (gap-finding): GPT-5=14, Opus=n/a, Sonnet=n/a
  • Finding #10 (assumptions): GPT-5=26, Opus=13, Sonnet=n/a
  • Finding #12 (assumptions, order-execution): GPT-5=20, Sonnet=17, Opus=12
  • Finding #13 (race conditions): GPT-5=12, Opus=10, Sonnet=7
  • This experiment (observability): GPT-5=23, Opus=14, Sonnet=11

GPT-5 produced its highest finding count (23) outside of assumption-finding tasks. This suggests observability gap analysis plays to GPT-5's exhaustive enumeration strength — there are many possible gaps and GPT-5 is motivated to find ALL of them.

Sonnet's zero unique findings here vs. 6 unique findings in experiment #12 (order-execution assumptions) confirms the task-type dependency. Sonnet contributes when the task requires reasoning about component interactions in a complex multi-component document. On simpler documents or systematic enumeration tasks, it adds nothing.

Practical Implication

For observability reviews of system specifications:

  1. GPT-5 for comprehensive instrumentation gap enumeration — produces a complete telemetry design specification (events, metrics, metadata fields)
  2. Opus for identifying where available signals mislead operators — finds the dangerous gaps where wrong remediation appears correct
  3. Skip Sonnet — no unique value on this task type

Two-model configuration (GPT-5 + Opus) is optimal, same as spec-gap and testability analysis.

New Taxonomy Entry

Task category Best for Sonnet value Key question
Observability gap analysis GPT-5 (breadth) + Opus (operator-behavioral) None "When it breaks, can you see it?"

This slots alongside:

  • Spec-gap analysis: GPT-5 + Opus (no Sonnet value)
  • Testability analysis: GPT-5 + Opus (no Sonnet value)
  • Assumption-finding: All three contribute (Sonnet at ~85%)
  • Race conditions: GPT-5 + Opus only (Sonnet too imprecise)
  • Cross-component interaction: All three contribute