model-research/findings/2026-05-06-33-observability-gap-analysis-aggregation.md

# Experiment #33: Observability Gap Analysis on aggregation.md

**Date:** 2026-05-06
**Task type:** Observability gap analysis (NEW analytical lens)
**Document:** gargoyle's `aggregation.md` (239 lines) — decision engine signal aggregation with
state machines, timers, and cross-component forwarding

## Hypothesis

Observability gap analysis — identifying where system behavior becomes invisible, indistinguishable
from normal, or impossible to diagnose during failures — is a distinct analytical lens from failure
analysis or assumption-finding. Instead of asking "what can go wrong," it asks "when something goes
wrong, can you SEE it?" Models may differ in whether they identify technical instrumentation gaps
(missing metrics/events) vs. semantic indistinguishability problems (different failures that look
the same from outside).

## Method

Same structured prompt to all three models via HAI proxy on anvil. Prompt specified 5 categories:
1. Silent failures (no observable signal)
2. Indistinguishable states (different problems, identical observable pattern)
3. Diagnostic dead zones (unobservable time windows)
4. Missing correlation (effects visible, causes invisible)
5. False-normal signals (metrics healthy but correctness degraded)

Required output format: Gap, Scenario, What's invisible, Impact, What the spec should add.

Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4.6 (all via HAI proxy on anvil).

## Results

| Model | Findings | Output tokens | Reasoning tokens | Latency | Tokens/finding |
|---|---|---|---|---|---|
| GPT-5 | 23 | 9,433 | 5,632 | 153s | 656 |
| Opus 4.6 | 14 | 4,493 | (internal) | 103s | 321 |
| Sonnet 4.6 | 11 | 1,562 | (internal) | 36s | 142 |

## Common Ground (all 3 identified)

- **No telemetry during buffering state** — groups are opaque while accumulating signals; only
  terminal events (completion/expiry) produce observable signals
- **Decision forwarding failures are silent** — decisions form (event fires) but delivery to
  PortfolioRisk has no success/failure signal
- **Crash loses groups with no quantification** — in-flight groups vanish but nothing reports
  how much was lost
- **Timeout reason is indistinguishable** — `:timeout` expiry doesn't discriminate between
  "signals stopped arriving" vs "timeout misconfigured" vs "market conditions changed"
- **Force-complete decisions look normal downstream** — a decision formed from 1/5 expected
  signals is indistinguishable from a complete decision to PortfolioRisk

## GPT-5 Unique Findings (not in either Claude model)

1. **No group_id/decision_id correlation across events** — lifecycle events can't be joined;
   you can't trace a decision back through its group to its constituent signals
2. **Expired groups lack instrument context** — can't attribute expiration spikes to
   specific instruments
3. **Timer start/deadline not observable** — operators can't verify timers were set as intended
4. **No configuration context on events** — timeout_ms, threshold_N, capacity_limit not
   attached to events; can't correlate config changes with behavior changes
5. **Pattern-complete predicate is opaque** — no visibility into evaluation count, partial-match
   state, or "why false"; impossible to tune pattern strategies
6. **No per-strategy memory/backpressure signals** — no gauges for buffered signal count
   or memory footprint; misfiring strategy fills memory silently
7. **Unknown strategy signal drops are only "logged"** — no structured metric for discarded
   signals; operational data loss goes unmetered
8. **No cross-service trace context propagation** — no mention of trace_id/span_id flowing
   signal → aggregation → PortfolioRisk → OrderManager
9. **No ranking decision transparency** — when time-windowed selects "best" signal, no
   visibility into which candidate won, why, or what alternatives existed
10. **Capacity-triggered force-complete vs normal completion not explicitly monitored** —
    operators alerting on `:capacity` expirations miss capacity-triggered *completions*
11. **No version metadata** — events don't carry build/algorithm/config version; version
    skew causes indistinguishable behavioral drift
12. **No forwarding queue/latency visibility** — no metric for decision dispatch latency
    or queue depth between formation and delivery

## Opus Unique Findings (not in either other model)

1. **Signals in-flight during crash window have no fate** — signals dispatched by SignalRisk
   but not yet received by the aggregator vanish with no trace on either side. Distinguished
   from "groups lost on crash" because these signals never entered the aggregator's state.
   Unique insight: the acknowledgment boundary itself is invisible.
2. **Custom predicate FAILURE is observationally identical to predicate returning false** —
   a predicate that throws an exception vs. one that correctly returns false produce the same
   downstream effect (group stays in Buffering, eventually times out). Operators misdiagnose
   code bugs as strategy calibration problems.
3. **Capacity expire and timeout expire require OPPOSITE remediations but share the same
   metric pattern** — `:capacity` might mean "limit too low" OR "strategy misfiring."
   Misfiring requires investigation; low limit requires raising it. Raising the limit on a
   misfiring strategy converts bounded failure to unbounded memory growth.
4. **Decision formation-to-market-conditions temporal correlation is missing** — contributing
   signals were generated at T+0 but the decision forms at T+10min; no metric captures how
   stale the decision's inputs are relative to current market state. Different from GPT-5's
   "group duration" finding because this is specifically about *market relevance* decay.
5. **Expired groups can't be correlated to missed P&L** — expired groups represent missed
   trades but lack the business content (instrument, direction) needed to compute opportunity
   cost against actual market moves post-expiry.
6. **Aggregator appears "healthy but idle" indistinguishable from broken signal channel** —
   no liveness signal distinguishes "no signals because market is quiet" from "no signals
   because delivery channel is broken." Unique angle: this creates a false-normal condition
   specific to the *absence* of activity rather than degradation of existing activity.

## Sonnet Findings

Sonnet produced 11 findings in 36s. No findings were truly unique — all overlapped substantially
with GPT-5 or Opus findings. Sonnet's contribution was to identify the same categories of issues
but at lower specificity:

- Memory leaks from stuck groups (covered more precisely by GPT-5 #6 and Opus #2.3)
- Decision forwarding silence (common ground)
- Timeout indistinguishability (common ground)
- Buffering dead zone (common ground)
- Crash impact quantification (common ground)
- Immediate algorithm masking excessive decision rate (covered more precisely by GPT-5 #16)
- Signal quality hidden by completion metrics (covered by Opus #5.1, GPT-5 #10)
- Overly permissive predicate (covered by Opus #1.3)

Sonnet was the fastest (36s, 1,562 tokens) but produced no unique insights for this task type.

## Quality Assessment

- **GPT-5** was exhaustive and systematic — 23 findings covering all 5 categories, with specific
  telemetry event names, measurement fields, and metadata specifications. Multiple findings
  addressed the *instrumentation architecture* itself (trace propagation, config versioning,
  event correlation schema). GPT-5 treated this as a telemetry engineering problem and designed
  a complete observability layer. Its unique contributions are mostly about infrastructure
  (correlation IDs, trace context, config hashes) that enable diagnosis rather than about
  specific failure scenarios.

- **Opus** produced fewer findings (14) but several showed qualitatively different reasoning.
  The "acknowledgment boundary" finding (#1.2) identifies an observability gap that exists
  *between* components — neither side knows signals were lost because neither side records
  the handoff. The "opposite remediations" finding (#2.3) identifies where the same metric
  guides operators toward WRONG actions depending on an invisible variable. Opus consistently
  reasoned about *what operators would DO* with the available signals, not just what signals
  are missing.

- **Sonnet** produced no unique value on this task type. Every finding was a less-specific
  version of something GPT-5 or Opus found. This is consistent with the task-type taxonomy
  from previous experiments: Sonnet adds nothing on systematic/exhaustive analysis tasks.

## Key Insight — Observability Analysis as Task Type

This is genuinely different from failure analysis or assumption-finding:
- **Failure analysis** asks: "What can go wrong?"
- **Assumption-finding** asks: "What must be true for this to work?"
- **Observability gap analysis** asks: "When something goes wrong, can you SEE it?"

The third question requires reasoning about the system's *meta-properties* — not its behavior,
but its *visibility*. This is a second-order question: you have to first imagine a failure, then
ask whether any defined signal would fire, then determine whether that signal is distinguishable
from normal operation or from other failures.

GPT-5's approach: enumerate every possible metric/event that SHOULD exist but doesn't. Design
the telemetry architecture. (23 specific event/metric proposals.)

Opus's approach: identify the places where available signals guide operators toward WRONG actions
or create invisible boundaries between components. (14 findings, several about operator behavior.)

This distinction maps well to previous findings:
- GPT-5 is the **telemetry architect** — "here's what you should instrument"
- Opus is the **incident analyst** — "here's where your instrumentation will mislead you"

## Model Comparison to Previous Task Types

| Metric | GPT-5 | Opus | Sonnet |
|---|---|---|---|
| Finding count | 23 | 14 | 11 |
| Unique findings | 12 | 6 | 0 |
| Tokens per finding | 656 | 321 | 142 |
| Qualitative depth | Systematic/architectural | Operator-behavioral | Surface-level |

Comparison to previous experiments:
- Finding #9 (gap-finding): GPT-5=14, Opus=n/a, Sonnet=n/a
- Finding #10 (assumptions): GPT-5=26, Opus=13, Sonnet=n/a
- Finding #12 (assumptions, order-execution): GPT-5=20, Sonnet=17, Opus=12
- Finding #13 (race conditions): GPT-5=12, Opus=10, Sonnet=7
- **This experiment (observability): GPT-5=23, Opus=14, Sonnet=11**

GPT-5 produced its highest finding count (23) outside of assumption-finding tasks. This suggests
observability gap analysis plays to GPT-5's exhaustive enumeration strength — there are many
possible gaps and GPT-5 is motivated to find ALL of them.

Sonnet's zero unique findings here vs. 6 unique findings in experiment #12 (order-execution
assumptions) confirms the task-type dependency. Sonnet contributes when the task requires
reasoning about component interactions in a complex multi-component document. On simpler
documents or systematic enumeration tasks, it adds nothing.

## Practical Implication

For observability reviews of system specifications:
1. **GPT-5** for comprehensive instrumentation gap enumeration — produces a complete telemetry
   design specification (events, metrics, metadata fields)
2. **Opus** for identifying where available signals mislead operators — finds the dangerous
   gaps where wrong remediation appears correct
3. **Skip Sonnet** — no unique value on this task type

Two-model configuration (GPT-5 + Opus) is optimal, same as spec-gap and testability analysis.

## New Taxonomy Entry

| Task category | Best for | Sonnet value | Key question |
|---|---|---|---|
| Observability gap analysis | GPT-5 (breadth) + Opus (operator-behavioral) | None | "When it breaks, can you see it?" |

This slots alongside:
- Spec-gap analysis: GPT-5 + Opus (no Sonnet value)
- Testability analysis: GPT-5 + Opus (no Sonnet value)
- Assumption-finding: All three contribute (Sonnet at ~85%)
- Race conditions: GPT-5 + Opus only (Sonnet too imprecise)
- Cross-component interaction: All three contribute