diff --git a/findings/2026-05-06-33-observability-gap-analysis-aggregation.md b/findings/2026-05-06-33-observability-gap-analysis-aggregation.md new file mode 100644 index 0000000..c14bd7d --- /dev/null +++ b/findings/2026-05-06-33-observability-gap-analysis-aggregation.md @@ -0,0 +1,211 @@ +# Experiment #33: Observability Gap Analysis on aggregation.md + +**Date:** 2026-05-06 +**Task type:** Observability gap analysis (NEW analytical lens) +**Document:** gargoyle's `aggregation.md` (239 lines) — decision engine signal aggregation with +state machines, timers, and cross-component forwarding + +## Hypothesis + +Observability gap analysis — identifying where system behavior becomes invisible, indistinguishable +from normal, or impossible to diagnose during failures — is a distinct analytical lens from failure +analysis or assumption-finding. Instead of asking "what can go wrong," it asks "when something goes +wrong, can you SEE it?" Models may differ in whether they identify technical instrumentation gaps +(missing metrics/events) vs. semantic indistinguishability problems (different failures that look +the same from outside). + +## Method + +Same structured prompt to all three models via HAI proxy on anvil. Prompt specified 5 categories: +1. Silent failures (no observable signal) +2. Indistinguishable states (different problems, identical observable pattern) +3. Diagnostic dead zones (unobservable time windows) +4. Missing correlation (effects visible, causes invisible) +5. False-normal signals (metrics healthy but correctness degraded) + +Required output format: Gap, Scenario, What's invisible, Impact, What the spec should add. + +Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4.6 (all via HAI proxy on anvil). + +## Results + +| Model | Findings | Output tokens | Reasoning tokens | Latency | Tokens/finding | +|---|---|---|---|---|---| +| GPT-5 | 23 | 9,433 | 5,632 | 153s | 656 | +| Opus 4.6 | 14 | 4,493 | (internal) | 103s | 321 | +| Sonnet 4.6 | 11 | 1,562 | (internal) | 36s | 142 | + +## Common Ground (all 3 identified) + +- **No telemetry during buffering state** — groups are opaque while accumulating signals; only + terminal events (completion/expiry) produce observable signals +- **Decision forwarding failures are silent** — decisions form (event fires) but delivery to + PortfolioRisk has no success/failure signal +- **Crash loses groups with no quantification** — in-flight groups vanish but nothing reports + how much was lost +- **Timeout reason is indistinguishable** — `:timeout` expiry doesn't discriminate between + "signals stopped arriving" vs "timeout misconfigured" vs "market conditions changed" +- **Force-complete decisions look normal downstream** — a decision formed from 1/5 expected + signals is indistinguishable from a complete decision to PortfolioRisk + +## GPT-5 Unique Findings (not in either Claude model) + +1. **No group_id/decision_id correlation across events** — lifecycle events can't be joined; + you can't trace a decision back through its group to its constituent signals +2. **Expired groups lack instrument context** — can't attribute expiration spikes to + specific instruments +3. **Timer start/deadline not observable** — operators can't verify timers were set as intended +4. **No configuration context on events** — timeout_ms, threshold_N, capacity_limit not + attached to events; can't correlate config changes with behavior changes +5. **Pattern-complete predicate is opaque** — no visibility into evaluation count, partial-match + state, or "why false"; impossible to tune pattern strategies +6. **No per-strategy memory/backpressure signals** — no gauges for buffered signal count + or memory footprint; misfiring strategy fills memory silently +7. **Unknown strategy signal drops are only "logged"** — no structured metric for discarded + signals; operational data loss goes unmetered +8. **No cross-service trace context propagation** — no mention of trace_id/span_id flowing + signal → aggregation → PortfolioRisk → OrderManager +9. **No ranking decision transparency** — when time-windowed selects "best" signal, no + visibility into which candidate won, why, or what alternatives existed +10. **Capacity-triggered force-complete vs normal completion not explicitly monitored** — + operators alerting on `:capacity` expirations miss capacity-triggered *completions* +11. **No version metadata** — events don't carry build/algorithm/config version; version + skew causes indistinguishable behavioral drift +12. **No forwarding queue/latency visibility** — no metric for decision dispatch latency + or queue depth between formation and delivery + +## Opus Unique Findings (not in either other model) + +1. **Signals in-flight during crash window have no fate** — signals dispatched by SignalRisk + but not yet received by the aggregator vanish with no trace on either side. Distinguished + from "groups lost on crash" because these signals never entered the aggregator's state. + Unique insight: the acknowledgment boundary itself is invisible. +2. **Custom predicate FAILURE is observationally identical to predicate returning false** — + a predicate that throws an exception vs. one that correctly returns false produce the same + downstream effect (group stays in Buffering, eventually times out). Operators misdiagnose + code bugs as strategy calibration problems. +3. **Capacity expire and timeout expire require OPPOSITE remediations but share the same + metric pattern** — `:capacity` might mean "limit too low" OR "strategy misfiring." + Misfiring requires investigation; low limit requires raising it. Raising the limit on a + misfiring strategy converts bounded failure to unbounded memory growth. +4. **Decision formation-to-market-conditions temporal correlation is missing** — contributing + signals were generated at T+0 but the decision forms at T+10min; no metric captures how + stale the decision's inputs are relative to current market state. Different from GPT-5's + "group duration" finding because this is specifically about *market relevance* decay. +5. **Expired groups can't be correlated to missed P&L** — expired groups represent missed + trades but lack the business content (instrument, direction) needed to compute opportunity + cost against actual market moves post-expiry. +6. **Aggregator appears "healthy but idle" indistinguishable from broken signal channel** — + no liveness signal distinguishes "no signals because market is quiet" from "no signals + because delivery channel is broken." Unique angle: this creates a false-normal condition + specific to the *absence* of activity rather than degradation of existing activity. + +## Sonnet Findings + +Sonnet produced 11 findings in 36s. No findings were truly unique — all overlapped substantially +with GPT-5 or Opus findings. Sonnet's contribution was to identify the same categories of issues +but at lower specificity: + +- Memory leaks from stuck groups (covered more precisely by GPT-5 #6 and Opus #2.3) +- Decision forwarding silence (common ground) +- Timeout indistinguishability (common ground) +- Buffering dead zone (common ground) +- Crash impact quantification (common ground) +- Immediate algorithm masking excessive decision rate (covered more precisely by GPT-5 #16) +- Signal quality hidden by completion metrics (covered by Opus #5.1, GPT-5 #10) +- Overly permissive predicate (covered by Opus #1.3) + +Sonnet was the fastest (36s, 1,562 tokens) but produced no unique insights for this task type. + +## Quality Assessment + +- **GPT-5** was exhaustive and systematic — 23 findings covering all 5 categories, with specific + telemetry event names, measurement fields, and metadata specifications. Multiple findings + addressed the *instrumentation architecture* itself (trace propagation, config versioning, + event correlation schema). GPT-5 treated this as a telemetry engineering problem and designed + a complete observability layer. Its unique contributions are mostly about infrastructure + (correlation IDs, trace context, config hashes) that enable diagnosis rather than about + specific failure scenarios. + +- **Opus** produced fewer findings (14) but several showed qualitatively different reasoning. + The "acknowledgment boundary" finding (#1.2) identifies an observability gap that exists + *between* components — neither side knows signals were lost because neither side records + the handoff. The "opposite remediations" finding (#2.3) identifies where the same metric + guides operators toward WRONG actions depending on an invisible variable. Opus consistently + reasoned about *what operators would DO* with the available signals, not just what signals + are missing. + +- **Sonnet** produced no unique value on this task type. Every finding was a less-specific + version of something GPT-5 or Opus found. This is consistent with the task-type taxonomy + from previous experiments: Sonnet adds nothing on systematic/exhaustive analysis tasks. + +## Key Insight — Observability Analysis as Task Type + +This is genuinely different from failure analysis or assumption-finding: +- **Failure analysis** asks: "What can go wrong?" +- **Assumption-finding** asks: "What must be true for this to work?" +- **Observability gap analysis** asks: "When something goes wrong, can you SEE it?" + +The third question requires reasoning about the system's *meta-properties* — not its behavior, +but its *visibility*. This is a second-order question: you have to first imagine a failure, then +ask whether any defined signal would fire, then determine whether that signal is distinguishable +from normal operation or from other failures. + +GPT-5's approach: enumerate every possible metric/event that SHOULD exist but doesn't. Design +the telemetry architecture. (23 specific event/metric proposals.) + +Opus's approach: identify the places where available signals guide operators toward WRONG actions +or create invisible boundaries between components. (14 findings, several about operator behavior.) + +This distinction maps well to previous findings: +- GPT-5 is the **telemetry architect** — "here's what you should instrument" +- Opus is the **incident analyst** — "here's where your instrumentation will mislead you" + +## Model Comparison to Previous Task Types + +| Metric | GPT-5 | Opus | Sonnet | +|---|---|---|---| +| Finding count | 23 | 14 | 11 | +| Unique findings | 12 | 6 | 0 | +| Tokens per finding | 656 | 321 | 142 | +| Qualitative depth | Systematic/architectural | Operator-behavioral | Surface-level | + +Comparison to previous experiments: +- Finding #9 (gap-finding): GPT-5=14, Opus=n/a, Sonnet=n/a +- Finding #10 (assumptions): GPT-5=26, Opus=13, Sonnet=n/a +- Finding #12 (assumptions, order-execution): GPT-5=20, Sonnet=17, Opus=12 +- Finding #13 (race conditions): GPT-5=12, Opus=10, Sonnet=7 +- **This experiment (observability): GPT-5=23, Opus=14, Sonnet=11** + +GPT-5 produced its highest finding count (23) outside of assumption-finding tasks. This suggests +observability gap analysis plays to GPT-5's exhaustive enumeration strength — there are many +possible gaps and GPT-5 is motivated to find ALL of them. + +Sonnet's zero unique findings here vs. 6 unique findings in experiment #12 (order-execution +assumptions) confirms the task-type dependency. Sonnet contributes when the task requires +reasoning about component interactions in a complex multi-component document. On simpler +documents or systematic enumeration tasks, it adds nothing. + +## Practical Implication + +For observability reviews of system specifications: +1. **GPT-5** for comprehensive instrumentation gap enumeration — produces a complete telemetry + design specification (events, metrics, metadata fields) +2. **Opus** for identifying where available signals mislead operators — finds the dangerous + gaps where wrong remediation appears correct +3. **Skip Sonnet** — no unique value on this task type + +Two-model configuration (GPT-5 + Opus) is optimal, same as spec-gap and testability analysis. + +## New Taxonomy Entry + +| Task category | Best for | Sonnet value | Key question | +|---|---|---|---| +| Observability gap analysis | GPT-5 (breadth) + Opus (operator-behavioral) | None | "When it breaks, can you see it?" | + +This slots alongside: +- Spec-gap analysis: GPT-5 + Opus (no Sonnet value) +- Testability analysis: GPT-5 + Opus (no Sonnet value) +- Assumption-finding: All three contribute (Sonnet at ~85%) +- Race conditions: GPT-5 + Opus only (Sonnet too imprecise) +- Cross-component interaction: All three contribute