20c0bd2492
New analytical lens: observability gap analysis — asking 'when something goes wrong, can you SEE it?' rather than 'what can go wrong?' Results on aggregation.md (239 lines): - GPT-5: 23 findings (12 unique), exhaustive telemetry architecture - Opus: 14 findings (6 unique), operator-behavioral insights - Sonnet: 11 findings (0 unique), no added value Key insight: GPT-5 designs the instrumentation; Opus identifies where available signals mislead operators toward wrong remediations. Two-model (GPT-5 + Opus) optimal for this task type.
212 lines
12 KiB
Markdown
212 lines
12 KiB
Markdown
# Experiment #33: Observability Gap Analysis on aggregation.md
|
|
|
|
**Date:** 2026-05-06
|
|
**Task type:** Observability gap analysis (NEW analytical lens)
|
|
**Document:** gargoyle's `aggregation.md` (239 lines) — decision engine signal aggregation with
|
|
state machines, timers, and cross-component forwarding
|
|
|
|
## Hypothesis
|
|
|
|
Observability gap analysis — identifying where system behavior becomes invisible, indistinguishable
|
|
from normal, or impossible to diagnose during failures — is a distinct analytical lens from failure
|
|
analysis or assumption-finding. Instead of asking "what can go wrong," it asks "when something goes
|
|
wrong, can you SEE it?" Models may differ in whether they identify technical instrumentation gaps
|
|
(missing metrics/events) vs. semantic indistinguishability problems (different failures that look
|
|
the same from outside).
|
|
|
|
## Method
|
|
|
|
Same structured prompt to all three models via HAI proxy on anvil. Prompt specified 5 categories:
|
|
1. Silent failures (no observable signal)
|
|
2. Indistinguishable states (different problems, identical observable pattern)
|
|
3. Diagnostic dead zones (unobservable time windows)
|
|
4. Missing correlation (effects visible, causes invisible)
|
|
5. False-normal signals (metrics healthy but correctness degraded)
|
|
|
|
Required output format: Gap, Scenario, What's invisible, Impact, What the spec should add.
|
|
|
|
Models: GPT-5, Claude Opus 4.6, Claude Sonnet 4.6 (all via HAI proxy on anvil).
|
|
|
|
## Results
|
|
|
|
| Model | Findings | Output tokens | Reasoning tokens | Latency | Tokens/finding |
|
|
|---|---|---|---|---|---|
|
|
| GPT-5 | 23 | 9,433 | 5,632 | 153s | 656 |
|
|
| Opus 4.6 | 14 | 4,493 | (internal) | 103s | 321 |
|
|
| Sonnet 4.6 | 11 | 1,562 | (internal) | 36s | 142 |
|
|
|
|
## Common Ground (all 3 identified)
|
|
|
|
- **No telemetry during buffering state** — groups are opaque while accumulating signals; only
|
|
terminal events (completion/expiry) produce observable signals
|
|
- **Decision forwarding failures are silent** — decisions form (event fires) but delivery to
|
|
PortfolioRisk has no success/failure signal
|
|
- **Crash loses groups with no quantification** — in-flight groups vanish but nothing reports
|
|
how much was lost
|
|
- **Timeout reason is indistinguishable** — `:timeout` expiry doesn't discriminate between
|
|
"signals stopped arriving" vs "timeout misconfigured" vs "market conditions changed"
|
|
- **Force-complete decisions look normal downstream** — a decision formed from 1/5 expected
|
|
signals is indistinguishable from a complete decision to PortfolioRisk
|
|
|
|
## GPT-5 Unique Findings (not in either Claude model)
|
|
|
|
1. **No group_id/decision_id correlation across events** — lifecycle events can't be joined;
|
|
you can't trace a decision back through its group to its constituent signals
|
|
2. **Expired groups lack instrument context** — can't attribute expiration spikes to
|
|
specific instruments
|
|
3. **Timer start/deadline not observable** — operators can't verify timers were set as intended
|
|
4. **No configuration context on events** — timeout_ms, threshold_N, capacity_limit not
|
|
attached to events; can't correlate config changes with behavior changes
|
|
5. **Pattern-complete predicate is opaque** — no visibility into evaluation count, partial-match
|
|
state, or "why false"; impossible to tune pattern strategies
|
|
6. **No per-strategy memory/backpressure signals** — no gauges for buffered signal count
|
|
or memory footprint; misfiring strategy fills memory silently
|
|
7. **Unknown strategy signal drops are only "logged"** — no structured metric for discarded
|
|
signals; operational data loss goes unmetered
|
|
8. **No cross-service trace context propagation** — no mention of trace_id/span_id flowing
|
|
signal → aggregation → PortfolioRisk → OrderManager
|
|
9. **No ranking decision transparency** — when time-windowed selects "best" signal, no
|
|
visibility into which candidate won, why, or what alternatives existed
|
|
10. **Capacity-triggered force-complete vs normal completion not explicitly monitored** —
|
|
operators alerting on `:capacity` expirations miss capacity-triggered *completions*
|
|
11. **No version metadata** — events don't carry build/algorithm/config version; version
|
|
skew causes indistinguishable behavioral drift
|
|
12. **No forwarding queue/latency visibility** — no metric for decision dispatch latency
|
|
or queue depth between formation and delivery
|
|
|
|
## Opus Unique Findings (not in either other model)
|
|
|
|
1. **Signals in-flight during crash window have no fate** — signals dispatched by SignalRisk
|
|
but not yet received by the aggregator vanish with no trace on either side. Distinguished
|
|
from "groups lost on crash" because these signals never entered the aggregator's state.
|
|
Unique insight: the acknowledgment boundary itself is invisible.
|
|
2. **Custom predicate FAILURE is observationally identical to predicate returning false** —
|
|
a predicate that throws an exception vs. one that correctly returns false produce the same
|
|
downstream effect (group stays in Buffering, eventually times out). Operators misdiagnose
|
|
code bugs as strategy calibration problems.
|
|
3. **Capacity expire and timeout expire require OPPOSITE remediations but share the same
|
|
metric pattern** — `:capacity` might mean "limit too low" OR "strategy misfiring."
|
|
Misfiring requires investigation; low limit requires raising it. Raising the limit on a
|
|
misfiring strategy converts bounded failure to unbounded memory growth.
|
|
4. **Decision formation-to-market-conditions temporal correlation is missing** — contributing
|
|
signals were generated at T+0 but the decision forms at T+10min; no metric captures how
|
|
stale the decision's inputs are relative to current market state. Different from GPT-5's
|
|
"group duration" finding because this is specifically about *market relevance* decay.
|
|
5. **Expired groups can't be correlated to missed P&L** — expired groups represent missed
|
|
trades but lack the business content (instrument, direction) needed to compute opportunity
|
|
cost against actual market moves post-expiry.
|
|
6. **Aggregator appears "healthy but idle" indistinguishable from broken signal channel** —
|
|
no liveness signal distinguishes "no signals because market is quiet" from "no signals
|
|
because delivery channel is broken." Unique angle: this creates a false-normal condition
|
|
specific to the *absence* of activity rather than degradation of existing activity.
|
|
|
|
## Sonnet Findings
|
|
|
|
Sonnet produced 11 findings in 36s. No findings were truly unique — all overlapped substantially
|
|
with GPT-5 or Opus findings. Sonnet's contribution was to identify the same categories of issues
|
|
but at lower specificity:
|
|
|
|
- Memory leaks from stuck groups (covered more precisely by GPT-5 #6 and Opus #2.3)
|
|
- Decision forwarding silence (common ground)
|
|
- Timeout indistinguishability (common ground)
|
|
- Buffering dead zone (common ground)
|
|
- Crash impact quantification (common ground)
|
|
- Immediate algorithm masking excessive decision rate (covered more precisely by GPT-5 #16)
|
|
- Signal quality hidden by completion metrics (covered by Opus #5.1, GPT-5 #10)
|
|
- Overly permissive predicate (covered by Opus #1.3)
|
|
|
|
Sonnet was the fastest (36s, 1,562 tokens) but produced no unique insights for this task type.
|
|
|
|
## Quality Assessment
|
|
|
|
- **GPT-5** was exhaustive and systematic — 23 findings covering all 5 categories, with specific
|
|
telemetry event names, measurement fields, and metadata specifications. Multiple findings
|
|
addressed the *instrumentation architecture* itself (trace propagation, config versioning,
|
|
event correlation schema). GPT-5 treated this as a telemetry engineering problem and designed
|
|
a complete observability layer. Its unique contributions are mostly about infrastructure
|
|
(correlation IDs, trace context, config hashes) that enable diagnosis rather than about
|
|
specific failure scenarios.
|
|
|
|
- **Opus** produced fewer findings (14) but several showed qualitatively different reasoning.
|
|
The "acknowledgment boundary" finding (#1.2) identifies an observability gap that exists
|
|
*between* components — neither side knows signals were lost because neither side records
|
|
the handoff. The "opposite remediations" finding (#2.3) identifies where the same metric
|
|
guides operators toward WRONG actions depending on an invisible variable. Opus consistently
|
|
reasoned about *what operators would DO* with the available signals, not just what signals
|
|
are missing.
|
|
|
|
- **Sonnet** produced no unique value on this task type. Every finding was a less-specific
|
|
version of something GPT-5 or Opus found. This is consistent with the task-type taxonomy
|
|
from previous experiments: Sonnet adds nothing on systematic/exhaustive analysis tasks.
|
|
|
|
## Key Insight — Observability Analysis as Task Type
|
|
|
|
This is genuinely different from failure analysis or assumption-finding:
|
|
- **Failure analysis** asks: "What can go wrong?"
|
|
- **Assumption-finding** asks: "What must be true for this to work?"
|
|
- **Observability gap analysis** asks: "When something goes wrong, can you SEE it?"
|
|
|
|
The third question requires reasoning about the system's *meta-properties* — not its behavior,
|
|
but its *visibility*. This is a second-order question: you have to first imagine a failure, then
|
|
ask whether any defined signal would fire, then determine whether that signal is distinguishable
|
|
from normal operation or from other failures.
|
|
|
|
GPT-5's approach: enumerate every possible metric/event that SHOULD exist but doesn't. Design
|
|
the telemetry architecture. (23 specific event/metric proposals.)
|
|
|
|
Opus's approach: identify the places where available signals guide operators toward WRONG actions
|
|
or create invisible boundaries between components. (14 findings, several about operator behavior.)
|
|
|
|
This distinction maps well to previous findings:
|
|
- GPT-5 is the **telemetry architect** — "here's what you should instrument"
|
|
- Opus is the **incident analyst** — "here's where your instrumentation will mislead you"
|
|
|
|
## Model Comparison to Previous Task Types
|
|
|
|
| Metric | GPT-5 | Opus | Sonnet |
|
|
|---|---|---|---|
|
|
| Finding count | 23 | 14 | 11 |
|
|
| Unique findings | 12 | 6 | 0 |
|
|
| Tokens per finding | 656 | 321 | 142 |
|
|
| Qualitative depth | Systematic/architectural | Operator-behavioral | Surface-level |
|
|
|
|
Comparison to previous experiments:
|
|
- Finding #9 (gap-finding): GPT-5=14, Opus=n/a, Sonnet=n/a
|
|
- Finding #10 (assumptions): GPT-5=26, Opus=13, Sonnet=n/a
|
|
- Finding #12 (assumptions, order-execution): GPT-5=20, Sonnet=17, Opus=12
|
|
- Finding #13 (race conditions): GPT-5=12, Opus=10, Sonnet=7
|
|
- **This experiment (observability): GPT-5=23, Opus=14, Sonnet=11**
|
|
|
|
GPT-5 produced its highest finding count (23) outside of assumption-finding tasks. This suggests
|
|
observability gap analysis plays to GPT-5's exhaustive enumeration strength — there are many
|
|
possible gaps and GPT-5 is motivated to find ALL of them.
|
|
|
|
Sonnet's zero unique findings here vs. 6 unique findings in experiment #12 (order-execution
|
|
assumptions) confirms the task-type dependency. Sonnet contributes when the task requires
|
|
reasoning about component interactions in a complex multi-component document. On simpler
|
|
documents or systematic enumeration tasks, it adds nothing.
|
|
|
|
## Practical Implication
|
|
|
|
For observability reviews of system specifications:
|
|
1. **GPT-5** for comprehensive instrumentation gap enumeration — produces a complete telemetry
|
|
design specification (events, metrics, metadata fields)
|
|
2. **Opus** for identifying where available signals mislead operators — finds the dangerous
|
|
gaps where wrong remediation appears correct
|
|
3. **Skip Sonnet** — no unique value on this task type
|
|
|
|
Two-model configuration (GPT-5 + Opus) is optimal, same as spec-gap and testability analysis.
|
|
|
|
## New Taxonomy Entry
|
|
|
|
| Task category | Best for | Sonnet value | Key question |
|
|
|---|---|---|---|
|
|
| Observability gap analysis | GPT-5 (breadth) + Opus (operator-behavioral) | None | "When it breaks, can you see it?" |
|
|
|
|
This slots alongside:
|
|
- Spec-gap analysis: GPT-5 + Opus (no Sonnet value)
|
|
- Testability analysis: GPT-5 + Opus (no Sonnet value)
|
|
- Assumption-finding: All three contribute (Sonnet at ~85%)
|
|
- Race conditions: GPT-5 + Opus only (Sonnet too imprecise)
|
|
- Cross-component interaction: All three contribute
|