Files
model-research/findings/2026-05-08-46-operational-blind-spot-analysis-observability.md
T
claw b5b5b64a40 finding #46: operational blind spot analysis — new task type
Novel experiment testing 'what's invisible to operators' on gargoyle's
observability.md (563 lines). GPT-5 (18 findings), Opus (12), Sonnet (10).

Key discovery: 'actively misleads' category (observability creating false
confidence) is highest-value and Opus-dominated. Distinct from assumption-
finding, race conditions, or gap analysis — requires reasoning about
negation (what ISN'T instrumented vs what production needs).
2026-05-08 00:27:23 -07:00

202 lines
12 KiB
Markdown

# Finding #46: Operational blind spot analysis — a new task type revealing model divergence on "what's invisible"
**Date:** 2026-05-08
**Task:** Identify operational blind spots in gargoyle's `observability.md` (563 lines) —
scenarios where the observability design would fail to surface problems, actively mislead
operators, provide signal too late, or create diagnostic dead-ends.
**Novel aspect:** This is a new analytical task type distinct from assumption-finding (#10-12),
race condition identification (#13), cross-component interaction (#14), or gap-finding (#9).
It requires reasoning about the ABSENCE of signal — what can't you see? — and how
instrumentation choices create false confidence.
## Setup
Same document (full text, no truncation) + same focused analytical question to all 3 models
via HAI proxy. Prompt specified 4 categories of blind spot (fail to surface, actively mislead,
signal too late, diagnostic dead-ends) and required specific output format (Scenario, Why
invisible/misleading, Impact, Detection gap). Explicitly asked for trading-specific scenarios
and references to specific mechanisms in the document.
| Model | Time | Output tokens | Reasoning tokens | Blind spots found |
|---|---|---|---|---|
| GPT-5 | ~87s | 7,969 | 5,056 | 18 |
| Claude Opus 4.6 | ~120s | 4,591 | (internal) | 12 |
| Claude Sonnet 4.6 | ~38s | 1,930 | (internal) | 10 |
## What they found — common ground (all 3 identified)
- Order state post-submission is invisible (traces terminate at `order_manager.submit`;
no metrics for fills/cancels/rejects from broker)
- Position reconciliation gap (no metric for internal-vs-broker position drift)
- Market data staleness per-symbol undetectable at scale (cardinality exclusion of `symbol`
from `quote_feed.tick` metadata means per-symbol staleness invisible in Prometheus)
- Risk controls only measure rejections, not false approvals (stale data causing
incorrect approvals is invisible)
- LogStore GenServer mailbox/write failure silent due to async cast pattern
- Aggregator signal buffering without emission has no observability
## GPT-5 unique findings (not in either Claude model)
- **Sampling + span links break end-to-end traceability**: When signal evaluation sampling
is reduced, Aggregator span links reference unsampled trace IDs → Tempo shows broken
links, operators underestimate true latency
- **Webhook correlation dead-end**: `opentelemetry_phoenix` "does not capture HTTP headers";
inbound broker webhooks have no body-parsed attributes linking back to decision_id
- **15s Prometheus scrape + 10s VM poller too coarse for trading incidents**: 5-10 second
broker hiccups vanish between scrapes; micro-incidents permanently invisible
- **Price integrity/slippage invisible**: Span attribute allow-list forbids `price_target`;
no metric for filled_avg_price vs intended → systematic slippage undetectable
- **Duplicate decisions undetectable from current labels**: Counter tags (ticker/action/strategy)
can't distinguish legitimate trades from duplicates; decision_id isn't a Prometheus label
- **End-to-end reconstruction depends on both correlation keys always present**: If a code
path omits signal_id or decision_id, the three-way join breaks silently
- **Live logs UI shares Phoenix endpoint**: Under load, trading continues but LogsLive
becomes unreachable — operators lose situational awareness exactly when needed
- **Per-symbol staleness metric contradicts cardinality rules**: `feed.stale` with ticker
tag will be removed in practice, leaving only global signals
- **Metric/trace/log pillars can disagree without alerting**: Sampling/exporter failures
make metrics look green while traces vanish; no cross-pillar consistency check
- **Broker rate-limiting/HTTP failures at call-site**: `req` not in deps means no outgoing
HTTP instrumentation; retries/429s produce no trace or metric
- **Tempo/OTLP exporter outage**: No metric for exporter drop counts; operators see "no
traces" and may assume "nothing happening" vs "monitoring broken"
- **Order submission counter "green" while broker later rejects**: Post-submit broker-side
rejections have no counter or lifecycle span
## Claude Opus unique findings (not in either other model)
- **Aggregator swallowing signals without emission**: Signals entering the buffer that
never trigger aggregation produce ZERO observability signal — no error, no metric, no
timeout event. Distinguished from the common "aggregator gap" finding by reasoning about
the specific mechanism: the `aggregator.aggregate` span only fires on aggregation, not
on receipt. Buffered-and-forgotten signals are completely invisible.
- **OTel context propagation failure is silent**: `OpenTelemetry.Ctx.attach(nil)` returns
a no-op token; `with_span` creates a ROOT span (different trace_id). Traces look
structurally valid but are fragmented. No validation exists for context propagation success.
- **Strategy logic bug with correct observability**: All pillars report success when a
strategy emits the wrong signal (inverted buy/sell). No metric for financial correctness
(P&L, win rate, signal-vs-price-movement) — only operational correctness.
- **Decision latency histogram is survivor-biased**: Only records COMPLETED journeys.
Stuck decisions never appear. Histogram shows healthy p99 while decisions are lost.
Distinguished from generic "mailbox" finding by identifying the specific statistical
bias in the metric design.
- **Broker WebSocket reconnect succeeds but misses fill events during gap**: Reconnect
counter says "recovered"; fills during disconnect are permanently lost. Orders stay
in "submitted" state forever with no timeout metric.
- **@keep_keys allowlist silently drops new metadata**: Developer adds critical metadata
key, forgets to update allowlist. Logs appear in viewer (message intact) but structured
queries return nothing. Diagnostic dead-end that ACTIVELY misleads.
## Claude Sonnet findings (unique aspects)
- **Partial fill accumulation without detection**: Systematic 90% fills create gradual
position drift — no metric for unfilled quantity accumulation
- **Aggregator timing bias creating systematic directional bias**: Varying processing
latencies cause buy signals to consistently arrive before sells
- **Signal ID collision corrupting audit trail**: UUID weakness creating impossible
timelines in correlation queries
- **Telemetry emission timing creating phantom latencies**: Mailbox congestion in
telemetry handlers makes latency metrics unreliable (showing delays that don't
reflect actual performance)
## Quality assessment
**GPT-5 (18 findings):** Most exhaustive as usual. Strong on operational/infrastructure-level
blind spots (scrape granularity, exporter health, HTTP instrumentation gaps, cross-pillar
disagreement). Several findings showed careful reading of the document's specific configuration
choices (15s scrape, @keep_keys, span attribute allow-list). However, some findings were
variations on the same theme (multiple findings about post-submission order lifecycle gaps).
Every finding referenced specific document mechanisms and explained the causal chain clearly.
**Claude Opus (12 findings):** Highest insight-per-finding density. Two standout findings:
1. The `Ctx.attach(nil)` silent fragmentation — this is genuinely subtle and requires
understanding Erlang OTel SDK internals to recognize that nil context creates valid-looking
but uncorrelated traces. No other model caught this.
2. The @keep_keys metadata stripping creating a diagnostic dead-end that ACTIVELY misleads
(investigator sees log, queries by key, gets nothing, concludes event never happened).
This is the only finding across all models that describes observability creating a
FALSE NEGATIVE in investigation rather than just a gap.
Opus also continued its pattern of identifying design tensions: the strategy-correctness
finding (#7) explicitly names the gap between "operational correctness" (system works) and
"financial correctness" (system works RIGHT) — a fundamental architectural blind spot that
the other models only touched peripherally.
**Claude Sonnet (10 findings):** Weakest performance in this experiment. Several findings
were plausible but somewhat generic or low-specificity compared to the other models:
- "Signal ID collision" assumes a UUID weakness that isn't evidenced in the document
- "Telemetry emission timing creating phantom latencies" is theoretically possible but
doesn't reference specific document mechanisms that would cause it
- "Aggregator timing bias" is an interesting idea but doesn't explain WHY this specific
observability design would miss it
Sonnet's best finding was the partial-fill accumulation (no metric for systematic underfill),
which is genuinely trading-specific. But overall, it produced fewer findings, with less
document-grounding and more speculation.
## Key insight — "what's invisible" requires reasoning about negation
This task type is fundamentally about NEGATION: "given what IS instrumented, what ISN'T?"
This is harder than assumption-finding (which can work from what's stated) or race condition
analysis (which works from what's specified about concurrency). Here, the model must:
1. Build a mental model of what the observability design CAN see
2. Enumerate production scenarios (requiring domain knowledge)
3. Check each scenario against the coverage model
4. Identify scenarios that fall in the gaps
GPT-5 excelled at step 1 (thorough coverage mapping) and step 3 (systematic checking).
Opus excelled at step 2 (finding subtle scenarios like context propagation failure) and
identifying findings that ACTIVELY mislead rather than passively miss.
Sonnet struggled with step 3 — some of its scenarios were valid but its explanations of
WHY the observability design specifically misses them were weaker.
## Comparison to previous task types
| Task type | GPT-5 | Opus | Sonnet |
|---|---|---|---|
| Assumption-finding (#10-12) | 20-26 | 12-13 | 17 |
| Race conditions (#13) | 12 | 10 | 7 (with errors) |
| Cross-component (#14) | 10 | — | 8 |
| **Blind spot analysis** | **18** | **12** | **10** |
The ratios are consistent: GPT-5 ~1.5x Opus, Opus ~1.2x Sonnet. But quality-per-finding
continues to favor Opus for finding the most architecturally insightful issues. GPT-5's
breadth advantage is real but includes more operational/infrastructure findings vs
Opus's focus on design-level blind spots.
## Practical implications
1. **New analytical task for architecture review:** "What can't you see?" is a distinct
and valuable question to ask of any observability or monitoring design. It's not
covered by assumption-finding, gap-finding, or consistency checking.
2. **Model assignment for blind spot analysis:**
- GPT-5: Operational blind spots (infrastructure interactions, configuration gaps,
cross-system dependencies)
- Opus: Design-level blind spots (false confidence, active misdirection, semantic
gaps between what's measured and what matters)
- Sonnet: Not recommended for this task type — insufficient document grounding
3. **The "actively misleads" category is highest-value:** Of all findings across 3 models,
the ones that describe observability CREATING false confidence (rather than just missing
signal) are the most dangerous and actionable. Opus found 3 of these; GPT-5 found 2;
Sonnet found 0. This suggests Opus should be specifically tasked with: "Where does this
design create false confidence?"
## Updated task-model matrix
| Task | Best model(s) | Why |
|---|---|---|
| Assumption-finding | GPT-5 + Opus | Breadth + design tensions |
| Race conditions | GPT-5 + Opus | Sonnet unreliable for concurrency |
| Cross-component | GPT-5 + Sonnet | Both good; Sonnet recovers with structure |
| Cross-document consistency | Opus + GPT-5 | Opus dominates boundary reasoning |
| **Operational blind spots** | **GPT-5 + Opus** | **GPT-5 for coverage mapping; Opus for false confidence** |
| Bias detection | Any (with narrow framing) | Signal-to-noise matters more than model |
## Source
- Document: `gargoyle/docs/impl/observability.md` (563 lines)
- Models: GPT-5 (via HAI OpenAI endpoint), Claude Opus 4.6, Claude Sonnet 4.6 (via HAI Anthropic endpoint)
- No tools, no project context beyond the document itself