finding #46: operational blind spot analysis — new task type
Novel experiment testing 'what's invisible to operators' on gargoyle's observability.md (563 lines). GPT-5 (18 findings), Opus (12), Sonnet (10). Key discovery: 'actively misleads' category (observability creating false confidence) is highest-value and Opus-dominated. Distinct from assumption- finding, race conditions, or gap analysis — requires reasoning about negation (what ISN'T instrumented vs what production needs).
This commit is contained in:
@@ -0,0 +1,201 @@
|
||||
# Finding #46: Operational blind spot analysis — a new task type revealing model divergence on "what's invisible"
|
||||
|
||||
**Date:** 2026-05-08
|
||||
**Task:** Identify operational blind spots in gargoyle's `observability.md` (563 lines) —
|
||||
scenarios where the observability design would fail to surface problems, actively mislead
|
||||
operators, provide signal too late, or create diagnostic dead-ends.
|
||||
**Novel aspect:** This is a new analytical task type distinct from assumption-finding (#10-12),
|
||||
race condition identification (#13), cross-component interaction (#14), or gap-finding (#9).
|
||||
It requires reasoning about the ABSENCE of signal — what can't you see? — and how
|
||||
instrumentation choices create false confidence.
|
||||
|
||||
## Setup
|
||||
|
||||
Same document (full text, no truncation) + same focused analytical question to all 3 models
|
||||
via HAI proxy. Prompt specified 4 categories of blind spot (fail to surface, actively mislead,
|
||||
signal too late, diagnostic dead-ends) and required specific output format (Scenario, Why
|
||||
invisible/misleading, Impact, Detection gap). Explicitly asked for trading-specific scenarios
|
||||
and references to specific mechanisms in the document.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Blind spots found |
|
||||
|---|---|---|---|---|
|
||||
| GPT-5 | ~87s | 7,969 | 5,056 | 18 |
|
||||
| Claude Opus 4.6 | ~120s | 4,591 | (internal) | 12 |
|
||||
| Claude Sonnet 4.6 | ~38s | 1,930 | (internal) | 10 |
|
||||
|
||||
## What they found — common ground (all 3 identified)
|
||||
|
||||
- Order state post-submission is invisible (traces terminate at `order_manager.submit`;
|
||||
no metrics for fills/cancels/rejects from broker)
|
||||
- Position reconciliation gap (no metric for internal-vs-broker position drift)
|
||||
- Market data staleness per-symbol undetectable at scale (cardinality exclusion of `symbol`
|
||||
from `quote_feed.tick` metadata means per-symbol staleness invisible in Prometheus)
|
||||
- Risk controls only measure rejections, not false approvals (stale data causing
|
||||
incorrect approvals is invisible)
|
||||
- LogStore GenServer mailbox/write failure silent due to async cast pattern
|
||||
- Aggregator signal buffering without emission has no observability
|
||||
|
||||
## GPT-5 unique findings (not in either Claude model)
|
||||
|
||||
- **Sampling + span links break end-to-end traceability**: When signal evaluation sampling
|
||||
is reduced, Aggregator span links reference unsampled trace IDs → Tempo shows broken
|
||||
links, operators underestimate true latency
|
||||
- **Webhook correlation dead-end**: `opentelemetry_phoenix` "does not capture HTTP headers";
|
||||
inbound broker webhooks have no body-parsed attributes linking back to decision_id
|
||||
- **15s Prometheus scrape + 10s VM poller too coarse for trading incidents**: 5-10 second
|
||||
broker hiccups vanish between scrapes; micro-incidents permanently invisible
|
||||
- **Price integrity/slippage invisible**: Span attribute allow-list forbids `price_target`;
|
||||
no metric for filled_avg_price vs intended → systematic slippage undetectable
|
||||
- **Duplicate decisions undetectable from current labels**: Counter tags (ticker/action/strategy)
|
||||
can't distinguish legitimate trades from duplicates; decision_id isn't a Prometheus label
|
||||
- **End-to-end reconstruction depends on both correlation keys always present**: If a code
|
||||
path omits signal_id or decision_id, the three-way join breaks silently
|
||||
- **Live logs UI shares Phoenix endpoint**: Under load, trading continues but LogsLive
|
||||
becomes unreachable — operators lose situational awareness exactly when needed
|
||||
- **Per-symbol staleness metric contradicts cardinality rules**: `feed.stale` with ticker
|
||||
tag will be removed in practice, leaving only global signals
|
||||
- **Metric/trace/log pillars can disagree without alerting**: Sampling/exporter failures
|
||||
make metrics look green while traces vanish; no cross-pillar consistency check
|
||||
- **Broker rate-limiting/HTTP failures at call-site**: `req` not in deps means no outgoing
|
||||
HTTP instrumentation; retries/429s produce no trace or metric
|
||||
- **Tempo/OTLP exporter outage**: No metric for exporter drop counts; operators see "no
|
||||
traces" and may assume "nothing happening" vs "monitoring broken"
|
||||
- **Order submission counter "green" while broker later rejects**: Post-submit broker-side
|
||||
rejections have no counter or lifecycle span
|
||||
|
||||
## Claude Opus unique findings (not in either other model)
|
||||
|
||||
- **Aggregator swallowing signals without emission**: Signals entering the buffer that
|
||||
never trigger aggregation produce ZERO observability signal — no error, no metric, no
|
||||
timeout event. Distinguished from the common "aggregator gap" finding by reasoning about
|
||||
the specific mechanism: the `aggregator.aggregate` span only fires on aggregation, not
|
||||
on receipt. Buffered-and-forgotten signals are completely invisible.
|
||||
- **OTel context propagation failure is silent**: `OpenTelemetry.Ctx.attach(nil)` returns
|
||||
a no-op token; `with_span` creates a ROOT span (different trace_id). Traces look
|
||||
structurally valid but are fragmented. No validation exists for context propagation success.
|
||||
- **Strategy logic bug with correct observability**: All pillars report success when a
|
||||
strategy emits the wrong signal (inverted buy/sell). No metric for financial correctness
|
||||
(P&L, win rate, signal-vs-price-movement) — only operational correctness.
|
||||
- **Decision latency histogram is survivor-biased**: Only records COMPLETED journeys.
|
||||
Stuck decisions never appear. Histogram shows healthy p99 while decisions are lost.
|
||||
Distinguished from generic "mailbox" finding by identifying the specific statistical
|
||||
bias in the metric design.
|
||||
- **Broker WebSocket reconnect succeeds but misses fill events during gap**: Reconnect
|
||||
counter says "recovered"; fills during disconnect are permanently lost. Orders stay
|
||||
in "submitted" state forever with no timeout metric.
|
||||
- **@keep_keys allowlist silently drops new metadata**: Developer adds critical metadata
|
||||
key, forgets to update allowlist. Logs appear in viewer (message intact) but structured
|
||||
queries return nothing. Diagnostic dead-end that ACTIVELY misleads.
|
||||
|
||||
## Claude Sonnet findings (unique aspects)
|
||||
|
||||
- **Partial fill accumulation without detection**: Systematic 90% fills create gradual
|
||||
position drift — no metric for unfilled quantity accumulation
|
||||
- **Aggregator timing bias creating systematic directional bias**: Varying processing
|
||||
latencies cause buy signals to consistently arrive before sells
|
||||
- **Signal ID collision corrupting audit trail**: UUID weakness creating impossible
|
||||
timelines in correlation queries
|
||||
- **Telemetry emission timing creating phantom latencies**: Mailbox congestion in
|
||||
telemetry handlers makes latency metrics unreliable (showing delays that don't
|
||||
reflect actual performance)
|
||||
|
||||
## Quality assessment
|
||||
|
||||
**GPT-5 (18 findings):** Most exhaustive as usual. Strong on operational/infrastructure-level
|
||||
blind spots (scrape granularity, exporter health, HTTP instrumentation gaps, cross-pillar
|
||||
disagreement). Several findings showed careful reading of the document's specific configuration
|
||||
choices (15s scrape, @keep_keys, span attribute allow-list). However, some findings were
|
||||
variations on the same theme (multiple findings about post-submission order lifecycle gaps).
|
||||
Every finding referenced specific document mechanisms and explained the causal chain clearly.
|
||||
|
||||
**Claude Opus (12 findings):** Highest insight-per-finding density. Two standout findings:
|
||||
1. The `Ctx.attach(nil)` silent fragmentation — this is genuinely subtle and requires
|
||||
understanding Erlang OTel SDK internals to recognize that nil context creates valid-looking
|
||||
but uncorrelated traces. No other model caught this.
|
||||
2. The @keep_keys metadata stripping creating a diagnostic dead-end that ACTIVELY misleads
|
||||
(investigator sees log, queries by key, gets nothing, concludes event never happened).
|
||||
This is the only finding across all models that describes observability creating a
|
||||
FALSE NEGATIVE in investigation rather than just a gap.
|
||||
|
||||
Opus also continued its pattern of identifying design tensions: the strategy-correctness
|
||||
finding (#7) explicitly names the gap between "operational correctness" (system works) and
|
||||
"financial correctness" (system works RIGHT) — a fundamental architectural blind spot that
|
||||
the other models only touched peripherally.
|
||||
|
||||
**Claude Sonnet (10 findings):** Weakest performance in this experiment. Several findings
|
||||
were plausible but somewhat generic or low-specificity compared to the other models:
|
||||
- "Signal ID collision" assumes a UUID weakness that isn't evidenced in the document
|
||||
- "Telemetry emission timing creating phantom latencies" is theoretically possible but
|
||||
doesn't reference specific document mechanisms that would cause it
|
||||
- "Aggregator timing bias" is an interesting idea but doesn't explain WHY this specific
|
||||
observability design would miss it
|
||||
|
||||
Sonnet's best finding was the partial-fill accumulation (no metric for systematic underfill),
|
||||
which is genuinely trading-specific. But overall, it produced fewer findings, with less
|
||||
document-grounding and more speculation.
|
||||
|
||||
## Key insight — "what's invisible" requires reasoning about negation
|
||||
|
||||
This task type is fundamentally about NEGATION: "given what IS instrumented, what ISN'T?"
|
||||
This is harder than assumption-finding (which can work from what's stated) or race condition
|
||||
analysis (which works from what's specified about concurrency). Here, the model must:
|
||||
1. Build a mental model of what the observability design CAN see
|
||||
2. Enumerate production scenarios (requiring domain knowledge)
|
||||
3. Check each scenario against the coverage model
|
||||
4. Identify scenarios that fall in the gaps
|
||||
|
||||
GPT-5 excelled at step 1 (thorough coverage mapping) and step 3 (systematic checking).
|
||||
Opus excelled at step 2 (finding subtle scenarios like context propagation failure) and
|
||||
identifying findings that ACTIVELY mislead rather than passively miss.
|
||||
Sonnet struggled with step 3 — some of its scenarios were valid but its explanations of
|
||||
WHY the observability design specifically misses them were weaker.
|
||||
|
||||
## Comparison to previous task types
|
||||
|
||||
| Task type | GPT-5 | Opus | Sonnet |
|
||||
|---|---|---|---|
|
||||
| Assumption-finding (#10-12) | 20-26 | 12-13 | 17 |
|
||||
| Race conditions (#13) | 12 | 10 | 7 (with errors) |
|
||||
| Cross-component (#14) | 10 | — | 8 |
|
||||
| **Blind spot analysis** | **18** | **12** | **10** |
|
||||
|
||||
The ratios are consistent: GPT-5 ~1.5x Opus, Opus ~1.2x Sonnet. But quality-per-finding
|
||||
continues to favor Opus for finding the most architecturally insightful issues. GPT-5's
|
||||
breadth advantage is real but includes more operational/infrastructure findings vs
|
||||
Opus's focus on design-level blind spots.
|
||||
|
||||
## Practical implications
|
||||
|
||||
1. **New analytical task for architecture review:** "What can't you see?" is a distinct
|
||||
and valuable question to ask of any observability or monitoring design. It's not
|
||||
covered by assumption-finding, gap-finding, or consistency checking.
|
||||
|
||||
2. **Model assignment for blind spot analysis:**
|
||||
- GPT-5: Operational blind spots (infrastructure interactions, configuration gaps,
|
||||
cross-system dependencies)
|
||||
- Opus: Design-level blind spots (false confidence, active misdirection, semantic
|
||||
gaps between what's measured and what matters)
|
||||
- Sonnet: Not recommended for this task type — insufficient document grounding
|
||||
|
||||
3. **The "actively misleads" category is highest-value:** Of all findings across 3 models,
|
||||
the ones that describe observability CREATING false confidence (rather than just missing
|
||||
signal) are the most dangerous and actionable. Opus found 3 of these; GPT-5 found 2;
|
||||
Sonnet found 0. This suggests Opus should be specifically tasked with: "Where does this
|
||||
design create false confidence?"
|
||||
|
||||
## Updated task-model matrix
|
||||
|
||||
| Task | Best model(s) | Why |
|
||||
|---|---|---|
|
||||
| Assumption-finding | GPT-5 + Opus | Breadth + design tensions |
|
||||
| Race conditions | GPT-5 + Opus | Sonnet unreliable for concurrency |
|
||||
| Cross-component | GPT-5 + Sonnet | Both good; Sonnet recovers with structure |
|
||||
| Cross-document consistency | Opus + GPT-5 | Opus dominates boundary reasoning |
|
||||
| **Operational blind spots** | **GPT-5 + Opus** | **GPT-5 for coverage mapping; Opus for false confidence** |
|
||||
| Bias detection | Any (with narrow framing) | Signal-to-noise matters more than model |
|
||||
|
||||
## Source
|
||||
|
||||
- Document: `gargoyle/docs/impl/observability.md` (563 lines)
|
||||
- Models: GPT-5 (via HAI OpenAI endpoint), Claude Opus 4.6, Claude Sonnet 4.6 (via HAI Anthropic endpoint)
|
||||
- No tools, no project context beyond the document itself
|
||||
Reference in New Issue
Block a user