finding #46: operational blind spot analysis — new task type

Novel experiment testing 'what's invisible to operators' on gargoyle's observability.md (563 lines). GPT-5 (18 findings), Opus (12), Sonnet (10). Key discovery: 'actively misleads' category (observability creating false confidence) is highest-value and Opus-dominated. Distinct from assumption- finding, race conditions, or gap analysis — requires reasoning about negation (what ISN'T instrumented vs what production needs).
2026-05-08 00:27:23 -07:00
parent 64fdfebed3
commit b5b5b64a40
1 changed files with 201 additions and 0 deletions
@@ -0,0 +1,201 @@
+# Finding #46: Operational blind spot analysis — a new task type revealing model divergence on "what's invisible"
+
+**Date:** 2026-05-08
+**Task:** Identify operational blind spots in gargoyle's `observability.md` (563 lines) —
+scenarios where the observability design would fail to surface problems, actively mislead
+operators, provide signal too late, or create diagnostic dead-ends.
+**Novel aspect:** This is a new analytical task type distinct from assumption-finding (#10-12),
+race condition identification (#13), cross-component interaction (#14), or gap-finding (#9).
+It requires reasoning about the ABSENCE of signal — what can't you see? — and how
+instrumentation choices create false confidence.
+
+## Setup
+
+Same document (full text, no truncation) + same focused analytical question to all 3 models
+via HAI proxy. Prompt specified 4 categories of blind spot (fail to surface, actively mislead,
+signal too late, diagnostic dead-ends) and required specific output format (Scenario, Why
+invisible/misleading, Impact, Detection gap). Explicitly asked for trading-specific scenarios
+and references to specific mechanisms in the document.
+
+| Model | Time | Output tokens | Reasoning tokens | Blind spots found |
+|---|---|---|---|---|
+| GPT-5 | ~87s | 7,969 | 5,056 | 18 |
+| Claude Opus 4.6 | ~120s | 4,591 | (internal) | 12 |
+| Claude Sonnet 4.6 | ~38s | 1,930 | (internal) | 10 |
+
+## What they found — common ground (all 3 identified)
+
+- Order state post-submission is invisible (traces terminate at `order_manager.submit`;
+  no metrics for fills/cancels/rejects from broker)
+- Position reconciliation gap (no metric for internal-vs-broker position drift)
+- Market data staleness per-symbol undetectable at scale (cardinality exclusion of `symbol`
+  from `quote_feed.tick` metadata means per-symbol staleness invisible in Prometheus)
+- Risk controls only measure rejections, not false approvals (stale data causing
+  incorrect approvals is invisible)
+- LogStore GenServer mailbox/write failure silent due to async cast pattern
+- Aggregator signal buffering without emission has no observability
+
+## GPT-5 unique findings (not in either Claude model)
+
+- **Sampling + span links break end-to-end traceability**: When signal evaluation sampling
+  is reduced, Aggregator span links reference unsampled trace IDs → Tempo shows broken
+  links, operators underestimate true latency
+- **Webhook correlation dead-end**: `opentelemetry_phoenix` "does not capture HTTP headers";
+  inbound broker webhooks have no body-parsed attributes linking back to decision_id
+- **15s Prometheus scrape + 10s VM poller too coarse for trading incidents**: 5-10 second
+  broker hiccups vanish between scrapes; micro-incidents permanently invisible
+- **Price integrity/slippage invisible**: Span attribute allow-list forbids `price_target`;
+  no metric for filled_avg_price vs intended → systematic slippage undetectable
+- **Duplicate decisions undetectable from current labels**: Counter tags (ticker/action/strategy)
+  can't distinguish legitimate trades from duplicates; decision_id isn't a Prometheus label
+- **End-to-end reconstruction depends on both correlation keys always present**: If a code
+  path omits signal_id or decision_id, the three-way join breaks silently
+- **Live logs UI shares Phoenix endpoint**: Under load, trading continues but LogsLive
+  becomes unreachable — operators lose situational awareness exactly when needed
+- **Per-symbol staleness metric contradicts cardinality rules**: `feed.stale` with ticker
+  tag will be removed in practice, leaving only global signals
+- **Metric/trace/log pillars can disagree without alerting**: Sampling/exporter failures
+  make metrics look green while traces vanish; no cross-pillar consistency check
+- **Broker rate-limiting/HTTP failures at call-site**: `req` not in deps means no outgoing
+  HTTP instrumentation; retries/429s produce no trace or metric
+- **Tempo/OTLP exporter outage**: No metric for exporter drop counts; operators see "no
+  traces" and may assume "nothing happening" vs "monitoring broken"
+- **Order submission counter "green" while broker later rejects**: Post-submit broker-side
+  rejections have no counter or lifecycle span
+
+## Claude Opus unique findings (not in either other model)
+
+- **Aggregator swallowing signals without emission**: Signals entering the buffer that
+  never trigger aggregation produce ZERO observability signal — no error, no metric, no
+  timeout event. Distinguished from the common "aggregator gap" finding by reasoning about
+  the specific mechanism: the `aggregator.aggregate` span only fires on aggregation, not
+  on receipt. Buffered-and-forgotten signals are completely invisible.
+- **OTel context propagation failure is silent**: `OpenTelemetry.Ctx.attach(nil)` returns
+  a no-op token; `with_span` creates a ROOT span (different trace_id). Traces look
+  structurally valid but are fragmented. No validation exists for context propagation success.
+- **Strategy logic bug with correct observability**: All pillars report success when a
+  strategy emits the wrong signal (inverted buy/sell). No metric for financial correctness
+  (P&L, win rate, signal-vs-price-movement) — only operational correctness.
+- **Decision latency histogram is survivor-biased**: Only records COMPLETED journeys.
+  Stuck decisions never appear. Histogram shows healthy p99 while decisions are lost.
+  Distinguished from generic "mailbox" finding by identifying the specific statistical
+  bias in the metric design.
+- **Broker WebSocket reconnect succeeds but misses fill events during gap**: Reconnect
+  counter says "recovered"; fills during disconnect are permanently lost. Orders stay
+  in "submitted" state forever with no timeout metric.
+- **@keep_keys allowlist silently drops new metadata**: Developer adds critical metadata
+  key, forgets to update allowlist. Logs appear in viewer (message intact) but structured
+  queries return nothing. Diagnostic dead-end that ACTIVELY misleads.
+
+## Claude Sonnet findings (unique aspects)
+
+- **Partial fill accumulation without detection**: Systematic 90% fills create gradual
+  position drift — no metric for unfilled quantity accumulation
+- **Aggregator timing bias creating systematic directional bias**: Varying processing
+  latencies cause buy signals to consistently arrive before sells
+- **Signal ID collision corrupting audit trail**: UUID weakness creating impossible
+  timelines in correlation queries
+- **Telemetry emission timing creating phantom latencies**: Mailbox congestion in
+  telemetry handlers makes latency metrics unreliable (showing delays that don't
+  reflect actual performance)
+
+## Quality assessment
+
+**GPT-5 (18 findings):** Most exhaustive as usual. Strong on operational/infrastructure-level
+blind spots (scrape granularity, exporter health, HTTP instrumentation gaps, cross-pillar
+disagreement). Several findings showed careful reading of the document's specific configuration
+choices (15s scrape, @keep_keys, span attribute allow-list). However, some findings were
+variations on the same theme (multiple findings about post-submission order lifecycle gaps).
+Every finding referenced specific document mechanisms and explained the causal chain clearly.
+
+**Claude Opus (12 findings):** Highest insight-per-finding density. Two standout findings:
+1. The `Ctx.attach(nil)` silent fragmentation — this is genuinely subtle and requires
+   understanding Erlang OTel SDK internals to recognize that nil context creates valid-looking
+   but uncorrelated traces. No other model caught this.
+2. The @keep_keys metadata stripping creating a diagnostic dead-end that ACTIVELY misleads
+   (investigator sees log, queries by key, gets nothing, concludes event never happened).
+   This is the only finding across all models that describes observability creating a
+   FALSE NEGATIVE in investigation rather than just a gap.
+
+Opus also continued its pattern of identifying design tensions: the strategy-correctness
+finding (#7) explicitly names the gap between "operational correctness" (system works) and
+"financial correctness" (system works RIGHT) — a fundamental architectural blind spot that
+the other models only touched peripherally.
+
+**Claude Sonnet (10 findings):** Weakest performance in this experiment. Several findings
+were plausible but somewhat generic or low-specificity compared to the other models:
+- "Signal ID collision" assumes a UUID weakness that isn't evidenced in the document
+- "Telemetry emission timing creating phantom latencies" is theoretically possible but
+  doesn't reference specific document mechanisms that would cause it
+- "Aggregator timing bias" is an interesting idea but doesn't explain WHY this specific
+  observability design would miss it
+
+Sonnet's best finding was the partial-fill accumulation (no metric for systematic underfill),
+which is genuinely trading-specific. But overall, it produced fewer findings, with less
+document-grounding and more speculation.
+
+## Key insight — "what's invisible" requires reasoning about negation
+
+This task type is fundamentally about NEGATION: "given what IS instrumented, what ISN'T?"
+This is harder than assumption-finding (which can work from what's stated) or race condition
+analysis (which works from what's specified about concurrency). Here, the model must:
+1. Build a mental model of what the observability design CAN see
+2. Enumerate production scenarios (requiring domain knowledge)
+3. Check each scenario against the coverage model
+4. Identify scenarios that fall in the gaps
+
+GPT-5 excelled at step 1 (thorough coverage mapping) and step 3 (systematic checking).
+Opus excelled at step 2 (finding subtle scenarios like context propagation failure) and
+identifying findings that ACTIVELY mislead rather than passively miss.
+Sonnet struggled with step 3 — some of its scenarios were valid but its explanations of
+WHY the observability design specifically misses them were weaker.
+
+## Comparison to previous task types
+
+| Task type | GPT-5 | Opus | Sonnet |
+|---|---|---|---|
+| Assumption-finding (#10-12) | 20-26 | 12-13 | 17 |
+| Race conditions (#13) | 12 | 10 | 7 (with errors) |
+| Cross-component (#14) | 10 | — | 8 |
+| **Blind spot analysis** | **18** | **12** | **10** |
+
+The ratios are consistent: GPT-5 ~1.5x Opus, Opus ~1.2x Sonnet. But quality-per-finding
+continues to favor Opus for finding the most architecturally insightful issues. GPT-5's
+breadth advantage is real but includes more operational/infrastructure findings vs
+Opus's focus on design-level blind spots.
+
+## Practical implications
+
+1. **New analytical task for architecture review:** "What can't you see?" is a distinct
+   and valuable question to ask of any observability or monitoring design. It's not
+   covered by assumption-finding, gap-finding, or consistency checking.
+
+2. **Model assignment for blind spot analysis:**
+   - GPT-5: Operational blind spots (infrastructure interactions, configuration gaps,
+     cross-system dependencies)
+   - Opus: Design-level blind spots (false confidence, active misdirection, semantic
+     gaps between what's measured and what matters)
+   - Sonnet: Not recommended for this task type — insufficient document grounding
+
+3. **The "actively misleads" category is highest-value:** Of all findings across 3 models,
+   the ones that describe observability CREATING false confidence (rather than just missing
+   signal) are the most dangerous and actionable. Opus found 3 of these; GPT-5 found 2;
+   Sonnet found 0. This suggests Opus should be specifically tasked with: "Where does this
+   design create false confidence?"
+
+## Updated task-model matrix
+
+| Task | Best model(s) | Why |
+|---|---|---|
+| Assumption-finding | GPT-5 + Opus | Breadth + design tensions |
+| Race conditions | GPT-5 + Opus | Sonnet unreliable for concurrency |
+| Cross-component | GPT-5 + Sonnet | Both good; Sonnet recovers with structure |
+| Cross-document consistency | Opus + GPT-5 | Opus dominates boundary reasoning |
+| **Operational blind spots** | **GPT-5 + Opus** | **GPT-5 for coverage mapping; Opus for false confidence** |
+| Bias detection | Any (with narrow framing) | Signal-to-noise matters more than model |
+
+## Source
+
+- Document: `gargoyle/docs/impl/observability.md` (563 lines)
+- Models: GPT-5 (via HAI OpenAI endpoint), Claude Opus 4.6, Claude Sonnet 4.6 (via HAI Anthropic endpoint)
+- No tools, no project context beyond the document itself