From b5b5b64a40dbd19b3cf819e4b527b662760f3063 Mon Sep 17 00:00:00 2001 From: claw Date: Fri, 8 May 2026 00:27:23 -0700 Subject: [PATCH] =?UTF-8?q?finding=20#46:=20operational=20blind=20spot=20a?= =?UTF-8?q?nalysis=20=E2=80=94=20new=20task=20type?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Novel experiment testing 'what's invisible to operators' on gargoyle's observability.md (563 lines). GPT-5 (18 findings), Opus (12), Sonnet (10). Key discovery: 'actively misleads' category (observability creating false confidence) is highest-value and Opus-dominated. Distinct from assumption- finding, race conditions, or gap analysis — requires reasoning about negation (what ISN'T instrumented vs what production needs). --- ...ional-blind-spot-analysis-observability.md | 201 ++++++++++++++++++ 1 file changed, 201 insertions(+) create mode 100644 findings/2026-05-08-46-operational-blind-spot-analysis-observability.md diff --git a/findings/2026-05-08-46-operational-blind-spot-analysis-observability.md b/findings/2026-05-08-46-operational-blind-spot-analysis-observability.md new file mode 100644 index 0000000..ccf004b --- /dev/null +++ b/findings/2026-05-08-46-operational-blind-spot-analysis-observability.md @@ -0,0 +1,201 @@ +# Finding #46: Operational blind spot analysis — a new task type revealing model divergence on "what's invisible" + +**Date:** 2026-05-08 +**Task:** Identify operational blind spots in gargoyle's `observability.md` (563 lines) — +scenarios where the observability design would fail to surface problems, actively mislead +operators, provide signal too late, or create diagnostic dead-ends. +**Novel aspect:** This is a new analytical task type distinct from assumption-finding (#10-12), +race condition identification (#13), cross-component interaction (#14), or gap-finding (#9). +It requires reasoning about the ABSENCE of signal — what can't you see? — and how +instrumentation choices create false confidence. + +## Setup + +Same document (full text, no truncation) + same focused analytical question to all 3 models +via HAI proxy. Prompt specified 4 categories of blind spot (fail to surface, actively mislead, +signal too late, diagnostic dead-ends) and required specific output format (Scenario, Why +invisible/misleading, Impact, Detection gap). Explicitly asked for trading-specific scenarios +and references to specific mechanisms in the document. + +| Model | Time | Output tokens | Reasoning tokens | Blind spots found | +|---|---|---|---|---| +| GPT-5 | ~87s | 7,969 | 5,056 | 18 | +| Claude Opus 4.6 | ~120s | 4,591 | (internal) | 12 | +| Claude Sonnet 4.6 | ~38s | 1,930 | (internal) | 10 | + +## What they found — common ground (all 3 identified) + +- Order state post-submission is invisible (traces terminate at `order_manager.submit`; + no metrics for fills/cancels/rejects from broker) +- Position reconciliation gap (no metric for internal-vs-broker position drift) +- Market data staleness per-symbol undetectable at scale (cardinality exclusion of `symbol` + from `quote_feed.tick` metadata means per-symbol staleness invisible in Prometheus) +- Risk controls only measure rejections, not false approvals (stale data causing + incorrect approvals is invisible) +- LogStore GenServer mailbox/write failure silent due to async cast pattern +- Aggregator signal buffering without emission has no observability + +## GPT-5 unique findings (not in either Claude model) + +- **Sampling + span links break end-to-end traceability**: When signal evaluation sampling + is reduced, Aggregator span links reference unsampled trace IDs → Tempo shows broken + links, operators underestimate true latency +- **Webhook correlation dead-end**: `opentelemetry_phoenix` "does not capture HTTP headers"; + inbound broker webhooks have no body-parsed attributes linking back to decision_id +- **15s Prometheus scrape + 10s VM poller too coarse for trading incidents**: 5-10 second + broker hiccups vanish between scrapes; micro-incidents permanently invisible +- **Price integrity/slippage invisible**: Span attribute allow-list forbids `price_target`; + no metric for filled_avg_price vs intended → systematic slippage undetectable +- **Duplicate decisions undetectable from current labels**: Counter tags (ticker/action/strategy) + can't distinguish legitimate trades from duplicates; decision_id isn't a Prometheus label +- **End-to-end reconstruction depends on both correlation keys always present**: If a code + path omits signal_id or decision_id, the three-way join breaks silently +- **Live logs UI shares Phoenix endpoint**: Under load, trading continues but LogsLive + becomes unreachable — operators lose situational awareness exactly when needed +- **Per-symbol staleness metric contradicts cardinality rules**: `feed.stale` with ticker + tag will be removed in practice, leaving only global signals +- **Metric/trace/log pillars can disagree without alerting**: Sampling/exporter failures + make metrics look green while traces vanish; no cross-pillar consistency check +- **Broker rate-limiting/HTTP failures at call-site**: `req` not in deps means no outgoing + HTTP instrumentation; retries/429s produce no trace or metric +- **Tempo/OTLP exporter outage**: No metric for exporter drop counts; operators see "no + traces" and may assume "nothing happening" vs "monitoring broken" +- **Order submission counter "green" while broker later rejects**: Post-submit broker-side + rejections have no counter or lifecycle span + +## Claude Opus unique findings (not in either other model) + +- **Aggregator swallowing signals without emission**: Signals entering the buffer that + never trigger aggregation produce ZERO observability signal — no error, no metric, no + timeout event. Distinguished from the common "aggregator gap" finding by reasoning about + the specific mechanism: the `aggregator.aggregate` span only fires on aggregation, not + on receipt. Buffered-and-forgotten signals are completely invisible. +- **OTel context propagation failure is silent**: `OpenTelemetry.Ctx.attach(nil)` returns + a no-op token; `with_span` creates a ROOT span (different trace_id). Traces look + structurally valid but are fragmented. No validation exists for context propagation success. +- **Strategy logic bug with correct observability**: All pillars report success when a + strategy emits the wrong signal (inverted buy/sell). No metric for financial correctness + (P&L, win rate, signal-vs-price-movement) — only operational correctness. +- **Decision latency histogram is survivor-biased**: Only records COMPLETED journeys. + Stuck decisions never appear. Histogram shows healthy p99 while decisions are lost. + Distinguished from generic "mailbox" finding by identifying the specific statistical + bias in the metric design. +- **Broker WebSocket reconnect succeeds but misses fill events during gap**: Reconnect + counter says "recovered"; fills during disconnect are permanently lost. Orders stay + in "submitted" state forever with no timeout metric. +- **@keep_keys allowlist silently drops new metadata**: Developer adds critical metadata + key, forgets to update allowlist. Logs appear in viewer (message intact) but structured + queries return nothing. Diagnostic dead-end that ACTIVELY misleads. + +## Claude Sonnet findings (unique aspects) + +- **Partial fill accumulation without detection**: Systematic 90% fills create gradual + position drift — no metric for unfilled quantity accumulation +- **Aggregator timing bias creating systematic directional bias**: Varying processing + latencies cause buy signals to consistently arrive before sells +- **Signal ID collision corrupting audit trail**: UUID weakness creating impossible + timelines in correlation queries +- **Telemetry emission timing creating phantom latencies**: Mailbox congestion in + telemetry handlers makes latency metrics unreliable (showing delays that don't + reflect actual performance) + +## Quality assessment + +**GPT-5 (18 findings):** Most exhaustive as usual. Strong on operational/infrastructure-level +blind spots (scrape granularity, exporter health, HTTP instrumentation gaps, cross-pillar +disagreement). Several findings showed careful reading of the document's specific configuration +choices (15s scrape, @keep_keys, span attribute allow-list). However, some findings were +variations on the same theme (multiple findings about post-submission order lifecycle gaps). +Every finding referenced specific document mechanisms and explained the causal chain clearly. + +**Claude Opus (12 findings):** Highest insight-per-finding density. Two standout findings: +1. The `Ctx.attach(nil)` silent fragmentation — this is genuinely subtle and requires + understanding Erlang OTel SDK internals to recognize that nil context creates valid-looking + but uncorrelated traces. No other model caught this. +2. The @keep_keys metadata stripping creating a diagnostic dead-end that ACTIVELY misleads + (investigator sees log, queries by key, gets nothing, concludes event never happened). + This is the only finding across all models that describes observability creating a + FALSE NEGATIVE in investigation rather than just a gap. + +Opus also continued its pattern of identifying design tensions: the strategy-correctness +finding (#7) explicitly names the gap between "operational correctness" (system works) and +"financial correctness" (system works RIGHT) — a fundamental architectural blind spot that +the other models only touched peripherally. + +**Claude Sonnet (10 findings):** Weakest performance in this experiment. Several findings +were plausible but somewhat generic or low-specificity compared to the other models: +- "Signal ID collision" assumes a UUID weakness that isn't evidenced in the document +- "Telemetry emission timing creating phantom latencies" is theoretically possible but + doesn't reference specific document mechanisms that would cause it +- "Aggregator timing bias" is an interesting idea but doesn't explain WHY this specific + observability design would miss it + +Sonnet's best finding was the partial-fill accumulation (no metric for systematic underfill), +which is genuinely trading-specific. But overall, it produced fewer findings, with less +document-grounding and more speculation. + +## Key insight — "what's invisible" requires reasoning about negation + +This task type is fundamentally about NEGATION: "given what IS instrumented, what ISN'T?" +This is harder than assumption-finding (which can work from what's stated) or race condition +analysis (which works from what's specified about concurrency). Here, the model must: +1. Build a mental model of what the observability design CAN see +2. Enumerate production scenarios (requiring domain knowledge) +3. Check each scenario against the coverage model +4. Identify scenarios that fall in the gaps + +GPT-5 excelled at step 1 (thorough coverage mapping) and step 3 (systematic checking). +Opus excelled at step 2 (finding subtle scenarios like context propagation failure) and +identifying findings that ACTIVELY mislead rather than passively miss. +Sonnet struggled with step 3 — some of its scenarios were valid but its explanations of +WHY the observability design specifically misses them were weaker. + +## Comparison to previous task types + +| Task type | GPT-5 | Opus | Sonnet | +|---|---|---|---| +| Assumption-finding (#10-12) | 20-26 | 12-13 | 17 | +| Race conditions (#13) | 12 | 10 | 7 (with errors) | +| Cross-component (#14) | 10 | — | 8 | +| **Blind spot analysis** | **18** | **12** | **10** | + +The ratios are consistent: GPT-5 ~1.5x Opus, Opus ~1.2x Sonnet. But quality-per-finding +continues to favor Opus for finding the most architecturally insightful issues. GPT-5's +breadth advantage is real but includes more operational/infrastructure findings vs +Opus's focus on design-level blind spots. + +## Practical implications + +1. **New analytical task for architecture review:** "What can't you see?" is a distinct + and valuable question to ask of any observability or monitoring design. It's not + covered by assumption-finding, gap-finding, or consistency checking. + +2. **Model assignment for blind spot analysis:** + - GPT-5: Operational blind spots (infrastructure interactions, configuration gaps, + cross-system dependencies) + - Opus: Design-level blind spots (false confidence, active misdirection, semantic + gaps between what's measured and what matters) + - Sonnet: Not recommended for this task type — insufficient document grounding + +3. **The "actively misleads" category is highest-value:** Of all findings across 3 models, + the ones that describe observability CREATING false confidence (rather than just missing + signal) are the most dangerous and actionable. Opus found 3 of these; GPT-5 found 2; + Sonnet found 0. This suggests Opus should be specifically tasked with: "Where does this + design create false confidence?" + +## Updated task-model matrix + +| Task | Best model(s) | Why | +|---|---|---| +| Assumption-finding | GPT-5 + Opus | Breadth + design tensions | +| Race conditions | GPT-5 + Opus | Sonnet unreliable for concurrency | +| Cross-component | GPT-5 + Sonnet | Both good; Sonnet recovers with structure | +| Cross-document consistency | Opus + GPT-5 | Opus dominates boundary reasoning | +| **Operational blind spots** | **GPT-5 + Opus** | **GPT-5 for coverage mapping; Opus for false confidence** | +| Bias detection | Any (with narrow framing) | Signal-to-noise matters more than model | + +## Source + +- Document: `gargoyle/docs/impl/observability.md` (563 lines) +- Models: GPT-5 (via HAI OpenAI endpoint), Claude Opus 4.6, Claude Sonnet 4.6 (via HAI Anthropic endpoint) +- No tools, no project context beyond the document itself