Files

T

claw b5b5b64a40 finding #46 : operational blind spot analysis — new task type

Novel experiment testing 'what's invisible to operators' on gargoyle's
observability.md (563 lines). GPT-5 (18 findings), Opus (12), Sonnet (10).

Key discovery: 'actively misleads' category (observability creating false
confidence) is highest-value and Opus-dominated. Distinct from assumption-
finding, race conditions, or gap analysis — requires reasoning about
negation (what ISN'T instrumented vs what production needs).

2026-05-08 00:27:23 -07:00

12 KiB

Raw Blame History

Finding #46: Operational blind spot analysis — a new task type revealing model divergence on "what's invisible"

Date: 2026-05-08 Task: Identify operational blind spots in gargoyle's observability.md (563 lines) — scenarios where the observability design would fail to surface problems, actively mislead operators, provide signal too late, or create diagnostic dead-ends. Novel aspect: This is a new analytical task type distinct from assumption-finding (#10-12), race condition identification (#13), cross-component interaction (#14), or gap-finding (#9). It requires reasoning about the ABSENCE of signal — what can't you see? — and how instrumentation choices create false confidence.

Setup

Same document (full text, no truncation) + same focused analytical question to all 3 models via HAI proxy. Prompt specified 4 categories of blind spot (fail to surface, actively mislead, signal too late, diagnostic dead-ends) and required specific output format (Scenario, Why invisible/misleading, Impact, Detection gap). Explicitly asked for trading-specific scenarios and references to specific mechanisms in the document.

Model	Time	Output tokens	Reasoning tokens	Blind spots found
GPT-5	~87s	7,969	5,056	18
Claude Opus 4.6	~120s	4,591	(internal)	12
Claude Sonnet 4.6	~38s	1,930	(internal)	10

What they found — common ground (all 3 identified)

Order state post-submission is invisible (traces terminate at order_manager.submit; no metrics for fills/cancels/rejects from broker)
Position reconciliation gap (no metric for internal-vs-broker position drift)
Market data staleness per-symbol undetectable at scale (cardinality exclusion of symbol from quote_feed.tick metadata means per-symbol staleness invisible in Prometheus)
Risk controls only measure rejections, not false approvals (stale data causing incorrect approvals is invisible)
LogStore GenServer mailbox/write failure silent due to async cast pattern
Aggregator signal buffering without emission has no observability

GPT-5 unique findings (not in either Claude model)

Sampling + span links break end-to-end traceability: When signal evaluation sampling is reduced, Aggregator span links reference unsampled trace IDs → Tempo shows broken links, operators underestimate true latency
Webhook correlation dead-end: opentelemetry_phoenix "does not capture HTTP headers"; inbound broker webhooks have no body-parsed attributes linking back to decision_id
15s Prometheus scrape + 10s VM poller too coarse for trading incidents: 5-10 second broker hiccups vanish between scrapes; micro-incidents permanently invisible
Price integrity/slippage invisible: Span attribute allow-list forbids price_target; no metric for filled_avg_price vs intended → systematic slippage undetectable
Duplicate decisions undetectable from current labels: Counter tags (ticker/action/strategy) can't distinguish legitimate trades from duplicates; decision_id isn't a Prometheus label
End-to-end reconstruction depends on both correlation keys always present: If a code path omits signal_id or decision_id, the three-way join breaks silently
Live logs UI shares Phoenix endpoint: Under load, trading continues but LogsLive becomes unreachable — operators lose situational awareness exactly when needed
Per-symbol staleness metric contradicts cardinality rules: feed.stale with ticker tag will be removed in practice, leaving only global signals
Metric/trace/log pillars can disagree without alerting: Sampling/exporter failures make metrics look green while traces vanish; no cross-pillar consistency check
Broker rate-limiting/HTTP failures at call-site: req not in deps means no outgoing HTTP instrumentation; retries/429s produce no trace or metric
Tempo/OTLP exporter outage: No metric for exporter drop counts; operators see "no traces" and may assume "nothing happening" vs "monitoring broken"
Order submission counter "green" while broker later rejects: Post-submit broker-side rejections have no counter or lifecycle span

Claude Opus unique findings (not in either other model)

Aggregator swallowing signals without emission: Signals entering the buffer that never trigger aggregation produce ZERO observability signal — no error, no metric, no timeout event. Distinguished from the common "aggregator gap" finding by reasoning about the specific mechanism: the aggregator.aggregate span only fires on aggregation, not on receipt. Buffered-and-forgotten signals are completely invisible.
OTel context propagation failure is silent: OpenTelemetry.Ctx.attach(nil) returns a no-op token; with_span creates a ROOT span (different trace_id). Traces look structurally valid but are fragmented. No validation exists for context propagation success.
Strategy logic bug with correct observability: All pillars report success when a strategy emits the wrong signal (inverted buy/sell). No metric for financial correctness (P&L, win rate, signal-vs-price-movement) — only operational correctness.
Decision latency histogram is survivor-biased: Only records COMPLETED journeys. Stuck decisions never appear. Histogram shows healthy p99 while decisions are lost. Distinguished from generic "mailbox" finding by identifying the specific statistical bias in the metric design.
Broker WebSocket reconnect succeeds but misses fill events during gap: Reconnect counter says "recovered"; fills during disconnect are permanently lost. Orders stay in "submitted" state forever with no timeout metric.
@keep_keys allowlist silently drops new metadata: Developer adds critical metadata key, forgets to update allowlist. Logs appear in viewer (message intact) but structured queries return nothing. Diagnostic dead-end that ACTIVELY misleads.

Claude Sonnet findings (unique aspects)

Partial fill accumulation without detection: Systematic 90% fills create gradual position drift — no metric for unfilled quantity accumulation
Aggregator timing bias creating systematic directional bias: Varying processing latencies cause buy signals to consistently arrive before sells
Signal ID collision corrupting audit trail: UUID weakness creating impossible timelines in correlation queries
Telemetry emission timing creating phantom latencies: Mailbox congestion in telemetry handlers makes latency metrics unreliable (showing delays that don't reflect actual performance)

Quality assessment

GPT-5 (18 findings): Most exhaustive as usual. Strong on operational/infrastructure-level blind spots (scrape granularity, exporter health, HTTP instrumentation gaps, cross-pillar disagreement). Several findings showed careful reading of the document's specific configuration choices (15s scrape, @keep_keys, span attribute allow-list). However, some findings were variations on the same theme (multiple findings about post-submission order lifecycle gaps). Every finding referenced specific document mechanisms and explained the causal chain clearly.

Claude Opus (12 findings): Highest insight-per-finding density. Two standout findings:

The Ctx.attach(nil) silent fragmentation — this is genuinely subtle and requires understanding Erlang OTel SDK internals to recognize that nil context creates valid-looking but uncorrelated traces. No other model caught this.
The @keep_keys metadata stripping creating a diagnostic dead-end that ACTIVELY misleads (investigator sees log, queries by key, gets nothing, concludes event never happened). This is the only finding across all models that describes observability creating a FALSE NEGATIVE in investigation rather than just a gap.

Opus also continued its pattern of identifying design tensions: the strategy-correctness finding (#7) explicitly names the gap between "operational correctness" (system works) and "financial correctness" (system works RIGHT) — a fundamental architectural blind spot that the other models only touched peripherally.

Claude Sonnet (10 findings): Weakest performance in this experiment. Several findings were plausible but somewhat generic or low-specificity compared to the other models:

"Signal ID collision" assumes a UUID weakness that isn't evidenced in the document
"Telemetry emission timing creating phantom latencies" is theoretically possible but doesn't reference specific document mechanisms that would cause it
"Aggregator timing bias" is an interesting idea but doesn't explain WHY this specific observability design would miss it

Sonnet's best finding was the partial-fill accumulation (no metric for systematic underfill), which is genuinely trading-specific. But overall, it produced fewer findings, with less document-grounding and more speculation.

Key insight — "what's invisible" requires reasoning about negation

This task type is fundamentally about NEGATION: "given what IS instrumented, what ISN'T?" This is harder than assumption-finding (which can work from what's stated) or race condition analysis (which works from what's specified about concurrency). Here, the model must:

Build a mental model of what the observability design CAN see
Enumerate production scenarios (requiring domain knowledge)
Check each scenario against the coverage model
Identify scenarios that fall in the gaps

GPT-5 excelled at step 1 (thorough coverage mapping) and step 3 (systematic checking). Opus excelled at step 2 (finding subtle scenarios like context propagation failure) and identifying findings that ACTIVELY mislead rather than passively miss. Sonnet struggled with step 3 — some of its scenarios were valid but its explanations of WHY the observability design specifically misses them were weaker.

Comparison to previous task types

Task type	GPT-5	Opus	Sonnet
Assumption-finding (#10-12)	20-26	12-13	17
Race conditions (#13)	12	10	7 (with errors)
Cross-component (#14)	10	—	8
Blind spot analysis	18	12	10

The ratios are consistent: GPT-5 ~1.5x Opus, Opus ~1.2x Sonnet. But quality-per-finding continues to favor Opus for finding the most architecturally insightful issues. GPT-5's breadth advantage is real but includes more operational/infrastructure findings vs Opus's focus on design-level blind spots.

Practical implications

New analytical task for architecture review: "What can't you see?" is a distinct and valuable question to ask of any observability or monitoring design. It's not covered by assumption-finding, gap-finding, or consistency checking.
Model assignment for blind spot analysis:
- GPT-5: Operational blind spots (infrastructure interactions, configuration gaps, cross-system dependencies)
- Opus: Design-level blind spots (false confidence, active misdirection, semantic gaps between what's measured and what matters)
- Sonnet: Not recommended for this task type — insufficient document grounding
The "actively misleads" category is highest-value: Of all findings across 3 models, the ones that describe observability CREATING false confidence (rather than just missing signal) are the most dangerous and actionable. Opus found 3 of these; GPT-5 found 2; Sonnet found 0. This suggests Opus should be specifically tasked with: "Where does this design create false confidence?"

Updated task-model matrix

Task	Best model(s)	Why
Assumption-finding	GPT-5 + Opus	Breadth + design tensions
Race conditions	GPT-5 + Opus	Sonnet unreliable for concurrency
Cross-component	GPT-5 + Sonnet	Both good; Sonnet recovers with structure
Cross-document consistency	Opus + GPT-5	Opus dominates boundary reasoning
Operational blind spots	GPT-5 + Opus	GPT-5 for coverage mapping; Opus for false confidence
Bias detection	Any (with narrow framing)	Signal-to-noise matters more than model

Source

Document: gargoyle/docs/impl/observability.md (563 lines)
Models: GPT-5 (via HAI OpenAI endpoint), Claude Opus 4.6, Claude Sonnet 4.6 (via HAI Anthropic endpoint)
No tools, no project context beyond the document itself

12 KiB Raw Blame History