b5b5b64a40
Novel experiment testing 'what's invisible to operators' on gargoyle's observability.md (563 lines). GPT-5 (18 findings), Opus (12), Sonnet (10). Key discovery: 'actively misleads' category (observability creating false confidence) is highest-value and Opus-dominated. Distinct from assumption- finding, race conditions, or gap analysis — requires reasoning about negation (what ISN'T instrumented vs what production needs).
202 lines
12 KiB
Markdown
202 lines
12 KiB
Markdown
# Finding #46: Operational blind spot analysis — a new task type revealing model divergence on "what's invisible"
|
|
|
|
**Date:** 2026-05-08
|
|
**Task:** Identify operational blind spots in gargoyle's `observability.md` (563 lines) —
|
|
scenarios where the observability design would fail to surface problems, actively mislead
|
|
operators, provide signal too late, or create diagnostic dead-ends.
|
|
**Novel aspect:** This is a new analytical task type distinct from assumption-finding (#10-12),
|
|
race condition identification (#13), cross-component interaction (#14), or gap-finding (#9).
|
|
It requires reasoning about the ABSENCE of signal — what can't you see? — and how
|
|
instrumentation choices create false confidence.
|
|
|
|
## Setup
|
|
|
|
Same document (full text, no truncation) + same focused analytical question to all 3 models
|
|
via HAI proxy. Prompt specified 4 categories of blind spot (fail to surface, actively mislead,
|
|
signal too late, diagnostic dead-ends) and required specific output format (Scenario, Why
|
|
invisible/misleading, Impact, Detection gap). Explicitly asked for trading-specific scenarios
|
|
and references to specific mechanisms in the document.
|
|
|
|
| Model | Time | Output tokens | Reasoning tokens | Blind spots found |
|
|
|---|---|---|---|---|
|
|
| GPT-5 | ~87s | 7,969 | 5,056 | 18 |
|
|
| Claude Opus 4.6 | ~120s | 4,591 | (internal) | 12 |
|
|
| Claude Sonnet 4.6 | ~38s | 1,930 | (internal) | 10 |
|
|
|
|
## What they found — common ground (all 3 identified)
|
|
|
|
- Order state post-submission is invisible (traces terminate at `order_manager.submit`;
|
|
no metrics for fills/cancels/rejects from broker)
|
|
- Position reconciliation gap (no metric for internal-vs-broker position drift)
|
|
- Market data staleness per-symbol undetectable at scale (cardinality exclusion of `symbol`
|
|
from `quote_feed.tick` metadata means per-symbol staleness invisible in Prometheus)
|
|
- Risk controls only measure rejections, not false approvals (stale data causing
|
|
incorrect approvals is invisible)
|
|
- LogStore GenServer mailbox/write failure silent due to async cast pattern
|
|
- Aggregator signal buffering without emission has no observability
|
|
|
|
## GPT-5 unique findings (not in either Claude model)
|
|
|
|
- **Sampling + span links break end-to-end traceability**: When signal evaluation sampling
|
|
is reduced, Aggregator span links reference unsampled trace IDs → Tempo shows broken
|
|
links, operators underestimate true latency
|
|
- **Webhook correlation dead-end**: `opentelemetry_phoenix` "does not capture HTTP headers";
|
|
inbound broker webhooks have no body-parsed attributes linking back to decision_id
|
|
- **15s Prometheus scrape + 10s VM poller too coarse for trading incidents**: 5-10 second
|
|
broker hiccups vanish between scrapes; micro-incidents permanently invisible
|
|
- **Price integrity/slippage invisible**: Span attribute allow-list forbids `price_target`;
|
|
no metric for filled_avg_price vs intended → systematic slippage undetectable
|
|
- **Duplicate decisions undetectable from current labels**: Counter tags (ticker/action/strategy)
|
|
can't distinguish legitimate trades from duplicates; decision_id isn't a Prometheus label
|
|
- **End-to-end reconstruction depends on both correlation keys always present**: If a code
|
|
path omits signal_id or decision_id, the three-way join breaks silently
|
|
- **Live logs UI shares Phoenix endpoint**: Under load, trading continues but LogsLive
|
|
becomes unreachable — operators lose situational awareness exactly when needed
|
|
- **Per-symbol staleness metric contradicts cardinality rules**: `feed.stale` with ticker
|
|
tag will be removed in practice, leaving only global signals
|
|
- **Metric/trace/log pillars can disagree without alerting**: Sampling/exporter failures
|
|
make metrics look green while traces vanish; no cross-pillar consistency check
|
|
- **Broker rate-limiting/HTTP failures at call-site**: `req` not in deps means no outgoing
|
|
HTTP instrumentation; retries/429s produce no trace or metric
|
|
- **Tempo/OTLP exporter outage**: No metric for exporter drop counts; operators see "no
|
|
traces" and may assume "nothing happening" vs "monitoring broken"
|
|
- **Order submission counter "green" while broker later rejects**: Post-submit broker-side
|
|
rejections have no counter or lifecycle span
|
|
|
|
## Claude Opus unique findings (not in either other model)
|
|
|
|
- **Aggregator swallowing signals without emission**: Signals entering the buffer that
|
|
never trigger aggregation produce ZERO observability signal — no error, no metric, no
|
|
timeout event. Distinguished from the common "aggregator gap" finding by reasoning about
|
|
the specific mechanism: the `aggregator.aggregate` span only fires on aggregation, not
|
|
on receipt. Buffered-and-forgotten signals are completely invisible.
|
|
- **OTel context propagation failure is silent**: `OpenTelemetry.Ctx.attach(nil)` returns
|
|
a no-op token; `with_span` creates a ROOT span (different trace_id). Traces look
|
|
structurally valid but are fragmented. No validation exists for context propagation success.
|
|
- **Strategy logic bug with correct observability**: All pillars report success when a
|
|
strategy emits the wrong signal (inverted buy/sell). No metric for financial correctness
|
|
(P&L, win rate, signal-vs-price-movement) — only operational correctness.
|
|
- **Decision latency histogram is survivor-biased**: Only records COMPLETED journeys.
|
|
Stuck decisions never appear. Histogram shows healthy p99 while decisions are lost.
|
|
Distinguished from generic "mailbox" finding by identifying the specific statistical
|
|
bias in the metric design.
|
|
- **Broker WebSocket reconnect succeeds but misses fill events during gap**: Reconnect
|
|
counter says "recovered"; fills during disconnect are permanently lost. Orders stay
|
|
in "submitted" state forever with no timeout metric.
|
|
- **@keep_keys allowlist silently drops new metadata**: Developer adds critical metadata
|
|
key, forgets to update allowlist. Logs appear in viewer (message intact) but structured
|
|
queries return nothing. Diagnostic dead-end that ACTIVELY misleads.
|
|
|
|
## Claude Sonnet findings (unique aspects)
|
|
|
|
- **Partial fill accumulation without detection**: Systematic 90% fills create gradual
|
|
position drift — no metric for unfilled quantity accumulation
|
|
- **Aggregator timing bias creating systematic directional bias**: Varying processing
|
|
latencies cause buy signals to consistently arrive before sells
|
|
- **Signal ID collision corrupting audit trail**: UUID weakness creating impossible
|
|
timelines in correlation queries
|
|
- **Telemetry emission timing creating phantom latencies**: Mailbox congestion in
|
|
telemetry handlers makes latency metrics unreliable (showing delays that don't
|
|
reflect actual performance)
|
|
|
|
## Quality assessment
|
|
|
|
**GPT-5 (18 findings):** Most exhaustive as usual. Strong on operational/infrastructure-level
|
|
blind spots (scrape granularity, exporter health, HTTP instrumentation gaps, cross-pillar
|
|
disagreement). Several findings showed careful reading of the document's specific configuration
|
|
choices (15s scrape, @keep_keys, span attribute allow-list). However, some findings were
|
|
variations on the same theme (multiple findings about post-submission order lifecycle gaps).
|
|
Every finding referenced specific document mechanisms and explained the causal chain clearly.
|
|
|
|
**Claude Opus (12 findings):** Highest insight-per-finding density. Two standout findings:
|
|
1. The `Ctx.attach(nil)` silent fragmentation — this is genuinely subtle and requires
|
|
understanding Erlang OTel SDK internals to recognize that nil context creates valid-looking
|
|
but uncorrelated traces. No other model caught this.
|
|
2. The @keep_keys metadata stripping creating a diagnostic dead-end that ACTIVELY misleads
|
|
(investigator sees log, queries by key, gets nothing, concludes event never happened).
|
|
This is the only finding across all models that describes observability creating a
|
|
FALSE NEGATIVE in investigation rather than just a gap.
|
|
|
|
Opus also continued its pattern of identifying design tensions: the strategy-correctness
|
|
finding (#7) explicitly names the gap between "operational correctness" (system works) and
|
|
"financial correctness" (system works RIGHT) — a fundamental architectural blind spot that
|
|
the other models only touched peripherally.
|
|
|
|
**Claude Sonnet (10 findings):** Weakest performance in this experiment. Several findings
|
|
were plausible but somewhat generic or low-specificity compared to the other models:
|
|
- "Signal ID collision" assumes a UUID weakness that isn't evidenced in the document
|
|
- "Telemetry emission timing creating phantom latencies" is theoretically possible but
|
|
doesn't reference specific document mechanisms that would cause it
|
|
- "Aggregator timing bias" is an interesting idea but doesn't explain WHY this specific
|
|
observability design would miss it
|
|
|
|
Sonnet's best finding was the partial-fill accumulation (no metric for systematic underfill),
|
|
which is genuinely trading-specific. But overall, it produced fewer findings, with less
|
|
document-grounding and more speculation.
|
|
|
|
## Key insight — "what's invisible" requires reasoning about negation
|
|
|
|
This task type is fundamentally about NEGATION: "given what IS instrumented, what ISN'T?"
|
|
This is harder than assumption-finding (which can work from what's stated) or race condition
|
|
analysis (which works from what's specified about concurrency). Here, the model must:
|
|
1. Build a mental model of what the observability design CAN see
|
|
2. Enumerate production scenarios (requiring domain knowledge)
|
|
3. Check each scenario against the coverage model
|
|
4. Identify scenarios that fall in the gaps
|
|
|
|
GPT-5 excelled at step 1 (thorough coverage mapping) and step 3 (systematic checking).
|
|
Opus excelled at step 2 (finding subtle scenarios like context propagation failure) and
|
|
identifying findings that ACTIVELY mislead rather than passively miss.
|
|
Sonnet struggled with step 3 — some of its scenarios were valid but its explanations of
|
|
WHY the observability design specifically misses them were weaker.
|
|
|
|
## Comparison to previous task types
|
|
|
|
| Task type | GPT-5 | Opus | Sonnet |
|
|
|---|---|---|---|
|
|
| Assumption-finding (#10-12) | 20-26 | 12-13 | 17 |
|
|
| Race conditions (#13) | 12 | 10 | 7 (with errors) |
|
|
| Cross-component (#14) | 10 | — | 8 |
|
|
| **Blind spot analysis** | **18** | **12** | **10** |
|
|
|
|
The ratios are consistent: GPT-5 ~1.5x Opus, Opus ~1.2x Sonnet. But quality-per-finding
|
|
continues to favor Opus for finding the most architecturally insightful issues. GPT-5's
|
|
breadth advantage is real but includes more operational/infrastructure findings vs
|
|
Opus's focus on design-level blind spots.
|
|
|
|
## Practical implications
|
|
|
|
1. **New analytical task for architecture review:** "What can't you see?" is a distinct
|
|
and valuable question to ask of any observability or monitoring design. It's not
|
|
covered by assumption-finding, gap-finding, or consistency checking.
|
|
|
|
2. **Model assignment for blind spot analysis:**
|
|
- GPT-5: Operational blind spots (infrastructure interactions, configuration gaps,
|
|
cross-system dependencies)
|
|
- Opus: Design-level blind spots (false confidence, active misdirection, semantic
|
|
gaps between what's measured and what matters)
|
|
- Sonnet: Not recommended for this task type — insufficient document grounding
|
|
|
|
3. **The "actively misleads" category is highest-value:** Of all findings across 3 models,
|
|
the ones that describe observability CREATING false confidence (rather than just missing
|
|
signal) are the most dangerous and actionable. Opus found 3 of these; GPT-5 found 2;
|
|
Sonnet found 0. This suggests Opus should be specifically tasked with: "Where does this
|
|
design create false confidence?"
|
|
|
|
## Updated task-model matrix
|
|
|
|
| Task | Best model(s) | Why |
|
|
|---|---|---|
|
|
| Assumption-finding | GPT-5 + Opus | Breadth + design tensions |
|
|
| Race conditions | GPT-5 + Opus | Sonnet unreliable for concurrency |
|
|
| Cross-component | GPT-5 + Sonnet | Both good; Sonnet recovers with structure |
|
|
| Cross-document consistency | Opus + GPT-5 | Opus dominates boundary reasoning |
|
|
| **Operational blind spots** | **GPT-5 + Opus** | **GPT-5 for coverage mapping; Opus for false confidence** |
|
|
| Bias detection | Any (with narrow framing) | Signal-to-noise matters more than model |
|
|
|
|
## Source
|
|
|
|
- Document: `gargoyle/docs/impl/observability.md` (563 lines)
|
|
- Models: GPT-5 (via HAI OpenAI endpoint), Claude Opus 4.6, Claude Sonnet 4.6 (via HAI Anthropic endpoint)
|
|
- No tools, no project context beyond the document itself
|