finding 51: implementation ambiguity analysis — new analytical lens

This commit is contained in:
claw
2026-05-08 12:46:32 -07:00
parent 5b8f8caf8c
commit 79915d1dc3
@@ -0,0 +1,166 @@
# Finding 51: Implementation Ambiguity Analysis — New Analytical Lens; All Models Converge on Core Ambiguity
**Date:** 2026-05-08
**Task:** Identify implementation ambiguities in two related gargoyle design documents
(`market-sessions.md` 102 lines + `strategy-config.md` 97 lines) — places where the
spec is clear enough to design from but ambiguous enough that two engineers could
reasonably implement different behaviors from the same text.
**How we used them:** Both documents (full text) combined in a single prompt with a
structured analytical question. Each ambiguity required: quoted text, two interpretations,
cross-component divergence impact, severity. Tested via HAI proxy (OpenAI endpoint for
GPT-5, Anthropic endpoint for Claude models). No tools, no project context beyond the
two documents.
| Model | Time | Output tokens | Reasoning tokens | Ambiguities found |
|---|---|---|---|---|
| GPT-5 | 78s | 9,190 | 6,784 | 8 |
| Claude Opus 4.6 | 56s | 2,558 | (internal) | 7 |
| Claude Sonnet 4.6 | 48s | 2,353 | (internal) | 6 |
## What they found — common ground (all 3 identified):
**The central ambiguity: activation vs session lifecycle.** All three models independently
identified the same core tension: Document 1 says "strategies subscribe to market data
at session open" while Document 2 says "enabling the first strategy can trigger
activation." Does activation happen at session open, at eligibility, or both? All three
agree this is the highest-severity cross-document ambiguity.
**The config-restart-session ambiguity.** All three identified that "configuration changes
take effect on the next decision engine restart" is ambiguous relative to session
boundaries — does session open/close constitute a "restart"? Two different interpretations
lead to fundamentally different user expectations about when parameter changes apply.
**Mid-session activation semantics.** All three identified the scenario where a user
enables their first strategy during an active session — does the engine activate
immediately, or defer to next session open?
## GPT-5 unique findings (not in either Claude model):
- **Date field in session domain events** (#5): The `date` field in `session.opened`/
`session.closed` events could be ET trading date or UTC date. Components keying off
different interpretations would attribute after-hours activity to different days.
- **P&L snapshot timing vs after-hours attribution** (#8): If P&L snapshots at 4:00 PM
but after-hours fills until 8:00 PM are attributed to the same trading day, the
"daily P&L" has no defined finalization time.
- **Aggregation lifecycle across session boundaries** (#4): Are incomplete signal groups
flushed at session close, or do they persist? If a group started at 3:50 PM has a
60-minute timeout, does it expire after-hours?
- **Strategies running outside session** (#6): Are strategies shut down outside sessions
(engine paused), or do they run continuously with risk controls blocking the output?
Different implementations affect aggregation state and warmup behavior.
## Claude Opus unique findings (not in either other model):
- **"Shut down at close" interaction with deactivation** (#2): If "some shut down at
close" causes the enabled strategy count to reach zero, does this trigger the "last
strategy disabled → deactivation" path? This could unintentionally deactivate the
entire engine when only session-sensitive strategies should pause.
- **"Each component decides independently" vs config-driven behavior** (#6): If
components decide independently what session events mean, but the config layer expects
deterministic restart semantics, there's no single authority on when "restart" occurs.
- **"Snapshots backfill" ambiguity for engine state** (#7): The missed-event recovery
mechanism says "subscribers trigger on next check" — but for the decision engine,
missing `session.closed` means potentially running all night.
## Claude Sonnet unique findings (not in either other model):
- **"Skips the instance" for removed strategy** (#3): If a strategy is removed from
the system but a user has it configured and enabled, does "skips" mean the engine
still activates with remaining strategies? What if that's the only configured strategy?
The "at least one enabled strategy" prerequisite doesn't account for enabled-but-
unresolvable strategies.
- **High-water mark reset timing** (#4): Is the HWM reset tied to session open (the
event) or decision engine activation? For mid-session activation, these diverge —
the HWM baseline could be session-open portfolio value or activation-time value.
- **"At startup" vs "at activation"** (#5): Aggregation config is "consumed by the
aggregator at startup" — but is "startup" application boot or decision engine
activation? If boot, config changes after boot but before activation are missed.
- **Configuration change event scope** (#6): Does the "configuration change event"
fire only for enable/disable, or for any config mutation (parameter changes)?
If broader, the engine may receive events that trigger re-evaluation but shouldn't
cause hot-reload.
## Quality assessment:
- **GPT-5** found the most ambiguities (8) and was the only model to identify the
date/timezone and P&L-timing ambiguities (findings that span beyond the two documents
into system-wide consistency). Its unique findings extend further from the documents'
explicit text into operational consequences. Every finding includes both interpretations
clearly stated and a specific cross-component failure scenario. The aggregation
lifecycle finding (#4) is architecturally significant — it identifies a design gap
that neither document addresses. However, GPT-5 used 9,190 output tokens (3.6× Opus's
2,558) for 14% more findings — less token-efficient per finding.
- **Claude Opus** found 7 ambiguities in 56s with only 2,558 tokens — the most
concise output. Its unique findings focus on *design tensions within the interaction*
(session-driven deactivation vs config-driven deactivation, component independence
vs deterministic behavior). The "shut down at close" finding (#2) is genuinely
insightful: it identifies a scenario where a session lifecycle event could
accidentally trigger a config-layer state machine transition (deactivation). This
is Opus's characteristic strength — reasoning about where one subsystem's behavior
inadvertently triggers another subsystem's semantics.
- **Claude Sonnet** found 6 ambiguities in 48s with 2,353 tokens. Notably, Sonnet
found the most *implementation-specific* ambiguities — the "skips the instance"
edge case, the "at startup" vs "at activation" timing, and the config change event
scope. These are the kinds of ambiguities that would bite an engineer writing the
actual code (GenServer init vs application startup, event scope design). Sonnet
appears to reason more from an implementer's perspective ("if I were coding this,
what would I be unsure about?") while Opus reasons from a designer's perspective
("if I were reviewing this architecture, what tensions exist?").
## Key insight — Implementation Ambiguity Analysis as a task type:
This is a genuinely NEW analytical lens not previously tested. Unlike:
- **Assumption-finding** ("what must be true for this to work?")
- **Gap-finding** ("what's missing?")
- **Race conditions** ("what ordering hazards exist?")
- **Cross-doc consistency** ("do these docs contradict?")
...implementation ambiguity asks: "where does the spec admit multiple valid
implementations?" This requires the model to:
1. Read the text as an implementer would (not a reviewer)
2. Generate two BOTH-VALID interpretations (neither is wrong)
3. Show why the divergence matters cross-component
The distinguishing characteristic: findings are not bugs or gaps — they're
**specification underspecification**. The spec author wrote something reasonable,
but didn't realize it could be read two ways by different engineers working on
different components.
## All models converge — but that's the point:
Unlike previous experiments where models had dramatically different finding counts
(GPT-5: 20-35, Opus: 10-13, Sonnet: 7-17 for assumption-finding), here the range
is tight (6-8 findings). The convergence suggests that:
1. The input documents are relatively short (199 lines combined) — less room for
divergence
2. Implementation ambiguity is a more constrained task than open-ended analysis —
you need quoted text from both docs plus two valid interpretations, which naturally
limits the space
3. The core ambiguity (activation vs session) is SO dominant that all models spend
significant output budget exploring its variations
The value differentiation is in WHICH ambiguities each model finds beyond the core:
- GPT-5 extends to system-wide concerns (timezone, P&L timing)
- Opus finds interaction tensions (session events triggering config state machines)
- Sonnet finds implementation-level confusions (init vs activation, event scope)
## Practical implication:
**Implementation ambiguity analysis is ideal for pre-implementation review.** Before
assigning two engineers to work on interacting components, run this analysis on their
respective spec documents. The findings directly identify coordination points that
need explicit resolution before implementation begins. The cost (48-78s, 2-9K tokens)
is trivial compared to the debugging cost of discovering these ambiguities as bugs
in production.
**Model recommendation for this task:**
- **Sonnet** for quick pre-implementation review (most implementer-focused findings, fastest)
- **Opus** for design-review contexts (finds where subsystem semantics leak across boundaries)
- **GPT-5** when you need exhaustive coverage including system-wide implications
All three are viable — the gap between them is smaller here than for other analytical tasks.
This may be the first task type where Sonnet's findings are qualitatively AS valuable as
the reasoning models' findings (just different in character).