finding 51: implementation ambiguity analysis — new analytical lens
This commit is contained in:
@@ -0,0 +1,166 @@
|
||||
# Finding 51: Implementation Ambiguity Analysis — New Analytical Lens; All Models Converge on Core Ambiguity
|
||||
|
||||
**Date:** 2026-05-08
|
||||
**Task:** Identify implementation ambiguities in two related gargoyle design documents
|
||||
(`market-sessions.md` 102 lines + `strategy-config.md` 97 lines) — places where the
|
||||
spec is clear enough to design from but ambiguous enough that two engineers could
|
||||
reasonably implement different behaviors from the same text.
|
||||
**How we used them:** Both documents (full text) combined in a single prompt with a
|
||||
structured analytical question. Each ambiguity required: quoted text, two interpretations,
|
||||
cross-component divergence impact, severity. Tested via HAI proxy (OpenAI endpoint for
|
||||
GPT-5, Anthropic endpoint for Claude models). No tools, no project context beyond the
|
||||
two documents.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Ambiguities found |
|
||||
|---|---|---|---|---|
|
||||
| GPT-5 | 78s | 9,190 | 6,784 | 8 |
|
||||
| Claude Opus 4.6 | 56s | 2,558 | (internal) | 7 |
|
||||
| Claude Sonnet 4.6 | 48s | 2,353 | (internal) | 6 |
|
||||
|
||||
## What they found — common ground (all 3 identified):
|
||||
|
||||
**The central ambiguity: activation vs session lifecycle.** All three models independently
|
||||
identified the same core tension: Document 1 says "strategies subscribe to market data
|
||||
at session open" while Document 2 says "enabling the first strategy can trigger
|
||||
activation." Does activation happen at session open, at eligibility, or both? All three
|
||||
agree this is the highest-severity cross-document ambiguity.
|
||||
|
||||
**The config-restart-session ambiguity.** All three identified that "configuration changes
|
||||
take effect on the next decision engine restart" is ambiguous relative to session
|
||||
boundaries — does session open/close constitute a "restart"? Two different interpretations
|
||||
lead to fundamentally different user expectations about when parameter changes apply.
|
||||
|
||||
**Mid-session activation semantics.** All three identified the scenario where a user
|
||||
enables their first strategy during an active session — does the engine activate
|
||||
immediately, or defer to next session open?
|
||||
|
||||
## GPT-5 unique findings (not in either Claude model):
|
||||
|
||||
- **Date field in session domain events** (#5): The `date` field in `session.opened`/
|
||||
`session.closed` events could be ET trading date or UTC date. Components keying off
|
||||
different interpretations would attribute after-hours activity to different days.
|
||||
- **P&L snapshot timing vs after-hours attribution** (#8): If P&L snapshots at 4:00 PM
|
||||
but after-hours fills until 8:00 PM are attributed to the same trading day, the
|
||||
"daily P&L" has no defined finalization time.
|
||||
- **Aggregation lifecycle across session boundaries** (#4): Are incomplete signal groups
|
||||
flushed at session close, or do they persist? If a group started at 3:50 PM has a
|
||||
60-minute timeout, does it expire after-hours?
|
||||
- **Strategies running outside session** (#6): Are strategies shut down outside sessions
|
||||
(engine paused), or do they run continuously with risk controls blocking the output?
|
||||
Different implementations affect aggregation state and warmup behavior.
|
||||
|
||||
## Claude Opus unique findings (not in either other model):
|
||||
|
||||
- **"Shut down at close" interaction with deactivation** (#2): If "some shut down at
|
||||
close" causes the enabled strategy count to reach zero, does this trigger the "last
|
||||
strategy disabled → deactivation" path? This could unintentionally deactivate the
|
||||
entire engine when only session-sensitive strategies should pause.
|
||||
- **"Each component decides independently" vs config-driven behavior** (#6): If
|
||||
components decide independently what session events mean, but the config layer expects
|
||||
deterministic restart semantics, there's no single authority on when "restart" occurs.
|
||||
- **"Snapshots backfill" ambiguity for engine state** (#7): The missed-event recovery
|
||||
mechanism says "subscribers trigger on next check" — but for the decision engine,
|
||||
missing `session.closed` means potentially running all night.
|
||||
|
||||
## Claude Sonnet unique findings (not in either other model):
|
||||
|
||||
- **"Skips the instance" for removed strategy** (#3): If a strategy is removed from
|
||||
the system but a user has it configured and enabled, does "skips" mean the engine
|
||||
still activates with remaining strategies? What if that's the only configured strategy?
|
||||
The "at least one enabled strategy" prerequisite doesn't account for enabled-but-
|
||||
unresolvable strategies.
|
||||
- **High-water mark reset timing** (#4): Is the HWM reset tied to session open (the
|
||||
event) or decision engine activation? For mid-session activation, these diverge —
|
||||
the HWM baseline could be session-open portfolio value or activation-time value.
|
||||
- **"At startup" vs "at activation"** (#5): Aggregation config is "consumed by the
|
||||
aggregator at startup" — but is "startup" application boot or decision engine
|
||||
activation? If boot, config changes after boot but before activation are missed.
|
||||
- **Configuration change event scope** (#6): Does the "configuration change event"
|
||||
fire only for enable/disable, or for any config mutation (parameter changes)?
|
||||
If broader, the engine may receive events that trigger re-evaluation but shouldn't
|
||||
cause hot-reload.
|
||||
|
||||
## Quality assessment:
|
||||
|
||||
- **GPT-5** found the most ambiguities (8) and was the only model to identify the
|
||||
date/timezone and P&L-timing ambiguities (findings that span beyond the two documents
|
||||
into system-wide consistency). Its unique findings extend further from the documents'
|
||||
explicit text into operational consequences. Every finding includes both interpretations
|
||||
clearly stated and a specific cross-component failure scenario. The aggregation
|
||||
lifecycle finding (#4) is architecturally significant — it identifies a design gap
|
||||
that neither document addresses. However, GPT-5 used 9,190 output tokens (3.6× Opus's
|
||||
2,558) for 14% more findings — less token-efficient per finding.
|
||||
|
||||
- **Claude Opus** found 7 ambiguities in 56s with only 2,558 tokens — the most
|
||||
concise output. Its unique findings focus on *design tensions within the interaction*
|
||||
(session-driven deactivation vs config-driven deactivation, component independence
|
||||
vs deterministic behavior). The "shut down at close" finding (#2) is genuinely
|
||||
insightful: it identifies a scenario where a session lifecycle event could
|
||||
accidentally trigger a config-layer state machine transition (deactivation). This
|
||||
is Opus's characteristic strength — reasoning about where one subsystem's behavior
|
||||
inadvertently triggers another subsystem's semantics.
|
||||
|
||||
- **Claude Sonnet** found 6 ambiguities in 48s with 2,353 tokens. Notably, Sonnet
|
||||
found the most *implementation-specific* ambiguities — the "skips the instance"
|
||||
edge case, the "at startup" vs "at activation" timing, and the config change event
|
||||
scope. These are the kinds of ambiguities that would bite an engineer writing the
|
||||
actual code (GenServer init vs application startup, event scope design). Sonnet
|
||||
appears to reason more from an implementer's perspective ("if I were coding this,
|
||||
what would I be unsure about?") while Opus reasons from a designer's perspective
|
||||
("if I were reviewing this architecture, what tensions exist?").
|
||||
|
||||
## Key insight — Implementation Ambiguity Analysis as a task type:
|
||||
|
||||
This is a genuinely NEW analytical lens not previously tested. Unlike:
|
||||
- **Assumption-finding** ("what must be true for this to work?")
|
||||
- **Gap-finding** ("what's missing?")
|
||||
- **Race conditions** ("what ordering hazards exist?")
|
||||
- **Cross-doc consistency** ("do these docs contradict?")
|
||||
|
||||
...implementation ambiguity asks: "where does the spec admit multiple valid
|
||||
implementations?" This requires the model to:
|
||||
1. Read the text as an implementer would (not a reviewer)
|
||||
2. Generate two BOTH-VALID interpretations (neither is wrong)
|
||||
3. Show why the divergence matters cross-component
|
||||
|
||||
The distinguishing characteristic: findings are not bugs or gaps — they're
|
||||
**specification underspecification**. The spec author wrote something reasonable,
|
||||
but didn't realize it could be read two ways by different engineers working on
|
||||
different components.
|
||||
|
||||
## All models converge — but that's the point:
|
||||
|
||||
Unlike previous experiments where models had dramatically different finding counts
|
||||
(GPT-5: 20-35, Opus: 10-13, Sonnet: 7-17 for assumption-finding), here the range
|
||||
is tight (6-8 findings). The convergence suggests that:
|
||||
|
||||
1. The input documents are relatively short (199 lines combined) — less room for
|
||||
divergence
|
||||
2. Implementation ambiguity is a more constrained task than open-ended analysis —
|
||||
you need quoted text from both docs plus two valid interpretations, which naturally
|
||||
limits the space
|
||||
3. The core ambiguity (activation vs session) is SO dominant that all models spend
|
||||
significant output budget exploring its variations
|
||||
|
||||
The value differentiation is in WHICH ambiguities each model finds beyond the core:
|
||||
- GPT-5 extends to system-wide concerns (timezone, P&L timing)
|
||||
- Opus finds interaction tensions (session events triggering config state machines)
|
||||
- Sonnet finds implementation-level confusions (init vs activation, event scope)
|
||||
|
||||
## Practical implication:
|
||||
|
||||
**Implementation ambiguity analysis is ideal for pre-implementation review.** Before
|
||||
assigning two engineers to work on interacting components, run this analysis on their
|
||||
respective spec documents. The findings directly identify coordination points that
|
||||
need explicit resolution before implementation begins. The cost (48-78s, 2-9K tokens)
|
||||
is trivial compared to the debugging cost of discovering these ambiguities as bugs
|
||||
in production.
|
||||
|
||||
**Model recommendation for this task:**
|
||||
- **Sonnet** for quick pre-implementation review (most implementer-focused findings, fastest)
|
||||
- **Opus** for design-review contexts (finds where subsystem semantics leak across boundaries)
|
||||
- **GPT-5** when you need exhaustive coverage including system-wide implications
|
||||
|
||||
All three are viable — the gap between them is smaller here than for other analytical tasks.
|
||||
This may be the first task type where Sonnet's findings are qualitatively AS valuable as
|
||||
the reasoning models' findings (just different in character).
|
||||
Reference in New Issue
Block a user