167 lines
9.9 KiB
Markdown
167 lines
9.9 KiB
Markdown
# Finding 51: Implementation Ambiguity Analysis — New Analytical Lens; All Models Converge on Core Ambiguity
|
||
|
||
**Date:** 2026-05-08
|
||
**Task:** Identify implementation ambiguities in two related gargoyle design documents
|
||
(`market-sessions.md` 102 lines + `strategy-config.md` 97 lines) — places where the
|
||
spec is clear enough to design from but ambiguous enough that two engineers could
|
||
reasonably implement different behaviors from the same text.
|
||
**How we used them:** Both documents (full text) combined in a single prompt with a
|
||
structured analytical question. Each ambiguity required: quoted text, two interpretations,
|
||
cross-component divergence impact, severity. Tested via HAI proxy (OpenAI endpoint for
|
||
GPT-5, Anthropic endpoint for Claude models). No tools, no project context beyond the
|
||
two documents.
|
||
|
||
| Model | Time | Output tokens | Reasoning tokens | Ambiguities found |
|
||
|---|---|---|---|---|
|
||
| GPT-5 | 78s | 9,190 | 6,784 | 8 |
|
||
| Claude Opus 4.6 | 56s | 2,558 | (internal) | 7 |
|
||
| Claude Sonnet 4.6 | 48s | 2,353 | (internal) | 6 |
|
||
|
||
## What they found — common ground (all 3 identified):
|
||
|
||
**The central ambiguity: activation vs session lifecycle.** All three models independently
|
||
identified the same core tension: Document 1 says "strategies subscribe to market data
|
||
at session open" while Document 2 says "enabling the first strategy can trigger
|
||
activation." Does activation happen at session open, at eligibility, or both? All three
|
||
agree this is the highest-severity cross-document ambiguity.
|
||
|
||
**The config-restart-session ambiguity.** All three identified that "configuration changes
|
||
take effect on the next decision engine restart" is ambiguous relative to session
|
||
boundaries — does session open/close constitute a "restart"? Two different interpretations
|
||
lead to fundamentally different user expectations about when parameter changes apply.
|
||
|
||
**Mid-session activation semantics.** All three identified the scenario where a user
|
||
enables their first strategy during an active session — does the engine activate
|
||
immediately, or defer to next session open?
|
||
|
||
## GPT-5 unique findings (not in either Claude model):
|
||
|
||
- **Date field in session domain events** (#5): The `date` field in `session.opened`/
|
||
`session.closed` events could be ET trading date or UTC date. Components keying off
|
||
different interpretations would attribute after-hours activity to different days.
|
||
- **P&L snapshot timing vs after-hours attribution** (#8): If P&L snapshots at 4:00 PM
|
||
but after-hours fills until 8:00 PM are attributed to the same trading day, the
|
||
"daily P&L" has no defined finalization time.
|
||
- **Aggregation lifecycle across session boundaries** (#4): Are incomplete signal groups
|
||
flushed at session close, or do they persist? If a group started at 3:50 PM has a
|
||
60-minute timeout, does it expire after-hours?
|
||
- **Strategies running outside session** (#6): Are strategies shut down outside sessions
|
||
(engine paused), or do they run continuously with risk controls blocking the output?
|
||
Different implementations affect aggregation state and warmup behavior.
|
||
|
||
## Claude Opus unique findings (not in either other model):
|
||
|
||
- **"Shut down at close" interaction with deactivation** (#2): If "some shut down at
|
||
close" causes the enabled strategy count to reach zero, does this trigger the "last
|
||
strategy disabled → deactivation" path? This could unintentionally deactivate the
|
||
entire engine when only session-sensitive strategies should pause.
|
||
- **"Each component decides independently" vs config-driven behavior** (#6): If
|
||
components decide independently what session events mean, but the config layer expects
|
||
deterministic restart semantics, there's no single authority on when "restart" occurs.
|
||
- **"Snapshots backfill" ambiguity for engine state** (#7): The missed-event recovery
|
||
mechanism says "subscribers trigger on next check" — but for the decision engine,
|
||
missing `session.closed` means potentially running all night.
|
||
|
||
## Claude Sonnet unique findings (not in either other model):
|
||
|
||
- **"Skips the instance" for removed strategy** (#3): If a strategy is removed from
|
||
the system but a user has it configured and enabled, does "skips" mean the engine
|
||
still activates with remaining strategies? What if that's the only configured strategy?
|
||
The "at least one enabled strategy" prerequisite doesn't account for enabled-but-
|
||
unresolvable strategies.
|
||
- **High-water mark reset timing** (#4): Is the HWM reset tied to session open (the
|
||
event) or decision engine activation? For mid-session activation, these diverge —
|
||
the HWM baseline could be session-open portfolio value or activation-time value.
|
||
- **"At startup" vs "at activation"** (#5): Aggregation config is "consumed by the
|
||
aggregator at startup" — but is "startup" application boot or decision engine
|
||
activation? If boot, config changes after boot but before activation are missed.
|
||
- **Configuration change event scope** (#6): Does the "configuration change event"
|
||
fire only for enable/disable, or for any config mutation (parameter changes)?
|
||
If broader, the engine may receive events that trigger re-evaluation but shouldn't
|
||
cause hot-reload.
|
||
|
||
## Quality assessment:
|
||
|
||
- **GPT-5** found the most ambiguities (8) and was the only model to identify the
|
||
date/timezone and P&L-timing ambiguities (findings that span beyond the two documents
|
||
into system-wide consistency). Its unique findings extend further from the documents'
|
||
explicit text into operational consequences. Every finding includes both interpretations
|
||
clearly stated and a specific cross-component failure scenario. The aggregation
|
||
lifecycle finding (#4) is architecturally significant — it identifies a design gap
|
||
that neither document addresses. However, GPT-5 used 9,190 output tokens (3.6× Opus's
|
||
2,558) for 14% more findings — less token-efficient per finding.
|
||
|
||
- **Claude Opus** found 7 ambiguities in 56s with only 2,558 tokens — the most
|
||
concise output. Its unique findings focus on *design tensions within the interaction*
|
||
(session-driven deactivation vs config-driven deactivation, component independence
|
||
vs deterministic behavior). The "shut down at close" finding (#2) is genuinely
|
||
insightful: it identifies a scenario where a session lifecycle event could
|
||
accidentally trigger a config-layer state machine transition (deactivation). This
|
||
is Opus's characteristic strength — reasoning about where one subsystem's behavior
|
||
inadvertently triggers another subsystem's semantics.
|
||
|
||
- **Claude Sonnet** found 6 ambiguities in 48s with 2,353 tokens. Notably, Sonnet
|
||
found the most *implementation-specific* ambiguities — the "skips the instance"
|
||
edge case, the "at startup" vs "at activation" timing, and the config change event
|
||
scope. These are the kinds of ambiguities that would bite an engineer writing the
|
||
actual code (GenServer init vs application startup, event scope design). Sonnet
|
||
appears to reason more from an implementer's perspective ("if I were coding this,
|
||
what would I be unsure about?") while Opus reasons from a designer's perspective
|
||
("if I were reviewing this architecture, what tensions exist?").
|
||
|
||
## Key insight — Implementation Ambiguity Analysis as a task type:
|
||
|
||
This is a genuinely NEW analytical lens not previously tested. Unlike:
|
||
- **Assumption-finding** ("what must be true for this to work?")
|
||
- **Gap-finding** ("what's missing?")
|
||
- **Race conditions** ("what ordering hazards exist?")
|
||
- **Cross-doc consistency** ("do these docs contradict?")
|
||
|
||
...implementation ambiguity asks: "where does the spec admit multiple valid
|
||
implementations?" This requires the model to:
|
||
1. Read the text as an implementer would (not a reviewer)
|
||
2. Generate two BOTH-VALID interpretations (neither is wrong)
|
||
3. Show why the divergence matters cross-component
|
||
|
||
The distinguishing characteristic: findings are not bugs or gaps — they're
|
||
**specification underspecification**. The spec author wrote something reasonable,
|
||
but didn't realize it could be read two ways by different engineers working on
|
||
different components.
|
||
|
||
## All models converge — but that's the point:
|
||
|
||
Unlike previous experiments where models had dramatically different finding counts
|
||
(GPT-5: 20-35, Opus: 10-13, Sonnet: 7-17 for assumption-finding), here the range
|
||
is tight (6-8 findings). The convergence suggests that:
|
||
|
||
1. The input documents are relatively short (199 lines combined) — less room for
|
||
divergence
|
||
2. Implementation ambiguity is a more constrained task than open-ended analysis —
|
||
you need quoted text from both docs plus two valid interpretations, which naturally
|
||
limits the space
|
||
3. The core ambiguity (activation vs session) is SO dominant that all models spend
|
||
significant output budget exploring its variations
|
||
|
||
The value differentiation is in WHICH ambiguities each model finds beyond the core:
|
||
- GPT-5 extends to system-wide concerns (timezone, P&L timing)
|
||
- Opus finds interaction tensions (session events triggering config state machines)
|
||
- Sonnet finds implementation-level confusions (init vs activation, event scope)
|
||
|
||
## Practical implication:
|
||
|
||
**Implementation ambiguity analysis is ideal for pre-implementation review.** Before
|
||
assigning two engineers to work on interacting components, run this analysis on their
|
||
respective spec documents. The findings directly identify coordination points that
|
||
need explicit resolution before implementation begins. The cost (48-78s, 2-9K tokens)
|
||
is trivial compared to the debugging cost of discovering these ambiguities as bugs
|
||
in production.
|
||
|
||
**Model recommendation for this task:**
|
||
- **Sonnet** for quick pre-implementation review (most implementer-focused findings, fastest)
|
||
- **Opus** for design-review contexts (finds where subsystem semantics leak across boundaries)
|
||
- **GPT-5** when you need exhaustive coverage including system-wide implications
|
||
|
||
All three are viable — the gap between them is smaller here than for other analytical tasks.
|
||
This may be the first task type where Sonnet's findings are qualitatively AS valuable as
|
||
the reasoning models' findings (just different in character).
|