finding 51: implementation ambiguity analysis — new analytical lens
This commit is contained in:
@@ -0,0 +1,166 @@
|
|||||||
|
# Finding 51: Implementation Ambiguity Analysis — New Analytical Lens; All Models Converge on Core Ambiguity
|
||||||
|
|
||||||
|
**Date:** 2026-05-08
|
||||||
|
**Task:** Identify implementation ambiguities in two related gargoyle design documents
|
||||||
|
(`market-sessions.md` 102 lines + `strategy-config.md` 97 lines) — places where the
|
||||||
|
spec is clear enough to design from but ambiguous enough that two engineers could
|
||||||
|
reasonably implement different behaviors from the same text.
|
||||||
|
**How we used them:** Both documents (full text) combined in a single prompt with a
|
||||||
|
structured analytical question. Each ambiguity required: quoted text, two interpretations,
|
||||||
|
cross-component divergence impact, severity. Tested via HAI proxy (OpenAI endpoint for
|
||||||
|
GPT-5, Anthropic endpoint for Claude models). No tools, no project context beyond the
|
||||||
|
two documents.
|
||||||
|
|
||||||
|
| Model | Time | Output tokens | Reasoning tokens | Ambiguities found |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| GPT-5 | 78s | 9,190 | 6,784 | 8 |
|
||||||
|
| Claude Opus 4.6 | 56s | 2,558 | (internal) | 7 |
|
||||||
|
| Claude Sonnet 4.6 | 48s | 2,353 | (internal) | 6 |
|
||||||
|
|
||||||
|
## What they found — common ground (all 3 identified):
|
||||||
|
|
||||||
|
**The central ambiguity: activation vs session lifecycle.** All three models independently
|
||||||
|
identified the same core tension: Document 1 says "strategies subscribe to market data
|
||||||
|
at session open" while Document 2 says "enabling the first strategy can trigger
|
||||||
|
activation." Does activation happen at session open, at eligibility, or both? All three
|
||||||
|
agree this is the highest-severity cross-document ambiguity.
|
||||||
|
|
||||||
|
**The config-restart-session ambiguity.** All three identified that "configuration changes
|
||||||
|
take effect on the next decision engine restart" is ambiguous relative to session
|
||||||
|
boundaries — does session open/close constitute a "restart"? Two different interpretations
|
||||||
|
lead to fundamentally different user expectations about when parameter changes apply.
|
||||||
|
|
||||||
|
**Mid-session activation semantics.** All three identified the scenario where a user
|
||||||
|
enables their first strategy during an active session — does the engine activate
|
||||||
|
immediately, or defer to next session open?
|
||||||
|
|
||||||
|
## GPT-5 unique findings (not in either Claude model):
|
||||||
|
|
||||||
|
- **Date field in session domain events** (#5): The `date` field in `session.opened`/
|
||||||
|
`session.closed` events could be ET trading date or UTC date. Components keying off
|
||||||
|
different interpretations would attribute after-hours activity to different days.
|
||||||
|
- **P&L snapshot timing vs after-hours attribution** (#8): If P&L snapshots at 4:00 PM
|
||||||
|
but after-hours fills until 8:00 PM are attributed to the same trading day, the
|
||||||
|
"daily P&L" has no defined finalization time.
|
||||||
|
- **Aggregation lifecycle across session boundaries** (#4): Are incomplete signal groups
|
||||||
|
flushed at session close, or do they persist? If a group started at 3:50 PM has a
|
||||||
|
60-minute timeout, does it expire after-hours?
|
||||||
|
- **Strategies running outside session** (#6): Are strategies shut down outside sessions
|
||||||
|
(engine paused), or do they run continuously with risk controls blocking the output?
|
||||||
|
Different implementations affect aggregation state and warmup behavior.
|
||||||
|
|
||||||
|
## Claude Opus unique findings (not in either other model):
|
||||||
|
|
||||||
|
- **"Shut down at close" interaction with deactivation** (#2): If "some shut down at
|
||||||
|
close" causes the enabled strategy count to reach zero, does this trigger the "last
|
||||||
|
strategy disabled → deactivation" path? This could unintentionally deactivate the
|
||||||
|
entire engine when only session-sensitive strategies should pause.
|
||||||
|
- **"Each component decides independently" vs config-driven behavior** (#6): If
|
||||||
|
components decide independently what session events mean, but the config layer expects
|
||||||
|
deterministic restart semantics, there's no single authority on when "restart" occurs.
|
||||||
|
- **"Snapshots backfill" ambiguity for engine state** (#7): The missed-event recovery
|
||||||
|
mechanism says "subscribers trigger on next check" — but for the decision engine,
|
||||||
|
missing `session.closed` means potentially running all night.
|
||||||
|
|
||||||
|
## Claude Sonnet unique findings (not in either other model):
|
||||||
|
|
||||||
|
- **"Skips the instance" for removed strategy** (#3): If a strategy is removed from
|
||||||
|
the system but a user has it configured and enabled, does "skips" mean the engine
|
||||||
|
still activates with remaining strategies? What if that's the only configured strategy?
|
||||||
|
The "at least one enabled strategy" prerequisite doesn't account for enabled-but-
|
||||||
|
unresolvable strategies.
|
||||||
|
- **High-water mark reset timing** (#4): Is the HWM reset tied to session open (the
|
||||||
|
event) or decision engine activation? For mid-session activation, these diverge —
|
||||||
|
the HWM baseline could be session-open portfolio value or activation-time value.
|
||||||
|
- **"At startup" vs "at activation"** (#5): Aggregation config is "consumed by the
|
||||||
|
aggregator at startup" — but is "startup" application boot or decision engine
|
||||||
|
activation? If boot, config changes after boot but before activation are missed.
|
||||||
|
- **Configuration change event scope** (#6): Does the "configuration change event"
|
||||||
|
fire only for enable/disable, or for any config mutation (parameter changes)?
|
||||||
|
If broader, the engine may receive events that trigger re-evaluation but shouldn't
|
||||||
|
cause hot-reload.
|
||||||
|
|
||||||
|
## Quality assessment:
|
||||||
|
|
||||||
|
- **GPT-5** found the most ambiguities (8) and was the only model to identify the
|
||||||
|
date/timezone and P&L-timing ambiguities (findings that span beyond the two documents
|
||||||
|
into system-wide consistency). Its unique findings extend further from the documents'
|
||||||
|
explicit text into operational consequences. Every finding includes both interpretations
|
||||||
|
clearly stated and a specific cross-component failure scenario. The aggregation
|
||||||
|
lifecycle finding (#4) is architecturally significant — it identifies a design gap
|
||||||
|
that neither document addresses. However, GPT-5 used 9,190 output tokens (3.6× Opus's
|
||||||
|
2,558) for 14% more findings — less token-efficient per finding.
|
||||||
|
|
||||||
|
- **Claude Opus** found 7 ambiguities in 56s with only 2,558 tokens — the most
|
||||||
|
concise output. Its unique findings focus on *design tensions within the interaction*
|
||||||
|
(session-driven deactivation vs config-driven deactivation, component independence
|
||||||
|
vs deterministic behavior). The "shut down at close" finding (#2) is genuinely
|
||||||
|
insightful: it identifies a scenario where a session lifecycle event could
|
||||||
|
accidentally trigger a config-layer state machine transition (deactivation). This
|
||||||
|
is Opus's characteristic strength — reasoning about where one subsystem's behavior
|
||||||
|
inadvertently triggers another subsystem's semantics.
|
||||||
|
|
||||||
|
- **Claude Sonnet** found 6 ambiguities in 48s with 2,353 tokens. Notably, Sonnet
|
||||||
|
found the most *implementation-specific* ambiguities — the "skips the instance"
|
||||||
|
edge case, the "at startup" vs "at activation" timing, and the config change event
|
||||||
|
scope. These are the kinds of ambiguities that would bite an engineer writing the
|
||||||
|
actual code (GenServer init vs application startup, event scope design). Sonnet
|
||||||
|
appears to reason more from an implementer's perspective ("if I were coding this,
|
||||||
|
what would I be unsure about?") while Opus reasons from a designer's perspective
|
||||||
|
("if I were reviewing this architecture, what tensions exist?").
|
||||||
|
|
||||||
|
## Key insight — Implementation Ambiguity Analysis as a task type:
|
||||||
|
|
||||||
|
This is a genuinely NEW analytical lens not previously tested. Unlike:
|
||||||
|
- **Assumption-finding** ("what must be true for this to work?")
|
||||||
|
- **Gap-finding** ("what's missing?")
|
||||||
|
- **Race conditions** ("what ordering hazards exist?")
|
||||||
|
- **Cross-doc consistency** ("do these docs contradict?")
|
||||||
|
|
||||||
|
...implementation ambiguity asks: "where does the spec admit multiple valid
|
||||||
|
implementations?" This requires the model to:
|
||||||
|
1. Read the text as an implementer would (not a reviewer)
|
||||||
|
2. Generate two BOTH-VALID interpretations (neither is wrong)
|
||||||
|
3. Show why the divergence matters cross-component
|
||||||
|
|
||||||
|
The distinguishing characteristic: findings are not bugs or gaps — they're
|
||||||
|
**specification underspecification**. The spec author wrote something reasonable,
|
||||||
|
but didn't realize it could be read two ways by different engineers working on
|
||||||
|
different components.
|
||||||
|
|
||||||
|
## All models converge — but that's the point:
|
||||||
|
|
||||||
|
Unlike previous experiments where models had dramatically different finding counts
|
||||||
|
(GPT-5: 20-35, Opus: 10-13, Sonnet: 7-17 for assumption-finding), here the range
|
||||||
|
is tight (6-8 findings). The convergence suggests that:
|
||||||
|
|
||||||
|
1. The input documents are relatively short (199 lines combined) — less room for
|
||||||
|
divergence
|
||||||
|
2. Implementation ambiguity is a more constrained task than open-ended analysis —
|
||||||
|
you need quoted text from both docs plus two valid interpretations, which naturally
|
||||||
|
limits the space
|
||||||
|
3. The core ambiguity (activation vs session) is SO dominant that all models spend
|
||||||
|
significant output budget exploring its variations
|
||||||
|
|
||||||
|
The value differentiation is in WHICH ambiguities each model finds beyond the core:
|
||||||
|
- GPT-5 extends to system-wide concerns (timezone, P&L timing)
|
||||||
|
- Opus finds interaction tensions (session events triggering config state machines)
|
||||||
|
- Sonnet finds implementation-level confusions (init vs activation, event scope)
|
||||||
|
|
||||||
|
## Practical implication:
|
||||||
|
|
||||||
|
**Implementation ambiguity analysis is ideal for pre-implementation review.** Before
|
||||||
|
assigning two engineers to work on interacting components, run this analysis on their
|
||||||
|
respective spec documents. The findings directly identify coordination points that
|
||||||
|
need explicit resolution before implementation begins. The cost (48-78s, 2-9K tokens)
|
||||||
|
is trivial compared to the debugging cost of discovering these ambiguities as bugs
|
||||||
|
in production.
|
||||||
|
|
||||||
|
**Model recommendation for this task:**
|
||||||
|
- **Sonnet** for quick pre-implementation review (most implementer-focused findings, fastest)
|
||||||
|
- **Opus** for design-review contexts (finds where subsystem semantics leak across boundaries)
|
||||||
|
- **GPT-5** when you need exhaustive coverage including system-wide implications
|
||||||
|
|
||||||
|
All three are viable — the gap between them is smaller here than for other analytical tasks.
|
||||||
|
This may be the first task type where Sonnet's findings are qualitatively AS valuable as
|
||||||
|
the reasoning models' findings (just different in character).
|
||||||
Reference in New Issue
Block a user