Finding 51: Implementation Ambiguity Analysis — New Analytical Lens; All Models Converge on Core Ambiguity

Date: 2026-05-08 Task: Identify implementation ambiguities in two related gargoyle design documents (market-sessions.md 102 lines + strategy-config.md 97 lines) — places where the spec is clear enough to design from but ambiguous enough that two engineers could reasonably implement different behaviors from the same text. How we used them: Both documents (full text) combined in a single prompt with a structured analytical question. Each ambiguity required: quoted text, two interpretations, cross-component divergence impact, severity. Tested via HAI proxy (OpenAI endpoint for GPT-5, Anthropic endpoint for Claude models). No tools, no project context beyond the two documents.

Model	Time	Output tokens	Reasoning tokens	Ambiguities found
GPT-5	78s	9,190	6,784	8
Claude Opus 4.6	56s	2,558	(internal)	7
Claude Sonnet 4.6	48s	2,353	(internal)	6

What they found — common ground (all 3 identified):

The central ambiguity: activation vs session lifecycle. All three models independently identified the same core tension: Document 1 says "strategies subscribe to market data at session open" while Document 2 says "enabling the first strategy can trigger activation." Does activation happen at session open, at eligibility, or both? All three agree this is the highest-severity cross-document ambiguity.

The config-restart-session ambiguity. All three identified that "configuration changes take effect on the next decision engine restart" is ambiguous relative to session boundaries — does session open/close constitute a "restart"? Two different interpretations lead to fundamentally different user expectations about when parameter changes apply.

Mid-session activation semantics. All three identified the scenario where a user enables their first strategy during an active session — does the engine activate immediately, or defer to next session open?

GPT-5 unique findings (not in either Claude model):

Date field in session domain events (#5): The date field in session.opened/ session.closed events could be ET trading date or UTC date. Components keying off different interpretations would attribute after-hours activity to different days.
P&L snapshot timing vs after-hours attribution (#8): If P&L snapshots at 4:00 PM but after-hours fills until 8:00 PM are attributed to the same trading day, the "daily P&L" has no defined finalization time.
Aggregation lifecycle across session boundaries (#4): Are incomplete signal groups flushed at session close, or do they persist? If a group started at 3:50 PM has a 60-minute timeout, does it expire after-hours?
Strategies running outside session (#6): Are strategies shut down outside sessions (engine paused), or do they run continuously with risk controls blocking the output? Different implementations affect aggregation state and warmup behavior.

Claude Opus unique findings (not in either other model):

"Shut down at close" interaction with deactivation (#2): If "some shut down at close" causes the enabled strategy count to reach zero, does this trigger the "last strategy disabled → deactivation" path? This could unintentionally deactivate the entire engine when only session-sensitive strategies should pause.
"Each component decides independently" vs config-driven behavior (#6): If components decide independently what session events mean, but the config layer expects deterministic restart semantics, there's no single authority on when "restart" occurs.
"Snapshots backfill" ambiguity for engine state (#7): The missed-event recovery mechanism says "subscribers trigger on next check" — but for the decision engine, missing session.closed means potentially running all night.

Claude Sonnet unique findings (not in either other model):

"Skips the instance" for removed strategy (#3): If a strategy is removed from the system but a user has it configured and enabled, does "skips" mean the engine still activates with remaining strategies? What if that's the only configured strategy? The "at least one enabled strategy" prerequisite doesn't account for enabled-but- unresolvable strategies.
High-water mark reset timing (#4): Is the HWM reset tied to session open (the event) or decision engine activation? For mid-session activation, these diverge — the HWM baseline could be session-open portfolio value or activation-time value.
"At startup" vs "at activation" (#5): Aggregation config is "consumed by the aggregator at startup" — but is "startup" application boot or decision engine activation? If boot, config changes after boot but before activation are missed.
Configuration change event scope (#6): Does the "configuration change event" fire only for enable/disable, or for any config mutation (parameter changes)? If broader, the engine may receive events that trigger re-evaluation but shouldn't cause hot-reload.

Quality assessment:

GPT-5 found the most ambiguities (8) and was the only model to identify the date/timezone and P&L-timing ambiguities (findings that span beyond the two documents into system-wide consistency). Its unique findings extend further from the documents' explicit text into operational consequences. Every finding includes both interpretations clearly stated and a specific cross-component failure scenario. The aggregation lifecycle finding (#4) is architecturally significant — it identifies a design gap that neither document addresses. However, GPT-5 used 9,190 output tokens (3.6× Opus's 2,558) for 14% more findings — less token-efficient per finding.
Claude Opus found 7 ambiguities in 56s with only 2,558 tokens — the most concise output. Its unique findings focus on design tensions within the interaction (session-driven deactivation vs config-driven deactivation, component independence vs deterministic behavior). The "shut down at close" finding (#2) is genuinely insightful: it identifies a scenario where a session lifecycle event could accidentally trigger a config-layer state machine transition (deactivation). This is Opus's characteristic strength — reasoning about where one subsystem's behavior inadvertently triggers another subsystem's semantics.
Claude Sonnet found 6 ambiguities in 48s with 2,353 tokens. Notably, Sonnet found the most implementation-specific ambiguities — the "skips the instance" edge case, the "at startup" vs "at activation" timing, and the config change event scope. These are the kinds of ambiguities that would bite an engineer writing the actual code (GenServer init vs application startup, event scope design). Sonnet appears to reason more from an implementer's perspective ("if I were coding this, what would I be unsure about?") while Opus reasons from a designer's perspective ("if I were reviewing this architecture, what tensions exist?").

Key insight — Implementation Ambiguity Analysis as a task type:

This is a genuinely NEW analytical lens not previously tested. Unlike:

Assumption-finding ("what must be true for this to work?")
Gap-finding ("what's missing?")
Race conditions ("what ordering hazards exist?")
Cross-doc consistency ("do these docs contradict?")

...implementation ambiguity asks: "where does the spec admit multiple valid implementations?" This requires the model to:

Read the text as an implementer would (not a reviewer)
Generate two BOTH-VALID interpretations (neither is wrong)
Show why the divergence matters cross-component

The distinguishing characteristic: findings are not bugs or gaps — they're specification underspecification. The spec author wrote something reasonable, but didn't realize it could be read two ways by different engineers working on different components.

All models converge — but that's the point:

Unlike previous experiments where models had dramatically different finding counts (GPT-5: 20-35, Opus: 10-13, Sonnet: 7-17 for assumption-finding), here the range is tight (6-8 findings). The convergence suggests that:

The input documents are relatively short (199 lines combined) — less room for divergence
Implementation ambiguity is a more constrained task than open-ended analysis — you need quoted text from both docs plus two valid interpretations, which naturally limits the space
The core ambiguity (activation vs session) is SO dominant that all models spend significant output budget exploring its variations

The value differentiation is in WHICH ambiguities each model finds beyond the core:

GPT-5 extends to system-wide concerns (timezone, P&L timing)
Opus finds interaction tensions (session events triggering config state machines)
Sonnet finds implementation-level confusions (init vs activation, event scope)

Practical implication:

Implementation ambiguity analysis is ideal for pre-implementation review. Before assigning two engineers to work on interacting components, run this analysis on their respective spec documents. The findings directly identify coordination points that need explicit resolution before implementation begins. The cost (48-78s, 2-9K tokens) is trivial compared to the debugging cost of discovering these ambiguities as bugs in production.

Model recommendation for this task:

Sonnet for quick pre-implementation review (most implementer-focused findings, fastest)
Opus for design-review contexts (finds where subsystem semantics leak across boundaries)
GPT-5 when you need exhaustive coverage including system-wide implications

All three are viable — the gap between them is smaller here than for other analytical tasks. This may be the first task type where Sonnet's findings are qualitatively AS valuable as the reasoning models' findings (just different in character).

9.9 KiB Raw Blame History Unescape Escape