diff --git a/findings/2026-05-08-51-implementation-ambiguity-analysis.md b/findings/2026-05-08-51-implementation-ambiguity-analysis.md new file mode 100644 index 0000000..0c4bd2e --- /dev/null +++ b/findings/2026-05-08-51-implementation-ambiguity-analysis.md @@ -0,0 +1,166 @@ +# Finding 51: Implementation Ambiguity Analysis — New Analytical Lens; All Models Converge on Core Ambiguity + +**Date:** 2026-05-08 +**Task:** Identify implementation ambiguities in two related gargoyle design documents +(`market-sessions.md` 102 lines + `strategy-config.md` 97 lines) — places where the +spec is clear enough to design from but ambiguous enough that two engineers could +reasonably implement different behaviors from the same text. +**How we used them:** Both documents (full text) combined in a single prompt with a +structured analytical question. Each ambiguity required: quoted text, two interpretations, +cross-component divergence impact, severity. Tested via HAI proxy (OpenAI endpoint for +GPT-5, Anthropic endpoint for Claude models). No tools, no project context beyond the +two documents. + +| Model | Time | Output tokens | Reasoning tokens | Ambiguities found | +|---|---|---|---|---| +| GPT-5 | 78s | 9,190 | 6,784 | 8 | +| Claude Opus 4.6 | 56s | 2,558 | (internal) | 7 | +| Claude Sonnet 4.6 | 48s | 2,353 | (internal) | 6 | + +## What they found — common ground (all 3 identified): + +**The central ambiguity: activation vs session lifecycle.** All three models independently +identified the same core tension: Document 1 says "strategies subscribe to market data +at session open" while Document 2 says "enabling the first strategy can trigger +activation." Does activation happen at session open, at eligibility, or both? All three +agree this is the highest-severity cross-document ambiguity. + +**The config-restart-session ambiguity.** All three identified that "configuration changes +take effect on the next decision engine restart" is ambiguous relative to session +boundaries — does session open/close constitute a "restart"? Two different interpretations +lead to fundamentally different user expectations about when parameter changes apply. + +**Mid-session activation semantics.** All three identified the scenario where a user +enables their first strategy during an active session — does the engine activate +immediately, or defer to next session open? + +## GPT-5 unique findings (not in either Claude model): + +- **Date field in session domain events** (#5): The `date` field in `session.opened`/ + `session.closed` events could be ET trading date or UTC date. Components keying off + different interpretations would attribute after-hours activity to different days. +- **P&L snapshot timing vs after-hours attribution** (#8): If P&L snapshots at 4:00 PM + but after-hours fills until 8:00 PM are attributed to the same trading day, the + "daily P&L" has no defined finalization time. +- **Aggregation lifecycle across session boundaries** (#4): Are incomplete signal groups + flushed at session close, or do they persist? If a group started at 3:50 PM has a + 60-minute timeout, does it expire after-hours? +- **Strategies running outside session** (#6): Are strategies shut down outside sessions + (engine paused), or do they run continuously with risk controls blocking the output? + Different implementations affect aggregation state and warmup behavior. + +## Claude Opus unique findings (not in either other model): + +- **"Shut down at close" interaction with deactivation** (#2): If "some shut down at + close" causes the enabled strategy count to reach zero, does this trigger the "last + strategy disabled → deactivation" path? This could unintentionally deactivate the + entire engine when only session-sensitive strategies should pause. +- **"Each component decides independently" vs config-driven behavior** (#6): If + components decide independently what session events mean, but the config layer expects + deterministic restart semantics, there's no single authority on when "restart" occurs. +- **"Snapshots backfill" ambiguity for engine state** (#7): The missed-event recovery + mechanism says "subscribers trigger on next check" — but for the decision engine, + missing `session.closed` means potentially running all night. + +## Claude Sonnet unique findings (not in either other model): + +- **"Skips the instance" for removed strategy** (#3): If a strategy is removed from + the system but a user has it configured and enabled, does "skips" mean the engine + still activates with remaining strategies? What if that's the only configured strategy? + The "at least one enabled strategy" prerequisite doesn't account for enabled-but- + unresolvable strategies. +- **High-water mark reset timing** (#4): Is the HWM reset tied to session open (the + event) or decision engine activation? For mid-session activation, these diverge — + the HWM baseline could be session-open portfolio value or activation-time value. +- **"At startup" vs "at activation"** (#5): Aggregation config is "consumed by the + aggregator at startup" — but is "startup" application boot or decision engine + activation? If boot, config changes after boot but before activation are missed. +- **Configuration change event scope** (#6): Does the "configuration change event" + fire only for enable/disable, or for any config mutation (parameter changes)? + If broader, the engine may receive events that trigger re-evaluation but shouldn't + cause hot-reload. + +## Quality assessment: + +- **GPT-5** found the most ambiguities (8) and was the only model to identify the + date/timezone and P&L-timing ambiguities (findings that span beyond the two documents + into system-wide consistency). Its unique findings extend further from the documents' + explicit text into operational consequences. Every finding includes both interpretations + clearly stated and a specific cross-component failure scenario. The aggregation + lifecycle finding (#4) is architecturally significant — it identifies a design gap + that neither document addresses. However, GPT-5 used 9,190 output tokens (3.6× Opus's + 2,558) for 14% more findings — less token-efficient per finding. + +- **Claude Opus** found 7 ambiguities in 56s with only 2,558 tokens — the most + concise output. Its unique findings focus on *design tensions within the interaction* + (session-driven deactivation vs config-driven deactivation, component independence + vs deterministic behavior). The "shut down at close" finding (#2) is genuinely + insightful: it identifies a scenario where a session lifecycle event could + accidentally trigger a config-layer state machine transition (deactivation). This + is Opus's characteristic strength — reasoning about where one subsystem's behavior + inadvertently triggers another subsystem's semantics. + +- **Claude Sonnet** found 6 ambiguities in 48s with 2,353 tokens. Notably, Sonnet + found the most *implementation-specific* ambiguities — the "skips the instance" + edge case, the "at startup" vs "at activation" timing, and the config change event + scope. These are the kinds of ambiguities that would bite an engineer writing the + actual code (GenServer init vs application startup, event scope design). Sonnet + appears to reason more from an implementer's perspective ("if I were coding this, + what would I be unsure about?") while Opus reasons from a designer's perspective + ("if I were reviewing this architecture, what tensions exist?"). + +## Key insight — Implementation Ambiguity Analysis as a task type: + +This is a genuinely NEW analytical lens not previously tested. Unlike: +- **Assumption-finding** ("what must be true for this to work?") +- **Gap-finding** ("what's missing?") +- **Race conditions** ("what ordering hazards exist?") +- **Cross-doc consistency** ("do these docs contradict?") + +...implementation ambiguity asks: "where does the spec admit multiple valid +implementations?" This requires the model to: +1. Read the text as an implementer would (not a reviewer) +2. Generate two BOTH-VALID interpretations (neither is wrong) +3. Show why the divergence matters cross-component + +The distinguishing characteristic: findings are not bugs or gaps — they're +**specification underspecification**. The spec author wrote something reasonable, +but didn't realize it could be read two ways by different engineers working on +different components. + +## All models converge — but that's the point: + +Unlike previous experiments where models had dramatically different finding counts +(GPT-5: 20-35, Opus: 10-13, Sonnet: 7-17 for assumption-finding), here the range +is tight (6-8 findings). The convergence suggests that: + +1. The input documents are relatively short (199 lines combined) — less room for + divergence +2. Implementation ambiguity is a more constrained task than open-ended analysis — + you need quoted text from both docs plus two valid interpretations, which naturally + limits the space +3. The core ambiguity (activation vs session) is SO dominant that all models spend + significant output budget exploring its variations + +The value differentiation is in WHICH ambiguities each model finds beyond the core: +- GPT-5 extends to system-wide concerns (timezone, P&L timing) +- Opus finds interaction tensions (session events triggering config state machines) +- Sonnet finds implementation-level confusions (init vs activation, event scope) + +## Practical implication: + +**Implementation ambiguity analysis is ideal for pre-implementation review.** Before +assigning two engineers to work on interacting components, run this analysis on their +respective spec documents. The findings directly identify coordination points that +need explicit resolution before implementation begins. The cost (48-78s, 2-9K tokens) +is trivial compared to the debugging cost of discovering these ambiguities as bugs +in production. + +**Model recommendation for this task:** +- **Sonnet** for quick pre-implementation review (most implementer-focused findings, fastest) +- **Opus** for design-review contexts (finds where subsystem semantics leak across boundaries) +- **GPT-5** when you need exhaustive coverage including system-wide implications + +All three are viable — the gap between them is smaller here than for other analytical tasks. +This may be the first task type where Sonnet's findings are qualitatively AS valuable as +the reasoning models' findings (just different in character).