finding 51: implementation ambiguity analysis — new analytical lens

2026-05-08 12:46:32 -07:00
parent 5b8f8caf8c
commit 79915d1dc3
1 changed files with 166 additions and 0 deletions
@@ -0,0 +1,166 @@
 # Finding 51: Implementation Ambiguity Analysis — New Analytical Lens; All Models Converge on Core Ambiguity
 **Date:** 2026-05-08
 **Task:** Identify implementation ambiguities in two related gargoyle design documents
 (`market-sessions.md` 102 lines + `strategy-config.md` 97 lines) — places where the
 spec is clear enough to design from but ambiguous enough that two engineers could
 reasonably implement different behaviors from the same text.
 **How we used them:** Both documents (full text) combined in a single prompt with a
 structured analytical question. Each ambiguity required: quoted text, two interpretations,
 cross-component divergence impact, severity. Tested via HAI proxy (OpenAI endpoint for
 GPT-5, Anthropic endpoint for Claude models). No tools, no project context beyond the
 two documents.
 | Model | Time | Output tokens | Reasoning tokens | Ambiguities found |
 |---|---|---|---|---|
 | GPT-5 | 78s | 9,190 | 6,784 | 8 |
 | Claude Opus 4.6 | 56s | 2,558 | (internal) | 7 |
 | Claude Sonnet 4.6 | 48s | 2,353 | (internal) | 6 |
 ## What they found — common ground (all 3 identified):
 **The central ambiguity: activation vs session lifecycle.** All three models independently
 identified the same core tension: Document 1 says "strategies subscribe to market data
 at session open" while Document 2 says "enabling the first strategy can trigger
 activation." Does activation happen at session open, at eligibility, or both? All three
 agree this is the highest-severity cross-document ambiguity.
 **The config-restart-session ambiguity.** All three identified that "configuration changes
 take effect on the next decision engine restart" is ambiguous relative to session
 boundaries — does session open/close constitute a "restart"? Two different interpretations
 lead to fundamentally different user expectations about when parameter changes apply.
 **Mid-session activation semantics.** All three identified the scenario where a user
 enables their first strategy during an active session — does the engine activate
 immediately, or defer to next session open?
 ## GPT-5 unique findings (not in either Claude model):
 - **Date field in session domain events** (#5): The `date` field in `session.opened`/
  `session.closed` events could be ET trading date or UTC date. Components keying off
  different interpretations would attribute after-hours activity to different days.
 - **P&L snapshot timing vs after-hours attribution** (#8): If P&L snapshots at 4:00 PM
  but after-hours fills until 8:00 PM are attributed to the same trading day, the
  "daily P&L" has no defined finalization time.
 - **Aggregation lifecycle across session boundaries** (#4): Are incomplete signal groups
  flushed at session close, or do they persist? If a group started at 3:50 PM has a
  60-minute timeout, does it expire after-hours?
 - **Strategies running outside session** (#6): Are strategies shut down outside sessions
  (engine paused), or do they run continuously with risk controls blocking the output?
  Different implementations affect aggregation state and warmup behavior.
 ## Claude Opus unique findings (not in either other model):
 - **"Shut down at close" interaction with deactivation** (#2): If "some shut down at
  close" causes the enabled strategy count to reach zero, does this trigger the "last
  strategy disabled → deactivation" path? This could unintentionally deactivate the
  entire engine when only session-sensitive strategies should pause.
 - **"Each component decides independently" vs config-driven behavior** (#6): If
  components decide independently what session events mean, but the config layer expects
  deterministic restart semantics, there's no single authority on when "restart" occurs.
 - **"Snapshots backfill" ambiguity for engine state** (#7): The missed-event recovery
  mechanism says "subscribers trigger on next check" — but for the decision engine,
  missing `session.closed` means potentially running all night.
 ## Claude Sonnet unique findings (not in either other model):
 - **"Skips the instance" for removed strategy** (#3): If a strategy is removed from
  the system but a user has it configured and enabled, does "skips" mean the engine
  still activates with remaining strategies? What if that's the only configured strategy?
  The "at least one enabled strategy" prerequisite doesn't account for enabled-but-
  unresolvable strategies.
 - **High-water mark reset timing** (#4): Is the HWM reset tied to session open (the
  event) or decision engine activation? For mid-session activation, these diverge —
  the HWM baseline could be session-open portfolio value or activation-time value.
 - **"At startup" vs "at activation"** (#5): Aggregation config is "consumed by the
  aggregator at startup" — but is "startup" application boot or decision engine
  activation? If boot, config changes after boot but before activation are missed.
 - **Configuration change event scope** (#6): Does the "configuration change event"
  fire only for enable/disable, or for any config mutation (parameter changes)?
  If broader, the engine may receive events that trigger re-evaluation but shouldn't
  cause hot-reload.
 ## Quality assessment:
 - **GPT-5** found the most ambiguities (8) and was the only model to identify the
  date/timezone and P&L-timing ambiguities (findings that span beyond the two documents
  into system-wide consistency). Its unique findings extend further from the documents'
  explicit text into operational consequences. Every finding includes both interpretations
  clearly stated and a specific cross-component failure scenario. The aggregation
  lifecycle finding (#4) is architecturally significant — it identifies a design gap
  that neither document addresses. However, GPT-5 used 9,190 output tokens (3.6× Opus's
  2,558) for 14% more findings — less token-efficient per finding.
 - **Claude Opus** found 7 ambiguities in 56s with only 2,558 tokens — the most
  concise output. Its unique findings focus on *design tensions within the interaction*
  (session-driven deactivation vs config-driven deactivation, component independence
  vs deterministic behavior). The "shut down at close" finding (#2) is genuinely
  insightful: it identifies a scenario where a session lifecycle event could
  accidentally trigger a config-layer state machine transition (deactivation). This
  is Opus's characteristic strength — reasoning about where one subsystem's behavior
  inadvertently triggers another subsystem's semantics.
 - **Claude Sonnet** found 6 ambiguities in 48s with 2,353 tokens. Notably, Sonnet
  found the most *implementation-specific* ambiguities — the "skips the instance"
  edge case, the "at startup" vs "at activation" timing, and the config change event
  scope. These are the kinds of ambiguities that would bite an engineer writing the
  actual code (GenServer init vs application startup, event scope design). Sonnet
  appears to reason more from an implementer's perspective ("if I were coding this,
  what would I be unsure about?") while Opus reasons from a designer's perspective
  ("if I were reviewing this architecture, what tensions exist?").
 ## Key insight — Implementation Ambiguity Analysis as a task type:
 This is a genuinely NEW analytical lens not previously tested. Unlike:
 - **Assumption-finding** ("what must be true for this to work?")
 - **Gap-finding** ("what's missing?")
 - **Race conditions** ("what ordering hazards exist?")
 - **Cross-doc consistency** ("do these docs contradict?")
 ...implementation ambiguity asks: "where does the spec admit multiple valid
 implementations?" This requires the model to:
 1. Read the text as an implementer would (not a reviewer)
 2. Generate two BOTH-VALID interpretations (neither is wrong)
 3. Show why the divergence matters cross-component
 The distinguishing characteristic: findings are not bugs or gaps — they're
 **specification underspecification**. The spec author wrote something reasonable,
 but didn't realize it could be read two ways by different engineers working on
 different components.
 ## All models converge — but that's the point:
 Unlike previous experiments where models had dramatically different finding counts
 (GPT-5: 20-35, Opus: 10-13, Sonnet: 7-17 for assumption-finding), here the range
 is tight (6-8 findings). The convergence suggests that:
 1. The input documents are relatively short (199 lines combined) — less room for
   divergence
 2. Implementation ambiguity is a more constrained task than open-ended analysis —
   you need quoted text from both docs plus two valid interpretations, which naturally
   limits the space
 3. The core ambiguity (activation vs session) is SO dominant that all models spend
   significant output budget exploring its variations
 The value differentiation is in WHICH ambiguities each model finds beyond the core:
 - GPT-5 extends to system-wide concerns (timezone, P&L timing)
 - Opus finds interaction tensions (session events triggering config state machines)
 - Sonnet finds implementation-level confusions (init vs activation, event scope)
 ## Practical implication:
 **Implementation ambiguity analysis is ideal for pre-implementation review.** Before
 assigning two engineers to work on interacting components, run this analysis on their
 respective spec documents. The findings directly identify coordination points that
 need explicit resolution before implementation begins. The cost (48-78s, 2-9K tokens)
 is trivial compared to the debugging cost of discovering these ambiguities as bugs
 in production.
 **Model recommendation for this task:**
 - **Sonnet** for quick pre-implementation review (most implementer-focused findings, fastest)
 - **Opus** for design-review contexts (finds where subsystem semantics leak across boundaries)
 - **GPT-5** when you need exhaustive coverage including system-wide implications
 All three are viable — the gap between them is smaller here than for other analytical tasks.
 This may be the first task type where Sonnet's findings are qualitatively AS valuable as
 the reasoning models' findings (just different in character).