finding 51: implementation ambiguity analysis — new analytical lens

2026-05-08 12:46:32 -07:00
parent 5b8f8caf8c
commit 79915d1dc3
1 changed files with 166 additions and 0 deletions
@@ -0,0 +1,166 @@
+# Finding 51: Implementation Ambiguity Analysis — New Analytical Lens; All Models Converge on Core Ambiguity
+
+**Date:** 2026-05-08
+**Task:** Identify implementation ambiguities in two related gargoyle design documents
+(`market-sessions.md` 102 lines + `strategy-config.md` 97 lines) — places where the
+spec is clear enough to design from but ambiguous enough that two engineers could
+reasonably implement different behaviors from the same text.
+**How we used them:** Both documents (full text) combined in a single prompt with a
+structured analytical question. Each ambiguity required: quoted text, two interpretations,
+cross-component divergence impact, severity. Tested via HAI proxy (OpenAI endpoint for
+GPT-5, Anthropic endpoint for Claude models). No tools, no project context beyond the
+two documents.
+
+| Model | Time | Output tokens | Reasoning tokens | Ambiguities found |
+|---|---|---|---|---|
+| GPT-5 | 78s | 9,190 | 6,784 | 8 |
+| Claude Opus 4.6 | 56s | 2,558 | (internal) | 7 |
+| Claude Sonnet 4.6 | 48s | 2,353 | (internal) | 6 |
+
+## What they found — common ground (all 3 identified):
+
+**The central ambiguity: activation vs session lifecycle.** All three models independently
+identified the same core tension: Document 1 says "strategies subscribe to market data
+at session open" while Document 2 says "enabling the first strategy can trigger
+activation." Does activation happen at session open, at eligibility, or both? All three
+agree this is the highest-severity cross-document ambiguity.
+
+**The config-restart-session ambiguity.** All three identified that "configuration changes
+take effect on the next decision engine restart" is ambiguous relative to session
+boundaries — does session open/close constitute a "restart"? Two different interpretations
+lead to fundamentally different user expectations about when parameter changes apply.
+
+**Mid-session activation semantics.** All three identified the scenario where a user
+enables their first strategy during an active session — does the engine activate
+immediately, or defer to next session open?
+
+## GPT-5 unique findings (not in either Claude model):
+
+- **Date field in session domain events** (#5): The `date` field in `session.opened`/
+  `session.closed` events could be ET trading date or UTC date. Components keying off
+  different interpretations would attribute after-hours activity to different days.
+- **P&L snapshot timing vs after-hours attribution** (#8): If P&L snapshots at 4:00 PM
+  but after-hours fills until 8:00 PM are attributed to the same trading day, the
+  "daily P&L" has no defined finalization time.
+- **Aggregation lifecycle across session boundaries** (#4): Are incomplete signal groups
+  flushed at session close, or do they persist? If a group started at 3:50 PM has a
+  60-minute timeout, does it expire after-hours?
+- **Strategies running outside session** (#6): Are strategies shut down outside sessions
+  (engine paused), or do they run continuously with risk controls blocking the output?
+  Different implementations affect aggregation state and warmup behavior.
+
+## Claude Opus unique findings (not in either other model):
+
+- **"Shut down at close" interaction with deactivation** (#2): If "some shut down at
+  close" causes the enabled strategy count to reach zero, does this trigger the "last
+  strategy disabled → deactivation" path? This could unintentionally deactivate the
+  entire engine when only session-sensitive strategies should pause.
+- **"Each component decides independently" vs config-driven behavior** (#6): If
+  components decide independently what session events mean, but the config layer expects
+  deterministic restart semantics, there's no single authority on when "restart" occurs.
+- **"Snapshots backfill" ambiguity for engine state** (#7): The missed-event recovery
+  mechanism says "subscribers trigger on next check" — but for the decision engine,
+  missing `session.closed` means potentially running all night.
+
+## Claude Sonnet unique findings (not in either other model):
+
+- **"Skips the instance" for removed strategy** (#3): If a strategy is removed from
+  the system but a user has it configured and enabled, does "skips" mean the engine
+  still activates with remaining strategies? What if that's the only configured strategy?
+  The "at least one enabled strategy" prerequisite doesn't account for enabled-but-
+  unresolvable strategies.
+- **High-water mark reset timing** (#4): Is the HWM reset tied to session open (the
+  event) or decision engine activation? For mid-session activation, these diverge —
+  the HWM baseline could be session-open portfolio value or activation-time value.
+- **"At startup" vs "at activation"** (#5): Aggregation config is "consumed by the
+  aggregator at startup" — but is "startup" application boot or decision engine
+  activation? If boot, config changes after boot but before activation are missed.
+- **Configuration change event scope** (#6): Does the "configuration change event"
+  fire only for enable/disable, or for any config mutation (parameter changes)?
+  If broader, the engine may receive events that trigger re-evaluation but shouldn't
+  cause hot-reload.
+
+## Quality assessment:
+
+- **GPT-5** found the most ambiguities (8) and was the only model to identify the
+  date/timezone and P&L-timing ambiguities (findings that span beyond the two documents
+  into system-wide consistency). Its unique findings extend further from the documents'
+  explicit text into operational consequences. Every finding includes both interpretations
+  clearly stated and a specific cross-component failure scenario. The aggregation
+  lifecycle finding (#4) is architecturally significant — it identifies a design gap
+  that neither document addresses. However, GPT-5 used 9,190 output tokens (3.6× Opus's
+  2,558) for 14% more findings — less token-efficient per finding.
+
+- **Claude Opus** found 7 ambiguities in 56s with only 2,558 tokens — the most
+  concise output. Its unique findings focus on *design tensions within the interaction*
+  (session-driven deactivation vs config-driven deactivation, component independence
+  vs deterministic behavior). The "shut down at close" finding (#2) is genuinely
+  insightful: it identifies a scenario where a session lifecycle event could
+  accidentally trigger a config-layer state machine transition (deactivation). This
+  is Opus's characteristic strength — reasoning about where one subsystem's behavior
+  inadvertently triggers another subsystem's semantics.
+
+- **Claude Sonnet** found 6 ambiguities in 48s with 2,353 tokens. Notably, Sonnet
+  found the most *implementation-specific* ambiguities — the "skips the instance"
+  edge case, the "at startup" vs "at activation" timing, and the config change event
+  scope. These are the kinds of ambiguities that would bite an engineer writing the
+  actual code (GenServer init vs application startup, event scope design). Sonnet
+  appears to reason more from an implementer's perspective ("if I were coding this,
+  what would I be unsure about?") while Opus reasons from a designer's perspective
+  ("if I were reviewing this architecture, what tensions exist?").
+
+## Key insight — Implementation Ambiguity Analysis as a task type:
+
+This is a genuinely NEW analytical lens not previously tested. Unlike:
+- **Assumption-finding** ("what must be true for this to work?")
+- **Gap-finding** ("what's missing?")
+- **Race conditions** ("what ordering hazards exist?")
+- **Cross-doc consistency** ("do these docs contradict?")
+
+...implementation ambiguity asks: "where does the spec admit multiple valid
+implementations?" This requires the model to:
+1. Read the text as an implementer would (not a reviewer)
+2. Generate two BOTH-VALID interpretations (neither is wrong)
+3. Show why the divergence matters cross-component
+
+The distinguishing characteristic: findings are not bugs or gaps — they're
+**specification underspecification**. The spec author wrote something reasonable,
+but didn't realize it could be read two ways by different engineers working on
+different components.
+
+## All models converge — but that's the point:
+
+Unlike previous experiments where models had dramatically different finding counts
+(GPT-5: 20-35, Opus: 10-13, Sonnet: 7-17 for assumption-finding), here the range
+is tight (6-8 findings). The convergence suggests that:
+
+1. The input documents are relatively short (199 lines combined) — less room for
+   divergence
+2. Implementation ambiguity is a more constrained task than open-ended analysis —
+   you need quoted text from both docs plus two valid interpretations, which naturally
+   limits the space
+3. The core ambiguity (activation vs session) is SO dominant that all models spend
+   significant output budget exploring its variations
+
+The value differentiation is in WHICH ambiguities each model finds beyond the core:
+- GPT-5 extends to system-wide concerns (timezone, P&L timing)
+- Opus finds interaction tensions (session events triggering config state machines)
+- Sonnet finds implementation-level confusions (init vs activation, event scope)
+
+## Practical implication:
+
+**Implementation ambiguity analysis is ideal for pre-implementation review.** Before
+assigning two engineers to work on interacting components, run this analysis on their
+respective spec documents. The findings directly identify coordination points that
+need explicit resolution before implementation begins. The cost (48-78s, 2-9K tokens)
+is trivial compared to the debugging cost of discovering these ambiguities as bugs
+in production.
+
+**Model recommendation for this task:**
+- **Sonnet** for quick pre-implementation review (most implementer-focused findings, fastest)
+- **Opus** for design-review contexts (finds where subsystem semantics leak across boundaries)
+- **GPT-5** when you need exhaustive coverage including system-wide implications
+
+All three are viable — the gap between them is smaller here than for other analytical tasks.
+This may be the first task type where Sonnet's findings are qualitatively AS valuable as
+the reasoning models' findings (just different in character).