6af8a6ee10
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
132 lines
8.1 KiB
Markdown
132 lines
8.1 KiB
Markdown
# Finding 16: Specification completeness: Sonnet 4.5 produces 2x the findings of Sonnet 4.6 on implementation-gap analysis; quality vs volume tradeoff
|
|
|
|
**Date:** 2026-05-03
|
|
**Task:** Identify specification gaps in gargoyle's `kill-switch.md` (185 lines) — places
|
|
where an implementer would be forced to guess or decide on their own because the spec
|
|
doesn't clearly specify behavior. New analytical lens not previously tested.
|
|
**How we used them:** Same document (full text) + same focused analytical question to all
|
|
3 models via HAI proxy. Highly structured prompt specifying 5 categories of underspecification
|
|
(behavioral ambiguity, missing edge cases, ordering/sequencing gaps, interface contracts
|
|
undefined, concurrency semantics omitted). Required specific output format per finding
|
|
(gap, section, what implementer must decide, risk if wrong, severity). No tools, no
|
|
project context beyond the document itself.
|
|
|
|
| Model | Time | Output tokens | Reasoning tokens | Gaps found | Critical | High | Medium | Low |
|
|
|---|---|---|---|---|---|---|---|---|
|
|
| Claude Sonnet 4.6 | 73s | 3,403 | (internal) | 13 | 8 | 4 | 0 | 1 |
|
|
| Claude Sonnet 4.5 | 102s | 5,191 | (internal) | 25 | 14 | 6 | 4 | 1 |
|
|
| GPT-5 | 109s | 10,140 | 7,872 | 19 | 8 | 7 | 3 | 0 |
|
|
|
|
**What they found — common ground (all 3 identified):**
|
|
- Pipeline process identification ambiguity (which processes are "pipeline processes")
|
|
- Per-user process scope mapping (how to terminate only one user's processes)
|
|
- ETS table ownership and lifecycle (who owns it, what happens on crash)
|
|
- Concurrent engage operations (what happens when two sources engage simultaneously)
|
|
- Liquidation order tagging mechanism (what the tag is, how verified)
|
|
- Process restart prevention (how "must not restart" is enforced)
|
|
- Engage sequence atomicity (partial failure between DB write and termination)
|
|
- Startup ordering and ETS readiness (pipeline starting before ETS populated)
|
|
- Disengage sequence ordering (what happens and in what order)
|
|
|
|
**Sonnet 4.5 unique findings (not in either other model):**
|
|
- ETS table schema/structure (set vs ordered_set, key format, value schema)
|
|
- Missing ETS detection mechanism (catch :badarg vs table existence check)
|
|
- Database write atomicity with ETS (transaction boundaries, rollback semantics)
|
|
- Per-user engage while global is already engaged (is it a no-op or error?)
|
|
- Broker rejection semantics ("already filled" vs "invalid cancel" distinction)
|
|
- Cold-start gate interaction (independence vs dependency of the two gates)
|
|
- User deletion with active kill switch (orphaned rows, cascade semantics)
|
|
- Global disengage effect on per-user states (independent or auto-clear?)
|
|
- Audit log write failure during engage (critical-path vs best-effort)
|
|
- Dashboard control ambiguity in LIQUIDATE mode (contradictory disable/enable)
|
|
- Cancel timeout duration (operational parameter not specified)
|
|
- Manual order source code path during LIQUIDATE (how orders bypass the dead pipeline)
|
|
|
|
**GPT-5 unique findings (not in either other model):**
|
|
- Combined global/per-user mode semantics (what happens when global=RESTRICT,
|
|
user=LIQUIDATE — can user's liquidation proceed?)
|
|
- Scope of "all" in cancel_all and liquidation (system-wide vs per-user)
|
|
- Gate behavior when ETS missing but liquidation needed (conflicting requirements:
|
|
fail-closed says block, but liquidation needs to pass)
|
|
- Disengage during in-flight cancellations (what happens to racing tasks)
|
|
- Gate placement relative to broker submission (exact point in the flow)
|
|
- Engage latency expectations (no quantified SLA)
|
|
- Mode change while already engaged (RESTRICT → LIQUIDATE without disengage)
|
|
- Dashboard vs backend scope for manual liquidation (individual vs bulk only)
|
|
|
|
**Sonnet 4.6 unique findings (not in either other model):**
|
|
- ETS sequencing relative to process termination (ETS before or after kill?)
|
|
- Concurrent disengage + re-engage race (specific interleaving scenario)
|
|
- Close-only enforcement mechanism (UI-only vs backend validation)
|
|
- Order-in-flight past ETS gate during termination (already-checked orders)
|
|
|
|
**Quality assessment:**
|
|
- **Claude Sonnet 4.5** was the most EXHAUSTIVE (25 gaps) but with notable
|
|
quality variance. Several findings were highly specific and implementation-
|
|
relevant (ETS schema, missing-table detection, broker rejection semantics).
|
|
Others were relatively obvious or lower-impact (user deletion, audit log
|
|
failure, cancel timeout duration). The 14 Critical ratings feel somewhat
|
|
generous — some would be more accurately rated as High in practice. Output
|
|
was well-structured with clear per-finding format.
|
|
- **GPT-5** found 19 gaps with consistent high quality. Its unique findings
|
|
show cross-cutting reasoning: the combined mode semantics finding (global
|
|
vs per-user mode interaction) identifies a genuine specification gap that
|
|
neither Sonnet version noticed. The "ETS missing but liquidation needed"
|
|
finding is architecturally significant — it identifies a CONTRADICTION in
|
|
the spec's own rules (fail-closed blocks everything, but liquidation must
|
|
pass). Every finding was actionable. More selective severity ratings
|
|
(8 Critical vs Sonnet 4.5's 14).
|
|
- **Claude Sonnet 4.6** was the most SELECTIVE (13 gaps) but with the highest
|
|
precision. Every finding was genuinely a specification gap that an
|
|
implementer would face. The ETS sequencing finding (#4) is particularly
|
|
well-reasoned — it identifies a specific ordering dependency that creates
|
|
a race window. Sonnet 4.6 appears to self-filter aggressively, producing
|
|
only findings it's confident about. Higher signal-to-noise than 4.5.
|
|
|
|
**Key insight — Sonnet 4.5 vs 4.6 on analytical tasks:**
|
|
This is the first direct comparison between Claude model versions on the same
|
|
analytical task. Key differences:
|
|
|
|
- **Volume:** 4.5 produced almost 2x the findings (25 vs 13)
|
|
- **Tokens:** 4.5 used ~1.5x the output tokens (5,191 vs 3,403)
|
|
- **Time:** 4.5 took ~1.4x longer (102s vs 73s)
|
|
- **Severity distribution:** 4.5 had more Critical findings (14 vs 8) but
|
|
with more generous severity ratings
|
|
- **Quality per finding:** 4.6 had higher average quality; fewer "obvious"
|
|
or lower-impact findings
|
|
|
|
The 4.6 model appears to have been trained toward higher precision/selectivity.
|
|
It finds fewer things but each finding is more reliably a genuine gap. The 4.5
|
|
model is more exhaustive but includes findings that a reviewer might triage as
|
|
"yes, technically, but not really a spec gap." This mirrors a known training
|
|
direction in Claude models: later versions tend to be more concise and selective.
|
|
|
|
**For practical use:** If you want completeness (cast a wide net, accept some
|
|
noise): use 4.5. If you want precision (every finding is actionable, no triage
|
|
needed): use 4.6. For architecture review where missing a gap has cost, 4.5's
|
|
exhaustiveness is probably worth the noise. For review where false positives
|
|
cost attention (e.g., PR review comments), 4.6's selectivity is preferred.
|
|
|
|
**GPT-5 vs Sonnet comparison on this task:**
|
|
GPT-5 (19 findings) sits between the two Sonnets in volume but has the highest
|
|
consistency — no obvious misses or inflated severities. Its unique strength
|
|
here: finding CONTRADICTIONS within the spec's own rules (ETS-missing blocking
|
|
conflicts with liquidation needing to pass). This is consistent with Finding #15
|
|
where GPT-5 was unusually selective but precise on coherence checking.
|
|
|
|
Specification completeness analysis appears to be a task where:
|
|
1. Sonnet 4.5 is strongest for breadth (25 findings, catches operational gaps)
|
|
2. GPT-5 is strongest for detecting spec self-contradictions (19 findings, high precision)
|
|
3. Sonnet 4.6 is strongest for precision (13 findings, zero noise)
|
|
|
|
**Updated model version comparison:**
|
|
- Claude 4.6 → higher precision, more selective, concise
|
|
- Claude 4.5 → more exhaustive, more verbose, occasional severity inflation
|
|
- This is a genuine tradeoff, not a simple regression or improvement
|
|
|
|
**Practical implication:** Run BOTH Sonnet versions? 4.5 catches things 4.6
|
|
filters out (ETS schema, broker rejection semantics, cold-start gate interaction).
|
|
4.6 catches things with more specificity (sequencing gaps, exact race windows).
|
|
For a one-shot budget: 4.5 if you want coverage, 4.6 if you want actionability.
|
|
GPT-5 if you want to find where the spec contradicts itself.
|