Files

T

Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files

Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.

2026-05-06 07:15:50 -07:00

8.1 KiB

Raw Blame History

Finding 16: Specification completeness: Sonnet 4.5 produces 2x the findings of Sonnet 4.6 on implementation-gap analysis; quality vs volume tradeoff

Date: 2026-05-03 Task: Identify specification gaps in gargoyle's kill-switch.md (185 lines) — places where an implementer would be forced to guess or decide on their own because the spec doesn't clearly specify behavior. New analytical lens not previously tested. How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of underspecification (behavioral ambiguity, missing edge cases, ordering/sequencing gaps, interface contracts undefined, concurrency semantics omitted). Required specific output format per finding (gap, section, what implementer must decide, risk if wrong, severity). No tools, no project context beyond the document itself.

Model	Time	Output tokens	Reasoning tokens	Gaps found	Critical	High	Medium	Low
Claude Sonnet 4.6	73s	3,403	(internal)	13	8	4	0	1
Claude Sonnet 4.5	102s	5,191	(internal)	25	14	6	4	1
GPT-5	109s	10,140	7,872	19	8	7	3	0

What they found — common ground (all 3 identified):

Pipeline process identification ambiguity (which processes are "pipeline processes")
Per-user process scope mapping (how to terminate only one user's processes)
ETS table ownership and lifecycle (who owns it, what happens on crash)
Concurrent engage operations (what happens when two sources engage simultaneously)
Liquidation order tagging mechanism (what the tag is, how verified)
Process restart prevention (how "must not restart" is enforced)
Engage sequence atomicity (partial failure between DB write and termination)
Startup ordering and ETS readiness (pipeline starting before ETS populated)
Disengage sequence ordering (what happens and in what order)

Sonnet 4.5 unique findings (not in either other model):

ETS table schema/structure (set vs ordered_set, key format, value schema)
Missing ETS detection mechanism (catch :badarg vs table existence check)
Database write atomicity with ETS (transaction boundaries, rollback semantics)
Per-user engage while global is already engaged (is it a no-op or error?)
Broker rejection semantics ("already filled" vs "invalid cancel" distinction)
Cold-start gate interaction (independence vs dependency of the two gates)
User deletion with active kill switch (orphaned rows, cascade semantics)
Global disengage effect on per-user states (independent or auto-clear?)
Audit log write failure during engage (critical-path vs best-effort)
Dashboard control ambiguity in LIQUIDATE mode (contradictory disable/enable)
Cancel timeout duration (operational parameter not specified)
Manual order source code path during LIQUIDATE (how orders bypass the dead pipeline)

GPT-5 unique findings (not in either other model):

Combined global/per-user mode semantics (what happens when global=RESTRICT, user=LIQUIDATE — can user's liquidation proceed?)
Scope of "all" in cancel_all and liquidation (system-wide vs per-user)
Gate behavior when ETS missing but liquidation needed (conflicting requirements: fail-closed says block, but liquidation needs to pass)
Disengage during in-flight cancellations (what happens to racing tasks)
Gate placement relative to broker submission (exact point in the flow)
Engage latency expectations (no quantified SLA)
Mode change while already engaged (RESTRICT → LIQUIDATE without disengage)
Dashboard vs backend scope for manual liquidation (individual vs bulk only)

Sonnet 4.6 unique findings (not in either other model):

ETS sequencing relative to process termination (ETS before or after kill?)
Concurrent disengage + re-engage race (specific interleaving scenario)
Close-only enforcement mechanism (UI-only vs backend validation)
Order-in-flight past ETS gate during termination (already-checked orders)

Quality assessment:

Claude Sonnet 4.5 was the most EXHAUSTIVE (25 gaps) but with notable quality variance. Several findings were highly specific and implementation- relevant (ETS schema, missing-table detection, broker rejection semantics). Others were relatively obvious or lower-impact (user deletion, audit log failure, cancel timeout duration). The 14 Critical ratings feel somewhat generous — some would be more accurately rated as High in practice. Output was well-structured with clear per-finding format.
GPT-5 found 19 gaps with consistent high quality. Its unique findings show cross-cutting reasoning: the combined mode semantics finding (global vs per-user mode interaction) identifies a genuine specification gap that neither Sonnet version noticed. The "ETS missing but liquidation needed" finding is architecturally significant — it identifies a CONTRADICTION in the spec's own rules (fail-closed blocks everything, but liquidation must pass). Every finding was actionable. More selective severity ratings (8 Critical vs Sonnet 4.5's 14).
Claude Sonnet 4.6 was the most SELECTIVE (13 gaps) but with the highest precision. Every finding was genuinely a specification gap that an implementer would face. The ETS sequencing finding (#4) is particularly well-reasoned — it identifies a specific ordering dependency that creates a race window. Sonnet 4.6 appears to self-filter aggressively, producing only findings it's confident about. Higher signal-to-noise than 4.5.

Key insight — Sonnet 4.5 vs 4.6 on analytical tasks: This is the first direct comparison between Claude model versions on the same analytical task. Key differences:

Volume: 4.5 produced almost 2x the findings (25 vs 13)
Tokens: 4.5 used ~1.5x the output tokens (5,191 vs 3,403)
Time: 4.5 took ~1.4x longer (102s vs 73s)
Severity distribution: 4.5 had more Critical findings (14 vs 8) but with more generous severity ratings
Quality per finding: 4.6 had higher average quality; fewer "obvious" or lower-impact findings

The 4.6 model appears to have been trained toward higher precision/selectivity. It finds fewer things but each finding is more reliably a genuine gap. The 4.5 model is more exhaustive but includes findings that a reviewer might triage as "yes, technically, but not really a spec gap." This mirrors a known training direction in Claude models: later versions tend to be more concise and selective.

For practical use: If you want completeness (cast a wide net, accept some noise): use 4.5. If you want precision (every finding is actionable, no triage needed): use 4.6. For architecture review where missing a gap has cost, 4.5's exhaustiveness is probably worth the noise. For review where false positives cost attention (e.g., PR review comments), 4.6's selectivity is preferred.

GPT-5 vs Sonnet comparison on this task: GPT-5 (19 findings) sits between the two Sonnets in volume but has the highest consistency — no obvious misses or inflated severities. Its unique strength here: finding CONTRADICTIONS within the spec's own rules (ETS-missing blocking conflicts with liquidation needing to pass). This is consistent with Finding #15 where GPT-5 was unusually selective but precise on coherence checking.

Specification completeness analysis appears to be a task where:

Sonnet 4.5 is strongest for breadth (25 findings, catches operational gaps)
GPT-5 is strongest for detecting spec self-contradictions (19 findings, high precision)
Sonnet 4.6 is strongest for precision (13 findings, zero noise)

Updated model version comparison:

Claude 4.6 → higher precision, more selective, concise
Claude 4.5 → more exhaustive, more verbose, occasional severity inflation
This is a genuine tradeoff, not a simple regression or improvement

Practical implication: Run BOTH Sonnet versions? 4.5 catches things 4.6 filters out (ETS schema, broker rejection semantics, cold-start gate interaction). 4.6 catches things with more specificity (sequencing gaps, exact race windows). For a one-shot budget: 4.5 if you want coverage, 4.6 if you want actionability. GPT-5 if you want to find where the spec contradicts itself.

8.1 KiB Raw Blame History

Finding 16: Specification completeness: Sonnet 4.5 produces 2x the findings of Sonnet 4.6 on implementation-gap analysis; quality vs volume tradeoff

8.1 KiB

Raw Blame History