From 0c632c255a34f9b41e0686b8cbaafad65e89f985 Mon Sep 17 00:00:00 2001 From: claw Date: Thu, 7 May 2026 09:26:08 -0700 Subject: [PATCH] finding #39: narrow framing does not close Sonnet-GPT-5 gap for semantic consistency MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Tested open question from Finding #5: does narrow framing give Sonnet GPT-5-level semantic analysis? Result: NO. Narrow framing changes WHAT Sonnet looks for (redirects from gaps to contradictions) but not HOW WELL it reasons. Sonnet narrow found 3 contradictions but only 1 was genuine (2 were analytical errors/misreads). GPT-5 found 4 all-genuine findings with precise reasoning. Key insight: framing controls scope, not reasoning depth. For tasks requiring logical verification (contradictions, race conditions, invariant violations), reasoning tokens are necessary — framing alone is insufficient. Updated open-questions.md: marked Sonnet+narrow as answered, added new question about Opus+narrow for contradiction detection. --- ...-framing-does-not-close-sonnet-gpt5-gap.md | 181 ++++++++++++++++++ open-questions.md | 18 +- 2 files changed, 196 insertions(+), 3 deletions(-) create mode 100644 findings/2026-05-07-39-narrow-framing-does-not-close-sonnet-gpt5-gap.md diff --git a/findings/2026-05-07-39-narrow-framing-does-not-close-sonnet-gpt5-gap.md b/findings/2026-05-07-39-narrow-framing-does-not-close-sonnet-gpt5-gap.md new file mode 100644 index 0000000..92aa458 --- /dev/null +++ b/findings/2026-05-07-39-narrow-framing-does-not-close-sonnet-gpt5-gap.md @@ -0,0 +1,181 @@ +# Finding #39: Narrow framing does NOT close the Sonnet-GPT-5 gap for semantic consistency analysis + +**Date:** 2026-05-07 +**Document:** `docs/domain/user-lifecycle.md` (343 lines) +**Task type:** Internal logical consistency / self-contradiction detection +**Models:** Claude Sonnet 4 (narrow + broad), GPT-5 (narrow) +**Open question tested:** "Sonnet + narrow framing = GPT-5 level?" (from Finding #5) + +## Experiment Design + +Finding #5 hypothesized that Sonnet's tendency to focus on structural issues (formatting, +links, missing sections) rather than semantic issues (logical contradictions, meaning-level +conflicts) might be a **framing artifact** — that giving Sonnet a narrow semantic question +would produce GPT-5-level semantic analysis. + +Three conditions, same document: + +| Condition | Model | Framing | Purpose | +|---|---|---|---| +| A | Sonnet 4 | Narrow: "identify contradictions within this document" | Test hypothesis | +| B | GPT-5 | Same narrow question | Control (known semantic strength) | +| C | Sonnet 4 | Broad: "review for quality, clarity, completeness, correctness" | Baseline | + +Narrow prompt explicitly excluded gap-finding, suggestions, missing features — ONLY +self-contradictions. Required specific format: two conflicting quotes, locations, +explanation of logical conflict, severity. + +## Results + +| Condition | Model | Framing | Time | Output tokens | Reasoning tokens | Findings | +|---|---|---|---|---|---|---| +| A | Sonnet 4 | Narrow semantic | 16s | 694 | (internal) | 3 | +| B | GPT-5 | Narrow semantic | 108s | 11,280 | 10,368 | 4 | +| C | Sonnet 4 | Broad review | 24s | 1,248 | (internal) | 13 | + +## What They Found + +### Common ground (both Sonnet narrow and GPT-5 identified): + +1. **Credential "validity" contradiction** — Gate #3 says "valid broker credentials" + but the document explicitly says "does not validate broker credentials at startup." + Both models caught this. GPT-5's analysis was more nuanced (distinguished presence + vs validity, noted the `no_credentials` reason code suggests presence-checking). + Sonnet called it Critical; GPT-5 called it High. + +### GPT-5 unique findings (not in Sonnet narrow): + +2. **"Terminal state" with outgoing transitions** — The table says stopped is "Default + and terminal state" but the state diagram shows multiple outgoing transitions from + stopped (→starting, →blocked). A terminal state by definition has no outgoing + transitions. This is a clear terminology contradiction. + +3. **Per-user release vs system-wide release asymmetry** — The eligibility section says + "every activation path (event-driven, periodic, operator-initiated) consults [the + eligibility check]." But per-user release says it "starts their instance" (no + qualifying check mentioned), while system-wide release says "starts all qualifying + users" (explicit check). The inconsistent language implies per-user release might + bypass eligibility — contradicting the universal gate rule. + +4. **"Running instance" ambiguity** — The document distinguishes "starting" (instance + exists, cold-start incomplete) from "ready" (trading active), but uses "running + instance" without defining which states it includes. The edge case section implies + "running" includes "starting," but reconciliation uses "running" in a context that + might mean only "ready." Contradictory interpretation paths exist. + +### Sonnet narrow unique findings (not in GPT-5): + +2. **State diagram vs Instance Start section inconsistency** — State diagram shows + stopped→blocked transition, but Instance Start section describes the same scenario + (ineligible user) without transitioning to blocked. Sonnet argued these conflict. + (This is actually WRONG — the Instance Start section says "record the blocking + reason" which corresponds to the blocked state. Sonnet misread the prose.) + +3. **Configuration events vs operational gates interaction** — Sonnet argued that + `user_configured` not triggering start (due to kill switch) contradicts "configuration + prerequisites changing triggers instance start automatically." This is a MISREAD — + the document actually says the trigger fires but eligibility fails, which is + consistent (trigger ≠ guaranteed start). Sonnet confused "trigger the evaluation" + with "trigger the start." + +### Sonnet broad findings (Condition C): + +Of 13 findings, NONE were internal logical contradictions. All were: +- Missing features (5): recovery mechanisms, state persistence, error handling, perf specs, glossary +- External concerns (4): race conditions, cascading failures, component boundaries, integration specs +- Suggestions (3): formatting, calibration rationale, event ordering +- One terminology issue (#9) that partially overlaps with GPT-5's finding #4 + +## Key Insight — The Hypothesis is REJECTED + +**Sonnet + narrow framing does NOT produce GPT-5-level semantic analysis.** + +The narrow framing did change Sonnet's behavior — it stopped suggesting missing features +and focused on contradictions (Condition A vs C). But the quality of its semantic reasoning +was significantly lower than GPT-5's: + +1. **Quantity:** Sonnet found 3 vs GPT-5's 4 — a smaller gap than expected +2. **Quality:** This is the real differentiator: + - Sonnet's finding #2 contains an analytical error (misreading the Instance Start section) + - Sonnet's finding #3 conflates "trigger the evaluation" with "guarantee the start" + - GPT-5's findings are all logically sound and precisely reasoned +3. **Precision:** Of Sonnet's 3 findings, only 1 is genuinely correct (the credential one). + GPT-5's 4 findings are all legitimate contradictions or ambiguities. +4. **Depth:** GPT-5's analysis distinguishes subtler levels (presence vs validity, per-user + vs system-wide wording asymmetry). Sonnet identifies surface contradictions. + +**Effective findings:** Sonnet narrow = 1 genuine / GPT-5 narrow = 4 genuine. +That's a 4:1 quality gap, not a framing artifact. + +## Why Narrow Framing Doesn't Help + +The hypothesis assumed Sonnet's structural bias was caused by the broad prompt giving it +"permission" to find easy structural issues instead of harder semantic ones. But the narrow +prompt FORCED it to look for contradictions — and it still couldn't reason about them +correctly. + +The problem isn't that Sonnet doesn't LOOK for semantic issues when framed narrowly. It does. +The problem is that semantic consistency analysis requires: + +1. **Holding multiple document sections in working memory simultaneously** +2. **Reasoning about the logical IMPLICATIONS of each statement (not just surface text)** +3. **Testing whether interpretations are actually contradictory vs complementary** + +GPT-5's 10,368 reasoning tokens were spent doing exactly this — cross-referencing sections, +testing interpretations, confirming that conflicts are genuine. Sonnet's internal reasoning +(not reported) apparently doesn't do this verification step, leading to false-positive +contradictions (findings where the document is actually consistent but Sonnet misread one +part). + +## Comparison to Previous Findings + +This result is consistent with Finding #13 (race conditions), where Sonnet also struggled +with reasoning that requires holding multiple interacting parts in working memory. And it +CONTRASTS with Findings #12 and #14, where Sonnet performed well on assumption-finding +and cross-component analysis. + +The emerging pattern: +- **Sonnet excels at:** identification tasks ("what could go wrong?") where each finding + is evaluated independently +- **Sonnet struggles with:** verification tasks ("does X actually contradict Y?") where + the finding requires cross-referencing and logical proof + +Narrow framing helps Sonnet focus, but doesn't help it REASON more deeply. + +## Sonnet Broad vs Sonnet Narrow + +The comparison between conditions A and C reveals what narrow framing DOES accomplish: + +- **Broad Sonnet** produced 13 findings but 0 were internal contradictions (all gaps/suggestions) +- **Narrow Sonnet** produced 3 findings that were at least ATTEMPTING to be contradictions (1 genuine) + +So narrow framing successfully redirects Sonnet from its default mode (gap-finding, structural +review) to the target analytical mode (contradiction detection). It just doesn't give Sonnet +the reasoning depth to execute that mode well. + +**Implication:** Framing controls WHAT Sonnet looks for, but not HOW WELL it reasons about +what it finds. For tasks requiring logical verification (contradictions, race conditions, +invariant violations), reasoning tokens are necessary — framing alone is insufficient. + +## Updated Open Questions + +The original question "Sonnet + narrow framing = GPT-5 level?" is now **ANSWERED: No.** + +New question arising: **Would Opus + narrow framing match GPT-5 for contradiction detection?** +Opus has demonstrated strong cross-boundary reasoning in previous experiments. If contradiction +detection is primarily about reasoning depth (which this experiment suggests), Opus's internal +reasoning should perform better than Sonnet's. But GPT-5's extreme selectivity (10K reasoning +tokens for 4 precise findings) might still dominate on precision. + +## Practical Implication + +For document self-consistency analysis: +- **Use GPT-5.** It's the only model tested that reliably distinguishes genuine contradictions + from apparent ones. +- **Don't use Sonnet** — even with narrow framing, it produces false-positive contradictions + that would waste reviewer time. +- **Narrow framing helps with SCOPE** (preventing gap-finding when you want contradictions) + but not with QUALITY (preventing false positives in the findings it does produce). + +The three-model stack for architecture review should assign contradiction/consistency tasks +to GPT-5 specifically, not to Sonnet with better prompts. diff --git a/open-questions.md b/open-questions.md index cf7ed11..4e89821 100644 --- a/open-questions.md +++ b/open-questions.md @@ -22,10 +22,22 @@ cross-doc contradictions are easy to verify once spotted (reducing GPT-5's verification advantage)? Or because boundary reasoning (Opus's strength) is the primary skill needed? -### Sonnet + narrow framing = GPT-5 level? (from Finding #5) -Would Sonnet catch semantic issues if given a narrower "check for logical +### Opus + narrow framing for contradiction detection (from Finding #39) +Would Opus + narrow framing match GPT-5 for self-contradiction detection? +Finding #39 showed Sonnet can't do it even with narrow framing (reasoning +depth issue). Opus has strong cross-boundary reasoning — does its internal +reasoning depth suffice for the verification step that Sonnet lacks? + +### ~~Sonnet + narrow framing = GPT-5 level? (from Finding #5)~~ → ANSWERED (Finding #39) +~~Would Sonnet catch semantic issues if given a narrower "check for logical consistency" framing instead of broad review? The hypothesis: Sonnet's -"structural reviewer" tendency is a framing artifact, not a capability limit. +"structural reviewer" tendency is a framing artifact, not a capability limit.~~ + +**NO.** Narrow framing changes WHAT Sonnet looks for (redirects from gaps to +contradictions) but not HOW WELL it reasons. Sonnet narrow found 3 contradictions +but only 1 was genuine (2 were misreadings). GPT-5 found 4 all-genuine findings. +The gap is reasoning depth, not framing — Sonnet can't reliably verify whether +two statements actually contradict each other. ## Medium Priority