finding #39: narrow framing does not close Sonnet-GPT-5 gap for semantic consistency

Tested open question from Finding #5: does narrow framing give Sonnet GPT-5-level semantic analysis? Result: NO. Narrow framing changes WHAT Sonnet looks for (redirects from gaps to contradictions) but not HOW WELL it reasons. Sonnet narrow found 3 contradictions but only 1 was genuine (2 were analytical errors/misreads). GPT-5 found 4 all-genuine findings with precise reasoning. Key insight: framing controls scope, not reasoning depth. For tasks requiring logical verification (contradictions, race conditions, invariant violations), reasoning tokens are necessary — framing alone is insufficient. Updated open-questions.md: marked Sonnet+narrow as answered, added new question about Opus+narrow for contradiction detection.
2026-05-07 09:26:08 -07:00
parent d27ce6f5e1
commit 0c632c255a
2 changed files with 196 additions and 3 deletions
@@ -0,0 +1,181 @@
+# Finding #39: Narrow framing does NOT close the Sonnet-GPT-5 gap for semantic consistency analysis
+
+**Date:** 2026-05-07
+**Document:** `docs/domain/user-lifecycle.md` (343 lines)
+**Task type:** Internal logical consistency / self-contradiction detection
+**Models:** Claude Sonnet 4 (narrow + broad), GPT-5 (narrow)
+**Open question tested:** "Sonnet + narrow framing = GPT-5 level?" (from Finding #5)
+
+## Experiment Design
+
+Finding #5 hypothesized that Sonnet's tendency to focus on structural issues (formatting,
+links, missing sections) rather than semantic issues (logical contradictions, meaning-level
+conflicts) might be a **framing artifact** — that giving Sonnet a narrow semantic question
+would produce GPT-5-level semantic analysis.
+
+Three conditions, same document:
+
+| Condition | Model | Framing | Purpose |
+|---|---|---|---|
+| A | Sonnet 4 | Narrow: "identify contradictions within this document" | Test hypothesis |
+| B | GPT-5 | Same narrow question | Control (known semantic strength) |
+| C | Sonnet 4 | Broad: "review for quality, clarity, completeness, correctness" | Baseline |
+
+Narrow prompt explicitly excluded gap-finding, suggestions, missing features — ONLY
+self-contradictions. Required specific format: two conflicting quotes, locations,
+explanation of logical conflict, severity.
+
+## Results
+
+| Condition | Model | Framing | Time | Output tokens | Reasoning tokens | Findings |
+|---|---|---|---|---|---|---|
+| A | Sonnet 4 | Narrow semantic | 16s | 694 | (internal) | 3 |
+| B | GPT-5 | Narrow semantic | 108s | 11,280 | 10,368 | 4 |
+| C | Sonnet 4 | Broad review | 24s | 1,248 | (internal) | 13 |
+
+## What They Found
+
+### Common ground (both Sonnet narrow and GPT-5 identified):
+
+1. **Credential "validity" contradiction** — Gate #3 says "valid broker credentials"
+   but the document explicitly says "does not validate broker credentials at startup."
+   Both models caught this. GPT-5's analysis was more nuanced (distinguished presence
+   vs validity, noted the `no_credentials` reason code suggests presence-checking).
+   Sonnet called it Critical; GPT-5 called it High.
+
+### GPT-5 unique findings (not in Sonnet narrow):
+
+2. **"Terminal state" with outgoing transitions** — The table says stopped is "Default
+   and terminal state" but the state diagram shows multiple outgoing transitions from
+   stopped (→starting, →blocked). A terminal state by definition has no outgoing
+   transitions. This is a clear terminology contradiction.
+
+3. **Per-user release vs system-wide release asymmetry** — The eligibility section says
+   "every activation path (event-driven, periodic, operator-initiated) consults [the
+   eligibility check]." But per-user release says it "starts their instance" (no
+   qualifying check mentioned), while system-wide release says "starts all qualifying
+   users" (explicit check). The inconsistent language implies per-user release might
+   bypass eligibility — contradicting the universal gate rule.
+
+4. **"Running instance" ambiguity** — The document distinguishes "starting" (instance
+   exists, cold-start incomplete) from "ready" (trading active), but uses "running
+   instance" without defining which states it includes. The edge case section implies
+   "running" includes "starting," but reconciliation uses "running" in a context that
+   might mean only "ready." Contradictory interpretation paths exist.
+
+### Sonnet narrow unique findings (not in GPT-5):
+
+2. **State diagram vs Instance Start section inconsistency** — State diagram shows
+   stopped→blocked transition, but Instance Start section describes the same scenario
+   (ineligible user) without transitioning to blocked. Sonnet argued these conflict.
+   (This is actually WRONG — the Instance Start section says "record the blocking
+   reason" which corresponds to the blocked state. Sonnet misread the prose.)
+
+3. **Configuration events vs operational gates interaction** — Sonnet argued that
+   `user_configured` not triggering start (due to kill switch) contradicts "configuration
+   prerequisites changing triggers instance start automatically." This is a MISREAD —
+   the document actually says the trigger fires but eligibility fails, which is
+   consistent (trigger ≠ guaranteed start). Sonnet confused "trigger the evaluation"
+   with "trigger the start."
+
+### Sonnet broad findings (Condition C):
+
+Of 13 findings, NONE were internal logical contradictions. All were:
+- Missing features (5): recovery mechanisms, state persistence, error handling, perf specs, glossary
+- External concerns (4): race conditions, cascading failures, component boundaries, integration specs
+- Suggestions (3): formatting, calibration rationale, event ordering
+- One terminology issue (#9) that partially overlaps with GPT-5's finding #4
+
+## Key Insight — The Hypothesis is REJECTED
+
+**Sonnet + narrow framing does NOT produce GPT-5-level semantic analysis.**
+
+The narrow framing did change Sonnet's behavior — it stopped suggesting missing features
+and focused on contradictions (Condition A vs C). But the quality of its semantic reasoning
+was significantly lower than GPT-5's:
+
+1. **Quantity:** Sonnet found 3 vs GPT-5's 4 — a smaller gap than expected
+2. **Quality:** This is the real differentiator:
+   - Sonnet's finding #2 contains an analytical error (misreading the Instance Start section)
+   - Sonnet's finding #3 conflates "trigger the evaluation" with "guarantee the start"
+   - GPT-5's findings are all logically sound and precisely reasoned
+3. **Precision:** Of Sonnet's 3 findings, only 1 is genuinely correct (the credential one).
+   GPT-5's 4 findings are all legitimate contradictions or ambiguities.
+4. **Depth:** GPT-5's analysis distinguishes subtler levels (presence vs validity, per-user
+   vs system-wide wording asymmetry). Sonnet identifies surface contradictions.
+
+**Effective findings:** Sonnet narrow = 1 genuine / GPT-5 narrow = 4 genuine.
+That's a 4:1 quality gap, not a framing artifact.
+
+## Why Narrow Framing Doesn't Help
+
+The hypothesis assumed Sonnet's structural bias was caused by the broad prompt giving it
+"permission" to find easy structural issues instead of harder semantic ones. But the narrow
+prompt FORCED it to look for contradictions — and it still couldn't reason about them
+correctly.
+
+The problem isn't that Sonnet doesn't LOOK for semantic issues when framed narrowly. It does.
+The problem is that semantic consistency analysis requires:
+
+1. **Holding multiple document sections in working memory simultaneously**
+2. **Reasoning about the logical IMPLICATIONS of each statement (not just surface text)**
+3. **Testing whether interpretations are actually contradictory vs complementary**
+
+GPT-5's 10,368 reasoning tokens were spent doing exactly this — cross-referencing sections,
+testing interpretations, confirming that conflicts are genuine. Sonnet's internal reasoning
+(not reported) apparently doesn't do this verification step, leading to false-positive
+contradictions (findings where the document is actually consistent but Sonnet misread one
+part).
+
+## Comparison to Previous Findings
+
+This result is consistent with Finding #13 (race conditions), where Sonnet also struggled
+with reasoning that requires holding multiple interacting parts in working memory. And it
+CONTRASTS with Findings #12 and #14, where Sonnet performed well on assumption-finding
+and cross-component analysis.
+
+The emerging pattern:
+- **Sonnet excels at:** identification tasks ("what could go wrong?") where each finding
+  is evaluated independently
+- **Sonnet struggles with:** verification tasks ("does X actually contradict Y?") where
+  the finding requires cross-referencing and logical proof
+
+Narrow framing helps Sonnet focus, but doesn't help it REASON more deeply.
+
+## Sonnet Broad vs Sonnet Narrow
+
+The comparison between conditions A and C reveals what narrow framing DOES accomplish:
+
+- **Broad Sonnet** produced 13 findings but 0 were internal contradictions (all gaps/suggestions)
+- **Narrow Sonnet** produced 3 findings that were at least ATTEMPTING to be contradictions (1 genuine)
+
+So narrow framing successfully redirects Sonnet from its default mode (gap-finding, structural
+review) to the target analytical mode (contradiction detection). It just doesn't give Sonnet
+the reasoning depth to execute that mode well.
+
+**Implication:** Framing controls WHAT Sonnet looks for, but not HOW WELL it reasons about
+what it finds. For tasks requiring logical verification (contradictions, race conditions,
+invariant violations), reasoning tokens are necessary — framing alone is insufficient.
+
+## Updated Open Questions
+
+The original question "Sonnet + narrow framing = GPT-5 level?" is now **ANSWERED: No.**
+
+New question arising: **Would Opus + narrow framing match GPT-5 for contradiction detection?**
+Opus has demonstrated strong cross-boundary reasoning in previous experiments. If contradiction
+detection is primarily about reasoning depth (which this experiment suggests), Opus's internal
+reasoning should perform better than Sonnet's. But GPT-5's extreme selectivity (10K reasoning
+tokens for 4 precise findings) might still dominate on precision.
+
+## Practical Implication
+
+For document self-consistency analysis:
+- **Use GPT-5.** It's the only model tested that reliably distinguishes genuine contradictions
+  from apparent ones.
+- **Don't use Sonnet** — even with narrow framing, it produces false-positive contradictions
+  that would waste reviewer time.
+- **Narrow framing helps with SCOPE** (preventing gap-finding when you want contradictions)
+  but not with QUALITY (preventing false positives in the findings it does produce).
+
+The three-model stack for architecture review should assign contradiction/consistency tasks
+to GPT-5 specifically, not to Sonnet with better prompts.
@@ -22,10 +22,22 @@ cross-doc contradictions are easy to verify once spotted (reducing GPT-5's
 verification advantage)? Or because boundary reasoning (Opus's strength)
 is the primary skill needed?

-### Sonnet + narrow framing = GPT-5 level? (from Finding #5)
-Would Sonnet catch semantic issues if given a narrower "check for logical
+### Opus + narrow framing for contradiction detection (from Finding #39)
+Would Opus + narrow framing match GPT-5 for self-contradiction detection?
+Finding #39 showed Sonnet can't do it even with narrow framing (reasoning
+depth issue). Opus has strong cross-boundary reasoning — does its internal
+reasoning depth suffice for the verification step that Sonnet lacks?
+
+### ~~Sonnet + narrow framing = GPT-5 level? (from Finding #5)~~ → ANSWERED (Finding #39)
+~~Would Sonnet catch semantic issues if given a narrower "check for logical
 consistency" framing instead of broad review? The hypothesis: Sonnet's
-"structural reviewer" tendency is a framing artifact, not a capability limit.
+"structural reviewer" tendency is a framing artifact, not a capability limit.~~
+
+**NO.** Narrow framing changes WHAT Sonnet looks for (redirects from gaps to
+contradictions) but not HOW WELL it reasons. Sonnet narrow found 3 contradictions
+but only 1 was genuine (2 were misreadings). GPT-5 found 4 all-genuine findings.
+The gap is reasoning depth, not framing — Sonnet can't reliably verify whether
+two statements actually contradict each other.

 ## Medium Priority