finding #39: narrow framing does not close Sonnet-GPT-5 gap for semantic consistency
Tested open question from Finding #5: does narrow framing give Sonnet GPT-5-level semantic analysis? Result: NO. Narrow framing changes WHAT Sonnet looks for (redirects from gaps to contradictions) but not HOW WELL it reasons. Sonnet narrow found 3 contradictions but only 1 was genuine (2 were analytical errors/misreads). GPT-5 found 4 all-genuine findings with precise reasoning. Key insight: framing controls scope, not reasoning depth. For tasks requiring logical verification (contradictions, race conditions, invariant violations), reasoning tokens are necessary — framing alone is insufficient. Updated open-questions.md: marked Sonnet+narrow as answered, added new question about Opus+narrow for contradiction detection.
This commit is contained in:
@@ -0,0 +1,181 @@
|
||||
# Finding #39: Narrow framing does NOT close the Sonnet-GPT-5 gap for semantic consistency analysis
|
||||
|
||||
**Date:** 2026-05-07
|
||||
**Document:** `docs/domain/user-lifecycle.md` (343 lines)
|
||||
**Task type:** Internal logical consistency / self-contradiction detection
|
||||
**Models:** Claude Sonnet 4 (narrow + broad), GPT-5 (narrow)
|
||||
**Open question tested:** "Sonnet + narrow framing = GPT-5 level?" (from Finding #5)
|
||||
|
||||
## Experiment Design
|
||||
|
||||
Finding #5 hypothesized that Sonnet's tendency to focus on structural issues (formatting,
|
||||
links, missing sections) rather than semantic issues (logical contradictions, meaning-level
|
||||
conflicts) might be a **framing artifact** — that giving Sonnet a narrow semantic question
|
||||
would produce GPT-5-level semantic analysis.
|
||||
|
||||
Three conditions, same document:
|
||||
|
||||
| Condition | Model | Framing | Purpose |
|
||||
|---|---|---|---|
|
||||
| A | Sonnet 4 | Narrow: "identify contradictions within this document" | Test hypothesis |
|
||||
| B | GPT-5 | Same narrow question | Control (known semantic strength) |
|
||||
| C | Sonnet 4 | Broad: "review for quality, clarity, completeness, correctness" | Baseline |
|
||||
|
||||
Narrow prompt explicitly excluded gap-finding, suggestions, missing features — ONLY
|
||||
self-contradictions. Required specific format: two conflicting quotes, locations,
|
||||
explanation of logical conflict, severity.
|
||||
|
||||
## Results
|
||||
|
||||
| Condition | Model | Framing | Time | Output tokens | Reasoning tokens | Findings |
|
||||
|---|---|---|---|---|---|---|
|
||||
| A | Sonnet 4 | Narrow semantic | 16s | 694 | (internal) | 3 |
|
||||
| B | GPT-5 | Narrow semantic | 108s | 11,280 | 10,368 | 4 |
|
||||
| C | Sonnet 4 | Broad review | 24s | 1,248 | (internal) | 13 |
|
||||
|
||||
## What They Found
|
||||
|
||||
### Common ground (both Sonnet narrow and GPT-5 identified):
|
||||
|
||||
1. **Credential "validity" contradiction** — Gate #3 says "valid broker credentials"
|
||||
but the document explicitly says "does not validate broker credentials at startup."
|
||||
Both models caught this. GPT-5's analysis was more nuanced (distinguished presence
|
||||
vs validity, noted the `no_credentials` reason code suggests presence-checking).
|
||||
Sonnet called it Critical; GPT-5 called it High.
|
||||
|
||||
### GPT-5 unique findings (not in Sonnet narrow):
|
||||
|
||||
2. **"Terminal state" with outgoing transitions** — The table says stopped is "Default
|
||||
and terminal state" but the state diagram shows multiple outgoing transitions from
|
||||
stopped (→starting, →blocked). A terminal state by definition has no outgoing
|
||||
transitions. This is a clear terminology contradiction.
|
||||
|
||||
3. **Per-user release vs system-wide release asymmetry** — The eligibility section says
|
||||
"every activation path (event-driven, periodic, operator-initiated) consults [the
|
||||
eligibility check]." But per-user release says it "starts their instance" (no
|
||||
qualifying check mentioned), while system-wide release says "starts all qualifying
|
||||
users" (explicit check). The inconsistent language implies per-user release might
|
||||
bypass eligibility — contradicting the universal gate rule.
|
||||
|
||||
4. **"Running instance" ambiguity** — The document distinguishes "starting" (instance
|
||||
exists, cold-start incomplete) from "ready" (trading active), but uses "running
|
||||
instance" without defining which states it includes. The edge case section implies
|
||||
"running" includes "starting," but reconciliation uses "running" in a context that
|
||||
might mean only "ready." Contradictory interpretation paths exist.
|
||||
|
||||
### Sonnet narrow unique findings (not in GPT-5):
|
||||
|
||||
2. **State diagram vs Instance Start section inconsistency** — State diagram shows
|
||||
stopped→blocked transition, but Instance Start section describes the same scenario
|
||||
(ineligible user) without transitioning to blocked. Sonnet argued these conflict.
|
||||
(This is actually WRONG — the Instance Start section says "record the blocking
|
||||
reason" which corresponds to the blocked state. Sonnet misread the prose.)
|
||||
|
||||
3. **Configuration events vs operational gates interaction** — Sonnet argued that
|
||||
`user_configured` not triggering start (due to kill switch) contradicts "configuration
|
||||
prerequisites changing triggers instance start automatically." This is a MISREAD —
|
||||
the document actually says the trigger fires but eligibility fails, which is
|
||||
consistent (trigger ≠ guaranteed start). Sonnet confused "trigger the evaluation"
|
||||
with "trigger the start."
|
||||
|
||||
### Sonnet broad findings (Condition C):
|
||||
|
||||
Of 13 findings, NONE were internal logical contradictions. All were:
|
||||
- Missing features (5): recovery mechanisms, state persistence, error handling, perf specs, glossary
|
||||
- External concerns (4): race conditions, cascading failures, component boundaries, integration specs
|
||||
- Suggestions (3): formatting, calibration rationale, event ordering
|
||||
- One terminology issue (#9) that partially overlaps with GPT-5's finding #4
|
||||
|
||||
## Key Insight — The Hypothesis is REJECTED
|
||||
|
||||
**Sonnet + narrow framing does NOT produce GPT-5-level semantic analysis.**
|
||||
|
||||
The narrow framing did change Sonnet's behavior — it stopped suggesting missing features
|
||||
and focused on contradictions (Condition A vs C). But the quality of its semantic reasoning
|
||||
was significantly lower than GPT-5's:
|
||||
|
||||
1. **Quantity:** Sonnet found 3 vs GPT-5's 4 — a smaller gap than expected
|
||||
2. **Quality:** This is the real differentiator:
|
||||
- Sonnet's finding #2 contains an analytical error (misreading the Instance Start section)
|
||||
- Sonnet's finding #3 conflates "trigger the evaluation" with "guarantee the start"
|
||||
- GPT-5's findings are all logically sound and precisely reasoned
|
||||
3. **Precision:** Of Sonnet's 3 findings, only 1 is genuinely correct (the credential one).
|
||||
GPT-5's 4 findings are all legitimate contradictions or ambiguities.
|
||||
4. **Depth:** GPT-5's analysis distinguishes subtler levels (presence vs validity, per-user
|
||||
vs system-wide wording asymmetry). Sonnet identifies surface contradictions.
|
||||
|
||||
**Effective findings:** Sonnet narrow = 1 genuine / GPT-5 narrow = 4 genuine.
|
||||
That's a 4:1 quality gap, not a framing artifact.
|
||||
|
||||
## Why Narrow Framing Doesn't Help
|
||||
|
||||
The hypothesis assumed Sonnet's structural bias was caused by the broad prompt giving it
|
||||
"permission" to find easy structural issues instead of harder semantic ones. But the narrow
|
||||
prompt FORCED it to look for contradictions — and it still couldn't reason about them
|
||||
correctly.
|
||||
|
||||
The problem isn't that Sonnet doesn't LOOK for semantic issues when framed narrowly. It does.
|
||||
The problem is that semantic consistency analysis requires:
|
||||
|
||||
1. **Holding multiple document sections in working memory simultaneously**
|
||||
2. **Reasoning about the logical IMPLICATIONS of each statement (not just surface text)**
|
||||
3. **Testing whether interpretations are actually contradictory vs complementary**
|
||||
|
||||
GPT-5's 10,368 reasoning tokens were spent doing exactly this — cross-referencing sections,
|
||||
testing interpretations, confirming that conflicts are genuine. Sonnet's internal reasoning
|
||||
(not reported) apparently doesn't do this verification step, leading to false-positive
|
||||
contradictions (findings where the document is actually consistent but Sonnet misread one
|
||||
part).
|
||||
|
||||
## Comparison to Previous Findings
|
||||
|
||||
This result is consistent with Finding #13 (race conditions), where Sonnet also struggled
|
||||
with reasoning that requires holding multiple interacting parts in working memory. And it
|
||||
CONTRASTS with Findings #12 and #14, where Sonnet performed well on assumption-finding
|
||||
and cross-component analysis.
|
||||
|
||||
The emerging pattern:
|
||||
- **Sonnet excels at:** identification tasks ("what could go wrong?") where each finding
|
||||
is evaluated independently
|
||||
- **Sonnet struggles with:** verification tasks ("does X actually contradict Y?") where
|
||||
the finding requires cross-referencing and logical proof
|
||||
|
||||
Narrow framing helps Sonnet focus, but doesn't help it REASON more deeply.
|
||||
|
||||
## Sonnet Broad vs Sonnet Narrow
|
||||
|
||||
The comparison between conditions A and C reveals what narrow framing DOES accomplish:
|
||||
|
||||
- **Broad Sonnet** produced 13 findings but 0 were internal contradictions (all gaps/suggestions)
|
||||
- **Narrow Sonnet** produced 3 findings that were at least ATTEMPTING to be contradictions (1 genuine)
|
||||
|
||||
So narrow framing successfully redirects Sonnet from its default mode (gap-finding, structural
|
||||
review) to the target analytical mode (contradiction detection). It just doesn't give Sonnet
|
||||
the reasoning depth to execute that mode well.
|
||||
|
||||
**Implication:** Framing controls WHAT Sonnet looks for, but not HOW WELL it reasons about
|
||||
what it finds. For tasks requiring logical verification (contradictions, race conditions,
|
||||
invariant violations), reasoning tokens are necessary — framing alone is insufficient.
|
||||
|
||||
## Updated Open Questions
|
||||
|
||||
The original question "Sonnet + narrow framing = GPT-5 level?" is now **ANSWERED: No.**
|
||||
|
||||
New question arising: **Would Opus + narrow framing match GPT-5 for contradiction detection?**
|
||||
Opus has demonstrated strong cross-boundary reasoning in previous experiments. If contradiction
|
||||
detection is primarily about reasoning depth (which this experiment suggests), Opus's internal
|
||||
reasoning should perform better than Sonnet's. But GPT-5's extreme selectivity (10K reasoning
|
||||
tokens for 4 precise findings) might still dominate on precision.
|
||||
|
||||
## Practical Implication
|
||||
|
||||
For document self-consistency analysis:
|
||||
- **Use GPT-5.** It's the only model tested that reliably distinguishes genuine contradictions
|
||||
from apparent ones.
|
||||
- **Don't use Sonnet** — even with narrow framing, it produces false-positive contradictions
|
||||
that would waste reviewer time.
|
||||
- **Narrow framing helps with SCOPE** (preventing gap-finding when you want contradictions)
|
||||
but not with QUALITY (preventing false positives in the findings it does produce).
|
||||
|
||||
The three-model stack for architecture review should assign contradiction/consistency tasks
|
||||
to GPT-5 specifically, not to Sonnet with better prompts.
|
||||
+15
-3
@@ -22,10 +22,22 @@ cross-doc contradictions are easy to verify once spotted (reducing GPT-5's
|
||||
verification advantage)? Or because boundary reasoning (Opus's strength)
|
||||
is the primary skill needed?
|
||||
|
||||
### Sonnet + narrow framing = GPT-5 level? (from Finding #5)
|
||||
Would Sonnet catch semantic issues if given a narrower "check for logical
|
||||
### Opus + narrow framing for contradiction detection (from Finding #39)
|
||||
Would Opus + narrow framing match GPT-5 for self-contradiction detection?
|
||||
Finding #39 showed Sonnet can't do it even with narrow framing (reasoning
|
||||
depth issue). Opus has strong cross-boundary reasoning — does its internal
|
||||
reasoning depth suffice for the verification step that Sonnet lacks?
|
||||
|
||||
### ~~Sonnet + narrow framing = GPT-5 level? (from Finding #5)~~ → ANSWERED (Finding #39)
|
||||
~~Would Sonnet catch semantic issues if given a narrower "check for logical
|
||||
consistency" framing instead of broad review? The hypothesis: Sonnet's
|
||||
"structural reviewer" tendency is a framing artifact, not a capability limit.
|
||||
"structural reviewer" tendency is a framing artifact, not a capability limit.~~
|
||||
|
||||
**NO.** Narrow framing changes WHAT Sonnet looks for (redirects from gaps to
|
||||
contradictions) but not HOW WELL it reasons. Sonnet narrow found 3 contradictions
|
||||
but only 1 was genuine (2 were misreadings). GPT-5 found 4 all-genuine findings.
|
||||
The gap is reasoning depth, not framing — Sonnet can't reliably verify whether
|
||||
two statements actually contradict each other.
|
||||
|
||||
## Medium Priority
|
||||
|
||||
|
||||
Reference in New Issue
Block a user