Files
model-research/findings/2026-05-07-39-narrow-framing-does-not-close-sonnet-gpt5-gap.md
T
claw 0c632c255a finding #39: narrow framing does not close Sonnet-GPT-5 gap for semantic consistency
Tested open question from Finding #5: does narrow framing give Sonnet
GPT-5-level semantic analysis?

Result: NO. Narrow framing changes WHAT Sonnet looks for (redirects from
gaps to contradictions) but not HOW WELL it reasons. Sonnet narrow found
3 contradictions but only 1 was genuine (2 were analytical errors/misreads).
GPT-5 found 4 all-genuine findings with precise reasoning.

Key insight: framing controls scope, not reasoning depth. For tasks
requiring logical verification (contradictions, race conditions, invariant
violations), reasoning tokens are necessary — framing alone is insufficient.

Updated open-questions.md: marked Sonnet+narrow as answered, added new
question about Opus+narrow for contradiction detection.
2026-05-07 09:26:08 -07:00

182 lines
9.6 KiB
Markdown

# Finding #39: Narrow framing does NOT close the Sonnet-GPT-5 gap for semantic consistency analysis
**Date:** 2026-05-07
**Document:** `docs/domain/user-lifecycle.md` (343 lines)
**Task type:** Internal logical consistency / self-contradiction detection
**Models:** Claude Sonnet 4 (narrow + broad), GPT-5 (narrow)
**Open question tested:** "Sonnet + narrow framing = GPT-5 level?" (from Finding #5)
## Experiment Design
Finding #5 hypothesized that Sonnet's tendency to focus on structural issues (formatting,
links, missing sections) rather than semantic issues (logical contradictions, meaning-level
conflicts) might be a **framing artifact** — that giving Sonnet a narrow semantic question
would produce GPT-5-level semantic analysis.
Three conditions, same document:
| Condition | Model | Framing | Purpose |
|---|---|---|---|
| A | Sonnet 4 | Narrow: "identify contradictions within this document" | Test hypothesis |
| B | GPT-5 | Same narrow question | Control (known semantic strength) |
| C | Sonnet 4 | Broad: "review for quality, clarity, completeness, correctness" | Baseline |
Narrow prompt explicitly excluded gap-finding, suggestions, missing features — ONLY
self-contradictions. Required specific format: two conflicting quotes, locations,
explanation of logical conflict, severity.
## Results
| Condition | Model | Framing | Time | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|---|---|
| A | Sonnet 4 | Narrow semantic | 16s | 694 | (internal) | 3 |
| B | GPT-5 | Narrow semantic | 108s | 11,280 | 10,368 | 4 |
| C | Sonnet 4 | Broad review | 24s | 1,248 | (internal) | 13 |
## What They Found
### Common ground (both Sonnet narrow and GPT-5 identified):
1. **Credential "validity" contradiction** — Gate #3 says "valid broker credentials"
but the document explicitly says "does not validate broker credentials at startup."
Both models caught this. GPT-5's analysis was more nuanced (distinguished presence
vs validity, noted the `no_credentials` reason code suggests presence-checking).
Sonnet called it Critical; GPT-5 called it High.
### GPT-5 unique findings (not in Sonnet narrow):
2. **"Terminal state" with outgoing transitions** — The table says stopped is "Default
and terminal state" but the state diagram shows multiple outgoing transitions from
stopped (→starting, →blocked). A terminal state by definition has no outgoing
transitions. This is a clear terminology contradiction.
3. **Per-user release vs system-wide release asymmetry** — The eligibility section says
"every activation path (event-driven, periodic, operator-initiated) consults [the
eligibility check]." But per-user release says it "starts their instance" (no
qualifying check mentioned), while system-wide release says "starts all qualifying
users" (explicit check). The inconsistent language implies per-user release might
bypass eligibility — contradicting the universal gate rule.
4. **"Running instance" ambiguity** — The document distinguishes "starting" (instance
exists, cold-start incomplete) from "ready" (trading active), but uses "running
instance" without defining which states it includes. The edge case section implies
"running" includes "starting," but reconciliation uses "running" in a context that
might mean only "ready." Contradictory interpretation paths exist.
### Sonnet narrow unique findings (not in GPT-5):
2. **State diagram vs Instance Start section inconsistency** — State diagram shows
stopped→blocked transition, but Instance Start section describes the same scenario
(ineligible user) without transitioning to blocked. Sonnet argued these conflict.
(This is actually WRONG — the Instance Start section says "record the blocking
reason" which corresponds to the blocked state. Sonnet misread the prose.)
3. **Configuration events vs operational gates interaction** — Sonnet argued that
`user_configured` not triggering start (due to kill switch) contradicts "configuration
prerequisites changing triggers instance start automatically." This is a MISREAD —
the document actually says the trigger fires but eligibility fails, which is
consistent (trigger ≠ guaranteed start). Sonnet confused "trigger the evaluation"
with "trigger the start."
### Sonnet broad findings (Condition C):
Of 13 findings, NONE were internal logical contradictions. All were:
- Missing features (5): recovery mechanisms, state persistence, error handling, perf specs, glossary
- External concerns (4): race conditions, cascading failures, component boundaries, integration specs
- Suggestions (3): formatting, calibration rationale, event ordering
- One terminology issue (#9) that partially overlaps with GPT-5's finding #4
## Key Insight — The Hypothesis is REJECTED
**Sonnet + narrow framing does NOT produce GPT-5-level semantic analysis.**
The narrow framing did change Sonnet's behavior — it stopped suggesting missing features
and focused on contradictions (Condition A vs C). But the quality of its semantic reasoning
was significantly lower than GPT-5's:
1. **Quantity:** Sonnet found 3 vs GPT-5's 4 — a smaller gap than expected
2. **Quality:** This is the real differentiator:
- Sonnet's finding #2 contains an analytical error (misreading the Instance Start section)
- Sonnet's finding #3 conflates "trigger the evaluation" with "guarantee the start"
- GPT-5's findings are all logically sound and precisely reasoned
3. **Precision:** Of Sonnet's 3 findings, only 1 is genuinely correct (the credential one).
GPT-5's 4 findings are all legitimate contradictions or ambiguities.
4. **Depth:** GPT-5's analysis distinguishes subtler levels (presence vs validity, per-user
vs system-wide wording asymmetry). Sonnet identifies surface contradictions.
**Effective findings:** Sonnet narrow = 1 genuine / GPT-5 narrow = 4 genuine.
That's a 4:1 quality gap, not a framing artifact.
## Why Narrow Framing Doesn't Help
The hypothesis assumed Sonnet's structural bias was caused by the broad prompt giving it
"permission" to find easy structural issues instead of harder semantic ones. But the narrow
prompt FORCED it to look for contradictions — and it still couldn't reason about them
correctly.
The problem isn't that Sonnet doesn't LOOK for semantic issues when framed narrowly. It does.
The problem is that semantic consistency analysis requires:
1. **Holding multiple document sections in working memory simultaneously**
2. **Reasoning about the logical IMPLICATIONS of each statement (not just surface text)**
3. **Testing whether interpretations are actually contradictory vs complementary**
GPT-5's 10,368 reasoning tokens were spent doing exactly this — cross-referencing sections,
testing interpretations, confirming that conflicts are genuine. Sonnet's internal reasoning
(not reported) apparently doesn't do this verification step, leading to false-positive
contradictions (findings where the document is actually consistent but Sonnet misread one
part).
## Comparison to Previous Findings
This result is consistent with Finding #13 (race conditions), where Sonnet also struggled
with reasoning that requires holding multiple interacting parts in working memory. And it
CONTRASTS with Findings #12 and #14, where Sonnet performed well on assumption-finding
and cross-component analysis.
The emerging pattern:
- **Sonnet excels at:** identification tasks ("what could go wrong?") where each finding
is evaluated independently
- **Sonnet struggles with:** verification tasks ("does X actually contradict Y?") where
the finding requires cross-referencing and logical proof
Narrow framing helps Sonnet focus, but doesn't help it REASON more deeply.
## Sonnet Broad vs Sonnet Narrow
The comparison between conditions A and C reveals what narrow framing DOES accomplish:
- **Broad Sonnet** produced 13 findings but 0 were internal contradictions (all gaps/suggestions)
- **Narrow Sonnet** produced 3 findings that were at least ATTEMPTING to be contradictions (1 genuine)
So narrow framing successfully redirects Sonnet from its default mode (gap-finding, structural
review) to the target analytical mode (contradiction detection). It just doesn't give Sonnet
the reasoning depth to execute that mode well.
**Implication:** Framing controls WHAT Sonnet looks for, but not HOW WELL it reasons about
what it finds. For tasks requiring logical verification (contradictions, race conditions,
invariant violations), reasoning tokens are necessary — framing alone is insufficient.
## Updated Open Questions
The original question "Sonnet + narrow framing = GPT-5 level?" is now **ANSWERED: No.**
New question arising: **Would Opus + narrow framing match GPT-5 for contradiction detection?**
Opus has demonstrated strong cross-boundary reasoning in previous experiments. If contradiction
detection is primarily about reasoning depth (which this experiment suggests), Opus's internal
reasoning should perform better than Sonnet's. But GPT-5's extreme selectivity (10K reasoning
tokens for 4 precise findings) might still dominate on precision.
## Practical Implication
For document self-consistency analysis:
- **Use GPT-5.** It's the only model tested that reliably distinguishes genuine contradictions
from apparent ones.
- **Don't use Sonnet** — even with narrow framing, it produces false-positive contradictions
that would waste reviewer time.
- **Narrow framing helps with SCOPE** (preventing gap-finding when you want contradictions)
but not with QUALITY (preventing false positives in the findings it does produce).
The three-model stack for architecture review should assign contradiction/consistency tasks
to GPT-5 specifically, not to Sonnet with better prompts.