T

claw 0c632c255a finding #39 : narrow framing does not close Sonnet-GPT-5 gap for semantic consistency

Tested open question from Finding #5: does narrow framing give Sonnet
GPT-5-level semantic analysis?

Result: NO. Narrow framing changes WHAT Sonnet looks for (redirects from
gaps to contradictions) but not HOW WELL it reasons. Sonnet narrow found
3 contradictions but only 1 was genuine (2 were analytical errors/misreads).
GPT-5 found 4 all-genuine findings with precise reasoning.

Key insight: framing controls scope, not reasoning depth. For tasks
requiring logical verification (contradictions, race conditions, invariant
violations), reasoning tokens are necessary — framing alone is insufficient.

Updated open-questions.md: marked Sonnet+narrow as answered, added new
question about Opus+narrow for contradiction detection.

2026-05-07 09:26:08 -07:00

findings

finding #39 : narrow framing does not close Sonnet-GPT-5 gap for semantic consistency

2026-05-07 09:26:08 -07:00

prompts

Initial publish: 29 findings, 6 prompts, methodology, open questions

2026-05-05 19:13:03 -07:00

review-prompts

feat: add generic review prompts and generation guide

2026-05-06 08:00:59 -07:00

LESSONS.md

docs: add generation timestamps to REPORT.md and LESSONS.md

2026-05-06 07:26:48 -07:00

methodology.md

Initial publish: 29 findings, 6 prompts, methodology, open questions

2026-05-05 19:13:03 -07:00

open-questions.md

finding #39 : narrow framing does not close Sonnet-GPT-5 gap for semantic consistency

2026-05-07 09:26:08 -07:00

README.md

docs(readme): add Reports section with links to REPORT.md and LESSONS.md

2026-05-06 07:29:03 -07:00

REPORT.md

docs: add generation timestamps to REPORT.md and LESSONS.md

2026-05-06 07:26:48 -07:00

README.md

Model Research — AI for Analytical Work

Comparative analysis of AI models on analytical tasks — not coding.

Most public discussion about LLM capabilities focuses on code generation. We found almost no published methodology for using models in analytical research tasks (searched 2026-04-26). This repo fills that gap with controlled experiments and reproducible findings.

What We're Testing

Using GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, and GPT-4.1 (+ Mini) for:

Architecture document review
Bias and assumption detection
Gap-finding in design specifications
Cross-document consistency analysis
Race condition identification
Adversarial path analysis
Contradiction detection
Regulatory compliance review

Key Findings (Summary)

#	Task Type	Winner	Key Insight
1	PR review	Both	Different models catch different things — Sonnet: structural, GPT-5: semantic
2	Bias detection	Framing	Signal-to-noise ratio matters more than model capability
9	Gap-finding	GPT-5	Reasoning tokens find domain-specific gaps, not generic ones
10	Hidden assumptions	GPT-5	Reasoning produces qualitatively different (not just more) findings
13	Race conditions	Opus	Temporal interaction reasoning is Opus's strongest domain
15	Design coherence	Task-dependent	Single-doc: model choice depends on document complexity
25	Contradiction detection	Opus	Precision > exhaustiveness; Opus's self-correction is unique
28	Cross-doc consistency	Opus	2.4x faster than GPT-5 with more findings; boundary reasoning
29	Adversarial analysis	GPT-5 + Opus	GPT-5: exhaustive; Opus: qualitatively different attack vectors

Methodology

Each experiment:

Same input document(s) to all models
Same structured prompt with explicit categories to analyze
No tools, no project context beyond the document(s)
Independent runs — no cross-pollination between models
Results evaluated for: correctness, uniqueness, actionability

Context dimensions tracked:

Rich vs minimal (how much background info)
Broad vs focused ("review this" vs "answer this specific question")
What kind of context (diff, full files, issue text, nothing)
Whether the model had tools or just text
Whether the task was step-by-step or open-ended

Reports

REPORT.md — Full research analysis. Covers model strengths with evidence, task-type → model mappings, meta-findings about how to use models effectively, cost-effectiveness comparison, and open questions. Regenerated weekly from all findings.
LESSONS.md — Actionable summary. The distilled "here's what to actually do" version: three core rules, operational playbooks for different review types, anti-patterns to avoid, and a model personality cheat sheet. Start here if you want answers, not methodology.

Both files include a generation timestamp and are automatically regenerated every Monday at 9 AM Pacific to incorporate new experiment results.

Repository Structure

REPORT.md           # Full research report (auto-regenerated weekly)
LESSONS.md          # Actionable lessons and playbooks (auto-regenerated weekly)
findings/           # Individual experiment files (one per experiment)
  README.md         # Context and index
  YYYY-MM-DD-NN-slug.md
  2026-04-26-01-different-models-catch-different-things.md
  ...
  2026-05-05-29-adversarial-manipulation-analysis-new-task.md
prompts/            # Exact prompts used for reproducibility
  cross-document-consistency.md
  design-coherence.md
  gap-finding.md
  hidden-assumptions.md
  ...
open-questions.md   # Unanswered questions for future experiments
methodology.md      # Full methodology notes

Findings are named YYYY-MM-DD-NN-slug.md for chronological sorting. Numbers are zero-padded (01–29). The duplicate finding #7 uses a b suffix.

Who We Are

This research is conducted by Rodin (AI assistant) and Aaron Weiker. The test corpus is gargoyle — an Elixir trading system with extensive architecture documentation (~35 design docs, ~5000 lines).

License

CC BY 4.0 — share and adapt with attribution.

README.md Unescape Escape

Model Research — AI for Analytical Work

What We're Testing

Key Findings (Summary)

Methodology

Reports

Repository Structure

Who We Are

License

README.md