claw 0c632c255a finding #39: narrow framing does not close Sonnet-GPT-5 gap for semantic consistency
Tested open question from Finding #5: does narrow framing give Sonnet
GPT-5-level semantic analysis?

Result: NO. Narrow framing changes WHAT Sonnet looks for (redirects from
gaps to contradictions) but not HOW WELL it reasons. Sonnet narrow found
3 contradictions but only 1 was genuine (2 were analytical errors/misreads).
GPT-5 found 4 all-genuine findings with precise reasoning.

Key insight: framing controls scope, not reasoning depth. For tasks
requiring logical verification (contradictions, race conditions, invariant
violations), reasoning tokens are necessary — framing alone is insufficient.

Updated open-questions.md: marked Sonnet+narrow as answered, added new
question about Opus+narrow for contradiction detection.
2026-05-07 09:26:08 -07:00

Model Research — AI for Analytical Work

Comparative analysis of AI models on analytical tasks — not coding.

Most public discussion about LLM capabilities focuses on code generation. We found almost no published methodology for using models in analytical research tasks (searched 2026-04-26). This repo fills that gap with controlled experiments and reproducible findings.

What We're Testing

Using GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, and GPT-4.1 (+ Mini) for:

  • Architecture document review
  • Bias and assumption detection
  • Gap-finding in design specifications
  • Cross-document consistency analysis
  • Race condition identification
  • Adversarial path analysis
  • Contradiction detection
  • Regulatory compliance review

Key Findings (Summary)

# Task Type Winner Key Insight
1 PR review Both Different models catch different things — Sonnet: structural, GPT-5: semantic
2 Bias detection Framing Signal-to-noise ratio matters more than model capability
9 Gap-finding GPT-5 Reasoning tokens find domain-specific gaps, not generic ones
10 Hidden assumptions GPT-5 Reasoning produces qualitatively different (not just more) findings
13 Race conditions Opus Temporal interaction reasoning is Opus's strongest domain
15 Design coherence Task-dependent Single-doc: model choice depends on document complexity
25 Contradiction detection Opus Precision > exhaustiveness; Opus's self-correction is unique
28 Cross-doc consistency Opus 2.4x faster than GPT-5 with more findings; boundary reasoning
29 Adversarial analysis GPT-5 + Opus GPT-5: exhaustive; Opus: qualitatively different attack vectors

Methodology

Each experiment:

  1. Same input document(s) to all models
  2. Same structured prompt with explicit categories to analyze
  3. No tools, no project context beyond the document(s)
  4. Independent runs — no cross-pollination between models
  5. Results evaluated for: correctness, uniqueness, actionability

Context dimensions tracked:

  • Rich vs minimal (how much background info)
  • Broad vs focused ("review this" vs "answer this specific question")
  • What kind of context (diff, full files, issue text, nothing)
  • Whether the model had tools or just text
  • Whether the task was step-by-step or open-ended

Reports

  • REPORT.md — Full research analysis. Covers model strengths with evidence, task-type → model mappings, meta-findings about how to use models effectively, cost-effectiveness comparison, and open questions. Regenerated weekly from all findings.
  • LESSONS.md — Actionable summary. The distilled "here's what to actually do" version: three core rules, operational playbooks for different review types, anti-patterns to avoid, and a model personality cheat sheet. Start here if you want answers, not methodology.

Both files include a generation timestamp and are automatically regenerated every Monday at 9 AM Pacific to incorporate new experiment results.

Repository Structure

REPORT.md           # Full research report (auto-regenerated weekly)
LESSONS.md          # Actionable lessons and playbooks (auto-regenerated weekly)
findings/           # Individual experiment files (one per experiment)
  README.md         # Context and index
  YYYY-MM-DD-NN-slug.md
  2026-04-26-01-different-models-catch-different-things.md
  ...
  2026-05-05-29-adversarial-manipulation-analysis-new-task.md
prompts/            # Exact prompts used for reproducibility
  cross-document-consistency.md
  design-coherence.md
  gap-finding.md
  hidden-assumptions.md
  ...
open-questions.md   # Unanswered questions for future experiments
methodology.md      # Full methodology notes

Findings are named YYYY-MM-DD-NN-slug.md for chronological sorting. Numbers are zero-padded (0129). The duplicate finding #7 uses a b suffix.

Who We Are

This research is conducted by Rodin (AI assistant) and Aaron Weiker. The test corpus is gargoyle — an Elixir trading system with extensive architecture documentation (~35 design docs, ~5000 lines).

License

CC BY 4.0 — share and adapt with attribution.

S
Description
Comparative analysis of AI models on analytical tasks — not coding. Tracking what works when using GPT-5, Claude Opus, Claude Sonnet, and GPT-4.1 for research, document review, bias detection, and architecture analysis.
Readme 489 KiB
Languages
Markdown 100%