claw bb0c0d564b Finding #40: Silent data corruption paths in financial accounting
New analytical lens applied to lot-accounting.md (181 lines).
Tests how models identify sequences of individually correct
operations that produce silently wrong financial results.

Results:
- GPT-5: 12 findings (137s, 10688 reasoning tokens) - tax law domain knowledge
- Opus: 8 findings (121s) - concurrent systems / crash recovery focus
- Sonnet: 8 findings (111s) - structural meta-analysis, highest-leverage finding

Key insight: First experiment where domain-specific knowledge (tax law)
is the primary differentiator. Models reason from different knowledge
domains: GPT-5=tax law, Opus=distributed systems, Sonnet=architecture patterns.

Sonnet produced the most architecturally significant finding: that the
system's reconciliation mechanism confirms corruption rather than detecting
it (because it re-derives from LotClosed which is itself the corrupted source).
2026-05-07 11:09:58 -07:00

Model Research — AI for Analytical Work

Comparative analysis of AI models on analytical tasks — not coding.

Most public discussion about LLM capabilities focuses on code generation. We found almost no published methodology for using models in analytical research tasks (searched 2026-04-26). This repo fills that gap with controlled experiments and reproducible findings.

What We're Testing

Using GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, and GPT-4.1 (+ Mini) for:

  • Architecture document review
  • Bias and assumption detection
  • Gap-finding in design specifications
  • Cross-document consistency analysis
  • Race condition identification
  • Adversarial path analysis
  • Contradiction detection
  • Regulatory compliance review

Key Findings (Summary)

# Task Type Winner Key Insight
1 PR review Both Different models catch different things — Sonnet: structural, GPT-5: semantic
2 Bias detection Framing Signal-to-noise ratio matters more than model capability
9 Gap-finding GPT-5 Reasoning tokens find domain-specific gaps, not generic ones
10 Hidden assumptions GPT-5 Reasoning produces qualitatively different (not just more) findings
13 Race conditions Opus Temporal interaction reasoning is Opus's strongest domain
15 Design coherence Task-dependent Single-doc: model choice depends on document complexity
25 Contradiction detection Opus Precision > exhaustiveness; Opus's self-correction is unique
28 Cross-doc consistency Opus 2.4x faster than GPT-5 with more findings; boundary reasoning
29 Adversarial analysis GPT-5 + Opus GPT-5: exhaustive; Opus: qualitatively different attack vectors

Methodology

Each experiment:

  1. Same input document(s) to all models
  2. Same structured prompt with explicit categories to analyze
  3. No tools, no project context beyond the document(s)
  4. Independent runs — no cross-pollination between models
  5. Results evaluated for: correctness, uniqueness, actionability

Context dimensions tracked:

  • Rich vs minimal (how much background info)
  • Broad vs focused ("review this" vs "answer this specific question")
  • What kind of context (diff, full files, issue text, nothing)
  • Whether the model had tools or just text
  • Whether the task was step-by-step or open-ended

Reports

  • REPORT.md — Full research analysis. Covers model strengths with evidence, task-type → model mappings, meta-findings about how to use models effectively, cost-effectiveness comparison, and open questions. Regenerated weekly from all findings.
  • LESSONS.md — Actionable summary. The distilled "here's what to actually do" version: three core rules, operational playbooks for different review types, anti-patterns to avoid, and a model personality cheat sheet. Start here if you want answers, not methodology.

Both files include a generation timestamp and are automatically regenerated every Monday at 9 AM Pacific to incorporate new experiment results.

Repository Structure

REPORT.md           # Full research report (auto-regenerated weekly)
LESSONS.md          # Actionable lessons and playbooks (auto-regenerated weekly)
findings/           # Individual experiment files (one per experiment)
  README.md         # Context and index
  YYYY-MM-DD-NN-slug.md
  2026-04-26-01-different-models-catch-different-things.md
  ...
  2026-05-05-29-adversarial-manipulation-analysis-new-task.md
prompts/            # Exact prompts used for reproducibility
  cross-document-consistency.md
  design-coherence.md
  gap-finding.md
  hidden-assumptions.md
  ...
open-questions.md   # Unanswered questions for future experiments
methodology.md      # Full methodology notes

Findings are named YYYY-MM-DD-NN-slug.md for chronological sorting. Numbers are zero-padded (0129). The duplicate finding #7 uses a b suffix.

Who We Are

This research is conducted by Rodin (AI assistant) and Aaron Weiker. The test corpus is gargoyle — an Elixir trading system with extensive architecture documentation (~35 design docs, ~5000 lines).

License

CC BY 4.0 — share and adapt with attribution.

S
Description
Comparative analysis of AI models on analytical tasks — not coding. Tracking what works when using GPT-5, Claude Opus, Claude Sonnet, and GPT-4.1 for research, document review, bias detection, and architecture analysis.
Readme 489 KiB
Languages
Markdown 100%