T

claw bb0c0d564b Finding #40 : Silent data corruption paths in financial accounting

New analytical lens applied to lot-accounting.md (181 lines).
Tests how models identify sequences of individually correct
operations that produce silently wrong financial results.

Results:
- GPT-5: 12 findings (137s, 10688 reasoning tokens) - tax law domain knowledge
- Opus: 8 findings (121s) - concurrent systems / crash recovery focus
- Sonnet: 8 findings (111s) - structural meta-analysis, highest-leverage finding

Key insight: First experiment where domain-specific knowledge (tax law)
is the primary differentiator. Models reason from different knowledge
domains: GPT-5=tax law, Opus=distributed systems, Sonnet=architecture patterns.

Sonnet produced the most architecturally significant finding: that the
system's reconciliation mechanism confirms corruption rather than detecting
it (because it re-derives from LotClosed which is itself the corrupted source).

2026-05-07 11:09:58 -07:00

findings

Finding #40 : Silent data corruption paths in financial accounting

2026-05-07 11:09:58 -07:00

prompts

Initial publish: 29 findings, 6 prompts, methodology, open questions

2026-05-05 19:13:03 -07:00

review-prompts

feat: add generic review prompts and generation guide

2026-05-06 08:00:59 -07:00

LESSONS.md

docs: add generation timestamps to REPORT.md and LESSONS.md

2026-05-06 07:26:48 -07:00

methodology.md

Initial publish: 29 findings, 6 prompts, methodology, open questions

2026-05-05 19:13:03 -07:00

open-questions.md

finding #39 : narrow framing does not close Sonnet-GPT-5 gap for semantic consistency

2026-05-07 09:26:08 -07:00

README.md

docs(readme): add Reports section with links to REPORT.md and LESSONS.md

2026-05-06 07:29:03 -07:00

REPORT.md

docs: add generation timestamps to REPORT.md and LESSONS.md

2026-05-06 07:26:48 -07:00

README.md

Model Research — AI for Analytical Work

Comparative analysis of AI models on analytical tasks — not coding.

Most public discussion about LLM capabilities focuses on code generation. We found almost no published methodology for using models in analytical research tasks (searched 2026-04-26). This repo fills that gap with controlled experiments and reproducible findings.

What We're Testing

Using GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, and GPT-4.1 (+ Mini) for:

Architecture document review
Bias and assumption detection
Gap-finding in design specifications
Cross-document consistency analysis
Race condition identification
Adversarial path analysis
Contradiction detection
Regulatory compliance review

Key Findings (Summary)

#	Task Type	Winner	Key Insight
1	PR review	Both	Different models catch different things — Sonnet: structural, GPT-5: semantic
2	Bias detection	Framing	Signal-to-noise ratio matters more than model capability
9	Gap-finding	GPT-5	Reasoning tokens find domain-specific gaps, not generic ones
10	Hidden assumptions	GPT-5	Reasoning produces qualitatively different (not just more) findings
13	Race conditions	Opus	Temporal interaction reasoning is Opus's strongest domain
15	Design coherence	Task-dependent	Single-doc: model choice depends on document complexity
25	Contradiction detection	Opus	Precision > exhaustiveness; Opus's self-correction is unique
28	Cross-doc consistency	Opus	2.4x faster than GPT-5 with more findings; boundary reasoning
29	Adversarial analysis	GPT-5 + Opus	GPT-5: exhaustive; Opus: qualitatively different attack vectors

Methodology

Each experiment:

Same input document(s) to all models
Same structured prompt with explicit categories to analyze
No tools, no project context beyond the document(s)
Independent runs — no cross-pollination between models
Results evaluated for: correctness, uniqueness, actionability

Context dimensions tracked:

Rich vs minimal (how much background info)
Broad vs focused ("review this" vs "answer this specific question")
What kind of context (diff, full files, issue text, nothing)
Whether the model had tools or just text
Whether the task was step-by-step or open-ended

Reports

REPORT.md — Full research analysis. Covers model strengths with evidence, task-type → model mappings, meta-findings about how to use models effectively, cost-effectiveness comparison, and open questions. Regenerated weekly from all findings.
LESSONS.md — Actionable summary. The distilled "here's what to actually do" version: three core rules, operational playbooks for different review types, anti-patterns to avoid, and a model personality cheat sheet. Start here if you want answers, not methodology.

Both files include a generation timestamp and are automatically regenerated every Monday at 9 AM Pacific to incorporate new experiment results.

Repository Structure

REPORT.md           # Full research report (auto-regenerated weekly)
LESSONS.md          # Actionable lessons and playbooks (auto-regenerated weekly)
findings/           # Individual experiment files (one per experiment)
  README.md         # Context and index
  YYYY-MM-DD-NN-slug.md
  2026-04-26-01-different-models-catch-different-things.md
  ...
  2026-05-05-29-adversarial-manipulation-analysis-new-task.md
prompts/            # Exact prompts used for reproducibility
  cross-document-consistency.md
  design-coherence.md
  gap-finding.md
  hidden-assumptions.md
  ...
open-questions.md   # Unanswered questions for future experiments
methodology.md      # Full methodology notes

Findings are named YYYY-MM-DD-NN-slug.md for chronological sorting. Numbers are zero-padded (01–29). The duplicate finding #7 uses a b suffix.

Who We Are

This research is conducted by Rodin (AI assistant) and Aaron Weiker. The test corpus is gargoyle — an Elixir trading system with extensive architecture documentation (~35 design docs, ~5000 lines).

License

CC BY 4.0 — share and adapt with attribution.

README.md Unescape Escape

Model Research — AI for Analytical Work

What We're Testing

Key Findings (Summary)

Methodology

Reports

Repository Structure

Who We Are

License

README.md