Files

T

Rodin 1b108ff66e Initial publish: 29 findings, 6 prompts, methodology, open questions

Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6,
GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding).

Contents:
- findings/ALL-FINDINGS.md — complete 3,249-line research log with all
  29 findings, methodology notes, and open questions
- prompts/ — 6 exact prompts used across experiments
- methodology.md — experimental setup and evaluation criteria
- open-questions.md — unanswered questions for future work
- README.md — overview and summary table

Key findings:
- Cross-document consistency: Opus is 2.4x faster with more findings
- Gap-finding: GPT-5 reasoning tokens find domain-specific gaps
- Race conditions: Opus excels at temporal interaction reasoning
- Bias detection: Signal-to-noise ratio > model capability
- Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different

Signed-off-by: Rodin

2026-05-05 19:13:03 -07:00

1.5 KiB

Raw Blame History

Prompt: Gap-Finding in Architecture Documents

Used in Finding #9.

Setup

Single document (full text, no truncation)
Same focused analytical question to all models
No tools, no project context beyond the document
Temperature 0.3 for GPT-4.1/Mini, default for GPT-5

Prompt

You are a systems reliability engineer reviewing a failure modes document
for a trading platform. Your task is to identify MISSING failure scenarios
— things that COULD go wrong in this architecture but are NOT covered in
the document.

Focus on:
1. Scenarios specific to THIS architecture (not generic "server could crash")
2. Interactions between components that could produce unexpected states
3. External dependency failures not covered
4. Timing/ordering issues in the described sequences
5. Recovery procedures that have gaps

For each missing scenario:
- **Scenario:** What goes wrong
- **Why it's specific to this system:** Why generic monitoring wouldn't catch it
- **Impact:** What state the system ends up in
- **Why the document misses it:** What assumption makes this invisible

## Document:

[FULL TEXT OF failure-modes.md, 383 lines]

Results

Model	Time	Output tokens	Reasoning tokens	Scenarios found
GPT-4.1 Mini	16s	2,003	0	10
GPT-4.1	24s	2,575	0	15
GPT-5	45s	8,565	6,656	14

GPT-5 found the most domain-specific and actionable gaps despite finding fewer total scenarios than GPT-4.1. Quality > quantity.

1.5 KiB Raw Blame History

Prompt: Gap-Finding in Architecture Documents

Setup

Prompt

Results

1.5 KiB

Raw Blame History