Files

T

Rodin 1b108ff66e Initial publish: 29 findings, 6 prompts, methodology, open questions

Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6,
GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding).

Contents:
- findings/ALL-FINDINGS.md — complete 3,249-line research log with all
  29 findings, methodology notes, and open questions
- prompts/ — 6 exact prompts used across experiments
- methodology.md — experimental setup and evaluation criteria
- open-questions.md — unanswered questions for future work
- README.md — overview and summary table

Key findings:
- Cross-document consistency: Opus is 2.4x faster with more findings
- Gap-finding: GPT-5 reasoning tokens find domain-specific gaps
- Race conditions: Opus excels at temporal interaction reasoning
- Bias detection: Signal-to-noise ratio > model capability
- Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different

Signed-off-by: Rodin

2026-05-05 19:13:03 -07:00

2.2 KiB

Raw Blame History

Prompt: Design Coherence Analysis

Used in Findings #15, #27.

Setup

Single document provided as full text
No tools, no project context beyond the document
Same prompt to all models independently

Prompt

You are analyzing a single design document for INTERNAL incoherence —
places where the document contradicts itself. The document states
principles, invariants, or guarantees in one place, then describes
mechanisms that violate those guarantees elsewhere.

## Categories of incoherence to check:

1. **Safety properties not enforced** — Document claims a safety property
   (e.g., "fail-closed") but the described mechanism has a path that
   violates it
2. **State machine violations** — Declared states/transitions don't match
   the described behavior (missing transitions, unreachable states,
   states with no exit)
3. **Recovery contradictions** — Recovery mechanism assumes preconditions
   that the failure scenario explicitly invalidates
4. **Supervision conflicts** — Supervision strategy contradicts the
   independence/coupling claims about the supervised processes
5. **Cross-mechanism contradictions** — Two different sections describe
   incompatible behaviors for the same scenario

## What to EXCLUDE:

- Missing features (things the document doesn't cover)
- Design tradeoffs that are explicitly acknowledged
- Future work items marked as such

## Output format per finding:

- **Category:** (one of the 5 above)
- **Severity:** Critical / High / Medium
- **Section A says:** (exact quote with section reference)
- **Section B says:** (exact quote with section reference)
- **The incoherence:** (explain the contradiction)
- **Why it matters:** (what would break in implementation)

## Document:

[FULL TEXT OF DOCUMENT]

Results (Finding #15: failure-modes.md, 383 lines)

Model	Time	Findings
Sonnet 4.6	39s	5
Opus 4.6	105s	7 (8 attempted, 1 self-withdrawn)
GPT-5	120s	4

Results (Finding #27: risk-controls.md, 992 lines)

Model	Time	Findings
Sonnet 4.6	31s	4
Opus 4.6	86s	5
GPT-5	112s	6

Key insight: results are document-dependent. Opus won on the shorter doc, GPT-5 won on the longer, more complex one.

2.2 KiB Raw Blame History

Prompt: Design Coherence Analysis

Setup

Prompt

Results (Finding #15: failure-modes.md, 383 lines)

Results (Finding #27: risk-controls.md, 992 lines)

2.2 KiB

Raw Blame History