1b108ff66e
Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding). Contents: - findings/ALL-FINDINGS.md — complete 3,249-line research log with all 29 findings, methodology notes, and open questions - prompts/ — 6 exact prompts used across experiments - methodology.md — experimental setup and evaluation criteria - open-questions.md — unanswered questions for future work - README.md — overview and summary table Key findings: - Cross-document consistency: Opus is 2.4x faster with more findings - Gap-finding: GPT-5 reasoning tokens find domain-specific gaps - Race conditions: Opus excels at temporal interaction reasoning - Bias detection: Signal-to-noise ratio > model capability - Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different Signed-off-by: Rodin
72 lines
2.2 KiB
Markdown
72 lines
2.2 KiB
Markdown
# Prompt: Design Coherence Analysis
|
|
|
|
Used in Findings #15, #27.
|
|
|
|
## Setup
|
|
|
|
- Single document provided as full text
|
|
- No tools, no project context beyond the document
|
|
- Same prompt to all models independently
|
|
|
|
## Prompt
|
|
|
|
```
|
|
You are analyzing a single design document for INTERNAL incoherence —
|
|
places where the document contradicts itself. The document states
|
|
principles, invariants, or guarantees in one place, then describes
|
|
mechanisms that violate those guarantees elsewhere.
|
|
|
|
## Categories of incoherence to check:
|
|
|
|
1. **Safety properties not enforced** — Document claims a safety property
|
|
(e.g., "fail-closed") but the described mechanism has a path that
|
|
violates it
|
|
2. **State machine violations** — Declared states/transitions don't match
|
|
the described behavior (missing transitions, unreachable states,
|
|
states with no exit)
|
|
3. **Recovery contradictions** — Recovery mechanism assumes preconditions
|
|
that the failure scenario explicitly invalidates
|
|
4. **Supervision conflicts** — Supervision strategy contradicts the
|
|
independence/coupling claims about the supervised processes
|
|
5. **Cross-mechanism contradictions** — Two different sections describe
|
|
incompatible behaviors for the same scenario
|
|
|
|
## What to EXCLUDE:
|
|
|
|
- Missing features (things the document doesn't cover)
|
|
- Design tradeoffs that are explicitly acknowledged
|
|
- Future work items marked as such
|
|
|
|
## Output format per finding:
|
|
|
|
- **Category:** (one of the 5 above)
|
|
- **Severity:** Critical / High / Medium
|
|
- **Section A says:** (exact quote with section reference)
|
|
- **Section B says:** (exact quote with section reference)
|
|
- **The incoherence:** (explain the contradiction)
|
|
- **Why it matters:** (what would break in implementation)
|
|
|
|
## Document:
|
|
|
|
[FULL TEXT OF DOCUMENT]
|
|
```
|
|
|
|
## Results (Finding #15: failure-modes.md, 383 lines)
|
|
|
|
| Model | Time | Findings |
|
|
|-------|------|----------|
|
|
| Sonnet 4.6 | 39s | 5 |
|
|
| Opus 4.6 | 105s | 7 (8 attempted, 1 self-withdrawn) |
|
|
| GPT-5 | 120s | 4 |
|
|
|
|
## Results (Finding #27: risk-controls.md, 992 lines)
|
|
|
|
| Model | Time | Findings |
|
|
|-------|------|----------|
|
|
| Sonnet 4.6 | 31s | 4 |
|
|
| Opus 4.6 | 86s | 5 |
|
|
| GPT-5 | 112s | 6 |
|
|
|
|
Key insight: results are document-dependent. Opus won on the shorter doc,
|
|
GPT-5 won on the longer, more complex one.
|