1b108ff66e
Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding). Contents: - findings/ALL-FINDINGS.md — complete 3,249-line research log with all 29 findings, methodology notes, and open questions - prompts/ — 6 exact prompts used across experiments - methodology.md — experimental setup and evaluation criteria - open-questions.md — unanswered questions for future work - README.md — overview and summary table Key findings: - Cross-document consistency: Opus is 2.4x faster with more findings - Gap-finding: GPT-5 reasoning tokens find domain-specific gaps - Race conditions: Opus excels at temporal interaction reasoning - Bias detection: Signal-to-noise ratio > model capability - Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different Signed-off-by: Rodin
2.9 KiB
2.9 KiB
Prompt: Cross-Document Consistency Analysis
Used in Finding #28.
Setup
- Two documents provided as full text in a single prompt (~25KB total)
- Document A:
system-overview.md(323 lines, narrative overview) - Document B:
architecture.md(213 lines, DDD-focused) - No tools, no project context beyond the two documents
- Same prompt to all 3 models independently
Prompt
You are analyzing two architecture documents that describe the SAME system.
Your task is to identify places where these documents CONTRADICT each other
— not where they differ in scope or detail level, but where they make
incompatible claims about the same concept.
## Categories of inconsistency to check:
1. **Terminology conflicts** — Same concept called different names in ways
that imply different meanings (not just abbreviation)
2. **Structural contradictions** — Documents disagree about what is inside
vs outside a component boundary
3. **Flow/sequence conflicts** — Documents describe incompatible orderings
or data flows for the same process
4. **Ownership/authority conflicts** — Documents disagree about which
component owns, writes, or is authoritative for a concept
5. **Philosophical contradictions** — Documents state incompatible
foundational assumptions (e.g., event sourcing vs CRUD)
## What to EXCLUDE:
- Omissions (one doc covers something the other doesn't)
- Detail-level differences (one is more detailed than the other)
- Naming differences that are clearly just abbreviations
- Scope differences (one covers more topics)
## Output format per finding:
For each inconsistency found:
- **Category:** (one of the 5 above)
- **Severity:** Critical / High / Medium
- **Document A says:** (exact quote or precise paraphrase with section ref)
- **Document B says:** (exact quote or precise paraphrase with section ref)
- **Why these are incompatible:** (explain why both cannot be correct)
- **Impact:** (what would go wrong if an implementer followed both)
## Document A: [system-overview.md]
[FULL TEXT OF DOCUMENT A]
## Document B: [architecture.md]
[FULL TEXT OF DOCUMENT B]
Key Design Decisions
- Explicit exclusion of omissions — prevents models from padding findings with "Doc A mentions X but Doc B doesn't"
- Five specific categories — focuses attention without being so restrictive that models miss novel inconsistency types
- Required "why incompatible" explanation — forces models to reason about WHY differences matter, not just list differences
- Impact field — grounds findings in practical consequences
- Both documents in single prompt — enables cross-referencing without tool calls or context fragmentation
Results
| Model | Time | Findings | Tokens/finding |
|---|---|---|---|
| Opus | 52s | 7 | 336 |
| GPT-5 | 125s | 6 | 2,967 |
| Sonnet | 14s | 4 | 194 |
Opus recommended for this task type.