Initial publish: 29 findings, 6 prompts, methodology, open questions

Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding). Contents: - findings/ALL-FINDINGS.md — complete 3,249-line research log with all 29 findings, methodology notes, and open questions - prompts/ — 6 exact prompts used across experiments - methodology.md — experimental setup and evaluation criteria - open-questions.md — unanswered questions for future work - README.md — overview and summary table Key findings: - Cross-document consistency: Opus is 2.4x faster with more findings - Gap-finding: GPT-5 reasoning tokens find domain-specific gaps - Race conditions: Opus excels at temporal interaction reasoning - Bias detection: Signal-to-noise ratio > model capability - Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different Signed-off-by: Rodin
2026-05-05 19:13:03 -07:00
parent 4aea0d004b
commit 1b108ff66e
10 changed files with 3831 additions and 2 deletions
@@ -0,0 +1,80 @@
+# Prompt: Cross-Document Consistency Analysis
+
+Used in Finding #28.
+
+## Setup
+
+- Two documents provided as full text in a single prompt (~25KB total)
+- Document A: `system-overview.md` (323 lines, narrative overview)
+- Document B: `architecture.md` (213 lines, DDD-focused)
+- No tools, no project context beyond the two documents
+- Same prompt to all 3 models independently
+
+## Prompt
+
+```
+You are analyzing two architecture documents that describe the SAME system.
+Your task is to identify places where these documents CONTRADICT each other
+— not where they differ in scope or detail level, but where they make
+incompatible claims about the same concept.
+
+## Categories of inconsistency to check:
+
+1. **Terminology conflicts** — Same concept called different names in ways
+   that imply different meanings (not just abbreviation)
+2. **Structural contradictions** — Documents disagree about what is inside
+   vs outside a component boundary
+3. **Flow/sequence conflicts** — Documents describe incompatible orderings
+   or data flows for the same process
+4. **Ownership/authority conflicts** — Documents disagree about which
+   component owns, writes, or is authoritative for a concept
+5. **Philosophical contradictions** — Documents state incompatible
+   foundational assumptions (e.g., event sourcing vs CRUD)
+
+## What to EXCLUDE:
+
+- Omissions (one doc covers something the other doesn't)
+- Detail-level differences (one is more detailed than the other)
+- Naming differences that are clearly just abbreviations
+- Scope differences (one covers more topics)
+
+## Output format per finding:
+
+For each inconsistency found:
+- **Category:** (one of the 5 above)
+- **Severity:** Critical / High / Medium
+- **Document A says:** (exact quote or precise paraphrase with section ref)
+- **Document B says:** (exact quote or precise paraphrase with section ref)
+- **Why these are incompatible:** (explain why both cannot be correct)
+- **Impact:** (what would go wrong if an implementer followed both)
+
+## Document A: [system-overview.md]
+
+[FULL TEXT OF DOCUMENT A]
+
+## Document B: [architecture.md]
+
+[FULL TEXT OF DOCUMENT B]
+```
+
+## Key Design Decisions
+
+1. **Explicit exclusion of omissions** — prevents models from padding
+   findings with "Doc A mentions X but Doc B doesn't"
+2. **Five specific categories** — focuses attention without being
+   so restrictive that models miss novel inconsistency types
+3. **Required "why incompatible" explanation** — forces models to reason
+   about WHY differences matter, not just list differences
+4. **Impact field** — grounds findings in practical consequences
+5. **Both documents in single prompt** — enables cross-referencing
+   without tool calls or context fragmentation
+
+## Results
+
+| Model | Time | Findings | Tokens/finding |
+|-------|------|----------|----------------|
+| Opus | 52s | 7 | 336 |
+| GPT-5 | 125s | 6 | 2,967 |
+| Sonnet | 14s | 4 | 194 |
+
+Opus recommended for this task type.