Files
model-research/prompts/hidden-assumptions.md
Rodin 1b108ff66e Initial publish: 29 findings, 6 prompts, methodology, open questions
Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6,
GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding).

Contents:
- findings/ALL-FINDINGS.md — complete 3,249-line research log with all
  29 findings, methodology notes, and open questions
- prompts/ — 6 exact prompts used across experiments
- methodology.md — experimental setup and evaluation criteria
- open-questions.md — unanswered questions for future work
- README.md — overview and summary table

Key findings:
- Cross-document consistency: Opus is 2.4x faster with more findings
- Gap-finding: GPT-5 reasoning tokens find domain-specific gaps
- Race conditions: Opus excels at temporal interaction reasoning
- Bias detection: Signal-to-noise ratio > model capability
- Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different

Signed-off-by: Rodin
2026-05-05 19:13:03 -07:00

54 lines
1.8 KiB
Markdown

# Prompt: Hidden Assumption Identification
Used in Findings #10, #11, #12.
## Setup
- Single document (full text)
- Same prompt to all models
- No tools, no project context beyond the document
- Temperature 0.3 for non-reasoning models
## Prompt
```
You are reviewing a system design document for hidden assumptions —
things the design DEPENDS ON being true but does NOT explicitly state
or validate.
A hidden assumption is different from a design decision:
- Design decision: "We use event sourcing" (explicit choice)
- Hidden assumption: "Events will always be delivered in order"
(unstated dependency that could break)
For each hidden assumption found:
- **Assumption:** What the design implicitly depends on
- **Where it's hidden:** Which mechanism relies on it (section reference)
- **What breaks if violated:** Concrete failure mode
- **Likelihood of violation:** In production, how likely is this to be
violated? (not in theory — in the real world with network partitions,
clock skew, operator error, etc.)
Focus on assumptions that:
1. Are NOT explicitly stated in the document
2. COULD realistically be violated in production
3. Would cause SILENT incorrect behavior (not loud crashes)
4. Are specific to THIS architecture (not generic distributed systems concerns)
## Document:
[FULL TEXT OF DOCUMENT]
```
## Results (Finding #10: cold-start-and-recovery.md, 234 lines)
| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
|-------|------|---------------|------------------|-------------------|
| GPT-4.1 Mini | 25s | 3,090 | 0 | 12 |
| GPT-4.1 | 77s | 2,751 | 0 | 14 |
| GPT-5 | 78s | 2,649 | 4,096 | 26 |
GPT-5 found 2x more assumptions AND they were qualitatively different —
multi-component interaction assumptions that require reasoning about
system-level behavior, not just local properties.