Initial publish: 29 findings, 6 prompts, methodology, open questions
Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding). Contents: - findings/ALL-FINDINGS.md — complete 3,249-line research log with all 29 findings, methodology notes, and open questions - prompts/ — 6 exact prompts used across experiments - methodology.md — experimental setup and evaluation criteria - open-questions.md — unanswered questions for future work - README.md — overview and summary table Key findings: - Cross-document consistency: Opus is 2.4x faster with more findings - Gap-finding: GPT-5 reasoning tokens find domain-specific gaps - Race conditions: Opus excels at temporal interaction reasoning - Bias detection: Signal-to-noise ratio > model capability - Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different Signed-off-by: Rodin
This commit is contained in:
@@ -0,0 +1,53 @@
|
||||
# Prompt: Hidden Assumption Identification
|
||||
|
||||
Used in Findings #10, #11, #12.
|
||||
|
||||
## Setup
|
||||
|
||||
- Single document (full text)
|
||||
- Same prompt to all models
|
||||
- No tools, no project context beyond the document
|
||||
- Temperature 0.3 for non-reasoning models
|
||||
|
||||
## Prompt
|
||||
|
||||
```
|
||||
You are reviewing a system design document for hidden assumptions —
|
||||
things the design DEPENDS ON being true but does NOT explicitly state
|
||||
or validate.
|
||||
|
||||
A hidden assumption is different from a design decision:
|
||||
- Design decision: "We use event sourcing" (explicit choice)
|
||||
- Hidden assumption: "Events will always be delivered in order"
|
||||
(unstated dependency that could break)
|
||||
|
||||
For each hidden assumption found:
|
||||
- **Assumption:** What the design implicitly depends on
|
||||
- **Where it's hidden:** Which mechanism relies on it (section reference)
|
||||
- **What breaks if violated:** Concrete failure mode
|
||||
- **Likelihood of violation:** In production, how likely is this to be
|
||||
violated? (not in theory — in the real world with network partitions,
|
||||
clock skew, operator error, etc.)
|
||||
|
||||
Focus on assumptions that:
|
||||
1. Are NOT explicitly stated in the document
|
||||
2. COULD realistically be violated in production
|
||||
3. Would cause SILENT incorrect behavior (not loud crashes)
|
||||
4. Are specific to THIS architecture (not generic distributed systems concerns)
|
||||
|
||||
## Document:
|
||||
|
||||
[FULL TEXT OF DOCUMENT]
|
||||
```
|
||||
|
||||
## Results (Finding #10: cold-start-and-recovery.md, 234 lines)
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
|
||||
|-------|------|---------------|------------------|-------------------|
|
||||
| GPT-4.1 Mini | 25s | 3,090 | 0 | 12 |
|
||||
| GPT-4.1 | 77s | 2,751 | 0 | 14 |
|
||||
| GPT-5 | 78s | 2,649 | 4,096 | 26 |
|
||||
|
||||
GPT-5 found 2x more assumptions AND they were qualitatively different —
|
||||
multi-component interaction assumptions that require reasoning about
|
||||
system-level behavior, not just local properties.
|
||||
Reference in New Issue
Block a user