Initial publish: 29 findings, 6 prompts, methodology, open questions
Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding). Contents: - findings/ALL-FINDINGS.md — complete 3,249-line research log with all 29 findings, methodology notes, and open questions - prompts/ — 6 exact prompts used across experiments - methodology.md — experimental setup and evaluation criteria - open-questions.md — unanswered questions for future work - README.md — overview and summary table Key findings: - Cross-document consistency: Opus is 2.4x faster with more findings - Gap-finding: GPT-5 reasoning tokens find domain-specific gaps - Race conditions: Opus excels at temporal interaction reasoning - Bias detection: Signal-to-noise ratio > model capability - Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different Signed-off-by: Rodin
This commit is contained in:
@@ -0,0 +1,47 @@
|
||||
# Prompt: Gap-Finding in Architecture Documents
|
||||
|
||||
Used in Finding #9.
|
||||
|
||||
## Setup
|
||||
|
||||
- Single document (full text, no truncation)
|
||||
- Same focused analytical question to all models
|
||||
- No tools, no project context beyond the document
|
||||
- Temperature 0.3 for GPT-4.1/Mini, default for GPT-5
|
||||
|
||||
## Prompt
|
||||
|
||||
```
|
||||
You are a systems reliability engineer reviewing a failure modes document
|
||||
for a trading platform. Your task is to identify MISSING failure scenarios
|
||||
— things that COULD go wrong in this architecture but are NOT covered in
|
||||
the document.
|
||||
|
||||
Focus on:
|
||||
1. Scenarios specific to THIS architecture (not generic "server could crash")
|
||||
2. Interactions between components that could produce unexpected states
|
||||
3. External dependency failures not covered
|
||||
4. Timing/ordering issues in the described sequences
|
||||
5. Recovery procedures that have gaps
|
||||
|
||||
For each missing scenario:
|
||||
- **Scenario:** What goes wrong
|
||||
- **Why it's specific to this system:** Why generic monitoring wouldn't catch it
|
||||
- **Impact:** What state the system ends up in
|
||||
- **Why the document misses it:** What assumption makes this invisible
|
||||
|
||||
## Document:
|
||||
|
||||
[FULL TEXT OF failure-modes.md, 383 lines]
|
||||
```
|
||||
|
||||
## Results
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Scenarios found |
|
||||
|-------|------|---------------|------------------|-----------------|
|
||||
| GPT-4.1 Mini | 16s | 2,003 | 0 | 10 |
|
||||
| GPT-4.1 | 24s | 2,575 | 0 | 15 |
|
||||
| GPT-5 | 45s | 8,565 | 6,656 | 14 |
|
||||
|
||||
GPT-5 found the most domain-specific and actionable gaps despite finding
|
||||
fewer total scenarios than GPT-4.1. Quality > quantity.
|
||||
Reference in New Issue
Block a user