Initial publish: 29 findings, 6 prompts, methodology, open questions
Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding). Contents: - findings/ALL-FINDINGS.md — complete 3,249-line research log with all 29 findings, methodology notes, and open questions - prompts/ — 6 exact prompts used across experiments - methodology.md — experimental setup and evaluation criteria - open-questions.md — unanswered questions for future work - README.md — overview and summary table Key findings: - Cross-document consistency: Opus is 2.4x faster with more findings - Gap-finding: GPT-5 reasoning tokens find domain-specific gaps - Race conditions: Opus excels at temporal interaction reasoning - Bias detection: Signal-to-noise ratio > model capability - Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different Signed-off-by: Rodin
This commit is contained in:
@@ -1,3 +1,81 @@
|
||||
# model-research
|
||||
# Model Research — AI for Analytical Work
|
||||
|
||||
Comparative analysis of AI models on analytical tasks — not coding. Tracking what works when using GPT-5, Claude Opus, Claude Sonnet, and GPT-4.1 for research, document review, bias detection, and architecture analysis.
|
||||
Comparative analysis of AI models on **analytical tasks** — not coding.
|
||||
|
||||
Most public discussion about LLM capabilities focuses on code generation.
|
||||
We found almost no published methodology for using models in analytical
|
||||
research tasks (searched 2026-04-26). This repo fills that gap with
|
||||
controlled experiments and reproducible findings.
|
||||
|
||||
## What We're Testing
|
||||
|
||||
Using GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, and GPT-4.1 (+ Mini) for:
|
||||
|
||||
- Architecture document review
|
||||
- Bias and assumption detection
|
||||
- Gap-finding in design specifications
|
||||
- Cross-document consistency analysis
|
||||
- Race condition identification
|
||||
- Adversarial path analysis
|
||||
- Contradiction detection
|
||||
- Regulatory compliance review
|
||||
|
||||
## Key Findings (Summary)
|
||||
|
||||
| # | Task Type | Winner | Key Insight |
|
||||
|---|-----------|--------|-------------|
|
||||
| 1 | PR review | Both | Different models catch different things — Sonnet: structural, GPT-5: semantic |
|
||||
| 2 | Bias detection | Framing | Signal-to-noise ratio matters more than model capability |
|
||||
| 9 | Gap-finding | GPT-5 | Reasoning tokens find domain-specific gaps, not generic ones |
|
||||
| 10 | Hidden assumptions | GPT-5 | Reasoning produces qualitatively different (not just more) findings |
|
||||
| 13 | Race conditions | Opus | Temporal interaction reasoning is Opus's strongest domain |
|
||||
| 15 | Design coherence | Task-dependent | Single-doc: model choice depends on document complexity |
|
||||
| 25 | Contradiction detection | Opus | Precision > exhaustiveness; Opus's self-correction is unique |
|
||||
| 28 | Cross-doc consistency | Opus | 2.4x faster than GPT-5 with more findings; boundary reasoning |
|
||||
| 29 | Adversarial analysis | GPT-5 + Opus | GPT-5: exhaustive; Opus: qualitatively different attack vectors |
|
||||
|
||||
## Methodology
|
||||
|
||||
Each experiment:
|
||||
1. Same input document(s) to all models
|
||||
2. Same structured prompt with explicit categories to analyze
|
||||
3. No tools, no project context beyond the document(s)
|
||||
4. Independent runs — no cross-pollination between models
|
||||
5. Results evaluated for: correctness, uniqueness, actionability
|
||||
|
||||
**Context dimensions tracked:**
|
||||
- Rich vs minimal (how much background info)
|
||||
- Broad vs focused ("review this" vs "answer this specific question")
|
||||
- What kind of context (diff, full files, issue text, nothing)
|
||||
- Whether the model had tools or just text
|
||||
- Whether the task was step-by-step or open-ended
|
||||
|
||||
## Repository Structure
|
||||
|
||||
```
|
||||
findings/ # Individual findings with full analysis
|
||||
01-different-models-different-things.md
|
||||
02-narrow-lens-vs-broad-review.md
|
||||
...
|
||||
28-cross-document-consistency.md
|
||||
29-adversarial-manipulation.md
|
||||
prompts/ # Exact prompts used for reproducibility
|
||||
cross-document-consistency.md
|
||||
design-coherence.md
|
||||
gap-finding.md
|
||||
hidden-assumptions.md
|
||||
...
|
||||
open-questions.md # Unanswered questions for future experiments
|
||||
methodology.md # Full methodology notes
|
||||
```
|
||||
|
||||
## Who We Are
|
||||
|
||||
This research is conducted by [Rodin](https://gitea.weiker.me/rodin) (AI
|
||||
assistant) and Aaron Weiker. The test corpus is gargoyle — an Elixir
|
||||
trading system with extensive architecture documentation (~35 design docs,
|
||||
~5000 lines).
|
||||
|
||||
## License
|
||||
|
||||
CC BY 4.0 — share and adapt with attribution.
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,76 @@
|
||||
# Methodology
|
||||
|
||||
## Principles
|
||||
|
||||
1. **Internet opinions about models are overwhelmingly about coding.** Don't
|
||||
extrapolate to analytical work without testing.
|
||||
2. **"Just because someone says it on the internet doesn't make it right."**
|
||||
Opinions need context. Track our own evidence.
|
||||
3. **Absence of published methodology for a use case is itself a finding.**
|
||||
4. **No unsupported generalizations.** Each finding needs: date, task,
|
||||
how we used it (context shape, task framing, what info the model
|
||||
had/didn't have), what happened, takeaway.
|
||||
|
||||
## Experimental Setup
|
||||
|
||||
### Models Tested
|
||||
|
||||
| Model | Provider | Access | Notes |
|
||||
|-------|----------|--------|-------|
|
||||
| GPT-5 | OpenAI (via HAI proxy) | API | Requires `max_completion_tokens` ≥16K |
|
||||
| Claude Opus 4.6 | Anthropic (via HAI proxy) | API | Internal reasoning (not exposed) |
|
||||
| Claude Sonnet 4.6 | Anthropic (via HAI proxy) | API | Fast, cost-effective |
|
||||
| GPT-4.1 | OpenAI (via HAI proxy) | API | Non-reasoning, structured output |
|
||||
| GPT-4.1 Mini | OpenAI (via HAI proxy) | API | Cheapest, good for screening |
|
||||
| Claude Sonnet 4.5 | Anthropic (via HAI proxy) | API | Predecessor to 4.6 |
|
||||
|
||||
### Control Variables
|
||||
|
||||
- **Same input:** All models receive identical document text
|
||||
- **Same prompt:** Structured prompt with explicit categories and output format
|
||||
- **Same constraints:** No tools, no project context beyond the document(s)
|
||||
- **Independent runs:** No cross-pollination between model runs
|
||||
- **Temperature:** 0.3 for GPT-4.1/Mini; default (1.0) for GPT-5 (required)
|
||||
|
||||
### Measurement
|
||||
|
||||
- **Time:** Wall clock from request to final token
|
||||
- **Output tokens:** Total generated tokens
|
||||
- **Reasoning tokens:** For reasoning models (GPT-5), exposed separately
|
||||
- **Findings count:** Number of distinct issues identified
|
||||
- **Unique findings:** Issues found by only one model
|
||||
- **Severity distribution:** Critical / High / Medium / Low per finding
|
||||
- **Tokens per finding:** Efficiency metric
|
||||
|
||||
### Evaluation Criteria
|
||||
|
||||
Each finding is assessed for:
|
||||
1. **Correctness:** Is the identified issue real?
|
||||
2. **Uniqueness:** Did only this model find it?
|
||||
3. **Actionability:** Would a developer change something based on this?
|
||||
4. **Depth:** Surface observation vs architectural insight?
|
||||
|
||||
### Context Dimensions Tracked
|
||||
|
||||
| Dimension | Options |
|
||||
|-----------|---------|
|
||||
| Context richness | Rich (full project) vs Minimal (document only) |
|
||||
| Task framing | Broad ("review this") vs Focused ("check for X") |
|
||||
| Context type | Diff, full files, issue text, research notes, nothing |
|
||||
| Tool access | With tools (API calls, file reads) vs text-only |
|
||||
| Task structure | Step-by-step explicit vs open-ended |
|
||||
|
||||
## Limitations
|
||||
|
||||
- Single test corpus (gargoyle architecture docs) — domain bias possible
|
||||
- Single researcher evaluating findings — subjectivity in quality assessment
|
||||
- Models are non-deterministic — single runs, not averaged
|
||||
- Proxy adds latency — timing comparisons are relative, not absolute
|
||||
- Internal reasoning tokens not visible for Claude models
|
||||
|
||||
## Reproducibility
|
||||
|
||||
Prompts for each experiment are in the `prompts/` directory. The test
|
||||
corpus is the gargoyle project's `docs/` directory (available at
|
||||
`gitea.weiker.me/grgl/gargoyle`). Each finding documents the exact document
|
||||
used, its line count, and the specific version/commit when relevant.
|
||||
@@ -0,0 +1,58 @@
|
||||
# Open Questions
|
||||
|
||||
Unanswered questions from experiments, ordered by potential impact.
|
||||
|
||||
## High Priority
|
||||
|
||||
### Signal-to-noise confirmation (from Finding #8)
|
||||
Give a model the FULL PR review context (diff, files, issue, AC) but add
|
||||
the narrow bias question as an explicit review checklist item. If the model
|
||||
catches bias despite the rich context, it confirms the signal-to-noise
|
||||
hypothesis. If it misses, it suggests something else (attention allocation,
|
||||
task switching cost).
|
||||
|
||||
### Cross-document consistency as maintenance tool (from Finding #28)
|
||||
Does running cross-doc analysis across MORE document pairs (domain readmes
|
||||
vs implementation docs, design docs vs plan docs) yield additional real
|
||||
inconsistencies? Could become a systematic documentation maintenance tool.
|
||||
|
||||
### Why Opus dominates cross-doc consistency (from Finding #28)
|
||||
Opus was 2.4x faster AND found more issues than GPT-5. Is this because
|
||||
cross-doc contradictions are easy to verify once spotted (reducing GPT-5's
|
||||
verification advantage)? Or because boundary reasoning (Opus's strength)
|
||||
is the primary skill needed?
|
||||
|
||||
### Sonnet + narrow framing = GPT-5 level? (from Finding #5)
|
||||
Would Sonnet catch semantic issues if given a narrower "check for logical
|
||||
consistency" framing instead of broad review? The hypothesis: Sonnet's
|
||||
"structural reviewer" tendency is a framing artifact, not a capability limit.
|
||||
|
||||
## Medium Priority
|
||||
|
||||
### Adversarial analysis ensemble (from Finding #29)
|
||||
Run GPT-5 and Opus sequentially — give Opus access to GPT-5's findings
|
||||
and ask it to critique and extend. Does the ensemble find more than either
|
||||
alone? Does Opus's system-level thinking complement GPT-5's exhaustiveness?
|
||||
|
||||
### Reasoning effort parameter (from Finding #21)
|
||||
Reasoning effort (low/medium/high) had negligible effect on GPT-5's
|
||||
analytical output. Is this because the parameter doesn't work for open-ended
|
||||
analysis? Or because the task was already within GPT-5's "easy" threshold?
|
||||
Test with a harder document.
|
||||
|
||||
### Model personality vs prompt (from Finding #26)
|
||||
Missing-feature identification IS promptable across all models — prompt
|
||||
framing eliminates Opus's historical advantage. How many other "model
|
||||
personality" observations are actually just prompt framing effects?
|
||||
|
||||
## Answered Questions
|
||||
|
||||
- ~~Opus's "missing feature identification" mode — is it promptable?~~
|
||||
**YES** (Finding #26): all models find regulatory gaps when explicitly
|
||||
prompted. Opus's behavior was an emergent DEFAULT tendency, not a unique
|
||||
capability.
|
||||
|
||||
- ~~Is Opus > GPT-5 for coherence tasks universal?~~
|
||||
**NO** (Finding #27): Opus's advantage from Finding #15 was document-
|
||||
specific. On risk-controls.md (992 lines, more complex), GPT-5 regained
|
||||
top position. Document complexity and domain specialization affect ranking.
|
||||
@@ -0,0 +1,59 @@
|
||||
# Prompt: Adversarial Manipulation Analysis
|
||||
|
||||
Used in Finding #29.
|
||||
|
||||
## Setup
|
||||
|
||||
- Single document (full text)
|
||||
- Same prompt to all models
|
||||
- No tools, no project context beyond the document
|
||||
|
||||
## Prompt
|
||||
|
||||
```
|
||||
You are a red-team security analyst reviewing a trading system's
|
||||
aggregation component. Your task is to identify how a MISBEHAVING,
|
||||
COMPROMISED, or BUGGY upstream component could exploit this design
|
||||
to produce harmful trading outcomes that bypass downstream safety controls.
|
||||
|
||||
## Categories of adversarial manipulation:
|
||||
|
||||
1. **Signal injection** — How could a compromised strategy inject signals
|
||||
that exploit the aggregator's logic to produce dangerous decisions?
|
||||
2. **Timing manipulation** — How could an attacker manipulate timing
|
||||
(delays, bursts, clock skew) to exploit the aggregator's temporal logic?
|
||||
3. **Capacity weaponization** — How could the max_signals bound or group
|
||||
completion logic be exploited to force premature or delayed decisions?
|
||||
4. **State corruption via crash** — How could deliberate crashes be used
|
||||
to put the aggregator in an exploitable state?
|
||||
5. **Audit evasion** — How could an attacker cause the aggregator to make
|
||||
decisions that don't appear in the audit log, or appear differently
|
||||
than what actually happened?
|
||||
|
||||
## For each attack vector:
|
||||
|
||||
- **Category:** (one of the 5 above)
|
||||
- **Attack vector:** Name of the attack
|
||||
- **Mechanism:** How the attacker exploits the design
|
||||
- **Exploit:** Step-by-step attack sequence
|
||||
- **Why downstream controls miss it:** Why PortfolioRisk, BuyingPower,
|
||||
or other downstream checks don't catch this
|
||||
- **Severity:** Critical / High / Medium
|
||||
- **Mitigation:** What the design could add to prevent it
|
||||
|
||||
## Document:
|
||||
|
||||
[FULL TEXT OF aggregation.md, 193 lines]
|
||||
```
|
||||
|
||||
## Results
|
||||
|
||||
| Model | Time | Findings | Unique vectors |
|
||||
|-------|------|----------|----------------|
|
||||
| GPT-5 | ~150s | 8 | 3 (most exhaustive) |
|
||||
| Opus | ~65s | 6 | 2 (qualitatively different) |
|
||||
| Sonnet | ~20s | 4 | 0 (subset of others) |
|
||||
|
||||
GPT-5 was most exhaustive and systematic. Opus found qualitatively different
|
||||
attack vectors with system-level thinking (e.g., exploiting supervision tree
|
||||
restart semantics).
|
||||
@@ -0,0 +1,58 @@
|
||||
# Prompt: Contradiction Detection
|
||||
|
||||
Used in Finding #25.
|
||||
|
||||
## Setup
|
||||
|
||||
- Single document (full text)
|
||||
- Same prompt to all models
|
||||
- No tools, no project context beyond the document
|
||||
|
||||
## Prompt
|
||||
|
||||
```
|
||||
You are analyzing a design document for CONTRADICTIONS — places where
|
||||
the document makes two claims that cannot both be true simultaneously.
|
||||
|
||||
This is NOT about:
|
||||
- Missing information
|
||||
- Unclear writing
|
||||
- Design tradeoffs
|
||||
- Things that MIGHT conflict
|
||||
|
||||
This IS about:
|
||||
- Statement A says X, Statement B says NOT-X
|
||||
- Mechanism A requires condition C, Mechanism B prevents condition C
|
||||
- Rule A applies to set S, but S includes elements that violate Rule A
|
||||
|
||||
## Categories:
|
||||
|
||||
1. **Direct contradictions** — Two statements that are logically incompatible
|
||||
2. **Mechanism conflicts** — Two described mechanisms that cannot coexist
|
||||
3. **Scope violations** — A rule/invariant that is violated by a specific
|
||||
case described elsewhere in the document
|
||||
4. **Temporal impossibilities** — A sequence that requires something to be
|
||||
true before the described mechanism makes it true
|
||||
|
||||
## For each contradiction:
|
||||
|
||||
- **Category:** (one of the 4 above)
|
||||
- **Statement A:** (exact text, with section)
|
||||
- **Statement B:** (exact text, with section)
|
||||
- **Why contradictory:** (formal reasoning about incompatibility)
|
||||
- **Severity:** Critical (system correctness) / High (safety) / Medium (confusion)
|
||||
|
||||
Be PRECISE. Only report genuine logical contradictions, not differences
|
||||
in emphasis or scope.
|
||||
|
||||
## Document:
|
||||
|
||||
[FULL TEXT OF DOCUMENT]
|
||||
```
|
||||
|
||||
## Key Design Decision
|
||||
|
||||
The "Be PRECISE" instruction and explicit exclusion list ("NOT about")
|
||||
is critical. Without it, models pad findings with style/clarity issues.
|
||||
The contradiction prompt naturally favors Opus (self-correcting, withdraws
|
||||
false positives) over GPT-5 (exhaustive, includes borderline cases).
|
||||
@@ -0,0 +1,80 @@
|
||||
# Prompt: Cross-Document Consistency Analysis
|
||||
|
||||
Used in Finding #28.
|
||||
|
||||
## Setup
|
||||
|
||||
- Two documents provided as full text in a single prompt (~25KB total)
|
||||
- Document A: `system-overview.md` (323 lines, narrative overview)
|
||||
- Document B: `architecture.md` (213 lines, DDD-focused)
|
||||
- No tools, no project context beyond the two documents
|
||||
- Same prompt to all 3 models independently
|
||||
|
||||
## Prompt
|
||||
|
||||
```
|
||||
You are analyzing two architecture documents that describe the SAME system.
|
||||
Your task is to identify places where these documents CONTRADICT each other
|
||||
— not where they differ in scope or detail level, but where they make
|
||||
incompatible claims about the same concept.
|
||||
|
||||
## Categories of inconsistency to check:
|
||||
|
||||
1. **Terminology conflicts** — Same concept called different names in ways
|
||||
that imply different meanings (not just abbreviation)
|
||||
2. **Structural contradictions** — Documents disagree about what is inside
|
||||
vs outside a component boundary
|
||||
3. **Flow/sequence conflicts** — Documents describe incompatible orderings
|
||||
or data flows for the same process
|
||||
4. **Ownership/authority conflicts** — Documents disagree about which
|
||||
component owns, writes, or is authoritative for a concept
|
||||
5. **Philosophical contradictions** — Documents state incompatible
|
||||
foundational assumptions (e.g., event sourcing vs CRUD)
|
||||
|
||||
## What to EXCLUDE:
|
||||
|
||||
- Omissions (one doc covers something the other doesn't)
|
||||
- Detail-level differences (one is more detailed than the other)
|
||||
- Naming differences that are clearly just abbreviations
|
||||
- Scope differences (one covers more topics)
|
||||
|
||||
## Output format per finding:
|
||||
|
||||
For each inconsistency found:
|
||||
- **Category:** (one of the 5 above)
|
||||
- **Severity:** Critical / High / Medium
|
||||
- **Document A says:** (exact quote or precise paraphrase with section ref)
|
||||
- **Document B says:** (exact quote or precise paraphrase with section ref)
|
||||
- **Why these are incompatible:** (explain why both cannot be correct)
|
||||
- **Impact:** (what would go wrong if an implementer followed both)
|
||||
|
||||
## Document A: [system-overview.md]
|
||||
|
||||
[FULL TEXT OF DOCUMENT A]
|
||||
|
||||
## Document B: [architecture.md]
|
||||
|
||||
[FULL TEXT OF DOCUMENT B]
|
||||
```
|
||||
|
||||
## Key Design Decisions
|
||||
|
||||
1. **Explicit exclusion of omissions** — prevents models from padding
|
||||
findings with "Doc A mentions X but Doc B doesn't"
|
||||
2. **Five specific categories** — focuses attention without being
|
||||
so restrictive that models miss novel inconsistency types
|
||||
3. **Required "why incompatible" explanation** — forces models to reason
|
||||
about WHY differences matter, not just list differences
|
||||
4. **Impact field** — grounds findings in practical consequences
|
||||
5. **Both documents in single prompt** — enables cross-referencing
|
||||
without tool calls or context fragmentation
|
||||
|
||||
## Results
|
||||
|
||||
| Model | Time | Findings | Tokens/finding |
|
||||
|-------|------|----------|----------------|
|
||||
| Opus | 52s | 7 | 336 |
|
||||
| GPT-5 | 125s | 6 | 2,967 |
|
||||
| Sonnet | 14s | 4 | 194 |
|
||||
|
||||
Opus recommended for this task type.
|
||||
@@ -0,0 +1,71 @@
|
||||
# Prompt: Design Coherence Analysis
|
||||
|
||||
Used in Findings #15, #27.
|
||||
|
||||
## Setup
|
||||
|
||||
- Single document provided as full text
|
||||
- No tools, no project context beyond the document
|
||||
- Same prompt to all models independently
|
||||
|
||||
## Prompt
|
||||
|
||||
```
|
||||
You are analyzing a single design document for INTERNAL incoherence —
|
||||
places where the document contradicts itself. The document states
|
||||
principles, invariants, or guarantees in one place, then describes
|
||||
mechanisms that violate those guarantees elsewhere.
|
||||
|
||||
## Categories of incoherence to check:
|
||||
|
||||
1. **Safety properties not enforced** — Document claims a safety property
|
||||
(e.g., "fail-closed") but the described mechanism has a path that
|
||||
violates it
|
||||
2. **State machine violations** — Declared states/transitions don't match
|
||||
the described behavior (missing transitions, unreachable states,
|
||||
states with no exit)
|
||||
3. **Recovery contradictions** — Recovery mechanism assumes preconditions
|
||||
that the failure scenario explicitly invalidates
|
||||
4. **Supervision conflicts** — Supervision strategy contradicts the
|
||||
independence/coupling claims about the supervised processes
|
||||
5. **Cross-mechanism contradictions** — Two different sections describe
|
||||
incompatible behaviors for the same scenario
|
||||
|
||||
## What to EXCLUDE:
|
||||
|
||||
- Missing features (things the document doesn't cover)
|
||||
- Design tradeoffs that are explicitly acknowledged
|
||||
- Future work items marked as such
|
||||
|
||||
## Output format per finding:
|
||||
|
||||
- **Category:** (one of the 5 above)
|
||||
- **Severity:** Critical / High / Medium
|
||||
- **Section A says:** (exact quote with section reference)
|
||||
- **Section B says:** (exact quote with section reference)
|
||||
- **The incoherence:** (explain the contradiction)
|
||||
- **Why it matters:** (what would break in implementation)
|
||||
|
||||
## Document:
|
||||
|
||||
[FULL TEXT OF DOCUMENT]
|
||||
```
|
||||
|
||||
## Results (Finding #15: failure-modes.md, 383 lines)
|
||||
|
||||
| Model | Time | Findings |
|
||||
|-------|------|----------|
|
||||
| Sonnet 4.6 | 39s | 5 |
|
||||
| Opus 4.6 | 105s | 7 (8 attempted, 1 self-withdrawn) |
|
||||
| GPT-5 | 120s | 4 |
|
||||
|
||||
## Results (Finding #27: risk-controls.md, 992 lines)
|
||||
|
||||
| Model | Time | Findings |
|
||||
|-------|------|----------|
|
||||
| Sonnet 4.6 | 31s | 4 |
|
||||
| Opus 4.6 | 86s | 5 |
|
||||
| GPT-5 | 112s | 6 |
|
||||
|
||||
Key insight: results are document-dependent. Opus won on the shorter doc,
|
||||
GPT-5 won on the longer, more complex one.
|
||||
@@ -0,0 +1,47 @@
|
||||
# Prompt: Gap-Finding in Architecture Documents
|
||||
|
||||
Used in Finding #9.
|
||||
|
||||
## Setup
|
||||
|
||||
- Single document (full text, no truncation)
|
||||
- Same focused analytical question to all models
|
||||
- No tools, no project context beyond the document
|
||||
- Temperature 0.3 for GPT-4.1/Mini, default for GPT-5
|
||||
|
||||
## Prompt
|
||||
|
||||
```
|
||||
You are a systems reliability engineer reviewing a failure modes document
|
||||
for a trading platform. Your task is to identify MISSING failure scenarios
|
||||
— things that COULD go wrong in this architecture but are NOT covered in
|
||||
the document.
|
||||
|
||||
Focus on:
|
||||
1. Scenarios specific to THIS architecture (not generic "server could crash")
|
||||
2. Interactions between components that could produce unexpected states
|
||||
3. External dependency failures not covered
|
||||
4. Timing/ordering issues in the described sequences
|
||||
5. Recovery procedures that have gaps
|
||||
|
||||
For each missing scenario:
|
||||
- **Scenario:** What goes wrong
|
||||
- **Why it's specific to this system:** Why generic monitoring wouldn't catch it
|
||||
- **Impact:** What state the system ends up in
|
||||
- **Why the document misses it:** What assumption makes this invisible
|
||||
|
||||
## Document:
|
||||
|
||||
[FULL TEXT OF failure-modes.md, 383 lines]
|
||||
```
|
||||
|
||||
## Results
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Scenarios found |
|
||||
|-------|------|---------------|------------------|-----------------|
|
||||
| GPT-4.1 Mini | 16s | 2,003 | 0 | 10 |
|
||||
| GPT-4.1 | 24s | 2,575 | 0 | 15 |
|
||||
| GPT-5 | 45s | 8,565 | 6,656 | 14 |
|
||||
|
||||
GPT-5 found the most domain-specific and actionable gaps despite finding
|
||||
fewer total scenarios than GPT-4.1. Quality > quantity.
|
||||
@@ -0,0 +1,53 @@
|
||||
# Prompt: Hidden Assumption Identification
|
||||
|
||||
Used in Findings #10, #11, #12.
|
||||
|
||||
## Setup
|
||||
|
||||
- Single document (full text)
|
||||
- Same prompt to all models
|
||||
- No tools, no project context beyond the document
|
||||
- Temperature 0.3 for non-reasoning models
|
||||
|
||||
## Prompt
|
||||
|
||||
```
|
||||
You are reviewing a system design document for hidden assumptions —
|
||||
things the design DEPENDS ON being true but does NOT explicitly state
|
||||
or validate.
|
||||
|
||||
A hidden assumption is different from a design decision:
|
||||
- Design decision: "We use event sourcing" (explicit choice)
|
||||
- Hidden assumption: "Events will always be delivered in order"
|
||||
(unstated dependency that could break)
|
||||
|
||||
For each hidden assumption found:
|
||||
- **Assumption:** What the design implicitly depends on
|
||||
- **Where it's hidden:** Which mechanism relies on it (section reference)
|
||||
- **What breaks if violated:** Concrete failure mode
|
||||
- **Likelihood of violation:** In production, how likely is this to be
|
||||
violated? (not in theory — in the real world with network partitions,
|
||||
clock skew, operator error, etc.)
|
||||
|
||||
Focus on assumptions that:
|
||||
1. Are NOT explicitly stated in the document
|
||||
2. COULD realistically be violated in production
|
||||
3. Would cause SILENT incorrect behavior (not loud crashes)
|
||||
4. Are specific to THIS architecture (not generic distributed systems concerns)
|
||||
|
||||
## Document:
|
||||
|
||||
[FULL TEXT OF DOCUMENT]
|
||||
```
|
||||
|
||||
## Results (Finding #10: cold-start-and-recovery.md, 234 lines)
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
|
||||
|-------|------|---------------|------------------|-------------------|
|
||||
| GPT-4.1 Mini | 25s | 3,090 | 0 | 12 |
|
||||
| GPT-4.1 | 77s | 2,751 | 0 | 14 |
|
||||
| GPT-5 | 78s | 2,649 | 4,096 | 26 |
|
||||
|
||||
GPT-5 found 2x more assumptions AND they were qualitatively different —
|
||||
multi-component interaction assumptions that require reasoning about
|
||||
system-level behavior, not just local properties.
|
||||
Reference in New Issue
Block a user