Initial publish: 29 findings, 6 prompts, methodology, open questions
Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding). Contents: - findings/ALL-FINDINGS.md — complete 3,249-line research log with all 29 findings, methodology notes, and open questions - prompts/ — 6 exact prompts used across experiments - methodology.md — experimental setup and evaluation criteria - open-questions.md — unanswered questions for future work - README.md — overview and summary table Key findings: - Cross-document consistency: Opus is 2.4x faster with more findings - Gap-finding: GPT-5 reasoning tokens find domain-specific gaps - Race conditions: Opus excels at temporal interaction reasoning - Bias detection: Signal-to-noise ratio > model capability - Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different Signed-off-by: Rodin
This commit is contained in:
@@ -0,0 +1,59 @@
|
||||
# Prompt: Adversarial Manipulation Analysis
|
||||
|
||||
Used in Finding #29.
|
||||
|
||||
## Setup
|
||||
|
||||
- Single document (full text)
|
||||
- Same prompt to all models
|
||||
- No tools, no project context beyond the document
|
||||
|
||||
## Prompt
|
||||
|
||||
```
|
||||
You are a red-team security analyst reviewing a trading system's
|
||||
aggregation component. Your task is to identify how a MISBEHAVING,
|
||||
COMPROMISED, or BUGGY upstream component could exploit this design
|
||||
to produce harmful trading outcomes that bypass downstream safety controls.
|
||||
|
||||
## Categories of adversarial manipulation:
|
||||
|
||||
1. **Signal injection** — How could a compromised strategy inject signals
|
||||
that exploit the aggregator's logic to produce dangerous decisions?
|
||||
2. **Timing manipulation** — How could an attacker manipulate timing
|
||||
(delays, bursts, clock skew) to exploit the aggregator's temporal logic?
|
||||
3. **Capacity weaponization** — How could the max_signals bound or group
|
||||
completion logic be exploited to force premature or delayed decisions?
|
||||
4. **State corruption via crash** — How could deliberate crashes be used
|
||||
to put the aggregator in an exploitable state?
|
||||
5. **Audit evasion** — How could an attacker cause the aggregator to make
|
||||
decisions that don't appear in the audit log, or appear differently
|
||||
than what actually happened?
|
||||
|
||||
## For each attack vector:
|
||||
|
||||
- **Category:** (one of the 5 above)
|
||||
- **Attack vector:** Name of the attack
|
||||
- **Mechanism:** How the attacker exploits the design
|
||||
- **Exploit:** Step-by-step attack sequence
|
||||
- **Why downstream controls miss it:** Why PortfolioRisk, BuyingPower,
|
||||
or other downstream checks don't catch this
|
||||
- **Severity:** Critical / High / Medium
|
||||
- **Mitigation:** What the design could add to prevent it
|
||||
|
||||
## Document:
|
||||
|
||||
[FULL TEXT OF aggregation.md, 193 lines]
|
||||
```
|
||||
|
||||
## Results
|
||||
|
||||
| Model | Time | Findings | Unique vectors |
|
||||
|-------|------|----------|----------------|
|
||||
| GPT-5 | ~150s | 8 | 3 (most exhaustive) |
|
||||
| Opus | ~65s | 6 | 2 (qualitatively different) |
|
||||
| Sonnet | ~20s | 4 | 0 (subset of others) |
|
||||
|
||||
GPT-5 was most exhaustive and systematic. Opus found qualitatively different
|
||||
attack vectors with system-level thinking (e.g., exploiting supervision tree
|
||||
restart semantics).
|
||||
@@ -0,0 +1,58 @@
|
||||
# Prompt: Contradiction Detection
|
||||
|
||||
Used in Finding #25.
|
||||
|
||||
## Setup
|
||||
|
||||
- Single document (full text)
|
||||
- Same prompt to all models
|
||||
- No tools, no project context beyond the document
|
||||
|
||||
## Prompt
|
||||
|
||||
```
|
||||
You are analyzing a design document for CONTRADICTIONS — places where
|
||||
the document makes two claims that cannot both be true simultaneously.
|
||||
|
||||
This is NOT about:
|
||||
- Missing information
|
||||
- Unclear writing
|
||||
- Design tradeoffs
|
||||
- Things that MIGHT conflict
|
||||
|
||||
This IS about:
|
||||
- Statement A says X, Statement B says NOT-X
|
||||
- Mechanism A requires condition C, Mechanism B prevents condition C
|
||||
- Rule A applies to set S, but S includes elements that violate Rule A
|
||||
|
||||
## Categories:
|
||||
|
||||
1. **Direct contradictions** — Two statements that are logically incompatible
|
||||
2. **Mechanism conflicts** — Two described mechanisms that cannot coexist
|
||||
3. **Scope violations** — A rule/invariant that is violated by a specific
|
||||
case described elsewhere in the document
|
||||
4. **Temporal impossibilities** — A sequence that requires something to be
|
||||
true before the described mechanism makes it true
|
||||
|
||||
## For each contradiction:
|
||||
|
||||
- **Category:** (one of the 4 above)
|
||||
- **Statement A:** (exact text, with section)
|
||||
- **Statement B:** (exact text, with section)
|
||||
- **Why contradictory:** (formal reasoning about incompatibility)
|
||||
- **Severity:** Critical (system correctness) / High (safety) / Medium (confusion)
|
||||
|
||||
Be PRECISE. Only report genuine logical contradictions, not differences
|
||||
in emphasis or scope.
|
||||
|
||||
## Document:
|
||||
|
||||
[FULL TEXT OF DOCUMENT]
|
||||
```
|
||||
|
||||
## Key Design Decision
|
||||
|
||||
The "Be PRECISE" instruction and explicit exclusion list ("NOT about")
|
||||
is critical. Without it, models pad findings with style/clarity issues.
|
||||
The contradiction prompt naturally favors Opus (self-correcting, withdraws
|
||||
false positives) over GPT-5 (exhaustive, includes borderline cases).
|
||||
@@ -0,0 +1,80 @@
|
||||
# Prompt: Cross-Document Consistency Analysis
|
||||
|
||||
Used in Finding #28.
|
||||
|
||||
## Setup
|
||||
|
||||
- Two documents provided as full text in a single prompt (~25KB total)
|
||||
- Document A: `system-overview.md` (323 lines, narrative overview)
|
||||
- Document B: `architecture.md` (213 lines, DDD-focused)
|
||||
- No tools, no project context beyond the two documents
|
||||
- Same prompt to all 3 models independently
|
||||
|
||||
## Prompt
|
||||
|
||||
```
|
||||
You are analyzing two architecture documents that describe the SAME system.
|
||||
Your task is to identify places where these documents CONTRADICT each other
|
||||
— not where they differ in scope or detail level, but where they make
|
||||
incompatible claims about the same concept.
|
||||
|
||||
## Categories of inconsistency to check:
|
||||
|
||||
1. **Terminology conflicts** — Same concept called different names in ways
|
||||
that imply different meanings (not just abbreviation)
|
||||
2. **Structural contradictions** — Documents disagree about what is inside
|
||||
vs outside a component boundary
|
||||
3. **Flow/sequence conflicts** — Documents describe incompatible orderings
|
||||
or data flows for the same process
|
||||
4. **Ownership/authority conflicts** — Documents disagree about which
|
||||
component owns, writes, or is authoritative for a concept
|
||||
5. **Philosophical contradictions** — Documents state incompatible
|
||||
foundational assumptions (e.g., event sourcing vs CRUD)
|
||||
|
||||
## What to EXCLUDE:
|
||||
|
||||
- Omissions (one doc covers something the other doesn't)
|
||||
- Detail-level differences (one is more detailed than the other)
|
||||
- Naming differences that are clearly just abbreviations
|
||||
- Scope differences (one covers more topics)
|
||||
|
||||
## Output format per finding:
|
||||
|
||||
For each inconsistency found:
|
||||
- **Category:** (one of the 5 above)
|
||||
- **Severity:** Critical / High / Medium
|
||||
- **Document A says:** (exact quote or precise paraphrase with section ref)
|
||||
- **Document B says:** (exact quote or precise paraphrase with section ref)
|
||||
- **Why these are incompatible:** (explain why both cannot be correct)
|
||||
- **Impact:** (what would go wrong if an implementer followed both)
|
||||
|
||||
## Document A: [system-overview.md]
|
||||
|
||||
[FULL TEXT OF DOCUMENT A]
|
||||
|
||||
## Document B: [architecture.md]
|
||||
|
||||
[FULL TEXT OF DOCUMENT B]
|
||||
```
|
||||
|
||||
## Key Design Decisions
|
||||
|
||||
1. **Explicit exclusion of omissions** — prevents models from padding
|
||||
findings with "Doc A mentions X but Doc B doesn't"
|
||||
2. **Five specific categories** — focuses attention without being
|
||||
so restrictive that models miss novel inconsistency types
|
||||
3. **Required "why incompatible" explanation** — forces models to reason
|
||||
about WHY differences matter, not just list differences
|
||||
4. **Impact field** — grounds findings in practical consequences
|
||||
5. **Both documents in single prompt** — enables cross-referencing
|
||||
without tool calls or context fragmentation
|
||||
|
||||
## Results
|
||||
|
||||
| Model | Time | Findings | Tokens/finding |
|
||||
|-------|------|----------|----------------|
|
||||
| Opus | 52s | 7 | 336 |
|
||||
| GPT-5 | 125s | 6 | 2,967 |
|
||||
| Sonnet | 14s | 4 | 194 |
|
||||
|
||||
Opus recommended for this task type.
|
||||
@@ -0,0 +1,71 @@
|
||||
# Prompt: Design Coherence Analysis
|
||||
|
||||
Used in Findings #15, #27.
|
||||
|
||||
## Setup
|
||||
|
||||
- Single document provided as full text
|
||||
- No tools, no project context beyond the document
|
||||
- Same prompt to all models independently
|
||||
|
||||
## Prompt
|
||||
|
||||
```
|
||||
You are analyzing a single design document for INTERNAL incoherence —
|
||||
places where the document contradicts itself. The document states
|
||||
principles, invariants, or guarantees in one place, then describes
|
||||
mechanisms that violate those guarantees elsewhere.
|
||||
|
||||
## Categories of incoherence to check:
|
||||
|
||||
1. **Safety properties not enforced** — Document claims a safety property
|
||||
(e.g., "fail-closed") but the described mechanism has a path that
|
||||
violates it
|
||||
2. **State machine violations** — Declared states/transitions don't match
|
||||
the described behavior (missing transitions, unreachable states,
|
||||
states with no exit)
|
||||
3. **Recovery contradictions** — Recovery mechanism assumes preconditions
|
||||
that the failure scenario explicitly invalidates
|
||||
4. **Supervision conflicts** — Supervision strategy contradicts the
|
||||
independence/coupling claims about the supervised processes
|
||||
5. **Cross-mechanism contradictions** — Two different sections describe
|
||||
incompatible behaviors for the same scenario
|
||||
|
||||
## What to EXCLUDE:
|
||||
|
||||
- Missing features (things the document doesn't cover)
|
||||
- Design tradeoffs that are explicitly acknowledged
|
||||
- Future work items marked as such
|
||||
|
||||
## Output format per finding:
|
||||
|
||||
- **Category:** (one of the 5 above)
|
||||
- **Severity:** Critical / High / Medium
|
||||
- **Section A says:** (exact quote with section reference)
|
||||
- **Section B says:** (exact quote with section reference)
|
||||
- **The incoherence:** (explain the contradiction)
|
||||
- **Why it matters:** (what would break in implementation)
|
||||
|
||||
## Document:
|
||||
|
||||
[FULL TEXT OF DOCUMENT]
|
||||
```
|
||||
|
||||
## Results (Finding #15: failure-modes.md, 383 lines)
|
||||
|
||||
| Model | Time | Findings |
|
||||
|-------|------|----------|
|
||||
| Sonnet 4.6 | 39s | 5 |
|
||||
| Opus 4.6 | 105s | 7 (8 attempted, 1 self-withdrawn) |
|
||||
| GPT-5 | 120s | 4 |
|
||||
|
||||
## Results (Finding #27: risk-controls.md, 992 lines)
|
||||
|
||||
| Model | Time | Findings |
|
||||
|-------|------|----------|
|
||||
| Sonnet 4.6 | 31s | 4 |
|
||||
| Opus 4.6 | 86s | 5 |
|
||||
| GPT-5 | 112s | 6 |
|
||||
|
||||
Key insight: results are document-dependent. Opus won on the shorter doc,
|
||||
GPT-5 won on the longer, more complex one.
|
||||
@@ -0,0 +1,47 @@
|
||||
# Prompt: Gap-Finding in Architecture Documents
|
||||
|
||||
Used in Finding #9.
|
||||
|
||||
## Setup
|
||||
|
||||
- Single document (full text, no truncation)
|
||||
- Same focused analytical question to all models
|
||||
- No tools, no project context beyond the document
|
||||
- Temperature 0.3 for GPT-4.1/Mini, default for GPT-5
|
||||
|
||||
## Prompt
|
||||
|
||||
```
|
||||
You are a systems reliability engineer reviewing a failure modes document
|
||||
for a trading platform. Your task is to identify MISSING failure scenarios
|
||||
— things that COULD go wrong in this architecture but are NOT covered in
|
||||
the document.
|
||||
|
||||
Focus on:
|
||||
1. Scenarios specific to THIS architecture (not generic "server could crash")
|
||||
2. Interactions between components that could produce unexpected states
|
||||
3. External dependency failures not covered
|
||||
4. Timing/ordering issues in the described sequences
|
||||
5. Recovery procedures that have gaps
|
||||
|
||||
For each missing scenario:
|
||||
- **Scenario:** What goes wrong
|
||||
- **Why it's specific to this system:** Why generic monitoring wouldn't catch it
|
||||
- **Impact:** What state the system ends up in
|
||||
- **Why the document misses it:** What assumption makes this invisible
|
||||
|
||||
## Document:
|
||||
|
||||
[FULL TEXT OF failure-modes.md, 383 lines]
|
||||
```
|
||||
|
||||
## Results
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Scenarios found |
|
||||
|-------|------|---------------|------------------|-----------------|
|
||||
| GPT-4.1 Mini | 16s | 2,003 | 0 | 10 |
|
||||
| GPT-4.1 | 24s | 2,575 | 0 | 15 |
|
||||
| GPT-5 | 45s | 8,565 | 6,656 | 14 |
|
||||
|
||||
GPT-5 found the most domain-specific and actionable gaps despite finding
|
||||
fewer total scenarios than GPT-4.1. Quality > quantity.
|
||||
@@ -0,0 +1,53 @@
|
||||
# Prompt: Hidden Assumption Identification
|
||||
|
||||
Used in Findings #10, #11, #12.
|
||||
|
||||
## Setup
|
||||
|
||||
- Single document (full text)
|
||||
- Same prompt to all models
|
||||
- No tools, no project context beyond the document
|
||||
- Temperature 0.3 for non-reasoning models
|
||||
|
||||
## Prompt
|
||||
|
||||
```
|
||||
You are reviewing a system design document for hidden assumptions —
|
||||
things the design DEPENDS ON being true but does NOT explicitly state
|
||||
or validate.
|
||||
|
||||
A hidden assumption is different from a design decision:
|
||||
- Design decision: "We use event sourcing" (explicit choice)
|
||||
- Hidden assumption: "Events will always be delivered in order"
|
||||
(unstated dependency that could break)
|
||||
|
||||
For each hidden assumption found:
|
||||
- **Assumption:** What the design implicitly depends on
|
||||
- **Where it's hidden:** Which mechanism relies on it (section reference)
|
||||
- **What breaks if violated:** Concrete failure mode
|
||||
- **Likelihood of violation:** In production, how likely is this to be
|
||||
violated? (not in theory — in the real world with network partitions,
|
||||
clock skew, operator error, etc.)
|
||||
|
||||
Focus on assumptions that:
|
||||
1. Are NOT explicitly stated in the document
|
||||
2. COULD realistically be violated in production
|
||||
3. Would cause SILENT incorrect behavior (not loud crashes)
|
||||
4. Are specific to THIS architecture (not generic distributed systems concerns)
|
||||
|
||||
## Document:
|
||||
|
||||
[FULL TEXT OF DOCUMENT]
|
||||
```
|
||||
|
||||
## Results (Finding #10: cold-start-and-recovery.md, 234 lines)
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
|
||||
|-------|------|---------------|------------------|-------------------|
|
||||
| GPT-4.1 Mini | 25s | 3,090 | 0 | 12 |
|
||||
| GPT-4.1 | 77s | 2,751 | 0 | 14 |
|
||||
| GPT-5 | 78s | 2,649 | 4,096 | 26 |
|
||||
|
||||
GPT-5 found 2x more assumptions AND they were qualitatively different —
|
||||
multi-component interaction assumptions that require reasoning about
|
||||
system-level behavior, not just local properties.
|
||||
Reference in New Issue
Block a user