Initial publish: 29 findings, 6 prompts, methodology, open questions

Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6,
GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding).

Contents:
- findings/ALL-FINDINGS.md — complete 3,249-line research log with all
  29 findings, methodology notes, and open questions
- prompts/ — 6 exact prompts used across experiments
- methodology.md — experimental setup and evaluation criteria
- open-questions.md — unanswered questions for future work
- README.md — overview and summary table

Key findings:
- Cross-document consistency: Opus is 2.4x faster with more findings
- Gap-finding: GPT-5 reasoning tokens find domain-specific gaps
- Race conditions: Opus excels at temporal interaction reasoning
- Bias detection: Signal-to-noise ratio > model capability
- Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different

Signed-off-by: Rodin
This commit is contained in:
Rodin
2026-05-05 19:13:03 -07:00
parent 4aea0d004b
commit 1b108ff66e
10 changed files with 3831 additions and 2 deletions
+59
View File
@@ -0,0 +1,59 @@
# Prompt: Adversarial Manipulation Analysis
Used in Finding #29.
## Setup
- Single document (full text)
- Same prompt to all models
- No tools, no project context beyond the document
## Prompt
```
You are a red-team security analyst reviewing a trading system's
aggregation component. Your task is to identify how a MISBEHAVING,
COMPROMISED, or BUGGY upstream component could exploit this design
to produce harmful trading outcomes that bypass downstream safety controls.
## Categories of adversarial manipulation:
1. **Signal injection** — How could a compromised strategy inject signals
that exploit the aggregator's logic to produce dangerous decisions?
2. **Timing manipulation** — How could an attacker manipulate timing
(delays, bursts, clock skew) to exploit the aggregator's temporal logic?
3. **Capacity weaponization** — How could the max_signals bound or group
completion logic be exploited to force premature or delayed decisions?
4. **State corruption via crash** — How could deliberate crashes be used
to put the aggregator in an exploitable state?
5. **Audit evasion** — How could an attacker cause the aggregator to make
decisions that don't appear in the audit log, or appear differently
than what actually happened?
## For each attack vector:
- **Category:** (one of the 5 above)
- **Attack vector:** Name of the attack
- **Mechanism:** How the attacker exploits the design
- **Exploit:** Step-by-step attack sequence
- **Why downstream controls miss it:** Why PortfolioRisk, BuyingPower,
or other downstream checks don't catch this
- **Severity:** Critical / High / Medium
- **Mitigation:** What the design could add to prevent it
## Document:
[FULL TEXT OF aggregation.md, 193 lines]
```
## Results
| Model | Time | Findings | Unique vectors |
|-------|------|----------|----------------|
| GPT-5 | ~150s | 8 | 3 (most exhaustive) |
| Opus | ~65s | 6 | 2 (qualitatively different) |
| Sonnet | ~20s | 4 | 0 (subset of others) |
GPT-5 was most exhaustive and systematic. Opus found qualitatively different
attack vectors with system-level thinking (e.g., exploiting supervision tree
restart semantics).
+58
View File
@@ -0,0 +1,58 @@
# Prompt: Contradiction Detection
Used in Finding #25.
## Setup
- Single document (full text)
- Same prompt to all models
- No tools, no project context beyond the document
## Prompt
```
You are analyzing a design document for CONTRADICTIONS — places where
the document makes two claims that cannot both be true simultaneously.
This is NOT about:
- Missing information
- Unclear writing
- Design tradeoffs
- Things that MIGHT conflict
This IS about:
- Statement A says X, Statement B says NOT-X
- Mechanism A requires condition C, Mechanism B prevents condition C
- Rule A applies to set S, but S includes elements that violate Rule A
## Categories:
1. **Direct contradictions** — Two statements that are logically incompatible
2. **Mechanism conflicts** — Two described mechanisms that cannot coexist
3. **Scope violations** — A rule/invariant that is violated by a specific
case described elsewhere in the document
4. **Temporal impossibilities** — A sequence that requires something to be
true before the described mechanism makes it true
## For each contradiction:
- **Category:** (one of the 4 above)
- **Statement A:** (exact text, with section)
- **Statement B:** (exact text, with section)
- **Why contradictory:** (formal reasoning about incompatibility)
- **Severity:** Critical (system correctness) / High (safety) / Medium (confusion)
Be PRECISE. Only report genuine logical contradictions, not differences
in emphasis or scope.
## Document:
[FULL TEXT OF DOCUMENT]
```
## Key Design Decision
The "Be PRECISE" instruction and explicit exclusion list ("NOT about")
is critical. Without it, models pad findings with style/clarity issues.
The contradiction prompt naturally favors Opus (self-correcting, withdraws
false positives) over GPT-5 (exhaustive, includes borderline cases).
+80
View File
@@ -0,0 +1,80 @@
# Prompt: Cross-Document Consistency Analysis
Used in Finding #28.
## Setup
- Two documents provided as full text in a single prompt (~25KB total)
- Document A: `system-overview.md` (323 lines, narrative overview)
- Document B: `architecture.md` (213 lines, DDD-focused)
- No tools, no project context beyond the two documents
- Same prompt to all 3 models independently
## Prompt
```
You are analyzing two architecture documents that describe the SAME system.
Your task is to identify places where these documents CONTRADICT each other
— not where they differ in scope or detail level, but where they make
incompatible claims about the same concept.
## Categories of inconsistency to check:
1. **Terminology conflicts** — Same concept called different names in ways
that imply different meanings (not just abbreviation)
2. **Structural contradictions** — Documents disagree about what is inside
vs outside a component boundary
3. **Flow/sequence conflicts** — Documents describe incompatible orderings
or data flows for the same process
4. **Ownership/authority conflicts** — Documents disagree about which
component owns, writes, or is authoritative for a concept
5. **Philosophical contradictions** — Documents state incompatible
foundational assumptions (e.g., event sourcing vs CRUD)
## What to EXCLUDE:
- Omissions (one doc covers something the other doesn't)
- Detail-level differences (one is more detailed than the other)
- Naming differences that are clearly just abbreviations
- Scope differences (one covers more topics)
## Output format per finding:
For each inconsistency found:
- **Category:** (one of the 5 above)
- **Severity:** Critical / High / Medium
- **Document A says:** (exact quote or precise paraphrase with section ref)
- **Document B says:** (exact quote or precise paraphrase with section ref)
- **Why these are incompatible:** (explain why both cannot be correct)
- **Impact:** (what would go wrong if an implementer followed both)
## Document A: [system-overview.md]
[FULL TEXT OF DOCUMENT A]
## Document B: [architecture.md]
[FULL TEXT OF DOCUMENT B]
```
## Key Design Decisions
1. **Explicit exclusion of omissions** — prevents models from padding
findings with "Doc A mentions X but Doc B doesn't"
2. **Five specific categories** — focuses attention without being
so restrictive that models miss novel inconsistency types
3. **Required "why incompatible" explanation** — forces models to reason
about WHY differences matter, not just list differences
4. **Impact field** — grounds findings in practical consequences
5. **Both documents in single prompt** — enables cross-referencing
without tool calls or context fragmentation
## Results
| Model | Time | Findings | Tokens/finding |
|-------|------|----------|----------------|
| Opus | 52s | 7 | 336 |
| GPT-5 | 125s | 6 | 2,967 |
| Sonnet | 14s | 4 | 194 |
Opus recommended for this task type.
+71
View File
@@ -0,0 +1,71 @@
# Prompt: Design Coherence Analysis
Used in Findings #15, #27.
## Setup
- Single document provided as full text
- No tools, no project context beyond the document
- Same prompt to all models independently
## Prompt
```
You are analyzing a single design document for INTERNAL incoherence —
places where the document contradicts itself. The document states
principles, invariants, or guarantees in one place, then describes
mechanisms that violate those guarantees elsewhere.
## Categories of incoherence to check:
1. **Safety properties not enforced** — Document claims a safety property
(e.g., "fail-closed") but the described mechanism has a path that
violates it
2. **State machine violations** — Declared states/transitions don't match
the described behavior (missing transitions, unreachable states,
states with no exit)
3. **Recovery contradictions** — Recovery mechanism assumes preconditions
that the failure scenario explicitly invalidates
4. **Supervision conflicts** — Supervision strategy contradicts the
independence/coupling claims about the supervised processes
5. **Cross-mechanism contradictions** — Two different sections describe
incompatible behaviors for the same scenario
## What to EXCLUDE:
- Missing features (things the document doesn't cover)
- Design tradeoffs that are explicitly acknowledged
- Future work items marked as such
## Output format per finding:
- **Category:** (one of the 5 above)
- **Severity:** Critical / High / Medium
- **Section A says:** (exact quote with section reference)
- **Section B says:** (exact quote with section reference)
- **The incoherence:** (explain the contradiction)
- **Why it matters:** (what would break in implementation)
## Document:
[FULL TEXT OF DOCUMENT]
```
## Results (Finding #15: failure-modes.md, 383 lines)
| Model | Time | Findings |
|-------|------|----------|
| Sonnet 4.6 | 39s | 5 |
| Opus 4.6 | 105s | 7 (8 attempted, 1 self-withdrawn) |
| GPT-5 | 120s | 4 |
## Results (Finding #27: risk-controls.md, 992 lines)
| Model | Time | Findings |
|-------|------|----------|
| Sonnet 4.6 | 31s | 4 |
| Opus 4.6 | 86s | 5 |
| GPT-5 | 112s | 6 |
Key insight: results are document-dependent. Opus won on the shorter doc,
GPT-5 won on the longer, more complex one.
+47
View File
@@ -0,0 +1,47 @@
# Prompt: Gap-Finding in Architecture Documents
Used in Finding #9.
## Setup
- Single document (full text, no truncation)
- Same focused analytical question to all models
- No tools, no project context beyond the document
- Temperature 0.3 for GPT-4.1/Mini, default for GPT-5
## Prompt
```
You are a systems reliability engineer reviewing a failure modes document
for a trading platform. Your task is to identify MISSING failure scenarios
— things that COULD go wrong in this architecture but are NOT covered in
the document.
Focus on:
1. Scenarios specific to THIS architecture (not generic "server could crash")
2. Interactions between components that could produce unexpected states
3. External dependency failures not covered
4. Timing/ordering issues in the described sequences
5. Recovery procedures that have gaps
For each missing scenario:
- **Scenario:** What goes wrong
- **Why it's specific to this system:** Why generic monitoring wouldn't catch it
- **Impact:** What state the system ends up in
- **Why the document misses it:** What assumption makes this invisible
## Document:
[FULL TEXT OF failure-modes.md, 383 lines]
```
## Results
| Model | Time | Output tokens | Reasoning tokens | Scenarios found |
|-------|------|---------------|------------------|-----------------|
| GPT-4.1 Mini | 16s | 2,003 | 0 | 10 |
| GPT-4.1 | 24s | 2,575 | 0 | 15 |
| GPT-5 | 45s | 8,565 | 6,656 | 14 |
GPT-5 found the most domain-specific and actionable gaps despite finding
fewer total scenarios than GPT-4.1. Quality > quantity.
+53
View File
@@ -0,0 +1,53 @@
# Prompt: Hidden Assumption Identification
Used in Findings #10, #11, #12.
## Setup
- Single document (full text)
- Same prompt to all models
- No tools, no project context beyond the document
- Temperature 0.3 for non-reasoning models
## Prompt
```
You are reviewing a system design document for hidden assumptions —
things the design DEPENDS ON being true but does NOT explicitly state
or validate.
A hidden assumption is different from a design decision:
- Design decision: "We use event sourcing" (explicit choice)
- Hidden assumption: "Events will always be delivered in order"
(unstated dependency that could break)
For each hidden assumption found:
- **Assumption:** What the design implicitly depends on
- **Where it's hidden:** Which mechanism relies on it (section reference)
- **What breaks if violated:** Concrete failure mode
- **Likelihood of violation:** In production, how likely is this to be
violated? (not in theory — in the real world with network partitions,
clock skew, operator error, etc.)
Focus on assumptions that:
1. Are NOT explicitly stated in the document
2. COULD realistically be violated in production
3. Would cause SILENT incorrect behavior (not loud crashes)
4. Are specific to THIS architecture (not generic distributed systems concerns)
## Document:
[FULL TEXT OF DOCUMENT]
```
## Results (Finding #10: cold-start-and-recovery.md, 234 lines)
| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
|-------|------|---------------|------------------|-------------------|
| GPT-4.1 Mini | 25s | 3,090 | 0 | 12 |
| GPT-4.1 | 77s | 2,751 | 0 | 14 |
| GPT-5 | 78s | 2,649 | 4,096 | 26 |
GPT-5 found 2x more assumptions AND they were qualitatively different —
multi-component interaction assumptions that require reasoning about
system-level behavior, not just local properties.