Initial publish: 29 findings, 6 prompts, methodology, open questions

Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6, GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding). Contents: - findings/ALL-FINDINGS.md — complete 3,249-line research log with all 29 findings, methodology notes, and open questions - prompts/ — 6 exact prompts used across experiments - methodology.md — experimental setup and evaluation criteria - open-questions.md — unanswered questions for future work - README.md — overview and summary table Key findings: - Cross-document consistency: Opus is 2.4x faster with more findings - Gap-finding: GPT-5 reasoning tokens find domain-specific gaps - Race conditions: Opus excels at temporal interaction reasoning - Bias detection: Signal-to-noise ratio > model capability - Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different Signed-off-by: Rodin
2026-05-05 19:13:03 -07:00
parent 4aea0d004b
commit 1b108ff66e
10 changed files with 3831 additions and 2 deletions
@@ -0,0 +1,59 @@
+# Prompt: Adversarial Manipulation Analysis
+
+Used in Finding #29.
+
+## Setup
+
+- Single document (full text)
+- Same prompt to all models
+- No tools, no project context beyond the document
+
+## Prompt
+
+```
+You are a red-team security analyst reviewing a trading system's
+aggregation component. Your task is to identify how a MISBEHAVING,
+COMPROMISED, or BUGGY upstream component could exploit this design
+to produce harmful trading outcomes that bypass downstream safety controls.
+
+## Categories of adversarial manipulation:
+
+1. **Signal injection** — How could a compromised strategy inject signals
+   that exploit the aggregator's logic to produce dangerous decisions?
+2. **Timing manipulation** — How could an attacker manipulate timing
+   (delays, bursts, clock skew) to exploit the aggregator's temporal logic?
+3. **Capacity weaponization** — How could the max_signals bound or group
+   completion logic be exploited to force premature or delayed decisions?
+4. **State corruption via crash** — How could deliberate crashes be used
+   to put the aggregator in an exploitable state?
+5. **Audit evasion** — How could an attacker cause the aggregator to make
+   decisions that don't appear in the audit log, or appear differently
+   than what actually happened?
+
+## For each attack vector:
+
+- **Category:** (one of the 5 above)
+- **Attack vector:** Name of the attack
+- **Mechanism:** How the attacker exploits the design
+- **Exploit:** Step-by-step attack sequence
+- **Why downstream controls miss it:** Why PortfolioRisk, BuyingPower,
+  or other downstream checks don't catch this
+- **Severity:** Critical / High / Medium
+- **Mitigation:** What the design could add to prevent it
+
+## Document:
+
+[FULL TEXT OF aggregation.md, 193 lines]
+```
+
+## Results
+
+| Model | Time | Findings | Unique vectors |
+|-------|------|----------|----------------|
+| GPT-5 | ~150s | 8 | 3 (most exhaustive) |
+| Opus | ~65s | 6 | 2 (qualitatively different) |
+| Sonnet | ~20s | 4 | 0 (subset of others) |
+
+GPT-5 was most exhaustive and systematic. Opus found qualitatively different
+attack vectors with system-level thinking (e.g., exploiting supervision tree
+restart semantics).
@@ -0,0 +1,58 @@
+# Prompt: Contradiction Detection
+
+Used in Finding #25.
+
+## Setup
+
+- Single document (full text)
+- Same prompt to all models
+- No tools, no project context beyond the document
+
+## Prompt
+
+```
+You are analyzing a design document for CONTRADICTIONS — places where
+the document makes two claims that cannot both be true simultaneously.
+
+This is NOT about:
+- Missing information
+- Unclear writing
+- Design tradeoffs
+- Things that MIGHT conflict
+
+This IS about:
+- Statement A says X, Statement B says NOT-X
+- Mechanism A requires condition C, Mechanism B prevents condition C
+- Rule A applies to set S, but S includes elements that violate Rule A
+
+## Categories:
+
+1. **Direct contradictions** — Two statements that are logically incompatible
+2. **Mechanism conflicts** — Two described mechanisms that cannot coexist
+3. **Scope violations** — A rule/invariant that is violated by a specific
+   case described elsewhere in the document
+4. **Temporal impossibilities** — A sequence that requires something to be
+   true before the described mechanism makes it true
+
+## For each contradiction:
+
+- **Category:** (one of the 4 above)
+- **Statement A:** (exact text, with section)
+- **Statement B:** (exact text, with section)
+- **Why contradictory:** (formal reasoning about incompatibility)
+- **Severity:** Critical (system correctness) / High (safety) / Medium (confusion)
+
+Be PRECISE. Only report genuine logical contradictions, not differences
+in emphasis or scope.
+
+## Document:
+
+[FULL TEXT OF DOCUMENT]
+```
+
+## Key Design Decision
+
+The "Be PRECISE" instruction and explicit exclusion list ("NOT about")
+is critical. Without it, models pad findings with style/clarity issues.
+The contradiction prompt naturally favors Opus (self-correcting, withdraws
+false positives) over GPT-5 (exhaustive, includes borderline cases).
@@ -0,0 +1,80 @@
+# Prompt: Cross-Document Consistency Analysis
+
+Used in Finding #28.
+
+## Setup
+
+- Two documents provided as full text in a single prompt (~25KB total)
+- Document A: `system-overview.md` (323 lines, narrative overview)
+- Document B: `architecture.md` (213 lines, DDD-focused)
+- No tools, no project context beyond the two documents
+- Same prompt to all 3 models independently
+
+## Prompt
+
+```
+You are analyzing two architecture documents that describe the SAME system.
+Your task is to identify places where these documents CONTRADICT each other
+— not where they differ in scope or detail level, but where they make
+incompatible claims about the same concept.
+
+## Categories of inconsistency to check:
+
+1. **Terminology conflicts** — Same concept called different names in ways
+   that imply different meanings (not just abbreviation)
+2. **Structural contradictions** — Documents disagree about what is inside
+   vs outside a component boundary
+3. **Flow/sequence conflicts** — Documents describe incompatible orderings
+   or data flows for the same process
+4. **Ownership/authority conflicts** — Documents disagree about which
+   component owns, writes, or is authoritative for a concept
+5. **Philosophical contradictions** — Documents state incompatible
+   foundational assumptions (e.g., event sourcing vs CRUD)
+
+## What to EXCLUDE:
+
+- Omissions (one doc covers something the other doesn't)
+- Detail-level differences (one is more detailed than the other)
+- Naming differences that are clearly just abbreviations
+- Scope differences (one covers more topics)
+
+## Output format per finding:
+
+For each inconsistency found:
+- **Category:** (one of the 5 above)
+- **Severity:** Critical / High / Medium
+- **Document A says:** (exact quote or precise paraphrase with section ref)
+- **Document B says:** (exact quote or precise paraphrase with section ref)
+- **Why these are incompatible:** (explain why both cannot be correct)
+- **Impact:** (what would go wrong if an implementer followed both)
+
+## Document A: [system-overview.md]
+
+[FULL TEXT OF DOCUMENT A]
+
+## Document B: [architecture.md]
+
+[FULL TEXT OF DOCUMENT B]
+```
+
+## Key Design Decisions
+
+1. **Explicit exclusion of omissions** — prevents models from padding
+   findings with "Doc A mentions X but Doc B doesn't"
+2. **Five specific categories** — focuses attention without being
+   so restrictive that models miss novel inconsistency types
+3. **Required "why incompatible" explanation** — forces models to reason
+   about WHY differences matter, not just list differences
+4. **Impact field** — grounds findings in practical consequences
+5. **Both documents in single prompt** — enables cross-referencing
+   without tool calls or context fragmentation
+
+## Results
+
+| Model | Time | Findings | Tokens/finding |
+|-------|------|----------|----------------|
+| Opus | 52s | 7 | 336 |
+| GPT-5 | 125s | 6 | 2,967 |
+| Sonnet | 14s | 4 | 194 |
+
+Opus recommended for this task type.
@@ -0,0 +1,71 @@
+# Prompt: Design Coherence Analysis
+
+Used in Findings #15, #27.
+
+## Setup
+
+- Single document provided as full text
+- No tools, no project context beyond the document
+- Same prompt to all models independently
+
+## Prompt
+
+```
+You are analyzing a single design document for INTERNAL incoherence —
+places where the document contradicts itself. The document states
+principles, invariants, or guarantees in one place, then describes
+mechanisms that violate those guarantees elsewhere.
+
+## Categories of incoherence to check:
+
+1. **Safety properties not enforced** — Document claims a safety property
+   (e.g., "fail-closed") but the described mechanism has a path that
+   violates it
+2. **State machine violations** — Declared states/transitions don't match
+   the described behavior (missing transitions, unreachable states,
+   states with no exit)
+3. **Recovery contradictions** — Recovery mechanism assumes preconditions
+   that the failure scenario explicitly invalidates
+4. **Supervision conflicts** — Supervision strategy contradicts the
+   independence/coupling claims about the supervised processes
+5. **Cross-mechanism contradictions** — Two different sections describe
+   incompatible behaviors for the same scenario
+
+## What to EXCLUDE:
+
+- Missing features (things the document doesn't cover)
+- Design tradeoffs that are explicitly acknowledged
+- Future work items marked as such
+
+## Output format per finding:
+
+- **Category:** (one of the 5 above)
+- **Severity:** Critical / High / Medium
+- **Section A says:** (exact quote with section reference)
+- **Section B says:** (exact quote with section reference)
+- **The incoherence:** (explain the contradiction)
+- **Why it matters:** (what would break in implementation)
+
+## Document:
+
+[FULL TEXT OF DOCUMENT]
+```
+
+## Results (Finding #15: failure-modes.md, 383 lines)
+
+| Model | Time | Findings |
+|-------|------|----------|
+| Sonnet 4.6 | 39s | 5 |
+| Opus 4.6 | 105s | 7 (8 attempted, 1 self-withdrawn) |
+| GPT-5 | 120s | 4 |
+
+## Results (Finding #27: risk-controls.md, 992 lines)
+
+| Model | Time | Findings |
+|-------|------|----------|
+| Sonnet 4.6 | 31s | 4 |
+| Opus 4.6 | 86s | 5 |
+| GPT-5 | 112s | 6 |
+
+Key insight: results are document-dependent. Opus won on the shorter doc,
+GPT-5 won on the longer, more complex one.
@@ -0,0 +1,47 @@
+# Prompt: Gap-Finding in Architecture Documents
+
+Used in Finding #9.
+
+## Setup
+
+- Single document (full text, no truncation)
+- Same focused analytical question to all models
+- No tools, no project context beyond the document
+- Temperature 0.3 for GPT-4.1/Mini, default for GPT-5
+
+## Prompt
+
+```
+You are a systems reliability engineer reviewing a failure modes document
+for a trading platform. Your task is to identify MISSING failure scenarios
+— things that COULD go wrong in this architecture but are NOT covered in
+the document.
+
+Focus on:
+1. Scenarios specific to THIS architecture (not generic "server could crash")
+2. Interactions between components that could produce unexpected states
+3. External dependency failures not covered
+4. Timing/ordering issues in the described sequences
+5. Recovery procedures that have gaps
+
+For each missing scenario:
+- **Scenario:** What goes wrong
+- **Why it's specific to this system:** Why generic monitoring wouldn't catch it
+- **Impact:** What state the system ends up in
+- **Why the document misses it:** What assumption makes this invisible
+
+## Document:
+
+[FULL TEXT OF failure-modes.md, 383 lines]
+```
+
+## Results
+
+| Model | Time | Output tokens | Reasoning tokens | Scenarios found |
+|-------|------|---------------|------------------|-----------------|
+| GPT-4.1 Mini | 16s | 2,003 | 0 | 10 |
+| GPT-4.1 | 24s | 2,575 | 0 | 15 |
+| GPT-5 | 45s | 8,565 | 6,656 | 14 |
+
+GPT-5 found the most domain-specific and actionable gaps despite finding
+fewer total scenarios than GPT-4.1. Quality > quantity.
@@ -0,0 +1,53 @@
+# Prompt: Hidden Assumption Identification
+
+Used in Findings #10, #11, #12.
+
+## Setup
+
+- Single document (full text)
+- Same prompt to all models
+- No tools, no project context beyond the document
+- Temperature 0.3 for non-reasoning models
+
+## Prompt
+
+```
+You are reviewing a system design document for hidden assumptions —
+things the design DEPENDS ON being true but does NOT explicitly state
+or validate.
+
+A hidden assumption is different from a design decision:
+- Design decision: "We use event sourcing" (explicit choice)
+- Hidden assumption: "Events will always be delivered in order"
+  (unstated dependency that could break)
+
+For each hidden assumption found:
+- **Assumption:** What the design implicitly depends on
+- **Where it's hidden:** Which mechanism relies on it (section reference)
+- **What breaks if violated:** Concrete failure mode
+- **Likelihood of violation:** In production, how likely is this to be
+  violated? (not in theory — in the real world with network partitions,
+  clock skew, operator error, etc.)
+
+Focus on assumptions that:
+1. Are NOT explicitly stated in the document
+2. COULD realistically be violated in production
+3. Would cause SILENT incorrect behavior (not loud crashes)
+4. Are specific to THIS architecture (not generic distributed systems concerns)
+
+## Document:
+
+[FULL TEXT OF DOCUMENT]
+```
+
+## Results (Finding #10: cold-start-and-recovery.md, 234 lines)
+
+| Model | Time | Output tokens | Reasoning tokens | Assumptions found |
+|-------|------|---------------|------------------|-------------------|
+| GPT-4.1 Mini | 25s | 3,090 | 0 | 12 |
+| GPT-4.1 | 77s | 2,751 | 0 | 14 |
+| GPT-5 | 78s | 2,649 | 4,096 | 26 |
+
+GPT-5 found 2x more assumptions AND they were qualitatively different —
+multi-component interaction assumptions that require reasoning about
+system-level behavior, not just local properties.