Files
model-research/prompts/adversarial-manipulation.md
T
Rodin 1b108ff66e Initial publish: 29 findings, 6 prompts, methodology, open questions
Full comparative analysis of GPT-5, Claude Opus 4.6, Claude Sonnet 4.6,
GPT-4.1, and GPT-4.1 Mini on analytical tasks (not coding).

Contents:
- findings/ALL-FINDINGS.md — complete 3,249-line research log with all
  29 findings, methodology notes, and open questions
- prompts/ — 6 exact prompts used across experiments
- methodology.md — experimental setup and evaluation criteria
- open-questions.md — unanswered questions for future work
- README.md — overview and summary table

Key findings:
- Cross-document consistency: Opus is 2.4x faster with more findings
- Gap-finding: GPT-5 reasoning tokens find domain-specific gaps
- Race conditions: Opus excels at temporal interaction reasoning
- Bias detection: Signal-to-noise ratio > model capability
- Adversarial analysis: GPT-5 exhaustive, Opus qualitatively different

Signed-off-by: Rodin
2026-05-05 19:13:03 -07:00

2.1 KiB

Prompt: Adversarial Manipulation Analysis

Used in Finding #29.

Setup

  • Single document (full text)
  • Same prompt to all models
  • No tools, no project context beyond the document

Prompt

You are a red-team security analyst reviewing a trading system's
aggregation component. Your task is to identify how a MISBEHAVING,
COMPROMISED, or BUGGY upstream component could exploit this design
to produce harmful trading outcomes that bypass downstream safety controls.

## Categories of adversarial manipulation:

1. **Signal injection** — How could a compromised strategy inject signals
   that exploit the aggregator's logic to produce dangerous decisions?
2. **Timing manipulation** — How could an attacker manipulate timing
   (delays, bursts, clock skew) to exploit the aggregator's temporal logic?
3. **Capacity weaponization** — How could the max_signals bound or group
   completion logic be exploited to force premature or delayed decisions?
4. **State corruption via crash** — How could deliberate crashes be used
   to put the aggregator in an exploitable state?
5. **Audit evasion** — How could an attacker cause the aggregator to make
   decisions that don't appear in the audit log, or appear differently
   than what actually happened?

## For each attack vector:

- **Category:** (one of the 5 above)
- **Attack vector:** Name of the attack
- **Mechanism:** How the attacker exploits the design
- **Exploit:** Step-by-step attack sequence
- **Why downstream controls miss it:** Why PortfolioRisk, BuyingPower,
  or other downstream checks don't catch this
- **Severity:** Critical / High / Medium
- **Mitigation:** What the design could add to prevent it

## Document:

[FULL TEXT OF aggregation.md, 193 lines]

Results

Model Time Findings Unique vectors
GPT-5 ~150s 8 3 (most exhaustive)
Opus ~65s 6 2 (qualitatively different)
Sonnet ~20s 4 0 (subset of others)

GPT-5 was most exhaustive and systematic. Opus found qualitatively different attack vectors with system-level thinking (e.g., exploiting supervision tree restart semantics).