model-research/findings/2026-05-05-29-adversarial-manipulation-analysis-new-task.md

# Finding 29: Adversarial manipulation analysis: NEW task type — GPT-5 is most exhaustive and systematic; Opus produces qualitatively different attack vectors with system-level thinking; Sonnet is adequate but less creative

**Date:** 2026-05-05
**Task:** Identify adversarial manipulation paths in gargoyle's `aggregation.md` (193 lines)
— how a misbehaving, compromised, or buggy upstream component could exploit the
aggregator's design guarantees to produce harmful trading outcomes that bypass
downstream safety controls.
**How we used them:** Same document (full text) + same focused analytical question to all
3 models via HAI proxy. Highly structured prompt specifying 5 categories of adversarial
manipulation (signal injection, timing manipulation, capacity weaponization, state
corruption via crash, audit evasion). Required specific output format per finding
(attack vector, mechanism, exploit, why downstream controls miss it, severity). No tools,
no project context beyond the document itself.

| Model | Time | Output tokens | Reasoning tokens | Attack vectors found | Critical | High | Medium |
|---|---|---|---|---|---|---|---|
| Claude Sonnet 4.6 | 27s | 1,257 | (internal) | 10 | 3 | 5 | 2 |
| Claude Opus 4.6 | 84s | 3,662 | (internal) | 12 | 5 | 5 | 0 |
| GPT-5 | 111s | 8,808 | 6,336 | 15 | 2 | 10 | 3 |

**What they found — common ground (all 3 identified):**
- Primary signal hijacking via ranking manipulation (last-tick injection in
  time-windowed to control decision parameters)
- Threshold gaming via signal replay/duplication (no deduplication means N
  identical signals satisfy "N confirmations")
- Capacity flooding to force premature completion or deny legitimate trades
- Strategic crash to erase unfavorable in-flight groups
- Timeout-masqueraded manipulation (making attacks look like normal system behavior
  in the audit trail)

**GPT-5 unique findings (not in either Claude model):**
- **Direction flip against majority via ranking:** In "most recent" ranking,
  emit multiple SELL confirmations then inject a late BUY — the BUY becomes
  primary and the decision contradicts the bulk of evidence. Distinct from
  general primary hijack because it's specifically about *directional* reversal.
- **Late-arrival exclusion of counter-signals:** Time signals so countervailing
  signals arrive just after group destruction, ensuring the decision is formed
  without dissenting inputs that would have altered ranking.
- **Capacity filter to curate the audit set:** Pre-fill buffer with chosen
  signals so riskier alternatives cannot be included before capacity fires —
  the contributing signals list looks clean.
- **Timer nullification by crash:** Crash just before a timeout that would
  force-complete an unfavorable decision — the timer becomes no-op on restart,
  no decision or expiry event is emitted.
- **Decision drop via induced forwarding failure:** Exploit the "Decision
  forwarding fails: Decision is lost" failure mode to selectively suppress
  protective decisions (stops, hedges) with no automatic retry.
- **Crash to erase evidence of contrary signals:** Post-crash, submit a
  fresh group that completes quickly; audit shows only the new set, not the
  earlier contradictory pre-crash signals.

**Claude Opus unique findings (not in either other model):**
- **Instrument fragmentation to multiply position size:** Emit signals for
  economically equivalent exposures using different instrument identifiers.
  Each gets its own group, each produces a separate decision, bypassing
  per-group capacity limits. Combined position exceeds what any single group
  would allow. Identifies TOCTOU at the fan-in to PortfolioRisk.
- **Forced stale decision via timer exploitation:** Emit one signal at a
  favorable price spike known to be transient, then deliberately withhold
  further signals. Timer force-completes with a stale price. The entry price
  WAS valid when the signal was generated — PortfolioRisk doesn't check
  staleness of decision prices.
- **Timeout prevention / keep-alive suppression:** Manipulate market data
  feed to suppress signals that would reach threshold N. Group expires
  normally — denial-of-trading attack disguised as insufficient confirmation.
- **Crash-restart duplicate decisions:** Crash after decision is forwarded
  but before strategy reflects it. Both restart "clean" — strategy re-emits
  signals, aggregator produces a second decision with a fresh ID. Same trade
  executes twice. PortfolioRisk can't deduplicate because IDs are different.
- **Force-complete with insufficient confirmation (capacity < threshold):**
  If capacity limit is lower than threshold, hitting capacity ALWAYS force-
  completes before predicate is satisfied. Fundamentally changes a 5-confirmation
  strategy into a 3-confirmation strategy.
- **Pattern predicate as arbitrary decision trigger:** If adversary controls
  predicate logic (via strategy configuration), can make pattern-complete
  trigger on any single signal while audit shows algorithm=pattern-complete
  and reason=:predicate. Trust boundary between configuration and execution.

**Claude Sonnet unique findings (not in either other model):**
- **Cross-group timing coordination:** Coordinate signal injection across
  multiple instruments to synchronize completion times, creating a burst of
  correlated decisions that overwhelm PortfolioRisk individually-safe
  evaluations. (NOTE: Opus found a similar concept — instrument fragmentation
  — but framed it differently: Opus focused on position multiplication via
  instrument aliasing, Sonnet focused on burst timing overwhelming evaluation.)
- **Multi-strategy attack distribution:** Spread manipulation across multiple
  isolated strategy aggregators so no single aggregator's behavior looks
  abnormal while cumulative effect is harmful.

**Quality assessment:**
- **GPT-5** produced the most findings (15) with the most systematic coverage
  across all 5 prompt categories. Its strength was in identifying SPECIFIC
  INTERLEAVINGS — exactly how timing, state, and ranking mechanisms interact
  to produce exploits. The direction-flip finding (#3) and the late-arrival
  exclusion finding (#6) show precise temporal reasoning about when signals
  arrive relative to group lifecycle events. The "decision drop via forwarding
  failure" finding exploits a DOCUMENTED failure mode (from the failure table)
  as an offensive weapon — turning a recovery mechanism into an attack vector.
  Every finding references specific mechanisms from the spec.
- **Claude Opus** produced 12 findings with the most architecturally creative
  attacks. The instrument fragmentation attack is the most SYSTEMICALLY
  dangerous finding across all three models — it's not about manipulating one
  group but about the RELATIONSHIP between groups, and it identifies a
  TOCTOU vulnerability at the PortfolioRisk fan-in point that no other model
  found. The crash-restart duplication attack is also architecturally novel —
  it exploits the "clean state" guarantee as a weapon for invisible trade
  doubling. Opus consistently reasons about the system BOUNDARY (aggregator
  → PortfolioRisk handoff) rather than just within-component mechanics. The
  pattern-predicate trust boundary finding is uniquely about CONFIGURATION
  as an attack surface.
- **Claude Sonnet** produced 10 findings in 27s — extremely efficient (127
  tokens per finding). Findings were adequate and covered all 5 categories,
  but lacked the specificity of GPT-5 and the architectural creativity of
  Opus. Several findings were somewhat generic (e.g., "crash at strategic
  moments" without specifying exactly WHEN relative to group lifecycle).
  The cross-group coordination and multi-strategy distribution findings show
  system-level thinking but are stated at a higher abstraction level without
  concrete exploit sequences.

**Key insight — "adversarial manipulation analysis" as a task type:**
This is qualitatively different from all previous analytical lenses tested.
Previous tasks asked models to find problems WITH the design (assumptions,
races, incoherences). This task asks models to find ways to USE the design
AGAINST itself — a creative/generative adversarial task. Results:

- **GPT-5** treats it as an exhaustive enumeration exercise — systematically
  walks through each mechanism and asks "how could this be abused?" High
  count (15), thorough coverage, but some findings are minor variations of
  each other (e.g., crash-related findings #10, #12, #15 share the same core
  mechanism). Reasoning tokens (6,336) used for both generation and verification.
- **Opus** treats it as a creative design exercise — asks "what would a
  smart adversary do that the designer didn't consider?" Fewer findings (12)
  but several are genuinely novel attack concepts (instrument fragmentation,
  crash-restart duplication, predicate trust boundary) that require reasoning
  about the SYSTEM rather than the COMPONENT. Opus also provided a summary
  table and systemic conclusion about the root design weaknesses.
- **Sonnet** treats it as a categorization exercise — fills each prompt
  category with plausible attacks but at a higher abstraction level. Fast
  and adequate for a first pass but wouldn't surprise a security reviewer.

**Comparison to "predictable exploit window" (Finding #18):**
Finding #18 noted that Opus uniquely identified predictable exploit windows
in escalation-policy.md. Here, Opus again shows the strongest adversarial
creativity — the instrument fragmentation attack and crash-restart duplication
are both about exploiting DESIGN GUARANTEES (per-instrument grouping, clean
restart) as weapons. This confirms that Opus's strength on adversarial analysis
is a CONSISTENT PATTERN, not document-specific.

GPT-5 excels when the adversarial task is framed as "enumerate all possible
abuses of each mechanism" (systematic coverage). Opus excels when the task
requires "invent novel attack concepts that exploit design boundaries"
(creative adversarial thinking).

**Model hierarchy for adversarial manipulation analysis:**
1. GPT-5 — most thorough enumeration, best at mechanism-level exploitation (15)
2. Opus — most creative, finds system-boundary attacks others miss (12)
3. Sonnet — adequate first pass, fast, but less specific (10)

**Practical implication:** For security-oriented architecture review:
- Run GPT-5 for comprehensive attack surface enumeration
- Run Opus for novel/creative attack vectors that exploit design boundaries
- Sonnet is sufficient only as a quick initial screen
- The UNION of GPT-5 + Opus findings (removing overlaps) would produce the
  most complete adversarial analysis

**New finding about the aggregator itself:** Several attacks identified by
multiple models point to real design weaknesses worth addressing:
1. No signal deduplication/independence validation (all 3 models)
2. Primary signal determines all decision parameters regardless of group
   composition (all 3 models)
3. Transient state + no replay = perfect adversarial erasure tool (all 3)
4. Capacity/timeout treated as normal events even when weaponized (all 3)
5. No cross-group correlation at aggregator level (Opus + Sonnet)
6. TOCTOU at PortfolioRisk fan-in for concurrent decisions (Opus)