Files
model-research/findings/2026-05-05-29-adversarial-manipulation-analysis-new-task.md
T
Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00

11 KiB

Finding 29: Adversarial manipulation analysis: NEW task type — GPT-5 is most exhaustive and systematic; Opus produces qualitatively different attack vectors with system-level thinking; Sonnet is adequate but less creative

Date: 2026-05-05 Task: Identify adversarial manipulation paths in gargoyle's aggregation.md (193 lines) — how a misbehaving, compromised, or buggy upstream component could exploit the aggregator's design guarantees to produce harmful trading outcomes that bypass downstream safety controls. How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy. Highly structured prompt specifying 5 categories of adversarial manipulation (signal injection, timing manipulation, capacity weaponization, state corruption via crash, audit evasion). Required specific output format per finding (attack vector, mechanism, exploit, why downstream controls miss it, severity). No tools, no project context beyond the document itself.

Model Time Output tokens Reasoning tokens Attack vectors found Critical High Medium
Claude Sonnet 4.6 27s 1,257 (internal) 10 3 5 2
Claude Opus 4.6 84s 3,662 (internal) 12 5 5 0
GPT-5 111s 8,808 6,336 15 2 10 3

What they found — common ground (all 3 identified):

  • Primary signal hijacking via ranking manipulation (last-tick injection in time-windowed to control decision parameters)
  • Threshold gaming via signal replay/duplication (no deduplication means N identical signals satisfy "N confirmations")
  • Capacity flooding to force premature completion or deny legitimate trades
  • Strategic crash to erase unfavorable in-flight groups
  • Timeout-masqueraded manipulation (making attacks look like normal system behavior in the audit trail)

GPT-5 unique findings (not in either Claude model):

  • Direction flip against majority via ranking: In "most recent" ranking, emit multiple SELL confirmations then inject a late BUY — the BUY becomes primary and the decision contradicts the bulk of evidence. Distinct from general primary hijack because it's specifically about directional reversal.
  • Late-arrival exclusion of counter-signals: Time signals so countervailing signals arrive just after group destruction, ensuring the decision is formed without dissenting inputs that would have altered ranking.
  • Capacity filter to curate the audit set: Pre-fill buffer with chosen signals so riskier alternatives cannot be included before capacity fires — the contributing signals list looks clean.
  • Timer nullification by crash: Crash just before a timeout that would force-complete an unfavorable decision — the timer becomes no-op on restart, no decision or expiry event is emitted.
  • Decision drop via induced forwarding failure: Exploit the "Decision forwarding fails: Decision is lost" failure mode to selectively suppress protective decisions (stops, hedges) with no automatic retry.
  • Crash to erase evidence of contrary signals: Post-crash, submit a fresh group that completes quickly; audit shows only the new set, not the earlier contradictory pre-crash signals.

Claude Opus unique findings (not in either other model):

  • Instrument fragmentation to multiply position size: Emit signals for economically equivalent exposures using different instrument identifiers. Each gets its own group, each produces a separate decision, bypassing per-group capacity limits. Combined position exceeds what any single group would allow. Identifies TOCTOU at the fan-in to PortfolioRisk.
  • Forced stale decision via timer exploitation: Emit one signal at a favorable price spike known to be transient, then deliberately withhold further signals. Timer force-completes with a stale price. The entry price WAS valid when the signal was generated — PortfolioRisk doesn't check staleness of decision prices.
  • Timeout prevention / keep-alive suppression: Manipulate market data feed to suppress signals that would reach threshold N. Group expires normally — denial-of-trading attack disguised as insufficient confirmation.
  • Crash-restart duplicate decisions: Crash after decision is forwarded but before strategy reflects it. Both restart "clean" — strategy re-emits signals, aggregator produces a second decision with a fresh ID. Same trade executes twice. PortfolioRisk can't deduplicate because IDs are different.
  • Force-complete with insufficient confirmation (capacity < threshold): If capacity limit is lower than threshold, hitting capacity ALWAYS force- completes before predicate is satisfied. Fundamentally changes a 5-confirmation strategy into a 3-confirmation strategy.
  • Pattern predicate as arbitrary decision trigger: If adversary controls predicate logic (via strategy configuration), can make pattern-complete trigger on any single signal while audit shows algorithm=pattern-complete and reason=:predicate. Trust boundary between configuration and execution.

Claude Sonnet unique findings (not in either other model):

  • Cross-group timing coordination: Coordinate signal injection across multiple instruments to synchronize completion times, creating a burst of correlated decisions that overwhelm PortfolioRisk individually-safe evaluations. (NOTE: Opus found a similar concept — instrument fragmentation — but framed it differently: Opus focused on position multiplication via instrument aliasing, Sonnet focused on burst timing overwhelming evaluation.)
  • Multi-strategy attack distribution: Spread manipulation across multiple isolated strategy aggregators so no single aggregator's behavior looks abnormal while cumulative effect is harmful.

Quality assessment:

  • GPT-5 produced the most findings (15) with the most systematic coverage across all 5 prompt categories. Its strength was in identifying SPECIFIC INTERLEAVINGS — exactly how timing, state, and ranking mechanisms interact to produce exploits. The direction-flip finding (#3) and the late-arrival exclusion finding (#6) show precise temporal reasoning about when signals arrive relative to group lifecycle events. The "decision drop via forwarding failure" finding exploits a DOCUMENTED failure mode (from the failure table) as an offensive weapon — turning a recovery mechanism into an attack vector. Every finding references specific mechanisms from the spec.
  • Claude Opus produced 12 findings with the most architecturally creative attacks. The instrument fragmentation attack is the most SYSTEMICALLY dangerous finding across all three models — it's not about manipulating one group but about the RELATIONSHIP between groups, and it identifies a TOCTOU vulnerability at the PortfolioRisk fan-in point that no other model found. The crash-restart duplication attack is also architecturally novel — it exploits the "clean state" guarantee as a weapon for invisible trade doubling. Opus consistently reasons about the system BOUNDARY (aggregator → PortfolioRisk handoff) rather than just within-component mechanics. The pattern-predicate trust boundary finding is uniquely about CONFIGURATION as an attack surface.
  • Claude Sonnet produced 10 findings in 27s — extremely efficient (127 tokens per finding). Findings were adequate and covered all 5 categories, but lacked the specificity of GPT-5 and the architectural creativity of Opus. Several findings were somewhat generic (e.g., "crash at strategic moments" without specifying exactly WHEN relative to group lifecycle). The cross-group coordination and multi-strategy distribution findings show system-level thinking but are stated at a higher abstraction level without concrete exploit sequences.

Key insight — "adversarial manipulation analysis" as a task type: This is qualitatively different from all previous analytical lenses tested. Previous tasks asked models to find problems WITH the design (assumptions, races, incoherences). This task asks models to find ways to USE the design AGAINST itself — a creative/generative adversarial task. Results:

  • GPT-5 treats it as an exhaustive enumeration exercise — systematically walks through each mechanism and asks "how could this be abused?" High count (15), thorough coverage, but some findings are minor variations of each other (e.g., crash-related findings #10, #12, #15 share the same core mechanism). Reasoning tokens (6,336) used for both generation and verification.
  • Opus treats it as a creative design exercise — asks "what would a smart adversary do that the designer didn't consider?" Fewer findings (12) but several are genuinely novel attack concepts (instrument fragmentation, crash-restart duplication, predicate trust boundary) that require reasoning about the SYSTEM rather than the COMPONENT. Opus also provided a summary table and systemic conclusion about the root design weaknesses.
  • Sonnet treats it as a categorization exercise — fills each prompt category with plausible attacks but at a higher abstraction level. Fast and adequate for a first pass but wouldn't surprise a security reviewer.

Comparison to "predictable exploit window" (Finding #18): Finding #18 noted that Opus uniquely identified predictable exploit windows in escalation-policy.md. Here, Opus again shows the strongest adversarial creativity — the instrument fragmentation attack and crash-restart duplication are both about exploiting DESIGN GUARANTEES (per-instrument grouping, clean restart) as weapons. This confirms that Opus's strength on adversarial analysis is a CONSISTENT PATTERN, not document-specific.

GPT-5 excels when the adversarial task is framed as "enumerate all possible abuses of each mechanism" (systematic coverage). Opus excels when the task requires "invent novel attack concepts that exploit design boundaries" (creative adversarial thinking).

Model hierarchy for adversarial manipulation analysis:

  1. GPT-5 — most thorough enumeration, best at mechanism-level exploitation (15)
  2. Opus — most creative, finds system-boundary attacks others miss (12)
  3. Sonnet — adequate first pass, fast, but less specific (10)

Practical implication: For security-oriented architecture review:

  • Run GPT-5 for comprehensive attack surface enumeration
  • Run Opus for novel/creative attack vectors that exploit design boundaries
  • Sonnet is sufficient only as a quick initial screen
  • The UNION of GPT-5 + Opus findings (removing overlaps) would produce the most complete adversarial analysis

New finding about the aggregator itself: Several attacks identified by multiple models point to real design weaknesses worth addressing:

  1. No signal deduplication/independence validation (all 3 models)
  2. Primary signal determines all decision parameters regardless of group composition (all 3 models)
  3. Transient state + no replay = perfect adversarial erasure tool (all 3)
  4. Capacity/timeout treated as normal events even when weaponized (all 3)
  5. No cross-group correlation at aggregator level (Opus + Sonnet)
  6. TOCTOU at PortfolioRisk fan-in for concurrent decisions (Opus)