Files
model-research/findings/2026-05-05-29-adversarial-manipulation-analysis-new-task.md
T
Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00

175 lines
11 KiB
Markdown

# Finding 29: Adversarial manipulation analysis: NEW task type — GPT-5 is most exhaustive and systematic; Opus produces qualitatively different attack vectors with system-level thinking; Sonnet is adequate but less creative
**Date:** 2026-05-05
**Task:** Identify adversarial manipulation paths in gargoyle's `aggregation.md` (193 lines)
— how a misbehaving, compromised, or buggy upstream component could exploit the
aggregator's design guarantees to produce harmful trading outcomes that bypass
downstream safety controls.
**How we used them:** Same document (full text) + same focused analytical question to all
3 models via HAI proxy. Highly structured prompt specifying 5 categories of adversarial
manipulation (signal injection, timing manipulation, capacity weaponization, state
corruption via crash, audit evasion). Required specific output format per finding
(attack vector, mechanism, exploit, why downstream controls miss it, severity). No tools,
no project context beyond the document itself.
| Model | Time | Output tokens | Reasoning tokens | Attack vectors found | Critical | High | Medium |
|---|---|---|---|---|---|---|---|
| Claude Sonnet 4.6 | 27s | 1,257 | (internal) | 10 | 3 | 5 | 2 |
| Claude Opus 4.6 | 84s | 3,662 | (internal) | 12 | 5 | 5 | 0 |
| GPT-5 | 111s | 8,808 | 6,336 | 15 | 2 | 10 | 3 |
**What they found — common ground (all 3 identified):**
- Primary signal hijacking via ranking manipulation (last-tick injection in
time-windowed to control decision parameters)
- Threshold gaming via signal replay/duplication (no deduplication means N
identical signals satisfy "N confirmations")
- Capacity flooding to force premature completion or deny legitimate trades
- Strategic crash to erase unfavorable in-flight groups
- Timeout-masqueraded manipulation (making attacks look like normal system behavior
in the audit trail)
**GPT-5 unique findings (not in either Claude model):**
- **Direction flip against majority via ranking:** In "most recent" ranking,
emit multiple SELL confirmations then inject a late BUY — the BUY becomes
primary and the decision contradicts the bulk of evidence. Distinct from
general primary hijack because it's specifically about *directional* reversal.
- **Late-arrival exclusion of counter-signals:** Time signals so countervailing
signals arrive just after group destruction, ensuring the decision is formed
without dissenting inputs that would have altered ranking.
- **Capacity filter to curate the audit set:** Pre-fill buffer with chosen
signals so riskier alternatives cannot be included before capacity fires —
the contributing signals list looks clean.
- **Timer nullification by crash:** Crash just before a timeout that would
force-complete an unfavorable decision — the timer becomes no-op on restart,
no decision or expiry event is emitted.
- **Decision drop via induced forwarding failure:** Exploit the "Decision
forwarding fails: Decision is lost" failure mode to selectively suppress
protective decisions (stops, hedges) with no automatic retry.
- **Crash to erase evidence of contrary signals:** Post-crash, submit a
fresh group that completes quickly; audit shows only the new set, not the
earlier contradictory pre-crash signals.
**Claude Opus unique findings (not in either other model):**
- **Instrument fragmentation to multiply position size:** Emit signals for
economically equivalent exposures using different instrument identifiers.
Each gets its own group, each produces a separate decision, bypassing
per-group capacity limits. Combined position exceeds what any single group
would allow. Identifies TOCTOU at the fan-in to PortfolioRisk.
- **Forced stale decision via timer exploitation:** Emit one signal at a
favorable price spike known to be transient, then deliberately withhold
further signals. Timer force-completes with a stale price. The entry price
WAS valid when the signal was generated — PortfolioRisk doesn't check
staleness of decision prices.
- **Timeout prevention / keep-alive suppression:** Manipulate market data
feed to suppress signals that would reach threshold N. Group expires
normally — denial-of-trading attack disguised as insufficient confirmation.
- **Crash-restart duplicate decisions:** Crash after decision is forwarded
but before strategy reflects it. Both restart "clean" — strategy re-emits
signals, aggregator produces a second decision with a fresh ID. Same trade
executes twice. PortfolioRisk can't deduplicate because IDs are different.
- **Force-complete with insufficient confirmation (capacity < threshold):**
If capacity limit is lower than threshold, hitting capacity ALWAYS force-
completes before predicate is satisfied. Fundamentally changes a 5-confirmation
strategy into a 3-confirmation strategy.
- **Pattern predicate as arbitrary decision trigger:** If adversary controls
predicate logic (via strategy configuration), can make pattern-complete
trigger on any single signal while audit shows algorithm=pattern-complete
and reason=:predicate. Trust boundary between configuration and execution.
**Claude Sonnet unique findings (not in either other model):**
- **Cross-group timing coordination:** Coordinate signal injection across
multiple instruments to synchronize completion times, creating a burst of
correlated decisions that overwhelm PortfolioRisk individually-safe
evaluations. (NOTE: Opus found a similar concept — instrument fragmentation
— but framed it differently: Opus focused on position multiplication via
instrument aliasing, Sonnet focused on burst timing overwhelming evaluation.)
- **Multi-strategy attack distribution:** Spread manipulation across multiple
isolated strategy aggregators so no single aggregator's behavior looks
abnormal while cumulative effect is harmful.
**Quality assessment:**
- **GPT-5** produced the most findings (15) with the most systematic coverage
across all 5 prompt categories. Its strength was in identifying SPECIFIC
INTERLEAVINGS — exactly how timing, state, and ranking mechanisms interact
to produce exploits. The direction-flip finding (#3) and the late-arrival
exclusion finding (#6) show precise temporal reasoning about when signals
arrive relative to group lifecycle events. The "decision drop via forwarding
failure" finding exploits a DOCUMENTED failure mode (from the failure table)
as an offensive weapon — turning a recovery mechanism into an attack vector.
Every finding references specific mechanisms from the spec.
- **Claude Opus** produced 12 findings with the most architecturally creative
attacks. The instrument fragmentation attack is the most SYSTEMICALLY
dangerous finding across all three models — it's not about manipulating one
group but about the RELATIONSHIP between groups, and it identifies a
TOCTOU vulnerability at the PortfolioRisk fan-in point that no other model
found. The crash-restart duplication attack is also architecturally novel —
it exploits the "clean state" guarantee as a weapon for invisible trade
doubling. Opus consistently reasons about the system BOUNDARY (aggregator
→ PortfolioRisk handoff) rather than just within-component mechanics. The
pattern-predicate trust boundary finding is uniquely about CONFIGURATION
as an attack surface.
- **Claude Sonnet** produced 10 findings in 27s — extremely efficient (127
tokens per finding). Findings were adequate and covered all 5 categories,
but lacked the specificity of GPT-5 and the architectural creativity of
Opus. Several findings were somewhat generic (e.g., "crash at strategic
moments" without specifying exactly WHEN relative to group lifecycle).
The cross-group coordination and multi-strategy distribution findings show
system-level thinking but are stated at a higher abstraction level without
concrete exploit sequences.
**Key insight — "adversarial manipulation analysis" as a task type:**
This is qualitatively different from all previous analytical lenses tested.
Previous tasks asked models to find problems WITH the design (assumptions,
races, incoherences). This task asks models to find ways to USE the design
AGAINST itself — a creative/generative adversarial task. Results:
- **GPT-5** treats it as an exhaustive enumeration exercise — systematically
walks through each mechanism and asks "how could this be abused?" High
count (15), thorough coverage, but some findings are minor variations of
each other (e.g., crash-related findings #10, #12, #15 share the same core
mechanism). Reasoning tokens (6,336) used for both generation and verification.
- **Opus** treats it as a creative design exercise — asks "what would a
smart adversary do that the designer didn't consider?" Fewer findings (12)
but several are genuinely novel attack concepts (instrument fragmentation,
crash-restart duplication, predicate trust boundary) that require reasoning
about the SYSTEM rather than the COMPONENT. Opus also provided a summary
table and systemic conclusion about the root design weaknesses.
- **Sonnet** treats it as a categorization exercise — fills each prompt
category with plausible attacks but at a higher abstraction level. Fast
and adequate for a first pass but wouldn't surprise a security reviewer.
**Comparison to "predictable exploit window" (Finding #18):**
Finding #18 noted that Opus uniquely identified predictable exploit windows
in escalation-policy.md. Here, Opus again shows the strongest adversarial
creativity — the instrument fragmentation attack and crash-restart duplication
are both about exploiting DESIGN GUARANTEES (per-instrument grouping, clean
restart) as weapons. This confirms that Opus's strength on adversarial analysis
is a CONSISTENT PATTERN, not document-specific.
GPT-5 excels when the adversarial task is framed as "enumerate all possible
abuses of each mechanism" (systematic coverage). Opus excels when the task
requires "invent novel attack concepts that exploit design boundaries"
(creative adversarial thinking).
**Model hierarchy for adversarial manipulation analysis:**
1. GPT-5 — most thorough enumeration, best at mechanism-level exploitation (15)
2. Opus — most creative, finds system-boundary attacks others miss (12)
3. Sonnet — adequate first pass, fast, but less specific (10)
**Practical implication:** For security-oriented architecture review:
- Run GPT-5 for comprehensive attack surface enumeration
- Run Opus for novel/creative attack vectors that exploit design boundaries
- Sonnet is sufficient only as a quick initial screen
- The UNION of GPT-5 + Opus findings (removing overlaps) would produce the
most complete adversarial analysis
**New finding about the aggregator itself:** Several attacks identified by
multiple models point to real design weaknesses worth addressing:
1. No signal deduplication/independence validation (all 3 models)
2. Primary signal determines all decision parameters regardless of group
composition (all 3 models)
3. Transient state + no replay = perfect adversarial erasure tool (all 3)
4. Capacity/timeout treated as normal events even when weaponized (all 3)
5. No cross-group correlation at aggregator level (Opus + Sonnet)
6. TOCTOU at PortfolioRisk fan-in for concurrent decisions (Opus)