6af8a6ee10
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
175 lines
11 KiB
Markdown
175 lines
11 KiB
Markdown
# Finding 29: Adversarial manipulation analysis: NEW task type — GPT-5 is most exhaustive and systematic; Opus produces qualitatively different attack vectors with system-level thinking; Sonnet is adequate but less creative
|
|
|
|
**Date:** 2026-05-05
|
|
**Task:** Identify adversarial manipulation paths in gargoyle's `aggregation.md` (193 lines)
|
|
— how a misbehaving, compromised, or buggy upstream component could exploit the
|
|
aggregator's design guarantees to produce harmful trading outcomes that bypass
|
|
downstream safety controls.
|
|
**How we used them:** Same document (full text) + same focused analytical question to all
|
|
3 models via HAI proxy. Highly structured prompt specifying 5 categories of adversarial
|
|
manipulation (signal injection, timing manipulation, capacity weaponization, state
|
|
corruption via crash, audit evasion). Required specific output format per finding
|
|
(attack vector, mechanism, exploit, why downstream controls miss it, severity). No tools,
|
|
no project context beyond the document itself.
|
|
|
|
| Model | Time | Output tokens | Reasoning tokens | Attack vectors found | Critical | High | Medium |
|
|
|---|---|---|---|---|---|---|---|
|
|
| Claude Sonnet 4.6 | 27s | 1,257 | (internal) | 10 | 3 | 5 | 2 |
|
|
| Claude Opus 4.6 | 84s | 3,662 | (internal) | 12 | 5 | 5 | 0 |
|
|
| GPT-5 | 111s | 8,808 | 6,336 | 15 | 2 | 10 | 3 |
|
|
|
|
**What they found — common ground (all 3 identified):**
|
|
- Primary signal hijacking via ranking manipulation (last-tick injection in
|
|
time-windowed to control decision parameters)
|
|
- Threshold gaming via signal replay/duplication (no deduplication means N
|
|
identical signals satisfy "N confirmations")
|
|
- Capacity flooding to force premature completion or deny legitimate trades
|
|
- Strategic crash to erase unfavorable in-flight groups
|
|
- Timeout-masqueraded manipulation (making attacks look like normal system behavior
|
|
in the audit trail)
|
|
|
|
**GPT-5 unique findings (not in either Claude model):**
|
|
- **Direction flip against majority via ranking:** In "most recent" ranking,
|
|
emit multiple SELL confirmations then inject a late BUY — the BUY becomes
|
|
primary and the decision contradicts the bulk of evidence. Distinct from
|
|
general primary hijack because it's specifically about *directional* reversal.
|
|
- **Late-arrival exclusion of counter-signals:** Time signals so countervailing
|
|
signals arrive just after group destruction, ensuring the decision is formed
|
|
without dissenting inputs that would have altered ranking.
|
|
- **Capacity filter to curate the audit set:** Pre-fill buffer with chosen
|
|
signals so riskier alternatives cannot be included before capacity fires —
|
|
the contributing signals list looks clean.
|
|
- **Timer nullification by crash:** Crash just before a timeout that would
|
|
force-complete an unfavorable decision — the timer becomes no-op on restart,
|
|
no decision or expiry event is emitted.
|
|
- **Decision drop via induced forwarding failure:** Exploit the "Decision
|
|
forwarding fails: Decision is lost" failure mode to selectively suppress
|
|
protective decisions (stops, hedges) with no automatic retry.
|
|
- **Crash to erase evidence of contrary signals:** Post-crash, submit a
|
|
fresh group that completes quickly; audit shows only the new set, not the
|
|
earlier contradictory pre-crash signals.
|
|
|
|
**Claude Opus unique findings (not in either other model):**
|
|
- **Instrument fragmentation to multiply position size:** Emit signals for
|
|
economically equivalent exposures using different instrument identifiers.
|
|
Each gets its own group, each produces a separate decision, bypassing
|
|
per-group capacity limits. Combined position exceeds what any single group
|
|
would allow. Identifies TOCTOU at the fan-in to PortfolioRisk.
|
|
- **Forced stale decision via timer exploitation:** Emit one signal at a
|
|
favorable price spike known to be transient, then deliberately withhold
|
|
further signals. Timer force-completes with a stale price. The entry price
|
|
WAS valid when the signal was generated — PortfolioRisk doesn't check
|
|
staleness of decision prices.
|
|
- **Timeout prevention / keep-alive suppression:** Manipulate market data
|
|
feed to suppress signals that would reach threshold N. Group expires
|
|
normally — denial-of-trading attack disguised as insufficient confirmation.
|
|
- **Crash-restart duplicate decisions:** Crash after decision is forwarded
|
|
but before strategy reflects it. Both restart "clean" — strategy re-emits
|
|
signals, aggregator produces a second decision with a fresh ID. Same trade
|
|
executes twice. PortfolioRisk can't deduplicate because IDs are different.
|
|
- **Force-complete with insufficient confirmation (capacity < threshold):**
|
|
If capacity limit is lower than threshold, hitting capacity ALWAYS force-
|
|
completes before predicate is satisfied. Fundamentally changes a 5-confirmation
|
|
strategy into a 3-confirmation strategy.
|
|
- **Pattern predicate as arbitrary decision trigger:** If adversary controls
|
|
predicate logic (via strategy configuration), can make pattern-complete
|
|
trigger on any single signal while audit shows algorithm=pattern-complete
|
|
and reason=:predicate. Trust boundary between configuration and execution.
|
|
|
|
**Claude Sonnet unique findings (not in either other model):**
|
|
- **Cross-group timing coordination:** Coordinate signal injection across
|
|
multiple instruments to synchronize completion times, creating a burst of
|
|
correlated decisions that overwhelm PortfolioRisk individually-safe
|
|
evaluations. (NOTE: Opus found a similar concept — instrument fragmentation
|
|
— but framed it differently: Opus focused on position multiplication via
|
|
instrument aliasing, Sonnet focused on burst timing overwhelming evaluation.)
|
|
- **Multi-strategy attack distribution:** Spread manipulation across multiple
|
|
isolated strategy aggregators so no single aggregator's behavior looks
|
|
abnormal while cumulative effect is harmful.
|
|
|
|
**Quality assessment:**
|
|
- **GPT-5** produced the most findings (15) with the most systematic coverage
|
|
across all 5 prompt categories. Its strength was in identifying SPECIFIC
|
|
INTERLEAVINGS — exactly how timing, state, and ranking mechanisms interact
|
|
to produce exploits. The direction-flip finding (#3) and the late-arrival
|
|
exclusion finding (#6) show precise temporal reasoning about when signals
|
|
arrive relative to group lifecycle events. The "decision drop via forwarding
|
|
failure" finding exploits a DOCUMENTED failure mode (from the failure table)
|
|
as an offensive weapon — turning a recovery mechanism into an attack vector.
|
|
Every finding references specific mechanisms from the spec.
|
|
- **Claude Opus** produced 12 findings with the most architecturally creative
|
|
attacks. The instrument fragmentation attack is the most SYSTEMICALLY
|
|
dangerous finding across all three models — it's not about manipulating one
|
|
group but about the RELATIONSHIP between groups, and it identifies a
|
|
TOCTOU vulnerability at the PortfolioRisk fan-in point that no other model
|
|
found. The crash-restart duplication attack is also architecturally novel —
|
|
it exploits the "clean state" guarantee as a weapon for invisible trade
|
|
doubling. Opus consistently reasons about the system BOUNDARY (aggregator
|
|
→ PortfolioRisk handoff) rather than just within-component mechanics. The
|
|
pattern-predicate trust boundary finding is uniquely about CONFIGURATION
|
|
as an attack surface.
|
|
- **Claude Sonnet** produced 10 findings in 27s — extremely efficient (127
|
|
tokens per finding). Findings were adequate and covered all 5 categories,
|
|
but lacked the specificity of GPT-5 and the architectural creativity of
|
|
Opus. Several findings were somewhat generic (e.g., "crash at strategic
|
|
moments" without specifying exactly WHEN relative to group lifecycle).
|
|
The cross-group coordination and multi-strategy distribution findings show
|
|
system-level thinking but are stated at a higher abstraction level without
|
|
concrete exploit sequences.
|
|
|
|
**Key insight — "adversarial manipulation analysis" as a task type:**
|
|
This is qualitatively different from all previous analytical lenses tested.
|
|
Previous tasks asked models to find problems WITH the design (assumptions,
|
|
races, incoherences). This task asks models to find ways to USE the design
|
|
AGAINST itself — a creative/generative adversarial task. Results:
|
|
|
|
- **GPT-5** treats it as an exhaustive enumeration exercise — systematically
|
|
walks through each mechanism and asks "how could this be abused?" High
|
|
count (15), thorough coverage, but some findings are minor variations of
|
|
each other (e.g., crash-related findings #10, #12, #15 share the same core
|
|
mechanism). Reasoning tokens (6,336) used for both generation and verification.
|
|
- **Opus** treats it as a creative design exercise — asks "what would a
|
|
smart adversary do that the designer didn't consider?" Fewer findings (12)
|
|
but several are genuinely novel attack concepts (instrument fragmentation,
|
|
crash-restart duplication, predicate trust boundary) that require reasoning
|
|
about the SYSTEM rather than the COMPONENT. Opus also provided a summary
|
|
table and systemic conclusion about the root design weaknesses.
|
|
- **Sonnet** treats it as a categorization exercise — fills each prompt
|
|
category with plausible attacks but at a higher abstraction level. Fast
|
|
and adequate for a first pass but wouldn't surprise a security reviewer.
|
|
|
|
**Comparison to "predictable exploit window" (Finding #18):**
|
|
Finding #18 noted that Opus uniquely identified predictable exploit windows
|
|
in escalation-policy.md. Here, Opus again shows the strongest adversarial
|
|
creativity — the instrument fragmentation attack and crash-restart duplication
|
|
are both about exploiting DESIGN GUARANTEES (per-instrument grouping, clean
|
|
restart) as weapons. This confirms that Opus's strength on adversarial analysis
|
|
is a CONSISTENT PATTERN, not document-specific.
|
|
|
|
GPT-5 excels when the adversarial task is framed as "enumerate all possible
|
|
abuses of each mechanism" (systematic coverage). Opus excels when the task
|
|
requires "invent novel attack concepts that exploit design boundaries"
|
|
(creative adversarial thinking).
|
|
|
|
**Model hierarchy for adversarial manipulation analysis:**
|
|
1. GPT-5 — most thorough enumeration, best at mechanism-level exploitation (15)
|
|
2. Opus — most creative, finds system-boundary attacks others miss (12)
|
|
3. Sonnet — adequate first pass, fast, but less specific (10)
|
|
|
|
**Practical implication:** For security-oriented architecture review:
|
|
- Run GPT-5 for comprehensive attack surface enumeration
|
|
- Run Opus for novel/creative attack vectors that exploit design boundaries
|
|
- Sonnet is sufficient only as a quick initial screen
|
|
- The UNION of GPT-5 + Opus findings (removing overlaps) would produce the
|
|
most complete adversarial analysis
|
|
|
|
**New finding about the aggregator itself:** Several attacks identified by
|
|
multiple models point to real design weaknesses worth addressing:
|
|
1. No signal deduplication/independence validation (all 3 models)
|
|
2. Primary signal determines all decision parameters regardless of group
|
|
composition (all 3 models)
|
|
3. Transient state + no replay = perfect adversarial erasure tool (all 3)
|
|
4. Capacity/timeout treated as normal events even when weaponized (all 3)
|
|
5. No cross-group correlation at aggregator level (Opus + Sonnet)
|
|
6. TOCTOU at PortfolioRisk fan-in for concurrent decisions (Opus)
|