model-research/findings/2026-05-08-47-emergent-behavior-rule-composition.md

### 47. Emergent behavior from rule composition: a new analytical lens; GPT-5 excels at identifying feedback loops and systemic dynamics; Opus finds the most architecturally consequential single findings

**Date:** 2026-05-08
**Task:** Identify emergent behaviors from rule composition in gargoyle's `aggregation.md`
(239 lines) — situations where individually correct rules interact to produce undocumented
or unintended system behaviors.
**How we used them:** Same document (full text) + same focused analytical prompt to 3
models via HAI proxy. Prompt specified 5 categories: combinatorial state explosions, feedback
loops through external systems, resource competition between isolated instances, temporal
composition effects, and policy contradiction under composition. Required structured output
per finding (components, individual correctness, emergent behavior, why doc misses it,
severity). No tools, no project context beyond the document.

| Model | Time | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|
| GPT-5 | 76.0s | 6,956 | 4,480 | 13 |
| Claude Opus 4.6 | 93.2s | 3,956 | (internal) | 8 |
| Claude Sonnet 4.0 | 31.8s | 1,296 | (internal) | 6 |

**Note:** Sonnet 4.0 was used (not 4.6) due to model name resolution. This provides a
comparison point against the 4.6 results in previous findings.

**What they found — common ground (all 3 identified):**
- Synchronized timeout thundering herd when correlated market events start many
  time-windowed timers simultaneously, overwhelming PortfolioRisk fan-in
- Per-strategy capacity limits don't bound system-wide memory (multiplicative
  composition across strategies × instruments)
- Multiple strategies producing concurrent decisions for same instrument/risk
  budget without coordination (capital invisibility during buffering)
- Feedback loops between downstream rejection/processing and upstream signal/
  decision generation

**GPT-5 unique findings (not in either other model):**
- **Force-complete vs expire semantics create systematic portfolio bias (#5):**
  Under stress, force-complete strategies always consume risk budget while expire
  strategies drop out. Portfolio composition becomes dominated by whichever timeout
  semantics is more aggressive, independent of alpha quality. Sophisticated insight
  about fairness across heterogeneous algorithm configurations.
- **Cross-strategy state coupling via position changes (#12):** Fast exit decisions
  alter position-dependent signal generation upstream, destabilizing pattern groups
  in other strategies. The document assumes signals are exogenous but they're
  actually coupled through position state.
- **End-to-end latency variance causes duplicate economic intent (#13):** A slow
  decision acts on stale prices while a newer decision from another strategy passes
  risk on fresh prices — temporary overexposure followed by costly correction.
- **Decision identity regeneration + forwarding failures = no idempotency (#6):**
  Lost decisions re-created later with different IDs but same economic intent cause
  duplicate execution across the fan-in with no deduplication mechanism.
- **Crash/restart phase alignment creates persistent periodic spikes (#9):** After
  fleet restart, all strategies begin windows simultaneously and stay phase-aligned.
- **Telemetry storms perturb timer accuracy (#10):** Synchronous event emission
  during herds delays timer callbacks, changing which path fires.
- **Combinatorial O(S×I) groups under market-wide shocks (#11):** Timer queues and
  GC thrash cause non-deterministic path selection between predicate/timeout/capacity.
- **Risk-budget race favoring fast strategies over confirmatory ones (#8):** Arrival
  order determines which strategies consume budget, not alpha quality.

**Claude Opus unique findings (not in either other model):**
- **Stop-loss defeated by pattern aggregation temporal composition (#5, CRITICAL):**
  A stop-loss fires immediately (closing position), but a pattern strategy's group
  is still buffering entry signals for the same instrument. When the pattern
  completes (with stale, pre-crash signal data), it forwards a re-entry decision.
  PortfolioRisk sees an empty position and approves — the system immediately
  re-enters a position that was just risk-exited. **The safety mechanism (stop-loss)
  is rendered ineffective by temporal composition with a slow strategy.** This is
  the most architecturally consequential finding across all three models.
- **Crash probability correlates with opportunity quality (#6):** Crashes correlate
  with high-volatility (memory pressure from signal bursts), but high-volatility is
  also when trading opportunities are most profitable. The system architecturally
  selects AGAINST its best opportunities through correlated crash-and-miss cycles.
  Novel "selection bias" framing not seen in other models.
- **Pre-crash forwarded decisions + post-restart new decisions create duplicates (#7):**
  A decision forwarded successfully to PortfolioRisk before a crash is still in-flight
  downstream. After restart, new signals create a new group → new decision → duplicate
  entry with different decision ID. Unique insight about the boundary between
  aggregator responsibility and downstream responsibility.
- **Market-regime-driven bimodal completion clustering (#8):** The joint distribution
  of completion times across strategies is driven by shared market conditions (hidden
  common factor). System oscillates between burst-overload and timeout-cascade with
  limited healthy middle ground.

**Claude Sonnet 4.0 unique findings:**
- **Cross-algorithm state explosion in pattern completion (#6):** Multi-instrument
  pattern strategies create implicit dependencies between groups that the per-group
  state machine can't represent. Somewhat generic — lacks the specific mechanism
  detail of the other models' findings.
- No other truly unique findings — Sonnet's 6 findings overlap substantially with
  the common ground. Less specific, more abstract framing.

**Quality assessment:**
- **GPT-5** produced the most findings (13) with the highest breadth. Several
  findings identified systemic dynamics — portfolio composition bias from heterogeneous
  timeout semantics, position-dependent signal coupling creating feedback into
  aggregation, and arrival-order fairness violations. GPT-5 uniquely identified the
  telemetry/timer interference pattern and the combinatorial explosion at system scale.
  Its strongest contribution is identifying multiple distinct feedback loops through
  external systems (categories 2 and 5 of the prompt).
- **Claude Opus** produced fewer findings (8) but two are qualitatively superior to
  anything in GPT-5's output: the stop-loss-defeated-by-pattern-composition finding
  (a genuine safety mechanism failure) and the crash-correlates-with-opportunity-quality
  finding (a selection bias). Opus continues its pattern from Findings #11-#20 of
  identifying design TENSIONS and CONTRADICTIONS rather than just failure modes. The
  stop-loss finding is the single most important finding across all three models because
  it shows a risk management mechanism being defeated by the architecture's own
  composition rules.
- **Claude Sonnet 4.0** was fast (31.8s) but produced the weakest output. Only 6
  findings with less specificity and more overlap with common ground. The cross-algorithm
  state explosion finding (#6) was somewhat generic. Compared to Sonnet 4.6's
  performance in Finding #12 (17 assumptions, 85% of GPT-5) and Finding #14 (8 cross-
  component findings), Sonnet 4.0 here is notably weaker. This suggests the 4.0 → 4.6
  jump was significant for analytical depth.

**Key insight — "emergent behavior from rule composition" as an analytical lens:**

This is a genuinely novel task type not previously tested. It differs from:
- **Hidden assumptions** (what must be true for this to work?)
- **Race conditions** (what temporal interleavings cause bugs?)
- **Cross-component interactions** (what happens between components?)
- **Invariant violations** (what legal sequences break invariants?)

The new lens asks: **"What happens when many correct instances of this design operate
simultaneously in a correlated environment?"** This requires reasoning about:
1. Statistical properties of the composition (correlation, synchronization)
2. Shared-resource contention hidden by logical isolation abstractions
3. Feedback loops that cross the document's scope boundaries
4. Fairness and priority under resource competition

**What makes this lens uniquely valuable:**
Most previous analytical lenses find bugs in the DESIGN. This lens finds bugs in the
DEPLOYMENT — scenarios that are invisible at the single-instance design level but emerge
at scale. The aggregation document's state machine is CORRECT for a single group. The
findings here are about what happens when many correct state machines operate
simultaneously in a correlated world. This is the gap between "correct by construction"
and "correct in production."

**Model strengths for this lens:**
- GPT-5 is best at identifying SYSTEMIC dynamics — feedback loops, fairness violations,
  resource contention patterns, and cascading effects. It reasons about the system as a
  whole operating over time.
- Opus is best at identifying CONSEQUENTIAL compositions — finding the one interaction
  that defeats a safety mechanism or creates a selection bias. Fewer findings but higher
  architectural significance per finding.
- Sonnet 4.0 is insufficient for this task — too abstract, insufficient specificity.
  (Sonnet 4.6 would likely perform better based on prior results.)

**Comparison to document-centric lenses:**
| Lens type | What it finds | Best model |
|---|---|---|
| Hidden assumptions | What the doc takes for granted | All (GPT-5 most) |
| Race conditions | Temporal bugs between components | GPT-5, Opus |
| Invariant violations | Legal paths that break rules | GPT-5 (precision) |
| Cross-doc consistency | Contradictions between docs | GPT-5, Opus |
| **Emergent composition** | **Scale/deployment bugs** | **GPT-5 (breadth), Opus (depth)** |

**Practical implication:** For any system that runs many instances of the same design
(microservices, per-user pipelines, per-strategy workers), the "emergent composition"
lens should be part of architecture review. It specifically targets the gap between
design correctness and production behavior — the class of bugs that unit tests, property
tests, and single-instance review all miss.