finding 47: emergent behavior from rule composition - new analytical lens

Tests a novel analytical lens on aggregation.md (239 lines): 'what happens
when many correct instances operate simultaneously in a correlated environment?'

Results: GPT-5 (13 findings, 76s) excels at systemic dynamics and feedback
loops. Opus (8 findings, 93s) finds the most consequential single findings
(stop-loss defeated by temporal composition, crash-opportunity correlation).
Sonnet 4.0 (6 findings, 32s) too abstract for this task.

Key insight: This lens finds DEPLOYMENT bugs invisible at design time -
the gap between 'correct by construction' and 'correct in production'.
This commit is contained in:
claw
2026-05-08 02:06:25 -07:00
parent b5b5b64a40
commit f3266ccc13
@@ -0,0 +1,159 @@
### 47. Emergent behavior from rule composition: a new analytical lens; GPT-5 excels at identifying feedback loops and systemic dynamics; Opus finds the most architecturally consequential single findings
**Date:** 2026-05-08
**Task:** Identify emergent behaviors from rule composition in gargoyle's `aggregation.md`
(239 lines) — situations where individually correct rules interact to produce undocumented
or unintended system behaviors.
**How we used them:** Same document (full text) + same focused analytical prompt to 3
models via HAI proxy. Prompt specified 5 categories: combinatorial state explosions, feedback
loops through external systems, resource competition between isolated instances, temporal
composition effects, and policy contradiction under composition. Required structured output
per finding (components, individual correctness, emergent behavior, why doc misses it,
severity). No tools, no project context beyond the document.
| Model | Time | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|
| GPT-5 | 76.0s | 6,956 | 4,480 | 13 |
| Claude Opus 4.6 | 93.2s | 3,956 | (internal) | 8 |
| Claude Sonnet 4.0 | 31.8s | 1,296 | (internal) | 6 |
**Note:** Sonnet 4.0 was used (not 4.6) due to model name resolution. This provides a
comparison point against the 4.6 results in previous findings.
**What they found — common ground (all 3 identified):**
- Synchronized timeout thundering herd when correlated market events start many
time-windowed timers simultaneously, overwhelming PortfolioRisk fan-in
- Per-strategy capacity limits don't bound system-wide memory (multiplicative
composition across strategies × instruments)
- Multiple strategies producing concurrent decisions for same instrument/risk
budget without coordination (capital invisibility during buffering)
- Feedback loops between downstream rejection/processing and upstream signal/
decision generation
**GPT-5 unique findings (not in either other model):**
- **Force-complete vs expire semantics create systematic portfolio bias (#5):**
Under stress, force-complete strategies always consume risk budget while expire
strategies drop out. Portfolio composition becomes dominated by whichever timeout
semantics is more aggressive, independent of alpha quality. Sophisticated insight
about fairness across heterogeneous algorithm configurations.
- **Cross-strategy state coupling via position changes (#12):** Fast exit decisions
alter position-dependent signal generation upstream, destabilizing pattern groups
in other strategies. The document assumes signals are exogenous but they're
actually coupled through position state.
- **End-to-end latency variance causes duplicate economic intent (#13):** A slow
decision acts on stale prices while a newer decision from another strategy passes
risk on fresh prices — temporary overexposure followed by costly correction.
- **Decision identity regeneration + forwarding failures = no idempotency (#6):**
Lost decisions re-created later with different IDs but same economic intent cause
duplicate execution across the fan-in with no deduplication mechanism.
- **Crash/restart phase alignment creates persistent periodic spikes (#9):** After
fleet restart, all strategies begin windows simultaneously and stay phase-aligned.
- **Telemetry storms perturb timer accuracy (#10):** Synchronous event emission
during herds delays timer callbacks, changing which path fires.
- **Combinatorial O(S×I) groups under market-wide shocks (#11):** Timer queues and
GC thrash cause non-deterministic path selection between predicate/timeout/capacity.
- **Risk-budget race favoring fast strategies over confirmatory ones (#8):** Arrival
order determines which strategies consume budget, not alpha quality.
**Claude Opus unique findings (not in either other model):**
- **Stop-loss defeated by pattern aggregation temporal composition (#5, CRITICAL):**
A stop-loss fires immediately (closing position), but a pattern strategy's group
is still buffering entry signals for the same instrument. When the pattern
completes (with stale, pre-crash signal data), it forwards a re-entry decision.
PortfolioRisk sees an empty position and approves — the system immediately
re-enters a position that was just risk-exited. **The safety mechanism (stop-loss)
is rendered ineffective by temporal composition with a slow strategy.** This is
the most architecturally consequential finding across all three models.
- **Crash probability correlates with opportunity quality (#6):** Crashes correlate
with high-volatility (memory pressure from signal bursts), but high-volatility is
also when trading opportunities are most profitable. The system architecturally
selects AGAINST its best opportunities through correlated crash-and-miss cycles.
Novel "selection bias" framing not seen in other models.
- **Pre-crash forwarded decisions + post-restart new decisions create duplicates (#7):**
A decision forwarded successfully to PortfolioRisk before a crash is still in-flight
downstream. After restart, new signals create a new group → new decision → duplicate
entry with different decision ID. Unique insight about the boundary between
aggregator responsibility and downstream responsibility.
- **Market-regime-driven bimodal completion clustering (#8):** The joint distribution
of completion times across strategies is driven by shared market conditions (hidden
common factor). System oscillates between burst-overload and timeout-cascade with
limited healthy middle ground.
**Claude Sonnet 4.0 unique findings:**
- **Cross-algorithm state explosion in pattern completion (#6):** Multi-instrument
pattern strategies create implicit dependencies between groups that the per-group
state machine can't represent. Somewhat generic — lacks the specific mechanism
detail of the other models' findings.
- No other truly unique findings — Sonnet's 6 findings overlap substantially with
the common ground. Less specific, more abstract framing.
**Quality assessment:**
- **GPT-5** produced the most findings (13) with the highest breadth. Several
findings identified systemic dynamics — portfolio composition bias from heterogeneous
timeout semantics, position-dependent signal coupling creating feedback into
aggregation, and arrival-order fairness violations. GPT-5 uniquely identified the
telemetry/timer interference pattern and the combinatorial explosion at system scale.
Its strongest contribution is identifying multiple distinct feedback loops through
external systems (categories 2 and 5 of the prompt).
- **Claude Opus** produced fewer findings (8) but two are qualitatively superior to
anything in GPT-5's output: the stop-loss-defeated-by-pattern-composition finding
(a genuine safety mechanism failure) and the crash-correlates-with-opportunity-quality
finding (a selection bias). Opus continues its pattern from Findings #11-#20 of
identifying design TENSIONS and CONTRADICTIONS rather than just failure modes. The
stop-loss finding is the single most important finding across all three models because
it shows a risk management mechanism being defeated by the architecture's own
composition rules.
- **Claude Sonnet 4.0** was fast (31.8s) but produced the weakest output. Only 6
findings with less specificity and more overlap with common ground. The cross-algorithm
state explosion finding (#6) was somewhat generic. Compared to Sonnet 4.6's
performance in Finding #12 (17 assumptions, 85% of GPT-5) and Finding #14 (8 cross-
component findings), Sonnet 4.0 here is notably weaker. This suggests the 4.0 → 4.6
jump was significant for analytical depth.
**Key insight — "emergent behavior from rule composition" as an analytical lens:**
This is a genuinely novel task type not previously tested. It differs from:
- **Hidden assumptions** (what must be true for this to work?)
- **Race conditions** (what temporal interleavings cause bugs?)
- **Cross-component interactions** (what happens between components?)
- **Invariant violations** (what legal sequences break invariants?)
The new lens asks: **"What happens when many correct instances of this design operate
simultaneously in a correlated environment?"** This requires reasoning about:
1. Statistical properties of the composition (correlation, synchronization)
2. Shared-resource contention hidden by logical isolation abstractions
3. Feedback loops that cross the document's scope boundaries
4. Fairness and priority under resource competition
**What makes this lens uniquely valuable:**
Most previous analytical lenses find bugs in the DESIGN. This lens finds bugs in the
DEPLOYMENT — scenarios that are invisible at the single-instance design level but emerge
at scale. The aggregation document's state machine is CORRECT for a single group. The
findings here are about what happens when many correct state machines operate
simultaneously in a correlated world. This is the gap between "correct by construction"
and "correct in production."
**Model strengths for this lens:**
- GPT-5 is best at identifying SYSTEMIC dynamics — feedback loops, fairness violations,
resource contention patterns, and cascading effects. It reasons about the system as a
whole operating over time.
- Opus is best at identifying CONSEQUENTIAL compositions — finding the one interaction
that defeats a safety mechanism or creates a selection bias. Fewer findings but higher
architectural significance per finding.
- Sonnet 4.0 is insufficient for this task — too abstract, insufficient specificity.
(Sonnet 4.6 would likely perform better based on prior results.)
**Comparison to document-centric lenses:**
| Lens type | What it finds | Best model |
|---|---|---|
| Hidden assumptions | What the doc takes for granted | All (GPT-5 most) |
| Race conditions | Temporal bugs between components | GPT-5, Opus |
| Invariant violations | Legal paths that break rules | GPT-5 (precision) |
| Cross-doc consistency | Contradictions between docs | GPT-5, Opus |
| **Emergent composition** | **Scale/deployment bugs** | **GPT-5 (breadth), Opus (depth)** |
**Practical implication:** For any system that runs many instances of the same design
(microservices, per-user pipelines, per-strategy workers), the "emergent composition"
lens should be part of architecture review. It specifically targets the gap between
design correctness and production behavior — the class of bugs that unit tests, property
tests, and single-instance review all miss.