From f3266ccc13875943840f589a223b09da2518dedd Mon Sep 17 00:00:00 2001 From: claw Date: Fri, 8 May 2026 02:06:25 -0700 Subject: [PATCH] finding 47: emergent behavior from rule composition - new analytical lens Tests a novel analytical lens on aggregation.md (239 lines): 'what happens when many correct instances operate simultaneously in a correlated environment?' Results: GPT-5 (13 findings, 76s) excels at systemic dynamics and feedback loops. Opus (8 findings, 93s) finds the most consequential single findings (stop-loss defeated by temporal composition, crash-opportunity correlation). Sonnet 4.0 (6 findings, 32s) too abstract for this task. Key insight: This lens finds DEPLOYMENT bugs invisible at design time - the gap between 'correct by construction' and 'correct in production'. --- ...8-47-emergent-behavior-rule-composition.md | 159 ++++++++++++++++++ 1 file changed, 159 insertions(+) create mode 100644 findings/2026-05-08-47-emergent-behavior-rule-composition.md diff --git a/findings/2026-05-08-47-emergent-behavior-rule-composition.md b/findings/2026-05-08-47-emergent-behavior-rule-composition.md new file mode 100644 index 0000000..759e1e3 --- /dev/null +++ b/findings/2026-05-08-47-emergent-behavior-rule-composition.md @@ -0,0 +1,159 @@ +### 47. Emergent behavior from rule composition: a new analytical lens; GPT-5 excels at identifying feedback loops and systemic dynamics; Opus finds the most architecturally consequential single findings + +**Date:** 2026-05-08 +**Task:** Identify emergent behaviors from rule composition in gargoyle's `aggregation.md` +(239 lines) — situations where individually correct rules interact to produce undocumented +or unintended system behaviors. +**How we used them:** Same document (full text) + same focused analytical prompt to 3 +models via HAI proxy. Prompt specified 5 categories: combinatorial state explosions, feedback +loops through external systems, resource competition between isolated instances, temporal +composition effects, and policy contradiction under composition. Required structured output +per finding (components, individual correctness, emergent behavior, why doc misses it, +severity). No tools, no project context beyond the document. + +| Model | Time | Output tokens | Reasoning tokens | Findings | +|---|---|---|---|---| +| GPT-5 | 76.0s | 6,956 | 4,480 | 13 | +| Claude Opus 4.6 | 93.2s | 3,956 | (internal) | 8 | +| Claude Sonnet 4.0 | 31.8s | 1,296 | (internal) | 6 | + +**Note:** Sonnet 4.0 was used (not 4.6) due to model name resolution. This provides a +comparison point against the 4.6 results in previous findings. + +**What they found — common ground (all 3 identified):** +- Synchronized timeout thundering herd when correlated market events start many + time-windowed timers simultaneously, overwhelming PortfolioRisk fan-in +- Per-strategy capacity limits don't bound system-wide memory (multiplicative + composition across strategies × instruments) +- Multiple strategies producing concurrent decisions for same instrument/risk + budget without coordination (capital invisibility during buffering) +- Feedback loops between downstream rejection/processing and upstream signal/ + decision generation + +**GPT-5 unique findings (not in either other model):** +- **Force-complete vs expire semantics create systematic portfolio bias (#5):** + Under stress, force-complete strategies always consume risk budget while expire + strategies drop out. Portfolio composition becomes dominated by whichever timeout + semantics is more aggressive, independent of alpha quality. Sophisticated insight + about fairness across heterogeneous algorithm configurations. +- **Cross-strategy state coupling via position changes (#12):** Fast exit decisions + alter position-dependent signal generation upstream, destabilizing pattern groups + in other strategies. The document assumes signals are exogenous but they're + actually coupled through position state. +- **End-to-end latency variance causes duplicate economic intent (#13):** A slow + decision acts on stale prices while a newer decision from another strategy passes + risk on fresh prices — temporary overexposure followed by costly correction. +- **Decision identity regeneration + forwarding failures = no idempotency (#6):** + Lost decisions re-created later with different IDs but same economic intent cause + duplicate execution across the fan-in with no deduplication mechanism. +- **Crash/restart phase alignment creates persistent periodic spikes (#9):** After + fleet restart, all strategies begin windows simultaneously and stay phase-aligned. +- **Telemetry storms perturb timer accuracy (#10):** Synchronous event emission + during herds delays timer callbacks, changing which path fires. +- **Combinatorial O(S×I) groups under market-wide shocks (#11):** Timer queues and + GC thrash cause non-deterministic path selection between predicate/timeout/capacity. +- **Risk-budget race favoring fast strategies over confirmatory ones (#8):** Arrival + order determines which strategies consume budget, not alpha quality. + +**Claude Opus unique findings (not in either other model):** +- **Stop-loss defeated by pattern aggregation temporal composition (#5, CRITICAL):** + A stop-loss fires immediately (closing position), but a pattern strategy's group + is still buffering entry signals for the same instrument. When the pattern + completes (with stale, pre-crash signal data), it forwards a re-entry decision. + PortfolioRisk sees an empty position and approves — the system immediately + re-enters a position that was just risk-exited. **The safety mechanism (stop-loss) + is rendered ineffective by temporal composition with a slow strategy.** This is + the most architecturally consequential finding across all three models. +- **Crash probability correlates with opportunity quality (#6):** Crashes correlate + with high-volatility (memory pressure from signal bursts), but high-volatility is + also when trading opportunities are most profitable. The system architecturally + selects AGAINST its best opportunities through correlated crash-and-miss cycles. + Novel "selection bias" framing not seen in other models. +- **Pre-crash forwarded decisions + post-restart new decisions create duplicates (#7):** + A decision forwarded successfully to PortfolioRisk before a crash is still in-flight + downstream. After restart, new signals create a new group → new decision → duplicate + entry with different decision ID. Unique insight about the boundary between + aggregator responsibility and downstream responsibility. +- **Market-regime-driven bimodal completion clustering (#8):** The joint distribution + of completion times across strategies is driven by shared market conditions (hidden + common factor). System oscillates between burst-overload and timeout-cascade with + limited healthy middle ground. + +**Claude Sonnet 4.0 unique findings:** +- **Cross-algorithm state explosion in pattern completion (#6):** Multi-instrument + pattern strategies create implicit dependencies between groups that the per-group + state machine can't represent. Somewhat generic — lacks the specific mechanism + detail of the other models' findings. +- No other truly unique findings — Sonnet's 6 findings overlap substantially with + the common ground. Less specific, more abstract framing. + +**Quality assessment:** +- **GPT-5** produced the most findings (13) with the highest breadth. Several + findings identified systemic dynamics — portfolio composition bias from heterogeneous + timeout semantics, position-dependent signal coupling creating feedback into + aggregation, and arrival-order fairness violations. GPT-5 uniquely identified the + telemetry/timer interference pattern and the combinatorial explosion at system scale. + Its strongest contribution is identifying multiple distinct feedback loops through + external systems (categories 2 and 5 of the prompt). +- **Claude Opus** produced fewer findings (8) but two are qualitatively superior to + anything in GPT-5's output: the stop-loss-defeated-by-pattern-composition finding + (a genuine safety mechanism failure) and the crash-correlates-with-opportunity-quality + finding (a selection bias). Opus continues its pattern from Findings #11-#20 of + identifying design TENSIONS and CONTRADICTIONS rather than just failure modes. The + stop-loss finding is the single most important finding across all three models because + it shows a risk management mechanism being defeated by the architecture's own + composition rules. +- **Claude Sonnet 4.0** was fast (31.8s) but produced the weakest output. Only 6 + findings with less specificity and more overlap with common ground. The cross-algorithm + state explosion finding (#6) was somewhat generic. Compared to Sonnet 4.6's + performance in Finding #12 (17 assumptions, 85% of GPT-5) and Finding #14 (8 cross- + component findings), Sonnet 4.0 here is notably weaker. This suggests the 4.0 → 4.6 + jump was significant for analytical depth. + +**Key insight — "emergent behavior from rule composition" as an analytical lens:** + +This is a genuinely novel task type not previously tested. It differs from: +- **Hidden assumptions** (what must be true for this to work?) +- **Race conditions** (what temporal interleavings cause bugs?) +- **Cross-component interactions** (what happens between components?) +- **Invariant violations** (what legal sequences break invariants?) + +The new lens asks: **"What happens when many correct instances of this design operate +simultaneously in a correlated environment?"** This requires reasoning about: +1. Statistical properties of the composition (correlation, synchronization) +2. Shared-resource contention hidden by logical isolation abstractions +3. Feedback loops that cross the document's scope boundaries +4. Fairness and priority under resource competition + +**What makes this lens uniquely valuable:** +Most previous analytical lenses find bugs in the DESIGN. This lens finds bugs in the +DEPLOYMENT — scenarios that are invisible at the single-instance design level but emerge +at scale. The aggregation document's state machine is CORRECT for a single group. The +findings here are about what happens when many correct state machines operate +simultaneously in a correlated world. This is the gap between "correct by construction" +and "correct in production." + +**Model strengths for this lens:** +- GPT-5 is best at identifying SYSTEMIC dynamics — feedback loops, fairness violations, + resource contention patterns, and cascading effects. It reasons about the system as a + whole operating over time. +- Opus is best at identifying CONSEQUENTIAL compositions — finding the one interaction + that defeats a safety mechanism or creates a selection bias. Fewer findings but higher + architectural significance per finding. +- Sonnet 4.0 is insufficient for this task — too abstract, insufficient specificity. + (Sonnet 4.6 would likely perform better based on prior results.) + +**Comparison to document-centric lenses:** +| Lens type | What it finds | Best model | +|---|---|---| +| Hidden assumptions | What the doc takes for granted | All (GPT-5 most) | +| Race conditions | Temporal bugs between components | GPT-5, Opus | +| Invariant violations | Legal paths that break rules | GPT-5 (precision) | +| Cross-doc consistency | Contradictions between docs | GPT-5, Opus | +| **Emergent composition** | **Scale/deployment bugs** | **GPT-5 (breadth), Opus (depth)** | + +**Practical implication:** For any system that runs many instances of the same design +(microservices, per-user pipelines, per-strategy workers), the "emergent composition" +lens should be part of architecture review. It specifically targets the gap between +design correctness and production behavior — the class of bugs that unit tests, property +tests, and single-instance review all miss.