finding 47: emergent behavior from rule composition - new analytical lens

Tests a novel analytical lens on aggregation.md (239 lines): 'what happens when many correct instances operate simultaneously in a correlated environment?' Results: GPT-5 (13 findings, 76s) excels at systemic dynamics and feedback loops. Opus (8 findings, 93s) finds the most consequential single findings (stop-loss defeated by temporal composition, crash-opportunity correlation). Sonnet 4.0 (6 findings, 32s) too abstract for this task. Key insight: This lens finds DEPLOYMENT bugs invisible at design time - the gap between 'correct by construction' and 'correct in production'.
2026-05-08 02:06:25 -07:00
parent b5b5b64a40
commit f3266ccc13
1 changed files with 159 additions and 0 deletions
@@ -0,0 +1,159 @@
+### 47. Emergent behavior from rule composition: a new analytical lens; GPT-5 excels at identifying feedback loops and systemic dynamics; Opus finds the most architecturally consequential single findings
+
+**Date:** 2026-05-08
+**Task:** Identify emergent behaviors from rule composition in gargoyle's `aggregation.md`
+(239 lines) — situations where individually correct rules interact to produce undocumented
+or unintended system behaviors.
+**How we used them:** Same document (full text) + same focused analytical prompt to 3
+models via HAI proxy. Prompt specified 5 categories: combinatorial state explosions, feedback
+loops through external systems, resource competition between isolated instances, temporal
+composition effects, and policy contradiction under composition. Required structured output
+per finding (components, individual correctness, emergent behavior, why doc misses it,
+severity). No tools, no project context beyond the document.
+
+| Model | Time | Output tokens | Reasoning tokens | Findings |
+|---|---|---|---|---|
+| GPT-5 | 76.0s | 6,956 | 4,480 | 13 |
+| Claude Opus 4.6 | 93.2s | 3,956 | (internal) | 8 |
+| Claude Sonnet 4.0 | 31.8s | 1,296 | (internal) | 6 |
+
+**Note:** Sonnet 4.0 was used (not 4.6) due to model name resolution. This provides a
+comparison point against the 4.6 results in previous findings.
+
+**What they found — common ground (all 3 identified):**
+- Synchronized timeout thundering herd when correlated market events start many
+  time-windowed timers simultaneously, overwhelming PortfolioRisk fan-in
+- Per-strategy capacity limits don't bound system-wide memory (multiplicative
+  composition across strategies × instruments)
+- Multiple strategies producing concurrent decisions for same instrument/risk
+  budget without coordination (capital invisibility during buffering)
+- Feedback loops between downstream rejection/processing and upstream signal/
+  decision generation
+
+**GPT-5 unique findings (not in either other model):**
+- **Force-complete vs expire semantics create systematic portfolio bias (#5):**
+  Under stress, force-complete strategies always consume risk budget while expire
+  strategies drop out. Portfolio composition becomes dominated by whichever timeout
+  semantics is more aggressive, independent of alpha quality. Sophisticated insight
+  about fairness across heterogeneous algorithm configurations.
+- **Cross-strategy state coupling via position changes (#12):** Fast exit decisions
+  alter position-dependent signal generation upstream, destabilizing pattern groups
+  in other strategies. The document assumes signals are exogenous but they're
+  actually coupled through position state.
+- **End-to-end latency variance causes duplicate economic intent (#13):** A slow
+  decision acts on stale prices while a newer decision from another strategy passes
+  risk on fresh prices — temporary overexposure followed by costly correction.
+- **Decision identity regeneration + forwarding failures = no idempotency (#6):**
+  Lost decisions re-created later with different IDs but same economic intent cause
+  duplicate execution across the fan-in with no deduplication mechanism.
+- **Crash/restart phase alignment creates persistent periodic spikes (#9):** After
+  fleet restart, all strategies begin windows simultaneously and stay phase-aligned.
+- **Telemetry storms perturb timer accuracy (#10):** Synchronous event emission
+  during herds delays timer callbacks, changing which path fires.
+- **Combinatorial O(S×I) groups under market-wide shocks (#11):** Timer queues and
+  GC thrash cause non-deterministic path selection between predicate/timeout/capacity.
+- **Risk-budget race favoring fast strategies over confirmatory ones (#8):** Arrival
+  order determines which strategies consume budget, not alpha quality.
+
+**Claude Opus unique findings (not in either other model):**
+- **Stop-loss defeated by pattern aggregation temporal composition (#5, CRITICAL):**
+  A stop-loss fires immediately (closing position), but a pattern strategy's group
+  is still buffering entry signals for the same instrument. When the pattern
+  completes (with stale, pre-crash signal data), it forwards a re-entry decision.
+  PortfolioRisk sees an empty position and approves — the system immediately
+  re-enters a position that was just risk-exited. **The safety mechanism (stop-loss)
+  is rendered ineffective by temporal composition with a slow strategy.** This is
+  the most architecturally consequential finding across all three models.
+- **Crash probability correlates with opportunity quality (#6):** Crashes correlate
+  with high-volatility (memory pressure from signal bursts), but high-volatility is
+  also when trading opportunities are most profitable. The system architecturally
+  selects AGAINST its best opportunities through correlated crash-and-miss cycles.
+  Novel "selection bias" framing not seen in other models.
+- **Pre-crash forwarded decisions + post-restart new decisions create duplicates (#7):**
+  A decision forwarded successfully to PortfolioRisk before a crash is still in-flight
+  downstream. After restart, new signals create a new group → new decision → duplicate
+  entry with different decision ID. Unique insight about the boundary between
+  aggregator responsibility and downstream responsibility.
+- **Market-regime-driven bimodal completion clustering (#8):** The joint distribution
+  of completion times across strategies is driven by shared market conditions (hidden
+  common factor). System oscillates between burst-overload and timeout-cascade with
+  limited healthy middle ground.
+
+**Claude Sonnet 4.0 unique findings:**
+- **Cross-algorithm state explosion in pattern completion (#6):** Multi-instrument
+  pattern strategies create implicit dependencies between groups that the per-group
+  state machine can't represent. Somewhat generic — lacks the specific mechanism
+  detail of the other models' findings.
+- No other truly unique findings — Sonnet's 6 findings overlap substantially with
+  the common ground. Less specific, more abstract framing.
+
+**Quality assessment:**
+- **GPT-5** produced the most findings (13) with the highest breadth. Several
+  findings identified systemic dynamics — portfolio composition bias from heterogeneous
+  timeout semantics, position-dependent signal coupling creating feedback into
+  aggregation, and arrival-order fairness violations. GPT-5 uniquely identified the
+  telemetry/timer interference pattern and the combinatorial explosion at system scale.
+  Its strongest contribution is identifying multiple distinct feedback loops through
+  external systems (categories 2 and 5 of the prompt).
+- **Claude Opus** produced fewer findings (8) but two are qualitatively superior to
+  anything in GPT-5's output: the stop-loss-defeated-by-pattern-composition finding
+  (a genuine safety mechanism failure) and the crash-correlates-with-opportunity-quality
+  finding (a selection bias). Opus continues its pattern from Findings #11-#20 of
+  identifying design TENSIONS and CONTRADICTIONS rather than just failure modes. The
+  stop-loss finding is the single most important finding across all three models because
+  it shows a risk management mechanism being defeated by the architecture's own
+  composition rules.
+- **Claude Sonnet 4.0** was fast (31.8s) but produced the weakest output. Only 6
+  findings with less specificity and more overlap with common ground. The cross-algorithm
+  state explosion finding (#6) was somewhat generic. Compared to Sonnet 4.6's
+  performance in Finding #12 (17 assumptions, 85% of GPT-5) and Finding #14 (8 cross-
+  component findings), Sonnet 4.0 here is notably weaker. This suggests the 4.0 → 4.6
+  jump was significant for analytical depth.
+
+**Key insight — "emergent behavior from rule composition" as an analytical lens:**
+
+This is a genuinely novel task type not previously tested. It differs from:
+- **Hidden assumptions** (what must be true for this to work?)
+- **Race conditions** (what temporal interleavings cause bugs?)
+- **Cross-component interactions** (what happens between components?)
+- **Invariant violations** (what legal sequences break invariants?)
+
+The new lens asks: **"What happens when many correct instances of this design operate
+simultaneously in a correlated environment?"** This requires reasoning about:
+1. Statistical properties of the composition (correlation, synchronization)
+2. Shared-resource contention hidden by logical isolation abstractions
+3. Feedback loops that cross the document's scope boundaries
+4. Fairness and priority under resource competition
+
+**What makes this lens uniquely valuable:**
+Most previous analytical lenses find bugs in the DESIGN. This lens finds bugs in the
+DEPLOYMENT — scenarios that are invisible at the single-instance design level but emerge
+at scale. The aggregation document's state machine is CORRECT for a single group. The
+findings here are about what happens when many correct state machines operate
+simultaneously in a correlated world. This is the gap between "correct by construction"
+and "correct in production."
+
+**Model strengths for this lens:**
+- GPT-5 is best at identifying SYSTEMIC dynamics — feedback loops, fairness violations,
+  resource contention patterns, and cascading effects. It reasons about the system as a
+  whole operating over time.
+- Opus is best at identifying CONSEQUENTIAL compositions — finding the one interaction
+  that defeats a safety mechanism or creates a selection bias. Fewer findings but higher
+  architectural significance per finding.
+- Sonnet 4.0 is insufficient for this task — too abstract, insufficient specificity.
+  (Sonnet 4.6 would likely perform better based on prior results.)
+
+**Comparison to document-centric lenses:**
+| Lens type | What it finds | Best model |
+|---|---|---|
+| Hidden assumptions | What the doc takes for granted | All (GPT-5 most) |
+| Race conditions | Temporal bugs between components | GPT-5, Opus |
+| Invariant violations | Legal paths that break rules | GPT-5 (precision) |
+| Cross-doc consistency | Contradictions between docs | GPT-5, Opus |
+| **Emergent composition** | **Scale/deployment bugs** | **GPT-5 (breadth), Opus (depth)** |
+
+**Practical implication:** For any system that runs many instances of the same design
+(microservices, per-user pipelines, per-strategy workers), the "emergent composition"
+lens should be part of architecture review. It specifically targets the gap between
+design correctness and production behavior — the class of bugs that unit tests, property
+tests, and single-instance review all miss.