finding 47: emergent behavior from rule composition - new analytical lens
Tests a novel analytical lens on aggregation.md (239 lines): 'what happens when many correct instances operate simultaneously in a correlated environment?' Results: GPT-5 (13 findings, 76s) excels at systemic dynamics and feedback loops. Opus (8 findings, 93s) finds the most consequential single findings (stop-loss defeated by temporal composition, crash-opportunity correlation). Sonnet 4.0 (6 findings, 32s) too abstract for this task. Key insight: This lens finds DEPLOYMENT bugs invisible at design time - the gap between 'correct by construction' and 'correct in production'.
This commit is contained in:
@@ -0,0 +1,159 @@
|
||||
### 47. Emergent behavior from rule composition: a new analytical lens; GPT-5 excels at identifying feedback loops and systemic dynamics; Opus finds the most architecturally consequential single findings
|
||||
|
||||
**Date:** 2026-05-08
|
||||
**Task:** Identify emergent behaviors from rule composition in gargoyle's `aggregation.md`
|
||||
(239 lines) — situations where individually correct rules interact to produce undocumented
|
||||
or unintended system behaviors.
|
||||
**How we used them:** Same document (full text) + same focused analytical prompt to 3
|
||||
models via HAI proxy. Prompt specified 5 categories: combinatorial state explosions, feedback
|
||||
loops through external systems, resource competition between isolated instances, temporal
|
||||
composition effects, and policy contradiction under composition. Required structured output
|
||||
per finding (components, individual correctness, emergent behavior, why doc misses it,
|
||||
severity). No tools, no project context beyond the document.
|
||||
|
||||
| Model | Time | Output tokens | Reasoning tokens | Findings |
|
||||
|---|---|---|---|---|
|
||||
| GPT-5 | 76.0s | 6,956 | 4,480 | 13 |
|
||||
| Claude Opus 4.6 | 93.2s | 3,956 | (internal) | 8 |
|
||||
| Claude Sonnet 4.0 | 31.8s | 1,296 | (internal) | 6 |
|
||||
|
||||
**Note:** Sonnet 4.0 was used (not 4.6) due to model name resolution. This provides a
|
||||
comparison point against the 4.6 results in previous findings.
|
||||
|
||||
**What they found — common ground (all 3 identified):**
|
||||
- Synchronized timeout thundering herd when correlated market events start many
|
||||
time-windowed timers simultaneously, overwhelming PortfolioRisk fan-in
|
||||
- Per-strategy capacity limits don't bound system-wide memory (multiplicative
|
||||
composition across strategies × instruments)
|
||||
- Multiple strategies producing concurrent decisions for same instrument/risk
|
||||
budget without coordination (capital invisibility during buffering)
|
||||
- Feedback loops between downstream rejection/processing and upstream signal/
|
||||
decision generation
|
||||
|
||||
**GPT-5 unique findings (not in either other model):**
|
||||
- **Force-complete vs expire semantics create systematic portfolio bias (#5):**
|
||||
Under stress, force-complete strategies always consume risk budget while expire
|
||||
strategies drop out. Portfolio composition becomes dominated by whichever timeout
|
||||
semantics is more aggressive, independent of alpha quality. Sophisticated insight
|
||||
about fairness across heterogeneous algorithm configurations.
|
||||
- **Cross-strategy state coupling via position changes (#12):** Fast exit decisions
|
||||
alter position-dependent signal generation upstream, destabilizing pattern groups
|
||||
in other strategies. The document assumes signals are exogenous but they're
|
||||
actually coupled through position state.
|
||||
- **End-to-end latency variance causes duplicate economic intent (#13):** A slow
|
||||
decision acts on stale prices while a newer decision from another strategy passes
|
||||
risk on fresh prices — temporary overexposure followed by costly correction.
|
||||
- **Decision identity regeneration + forwarding failures = no idempotency (#6):**
|
||||
Lost decisions re-created later with different IDs but same economic intent cause
|
||||
duplicate execution across the fan-in with no deduplication mechanism.
|
||||
- **Crash/restart phase alignment creates persistent periodic spikes (#9):** After
|
||||
fleet restart, all strategies begin windows simultaneously and stay phase-aligned.
|
||||
- **Telemetry storms perturb timer accuracy (#10):** Synchronous event emission
|
||||
during herds delays timer callbacks, changing which path fires.
|
||||
- **Combinatorial O(S×I) groups under market-wide shocks (#11):** Timer queues and
|
||||
GC thrash cause non-deterministic path selection between predicate/timeout/capacity.
|
||||
- **Risk-budget race favoring fast strategies over confirmatory ones (#8):** Arrival
|
||||
order determines which strategies consume budget, not alpha quality.
|
||||
|
||||
**Claude Opus unique findings (not in either other model):**
|
||||
- **Stop-loss defeated by pattern aggregation temporal composition (#5, CRITICAL):**
|
||||
A stop-loss fires immediately (closing position), but a pattern strategy's group
|
||||
is still buffering entry signals for the same instrument. When the pattern
|
||||
completes (with stale, pre-crash signal data), it forwards a re-entry decision.
|
||||
PortfolioRisk sees an empty position and approves — the system immediately
|
||||
re-enters a position that was just risk-exited. **The safety mechanism (stop-loss)
|
||||
is rendered ineffective by temporal composition with a slow strategy.** This is
|
||||
the most architecturally consequential finding across all three models.
|
||||
- **Crash probability correlates with opportunity quality (#6):** Crashes correlate
|
||||
with high-volatility (memory pressure from signal bursts), but high-volatility is
|
||||
also when trading opportunities are most profitable. The system architecturally
|
||||
selects AGAINST its best opportunities through correlated crash-and-miss cycles.
|
||||
Novel "selection bias" framing not seen in other models.
|
||||
- **Pre-crash forwarded decisions + post-restart new decisions create duplicates (#7):**
|
||||
A decision forwarded successfully to PortfolioRisk before a crash is still in-flight
|
||||
downstream. After restart, new signals create a new group → new decision → duplicate
|
||||
entry with different decision ID. Unique insight about the boundary between
|
||||
aggregator responsibility and downstream responsibility.
|
||||
- **Market-regime-driven bimodal completion clustering (#8):** The joint distribution
|
||||
of completion times across strategies is driven by shared market conditions (hidden
|
||||
common factor). System oscillates between burst-overload and timeout-cascade with
|
||||
limited healthy middle ground.
|
||||
|
||||
**Claude Sonnet 4.0 unique findings:**
|
||||
- **Cross-algorithm state explosion in pattern completion (#6):** Multi-instrument
|
||||
pattern strategies create implicit dependencies between groups that the per-group
|
||||
state machine can't represent. Somewhat generic — lacks the specific mechanism
|
||||
detail of the other models' findings.
|
||||
- No other truly unique findings — Sonnet's 6 findings overlap substantially with
|
||||
the common ground. Less specific, more abstract framing.
|
||||
|
||||
**Quality assessment:**
|
||||
- **GPT-5** produced the most findings (13) with the highest breadth. Several
|
||||
findings identified systemic dynamics — portfolio composition bias from heterogeneous
|
||||
timeout semantics, position-dependent signal coupling creating feedback into
|
||||
aggregation, and arrival-order fairness violations. GPT-5 uniquely identified the
|
||||
telemetry/timer interference pattern and the combinatorial explosion at system scale.
|
||||
Its strongest contribution is identifying multiple distinct feedback loops through
|
||||
external systems (categories 2 and 5 of the prompt).
|
||||
- **Claude Opus** produced fewer findings (8) but two are qualitatively superior to
|
||||
anything in GPT-5's output: the stop-loss-defeated-by-pattern-composition finding
|
||||
(a genuine safety mechanism failure) and the crash-correlates-with-opportunity-quality
|
||||
finding (a selection bias). Opus continues its pattern from Findings #11-#20 of
|
||||
identifying design TENSIONS and CONTRADICTIONS rather than just failure modes. The
|
||||
stop-loss finding is the single most important finding across all three models because
|
||||
it shows a risk management mechanism being defeated by the architecture's own
|
||||
composition rules.
|
||||
- **Claude Sonnet 4.0** was fast (31.8s) but produced the weakest output. Only 6
|
||||
findings with less specificity and more overlap with common ground. The cross-algorithm
|
||||
state explosion finding (#6) was somewhat generic. Compared to Sonnet 4.6's
|
||||
performance in Finding #12 (17 assumptions, 85% of GPT-5) and Finding #14 (8 cross-
|
||||
component findings), Sonnet 4.0 here is notably weaker. This suggests the 4.0 → 4.6
|
||||
jump was significant for analytical depth.
|
||||
|
||||
**Key insight — "emergent behavior from rule composition" as an analytical lens:**
|
||||
|
||||
This is a genuinely novel task type not previously tested. It differs from:
|
||||
- **Hidden assumptions** (what must be true for this to work?)
|
||||
- **Race conditions** (what temporal interleavings cause bugs?)
|
||||
- **Cross-component interactions** (what happens between components?)
|
||||
- **Invariant violations** (what legal sequences break invariants?)
|
||||
|
||||
The new lens asks: **"What happens when many correct instances of this design operate
|
||||
simultaneously in a correlated environment?"** This requires reasoning about:
|
||||
1. Statistical properties of the composition (correlation, synchronization)
|
||||
2. Shared-resource contention hidden by logical isolation abstractions
|
||||
3. Feedback loops that cross the document's scope boundaries
|
||||
4. Fairness and priority under resource competition
|
||||
|
||||
**What makes this lens uniquely valuable:**
|
||||
Most previous analytical lenses find bugs in the DESIGN. This lens finds bugs in the
|
||||
DEPLOYMENT — scenarios that are invisible at the single-instance design level but emerge
|
||||
at scale. The aggregation document's state machine is CORRECT for a single group. The
|
||||
findings here are about what happens when many correct state machines operate
|
||||
simultaneously in a correlated world. This is the gap between "correct by construction"
|
||||
and "correct in production."
|
||||
|
||||
**Model strengths for this lens:**
|
||||
- GPT-5 is best at identifying SYSTEMIC dynamics — feedback loops, fairness violations,
|
||||
resource contention patterns, and cascading effects. It reasons about the system as a
|
||||
whole operating over time.
|
||||
- Opus is best at identifying CONSEQUENTIAL compositions — finding the one interaction
|
||||
that defeats a safety mechanism or creates a selection bias. Fewer findings but higher
|
||||
architectural significance per finding.
|
||||
- Sonnet 4.0 is insufficient for this task — too abstract, insufficient specificity.
|
||||
(Sonnet 4.6 would likely perform better based on prior results.)
|
||||
|
||||
**Comparison to document-centric lenses:**
|
||||
| Lens type | What it finds | Best model |
|
||||
|---|---|---|
|
||||
| Hidden assumptions | What the doc takes for granted | All (GPT-5 most) |
|
||||
| Race conditions | Temporal bugs between components | GPT-5, Opus |
|
||||
| Invariant violations | Legal paths that break rules | GPT-5 (precision) |
|
||||
| Cross-doc consistency | Contradictions between docs | GPT-5, Opus |
|
||||
| **Emergent composition** | **Scale/deployment bugs** | **GPT-5 (breadth), Opus (depth)** |
|
||||
|
||||
**Practical implication:** For any system that runs many instances of the same design
|
||||
(microservices, per-user pipelines, per-strategy workers), the "emergent composition"
|
||||
lens should be part of architecture review. It specifically targets the gap between
|
||||
design correctness and production behavior — the class of bugs that unit tests, property
|
||||
tests, and single-instance review all miss.
|
||||
Reference in New Issue
Block a user