f3266ccc13
Tests a novel analytical lens on aggregation.md (239 lines): 'what happens when many correct instances operate simultaneously in a correlated environment?' Results: GPT-5 (13 findings, 76s) excels at systemic dynamics and feedback loops. Opus (8 findings, 93s) finds the most consequential single findings (stop-loss defeated by temporal composition, crash-opportunity correlation). Sonnet 4.0 (6 findings, 32s) too abstract for this task. Key insight: This lens finds DEPLOYMENT bugs invisible at design time - the gap between 'correct by construction' and 'correct in production'.
160 lines
10 KiB
Markdown
160 lines
10 KiB
Markdown
### 47. Emergent behavior from rule composition: a new analytical lens; GPT-5 excels at identifying feedback loops and systemic dynamics; Opus finds the most architecturally consequential single findings
|
||
|
||
**Date:** 2026-05-08
|
||
**Task:** Identify emergent behaviors from rule composition in gargoyle's `aggregation.md`
|
||
(239 lines) — situations where individually correct rules interact to produce undocumented
|
||
or unintended system behaviors.
|
||
**How we used them:** Same document (full text) + same focused analytical prompt to 3
|
||
models via HAI proxy. Prompt specified 5 categories: combinatorial state explosions, feedback
|
||
loops through external systems, resource competition between isolated instances, temporal
|
||
composition effects, and policy contradiction under composition. Required structured output
|
||
per finding (components, individual correctness, emergent behavior, why doc misses it,
|
||
severity). No tools, no project context beyond the document.
|
||
|
||
| Model | Time | Output tokens | Reasoning tokens | Findings |
|
||
|---|---|---|---|---|
|
||
| GPT-5 | 76.0s | 6,956 | 4,480 | 13 |
|
||
| Claude Opus 4.6 | 93.2s | 3,956 | (internal) | 8 |
|
||
| Claude Sonnet 4.0 | 31.8s | 1,296 | (internal) | 6 |
|
||
|
||
**Note:** Sonnet 4.0 was used (not 4.6) due to model name resolution. This provides a
|
||
comparison point against the 4.6 results in previous findings.
|
||
|
||
**What they found — common ground (all 3 identified):**
|
||
- Synchronized timeout thundering herd when correlated market events start many
|
||
time-windowed timers simultaneously, overwhelming PortfolioRisk fan-in
|
||
- Per-strategy capacity limits don't bound system-wide memory (multiplicative
|
||
composition across strategies × instruments)
|
||
- Multiple strategies producing concurrent decisions for same instrument/risk
|
||
budget without coordination (capital invisibility during buffering)
|
||
- Feedback loops between downstream rejection/processing and upstream signal/
|
||
decision generation
|
||
|
||
**GPT-5 unique findings (not in either other model):**
|
||
- **Force-complete vs expire semantics create systematic portfolio bias (#5):**
|
||
Under stress, force-complete strategies always consume risk budget while expire
|
||
strategies drop out. Portfolio composition becomes dominated by whichever timeout
|
||
semantics is more aggressive, independent of alpha quality. Sophisticated insight
|
||
about fairness across heterogeneous algorithm configurations.
|
||
- **Cross-strategy state coupling via position changes (#12):** Fast exit decisions
|
||
alter position-dependent signal generation upstream, destabilizing pattern groups
|
||
in other strategies. The document assumes signals are exogenous but they're
|
||
actually coupled through position state.
|
||
- **End-to-end latency variance causes duplicate economic intent (#13):** A slow
|
||
decision acts on stale prices while a newer decision from another strategy passes
|
||
risk on fresh prices — temporary overexposure followed by costly correction.
|
||
- **Decision identity regeneration + forwarding failures = no idempotency (#6):**
|
||
Lost decisions re-created later with different IDs but same economic intent cause
|
||
duplicate execution across the fan-in with no deduplication mechanism.
|
||
- **Crash/restart phase alignment creates persistent periodic spikes (#9):** After
|
||
fleet restart, all strategies begin windows simultaneously and stay phase-aligned.
|
||
- **Telemetry storms perturb timer accuracy (#10):** Synchronous event emission
|
||
during herds delays timer callbacks, changing which path fires.
|
||
- **Combinatorial O(S×I) groups under market-wide shocks (#11):** Timer queues and
|
||
GC thrash cause non-deterministic path selection between predicate/timeout/capacity.
|
||
- **Risk-budget race favoring fast strategies over confirmatory ones (#8):** Arrival
|
||
order determines which strategies consume budget, not alpha quality.
|
||
|
||
**Claude Opus unique findings (not in either other model):**
|
||
- **Stop-loss defeated by pattern aggregation temporal composition (#5, CRITICAL):**
|
||
A stop-loss fires immediately (closing position), but a pattern strategy's group
|
||
is still buffering entry signals for the same instrument. When the pattern
|
||
completes (with stale, pre-crash signal data), it forwards a re-entry decision.
|
||
PortfolioRisk sees an empty position and approves — the system immediately
|
||
re-enters a position that was just risk-exited. **The safety mechanism (stop-loss)
|
||
is rendered ineffective by temporal composition with a slow strategy.** This is
|
||
the most architecturally consequential finding across all three models.
|
||
- **Crash probability correlates with opportunity quality (#6):** Crashes correlate
|
||
with high-volatility (memory pressure from signal bursts), but high-volatility is
|
||
also when trading opportunities are most profitable. The system architecturally
|
||
selects AGAINST its best opportunities through correlated crash-and-miss cycles.
|
||
Novel "selection bias" framing not seen in other models.
|
||
- **Pre-crash forwarded decisions + post-restart new decisions create duplicates (#7):**
|
||
A decision forwarded successfully to PortfolioRisk before a crash is still in-flight
|
||
downstream. After restart, new signals create a new group → new decision → duplicate
|
||
entry with different decision ID. Unique insight about the boundary between
|
||
aggregator responsibility and downstream responsibility.
|
||
- **Market-regime-driven bimodal completion clustering (#8):** The joint distribution
|
||
of completion times across strategies is driven by shared market conditions (hidden
|
||
common factor). System oscillates between burst-overload and timeout-cascade with
|
||
limited healthy middle ground.
|
||
|
||
**Claude Sonnet 4.0 unique findings:**
|
||
- **Cross-algorithm state explosion in pattern completion (#6):** Multi-instrument
|
||
pattern strategies create implicit dependencies between groups that the per-group
|
||
state machine can't represent. Somewhat generic — lacks the specific mechanism
|
||
detail of the other models' findings.
|
||
- No other truly unique findings — Sonnet's 6 findings overlap substantially with
|
||
the common ground. Less specific, more abstract framing.
|
||
|
||
**Quality assessment:**
|
||
- **GPT-5** produced the most findings (13) with the highest breadth. Several
|
||
findings identified systemic dynamics — portfolio composition bias from heterogeneous
|
||
timeout semantics, position-dependent signal coupling creating feedback into
|
||
aggregation, and arrival-order fairness violations. GPT-5 uniquely identified the
|
||
telemetry/timer interference pattern and the combinatorial explosion at system scale.
|
||
Its strongest contribution is identifying multiple distinct feedback loops through
|
||
external systems (categories 2 and 5 of the prompt).
|
||
- **Claude Opus** produced fewer findings (8) but two are qualitatively superior to
|
||
anything in GPT-5's output: the stop-loss-defeated-by-pattern-composition finding
|
||
(a genuine safety mechanism failure) and the crash-correlates-with-opportunity-quality
|
||
finding (a selection bias). Opus continues its pattern from Findings #11-#20 of
|
||
identifying design TENSIONS and CONTRADICTIONS rather than just failure modes. The
|
||
stop-loss finding is the single most important finding across all three models because
|
||
it shows a risk management mechanism being defeated by the architecture's own
|
||
composition rules.
|
||
- **Claude Sonnet 4.0** was fast (31.8s) but produced the weakest output. Only 6
|
||
findings with less specificity and more overlap with common ground. The cross-algorithm
|
||
state explosion finding (#6) was somewhat generic. Compared to Sonnet 4.6's
|
||
performance in Finding #12 (17 assumptions, 85% of GPT-5) and Finding #14 (8 cross-
|
||
component findings), Sonnet 4.0 here is notably weaker. This suggests the 4.0 → 4.6
|
||
jump was significant for analytical depth.
|
||
|
||
**Key insight — "emergent behavior from rule composition" as an analytical lens:**
|
||
|
||
This is a genuinely novel task type not previously tested. It differs from:
|
||
- **Hidden assumptions** (what must be true for this to work?)
|
||
- **Race conditions** (what temporal interleavings cause bugs?)
|
||
- **Cross-component interactions** (what happens between components?)
|
||
- **Invariant violations** (what legal sequences break invariants?)
|
||
|
||
The new lens asks: **"What happens when many correct instances of this design operate
|
||
simultaneously in a correlated environment?"** This requires reasoning about:
|
||
1. Statistical properties of the composition (correlation, synchronization)
|
||
2. Shared-resource contention hidden by logical isolation abstractions
|
||
3. Feedback loops that cross the document's scope boundaries
|
||
4. Fairness and priority under resource competition
|
||
|
||
**What makes this lens uniquely valuable:**
|
||
Most previous analytical lenses find bugs in the DESIGN. This lens finds bugs in the
|
||
DEPLOYMENT — scenarios that are invisible at the single-instance design level but emerge
|
||
at scale. The aggregation document's state machine is CORRECT for a single group. The
|
||
findings here are about what happens when many correct state machines operate
|
||
simultaneously in a correlated world. This is the gap between "correct by construction"
|
||
and "correct in production."
|
||
|
||
**Model strengths for this lens:**
|
||
- GPT-5 is best at identifying SYSTEMIC dynamics — feedback loops, fairness violations,
|
||
resource contention patterns, and cascading effects. It reasons about the system as a
|
||
whole operating over time.
|
||
- Opus is best at identifying CONSEQUENTIAL compositions — finding the one interaction
|
||
that defeats a safety mechanism or creates a selection bias. Fewer findings but higher
|
||
architectural significance per finding.
|
||
- Sonnet 4.0 is insufficient for this task — too abstract, insufficient specificity.
|
||
(Sonnet 4.6 would likely perform better based on prior results.)
|
||
|
||
**Comparison to document-centric lenses:**
|
||
| Lens type | What it finds | Best model |
|
||
|---|---|---|
|
||
| Hidden assumptions | What the doc takes for granted | All (GPT-5 most) |
|
||
| Race conditions | Temporal bugs between components | GPT-5, Opus |
|
||
| Invariant violations | Legal paths that break rules | GPT-5 (precision) |
|
||
| Cross-doc consistency | Contradictions between docs | GPT-5, Opus |
|
||
| **Emergent composition** | **Scale/deployment bugs** | **GPT-5 (breadth), Opus (depth)** |
|
||
|
||
**Practical implication:** For any system that runs many instances of the same design
|
||
(microservices, per-user pipelines, per-strategy workers), the "emergent composition"
|
||
lens should be part of architecture review. It specifically targets the gap between
|
||
design correctness and production behavior — the class of bugs that unit tests, property
|
||
tests, and single-instance review all miss.
|