Files
model-research/findings/2026-05-03-14-crosscomponent-interaction-analysis-gpt5-mini.md
T
Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00

132 lines
7.8 KiB
Markdown

# Finding 14: Cross-component interaction analysis: GPT-5 Mini enters as viable mid-tier; task framing strongly shapes output quality
**Date:** 2026-05-03
**Task:** Identify cross-component interaction failures in gargoyle's
`continuous-risk-monitoring.md` (459 lines) — a document specifying
PortfolioMonitor's interactions with OrderManager, PortfolioRisk, MarketData,
KillSwitch, ETS tables, and the pipeline supervision tree.
**How we used them:** Same document (full text) + same focused analytical
question to all 3 models via HAI proxy. Prompt was highly structured: specified
5 categories of cross-component failures to look for (semantic mismatches,
ordering violations, feedback loops, partial visibility, supervision boundary
effects) and required specific output format (components, sequence, gap, impact).
No tools, no project context beyond the document itself.
| Model | Time | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|
| GPT-5 Mini | 68s | 5,445 | 2,240 | 6 (+1 cut off) |
| GPT-5 | 116s | 10,604 | 8,128 | 10 |
| Claude Sonnet 4.6 | 38s | 1,868 | (internal) | 8 |
**What they found — common ground (all 3 identified):**
- Fill-to-position query race (fill event triggers evaluation but position
store hasn't yet reflected the fill)
- Restrict flag ETS table destruction on PM crash → permissive window
- Kill switch check vs liquidation submission race
- Ticker subscription timing gap (new position opened but ticks not yet
subscribed → breach goes undetected)
**GPT-5 unique findings (not in either other model):**
- Stale prices are NOT fail-safe for drawdown (higher stale price → inflated
portfolio value → understated drawdown). The document claims "fail-safe"
but this only holds for exposure metrics, not drawdown. This is the most
architecturally significant finding across all three models.
- Price definition mismatch between PM (last_trade from ETS) and OrderManager/
broker (bid/ask/mid) causing mis-sized liquidation and oscillation
- Cross-component oscillation: PM hysteresis internal vs PRisk's immediate
binary restrict gate clearing (no cross-component cooldown)
- Liquidation stuck after OM restart (terminal events lost; liquidation_in_
flight stays true indefinitely with no timeout/rehydration)
- "Minimal risk checks" not enforced — PM goes through same OM gates as
strategy orders but MarketHours/StalePrice controls may reject after-hours
or stale-price liquidation attempts
- FLATTEN mode semantics gap — PM refrains from liquidating when kill switch
engaged, but FLATTEN cancels open orders without actually CLOSING positions.
No component left to close positions.
**Claude Sonnet 4.6 unique findings (not in either other model):**
- Liquidation feedback loop with PortfolioRisk — buy-to-cover for short
positions could INCREASE net long exposure at portfolio level, paradoxically
worsening concentration while fixing position-level metrics
- High water mark reset on pipeline restart masks true intraday drawdown
(restart → HWM resets to lower current value → drawdown calculated from
false baseline → larger losses permitted than intended)
- Multi-metric breach with single boolean flag — concentration liquidation
for AAPL sets liquidation_in_flight, blocking simultaneous daily P&L
liquidation for different positions
- Market close/open vs after-hours fills — claims to evaluate after-hours
fills but uses stale market-close prices
**GPT-5 Mini unique findings (not in either other model):**
- OrderManager order splitting/remapping causing liquidation_in_flight
correlation failure (parent/child order ID mapping breaks terminal-event
detection). Well-reasoned but highly implementation-specific.
- Restrict/clear oscillation loop with strategy behavior (strategies react
to rejects → back off → restrict clears → strategies re-enter aggressively
→ re-breach). Good systems-thinking about emergent feedback.
**Quality assessment:**
- **GPT-5** produced the most findings (10) and the highest-quality
architectural insight: the stale-price/drawdown contradiction is a genuine
design flaw that contradicts the document's own safety claim. Multiple
findings showed cross-boundary reasoning about semantic mismatches (price
definition, FLATTEN semantics, gate bypass). Every finding named specific
components and described precise event sequences.
- **Claude Sonnet 4.6** was fast (38s, only 1,868 tokens) and produced 8
solid findings. The HWM reset finding and the multi-metric/single-flag
finding show genuine architectural reasoning. The liquidation feedback
loop (buy-to-cover worsening portfolio concentration) is subtle and
shows cross-position reasoning. However, some findings overlapped
significantly with the common-ground set and added less unique depth.
Sonnet performed MUCH better here than on race condition identification
(Finding #13) — 8/10 ratio vs 7/12 previously.
- **GPT-5 Mini** produced 6 findings in 68s with 2,240 reasoning tokens.
Quality was genuinely good — the order-splitting/correlation finding
and the oscillation feedback loop both show real reasoning depth. It's
clearly NOT GPT-4.1 Mini — it reasons about component interactions,
not just within-frame risks. However, it found fewer issues and one
response was cut off (token limit or response truncation).
**Key insight — task framing as the dominant variable:**
This experiment used a much more structured prompt than previous ones:
specified 5 categories, required specific output format, explicitly excluded
single-component failures. The result: ALL models produced higher-quality,
more focused output than in earlier experiments with broader prompts. Even
Sonnet — which struggled on race conditions (Finding #13) — performed well
here. The structured categories likely helped models organize their reasoning
without losing track of what they were looking for.
The prompt explicitly asked for "cross-component interaction failures" rather
than general analysis. This is the narrow-lens effect from Finding #2, but
applied to a complex multi-component document. The lens is narrow (only
inter-component gaps) but the scope is broad (459 lines, many interactions).
This combination — narrow analytical lens + broad document scope — appears
to be the sweet spot for getting quality from all model tiers.
**GPT-5 Mini positioning:**
First time testing GPT-5 Mini. Results: 6 findings in 68s vs GPT-5's 10 in
116s. That's 60% of the findings in 59% of the time, with 28% of the
reasoning tokens (2,240 vs 8,128). Quality-per-finding was solid — the order
correlation finding especially showed genuine systems reasoning. GPT-5 Mini
appears to be a legitimate mid-tier: more capable than GPT-4.1 (which can't
do this kind of cross-boundary reasoning) but less exhaustive than GPT-5.
Viable for: first-pass screening, bulk document review where you'd run many
docs and can't afford full GPT-5 on each.
**Sonnet recovery from Finding #13:**
Sonnet went from 7 findings (with errors) on race conditions to 8 solid
findings here. The difference: this prompt was more structured, the document
was larger with more explicit interaction descriptions, and the task didn't
require pure temporal/sequential reasoning. "Cross-component interaction
failures" is closer to assumption-finding (Sonnet's strength) than race
condition identification (Sonnet's weakness). Task taxonomy continues to
matter more than raw model capability.
**Updated model assignment for cross-component analysis:**
1. GPT-5 — broadest coverage, finds semantic contradictions in the doc's
own claims (10 findings)
2. Sonnet 4.6 — fast and efficient, good at portfolio-level reasoning and
feedback loops (8 findings in 38s)
3. GPT-5 Mini — viable first-pass with genuine reasoning depth (6 findings)
4. (Opus untested for this task type — likely strong on design tensions)