Files
model-research/findings/2026-05-03-14-crosscomponent-interaction-analysis-gpt5-mini.md
T
Rodin 6af8a6ee10 refactor(findings): split ALL-FINDINGS.md into per-experiment files
Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00

7.8 KiB

Finding 14: Cross-component interaction analysis: GPT-5 Mini enters as viable mid-tier; task framing strongly shapes output quality

Date: 2026-05-03 Task: Identify cross-component interaction failures in gargoyle's continuous-risk-monitoring.md (459 lines) — a document specifying PortfolioMonitor's interactions with OrderManager, PortfolioRisk, MarketData, KillSwitch, ETS tables, and the pipeline supervision tree. How we used them: Same document (full text) + same focused analytical question to all 3 models via HAI proxy. Prompt was highly structured: specified 5 categories of cross-component failures to look for (semantic mismatches, ordering violations, feedback loops, partial visibility, supervision boundary effects) and required specific output format (components, sequence, gap, impact). No tools, no project context beyond the document itself.

Model Time Output tokens Reasoning tokens Findings
GPT-5 Mini 68s 5,445 2,240 6 (+1 cut off)
GPT-5 116s 10,604 8,128 10
Claude Sonnet 4.6 38s 1,868 (internal) 8

What they found — common ground (all 3 identified):

  • Fill-to-position query race (fill event triggers evaluation but position store hasn't yet reflected the fill)
  • Restrict flag ETS table destruction on PM crash → permissive window
  • Kill switch check vs liquidation submission race
  • Ticker subscription timing gap (new position opened but ticks not yet subscribed → breach goes undetected)

GPT-5 unique findings (not in either other model):

  • Stale prices are NOT fail-safe for drawdown (higher stale price → inflated portfolio value → understated drawdown). The document claims "fail-safe" but this only holds for exposure metrics, not drawdown. This is the most architecturally significant finding across all three models.
  • Price definition mismatch between PM (last_trade from ETS) and OrderManager/ broker (bid/ask/mid) causing mis-sized liquidation and oscillation
  • Cross-component oscillation: PM hysteresis internal vs PRisk's immediate binary restrict gate clearing (no cross-component cooldown)
  • Liquidation stuck after OM restart (terminal events lost; liquidation_in_ flight stays true indefinitely with no timeout/rehydration)
  • "Minimal risk checks" not enforced — PM goes through same OM gates as strategy orders but MarketHours/StalePrice controls may reject after-hours or stale-price liquidation attempts
  • FLATTEN mode semantics gap — PM refrains from liquidating when kill switch engaged, but FLATTEN cancels open orders without actually CLOSING positions. No component left to close positions.

Claude Sonnet 4.6 unique findings (not in either other model):

  • Liquidation feedback loop with PortfolioRisk — buy-to-cover for short positions could INCREASE net long exposure at portfolio level, paradoxically worsening concentration while fixing position-level metrics
  • High water mark reset on pipeline restart masks true intraday drawdown (restart → HWM resets to lower current value → drawdown calculated from false baseline → larger losses permitted than intended)
  • Multi-metric breach with single boolean flag — concentration liquidation for AAPL sets liquidation_in_flight, blocking simultaneous daily P&L liquidation for different positions
  • Market close/open vs after-hours fills — claims to evaluate after-hours fills but uses stale market-close prices

GPT-5 Mini unique findings (not in either other model):

  • OrderManager order splitting/remapping causing liquidation_in_flight correlation failure (parent/child order ID mapping breaks terminal-event detection). Well-reasoned but highly implementation-specific.
  • Restrict/clear oscillation loop with strategy behavior (strategies react to rejects → back off → restrict clears → strategies re-enter aggressively → re-breach). Good systems-thinking about emergent feedback.

Quality assessment:

  • GPT-5 produced the most findings (10) and the highest-quality architectural insight: the stale-price/drawdown contradiction is a genuine design flaw that contradicts the document's own safety claim. Multiple findings showed cross-boundary reasoning about semantic mismatches (price definition, FLATTEN semantics, gate bypass). Every finding named specific components and described precise event sequences.
  • Claude Sonnet 4.6 was fast (38s, only 1,868 tokens) and produced 8 solid findings. The HWM reset finding and the multi-metric/single-flag finding show genuine architectural reasoning. The liquidation feedback loop (buy-to-cover worsening portfolio concentration) is subtle and shows cross-position reasoning. However, some findings overlapped significantly with the common-ground set and added less unique depth. Sonnet performed MUCH better here than on race condition identification (Finding #13) — 8/10 ratio vs 7/12 previously.
  • GPT-5 Mini produced 6 findings in 68s with 2,240 reasoning tokens. Quality was genuinely good — the order-splitting/correlation finding and the oscillation feedback loop both show real reasoning depth. It's clearly NOT GPT-4.1 Mini — it reasons about component interactions, not just within-frame risks. However, it found fewer issues and one response was cut off (token limit or response truncation).

Key insight — task framing as the dominant variable: This experiment used a much more structured prompt than previous ones: specified 5 categories, required specific output format, explicitly excluded single-component failures. The result: ALL models produced higher-quality, more focused output than in earlier experiments with broader prompts. Even Sonnet — which struggled on race conditions (Finding #13) — performed well here. The structured categories likely helped models organize their reasoning without losing track of what they were looking for.

The prompt explicitly asked for "cross-component interaction failures" rather than general analysis. This is the narrow-lens effect from Finding #2, but applied to a complex multi-component document. The lens is narrow (only inter-component gaps) but the scope is broad (459 lines, many interactions). This combination — narrow analytical lens + broad document scope — appears to be the sweet spot for getting quality from all model tiers.

GPT-5 Mini positioning: First time testing GPT-5 Mini. Results: 6 findings in 68s vs GPT-5's 10 in 116s. That's 60% of the findings in 59% of the time, with 28% of the reasoning tokens (2,240 vs 8,128). Quality-per-finding was solid — the order correlation finding especially showed genuine systems reasoning. GPT-5 Mini appears to be a legitimate mid-tier: more capable than GPT-4.1 (which can't do this kind of cross-boundary reasoning) but less exhaustive than GPT-5. Viable for: first-pass screening, bulk document review where you'd run many docs and can't afford full GPT-5 on each.

Sonnet recovery from Finding #13: Sonnet went from 7 findings (with errors) on race conditions to 8 solid findings here. The difference: this prompt was more structured, the document was larger with more explicit interaction descriptions, and the task didn't require pure temporal/sequential reasoning. "Cross-component interaction failures" is closer to assumption-finding (Sonnet's strength) than race condition identification (Sonnet's weakness). Task taxonomy continues to matter more than raw model capability.

Updated model assignment for cross-component analysis:

  1. GPT-5 — broadest coverage, finds semantic contradictions in the doc's own claims (10 findings)
  2. Sonnet 4.6 — fast and efficient, good at portfolio-level reasoning and feedback loops (8 findings in 38s)
  3. GPT-5 Mini — viable first-pass with genuine reasoning depth (6 findings)
  4. (Opus untested for this task type — likely strong on design tensions)