6af8a6ee10
Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
132 lines
7.8 KiB
Markdown
132 lines
7.8 KiB
Markdown
# Finding 14: Cross-component interaction analysis: GPT-5 Mini enters as viable mid-tier; task framing strongly shapes output quality
|
|
|
|
**Date:** 2026-05-03
|
|
**Task:** Identify cross-component interaction failures in gargoyle's
|
|
`continuous-risk-monitoring.md` (459 lines) — a document specifying
|
|
PortfolioMonitor's interactions with OrderManager, PortfolioRisk, MarketData,
|
|
KillSwitch, ETS tables, and the pipeline supervision tree.
|
|
**How we used them:** Same document (full text) + same focused analytical
|
|
question to all 3 models via HAI proxy. Prompt was highly structured: specified
|
|
5 categories of cross-component failures to look for (semantic mismatches,
|
|
ordering violations, feedback loops, partial visibility, supervision boundary
|
|
effects) and required specific output format (components, sequence, gap, impact).
|
|
No tools, no project context beyond the document itself.
|
|
|
|
| Model | Time | Output tokens | Reasoning tokens | Findings |
|
|
|---|---|---|---|---|
|
|
| GPT-5 Mini | 68s | 5,445 | 2,240 | 6 (+1 cut off) |
|
|
| GPT-5 | 116s | 10,604 | 8,128 | 10 |
|
|
| Claude Sonnet 4.6 | 38s | 1,868 | (internal) | 8 |
|
|
|
|
**What they found — common ground (all 3 identified):**
|
|
- Fill-to-position query race (fill event triggers evaluation but position
|
|
store hasn't yet reflected the fill)
|
|
- Restrict flag ETS table destruction on PM crash → permissive window
|
|
- Kill switch check vs liquidation submission race
|
|
- Ticker subscription timing gap (new position opened but ticks not yet
|
|
subscribed → breach goes undetected)
|
|
|
|
**GPT-5 unique findings (not in either other model):**
|
|
- Stale prices are NOT fail-safe for drawdown (higher stale price → inflated
|
|
portfolio value → understated drawdown). The document claims "fail-safe"
|
|
but this only holds for exposure metrics, not drawdown. This is the most
|
|
architecturally significant finding across all three models.
|
|
- Price definition mismatch between PM (last_trade from ETS) and OrderManager/
|
|
broker (bid/ask/mid) causing mis-sized liquidation and oscillation
|
|
- Cross-component oscillation: PM hysteresis internal vs PRisk's immediate
|
|
binary restrict gate clearing (no cross-component cooldown)
|
|
- Liquidation stuck after OM restart (terminal events lost; liquidation_in_
|
|
flight stays true indefinitely with no timeout/rehydration)
|
|
- "Minimal risk checks" not enforced — PM goes through same OM gates as
|
|
strategy orders but MarketHours/StalePrice controls may reject after-hours
|
|
or stale-price liquidation attempts
|
|
- FLATTEN mode semantics gap — PM refrains from liquidating when kill switch
|
|
engaged, but FLATTEN cancels open orders without actually CLOSING positions.
|
|
No component left to close positions.
|
|
|
|
**Claude Sonnet 4.6 unique findings (not in either other model):**
|
|
- Liquidation feedback loop with PortfolioRisk — buy-to-cover for short
|
|
positions could INCREASE net long exposure at portfolio level, paradoxically
|
|
worsening concentration while fixing position-level metrics
|
|
- High water mark reset on pipeline restart masks true intraday drawdown
|
|
(restart → HWM resets to lower current value → drawdown calculated from
|
|
false baseline → larger losses permitted than intended)
|
|
- Multi-metric breach with single boolean flag — concentration liquidation
|
|
for AAPL sets liquidation_in_flight, blocking simultaneous daily P&L
|
|
liquidation for different positions
|
|
- Market close/open vs after-hours fills — claims to evaluate after-hours
|
|
fills but uses stale market-close prices
|
|
|
|
**GPT-5 Mini unique findings (not in either other model):**
|
|
- OrderManager order splitting/remapping causing liquidation_in_flight
|
|
correlation failure (parent/child order ID mapping breaks terminal-event
|
|
detection). Well-reasoned but highly implementation-specific.
|
|
- Restrict/clear oscillation loop with strategy behavior (strategies react
|
|
to rejects → back off → restrict clears → strategies re-enter aggressively
|
|
→ re-breach). Good systems-thinking about emergent feedback.
|
|
|
|
**Quality assessment:**
|
|
- **GPT-5** produced the most findings (10) and the highest-quality
|
|
architectural insight: the stale-price/drawdown contradiction is a genuine
|
|
design flaw that contradicts the document's own safety claim. Multiple
|
|
findings showed cross-boundary reasoning about semantic mismatches (price
|
|
definition, FLATTEN semantics, gate bypass). Every finding named specific
|
|
components and described precise event sequences.
|
|
- **Claude Sonnet 4.6** was fast (38s, only 1,868 tokens) and produced 8
|
|
solid findings. The HWM reset finding and the multi-metric/single-flag
|
|
finding show genuine architectural reasoning. The liquidation feedback
|
|
loop (buy-to-cover worsening portfolio concentration) is subtle and
|
|
shows cross-position reasoning. However, some findings overlapped
|
|
significantly with the common-ground set and added less unique depth.
|
|
Sonnet performed MUCH better here than on race condition identification
|
|
(Finding #13) — 8/10 ratio vs 7/12 previously.
|
|
- **GPT-5 Mini** produced 6 findings in 68s with 2,240 reasoning tokens.
|
|
Quality was genuinely good — the order-splitting/correlation finding
|
|
and the oscillation feedback loop both show real reasoning depth. It's
|
|
clearly NOT GPT-4.1 Mini — it reasons about component interactions,
|
|
not just within-frame risks. However, it found fewer issues and one
|
|
response was cut off (token limit or response truncation).
|
|
|
|
**Key insight — task framing as the dominant variable:**
|
|
This experiment used a much more structured prompt than previous ones:
|
|
specified 5 categories, required specific output format, explicitly excluded
|
|
single-component failures. The result: ALL models produced higher-quality,
|
|
more focused output than in earlier experiments with broader prompts. Even
|
|
Sonnet — which struggled on race conditions (Finding #13) — performed well
|
|
here. The structured categories likely helped models organize their reasoning
|
|
without losing track of what they were looking for.
|
|
|
|
The prompt explicitly asked for "cross-component interaction failures" rather
|
|
than general analysis. This is the narrow-lens effect from Finding #2, but
|
|
applied to a complex multi-component document. The lens is narrow (only
|
|
inter-component gaps) but the scope is broad (459 lines, many interactions).
|
|
This combination — narrow analytical lens + broad document scope — appears
|
|
to be the sweet spot for getting quality from all model tiers.
|
|
|
|
**GPT-5 Mini positioning:**
|
|
First time testing GPT-5 Mini. Results: 6 findings in 68s vs GPT-5's 10 in
|
|
116s. That's 60% of the findings in 59% of the time, with 28% of the
|
|
reasoning tokens (2,240 vs 8,128). Quality-per-finding was solid — the order
|
|
correlation finding especially showed genuine systems reasoning. GPT-5 Mini
|
|
appears to be a legitimate mid-tier: more capable than GPT-4.1 (which can't
|
|
do this kind of cross-boundary reasoning) but less exhaustive than GPT-5.
|
|
Viable for: first-pass screening, bulk document review where you'd run many
|
|
docs and can't afford full GPT-5 on each.
|
|
|
|
**Sonnet recovery from Finding #13:**
|
|
Sonnet went from 7 findings (with errors) on race conditions to 8 solid
|
|
findings here. The difference: this prompt was more structured, the document
|
|
was larger with more explicit interaction descriptions, and the task didn't
|
|
require pure temporal/sequential reasoning. "Cross-component interaction
|
|
failures" is closer to assumption-finding (Sonnet's strength) than race
|
|
condition identification (Sonnet's weakness). Task taxonomy continues to
|
|
matter more than raw model capability.
|
|
|
|
**Updated model assignment for cross-component analysis:**
|
|
1. GPT-5 — broadest coverage, finds semantic contradictions in the doc's
|
|
own claims (10 findings)
|
|
2. Sonnet 4.6 — fast and efficient, good at portfolio-level reasoning and
|
|
feedback loops (8 findings in 38s)
|
|
3. GPT-5 Mini — viable first-pass with genuine reasoning depth (6 findings)
|
|
4. (Opus untested for this task type — likely strong on design tensions)
|