refactor(findings): split ALL-FINDINGS.md into per-experiment files

Break the monolithic 3249-line findings file into 29 individual files, one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy chronological sorting and discovery. No content changes — purely structural reorganization.
2026-05-06 07:15:50 -07:00
parent 1b108ff66e
commit 6af8a6ee10
32 changed files with 3232 additions and 3254 deletions
@@ -0,0 +1,158 @@
+# Finding 18: Temporal boundary analysis: GPT-5 is most exhaustive; Opus finds design-level contradictions; Sonnet 4.5 is structured but less deep
+
+**Date:** 2026-05-04
+**Task:** Identify temporal boundary vulnerabilities in gargoyle's `escalation-policy.md`
+(238 lines) — scenarios where the timing model (evaluation cycles, debounce counts,
+cooldown periods) creates windows of incorrect or dangerous behavior.
+**How we used them:** Same document (full text) + same focused analytical question to all
+3 models via HAI proxy. Highly structured prompt specifying 5 categories of temporal
+vulnerability (timing exploitation, counter reset abuse, asymmetric time exposure,
+cross-metric temporal interactions, state loss temporal effects). Required specific
+output format per finding (name, sequence with cycle numbers, mechanism, severity, fix).
+No tools, no project context beyond the document itself.
+
+| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
+|---|---|---|---|---|---|---|---|
+| GPT-5 | ~128s | 9,175 | 5,888 | 15 | 3 | 7 | 2 |
+| Claude Opus 4.6 | ~120s | 5,112 | (internal) | 10 | 3 | 5 | 2 |
+| Claude Sonnet 4.5 | ~100s | 4,056 | (internal) | 12 | 3 | 3 | 3 |
+
+**What they found — common ground (all 3 identified):**
+- Flash crash / inter-evaluation gap exploitation (metric spikes between discrete
+  evaluation cycles go undetected)
+- Single clear cycle resetting debounce counter (transient recovery defeats escalation
+  despite sustained risk — metric can breach 80%+ of cycles and never escalate)
+- Asymmetric escalation time vs loss compounding rate (11 cycles to reach liquidation
+  while losses compound every single cycle)
+- Monitor crash resets state to Clear, losing all escalation progress
+- Liquidation re-trigger requiring full debounce reset, delaying subsequent batches
+- Kill switch N value unspecified (timing indeterminacy)
+
+**GPT-5 unique findings (not in either other model):**
+- Boundary-alignment counter starvation: explicitly modeled the "adversarial flicker"
+  pattern (breaching 2 cycles, 1 clear, repeat — 66% breach time, never escalates)
+  with a precise mathematical framing of why K-of-N is needed
+- Cycle-length drift under load: GC pauses or CPU contention stretching evaluation
+  intervals means "3 cycles" could be 12 minutes instead of 90 seconds when it
+  matters most (high-load market stress = slowest evaluations)
+- Adversarial boundary timing (market microstructure masking): illiquid instruments
+  where opposing prints predictably arrive near evaluation boundaries, exploiting
+  deterministic sampling points
+- Cross-metric escalation gap and hedging prohibition: Restrict prevents ALL new
+  positions including risk-REDUCING hedges needed for a different metric still
+  escalating on its own timeline — protection for metric A actively worsens metric B
+- Cooldown stall causing prolonged Restrict: repeated transient spikes near hysteresis
+  threshold reset cooldown indefinitely while metric is actually safe
+- State inconsistency between restriction flags and monitor after restart:
+  documented asymmetry where flag persists (manual clear) but state resets (auto
+  clear) — creates orphaned restriction or unprotected window depending on
+  reconciliation approach
+- Metric computation fail-closed interacting with debounce: system errors create
+  false escalations with long cooldown, potentially blocking hedging trades
+- Unspecified N for kill switch post-liquidation breaches: coupled with crash
+  reset, system can loop indefinitely without reaching kill switch
+- In-liquidate flicker stall: one cycle below threshold after partial fill resets
+  re-trigger counter, stalling further liquidation
+
+**Claude Opus unique findings (not in either other model):**
+- De-escalation cooldown exploitation (predictable window): after cooldown completes
+  and restriction lifts, strategy has a GUARANTEED 5+ cycle window of unrestricted
+  trading before Restrict can re-engage — an automated strategy could systematically
+  exploit this predictable safe window to re-enter dangerous positions
+- Orphaned restriction flag asymmetry framed as a DESIGN CONTRADICTION: the failure
+  modes table specifies opposing recovery paths for state (automatic → Clear) vs
+  flags (manual clear), creating an irreconcilable dual state. Opus uniquely
+  identified that operator intervention to clear the flag could inadvertently
+  create a WORSE protection gap than leaving it orphaned
+- Self-correcting analysis style: Opus's summary explicitly synthesized that the
+  three Critical findings share a common cause (debounce optimizes against false
+  positives at the expense of false negatives during sustained events) and proposed
+  a single architectural fix (severity-aware fast path) that addresses all three
+
+**Claude Sonnet 4.5 unique findings (not in either other model):**
+- De-escalation timing not accounting for proximity to breach threshold: system
+  removes protection while metric is still near-dangerous, and re-escalation
+  requires full debounce — created a specific "whipsaw" scenario with cycle numbers
+- Manual-only de-escalation from Liquidate creates UNBOUNDED recovery time:
+  if triggered at 2 AM Saturday, trading disabled until Monday despite metrics
+  recovering in minutes. Framed as contradiction with "autonomous" design goals
+- Evaluation cycle synchronization assumption: no handling of variable timing
+  (CPU contention, GC pauses) — implicit throughout but never addressed
+- Cold start escalation ambiguity: system starts with no prior state while
+  portfolio may already be in breach condition
+- De-escalation event ordering race: multiple metrics de-escalating simultaneously
+  may emit events in non-deterministic order, confusing external observers
+
+**Quality assessment:**
+- **GPT-5** was the most exhaustive (15 findings) and showed the strongest
+  mathematical/systems reasoning. Its unique findings included precise attack
+  models (adversarial flicker, boundary alignment, microstructure masking) that
+  describe exact exploitation patterns with percentages and cycle counts. The
+  cross-metric hedging prohibition finding is architecturally significant — it
+  identifies that protection for one metric can actively CREATE risk for another.
+  Every finding was actionable with specific fixes.
+- **Claude Opus 4.6** produced fewer findings (10) but with characteristic depth
+  and self-awareness. Its cooldown exploitation finding identified a PREDICTABLE
+  exploit window that an automated strategy could systematically abuse — framed
+  not as an accident but as an adversarial opportunity. The summary synthesis
+  (identifying common cause across Critical findings) shows meta-analytical
+  capability the other models didn't demonstrate. Opus also uniquely identified
+  that human intervention to fix one problem could create a WORSE problem —
+  second-order operational reasoning.
+- **Claude Sonnet 4.5** was well-structured (12 findings, clean severity tiers,
+  organized by Critical/High/Medium/Low) and faster than both other models.
+  Its findings were solid but less architecturally deep. The manual de-escalation
+  contradiction finding was genuinely insightful (unbounded recovery time vs
+  autonomous design goals). However, several findings restated concepts the
+  other models covered with less specificity about exploitation mechanics.
+
+**Key insight — temporal reasoning as a task type:**
+This is the first experiment specifically testing "temporal boundary analysis" —
+reasoning about time-domain properties of a state machine (evaluation frequency,
+counter semantics, cooldown mechanics, crash/restart timing).
+
+Results compared to Finding #13 (race condition identification on a concurrency doc):
+- GPT-5: 15 findings here vs 12 in Finding #13. Consistent high performance
+  on temporal reasoning tasks across both experiments.
+- Opus: 10 findings here vs 10 in Finding #13. Remarkably consistent — Opus
+  produces ~10 high-quality findings regardless of temporal task variant.
+- Sonnet 4.5: 12 findings here (first test). Compare to Sonnet 4.6's 7 findings
+  (with errors) in Finding #13. Sonnet 4.5 handles temporal reasoning better than
+  4.6 — consistent with Finding #16 showing 4.5 is more exhaustive across task types.
+
+**Sonnet 4.5 vs 4.6 on temporal reasoning (inferred comparison):**
+Sonnet 4.6 struggled significantly on race condition identification (Finding #13:
+7 findings with analytical errors, misreading architecture). Sonnet 4.5 here
+produced 12 solid findings with no apparent misreadings. This suggests 4.5's
+exhaustiveness advantage extends to temporal reasoning — the additional
+exploration it does (vs 4.6's aggressive self-filtering) catches more temporal
+interactions. Confirms Finding #16's pattern: 4.5 for coverage, 4.6 for precision.
+
+**The structured-prompt effect continues:**
+All three models produced focused, high-quality output with this highly structured
+prompt (5 specific categories + required output format). This confirms Finding #14:
+narrow analytical lens + broad document scope is the sweet spot for all model tiers.
+The prompt structure appears to be a stronger predictor of output quality than model
+choice for the bottom 80% of findings (all models find the common-ground issues).
+Model choice matters for the TOP 20% — the unique insights that require deeper
+reasoning about system interactions.
+
+**Updated model assignment for temporal boundary analysis:**
+1. GPT-5 — most exhaustive, strongest at modeling adversarial exploitation patterns
+   and mathematical edge cases (15 findings)
+2. Claude Sonnet 4.5 — good volume with clean structure, viable for first-pass
+   temporal analysis (12 findings, no errors)
+3. Claude Opus 4.6 — fewest findings but highest insight density, uniquely
+   identifies predictable exploit windows and operational second-order effects
+   (10 findings)
+
+**Practical implication:** For temporal analysis on state machines and timing-dependent
+policies, the three-model stack produces genuine complementary value:
+- GPT-5 catches the adversarial attack patterns and mathematical edge cases
+- Opus catches the predictable exploit windows and operational contradictions
+- Sonnet 4.5 provides good breadth at lower cost with clean severity categorization
+
+The union of unique findings across all three models reveals significantly more
+temporal vulnerabilities than any single model alone. For a document governing
+autonomous financial actions (liquidation, kill switch), the cost of running all
+three (~$1-2) is trivially justified against the risk of missing a timing exploit.