refactor(findings): split ALL-FINDINGS.md into per-experiment files

Break the monolithic 3249-line findings file into 29 individual files,
one per experiment. Each file is named YYYY-MM-DD-NN-slug.md for easy
chronological sorting and discovery.

No content changes — purely structural reorganization.
This commit is contained in:
Rodin
2026-05-06 07:15:50 -07:00
parent 1b108ff66e
commit 6af8a6ee10
32 changed files with 3232 additions and 3254 deletions
@@ -0,0 +1,158 @@
# Finding 18: Temporal boundary analysis: GPT-5 is most exhaustive; Opus finds design-level contradictions; Sonnet 4.5 is structured but less deep
**Date:** 2026-05-04
**Task:** Identify temporal boundary vulnerabilities in gargoyle's `escalation-policy.md`
(238 lines) — scenarios where the timing model (evaluation cycles, debounce counts,
cooldown periods) creates windows of incorrect or dangerous behavior.
**How we used them:** Same document (full text) + same focused analytical question to all
3 models via HAI proxy. Highly structured prompt specifying 5 categories of temporal
vulnerability (timing exploitation, counter reset abuse, asymmetric time exposure,
cross-metric temporal interactions, state loss temporal effects). Required specific
output format per finding (name, sequence with cycle numbers, mechanism, severity, fix).
No tools, no project context beyond the document itself.
| Model | Time | Output tokens | Reasoning tokens | Findings | Critical | High | Medium |
|---|---|---|---|---|---|---|---|
| GPT-5 | ~128s | 9,175 | 5,888 | 15 | 3 | 7 | 2 |
| Claude Opus 4.6 | ~120s | 5,112 | (internal) | 10 | 3 | 5 | 2 |
| Claude Sonnet 4.5 | ~100s | 4,056 | (internal) | 12 | 3 | 3 | 3 |
**What they found — common ground (all 3 identified):**
- Flash crash / inter-evaluation gap exploitation (metric spikes between discrete
evaluation cycles go undetected)
- Single clear cycle resetting debounce counter (transient recovery defeats escalation
despite sustained risk — metric can breach 80%+ of cycles and never escalate)
- Asymmetric escalation time vs loss compounding rate (11 cycles to reach liquidation
while losses compound every single cycle)
- Monitor crash resets state to Clear, losing all escalation progress
- Liquidation re-trigger requiring full debounce reset, delaying subsequent batches
- Kill switch N value unspecified (timing indeterminacy)
**GPT-5 unique findings (not in either other model):**
- Boundary-alignment counter starvation: explicitly modeled the "adversarial flicker"
pattern (breaching 2 cycles, 1 clear, repeat — 66% breach time, never escalates)
with a precise mathematical framing of why K-of-N is needed
- Cycle-length drift under load: GC pauses or CPU contention stretching evaluation
intervals means "3 cycles" could be 12 minutes instead of 90 seconds when it
matters most (high-load market stress = slowest evaluations)
- Adversarial boundary timing (market microstructure masking): illiquid instruments
where opposing prints predictably arrive near evaluation boundaries, exploiting
deterministic sampling points
- Cross-metric escalation gap and hedging prohibition: Restrict prevents ALL new
positions including risk-REDUCING hedges needed for a different metric still
escalating on its own timeline — protection for metric A actively worsens metric B
- Cooldown stall causing prolonged Restrict: repeated transient spikes near hysteresis
threshold reset cooldown indefinitely while metric is actually safe
- State inconsistency between restriction flags and monitor after restart:
documented asymmetry where flag persists (manual clear) but state resets (auto
clear) — creates orphaned restriction or unprotected window depending on
reconciliation approach
- Metric computation fail-closed interacting with debounce: system errors create
false escalations with long cooldown, potentially blocking hedging trades
- Unspecified N for kill switch post-liquidation breaches: coupled with crash
reset, system can loop indefinitely without reaching kill switch
- In-liquidate flicker stall: one cycle below threshold after partial fill resets
re-trigger counter, stalling further liquidation
**Claude Opus unique findings (not in either other model):**
- De-escalation cooldown exploitation (predictable window): after cooldown completes
and restriction lifts, strategy has a GUARANTEED 5+ cycle window of unrestricted
trading before Restrict can re-engage — an automated strategy could systematically
exploit this predictable safe window to re-enter dangerous positions
- Orphaned restriction flag asymmetry framed as a DESIGN CONTRADICTION: the failure
modes table specifies opposing recovery paths for state (automatic → Clear) vs
flags (manual clear), creating an irreconcilable dual state. Opus uniquely
identified that operator intervention to clear the flag could inadvertently
create a WORSE protection gap than leaving it orphaned
- Self-correcting analysis style: Opus's summary explicitly synthesized that the
three Critical findings share a common cause (debounce optimizes against false
positives at the expense of false negatives during sustained events) and proposed
a single architectural fix (severity-aware fast path) that addresses all three
**Claude Sonnet 4.5 unique findings (not in either other model):**
- De-escalation timing not accounting for proximity to breach threshold: system
removes protection while metric is still near-dangerous, and re-escalation
requires full debounce — created a specific "whipsaw" scenario with cycle numbers
- Manual-only de-escalation from Liquidate creates UNBOUNDED recovery time:
if triggered at 2 AM Saturday, trading disabled until Monday despite metrics
recovering in minutes. Framed as contradiction with "autonomous" design goals
- Evaluation cycle synchronization assumption: no handling of variable timing
(CPU contention, GC pauses) — implicit throughout but never addressed
- Cold start escalation ambiguity: system starts with no prior state while
portfolio may already be in breach condition
- De-escalation event ordering race: multiple metrics de-escalating simultaneously
may emit events in non-deterministic order, confusing external observers
**Quality assessment:**
- **GPT-5** was the most exhaustive (15 findings) and showed the strongest
mathematical/systems reasoning. Its unique findings included precise attack
models (adversarial flicker, boundary alignment, microstructure masking) that
describe exact exploitation patterns with percentages and cycle counts. The
cross-metric hedging prohibition finding is architecturally significant — it
identifies that protection for one metric can actively CREATE risk for another.
Every finding was actionable with specific fixes.
- **Claude Opus 4.6** produced fewer findings (10) but with characteristic depth
and self-awareness. Its cooldown exploitation finding identified a PREDICTABLE
exploit window that an automated strategy could systematically abuse — framed
not as an accident but as an adversarial opportunity. The summary synthesis
(identifying common cause across Critical findings) shows meta-analytical
capability the other models didn't demonstrate. Opus also uniquely identified
that human intervention to fix one problem could create a WORSE problem —
second-order operational reasoning.
- **Claude Sonnet 4.5** was well-structured (12 findings, clean severity tiers,
organized by Critical/High/Medium/Low) and faster than both other models.
Its findings were solid but less architecturally deep. The manual de-escalation
contradiction finding was genuinely insightful (unbounded recovery time vs
autonomous design goals). However, several findings restated concepts the
other models covered with less specificity about exploitation mechanics.
**Key insight — temporal reasoning as a task type:**
This is the first experiment specifically testing "temporal boundary analysis" —
reasoning about time-domain properties of a state machine (evaluation frequency,
counter semantics, cooldown mechanics, crash/restart timing).
Results compared to Finding #13 (race condition identification on a concurrency doc):
- GPT-5: 15 findings here vs 12 in Finding #13. Consistent high performance
on temporal reasoning tasks across both experiments.
- Opus: 10 findings here vs 10 in Finding #13. Remarkably consistent — Opus
produces ~10 high-quality findings regardless of temporal task variant.
- Sonnet 4.5: 12 findings here (first test). Compare to Sonnet 4.6's 7 findings
(with errors) in Finding #13. Sonnet 4.5 handles temporal reasoning better than
4.6 — consistent with Finding #16 showing 4.5 is more exhaustive across task types.
**Sonnet 4.5 vs 4.6 on temporal reasoning (inferred comparison):**
Sonnet 4.6 struggled significantly on race condition identification (Finding #13:
7 findings with analytical errors, misreading architecture). Sonnet 4.5 here
produced 12 solid findings with no apparent misreadings. This suggests 4.5's
exhaustiveness advantage extends to temporal reasoning — the additional
exploration it does (vs 4.6's aggressive self-filtering) catches more temporal
interactions. Confirms Finding #16's pattern: 4.5 for coverage, 4.6 for precision.
**The structured-prompt effect continues:**
All three models produced focused, high-quality output with this highly structured
prompt (5 specific categories + required output format). This confirms Finding #14:
narrow analytical lens + broad document scope is the sweet spot for all model tiers.
The prompt structure appears to be a stronger predictor of output quality than model
choice for the bottom 80% of findings (all models find the common-ground issues).
Model choice matters for the TOP 20% — the unique insights that require deeper
reasoning about system interactions.
**Updated model assignment for temporal boundary analysis:**
1. GPT-5 — most exhaustive, strongest at modeling adversarial exploitation patterns
and mathematical edge cases (15 findings)
2. Claude Sonnet 4.5 — good volume with clean structure, viable for first-pass
temporal analysis (12 findings, no errors)
3. Claude Opus 4.6 — fewest findings but highest insight density, uniquely
identifies predictable exploit windows and operational second-order effects
(10 findings)
**Practical implication:** For temporal analysis on state machines and timing-dependent
policies, the three-model stack produces genuine complementary value:
- GPT-5 catches the adversarial attack patterns and mathematical edge cases
- Opus catches the predictable exploit windows and operational contradictions
- Sonnet 4.5 provides good breadth at lower cost with clean severity categorization
The union of unique findings across all three models reveals significantly more
temporal vulnerabilities than any single model alone. For a document governing
autonomous financial actions (liquidation, kill switch), the cost of running all
three (~$1-2) is trivially justified against the risk of missing a timing exploit.