From c071ffc31fca61736198b33789bcdd7ad75eeb86 Mon Sep 17 00:00:00 2001 From: claw Date: Thu, 7 May 2026 02:48:46 -0700 Subject: [PATCH] Finding #36: Compositional interface analysis - two-document interface assumptions New experiment type: give models two related architecture documents and ask them to identify assumptions each document makes about the other that could be violated. Results: GPT-5 (10 findings, 175s, operational/race-focused) and Opus (10 findings, 111s, structural/architectural) both found unique interface gaps. Sonnet (7 findings, 29s) found nothing unique - all its findings were simplified versions of GPT-5/Opus findings. Key insight: Interface analysis requires holding two mental models simultaneously and is harder than single-document analysis. Sonnet produced 0 unique findings (vs 2-6 on single-doc tasks). Extended reasoning appears necessary for this task type. --- ...-07-36-compositional-interface-analysis.md | 177 ++++++++++++++++++ 1 file changed, 177 insertions(+) create mode 100644 findings/2026-05-07-36-compositional-interface-analysis.md diff --git a/findings/2026-05-07-36-compositional-interface-analysis.md b/findings/2026-05-07-36-compositional-interface-analysis.md new file mode 100644 index 0000000..5d33b6e --- /dev/null +++ b/findings/2026-05-07-36-compositional-interface-analysis.md @@ -0,0 +1,177 @@ +# Finding #36: Compositional Interface Analysis — Models find qualitatively different interface gaps when analyzing two interacting design documents + +**Date:** 2026-05-07 +**Task:** Identify security-relevant INTERFACE ASSUMPTIONS between gargoyle's `kill-switch.md` +(293 lines) and `escalation-policy.md` (228 lines) — places where one document assumes +behavior the other doesn't guarantee, producing gaps visible only from the interface between +both designs. +**How we used them:** Both documents provided in full (521 lines total) + same focused +analytical question to all 3 models. Prompt explicitly specified 5 categories (authority +conflicts, state consistency gaps, timing/ordering hazards, recovery path contradictions, +semantic mismatches) and required interface-only findings. GPT-5 via HAI OpenAI endpoint; +Opus 4.6 and Sonnet 4.6 via HAI Anthropic endpoint. No tools, no project context beyond the +two documents. + +## Results + +| Model | Time | Input tokens | Output tokens | Reasoning tokens | Findings | +|---|---|---|---|---|---| +| GPT-5 | 175s | 5,465 | 11,339 | 8,512 | 10 | +| Claude Opus 4.6 | 111s | 6,209 | 4,496 | (internal) | 10 | +| Claude Sonnet 4.6 | 29s | 6,209 | 1,306 | (internal) | 7 | + +## What they found — common ground (all 3 identified): + +- **Competing writers to acceptance policy** — both documents assume authority to set + OrderManager's acceptance policy with no defined precedence/arbitration mechanism +- **Semantic collision on "restrict" / "liquidate"** — same terms used for escalation + LEVELS (Document B, autonomous, reversible) and kill switch MODES (Document A, terminal, + manual recovery), creating implementation ambiguity +- **Autonomous liquidation vs manual liquidation conflict** — Document B assumes + autonomous surgical liquidation; Document A defines manual total liquidation. + Neither defines which component executes B's autonomous liquidation. +- **Recovery/de-escalation authority conflict** — Document B assumes automatic de-escalation + via cooldown; Document A requires manual disengagement; incompatible gates +- **Kill switch instant engagement vs debounce timing** — external sources can bypass + the entire escalation ladder, invalidating Document B's careful debounce state + +## GPT-5 unique findings (not in either Claude model): + +- **Mode selection ambiguity during kill switch escalation** (#3): Document B signals + "escalate to kill switch" without specifying which mode. If broker is unreachable, + Document A prescribes RESTRICT (no broker interaction); Document B's liquidate-level + context could naively map to LIQUIDATE, causing cancel attempts in a scenario where + Document A says "don't talk to the broker." This is a mapping gap at the command + interface. +- **In-flight order race during engagement** (#4): Document A terminates decision engine + BEFORE flipping acceptance policy. Orders submitted by Document B's liquidation logic + just before termination could arrive at OrderManager while policy is still "open." + Neither doc specifies atomic sequencing across their boundary. +- **Kill switch LIQUIDATE cancel-all negating B's autonomous liquidation** (#5): + Document A cancels ALL open orders in LIQUIDATE mode. If Document B just submitted + close-only orders to reduce risk, A's cancel-all undoes B's remediation. B sees + insufficient reduction on next cycle, re-triggers, gets cancelled again. Deadlock + between safety mechanisms. +- **Global kill switch vs per-user escalation policy writes** (#10): Document A's "global + always wins" precedence is defined only for kill switch states, not relative to other + policy writers. Document B could overwrite reject-all with close-only while global kill + is engaged. + +## Claude Opus unique findings (not in either other model): + +- **Decision engine termination kills the metric evaluator** (#2 + #7): Document B + requires ONGOING metric evaluation at restrict/liquidate levels (to determine whether + to escalate further or de-escalate). Document A terminates Portfolio Risk (the metric + computation component) on kill switch engagement. No document identifies which component + evaluates risk metrics while the decision engine is dead. This makes Document B's entire + post-restrict logic dead code if restrict = kill switch engagement. +- **Monitor crash resets escalation state with no backstop** (#6): Document B accepts that + crash = state loss, restart from clear. Combined with Document A never having been + engaged (escalation hadn't reached that point), a well-timed crash creates a window + where NO risk controls are active despite ongoing threshold breaches. The full + re-escalation sequence (14+ cycles) runs with zero protection. +- **Dual manual gates with undefined ordering** (#8): Document A requires 3 manual steps + (RESTRICT→LIQUIDATE transition, disengage, release users). Document B requires 1 manual + step (operator confirms recovery from liquidate level). These are independent state + machines with their own manual gates. Neither defines whether one gate satisfies the + other or what order they must be performed in. + +## Claude Sonnet findings: + +Sonnet found 7 findings total. All mapped to findings already identified by GPT-5 or Opus. +No unique findings that weren't covered (at higher depth) by the other two models. Its +findings were accurate but structurally simpler — 2-3 paragraphs each vs 5-6 for GPT-5 +and Opus. The 29-second completion time and 1,306 output tokens reflect this reduced depth. + +## Quality Assessment + +- **GPT-5** produced the most operationally actionable findings. Its #3 (mode selection + mapping gap), #4 (in-flight race), and #5 (cancel-all vs liquidation deadlock) all + describe specific event sequences that would produce incorrect behavior in implementation. + GPT-5 also provided concrete recommendations for fixing each gap (compositional policy + model, command API specification, mode selection rules, atomic sequencing). Every finding + references specific sections in both documents and describes WHY neither document alone + can see the problem. + +- **Claude Opus** found the most architecturally fundamental gap: if the kill switch + terminates Portfolio Risk, then Document B's entire escalation logic above "alert" + becomes dead code. This isn't just a race condition or authority conflict — it's a + structural contradiction where engaging the safety mechanism kills the component that + determines whether the safety mechanism should have been engaged. Opus's monitor-crash + finding (#6) is also unique in identifying an adversarially exploitable window. The dual + manual gates finding (#8) shows Opus's characteristic attention to recovery-path tensions. + +- **Claude Sonnet** was fast but added no unique analytical value for this task. Every + finding was a simplified version of something GPT-5 or Opus found in greater depth. + For a 29-second, 1306-token response, it's competent as a "quick summary of obvious + interface issues" but wouldn't catch the subtle problems. + +## Key Insight — Interface Analysis as a NEW Task Type + +This is the first experiment testing **compositional analysis** across two documents that +reference each other. Previous experiments (including #28 cross-document consistency) gave +models multiple documents and asked about consistency. This experiment differs in a critical +way: it asks specifically about **assumptions each document makes about the other's behavior**. + +The results suggest this task type favors reasoning models even more strongly than +single-document analysis: + +| Task type | Sonnet unique findings | Opus unique findings | GPT-5 unique findings | +|---|---|---|---| +| Hidden assumptions (single doc) | 2-6 | 3-6 | 5-14 | +| Race conditions (single doc) | 0 | 5 | 6 | +| Interface analysis (two docs) | **0** | **3** | **4** | + +Sonnet's inability to produce ANY unique findings here — when it consistently produces +some on single-document tasks — suggests that reasoning about interfaces requires holding +two mental models simultaneously and finding contradictions between them. This is a harder +cognitive task than analyzing one document's internal consistency. Extended reasoning +(GPT-5's 8,512 tokens) and deep internal reasoning (Opus) appear necessary for this. + +## Comparison to Finding #28 (Cross-Document Consistency) + +Finding #28 tested cross-document consistency on different document pairs. That experiment +asked "are these documents consistent?" (verification task). This experiment asks "what does +each assume about the other?" (generative/constructive task). The distinction matters: + +- **Consistency checking** (Finding #28): compare stated facts across documents. More surface- + level — look for contradictions in explicit claims. +- **Interface assumption analysis** (this finding): reason about what each document takes for + granted about the other's implementation. Requires understanding the *implications* of each + design, not just the *statements*. + +The models that excel are the same (GPT-5 and Opus), but the nature of their findings differs: +GPT-5's interface findings are more operational (specific race conditions, specific event +sequences), while Opus's are more structural (fundamental architectural contradictions, +recovery-path tensions). + +## Practical Implications + +1. **For architecture reviews of interacting components:** Run GPT-5 + Opus together. GPT-5 + catches operational gaps (races, ordering, command interfaces). Opus catches structural + contradictions (dead code paths, killed components, recovery-path conflicts). + +2. **Sonnet is NOT suitable for interface analysis.** Use it only for single-document tasks + where it has proven capable (assumption-finding, structural review). + +3. **The "both documents together" framing is critical.** Previous experiments showed models + find plenty of issues in each document alone. The interface analysis prompt forces models + to reason about the SPACE BETWEEN the documents — which is where the real bugs live in + multi-component systems. + +4. **Recommendations should specify the integration contract.** The most valuable output from + this type of analysis is not "here's what's wrong" but "here's what the integration + contract must define" — precedence rules, command APIs, event subscriptions, atomic + sequencing guarantees. + +## Next Experiments + +- **Three-document interface analysis:** Add `continuous-risk-monitoring.md` as a third document + (it bridges both). Do models find additional interface gaps that only emerge from the + three-way interaction? +- **Adversarial ensemble on interface analysis:** Give Opus GPT-5's interface findings and ask + it to critique + extend (per Finding #35 methodology). Does the ensemble approach produce + even more interface insights? +- **Implementation-level verification:** Take the top interface findings from this experiment + and check them against gargoyle's actual code. Are these REAL bugs or are the documents + already consistent at the implementation level despite the spec-level gaps?