Finding #36: Compositional interface analysis - two-document interface assumptions

New experiment type: give models two related architecture documents and ask them to identify assumptions each document makes about the other that could be violated. Results: GPT-5 (10 findings, 175s, operational/race-focused) and Opus (10 findings, 111s, structural/architectural) both found unique interface gaps. Sonnet (7 findings, 29s) found nothing unique - all its findings were simplified versions of GPT-5/Opus findings. Key insight: Interface analysis requires holding two mental models simultaneously and is harder than single-document analysis. Sonnet produced 0 unique findings (vs 2-6 on single-doc tasks). Extended reasoning appears necessary for this task type.
2026-05-07 02:48:46 -07:00
parent d8ddbc9861
commit c071ffc31f
1 changed files with 177 additions and 0 deletions
@@ -0,0 +1,177 @@
+# Finding #36: Compositional Interface Analysis — Models find qualitatively different interface gaps when analyzing two interacting design documents
+
+**Date:** 2026-05-07
+**Task:** Identify security-relevant INTERFACE ASSUMPTIONS between gargoyle's `kill-switch.md`
+(293 lines) and `escalation-policy.md` (228 lines) — places where one document assumes
+behavior the other doesn't guarantee, producing gaps visible only from the interface between
+both designs.
+**How we used them:** Both documents provided in full (521 lines total) + same focused
+analytical question to all 3 models. Prompt explicitly specified 5 categories (authority
+conflicts, state consistency gaps, timing/ordering hazards, recovery path contradictions,
+semantic mismatches) and required interface-only findings. GPT-5 via HAI OpenAI endpoint;
+Opus 4.6 and Sonnet 4.6 via HAI Anthropic endpoint. No tools, no project context beyond the
+two documents.
+
+## Results
+
+| Model | Time | Input tokens | Output tokens | Reasoning tokens | Findings |
+|---|---|---|---|---|---|
+| GPT-5 | 175s | 5,465 | 11,339 | 8,512 | 10 |
+| Claude Opus 4.6 | 111s | 6,209 | 4,496 | (internal) | 10 |
+| Claude Sonnet 4.6 | 29s | 6,209 | 1,306 | (internal) | 7 |
+
+## What they found — common ground (all 3 identified):
+
+- **Competing writers to acceptance policy** — both documents assume authority to set
+  OrderManager's acceptance policy with no defined precedence/arbitration mechanism
+- **Semantic collision on "restrict" / "liquidate"** — same terms used for escalation
+  LEVELS (Document B, autonomous, reversible) and kill switch MODES (Document A, terminal,
+  manual recovery), creating implementation ambiguity
+- **Autonomous liquidation vs manual liquidation conflict** — Document B assumes
+  autonomous surgical liquidation; Document A defines manual total liquidation.
+  Neither defines which component executes B's autonomous liquidation.
+- **Recovery/de-escalation authority conflict** — Document B assumes automatic de-escalation
+  via cooldown; Document A requires manual disengagement; incompatible gates
+- **Kill switch instant engagement vs debounce timing** — external sources can bypass
+  the entire escalation ladder, invalidating Document B's careful debounce state
+
+## GPT-5 unique findings (not in either Claude model):
+
+- **Mode selection ambiguity during kill switch escalation** (#3): Document B signals
+  "escalate to kill switch" without specifying which mode. If broker is unreachable,
+  Document A prescribes RESTRICT (no broker interaction); Document B's liquidate-level
+  context could naively map to LIQUIDATE, causing cancel attempts in a scenario where
+  Document A says "don't talk to the broker." This is a mapping gap at the command
+  interface.
+- **In-flight order race during engagement** (#4): Document A terminates decision engine
+  BEFORE flipping acceptance policy. Orders submitted by Document B's liquidation logic
+  just before termination could arrive at OrderManager while policy is still "open."
+  Neither doc specifies atomic sequencing across their boundary.
+- **Kill switch LIQUIDATE cancel-all negating B's autonomous liquidation** (#5):
+  Document A cancels ALL open orders in LIQUIDATE mode. If Document B just submitted
+  close-only orders to reduce risk, A's cancel-all undoes B's remediation. B sees
+  insufficient reduction on next cycle, re-triggers, gets cancelled again. Deadlock
+  between safety mechanisms.
+- **Global kill switch vs per-user escalation policy writes** (#10): Document A's "global
+  always wins" precedence is defined only for kill switch states, not relative to other
+  policy writers. Document B could overwrite reject-all with close-only while global kill
+  is engaged.
+
+## Claude Opus unique findings (not in either other model):
+
+- **Decision engine termination kills the metric evaluator** (#2 + #7): Document B
+  requires ONGOING metric evaluation at restrict/liquidate levels (to determine whether
+  to escalate further or de-escalate). Document A terminates Portfolio Risk (the metric
+  computation component) on kill switch engagement. No document identifies which component
+  evaluates risk metrics while the decision engine is dead. This makes Document B's entire
+  post-restrict logic dead code if restrict = kill switch engagement.
+- **Monitor crash resets escalation state with no backstop** (#6): Document B accepts that
+  crash = state loss, restart from clear. Combined with Document A never having been
+  engaged (escalation hadn't reached that point), a well-timed crash creates a window
+  where NO risk controls are active despite ongoing threshold breaches. The full
+  re-escalation sequence (14+ cycles) runs with zero protection.
+- **Dual manual gates with undefined ordering** (#8): Document A requires 3 manual steps
+  (RESTRICT→LIQUIDATE transition, disengage, release users). Document B requires 1 manual
+  step (operator confirms recovery from liquidate level). These are independent state
+  machines with their own manual gates. Neither defines whether one gate satisfies the
+  other or what order they must be performed in.
+
+## Claude Sonnet findings:
+
+Sonnet found 7 findings total. All mapped to findings already identified by GPT-5 or Opus.
+No unique findings that weren't covered (at higher depth) by the other two models. Its
+findings were accurate but structurally simpler — 2-3 paragraphs each vs 5-6 for GPT-5
+and Opus. The 29-second completion time and 1,306 output tokens reflect this reduced depth.
+
+## Quality Assessment
+
+- **GPT-5** produced the most operationally actionable findings. Its #3 (mode selection
+  mapping gap), #4 (in-flight race), and #5 (cancel-all vs liquidation deadlock) all
+  describe specific event sequences that would produce incorrect behavior in implementation.
+  GPT-5 also provided concrete recommendations for fixing each gap (compositional policy
+  model, command API specification, mode selection rules, atomic sequencing). Every finding
+  references specific sections in both documents and describes WHY neither document alone
+  can see the problem.
+
+- **Claude Opus** found the most architecturally fundamental gap: if the kill switch
+  terminates Portfolio Risk, then Document B's entire escalation logic above "alert"
+  becomes dead code. This isn't just a race condition or authority conflict — it's a
+  structural contradiction where engaging the safety mechanism kills the component that
+  determines whether the safety mechanism should have been engaged. Opus's monitor-crash
+  finding (#6) is also unique in identifying an adversarially exploitable window. The dual
+  manual gates finding (#8) shows Opus's characteristic attention to recovery-path tensions.
+
+- **Claude Sonnet** was fast but added no unique analytical value for this task. Every
+  finding was a simplified version of something GPT-5 or Opus found in greater depth.
+  For a 29-second, 1306-token response, it's competent as a "quick summary of obvious
+  interface issues" but wouldn't catch the subtle problems.
+
+## Key Insight — Interface Analysis as a NEW Task Type
+
+This is the first experiment testing **compositional analysis** across two documents that
+reference each other. Previous experiments (including #28 cross-document consistency) gave
+models multiple documents and asked about consistency. This experiment differs in a critical
+way: it asks specifically about **assumptions each document makes about the other's behavior**.
+
+The results suggest this task type favors reasoning models even more strongly than
+single-document analysis:
+
+| Task type | Sonnet unique findings | Opus unique findings | GPT-5 unique findings |
+|---|---|---|---|
+| Hidden assumptions (single doc) | 2-6 | 3-6 | 5-14 |
+| Race conditions (single doc) | 0 | 5 | 6 |
+| Interface analysis (two docs) | **0** | **3** | **4** |
+
+Sonnet's inability to produce ANY unique findings here — when it consistently produces
+some on single-document tasks — suggests that reasoning about interfaces requires holding
+two mental models simultaneously and finding contradictions between them. This is a harder
+cognitive task than analyzing one document's internal consistency. Extended reasoning
+(GPT-5's 8,512 tokens) and deep internal reasoning (Opus) appear necessary for this.
+
+## Comparison to Finding #28 (Cross-Document Consistency)
+
+Finding #28 tested cross-document consistency on different document pairs. That experiment
+asked "are these documents consistent?" (verification task). This experiment asks "what does
+each assume about the other?" (generative/constructive task). The distinction matters:
+
+- **Consistency checking** (Finding #28): compare stated facts across documents. More surface-
+  level — look for contradictions in explicit claims.
+- **Interface assumption analysis** (this finding): reason about what each document takes for
+  granted about the other's implementation. Requires understanding the *implications* of each
+  design, not just the *statements*.
+
+The models that excel are the same (GPT-5 and Opus), but the nature of their findings differs:
+GPT-5's interface findings are more operational (specific race conditions, specific event
+sequences), while Opus's are more structural (fundamental architectural contradictions,
+recovery-path tensions).
+
+## Practical Implications
+
+1. **For architecture reviews of interacting components:** Run GPT-5 + Opus together. GPT-5
+   catches operational gaps (races, ordering, command interfaces). Opus catches structural
+   contradictions (dead code paths, killed components, recovery-path conflicts).
+
+2. **Sonnet is NOT suitable for interface analysis.** Use it only for single-document tasks
+   where it has proven capable (assumption-finding, structural review).
+
+3. **The "both documents together" framing is critical.** Previous experiments showed models
+   find plenty of issues in each document alone. The interface analysis prompt forces models
+   to reason about the SPACE BETWEEN the documents — which is where the real bugs live in
+   multi-component systems.
+
+4. **Recommendations should specify the integration contract.** The most valuable output from
+   this type of analysis is not "here's what's wrong" but "here's what the integration
+   contract must define" — precedence rules, command APIs, event subscriptions, atomic
+   sequencing guarantees.
+
+## Next Experiments
+
+- **Three-document interface analysis:** Add `continuous-risk-monitoring.md` as a third document
+  (it bridges both). Do models find additional interface gaps that only emerge from the
+  three-way interaction?
+- **Adversarial ensemble on interface analysis:** Give Opus GPT-5's interface findings and ask
+  it to critique + extend (per Finding #35 methodology). Does the ensemble approach produce
+  even more interface insights?
+- **Implementation-level verification:** Take the top interface findings from this experiment
+  and check them against gargoyle's actual code. Are these REAL bugs or are the documents
+  already consistent at the implementation level despite the spec-level gaps?