model-research/findings/2026-05-07-36-compositional-interface-analysis.md

# Finding #36: Compositional Interface Analysis — Models find qualitatively different interface gaps when analyzing two interacting design documents

**Date:** 2026-05-07
**Task:** Identify security-relevant INTERFACE ASSUMPTIONS between gargoyle's `kill-switch.md`
(293 lines) and `escalation-policy.md` (228 lines) — places where one document assumes
behavior the other doesn't guarantee, producing gaps visible only from the interface between
both designs.
**How we used them:** Both documents provided in full (521 lines total) + same focused
analytical question to all 3 models. Prompt explicitly specified 5 categories (authority
conflicts, state consistency gaps, timing/ordering hazards, recovery path contradictions,
semantic mismatches) and required interface-only findings. GPT-5 via HAI OpenAI endpoint;
Opus 4.6 and Sonnet 4.6 via HAI Anthropic endpoint. No tools, no project context beyond the
two documents.

## Results

| Model | Time | Input tokens | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|---|
| GPT-5 | 175s | 5,465 | 11,339 | 8,512 | 10 |
| Claude Opus 4.6 | 111s | 6,209 | 4,496 | (internal) | 10 |
| Claude Sonnet 4.6 | 29s | 6,209 | 1,306 | (internal) | 7 |

## What they found — common ground (all 3 identified):

- **Competing writers to acceptance policy** — both documents assume authority to set
  OrderManager's acceptance policy with no defined precedence/arbitration mechanism
- **Semantic collision on "restrict" / "liquidate"** — same terms used for escalation
  LEVELS (Document B, autonomous, reversible) and kill switch MODES (Document A, terminal,
  manual recovery), creating implementation ambiguity
- **Autonomous liquidation vs manual liquidation conflict** — Document B assumes
  autonomous surgical liquidation; Document A defines manual total liquidation.
  Neither defines which component executes B's autonomous liquidation.
- **Recovery/de-escalation authority conflict** — Document B assumes automatic de-escalation
  via cooldown; Document A requires manual disengagement; incompatible gates
- **Kill switch instant engagement vs debounce timing** — external sources can bypass
  the entire escalation ladder, invalidating Document B's careful debounce state

## GPT-5 unique findings (not in either Claude model):

- **Mode selection ambiguity during kill switch escalation** (#3): Document B signals
  "escalate to kill switch" without specifying which mode. If broker is unreachable,
  Document A prescribes RESTRICT (no broker interaction); Document B's liquidate-level
  context could naively map to LIQUIDATE, causing cancel attempts in a scenario where
  Document A says "don't talk to the broker." This is a mapping gap at the command
  interface.
- **In-flight order race during engagement** (#4): Document A terminates decision engine
  BEFORE flipping acceptance policy. Orders submitted by Document B's liquidation logic
  just before termination could arrive at OrderManager while policy is still "open."
  Neither doc specifies atomic sequencing across their boundary.
- **Kill switch LIQUIDATE cancel-all negating B's autonomous liquidation** (#5):
  Document A cancels ALL open orders in LIQUIDATE mode. If Document B just submitted
  close-only orders to reduce risk, A's cancel-all undoes B's remediation. B sees
  insufficient reduction on next cycle, re-triggers, gets cancelled again. Deadlock
  between safety mechanisms.
- **Global kill switch vs per-user escalation policy writes** (#10): Document A's "global
  always wins" precedence is defined only for kill switch states, not relative to other
  policy writers. Document B could overwrite reject-all with close-only while global kill
  is engaged.

## Claude Opus unique findings (not in either other model):

- **Decision engine termination kills the metric evaluator** (#2 + #7): Document B
  requires ONGOING metric evaluation at restrict/liquidate levels (to determine whether
  to escalate further or de-escalate). Document A terminates Portfolio Risk (the metric
  computation component) on kill switch engagement. No document identifies which component
  evaluates risk metrics while the decision engine is dead. This makes Document B's entire
  post-restrict logic dead code if restrict = kill switch engagement.
- **Monitor crash resets escalation state with no backstop** (#6): Document B accepts that
  crash = state loss, restart from clear. Combined with Document A never having been
  engaged (escalation hadn't reached that point), a well-timed crash creates a window
  where NO risk controls are active despite ongoing threshold breaches. The full
  re-escalation sequence (14+ cycles) runs with zero protection.
- **Dual manual gates with undefined ordering** (#8): Document A requires 3 manual steps
  (RESTRICT→LIQUIDATE transition, disengage, release users). Document B requires 1 manual
  step (operator confirms recovery from liquidate level). These are independent state
  machines with their own manual gates. Neither defines whether one gate satisfies the
  other or what order they must be performed in.

## Claude Sonnet findings:

Sonnet found 7 findings total. All mapped to findings already identified by GPT-5 or Opus.
No unique findings that weren't covered (at higher depth) by the other two models. Its
findings were accurate but structurally simpler — 2-3 paragraphs each vs 5-6 for GPT-5
and Opus. The 29-second completion time and 1,306 output tokens reflect this reduced depth.

## Quality Assessment

- **GPT-5** produced the most operationally actionable findings. Its #3 (mode selection
  mapping gap), #4 (in-flight race), and #5 (cancel-all vs liquidation deadlock) all
  describe specific event sequences that would produce incorrect behavior in implementation.
  GPT-5 also provided concrete recommendations for fixing each gap (compositional policy
  model, command API specification, mode selection rules, atomic sequencing). Every finding
  references specific sections in both documents and describes WHY neither document alone
  can see the problem.

- **Claude Opus** found the most architecturally fundamental gap: if the kill switch
  terminates Portfolio Risk, then Document B's entire escalation logic above "alert"
  becomes dead code. This isn't just a race condition or authority conflict — it's a
  structural contradiction where engaging the safety mechanism kills the component that
  determines whether the safety mechanism should have been engaged. Opus's monitor-crash
  finding (#6) is also unique in identifying an adversarially exploitable window. The dual
  manual gates finding (#8) shows Opus's characteristic attention to recovery-path tensions.

- **Claude Sonnet** was fast but added no unique analytical value for this task. Every
  finding was a simplified version of something GPT-5 or Opus found in greater depth.
  For a 29-second, 1306-token response, it's competent as a "quick summary of obvious
  interface issues" but wouldn't catch the subtle problems.

## Key Insight — Interface Analysis as a NEW Task Type

This is the first experiment testing **compositional analysis** across two documents that
reference each other. Previous experiments (including #28 cross-document consistency) gave
models multiple documents and asked about consistency. This experiment differs in a critical
way: it asks specifically about **assumptions each document makes about the other's behavior**.

The results suggest this task type favors reasoning models even more strongly than
single-document analysis:

| Task type | Sonnet unique findings | Opus unique findings | GPT-5 unique findings |
|---|---|---|---|
| Hidden assumptions (single doc) | 2-6 | 3-6 | 5-14 |
| Race conditions (single doc) | 0 | 5 | 6 |
| Interface analysis (two docs) | **0** | **3** | **4** |

Sonnet's inability to produce ANY unique findings here — when it consistently produces
some on single-document tasks — suggests that reasoning about interfaces requires holding
two mental models simultaneously and finding contradictions between them. This is a harder
cognitive task than analyzing one document's internal consistency. Extended reasoning
(GPT-5's 8,512 tokens) and deep internal reasoning (Opus) appear necessary for this.

## Comparison to Finding #28 (Cross-Document Consistency)

Finding #28 tested cross-document consistency on different document pairs. That experiment
asked "are these documents consistent?" (verification task). This experiment asks "what does
each assume about the other?" (generative/constructive task). The distinction matters:

- **Consistency checking** (Finding #28): compare stated facts across documents. More surface-
  level — look for contradictions in explicit claims.
- **Interface assumption analysis** (this finding): reason about what each document takes for
  granted about the other's implementation. Requires understanding the *implications* of each
  design, not just the *statements*.

The models that excel are the same (GPT-5 and Opus), but the nature of their findings differs:
GPT-5's interface findings are more operational (specific race conditions, specific event
sequences), while Opus's are more structural (fundamental architectural contradictions,
recovery-path tensions).

## Practical Implications

1. **For architecture reviews of interacting components:** Run GPT-5 + Opus together. GPT-5
   catches operational gaps (races, ordering, command interfaces). Opus catches structural
   contradictions (dead code paths, killed components, recovery-path conflicts).

2. **Sonnet is NOT suitable for interface analysis.** Use it only for single-document tasks
   where it has proven capable (assumption-finding, structural review).

3. **The "both documents together" framing is critical.** Previous experiments showed models
   find plenty of issues in each document alone. The interface analysis prompt forces models
   to reason about the SPACE BETWEEN the documents — which is where the real bugs live in
   multi-component systems.

4. **Recommendations should specify the integration contract.** The most valuable output from
   this type of analysis is not "here's what's wrong" but "here's what the integration
   contract must define" — precedence rules, command APIs, event subscriptions, atomic
   sequencing guarantees.

## Next Experiments

- **Three-document interface analysis:** Add `continuous-risk-monitoring.md` as a third document
  (it bridges both). Do models find additional interface gaps that only emerge from the
  three-way interaction?
- **Adversarial ensemble on interface analysis:** Give Opus GPT-5's interface findings and ask
  it to critique + extend (per Finding #35 methodology). Does the ensemble approach produce
  even more interface insights?
- **Implementation-level verification:** Take the top interface findings from this experiment
  and check them against gargoyle's actual code. Are these REAL bugs or are the documents
  already consistent at the implementation level despite the spec-level gaps?