Files
model-research/findings/2026-05-07-36-compositional-interface-analysis.md
claw c071ffc31f Finding #36: Compositional interface analysis - two-document interface assumptions
New experiment type: give models two related architecture documents and ask
them to identify assumptions each document makes about the other that could
be violated.

Results: GPT-5 (10 findings, 175s, operational/race-focused) and Opus (10
findings, 111s, structural/architectural) both found unique interface gaps.
Sonnet (7 findings, 29s) found nothing unique - all its findings were
simplified versions of GPT-5/Opus findings.

Key insight: Interface analysis requires holding two mental models simultaneously
and is harder than single-document analysis. Sonnet produced 0 unique findings
(vs 2-6 on single-doc tasks). Extended reasoning appears necessary for this
task type.
2026-05-07 02:48:46 -07:00

11 KiB

Finding #36: Compositional Interface Analysis — Models find qualitatively different interface gaps when analyzing two interacting design documents

Date: 2026-05-07 Task: Identify security-relevant INTERFACE ASSUMPTIONS between gargoyle's kill-switch.md (293 lines) and escalation-policy.md (228 lines) — places where one document assumes behavior the other doesn't guarantee, producing gaps visible only from the interface between both designs. How we used them: Both documents provided in full (521 lines total) + same focused analytical question to all 3 models. Prompt explicitly specified 5 categories (authority conflicts, state consistency gaps, timing/ordering hazards, recovery path contradictions, semantic mismatches) and required interface-only findings. GPT-5 via HAI OpenAI endpoint; Opus 4.6 and Sonnet 4.6 via HAI Anthropic endpoint. No tools, no project context beyond the two documents.

Results

Model Time Input tokens Output tokens Reasoning tokens Findings
GPT-5 175s 5,465 11,339 8,512 10
Claude Opus 4.6 111s 6,209 4,496 (internal) 10
Claude Sonnet 4.6 29s 6,209 1,306 (internal) 7

What they found — common ground (all 3 identified):

  • Competing writers to acceptance policy — both documents assume authority to set OrderManager's acceptance policy with no defined precedence/arbitration mechanism
  • Semantic collision on "restrict" / "liquidate" — same terms used for escalation LEVELS (Document B, autonomous, reversible) and kill switch MODES (Document A, terminal, manual recovery), creating implementation ambiguity
  • Autonomous liquidation vs manual liquidation conflict — Document B assumes autonomous surgical liquidation; Document A defines manual total liquidation. Neither defines which component executes B's autonomous liquidation.
  • Recovery/de-escalation authority conflict — Document B assumes automatic de-escalation via cooldown; Document A requires manual disengagement; incompatible gates
  • Kill switch instant engagement vs debounce timing — external sources can bypass the entire escalation ladder, invalidating Document B's careful debounce state

GPT-5 unique findings (not in either Claude model):

  • Mode selection ambiguity during kill switch escalation (#3): Document B signals "escalate to kill switch" without specifying which mode. If broker is unreachable, Document A prescribes RESTRICT (no broker interaction); Document B's liquidate-level context could naively map to LIQUIDATE, causing cancel attempts in a scenario where Document A says "don't talk to the broker." This is a mapping gap at the command interface.
  • In-flight order race during engagement (#4): Document A terminates decision engine BEFORE flipping acceptance policy. Orders submitted by Document B's liquidation logic just before termination could arrive at OrderManager while policy is still "open." Neither doc specifies atomic sequencing across their boundary.
  • Kill switch LIQUIDATE cancel-all negating B's autonomous liquidation (#5): Document A cancels ALL open orders in LIQUIDATE mode. If Document B just submitted close-only orders to reduce risk, A's cancel-all undoes B's remediation. B sees insufficient reduction on next cycle, re-triggers, gets cancelled again. Deadlock between safety mechanisms.
  • Global kill switch vs per-user escalation policy writes (#10): Document A's "global always wins" precedence is defined only for kill switch states, not relative to other policy writers. Document B could overwrite reject-all with close-only while global kill is engaged.

Claude Opus unique findings (not in either other model):

  • Decision engine termination kills the metric evaluator (#2 + #7): Document B requires ONGOING metric evaluation at restrict/liquidate levels (to determine whether to escalate further or de-escalate). Document A terminates Portfolio Risk (the metric computation component) on kill switch engagement. No document identifies which component evaluates risk metrics while the decision engine is dead. This makes Document B's entire post-restrict logic dead code if restrict = kill switch engagement.
  • Monitor crash resets escalation state with no backstop (#6): Document B accepts that crash = state loss, restart from clear. Combined with Document A never having been engaged (escalation hadn't reached that point), a well-timed crash creates a window where NO risk controls are active despite ongoing threshold breaches. The full re-escalation sequence (14+ cycles) runs with zero protection.
  • Dual manual gates with undefined ordering (#8): Document A requires 3 manual steps (RESTRICT→LIQUIDATE transition, disengage, release users). Document B requires 1 manual step (operator confirms recovery from liquidate level). These are independent state machines with their own manual gates. Neither defines whether one gate satisfies the other or what order they must be performed in.

Claude Sonnet findings:

Sonnet found 7 findings total. All mapped to findings already identified by GPT-5 or Opus. No unique findings that weren't covered (at higher depth) by the other two models. Its findings were accurate but structurally simpler — 2-3 paragraphs each vs 5-6 for GPT-5 and Opus. The 29-second completion time and 1,306 output tokens reflect this reduced depth.

Quality Assessment

  • GPT-5 produced the most operationally actionable findings. Its #3 (mode selection mapping gap), #4 (in-flight race), and #5 (cancel-all vs liquidation deadlock) all describe specific event sequences that would produce incorrect behavior in implementation. GPT-5 also provided concrete recommendations for fixing each gap (compositional policy model, command API specification, mode selection rules, atomic sequencing). Every finding references specific sections in both documents and describes WHY neither document alone can see the problem.

  • Claude Opus found the most architecturally fundamental gap: if the kill switch terminates Portfolio Risk, then Document B's entire escalation logic above "alert" becomes dead code. This isn't just a race condition or authority conflict — it's a structural contradiction where engaging the safety mechanism kills the component that determines whether the safety mechanism should have been engaged. Opus's monitor-crash finding (#6) is also unique in identifying an adversarially exploitable window. The dual manual gates finding (#8) shows Opus's characteristic attention to recovery-path tensions.

  • Claude Sonnet was fast but added no unique analytical value for this task. Every finding was a simplified version of something GPT-5 or Opus found in greater depth. For a 29-second, 1306-token response, it's competent as a "quick summary of obvious interface issues" but wouldn't catch the subtle problems.

Key Insight — Interface Analysis as a NEW Task Type

This is the first experiment testing compositional analysis across two documents that reference each other. Previous experiments (including #28 cross-document consistency) gave models multiple documents and asked about consistency. This experiment differs in a critical way: it asks specifically about assumptions each document makes about the other's behavior.

The results suggest this task type favors reasoning models even more strongly than single-document analysis:

Task type Sonnet unique findings Opus unique findings GPT-5 unique findings
Hidden assumptions (single doc) 2-6 3-6 5-14
Race conditions (single doc) 0 5 6
Interface analysis (two docs) 0 3 4

Sonnet's inability to produce ANY unique findings here — when it consistently produces some on single-document tasks — suggests that reasoning about interfaces requires holding two mental models simultaneously and finding contradictions between them. This is a harder cognitive task than analyzing one document's internal consistency. Extended reasoning (GPT-5's 8,512 tokens) and deep internal reasoning (Opus) appear necessary for this.

Comparison to Finding #28 (Cross-Document Consistency)

Finding #28 tested cross-document consistency on different document pairs. That experiment asked "are these documents consistent?" (verification task). This experiment asks "what does each assume about the other?" (generative/constructive task). The distinction matters:

  • Consistency checking (Finding #28): compare stated facts across documents. More surface- level — look for contradictions in explicit claims.
  • Interface assumption analysis (this finding): reason about what each document takes for granted about the other's implementation. Requires understanding the implications of each design, not just the statements.

The models that excel are the same (GPT-5 and Opus), but the nature of their findings differs: GPT-5's interface findings are more operational (specific race conditions, specific event sequences), while Opus's are more structural (fundamental architectural contradictions, recovery-path tensions).

Practical Implications

  1. For architecture reviews of interacting components: Run GPT-5 + Opus together. GPT-5 catches operational gaps (races, ordering, command interfaces). Opus catches structural contradictions (dead code paths, killed components, recovery-path conflicts).

  2. Sonnet is NOT suitable for interface analysis. Use it only for single-document tasks where it has proven capable (assumption-finding, structural review).

  3. The "both documents together" framing is critical. Previous experiments showed models find plenty of issues in each document alone. The interface analysis prompt forces models to reason about the SPACE BETWEEN the documents — which is where the real bugs live in multi-component systems.

  4. Recommendations should specify the integration contract. The most valuable output from this type of analysis is not "here's what's wrong" but "here's what the integration contract must define" — precedence rules, command APIs, event subscriptions, atomic sequencing guarantees.

Next Experiments

  • Three-document interface analysis: Add continuous-risk-monitoring.md as a third document (it bridges both). Do models find additional interface gaps that only emerge from the three-way interaction?
  • Adversarial ensemble on interface analysis: Give Opus GPT-5's interface findings and ask it to critique + extend (per Finding #35 methodology). Does the ensemble approach produce even more interface insights?
  • Implementation-level verification: Take the top interface findings from this experiment and check them against gargoyle's actual code. Are these REAL bugs or are the documents already consistent at the implementation level despite the spec-level gaps?