Finding #36: Compositional interface analysis - two-document interface assumptions

New experiment type: give models two related architecture documents and ask
them to identify assumptions each document makes about the other that could
be violated.

Results: GPT-5 (10 findings, 175s, operational/race-focused) and Opus (10
findings, 111s, structural/architectural) both found unique interface gaps.
Sonnet (7 findings, 29s) found nothing unique - all its findings were
simplified versions of GPT-5/Opus findings.

Key insight: Interface analysis requires holding two mental models simultaneously
and is harder than single-document analysis. Sonnet produced 0 unique findings
(vs 2-6 on single-doc tasks). Extended reasoning appears necessary for this
task type.
This commit is contained in:
claw
2026-05-07 02:48:46 -07:00
parent d8ddbc9861
commit c071ffc31f
@@ -0,0 +1,177 @@
# Finding #36: Compositional Interface Analysis — Models find qualitatively different interface gaps when analyzing two interacting design documents
**Date:** 2026-05-07
**Task:** Identify security-relevant INTERFACE ASSUMPTIONS between gargoyle's `kill-switch.md`
(293 lines) and `escalation-policy.md` (228 lines) — places where one document assumes
behavior the other doesn't guarantee, producing gaps visible only from the interface between
both designs.
**How we used them:** Both documents provided in full (521 lines total) + same focused
analytical question to all 3 models. Prompt explicitly specified 5 categories (authority
conflicts, state consistency gaps, timing/ordering hazards, recovery path contradictions,
semantic mismatches) and required interface-only findings. GPT-5 via HAI OpenAI endpoint;
Opus 4.6 and Sonnet 4.6 via HAI Anthropic endpoint. No tools, no project context beyond the
two documents.
## Results
| Model | Time | Input tokens | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|---|
| GPT-5 | 175s | 5,465 | 11,339 | 8,512 | 10 |
| Claude Opus 4.6 | 111s | 6,209 | 4,496 | (internal) | 10 |
| Claude Sonnet 4.6 | 29s | 6,209 | 1,306 | (internal) | 7 |
## What they found — common ground (all 3 identified):
- **Competing writers to acceptance policy** — both documents assume authority to set
OrderManager's acceptance policy with no defined precedence/arbitration mechanism
- **Semantic collision on "restrict" / "liquidate"** — same terms used for escalation
LEVELS (Document B, autonomous, reversible) and kill switch MODES (Document A, terminal,
manual recovery), creating implementation ambiguity
- **Autonomous liquidation vs manual liquidation conflict** — Document B assumes
autonomous surgical liquidation; Document A defines manual total liquidation.
Neither defines which component executes B's autonomous liquidation.
- **Recovery/de-escalation authority conflict** — Document B assumes automatic de-escalation
via cooldown; Document A requires manual disengagement; incompatible gates
- **Kill switch instant engagement vs debounce timing** — external sources can bypass
the entire escalation ladder, invalidating Document B's careful debounce state
## GPT-5 unique findings (not in either Claude model):
- **Mode selection ambiguity during kill switch escalation** (#3): Document B signals
"escalate to kill switch" without specifying which mode. If broker is unreachable,
Document A prescribes RESTRICT (no broker interaction); Document B's liquidate-level
context could naively map to LIQUIDATE, causing cancel attempts in a scenario where
Document A says "don't talk to the broker." This is a mapping gap at the command
interface.
- **In-flight order race during engagement** (#4): Document A terminates decision engine
BEFORE flipping acceptance policy. Orders submitted by Document B's liquidation logic
just before termination could arrive at OrderManager while policy is still "open."
Neither doc specifies atomic sequencing across their boundary.
- **Kill switch LIQUIDATE cancel-all negating B's autonomous liquidation** (#5):
Document A cancels ALL open orders in LIQUIDATE mode. If Document B just submitted
close-only orders to reduce risk, A's cancel-all undoes B's remediation. B sees
insufficient reduction on next cycle, re-triggers, gets cancelled again. Deadlock
between safety mechanisms.
- **Global kill switch vs per-user escalation policy writes** (#10): Document A's "global
always wins" precedence is defined only for kill switch states, not relative to other
policy writers. Document B could overwrite reject-all with close-only while global kill
is engaged.
## Claude Opus unique findings (not in either other model):
- **Decision engine termination kills the metric evaluator** (#2 + #7): Document B
requires ONGOING metric evaluation at restrict/liquidate levels (to determine whether
to escalate further or de-escalate). Document A terminates Portfolio Risk (the metric
computation component) on kill switch engagement. No document identifies which component
evaluates risk metrics while the decision engine is dead. This makes Document B's entire
post-restrict logic dead code if restrict = kill switch engagement.
- **Monitor crash resets escalation state with no backstop** (#6): Document B accepts that
crash = state loss, restart from clear. Combined with Document A never having been
engaged (escalation hadn't reached that point), a well-timed crash creates a window
where NO risk controls are active despite ongoing threshold breaches. The full
re-escalation sequence (14+ cycles) runs with zero protection.
- **Dual manual gates with undefined ordering** (#8): Document A requires 3 manual steps
(RESTRICT→LIQUIDATE transition, disengage, release users). Document B requires 1 manual
step (operator confirms recovery from liquidate level). These are independent state
machines with their own manual gates. Neither defines whether one gate satisfies the
other or what order they must be performed in.
## Claude Sonnet findings:
Sonnet found 7 findings total. All mapped to findings already identified by GPT-5 or Opus.
No unique findings that weren't covered (at higher depth) by the other two models. Its
findings were accurate but structurally simpler — 2-3 paragraphs each vs 5-6 for GPT-5
and Opus. The 29-second completion time and 1,306 output tokens reflect this reduced depth.
## Quality Assessment
- **GPT-5** produced the most operationally actionable findings. Its #3 (mode selection
mapping gap), #4 (in-flight race), and #5 (cancel-all vs liquidation deadlock) all
describe specific event sequences that would produce incorrect behavior in implementation.
GPT-5 also provided concrete recommendations for fixing each gap (compositional policy
model, command API specification, mode selection rules, atomic sequencing). Every finding
references specific sections in both documents and describes WHY neither document alone
can see the problem.
- **Claude Opus** found the most architecturally fundamental gap: if the kill switch
terminates Portfolio Risk, then Document B's entire escalation logic above "alert"
becomes dead code. This isn't just a race condition or authority conflict — it's a
structural contradiction where engaging the safety mechanism kills the component that
determines whether the safety mechanism should have been engaged. Opus's monitor-crash
finding (#6) is also unique in identifying an adversarially exploitable window. The dual
manual gates finding (#8) shows Opus's characteristic attention to recovery-path tensions.
- **Claude Sonnet** was fast but added no unique analytical value for this task. Every
finding was a simplified version of something GPT-5 or Opus found in greater depth.
For a 29-second, 1306-token response, it's competent as a "quick summary of obvious
interface issues" but wouldn't catch the subtle problems.
## Key Insight — Interface Analysis as a NEW Task Type
This is the first experiment testing **compositional analysis** across two documents that
reference each other. Previous experiments (including #28 cross-document consistency) gave
models multiple documents and asked about consistency. This experiment differs in a critical
way: it asks specifically about **assumptions each document makes about the other's behavior**.
The results suggest this task type favors reasoning models even more strongly than
single-document analysis:
| Task type | Sonnet unique findings | Opus unique findings | GPT-5 unique findings |
|---|---|---|---|
| Hidden assumptions (single doc) | 2-6 | 3-6 | 5-14 |
| Race conditions (single doc) | 0 | 5 | 6 |
| Interface analysis (two docs) | **0** | **3** | **4** |
Sonnet's inability to produce ANY unique findings here — when it consistently produces
some on single-document tasks — suggests that reasoning about interfaces requires holding
two mental models simultaneously and finding contradictions between them. This is a harder
cognitive task than analyzing one document's internal consistency. Extended reasoning
(GPT-5's 8,512 tokens) and deep internal reasoning (Opus) appear necessary for this.
## Comparison to Finding #28 (Cross-Document Consistency)
Finding #28 tested cross-document consistency on different document pairs. That experiment
asked "are these documents consistent?" (verification task). This experiment asks "what does
each assume about the other?" (generative/constructive task). The distinction matters:
- **Consistency checking** (Finding #28): compare stated facts across documents. More surface-
level — look for contradictions in explicit claims.
- **Interface assumption analysis** (this finding): reason about what each document takes for
granted about the other's implementation. Requires understanding the *implications* of each
design, not just the *statements*.
The models that excel are the same (GPT-5 and Opus), but the nature of their findings differs:
GPT-5's interface findings are more operational (specific race conditions, specific event
sequences), while Opus's are more structural (fundamental architectural contradictions,
recovery-path tensions).
## Practical Implications
1. **For architecture reviews of interacting components:** Run GPT-5 + Opus together. GPT-5
catches operational gaps (races, ordering, command interfaces). Opus catches structural
contradictions (dead code paths, killed components, recovery-path conflicts).
2. **Sonnet is NOT suitable for interface analysis.** Use it only for single-document tasks
where it has proven capable (assumption-finding, structural review).
3. **The "both documents together" framing is critical.** Previous experiments showed models
find plenty of issues in each document alone. The interface analysis prompt forces models
to reason about the SPACE BETWEEN the documents — which is where the real bugs live in
multi-component systems.
4. **Recommendations should specify the integration contract.** The most valuable output from
this type of analysis is not "here's what's wrong" but "here's what the integration
contract must define" — precedence rules, command APIs, event subscriptions, atomic
sequencing guarantees.
## Next Experiments
- **Three-document interface analysis:** Add `continuous-risk-monitoring.md` as a third document
(it bridges both). Do models find additional interface gaps that only emerge from the
three-way interaction?
- **Adversarial ensemble on interface analysis:** Give Opus GPT-5's interface findings and ask
it to critique + extend (per Finding #35 methodology). Does the ensemble approach produce
even more interface insights?
- **Implementation-level verification:** Take the top interface findings from this experiment
and check them against gargoyle's actual code. Are these REAL bugs or are the documents
already consistent at the implementation level despite the spec-level gaps?