c071ffc31f
New experiment type: give models two related architecture documents and ask them to identify assumptions each document makes about the other that could be violated. Results: GPT-5 (10 findings, 175s, operational/race-focused) and Opus (10 findings, 111s, structural/architectural) both found unique interface gaps. Sonnet (7 findings, 29s) found nothing unique - all its findings were simplified versions of GPT-5/Opus findings. Key insight: Interface analysis requires holding two mental models simultaneously and is harder than single-document analysis. Sonnet produced 0 unique findings (vs 2-6 on single-doc tasks). Extended reasoning appears necessary for this task type.
178 lines
11 KiB
Markdown
178 lines
11 KiB
Markdown
# Finding #36: Compositional Interface Analysis — Models find qualitatively different interface gaps when analyzing two interacting design documents
|
|
|
|
**Date:** 2026-05-07
|
|
**Task:** Identify security-relevant INTERFACE ASSUMPTIONS between gargoyle's `kill-switch.md`
|
|
(293 lines) and `escalation-policy.md` (228 lines) — places where one document assumes
|
|
behavior the other doesn't guarantee, producing gaps visible only from the interface between
|
|
both designs.
|
|
**How we used them:** Both documents provided in full (521 lines total) + same focused
|
|
analytical question to all 3 models. Prompt explicitly specified 5 categories (authority
|
|
conflicts, state consistency gaps, timing/ordering hazards, recovery path contradictions,
|
|
semantic mismatches) and required interface-only findings. GPT-5 via HAI OpenAI endpoint;
|
|
Opus 4.6 and Sonnet 4.6 via HAI Anthropic endpoint. No tools, no project context beyond the
|
|
two documents.
|
|
|
|
## Results
|
|
|
|
| Model | Time | Input tokens | Output tokens | Reasoning tokens | Findings |
|
|
|---|---|---|---|---|---|
|
|
| GPT-5 | 175s | 5,465 | 11,339 | 8,512 | 10 |
|
|
| Claude Opus 4.6 | 111s | 6,209 | 4,496 | (internal) | 10 |
|
|
| Claude Sonnet 4.6 | 29s | 6,209 | 1,306 | (internal) | 7 |
|
|
|
|
## What they found — common ground (all 3 identified):
|
|
|
|
- **Competing writers to acceptance policy** — both documents assume authority to set
|
|
OrderManager's acceptance policy with no defined precedence/arbitration mechanism
|
|
- **Semantic collision on "restrict" / "liquidate"** — same terms used for escalation
|
|
LEVELS (Document B, autonomous, reversible) and kill switch MODES (Document A, terminal,
|
|
manual recovery), creating implementation ambiguity
|
|
- **Autonomous liquidation vs manual liquidation conflict** — Document B assumes
|
|
autonomous surgical liquidation; Document A defines manual total liquidation.
|
|
Neither defines which component executes B's autonomous liquidation.
|
|
- **Recovery/de-escalation authority conflict** — Document B assumes automatic de-escalation
|
|
via cooldown; Document A requires manual disengagement; incompatible gates
|
|
- **Kill switch instant engagement vs debounce timing** — external sources can bypass
|
|
the entire escalation ladder, invalidating Document B's careful debounce state
|
|
|
|
## GPT-5 unique findings (not in either Claude model):
|
|
|
|
- **Mode selection ambiguity during kill switch escalation** (#3): Document B signals
|
|
"escalate to kill switch" without specifying which mode. If broker is unreachable,
|
|
Document A prescribes RESTRICT (no broker interaction); Document B's liquidate-level
|
|
context could naively map to LIQUIDATE, causing cancel attempts in a scenario where
|
|
Document A says "don't talk to the broker." This is a mapping gap at the command
|
|
interface.
|
|
- **In-flight order race during engagement** (#4): Document A terminates decision engine
|
|
BEFORE flipping acceptance policy. Orders submitted by Document B's liquidation logic
|
|
just before termination could arrive at OrderManager while policy is still "open."
|
|
Neither doc specifies atomic sequencing across their boundary.
|
|
- **Kill switch LIQUIDATE cancel-all negating B's autonomous liquidation** (#5):
|
|
Document A cancels ALL open orders in LIQUIDATE mode. If Document B just submitted
|
|
close-only orders to reduce risk, A's cancel-all undoes B's remediation. B sees
|
|
insufficient reduction on next cycle, re-triggers, gets cancelled again. Deadlock
|
|
between safety mechanisms.
|
|
- **Global kill switch vs per-user escalation policy writes** (#10): Document A's "global
|
|
always wins" precedence is defined only for kill switch states, not relative to other
|
|
policy writers. Document B could overwrite reject-all with close-only while global kill
|
|
is engaged.
|
|
|
|
## Claude Opus unique findings (not in either other model):
|
|
|
|
- **Decision engine termination kills the metric evaluator** (#2 + #7): Document B
|
|
requires ONGOING metric evaluation at restrict/liquidate levels (to determine whether
|
|
to escalate further or de-escalate). Document A terminates Portfolio Risk (the metric
|
|
computation component) on kill switch engagement. No document identifies which component
|
|
evaluates risk metrics while the decision engine is dead. This makes Document B's entire
|
|
post-restrict logic dead code if restrict = kill switch engagement.
|
|
- **Monitor crash resets escalation state with no backstop** (#6): Document B accepts that
|
|
crash = state loss, restart from clear. Combined with Document A never having been
|
|
engaged (escalation hadn't reached that point), a well-timed crash creates a window
|
|
where NO risk controls are active despite ongoing threshold breaches. The full
|
|
re-escalation sequence (14+ cycles) runs with zero protection.
|
|
- **Dual manual gates with undefined ordering** (#8): Document A requires 3 manual steps
|
|
(RESTRICT→LIQUIDATE transition, disengage, release users). Document B requires 1 manual
|
|
step (operator confirms recovery from liquidate level). These are independent state
|
|
machines with their own manual gates. Neither defines whether one gate satisfies the
|
|
other or what order they must be performed in.
|
|
|
|
## Claude Sonnet findings:
|
|
|
|
Sonnet found 7 findings total. All mapped to findings already identified by GPT-5 or Opus.
|
|
No unique findings that weren't covered (at higher depth) by the other two models. Its
|
|
findings were accurate but structurally simpler — 2-3 paragraphs each vs 5-6 for GPT-5
|
|
and Opus. The 29-second completion time and 1,306 output tokens reflect this reduced depth.
|
|
|
|
## Quality Assessment
|
|
|
|
- **GPT-5** produced the most operationally actionable findings. Its #3 (mode selection
|
|
mapping gap), #4 (in-flight race), and #5 (cancel-all vs liquidation deadlock) all
|
|
describe specific event sequences that would produce incorrect behavior in implementation.
|
|
GPT-5 also provided concrete recommendations for fixing each gap (compositional policy
|
|
model, command API specification, mode selection rules, atomic sequencing). Every finding
|
|
references specific sections in both documents and describes WHY neither document alone
|
|
can see the problem.
|
|
|
|
- **Claude Opus** found the most architecturally fundamental gap: if the kill switch
|
|
terminates Portfolio Risk, then Document B's entire escalation logic above "alert"
|
|
becomes dead code. This isn't just a race condition or authority conflict — it's a
|
|
structural contradiction where engaging the safety mechanism kills the component that
|
|
determines whether the safety mechanism should have been engaged. Opus's monitor-crash
|
|
finding (#6) is also unique in identifying an adversarially exploitable window. The dual
|
|
manual gates finding (#8) shows Opus's characteristic attention to recovery-path tensions.
|
|
|
|
- **Claude Sonnet** was fast but added no unique analytical value for this task. Every
|
|
finding was a simplified version of something GPT-5 or Opus found in greater depth.
|
|
For a 29-second, 1306-token response, it's competent as a "quick summary of obvious
|
|
interface issues" but wouldn't catch the subtle problems.
|
|
|
|
## Key Insight — Interface Analysis as a NEW Task Type
|
|
|
|
This is the first experiment testing **compositional analysis** across two documents that
|
|
reference each other. Previous experiments (including #28 cross-document consistency) gave
|
|
models multiple documents and asked about consistency. This experiment differs in a critical
|
|
way: it asks specifically about **assumptions each document makes about the other's behavior**.
|
|
|
|
The results suggest this task type favors reasoning models even more strongly than
|
|
single-document analysis:
|
|
|
|
| Task type | Sonnet unique findings | Opus unique findings | GPT-5 unique findings |
|
|
|---|---|---|---|
|
|
| Hidden assumptions (single doc) | 2-6 | 3-6 | 5-14 |
|
|
| Race conditions (single doc) | 0 | 5 | 6 |
|
|
| Interface analysis (two docs) | **0** | **3** | **4** |
|
|
|
|
Sonnet's inability to produce ANY unique findings here — when it consistently produces
|
|
some on single-document tasks — suggests that reasoning about interfaces requires holding
|
|
two mental models simultaneously and finding contradictions between them. This is a harder
|
|
cognitive task than analyzing one document's internal consistency. Extended reasoning
|
|
(GPT-5's 8,512 tokens) and deep internal reasoning (Opus) appear necessary for this.
|
|
|
|
## Comparison to Finding #28 (Cross-Document Consistency)
|
|
|
|
Finding #28 tested cross-document consistency on different document pairs. That experiment
|
|
asked "are these documents consistent?" (verification task). This experiment asks "what does
|
|
each assume about the other?" (generative/constructive task). The distinction matters:
|
|
|
|
- **Consistency checking** (Finding #28): compare stated facts across documents. More surface-
|
|
level — look for contradictions in explicit claims.
|
|
- **Interface assumption analysis** (this finding): reason about what each document takes for
|
|
granted about the other's implementation. Requires understanding the *implications* of each
|
|
design, not just the *statements*.
|
|
|
|
The models that excel are the same (GPT-5 and Opus), but the nature of their findings differs:
|
|
GPT-5's interface findings are more operational (specific race conditions, specific event
|
|
sequences), while Opus's are more structural (fundamental architectural contradictions,
|
|
recovery-path tensions).
|
|
|
|
## Practical Implications
|
|
|
|
1. **For architecture reviews of interacting components:** Run GPT-5 + Opus together. GPT-5
|
|
catches operational gaps (races, ordering, command interfaces). Opus catches structural
|
|
contradictions (dead code paths, killed components, recovery-path conflicts).
|
|
|
|
2. **Sonnet is NOT suitable for interface analysis.** Use it only for single-document tasks
|
|
where it has proven capable (assumption-finding, structural review).
|
|
|
|
3. **The "both documents together" framing is critical.** Previous experiments showed models
|
|
find plenty of issues in each document alone. The interface analysis prompt forces models
|
|
to reason about the SPACE BETWEEN the documents — which is where the real bugs live in
|
|
multi-component systems.
|
|
|
|
4. **Recommendations should specify the integration contract.** The most valuable output from
|
|
this type of analysis is not "here's what's wrong" but "here's what the integration
|
|
contract must define" — precedence rules, command APIs, event subscriptions, atomic
|
|
sequencing guarantees.
|
|
|
|
## Next Experiments
|
|
|
|
- **Three-document interface analysis:** Add `continuous-risk-monitoring.md` as a third document
|
|
(it bridges both). Do models find additional interface gaps that only emerge from the
|
|
three-way interaction?
|
|
- **Adversarial ensemble on interface analysis:** Give Opus GPT-5's interface findings and ask
|
|
it to critique + extend (per Finding #35 methodology). Does the ensemble approach produce
|
|
even more interface insights?
|
|
- **Implementation-level verification:** Take the top interface findings from this experiment
|
|
and check them against gargoyle's actual code. Are these REAL bugs or are the documents
|
|
already consistent at the implementation level despite the spec-level gaps?
|