Finding #36: Compositional interface analysis - two-document interface assumptions
New experiment type: give models two related architecture documents and ask them to identify assumptions each document makes about the other that could be violated. Results: GPT-5 (10 findings, 175s, operational/race-focused) and Opus (10 findings, 111s, structural/architectural) both found unique interface gaps. Sonnet (7 findings, 29s) found nothing unique - all its findings were simplified versions of GPT-5/Opus findings. Key insight: Interface analysis requires holding two mental models simultaneously and is harder than single-document analysis. Sonnet produced 0 unique findings (vs 2-6 on single-doc tasks). Extended reasoning appears necessary for this task type.
This commit is contained in:
@@ -0,0 +1,177 @@
|
||||
# Finding #36: Compositional Interface Analysis — Models find qualitatively different interface gaps when analyzing two interacting design documents
|
||||
|
||||
**Date:** 2026-05-07
|
||||
**Task:** Identify security-relevant INTERFACE ASSUMPTIONS between gargoyle's `kill-switch.md`
|
||||
(293 lines) and `escalation-policy.md` (228 lines) — places where one document assumes
|
||||
behavior the other doesn't guarantee, producing gaps visible only from the interface between
|
||||
both designs.
|
||||
**How we used them:** Both documents provided in full (521 lines total) + same focused
|
||||
analytical question to all 3 models. Prompt explicitly specified 5 categories (authority
|
||||
conflicts, state consistency gaps, timing/ordering hazards, recovery path contradictions,
|
||||
semantic mismatches) and required interface-only findings. GPT-5 via HAI OpenAI endpoint;
|
||||
Opus 4.6 and Sonnet 4.6 via HAI Anthropic endpoint. No tools, no project context beyond the
|
||||
two documents.
|
||||
|
||||
## Results
|
||||
|
||||
| Model | Time | Input tokens | Output tokens | Reasoning tokens | Findings |
|
||||
|---|---|---|---|---|---|
|
||||
| GPT-5 | 175s | 5,465 | 11,339 | 8,512 | 10 |
|
||||
| Claude Opus 4.6 | 111s | 6,209 | 4,496 | (internal) | 10 |
|
||||
| Claude Sonnet 4.6 | 29s | 6,209 | 1,306 | (internal) | 7 |
|
||||
|
||||
## What they found — common ground (all 3 identified):
|
||||
|
||||
- **Competing writers to acceptance policy** — both documents assume authority to set
|
||||
OrderManager's acceptance policy with no defined precedence/arbitration mechanism
|
||||
- **Semantic collision on "restrict" / "liquidate"** — same terms used for escalation
|
||||
LEVELS (Document B, autonomous, reversible) and kill switch MODES (Document A, terminal,
|
||||
manual recovery), creating implementation ambiguity
|
||||
- **Autonomous liquidation vs manual liquidation conflict** — Document B assumes
|
||||
autonomous surgical liquidation; Document A defines manual total liquidation.
|
||||
Neither defines which component executes B's autonomous liquidation.
|
||||
- **Recovery/de-escalation authority conflict** — Document B assumes automatic de-escalation
|
||||
via cooldown; Document A requires manual disengagement; incompatible gates
|
||||
- **Kill switch instant engagement vs debounce timing** — external sources can bypass
|
||||
the entire escalation ladder, invalidating Document B's careful debounce state
|
||||
|
||||
## GPT-5 unique findings (not in either Claude model):
|
||||
|
||||
- **Mode selection ambiguity during kill switch escalation** (#3): Document B signals
|
||||
"escalate to kill switch" without specifying which mode. If broker is unreachable,
|
||||
Document A prescribes RESTRICT (no broker interaction); Document B's liquidate-level
|
||||
context could naively map to LIQUIDATE, causing cancel attempts in a scenario where
|
||||
Document A says "don't talk to the broker." This is a mapping gap at the command
|
||||
interface.
|
||||
- **In-flight order race during engagement** (#4): Document A terminates decision engine
|
||||
BEFORE flipping acceptance policy. Orders submitted by Document B's liquidation logic
|
||||
just before termination could arrive at OrderManager while policy is still "open."
|
||||
Neither doc specifies atomic sequencing across their boundary.
|
||||
- **Kill switch LIQUIDATE cancel-all negating B's autonomous liquidation** (#5):
|
||||
Document A cancels ALL open orders in LIQUIDATE mode. If Document B just submitted
|
||||
close-only orders to reduce risk, A's cancel-all undoes B's remediation. B sees
|
||||
insufficient reduction on next cycle, re-triggers, gets cancelled again. Deadlock
|
||||
between safety mechanisms.
|
||||
- **Global kill switch vs per-user escalation policy writes** (#10): Document A's "global
|
||||
always wins" precedence is defined only for kill switch states, not relative to other
|
||||
policy writers. Document B could overwrite reject-all with close-only while global kill
|
||||
is engaged.
|
||||
|
||||
## Claude Opus unique findings (not in either other model):
|
||||
|
||||
- **Decision engine termination kills the metric evaluator** (#2 + #7): Document B
|
||||
requires ONGOING metric evaluation at restrict/liquidate levels (to determine whether
|
||||
to escalate further or de-escalate). Document A terminates Portfolio Risk (the metric
|
||||
computation component) on kill switch engagement. No document identifies which component
|
||||
evaluates risk metrics while the decision engine is dead. This makes Document B's entire
|
||||
post-restrict logic dead code if restrict = kill switch engagement.
|
||||
- **Monitor crash resets escalation state with no backstop** (#6): Document B accepts that
|
||||
crash = state loss, restart from clear. Combined with Document A never having been
|
||||
engaged (escalation hadn't reached that point), a well-timed crash creates a window
|
||||
where NO risk controls are active despite ongoing threshold breaches. The full
|
||||
re-escalation sequence (14+ cycles) runs with zero protection.
|
||||
- **Dual manual gates with undefined ordering** (#8): Document A requires 3 manual steps
|
||||
(RESTRICT→LIQUIDATE transition, disengage, release users). Document B requires 1 manual
|
||||
step (operator confirms recovery from liquidate level). These are independent state
|
||||
machines with their own manual gates. Neither defines whether one gate satisfies the
|
||||
other or what order they must be performed in.
|
||||
|
||||
## Claude Sonnet findings:
|
||||
|
||||
Sonnet found 7 findings total. All mapped to findings already identified by GPT-5 or Opus.
|
||||
No unique findings that weren't covered (at higher depth) by the other two models. Its
|
||||
findings were accurate but structurally simpler — 2-3 paragraphs each vs 5-6 for GPT-5
|
||||
and Opus. The 29-second completion time and 1,306 output tokens reflect this reduced depth.
|
||||
|
||||
## Quality Assessment
|
||||
|
||||
- **GPT-5** produced the most operationally actionable findings. Its #3 (mode selection
|
||||
mapping gap), #4 (in-flight race), and #5 (cancel-all vs liquidation deadlock) all
|
||||
describe specific event sequences that would produce incorrect behavior in implementation.
|
||||
GPT-5 also provided concrete recommendations for fixing each gap (compositional policy
|
||||
model, command API specification, mode selection rules, atomic sequencing). Every finding
|
||||
references specific sections in both documents and describes WHY neither document alone
|
||||
can see the problem.
|
||||
|
||||
- **Claude Opus** found the most architecturally fundamental gap: if the kill switch
|
||||
terminates Portfolio Risk, then Document B's entire escalation logic above "alert"
|
||||
becomes dead code. This isn't just a race condition or authority conflict — it's a
|
||||
structural contradiction where engaging the safety mechanism kills the component that
|
||||
determines whether the safety mechanism should have been engaged. Opus's monitor-crash
|
||||
finding (#6) is also unique in identifying an adversarially exploitable window. The dual
|
||||
manual gates finding (#8) shows Opus's characteristic attention to recovery-path tensions.
|
||||
|
||||
- **Claude Sonnet** was fast but added no unique analytical value for this task. Every
|
||||
finding was a simplified version of something GPT-5 or Opus found in greater depth.
|
||||
For a 29-second, 1306-token response, it's competent as a "quick summary of obvious
|
||||
interface issues" but wouldn't catch the subtle problems.
|
||||
|
||||
## Key Insight — Interface Analysis as a NEW Task Type
|
||||
|
||||
This is the first experiment testing **compositional analysis** across two documents that
|
||||
reference each other. Previous experiments (including #28 cross-document consistency) gave
|
||||
models multiple documents and asked about consistency. This experiment differs in a critical
|
||||
way: it asks specifically about **assumptions each document makes about the other's behavior**.
|
||||
|
||||
The results suggest this task type favors reasoning models even more strongly than
|
||||
single-document analysis:
|
||||
|
||||
| Task type | Sonnet unique findings | Opus unique findings | GPT-5 unique findings |
|
||||
|---|---|---|---|
|
||||
| Hidden assumptions (single doc) | 2-6 | 3-6 | 5-14 |
|
||||
| Race conditions (single doc) | 0 | 5 | 6 |
|
||||
| Interface analysis (two docs) | **0** | **3** | **4** |
|
||||
|
||||
Sonnet's inability to produce ANY unique findings here — when it consistently produces
|
||||
some on single-document tasks — suggests that reasoning about interfaces requires holding
|
||||
two mental models simultaneously and finding contradictions between them. This is a harder
|
||||
cognitive task than analyzing one document's internal consistency. Extended reasoning
|
||||
(GPT-5's 8,512 tokens) and deep internal reasoning (Opus) appear necessary for this.
|
||||
|
||||
## Comparison to Finding #28 (Cross-Document Consistency)
|
||||
|
||||
Finding #28 tested cross-document consistency on different document pairs. That experiment
|
||||
asked "are these documents consistent?" (verification task). This experiment asks "what does
|
||||
each assume about the other?" (generative/constructive task). The distinction matters:
|
||||
|
||||
- **Consistency checking** (Finding #28): compare stated facts across documents. More surface-
|
||||
level — look for contradictions in explicit claims.
|
||||
- **Interface assumption analysis** (this finding): reason about what each document takes for
|
||||
granted about the other's implementation. Requires understanding the *implications* of each
|
||||
design, not just the *statements*.
|
||||
|
||||
The models that excel are the same (GPT-5 and Opus), but the nature of their findings differs:
|
||||
GPT-5's interface findings are more operational (specific race conditions, specific event
|
||||
sequences), while Opus's are more structural (fundamental architectural contradictions,
|
||||
recovery-path tensions).
|
||||
|
||||
## Practical Implications
|
||||
|
||||
1. **For architecture reviews of interacting components:** Run GPT-5 + Opus together. GPT-5
|
||||
catches operational gaps (races, ordering, command interfaces). Opus catches structural
|
||||
contradictions (dead code paths, killed components, recovery-path conflicts).
|
||||
|
||||
2. **Sonnet is NOT suitable for interface analysis.** Use it only for single-document tasks
|
||||
where it has proven capable (assumption-finding, structural review).
|
||||
|
||||
3. **The "both documents together" framing is critical.** Previous experiments showed models
|
||||
find plenty of issues in each document alone. The interface analysis prompt forces models
|
||||
to reason about the SPACE BETWEEN the documents — which is where the real bugs live in
|
||||
multi-component systems.
|
||||
|
||||
4. **Recommendations should specify the integration contract.** The most valuable output from
|
||||
this type of analysis is not "here's what's wrong" but "here's what the integration
|
||||
contract must define" — precedence rules, command APIs, event subscriptions, atomic
|
||||
sequencing guarantees.
|
||||
|
||||
## Next Experiments
|
||||
|
||||
- **Three-document interface analysis:** Add `continuous-risk-monitoring.md` as a third document
|
||||
(it bridges both). Do models find additional interface gaps that only emerge from the
|
||||
three-way interaction?
|
||||
- **Adversarial ensemble on interface analysis:** Give Opus GPT-5's interface findings and ask
|
||||
it to critique + extend (per Finding #35 methodology). Does the ensemble approach produce
|
||||
even more interface insights?
|
||||
- **Implementation-level verification:** Take the top interface findings from this experiment
|
||||
and check them against gargoyle's actual code. Are these REAL bugs or are the documents
|
||||
already consistent at the implementation level despite the spec-level gaps?
|
||||
Reference in New Issue
Block a user