model-research/findings/2026-05-07-35-adversarial-ensemble-critique-extend.md

# Finding #35: Adversarial Ensemble (Critique + Extend) Produces Higher Total Coverage Without Redundancy

**Date:** 2026-05-07
**Task:** Identify hidden assumptions in gargoyle's `dtbp-margin-call.md` (363 lines) —
a document specifying day-trading buying power mode selection, local DTBP tracking, and
the margin call state machine in a GenServer.
**Experiment:** Test the "adversarial ensemble" approach from open questions: Does giving
Opus access to GPT-5's findings and asking it to critique + extend produce more than
either model alone?

## Method

Three runs, same document, same analytical lens ("hidden assumptions"):

| Run | Model | Input | Role |
|---|---|---|---|
| A: GPT-5 independent | GPT-5 | Document only | Find all hidden assumptions |
| B: Opus independent | Claude Opus 4.6 | Document only | Find all hidden assumptions (baseline) |
| C: Opus ensemble | Claude Opus 4.6 | Document + GPT-5's findings | Critique GPT-5's findings, then extend with new ones |

## Results

| Run | Time | Input tokens | Output tokens | Reasoning tokens | Findings |
|---|---|---|---|---|---|
| A: GPT-5 independent | ~104s | 3,979 | 9,431 | 5,440 | 43 |
| B: Opus independent | ~100s | 4,833 | 4,994 | (internal) | 28 |
| C: Opus ensemble | ~120s | 9,569 | 6,819 | (internal) | 43 critiques + 13 new |

## Ensemble Critique Breakdown

Of GPT-5's 43 findings, Opus assessed:
- **31 AGREE** (72%) — correct and well-reasoned
- **12 PARTIALLY AGREE** (28%) — real issue but overstated, understated, or imprecise
- **0 DISAGREE** (0%) — none rejected entirely

Zero full disagreements is striking. Opus never said "this isn't actually an issue."
The partial agreements were consistently about severity calibration or scope assumptions:
- 5 cases of severity downgrade (GPT-5 overstated risk for this Alpaca-specific system)
- 4 cases of framing refinement (correct concern, wrong root cause identified)
- 3 cases of scope limitation (valid if system expands, not currently relevant)

## Ensemble Extensions (13 New Findings)

Opus found 13 findings GPT-5 missed entirely:

| # | Finding | Severity | Category |
|---|---|---|---|
| 1 | Race between mode transition and order acceptance (cross-process TOCTOU) | High | Timing |
| 2 | dtbp_used_today reset on broker re-query creates accounting discontinuities (double-count risk) | High | Data model |
| 3 | No handling of partial day-trade completion | Medium | Financial |
| 4 | `:met` transient state has no timeout/crash recovery | Low | State machine |
| 5 | No mechanism for intra-day deposits increasing DTBP | Medium | Financial |
| 6 | "Expected 4×" threshold for call detection is undefined | High | Detection logic |
| 7 | No coordination between DTBP and order cancellation/expiration events | Medium | Coupling |
| 8 | T+1 settlement cycle impact on overnight buying power | Medium | Financial |
| 9 | No backpressure or circuit-breaking on broker API queries | High | Operational |
| 10 | Assumes single market open/close per day (no halt modeling) | Medium | Timing |
| 11 | Broker forced liquidations treated as normal day-trade sells | Medium | Financial |
| 12 | Persistence model has no audit trail | High | Compliance |
| 13 | No concept of "day-trade call amount" (only binary state) | Medium | Financial |

## Comparison: Ensemble vs Independent

### Total unique findings:
- **GPT-5 independent:** 43
- **Opus independent:** 28
- **Opus ensemble (GPT-5's 43 + 13 new):** 56 total unique findings

### Overlap analysis:
Comparing Opus independent (28) against GPT-5's findings (43):
- ~18 of Opus's 28 independent findings overlap with GPT-5's (same core concern)
- ~10 of Opus's independent findings are unique to Opus
- Of those 10 Opus-unique independent findings, ~7 appear in the ensemble's 13 extensions
- The ensemble found ~6 additional findings that Opus ALSO missed independently

### The key result:
The ensemble produced **56 total unique findings** vs GPT-5's 43 alone (30% increase)
or Opus's 28 alone (100% increase). More importantly:
- The 13 new findings are genuinely novel (not reframings)
- The critique refined 12 findings' severity/framing without losing information
- The ensemble Opus found 6 things its independent counterpart missed

## Why the Ensemble Works

### 1. Reduced search space
Independent Opus must find ALL assumptions from scratch. Ensemble Opus knows what's
already covered and can focus its reasoning on gaps — areas GPT-5's analytical style
tends to miss.

### 2. Complementary blind spots become visible
GPT-5's findings reveal its analytical frame (operational, implementation-level,
per-component). Seeing this frame explicitly helps Opus identify what's OUTSIDE it:
cross-component timing issues, accounting discontinuities, compliance gaps.

### 3. The critique phase calibrates without discarding
Zero disagreements means GPT-5's findings are reliable signal (not noise). The 12
partial agreements ADD information (severity calibration, scope limitations) rather
than removing findings. The ensemble output is strictly more informative than either
input alone.

### 4. Opus's strengths amplified in the extension phase
Opus independently found 28 assumptions in ~100s. Given GPT-5's 43 as a starting
point, Opus found 13 more in ~120s. The marginal cost of the 13 extensions was minimal
(~20s extra due to larger input), and they include 4 High-severity findings that
neither model found independently.

## Cost Analysis

| Approach | Total tokens (in+out) | Findings | Tokens per finding |
|---|---|---|---|
| GPT-5 alone | 13,410 | 43 | 312 |
| Opus alone | 9,827 | 28 | 351 |
| Ensemble (GPT-5 + Opus critique) | 13,410 + 16,388 = 29,798 | 56 | 532 |
| Both independently (GPT-5 + Opus) | 13,410 + 9,827 = 23,237 | ~53 (with ~18 overlap) | 438 |

The ensemble costs ~28% more tokens than running both independently but produces:
- 3 more unique findings (56 vs ~53 de-duplicated)
- Severity calibration on all 43 of GPT-5's findings
- Explicit identification of which findings are Alpaca-specific vs general
- Zero wasted effort on overlapping findings

## Key Insight: The Ensemble's Value Isn't Just "More Findings"

The most valuable output from the ensemble isn't the 13 new findings — it's the
**structured critique** of GPT-5's 43 findings. In production use, an architecture
team receiving 43 findings needs to know:
- Which are genuinely critical vs overstated?
- Which apply to their specific broker vs being general concerns?
- Which are design decisions (acknowledged tradeoffs) vs hidden assumptions?

The ensemble provides this triage automatically. A team receiving the ensemble output
gets an actionable, prioritized list rather than a raw dump of concerns. This is
qualitatively different from receiving two independent lists and having to merge them.

## Practical Implications

### When to use the adversarial ensemble:
- Architecture documents heading into implementation (worth the extra tokens)
- Documents where severity calibration matters (financial, safety-critical)
- When the team needs actionable output, not just a concern list

### When independent runs suffice:
- Exploratory analysis (finding IS the goal, not prioritizing)
- Cost-sensitive scenarios (the ensemble is ~28% more expensive)
- Documents where overlap is minimal (highly specialized vs general)

### Optimal workflow:
1. Run GPT-5 first (broadest coverage, most operational concerns)
2. Feed GPT-5's output to Opus for critique + extension
3. Use Opus's output as the final deliverable (calibrated + extended)

This is strictly better than running both independently and manually merging,
because the ensemble eliminates duplicate effort and produces structured assessment
of each finding's validity.

## Updated Open Questions

- **Does the ensemble benefit diminish with simpler documents?** This 363-line
  financial doc has many implicit assumptions. Would a simpler, less domain-specific
  doc show the same 30% improvement?
- **Would GPT-5 as the critiquer (Opus first, GPT-5 critiques) work equally well?**
  Given GPT-5's tendency toward exhaustiveness, it might add more extensions. But
  Opus's precision in severity calibration might be lost.
- **Is there a three-model ensemble worth testing?** (GPT-5 → Opus critique → Sonnet
  for accessibility/communication of findings to non-experts)