8338ae3019
Tests GPT-5 → Opus critique+extend pipeline on dtbp-margin-call.md. Key results: - Ensemble produces 56 unique findings vs 43 (GPT-5) or 28 (Opus) alone - Zero full disagreements — GPT-5's coverage is reliable signal - Critique phase (severity calibration) more valuable than extension phase - 28% more tokens for 30% more coverage + structured prioritization - Answers open question about adversarial ensemble value
166 lines
8.1 KiB
Markdown
166 lines
8.1 KiB
Markdown
# Finding #35: Adversarial Ensemble (Critique + Extend) Produces Higher Total Coverage Without Redundancy
|
||
|
||
**Date:** 2026-05-07
|
||
**Task:** Identify hidden assumptions in gargoyle's `dtbp-margin-call.md` (363 lines) —
|
||
a document specifying day-trading buying power mode selection, local DTBP tracking, and
|
||
the margin call state machine in a GenServer.
|
||
**Experiment:** Test the "adversarial ensemble" approach from open questions: Does giving
|
||
Opus access to GPT-5's findings and asking it to critique + extend produce more than
|
||
either model alone?
|
||
|
||
## Method
|
||
|
||
Three runs, same document, same analytical lens ("hidden assumptions"):
|
||
|
||
| Run | Model | Input | Role |
|
||
|---|---|---|---|
|
||
| A: GPT-5 independent | GPT-5 | Document only | Find all hidden assumptions |
|
||
| B: Opus independent | Claude Opus 4.6 | Document only | Find all hidden assumptions (baseline) |
|
||
| C: Opus ensemble | Claude Opus 4.6 | Document + GPT-5's findings | Critique GPT-5's findings, then extend with new ones |
|
||
|
||
## Results
|
||
|
||
| Run | Time | Input tokens | Output tokens | Reasoning tokens | Findings |
|
||
|---|---|---|---|---|---|
|
||
| A: GPT-5 independent | ~104s | 3,979 | 9,431 | 5,440 | 43 |
|
||
| B: Opus independent | ~100s | 4,833 | 4,994 | (internal) | 28 |
|
||
| C: Opus ensemble | ~120s | 9,569 | 6,819 | (internal) | 43 critiques + 13 new |
|
||
|
||
## Ensemble Critique Breakdown
|
||
|
||
Of GPT-5's 43 findings, Opus assessed:
|
||
- **31 AGREE** (72%) — correct and well-reasoned
|
||
- **12 PARTIALLY AGREE** (28%) — real issue but overstated, understated, or imprecise
|
||
- **0 DISAGREE** (0%) — none rejected entirely
|
||
|
||
Zero full disagreements is striking. Opus never said "this isn't actually an issue."
|
||
The partial agreements were consistently about severity calibration or scope assumptions:
|
||
- 5 cases of severity downgrade (GPT-5 overstated risk for this Alpaca-specific system)
|
||
- 4 cases of framing refinement (correct concern, wrong root cause identified)
|
||
- 3 cases of scope limitation (valid if system expands, not currently relevant)
|
||
|
||
## Ensemble Extensions (13 New Findings)
|
||
|
||
Opus found 13 findings GPT-5 missed entirely:
|
||
|
||
| # | Finding | Severity | Category |
|
||
|---|---|---|---|
|
||
| 1 | Race between mode transition and order acceptance (cross-process TOCTOU) | High | Timing |
|
||
| 2 | dtbp_used_today reset on broker re-query creates accounting discontinuities (double-count risk) | High | Data model |
|
||
| 3 | No handling of partial day-trade completion | Medium | Financial |
|
||
| 4 | `:met` transient state has no timeout/crash recovery | Low | State machine |
|
||
| 5 | No mechanism for intra-day deposits increasing DTBP | Medium | Financial |
|
||
| 6 | "Expected 4×" threshold for call detection is undefined | High | Detection logic |
|
||
| 7 | No coordination between DTBP and order cancellation/expiration events | Medium | Coupling |
|
||
| 8 | T+1 settlement cycle impact on overnight buying power | Medium | Financial |
|
||
| 9 | No backpressure or circuit-breaking on broker API queries | High | Operational |
|
||
| 10 | Assumes single market open/close per day (no halt modeling) | Medium | Timing |
|
||
| 11 | Broker forced liquidations treated as normal day-trade sells | Medium | Financial |
|
||
| 12 | Persistence model has no audit trail | High | Compliance |
|
||
| 13 | No concept of "day-trade call amount" (only binary state) | Medium | Financial |
|
||
|
||
## Comparison: Ensemble vs Independent
|
||
|
||
### Total unique findings:
|
||
- **GPT-5 independent:** 43
|
||
- **Opus independent:** 28
|
||
- **Opus ensemble (GPT-5's 43 + 13 new):** 56 total unique findings
|
||
|
||
### Overlap analysis:
|
||
Comparing Opus independent (28) against GPT-5's findings (43):
|
||
- ~18 of Opus's 28 independent findings overlap with GPT-5's (same core concern)
|
||
- ~10 of Opus's independent findings are unique to Opus
|
||
- Of those 10 Opus-unique independent findings, ~7 appear in the ensemble's 13 extensions
|
||
- The ensemble found ~6 additional findings that Opus ALSO missed independently
|
||
|
||
### The key result:
|
||
The ensemble produced **56 total unique findings** vs GPT-5's 43 alone (30% increase)
|
||
or Opus's 28 alone (100% increase). More importantly:
|
||
- The 13 new findings are genuinely novel (not reframings)
|
||
- The critique refined 12 findings' severity/framing without losing information
|
||
- The ensemble Opus found 6 things its independent counterpart missed
|
||
|
||
## Why the Ensemble Works
|
||
|
||
### 1. Reduced search space
|
||
Independent Opus must find ALL assumptions from scratch. Ensemble Opus knows what's
|
||
already covered and can focus its reasoning on gaps — areas GPT-5's analytical style
|
||
tends to miss.
|
||
|
||
### 2. Complementary blind spots become visible
|
||
GPT-5's findings reveal its analytical frame (operational, implementation-level,
|
||
per-component). Seeing this frame explicitly helps Opus identify what's OUTSIDE it:
|
||
cross-component timing issues, accounting discontinuities, compliance gaps.
|
||
|
||
### 3. The critique phase calibrates without discarding
|
||
Zero disagreements means GPT-5's findings are reliable signal (not noise). The 12
|
||
partial agreements ADD information (severity calibration, scope limitations) rather
|
||
than removing findings. The ensemble output is strictly more informative than either
|
||
input alone.
|
||
|
||
### 4. Opus's strengths amplified in the extension phase
|
||
Opus independently found 28 assumptions in ~100s. Given GPT-5's 43 as a starting
|
||
point, Opus found 13 more in ~120s. The marginal cost of the 13 extensions was minimal
|
||
(~20s extra due to larger input), and they include 4 High-severity findings that
|
||
neither model found independently.
|
||
|
||
## Cost Analysis
|
||
|
||
| Approach | Total tokens (in+out) | Findings | Tokens per finding |
|
||
|---|---|---|---|
|
||
| GPT-5 alone | 13,410 | 43 | 312 |
|
||
| Opus alone | 9,827 | 28 | 351 |
|
||
| Ensemble (GPT-5 + Opus critique) | 13,410 + 16,388 = 29,798 | 56 | 532 |
|
||
| Both independently (GPT-5 + Opus) | 13,410 + 9,827 = 23,237 | ~53 (with ~18 overlap) | 438 |
|
||
|
||
The ensemble costs ~28% more tokens than running both independently but produces:
|
||
- 3 more unique findings (56 vs ~53 de-duplicated)
|
||
- Severity calibration on all 43 of GPT-5's findings
|
||
- Explicit identification of which findings are Alpaca-specific vs general
|
||
- Zero wasted effort on overlapping findings
|
||
|
||
## Key Insight: The Ensemble's Value Isn't Just "More Findings"
|
||
|
||
The most valuable output from the ensemble isn't the 13 new findings — it's the
|
||
**structured critique** of GPT-5's 43 findings. In production use, an architecture
|
||
team receiving 43 findings needs to know:
|
||
- Which are genuinely critical vs overstated?
|
||
- Which apply to their specific broker vs being general concerns?
|
||
- Which are design decisions (acknowledged tradeoffs) vs hidden assumptions?
|
||
|
||
The ensemble provides this triage automatically. A team receiving the ensemble output
|
||
gets an actionable, prioritized list rather than a raw dump of concerns. This is
|
||
qualitatively different from receiving two independent lists and having to merge them.
|
||
|
||
## Practical Implications
|
||
|
||
### When to use the adversarial ensemble:
|
||
- Architecture documents heading into implementation (worth the extra tokens)
|
||
- Documents where severity calibration matters (financial, safety-critical)
|
||
- When the team needs actionable output, not just a concern list
|
||
|
||
### When independent runs suffice:
|
||
- Exploratory analysis (finding IS the goal, not prioritizing)
|
||
- Cost-sensitive scenarios (the ensemble is ~28% more expensive)
|
||
- Documents where overlap is minimal (highly specialized vs general)
|
||
|
||
### Optimal workflow:
|
||
1. Run GPT-5 first (broadest coverage, most operational concerns)
|
||
2. Feed GPT-5's output to Opus for critique + extension
|
||
3. Use Opus's output as the final deliverable (calibrated + extended)
|
||
|
||
This is strictly better than running both independently and manually merging,
|
||
because the ensemble eliminates duplicate effort and produces structured assessment
|
||
of each finding's validity.
|
||
|
||
## Updated Open Questions
|
||
|
||
- **Does the ensemble benefit diminish with simpler documents?** This 363-line
|
||
financial doc has many implicit assumptions. Would a simpler, less domain-specific
|
||
doc show the same 30% improvement?
|
||
- **Would GPT-5 as the critiquer (Opus first, GPT-5 critiques) work equally well?**
|
||
Given GPT-5's tendency toward exhaustiveness, it might add more extensions. But
|
||
Opus's precision in severity calibration might be lost.
|
||
- **Is there a three-model ensemble worth testing?** (GPT-5 → Opus critique → Sonnet
|
||
for accessibility/communication of findings to non-experts)
|