finding #35: adversarial ensemble (critique+extend) produces 30% more coverage

Tests GPT-5 → Opus critique+extend pipeline on dtbp-margin-call.md. Key results: - Ensemble produces 56 unique findings vs 43 (GPT-5) or 28 (Opus) alone - Zero full disagreements — GPT-5's coverage is reliable signal - Critique phase (severity calibration) more valuable than extension phase - 28% more tokens for 30% more coverage + structured prioritization - Answers open question about adversarial ensemble value
2026-05-06 21:29:17 -07:00
parent 4a69a99d05
commit 8338ae3019
1 changed files with 165 additions and 0 deletions
@@ -0,0 +1,165 @@
 # Finding #35: Adversarial Ensemble (Critique + Extend) Produces Higher Total Coverage Without Redundancy
 **Date:** 2026-05-07
 **Task:** Identify hidden assumptions in gargoyle's `dtbp-margin-call.md` (363 lines) —
 a document specifying day-trading buying power mode selection, local DTBP tracking, and
 the margin call state machine in a GenServer.
 **Experiment:** Test the "adversarial ensemble" approach from open questions: Does giving
 Opus access to GPT-5's findings and asking it to critique + extend produce more than
 either model alone?
 ## Method
 Three runs, same document, same analytical lens ("hidden assumptions"):
 | Run | Model | Input | Role |
 |---|---|---|---|
 | A: GPT-5 independent | GPT-5 | Document only | Find all hidden assumptions |
 | B: Opus independent | Claude Opus 4.6 | Document only | Find all hidden assumptions (baseline) |
 | C: Opus ensemble | Claude Opus 4.6 | Document + GPT-5's findings | Critique GPT-5's findings, then extend with new ones |
 ## Results
 | Run | Time | Input tokens | Output tokens | Reasoning tokens | Findings |
 |---|---|---|---|---|---|
 | A: GPT-5 independent | ~104s | 3,979 | 9,431 | 5,440 | 43 |
 | B: Opus independent | ~100s | 4,833 | 4,994 | (internal) | 28 |
 | C: Opus ensemble | ~120s | 9,569 | 6,819 | (internal) | 43 critiques + 13 new |
 ## Ensemble Critique Breakdown
 Of GPT-5's 43 findings, Opus assessed:
 - **31 AGREE** (72%) — correct and well-reasoned
 - **12 PARTIALLY AGREE** (28%) — real issue but overstated, understated, or imprecise
 - **0 DISAGREE** (0%) — none rejected entirely
 Zero full disagreements is striking. Opus never said "this isn't actually an issue."
 The partial agreements were consistently about severity calibration or scope assumptions:
 - 5 cases of severity downgrade (GPT-5 overstated risk for this Alpaca-specific system)
 - 4 cases of framing refinement (correct concern, wrong root cause identified)
 - 3 cases of scope limitation (valid if system expands, not currently relevant)
 ## Ensemble Extensions (13 New Findings)
 Opus found 13 findings GPT-5 missed entirely:
 | # | Finding | Severity | Category |
 |---|---|---|---|
 | 1 | Race between mode transition and order acceptance (cross-process TOCTOU) | High | Timing |
 | 2 | dtbp_used_today reset on broker re-query creates accounting discontinuities (double-count risk) | High | Data model |
 | 3 | No handling of partial day-trade completion | Medium | Financial |
 | 4 | `:met` transient state has no timeout/crash recovery | Low | State machine |
 | 5 | No mechanism for intra-day deposits increasing DTBP | Medium | Financial |
 | 6 | "Expected 4×" threshold for call detection is undefined | High | Detection logic |
 | 7 | No coordination between DTBP and order cancellation/expiration events | Medium | Coupling |
 | 8 | T+1 settlement cycle impact on overnight buying power | Medium | Financial |
 | 9 | No backpressure or circuit-breaking on broker API queries | High | Operational |
 | 10 | Assumes single market open/close per day (no halt modeling) | Medium | Timing |
 | 11 | Broker forced liquidations treated as normal day-trade sells | Medium | Financial |
 | 12 | Persistence model has no audit trail | High | Compliance |
 | 13 | No concept of "day-trade call amount" (only binary state) | Medium | Financial |
 ## Comparison: Ensemble vs Independent
 ### Total unique findings:
 - **GPT-5 independent:** 43
 - **Opus independent:** 28
 - **Opus ensemble (GPT-5's 43 + 13 new):** 56 total unique findings
 ### Overlap analysis:
 Comparing Opus independent (28) against GPT-5's findings (43):
 - ~18 of Opus's 28 independent findings overlap with GPT-5's (same core concern)
 - ~10 of Opus's independent findings are unique to Opus
 - Of those 10 Opus-unique independent findings, ~7 appear in the ensemble's 13 extensions
 - The ensemble found ~6 additional findings that Opus ALSO missed independently
 ### The key result:
 The ensemble produced **56 total unique findings** vs GPT-5's 43 alone (30% increase)
 or Opus's 28 alone (100% increase). More importantly:
 - The 13 new findings are genuinely novel (not reframings)
 - The critique refined 12 findings' severity/framing without losing information
 - The ensemble Opus found 6 things its independent counterpart missed
 ## Why the Ensemble Works
 ### 1. Reduced search space
 Independent Opus must find ALL assumptions from scratch. Ensemble Opus knows what's
 already covered and can focus its reasoning on gaps — areas GPT-5's analytical style
 tends to miss.
 ### 2. Complementary blind spots become visible
 GPT-5's findings reveal its analytical frame (operational, implementation-level,
 per-component). Seeing this frame explicitly helps Opus identify what's OUTSIDE it:
 cross-component timing issues, accounting discontinuities, compliance gaps.
 ### 3. The critique phase calibrates without discarding
 Zero disagreements means GPT-5's findings are reliable signal (not noise). The 12
 partial agreements ADD information (severity calibration, scope limitations) rather
 than removing findings. The ensemble output is strictly more informative than either
 input alone.
 ### 4. Opus's strengths amplified in the extension phase
 Opus independently found 28 assumptions in ~100s. Given GPT-5's 43 as a starting
 point, Opus found 13 more in ~120s. The marginal cost of the 13 extensions was minimal
 (~20s extra due to larger input), and they include 4 High-severity findings that
 neither model found independently.
 ## Cost Analysis
 | Approach | Total tokens (in+out) | Findings | Tokens per finding |
 |---|---|---|---|
 | GPT-5 alone | 13,410 | 43 | 312 |
 | Opus alone | 9,827 | 28 | 351 |
 | Ensemble (GPT-5 + Opus critique) | 13,410 + 16,388 = 29,798 | 56 | 532 |
 | Both independently (GPT-5 + Opus) | 13,410 + 9,827 = 23,237 | ~53 (with ~18 overlap) | 438 |
 The ensemble costs ~28% more tokens than running both independently but produces:
 - 3 more unique findings (56 vs ~53 de-duplicated)
 - Severity calibration on all 43 of GPT-5's findings
 - Explicit identification of which findings are Alpaca-specific vs general
 - Zero wasted effort on overlapping findings
 ## Key Insight: The Ensemble's Value Isn't Just "More Findings"
 The most valuable output from the ensemble isn't the 13 new findings — it's the
 **structured critique** of GPT-5's 43 findings. In production use, an architecture
 team receiving 43 findings needs to know:
 - Which are genuinely critical vs overstated?
 - Which apply to their specific broker vs being general concerns?
 - Which are design decisions (acknowledged tradeoffs) vs hidden assumptions?
 The ensemble provides this triage automatically. A team receiving the ensemble output
 gets an actionable, prioritized list rather than a raw dump of concerns. This is
 qualitatively different from receiving two independent lists and having to merge them.
 ## Practical Implications
 ### When to use the adversarial ensemble:
 - Architecture documents heading into implementation (worth the extra tokens)
 - Documents where severity calibration matters (financial, safety-critical)
 - When the team needs actionable output, not just a concern list
 ### When independent runs suffice:
 - Exploratory analysis (finding IS the goal, not prioritizing)
 - Cost-sensitive scenarios (the ensemble is ~28% more expensive)
 - Documents where overlap is minimal (highly specialized vs general)
 ### Optimal workflow:
 1. Run GPT-5 first (broadest coverage, most operational concerns)
 2. Feed GPT-5's output to Opus for critique + extension
 3. Use Opus's output as the final deliverable (calibrated + extended)
 This is strictly better than running both independently and manually merging,
 because the ensemble eliminates duplicate effort and produces structured assessment
 of each finding's validity.
 ## Updated Open Questions
 - **Does the ensemble benefit diminish with simpler documents?** This 363-line
  financial doc has many implicit assumptions. Would a simpler, less domain-specific
  doc show the same 30% improvement?
 - **Would GPT-5 as the critiquer (Opus first, GPT-5 critiques) work equally well?**
  Given GPT-5's tendency toward exhaustiveness, it might add more extensions. But
  Opus's precision in severity calibration might be lost.
 - **Is there a three-model ensemble worth testing?** (GPT-5 → Opus critique → Sonnet
  for accessibility/communication of findings to non-experts)