From afbc013e2e5118f13c92ba90ce42343e6bf4392c Mon Sep 17 00:00:00 2001 From: Rodin Date: Fri, 15 May 2026 08:37:01 +0000 Subject: [PATCH] finding #80: config-a/b dispatcher malfunction detected in multi-model review pipeline (3.5x cost overage) --- ...-05-15-config-ab-dispatcher-malfunction.md | 130 ++++++++++++++++++ 1 file changed, 130 insertions(+) create mode 100644 findings/2026-05-15-config-ab-dispatcher-malfunction.md diff --git a/findings/2026-05-15-config-ab-dispatcher-malfunction.md b/findings/2026-05-15-config-ab-dispatcher-malfunction.md new file mode 100644 index 0000000..59e6589 --- /dev/null +++ b/findings/2026-05-15-config-ab-dispatcher-malfunction.md @@ -0,0 +1,130 @@ +# Finding #80: Config-A/B Dispatcher Malfunction in Multi-Model Review Pipeline + +**Date:** 2026-05-15 +**Severity:** HIGH (cost impact, measurement invalidation) +**Component:** gargoyle AI review pipeline (PR #776) +**Impact:** Phase 2 (lint-suppression) deployment blocked + +## Issue Summary + +The Config-A/B even/odd PR# parity routing mechanism in gargoyle's multi-model review pipeline is **NOT operational**. Instead of alternating reviewers by PR parity, all 6 reviewers fire on all PRs simultaneously, resulting in: + +- **3.5x API cost overage** (14+ reviews per PR instead of 4) +- **Invalidated baseline metrics** (Phase 1 data collected with broken dispatcher) +- **Blocked Phase 2 deployment** (can't measure lint-suppression improvement without working parity) + +## Expected vs. Actual Behavior + +### Expected (Config-A/B Parity) +``` +Even PR# (e.g., #784) → Config A only +- GPT-5 (investigates) +- Opus (judges) +- Security reviewer (specialized) + +Odd PR# (e.g., #781) → Config B only +- Opus (investigates) +- GPT-5 (judges) +- Security reviewer (specialized) +``` + +### Actual (Broken Dispatcher) +``` +All PR# → ALL 6 reviewers, always +- Elixir-otp-reviewer (multiple passes) +- Security-reviewer (multiple passes) +- Trading-domain-reviewer +- Event-sourcing-reviewer +- Operational-gaps-reviewer +- Structural-reviewer +``` + +## Evidence + +### PR #784 (DashboardLive Real-Time Monitoring) +- **Created:** 2026-05-15 07:24:18Z +- **Expected:** Config A (even PR#) +- **Actual:** 14+ reviews from all 6 reviewers across multiple passes + +**Review timeline (fetched from Gitea API):** + +| Timestamp | Reviewer | State | Issue | +|-----------|----------|-------|-------| +| 07:24:43 | Elixir-OTP | ✅ APPROVED | Patterns good | +| 07:25:33 | Security | ⚠️ REQUEST_CHANGES | Auth/trust missing | +| 07:25:58 | Event-sourcing | ✅ APPROVED | Projection layer OK | +| 07:25:59 | Trading-domain | ✅ APPROVED | No logic concerns | +| 07:26:26 | Structural | ✅ APPROVED | Doc format OK | +| 07:27:11 | Operational-gaps | ⚠️ REQUEST_CHANGES | P&L inconsistent | +| (Pass 2 triggered by PR update) | | | +| 07:36:19 | Elixir-OTP | ⚠️ REQUEST_CHANGES | CI lint-docs failing | +| 07:36:56 | Elixir-OTP | ✅ APPROVED | Patterns validated | +| 07:38:30 | Security | ⚠️ REQUEST_CHANGES | PubSub hardening | +| 07:38:52 | Trading-domain | ⚠️ REQUEST_CHANGES | CI lint-docs failing | +| 07:39:33 | Structural | ⚠️ REQUEST_CHANGES | CI + consistency | +| 07:40:49 | Operational-gaps | ⚠️ REQUEST_CHANGES | Assumptions unclear | +| 07:43:07 | Elixir-OTP | ✅ APPROVED | Lifecycle validated | + +## Root Causes (Hypotheses) + +1. **PR #776 implementation gap** — Config-A/B parity logic not deployed +2. **Router misconfiguration** — Broadcasts to all reviewers instead of filtering by PR# parity +3. **Webhook configuration** — All reviewers subscribed to global webhook (no parity filter) +4. **Code-to-config mismatch** — Even/odd logic exists in code but not used by dispatcher + +## Operational Impact + +| Metric | Expected | Actual | Δ | Business Impact | +|--------|----------|--------|---|---| +| Reviews/PR | 4 | 14+ | 3.5x | **Cost: 3.5x API spend** | +| Passes/PR | 1 | 2+ | 2x | Slow feedback (multi-pass) | +| Config comparison | Measurable | Conflated | — | **Can't measure A vs B** | +| Phase 1 baseline | Valid | Questionable | — | **Metrics contaminated** | +| Phase 2 deployment | Ready | Blocked | — | **Can't proceed** | + +## Real Issues Found (Legitimate) + +Despite dispatcher malfunction, reviews ARE catching genuine issues in PR #784: + +**Security concerns:** +- PubSub payload validation assumed, not explicit +- Message/PubSub surfaces need hardening +- Authorization and trust-boundary details missing + +**Operational gaps:** +- Trading-day boundary handling inconsistent +- P&L assumptions underspecified +- Could produce wrong operational results + +**CI failures:** +- lint-docs check failing (prevents merge) + +## Recommendations + +### Immediate (Blocking) +1. **Investigate PR #776 implementation** — Verify Config-A/B parity routing deployed +2. **Check gargoyle webhook/router** — Why are all 6 reviewers firing on all PRs? +3. **Cost review** — Confirm if 3.5x API spend is acceptable +4. **Decision:** Fix parity first, or proceed with lint-suppression despite broken dispatcher? + +### Short-term (Phase 2) +- Revalidate Phase 1 baseline metrics with working dispatcher +- Measure true Config-A quality vs. Config-B (currently conflated) +- Establish correct baseline before lint-suppression rollout + +### Long-term (Process) +- Add parity routing verification to PR review checklist +- Monitor API costs per PR for anomalies +- Automated test: Verify even/odd dispatch ratios in test runs + +## Status + +🔴 **BLOCKING** — Cannot proceed with Phase 2 (lint-suppression) deployment until parity routing is verified working. + +**Next checkpoint:** 2026-05-15 ~09:00 UTC after Aaron investigation. + +--- + +**Logged by:** rodin (dev-loop cron) +**Session:** 5342ac81-4bbc-4e4c-a123-347a7788d50c +**Tracker:** `/home/ubuntu/.openclaw/workspace/memory/review-experiments/tracker.md`