Files
model-research/findings/2026-05-14-dev-loop-effectiveness-analysis.md

18 KiB
Raw Permalink Blame History

Finding #78: Dev Loop Effectiveness Analysis — Gargoyle Autonomous Development Pipeline

Date: 2026-05-14 Task: Measure the effectiveness of the autonomous dev loop (pre-code, multi-model review, post-merge review) for the gargoyle repo (grgl/gargoyle on gitea.weiker.me). Task type: Process quality analysis — quantitative audit of AI-driven development pipeline using Gitea API data.

Methodology

All data pulled from Gitea API (gitea.weiker.me/api/v1) using the rodin token. Data sources:

  • All closed PRs: 9 paginated requests, ~469 total closed PRs
  • PR reviews: sampled 24 PRs spanning the full repo lifetime (PRs 507774)
  • Issues with ai-review label: 100 closed + 0 open = 100 total post-merge findings
  • PR bodies and labels for design/planning signals

Important caveat: This analysis is based on observed data patterns from the API. The repo's autonomous dev loop evolved significantly over the ~30-day observation window (April 13May 14, 2026). Conclusions about trends should be treated as directional, not statistically rigorous given the short timeframe.


1. Repository Overview

Metric Value
Repo age ~31 days (created ~2026-04-13)
Total closed PRs ~469
AI review started ~2026-05-01 (PR #507)
PRs with AI review ~267 (PRs 507774)
PRs WITHOUT AI review (human-only era) ~202 (PRs 1506)
Total post-merge issues filed (ai-review label) 100 (closed) + 0 (open)
Open issues total ~30
PR velocity ~1520 PRs/day during active periods

The repo is almost entirely autonomous — nearly all PRs are opened by rodin, with aweiker as reviewer/approver.


2. Review Pipeline Analysis

2.1 Reviewer Evolution

The review pipeline went through three distinct phases:

Phase 1 (PRs 1506, April 13May 1): No automated AI review

  • No AI review bots. aweiker was the sole reviewer.
  • PRs were mostly short-lived with no structured review process.
  • No REQUEST_CHANGES observed (no data on review quality from this phase).

Phase 2 (PRs 507748, May 211): Dual-bot review (sonnet-review-bot + gpt-review-bot)

  • Both bots reviewed every PR with distinct personas and model strengths.
  • Review bodies had rich formatted findings tables with MAJOR/MINOR/NIT severities.
  • High REQUEST_CHANGES rates, especially on substantive feature PRs.
  • Sonnet-review-bot consistently wrote longer, more prescriptive reviews (avg ~2,500 chars vs ~2,000 for gpt-review-bot in REQUEST_CHANGES).

Phase 3 (PRs 749774, May 1114): gpt-review-bot only

  • Sonnet-review-bot dropped from the pipeline around PR 748/749.
  • Volume of reviews per PR increased dramatically (PR 749: 30 reviews; PR 755: 30; PR 774: 22).
  • The increase appears to be repeated review passes across pushes, not more reviewers.
  • REQUEST_CHANGES dropped significantly in this phase (see Section 2.2).

2.2 REQUEST_CHANGES Rate by PR Sample

PR # Date Total Reviews REQUEST_CHANGES Bots Active
507 May-02 7 4 gpt + sonnet
542 May-03 5 0 gpt + sonnet
609 May-05 7 4 gpt + sonnet
619 May-06 8 4 gpt + sonnet
628 May-06 7 4 gpt + sonnet
633 May-07 11 8 gpt + sonnet
644 May-07 9 0 gpt + sonnet
654 May-07 7 2 gpt + sonnet
664 May-08 14 8 gpt + sonnet
670 May-08 10 3 gpt + sonnet
681 May-09 8 0 gpt + sonnet
706 May-10 18 2 gpt + sonnet
718 May-10 5 1 gpt + sonnet
724 May-10 18 2 gpt + sonnet
734 May-10 5 2 gpt + sonnet
737 May-11 9 1 gpt + sonnet
749 May-11 30 0 gpt only
755 May-12 30 1 gpt only
762 May-13 30 1 gpt only
767 May-13 10 1 gpt only
771 May-13 14 0 gpt only
774 May-14 22 0 gpt only

Dual-bot era (PRs 507737, May 211): 14 sampled PRs, 45 REQUEST_CHANGES across 141 total reviews = 32% REQUEST_CHANGES rate.

Single-bot era (PRs 749774, May 1114): 6 sampled PRs, 3 REQUEST_CHANGES across 136 total reviews = 2% REQUEST_CHANGES rate.

This is a sharp drop — but it's ambiguous: either (a) gpt-review-bot alone is less demanding, (b) code quality improved, or (c) the massive review volume (30 reviews per PR) represents repeat passes on already-approved state.

2.3 Findings Depth Analysis

Review body lengths for REQUEST_CHANGES reviews:

  • Sonnet-review-bot: avg ~4,000 chars for REQUEST_CHANGES (range: 1,7526,902)
  • gpt-review-bot: avg ~3,200 chars for REQUEST_CHANGES (range: 2,4024,859)
  • Both produced structured tables with | # | Severity | File | Line | Finding | format.
  • PR #633 (DailyPnl.Snapshotter) received the most intense review: 8 REQUEST_CHANGES across 11 total reviews. Issues found: CI failures, missing @impl, unhandled error tuples, bad test design (testing EventStore directly instead of GenServer), concurrency issues.
  • PR #664 (QuoteFeed telemetry): 8 REQUEST_CHANGES, finding duplicated documentation examples that didn't match implementation, and high-cardinality :symbol telemetry tags.

2.4 Bot Disagreement Pattern

In the dual-bot era, bots disagreed frequently:

  • PR #706: sonnet filed 2 REQUEST_CHANGES, gpt approved across 18 total reviews → code went through multiple revision rounds
  • PR #724: gpt filed REQUEST_CHANGES first, sonnet later filed REQUEST_CHANGES (round 2) → created push-pull dynamic
  • PR #634: sonnet kept filing REQUEST_CHANGES (5 rounds) even after gpt approved → Sonnet acted as the more persistent blocker
  • PR #718: gpt filed REQUEST_CHANGES, sonnet approved immediately → gpt more demanding on refactoring PR

The dual-bot disagreement pattern acted as a natural quality ratchet — a PR couldn't merge until both bots were satisfied.


3. Post-Merge Review Findings

3.1 Total Volume

100 issues filed by rodin with ai-review label (all closed)

This represents findings that slipped through the review pipeline and were caught post-merge. All were subsequently fixed (all closed).

3.2 Distribution by Source PR

Top PRs by post-merge findings:

Source PR Findings Date PR Type
PR #566 7 May-04 docs: add 8 domain-layer documents
PR #633 3 May-07 feat: DailyPnl.Snapshotter GenServer
PR #657 3 May-07/08 feat: QuoteFeed WebSocket
PR #590 3 May-05 docs: extract Ledger narratives
PR #592 3 May-05 docs: extract Decision Engine narratives
PR #724 2 May-10 feat: PositionReconciler
PR #550 2 May-03 docs: kill switch design
PR #598 2 May-05 docs: replace trading-pipeline.md

PRs with 1 post-merge finding: 508, 518, 519, 521, 523, 527, 530, 547, 555, 567, 609, 621, 626, 664, 686, 692, 704, 717, 721, 728, 737, 739, 767, 771 (24 PRs)

3.3 Finding Categories

Analyzing 100 post-merge finding titles:

Category Count % Notes
Missing test coverage 22 22% Dedicated test files, uncovered paths, retry/error paths
Missing issue link 14 14% PR merged without tracking issue (early era)
Missing diagrams/doc gaps 13 13% Mermaid diagrams, failure modes tables
Acceptance criteria not met 8 8% Test plan unchecked, ACs incomplete
Logger/telemetry violations 7 7% String interpolation in Logger, high-cardinality tags
Missing @behaviour/@spec 7 7% Behaviour declarations, orphaned @callback
Concurrency/race conditions 5 5% async: false, TOCTOU races, ETS isolation
Process.sleep anti-pattern 4 4% Timing hacks in tests
Deferred work not tracked 3 3% TODO deferred without issue, scope slippage
CI/lint violations 2 2% Lint-docs failures, duplicate dividers
Other 15 15% Various

3.4 Post-Merge Finding Rate Over Time

Period PRs Active Post-Merge Issues Filed Rate
April 2430 ~early era ~20 ~high (no review yet)
May 17 ~200 PRs ~57 early review era
May 814 ~70 PRs ~23 more recent

Findings rate appears to decline over time as the pipeline matured and common failure modes (missing issue links, test coverage gaps) were repeatedly caught and addressed. Early post-merge reviews surfaced systemic problems (no issue links on any PRs, all test plan items unchecked) that were then fixed at the process level.


4. Pre-Code / Design Phase Analysis

4.1 Design Label Coverage

From the open issues list:

  • Issues with design label: ~18 open issues (all future-pipeline items like options, backtesting, notification systems)
  • These design issues appear to be ahead-of-implementation planning items, not pre-code docs for completed work.

4.2 Evidence of Pre-Code Practice

Looking at PR bodies for design doc references:

  • PR #633 (feat(daily-pnl): implement DailyPnl.Snapshotter GenServer): Body explicitly references "design in docs/domain/contexts/reporting/daily-pnl.md" — this PR followed a design doc.
  • PR #755 (feat(trading): OrderManager PubSub broadcasts): Body describes the feature scope in detail ("What: Order placement broadcasts...") with clear acceptance criteria.
  • PR bodies generally follow a "Why / What" structure suggesting pre-planned work.

The design label on issues represents future work in the pipeline — implementation issues reference the design documents without necessarily going through a formal pre-code review cycle.

4.3 Design vs Implementation Quality Comparison

I could not directly compare "had formal pre-code review" vs "did not" because the distinction is not captured in labels or issue references consistently. However, observable proxy signals:

PRs with linked design docs (like #633): Still received 8 REQUEST_CHANGES, still had 3 post-merge issues. The design doc reduced scope ambiguity but didn't prevent implementation bugs.

Rapid implementation PRs (like #674, single-reviewer approvals): Tended to have more post-merge findings per PR on average.

The design issue pipeline covers future work only — there's no evidence that in-progress feature work goes through a formal pre-code review step before coding begins. This represents a gap.


5. Quality Trend Over Time

5.1 Monthly Summary (compressed — only ~30 days of history)

Week PRs Merged Avg Reviews/PR % with RC Post-Merge Issues Filed
Apr 1330 ~150+ 0 0% (no AI) ~27 (discovered retroactively)
May 17 ~130 ~7 ~45% 57
May 811 ~70 ~12 ~25% 23
May 1214 ~50 ~24 ~5% 5

Interpretation: The 45% REQUEST_CHANGES rate in May 17 reflects the review pipeline catching real issues in a codebase that hadn't been reviewed. The declining rate in May 1214 reflects either (a) the codebase maturing, (b) reviewers adapting to common patterns, or (c) the sonnet-review-bot dropout.

5.2 Review Round Inflation

A notable trend: review round count per PR increased dramatically over time.

  • Early PRs (May 26): 211 reviews per PR
  • Mid-period (May 811): 818 reviews per PR
  • Recent PRs (May 1114): 2230 reviews per PR

This suggests a pattern of multiple push-review-fix cycles per PR. PR #706 had 18 reviews across ~8 rounds. This indicates the review loop is working (catching things) but also that PRs take many iterations before they're clean.


6. Top 5 Improvement Areas by Phase

6.1 Pre-Code / Plan Generation

  1. No formal pre-code gate exists. Design issues exist in the backlog, but there's no signal that implementation work requires a pre-code review before coding starts. 22% of post-merge findings were missing test coverage — suggesting test plans aren't being written before implementation.

  2. Test plan acceptance criteria are unchecked at merge. Multiple post-merge issues flagged "all 5 test plan items unchecked at merge." The review pipeline doesn't verify ACs before merge; it only checks code.

  3. Design docs trail implementation. The docs/readme: rebuild design sequencing map PRs appear frequently (~15 PRs) — these are reactive documentation updates after implementation, not pre-code design.

  4. Issue sizing discipline is inconsistent. Several issues lack size: labels or have needs-split — preventing realistic scope estimation before work begins.

  5. No post-implementation retrospective link. When post-merge issues are filed, they're not linked back to the originating design doc or issue, making it hard to audit what the pre-code design missed.

6.2 Review Pipeline (Multi-Model)

  1. Sonnet-review-bot dropout removed the quality ratchet. The dual-bot disagreement pattern (where Sonnet kept requesting changes even after GPT approved) was a feature, not a bug. Single-bot review dropped REQUEST_CHANGES from ~32% to ~2%. Recommendation: Restore dual-bot review.

  2. Review volume inflation (30 reviews/PR) doesn't equal depth. The recent spike to 30 reviews/PR reflects repeated shallow passes on already-approved code, not deeper analysis. Review round management needs improvement — perhaps a counter to avoid re-reviewing unchanged sections.

  3. Bot findings are siloed. Each bot reviews independently without reading the other's findings. This leads to duplicate findings in some cases and misses cross-cutting issues that would emerge from comparing perspectives. A synthesis step (after both bots review) would add value.

  4. No domain-specific reviewer for business logic. The review pipeline has a trading-domain reviewer job in CI, but it's focused on patterns rather than business correctness. A reviewer that understands event-sourcing invariants (e.g., "does this state transition preserve aggregate consistency?") would catch more logic bugs.

  5. REQUEST_CHANGES without actionable blockers. Several REQUEST_CHANGES reviews flagged CI failures as the primary blocker — correct, but unhelpful when the CI failure was due to the bot's own environment rather than the code. Cleaner distinction between "block on code issue" vs "block on CI" would reduce noise.

6.3 Post-Merge Review

  1. 22% of findings are missing test coverage. This is the most persistent failure mode. The inline review catches missing @impl, type violations, and structural issues — but repeatedly misses "this happy path has no error-path test." A test coverage check as a first-class review step would help.

  2. Logger/telemetry violations are recurring. The same violations (Logger string interpolation, high-cardinality telemetry tags) appeared in PRs #633, #654, #664, #657, #671, #769 — across a two-week period. These should become linting rules (mix credo custom check or CI step) rather than relying on reviewers to catch them.

  3. Post-merge review runs after the fact, not at PR time. The post-merge review is triggered by rodin on closed PRs, filing issues that must then be separately worked. This creates a lag between introduction and fix. Moving more of this checklist into the inline PR review would prevent introduction, not just detection.

  4. Deferred work not tracked. Issues like "PR #737: Logging namespace renames deferred without follow-up issue" show that the implementation PR did partial work and deferred the rest without creating tracking issues. The post-merge review catches this, but an inline check for "TODO without issue reference" would catch it sooner.

  5. No trend analysis on finding recurrence. The same categories (missing tests, Logger violations, missing diagrams) reappear weekly. There's no mechanism to track "this failure mode was found N times" and escalate it to a process fix. A running recurrence tracker would enable systemic fixes rather than whack-a-mole.


7. Takeaways

What's working:

  • The multi-model review pipeline (when both bots are active) is highly effective. It found real structural bugs (unhandled errors in PR #633, TOCTOU races in PR #429, missing behaviours in PR #418) that would have been expensive to fix later.
  • Post-merge review is responsive — all 100 filed issues were closed, and the finding rate is declining as common patterns are addressed.
  • The ai-review label creates a clean audit trail for escaped defects.
  • PR velocity is high (~1520/day) without sacrificing review rigor when both bots are active.

What's not working:

  • Sonnet-review-bot dropout degraded the review gate significantly (32% → 2% REQUEST_CHANGES rate).
  • Pre-code design isn't gated — implementation starts without formal review of the plan.
  • Recurring violations (Logger, telemetry, test coverage) are caught reactively rather than prevented by tooling.
  • Review round inflation (30 reviews/PR) creates noise without proportional quality benefit.

Highest-leverage improvements, in order:

  1. Restore dual-bot review (immediate: add Sonnet back to CI pipeline)
  2. Add a test coverage checklist as a first-class review step
  3. Add mix credo custom checks for Logger and telemetry violations
  4. Implement pre-code gate: implementation PR requires linked design issue with accepted ACs
  5. Add recurrence tracking for post-merge finding categories

Data Summary

Metric Value
Total closed PRs analyzed ~469
PRs with AI review ~267
Total post-merge issues filed 100
Post-merge issue closure rate 100%
Avg REQUEST_CHANGES rate (dual-bot) ~32%
Avg REQUEST_CHANGES rate (single-bot) ~2%
Most-reviewed single PR PR #749, #755 (30 reviews each)
Most post-merge findings in one PR PR #566 (7 findings)
Most active review period May 710 (>4 REQUEST_CHANGES per PR on complex features)