18 KiB
Finding #78: Dev Loop Effectiveness Analysis — Gargoyle Autonomous Development Pipeline
Date: 2026-05-14 Task: Measure the effectiveness of the autonomous dev loop (pre-code, multi-model review, post-merge review) for the gargoyle repo (grgl/gargoyle on gitea.weiker.me). Task type: Process quality analysis — quantitative audit of AI-driven development pipeline using Gitea API data.
Methodology
All data pulled from Gitea API (gitea.weiker.me/api/v1) using the rodin token. Data sources:
- All closed PRs: 9 paginated requests, ~469 total closed PRs
- PR reviews: sampled 24 PRs spanning the full repo lifetime (PRs 507–774)
- Issues with
ai-reviewlabel: 100 closed + 0 open = 100 total post-merge findings - PR bodies and labels for design/planning signals
Important caveat: This analysis is based on observed data patterns from the API. The repo's autonomous dev loop evolved significantly over the ~30-day observation window (April 13–May 14, 2026). Conclusions about trends should be treated as directional, not statistically rigorous given the short timeframe.
1. Repository Overview
| Metric | Value |
|---|---|
| Repo age | ~31 days (created ~2026-04-13) |
| Total closed PRs | ~469 |
| AI review started | ~2026-05-01 (PR #507) |
| PRs with AI review | ~267 (PRs 507–774) |
| PRs WITHOUT AI review (human-only era) | ~202 (PRs 1–506) |
| Total post-merge issues filed (ai-review label) | 100 (closed) + 0 (open) |
| Open issues total | ~30 |
| PR velocity | ~15–20 PRs/day during active periods |
The repo is almost entirely autonomous — nearly all PRs are opened by rodin, with aweiker as reviewer/approver.
2. Review Pipeline Analysis
2.1 Reviewer Evolution
The review pipeline went through three distinct phases:
Phase 1 (PRs 1–506, April 13–May 1): No automated AI review
- No AI review bots.
aweikerwas the sole reviewer. - PRs were mostly short-lived with no structured review process.
- No REQUEST_CHANGES observed (no data on review quality from this phase).
Phase 2 (PRs 507–748, May 2–11): Dual-bot review (sonnet-review-bot + gpt-review-bot)
- Both bots reviewed every PR with distinct personas and model strengths.
- Review bodies had rich formatted findings tables with MAJOR/MINOR/NIT severities.
- High REQUEST_CHANGES rates, especially on substantive feature PRs.
- Sonnet-review-bot consistently wrote longer, more prescriptive reviews (avg ~2,500 chars vs ~2,000 for gpt-review-bot in REQUEST_CHANGES).
Phase 3 (PRs 749–774, May 11–14): gpt-review-bot only
- Sonnet-review-bot dropped from the pipeline around PR 748/749.
- Volume of reviews per PR increased dramatically (PR 749: 30 reviews; PR 755: 30; PR 774: 22).
- The increase appears to be repeated review passes across pushes, not more reviewers.
- REQUEST_CHANGES dropped significantly in this phase (see Section 2.2).
2.2 REQUEST_CHANGES Rate by PR Sample
| PR # | Date | Total Reviews | REQUEST_CHANGES | Bots Active |
|---|---|---|---|---|
| 507 | May-02 | 7 | 4 | gpt + sonnet |
| 542 | May-03 | 5 | 0 | gpt + sonnet |
| 609 | May-05 | 7 | 4 | gpt + sonnet |
| 619 | May-06 | 8 | 4 | gpt + sonnet |
| 628 | May-06 | 7 | 4 | gpt + sonnet |
| 633 | May-07 | 11 | 8 | gpt + sonnet |
| 644 | May-07 | 9 | 0 | gpt + sonnet |
| 654 | May-07 | 7 | 2 | gpt + sonnet |
| 664 | May-08 | 14 | 8 | gpt + sonnet |
| 670 | May-08 | 10 | 3 | gpt + sonnet |
| 681 | May-09 | 8 | 0 | gpt + sonnet |
| 706 | May-10 | 18 | 2 | gpt + sonnet |
| 718 | May-10 | 5 | 1 | gpt + sonnet |
| 724 | May-10 | 18 | 2 | gpt + sonnet |
| 734 | May-10 | 5 | 2 | gpt + sonnet |
| 737 | May-11 | 9 | 1 | gpt + sonnet |
| 749 | May-11 | 30 | 0 | gpt only |
| 755 | May-12 | 30 | 1 | gpt only |
| 762 | May-13 | 30 | 1 | gpt only |
| 767 | May-13 | 10 | 1 | gpt only |
| 771 | May-13 | 14 | 0 | gpt only |
| 774 | May-14 | 22 | 0 | gpt only |
Dual-bot era (PRs 507–737, May 2–11): 14 sampled PRs, 45 REQUEST_CHANGES across 141 total reviews = 32% REQUEST_CHANGES rate.
Single-bot era (PRs 749–774, May 11–14): 6 sampled PRs, 3 REQUEST_CHANGES across 136 total reviews = 2% REQUEST_CHANGES rate.
This is a sharp drop — but it's ambiguous: either (a) gpt-review-bot alone is less demanding, (b) code quality improved, or (c) the massive review volume (30 reviews per PR) represents repeat passes on already-approved state.
2.3 Findings Depth Analysis
Review body lengths for REQUEST_CHANGES reviews:
- Sonnet-review-bot: avg ~4,000 chars for REQUEST_CHANGES (range: 1,752–6,902)
- gpt-review-bot: avg ~3,200 chars for REQUEST_CHANGES (range: 2,402–4,859)
- Both produced structured tables with | # | Severity | File | Line | Finding | format.
- PR #633 (DailyPnl.Snapshotter) received the most intense review: 8 REQUEST_CHANGES across 11 total reviews. Issues found: CI failures, missing
@impl, unhandled error tuples, bad test design (testing EventStore directly instead of GenServer), concurrency issues. - PR #664 (QuoteFeed telemetry): 8 REQUEST_CHANGES, finding duplicated documentation examples that didn't match implementation, and high-cardinality
:symboltelemetry tags.
2.4 Bot Disagreement Pattern
In the dual-bot era, bots disagreed frequently:
- PR #706: sonnet filed 2 REQUEST_CHANGES, gpt approved across 18 total reviews → code went through multiple revision rounds
- PR #724: gpt filed REQUEST_CHANGES first, sonnet later filed REQUEST_CHANGES (round 2) → created push-pull dynamic
- PR #634: sonnet kept filing REQUEST_CHANGES (5 rounds) even after gpt approved → Sonnet acted as the more persistent blocker
- PR #718: gpt filed REQUEST_CHANGES, sonnet approved immediately → gpt more demanding on refactoring PR
The dual-bot disagreement pattern acted as a natural quality ratchet — a PR couldn't merge until both bots were satisfied.
3. Post-Merge Review Findings
3.1 Total Volume
100 issues filed by rodin with ai-review label (all closed)
This represents findings that slipped through the review pipeline and were caught post-merge. All were subsequently fixed (all closed).
3.2 Distribution by Source PR
Top PRs by post-merge findings:
| Source PR | Findings | Date | PR Type |
|---|---|---|---|
| PR #566 | 7 | May-04 | docs: add 8 domain-layer documents |
| PR #633 | 3 | May-07 | feat: DailyPnl.Snapshotter GenServer |
| PR #657 | 3 | May-07/08 | feat: QuoteFeed WebSocket |
| PR #590 | 3 | May-05 | docs: extract Ledger narratives |
| PR #592 | 3 | May-05 | docs: extract Decision Engine narratives |
| PR #724 | 2 | May-10 | feat: PositionReconciler |
| PR #550 | 2 | May-03 | docs: kill switch design |
| PR #598 | 2 | May-05 | docs: replace trading-pipeline.md |
PRs with 1 post-merge finding: 508, 518, 519, 521, 523, 527, 530, 547, 555, 567, 609, 621, 626, 664, 686, 692, 704, 717, 721, 728, 737, 739, 767, 771 (24 PRs)
3.3 Finding Categories
Analyzing 100 post-merge finding titles:
| Category | Count | % | Notes |
|---|---|---|---|
| Missing test coverage | 22 | 22% | Dedicated test files, uncovered paths, retry/error paths |
| Missing issue link | 14 | 14% | PR merged without tracking issue (early era) |
| Missing diagrams/doc gaps | 13 | 13% | Mermaid diagrams, failure modes tables |
| Acceptance criteria not met | 8 | 8% | Test plan unchecked, ACs incomplete |
| Logger/telemetry violations | 7 | 7% | String interpolation in Logger, high-cardinality tags |
| Missing @behaviour/@spec | 7 | 7% | Behaviour declarations, orphaned @callback |
| Concurrency/race conditions | 5 | 5% | async: false, TOCTOU races, ETS isolation |
| Process.sleep anti-pattern | 4 | 4% | Timing hacks in tests |
| Deferred work not tracked | 3 | 3% | TODO deferred without issue, scope slippage |
| CI/lint violations | 2 | 2% | Lint-docs failures, duplicate dividers |
| Other | 15 | 15% | Various |
3.4 Post-Merge Finding Rate Over Time
| Period | PRs Active | Post-Merge Issues Filed | Rate |
|---|---|---|---|
| April 24–30 | ~early era | ~20 | ~high (no review yet) |
| May 1–7 | ~200 PRs | ~57 | early review era |
| May 8–14 | ~70 PRs | ~23 | more recent |
Findings rate appears to decline over time as the pipeline matured and common failure modes (missing issue links, test coverage gaps) were repeatedly caught and addressed. Early post-merge reviews surfaced systemic problems (no issue links on any PRs, all test plan items unchecked) that were then fixed at the process level.
4. Pre-Code / Design Phase Analysis
4.1 Design Label Coverage
From the open issues list:
- Issues with
designlabel: ~18 open issues (all future-pipeline items like options, backtesting, notification systems) - These design issues appear to be ahead-of-implementation planning items, not pre-code docs for completed work.
4.2 Evidence of Pre-Code Practice
Looking at PR bodies for design doc references:
- PR #633 (
feat(daily-pnl): implement DailyPnl.Snapshotter GenServer): Body explicitly references "design indocs/domain/contexts/reporting/daily-pnl.md" — this PR followed a design doc. - PR #755 (
feat(trading): OrderManager PubSub broadcasts): Body describes the feature scope in detail ("What: Order placement broadcasts...") with clear acceptance criteria. - PR bodies generally follow a "Why / What" structure suggesting pre-planned work.
The design label on issues represents future work in the pipeline — implementation issues reference the design documents without necessarily going through a formal pre-code review cycle.
4.3 Design vs Implementation Quality Comparison
I could not directly compare "had formal pre-code review" vs "did not" because the distinction is not captured in labels or issue references consistently. However, observable proxy signals:
PRs with linked design docs (like #633): Still received 8 REQUEST_CHANGES, still had 3 post-merge issues. The design doc reduced scope ambiguity but didn't prevent implementation bugs.
Rapid implementation PRs (like #674, single-reviewer approvals): Tended to have more post-merge findings per PR on average.
The design issue pipeline covers future work only — there's no evidence that in-progress feature work goes through a formal pre-code review step before coding begins. This represents a gap.
5. Quality Trend Over Time
5.1 Monthly Summary (compressed — only ~30 days of history)
| Week | PRs Merged | Avg Reviews/PR | % with RC | Post-Merge Issues Filed |
|---|---|---|---|---|
| Apr 13–30 | ~150+ | 0 | 0% (no AI) | ~27 (discovered retroactively) |
| May 1–7 | ~130 | ~7 | ~45% | 57 |
| May 8–11 | ~70 | ~12 | ~25% | 23 |
| May 12–14 | ~50 | ~24 | ~5% | 5 |
Interpretation: The 45% REQUEST_CHANGES rate in May 1–7 reflects the review pipeline catching real issues in a codebase that hadn't been reviewed. The declining rate in May 12–14 reflects either (a) the codebase maturing, (b) reviewers adapting to common patterns, or (c) the sonnet-review-bot dropout.
5.2 Review Round Inflation
A notable trend: review round count per PR increased dramatically over time.
- Early PRs (May 2–6): 2–11 reviews per PR
- Mid-period (May 8–11): 8–18 reviews per PR
- Recent PRs (May 11–14): 22–30 reviews per PR
This suggests a pattern of multiple push-review-fix cycles per PR. PR #706 had 18 reviews across ~8 rounds. This indicates the review loop is working (catching things) but also that PRs take many iterations before they're clean.
6. Top 5 Improvement Areas by Phase
6.1 Pre-Code / Plan Generation
-
No formal pre-code gate exists. Design issues exist in the backlog, but there's no signal that implementation work requires a pre-code review before coding starts. 22% of post-merge findings were missing test coverage — suggesting test plans aren't being written before implementation.
-
Test plan acceptance criteria are unchecked at merge. Multiple post-merge issues flagged "all 5 test plan items unchecked at merge." The review pipeline doesn't verify ACs before merge; it only checks code.
-
Design docs trail implementation. The
docs/readme: rebuild design sequencing mapPRs appear frequently (~15 PRs) — these are reactive documentation updates after implementation, not pre-code design. -
Issue sizing discipline is inconsistent. Several issues lack
size:labels or haveneeds-split— preventing realistic scope estimation before work begins. -
No post-implementation retrospective link. When post-merge issues are filed, they're not linked back to the originating design doc or issue, making it hard to audit what the pre-code design missed.
6.2 Review Pipeline (Multi-Model)
-
Sonnet-review-bot dropout removed the quality ratchet. The dual-bot disagreement pattern (where Sonnet kept requesting changes even after GPT approved) was a feature, not a bug. Single-bot review dropped REQUEST_CHANGES from ~32% to ~2%. Recommendation: Restore dual-bot review.
-
Review volume inflation (30 reviews/PR) doesn't equal depth. The recent spike to 30 reviews/PR reflects repeated shallow passes on already-approved code, not deeper analysis. Review round management needs improvement — perhaps a counter to avoid re-reviewing unchanged sections.
-
Bot findings are siloed. Each bot reviews independently without reading the other's findings. This leads to duplicate findings in some cases and misses cross-cutting issues that would emerge from comparing perspectives. A synthesis step (after both bots review) would add value.
-
No domain-specific reviewer for business logic. The review pipeline has a trading-domain reviewer job in CI, but it's focused on patterns rather than business correctness. A reviewer that understands event-sourcing invariants (e.g., "does this state transition preserve aggregate consistency?") would catch more logic bugs.
-
REQUEST_CHANGES without actionable blockers. Several REQUEST_CHANGES reviews flagged CI failures as the primary blocker — correct, but unhelpful when the CI failure was due to the bot's own environment rather than the code. Cleaner distinction between "block on code issue" vs "block on CI" would reduce noise.
6.3 Post-Merge Review
-
22% of findings are missing test coverage. This is the most persistent failure mode. The inline review catches missing
@impl, type violations, and structural issues — but repeatedly misses "this happy path has no error-path test." A test coverage check as a first-class review step would help. -
Logger/telemetry violations are recurring. The same violations (
Loggerstring interpolation, high-cardinality telemetry tags) appeared in PRs #633, #654, #664, #657, #671, #769 — across a two-week period. These should become linting rules (mix credocustom check or CI step) rather than relying on reviewers to catch them. -
Post-merge review runs after the fact, not at PR time. The post-merge review is triggered by
rodinon closed PRs, filing issues that must then be separately worked. This creates a lag between introduction and fix. Moving more of this checklist into the inline PR review would prevent introduction, not just detection. -
Deferred work not tracked. Issues like "PR #737: Logging namespace renames deferred without follow-up issue" show that the implementation PR did partial work and deferred the rest without creating tracking issues. The post-merge review catches this, but an inline check for "TODO without issue reference" would catch it sooner.
-
No trend analysis on finding recurrence. The same categories (missing tests, Logger violations, missing diagrams) reappear weekly. There's no mechanism to track "this failure mode was found N times" and escalate it to a process fix. A running recurrence tracker would enable systemic fixes rather than whack-a-mole.
7. Takeaways
What's working:
- The multi-model review pipeline (when both bots are active) is highly effective. It found real structural bugs (unhandled errors in PR #633, TOCTOU races in PR #429, missing behaviours in PR #418) that would have been expensive to fix later.
- Post-merge review is responsive — all 100 filed issues were closed, and the finding rate is declining as common patterns are addressed.
- The
ai-reviewlabel creates a clean audit trail for escaped defects. - PR velocity is high (~15–20/day) without sacrificing review rigor when both bots are active.
What's not working:
- Sonnet-review-bot dropout degraded the review gate significantly (32% → 2% REQUEST_CHANGES rate).
- Pre-code design isn't gated — implementation starts without formal review of the plan.
- Recurring violations (Logger, telemetry, test coverage) are caught reactively rather than prevented by tooling.
- Review round inflation (30 reviews/PR) creates noise without proportional quality benefit.
Highest-leverage improvements, in order:
- Restore dual-bot review (immediate: add Sonnet back to CI pipeline)
- Add a test coverage checklist as a first-class review step
- Add
mix credocustom checks for Logger and telemetry violations - Implement pre-code gate: implementation PR requires linked design issue with accepted ACs
- Add recurrence tracking for post-merge finding categories
Data Summary
| Metric | Value |
|---|---|
| Total closed PRs analyzed | ~469 |
| PRs with AI review | ~267 |
| Total post-merge issues filed | 100 |
| Post-merge issue closure rate | 100% |
| Avg REQUEST_CHANGES rate (dual-bot) | ~32% |
| Avg REQUEST_CHANGES rate (single-bot) | ~2% |
| Most-reviewed single PR | PR #749, #755 (30 reviews each) |
| Most post-merge findings in one PR | PR #566 (7 findings) |
| Most active review period | May 7–10 (>4 REQUEST_CHANGES per PR on complex features) |