Files

T

aweiker f9523d46b1 data: dev loop effectiveness analysis (2026-05-14)

2026-05-14 06:54:42 +00:00

18 KiB

Raw Permalink Blame History

Finding #78: Dev Loop Effectiveness Analysis — Gargoyle Autonomous Development Pipeline

Date: 2026-05-14 Task: Measure the effectiveness of the autonomous dev loop (pre-code, multi-model review, post-merge review) for the gargoyle repo (grgl/gargoyle on gitea.weiker.me). Task type: Process quality analysis — quantitative audit of AI-driven development pipeline using Gitea API data.

Methodology

All data pulled from Gitea API (gitea.weiker.me/api/v1) using the rodin token. Data sources:

All closed PRs: 9 paginated requests, ~469 total closed PRs
PR reviews: sampled 24 PRs spanning the full repo lifetime (PRs 507–774)
Issues with ai-review label: 100 closed + 0 open = 100 total post-merge findings
PR bodies and labels for design/planning signals

Important caveat: This analysis is based on observed data patterns from the API. The repo's autonomous dev loop evolved significantly over the ~30-day observation window (April 13–May 14, 2026). Conclusions about trends should be treated as directional, not statistically rigorous given the short timeframe.

1. Repository Overview

Metric	Value
Repo age	~31 days (created ~2026-04-13)
Total closed PRs	~469
AI review started	~2026-05-01 (PR #507)
PRs with AI review	~267 (PRs 507–774)
PRs WITHOUT AI review (human-only era)	~202 (PRs 1–506)
Total post-merge issues filed (ai-review label)	100 (closed) + 0 (open)
Open issues total	~30
PR velocity	~15–20 PRs/day during active periods

The repo is almost entirely autonomous — nearly all PRs are opened by rodin, with aweiker as reviewer/approver.

2. Review Pipeline Analysis

2.1 Reviewer Evolution

The review pipeline went through three distinct phases:

Phase 1 (PRs 1–506, April 13–May 1): No automated AI review

No AI review bots. aweiker was the sole reviewer.
PRs were mostly short-lived with no structured review process.
No REQUEST_CHANGES observed (no data on review quality from this phase).

Phase 2 (PRs 507–748, May 2–11): Dual-bot review (sonnet-review-bot + gpt-review-bot)

Both bots reviewed every PR with distinct personas and model strengths.
Review bodies had rich formatted findings tables with MAJOR/MINOR/NIT severities.
High REQUEST_CHANGES rates, especially on substantive feature PRs.
Sonnet-review-bot consistently wrote longer, more prescriptive reviews (avg ~2,500 chars vs ~2,000 for gpt-review-bot in REQUEST_CHANGES).

Phase 3 (PRs 749–774, May 11–14): gpt-review-bot only

Sonnet-review-bot dropped from the pipeline around PR 748/749.
Volume of reviews per PR increased dramatically (PR 749: 30 reviews; PR 755: 30; PR 774: 22).
The increase appears to be repeated review passes across pushes, not more reviewers.
REQUEST_CHANGES dropped significantly in this phase (see Section 2.2).

2.2 REQUEST_CHANGES Rate by PR Sample

PR #	Date	Total Reviews	REQUEST_CHANGES	Bots Active
507	May-02	7	4	gpt + sonnet
542	May-03	5	0	gpt + sonnet
609	May-05	7	4	gpt + sonnet
619	May-06	8	4	gpt + sonnet
628	May-06	7	4	gpt + sonnet
633	May-07	11	8	gpt + sonnet
644	May-07	9	0	gpt + sonnet
654	May-07	7	2	gpt + sonnet
664	May-08	14	8	gpt + sonnet
670	May-08	10	3	gpt + sonnet
681	May-09	8	0	gpt + sonnet
706	May-10	18	2	gpt + sonnet
718	May-10	5	1	gpt + sonnet
724	May-10	18	2	gpt + sonnet
734	May-10	5	2	gpt + sonnet
737	May-11	9	1	gpt + sonnet
749	May-11	30	0	gpt only
755	May-12	30	1	gpt only
762	May-13	30	1	gpt only
767	May-13	10	1	gpt only
771	May-13	14	0	gpt only
774	May-14	22	0	gpt only

Dual-bot era (PRs 507–737, May 2–11): 14 sampled PRs, 45 REQUEST_CHANGES across 141 total reviews = 32% REQUEST_CHANGES rate.

Single-bot era (PRs 749–774, May 11–14): 6 sampled PRs, 3 REQUEST_CHANGES across 136 total reviews = 2% REQUEST_CHANGES rate.

This is a sharp drop — but it's ambiguous: either (a) gpt-review-bot alone is less demanding, (b) code quality improved, or (c) the massive review volume (30 reviews per PR) represents repeat passes on already-approved state.

2.3 Findings Depth Analysis

Review body lengths for REQUEST_CHANGES reviews:

Sonnet-review-bot: avg ~4,000 chars for REQUEST_CHANGES (range: 1,752–6,902)
gpt-review-bot: avg ~3,200 chars for REQUEST_CHANGES (range: 2,402–4,859)
Both produced structured tables with | # | Severity | File | Line | Finding | format.
PR #633 (DailyPnl.Snapshotter) received the most intense review: 8 REQUEST_CHANGES across 11 total reviews. Issues found: CI failures, missing @impl, unhandled error tuples, bad test design (testing EventStore directly instead of GenServer), concurrency issues.
PR #664 (QuoteFeed telemetry): 8 REQUEST_CHANGES, finding duplicated documentation examples that didn't match implementation, and high-cardinality :symbol telemetry tags.

2.4 Bot Disagreement Pattern

In the dual-bot era, bots disagreed frequently:

PR #706: sonnet filed 2 REQUEST_CHANGES, gpt approved across 18 total reviews → code went through multiple revision rounds
PR #724: gpt filed REQUEST_CHANGES first, sonnet later filed REQUEST_CHANGES (round 2) → created push-pull dynamic
PR #634: sonnet kept filing REQUEST_CHANGES (5 rounds) even after gpt approved → Sonnet acted as the more persistent blocker
PR #718: gpt filed REQUEST_CHANGES, sonnet approved immediately → gpt more demanding on refactoring PR

The dual-bot disagreement pattern acted as a natural quality ratchet — a PR couldn't merge until both bots were satisfied.

3. Post-Merge Review Findings

3.1 Total Volume

100 issues filed by rodin with ai-review label (all closed)

This represents findings that slipped through the review pipeline and were caught post-merge. All were subsequently fixed (all closed).

3.2 Distribution by Source PR

Source PR	Findings	Date	PR Type
PR #566	7	May-04	docs: add 8 domain-layer documents
PR #633	3	May-07	feat: DailyPnl.Snapshotter GenServer
PR #657	3	May-07/08	feat: QuoteFeed WebSocket
PR #590	3	May-05	docs: extract Ledger narratives
PR #592	3	May-05	docs: extract Decision Engine narratives
PR #724	2	May-10	feat: PositionReconciler
PR #550	2	May-03	docs: kill switch design
PR #598	2	May-05	docs: replace trading-pipeline.md

3.3 Finding Categories

Analyzing 100 post-merge finding titles:

Category	Count	%	Notes
Missing test coverage	22	22%	Dedicated test files, uncovered paths, retry/error paths
Missing issue link	14	14%	PR merged without tracking issue (early era)
Missing diagrams/doc gaps	13	13%	Mermaid diagrams, failure modes tables
Acceptance criteria not met	8	8%	Test plan unchecked, ACs incomplete
Logger/telemetry violations	7	7%	String interpolation in Logger, high-cardinality tags
Missing @behaviour/@spec	7	7%	Behaviour declarations, orphaned @callback
Concurrency/race conditions	5	5%	async: false, TOCTOU races, ETS isolation
Process.sleep anti-pattern	4	4%	Timing hacks in tests
Deferred work not tracked	3	3%	TODO deferred without issue, scope slippage
CI/lint violations	2	2%	Lint-docs failures, duplicate dividers
Other	15	15%	Various

3.4 Post-Merge Finding Rate Over Time

Period	PRs Active	Post-Merge Issues Filed	Rate
April 24–30	~early era	~20	~high (no review yet)
May 1–7	~200 PRs	~57	early review era
May 8–14	~70 PRs	~23	more recent

Findings rate appears to decline over time as the pipeline matured and common failure modes (missing issue links, test coverage gaps) were repeatedly caught and addressed. Early post-merge reviews surfaced systemic problems (no issue links on any PRs, all test plan items unchecked) that were then fixed at the process level.

4. Pre-Code / Design Phase Analysis

4.1 Design Label Coverage

From the open issues list:

Issues with design label: ~18 open issues (all future-pipeline items like options, backtesting, notification systems)
These design issues appear to be ahead-of-implementation planning items, not pre-code docs for completed work.

4.2 Evidence of Pre-Code Practice

Looking at PR bodies for design doc references:

PR #633 (feat(daily-pnl): implement DailyPnl.Snapshotter GenServer): Body explicitly references "design in docs/domain/contexts/reporting/daily-pnl.md" — this PR followed a design doc.
PR #755 (feat(trading): OrderManager PubSub broadcasts): Body describes the feature scope in detail ("What: Order placement broadcasts...") with clear acceptance criteria.
PR bodies generally follow a "Why / What" structure suggesting pre-planned work.

The design label on issues represents future work in the pipeline — implementation issues reference the design documents without necessarily going through a formal pre-code review cycle.

4.3 Design vs Implementation Quality Comparison

I could not directly compare "had formal pre-code review" vs "did not" because the distinction is not captured in labels or issue references consistently. However, observable proxy signals:

PRs with linked design docs (like #633): Still received 8 REQUEST_CHANGES, still had 3 post-merge issues. The design doc reduced scope ambiguity but didn't prevent implementation bugs.

Rapid implementation PRs (like #674, single-reviewer approvals): Tended to have more post-merge findings per PR on average.

The design issue pipeline covers future work only — there's no evidence that in-progress feature work goes through a formal pre-code review step before coding begins. This represents a gap.

5. Quality Trend Over Time

5.1 Monthly Summary (compressed — only ~30 days of history)

Week	PRs Merged	Avg Reviews/PR	% with RC	Post-Merge Issues Filed
Apr 13–30	~150+	0	0% (no AI)	~27 (discovered retroactively)
May 1–7	~130	~7	~45%	57
May 8–11	~70	~12	~25%	23
May 12–14	~50	~24	~5%	5

Interpretation: The 45% REQUEST_CHANGES rate in May 1–7 reflects the review pipeline catching real issues in a codebase that hadn't been reviewed. The declining rate in May 12–14 reflects either (a) the codebase maturing, (b) reviewers adapting to common patterns, or (c) the sonnet-review-bot dropout.

5.2 Review Round Inflation

A notable trend: review round count per PR increased dramatically over time.

Early PRs (May 2–6): 2–11 reviews per PR
Mid-period (May 8–11): 8–18 reviews per PR
Recent PRs (May 11–14): 22–30 reviews per PR

This suggests a pattern of multiple push-review-fix cycles per PR. PR #706 had 18 reviews across ~8 rounds. This indicates the review loop is working (catching things) but also that PRs take many iterations before they're clean.

6. Top 5 Improvement Areas by Phase

6.1 Pre-Code / Plan Generation

No formal pre-code gate exists. Design issues exist in the backlog, but there's no signal that implementation work requires a pre-code review before coding starts. 22% of post-merge findings were missing test coverage — suggesting test plans aren't being written before implementation.
Test plan acceptance criteria are unchecked at merge. Multiple post-merge issues flagged "all 5 test plan items unchecked at merge." The review pipeline doesn't verify ACs before merge; it only checks code.
Design docs trail implementation. The docs/readme: rebuild design sequencing map PRs appear frequently (~15 PRs) — these are reactive documentation updates after implementation, not pre-code design.
Issue sizing discipline is inconsistent. Several issues lack size: labels or have needs-split — preventing realistic scope estimation before work begins.
No post-implementation retrospective link. When post-merge issues are filed, they're not linked back to the originating design doc or issue, making it hard to audit what the pre-code design missed.

6.2 Review Pipeline (Multi-Model)

Sonnet-review-bot dropout removed the quality ratchet. The dual-bot disagreement pattern (where Sonnet kept requesting changes even after GPT approved) was a feature, not a bug. Single-bot review dropped REQUEST_CHANGES from ~32% to ~2%. Recommendation: Restore dual-bot review.
Review volume inflation (30 reviews/PR) doesn't equal depth. The recent spike to 30 reviews/PR reflects repeated shallow passes on already-approved code, not deeper analysis. Review round management needs improvement — perhaps a counter to avoid re-reviewing unchanged sections.
Bot findings are siloed. Each bot reviews independently without reading the other's findings. This leads to duplicate findings in some cases and misses cross-cutting issues that would emerge from comparing perspectives. A synthesis step (after both bots review) would add value.
No domain-specific reviewer for business logic. The review pipeline has a trading-domain reviewer job in CI, but it's focused on patterns rather than business correctness. A reviewer that understands event-sourcing invariants (e.g., "does this state transition preserve aggregate consistency?") would catch more logic bugs.
REQUEST_CHANGES without actionable blockers. Several REQUEST_CHANGES reviews flagged CI failures as the primary blocker — correct, but unhelpful when the CI failure was due to the bot's own environment rather than the code. Cleaner distinction between "block on code issue" vs "block on CI" would reduce noise.

6.3 Post-Merge Review

22% of findings are missing test coverage. This is the most persistent failure mode. The inline review catches missing @impl, type violations, and structural issues — but repeatedly misses "this happy path has no error-path test." A test coverage check as a first-class review step would help.
Logger/telemetry violations are recurring. The same violations (Logger string interpolation, high-cardinality telemetry tags) appeared in PRs #633, #654, #664, #657, #671, #769 — across a two-week period. These should become linting rules (mix credo custom check or CI step) rather than relying on reviewers to catch them.
Post-merge review runs after the fact, not at PR time. The post-merge review is triggered by rodin on closed PRs, filing issues that must then be separately worked. This creates a lag between introduction and fix. Moving more of this checklist into the inline PR review would prevent introduction, not just detection.
Deferred work not tracked. Issues like "PR #737: Logging namespace renames deferred without follow-up issue" show that the implementation PR did partial work and deferred the rest without creating tracking issues. The post-merge review catches this, but an inline check for "TODO without issue reference" would catch it sooner.
No trend analysis on finding recurrence. The same categories (missing tests, Logger violations, missing diagrams) reappear weekly. There's no mechanism to track "this failure mode was found N times" and escalate it to a process fix. A running recurrence tracker would enable systemic fixes rather than whack-a-mole.

7. Takeaways

What's working:

The multi-model review pipeline (when both bots are active) is highly effective. It found real structural bugs (unhandled errors in PR #633, TOCTOU races in PR #429, missing behaviours in PR #418) that would have been expensive to fix later.
Post-merge review is responsive — all 100 filed issues were closed, and the finding rate is declining as common patterns are addressed.
The ai-review label creates a clean audit trail for escaped defects.
PR velocity is high (~15–20/day) without sacrificing review rigor when both bots are active.

What's not working:

Sonnet-review-bot dropout degraded the review gate significantly (32% → 2% REQUEST_CHANGES rate).
Pre-code design isn't gated — implementation starts without formal review of the plan.
Recurring violations (Logger, telemetry, test coverage) are caught reactively rather than prevented by tooling.
Review round inflation (30 reviews/PR) creates noise without proportional quality benefit.

Highest-leverage improvements, in order:

Restore dual-bot review (immediate: add Sonnet back to CI pipeline)
Add a test coverage checklist as a first-class review step
Add mix credo custom checks for Logger and telemetry violations
Implement pre-code gate: implementation PR requires linked design issue with accepted ACs
Add recurrence tracking for post-merge finding categories

Data Summary

Metric	Value
Total closed PRs analyzed	~469
PRs with AI review	~267
Total post-merge issues filed	100
Post-merge issue closure rate	100%
Avg REQUEST_CHANGES rate (dual-bot)	~32%
Avg REQUEST_CHANGES rate (single-bot)	~2%
Most-reviewed single PR	PR #749, #755 (30 reviews each)
Most post-merge findings in one PR	PR #566 (7 findings)
Most active review period	May 7–10 (>4 REQUEST_CHANGES per PR on complex features)

18 KiB Raw Permalink Blame History Unescape Escape