Investigate stale commit reviews on PR #687 #52

New Issue

2026-05-09T14:48:46Z

rodin commented

2026-05-09 14:48:46 +00:00

Problem

On gargoyle PR #687, GPT-5 filed 14 consecutive REQUEST_CHANGES reviews against commits that had already been superseded. The reviews referenced line numbers and code that no longer existed at HEAD.

Timeline (2026-05-09)

HEAD moved to 758910f with fixes at ~7:04am
GPT-5 reviews continued citing "Evaluated against ea1c6028" and "71828e5" (older commits)
Reviews raised findings that were already addressed in the current code
Eventually review #1790 correctly evaluated 758910f and found a legitimate new issue

Evidence

Review #1788: "Evaluated against 71828e5e" — cited line 380, 513, 404 which did not match actual code locations
Review #1786: "Evaluated against ea1c6028" — same stale findings
15 total REQUEST_CHANGES reviews, ~14 against stale commits

Questions to investigate

Is review-bot fetching the diff/files from an outdated ref?
Is there a race condition where the review triggers before the push is visible?
Is prior review text being included in context, causing the model to reinforce its own stale findings?
Why did it eventually self-correct on review #1790?

Impact

Wasted CI cycles (15 review runs)
Noise in PR timeline
Could have blocked a legitimate fix if not manually investigated

Suggested fix

Verify the commit SHA being reviewed matches the PR head SHA at review time
Consider adding a sanity check that the evaluated commit matches what was requested
Log the exact ref being checked out vs the PR head for debugging

## Problem On gargoyle PR #687, GPT-5 filed 14 consecutive `REQUEST_CHANGES` reviews against commits that had already been superseded. The reviews referenced line numbers and code that no longer existed at HEAD. ### Timeline (2026-05-09) - HEAD moved to `758910f` with fixes at ~7:04am - GPT-5 reviews continued citing "Evaluated against ea1c6028" and "71828e5" (older commits) - Reviews raised findings that were already addressed in the current code - Eventually review #1790 correctly evaluated `758910f` and found a legitimate new issue ### Evidence - Review #1788: "Evaluated against 71828e5e" — cited line 380, 513, 404 which did not match actual code locations - Review #1786: "Evaluated against ea1c6028" — same stale findings - 15 total REQUEST_CHANGES reviews, ~14 against stale commits ### Questions to investigate 1. Is review-bot fetching the diff/files from an outdated ref? 2. Is there a race condition where the review triggers before the push is visible? 3. Is prior review text being included in context, causing the model to reinforce its own stale findings? 4. Why did it eventually self-correct on review #1790? ### Impact - Wasted CI cycles (15 review runs) - Noise in PR timeline - Could have blocked a legitimate fix if not manually investigated ## Suggested fix - Verify the commit SHA being reviewed matches the PR head SHA at review time - Consider adding a sanity check that the evaluated commit matches what was requested - Log the exact ref being checked out vs the PR head for debugging

rodin commented

2026-05-09 15:02:33 +00:00

Investigation Findings

Investigated the review history on PR #687. Here's what I found:

What Actually Happened

Each review correctly reports the commit it evaluated:

Review ID	Commit	State
1759	8ceacfc1	APPROVED
1761	e48d8810	REQUEST_CHANGES
1765	4f0dba1b	REQUEST_CHANGES
... (12 more)	...	REQUEST_CHANGES
1790	758910f0	REQUEST_CHANGES
1795	8a7c319d	APPROVED

The reviews ARE evaluating different commits, not the same stale commit repeatedly. The "Evaluated against" footer matches the commit_id Gitea stores for each review.

Root Cause

The issue isn't stale commit evaluation - it's review-per-commit triggering:

The workflow triggers on synchronize (every push)
Each push to the PR branch triggers a fresh review
PR #687 had 20 commits, generating 16 reviews from GPT-5 alone
Fixes addressed one issue but exposed related issues, causing repeated REQUEST_CHANGES

The findings look similar across reviews because:

The underlying code pattern (index failure handling) was iteratively fixed
Each review found slightly different manifestations of the same concern
Line numbers shifted as code was added/modified

Why It Felt Like Stale Reviews

The rapid succession of REQUEST_CHANGES reviews created noise. When a fix was pushed, a new review would fire and often find a related but distinct issue. This felt like "the same complaint over and over" but each review was legitimately evaluating different code.

Suggested Improvements

Dedupe by finding similarity: Before posting, compare new findings against existing non-superseded reviews. If >80% similarity, skip posting or note "same concern as review #X"
Rate-limit reviews per PR: Add a cooldown (e.g., 5 min) between reviews on the same PR. Use job concurrency or a state file.
Skip intermediate commits: Instead of reviewing every synchronize, only review:
- When PR is opened/reopened
- When PR moves from draft to ready
- When explicitly requested (comment trigger)
- Periodically (e.g., every 30 min if there are new commits)
Acknowledge fixes: Include diff between last-reviewed commit and current commit in context, so the model can recognize "this concern from review #X appears addressed"

Not a Bug

The review-bot is working as designed - it reviews the commit it's asked to review. The issue is the workflow configuration triggering too many reviews. This is more of a gargoyle CI configuration issue than a review-bot bug.

Recommend either:

Changing gargoyle's workflow to not trigger on every synchronize
Or implementing deduplication in review-bot itself

## Investigation Findings Investigated the review history on PR #687. Here's what I found: ### What Actually Happened Each review correctly reports the commit it evaluated: | Review ID | Commit | State | |-----------|--------|-------| | 1759 | 8ceacfc1 | APPROVED | | 1761 | e48d8810 | REQUEST_CHANGES | | 1765 | 4f0dba1b | REQUEST_CHANGES | | ... (12 more) | ... | REQUEST_CHANGES | | 1790 | 758910f0 | REQUEST_CHANGES | | 1795 | 8a7c319d | APPROVED | The reviews ARE evaluating different commits, not the same stale commit repeatedly. The "Evaluated against" footer matches the `commit_id` Gitea stores for each review. ### Root Cause The issue isn't stale commit evaluation - it's **review-per-commit triggering**: 1. The workflow triggers on `synchronize` (every push) 2. Each push to the PR branch triggers a fresh review 3. PR #687 had 20 commits, generating 16 reviews from GPT-5 alone 4. Fixes addressed one issue but exposed related issues, causing repeated REQUEST_CHANGES The findings look similar across reviews because: - The underlying code pattern (index failure handling) was iteratively fixed - Each review found slightly different manifestations of the same concern - Line numbers shifted as code was added/modified ### Why It Felt Like Stale Reviews The rapid succession of REQUEST_CHANGES reviews created noise. When a fix was pushed, a new review would fire and often find a related but distinct issue. This felt like "the same complaint over and over" but each review was legitimately evaluating different code. ### Suggested Improvements 1. **Dedupe by finding similarity**: Before posting, compare new findings against existing non-superseded reviews. If >80% similarity, skip posting or note "same concern as review #X" 2. **Rate-limit reviews per PR**: Add a cooldown (e.g., 5 min) between reviews on the same PR. Use job concurrency or a state file. 3. **Skip intermediate commits**: Instead of reviewing every `synchronize`, only review: - When PR is opened/reopened - When PR moves from draft to ready - When explicitly requested (comment trigger) - Periodically (e.g., every 30 min if there are new commits) 4. **Acknowledge fixes**: Include diff between last-reviewed commit and current commit in context, so the model can recognize "this concern from review #X appears addressed" ### Not a Bug The review-bot is working as designed - it reviews the commit it's asked to review. The issue is the workflow configuration triggering too many reviews. This is more of a gargoyle CI configuration issue than a review-bot bug. Recommend either: - Changing gargoyle's workflow to not trigger on every synchronize - Or implementing deduplication in review-bot itself

rodin referenced this issue from a commit

2026-05-10 01:08:10 +00:00

fix: skip posting review when HEAD moves during evaluation

rodin referenced a pull request that will close this issue

2026-05-10 01:08:15 +00:00

fix: skip posting review when HEAD moves during evaluation #53

rodin referenced this issue

2026-05-10 06:06:05 +00:00

fix: skip posting review when HEAD moves during evaluation #53

rodin referenced this issue from a commit

2026-05-10 06:18:42 +00:00

fix: skip posting review when HEAD moves during evaluation

rodin referenced this issue

2026-05-10 06:46:29 +00:00

fix: skip posting review when HEAD moves during evaluation #53

rodin commented

2026-05-10 07:10:04 +00:00

Follow-up Investigation: Stale Commit Reviews Analysis

Summary

After reviewing the evidence from PR #687 and PR #53, I can confirm and expand on the previous investigation findings.

Timeline Verification

Looking at the GPT review history on PR #687:

Review	Commit	Time	State
1759	`8ceacfc1`	07:06:45Z	✅ APPROVED
1761	`e48d8810`	07:19:52Z	❌ REQUEST_CHANGES
1765	`4f0dba1b`	07:55:47Z	❌ REQUEST_CHANGES
... (10 more reviews on different commits) ...
1788	`71828e5e`	13:57:01Z	❌ REQUEST_CHANGES
1790	`758910f0`	14:27:20Z	❌ REQUEST_CHANGES
1795	`8a7c319d`	14:41:30Z	✅ APPROVED

Key observation: Each review IS evaluating a different commit SHA. The commit_id in Gitea's review metadata matches the "Evaluated against" footer in the review body.

Root Cause Analysis

The original issue description hypothesized:

❓ Is review-bot fetching the diff from an outdated ref?
❓ Is there a race condition where review triggers before push is visible?
❓ Is prior review text being included in context, causing the model to reinforce stale findings?

Answer: None of the above. The real issue is:

The workflow triggers on every synchronize event (push to PR branch)
PR #687 had 20+ commits pushed in rapid succession
Each push triggered a fresh review workflow
GPT-5 found legitimate but related issues across commits, generating repeated REQUEST_CHANGES

The findings looked repetitive because the underlying domain concerns (index failure handling, idempotency) required multiple iterations to fully address.

Does PR #53 Address the Root Cause?

Partially. PR #53 adds a shouldSkipStaleReview() check that:

Re-fetches PR metadata before posting
Compares evaluated SHA vs current HEAD
Skips posting if HEAD has moved

This would help if a review was in-flight when a new push arrived. However:

PR #53 does NOT address:

Review-per-commit triggering — the workflow still triggers on every synchronize
Similar findings across commits — no deduplication logic
Review frequency — no cooldown between reviews on same PR

Recommendations

Near-term (PR #53 is valuable):

Merge PR #53 — it prevents posting truly stale reviews when HEAD moves during evaluation
This is a defensive guard that reduces noise

For fuller resolution (separate work):

Workflow-level rate limiting — add job concurrency controls in gargoyle's CI workflow
Dedupe by similarity — compare new findings against existing open reviews before posting
Batch reviews — instead of reviewing every push, review on intervals or triggers

Verdict

PR #53 addresses one edge case (HEAD moves during evaluation) but doesn't solve the core problem demonstrated in PR #687, which is too many reviews triggered by rapid pushes. The fix is still valuable and should merge, but the issue should remain open for follow-up work on review frequency/deduplication.

If this is considered sufficient for now, close the issue with a note about future improvements. If the full problem needs solving, this issue should spawn follow-up tasks for workflow rate-limiting and finding deduplication.

## Follow-up Investigation: Stale Commit Reviews Analysis ### Summary After reviewing the evidence from PR #687 and PR #53, I can confirm and expand on the previous investigation findings. ### Timeline Verification Looking at the GPT review history on PR #687: | Review | Commit | Time | State | |--------|--------|------|-------| | 1759 | `8ceacfc1` | 07:06:45Z | ✅ APPROVED | | 1761 | `e48d8810` | 07:19:52Z | ❌ REQUEST_CHANGES | | 1765 | `4f0dba1b` | 07:55:47Z | ❌ REQUEST_CHANGES | | ... (10 more reviews on different commits) ... | | 1788 | `71828e5e` | 13:57:01Z | ❌ REQUEST_CHANGES | | 1790 | `758910f0` | 14:27:20Z | ❌ REQUEST_CHANGES | | 1795 | `8a7c319d` | 14:41:30Z | ✅ APPROVED | **Key observation:** Each review IS evaluating a different commit SHA. The `commit_id` in Gitea's review metadata matches the "Evaluated against" footer in the review body. ### Root Cause Analysis The original issue description hypothesized: 1. ❓ Is review-bot fetching the diff from an outdated ref? 2. ❓ Is there a race condition where review triggers before push is visible? 3. ❓ Is prior review text being included in context, causing the model to reinforce stale findings? **Answer:** None of the above. The real issue is: - The workflow triggers on every `synchronize` event (push to PR branch) - PR #687 had 20+ commits pushed in rapid succession - Each push triggered a fresh review workflow - GPT-5 found legitimate but related issues across commits, generating repeated REQUEST_CHANGES The findings **looked** repetitive because the underlying domain concerns (index failure handling, idempotency) required multiple iterations to fully address. ### Does PR #53 Address the Root Cause? **Partially.** PR #53 adds a `shouldSkipStaleReview()` check that: - Re-fetches PR metadata before posting - Compares evaluated SHA vs current HEAD - Skips posting if HEAD has moved This would help if a review was in-flight when a new push arrived. However: **PR #53 does NOT address:** 1. **Review-per-commit triggering** — the workflow still triggers on every synchronize 2. **Similar findings across commits** — no deduplication logic 3. **Review frequency** — no cooldown between reviews on same PR ### Recommendations **Near-term (PR #53 is valuable):** - Merge PR #53 — it prevents posting truly stale reviews when HEAD moves during evaluation - This is a defensive guard that reduces noise **For fuller resolution (separate work):** 1. **Workflow-level rate limiting** — add job concurrency controls in gargoyle's CI workflow 2. **Dedupe by similarity** — compare new findings against existing open reviews before posting 3. **Batch reviews** — instead of reviewing every push, review on intervals or triggers ### Verdict **PR #53 addresses one edge case** (HEAD moves during evaluation) but doesn't solve the core problem demonstrated in PR #687, which is **too many reviews triggered by rapid pushes**. The fix is still valuable and should merge, but the issue should remain open for follow-up work on review frequency/deduplication. If this is considered sufficient for now, close the issue with a note about future improvements. If the full problem needs solving, this issue should spawn follow-up tasks for workflow rate-limiting and finding deduplication.

rodin commented

2026-05-10 08:58:37 +00:00

Investigation Complete

Summary

This issue has been thoroughly investigated (see previous comments #12314 and #13592). The key findings:

Root cause is NOT stale commits: Each review correctly evaluates the commit it was triggered for. The "stale" perception came from many reviews firing rapidly on successive pushes.
PR #53 addresses the edge case: When HEAD moves during review processing (between fetch and post), the new shouldSkipStaleReview() check prevents posting outdated findings.
PR #53 current status:
- Code changes are complete and tested
- CI failures are environmental (some models like gpt-4.1, gpt-5-mini, claude-sonnet-4-6 may not be available in the API)
- Aaron's feedback about CI workflow changes has been addressed (those changes were reverted)

Recommended Action

PR #53 should be merged once CI is green (may need to disable unavailable model reviewers in CI matrix).

For further improvements beyond PR #53:

Early HEAD check before LLM — save tokens by checking if HEAD moved before calling the LLM
Review rate limiting — workflow-level concurrency or cooldowns
Finding deduplication — compare against existing reviews before posting

These are nice-to-haves and can be addressed in follow-up issues if needed.

No Additional Code Needed

This investigation issue doesn't require new implementation work — PR #53 already contains the fix. Closing this issue in favor of tracking via PR #53.

## Investigation Complete ### Summary This issue has been thoroughly investigated (see previous comments #12314 and #13592). The key findings: 1. **Root cause is NOT stale commits**: Each review correctly evaluates the commit it was triggered for. The "stale" perception came from many reviews firing rapidly on successive pushes. 2. **PR #53 addresses the edge case**: When HEAD moves *during* review processing (between fetch and post), the new `shouldSkipStaleReview()` check prevents posting outdated findings. 3. **PR #53 current status**: - Code changes are complete and tested - CI failures are environmental (some models like gpt-4.1, gpt-5-mini, claude-sonnet-4-6 may not be available in the API) - Aaron's feedback about CI workflow changes has been addressed (those changes were reverted) ### Recommended Action PR #53 should be merged once CI is green (may need to disable unavailable model reviewers in CI matrix). For further improvements beyond PR #53: 1. **Early HEAD check before LLM** — save tokens by checking if HEAD moved before calling the LLM 2. **Review rate limiting** — workflow-level concurrency or cooldowns 3. **Finding deduplication** — compare against existing reviews before posting These are nice-to-haves and can be addressed in follow-up issues if needed. ### No Additional Code Needed This investigation issue doesn't require new implementation work — PR #53 already contains the fix. Closing this issue in favor of tracking via PR #53.

rodin closed this issue

2026-05-10 08:58:42 +00:00

Sign in to join this conversation.

Branches Tags

main

ci/cleanup

ci-selfreview-gate

issue-150

issue-157

issue-141

issue-154

review-bot-dev-loop

issue-143

issue-146

pr-153

review-bot-issue-130-work

issue-148

issue-139

issue-137

review-bot-fixes

review-bot-issue-133

review-bot-issue-130

issue-130

github-support

issue-123-work

issue-123

review-bot-issue-120

fix/125-readme-cli-example

issue-125

issue-124

issue-120

feature/github-support

review-bot-issue-116

review-bot-issue-115

review-bot-issue-114

review-bot-issue-96

review-bot-issue-107

review-bot-issue-82

review-bot-issue-95

review-bot-issue-92

review-bot-issue-94

review-bot-issue-81

review-bot-issue-91

review-bot-issue-97

issue-80-c-file-reader

issue-80-b-pr-reader

issue-80-a-client

review-bot-issue-80

review-bot-issue-87

review-bot-issue-79

review-bot-issue-84

review-bot-issue-78

issue-73

issue-70

issue-68

issue-66

issue-64

issue-60-remote-personas

issue-60

issue-57

allow-deps

feat/aicore-provider-v2

issue-51

ci/pr-ready-gate

fix/stale-commit-check

feat/aicore-provider

fix/response-body-truncation

fix/json-repair

fix/sonnet-reviewer

fix/consistent-path-escape

feat/inline-review-comments

feat/6-update-existing-review

fix/19-context-overflow

feat/18-anthropic-api

fix/url-escaping-and-shadow

fix/quick-wins

fix/context-and-encapsulation

docs/code-review-report

ci/release-workflow

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: rodin/review-bot#52