Skip to content

How to Use AI Tools to Review and Filter Pull Requests

Purpose

Can I use AI to fight AI-generated pull request spam? That’s the question I asked myself when my open source project started receiving dozens of low-quality AI-generated PRs per week.

The answer is yes—but it requires a thoughtful approach combining deterministic tools with LLM-based analysis.

Environment

  • GitHub-hosted open source project
  • Python codebase
  • OpenAI API access
  • GitHub Actions for automation

What Happened?

I was spending hours each week reviewing PRs that were obviously generated by AI without understanding. Code that looked superficially correct but had:

  • Missing error handling
  • Generic variable names
  • No tests
  • No connection to project conventions

I thought: “If AI created this mess, can AI help clean it up?”

The Reddit consensus was clear:

“Fight AI with AI - use it to evaluate the PR. Garbage = close”

How It Works

Approach 1: GitHub Actions Workflow

I created an automated PR review system:

.github/workflows/ai-pr-review.yml
name: AI PR Review
on:
pull_request:
types: [opened, synchronize]
jobs:
ai-review:
runs-on: ubuntu-latest
permissions:
pull-requests: write
contents: read
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Get PR diff
id: diff
run: |
echo "diff<<EOF" >> $GITHUB_OUTPUT
git diff origin/${{ github.base_ref }}...HEAD >> $GITHUB_OUTPUT
echo "EOF" >> $GITHUB_OUTPUT
- name: Run deterministic checks
id: static
run: |
npm run lint 2>&1 > lint-output.txt || true
npm run typecheck 2>&1 > typecheck-output.txt || true
CHANGED_FILES=$(git diff --name-only ${{ github.base_ref }}...HEAD | wc -l)
echo "changed_files=$CHANGED_FILES" >> $GITHUB_OUTPUT
- name: AI Quality Assessment
id: ai-assessment
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
SCORE=$(python scripts/ai-pr-review.py \
--diff "${{ steps.diff.outputs.diff }}" \
--pr-title "${{ github.event.pull_request.title }}" \
--pr-body "${{ github.event.pull_request.body }}" \
--format score)
echo "quality_score=$SCORE" >> $GITHUB_OUTPUT
- name: Evaluate and Respond
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
QUALITY_THRESHOLD: 60
run: |
SCORE=${{ steps.ai-assessment.outputs.quality_score }}
if [ "$SCORE" -lt "$QUALITY_THRESHOLD" ]; then
gh pr comment ${{ github.event.pull_request.number }} --body \
"## Automated Review Result
This PR has been closed due to low quality score: **$SCORE/100**
If you believe this is an error, please improve the PR and reopen."
gh pr close ${{ github.event.pull_request.number }}
else
gh pr comment ${{ github.event.pull_request.number }} --body \
"## Automated Quality Check
Quality Score: **$SCORE/100**
This PR has passed initial screening and awaits human review."
fi

Approach 2: LLM Review Script

The core logic is in a Python script that evaluates PR quality:

scripts/ai-pr-review.py
#!/usr/bin/env python3
"""
AI PR Review Script
Evaluates pull requests using LLM analysis and returns quality score.
"""
import argparse
import json
import os
import openai
def analyze_with_llm(diff: str, pr_title: str, pr_body: str) -> dict:
"""Use LLM to analyze PR quality."""
prompt = f"""You are an expert code reviewer. Analyze this pull request and provide scores.
PR Title: {pr_title}
PR Description:
{pr_body or "No description provided"}
Code Changes:
```
{diff[:8000]}
```
Score each category from 0-25 (0-10 for context):
1. **Code Quality** (0-25): Follows patterns, no bugs, appropriate abstractions
2. **Relevance** (0-25): Addresses real issue, meaningful change
3. **Documentation** (0-20): Comments, tests, clear description
4. **Risk Score** (0-20): Security, breaking changes (lower = better)
5. **Context** (0-10): Issue links, design decisions explained
Return ONLY a JSON object with keys: code_quality, relevance, documentation, risk_score, context
"""
client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0.3,
)
return json.loads(response.choices[0].message.content)
def calculate_total_score(criteria: dict) -> int:
"""Calculate total quality score from criteria."""
# Risk score is inverted (lower risk = higher score)
risk_contribution = 20 - min(criteria.get("risk_score", 20), 20)
total = (
criteria.get("code_quality", 0) +
criteria.get("relevance", 0) +
criteria.get("documentation", 0) +
risk_contribution +
criteria.get("context", 0)
)
return min(100, max(0, total))
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--diff", required=True)
parser.add_argument("--pr-title", required=True)
parser.add_argument("--pr-body", default="")
parser.add_argument("--format", choices=["score", "json"], default="score")
args = parser.parse_args()
criteria = analyze_with_llm(args.diff, args.pr_title, args.pr_body)
total_score = calculate_total_score(criteria)
if args.format == "json":
print(json.dumps({"total_score": total_score, "breakdown": criteria}))
else:
print(total_score)

Approach 3: Training a Classifier (Advanced)

For projects with historical PR data, you can train a custom classifier:

PR quality classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
import joblib
class PRQualityClassifier:
"""ML classifier for PR quality assessment."""
def __init__(self, model_path: str = "pr_classifier.joblib"):
self.pipeline = None
if os.path.exists(model_path):
self.pipeline = joblib.load(model_path)
def extract_features(self, pr_data: dict) -> str:
"""Extract features from PR for classification."""
return f"{pr_data.get('title', '')} {pr_data.get('body', '')} " \
f"files:{pr_data.get('changed_files', 0)} " \
f"additions:{pr_data.get('additions', 0)}"
def train(self, training_data: list):
"""Train on historical PRs with labels (1=good, 0=bad)."""
X = [self.extract_features(pr) for pr in training_data]
y = [pr.get('label', 0) for pr in training_data]
self.pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=5000)),
('clf', RandomForestClassifier(n_estimators=100))
])
self.pipeline.fit(X, y)
def predict(self, pr_data: dict) -> dict:
"""Predict PR quality."""
features = self.extract_features(pr_data)
proba = self.pipeline.predict_proba([features])[0]
return {
"quality_score": int(proba[1] * 100),
"should_auto_close": proba[1] < 0.4
}

The Reason

The key insight is that AI can quickly evaluate code against multiple dimensions that would take a human much longer to assess:

┌─────────────────┐ ┌─────────────────┐
│ PR Content │ ──→ │ LLM Analysis │
└─────────────────┘ └────────┬────────┘
┌─────────────────┐
│ Quality Score │
│ (0-100) │
└────────┬────────┘
┌────────────┼────────────┐
▼ ▼ ▼
Score &lt; 40 Score 40-60 Score &gt; 60
Auto-close Flag review Normal queue

But there’s a critical rule: Never use AI as the sole decision maker. Always have human oversight for borderline cases.

Summary

In this post, I showed how to use AI tools to review and filter pull requests. The key points are:

  1. Combine deterministic tools (linters, type checkers) with LLM analysis
  2. Score PRs on multiple dimensions - quality, relevance, documentation, risk
  3. Set appropriate thresholds - auto-close obviously bad PRs, flag borderline ones
  4. Provide transparent feedback - contributors deserve to know why PRs were closed
  5. Keep humans in the loop - AI assists, doesn’t replace, maintainer judgment

The goal isn’t to replace human review. It’s to amplify maintainer effectiveness by automating the obvious cases so I can focus on PRs that deserve attention.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments