How to Use AI Tools to Review and Filter Pull Requests
Purpose
Can I use AI to fight AI-generated pull request spam? That’s the question I asked myself when my open source project started receiving dozens of low-quality AI-generated PRs per week.
The answer is yes—but it requires a thoughtful approach combining deterministic tools with LLM-based analysis.
Environment
- GitHub-hosted open source project
- Python codebase
- OpenAI API access
- GitHub Actions for automation
What Happened?
I was spending hours each week reviewing PRs that were obviously generated by AI without understanding. Code that looked superficially correct but had:
- Missing error handling
- Generic variable names
- No tests
- No connection to project conventions
I thought: “If AI created this mess, can AI help clean it up?”
The Reddit consensus was clear:
“Fight AI with AI - use it to evaluate the PR. Garbage = close”
How It Works
Approach 1: GitHub Actions Workflow
I created an automated PR review system:
name: AI PR Review
on: pull_request: types: [opened, synchronize]
jobs: ai-review: runs-on: ubuntu-latest permissions: pull-requests: write contents: read
steps: - name: Checkout code uses: actions/checkout@v4 with: fetch-depth: 0
- name: Get PR diff id: diff run: | echo "diff<<EOF" >> $GITHUB_OUTPUT git diff origin/${{ github.base_ref }}...HEAD >> $GITHUB_OUTPUT echo "EOF" >> $GITHUB_OUTPUT
- name: Run deterministic checks id: static run: | npm run lint 2>&1 > lint-output.txt || true npm run typecheck 2>&1 > typecheck-output.txt || true
CHANGED_FILES=$(git diff --name-only ${{ github.base_ref }}...HEAD | wc -l) echo "changed_files=$CHANGED_FILES" >> $GITHUB_OUTPUT
- name: AI Quality Assessment id: ai-assessment env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: | SCORE=$(python scripts/ai-pr-review.py \ --diff "${{ steps.diff.outputs.diff }}" \ --pr-title "${{ github.event.pull_request.title }}" \ --pr-body "${{ github.event.pull_request.body }}" \ --format score)
echo "quality_score=$SCORE" >> $GITHUB_OUTPUT
- name: Evaluate and Respond env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} QUALITY_THRESHOLD: 60 run: | SCORE=${{ steps.ai-assessment.outputs.quality_score }}
if [ "$SCORE" -lt "$QUALITY_THRESHOLD" ]; then gh pr comment ${{ github.event.pull_request.number }} --body \ "## Automated Review Result
This PR has been closed due to low quality score: **$SCORE/100**
If you believe this is an error, please improve the PR and reopen."
gh pr close ${{ github.event.pull_request.number }} else gh pr comment ${{ github.event.pull_request.number }} --body \ "## Automated Quality Check
Quality Score: **$SCORE/100**
This PR has passed initial screening and awaits human review." fiApproach 2: LLM Review Script
The core logic is in a Python script that evaluates PR quality:
#!/usr/bin/env python3"""AI PR Review ScriptEvaluates pull requests using LLM analysis and returns quality score."""
import argparseimport jsonimport osimport openai
def analyze_with_llm(diff: str, pr_title: str, pr_body: str) -> dict: """Use LLM to analyze PR quality."""
prompt = f"""You are an expert code reviewer. Analyze this pull request and provide scores.
PR Title: {pr_title}
PR Description:{pr_body or "No description provided"}
Code Changes:```{diff[:8000]}```
Score each category from 0-25 (0-10 for context):
1. **Code Quality** (0-25): Follows patterns, no bugs, appropriate abstractions2. **Relevance** (0-25): Addresses real issue, meaningful change3. **Documentation** (0-20): Comments, tests, clear description4. **Risk Score** (0-20): Security, breaking changes (lower = better)5. **Context** (0-10): Issue links, design decisions explained
Return ONLY a JSON object with keys: code_quality, relevance, documentation, risk_score, context"""
client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], response_format={"type": "json_object"}, temperature=0.3, )
return json.loads(response.choices[0].message.content)
def calculate_total_score(criteria: dict) -> int: """Calculate total quality score from criteria.""" # Risk score is inverted (lower risk = higher score) risk_contribution = 20 - min(criteria.get("risk_score", 20), 20)
total = ( criteria.get("code_quality", 0) + criteria.get("relevance", 0) + criteria.get("documentation", 0) + risk_contribution + criteria.get("context", 0) )
return min(100, max(0, total))
if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--diff", required=True) parser.add_argument("--pr-title", required=True) parser.add_argument("--pr-body", default="") parser.add_argument("--format", choices=["score", "json"], default="score") args = parser.parse_args()
criteria = analyze_with_llm(args.diff, args.pr_title, args.pr_body) total_score = calculate_total_score(criteria)
if args.format == "json": print(json.dumps({"total_score": total_score, "breakdown": criteria})) else: print(total_score)Approach 3: Training a Classifier (Advanced)
For projects with historical PR data, you can train a custom classifier:
from sklearn.ensemble import RandomForestClassifierfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.pipeline import Pipelineimport joblib
class PRQualityClassifier: """ML classifier for PR quality assessment."""
def __init__(self, model_path: str = "pr_classifier.joblib"): self.pipeline = None if os.path.exists(model_path): self.pipeline = joblib.load(model_path)
def extract_features(self, pr_data: dict) -> str: """Extract features from PR for classification.""" return f"{pr_data.get('title', '')} {pr_data.get('body', '')} " \ f"files:{pr_data.get('changed_files', 0)} " \ f"additions:{pr_data.get('additions', 0)}"
def train(self, training_data: list): """Train on historical PRs with labels (1=good, 0=bad).""" X = [self.extract_features(pr) for pr in training_data] y = [pr.get('label', 0) for pr in training_data]
self.pipeline = Pipeline([ ('tfidf', TfidfVectorizer(max_features=5000)), ('clf', RandomForestClassifier(n_estimators=100)) ]) self.pipeline.fit(X, y)
def predict(self, pr_data: dict) -> dict: """Predict PR quality.""" features = self.extract_features(pr_data) proba = self.pipeline.predict_proba([features])[0]
return { "quality_score": int(proba[1] * 100), "should_auto_close": proba[1] < 0.4 }The Reason
The key insight is that AI can quickly evaluate code against multiple dimensions that would take a human much longer to assess:
┌─────────────────┐ ┌─────────────────┐│ PR Content │ ──→ │ LLM Analysis │└─────────────────┘ └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Quality Score │ │ (0-100) │ └────────┬────────┘ │ ┌────────────┼────────────┐ ▼ ▼ ▼ Score < 40 Score 40-60 Score > 60 Auto-close Flag review Normal queueBut there’s a critical rule: Never use AI as the sole decision maker. Always have human oversight for borderline cases.
Summary
In this post, I showed how to use AI tools to review and filter pull requests. The key points are:
- Combine deterministic tools (linters, type checkers) with LLM analysis
- Score PRs on multiple dimensions - quality, relevance, documentation, risk
- Set appropriate thresholds - auto-close obviously bad PRs, flag borderline ones
- Provide transparent feedback - contributors deserve to know why PRs were closed
- Keep humans in the loop - AI assists, doesn’t replace, maintainer judgment
The goal isn’t to replace human review. It’s to amplify maintainer effectiveness by automating the obvious cases so I can focus on PRs that deserve attention.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments