How to Use AI Tools to Review and Filter Pull Requests

Mar 20, 2026

Purpose

Can I use AI to fight AI-generated pull request spam? That’s the question I asked myself when my open source project started receiving dozens of low-quality AI-generated PRs per week.

The answer is yes—but it requires a thoughtful approach combining deterministic tools with LLM-based analysis.

Environment

GitHub-hosted open source project
Python codebase
OpenAI API access
GitHub Actions for automation

What Happened?

I was spending hours each week reviewing PRs that were obviously generated by AI without understanding. Code that looked superficially correct but had:

Missing error handling
Generic variable names
No tests
No connection to project conventions

I thought: “If AI created this mess, can AI help clean it up?”

The Reddit consensus was clear:

“Fight AI with AI - use it to evaluate the PR. Garbage = close”

How It Works

Approach 1: GitHub Actions Workflow

I created an automated PR review system:

name: AI PR Review

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  ai-review:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
      contents: read

    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Get PR diff
        id: diff
        run: |
          echo "diff<<EOF" >> $GITHUB_OUTPUT
          git diff origin/${{ github.base_ref }}...HEAD >> $GITHUB_OUTPUT
          echo "EOF" >> $GITHUB_OUTPUT

      - name: Run deterministic checks
        id: static
        run: |
          npm run lint 2>&1 > lint-output.txt || true
          npm run typecheck 2>&1 > typecheck-output.txt || true

          CHANGED_FILES=$(git diff --name-only ${{ github.base_ref }}...HEAD | wc -l)
          echo "changed_files=$CHANGED_FILES" >> $GITHUB_OUTPUT

      - name: AI Quality Assessment
        id: ai-assessment
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          SCORE=$(python scripts/ai-pr-review.py \
            --diff "${{ steps.diff.outputs.diff }}" \
            --pr-title "${{ github.event.pull_request.title }}" \
            --pr-body "${{ github.event.pull_request.body }}" \
            --format score)

          echo "quality_score=$SCORE" >> $GITHUB_OUTPUT

      - name: Evaluate and Respond
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          QUALITY_THRESHOLD: 60
        run: |
          SCORE=${{ steps.ai-assessment.outputs.quality_score }}

          if [ "$SCORE" -lt "$QUALITY_THRESHOLD" ]; then
            gh pr comment ${{ github.event.pull_request.number }} --body \
              "## Automated Review Result

              This PR has been closed due to low quality score: **$SCORE/100**

              If you believe this is an error, please improve the PR and reopen."

            gh pr close ${{ github.event.pull_request.number }}
          else
            gh pr comment ${{ github.event.pull_request.number }} --body \
              "## Automated Quality Check

              Quality Score: **$SCORE/100**

              This PR has passed initial screening and awaits human review."
          fi

Approach 2: LLM Review Script

The core logic is in a Python script that evaluates PR quality:

#!/usr/bin/env python3
"""
AI PR Review Script
Evaluates pull requests using LLM analysis and returns quality score.
"""

import argparse
import json
import os
import openai


def analyze_with_llm(diff: str, pr_title: str, pr_body: str) -> dict:
    """Use LLM to analyze PR quality."""

    prompt = f"""You are an expert code reviewer. Analyze this pull request and provide scores.

PR Title: {pr_title}

PR Description:
{pr_body or "No description provided"}

Code Changes:
```
{diff[:8000]}
```

Score each category from 0-25 (0-10 for context):

1. **Code Quality** (0-25): Follows patterns, no bugs, appropriate abstractions
2. **Relevance** (0-25): Addresses real issue, meaningful change
3. **Documentation** (0-20): Comments, tests, clear description
4. **Risk Score** (0-20): Security, breaking changes (lower = better)
5. **Context** (0-10): Issue links, design decisions explained

Return ONLY a JSON object with keys: code_quality, relevance, documentation, risk_score, context
"""

    client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0.3,
    )

    return json.loads(response.choices[0].message.content)


def calculate_total_score(criteria: dict) -> int:
    """Calculate total quality score from criteria."""
    # Risk score is inverted (lower risk = higher score)
    risk_contribution = 20 - min(criteria.get("risk_score", 20), 20)

    total = (
        criteria.get("code_quality", 0) +
        criteria.get("relevance", 0) +
        criteria.get("documentation", 0) +
        risk_contribution +
        criteria.get("context", 0)
    )

    return min(100, max(0, total))


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--diff", required=True)
    parser.add_argument("--pr-title", required=True)
    parser.add_argument("--pr-body", default="")
    parser.add_argument("--format", choices=["score", "json"], default="score")
    args = parser.parse_args()

    criteria = analyze_with_llm(args.diff, args.pr_title, args.pr_body)
    total_score = calculate_total_score(criteria)

    if args.format == "json":
        print(json.dumps({"total_score": total_score, "breakdown": criteria}))
    else:
        print(total_score)

Approach 3: Training a Classifier (Advanced)

For projects with historical PR data, you can train a custom classifier:

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
import joblib


class PRQualityClassifier:
    """ML classifier for PR quality assessment."""

    def __init__(self, model_path: str = "pr_classifier.joblib"):
        self.pipeline = None
        if os.path.exists(model_path):
            self.pipeline = joblib.load(model_path)

    def extract_features(self, pr_data: dict) -> str:
        """Extract features from PR for classification."""
        return f"{pr_data.get('title', '')} {pr_data.get('body', '')} " \
               f"files:{pr_data.get('changed_files', 0)} " \
               f"additions:{pr_data.get('additions', 0)}"

    def train(self, training_data: list):
        """Train on historical PRs with labels (1=good, 0=bad)."""
        X = [self.extract_features(pr) for pr in training_data]
        y = [pr.get('label', 0) for pr in training_data]

        self.pipeline = Pipeline([
            ('tfidf', TfidfVectorizer(max_features=5000)),
            ('clf', RandomForestClassifier(n_estimators=100))
        ])
        self.pipeline.fit(X, y)

    def predict(self, pr_data: dict) -> dict:
        """Predict PR quality."""
        features = self.extract_features(pr_data)
        proba = self.pipeline.predict_proba([features])[0]

        return {
            "quality_score": int(proba[1] * 100),
            "should_auto_close": proba[1] < 0.4
        }

The Reason

The key insight is that AI can quickly evaluate code against multiple dimensions that would take a human much longer to assess:

┌─────────────────┐     ┌─────────────────┐
│   PR Content    │ ──→ │   LLM Analysis  │
└─────────────────┘     └────────┬────────┘
                                 │
                                 ▼
                        ┌─────────────────┐
                        │  Quality Score  │
                        │   (0-100)       │
                        └────────┬────────┘
                                 │
                    ┌────────────┼────────────┐
                    ▼            ▼            ▼
              Score &lt; 40    Score 40-60   Score &gt; 60
              Auto-close     Flag review    Normal queue

But there’s a critical rule: Never use AI as the sole decision maker. Always have human oversight for borderline cases.

Summary

In this post, I showed how to use AI tools to review and filter pull requests. The key points are:

Combine deterministic tools (linters, type checkers) with LLM analysis
Score PRs on multiple dimensions - quality, relevance, documentation, risk
Set appropriate thresholds - auto-close obviously bad PRs, flag borderline ones
Provide transparent feedback - contributors deserve to know why PRs were closed
Keep humans in the loop - AI assists, doesn’t replace, maintainer judgment

The goal isn’t to replace human review. It’s to amplify maintainer effectiveness by automating the obvious cases so I can focus on PRs that deserve attention.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!