How Do You Automate CRM Data Hygiene with Python Scripts?

Mar 13, 2026

The Problem

Every Monday morning, I spent 30 minutes cleaning up our CRM database. I manually checked for duplicate contacts, missing follow-up dates, and stale pipeline stages. It was tedious, error-prone, and I still missed things.

The worst part? I couldn’t scale this. As our sales team grew, so did the data quality issues. Deals fell through because follow-up dates were missing. Sales reps wasted time calling duplicate contacts. Trust in the CRM plummeted.

I needed automation. But not just any automation—I needed something that wouldn’t accidentally delete critical data.

The Solution: A Python Hygiene Script

I built a Python script that runs daily at 5am. It scans our CRM for common issues, auto-fixes what’s safe to fix, and flags everything else for human review.

Here’s the basic architecture:

Scheduled Python Script (daily 5am)
    |
Database Connection (SQLAlchemy)
    |
Hygiene Checks (multiple validation functions)
    |
Auto-Fix Logic (safe, deterministic fixes)
    |
Exception Flagging (edge cases for human review)
    |
Notification System (email/Slack digest)
    |
Audit Log (PostgreSQL table for compliance)

Let me walk you through how I built it.

Step 1: Database Connection Setup

First, I set up the database connection using SQLAlchemy:

import sqlalchemy
from sqlalchemy import create_engine, text
from datetime import datetime, timedelta
import logging
import os

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Database connection - use environment variables
DATABASE_URL = os.getenv('DATABASE_URL', 'postgresql://user:pass@host:5432/crm_db')
engine = create_engine(DATABASE_URL)

def get_connection():
    return engine.connect()

I learned the hard way to use environment variables. Never hardcode credentials.

Step 2: Hygiene Check Functions

Each hygiene check follows a pattern: query the database, identify issues, categorize them (auto-fix vs flag).

Check 1: Duplicate Emails

def check_duplicate_emails(conn):
    """Find contacts with duplicate email addresses."""
    query = text("""
        SELECT
            email,
            array_agg(id) as contact_ids,
            array_agg(name) as names,
            COUNT(*) as duplicate_count
        FROM contacts
        WHERE email IS NOT NULL
        GROUP BY LOWER(email)
        HAVING COUNT(*) > 1
    """)

    duplicates = conn.execute(query).fetchall()
    results = []

    for dup in duplicates:
        # Flag for merge - too risky to auto-delete
        results.append({
            'type': 'duplicate_email',
            'email': dup.email,
            'contact_ids': dup.contact_ids,
            'action': 'flag_for_merge',
            'priority': 'high'
        })

    return results

Notice I flagged these instead of auto-fixing. Deleting duplicates is dangerous—one might have important relationship history.

Check 2: Missing Follow-up Dates

def check_missing_followup_dates(conn):
    """Find active prospects without next follow-up date."""
    query = text("""
        SELECT id, name, email, stage, last_activity_date
        FROM contacts
        WHERE stage IN ('prospect', 'qualified', 'demo_scheduled')
          AND (next_followup_date IS NULL OR next_followup_date < CURRENT_DATE)
          AND status = 'active'
    """)

    missing_dates = conn.execute(query).fetchall()
    results = []

    for contact in missing_dates:
        days_since_activity = (datetime.now() - contact.last_activity_date).days if contact.last_activity_date else 30

        if days_since_activity <= 7:
            # Auto-fix: set follow-up to 7 days from now if recently active
            new_date = datetime.now() + timedelta(days=7)
            results.append({
                'type': 'missing_followup',
                'contact_id': contact.id,
                'action': 'auto_fix',
                'new_date': new_date,
                'priority': 'medium'
            })
        else:
            # Flag if no recent activity - might be dead lead
            results.append({
                'type': 'missing_followup_stale',
                'contact_id': contact.id,
                'action': 'flag_for_review',
                'priority': 'high'
            })

    return results

This is safe to auto-fix for recently active contacts. For stale contacts, I flag them.

Check 3: Stale Pipeline Stages

def check_stale_pipeline_stages(conn, days_threshold=90):
    """Find contacts stuck in same stage too long."""
    query = text("""
        SELECT
            id, name, email, stage,
            stage_entered_date,
            CURRENT_DATE - stage_entered_date as days_in_stage
        FROM contacts
        WHERE stage_entered_date < CURRENT_DATE - INTERVAL '%s days'
          AND stage NOT IN ('closed_won', 'closed_lost')
          AND status = 'active'
        ORDER BY days_in_stage DESC
    """ % days_threshold)

    stale_contacts = conn.execute(query).fetchall()
    results = []

    for contact in stale_contacts:
        results.append({
            'type': 'stale_pipeline_stage',
            'contact_id': contact.id,
            'stage': contact.stage,
            'days_in_stage': contact.days_in_stage,
            'action': 'flag_for_review',
            'priority': 'high'
        })

    return results

A deal stuck in “demo scheduled” for 90 days? Something’s wrong. I flag these for review.

Check 4: Timing Violations

def check_timing_violations(conn, min_days_between=3):
    """Find contacts emailed too recently."""
    query = text("""
        SELECT
            c.id, c.name, c.email,
            c.last_email_date,
            COUNT(e.id) as emails_last_week
        FROM contacts c
        LEFT JOIN email_activities e
          ON c.id = e.contact_id
          AND e.sent_date > CURRENT_DATE - INTERVAL '7 days'
        WHERE c.last_email_date > CURRENT_DATE - INTERVAL '%s days'
          AND c.status = 'active'
        GROUP BY c.id, c.name, c.email, c.last_email_date
        HAVING COUNT(e.id) > 2
    """ % min_days_between)

    violations = conn.execute(query).fetchall()
    results = []

    for contact in violations:
        results.append({
            'type': 'timing_violation',
            'contact_id': contact.id,
            'emails_last_week': contact.emails_last_week,
            'last_email_date': contact.last_email_date,
            'action': 'flag_for_review',
            'priority': 'medium'
        })

    return results

Contacting prospects too frequently hurts relationships. I flag these for review.

Step 3: Auto-Fix Logic

Only safe, deterministic fixes should be automatic:

import json

def apply_auto_fixes(conn, issues):
    """Apply safe, deterministic fixes automatically."""
    fixed_count = 0

    for issue in issues:
        if issue['action'] != 'auto_fix':
            continue

        try:
            if issue['type'] == 'missing_followup':
                update_query = text("""
                    UPDATE contacts
                    SET next_followup_date = :new_date,
                        updated_at = CURRENT_TIMESTAMP,
                        updated_by = 'hygiene_script'
                    WHERE id = :contact_id
                """)
                conn.execute(update_query, {
                    'new_date': issue['new_date'],
                    'contact_id': issue['contact_id']
                })
                fixed_count += 1
                logger.info(f"Fixed missing follow-up for contact {issue['contact_id']}")

        except Exception as e:
            logger.error(f"Failed to fix issue {issue}: {e}")
            conn.rollback()

    conn.commit()
    return fixed_count


def log_hygiene_results(conn, issues, fixed_count):
    """Log all findings for audit trail."""
    audit_query = text("""
        INSERT INTO hygiene_audit_log
        (run_date, total_issues, auto_fixed, flagged_for_review, details)
        VALUES
        (CURRENT_TIMESTAMP, :total, :fixed, :flagged, :details)
    """)

    flagged_count = len([i for i in issues if i['action'] == 'flag_for_review'])

    conn.execute(audit_query, {
        'total': len(issues),
        'fixed': fixed_count,
        'flagged': flagged_count,
        'details': json.dumps(issues, default=str)
    })
    conn.commit()

Every change gets logged with before/after values. This saved me more than once when debugging.

Step 4: Notification System

I send a daily digest instead of individual alerts to avoid notification fatigue:

import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from datetime import datetime

def send_digest_email(flagged_issues):
    """Send daily digest of issues requiring human review."""
    if not flagged_issues:
        return

    # Group by type
    grouped = {}
    for issue in flagged_issues:
        issue_type = issue['type']
        if issue_type not in grouped:
            grouped[issue_type] = []
        grouped[issue_type].append(issue)

    # Build email body
    body = f"CRM Data Hygiene Report - {datetime.now().strftime('%Y-%m-%d')}\n\n"
    body += f"Total issues requiring review: {len(flagged_issues)}\n\n"

    for issue_type, issues in grouped.items():
        body += f"\n{issue_type.upper().replace('_', ' ')} ({len(issues)} issues):\n"
        for issue in issues[:5]:  # Show first 5 of each type
            body += f"  - Contact ID {issue.get('contact_id')}: {issue}\n"

    # Create email
    msg = MIMEMultipart()
    msg['From'] = '[email protected]'
    msg['To'] = '[email protected]'
    msg['Subject'] = f'CRM Hygiene Digest - {len(flagged_issues)} issues need attention'
    msg.attach(MIMEText(body, 'plain'))

    # Uncomment for actual sending
    # smtp_server = smtplib.SMTP('smtp.company.com', 587)
    # smtp_server.send_message(msg)
    print(f"Would send email with {len(flagged_issues)} issues")

Step 5: Main Orchestration

Now put it all together:

def run_daily_hygiene():
    """Main function to run all hygiene checks."""
    logger.info("Starting CRM data hygiene run...")

    all_issues = []

    with get_connection() as conn:
        # Run all checks
        all_issues.extend(check_duplicate_emails(conn))
        all_issues.extend(check_missing_followup_dates(conn))
        all_issues.extend(check_stale_pipeline_stages(conn))
        all_issues.extend(check_timing_violations(conn))

        logger.info(f"Found {len(all_issues)} total issues")

        # Apply auto-fixes
        fixed_count = apply_auto_fixes(conn, all_issues)
        logger.info(f"Auto-fixed {fixed_count} issues")

        # Log results
        log_hygiene_results(conn, all_issues, fixed_count)

        # Send digest for flagged items
        flagged = [i for i in all_issues if i['action'] in ('flag_for_review', 'flag_for_merge')]
        send_digest_email(flagged)

    logger.info("CRM data hygiene run complete")
    return all_issues


if __name__ == "__main__":
    run_daily_hygiene()

Step 6: Scheduling

I use cron for scheduling. Simple and reliable:

# /etc/cron.d/crm-hygiene
# Run at 5am every day
0 5 * * * crm_user /usr/bin/python3 /opt/crm-hygiene/run_hygiene.py >> /var/log/crm-hygiene.log 2>&1

Or use Python’s schedule library for more control:

import schedule
import time

schedule.every().day.at("05:00").do(run_daily_hygiene)

while True:
    schedule.run_pending()
    time.sleep(60)

The Results

After implementing this script:

Metric	Before	After
Weekly manual cleanup	30+ minutes	0 minutes
Issues caught	Maybe 70%	Near 100%
Data quality consistency	Variable	Consistent
Audit trail	None	Complete

Now I just check the daily digest email. The tedious work just… happens.

Common Mistakes I Made

Mistake 1: Over-automation too early

I initially tried to auto-fix duplicates by keeping the most recent record. I deleted an important contact with years of relationship history. Start conservative. Add auto-fixes incrementally.

Mistake 2: No audit trail

My first version made changes without logging. When data looked wrong, I couldn’t explain why. Every update now writes to an audit table with timestamp, script ID, and before/after values.

Mistake 3: Alert fatigue

I sent individual emails per issue at first. People ignored them within a week. Now I send a daily digest grouped by type, showing only the top issues with priority in the subject line.

Mistake 4: Hard-coded thresholds

“90 days in stage” was hard-coded. Every time we wanted to tune it, I had to modify code. Now thresholds live in a config table.

Mistake 5: Running during business hours

The script locked tables and slowed down the CRM for users. I moved it to 5am when traffic is minimal.

Mistake 6: No staging environment

I never tested auto-fix logic against real edge cases. Now I maintain a staging CRM with production-like data for testing.

When to Auto-fix vs Flag

Safe to auto-fix:

Missing follow-up dates on recently active contacts
Standardized format issues (phone numbers, state abbreviations)
Obvious data entry errors with clear correction

Flag for review:

Duplicate contacts (which has the relationship history?)
Stale pipeline stages (deal might still be alive)
Timing violations (might be intentional for hot leads)
Any deletion operation

Configuration for Production

Here’s my production-ready configuration:

import os
from dataclasses import dataclass

@dataclass
class HygieneConfig:
    database_url: str = os.getenv('DATABASE_URL')
    days_stale_threshold: int = int(os.getenv('DAYS_STALE_THRESHOLD', '90'))
    min_days_between_emails: int = int(os.getenv('MIN_DAYS_BETWEEN_EMAILS', '3'))
    auto_fix_enabled: bool = os.getenv('AUTO_FIX_ENABLED', 'false').lower() == 'true'
    digest_email: str = os.getenv('DIGEST_EMAIL', '[email protected]')
    run_time: str = os.getenv('RUN_TIME', '05:00')

config = HygieneConfig()

Summary

Automating CRM data hygiene with Python transforms tedious weekly maintenance into a reliable daily process. Build hygiene check functions for each issue type. Auto-fix only safe, deterministic problems. Flag everything else for human review. Always maintain an audit trail.

Start with read-only checks. Test thoroughly in staging. Add auto-fixes gradually. You will catch more issues than manual reviews ever could while reclaiming hours of your time.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!