Skip to content

How Do You Automate CRM Data Hygiene with Python Scripts?

The Problem

Every Monday morning, I spent 30 minutes cleaning up our CRM database. I manually checked for duplicate contacts, missing follow-up dates, and stale pipeline stages. It was tedious, error-prone, and I still missed things.

The worst part? I couldn’t scale this. As our sales team grew, so did the data quality issues. Deals fell through because follow-up dates were missing. Sales reps wasted time calling duplicate contacts. Trust in the CRM plummeted.

I needed automation. But not just any automation—I needed something that wouldn’t accidentally delete critical data.

The Solution: A Python Hygiene Script

I built a Python script that runs daily at 5am. It scans our CRM for common issues, auto-fixes what’s safe to fix, and flags everything else for human review.

Here’s the basic architecture:

architecture.txt
Scheduled Python Script (daily 5am)
|
Database Connection (SQLAlchemy)
|
Hygiene Checks (multiple validation functions)
|
Auto-Fix Logic (safe, deterministic fixes)
|
Exception Flagging (edge cases for human review)
|
Notification System (email/Slack digest)
|
Audit Log (PostgreSQL table for compliance)

Let me walk you through how I built it.

Step 1: Database Connection Setup

First, I set up the database connection using SQLAlchemy:

hygiene_config.py
import sqlalchemy
from sqlalchemy import create_engine, text
from datetime import datetime, timedelta
import logging
import os
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Database connection - use environment variables
DATABASE_URL = os.getenv('DATABASE_URL', 'postgresql://user:pass@host:5432/crm_db')
engine = create_engine(DATABASE_URL)
def get_connection():
return engine.connect()

I learned the hard way to use environment variables. Never hardcode credentials.

Step 2: Hygiene Check Functions

Each hygiene check follows a pattern: query the database, identify issues, categorize them (auto-fix vs flag).

Check 1: Duplicate Emails

check_duplicates.py
def check_duplicate_emails(conn):
"""Find contacts with duplicate email addresses."""
query = text("""
SELECT
email,
array_agg(id) as contact_ids,
array_agg(name) as names,
COUNT(*) as duplicate_count
FROM contacts
WHERE email IS NOT NULL
GROUP BY LOWER(email)
HAVING COUNT(*) > 1
""")
duplicates = conn.execute(query).fetchall()
results = []
for dup in duplicates:
# Flag for merge - too risky to auto-delete
results.append({
'type': 'duplicate_email',
'email': dup.email,
'contact_ids': dup.contact_ids,
'action': 'flag_for_merge',
'priority': 'high'
})
return results

Notice I flagged these instead of auto-fixing. Deleting duplicates is dangerous—one might have important relationship history.

Check 2: Missing Follow-up Dates

check_followup.py
def check_missing_followup_dates(conn):
"""Find active prospects without next follow-up date."""
query = text("""
SELECT id, name, email, stage, last_activity_date
FROM contacts
WHERE stage IN ('prospect', 'qualified', 'demo_scheduled')
AND (next_followup_date IS NULL OR next_followup_date < CURRENT_DATE)
AND status = 'active'
""")
missing_dates = conn.execute(query).fetchall()
results = []
for contact in missing_dates:
days_since_activity = (datetime.now() - contact.last_activity_date).days if contact.last_activity_date else 30
if days_since_activity <= 7:
# Auto-fix: set follow-up to 7 days from now if recently active
new_date = datetime.now() + timedelta(days=7)
results.append({
'type': 'missing_followup',
'contact_id': contact.id,
'action': 'auto_fix',
'new_date': new_date,
'priority': 'medium'
})
else:
# Flag if no recent activity - might be dead lead
results.append({
'type': 'missing_followup_stale',
'contact_id': contact.id,
'action': 'flag_for_review',
'priority': 'high'
})
return results

This is safe to auto-fix for recently active contacts. For stale contacts, I flag them.

Check 3: Stale Pipeline Stages

check_stale.py
def check_stale_pipeline_stages(conn, days_threshold=90):
"""Find contacts stuck in same stage too long."""
query = text("""
SELECT
id, name, email, stage,
stage_entered_date,
CURRENT_DATE - stage_entered_date as days_in_stage
FROM contacts
WHERE stage_entered_date < CURRENT_DATE - INTERVAL '%s days'
AND stage NOT IN ('closed_won', 'closed_lost')
AND status = 'active'
ORDER BY days_in_stage DESC
""" % days_threshold)
stale_contacts = conn.execute(query).fetchall()
results = []
for contact in stale_contacts:
results.append({
'type': 'stale_pipeline_stage',
'contact_id': contact.id,
'stage': contact.stage,
'days_in_stage': contact.days_in_stage,
'action': 'flag_for_review',
'priority': 'high'
})
return results

A deal stuck in “demo scheduled” for 90 days? Something’s wrong. I flag these for review.

Check 4: Timing Violations

check_timing.py
def check_timing_violations(conn, min_days_between=3):
"""Find contacts emailed too recently."""
query = text("""
SELECT
c.id, c.name, c.email,
c.last_email_date,
COUNT(e.id) as emails_last_week
FROM contacts c
LEFT JOIN email_activities e
ON c.id = e.contact_id
AND e.sent_date > CURRENT_DATE - INTERVAL '7 days'
WHERE c.last_email_date > CURRENT_DATE - INTERVAL '%s days'
AND c.status = 'active'
GROUP BY c.id, c.name, c.email, c.last_email_date
HAVING COUNT(e.id) > 2
""" % min_days_between)
violations = conn.execute(query).fetchall()
results = []
for contact in violations:
results.append({
'type': 'timing_violation',
'contact_id': contact.id,
'emails_last_week': contact.emails_last_week,
'last_email_date': contact.last_email_date,
'action': 'flag_for_review',
'priority': 'medium'
})
return results

Contacting prospects too frequently hurts relationships. I flag these for review.

Step 3: Auto-Fix Logic

Only safe, deterministic fixes should be automatic:

auto_fix.py
import json
def apply_auto_fixes(conn, issues):
"""Apply safe, deterministic fixes automatically."""
fixed_count = 0
for issue in issues:
if issue['action'] != 'auto_fix':
continue
try:
if issue['type'] == 'missing_followup':
update_query = text("""
UPDATE contacts
SET next_followup_date = :new_date,
updated_at = CURRENT_TIMESTAMP,
updated_by = 'hygiene_script'
WHERE id = :contact_id
""")
conn.execute(update_query, {
'new_date': issue['new_date'],
'contact_id': issue['contact_id']
})
fixed_count += 1
logger.info(f"Fixed missing follow-up for contact {issue['contact_id']}")
except Exception as e:
logger.error(f"Failed to fix issue {issue}: {e}")
conn.rollback()
conn.commit()
return fixed_count
def log_hygiene_results(conn, issues, fixed_count):
"""Log all findings for audit trail."""
audit_query = text("""
INSERT INTO hygiene_audit_log
(run_date, total_issues, auto_fixed, flagged_for_review, details)
VALUES
(CURRENT_TIMESTAMP, :total, :fixed, :flagged, :details)
""")
flagged_count = len([i for i in issues if i['action'] == 'flag_for_review'])
conn.execute(audit_query, {
'total': len(issues),
'fixed': fixed_count,
'flagged': flagged_count,
'details': json.dumps(issues, default=str)
})
conn.commit()

Every change gets logged with before/after values. This saved me more than once when debugging.

Step 4: Notification System

I send a daily digest instead of individual alerts to avoid notification fatigue:

notifications.py
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from datetime import datetime
def send_digest_email(flagged_issues):
"""Send daily digest of issues requiring human review."""
if not flagged_issues:
return
# Group by type
grouped = {}
for issue in flagged_issues:
issue_type = issue['type']
if issue_type not in grouped:
grouped[issue_type] = []
grouped[issue_type].append(issue)
# Build email body
body = f"CRM Data Hygiene Report - {datetime.now().strftime('%Y-%m-%d')}\n\n"
body += f"Total issues requiring review: {len(flagged_issues)}\n\n"
for issue_type, issues in grouped.items():
body += f"\n{issue_type.upper().replace('_', ' ')} ({len(issues)} issues):\n"
for issue in issues[:5]: # Show first 5 of each type
body += f" - Contact ID {issue.get('contact_id')}: {issue}\n"
# Create email
msg = MIMEMultipart()
msg['From'] = '[email protected]'
msg['To'] = '[email protected]'
msg['Subject'] = f'CRM Hygiene Digest - {len(flagged_issues)} issues need attention'
msg.attach(MIMEText(body, 'plain'))
# Uncomment for actual sending
# smtp_server = smtplib.SMTP('smtp.company.com', 587)
# smtp_server.send_message(msg)
print(f"Would send email with {len(flagged_issues)} issues")

Step 5: Main Orchestration

Now put it all together:

run_hygiene.py
def run_daily_hygiene():
"""Main function to run all hygiene checks."""
logger.info("Starting CRM data hygiene run...")
all_issues = []
with get_connection() as conn:
# Run all checks
all_issues.extend(check_duplicate_emails(conn))
all_issues.extend(check_missing_followup_dates(conn))
all_issues.extend(check_stale_pipeline_stages(conn))
all_issues.extend(check_timing_violations(conn))
logger.info(f"Found {len(all_issues)} total issues")
# Apply auto-fixes
fixed_count = apply_auto_fixes(conn, all_issues)
logger.info(f"Auto-fixed {fixed_count} issues")
# Log results
log_hygiene_results(conn, all_issues, fixed_count)
# Send digest for flagged items
flagged = [i for i in all_issues if i['action'] in ('flag_for_review', 'flag_for_merge')]
send_digest_email(flagged)
logger.info("CRM data hygiene run complete")
return all_issues
if __name__ == "__main__":
run_daily_hygiene()

Step 6: Scheduling

I use cron for scheduling. Simple and reliable:

cronjob.sh
# /etc/cron.d/crm-hygiene
# Run at 5am every day
0 5 * * * crm_user /usr/bin/python3 /opt/crm-hygiene/run_hygiene.py >> /var/log/crm-hygiene.log 2>&1

Or use Python’s schedule library for more control:

scheduler.py
import schedule
import time
schedule.every().day.at("05:00").do(run_daily_hygiene)
while True:
schedule.run_pending()
time.sleep(60)

The Results

After implementing this script:

MetricBeforeAfter
Weekly manual cleanup30+ minutes0 minutes
Issues caughtMaybe 70%Near 100%
Data quality consistencyVariableConsistent
Audit trailNoneComplete

Now I just check the daily digest email. The tedious work just… happens.

Common Mistakes I Made

Mistake 1: Over-automation too early

I initially tried to auto-fix duplicates by keeping the most recent record. I deleted an important contact with years of relationship history. Start conservative. Add auto-fixes incrementally.

Mistake 2: No audit trail

My first version made changes without logging. When data looked wrong, I couldn’t explain why. Every update now writes to an audit table with timestamp, script ID, and before/after values.

Mistake 3: Alert fatigue

I sent individual emails per issue at first. People ignored them within a week. Now I send a daily digest grouped by type, showing only the top issues with priority in the subject line.

Mistake 4: Hard-coded thresholds

“90 days in stage” was hard-coded. Every time we wanted to tune it, I had to modify code. Now thresholds live in a config table.

Mistake 5: Running during business hours

The script locked tables and slowed down the CRM for users. I moved it to 5am when traffic is minimal.

Mistake 6: No staging environment

I never tested auto-fix logic against real edge cases. Now I maintain a staging CRM with production-like data for testing.

When to Auto-fix vs Flag

Safe to auto-fix:

  • Missing follow-up dates on recently active contacts
  • Standardized format issues (phone numbers, state abbreviations)
  • Obvious data entry errors with clear correction

Flag for review:

  • Duplicate contacts (which has the relationship history?)
  • Stale pipeline stages (deal might still be alive)
  • Timing violations (might be intentional for hot leads)
  • Any deletion operation

Configuration for Production

Here’s my production-ready configuration:

config.py
import os
from dataclasses import dataclass
@dataclass
class HygieneConfig:
database_url: str = os.getenv('DATABASE_URL')
days_stale_threshold: int = int(os.getenv('DAYS_STALE_THRESHOLD', '90'))
min_days_between_emails: int = int(os.getenv('MIN_DAYS_BETWEEN_EMAILS', '3'))
auto_fix_enabled: bool = os.getenv('AUTO_FIX_ENABLED', 'false').lower() == 'true'
digest_email: str = os.getenv('DIGEST_EMAIL', '[email protected]')
run_time: str = os.getenv('RUN_TIME', '05:00')
config = HygieneConfig()

Summary

Automating CRM data hygiene with Python transforms tedious weekly maintenance into a reliable daily process. Build hygiene check functions for each issue type. Auto-fix only safe, deterministic problems. Flag everything else for human review. Always maintain an audit trail.

Start with read-only checks. Test thoroughly in staging. Add auto-fixes gradually. You will catch more issues than manual reviews ever could while reclaiming hours of your time.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments