How Do You Automate CRM Data Hygiene with Python Scripts?
The Problem
Every Monday morning, I spent 30 minutes cleaning up our CRM database. I manually checked for duplicate contacts, missing follow-up dates, and stale pipeline stages. It was tedious, error-prone, and I still missed things.
The worst part? I couldn’t scale this. As our sales team grew, so did the data quality issues. Deals fell through because follow-up dates were missing. Sales reps wasted time calling duplicate contacts. Trust in the CRM plummeted.
I needed automation. But not just any automation—I needed something that wouldn’t accidentally delete critical data.
The Solution: A Python Hygiene Script
I built a Python script that runs daily at 5am. It scans our CRM for common issues, auto-fixes what’s safe to fix, and flags everything else for human review.
Here’s the basic architecture:
Scheduled Python Script (daily 5am) |Database Connection (SQLAlchemy) |Hygiene Checks (multiple validation functions) |Auto-Fix Logic (safe, deterministic fixes) |Exception Flagging (edge cases for human review) |Notification System (email/Slack digest) |Audit Log (PostgreSQL table for compliance)Let me walk you through how I built it.
Step 1: Database Connection Setup
First, I set up the database connection using SQLAlchemy:
import sqlalchemyfrom sqlalchemy import create_engine, textfrom datetime import datetime, timedeltaimport loggingimport os
# Configure logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
# Database connection - use environment variablesDATABASE_URL = os.getenv('DATABASE_URL', 'postgresql://user:pass@host:5432/crm_db')engine = create_engine(DATABASE_URL)
def get_connection(): return engine.connect()I learned the hard way to use environment variables. Never hardcode credentials.
Step 2: Hygiene Check Functions
Each hygiene check follows a pattern: query the database, identify issues, categorize them (auto-fix vs flag).
Check 1: Duplicate Emails
def check_duplicate_emails(conn): """Find contacts with duplicate email addresses.""" query = text(""" SELECT email, array_agg(id) as contact_ids, array_agg(name) as names, COUNT(*) as duplicate_count FROM contacts WHERE email IS NOT NULL GROUP BY LOWER(email) HAVING COUNT(*) > 1 """)
duplicates = conn.execute(query).fetchall() results = []
for dup in duplicates: # Flag for merge - too risky to auto-delete results.append({ 'type': 'duplicate_email', 'email': dup.email, 'contact_ids': dup.contact_ids, 'action': 'flag_for_merge', 'priority': 'high' })
return resultsNotice I flagged these instead of auto-fixing. Deleting duplicates is dangerous—one might have important relationship history.
Check 2: Missing Follow-up Dates
def check_missing_followup_dates(conn): """Find active prospects without next follow-up date.""" query = text(""" SELECT id, name, email, stage, last_activity_date FROM contacts WHERE stage IN ('prospect', 'qualified', 'demo_scheduled') AND (next_followup_date IS NULL OR next_followup_date < CURRENT_DATE) AND status = 'active' """)
missing_dates = conn.execute(query).fetchall() results = []
for contact in missing_dates: days_since_activity = (datetime.now() - contact.last_activity_date).days if contact.last_activity_date else 30
if days_since_activity <= 7: # Auto-fix: set follow-up to 7 days from now if recently active new_date = datetime.now() + timedelta(days=7) results.append({ 'type': 'missing_followup', 'contact_id': contact.id, 'action': 'auto_fix', 'new_date': new_date, 'priority': 'medium' }) else: # Flag if no recent activity - might be dead lead results.append({ 'type': 'missing_followup_stale', 'contact_id': contact.id, 'action': 'flag_for_review', 'priority': 'high' })
return resultsThis is safe to auto-fix for recently active contacts. For stale contacts, I flag them.
Check 3: Stale Pipeline Stages
def check_stale_pipeline_stages(conn, days_threshold=90): """Find contacts stuck in same stage too long.""" query = text(""" SELECT id, name, email, stage, stage_entered_date, CURRENT_DATE - stage_entered_date as days_in_stage FROM contacts WHERE stage_entered_date < CURRENT_DATE - INTERVAL '%s days' AND stage NOT IN ('closed_won', 'closed_lost') AND status = 'active' ORDER BY days_in_stage DESC """ % days_threshold)
stale_contacts = conn.execute(query).fetchall() results = []
for contact in stale_contacts: results.append({ 'type': 'stale_pipeline_stage', 'contact_id': contact.id, 'stage': contact.stage, 'days_in_stage': contact.days_in_stage, 'action': 'flag_for_review', 'priority': 'high' })
return resultsA deal stuck in “demo scheduled” for 90 days? Something’s wrong. I flag these for review.
Check 4: Timing Violations
def check_timing_violations(conn, min_days_between=3): """Find contacts emailed too recently.""" query = text(""" SELECT c.id, c.name, c.email, c.last_email_date, COUNT(e.id) as emails_last_week FROM contacts c LEFT JOIN email_activities e ON c.id = e.contact_id AND e.sent_date > CURRENT_DATE - INTERVAL '7 days' WHERE c.last_email_date > CURRENT_DATE - INTERVAL '%s days' AND c.status = 'active' GROUP BY c.id, c.name, c.email, c.last_email_date HAVING COUNT(e.id) > 2 """ % min_days_between)
violations = conn.execute(query).fetchall() results = []
for contact in violations: results.append({ 'type': 'timing_violation', 'contact_id': contact.id, 'emails_last_week': contact.emails_last_week, 'last_email_date': contact.last_email_date, 'action': 'flag_for_review', 'priority': 'medium' })
return resultsContacting prospects too frequently hurts relationships. I flag these for review.
Step 3: Auto-Fix Logic
Only safe, deterministic fixes should be automatic:
import json
def apply_auto_fixes(conn, issues): """Apply safe, deterministic fixes automatically.""" fixed_count = 0
for issue in issues: if issue['action'] != 'auto_fix': continue
try: if issue['type'] == 'missing_followup': update_query = text(""" UPDATE contacts SET next_followup_date = :new_date, updated_at = CURRENT_TIMESTAMP, updated_by = 'hygiene_script' WHERE id = :contact_id """) conn.execute(update_query, { 'new_date': issue['new_date'], 'contact_id': issue['contact_id'] }) fixed_count += 1 logger.info(f"Fixed missing follow-up for contact {issue['contact_id']}")
except Exception as e: logger.error(f"Failed to fix issue {issue}: {e}") conn.rollback()
conn.commit() return fixed_count
def log_hygiene_results(conn, issues, fixed_count): """Log all findings for audit trail.""" audit_query = text(""" INSERT INTO hygiene_audit_log (run_date, total_issues, auto_fixed, flagged_for_review, details) VALUES (CURRENT_TIMESTAMP, :total, :fixed, :flagged, :details) """)
flagged_count = len([i for i in issues if i['action'] == 'flag_for_review'])
conn.execute(audit_query, { 'total': len(issues), 'fixed': fixed_count, 'flagged': flagged_count, 'details': json.dumps(issues, default=str) }) conn.commit()Every change gets logged with before/after values. This saved me more than once when debugging.
Step 4: Notification System
I send a daily digest instead of individual alerts to avoid notification fatigue:
import smtplibfrom email.mime.text import MIMETextfrom email.mime.multipart import MIMEMultipartfrom datetime import datetime
def send_digest_email(flagged_issues): """Send daily digest of issues requiring human review.""" if not flagged_issues: return
# Group by type grouped = {} for issue in flagged_issues: issue_type = issue['type'] if issue_type not in grouped: grouped[issue_type] = [] grouped[issue_type].append(issue)
# Build email body body = f"CRM Data Hygiene Report - {datetime.now().strftime('%Y-%m-%d')}\n\n" body += f"Total issues requiring review: {len(flagged_issues)}\n\n"
for issue_type, issues in grouped.items(): body += f"\n{issue_type.upper().replace('_', ' ')} ({len(issues)} issues):\n" for issue in issues[:5]: # Show first 5 of each type body += f" - Contact ID {issue.get('contact_id')}: {issue}\n"
# Create email msg = MIMEMultipart() msg['Subject'] = f'CRM Hygiene Digest - {len(flagged_issues)} issues need attention' msg.attach(MIMEText(body, 'plain'))
# Uncomment for actual sending # smtp_server = smtplib.SMTP('smtp.company.com', 587) # smtp_server.send_message(msg) print(f"Would send email with {len(flagged_issues)} issues")Step 5: Main Orchestration
Now put it all together:
def run_daily_hygiene(): """Main function to run all hygiene checks.""" logger.info("Starting CRM data hygiene run...")
all_issues = []
with get_connection() as conn: # Run all checks all_issues.extend(check_duplicate_emails(conn)) all_issues.extend(check_missing_followup_dates(conn)) all_issues.extend(check_stale_pipeline_stages(conn)) all_issues.extend(check_timing_violations(conn))
logger.info(f"Found {len(all_issues)} total issues")
# Apply auto-fixes fixed_count = apply_auto_fixes(conn, all_issues) logger.info(f"Auto-fixed {fixed_count} issues")
# Log results log_hygiene_results(conn, all_issues, fixed_count)
# Send digest for flagged items flagged = [i for i in all_issues if i['action'] in ('flag_for_review', 'flag_for_merge')] send_digest_email(flagged)
logger.info("CRM data hygiene run complete") return all_issues
if __name__ == "__main__": run_daily_hygiene()Step 6: Scheduling
I use cron for scheduling. Simple and reliable:
# /etc/cron.d/crm-hygiene# Run at 5am every day0 5 * * * crm_user /usr/bin/python3 /opt/crm-hygiene/run_hygiene.py >> /var/log/crm-hygiene.log 2>&1Or use Python’s schedule library for more control:
import scheduleimport time
schedule.every().day.at("05:00").do(run_daily_hygiene)
while True: schedule.run_pending() time.sleep(60)The Results
After implementing this script:
| Metric | Before | After |
|---|---|---|
| Weekly manual cleanup | 30+ minutes | 0 minutes |
| Issues caught | Maybe 70% | Near 100% |
| Data quality consistency | Variable | Consistent |
| Audit trail | None | Complete |
Now I just check the daily digest email. The tedious work just… happens.
Common Mistakes I Made
Mistake 1: Over-automation too early
I initially tried to auto-fix duplicates by keeping the most recent record. I deleted an important contact with years of relationship history. Start conservative. Add auto-fixes incrementally.
Mistake 2: No audit trail
My first version made changes without logging. When data looked wrong, I couldn’t explain why. Every update now writes to an audit table with timestamp, script ID, and before/after values.
Mistake 3: Alert fatigue
I sent individual emails per issue at first. People ignored them within a week. Now I send a daily digest grouped by type, showing only the top issues with priority in the subject line.
Mistake 4: Hard-coded thresholds
“90 days in stage” was hard-coded. Every time we wanted to tune it, I had to modify code. Now thresholds live in a config table.
Mistake 5: Running during business hours
The script locked tables and slowed down the CRM for users. I moved it to 5am when traffic is minimal.
Mistake 6: No staging environment
I never tested auto-fix logic against real edge cases. Now I maintain a staging CRM with production-like data for testing.
When to Auto-fix vs Flag
Safe to auto-fix:
- Missing follow-up dates on recently active contacts
- Standardized format issues (phone numbers, state abbreviations)
- Obvious data entry errors with clear correction
Flag for review:
- Duplicate contacts (which has the relationship history?)
- Stale pipeline stages (deal might still be alive)
- Timing violations (might be intentional for hot leads)
- Any deletion operation
Configuration for Production
Here’s my production-ready configuration:
import osfrom dataclasses import dataclass
@dataclassclass HygieneConfig: database_url: str = os.getenv('DATABASE_URL') days_stale_threshold: int = int(os.getenv('DAYS_STALE_THRESHOLD', '90')) min_days_between_emails: int = int(os.getenv('MIN_DAYS_BETWEEN_EMAILS', '3')) auto_fix_enabled: bool = os.getenv('AUTO_FIX_ENABLED', 'false').lower() == 'true' run_time: str = os.getenv('RUN_TIME', '05:00')
config = HygieneConfig()Summary
Automating CRM data hygiene with Python transforms tedious weekly maintenance into a reliable daily process. Build hygiene check functions for each issue type. Auto-fix only safe, deterministic problems. Flag everything else for human review. Always maintain an audit trail.
Start with read-only checks. Test thoroughly in staging. Add auto-fixes gradually. You will catch more issues than manual reviews ever could while reclaiming hours of your time.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments