How to Securely Redact Text in PDFs with Python
I needed to redact Social Security numbers from a batch of PDF documents before sharing them with a client. My first thought was simple—draw black rectangles over the sensitive text. Problem solved, right?
Wrong. A colleague discovered they could copy-paste the “redacted” text right out from under the black boxes. The SSNs were still there, fully searchable, fully extractable. I had created a security theater, not actual redaction.
This is a common and dangerous mistake. Let me show you why visual overlays fail and how to do real redaction in Python.
The Problem with Visual Overlays
When you draw a black rectangle over text in a PDF, you’re just adding a new layer on top. The original content remains intact underneath.
Here’s what I initially tried:
from pypdf import PdfReader, PdfWriterfrom reportlab.pdfgen import canvasfrom reportlab.lib.colors import blackfrom io import BytesIO
def fake_redact(input_pdf, output_pdf, coords): """ WRONG: This just draws black boxes - NOT secure! Text remains searchable and copyable. """ reader = PdfReader(input_pdf) writer = PdfWriter()
for page in reader.pages: # Create overlay with black rectangle packet = BytesIO() c = canvas.Canvas(packet, pagesize=page.mediabox) c.setFillColor(black) c.rect(coords['x'], coords['y'], coords['width'], coords['height'], fill=1) c.save() packet.seek(0)
# Merge overlay onto page overlay = PdfReader(packet) page.merge_page(overlay.pages[0]) writer.add_page(page)
writer.write(output_pdf)
# Usage - but this is NOT secure!fake_redact("sensitive.pdf", "fake_redacted.pdf", {'x': 100, 'y': 700, 'width': 200, 'height': 20})This looked fine on screen. A black box covered the SSN. But here’s how I verified it failed:
from pypdf import PdfReader
def check_for_text(pdf_path, search_text): reader = PdfReader(pdf_path) for page_num, page in enumerate(reader.pages): text = page.extract_text() if search_text in text: print(f"FAIL: '{search_text}' found on page {page_num + 1}!") return False print(f"PASS: '{search_text}' not found in text extraction") return True
# The "redacted" document still contains the SSN!check_for_text("fake_redacted.pdf", "123-45-6789")# Output: FAIL: '123-45-6789' found on page 1!The text was still there. I had created a false sense of security.
Why PDF Structure Preserves Content
PDFs are designed for document fidelity, not security. The format preserves content in multiple layers:
- Content streams: The actual text objects with positioning
- Metadata: Document properties, history, author information
- Hidden layers: Optional content groups, form fields
- Embedded files: Attachments that may contain copies
- Annotations: Comments, highlights, and yes, overlay shapes
When I drew that black rectangle, I added an annotation. The original text stream was untouched.
from pypdf import PdfReader
def inspect_pdf(pdf_path): reader = PdfReader(pdf_path)
print("=== Document Info ===") print(reader.metadata)
print("\n=== Page Content Stream ===") for i, page in enumerate(reader.pages): print(f"\nPage {i + 1} content:") if '/Contents' in page: print(page['/Contents'][:500] if len(str(page['/Contents'])) > 500 else page['/Contents'])
print("\n=== Annotations ===") for i, page in enumerate(reader.pages): if '/Annots' in page: print(f"Page {i + 1} has {len(page['/Annots'])} annotations")
inspect_pdf("fake_redacted.pdf")This inspection revealed the annotation I added, plus all the original content still intact.
Real Redaction: Content Removal
True redaction requires permanently removing content from the PDF structure. GoPdfSuit provides this through its internal decryption and scrubbing capabilities.
from pypdfsuit import PdfRedactor
redactor = PdfRedactor("sensitive_document.pdf")
# Define redaction patterns using regexredactor.add_redaction( pattern=r"\b\d{3}-\d{2}-\d{4}\b", # SSN pattern replacement="[REDACTED]")
redactor.add_redaction( pattern=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", # Email replacement="[EMAIL REMOVED]")
redactor.add_redaction( pattern=r"\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b", # Credit card replacement="[CARD REDACTED]")
# Apply redaction permanentlyredactor.apply("redacted_document.pdf")The key difference: apply() doesn’t just overlay—it removes the text objects from the content stream and rebuilds the PDF structure.
How Internal Decryption Works
The Reddit thread mentioned GoPdfSuit uses “internal decryption” for redaction. Here’s what that means:
- Parse content streams: Decode the PDF’s internal object structure
- Identify text objects: Find TJ and Tj operators (text drawing commands)
- Match patterns: Apply regex to find sensitive content
- Remove objects: Delete the matched text from the stream
- Rebuild PDF: Recreate the document without redacted content
from pypdfsuit import PdfRedactor
# GoPdfSuit also supports region-based redactionredactor = PdfRedactor("contract.pdf")
# Redact by area (coordinates in points, from bottom-left)redactor.add_region_redaction( page=1, x=100, y=700, width=200, height=20)
# This removes ALL content in that region, not just overlaysredactor.apply("contract_redacted.pdf")
# Verify: nothing should be extractable from that regionreader = PdfReader("contract_redacted.pdf")# The region is now blank - text objects are goneThis is fundamentally different from drawing black boxes. The content is removed from the PDF’s object structure.
Metadata Cleanup
Redacting visible text isn’t enough. PDFs store metadata that can leak information:
from pypdfsuit import PdfRedactor
redactor = PdfRedactor("document.pdf")
# Add content redactionsredactor.add_redaction(pattern=r"Confidential", replacement="[CLASSIFIED]")
# Clean metadata - this is crucial!redactor.clean_metadata()
# Or set new metadataredactor.set_metadata({ "Title": "Redacted Document", "Author": "Redacted", "Subject": "Redacted", "Creator": "Redaction Process", "Producer": "Secure Redaction Tool"})
redactor.apply("fully_redacted.pdf")Common metadata leaks I’ve encountered:
- Author field containing employee names
- Title revealing document purpose
- Custom fields with client information
- Modification history showing original content
Link Scrubbing
Hyperlinks are another overlooked vector. The text might be redacted, but the link target remains:
from pypdfsuit import PdfRedactor
redactor = PdfRedactor("report.pdf")
# Redact text contentredactor.add_redaction(pattern=r"secret-project", replacement="[PROJECT]")
# Remove all hyperlinksredactor.remove_links()
# Or remove specific links by patternredactor.remove_links_matching(r".*internal\.company\.com.*")
redactor.apply("report_clean.pdf")I once redacted a URL from a document, only to realize the hyperlink was still clickable. The redacted text said [REDACTED], but clicking it opened the original URL.
Verification: The Most Important Step
Never trust redaction without verification. Here’s my verification checklist:
import subprocessfrom pypdf import PdfReader
def verify_redaction(pdf_path, sensitive_patterns): """ Comprehensive redaction verification. Raises SecurityError if any sensitive content is found. """ reader = PdfReader(pdf_path) errors = []
# 1. Text extraction test for page_num, page in enumerate(reader.pages, 1): text = page.extract_text() for pattern in sensitive_patterns: if pattern in text: errors.append(f"Text '{pattern}' found on page {page_num}")
# 2. Raw content stream test for page_num, page in enumerate(reader.pages, 1): if '/Contents' in page: raw_content = str(page['/Contents']) for pattern in sensitive_patterns: if pattern in raw_content: errors.append(f"Raw content '{pattern}' on page {page_num}")
# 3. Metadata test if reader.metadata: for key, value in reader.metadata.items(): for pattern in sensitive_patterns: if pattern in str(value): errors.append(f"Metadata '{key}' contains '{pattern}'")
if errors: print("VERIFICATION FAILED:") for error in errors: print(f" - {error}") return False
print("VERIFICATION PASSED: No sensitive content found") return True
# UsageFor command-line verification (essential for production):
# Extract text and search for sensitive termspdftotext redacted_document.pdf - | grep -i "ssn\|confidential\|secret"
# Check metadatapdfinfo redacted_document.pdf
# Deep inspection with qpdfqpdf --show-object=1 redacted_document.pdf
# Full content dumpqpdf --show-pages redacted_document.pdfCommon Mistakes I Made
Mistake 1: Redacting without verifying
I assumed the redaction worked because the black boxes looked right. Always verify programmatically.
# WRONG: Trust without verificationredactor.apply("output.pdf")# Hope it worked...
# RIGHT: Always verifyredactor.apply("output.pdf")assert verify_redaction("output.pdf", sensitive_patterns), "Redaction failed!"Mistake 2: Forgetting embedded images
Text in images isn’t redacted by text-based patterns. I had to OCR and redact separately:
from pypdfsuit import PdfRedactor
redactor = PdfRedactor("scanned_doc.pdf")
# For scanned documents, use OCR-based redactionredactor.enable_ocr_redaction()redactor.add_redaction(pattern=r"\b\d{3}-\d{2}-\d{4}\b", replacement="[REDACTED]")
redactor.apply("scanned_redacted.pdf")Mistake 3: Ignoring fonts
Some redaction tools replace text with spaces but keep the font, which can hint at word length:
# Ensure replacement text doesn't leak informationredactor.add_redaction( pattern=r"\b\d{3}-\d{2}-\d{4}\b", replacement="[XXX-XX-XXXX]" # Wrong: reveals format)
# Better: use consistent replacementredactor.add_redaction( pattern=r"\b\d{3}-\d{2}-\d{4}\b", replacement="[REDACTED]" # Right: no format hints)Mistake 4: Not handling page breaks
Long content spanning pages might be partially redacted:
# GoPdfSuit handles cross-page content automaticallyredactor = PdfRedactor("document.pdf")
# Pattern matching works across page boundariesredactor.add_redaction( pattern=r"BEGIN_SECRET.*?END_SECRET", # Multi-line pattern replacement="[CONTENT REDACTED]", multiline=True)
redactor.apply("document_redacted.pdf")Legal and Compliance Considerations
The Reddit moderators made an important point: “security-centric programs… should not be treated as a security solution unless they’ve been audited by a third party.”
For compliance (GDPR, HIPAA, etc.):
- Audit trails: Log all redaction operations
- Verification records: Store verification results
- Third-party audit: For production use, get security review
- Documentation: Maintain redaction policies and procedures
import jsonfrom datetime import datetimefrom pypdfsuit import PdfRedactor
def redact_with_audit(input_path, output_path, patterns, auditor=None): audit_record = { "timestamp": datetime.utcnow().isoformat(), "input_file": input_path, "output_file": output_path, "patterns": [p['pattern'] for p in patterns], "status": "pending" }
try: redactor = PdfRedactor(input_path)
for p in patterns: redactor.add_redaction(pattern=p['pattern'], replacement=p['replacement'])
redactor.apply(output_path)
# Verify verification = verify_redaction(output_path, [p['pattern'] for p in patterns])
audit_record["status"] = "success" if verification else "failed" audit_record["verified"] = verification
except Exception as e: audit_record["status"] = "error" audit_record["error"] = str(e) raise
finally: # Write audit log with open(f"audit/{datetime.utcnow().strftime('%Y%m%d_%H%M%S')}.json", 'w') as f: json.dump(audit_record, f, indent=2)
return audit_recordAlternatives to GoPdfSuit
For environments where GoPdfSuit isn’t suitable:
# PyMuPDF (fitz) - has redaction supportimport fitz # PyMuPDF
doc = fitz.open("document.pdf")page = doc[0]
# Add redaction annotationpage.add_redact_annot((100, 700, 300, 720), fill=(0, 0, 0))
# Apply redactionspage.apply_redactions()doc.save("redacted.pdf")
# PDFTron - commercial solution with redaction# from pdftron.PDF import PDFDoc# (commercial license required)PyMuPDF’s redaction is more limited than GoPdfSuit—it doesn’t handle metadata cleanup or link scrubbing automatically.
Bottom Line
Drawing black boxes over PDF text isn’t redaction. It’s decoration that creates false confidence.
True redaction requires:
- Removing text from content streams
- Cleaning metadata
- Scrubbing links
- Rebuilding PDF structure
- Verifying the result
GoPdfSuit handles all of this through internal decryption. But always verify your output, and for production systems handling sensitive data, consider a third-party security audit.
The security of your redacted documents depends not on how they look, but on what’s actually left in the file.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 GoPdfSuit GitHub
- 👨💻 Reddit r/Python Discussion
- 👨💻 PDF Specification (ISO 32000)
- 👨💻 qpdf Command Line Tool
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments