Skip to content

How to Securely Redact Text in PDFs with Python

I needed to redact Social Security numbers from a batch of PDF documents before sharing them with a client. My first thought was simple—draw black rectangles over the sensitive text. Problem solved, right?

Wrong. A colleague discovered they could copy-paste the “redacted” text right out from under the black boxes. The SSNs were still there, fully searchable, fully extractable. I had created a security theater, not actual redaction.

This is a common and dangerous mistake. Let me show you why visual overlays fail and how to do real redaction in Python.

The Problem with Visual Overlays

When you draw a black rectangle over text in a PDF, you’re just adding a new layer on top. The original content remains intact underneath.

Here’s what I initially tried:

fake_redact.py
from pypdf import PdfReader, PdfWriter
from reportlab.pdfgen import canvas
from reportlab.lib.colors import black
from io import BytesIO
def fake_redact(input_pdf, output_pdf, coords):
"""
WRONG: This just draws black boxes - NOT secure!
Text remains searchable and copyable.
"""
reader = PdfReader(input_pdf)
writer = PdfWriter()
for page in reader.pages:
# Create overlay with black rectangle
packet = BytesIO()
c = canvas.Canvas(packet, pagesize=page.mediabox)
c.setFillColor(black)
c.rect(coords['x'], coords['y'], coords['width'], coords['height'], fill=1)
c.save()
packet.seek(0)
# Merge overlay onto page
overlay = PdfReader(packet)
page.merge_page(overlay.pages[0])
writer.add_page(page)
writer.write(output_pdf)
# Usage - but this is NOT secure!
fake_redact("sensitive.pdf", "fake_redacted.pdf", {'x': 100, 'y': 700, 'width': 200, 'height': 20})

This looked fine on screen. A black box covered the SSN. But here’s how I verified it failed:

verify_failure.py
from pypdf import PdfReader
def check_for_text(pdf_path, search_text):
reader = PdfReader(pdf_path)
for page_num, page in enumerate(reader.pages):
text = page.extract_text()
if search_text in text:
print(f"FAIL: '{search_text}' found on page {page_num + 1}!")
return False
print(f"PASS: '{search_text}' not found in text extraction")
return True
# The "redacted" document still contains the SSN!
check_for_text("fake_redacted.pdf", "123-45-6789")
# Output: FAIL: '123-45-6789' found on page 1!

The text was still there. I had created a false sense of security.

Why PDF Structure Preserves Content

PDFs are designed for document fidelity, not security. The format preserves content in multiple layers:

  1. Content streams: The actual text objects with positioning
  2. Metadata: Document properties, history, author information
  3. Hidden layers: Optional content groups, form fields
  4. Embedded files: Attachments that may contain copies
  5. Annotations: Comments, highlights, and yes, overlay shapes

When I drew that black rectangle, I added an annotation. The original text stream was untouched.

inspect_pdf_structure.py
from pypdf import PdfReader
def inspect_pdf(pdf_path):
reader = PdfReader(pdf_path)
print("=== Document Info ===")
print(reader.metadata)
print("\n=== Page Content Stream ===")
for i, page in enumerate(reader.pages):
print(f"\nPage {i + 1} content:")
if '/Contents' in page:
print(page['/Contents'][:500] if len(str(page['/Contents'])) > 500 else page['/Contents'])
print("\n=== Annotations ===")
for i, page in enumerate(reader.pages):
if '/Annots' in page:
print(f"Page {i + 1} has {len(page['/Annots'])} annotations")
inspect_pdf("fake_redacted.pdf")

This inspection revealed the annotation I added, plus all the original content still intact.

Real Redaction: Content Removal

True redaction requires permanently removing content from the PDF structure. GoPdfSuit provides this through its internal decryption and scrubbing capabilities.

real_redaction.py
from pypdfsuit import PdfRedactor
redactor = PdfRedactor("sensitive_document.pdf")
# Define redaction patterns using regex
redactor.add_redaction(
pattern=r"\b\d{3}-\d{2}-\d{4}\b", # SSN pattern
replacement="[REDACTED]"
)
redactor.add_redaction(
pattern=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", # Email
replacement="[EMAIL REMOVED]"
)
redactor.add_redaction(
pattern=r"\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b", # Credit card
replacement="[CARD REDACTED]"
)
# Apply redaction permanently
redactor.apply("redacted_document.pdf")

The key difference: apply() doesn’t just overlay—it removes the text objects from the content stream and rebuilds the PDF structure.

How Internal Decryption Works

The Reddit thread mentioned GoPdfSuit uses “internal decryption” for redaction. Here’s what that means:

  1. Parse content streams: Decode the PDF’s internal object structure
  2. Identify text objects: Find TJ and Tj operators (text drawing commands)
  3. Match patterns: Apply regex to find sensitive content
  4. Remove objects: Delete the matched text from the stream
  5. Rebuild PDF: Recreate the document without redacted content
understanding_redaction.py
from pypdfsuit import PdfRedactor
# GoPdfSuit also supports region-based redaction
redactor = PdfRedactor("contract.pdf")
# Redact by area (coordinates in points, from bottom-left)
redactor.add_region_redaction(
page=1,
x=100,
y=700,
width=200,
height=20
)
# This removes ALL content in that region, not just overlays
redactor.apply("contract_redacted.pdf")
# Verify: nothing should be extractable from that region
reader = PdfReader("contract_redacted.pdf")
# The region is now blank - text objects are gone

This is fundamentally different from drawing black boxes. The content is removed from the PDF’s object structure.

Metadata Cleanup

Redacting visible text isn’t enough. PDFs store metadata that can leak information:

metadata_cleanup.py
from pypdfsuit import PdfRedactor
redactor = PdfRedactor("document.pdf")
# Add content redactions
redactor.add_redaction(pattern=r"Confidential", replacement="[CLASSIFIED]")
# Clean metadata - this is crucial!
redactor.clean_metadata()
# Or set new metadata
redactor.set_metadata({
"Title": "Redacted Document",
"Author": "Redacted",
"Subject": "Redacted",
"Creator": "Redaction Process",
"Producer": "Secure Redaction Tool"
})
redactor.apply("fully_redacted.pdf")

Common metadata leaks I’ve encountered:

  • Author field containing employee names
  • Title revealing document purpose
  • Custom fields with client information
  • Modification history showing original content

Hyperlinks are another overlooked vector. The text might be redacted, but the link target remains:

link_scrubbing.py
from pypdfsuit import PdfRedactor
redactor = PdfRedactor("report.pdf")
# Redact text content
redactor.add_redaction(pattern=r"secret-project", replacement="[PROJECT]")
# Remove all hyperlinks
redactor.remove_links()
# Or remove specific links by pattern
redactor.remove_links_matching(r".*internal\.company\.com.*")
redactor.apply("report_clean.pdf")

I once redacted a URL from a document, only to realize the hyperlink was still clickable. The redacted text said [REDACTED], but clicking it opened the original URL.

Verification: The Most Important Step

Never trust redaction without verification. Here’s my verification checklist:

verify_redaction.py
import subprocess
from pypdf import PdfReader
def verify_redaction(pdf_path, sensitive_patterns):
"""
Comprehensive redaction verification.
Raises SecurityError if any sensitive content is found.
"""
reader = PdfReader(pdf_path)
errors = []
# 1. Text extraction test
for page_num, page in enumerate(reader.pages, 1):
text = page.extract_text()
for pattern in sensitive_patterns:
if pattern in text:
errors.append(f"Text '{pattern}' found on page {page_num}")
# 2. Raw content stream test
for page_num, page in enumerate(reader.pages, 1):
if '/Contents' in page:
raw_content = str(page['/Contents'])
for pattern in sensitive_patterns:
if pattern in raw_content:
errors.append(f"Raw content '{pattern}' on page {page_num}")
# 3. Metadata test
if reader.metadata:
for key, value in reader.metadata.items():
for pattern in sensitive_patterns:
if pattern in str(value):
errors.append(f"Metadata '{key}' contains '{pattern}'")
if errors:
print("VERIFICATION FAILED:")
for error in errors:
print(f" - {error}")
return False
print("VERIFICATION PASSED: No sensitive content found")
return True
# Usage
verify_redaction("redacted.pdf", ["123-45-6789", "[email protected]", "CONFIDENTIAL"])

For command-line verification (essential for production):

cli_verification.sh
# Extract text and search for sensitive terms
pdftotext redacted_document.pdf - | grep -i "ssn\|confidential\|secret"
# Check metadata
pdfinfo redacted_document.pdf
# Deep inspection with qpdf
qpdf --show-object=1 redacted_document.pdf
# Full content dump
qpdf --show-pages redacted_document.pdf

Common Mistakes I Made

Mistake 1: Redacting without verifying

I assumed the redaction worked because the black boxes looked right. Always verify programmatically.

# WRONG: Trust without verification
redactor.apply("output.pdf")
# Hope it worked...
# RIGHT: Always verify
redactor.apply("output.pdf")
assert verify_redaction("output.pdf", sensitive_patterns), "Redaction failed!"

Mistake 2: Forgetting embedded images

Text in images isn’t redacted by text-based patterns. I had to OCR and redact separately:

image_redaction.py
from pypdfsuit import PdfRedactor
redactor = PdfRedactor("scanned_doc.pdf")
# For scanned documents, use OCR-based redaction
redactor.enable_ocr_redaction()
redactor.add_redaction(pattern=r"\b\d{3}-\d{2}-\d{4}\b", replacement="[REDACTED]")
redactor.apply("scanned_redacted.pdf")

Mistake 3: Ignoring fonts

Some redaction tools replace text with spaces but keep the font, which can hint at word length:

# Ensure replacement text doesn't leak information
redactor.add_redaction(
pattern=r"\b\d{3}-\d{2}-\d{4}\b",
replacement="[XXX-XX-XXXX]" # Wrong: reveals format
)
# Better: use consistent replacement
redactor.add_redaction(
pattern=r"\b\d{3}-\d{2}-\d{4}\b",
replacement="[REDACTED]" # Right: no format hints
)

Mistake 4: Not handling page breaks

Long content spanning pages might be partially redacted:

handle_page_breaks.py
# GoPdfSuit handles cross-page content automatically
redactor = PdfRedactor("document.pdf")
# Pattern matching works across page boundaries
redactor.add_redaction(
pattern=r"BEGIN_SECRET.*?END_SECRET", # Multi-line pattern
replacement="[CONTENT REDACTED]",
multiline=True
)
redactor.apply("document_redacted.pdf")

The Reddit moderators made an important point: “security-centric programs… should not be treated as a security solution unless they’ve been audited by a third party.”

For compliance (GDPR, HIPAA, etc.):

  1. Audit trails: Log all redaction operations
  2. Verification records: Store verification results
  3. Third-party audit: For production use, get security review
  4. Documentation: Maintain redaction policies and procedures
audit_trail.py
import json
from datetime import datetime
from pypdfsuit import PdfRedactor
def redact_with_audit(input_path, output_path, patterns, auditor=None):
audit_record = {
"timestamp": datetime.utcnow().isoformat(),
"input_file": input_path,
"output_file": output_path,
"patterns": [p['pattern'] for p in patterns],
"status": "pending"
}
try:
redactor = PdfRedactor(input_path)
for p in patterns:
redactor.add_redaction(pattern=p['pattern'], replacement=p['replacement'])
redactor.apply(output_path)
# Verify
verification = verify_redaction(output_path, [p['pattern'] for p in patterns])
audit_record["status"] = "success" if verification else "failed"
audit_record["verified"] = verification
except Exception as e:
audit_record["status"] = "error"
audit_record["error"] = str(e)
raise
finally:
# Write audit log
with open(f"audit/{datetime.utcnow().strftime('%Y%m%d_%H%M%S')}.json", 'w') as f:
json.dump(audit_record, f, indent=2)
return audit_record

Alternatives to GoPdfSuit

For environments where GoPdfSuit isn’t suitable:

alternatives.py
# PyMuPDF (fitz) - has redaction support
import fitz # PyMuPDF
doc = fitz.open("document.pdf")
page = doc[0]
# Add redaction annotation
page.add_redact_annot((100, 700, 300, 720), fill=(0, 0, 0))
# Apply redactions
page.apply_redactions()
doc.save("redacted.pdf")
# PDFTron - commercial solution with redaction
# from pdftron.PDF import PDFDoc
# (commercial license required)

PyMuPDF’s redaction is more limited than GoPdfSuit—it doesn’t handle metadata cleanup or link scrubbing automatically.

Bottom Line

Drawing black boxes over PDF text isn’t redaction. It’s decoration that creates false confidence.

True redaction requires:

  1. Removing text from content streams
  2. Cleaning metadata
  3. Scrubbing links
  4. Rebuilding PDF structure
  5. Verifying the result

GoPdfSuit handles all of this through internal decryption. But always verify your output, and for production systems handling sensitive data, consider a third-party security audit.

The security of your redacted documents depends not on how they look, but on what’s actually left in the file.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments