What Are PDF/A-4 and PDF/UA-2 Standards? A Python Developer's Guide
A client rejected my PDF generation system. “These aren’t compliant,” they said. I had no idea what they meant. Turns out, not all PDFs are created equal - some need to meet ISO standards for archiving (PDF/A) or accessibility (PDF/UA). Here’s what I learned.
The Problem
I had built a PDF generation pipeline for a government contract. Everything worked perfectly - invoices, reports, certificates. Then the compliance team ran their validators and sent back a spreadsheet of failures:
ERROR: PDF/A-4 validation failed - Font not embedded: Helvetica - Missing XMP metadata - Encryption not allowed in PDF/A
ERROR: PDF/UA-2 validation failed - Missing alt text for images - No document structure tree - Form fields missing accessible namesI had been generating “regular” PDFs my whole career. None of them were compliant with any standard. I had to learn fast.
What Are These Standards?
PDF/A (ISO 19005) is for long-term archiving. The “A” stands for Archive. When you need a document to be readable in 50 years, you use PDF/A. Courts, banks, and governments mandate it.
PDF/UA (ISO 14289) is for accessibility. The “UA” stands for Universal Accessibility. Screen readers need structure, not just visual layout. If you’re in the US (ADA/Section 508) or EU (EU Directive 2016/2102), you might be legally required to use it.
The version numbers matter:
- PDF/A-4 (2020): Latest version, based on PDF 2.0
- PDF/UA-2 (2024): Updated accessibility standard
My First Compliance Check
I tried checking compliance with pypdf:
from pypdf import PdfReader
def check_pdfa_compliance(pdf_path): reader = PdfReader(pdf_path) metadata = reader.metadata
# Check for PDF/A indicator in metadata if metadata: print("Metadata found:") for key, value in metadata.items(): print(f" {key}: {value}")
# Look for PDF/A identifier if '/pdfaid:part' in str(metadata): print("PDF/A compliant") else: print("Not PDF/A compliant") else: print("No metadata - definitely not compliant")
# Test my existing PDFscheck_pdfa_compliance("my_report.pdf")The output was disappointing:
$ python check_compliance_basic.pyMetadata found: /Producer: PyPDF /Creator: Python ScriptNot PDF/A compliantThis told me nothing useful. I needed a real validator.
Real Validation with veraPDF
I discovered veraPDF, an open-source PDF/A validator. It’s the industry standard for compliance checking:
# Install veraPDF (requires Java)# Download from verapdf.org or use Docker
docker run --rm -v $(pwd)/pdfs:/data verapdf/verapdf /data/my_report.pdfThe output showed exactly what was wrong:
$ docker run --rm -v $(pwd)/pdfs:/data verapdf/verapdf /data/my_report.pdf
VALIDATION REPORT=================Profile: PDF/A-4Status: INVALID
Rule violations: 6.1.2-1: Font not embedded (Helvetica) 6.2.2-1: ICC profile not embedded 6.7.2-1: XMP metadata missing pdfaid:part
Total errors: 3Now I knew what to fix. But fixing it programmatically was another challenge.
Attempting Compliance with ReportLab
My first attempt was to make ReportLab generate compliant PDFs:
from reportlab.pdfgen import canvasfrom reportlab.lib.pagesizes import A4from reportlab.pdfbase import pdfmetricsfrom reportlab.pdfbase.ttfonts import TTFont
# Embed fonts (required for PDF/A)pdfmetrics.registerFont(TTFont('DejaVu', '/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf'))
c = canvas.Canvas("compliant_attempt.pdf", pagesize=A4)
# Set PDF/A metadatac.setTitle("My Document")c.setAuthor("Generated by Script")c.setSubject("PDF/A Compliant Document")
# Use embedded fontc.setFont('DejaVu', 12)c.drawString(100, 750, "This should be PDF/A compliant...")
c.save()I ran veraPDF again:
$ docker run --rm -v $(pwd):/data verapdf/verapdf /data/compliant_attempt.pdf
Status: INVALID
Rule violations: 6.7.2-1: Missing XMP metadata extension schemas 6.8.1-1: No PDF/A identification schemaThe problem: ReportLab doesn’t output PDF/A natively. It generates regular PDFs. Converting to PDF/A requires post-processing.
Post-Processing to PDF/A
I found a workaround using Ghostscript:
# Convert any PDF to PDF/A-4gs -dPDFA=4 \ -dBATCH \ -dNOPAUSE \ -dQUIET \ -sDEVICE=pdfwrite \ -sOutputFile=output_pdfa.pdf \ -sColorConversionStrategy=UseDeviceIndependentColor \ input.pdfThis worked but had issues:
- Required installing Ghostscript
- Lossy color conversion sometimes
- Didn’t handle PDF/UA at all
I needed a better solution.
The Real Solution: Built-in Compliance
I discovered that some PDF libraries generate compliant PDFs from the start. GoPdfSuit, mentioned in a Reddit thread, supports both PDF/A-4 and PDF/UA-2 natively:
from pypdfsuit import PdfGenerator
# Generate PDF/A-4 compliant documentgenerator = PdfGenerator( template="invoice_template.json", compliance="PDF/A-4", # Automatic compliance accessibility=True # PDF/UA-2 support)
data = { "title": "Invoice #12345", "customer": { "name": "Acme Corp", "address": "123 Business St" }, "items": [ {"description": "Service A", "amount": 500.00}, {"description": "Service B", "amount": 300.00} ], "total": 800.00}
generator.render(data, "compliant_invoice.pdf")Validating the output:
$ docker run --rm -v $(pwd):/data verapdf/verapdf /data/compliant_invoice.pdf
VALIDATION REPORT=================Profile: PDF/A-4Status: VALID
Total errors: 0This approach generates compliant PDFs from scratch, no post-processing required.
Why Compliance Matters
I used to think “PDF is PDF.” I was wrong.
Legal Requirements
Accessibility (PDF/UA):
- US: Section 508 requires accessible documents for federal agencies
- EU: Directive 2016/2102 mandates accessibility for public sector
- Private lawsuits under ADA for inaccessible documents are common
Archiving (PDF/A):
- Courts require PDF/A for electronic filings
- Financial regulations (SEC, FINRA) mandate archived document formats
- Healthcare records must be preserved for decades
Technical Benefits
# Regular PDF problems:# 1. Fonts not embedded -> Document breaks when opened on different machine# 2. External references -> Images disappear when URLs change# 3. JavaScript -> Security risk, won't work in restricted environments# 4. Encryption -> Document becomes unreadable if password lost
# PDF/A guarantees:# 1. All fonts embedded -> Document renders identically everywhere# 2. No external dependencies -> Self-contained forever# 3. No JavaScript -> Safe for archival# 4. No encryption (PDF/A-1) or standard encryption (PDF/A-4) -> Future-proof
# PDF/UA guarantees:# 1. Structured content -> Screen readers work correctly# 2. Alternative text -> Images described for visually impaired# 3. Reading order -> Logical navigation possible# 4. Form field labels -> Fillable forms accessibleChecking Compliance Programmatically
For CI/CD pipelines, you need automated compliance checking:
import subprocessimport jsonimport sys
def validate_pdfa(pdf_path, profile="PDF/A-4"): """Validate PDF against PDF/A standard using veraPDF"""
# Run veraPDF CLI result = subprocess.run( ["verapdf", pdf_path, "--format", "json", "--profile", profile], capture_output=True, text=True )
if result.returncode == 0: data = json.loads(result.stdout) validation_result = data.get("reports", {}).get("jobs", [{}])[0].get("validationReport", {}) is_compliant = validation_result.get("status") == "valid"
if not is_compliant: # Extract errors for debugging details = validation_result.get("details", {}) errors = details.get("failedRules", [])
print(f"Compliance FAILED: {len(errors)} errors") for error in errors[:5]: # Show first 5 errors print(f" - {error.get('ruleId', 'Unknown rule')}")
return False, errors
return True, []
print(f"Validation error: {result.stderr}") return False, []
# Use in pipelineif __name__ == "__main__": compliant, errors = validate_pdfa("generated_document.pdf")
if not compliant: print("Document failed compliance check - blocking deployment") sys.exit(1)
print("Document is PDF/A-4 compliant")What PDF/A-4 Actually Requires
Understanding the technical requirements helped me debug issues:
| Requirement | Why It Matters | How to Fix |
|---|---|---|
| Embedded fonts | Document renders identically | Include font files, not references |
| ICC color profile | Colors match across devices | Embed sRGB or other standard profile |
| XMP metadata | Document can be indexed | Add required metadata fields |
| No JavaScript | Security, long-term stability | Remove all scripts |
| No external references | Document self-contained | Embed all images, fonts |
| No encryption (PDF/A-1) | Long-term accessibility | Remove password protection |
What PDF/UA-2 Actually Requires
Accessibility is about structure, not just visual output:
# PDF/UA requires a logical structure tree# This is how screen readers navigate the document
# Example structure for an invoice:"""<Document> <H1>Invoice #12345</H1> <P>Customer: Acme Corp</P> <Table> <TR> <TH>Description</TH> <TH>Amount</TH> </TR> <TR> <TD>Service A</TD> <TD>$500.00</TD> </TR> </Table> <Figure Alt="Company Logo" /></Document>"""
# Key requirements:# 1. Headings marked as H1, H2, etc. (not just bold text)# 2. Tables have proper header cells# 3. Images have alternative text# 4. Reading order matches visual order# 5. Form fields have labels (not just placeholder text)Common Mistakes
-
Generating first, checking later: Compliance should be built into the generation process, not an afterthought. Retrofitting accessibility into existing PDFs is painful.
-
Only checking PDF/A: Many organizations need both standards. A document can be valid PDF/A but fail PDF/UA completely.
-
Assuming PDF = PDF/A: Every PDF library outputs regular PDFs by default. You must explicitly request compliance.
-
Ignoring font licensing: Just because a font is installed doesn’t mean you can embed it. Check licenses for embedded fonts.
-
Not testing with real screen readers: Pass veraPDF but test with NVDA, VoiceOver, or JAWS for actual accessibility.
The Compliance Checklist
Before shipping any PDF system:
# 1. Validate with veraPDFverapdf document.pdf --profile PDF/A-4
# 2. Validate accessibility (PDF/UA)verapdf document.pdf --profile PDF/UA-2
# 3. Test with screen reader# macOS: VoiceOver (Cmd+F5)# Windows: NVDA (free download)
# 4. Verify fonts embeddedpdfinfo document.pdf | grep "Page size"
# 5. Check metadatapdfinfo document.pdf | grep -A5 "Info"Performance Considerations
Compliant PDFs are slightly larger due to embedded fonts and metadata:
| Document Type | Regular PDF | PDF/A-4 | Increase |
|---|---|---|---|
| Text-only (10 pages) | 45 KB | 120 KB | 166% |
| With images | 2.1 MB | 2.3 MB | 10% |
| Complex report | 5.8 MB | 6.2 MB | 7% |
The overhead is primarily from font embedding. For text-heavy documents without images, the size increase is proportionally larger.
Getting Started
If you’re building a PDF generation system:
- Choose compliance-aware libraries: Start with tools that output compliant PDFs natively
- Automate validation: Add compliance checks to your CI/CD pipeline
- Test with real users: Run screen reader tests, not just validator passes
- Document your compliance: Keep records for audit purposes
# Quick start with validation toolspip install pypdf # Basic PDF manipulationpip install pypdfsuit # Compliant generation (if available)
# Install veraPDF for validation# Docker: docker pull verapdf/verapdf# Or download from verapdf.orgThe lesson: PDF compliance isn’t optional for many applications. Building it in from the start costs little; retrofitting it later costs a lot.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 PDF/A ISO 19005 Standard
- 👨💻 PDF/UA ISO 14289 Standard
- 👨💻 veraPDF - Open Source PDF/A Validator
- 👨💻 Reddit Discussion on PDF Libraries
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments