Should Test Data Be Bundled in Python Packages?
Problem: Porn in My Python Package
I was debugging a dependency issue in my Anaconda environment. I navigated to site-packages/protego/tests and saw:
www.youporn.comwww.pornhub.comtest_xxx_sites.txtMy first reaction: “Did I get hacked?”
I opened the files. They weren’t what they looked like. They were test fixtures for a robots.txt parser library. The Protego library uses real domain names to test robots.txt parsing logic.
Technically valid. Professionally disastrous.
I started wondering: Should test data even be included in Python packages? What’s the best practice here?
The answer isn’t simple. Let me walk you through what I learned.
What Counts as Test Data?
Test data in Python packages falls into three categories:
Synthetic fixtures: Generated data for testing
- Mock JSON responses
- Sample config files
- Fake URLs and domain names
Real-world samples: Actual data from production
- Real robots.txt files (like Protego uses)
- Sample PDFs/HTML files
- Production config snippets
Test corpora: Large datasets for validation
- 10,000 document samples
- Multi-gigabyte training data
- Comprehensive test suites
The question is: Which of these should you bundle in your PyPI package?
The Case FOR Bundling Test Data
There are legitimate reasons to include test data.
User convenience: Tests run immediately after installation
pip install protegopytest protego/tests # Just worksNo downloading test data separately. No setup scripts. Contributors can clone and run tests instantly.
Real-world validation: Synthetic data isn’t enough
Consider a robots.txt parser. You need actual robots.txt files from real domains:
def parse_example_com(): """Test parsing of real robots.txt from example.com""" parser = RobotFileParser() parser.read("tests/fixtures/www.example.com") assert parser.can_fetch("*", "/allowed")I tried writing synthetic robots.txt files. Edge cases emerged that I never predicted:
- Non-standard crawl-delay formats
- Weird wildcard patterns
- Conflicting allow/disallow rules
- Unicode in paths
Real data caught bugs synthetic data missed.
CI/CD simplicity: No test data setup in pipelines
Your CI just runs pytest. No caching test data, no download scripts, no network dependencies.
The Case AGAINST Bundling Test Data
Then I discovered the downsides. The hard way.
Package size bloat: 50KB becomes 5MB
I maintain a config parser library. Added 20 sample configs. Package size jumped from 50KB to 3.2MB.
Users noticed:
- Longer install times
- Bigger Docker images
- Wasted bandwidth for CI (hundreds of installs per day)
Professionalism issues: The www.youporn.com incident
Protego included real domain names as test fixtures. Technically correct. Horrible for professional environments:
Windows search indexing: "Found 3 results for 'porn' in C:\anaconda3"HR reviewing screen: "Why is porn in your Python packages?"Security scanner: "SUSPICIOUS FILENAME DETECTED: www.youporn.com"I work in corporate environments. IT departments run security scans. HR policies flag inappropriate content. My screen gets audited.
Do I want to explain why www.youporn.com is in my site-packages? No.
Installation clutter: Pollution in production
End users installing your package don’t need tests. They see:
pip install mypackage# Installs to site-packages:# - mypackage/__init__.py# - mypackage/core.py# - mypackage/tests/data/sample_01.json# - mypackage/tests/data/sample_02.json# - mypackage/tests/data/sample_03.json# ... 50 more test filesClutter. Confusion. “Why do I have test data in production?”
How Python Packaging Handles Test Data
I dug into Python packaging documentation. The official guidance is nuanced.
setup.cfg approach (traditional):
[options]packages = find:package_data = mypackage = tests/data/*.json, tests/data/*.txt
[options.extras_require]test = pytest, pytest-covpyproject.toml approach (modern):
[tool.setuptools]package-data = { "mypackage" = ["tests/data/*.json", "tests/data/*.txt"] }
[project.optional-dependencies]test = ["pytest>=7.0", "pytest-cov"]Critical detail: Wheels include ALL files tracked by version control.
From the Python Packaging User Guide:
“Wheels include all files tracked by version control, regardless of MANIFEST.in.”
This means:
MANIFEST.inonly affects source distributions (sdist)- Wheels bundle everything in your git repo
- To exclude test data from wheels: use
.gitignore
My mistake: I thought MANIFEST.in controlled everything. It doesn’t.
When SHOULD You Include Test Data?
After researching and experimenting, here’s my decision framework:
Include test data if:
- Total size < 100KB
- Files are professionally named
- Testing real-world format parsing (configs, APIs, file formats)
- Package is a library (not an end-user app)
- Core functionality depends on this data
Real-world examples where it makes sense:
Protego (robots.txt parser):
- Needs real robots.txt files
- ~20 small text files
- Justifies inclusion
PyPDF2 (PDF parser):
- Needs sample PDFs to test parsing
- Small fixture files
- Legitimate use case
Requests (HTTP library):
- Mock HTTP responses
- Essential for testing
- Professionally named
My rule of thumb: If the data is small (< 100KB) total) and professionally named, bundling is acceptable.
When SHOULD You NOT Include Test Data?
Exclude test data if:
- Total size > 100KB
- Filenames are unprofessional (learn from my mistake)
- Package is an end-user application
- Data can be generated synthetically
- Tests are for development, not validation
Large datasets: Use the download pattern
Instead of bundling 10MB of test data, download it on first use:
import pytestfrom pathlib import Pathimport urllib.request
@pytest.fixture(scope="session")def test_data(tmp_path_factory): """Download test data once per test session""" data_dir = tmp_path_factory.getbasetemp() / "test_data"
if not data_dir.exists(): data_dir.mkdir(parents=True) urllib.request.urlretrieve( "https://example.com/test-corpus.tar.gz", data_dir / "corpus.tar.gz" ) # Extract...
return data_dirInappropriate content: Learn from my pain
Rename files before bundling:
# BADwww.youporn.comtest_pornhub.txtillegal_content.dat
# GOODsample_robots_commercial_01.txttest_robots_adult_industry.txtrobots_restricted_access.txtThe filenames describe the CONTENT, not the SOURCE.
Add source attribution in comments:
# Source: www.youporn.com/robots.txt (retrieved 2024-01-15)# Purpose: Test parsing of Crawl-delay directiveUser-agent: *Crawl-delay: 5Disallow: /loginEnd-user applications: Don’t bundle tests
If you’re building a tool users run (not import):
from setuptools import find_packages
packages = find_packages(exclude=["tests", "tests.*", "docs", "docs.*"])Users installing pip install mycliapp don’t need your test fixtures.
Best Practices I Learned the Hard Way
1. Naming conventions matter
Professional, descriptive names:
sample_robots_github_01.txtsample_nginx_config_prod.confmock_api_response_200.jsontest_data_edge_case_01.csvAvoid:
- Domain names (use with caution)
- Offensive language (obviously)
- Cryptic abbreviations
- Industry-specific jargon
2. Size optimization
I reduced my package size from 3.2MB to 89KB by:
- Compressing text fixtures (gzip)
- Using minimal representative samples (not full datasets)
- Removing duplicate test data
- Generating synthetic data where possible
# Instead of 100 sample configs, use 5 representative ones# and generate variations programmatically
@pytest.fixturedef nginx_configs(): """Generate nginx config variations""" base_config = load_fixture("nginx_base.conf") for workers in [1, 2, 4, 8]: yield base_config.replace("worker_processes 4", f"worker_processes {workers}")3. Documentation is mandatory
Your README should explain:
## Test Data
This package includes test fixtures for validation:- Total size: ~89KB- Purpose: Testing real-world config parsing- Source: Publicly available sample files- Installation: Use `pip install mypackage[test]` for development
To run tests: `pytest`In CONTRIBUTING.md:
## Test Data Conventions
- Keep fixtures under 100KB total- Use professional naming (sample_*, not domain names)- Document source in file comments- Prefer synthetic data over real files4. Separate test extras
Don’t force test data on end users:
# End users (no test data)pip install mypackage
# Developers/contributors (includes test data)pip install mypackage[test]Configure in pyproject.toml:
[project.optional-dependencies]test = [ "pytest>=7.0", "pytest-cov>=4.0",]The Decision Framework
Here’s the checklist I use now:
Does test data exceed 100KB?├─ Yes → Don't bundle (use download pattern or separate package)└─ No → Continue
Are filenames professional?├─ No → Rename before bundling└─ Yes → Continue
Is it a library (not app)?├─ Yes → Consider bundling└─ No → Don't bundle
Does it test real-world formats?├─ Yes → Bundle it (with professional naming)└─ No → Generate syntheticallyAlternative Approaches
1. Separate test data package
For large datasets (>1MB), create a companion package:
mypackage # Main library (50KB)mypackage-testdata # Test fixtures (5MB)Users:
pip install mypackage # For usepip install mypackage[test] # For development (installs testdata)This keeps the main package small while still providing convenient access.
2. pytest-datafiles plugin
Lazy-load test data:
import pytestfrom pytest_datafiles import Datafiles
@pytest.fixturedef test_data(datafiles): """Auto-downloaded test data from pytest-datafiles""" return datafilesConfigure download URL in setup.cfg. Tests download data on first run.
3. Git LFS for large files
Store test data in Git LFS:
git lfs track "tests/corpora/*.json"git add .gitattributesContributors run git lfs pull to get test data. Not included in PyPI wheels.
Trade-off: More complex contributor setup.
What I Do Now
After the “Porn in Conda” incident, I changed my approach:
For small libraries (< 100KB) test data):
- Bundle professionally named fixtures
- Use
package_datain pyproject.toml - Document in README
- Provide
[test]extras
For larger libraries:
- Use pytest-datafiles for download-on-demand
- Or create separate testdata package
- Keep main package lean
For all projects:
- Audit test file names
- Add source attribution in comments
- Document test data decisions
- Use synthetic data where possible
The Bottom Line
Should test data be bundled in Python packages?
Answer: It depends.
Small, professionally named test data that validates real-world format parsing? Yes, bundle it.
Large datasets, inappropriate filenames, or end-user applications? No, use alternatives.
The key is balancing developer convenience with professional standards. The www.youporn.com test fixture taught me that technical correctness isn’t enough. You need to consider how your choices appear in corporate environments, security scans, and HR audits.
Name your files professionally. Document your decisions. Keep packages small.
And maybe don’t use porn site URLs as test fixtures.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Python Packaging User Guide - Including Data Files
- 👨💻 Setuptools Documentation - package_data
- 👨💻 Pytest Documentation - Fixtures
- 👨💻 Python Packaging User Guide - MANIFEST.in
- 👨💻 PyPA - Declaring Project Metadata
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments