Skip to content

Should Test Data Be Bundled in Python Packages?

Problem: Porn in My Python Package

I was debugging a dependency issue in my Anaconda environment. I navigated to site-packages/protego/tests and saw:

www.youporn.com
www.pornhub.com
test_xxx_sites.txt

My first reaction: “Did I get hacked?”

I opened the files. They weren’t what they looked like. They were test fixtures for a robots.txt parser library. The Protego library uses real domain names to test robots.txt parsing logic.

Technically valid. Professionally disastrous.

I started wondering: Should test data even be included in Python packages? What’s the best practice here?

The answer isn’t simple. Let me walk you through what I learned.

What Counts as Test Data?

Test data in Python packages falls into three categories:

Synthetic fixtures: Generated data for testing

  • Mock JSON responses
  • Sample config files
  • Fake URLs and domain names

Real-world samples: Actual data from production

  • Real robots.txt files (like Protego uses)
  • Sample PDFs/HTML files
  • Production config snippets

Test corpora: Large datasets for validation

  • 10,000 document samples
  • Multi-gigabyte training data
  • Comprehensive test suites

The question is: Which of these should you bundle in your PyPI package?

The Case FOR Bundling Test Data

There are legitimate reasons to include test data.

User convenience: Tests run immediately after installation

Terminal window
pip install protego
pytest protego/tests # Just works

No downloading test data separately. No setup scripts. Contributors can clone and run tests instantly.

Real-world validation: Synthetic data isn’t enough

Consider a robots.txt parser. You need actual robots.txt files from real domains:

tests/test_robots_parser.py
def parse_example_com():
"""Test parsing of real robots.txt from example.com"""
parser = RobotFileParser()
parser.read("tests/fixtures/www.example.com")
assert parser.can_fetch("*", "/allowed")

I tried writing synthetic robots.txt files. Edge cases emerged that I never predicted:

  • Non-standard crawl-delay formats
  • Weird wildcard patterns
  • Conflicting allow/disallow rules
  • Unicode in paths

Real data caught bugs synthetic data missed.

CI/CD simplicity: No test data setup in pipelines

Your CI just runs pytest. No caching test data, no download scripts, no network dependencies.

The Case AGAINST Bundling Test Data

Then I discovered the downsides. The hard way.

Package size bloat: 50KB becomes 5MB

I maintain a config parser library. Added 20 sample configs. Package size jumped from 50KB to 3.2MB.

Users noticed:

  • Longer install times
  • Bigger Docker images
  • Wasted bandwidth for CI (hundreds of installs per day)

Professionalism issues: The www.youporn.com incident

Protego included real domain names as test fixtures. Technically correct. Horrible for professional environments:

Windows search indexing: "Found 3 results for 'porn' in C:\anaconda3"
HR reviewing screen: "Why is porn in your Python packages?"
Security scanner: "SUSPICIOUS FILENAME DETECTED: www.youporn.com"

I work in corporate environments. IT departments run security scans. HR policies flag inappropriate content. My screen gets audited.

Do I want to explain why www.youporn.com is in my site-packages? No.

Installation clutter: Pollution in production

End users installing your package don’t need tests. They see:

pip install mypackage
# Installs to site-packages:
# - mypackage/__init__.py
# - mypackage/core.py
# - mypackage/tests/data/sample_01.json
# - mypackage/tests/data/sample_02.json
# - mypackage/tests/data/sample_03.json
# ... 50 more test files

Clutter. Confusion. “Why do I have test data in production?”

How Python Packaging Handles Test Data

I dug into Python packaging documentation. The official guidance is nuanced.

setup.cfg approach (traditional):

[options]
packages = find:
package_data =
mypackage = tests/data/*.json, tests/data/*.txt
[options.extras_require]
test = pytest, pytest-cov

pyproject.toml approach (modern):

[tool.setuptools]
package-data = { "mypackage" = ["tests/data/*.json", "tests/data/*.txt"] }
[project.optional-dependencies]
test = ["pytest>=7.0", "pytest-cov"]

Critical detail: Wheels include ALL files tracked by version control.

From the Python Packaging User Guide:

“Wheels include all files tracked by version control, regardless of MANIFEST.in.”

This means:

  • MANIFEST.in only affects source distributions (sdist)
  • Wheels bundle everything in your git repo
  • To exclude test data from wheels: use .gitignore

My mistake: I thought MANIFEST.in controlled everything. It doesn’t.

When SHOULD You Include Test Data?

After researching and experimenting, here’s my decision framework:

Include test data if:

  • Total size < 100KB
  • Files are professionally named
  • Testing real-world format parsing (configs, APIs, file formats)
  • Package is a library (not an end-user app)
  • Core functionality depends on this data

Real-world examples where it makes sense:

Protego (robots.txt parser):

  • Needs real robots.txt files
  • ~20 small text files
  • Justifies inclusion

PyPDF2 (PDF parser):

  • Needs sample PDFs to test parsing
  • Small fixture files
  • Legitimate use case

Requests (HTTP library):

  • Mock HTTP responses
  • Essential for testing
  • Professionally named

My rule of thumb: If the data is small (< 100KB) total) and professionally named, bundling is acceptable.

When SHOULD You NOT Include Test Data?

Exclude test data if:

  • Total size > 100KB
  • Filenames are unprofessional (learn from my mistake)
  • Package is an end-user application
  • Data can be generated synthetically
  • Tests are for development, not validation

Large datasets: Use the download pattern

Instead of bundling 10MB of test data, download it on first use:

tests/conftest.py
import pytest
from pathlib import Path
import urllib.request
@pytest.fixture(scope="session")
def test_data(tmp_path_factory):
"""Download test data once per test session"""
data_dir = tmp_path_factory.getbasetemp() / "test_data"
if not data_dir.exists():
data_dir.mkdir(parents=True)
urllib.request.urlretrieve(
"https://example.com/test-corpus.tar.gz",
data_dir / "corpus.tar.gz"
)
# Extract...
return data_dir

Inappropriate content: Learn from my pain

Rename files before bundling:

# BAD
www.youporn.com
test_pornhub.txt
illegal_content.dat
# GOOD
sample_robots_commercial_01.txt
test_robots_adult_industry.txt
robots_restricted_access.txt

The filenames describe the CONTENT, not the SOURCE.

Add source attribution in comments:

tests/fixtures/sample_robots_01.txt
# Source: www.youporn.com/robots.txt (retrieved 2024-01-15)
# Purpose: Test parsing of Crawl-delay directive
User-agent: *
Crawl-delay: 5
Disallow: /login

End-user applications: Don’t bundle tests

If you’re building a tool users run (not import):

setup.py
from setuptools import find_packages
packages = find_packages(exclude=["tests", "tests.*", "docs", "docs.*"])

Users installing pip install mycliapp don’t need your test fixtures.

Best Practices I Learned the Hard Way

1. Naming conventions matter

Professional, descriptive names:

sample_robots_github_01.txt
sample_nginx_config_prod.conf
mock_api_response_200.json
test_data_edge_case_01.csv

Avoid:

  • Domain names (use with caution)
  • Offensive language (obviously)
  • Cryptic abbreviations
  • Industry-specific jargon

2. Size optimization

I reduced my package size from 3.2MB to 89KB by:

  • Compressing text fixtures (gzip)
  • Using minimal representative samples (not full datasets)
  • Removing duplicate test data
  • Generating synthetic data where possible
# Instead of 100 sample configs, use 5 representative ones
# and generate variations programmatically
@pytest.fixture
def nginx_configs():
"""Generate nginx config variations"""
base_config = load_fixture("nginx_base.conf")
for workers in [1, 2, 4, 8]:
yield base_config.replace("worker_processes 4",
f"worker_processes {workers}")

3. Documentation is mandatory

Your README should explain:

## Test Data
This package includes test fixtures for validation:
- Total size: ~89KB
- Purpose: Testing real-world config parsing
- Source: Publicly available sample files
- Installation: Use `pip install mypackage[test]` for development
To run tests: `pytest`

In CONTRIBUTING.md:

## Test Data Conventions
- Keep fixtures under 100KB total
- Use professional naming (sample_*, not domain names)
- Document source in file comments
- Prefer synthetic data over real files

4. Separate test extras

Don’t force test data on end users:

Terminal window
# End users (no test data)
pip install mypackage
# Developers/contributors (includes test data)
pip install mypackage[test]

Configure in pyproject.toml:

[project.optional-dependencies]
test = [
"pytest>=7.0",
"pytest-cov>=4.0",
]

The Decision Framework

Here’s the checklist I use now:

Does test data exceed 100KB?
├─ Yes → Don't bundle (use download pattern or separate package)
└─ No → Continue
Are filenames professional?
├─ No → Rename before bundling
└─ Yes → Continue
Is it a library (not app)?
├─ Yes → Consider bundling
└─ No → Don't bundle
Does it test real-world formats?
├─ Yes → Bundle it (with professional naming)
└─ No → Generate synthetically

Alternative Approaches

1. Separate test data package

For large datasets (>1MB), create a companion package:

mypackage # Main library (50KB)
mypackage-testdata # Test fixtures (5MB)

Users:

Terminal window
pip install mypackage # For use
pip install mypackage[test] # For development (installs testdata)

This keeps the main package small while still providing convenient access.

2. pytest-datafiles plugin

Lazy-load test data:

tests/conftest.py
import pytest
from pytest_datafiles import Datafiles
@pytest.fixture
def test_data(datafiles):
"""Auto-downloaded test data from pytest-datafiles"""
return datafiles

Configure download URL in setup.cfg. Tests download data on first run.

3. Git LFS for large files

Store test data in Git LFS:

Terminal window
git lfs track "tests/corpora/*.json"
git add .gitattributes

Contributors run git lfs pull to get test data. Not included in PyPI wheels.

Trade-off: More complex contributor setup.

What I Do Now

After the “Porn in Conda” incident, I changed my approach:

For small libraries (< 100KB) test data):

  • Bundle professionally named fixtures
  • Use package_data in pyproject.toml
  • Document in README
  • Provide [test] extras

For larger libraries:

  • Use pytest-datafiles for download-on-demand
  • Or create separate testdata package
  • Keep main package lean

For all projects:

  • Audit test file names
  • Add source attribution in comments
  • Document test data decisions
  • Use synthetic data where possible

The Bottom Line

Should test data be bundled in Python packages?

Answer: It depends.

Small, professionally named test data that validates real-world format parsing? Yes, bundle it.

Large datasets, inappropriate filenames, or end-user applications? No, use alternatives.

The key is balancing developer convenience with professional standards. The www.youporn.com test fixture taught me that technical correctness isn’t enough. You need to consider how your choices appear in corporate environments, security scans, and HR audits.

Name your files professionally. Document your decisions. Keep packages small.

And maybe don’t use porn site URLs as test fixtures.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments