Skip to content

How to Build a High-Throughput Web Scraper with Python aiohttp: Concurrent Requests Guide

Purpose

I needed to scrape thousands of URLs for a data collection project. Using the synchronous requests library, each URL took about 1 second. Scraping 1000 URLs would take over 16 minutes. I wanted to speed this up using async HTTP requests.

Environment

  • Python 3.11
  • aiohttp for async HTTP
  • asyncio for concurrent execution
  • Ubuntu 22.04

The Problem with Sequential Scraping

I started with a simple scraper using requests:

sequential_scraper.py
import requests
import time
def fetch(url):
try:
response = requests.get(url, timeout=10)
return response.text
except Exception as e:
print(f"Error fetching {url}: {e}")
return None
urls = ["https://example.com/page1", "https://example.com/page2"] * 50 # 100 URLs
start = time.time()
for url in urls:
fetch(url)
print(f"Time: {time.time() - start:.2f}s")

I ran this:

Run sequential scraper
python sequential_scraper.py

Output:

Sequential output
Time: 100.52s

100 URLs took 100 seconds. Each request blocked while waiting for the response.

The aiohttp Solution

I rewrote the scraper using aiohttp with concurrent requests:

async_scraper.py
import aiohttp
import asyncio
import time
async def fetch(session, url):
"""Fetch a single URL with timeout and error handling."""
try:
async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as response:
return await response.text()
except Exception as e:
print(f"Error fetching {url}: {e}")
return None
async def main(urls):
"""Fetch all URLs concurrently with connection limits."""
connector = aiohttp.TCPConnector(limit_per_host=100)
async with aiohttp.ClientSession(connector=connector) as session:
tasks = [fetch(session, url) for url in urls]
results = await asyncio.gather(*tasks)
return results
urls = ["https://example.com/page1", "https://example.com/page2"] * 50 # 100 URLs
start = time.time()
results = asyncio.run(main(urls))
print(f"Fetched {len([r for r in results if r])} pages")
print(f"Time: {time.time() - start:.2f}s")

Running the async scraper:

Run async scraper
python async_scraper.py

Output:

Async output
Fetched 100 pages
Time: 2.15s

100 URLs completed in 2 seconds instead of 100 seconds. That’s a 50x speedup.

How It Works

The key components:

  1. ClientSession - One session for all requests. Connection pooling keeps TCP connections open between requests.

  2. TCPConnector - Controls concurrency. limit_per_host=100 means up to 100 concurrent connections per host. This prevents overwhelming the server.

  3. asyncio.gather() - Runs all fetch tasks concurrently. Instead of waiting for one request at a time, all requests start together.

  4. ClientTimeout - Prevents stalled requests. A request that hangs won’t block forever.

Sequential vs async timing comparison
Sequential (requests):
┌────┐ ┌────┐ ┌────┐ ┌────┐
│ 1s │ │ 1s │ │ 1s │ │ 1s │ ... = 100s total
└────┘ └────┘ └────┘ └────┘
Async (aiohttp):
┌────┐
│ 1s │ (all 100 requests start together)
└────┘ = ~2s total

Common Mistakes

I made several mistakes before getting this working:

1. Creating session per request:

wrong_session.py
# BAD: New session for each request (slow!)
async def fetch(url):
async with aiohttp.ClientSession() as session: # New session every time!
async with session.get(url) as response:
return await response.text()

This wastes connections. Each session creates new TCP connections. Create one session and reuse it.

2. No connection limits:

no_limits.py
# BAD: No limits, can overwhelm servers
async with aiohttp.ClientSession() as session: # Default allows unlimited connections!
tasks = [fetch(session, url) for url in urls]

Servers may block you for opening too many connections. Use TCPConnector with limits.

3. No timeout:

no_timeout.py
# BAD: Stalled requests block forever
async with session.get(url) as response: # No timeout!
return await response.text()

A hanging request blocks forever. Always set ClientTimeout.

4. No error handling:

no_error_handling.py
# BAD: One failure crashes entire scraper
async with session.get(url) as response:
return await response.text() # Raises exception on failure

One failed URL crashes asyncio.gather(). Use try/except for resilience.

Summary

In this post, I showed how to build a high-throughput web scraper with aiohttp. The key points are: use one ClientSession for all requests, set connection limits with TCPConnector, add timeouts, and handle errors. This approach reduced my scraping time from 100 seconds to 2 seconds.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments