How to Build a High-Throughput Web Scraper with Python aiohttp: Concurrent Requests Guide
Purpose
I needed to scrape thousands of URLs for a data collection project. Using the synchronous requests library, each URL took about 1 second. Scraping 1000 URLs would take over 16 minutes. I wanted to speed this up using async HTTP requests.
Environment
- Python 3.11
- aiohttp for async HTTP
- asyncio for concurrent execution
- Ubuntu 22.04
The Problem with Sequential Scraping
I started with a simple scraper using requests:
import requestsimport time
def fetch(url): try: response = requests.get(url, timeout=10) return response.text except Exception as e: print(f"Error fetching {url}: {e}") return None
urls = ["https://example.com/page1", "https://example.com/page2"] * 50 # 100 URLs
start = time.time()for url in urls: fetch(url)print(f"Time: {time.time() - start:.2f}s")I ran this:
python sequential_scraper.pyOutput:
Time: 100.52s100 URLs took 100 seconds. Each request blocked while waiting for the response.
The aiohttp Solution
I rewrote the scraper using aiohttp with concurrent requests:
import aiohttpimport asyncioimport time
async def fetch(session, url): """Fetch a single URL with timeout and error handling.""" try: async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as response: return await response.text() except Exception as e: print(f"Error fetching {url}: {e}") return None
async def main(urls): """Fetch all URLs concurrently with connection limits.""" connector = aiohttp.TCPConnector(limit_per_host=100) async with aiohttp.ClientSession(connector=connector) as session: tasks = [fetch(session, url) for url in urls] results = await asyncio.gather(*tasks) return results
urls = ["https://example.com/page1", "https://example.com/page2"] * 50 # 100 URLs
start = time.time()results = asyncio.run(main(urls))print(f"Fetched {len([r for r in results if r])} pages")print(f"Time: {time.time() - start:.2f}s")Running the async scraper:
python async_scraper.pyOutput:
Fetched 100 pagesTime: 2.15s100 URLs completed in 2 seconds instead of 100 seconds. That’s a 50x speedup.
How It Works
The key components:
-
ClientSession - One session for all requests. Connection pooling keeps TCP connections open between requests.
-
TCPConnector - Controls concurrency.
limit_per_host=100means up to 100 concurrent connections per host. This prevents overwhelming the server. -
asyncio.gather() - Runs all fetch tasks concurrently. Instead of waiting for one request at a time, all requests start together.
-
ClientTimeout - Prevents stalled requests. A request that hangs won’t block forever.
Sequential (requests):┌────┐ ┌────┐ ┌────┐ ┌────┐│ 1s │ │ 1s │ │ 1s │ │ 1s │ ... = 100s total└────┘ └────┘ └────┘ └────┘
Async (aiohttp):┌────┐│ 1s │ (all 100 requests start together)└────┘ = ~2s totalCommon Mistakes
I made several mistakes before getting this working:
1. Creating session per request:
# BAD: New session for each request (slow!)async def fetch(url): async with aiohttp.ClientSession() as session: # New session every time! async with session.get(url) as response: return await response.text()This wastes connections. Each session creates new TCP connections. Create one session and reuse it.
2. No connection limits:
# BAD: No limits, can overwhelm serversasync with aiohttp.ClientSession() as session: # Default allows unlimited connections! tasks = [fetch(session, url) for url in urls]Servers may block you for opening too many connections. Use TCPConnector with limits.
3. No timeout:
# BAD: Stalled requests block foreverasync with session.get(url) as response: # No timeout! return await response.text()A hanging request blocks forever. Always set ClientTimeout.
4. No error handling:
# BAD: One failure crashes entire scraperasync with session.get(url) as response: return await response.text() # Raises exception on failureOne failed URL crashes asyncio.gather(). Use try/except for resilience.
Summary
In this post, I showed how to build a high-throughput web scraper with aiohttp. The key points are: use one ClientSession for all requests, set connection limits with TCPConnector, add timeouts, and handle errors. This approach reduced my scraping time from 100 seconds to 2 seconds.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments