Skip to content

How to Start Web Scraping with Python for Beginners

The Problem

I wanted to scrape product prices from an e-commerce site. I’d heard web scraping was easy with Python. So I installed BeautifulSoup and wrote this:

first_attempt.py
from bs4 import BeautifulSoup
import requests
url = "https://example-store.com/products"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
prices = soup.find_all('span', class_='price')
print(prices)

The result? An empty list. No prices. No errors. Just nothing.

I spent three hours debugging. Was my selector wrong? Was the site blocking me? Was BeautifulSoup broken?

Turns out, the prices were loaded by JavaScript after the page rendered. BeautifulSoup only sees the initial HTML, not what JavaScript creates.

I had jumped into web scraping without understanding the fundamentals. Here’s what I wish I’d known from the start.

What You Actually Need Before Scraping

Before writing any scraper, I needed three things:

1. Basic Python Knowledge

Variables, functions, loops, and lists. If you can write a function that processes a list, you’re ready.

2. HTML Structure Understanding

I needed to know what tags, classes, and IDs are. When I inspect a page, I should understand:

<div class="product-card" id="item-123">
<h2 class="title">Product Name</h2>
<span class="price">$99.99</span>
</div>

The div is a container. class="product-card" is the CSS class. id="item-123" is unique. The h2 and span are child elements.

3. Browser Developer Tools

Right-click any element and select “Inspect.” This shows the HTML structure. I use this constantly to find the right selectors.

The Simplest Stack: requests + BeautifulSoup

For static websites (content loads with the page), this combination works:

terminal
pip install requests beautifulsoup4

Here’s my working first scraper:

basic_scraper.py
import requests
from bs4 import BeautifulSoup
def scrape_headlines(url):
# Pretend to be a real browser
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
# Fetch the page
response = requests.get(url, headers=headers)
response.raise_for_status() # Raise error for bad responses
# Parse HTML
soup = BeautifulSoup(response.text, 'html.parser')
# Find headlines (adjust selector for your target)
headlines = soup.find_all('h2', class_='article-title')
# Extract text
articles = []
for h in headlines:
title = h.get_text(strip=True)
link = h.find('a')['href'] if h.find('a') else None
articles.append({'title': title, 'link': link})
return articles
# Usage
url = 'https://news.ycombinator.com'
articles = scrape_headlines(url)
for i, article in enumerate(articles[:5], 1):
print(f"{i}. {article['title']}")

This works for Hacker News because it’s a static site. The HTML contains all the data when the page loads.

The Real-World Problems I Hit

Problem 1: Getting Blocked

My first scraper worked once, then stopped. The site detected my script and blocked my IP.

Solution: Add headers and delays

polite_scraper.py
import requests
import time
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml',
'Accept-Language': 'en-US,en;q=0.9'
}
# Wait between requests
time.sleep(2) # 2 seconds between requests
response = requests.get(url, headers=headers)

Problem 2: Dynamic Content (My Original Problem)

When content loads via JavaScript, BeautifulSoup can’t see it. I needed to check if a site uses dynamic content:

  1. Right-click and “View Page Source” (Ctrl+U)
  2. Search for the data you want
  3. If it’s not there, it’s loaded by JavaScript

Solution: Use Selenium for dynamic sites

selenium_scraper.py
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def scrape_dynamic(url, selector):
# Setup Chrome in headless mode
options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
driver = webdriver.Chrome(options=options)
wait = WebDriverWait(driver, 10)
try:
driver.get(url)
# Wait for content to load
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, selector)))
# Extract content
elements = driver.find_elements(By.CSS_SELECTOR, selector)
results = [e.text for e in elements]
return results
finally:
driver.quit()
# Usage
prices = scrape_dynamic(
'https://spa-example.com/products',
'.product-price'
)

Problem 3: Pagination

Most sites spread data across multiple pages. I needed to handle “Next” buttons.

pagination_scraper.py
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time
def scrape_all_pages(base_url, start_url, max_pages=10):
session = requests.Session()
session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
all_items = []
current_url = start_url
pages_scraped = 0
while current_url and pages_scraped < max_pages:
print(f"Scraping page {pages_scraped + 1}")
response = session.get(current_url)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# Extract items (customize for your site)
items = soup.select('.product-item')
for item in items:
title = item.select_one('.title')
price = item.select_one('.price')
if title:
all_items.append({
'title': title.get_text(strip=True),
'price': price.get_text(strip=True) if price else 'N/A'
})
# Find next page
next_link = soup.select_one('a.next, a[rel="next"]')
if next_link and next_link.get('href'):
current_url = urljoin(base_url, next_link['href'])
pages_scraped += 1
time.sleep(1) # Be polite
else:
break
return all_items

When to Upgrade Your Tools

I started with BeautifulSoup, but there’s a progression:

+------------------+------------------------+------------+
| Tool | Best For | Difficulty |
+------------------+------------------------+------------+
| BeautifulSoup | Learning, small tasks | Beginner |
| Scrapy | Large-scale scraping | Medium |
| Selenium | JavaScript-heavy sites | Medium |
| Playwright | Modern dynamic sites | Medium+ |
+------------------+------------------------+------------+

Decision flow:

Is the content in the initial HTML?
|
+-- YES --> Use requests + BeautifulSoup
|
+-- NO --> Is it a JavaScript-heavy site?
|
+-- YES --> Use Selenium or Playwright
|
+-- Need to scrape thousands of pages?
|
+-- YES --> Use Scrapy

A Practical Project: Price Tracker

Here’s a complete project that taught me the most:

price_tracker.py
import requests
from bs4 import BeautifulSoup
import csv
from datetime import datetime
import time
class PriceTracker:
def __init__(self, csv_file='prices.csv'):
self.csv_file = csv_file
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
def get_price(self, url, price_selector, name_selector=None):
"""Extract price from a product page."""
try:
response = requests.get(url, headers=self.headers)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# Extract price
price_elem = soup.select_one(price_selector)
if not price_elem:
return None
# Clean price text
price_text = price_elem.get_text(strip=True)
# Remove currency symbols, keep digits and decimal
price = float(''.join(c for c in price_text if c.isdigit() or c == '.'))
# Extract name
name = None
if name_selector:
name_elem = soup.select_one(name_selector)
if name_elem:
name = name_elem.get_text(strip=True)
return {
'name': name,
'price': price,
'url': url,
'timestamp': datetime.now().isoformat()
}
except Exception as e:
print(f"Error: {e}")
return None
def save_price(self, data):
"""Save to CSV file."""
file_exists = False
try:
with open(self.csv_file, 'r'):
file_exists = True
except FileNotFoundError:
pass
with open(self.csv_file, 'a', newline='') as f:
fieldnames = ['name', 'price', 'url', 'timestamp']
writer = csv.DictWriter(f, fieldnames=fieldnames)
if not file_exists:
writer.writeheader()
writer.writerow(data)
def track_products(self, products):
"""Track multiple products."""
for product in products:
data = self.get_price(
product['url'],
product['price_selector'],
product.get('name_selector')
)
if data:
self.save_price(data)
print(f"Tracked: {data['name']} - ${data['price']}")
time.sleep(2) # Be polite
# Usage
tracker = PriceTracker()
products = [
{
'url': 'https://example-store.com/product1',
'price_selector': '.current-price',
'name_selector': 'h1.product-title'
}
]
tracker.track_products(products)

This project taught me:

  • HTTP requests with headers
  • HTML parsing with CSS selectors
  • Data cleaning and transformation
  • File I/O with CSV
  • Error handling
  • Rate limiting

Common Mistakes I Made

Mistake 1: Starting with Complex Sites

I tried scraping a React app first. Big mistake. Start with static sites like:

  • News websites
  • Blogs
  • Wikipedia
  • Government data portals

Mistake 2: Ignoring robots.txt

Before scraping, check https://example.com/robots.txt. It tells you what’s allowed.

Mistake 3: No Error Handling

My scrapers crashed on the first unexpected response. Now I wrap everything in try/except:

safe_scraper.py
try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
except requests.Timeout:
print("Request timed out")
except requests.RequestException as e:
print(f"Request failed: {e}")

Mistake 4: Hardcoded Selectors

Sites change. My scraper broke when they updated their HTML. Now I use multiple fallback selectors:

flexible_selector.py
# Try multiple selectors
price = (
soup.select_one('.price-current') or
soup.select_one('.sale-price') or
soup.select_one('[data-price]')
)

The Learning Path That Worked

Week 1: Scrape a static news site. Learn requests, BeautifulSoup, and CSS selectors.

Week 2: Build a price tracker. Add CSV storage and error handling.

Week 3: Handle pagination. Scrape multiple pages of results.

Week 4: Try Selenium on a JavaScript-heavy site.

Summary

In this post, I showed how to start web scraping with Python using requests and BeautifulSoup, then progress to handling dynamic content with Selenium.

The key points:

  • Start with static sites using requests + BeautifulSoup
  • Add headers and delays to avoid getting blocked
  • Use Selenium when content loads via JavaScript
  • Handle pagination for multi-page data
  • Always include error handling

The best first project is scraping something you actually care about: sports scores, product prices, or news headlines. Personal interest keeps you motivated through debugging.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments