How to Start Web Scraping with Python for Beginners

Mar 16, 2026

The Problem

I wanted to scrape product prices from an e-commerce site. I’d heard web scraping was easy with Python. So I installed BeautifulSoup and wrote this:

from bs4 import BeautifulSoup
import requests

url = "https://example-store.com/products"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
prices = soup.find_all('span', class_='price')
print(prices)

The result? An empty list. No prices. No errors. Just nothing.

I spent three hours debugging. Was my selector wrong? Was the site blocking me? Was BeautifulSoup broken?

Turns out, the prices were loaded by JavaScript after the page rendered. BeautifulSoup only sees the initial HTML, not what JavaScript creates.

I had jumped into web scraping without understanding the fundamentals. Here’s what I wish I’d known from the start.

What You Actually Need Before Scraping

Before writing any scraper, I needed three things:

1. Basic Python Knowledge

Variables, functions, loops, and lists. If you can write a function that processes a list, you’re ready.

2. HTML Structure Understanding

I needed to know what tags, classes, and IDs are. When I inspect a page, I should understand:

<div class="product-card" id="item-123">
    <h2 class="title">Product Name</h2>
    <span class="price">$99.99</span>
</div>

The div is a container. class="product-card" is the CSS class. id="item-123" is unique. The h2 and span are child elements.

3. Browser Developer Tools

Right-click any element and select “Inspect.” This shows the HTML structure. I use this constantly to find the right selectors.

The Simplest Stack: requests + BeautifulSoup

For static websites (content loads with the page), this combination works:

pip install requests beautifulsoup4

Here’s my working first scraper:

import requests
from bs4 import BeautifulSoup

def scrape_headlines(url):
    # Pretend to be a real browser
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }

    # Fetch the page
    response = requests.get(url, headers=headers)
    response.raise_for_status()  # Raise error for bad responses

    # Parse HTML
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find headlines (adjust selector for your target)
    headlines = soup.find_all('h2', class_='article-title')

    # Extract text
    articles = []
    for h in headlines:
        title = h.get_text(strip=True)
        link = h.find('a')['href'] if h.find('a') else None
        articles.append({'title': title, 'link': link})

    return articles

# Usage
url = 'https://news.ycombinator.com'
articles = scrape_headlines(url)

for i, article in enumerate(articles[:5], 1):
    print(f"{i}. {article['title']}")

This works for Hacker News because it’s a static site. The HTML contains all the data when the page loads.

The Real-World Problems I Hit

Problem 1: Getting Blocked

My first scraper worked once, then stopped. The site detected my script and blocked my IP.

Solution: Add headers and delays

import requests
import time

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'text/html,application/xhtml+xml',
    'Accept-Language': 'en-US,en;q=0.9'
}

# Wait between requests
time.sleep(2)  # 2 seconds between requests
response = requests.get(url, headers=headers)

Problem 2: Dynamic Content (My Original Problem)

When content loads via JavaScript, BeautifulSoup can’t see it. I needed to check if a site uses dynamic content:

Right-click and “View Page Source” (Ctrl+U)
Search for the data you want
If it’s not there, it’s loaded by JavaScript

Solution: Use Selenium for dynamic sites

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def scrape_dynamic(url, selector):
    # Setup Chrome in headless mode
    options = Options()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')

    driver = webdriver.Chrome(options=options)
    wait = WebDriverWait(driver, 10)

    try:
        driver.get(url)

        # Wait for content to load
        wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, selector)))

        # Extract content
        elements = driver.find_elements(By.CSS_SELECTOR, selector)
        results = [e.text for e in elements]

        return results
    finally:
        driver.quit()

# Usage
prices = scrape_dynamic(
    'https://spa-example.com/products',
    '.product-price'
)

Problem 3: Pagination

Most sites spread data across multiple pages. I needed to handle “Next” buttons.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time

def scrape_all_pages(base_url, start_url, max_pages=10):
    session = requests.Session()
    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    })

    all_items = []
    current_url = start_url
    pages_scraped = 0

    while current_url and pages_scraped < max_pages:
        print(f"Scraping page {pages_scraped + 1}")

        response = session.get(current_url)
        response.raise_for_status()

        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract items (customize for your site)
        items = soup.select('.product-item')
        for item in items:
            title = item.select_one('.title')
            price = item.select_one('.price')
            if title:
                all_items.append({
                    'title': title.get_text(strip=True),
                    'price': price.get_text(strip=True) if price else 'N/A'
                })

        # Find next page
        next_link = soup.select_one('a.next, a[rel="next"]')
        if next_link and next_link.get('href'):
            current_url = urljoin(base_url, next_link['href'])
            pages_scraped += 1
            time.sleep(1)  # Be polite
        else:
            break

    return all_items

When to Upgrade Your Tools

I started with BeautifulSoup, but there’s a progression:

+------------------+------------------------+------------+
| Tool             | Best For               | Difficulty |
+------------------+------------------------+------------+
| BeautifulSoup    | Learning, small tasks  | Beginner   |
| Scrapy           | Large-scale scraping   | Medium     |
| Selenium         | JavaScript-heavy sites | Medium     |
| Playwright       | Modern dynamic sites   | Medium+    |
+------------------+------------------------+------------+

Decision flow:

Is the content in the initial HTML?
    |
    +-- YES --> Use requests + BeautifulSoup
    |
    +-- NO --> Is it a JavaScript-heavy site?
                    |
                    +-- YES --> Use Selenium or Playwright
                    |
                    +-- Need to scrape thousands of pages?
                                    |
                                    +-- YES --> Use Scrapy

A Practical Project: Price Tracker

Here’s a complete project that taught me the most:

import requests
from bs4 import BeautifulSoup
import csv
from datetime import datetime
import time

class PriceTracker:
    def __init__(self, csv_file='prices.csv'):
        self.csv_file = csv_file
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }

    def get_price(self, url, price_selector, name_selector=None):
        """Extract price from a product page."""
        try:
            response = requests.get(url, headers=self.headers)
            response.raise_for_status()

            soup = BeautifulSoup(response.text, 'html.parser')

            # Extract price
            price_elem = soup.select_one(price_selector)
            if not price_elem:
                return None

            # Clean price text
            price_text = price_elem.get_text(strip=True)
            # Remove currency symbols, keep digits and decimal
            price = float(''.join(c for c in price_text if c.isdigit() or c == '.'))

            # Extract name
            name = None
            if name_selector:
                name_elem = soup.select_one(name_selector)
                if name_elem:
                    name = name_elem.get_text(strip=True)

            return {
                'name': name,
                'price': price,
                'url': url,
                'timestamp': datetime.now().isoformat()
            }

        except Exception as e:
            print(f"Error: {e}")
            return None

    def save_price(self, data):
        """Save to CSV file."""
        file_exists = False
        try:
            with open(self.csv_file, 'r'):
                file_exists = True
        except FileNotFoundError:
            pass

        with open(self.csv_file, 'a', newline='') as f:
            fieldnames = ['name', 'price', 'url', 'timestamp']
            writer = csv.DictWriter(f, fieldnames=fieldnames)

            if not file_exists:
                writer.writeheader()

            writer.writerow(data)

    def track_products(self, products):
        """Track multiple products."""
        for product in products:
            data = self.get_price(
                product['url'],
                product['price_selector'],
                product.get('name_selector')
            )

            if data:
                self.save_price(data)
                print(f"Tracked: {data['name']} - ${data['price']}")

            time.sleep(2)  # Be polite

# Usage
tracker = PriceTracker()

products = [
    {
        'url': 'https://example-store.com/product1',
        'price_selector': '.current-price',
        'name_selector': 'h1.product-title'
    }
]

tracker.track_products(products)

This project taught me:

HTTP requests with headers
HTML parsing with CSS selectors
Data cleaning and transformation
File I/O with CSV
Error handling
Rate limiting

Common Mistakes I Made

Mistake 1: Starting with Complex Sites

I tried scraping a React app first. Big mistake. Start with static sites like:

News websites
Blogs
Wikipedia
Government data portals

Mistake 2: Ignoring robots.txt

Before scraping, check https://example.com/robots.txt. It tells you what’s allowed.

Mistake 3: No Error Handling

My scrapers crashed on the first unexpected response. Now I wrap everything in try/except:

try:
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()
except requests.Timeout:
    print("Request timed out")
except requests.RequestException as e:
    print(f"Request failed: {e}")

Mistake 4: Hardcoded Selectors

Sites change. My scraper broke when they updated their HTML. Now I use multiple fallback selectors:

# Try multiple selectors
price = (
    soup.select_one('.price-current') or
    soup.select_one('.sale-price') or
    soup.select_one('[data-price]')
)

The Learning Path That Worked

Week 1: Scrape a static news site. Learn requests, BeautifulSoup, and CSS selectors.

Week 2: Build a price tracker. Add CSV storage and error handling.

Week 3: Handle pagination. Scrape multiple pages of results.

Week 4: Try Selenium on a JavaScript-heavy site.

Summary

In this post, I showed how to start web scraping with Python using requests and BeautifulSoup, then progress to handling dynamic content with Selenium.

The key points:

Start with static sites using requests + BeautifulSoup
Add headers and delays to avoid getting blocked
Use Selenium when content loads via JavaScript
Handle pagination for multi-page data
Always include error handling

The best first project is scraping something you actually care about: sports scores, product prices, or news headlines. Personal interest keeps you motivated through debugging.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 BeautifulSoup Documentation
👨‍💻 Python requests library
👨‍💻 Selenium with Python
👨‍💻 Scrapy Framework
👨‍💻 r/learnpython Web Scraping Discussion

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!