How to Start Web Scraping with Python for Beginners
The Problem
I wanted to scrape product prices from an e-commerce site. I’d heard web scraping was easy with Python. So I installed BeautifulSoup and wrote this:
from bs4 import BeautifulSoupimport requests
url = "https://example-store.com/products"response = requests.get(url)soup = BeautifulSoup(response.text, 'html.parser')prices = soup.find_all('span', class_='price')print(prices)The result? An empty list. No prices. No errors. Just nothing.
I spent three hours debugging. Was my selector wrong? Was the site blocking me? Was BeautifulSoup broken?
Turns out, the prices were loaded by JavaScript after the page rendered. BeautifulSoup only sees the initial HTML, not what JavaScript creates.
I had jumped into web scraping without understanding the fundamentals. Here’s what I wish I’d known from the start.
What You Actually Need Before Scraping
Before writing any scraper, I needed three things:
1. Basic Python Knowledge
Variables, functions, loops, and lists. If you can write a function that processes a list, you’re ready.
2. HTML Structure Understanding
I needed to know what tags, classes, and IDs are. When I inspect a page, I should understand:
<div class="product-card" id="item-123"> <h2 class="title">Product Name</h2> <span class="price">$99.99</span></div>The div is a container. class="product-card" is the CSS class. id="item-123" is unique. The h2 and span are child elements.
3. Browser Developer Tools
Right-click any element and select “Inspect.” This shows the HTML structure. I use this constantly to find the right selectors.
The Simplest Stack: requests + BeautifulSoup
For static websites (content loads with the page), this combination works:
pip install requests beautifulsoup4Here’s my working first scraper:
import requestsfrom bs4 import BeautifulSoup
def scrape_headlines(url): # Pretend to be a real browser headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' }
# Fetch the page response = requests.get(url, headers=headers) response.raise_for_status() # Raise error for bad responses
# Parse HTML soup = BeautifulSoup(response.text, 'html.parser')
# Find headlines (adjust selector for your target) headlines = soup.find_all('h2', class_='article-title')
# Extract text articles = [] for h in headlines: title = h.get_text(strip=True) link = h.find('a')['href'] if h.find('a') else None articles.append({'title': title, 'link': link})
return articles
# Usageurl = 'https://news.ycombinator.com'articles = scrape_headlines(url)
for i, article in enumerate(articles[:5], 1): print(f"{i}. {article['title']}")This works for Hacker News because it’s a static site. The HTML contains all the data when the page loads.
The Real-World Problems I Hit
Problem 1: Getting Blocked
My first scraper worked once, then stopped. The site detected my script and blocked my IP.
Solution: Add headers and delays
import requestsimport time
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', 'Accept': 'text/html,application/xhtml+xml', 'Accept-Language': 'en-US,en;q=0.9'}
# Wait between requeststime.sleep(2) # 2 seconds between requestsresponse = requests.get(url, headers=headers)Problem 2: Dynamic Content (My Original Problem)
When content loads via JavaScript, BeautifulSoup can’t see it. I needed to check if a site uses dynamic content:
- Right-click and “View Page Source” (Ctrl+U)
- Search for the data you want
- If it’s not there, it’s loaded by JavaScript
Solution: Use Selenium for dynamic sites
from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.chrome.options import Optionsfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as EC
def scrape_dynamic(url, selector): # Setup Chrome in headless mode options = Options() options.add_argument('--headless') options.add_argument('--no-sandbox')
driver = webdriver.Chrome(options=options) wait = WebDriverWait(driver, 10)
try: driver.get(url)
# Wait for content to load wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, selector)))
# Extract content elements = driver.find_elements(By.CSS_SELECTOR, selector) results = [e.text for e in elements]
return results finally: driver.quit()
# Usageprices = scrape_dynamic( 'https://spa-example.com/products', '.product-price')Problem 3: Pagination
Most sites spread data across multiple pages. I needed to handle “Next” buttons.
import requestsfrom bs4 import BeautifulSoupfrom urllib.parse import urljoinimport time
def scrape_all_pages(base_url, start_url, max_pages=10): session = requests.Session() session.headers.update({ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' })
all_items = [] current_url = start_url pages_scraped = 0
while current_url and pages_scraped < max_pages: print(f"Scraping page {pages_scraped + 1}")
response = session.get(current_url) response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# Extract items (customize for your site) items = soup.select('.product-item') for item in items: title = item.select_one('.title') price = item.select_one('.price') if title: all_items.append({ 'title': title.get_text(strip=True), 'price': price.get_text(strip=True) if price else 'N/A' })
# Find next page next_link = soup.select_one('a.next, a[rel="next"]') if next_link and next_link.get('href'): current_url = urljoin(base_url, next_link['href']) pages_scraped += 1 time.sleep(1) # Be polite else: break
return all_itemsWhen to Upgrade Your Tools
I started with BeautifulSoup, but there’s a progression:
+------------------+------------------------+------------+| Tool | Best For | Difficulty |+------------------+------------------------+------------+| BeautifulSoup | Learning, small tasks | Beginner || Scrapy | Large-scale scraping | Medium || Selenium | JavaScript-heavy sites | Medium || Playwright | Modern dynamic sites | Medium+ |+------------------+------------------------+------------+Decision flow:
Is the content in the initial HTML? | +-- YES --> Use requests + BeautifulSoup | +-- NO --> Is it a JavaScript-heavy site? | +-- YES --> Use Selenium or Playwright | +-- Need to scrape thousands of pages? | +-- YES --> Use ScrapyA Practical Project: Price Tracker
Here’s a complete project that taught me the most:
import requestsfrom bs4 import BeautifulSoupimport csvfrom datetime import datetimeimport time
class PriceTracker: def __init__(self, csv_file='prices.csv'): self.csv_file = csv_file self.headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' }
def get_price(self, url, price_selector, name_selector=None): """Extract price from a product page.""" try: response = requests.get(url, headers=self.headers) response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# Extract price price_elem = soup.select_one(price_selector) if not price_elem: return None
# Clean price text price_text = price_elem.get_text(strip=True) # Remove currency symbols, keep digits and decimal price = float(''.join(c for c in price_text if c.isdigit() or c == '.'))
# Extract name name = None if name_selector: name_elem = soup.select_one(name_selector) if name_elem: name = name_elem.get_text(strip=True)
return { 'name': name, 'price': price, 'url': url, 'timestamp': datetime.now().isoformat() }
except Exception as e: print(f"Error: {e}") return None
def save_price(self, data): """Save to CSV file.""" file_exists = False try: with open(self.csv_file, 'r'): file_exists = True except FileNotFoundError: pass
with open(self.csv_file, 'a', newline='') as f: fieldnames = ['name', 'price', 'url', 'timestamp'] writer = csv.DictWriter(f, fieldnames=fieldnames)
if not file_exists: writer.writeheader()
writer.writerow(data)
def track_products(self, products): """Track multiple products.""" for product in products: data = self.get_price( product['url'], product['price_selector'], product.get('name_selector') )
if data: self.save_price(data) print(f"Tracked: {data['name']} - ${data['price']}")
time.sleep(2) # Be polite
# Usagetracker = PriceTracker()
products = [ { 'url': 'https://example-store.com/product1', 'price_selector': '.current-price', 'name_selector': 'h1.product-title' }]
tracker.track_products(products)This project taught me:
- HTTP requests with headers
- HTML parsing with CSS selectors
- Data cleaning and transformation
- File I/O with CSV
- Error handling
- Rate limiting
Common Mistakes I Made
Mistake 1: Starting with Complex Sites
I tried scraping a React app first. Big mistake. Start with static sites like:
- News websites
- Blogs
- Wikipedia
- Government data portals
Mistake 2: Ignoring robots.txt
Before scraping, check https://example.com/robots.txt. It tells you what’s allowed.
Mistake 3: No Error Handling
My scrapers crashed on the first unexpected response. Now I wrap everything in try/except:
try: response = requests.get(url, headers=headers, timeout=10) response.raise_for_status()except requests.Timeout: print("Request timed out")except requests.RequestException as e: print(f"Request failed: {e}")Mistake 4: Hardcoded Selectors
Sites change. My scraper broke when they updated their HTML. Now I use multiple fallback selectors:
# Try multiple selectorsprice = ( soup.select_one('.price-current') or soup.select_one('.sale-price') or soup.select_one('[data-price]'))The Learning Path That Worked
Week 1: Scrape a static news site. Learn requests, BeautifulSoup, and CSS selectors.
Week 2: Build a price tracker. Add CSV storage and error handling.
Week 3: Handle pagination. Scrape multiple pages of results.
Week 4: Try Selenium on a JavaScript-heavy site.
Summary
In this post, I showed how to start web scraping with Python using requests and BeautifulSoup, then progress to handling dynamic content with Selenium.
The key points:
- Start with static sites using requests + BeautifulSoup
- Add headers and delays to avoid getting blocked
- Use Selenium when content loads via JavaScript
- Handle pagination for multi-page data
- Always include error handling
The best first project is scraping something you actually care about: sports scores, product prices, or news headlines. Personal interest keeps you motivated through debugging.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 BeautifulSoup Documentation
- 👨💻 Python requests library
- 👨💻 Selenium with Python
- 👨💻 Scrapy Framework
- 👨💻 r/learnpython Web Scraping Discussion
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments