Project 2: Web Scraper | Python Course

Project Overview

Build a professional web scraping application that demonstrates your Python skills with HTTP requests, HTML parsing, data extraction, and file handling. Your scraper will collect book data from a practice website, handle multiple pages, implement robust error handling, and export clean CSV files.

Skills Applied: This project tests your proficiency in Python libraries (requests, BeautifulSoup, csv), HTML/CSS selectors, error handling (try/except), data cleaning, and working with external APIs and websites.

What You Will Build

A fully functional web scraper that extracts book data:

$ python scraper.py

==============================================
     BOOKSCRAPER - Web Scraping Tool
==============================================

Target URL: https://books.toscrape.com
Categories: All Books

[INFO] Starting scraper...
[INFO] Fetching page 1...
[INFO] Found 20 books on page 1
[INFO] Fetching page 2...
[INFO] Found 20 books on page 2
...
[INFO] Fetching page 50...
[INFO] Found 20 books on page 50

==============================================
           SCRAPING COMPLETE!
==============================================
Total Books Scraped: 1000
Categories Found: 50
Export File: data/books_2025-01-15.csv
Time Elapsed: 45.2 seconds

[SUCCESS] Data exported to CSV!

Skills You Will Apply

HTTP Requests

GET requests, headers, sessions, and response handling

HTML Parsing

BeautifulSoup, CSS selectors, DOM navigation

Data Extraction

Pattern matching, data cleaning, validation

CSV Export

File writing, structured data, encoding

Learning Objectives

Technical Skills

Make HTTP requests using the requests library
Parse HTML content with BeautifulSoup
Navigate DOM trees and extract specific elements
Handle pagination and multiple pages
Export data to CSV with proper formatting

Professional Skills

Implement robust error handling
Add rate limiting and polite scraping
Write modular, reusable code
Document scraping workflow
Understand web scraping ethics and legality

Ready to submit? Already completed the project? Submit your work now!

Submit Now

Project Scenario

DataHarvest Analytics

You have been hired as a Python Developer at DataHarvest Analytics, a data collection startup. The company needs to build a web scraping tool that can extract book information from online bookstores for market research and price comparison analysis. The tool should be reliable, respectful of website resources, and produce clean, analysis-ready data.

"We need a scraper that can collect book data - titles, prices, ratings, and availability. It should handle multiple pages, deal with errors gracefully, and output everything to CSV so our analysts can work with the data. Make sure it's polite to the servers - we don't want to get blocked!"

Alex Rivera, Data Engineering Lead

Core Features Required

Web Requests

Send HTTP GET requests to target URLs
Handle response status codes
Set proper User-Agent headers
Implement request timeouts

HTML Parsing

Parse HTML with BeautifulSoup
Use CSS selectors to find elements
Extract text, attributes, and links
Handle missing or malformed data

Data Extraction

Extract book titles, prices, ratings
Get availability status and categories
Follow links to detail pages
Clean and normalize extracted data

Export & Storage

Export data to CSV format
Handle Unicode characters properly
Create timestamped output files
Include summary statistics

Practice Website: Use books.toscrape.com - a website specifically designed for practicing web scraping. It is free to scrape and won't block your requests!

The Dataset

Your scraper will extract data similar to real-world book datasets. We provide sample data to help you understand the expected output format and validate your scraping results.

Dataset Download

Download the sample dataset from Kaggle (Amazon Top 50 Bestselling Books 2009-2019) to understand the expected output format and compare your scraped results.

bestsellers with categories.csv

Original Data Source

This project is inspired by the Amazon Top 50 Bestselling Books 2009-2019 dataset from Kaggle - a popular dataset containing 550 books scraped from Amazon. The dataset demonstrates real-world book data with titles, authors, ratings, reviews, prices, and genres that you will learn to extract through web scraping.

View on Kaggle Explore Similar Datasets

Dataset Info: 550 books | Years: 2009-2019 | Fields: Name, Author, User Rating, Reviews, Price, Year, Genre (Fiction/Non-Fiction) | License: CC0 Public Domain | Usability: 10.0

Dataset Schema (Kaggle Format)

Column	Type	Description
`Name`	String	Book title
`Author`	String	Author name
`User Rating`	Decimal	Average user rating (0.0-5.0)
`Reviews`	Integer	Number of user reviews
`Price`	Integer	Price in USD
`Year`	Integer	Year of publication (2009-2019)
`Genre`	String	Fiction or Non Fiction

Sample Data Preview

Here is sample data from the Kaggle dataset (Amazon bestsellers):

Name	Author	User Rating	Reviews	Price	Year	Genre
Becoming	Michelle Obama	4.8	61133	$11	2019	Non Fiction
Where the Crawdads Sing	Delia Owens	4.8	87841	$15	2019	Fiction
Educated: A Memoir	Tara Westover	4.7	42865	$14	2018	Non Fiction

Practice Target: books.toscrape.com is recommended for practicing your scraping skills. Compare your scraped results with the Kaggle dataset format!

Project Requirements

Your project must include all of the following components. Structure your code with clear organization, proper documentation, and follow Python best practices.

Project Structure

Organize your code into the following structure:

web-scraper/
├── scraper.py           # Main entry point
├── fetcher.py           # HTTP request handling
├── parser.py            # BeautifulSoup parsing logic
├── exporter.py          # CSV export functionality
├── config.py            # Configuration settings
├── utils.py             # Helper functions
├── data/
│   ├── books_YYYY-MM-DD.csv    # Scraped output (auto-generated)
│   └── categories.csv          # Category summary
├── tests/
│   ├── test_fetcher.py
│   ├── test_parser.py
│   └── test_exporter.py
├── requirements.txt     # Dependencies
└── README.md            # Project documentation

Requirement: Each module should have a clear, single responsibility. The scraper should work as a standalone CLI tool.

Fetcher Module (HTTP Requests)

Handle all HTTP operations:

get_page(url): Fetch a single URL and return response
get_pages(urls): Fetch multiple URLs with rate limiting
Headers: Set User-Agent and Accept headers
Timeouts: Implement request timeouts (10 seconds)
Retries: Retry failed requests up to 3 times
Rate limiting: Wait 1 second between requests

import requests
import time
from typing import Optional

class Fetcher:
    BASE_URL = "https://books.toscrape.com"
    
    def __init__(self, delay: float = 1.0):
        self.delay = delay
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'BookScraper/1.0 (Student Project)',
            'Accept': 'text/html,application/xhtml+xml'
        })
    
    def get_page(self, url: str) -> Optional[str]:
        """Fetch a single page and return HTML content."""
        try:
            time.sleep(self.delay)  # Rate limiting
            response = self.session.get(url, timeout=10)
            response.raise_for_status()
            return response.text
        except requests.RequestException as e:
            print(f"Error fetching {url}: {e}")
            return None

Parser Module (BeautifulSoup)

Parse HTML and extract data:

parse_book_list(html): Extract all books from a listing page
parse_book_detail(html): Extract full details from a book page
parse_categories(html): Extract all category links
parse_pagination(html): Find next page link if exists
clean_price(text): Convert "£51.77" to float 51.77
clean_rating(class_name): Convert "star-rating Three" to 3

from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class Book:
    title: str
    price: float
    rating: int
    availability: str
    category: str = ""
    upc: str = ""
    description: str = ""
    url: str = ""

class Parser:
    RATING_MAP = {
        'One': 1, 'Two': 2, 'Three': 3, 
        'Four': 4, 'Five': 5
    }
    
    def parse_book_list(self, html: str) -> List[Book]:
        """Extract all books from a listing page."""
        soup = BeautifulSoup(html, 'html.parser')
        books = []
        for article in soup.select('article.product_pod'):
            title = article.h3.a['title']
            price = self._clean_price(article.select_one('.price_color').text)
            rating = self._get_rating(article.select_one('.star-rating'))
            availability = article.select_one('.availability').text.strip()
            url = article.h3.a['href']
            books.append(Book(title, price, rating, availability, url=url))
        return books

Exporter Module (CSV)

Export data to CSV files:

export_books(books, filepath): Export book list to CSV
export_categories(categories, filepath): Export category summary
Encoding: Use UTF-8 encoding for Unicode support
Timestamps: Include date in filename
Headers: Include column headers in first row

import csv
from datetime import datetime
from typing import List
from parser import Book

class Exporter:
    def export_books(self, books: List[Book], filepath: str = None) -> str:
        """Export books to CSV file."""
        if filepath is None:
            date_str = datetime.now().strftime('%Y-%m-%d')
            filepath = f"data/books_{date_str}.csv"
        
        with open(filepath, 'w', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow(['title', 'price', 'rating', 'availability', 
                           'category', 'upc', 'description', 'url'])
            for book in books:
                writer.writerow([
                    book.title, book.price, book.rating, book.availability,
                    book.category, book.upc, book.description, book.url
                ])
        return filepath

Error Handling

Implement robust error handling:

Network errors: Handle connection timeouts and failures
HTTP errors: Handle 404, 500, and other status codes
Parsing errors: Handle missing elements gracefully
Logging: Log all errors with timestamps
Recovery: Continue scraping even if some pages fail

Requirement: Your scraper should never crash due to missing HTML elements or network issues. Use try/except blocks and provide fallback values.

CLI Interface

Command-line arguments:

python scraper.py - Scrape all books
python scraper.py --category Travel - Scrape specific category
python scraper.py --pages 5 - Limit to first N pages
python scraper.py --output data/my_books.csv - Custom output file
python scraper.py --verbose - Show detailed progress

Feature Specifications

Implement the following features with proper error handling. Each feature should be testable independently.

Page Fetching

Fetch pages with proper headers
Implement 1-second delay between requests
Retry failed requests (max 3 times)
Handle timeout errors (10s limit)
Log response status codes
Support session persistence

HTML Parsing

Parse book listings with BeautifulSoup
Extract title, price, rating, availability
Navigate to detail pages for more info
Extract UPC, description, category
Handle missing elements gracefully
Clean and normalize extracted text

Pagination

Detect if next page exists
Build correct next page URL
Loop through all available pages
Track total pages processed
Option to limit number of pages
Handle last page gracefully

CSV Export

Export to CSV with UTF-8 encoding
Include header row
Handle special characters in text
Create timestamped filenames
Create data directory if missing
Return filepath after export

Error Handling

Catch network connection errors
Handle HTTP 4xx/5xx responses
Deal with missing HTML elements
Log errors with context
Continue on individual failures
Report summary of errors at end

Statistics

Total books scraped count
Categories found count
Pages processed count
Errors encountered count
Time elapsed
Average price and rating

Sample Output: Scraping Complete

$ python scraper.py --verbose

==============================================
     BOOKSCRAPER - Web Scraping Tool
==============================================

[INFO] Initializing scraper...
[INFO] Target: https://books.toscrape.com
[INFO] Rate limit: 1.0 seconds between requests

[PAGE 1/50] Fetching catalogue/page-1.html
  ✓ Found 20 books
  → A Light in the Attic (£51.77, 3 stars)
  → Tipping the Velvet (£53.74, 1 star)
  → Soumission (£50.10, 1 star)
  ...

[PAGE 2/50] Fetching catalogue/page-2.html
  ✓ Found 20 books
  ...

==============================================
           SCRAPING COMPLETE!
==============================================

Summary:
  Total Books: 1000
  Categories: 50
  Pages Scraped: 50
  Errors: 0
  Time: 52.3 seconds

Output Files:
  → data/books_2025-01-15.csv (1000 rows)
  → data/categories.csv (50 rows)

[SUCCESS] All data exported!

Web Scraping Ethics

Web scraping comes with ethical and legal responsibilities. Always follow these guidelines when building scrapers.

Do's - Best Practices

Check robots.txt - Respect website's crawling rules
Rate limit requests - Add delays between requests (1+ seconds)
Identify yourself - Set a descriptive User-Agent header
Cache responses - Don't re-scrape unchanged pages
Handle errors gracefully - Don't hammer failing servers
Use practice sites - Like books.toscrape.com for learning

Don'ts - Avoid These

Don't ignore robots.txt - It's a guideline to follow
Don't scrape too fast - Can overload servers, get blocked
Don't scrape personal data - Respect privacy laws (GDPR)
Don't bypass authentication - Only scrape public data
Don't violate ToS - Read website terms of service
Don't redistribute data - Check copyright restrictions

Legal Note: Web scraping laws vary by country. For this project, we use books.toscrape.com - a website specifically designed for scraping practice. Always check a website's Terms of Service and robots.txt before scraping.

Checking robots.txt

# Always check robots.txt before scraping
import urllib.robotparser

def can_scrape(url: str, user_agent: str = '*') -> bool:
    """Check if scraping is allowed by robots.txt."""
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(url.rstrip('/') + '/robots.txt')
    rp.read()
    return rp.can_fetch(user_agent, url)

# Example usage
if can_scrape('https://books.toscrape.com/catalogue/'):
    print("Scraping allowed!")
else:
    print("Scraping not allowed by robots.txt")

Submission Requirements

Create a public GitHub repository with the exact name shown below:

Required Repository Name

python-web-scraper

github.com/<your-username>/python-web-scraper

Required Project Structure

python-web-scraper/
├── scraper.py           # Main entry point (run this)
├── fetcher.py           # HTTP request handling
├── parser.py            # BeautifulSoup parsing
├── exporter.py          # CSV export functionality
├── config.py            # Configuration settings
├── utils.py             # Helper functions
├── data/
│   └── bestsellers with categories.csv # Kaggle dataset
├── tests/
│   ├── test_fetcher.py  # Unit tests for Fetcher
│   ├── test_parser.py   # Unit tests for Parser
│   └── test_exporter.py # Unit tests for Exporter
├── screenshots/
│   ├── scraping.png     # Screenshot of scraper running
│   ├── output.png       # Screenshot of CSV output
│   └── stats.png        # Screenshot of statistics
├── requirements.txt     # Dependencies
└── README.md            # Project documentation

README.md Required Sections

1. Project Header

Project title and badges
Brief description
Your name and submission date

2. Features

List all implemented features
Highlight bonus features
Libraries used

3. Installation

Clone command
Python version (3.8+)
pip install requirements

4. Usage

How to run the scraper
CLI arguments explained
Example commands

5. Output Format

CSV column descriptions
Sample output rows
Output file locations

6. Project Structure

Explain each module
Class diagrams (optional)

7. Testing

How to run tests
Test coverage info

8. Ethical Considerations

robots.txt compliance
Rate limiting explanation

Do Include

All Python modules with docstrings
Sample scraped data CSV files
Unit tests for core modules
Screenshots of scraper output
requirements.txt with dependencies
Clear README with examples

Do Not Include

__pycache__ folders
.pyc compiled files
Virtual environment folder
Large data files (>10MB)
Cached HTML pages
API keys or credentials

Important: Your requirements.txt must include: requests, beautifulsoup4, and lxml (optional parser).

Submit Your Project

Enter your GitHub username - we will verify your repository automatically

Grading Rubric

Your project will be graded on the following criteria. Total: 450 points.

Criteria	Points	Description
HTTP Requests	60	Proper request handling, headers, timeouts, retries
HTML Parsing	80	BeautifulSoup usage, CSS selectors, data extraction
Pagination	50	Handle multiple pages, next page detection
CSV Export	60	Proper CSV formatting, UTF-8 encoding, headers
Error Handling	70	Graceful failures, logging, recovery
Code Quality	50	Modular design, docstrings, type hints
Testing	40	Unit tests for core modules
Documentation	40	README, comments, usage examples
Total	450

Grading Levels

Excellent

405-450

90%+

Good

360-404

80-89%

Satisfactory

315-359

70-79%

Needs Work

<315

<70%

Bonus Points (up to 50 extra)

+15 Points

Add SQLite database storage option in addition to CSV

+20 Points

Implement async scraping with aiohttp for faster performance

+15 Points

Add data visualization of scraped results with matplotlib

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Project

Web Scraper

What You Will Build

Contents

Project Overview

What You Will Build

Skills You Will Apply

HTTP Requests

HTML Parsing

Data Extraction

CSV Export

Learning Objectives

Technical Skills

Professional Skills

Project Scenario

DataHarvest Analytics

Core Features Required

The Dataset

Dataset Download

Original Data Source

Dataset Schema (Kaggle Format)

CSV Amazon Top 50 Bestselling Books (2009-2019)

Sample Data Preview

Project Requirements

Project Structure

Fetcher Module (HTTP Requests)

Parser Module (BeautifulSoup)

Exporter Module (CSV)

Error Handling

CLI Interface

Feature Specifications

Sample Output: Scraping Complete

Web Scraping Ethics

Checking robots.txt

Submission Requirements

Required Repository Name

Required Project Structure

README.md Required Sections

1. Project Header

2. Features

3. Installation

4. Usage

5. Output Format

6. Project Structure

7. Testing

8. Ethical Considerations

Do Include

Do Not Include

Grading Rubric

Grading Levels

Excellent

Good

Satisfactory

Needs Work

Bonus Points (up to 50 extra)

+15 Points

+20 Points

+15 Points

Ready to Submit?