Project Overview
Build a professional web scraping application that demonstrates your Python skills with HTTP requests, HTML parsing, data extraction, and file handling. Your scraper will collect book data from a practice website, handle multiple pages, implement robust error handling, and export clean CSV files.
What You Will Build
A fully functional web scraper that extracts book data:
$ python scraper.py
==============================================
BOOKSCRAPER - Web Scraping Tool
==============================================
Target URL: https://books.toscrape.com
Categories: All Books
[INFO] Starting scraper...
[INFO] Fetching page 1...
[INFO] Found 20 books on page 1
[INFO] Fetching page 2...
[INFO] Found 20 books on page 2
...
[INFO] Fetching page 50...
[INFO] Found 20 books on page 50
==============================================
SCRAPING COMPLETE!
==============================================
Total Books Scraped: 1000
Categories Found: 50
Export File: data/books_2025-01-15.csv
Time Elapsed: 45.2 seconds
[SUCCESS] Data exported to CSV!
Skills You Will Apply
HTTP Requests
GET requests, headers, sessions, and response handling
HTML Parsing
BeautifulSoup, CSS selectors, DOM navigation
Data Extraction
Pattern matching, data cleaning, validation
CSV Export
File writing, structured data, encoding
Learning Objectives
Technical Skills
- Make HTTP requests using the requests library
- Parse HTML content with BeautifulSoup
- Navigate DOM trees and extract specific elements
- Handle pagination and multiple pages
- Export data to CSV with proper formatting
Professional Skills
- Implement robust error handling
- Add rate limiting and polite scraping
- Write modular, reusable code
- Document scraping workflow
- Understand web scraping ethics and legality
Project Scenario
DataHarvest Analytics
You have been hired as a Python Developer at DataHarvest Analytics, a data collection startup. The company needs to build a web scraping tool that can extract book information from online bookstores for market research and price comparison analysis. The tool should be reliable, respectful of website resources, and produce clean, analysis-ready data.
"We need a scraper that can collect book data - titles, prices, ratings, and availability. It should handle multiple pages, deal with errors gracefully, and output everything to CSV so our analysts can work with the data. Make sure it's polite to the servers - we don't want to get blocked!"
Core Features Required
- Send HTTP GET requests to target URLs
- Handle response status codes
- Set proper User-Agent headers
- Implement request timeouts
- Parse HTML with BeautifulSoup
- Use CSS selectors to find elements
- Extract text, attributes, and links
- Handle missing or malformed data
- Extract book titles, prices, ratings
- Get availability status and categories
- Follow links to detail pages
- Clean and normalize extracted data
- Export data to CSV format
- Handle Unicode characters properly
- Create timestamped output files
- Include summary statistics
The Dataset
Your scraper will extract data similar to real-world book datasets. We provide sample data to help you understand the expected output format and validate your scraping results.
Dataset Download
Download the sample dataset from Kaggle (Amazon Top 50 Bestselling Books 2009-2019) to understand the expected output format and compare your scraped results.
Original Data Source
This project is inspired by the Amazon Top 50 Bestselling Books 2009-2019 dataset from Kaggle - a popular dataset containing 550 books scraped from Amazon. The dataset demonstrates real-world book data with titles, authors, ratings, reviews, prices, and genres that you will learn to extract through web scraping.
Dataset Schema (Kaggle Format)
| Column | Type | Description |
|---|---|---|
Name | String | Book title |
Author | String | Author name |
User Rating | Decimal | Average user rating (0.0-5.0) |
Reviews | Integer | Number of user reviews |
Price | Integer | Price in USD |
Year | Integer | Year of publication (2009-2019) |
Genre | String | Fiction or Non Fiction |
Sample Data Preview
Here is sample data from the Kaggle dataset (Amazon bestsellers):
| Name | Author | User Rating | Reviews | Price | Year | Genre |
|---|---|---|---|---|---|---|
| Becoming | Michelle Obama | 4.8 | 61133 | $11 | 2019 | Non Fiction |
| Where the Crawdads Sing | Delia Owens | 4.8 | 87841 | $15 | 2019 | Fiction |
| Educated: A Memoir | Tara Westover | 4.7 | 42865 | $14 | 2018 | Non Fiction |
Project Requirements
Your project must include all of the following components. Structure your code with clear organization, proper documentation, and follow Python best practices.
Project Structure
Organize your code into the following structure:
web-scraper/
├── scraper.py # Main entry point
├── fetcher.py # HTTP request handling
├── parser.py # BeautifulSoup parsing logic
├── exporter.py # CSV export functionality
├── config.py # Configuration settings
├── utils.py # Helper functions
├── data/
│ ├── books_YYYY-MM-DD.csv # Scraped output (auto-generated)
│ └── categories.csv # Category summary
├── tests/
│ ├── test_fetcher.py
│ ├── test_parser.py
│ └── test_exporter.py
├── requirements.txt # Dependencies
└── README.md # Project documentation
Fetcher Module (HTTP Requests)
Handle all HTTP operations:
- get_page(url): Fetch a single URL and return response
- get_pages(urls): Fetch multiple URLs with rate limiting
- Headers: Set User-Agent and Accept headers
- Timeouts: Implement request timeouts (10 seconds)
- Retries: Retry failed requests up to 3 times
- Rate limiting: Wait 1 second between requests
import requests
import time
from typing import Optional
class Fetcher:
BASE_URL = "https://books.toscrape.com"
def __init__(self, delay: float = 1.0):
self.delay = delay
self.session = requests.Session()
self.session.headers.update({
'User-Agent': 'BookScraper/1.0 (Student Project)',
'Accept': 'text/html,application/xhtml+xml'
})
def get_page(self, url: str) -> Optional[str]:
"""Fetch a single page and return HTML content."""
try:
time.sleep(self.delay) # Rate limiting
response = self.session.get(url, timeout=10)
response.raise_for_status()
return response.text
except requests.RequestException as e:
print(f"Error fetching {url}: {e}")
return None
Parser Module (BeautifulSoup)
Parse HTML and extract data:
- parse_book_list(html): Extract all books from a listing page
- parse_book_detail(html): Extract full details from a book page
- parse_categories(html): Extract all category links
- parse_pagination(html): Find next page link if exists
- clean_price(text): Convert "£51.77" to float 51.77
- clean_rating(class_name): Convert "star-rating Three" to 3
from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class Book:
title: str
price: float
rating: int
availability: str
category: str = ""
upc: str = ""
description: str = ""
url: str = ""
class Parser:
RATING_MAP = {
'One': 1, 'Two': 2, 'Three': 3,
'Four': 4, 'Five': 5
}
def parse_book_list(self, html: str) -> List[Book]:
"""Extract all books from a listing page."""
soup = BeautifulSoup(html, 'html.parser')
books = []
for article in soup.select('article.product_pod'):
title = article.h3.a['title']
price = self._clean_price(article.select_one('.price_color').text)
rating = self._get_rating(article.select_one('.star-rating'))
availability = article.select_one('.availability').text.strip()
url = article.h3.a['href']
books.append(Book(title, price, rating, availability, url=url))
return books
Exporter Module (CSV)
Export data to CSV files:
- export_books(books, filepath): Export book list to CSV
- export_categories(categories, filepath): Export category summary
- Encoding: Use UTF-8 encoding for Unicode support
- Timestamps: Include date in filename
- Headers: Include column headers in first row
import csv
from datetime import datetime
from typing import List
from parser import Book
class Exporter:
def export_books(self, books: List[Book], filepath: str = None) -> str:
"""Export books to CSV file."""
if filepath is None:
date_str = datetime.now().strftime('%Y-%m-%d')
filepath = f"data/books_{date_str}.csv"
with open(filepath, 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['title', 'price', 'rating', 'availability',
'category', 'upc', 'description', 'url'])
for book in books:
writer.writerow([
book.title, book.price, book.rating, book.availability,
book.category, book.upc, book.description, book.url
])
return filepath
Error Handling
Implement robust error handling:
- Network errors: Handle connection timeouts and failures
- HTTP errors: Handle 404, 500, and other status codes
- Parsing errors: Handle missing elements gracefully
- Logging: Log all errors with timestamps
- Recovery: Continue scraping even if some pages fail
CLI Interface
Command-line arguments:
python scraper.py- Scrape all bookspython scraper.py --category Travel- Scrape specific categorypython scraper.py --pages 5- Limit to first N pagespython scraper.py --output data/my_books.csv- Custom output filepython scraper.py --verbose- Show detailed progress
Feature Specifications
Implement the following features with proper error handling. Each feature should be testable independently.
- Fetch pages with proper headers
- Implement 1-second delay between requests
- Retry failed requests (max 3 times)
- Handle timeout errors (10s limit)
- Log response status codes
- Support session persistence
- Parse book listings with BeautifulSoup
- Extract title, price, rating, availability
- Navigate to detail pages for more info
- Extract UPC, description, category
- Handle missing elements gracefully
- Clean and normalize extracted text
- Detect if next page exists
- Build correct next page URL
- Loop through all available pages
- Track total pages processed
- Option to limit number of pages
- Handle last page gracefully
- Export to CSV with UTF-8 encoding
- Include header row
- Handle special characters in text
- Create timestamped filenames
- Create data directory if missing
- Return filepath after export
- Catch network connection errors
- Handle HTTP 4xx/5xx responses
- Deal with missing HTML elements
- Log errors with context
- Continue on individual failures
- Report summary of errors at end
- Total books scraped count
- Categories found count
- Pages processed count
- Errors encountered count
- Time elapsed
- Average price and rating
Sample Output: Scraping Complete
$ python scraper.py --verbose
==============================================
BOOKSCRAPER - Web Scraping Tool
==============================================
[INFO] Initializing scraper...
[INFO] Target: https://books.toscrape.com
[INFO] Rate limit: 1.0 seconds between requests
[PAGE 1/50] Fetching catalogue/page-1.html
✓ Found 20 books
→ A Light in the Attic (£51.77, 3 stars)
→ Tipping the Velvet (£53.74, 1 star)
→ Soumission (£50.10, 1 star)
...
[PAGE 2/50] Fetching catalogue/page-2.html
✓ Found 20 books
...
==============================================
SCRAPING COMPLETE!
==============================================
Summary:
Total Books: 1000
Categories: 50
Pages Scraped: 50
Errors: 0
Time: 52.3 seconds
Output Files:
→ data/books_2025-01-15.csv (1000 rows)
→ data/categories.csv (50 rows)
[SUCCESS] All data exported!
Web Scraping Ethics
Web scraping comes with ethical and legal responsibilities. Always follow these guidelines when building scrapers.
- Check robots.txt - Respect website's crawling rules
- Rate limit requests - Add delays between requests (1+ seconds)
- Identify yourself - Set a descriptive User-Agent header
- Cache responses - Don't re-scrape unchanged pages
- Handle errors gracefully - Don't hammer failing servers
- Use practice sites - Like books.toscrape.com for learning
- Don't ignore robots.txt - It's a guideline to follow
- Don't scrape too fast - Can overload servers, get blocked
- Don't scrape personal data - Respect privacy laws (GDPR)
- Don't bypass authentication - Only scrape public data
- Don't violate ToS - Read website terms of service
- Don't redistribute data - Check copyright restrictions
Checking robots.txt
# Always check robots.txt before scraping
import urllib.robotparser
def can_scrape(url: str, user_agent: str = '*') -> bool:
"""Check if scraping is allowed by robots.txt."""
rp = urllib.robotparser.RobotFileParser()
rp.set_url(url.rstrip('/') + '/robots.txt')
rp.read()
return rp.can_fetch(user_agent, url)
# Example usage
if can_scrape('https://books.toscrape.com/catalogue/'):
print("Scraping allowed!")
else:
print("Scraping not allowed by robots.txt")
Submission Requirements
Create a public GitHub repository with the exact name shown below:
Required Repository Name
python-web-scraper
Required Project Structure
python-web-scraper/
├── scraper.py # Main entry point (run this)
├── fetcher.py # HTTP request handling
├── parser.py # BeautifulSoup parsing
├── exporter.py # CSV export functionality
├── config.py # Configuration settings
├── utils.py # Helper functions
├── data/
│ └── bestsellers with categories.csv # Kaggle dataset
├── tests/
│ ├── test_fetcher.py # Unit tests for Fetcher
│ ├── test_parser.py # Unit tests for Parser
│ └── test_exporter.py # Unit tests for Exporter
├── screenshots/
│ ├── scraping.png # Screenshot of scraper running
│ ├── output.png # Screenshot of CSV output
│ └── stats.png # Screenshot of statistics
├── requirements.txt # Dependencies
└── README.md # Project documentation
README.md Required Sections
1. Project Header
- Project title and badges
- Brief description
- Your name and submission date
2. Features
- List all implemented features
- Highlight bonus features
- Libraries used
3. Installation
- Clone command
- Python version (3.8+)
- pip install requirements
4. Usage
- How to run the scraper
- CLI arguments explained
- Example commands
5. Output Format
- CSV column descriptions
- Sample output rows
- Output file locations
6. Project Structure
- Explain each module
- Class diagrams (optional)
7. Testing
- How to run tests
- Test coverage info
8. Ethical Considerations
- robots.txt compliance
- Rate limiting explanation
Do Include
- All Python modules with docstrings
- Sample scraped data CSV files
- Unit tests for core modules
- Screenshots of scraper output
- requirements.txt with dependencies
- Clear README with examples
Do Not Include
- __pycache__ folders
- .pyc compiled files
- Virtual environment folder
- Large data files (>10MB)
- Cached HTML pages
- API keys or credentials
requirements.txt must include: requests, beautifulsoup4,
and lxml (optional parser).
Enter your GitHub username - we will verify your repository automatically
Grading Rubric
Your project will be graded on the following criteria. Total: 450 points.
| Criteria | Points | Description |
|---|---|---|
| HTTP Requests | 60 | Proper request handling, headers, timeouts, retries |
| HTML Parsing | 80 | BeautifulSoup usage, CSS selectors, data extraction |
| Pagination | 50 | Handle multiple pages, next page detection |
| CSV Export | 60 | Proper CSV formatting, UTF-8 encoding, headers |
| Error Handling | 70 | Graceful failures, logging, recovery |
| Code Quality | 50 | Modular design, docstrings, type hints |
| Testing | 40 | Unit tests for core modules |
| Documentation | 40 | README, comments, usage examples |
| Total | 450 |
Grading Levels
Excellent
405-450
90%+Good
360-404
80-89%Satisfactory
315-359
70-79%Needs Work
<315
<70%Bonus Points (up to 50 extra)
+15 Points
Add SQLite database storage option in addition to CSV
+20 Points
Implement async scraping with aiohttp for faster performance
+15 Points
Add data visualization of scraped results with matplotlib
Ready to Submit?
Make sure you have completed all requirements and reviewed the grading rubric above.
Submit Project