Module 4.2

Data Loading

Learn to import data from anywhere! Master reading CSV, Excel, and JSON files, connect to SQL databases, and scrape tables directly from web pages using Pandas.

40 min read
Beginner
Hands-on Examples
What You'll Learn
  • Reading CSV files with various options
  • Loading Excel workbooks and sheets
  • Parsing JSON data structures
  • Connecting to SQL databases
  • Scraping web tables with read_html()
Contents
01

Introduction to Data Loading

Real-world data science projects rarely start with data already in Python. Your data lives in files, databases, APIs, and web pages. Pandas provides a powerful suite of I/O (input/output) functions that can read data from virtually any source and convert it into DataFrames ready for analysis. Mastering these functions is essential because data loading is the first step in every data science workflow.

Why Data Loading Matters

Before you can analyze, visualize, or model data, you need to get it into Python. This might seem trivial, but in practice, data loading can be surprisingly complex. Files might have different encodings, missing values represented in various ways, dates in non-standard formats, or columns that need special parsing. A single misread column can invalidate your entire analysis. That is why Pandas provides dozens of parameters to handle these edge cases, and understanding them will save you hours of debugging later.

Consider a typical data science project: you receive a CSV export from a CRM system, an Excel spreadsheet from the finance team, JSON data from an API, and you need to query a production database for historical records. Each source has its own quirks and formats. Pandas unifies all of these into a consistent DataFrame interface, letting you focus on analysis rather than data wrangling. The key is knowing which function to use and how to configure it properly.

Key Concept

Pandas I/O Functions

Pandas provides read_* functions for importing data and to_* methods for exporting. The most common are read_csv(), read_excel(), read_json(), and read_sql(). Each function returns a DataFrame and accepts parameters to customize how data is parsed.

Why it matters: Using the right parameters during loading prevents data quality issues downstream. It is much easier to parse dates correctly during import than to fix them after the fact.

Common Data Formats

Data comes in many formats, each with strengths and weaknesses. Understanding these formats helps you choose the right loading function and anticipate potential issues. CSV files are universal but lack data type information. Excel files preserve formatting but can be slow to parse. JSON handles nested data well but can be verbose. SQL databases are powerful but require connection setup. Knowing the trade-offs helps you work efficiently with whatever data source you encounter.

CSV Files

Comma-Separated Values are the most common format. Simple, universal, and human-readable. Use pd.read_csv() with encoding and separator options.

Excel Files

Spreadsheets with multiple sheets, formulas, and formatting. Common in business settings. Use pd.read_excel() with sheet_name parameter.

JSON Data

JavaScript Object Notation handles nested, hierarchical data. Standard for APIs. Use pd.read_json() or pd.json_normalize() for nested structures.

The Read-Analyze-Write Workflow

Every data analysis project follows a similar pattern: read data from a source, perform analysis and transformations, then write results to a destination. Pandas makes this workflow seamless. You read a CSV file into a DataFrame, clean and analyze it, then export the results to Excel for stakeholders or save to a database for production use. The consistency of the DataFrame interface means your analysis code works the same regardless of where the data came from.

# The typical data workflow
import pandas as pd

# 1. READ: Load data from any source
df = pd.read_csv('sales_data.csv')

# 2. ANALYZE: Transform and analyze
df['revenue'] = df['quantity'] * df['price']
monthly_sales = df.groupby('month')['revenue'].sum()

# 3. WRITE: Export results
monthly_sales.to_excel('monthly_report.xlsx')
print("Workflow complete!")  # Workflow complete!

This simple pattern scales from quick analyses to production data pipelines. The key insight is that once data is in a DataFrame, all Pandas operations work identically. Whether you loaded from CSV, Excel, JSON, or SQL, the analysis code remains the same. This abstraction is what makes Pandas so powerful for data science workflows.

Pro Tip: Always inspect your data immediately after loading with df.head(), df.info(), and df.dtypes. This catches parsing issues early before they cause problems in your analysis.

Practice Questions: Data Loading Basics

Test your data loading skills with these coding tasks.

Task: Load a CSV file called 'sales_data.csv' and display the first 3 rows.

Expected output (first 3 rows of the DataFrame):

     product  quantity  price
0     Laptop         5  45000
1      Mouse        25    500
2   Keyboard        15   1200
Show Solution
import pandas as pd

df = pd.read_csv('sales_data.csv')
print(df.head(3))

Given:

df = pd.read_csv('customers.csv')

Task: Write code to display the shape, column names, and data types of the DataFrame.

Show Solution
import pandas as pd

df = pd.read_csv('customers.csv')

print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nData Types:")
print(df.dtypes)

Task: Load 'sales_data.csv' and calculate the total revenue, then display which product had the highest revenue.

Show Solution
import pandas as pd

df = pd.read_csv('sales_data.csv')
total = df['revenue'].sum()
print(f"Total Revenue: ₹{total:,}")
top_product = df.loc[df['revenue'].idxmax(), 'product']
print(f"Top Product: {top_product}")
02

Reading CSV Files

CSV (Comma-Separated Values) is the most common format for tabular data. Almost every data source can export to CSV, making it the universal exchange format for data. The read_csv() function is Pandas' workhorse for data loading, offering dozens of parameters to handle virtually any CSV variation you might encounter. Understanding these parameters will save you countless hours of data cleaning downstream.

Basic CSV Loading

The simplest way to load a CSV file is with just the filename. Pandas automatically detects the delimiter (usually a comma), reads the first row as column headers, and infers data types for each column. This works perfectly for well-formatted files, but real-world data often requires additional parameters. Always inspect the result with head() and info() to verify the data loaded correctly before proceeding with analysis.

Practice File: customers.csv - Download this file to follow along
import pandas as pd

# Basic CSV loading
df = pd.read_csv('customers.csv')

print("First 5 rows:")
print(df.head())
# customer_id       name              email     city zip_code signup_date  age
# 0       101  Alice Johnson  alice@email.com   Boston    02101  2023-01-15   28
# 1       102    Bob Smith    bob@email.com  Seattle    98101  2023-02-20   35
# 2       103 Carol Williams  carol@email.com  Chicago    60601  2023-03-10   42
# ...

print("\nDataFrame info:")
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
# Shape: (8, 7)
# Columns: ['customer_id', 'name', 'email', 'city', 'zip_code', 'signup_date', 'age']

Notice that Pandas correctly identified all columns and assigned sensible data types. The customer_id and age are integers, while name, email, city, and signup_date are objects (strings). However, look closely at signup_date. It should be a datetime type, not a string. This is a common issue that we will fix using the parse_dates parameter.

Essential read_csv() Parameters

The read_csv() function accepts over 50 parameters, but you will use a handful regularly. The most important ones control the delimiter, header row, column selection, data types, and date parsing. Learning these parameters transforms you from someone who can only read perfect files to someone who can handle any data thrown at them. Let us explore the most useful ones with practical examples.

Key Parameters

read_csv() Options

sep - Delimiter character (default: ',')
header - Row number for column names (default: 0)
usecols - Columns to read (list of names or indices)
dtype - Data types for columns
parse_dates - Columns to parse as dates
na_values - Values to treat as missing
encoding - File encoding (default: 'utf-8')

Pro tip: When loading large files, use usecols to load only needed columns and nrows to preview the first N rows. This dramatically speeds up initial exploration.

# Load with specific options
df = pd.read_csv('customers.csv',
                 usecols=['customer_id', 'name', 'city', 'signup_date', 'age'],
                 parse_dates=['signup_date'])

print(df.dtypes)
# customer_id             int64
# name                   object
# city                   object
# signup_date    datetime64[ns]
# age                     int64

print(f"\nSignup date type: {type(df['signup_date'].iloc[0])}")
# Signup date type: 

Now signup_date is a proper datetime type, which enables powerful date operations. You can extract the year, month, or day, calculate time differences, filter by date ranges, and more. This is why parsing dates during loading is so important. Doing it later requires extra code and is more error-prone.

Handling Different Delimiters

Not all "CSV" files use commas. Some use tabs, semicolons, pipes, or other characters as delimiters. European systems often use semicolons because commas are used as decimal separators in those locales. Tab-separated values (TSV) files are common for scientific data. The sep parameter handles all these variations. If you are unsure about the delimiter, open the file in a text editor and look at the raw data.

# Tab-separated file
df_tab = pd.read_csv('data.tsv', sep='\t')

# Semicolon-separated (common in European exports)
df_semi = pd.read_csv('data_euro.csv', sep=';')

# Pipe-separated
df_pipe = pd.read_csv('data.txt', sep='|')

# Auto-detect separator (Python engine required)
df_auto = pd.read_csv('unknown.csv', sep=None, engine='python')
print("Auto-detected format loaded successfully!")
Common Gotcha: If your CSV has commas inside quoted fields (like addresses: "123 Main St, Suite 100"), Pandas handles this correctly by default. However, if the file uses a different quote character, specify it with quotechar.

Handling Missing Values

Real-world data has missing values represented in various ways: empty cells, "N/A", "NULL", "-", "missing", or even specific codes like -999. By default, Pandas recognizes empty strings and "NA" as missing. Use the na_values parameter to add custom missing value markers. This ensures missing data is properly represented as NaN (Not a Number) in your DataFrame, allowing you to use Pandas' missing value methods like isna(), fillna(), and dropna().

# Specify custom missing value markers
df = pd.read_csv('data_with_missing.csv',
                 na_values=['N/A', 'NULL', '-', 'missing', '', 'n/a'])

# Check for missing values
print("Missing values per column:")
print(df.isna().sum())

# Fill missing values if needed
df_filled = df.fillna({'age': df['age'].median(),
                       'city': 'Unknown'})
print("\nAfter filling:")
print(df_filled.isna().sum())

Specifying Data Types

Pandas infers data types automatically, but sometimes it gets it wrong. ZIP codes that start with zeros become integers (02101 becomes 2101). IDs that should be strings become integers. Phone numbers lose their formatting. The dtype parameter lets you explicitly set column types during loading, preventing these issues. This is especially important for columns that should remain as strings even though they contain numbers.

# Force specific data types
df = pd.read_csv('customers.csv',
                 dtype={'customer_id': str,
                        'zip_code': str,
                        'age': 'Int64'})  # Int64 handles nullable integers

print(df.dtypes)
# customer_id    object
# name           object
# email          object
# city           object
# zip_code       object
# signup_date    object
# age             Int64

print(f"\nZIP code preserved: {df['zip_code'].iloc[0]}")
# ZIP code preserved: 02101

Notice we used 'Int64' (capital I) instead of int. The capital-I version is Pandas' nullable integer type, which can handle missing values. Regular Python integers cannot be NaN, so columns with missing integers would be converted to floats. Using 'Int64' keeps them as integers.

Memory Optimization: For large datasets, specify smaller types like 'int32' instead of 'int64', or 'float32' instead of 'float64'. This can cut memory usage in half with minimal precision loss.

Practice Questions: CSV Loading

Test your CSV loading skills with these coding tasks.

Task: Load only the 'name', 'city', and 'age' columns from customers.csv.

Expected columns: ['name', 'city', 'age']

Show Solution
import pandas as pd

df = pd.read_csv('customers.csv', usecols=['name', 'city', 'age'])
print(df.columns.tolist())  # ['name', 'city', 'age']
print(df.head())

Task: Load customers.csv with signup_date parsed as a datetime. Then extract the year from the first signup date.

Show Solution
import pandas as pd

df = pd.read_csv('customers.csv', parse_dates=['signup_date'])
first_year = df['signup_date'].iloc[0].year
print(f"First signup year: {first_year}")  # 2023

Task: Load customers.csv keeping zip_code as a string (to preserve leading zeros), and verify that the Boston customer's ZIP code is '02101' (not 2101).

Show Solution
import pandas as pd

df = pd.read_csv('customers.csv', dtype={'zip_code': str})
boston_zip = df[df['city'] == 'Boston']['zip_code'].iloc[0]
print(f"Boston ZIP: {boston_zip}")  # 02101
print(f"Has leading zero: {boston_zip.startswith('0')}")  # True
03

Loading Excel Files

Excel files are everywhere in business environments. Financial reports, inventory tracking, customer lists, and project plans often live in spreadsheets. Unlike CSV files, Excel workbooks can contain multiple sheets, formulas, formatting, and charts. The read_excel() function handles all of this, extracting the data you need while ignoring the visual formatting. Understanding Excel loading is essential for any data scientist working with business stakeholders.

Required Dependencies

Before reading Excel files, you need an engine library installed. For modern Excel files (.xlsx), the most common engine is openpyxl. For older Excel files (.xls), you need xlrd. Pandas does not include these by default to keep the installation lightweight. Install them once and they work automatically whenever you call read_excel().

# Install the Excel engine (run once in terminal)
# pip install openpyxl

# For older .xls files
# pip install xlrd

import pandas as pd

# Now you can read Excel files
df = pd.read_excel('sales_report.xlsx')
print(df.head())
Note: If you see ImportError: Missing optional dependency 'openpyxl', just run pip install openpyxl in your terminal and try again.

Basic Excel Loading

Reading an Excel file is similar to reading a CSV. By default, Pandas reads the first sheet and uses the first row as column headers. Excel files often have better data type preservation than CSV files because Excel stores type information internally. Dates, numbers, and text are usually correctly typed without additional parsing. However, you should still verify with dtypes to ensure everything loaded as expected.

import pandas as pd

# Read the first sheet (default behavior)
df = pd.read_excel('quarterly_sales.xlsx')

print("Sheet data:")
print(df.head())
print(f"\nShape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")

Working with Multiple Sheets

Excel workbooks often contain multiple sheets. A financial report might have sheets for Q1, Q2, Q3, and Q4. A CRM export might separate customers by region. The sheet_name parameter lets you specify which sheet to read. You can use the sheet name as a string, the sheet index (starting from 0), or even load multiple sheets at once into a dictionary of DataFrames.

# Read a specific sheet by name
df_q1 = pd.read_excel('quarterly_sales.xlsx', sheet_name='Q1')

# Read by index (0 = first sheet, 1 = second, etc.)
df_second = pd.read_excel('quarterly_sales.xlsx', sheet_name=1)

# Read ALL sheets into a dictionary
all_sheets = pd.read_excel('quarterly_sales.xlsx', sheet_name=None)
print(f"Sheet names: {list(all_sheets.keys())}")
# Sheet names: ['Q1', 'Q2', 'Q3', 'Q4']

# Access individual sheets from the dictionary
df_q3 = all_sheets['Q3']
print(f"Q3 data shape: {df_q3.shape}")

Loading all sheets with sheet_name=None returns a dictionary where keys are sheet names and values are DataFrames. This is useful when you need to process multiple sheets or combine them into one DataFrame. You can iterate through the dictionary or access sheets by name.

Key Concept

Combining Multiple Sheets

When sheets have the same structure (same columns), you can combine them with pd.concat(). This is common for time-series data split across sheets like monthly or quarterly reports. Add a column to track which sheet each row came from.

Example: combined = pd.concat(all_sheets.values(), keys=all_sheets.keys())

# Combine all quarterly sheets into one DataFrame
all_sheets = pd.read_excel('quarterly_sales.xlsx', sheet_name=None)

# Method 1: Simple concat
combined = pd.concat(all_sheets.values(), ignore_index=True)

# Method 2: Add quarter column to track source
dfs = []
for quarter, df in all_sheets.items():
    df['quarter'] = quarter
    dfs.append(df)
combined = pd.concat(dfs, ignore_index=True)

print(f"Combined shape: {combined.shape}")
print(combined['quarter'].value_counts())

Skipping Rows and Headers

Business spreadsheets often have title rows, blank rows, or metadata before the actual data starts. The skiprows parameter lets you skip these rows. Similarly, header specifies which row contains the column names. If your data has no header row, set header=None and Pandas will assign numeric column names (0, 1, 2, etc.). You can then rename them with names.

# Skip the first 3 rows (title, blank, description)
df = pd.read_excel('report.xlsx', skiprows=3)

# Header is in row 5 (0-indexed), skip rows 0-4
df = pd.read_excel('report.xlsx', header=5)

# No header row, provide custom column names
df = pd.read_excel('raw_data.xlsx', 
                   header=None,
                   names=['date', 'product', 'quantity', 'revenue'])

print(df.head())

Reading Specific Cell Ranges

Sometimes you only need a portion of a spreadsheet. The usecols parameter selects specific columns, and nrows limits the number of rows. For Excel files, you can also use Excel-style column letters like "A:C" or "B,D,F". This is particularly useful for large spreadsheets where loading everything would be slow and memory-intensive.

# Read only columns A through D
df = pd.read_excel('large_report.xlsx', usecols='A:D')

# Read specific columns by letter
df = pd.read_excel('large_report.xlsx', usecols='A,C,E,G')

# Read specific columns by name
df = pd.read_excel('sales.xlsx', usecols=['Date', 'Product', 'Revenue'])

# Read only first 100 rows (great for previewing)
df = pd.read_excel('huge_file.xlsx', nrows=100)

print(f"Preview shape: {df.shape}")
Performance Tip: Excel files are slower to read than CSV files because the format is more complex. For very large datasets, consider exporting to CSV first, or use nrows to preview before loading the full file.

Practice Questions: Excel Files

Test your Excel loading skills with these coding tasks.

Task: Read the sheet named 'Q2' from a file called 'quarterly_sales.xlsx' and display its shape.

Show Solution
import pandas as pd

df = pd.read_excel('quarterly_sales.xlsx', sheet_name='Q2')
print(f"Shape: {df.shape}")

Task: Load all sheets from 'quarterly_sales.xlsx' and print the names of all sheets in the workbook.

Show Solution
import pandas as pd

all_sheets = pd.read_excel('quarterly_sales.xlsx', sheet_name=None)
print(f"Sheet names: {list(all_sheets.keys())}")

Task: Load all sheets from 'quarterly_sales.xlsx', add a 'source_sheet' column to each DataFrame, and combine them into one DataFrame.

Show Solution
import pandas as pd

all_sheets = pd.read_excel('quarterly_sales.xlsx', sheet_name=None)

dfs = []
for sheet_name, df in all_sheets.items():
    df['source_sheet'] = sheet_name
    dfs.append(df)

combined = pd.concat(dfs, ignore_index=True)
print(f"Combined shape: {combined.shape}")
print(combined['source_sheet'].value_counts())
04

Working with JSON Data

JSON (JavaScript Object Notation) has become the standard format for web APIs and modern data exchange. When you fetch data from a REST API, download datasets from data portals, or work with NoSQL databases, you are likely dealing with JSON. Unlike CSV's flat tabular structure, JSON can represent nested, hierarchical data. Pandas provides both read_json() for simple structures and json_normalize() for complex nested data.

Basic JSON Loading

For simple JSON files containing an array of objects (records), read_json() works seamlessly. Each object becomes a row, and each key becomes a column. This pattern is common in API responses and data exports. The function automatically detects the JSON structure and creates an appropriate DataFrame. Always verify the result because JSON can have subtle structural variations.

Practice File: customers.json - Download this file to follow along
import pandas as pd

# Load simple JSON array
df = pd.read_json('customers.json')

print("Loaded JSON data:")
print(df.head())
# customer_id   name  age             email
# 0         101  Alice   28  alice@email.com
# 1         102    Bob   35    bob@email.com
# 2         103  Carol   42  carol@email.com

print(f"\nShape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")

The JSON file contained an array of customer objects, and Pandas converted it directly to a DataFrame. Each JSON object became a row, and the keys (customer_id, name, age, email) became columns. This is the ideal case where JSON structure maps cleanly to tabular data. But not all JSON is this simple.

Understanding JSON Orientation

JSON data can be structured in different ways, called "orientations". The default is "columns" where each key is a column name mapping to a list of values. But you might also encounter "records" (list of row objects), "index" (dict of index to row), or "split" (separate lists for columns, data, and index). The orient parameter tells Pandas how to interpret the structure.

Records Orientation

[
  {"name": "Alice", "age": 28},
  {"name": "Bob", "age": 35}
]

Columns Orientation

{
  "name": ["Alice", "Bob"],
  "age": [28, 35]
}
# Records orientation (list of objects) - most common from APIs
df = pd.read_json('data.json', orient='records')

# Columns orientation (dict of column arrays)
df = pd.read_json('data.json', orient='columns')

# Index orientation (dict of index to row dict)
df = pd.read_json('data.json', orient='index')

# Split orientation (separate columns, data, index)
df = pd.read_json('data.json', orient='split')

print(df.head())

Handling Nested JSON with json_normalize()

Real-world JSON from APIs often contains nested structures. A customer record might have an embedded address object with street, city, and zip fields. A product might have a nested metadata object with dimensions and weight. The read_json() function cannot flatten these nested structures automatically. For this, you need pd.json_normalize(), which flattens nested dictionaries into a flat table with dot-notation column names.

Practice File: customers_nested.json - Download this file to follow along
import pandas as pd
import json

# First, let's see what nested JSON looks like
nested_data = [
    {
        "id": 1,
        "name": "Alice",
        "address": {
            "street": "123 Main St",
            "city": "Boston",
            "zip": "02101"
        }
    },
    {
        "id": 2,
        "name": "Bob",
        "address": {
            "street": "456 Oak Ave",
            "city": "Seattle",
            "zip": "98101"
        }
    }
]

# json_normalize flattens nested structures
df = pd.json_normalize(nested_data)

print(df.columns.tolist())
# ['id', 'name', 'address.street', 'address.city', 'address.zip']

print(df)
#    id   name  address.street address.city address.zip
# 0   1  Alice    123 Main St       Boston       02101
# 1   2    Bob     456 Oak Ave      Seattle       98101

Notice how the nested address object was flattened into columns with dot notation: address.street, address.city, address.zip. You can customize the separator with the sep parameter if you prefer underscores or another character. This flattening is essential for analysis because Pandas works best with flat tabular data.

Key Concept

json_normalize() Parameters

record_path - Path to the list of records to normalize
meta - Fields to include from parent records
sep - Separator for nested column names (default: '.')
max_level - Maximum nesting depth to flatten

Common use: API responses often have nested lists. Use record_path to specify which nested list to expand, and meta to keep parent-level fields.

Loading JSON from Web APIs

You can load JSON directly from a URL, which is perfect for working with REST APIs. Pandas' read_json() accepts URLs just like file paths. For more complex API interactions (authentication, headers, pagination), use the requests library to fetch the data first, then normalize it with Pandas. This gives you full control over the HTTP request while leveraging Pandas for data parsing.

import pandas as pd
import requests

# Method 1: Direct URL (for simple public APIs)
url = 'https://api.example.com/data.json'
df = pd.read_json(url)

# Method 2: Using requests (for complex APIs)
response = requests.get('https://api.example.com/users',
                       headers={'Authorization': 'Bearer token123'})
data = response.json()

# Normalize the response
df = pd.json_normalize(data['users'])
print(f"Loaded {len(df)} users from API")
API Best Practice: When working with paginated APIs, use a loop to fetch all pages, collect results in a list, then concatenate into a single DataFrame. This avoids making repeated normalize calls.

Practice Questions: JSON Data

Test your JSON loading skills with these coding tasks.

Task: Load 'customers.json' and display the column names and number of rows.

Show Solution
import pandas as pd

df = pd.read_json('customers.json')
print(f"Columns: {df.columns.tolist()}")
print(f"Rows: {len(df)}")

Given: A list of nested dictionaries:

data = [
    {"name": "Alice", "location": {"city": "NYC", "country": "USA"}},
    {"name": "Bob", "location": {"city": "London", "country": "UK"}}
]

Task: Use json_normalize to flatten this into a DataFrame with columns: name, location.city, location.country

Show Solution
import pandas as pd

data = [
    {"name": "Alice", "location": {"city": "NYC", "country": "USA"}},
    {"name": "Bob", "location": {"city": "London", "country": "UK"}}
]

df = pd.json_normalize(data)
print(df.columns.tolist())
# ['name', 'location.city', 'location.country']
print(df)

Given: The same nested data as Q2.

Task: Flatten the data but use underscore (_) as the separator instead of dot, so columns become: name, location_city, location_country

Show Solution
import pandas as pd

data = [
    {"name": "Alice", "location": {"city": "NYC", "country": "USA"}},
    {"name": "Bob", "location": {"city": "London", "country": "UK"}}
]

df = pd.json_normalize(data, sep='_')
print(df.columns.tolist())
# ['name', 'location_city', 'location_country']
05

Connecting to SQL Databases

Production data often lives in relational databases like PostgreSQL, MySQL, SQLite, or SQL Server. These databases handle transactions, enforce data integrity, and support concurrent access from multiple applications. Pandas can connect directly to these databases, execute SQL queries, and load results into DataFrames. This bridges the gap between your database systems and Python analytics, letting you work with live production data.

Database Connection Basics

To connect to a database, you need two things: a database driver and a connection string. The driver is a Python library that knows how to communicate with your specific database type. The connection string contains the database location, credentials, and other settings. Pandas works with SQLAlchemy under the hood, which provides a unified interface for different database types. Once connected, you can run SQL queries and receive results as DataFrames.

# Install database drivers (run once in terminal)
# pip install sqlalchemy
# pip install psycopg2-binary  # PostgreSQL
# pip install pymysql          # MySQL
# pip install pyodbc           # SQL Server

import pandas as pd
from sqlalchemy import create_engine

# Create a database connection (SQLite example - no server needed)
engine = create_engine('sqlite:///company.db')

# Read entire table
df = pd.read_sql_table('customers', engine)
print(f"Loaded {len(df)} customers from database")
Key Concept

Connection Strings

Connection strings follow the pattern: dialect+driver://user:password@host:port/database

SQLite: sqlite:///path/to/file.db
PostgreSQL: postgresql://user:pass@localhost:5432/dbname
MySQL: mysql+pymysql://user:pass@localhost:3306/dbname
SQL Server: mssql+pyodbc://user:pass@server/db?driver=ODBC+Driver+17

Security tip: Never hardcode passwords in your code. Use environment variables or configuration files that are not committed to version control.

Running SQL Queries

While read_sql_table() loads an entire table, read_sql_query() lets you run arbitrary SQL queries. This is more powerful because you can filter, join, aggregate, and transform data in the database before loading it into Python. For large datasets, doing aggregations in the database is much faster than loading all rows and aggregating in Pandas. The database is optimized for these operations and can use indexes efficiently.

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('sqlite:///company.db')

# Run a custom SQL query
query = """
    SELECT 
        c.name,
        c.city,
        COUNT(o.order_id) as order_count,
        SUM(o.amount) as total_spent
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id
    HAVING total_spent > 1000
    ORDER BY total_spent DESC
"""

df = pd.read_sql_query(query, engine)
print(df.head())
#        name     city  order_count  total_spent
# 0     Alice   Boston           15        12500
# 1       Bob  Seattle           12        10200
# 2     Carol  Chicago            8         8750

This query demonstrates the power of combining SQL and Pandas. The database handles the join, aggregation, filtering (HAVING), and sorting before any data reaches Python. Only the final summary rows are transferred. This is dramatically faster than loading all customers and orders separately and joining them in Pandas.

Parameterized Queries

When your query needs to include user input or variable values, never use string concatenation. This opens your code to SQL injection attacks. Instead, use parameterized queries where Pandas safely substitutes values into placeholders. This protects your database and ensures values are properly escaped.

# WRONG - Never do this (SQL injection risk!)
city = "Boston"
query = f"SELECT * FROM customers WHERE city = '{city}'"  # Dangerous!

# CORRECT - Use parameterized queries
from sqlalchemy import text

city = "Boston"
min_age = 25

query = text("SELECT * FROM customers WHERE city = :city AND age >= :min_age")
df = pd.read_sql_query(query, engine, params={"city": city, "min_age": min_age})

print(f"Found {len(df)} customers in {city} aged {min_age}+")
Security Warning: SQL injection is one of the most common security vulnerabilities. Always use parameterized queries when incorporating variable values into SQL statements. Never trust user input.

Writing Data to Databases

Pandas can also write DataFrames back to databases using to_sql(). This is useful for storing analysis results, creating derived tables, or migrating data between systems. The if_exists parameter controls what happens if the table already exists: 'fail' (raise error), 'replace' (drop and recreate), or 'append' (add rows to existing table).

import pandas as pd
from sqlalchemy import create_engine

# Create sample DataFrame
results = pd.DataFrame({
    'customer_id': [101, 102, 103],
    'segment': ['Gold', 'Silver', 'Gold'],
    'predicted_churn': [0.12, 0.45, 0.08]
})

engine = create_engine('sqlite:///company.db')

# Write to database
results.to_sql('customer_segments', 
               engine, 
               if_exists='replace',  # Options: 'fail', 'replace', 'append'
               index=False)          # Don't save DataFrame index as column

print("Data written to database successfully!")

# Verify by reading it back
df_verify = pd.read_sql_table('customer_segments', engine)
print(df_verify)
Performance Tip: For large DataFrames, use chunksize parameter to write in batches. Also consider method='multi' for databases that support multi-row INSERT statements.

Practice Questions: SQL Databases

Test your database connection skills with these coding tasks.

Task: Create a SQLAlchemy engine for a SQLite database file called 'mydata.db' and read all rows from a table named 'products'.

Show Solution
import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('sqlite:///mydata.db')
df = pd.read_sql_table('products', engine)
print(df.head())

Task: Write a SQL query to get the average price per category from a 'products' table, then load results into a DataFrame.

Show Solution
import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('sqlite:///mydata.db')

query = """
    SELECT category, AVG(price) as avg_price
    FROM products
    GROUP BY category
    ORDER BY avg_price DESC
"""

df = pd.read_sql_query(query, engine)
print(df)

Task: Create a DataFrame with columns 'name' and 'score', then write it to a new table called 'leaderboard' in a SQLite database. If the table exists, replace it.

Show Solution
import pandas as pd
from sqlalchemy import create_engine

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol'],
    'score': [95, 87, 92]
})

engine = create_engine('sqlite:///mydata.db')
df.to_sql('leaderboard', engine, if_exists='replace', index=False)

# Verify
result = pd.read_sql_table('leaderboard', engine)
print(result)
06

Web Scraping with read_html()

The web is full of data trapped in HTML tables. Wikipedia articles, government statistics, sports scores, financial data, and countless other datasets are displayed as tables on web pages. Manually copying this data would be tedious and error-prone. The read_html() function automatically extracts all tables from a web page and converts them into DataFrames. This is one of Pandas' most magical features for quick data gathering.

Basic Table Extraction

The read_html() function accepts a URL or HTML string and returns a list of DataFrames, one for each table found on the page. This is remarkably simple compared to traditional web scraping with BeautifulSoup or Scrapy. For many use cases, especially grabbing data from Wikipedia or government sites, this one function is all you need. The function intelligently parses table headers, handles merged cells, and cleans up whitespace.

# Install lxml for faster parsing (optional but recommended)
# pip install lxml html5lib

import pandas as pd

# Scrape all tables from a Wikipedia page
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'
tables = pd.read_html(url)

print(f"Found {len(tables)} tables on the page")
# Found 4 tables on the page

# Each table is a DataFrame - inspect the first one
df = tables[0]
print(f"\nFirst table shape: {df.shape}")
print(df.head())

The function returns a list because web pages often contain multiple tables. You need to inspect each one to find the data you want. Use len(tables) to see how many were found, then examine each with tables[0].head(), tables[1].head(), etc. Once you identify the right table, assign it to a variable and continue your analysis.

Filtering and Matching Tables

When a page has many tables, searching through them manually is inefficient. The match parameter lets you filter tables by searching for a regex pattern in the table content. Only tables containing matching text are returned. This is useful when you know a keyword that appears in your target table but not in navigation tables or other irrelevant content.

import pandas as pd

# Only get tables that contain "GDP" or "Population"
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'
tables = pd.read_html(url, match='GDP')

print(f"Tables matching 'GDP': {len(tables)}")

# Get the first matching table
df = tables[0]
print(df.head())
Key Parameters

read_html() Options

match - Regex pattern to filter tables
header - Row number to use as header (or list for multi-level)
index_col - Column to use as row index
attrs - Dict of HTML attributes to match (id, class, etc.)
encoding - Character encoding of the page

Advanced tip: Use attrs={'id': 'specific-table-id'} to target a specific table by its HTML id attribute, which is more precise than text matching.

Handling Common Issues

Web scraping is not always smooth. Tables might have merged header cells, footnotes mixed with data, or encoding issues. The header parameter can handle multi-level headers by passing a list of row numbers. Some sites block automated access, requiring you to use the requests library with custom headers to fetch the HTML first.

import pandas as pd
import requests

# Some sites block Pandas' default user agent
# Solution: Fetch with requests first
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get('https://example.com/data', headers=headers)

# Parse the HTML content
tables = pd.read_html(response.text)

# Handle multi-level headers (header spans 2 rows)
df = pd.read_html(url, header=[0, 1])[0]
print(df.columns)  # MultiIndex columns
Legal Note: Always check a website's robots.txt and terms of service before scraping. Some sites prohibit automated access. For large-scale scraping, consider rate limiting to avoid overwhelming servers.

Scraping Local HTML Files

You can also use read_html() on local HTML files or HTML strings. This is useful when you have downloaded HTML files, received HTML in an email, or want to test your parsing logic offline before hitting live websites. The function accepts file paths and raw HTML strings identically to URLs.

import pandas as pd

# Read from a local HTML file
tables = pd.read_html('report.html')

# Parse an HTML string directly
html_string = """
NameScore
Alice95
Bob87
Carol92
""" tables = pd.read_html(html_string) df = tables[0] print(df) # Name Score # 0 Alice 95 # 1 Bob 87 # 2 Carol 92
Pro Tip: When scraping fails, save the webpage locally and examine the HTML structure. Look for the table's id or class attributes, which you can use with the attrs parameter for precise targeting.

Practice Questions: Web Scraping

Test your web scraping skills with these coding tasks.

Given:

html = """
<table>
  <tr><th>Product</th><th>Price</th></tr>
  <tr><td>Laptop</td><td>999</td></tr>
  <tr><td>Mouse</td><td>29</td></tr>
</table>
"""

Task: Parse this HTML string and print the resulting DataFrame.

Show Solution
import pandas as pd

html = """
ProductPrice
Laptop999
Mouse29
""" tables = pd.read_html(html) df = tables[0] print(df)

Task: Load all tables from a Wikipedia URL and print how many tables were found, along with the shape of each table.

Show Solution
import pandas as pd

url = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
tables = pd.read_html(url)

print(f"Found {len(tables)} tables\n")
for i, table in enumerate(tables):
    print(f"Table {i}: shape {table.shape}")

Task: Scrape a Wikipedia page and use the match parameter to only get tables containing the word "Release". Print the number of matching tables.

Show Solution
import pandas as pd

url = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
tables = pd.read_html(url, match='Release')

print(f"Tables containing 'Release': {len(tables)}")
if len(tables) > 0:
    print(tables[0].head())

Key Takeaways

CSV is King

read_csv() handles delimiters, encodings, missing values, and data types. Master its parameters like sep, na_values, and dtype for robust data loading.

Excel Made Easy

Use sheet_name to load specific sheets, sheet_name=None for all sheets as a dict, and pd.concat() to combine them into one DataFrame.

JSON Normalization

Simple JSON loads directly. For nested structures, use json_normalize() with record_path and meta to flatten complex hierarchies.

SQL Connections

Create engines with SQLAlchemy, use read_sql() for queries, and to_sql() to write back. Use parameterized queries for security and chunksize for large datasets.

Web Table Magic

read_html() automatically extracts all tables from web pages. Use match to filter tables and attrs to target specific elements.

Memory Efficiency

Specify dtype to reduce memory, use usecols to load only needed columns, and leverage chunksize for processing files larger than RAM.

Knowledge Check

Test your understanding of data loading concepts:

Question 1 of 6

Which parameter changes the column separator in read_csv()?

Question 2 of 6

How do you load all sheets from an Excel file as a dictionary?

Question 3 of 6

Which function flattens nested JSON data into a DataFrame?

Question 4 of 6

What does the if_exists='append' parameter do in to_sql()?

Question 5 of 6

What does pd.read_html() return?

Question 6 of 6

Which parameter specifies custom values to treat as missing in read_csv()?

Answer all questions to check your score