Introduction to Data Loading
Real-world data science projects rarely start with data already in Python. Your data lives in files, databases, APIs, and web pages. Pandas provides a powerful suite of I/O (input/output) functions that can read data from virtually any source and convert it into DataFrames ready for analysis. Mastering these functions is essential because data loading is the first step in every data science workflow.
Why Data Loading Matters
Before you can analyze, visualize, or model data, you need to get it into Python. This might seem trivial, but in practice, data loading can be surprisingly complex. Files might have different encodings, missing values represented in various ways, dates in non-standard formats, or columns that need special parsing. A single misread column can invalidate your entire analysis. That is why Pandas provides dozens of parameters to handle these edge cases, and understanding them will save you hours of debugging later.
Consider a typical data science project: you receive a CSV export from a CRM system, an Excel spreadsheet from the finance team, JSON data from an API, and you need to query a production database for historical records. Each source has its own quirks and formats. Pandas unifies all of these into a consistent DataFrame interface, letting you focus on analysis rather than data wrangling. The key is knowing which function to use and how to configure it properly.
Pandas I/O Functions
Pandas provides read_* functions for importing data and to_* methods
for exporting. The most common are read_csv(), read_excel(),
read_json(), and read_sql(). Each function returns a DataFrame
and accepts parameters to customize how data is parsed.
Why it matters: Using the right parameters during loading prevents data quality issues downstream. It is much easier to parse dates correctly during import than to fix them after the fact.
Common Data Formats
Data comes in many formats, each with strengths and weaknesses. Understanding these formats helps you choose the right loading function and anticipate potential issues. CSV files are universal but lack data type information. Excel files preserve formatting but can be slow to parse. JSON handles nested data well but can be verbose. SQL databases are powerful but require connection setup. Knowing the trade-offs helps you work efficiently with whatever data source you encounter.
CSV Files
Comma-Separated Values are the most common format. Simple, universal, and human-readable.
Use pd.read_csv() with encoding and separator options.
Excel Files
Spreadsheets with multiple sheets, formulas, and formatting. Common in business settings.
Use pd.read_excel() with sheet_name parameter.
JSON Data
JavaScript Object Notation handles nested, hierarchical data. Standard for APIs.
Use pd.read_json() or pd.json_normalize() for nested structures.
The Read-Analyze-Write Workflow
Every data analysis project follows a similar pattern: read data from a source, perform analysis and transformations, then write results to a destination. Pandas makes this workflow seamless. You read a CSV file into a DataFrame, clean and analyze it, then export the results to Excel for stakeholders or save to a database for production use. The consistency of the DataFrame interface means your analysis code works the same regardless of where the data came from.
# The typical data workflow
import pandas as pd
# 1. READ: Load data from any source
df = pd.read_csv('sales_data.csv')
# 2. ANALYZE: Transform and analyze
df['revenue'] = df['quantity'] * df['price']
monthly_sales = df.groupby('month')['revenue'].sum()
# 3. WRITE: Export results
monthly_sales.to_excel('monthly_report.xlsx')
print("Workflow complete!") # Workflow complete!
This simple pattern scales from quick analyses to production data pipelines. The key insight is that once data is in a DataFrame, all Pandas operations work identically. Whether you loaded from CSV, Excel, JSON, or SQL, the analysis code remains the same. This abstraction is what makes Pandas so powerful for data science workflows.
df.head(), df.info(), and df.dtypes. This catches
parsing issues early before they cause problems in your analysis.
Practice Questions: Data Loading Basics
Test your data loading skills with these coding tasks.
Task: Load a CSV file called 'sales_data.csv' and display the first 3 rows.
Expected output (first 3 rows of the DataFrame):
product quantity price
0 Laptop 5 45000
1 Mouse 25 500
2 Keyboard 15 1200
Show Solution
import pandas as pd
df = pd.read_csv('sales_data.csv')
print(df.head(3))
Given:
df = pd.read_csv('customers.csv')
Task: Write code to display the shape, column names, and data types of the DataFrame.
Show Solution
import pandas as pd
df = pd.read_csv('customers.csv')
print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nData Types:")
print(df.dtypes)
Task: Load 'sales_data.csv' and calculate the total revenue, then display which product had the highest revenue.
Show Solution
import pandas as pd
df = pd.read_csv('sales_data.csv')
total = df['revenue'].sum()
print(f"Total Revenue: ₹{total:,}")
top_product = df.loc[df['revenue'].idxmax(), 'product']
print(f"Top Product: {top_product}")
Reading CSV Files
CSV (Comma-Separated Values) is the most common format for tabular data. Almost every data source
can export to CSV, making it the universal exchange format for data. The read_csv()
function is Pandas' workhorse for data loading, offering dozens of parameters to handle virtually
any CSV variation you might encounter. Understanding these parameters will save you countless hours
of data cleaning downstream.
Basic CSV Loading
The simplest way to load a CSV file is with just the filename. Pandas automatically detects the
delimiter (usually a comma), reads the first row as column headers, and infers data types for each
column. This works perfectly for well-formatted files, but real-world data often requires additional
parameters. Always inspect the result with head() and info() to verify the
data loaded correctly before proceeding with analysis.
import pandas as pd
# Basic CSV loading
df = pd.read_csv('customers.csv')
print("First 5 rows:")
print(df.head())
# customer_id name email city zip_code signup_date age
# 0 101 Alice Johnson alice@email.com Boston 02101 2023-01-15 28
# 1 102 Bob Smith bob@email.com Seattle 98101 2023-02-20 35
# 2 103 Carol Williams carol@email.com Chicago 60601 2023-03-10 42
# ...
print("\nDataFrame info:")
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
# Shape: (8, 7)
# Columns: ['customer_id', 'name', 'email', 'city', 'zip_code', 'signup_date', 'age']
Notice that Pandas correctly identified all columns and assigned sensible data types. The customer_id
and age are integers, while name, email, city, and signup_date are objects (strings). However, look
closely at signup_date. It should be a datetime type, not a string. This is a common issue that we
will fix using the parse_dates parameter.
Essential read_csv() Parameters
The read_csv() function accepts over 50 parameters, but you will use a handful regularly.
The most important ones control the delimiter, header row, column selection, data types, and date
parsing. Learning these parameters transforms you from someone who can only read perfect files to
someone who can handle any data thrown at them. Let us explore the most useful ones with practical
examples.
read_csv() Options
sep - Delimiter character (default: ',')
header - Row number for column names (default: 0)
usecols - Columns to read (list of names or indices)
dtype - Data types for columns
parse_dates - Columns to parse as dates
na_values - Values to treat as missing
encoding - File encoding (default: 'utf-8')
Pro tip: When loading large files, use usecols to load only
needed columns and nrows to preview the first N rows. This dramatically speeds
up initial exploration.
# Load with specific options
df = pd.read_csv('customers.csv',
usecols=['customer_id', 'name', 'city', 'signup_date', 'age'],
parse_dates=['signup_date'])
print(df.dtypes)
# customer_id int64
# name object
# city object
# signup_date datetime64[ns]
# age int64
print(f"\nSignup date type: {type(df['signup_date'].iloc[0])}")
# Signup date type:
Now signup_date is a proper datetime type, which enables powerful date operations. You can extract the year, month, or day, calculate time differences, filter by date ranges, and more. This is why parsing dates during loading is so important. Doing it later requires extra code and is more error-prone.
Handling Different Delimiters
Not all "CSV" files use commas. Some use tabs, semicolons, pipes, or other characters as delimiters.
European systems often use semicolons because commas are used as decimal separators in those locales.
Tab-separated values (TSV) files are common for scientific data. The sep parameter
handles all these variations. If you are unsure about the delimiter, open the file in a text editor
and look at the raw data.
# Tab-separated file
df_tab = pd.read_csv('data.tsv', sep='\t')
# Semicolon-separated (common in European exports)
df_semi = pd.read_csv('data_euro.csv', sep=';')
# Pipe-separated
df_pipe = pd.read_csv('data.txt', sep='|')
# Auto-detect separator (Python engine required)
df_auto = pd.read_csv('unknown.csv', sep=None, engine='python')
print("Auto-detected format loaded successfully!")
quotechar.
Handling Missing Values
Real-world data has missing values represented in various ways: empty cells, "N/A", "NULL", "-",
"missing", or even specific codes like -999. By default, Pandas recognizes empty strings and "NA"
as missing. Use the na_values parameter to add custom missing value markers. This
ensures missing data is properly represented as NaN (Not a Number) in your DataFrame, allowing
you to use Pandas' missing value methods like isna(), fillna(), and dropna().
# Specify custom missing value markers
df = pd.read_csv('data_with_missing.csv',
na_values=['N/A', 'NULL', '-', 'missing', '', 'n/a'])
# Check for missing values
print("Missing values per column:")
print(df.isna().sum())
# Fill missing values if needed
df_filled = df.fillna({'age': df['age'].median(),
'city': 'Unknown'})
print("\nAfter filling:")
print(df_filled.isna().sum())
Specifying Data Types
Pandas infers data types automatically, but sometimes it gets it wrong. ZIP codes that start with
zeros become integers (02101 becomes 2101). IDs that should be strings become integers. Phone
numbers lose their formatting. The dtype parameter lets you explicitly set column types
during loading, preventing these issues. This is especially important for columns that should remain
as strings even though they contain numbers.
# Force specific data types
df = pd.read_csv('customers.csv',
dtype={'customer_id': str,
'zip_code': str,
'age': 'Int64'}) # Int64 handles nullable integers
print(df.dtypes)
# customer_id object
# name object
# email object
# city object
# zip_code object
# signup_date object
# age Int64
print(f"\nZIP code preserved: {df['zip_code'].iloc[0]}")
# ZIP code preserved: 02101
Notice we used 'Int64' (capital I) instead of int. The capital-I version
is Pandas' nullable integer type, which can handle missing values. Regular Python integers cannot be
NaN, so columns with missing integers would be converted to floats. Using 'Int64' keeps
them as integers.
'int32' instead of 'int64', or 'float32' instead of
'float64'. This can cut memory usage in half with minimal precision loss.
Practice Questions: CSV Loading
Test your CSV loading skills with these coding tasks.
Task: Load only the 'name', 'city', and 'age' columns from customers.csv.
Expected columns: ['name', 'city', 'age']
Show Solution
import pandas as pd
df = pd.read_csv('customers.csv', usecols=['name', 'city', 'age'])
print(df.columns.tolist()) # ['name', 'city', 'age']
print(df.head())
Task: Load customers.csv with signup_date parsed as a datetime. Then extract the year from the first signup date.
Show Solution
import pandas as pd
df = pd.read_csv('customers.csv', parse_dates=['signup_date'])
first_year = df['signup_date'].iloc[0].year
print(f"First signup year: {first_year}") # 2023
Task: Load customers.csv keeping zip_code as a string (to preserve leading zeros), and verify that the Boston customer's ZIP code is '02101' (not 2101).
Show Solution
import pandas as pd
df = pd.read_csv('customers.csv', dtype={'zip_code': str})
boston_zip = df[df['city'] == 'Boston']['zip_code'].iloc[0]
print(f"Boston ZIP: {boston_zip}") # 02101
print(f"Has leading zero: {boston_zip.startswith('0')}") # True
Loading Excel Files
Excel files are everywhere in business environments. Financial reports, inventory tracking,
customer lists, and project plans often live in spreadsheets. Unlike CSV files, Excel workbooks
can contain multiple sheets, formulas, formatting, and charts. The read_excel()
function handles all of this, extracting the data you need while ignoring the visual formatting.
Understanding Excel loading is essential for any data scientist working with business stakeholders.
Required Dependencies
Before reading Excel files, you need an engine library installed. For modern Excel files (.xlsx),
the most common engine is openpyxl. For older Excel files (.xls), you need
xlrd. Pandas does not include these by default to keep the installation lightweight.
Install them once and they work automatically whenever you call read_excel().
# Install the Excel engine (run once in terminal)
# pip install openpyxl
# For older .xls files
# pip install xlrd
import pandas as pd
# Now you can read Excel files
df = pd.read_excel('sales_report.xlsx')
print(df.head())
ImportError: Missing optional dependency 'openpyxl', just run
pip install openpyxl in your terminal and try again.
Basic Excel Loading
Reading an Excel file is similar to reading a CSV. By default, Pandas reads the first sheet and
uses the first row as column headers. Excel files often have better data type preservation than
CSV files because Excel stores type information internally. Dates, numbers, and text are usually
correctly typed without additional parsing. However, you should still verify with dtypes
to ensure everything loaded as expected.
import pandas as pd
# Read the first sheet (default behavior)
df = pd.read_excel('quarterly_sales.xlsx')
print("Sheet data:")
print(df.head())
print(f"\nShape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")
Working with Multiple Sheets
Excel workbooks often contain multiple sheets. A financial report might have sheets for Q1, Q2,
Q3, and Q4. A CRM export might separate customers by region. The sheet_name parameter
lets you specify which sheet to read. You can use the sheet name as a string, the sheet index
(starting from 0), or even load multiple sheets at once into a dictionary of DataFrames.
# Read a specific sheet by name
df_q1 = pd.read_excel('quarterly_sales.xlsx', sheet_name='Q1')
# Read by index (0 = first sheet, 1 = second, etc.)
df_second = pd.read_excel('quarterly_sales.xlsx', sheet_name=1)
# Read ALL sheets into a dictionary
all_sheets = pd.read_excel('quarterly_sales.xlsx', sheet_name=None)
print(f"Sheet names: {list(all_sheets.keys())}")
# Sheet names: ['Q1', 'Q2', 'Q3', 'Q4']
# Access individual sheets from the dictionary
df_q3 = all_sheets['Q3']
print(f"Q3 data shape: {df_q3.shape}")
Loading all sheets with sheet_name=None returns a dictionary where keys are sheet
names and values are DataFrames. This is useful when you need to process multiple sheets or
combine them into one DataFrame. You can iterate through the dictionary or access sheets by name.
Combining Multiple Sheets
When sheets have the same structure (same columns), you can combine them with
pd.concat(). This is common for time-series data split across sheets
like monthly or quarterly reports. Add a column to track which sheet each row came from.
Example: combined = pd.concat(all_sheets.values(), keys=all_sheets.keys())
# Combine all quarterly sheets into one DataFrame
all_sheets = pd.read_excel('quarterly_sales.xlsx', sheet_name=None)
# Method 1: Simple concat
combined = pd.concat(all_sheets.values(), ignore_index=True)
# Method 2: Add quarter column to track source
dfs = []
for quarter, df in all_sheets.items():
df['quarter'] = quarter
dfs.append(df)
combined = pd.concat(dfs, ignore_index=True)
print(f"Combined shape: {combined.shape}")
print(combined['quarter'].value_counts())
Skipping Rows and Headers
Business spreadsheets often have title rows, blank rows, or metadata before the actual data starts.
The skiprows parameter lets you skip these rows. Similarly, header specifies
which row contains the column names. If your data has no header row, set header=None and
Pandas will assign numeric column names (0, 1, 2, etc.). You can then rename them with names.
# Skip the first 3 rows (title, blank, description)
df = pd.read_excel('report.xlsx', skiprows=3)
# Header is in row 5 (0-indexed), skip rows 0-4
df = pd.read_excel('report.xlsx', header=5)
# No header row, provide custom column names
df = pd.read_excel('raw_data.xlsx',
header=None,
names=['date', 'product', 'quantity', 'revenue'])
print(df.head())
Reading Specific Cell Ranges
Sometimes you only need a portion of a spreadsheet. The usecols parameter selects
specific columns, and nrows limits the number of rows. For Excel files, you can
also use Excel-style column letters like "A:C" or "B,D,F". This is particularly useful for
large spreadsheets where loading everything would be slow and memory-intensive.
# Read only columns A through D
df = pd.read_excel('large_report.xlsx', usecols='A:D')
# Read specific columns by letter
df = pd.read_excel('large_report.xlsx', usecols='A,C,E,G')
# Read specific columns by name
df = pd.read_excel('sales.xlsx', usecols=['Date', 'Product', 'Revenue'])
# Read only first 100 rows (great for previewing)
df = pd.read_excel('huge_file.xlsx', nrows=100)
print(f"Preview shape: {df.shape}")
nrows to preview before loading the full file.
Practice Questions: Excel Files
Test your Excel loading skills with these coding tasks.
Task: Read the sheet named 'Q2' from a file called 'quarterly_sales.xlsx' and display its shape.
Show Solution
import pandas as pd
df = pd.read_excel('quarterly_sales.xlsx', sheet_name='Q2')
print(f"Shape: {df.shape}")
Task: Load all sheets from 'quarterly_sales.xlsx' and print the names of all sheets in the workbook.
Show Solution
import pandas as pd
all_sheets = pd.read_excel('quarterly_sales.xlsx', sheet_name=None)
print(f"Sheet names: {list(all_sheets.keys())}")
Task: Load all sheets from 'quarterly_sales.xlsx', add a 'source_sheet' column to each DataFrame, and combine them into one DataFrame.
Show Solution
import pandas as pd
all_sheets = pd.read_excel('quarterly_sales.xlsx', sheet_name=None)
dfs = []
for sheet_name, df in all_sheets.items():
df['source_sheet'] = sheet_name
dfs.append(df)
combined = pd.concat(dfs, ignore_index=True)
print(f"Combined shape: {combined.shape}")
print(combined['source_sheet'].value_counts())
Working with JSON Data
JSON (JavaScript Object Notation) has become the standard format for web APIs and modern data
exchange. When you fetch data from a REST API, download datasets from data portals, or work with
NoSQL databases, you are likely dealing with JSON. Unlike CSV's flat tabular structure, JSON can
represent nested, hierarchical data. Pandas provides both read_json() for simple
structures and json_normalize() for complex nested data.
Basic JSON Loading
For simple JSON files containing an array of objects (records), read_json() works
seamlessly. Each object becomes a row, and each key becomes a column. This pattern is common
in API responses and data exports. The function automatically detects the JSON structure and
creates an appropriate DataFrame. Always verify the result because JSON can have subtle
structural variations.
import pandas as pd
# Load simple JSON array
df = pd.read_json('customers.json')
print("Loaded JSON data:")
print(df.head())
# customer_id name age email
# 0 101 Alice 28 alice@email.com
# 1 102 Bob 35 bob@email.com
# 2 103 Carol 42 carol@email.com
print(f"\nShape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
The JSON file contained an array of customer objects, and Pandas converted it directly to a DataFrame. Each JSON object became a row, and the keys (customer_id, name, age, email) became columns. This is the ideal case where JSON structure maps cleanly to tabular data. But not all JSON is this simple.
Understanding JSON Orientation
JSON data can be structured in different ways, called "orientations". The default is "columns"
where each key is a column name mapping to a list of values. But you might also encounter
"records" (list of row objects), "index" (dict of index to row), or "split" (separate lists
for columns, data, and index). The orient parameter tells Pandas how to interpret
the structure.
Records Orientation
[
{"name": "Alice", "age": 28},
{"name": "Bob", "age": 35}
]
Columns Orientation
{
"name": ["Alice", "Bob"],
"age": [28, 35]
}
# Records orientation (list of objects) - most common from APIs
df = pd.read_json('data.json', orient='records')
# Columns orientation (dict of column arrays)
df = pd.read_json('data.json', orient='columns')
# Index orientation (dict of index to row dict)
df = pd.read_json('data.json', orient='index')
# Split orientation (separate columns, data, index)
df = pd.read_json('data.json', orient='split')
print(df.head())
Handling Nested JSON with json_normalize()
Real-world JSON from APIs often contains nested structures. A customer record might have an
embedded address object with street, city, and zip fields. A product might have a nested
metadata object with dimensions and weight. The read_json() function cannot
flatten these nested structures automatically. For this, you need pd.json_normalize(),
which flattens nested dictionaries into a flat table with dot-notation column names.
import pandas as pd
import json
# First, let's see what nested JSON looks like
nested_data = [
{
"id": 1,
"name": "Alice",
"address": {
"street": "123 Main St",
"city": "Boston",
"zip": "02101"
}
},
{
"id": 2,
"name": "Bob",
"address": {
"street": "456 Oak Ave",
"city": "Seattle",
"zip": "98101"
}
}
]
# json_normalize flattens nested structures
df = pd.json_normalize(nested_data)
print(df.columns.tolist())
# ['id', 'name', 'address.street', 'address.city', 'address.zip']
print(df)
# id name address.street address.city address.zip
# 0 1 Alice 123 Main St Boston 02101
# 1 2 Bob 456 Oak Ave Seattle 98101
Notice how the nested address object was flattened into columns with dot notation:
address.street, address.city, address.zip. You can
customize the separator with the sep parameter if you prefer underscores or
another character. This flattening is essential for analysis because Pandas works best with
flat tabular data.
json_normalize() Parameters
record_path - Path to the list of records to normalize
meta - Fields to include from parent records
sep - Separator for nested column names (default: '.')
max_level - Maximum nesting depth to flatten
Common use: API responses often have nested lists. Use record_path
to specify which nested list to expand, and meta to keep parent-level fields.
Loading JSON from Web APIs
You can load JSON directly from a URL, which is perfect for working with REST APIs. Pandas'
read_json() accepts URLs just like file paths. For more complex API interactions
(authentication, headers, pagination), use the requests library to fetch the
data first, then normalize it with Pandas. This gives you full control over the HTTP request
while leveraging Pandas for data parsing.
import pandas as pd
import requests
# Method 1: Direct URL (for simple public APIs)
url = 'https://api.example.com/data.json'
df = pd.read_json(url)
# Method 2: Using requests (for complex APIs)
response = requests.get('https://api.example.com/users',
headers={'Authorization': 'Bearer token123'})
data = response.json()
# Normalize the response
df = pd.json_normalize(data['users'])
print(f"Loaded {len(df)} users from API")
Practice Questions: JSON Data
Test your JSON loading skills with these coding tasks.
Task: Load 'customers.json' and display the column names and number of rows.
Show Solution
import pandas as pd
df = pd.read_json('customers.json')
print(f"Columns: {df.columns.tolist()}")
print(f"Rows: {len(df)}")
Given: A list of nested dictionaries:
data = [
{"name": "Alice", "location": {"city": "NYC", "country": "USA"}},
{"name": "Bob", "location": {"city": "London", "country": "UK"}}
]
Task: Use json_normalize to flatten this into a DataFrame with columns: name, location.city, location.country
Show Solution
import pandas as pd
data = [
{"name": "Alice", "location": {"city": "NYC", "country": "USA"}},
{"name": "Bob", "location": {"city": "London", "country": "UK"}}
]
df = pd.json_normalize(data)
print(df.columns.tolist())
# ['name', 'location.city', 'location.country']
print(df)
Given: The same nested data as Q2.
Task: Flatten the data but use underscore (_) as the separator instead of dot, so columns become: name, location_city, location_country
Show Solution
import pandas as pd
data = [
{"name": "Alice", "location": {"city": "NYC", "country": "USA"}},
{"name": "Bob", "location": {"city": "London", "country": "UK"}}
]
df = pd.json_normalize(data, sep='_')
print(df.columns.tolist())
# ['name', 'location_city', 'location_country']
Connecting to SQL Databases
Production data often lives in relational databases like PostgreSQL, MySQL, SQLite, or SQL Server. These databases handle transactions, enforce data integrity, and support concurrent access from multiple applications. Pandas can connect directly to these databases, execute SQL queries, and load results into DataFrames. This bridges the gap between your database systems and Python analytics, letting you work with live production data.
Database Connection Basics
To connect to a database, you need two things: a database driver and a connection string. The driver is a Python library that knows how to communicate with your specific database type. The connection string contains the database location, credentials, and other settings. Pandas works with SQLAlchemy under the hood, which provides a unified interface for different database types. Once connected, you can run SQL queries and receive results as DataFrames.
# Install database drivers (run once in terminal)
# pip install sqlalchemy
# pip install psycopg2-binary # PostgreSQL
# pip install pymysql # MySQL
# pip install pyodbc # SQL Server
import pandas as pd
from sqlalchemy import create_engine
# Create a database connection (SQLite example - no server needed)
engine = create_engine('sqlite:///company.db')
# Read entire table
df = pd.read_sql_table('customers', engine)
print(f"Loaded {len(df)} customers from database")
Connection Strings
Connection strings follow the pattern: dialect+driver://user:password@host:port/database
SQLite: sqlite:///path/to/file.db
PostgreSQL: postgresql://user:pass@localhost:5432/dbname
MySQL: mysql+pymysql://user:pass@localhost:3306/dbname
SQL Server: mssql+pyodbc://user:pass@server/db?driver=ODBC+Driver+17
Security tip: Never hardcode passwords in your code. Use environment variables or configuration files that are not committed to version control.
Running SQL Queries
While read_sql_table() loads an entire table, read_sql_query() lets
you run arbitrary SQL queries. This is more powerful because you can filter, join, aggregate,
and transform data in the database before loading it into Python. For large datasets, doing
aggregations in the database is much faster than loading all rows and aggregating in Pandas.
The database is optimized for these operations and can use indexes efficiently.
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('sqlite:///company.db')
# Run a custom SQL query
query = """
SELECT
c.name,
c.city,
COUNT(o.order_id) as order_count,
SUM(o.amount) as total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id
HAVING total_spent > 1000
ORDER BY total_spent DESC
"""
df = pd.read_sql_query(query, engine)
print(df.head())
# name city order_count total_spent
# 0 Alice Boston 15 12500
# 1 Bob Seattle 12 10200
# 2 Carol Chicago 8 8750
This query demonstrates the power of combining SQL and Pandas. The database handles the join, aggregation, filtering (HAVING), and sorting before any data reaches Python. Only the final summary rows are transferred. This is dramatically faster than loading all customers and orders separately and joining them in Pandas.
Parameterized Queries
When your query needs to include user input or variable values, never use string concatenation. This opens your code to SQL injection attacks. Instead, use parameterized queries where Pandas safely substitutes values into placeholders. This protects your database and ensures values are properly escaped.
# WRONG - Never do this (SQL injection risk!)
city = "Boston"
query = f"SELECT * FROM customers WHERE city = '{city}'" # Dangerous!
# CORRECT - Use parameterized queries
from sqlalchemy import text
city = "Boston"
min_age = 25
query = text("SELECT * FROM customers WHERE city = :city AND age >= :min_age")
df = pd.read_sql_query(query, engine, params={"city": city, "min_age": min_age})
print(f"Found {len(df)} customers in {city} aged {min_age}+")
Writing Data to Databases
Pandas can also write DataFrames back to databases using to_sql(). This is useful
for storing analysis results, creating derived tables, or migrating data between systems. The
if_exists parameter controls what happens if the table already exists: 'fail'
(raise error), 'replace' (drop and recreate), or 'append' (add rows to existing table).
import pandas as pd
from sqlalchemy import create_engine
# Create sample DataFrame
results = pd.DataFrame({
'customer_id': [101, 102, 103],
'segment': ['Gold', 'Silver', 'Gold'],
'predicted_churn': [0.12, 0.45, 0.08]
})
engine = create_engine('sqlite:///company.db')
# Write to database
results.to_sql('customer_segments',
engine,
if_exists='replace', # Options: 'fail', 'replace', 'append'
index=False) # Don't save DataFrame index as column
print("Data written to database successfully!")
# Verify by reading it back
df_verify = pd.read_sql_table('customer_segments', engine)
print(df_verify)
chunksize parameter
to write in batches. Also consider method='multi' for databases that support
multi-row INSERT statements.
Practice Questions: SQL Databases
Test your database connection skills with these coding tasks.
Task: Create a SQLAlchemy engine for a SQLite database file called 'mydata.db' and read all rows from a table named 'products'.
Show Solution
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('sqlite:///mydata.db')
df = pd.read_sql_table('products', engine)
print(df.head())
Task: Write a SQL query to get the average price per category from a 'products' table, then load results into a DataFrame.
Show Solution
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('sqlite:///mydata.db')
query = """
SELECT category, AVG(price) as avg_price
FROM products
GROUP BY category
ORDER BY avg_price DESC
"""
df = pd.read_sql_query(query, engine)
print(df)
Task: Create a DataFrame with columns 'name' and 'score', then write it to a new table called 'leaderboard' in a SQLite database. If the table exists, replace it.
Show Solution
import pandas as pd
from sqlalchemy import create_engine
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Carol'],
'score': [95, 87, 92]
})
engine = create_engine('sqlite:///mydata.db')
df.to_sql('leaderboard', engine, if_exists='replace', index=False)
# Verify
result = pd.read_sql_table('leaderboard', engine)
print(result)
Web Scraping with read_html()
The web is full of data trapped in HTML tables. Wikipedia articles, government statistics,
sports scores, financial data, and countless other datasets are displayed as tables on web
pages. Manually copying this data would be tedious and error-prone. The read_html()
function automatically extracts all tables from a web page and converts them into DataFrames.
This is one of Pandas' most magical features for quick data gathering.
Basic Table Extraction
The read_html() function accepts a URL or HTML string and returns a list of
DataFrames, one for each table found on the page. This is remarkably simple compared to
traditional web scraping with BeautifulSoup or Scrapy. For many use cases, especially
grabbing data from Wikipedia or government sites, this one function is all you need. The
function intelligently parses table headers, handles merged cells, and cleans up whitespace.
# Install lxml for faster parsing (optional but recommended)
# pip install lxml html5lib
import pandas as pd
# Scrape all tables from a Wikipedia page
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'
tables = pd.read_html(url)
print(f"Found {len(tables)} tables on the page")
# Found 4 tables on the page
# Each table is a DataFrame - inspect the first one
df = tables[0]
print(f"\nFirst table shape: {df.shape}")
print(df.head())
The function returns a list because web pages often contain multiple tables. You need to
inspect each one to find the data you want. Use len(tables) to see how many
were found, then examine each with tables[0].head(), tables[1].head(),
etc. Once you identify the right table, assign it to a variable and continue your analysis.
Filtering and Matching Tables
When a page has many tables, searching through them manually is inefficient. The
match parameter lets you filter tables by searching for a regex pattern in the
table content. Only tables containing matching text are returned. This is useful when you
know a keyword that appears in your target table but not in navigation tables or other
irrelevant content.
import pandas as pd
# Only get tables that contain "GDP" or "Population"
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'
tables = pd.read_html(url, match='GDP')
print(f"Tables matching 'GDP': {len(tables)}")
# Get the first matching table
df = tables[0]
print(df.head())
read_html() Options
match - Regex pattern to filter tables
header - Row number to use as header (or list for multi-level)
index_col - Column to use as row index
attrs - Dict of HTML attributes to match (id, class, etc.)
encoding - Character encoding of the page
Advanced tip: Use attrs={'id': 'specific-table-id'} to
target a specific table by its HTML id attribute, which is more precise than text matching.
Handling Common Issues
Web scraping is not always smooth. Tables might have merged header cells, footnotes mixed
with data, or encoding issues. The header parameter can handle multi-level
headers by passing a list of row numbers. Some sites block automated access, requiring
you to use the requests library with custom headers to fetch the HTML first.
import pandas as pd
import requests
# Some sites block Pandas' default user agent
# Solution: Fetch with requests first
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get('https://example.com/data', headers=headers)
# Parse the HTML content
tables = pd.read_html(response.text)
# Handle multi-level headers (header spans 2 rows)
df = pd.read_html(url, header=[0, 1])[0]
print(df.columns) # MultiIndex columns
robots.txt and terms of service before scraping. Some sites prohibit
automated access. For large-scale scraping, consider rate limiting to avoid overwhelming servers.
Scraping Local HTML Files
You can also use read_html() on local HTML files or HTML strings. This is
useful when you have downloaded HTML files, received HTML in an email, or want to test
your parsing logic offline before hitting live websites. The function accepts file paths
and raw HTML strings identically to URLs.
import pandas as pd
# Read from a local HTML file
tables = pd.read_html('report.html')
# Parse an HTML string directly
html_string = """
Name Score
Alice 95
Bob 87
Carol 92
"""
tables = pd.read_html(html_string)
df = tables[0]
print(df)
# Name Score
# 0 Alice 95
# 1 Bob 87
# 2 Carol 92
attrs parameter for precise targeting.
Practice Questions: Web Scraping
Test your web scraping skills with these coding tasks.
Given:
html = """
<table>
<tr><th>Product</th><th>Price</th></tr>
<tr><td>Laptop</td><td>999</td></tr>
<tr><td>Mouse</td><td>29</td></tr>
</table>
"""
Task: Parse this HTML string and print the resulting DataFrame.
Show Solution
import pandas as pd
html = """
Product Price
Laptop 999
Mouse 29
"""
tables = pd.read_html(html)
df = tables[0]
print(df)
Task: Load all tables from a Wikipedia URL and print how many tables were found, along with the shape of each table.
Show Solution
import pandas as pd
url = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
tables = pd.read_html(url)
print(f"Found {len(tables)} tables\n")
for i, table in enumerate(tables):
print(f"Table {i}: shape {table.shape}")
Task: Scrape a Wikipedia page and use the match parameter to only get tables containing the word "Release". Print the number of matching tables.
Show Solution
import pandas as pd
url = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
tables = pd.read_html(url, match='Release')
print(f"Tables containing 'Release': {len(tables)}")
if len(tables) > 0:
print(tables[0].head())
Key Takeaways
CSV is King
read_csv() handles delimiters, encodings, missing values, and data types.
Master its parameters like sep, na_values, and dtype
for robust data loading.
Excel Made Easy
Use sheet_name to load specific sheets, sheet_name=None for
all sheets as a dict, and pd.concat() to combine them into one DataFrame.
JSON Normalization
Simple JSON loads directly. For nested structures, use json_normalize()
with record_path and meta to flatten complex hierarchies.
SQL Connections
Create engines with SQLAlchemy, use read_sql() for queries, and
to_sql() to write back. Use parameterized queries for security and
chunksize for large datasets.
Web Table Magic
read_html() automatically extracts all tables from web pages. Use
match to filter tables and attrs to target specific elements.
Memory Efficiency
Specify dtype to reduce memory, use usecols to load only
needed columns, and leverage chunksize for processing files larger than RAM.
Knowledge Check
Test your understanding of data loading concepts:
Which parameter changes the column separator in read_csv()?
How do you load all sheets from an Excel file as a dictionary?
Which function flattens nested JSON data into a DataFrame?
What does the if_exists='append' parameter do in to_sql()?
What does pd.read_html() return?
Which parameter specifies custom values to treat as missing in read_csv()?