Introduction to Pandas
Pandas is the most important Python library for data analysis. Whether you're cleaning messy datasets, calculating business metrics, or preparing data for visualization, Pandas is your go-to tool. In this lesson, you'll learn the fundamentals that data analysts use every single day.
Why Pandas Matters
Imagine you have a spreadsheet with 100,000 rows of sales data. You need to filter by region, calculate monthly totals, and find your top products. In Excel, this might crash your computer. In Pandas, it takes three lines of code and runs in seconds.
Pandas (Python Data Analysis Library) was created in 2008 by Wes McKinney while working at a hedge fund. Today, it's used by data analysts, scientists, and engineers at virtually every company that works with data—from startups to Fortune 500 companies.
The Two Core Data Structures
Everything in Pandas revolves around two structures: Series (1D) and DataFrame (2D). Think of a Series as a single column in Excel, and a DataFrame as the entire spreadsheet.
Series
A Series is a one-dimensional labeled array. It's like a single column in a spreadsheet, but with a labeled index that lets you access values by name, not just position.
Why it matters: When you select one column from a DataFrame (like df['revenue']),
you get a Series. Understanding Series operations is essential for data manipulation.
DataFrame
A DataFrame is a two-dimensional table with labeled rows and columns. Each column is a Series, and all columns share the same row index. This is where you'll spend 90% of your time in Pandas.
Think of it as: An Excel spreadsheet or a SQL table—but with Python's power for filtering, transforming, and analyzing data programmatically.
Installing and Importing Pandas
Before using Pandas, you need to install it (if not already installed) and import it into your Python script.
The community convention is to import Pandas as pd—you'll see this in every tutorial and codebase.
# Install pandas (run once in terminal)
pip install pandas
# Import pandas with the standard alias
import pandas as pd
# Check your pandas version
print(pd.__version__) # e.g., 2.1.0
import pandas as pd. This is the universal convention,
and using any other alias will confuse other developers reading your code.
Your First DataFrame
Let's create a simple DataFrame from scratch to understand how it works. The most common way is from a Python dictionary where keys become column names.
import pandas as pd
# Create a DataFrame from a dictionary
df = pd.DataFrame({
'region': ['North', 'South', 'East', 'West'],
'revenue': [12000, 9500, 15000, 8000],
'customers': [150, 120, 180, 95]
})
print(df)
Output:
region revenue customers
0 North 12000 150
1 South 9500 120
2 East 15000 180
3 West 8000 95
Notice the numbers 0, 1, 2, 3 on the left? That's the index—a label for each row.
Pandas automatically creates a numeric index if you don't specify one.
Practice Questions: Introduction
Test your understanding with these hands-on exercises.
Task: Create a DataFrame with 3 products (Laptop, Mouse, Keyboard) and their prices (45000, 500, 1500). Print the DataFrame.
Show Solution
import pandas as pd
df = pd.DataFrame({
'product': ['Laptop', 'Mouse', 'Keyboard'],
'price': [45000, 500, 1500]
})
print(df)
Task: Load any CSV file and print its shape (rows, columns) and column names.
Show Solution
import pandas as pd
df = pd.read_csv('sales_data.csv')
print(f"Shape: {df.shape}") # (rows, cols)
print(f"Columns: {df.columns.tolist()}")
Given:
df = pd.DataFrame({
'product': ['Laptop', 'Mouse', 'Keyboard'],
'price': [45000, 500, 1500],
'quantity': [5, 50, 30]
})
Task: Add a new column total that equals price * quantity.
Show Solution
df['total'] = df['price'] * df['quantity']
print(df)
Working with Series
A Series is the building block of everything in Pandas. Before you can work with full datasets, you need to master Series operations. When you select a single column from a DataFrame, you get a Series—so understanding them is essential.
Creating a Series
There are several ways to create a Series. The most common is from a Python list or dictionary. When you use a dictionary, the keys become the index labels.
import pandas as pd
# From a list (auto-generates numeric index)
revenue = pd.Series([12000, 9500, 15000, 8000])
print(revenue)
# Output:
# 0 12000
# 1 9500
# 2 15000
# 3 8000
# dtype: int64
# From a dictionary (keys become index labels)
revenue = pd.Series({
'North': 12000,
'South': 9500,
'East': 15000,
'West': 8000
})
print(revenue)
# Output:
# North 12000
# South 9500
# East 15000
# West 8000
# dtype: int64
revenue['North']) instead of positions (revenue[0]). This makes your code
more readable and less error-prone when rows get reordered.
Accessing Series Values
You can access Series values in multiple ways: by label, by position, or with boolean conditions.
| Method | Syntax | Description | Example |
|---|---|---|---|
| By label | s['label'] |
Access using index label | revenue['North'] → 12000 |
| By position | s.iloc[n] |
Access using numeric position | revenue.iloc[0] → 12000 |
| Multiple labels | s[['a','b']] |
Select multiple values | revenue[['North','East']] |
| Boolean mask | s[condition] |
Filter by condition | revenue[revenue > 10000] |
# Access by label
print(revenue['North']) # 12000
# Access by position
print(revenue.iloc[0]) # 12000
# Select multiple regions
print(revenue[['North', 'East']])
# North 12000
# East 15000
# Filter: only regions with revenue > 10000
high_revenue = revenue[revenue > 10000]
print(high_revenue)
# North 12000
# East 15000
Series Operations
One of Pandas' superpowers is vectorized operations—you can perform calculations on entire Series without writing loops. This is both faster and more readable.
# Mathematical operations (applied to all elements)
revenue_in_lakhs = revenue / 100000 # Convert to lakhs
print(revenue_in_lakhs)
# North 0.12
# South 0.095
# East 0.15
# West 0.08
# Add 10% bonus to all regions
bonus = revenue * 0.10
print(bonus)
# North 1200.0
# South 950.0
# East 1500.0
# West 800.0
Common Series Methods
These are the methods you'll use daily when analyzing data:
# Statistical summaries
print(revenue.sum()) # 44500 (total)
print(revenue.mean()) # 11125.0 (average)
print(revenue.max()) # 15000 (highest)
print(revenue.min()) # 8000 (lowest)
print(revenue.std()) # Standard deviation
# Useful information
print(revenue.count()) # 4 (number of values)
print(revenue.idxmax()) # 'East' (label of max value)
print(revenue.idxmin()) # 'West' (label of min value)
# Get all values as NumPy array
print(revenue.values) # array([12000, 9500, 15000, 8000])
The map() Method
The map() method lets you transform values using a dictionary or function.
This is incredibly useful for converting categories to numeric values for analysis.
Use case: Converting text ratings (Good, Average, Poor) to numeric scores (3, 2, 1) for calculating averages.
# Map categories to numbers
priority_scores = {'high': 3, 'medium': 2, 'low': 1}
priorities = pd.Series(['high', 'low', 'medium', 'high'])
numeric_scores = priorities.map(priority_scores)
print(numeric_scores)
# 0 3
# 1 1
# 2 2
# 3 3
# Now you can calculate the average priority!
print(numeric_scores.mean()) # 2.25
Practice Questions: Series
Apply what you learned about Series operations.
Task: Create a Series with months (Jan, Feb, Mar) as index and sales values (5000, 7500, 6200). Print the total sales.
Show Solution
sales = pd.Series({
'Jan': 5000,
'Feb': 7500,
'Mar': 6200
})
print(f"Total sales: {sales.sum()}") # 18700
Given:
scores = pd.Series({'Alice': 85, 'Bob': 92, 'Charlie': 78, 'Diana': 95})
Task: Find who has the highest score and what it is.
Show Solution
top_scorer = scores.idxmax()
top_score = scores.max()
print(f"{top_scorer} scored {top_score}") # Diana scored 95
Given:
revenue = pd.Series({
'North': 12000, 'South': 9500, 'East': 15000, 'West': 8000
})
Task: Find regions with revenue above average and calculate their percentage of total revenue.
Show Solution
avg = revenue.mean()
above_avg = revenue[revenue > avg]
pct = (above_avg.sum() / revenue.sum()) * 100
print(f"Above average regions: {list(above_avg.index)}")
print(f"Their share: {pct:.1f}%") # 60.7%
Given:
ratings = pd.Series(['Excellent', 'Good', 'Poor', 'Good', 'Excellent'])
Task: Map these to scores (Excellent=5, Good=3, Poor=1) and find the average rating.
Show Solution
score_map = {'Excellent': 5, 'Good': 3, 'Poor': 1}
numeric = ratings.map(score_map)
print(f"Average rating: {numeric.mean()}") # 3.4
DataFrame Operations
The DataFrame is where you'll spend 90% of your time in Pandas. It's a 2D table with rows and columns, similar to an Excel spreadsheet or SQL table. This section covers the essential operations every data analyst needs to master.
Exploring Your Data
When you first load a dataset, you need to understand what you're working with. These are the first commands every analyst runs:
import pandas as pd
# Load sample data
df = pd.read_csv('sales_data.csv')
# Quick look at the data
print(df.head()) # First 5 rows
print(df.tail(3)) # Last 3 rows
print(df.shape) # (rows, columns)
print(df.columns) # List of column names
print(df.dtypes) # Data type of each column
df.info()
The info() method gives you a complete overview: column names, data types,
non-null counts, and memory usage. It's the best single command to understand your dataset.
Pro tip: Run df.info() immediately after loading any dataset
to spot missing values and incorrect data types.
# Complete dataset overview
df.info()
# Output:
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 1000 entries, 0 to 999
# Data columns (total 5 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 date 1000 non-null object
# 1 region 1000 non-null object
# 2 revenue 985 non-null float64 # <-- 15 missing values!
# 3 units 1000 non-null int64
# 4 product 1000 non-null object
# memory usage: 39.1 KB
Selecting Columns
There are two ways to select columns: bracket notation (always works) and dot notation (convenient but limited).
# Bracket notation (recommended - always works)
revenue = df['revenue'] # Returns a Series
subset = df[['region', 'revenue']] # Returns a DataFrame with 2 columns
# Dot notation (convenient but limited)
revenue = df.revenue # Works only if column name is valid Python identifier
# df.revenue target # ERROR - space in name
# df['revenue target'] # This works!
df['column']) for column names with
spaces, special characters, or names that match Python keywords like class or type.
Filtering Rows
Filtering is how you select specific rows based on conditions. This is one of the most common operations in data analysis.
# Single condition
high_revenue = df[df['revenue'] > 10000]
# Multiple conditions (use & for AND, | for OR)
# IMPORTANT: Wrap each condition in parentheses!
north_high = df[(df['region'] == 'North') & (df['revenue'] > 10000)]
low_or_south = df[(df['revenue'] < 5000) | (df['region'] == 'South')]
# Using isin() for multiple values
target_regions = df[df['region'].isin(['North', 'East', 'West'])]
# String contains (for text searching)
laptop_sales = df[df['product'].str.contains('Laptop', case=False)]
df[df['a'] > 5 & df['b'] < 10] causes an error.
Always use df[(df['a'] > 5) & (df['b'] < 10)].
Creating New Columns
Adding calculated columns is a core skill. You can perform arithmetic, apply functions, or use conditional logic.
# Simple arithmetic
df['profit_margin'] = (df['revenue'] - df['cost']) / df['revenue'] * 100
# Using apply() for custom functions
def categorize_revenue(amount):
if amount > 15000:
return 'High'
elif amount > 8000:
return 'Medium'
else:
return 'Low'
df['revenue_category'] = df['revenue'].apply(categorize_revenue)
# Using np.where() for simple if-else (faster than apply)
import numpy as np
df['is_profitable'] = np.where(df['revenue'] > df['cost'], 'Yes', 'No')
loc and iloc: Precise Selection
For selecting specific rows and columns together, use loc (by label) or iloc (by position).
| Method | Selects By | Example | Description |
|---|---|---|---|
loc |
Label | df.loc[0:5, 'revenue'] |
Rows 0-5 (inclusive), 'revenue' column |
iloc |
Position | df.iloc[0:5, 2] |
First 5 rows, 3rd column |
loc |
Label | df.loc[df['region']=='North', ['revenue','units']] |
Filter + select columns |
# Select specific rows and columns by label
df.loc[0:10, ['region', 'revenue', 'units']]
# Select by position (first 10 rows, columns 1-3)
df.iloc[0:10, 1:4]
# Update values using loc
df.loc[df['region'] == 'North', 'revenue'] = df.loc[df['region'] == 'North', 'revenue'] * 1.1
Handling Missing Data
Real-world data is messy. Missing values (shown as NaN) are common.
Here's how to find and handle them:
# Check for missing values
print(df.isnull().sum()) # Count NaN per column
# Drop rows with any missing values
df_clean = df.dropna()
# Drop rows with missing values in specific columns
df_clean = df.dropna(subset=['revenue', 'region'])
# Fill missing values
df['revenue'] = df['revenue'].fillna(0) # Fill with zero
df['revenue'] = df['revenue'].fillna(df['revenue'].mean()) # Fill with average
df['category'] = df['category'].fillna('Unknown') # Fill with placeholder
Practice Questions: DataFrame
Practice essential DataFrame operations.
Task: Filter a sales DataFrame to show only rows where revenue is greater than 10000.
Show Solution
high_revenue = df[df['revenue'] > 10000]
print(high_revenue)
Task: Create a new DataFrame with only the 'product', 'quantity', and 'price' columns.
Show Solution
subset = df[['product', 'quantity', 'price']]
print(subset.head())
Given: A DataFrame with 'revenue' and 'cost' columns.
Task: Create a 'profit' column (revenue - cost) and a 'profit_margin' column (profit/revenue * 100).
Show Solution
df['profit'] = df['revenue'] - df['cost']
df['profit_margin'] = (df['profit'] / df['revenue']) * 100
print(df[['revenue', 'cost', 'profit', 'profit_margin']].head())
Task: Find all rows where region is 'North' AND revenue is above 10000.
Show Solution
result = df[(df['region'] == 'North') & (df['revenue'] > 10000)]
print(result)
Task: Fill missing revenue values with the median, then create a 'revenue_tier' column that categorizes as 'High' (>15000), 'Medium' (>8000), or 'Low'.
Show Solution
# Fill missing values
df['revenue'] = df['revenue'].fillna(df['revenue'].median())
# Create revenue tiers
def categorize(val):
if val > 15000:
return 'High'
elif val > 8000:
return 'Medium'
else:
return 'Low'
df['revenue_tier'] = df['revenue'].apply(categorize)
print(df['revenue_tier'].value_counts())
Reading & Writing Data
Real data doesn't live in Python code—it's in CSV files, Excel spreadsheets, databases, and APIs. Pandas makes it easy to load data from these sources and save your results back. This section covers the most common I/O operations you'll use daily.
Reading CSV Files
CSV (Comma-Separated Values) is the most common format for data exchange.
The read_csv() function handles most scenarios with the right parameters.
import pandas as pd
# Basic CSV read
df = pd.read_csv('sales_data.csv')
# With common options
df = pd.read_csv('sales_data.csv',
parse_dates=['date'], # Convert date columns to datetime
usecols=['date', 'revenue', 'region'], # Load only specific columns
nrows=1000, # Read only first 1000 rows (for testing)
na_values=['N/A', 'missing'] # Treat these as NaN
)
| Parameter | Purpose | Example |
|---|---|---|
parse_dates |
Convert columns to datetime | parse_dates=['date', 'created_at'] |
usecols |
Load only specific columns (faster) | usecols=['id', 'name', 'revenue'] |
nrows |
Read only first N rows | nrows=100 |
skiprows |
Skip rows at start of file | skiprows=2 (skip header rows) |
encoding |
Handle special characters | encoding='utf-8' or 'latin-1' |
usecols to load only
the columns you need. This can reduce memory usage by 50-80% and dramatically speed up loading.
Reading Excel Files
Pandas can read Excel files (.xlsx, .xls) directly. You'll need the openpyxl library installed.
# Install openpyxl first: pip install openpyxl
# Read Excel file
df = pd.read_excel('report.xlsx')
# Read specific sheet
df = pd.read_excel('report.xlsx', sheet_name='Q1 Sales')
# Read all sheets (returns a dictionary)
all_sheets = pd.read_excel('report.xlsx', sheet_name=None)
print(all_sheets.keys()) # ['Q1 Sales', 'Q2 Sales', 'Q3 Sales']
Writing Data
After cleaning and transforming your data, you'll want to save the results.
Use to_csv() for CSV and to_excel() for Excel.
# Save to CSV (most common)
df.to_csv('cleaned_data.csv', index=False) # index=False prevents extra column
# Save to Excel
df.to_excel('report.xlsx', sheet_name='Results', index=False)
# Save multiple sheets to Excel
with pd.ExcelWriter('report.xlsx') as writer:
df_north.to_excel(writer, sheet_name='North', index=False)
df_south.to_excel(writer, sheet_name='South', index=False)
df_summary.to_excel(writer, sheet_name='Summary', index=False)
index=False when saving to CSV unless you specifically
need the index. Otherwise, you'll get an extra unnamed column when you reload the file.
CSV Previewer (Interactive)
Drop a small CSV file or choose one to preview the first 10 rows (client-side only).
Practice Questions: Reading & Writing Data
Practice loading and saving data efficiently.
Task: Load only the 'date', 'region', and 'revenue' columns from 'sales_data.csv' with dates parsed as datetime.
Show Solution
df = pd.read_csv('sales_data.csv',
usecols=['date', 'region', 'revenue'],
parse_dates=['date']
)
print(df.dtypes)
Task: Drop rows with missing revenue, then save to 'sales_cleaned.csv' without the index column.
Show Solution
df_clean = df.dropna(subset=['revenue'])
df_clean.to_csv('sales_cleaned.csv', index=False)
print(f"Saved {len(df_clean)} rows")
Task: Read the 'Q1_Data' sheet from 'quarterly_report.xlsx' and print the shape.
Show Solution
df = pd.read_excel('quarterly_report.xlsx', sheet_name='Q1_Data')
print(f"Shape: {df.shape}")
print(df.head())
Task: Split data by region into separate DataFrames and save each as a different sheet in 'regional_report.xlsx'.
Show Solution
with pd.ExcelWriter('regional_report.xlsx') as writer:
for region in df['region'].unique():
region_df = df[df['region'] == region]
region_df.to_excel(writer, sheet_name=region, index=False)
print("Report saved with separate sheets per region")
Groupby & Aggregation
Groupby is where Pandas shines brightest. Need to calculate total sales by region? Average revenue by month? Top products by category? Groupby handles all of these. It's the Pandas equivalent of SQL's GROUP BY clause, but more powerful and flexible.
Understanding Groupby
The groupby operation follows a "split-apply-combine" pattern:
- Split: Divide data into groups based on one or more columns
- Apply: Calculate something for each group (sum, mean, count, etc.)
- Combine: Merge results back into a single DataFrame
Split-Apply-Combine
When you call df.groupby('region'), Pandas creates separate groups for each unique region.
Then you apply an aggregation (like .sum()) to each group independently.
Finally, results are combined into a new DataFrame.
Real example: "What's the total revenue for each region?" →
df.groupby('region')['revenue'].sum()
Basic Groupby Operations
import pandas as pd
# Sample data
df = pd.DataFrame({
'region': ['North', 'South', 'North', 'East', 'South', 'East'],
'product': ['Laptop', 'Mouse', 'Mouse', 'Laptop', 'Keyboard', 'Mouse'],
'revenue': [45000, 500, 800, 52000, 1500, 600]
})
# Basic groupby - sum revenue by region
by_region = df.groupby('region')['revenue'].sum()
print(by_region)
# region
# East 52600
# North 45800
# South 2000
# Multiple aggregations
summary = df.groupby('region')['revenue'].agg(['sum', 'mean', 'count'])
print(summary)
# sum mean count
# East 52600 26300.0 2
# North 45800 22900.0 2
# South 2000 1000.0 2
Named Aggregations (Modern Syntax)
The modern agg() syntax lets you give custom names to your aggregated columns,
making your output cleaner and more readable.
# Named aggregations (cleaner output)
result = df.groupby('region').agg(
total_revenue=('revenue', 'sum'),
avg_revenue=('revenue', 'mean'),
num_transactions=('revenue', 'count'),
max_sale=('revenue', 'max')
).reset_index()
print(result)
# region total_revenue avg_revenue num_transactions max_sale
# 0 East 52600 26300.0 2 52000
# 1 North 45800 22900.0 2 45000
# 2 South 2000 1000.0 2 1500
.reset_index() converts it back to a regular column, which is usually what you want
for further analysis or export.
Grouping by Multiple Columns
You can group by multiple columns to get more granular breakdowns—like revenue by region AND product.
# Group by multiple columns
breakdown = df.groupby(['region', 'product']).agg(
total_revenue=('revenue', 'sum'),
count=('revenue', 'count')
).reset_index()
print(breakdown)
# region product total_revenue count
# 0 East Laptop 52000 1
# 1 East Mouse 600 1
# 2 North Laptop 45000 1
# 3 North Mouse 800 1
# 4 South Keyboard 1500 1
# 5 South Mouse 500 1
Common Aggregation Functions
| Function | Description | Example |
|---|---|---|
sum |
Total of all values | ('revenue', 'sum') |
mean |
Average value | ('revenue', 'mean') |
count |
Number of non-null values | ('revenue', 'count') |
min / max |
Minimum / Maximum value | ('revenue', 'min') |
std |
Standard deviation | ('revenue', 'std') |
first / last |
First / Last value in group | ('date', 'first') |
nunique |
Number of unique values | ('product', 'nunique') |
Transform: Keep Original Rows
Sometimes you want to add group-level calculations back to each row (like "this row's revenue vs. regional average").
The transform() method does exactly this.
# Add regional average to each row
df['regional_avg'] = df.groupby('region')['revenue'].transform('mean')
df['vs_avg'] = df['revenue'] - df['regional_avg']
print(df[['region', 'revenue', 'regional_avg', 'vs_avg']])
# region revenue regional_avg vs_avg
# 0 North 45000 22900.0 22100.0
# 1 South 500 1000.0 -500.0
# 2 North 800 22900.0 -22100.0
# 3 East 52000 26300.0 25700.0
# 4 South 1500 1000.0 500.0
# 5 East 600 26300.0 -25700.0
Top-N Regions Slider (Interactive)
Use the slider to display the top N regions by revenue from a small sample dataset.
Practice Questions: Groupby & Aggregation
Master the art of aggregating and summarizing data.
Task: Calculate total revenue for each product category.
Show Solution
by_product = df.groupby('product')['revenue'].sum()
print(by_product.sort_values(ascending=False))
Task: Count how many transactions occurred in each region.
Show Solution
transaction_count = df.groupby('region').size()
# OR
transaction_count = df['region'].value_counts()
print(transaction_count)
Task: For each region, calculate: total revenue, average revenue, max single sale, and number of unique products.
Show Solution
summary = df.groupby('region').agg(
total_revenue=('revenue', 'sum'),
avg_revenue=('revenue', 'mean'),
max_sale=('revenue', 'max'),
unique_products=('product', 'nunique')
).reset_index()
print(summary)
Task: Calculate each region's percentage share of total revenue.
Show Solution
region_totals = df.groupby('region')['revenue'].sum().reset_index()
region_totals['pct_share'] = (region_totals['revenue'] / region_totals['revenue'].sum()) * 100
print(region_totals.sort_values('pct_share', ascending=False))
Task: Add a column showing whether each transaction's revenue is above or below its regional average.
Show Solution
df['regional_avg'] = df.groupby('region')['revenue'].transform('mean')
df['performance'] = df.apply(
lambda x: 'Above Average' if x['revenue'] > x['regional_avg'] else 'Below Average',
axis=1
)
print(df[['region', 'revenue', 'regional_avg', 'performance']])
Key Takeaways
Series & DataFrame
Pandas core data structures with labeled axes
Easy I/O
Read and write CSV or Excel with one function call
Groupby
Aggregate and summarize data efficiently
Data Cleaning
Fill missing values and transform columns for analysis
Transformations
Apply functions and compute new columns
Export
Save cleaned data for visualization and modeling
Knowledge Check
Test your understanding of Pandas fundamentals. Answer all questions to check your score.