Python Pandas

Introduction to Pandas

Pandas is the most important Python library for data analysis. Whether you're cleaning messy datasets, calculating business metrics, or preparing data for visualization, Pandas is your go-to tool. In this lesson, you'll learn the fundamentals that data analysts use every single day.

Why Pandas Matters

Imagine you have a spreadsheet with 100,000 rows of sales data. You need to filter by region, calculate monthly totals, and find your top products. In Excel, this might crash your computer. In Pandas, it takes three lines of code and runs in seconds.

Pandas (Python Data Analysis Library) was created in 2008 by Wes McKinney while working at a hedge fund. Today, it's used by data analysts, scientists, and engineers at virtually every company that works with data—from startups to Fortune 500 companies.

Real-world example: A retail analyst needs to analyze 2 years of transaction data (millions of rows) to find seasonal trends. With Pandas, they can load the CSV, group by month, calculate statistics, and export results—all in under 30 seconds.

The Two Core Data Structures

Everything in Pandas revolves around two structures: Series (1D) and DataFrame (2D). Think of a Series as a single column in Excel, and a DataFrame as the entire spreadsheet.

Core Concept

Series

A Series is a one-dimensional labeled array. It's like a single column in a spreadsheet, but with a labeled index that lets you access values by name, not just position.

Why it matters: When you select one column from a DataFrame (like df['revenue']), you get a Series. Understanding Series operations is essential for data manipulation.

Core Concept

DataFrame

A DataFrame is a two-dimensional table with labeled rows and columns. Each column is a Series, and all columns share the same row index. This is where you'll spend 90% of your time in Pandas.

Think of it as: An Excel spreadsheet or a SQL table—but with Python's power for filtering, transforming, and analyzing data programmatically.

Installing and Importing Pandas

Before using Pandas, you need to install it (if not already installed) and import it into your Python script. The community convention is to import Pandas as pd—you'll see this in every tutorial and codebase.

# Install pandas (run once in terminal)
pip install pandas

# Import pandas with the standard alias
import pandas as pd

# Check your pandas version
print(pd.__version__)  # e.g., 2.1.0

Pro tip: Always use import pandas as pd. This is the universal convention, and using any other alias will confuse other developers reading your code.

Your First DataFrame

Let's create a simple DataFrame from scratch to understand how it works. The most common way is from a Python dictionary where keys become column names.

import pandas as pd

# Create a DataFrame from a dictionary
df = pd.DataFrame({
    'region': ['North', 'South', 'East', 'West'],
    'revenue': [12000, 9500, 15000, 8000],
    'customers': [150, 120, 180, 95]
})

print(df)

Output:

   region  revenue  customers
0   North    12000        150
1   South     9500        120
2    East    15000        180
3    West     8000         95

Notice the numbers 0, 1, 2, 3 on the left? That's the index—a label for each row. Pandas automatically creates a numeric index if you don't specify one.

Practice Questions: Introduction

Test your understanding with these hands-on exercises.

Task: Create a DataFrame with 3 products (Laptop, Mouse, Keyboard) and their prices (45000, 500, 1500). Print the DataFrame.

Show Solution

import pandas as pd

df = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Keyboard'],
    'price': [45000, 500, 1500]
})
print(df)

Task: Load any CSV file and print its shape (rows, columns) and column names.

Show Solution

import pandas as pd

df = pd.read_csv('sales_data.csv')
print(f"Shape: {df.shape}")  # (rows, cols)
print(f"Columns: {df.columns.tolist()}")

Given:

df = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Keyboard'],
    'price': [45000, 500, 1500],
    'quantity': [5, 50, 30]
})

Task: Add a new column total that equals price * quantity.

Show Solution

df['total'] = df['price'] * df['quantity']
print(df)

Working with Series

A Series is the building block of everything in Pandas. Before you can work with full datasets, you need to master Series operations. When you select a single column from a DataFrame, you get a Series—so understanding them is essential.

Creating a Series

There are several ways to create a Series. The most common is from a Python list or dictionary. When you use a dictionary, the keys become the index labels.

import pandas as pd

# From a list (auto-generates numeric index)
revenue = pd.Series([12000, 9500, 15000, 8000])
print(revenue)
# Output:
# 0    12000
# 1     9500
# 2    15000
# 3     8000
# dtype: int64

# From a dictionary (keys become index labels)
revenue = pd.Series({
    'North': 12000,
    'South': 9500,
    'East': 15000,
    'West': 8000
})
print(revenue)
# Output:
# North    12000
# South     9500
# East     15000
# West      8000
# dtype: int64

Why use labeled indexes? With a labeled index, you can access data by meaningful names (revenue['North']) instead of positions (revenue[0]). This makes your code more readable and less error-prone when rows get reordered.

Accessing Series Values

You can access Series values in multiple ways: by label, by position, or with boolean conditions.

Method	Syntax	Description	Example
By label	`s['label']`	Access using index label	`revenue['North']` → 12000
By position	`s.iloc[n]`	Access using numeric position	`revenue.iloc[0]` → 12000
Multiple labels	`s[['a','b']]`	Select multiple values	`revenue[['North','East']]`
Boolean mask	`s[condition]`	Filter by condition	`revenue[revenue > 10000]`

# Access by label
print(revenue['North'])  # 12000

# Access by position
print(revenue.iloc[0])  # 12000

# Select multiple regions
print(revenue[['North', 'East']])
# North    12000
# East     15000

# Filter: only regions with revenue > 10000
high_revenue = revenue[revenue > 10000]
print(high_revenue)
# North    12000
# East     15000

Series Operations

One of Pandas' superpowers is vectorized operations—you can perform calculations on entire Series without writing loops. This is both faster and more readable.

# Mathematical operations (applied to all elements)
revenue_in_lakhs = revenue / 100000  # Convert to lakhs
print(revenue_in_lakhs)
# North    0.12
# South    0.095
# East     0.15
# West     0.08

# Add 10% bonus to all regions
bonus = revenue * 0.10
print(bonus)
# North    1200.0
# South     950.0
# East     1500.0
# West      800.0

Common Series Methods

These are the methods you'll use daily when analyzing data:

# Statistical summaries
print(revenue.sum())    # 44500 (total)
print(revenue.mean())   # 11125.0 (average)
print(revenue.max())    # 15000 (highest)
print(revenue.min())    # 8000 (lowest)
print(revenue.std())    # Standard deviation

# Useful information
print(revenue.count())  # 4 (number of values)
print(revenue.idxmax()) # 'East' (label of max value)
print(revenue.idxmin()) # 'West' (label of min value)

# Get all values as NumPy array
print(revenue.values)   # array([12000, 9500, 15000, 8000])

Beginner Tip

The map() Method

The map() method lets you transform values using a dictionary or function. This is incredibly useful for converting categories to numeric values for analysis.

Use case: Converting text ratings (Good, Average, Poor) to numeric scores (3, 2, 1) for calculating averages.

# Map categories to numbers
priority_scores = {'high': 3, 'medium': 2, 'low': 1}
priorities = pd.Series(['high', 'low', 'medium', 'high'])

numeric_scores = priorities.map(priority_scores)
print(numeric_scores)
# 0    3
# 1    1
# 2    2
# 3    3

# Now you can calculate the average priority!
print(numeric_scores.mean())  # 2.25

Practice Questions: Series

Apply what you learned about Series operations.

Task: Create a Series with months (Jan, Feb, Mar) as index and sales values (5000, 7500, 6200). Print the total sales.

Show Solution

sales = pd.Series({
    'Jan': 5000,
    'Feb': 7500,
    'Mar': 6200
})
print(f"Total sales: {sales.sum()}")  # 18700

Given:

scores = pd.Series({'Alice': 85, 'Bob': 92, 'Charlie': 78, 'Diana': 95})

Task: Find who has the highest score and what it is.

Show Solution

top_scorer = scores.idxmax()
top_score = scores.max()
print(f"{top_scorer} scored {top_score}")  # Diana scored 95

Given:

revenue = pd.Series({
    'North': 12000, 'South': 9500, 'East': 15000, 'West': 8000
})

Task: Find regions with revenue above average and calculate their percentage of total revenue.

Show Solution

avg = revenue.mean()
above_avg = revenue[revenue > avg]
pct = (above_avg.sum() / revenue.sum()) * 100
print(f"Above average regions: {list(above_avg.index)}")
print(f"Their share: {pct:.1f}%")  # 60.7%

Given:

ratings = pd.Series(['Excellent', 'Good', 'Poor', 'Good', 'Excellent'])

Task: Map these to scores (Excellent=5, Good=3, Poor=1) and find the average rating.

Show Solution

score_map = {'Excellent': 5, 'Good': 3, 'Poor': 1}
numeric = ratings.map(score_map)
print(f"Average rating: {numeric.mean()}")  # 3.4

DataFrame Operations

The DataFrame is where you'll spend 90% of your time in Pandas. It's a 2D table with rows and columns, similar to an Excel spreadsheet or SQL table. This section covers the essential operations every data analyst needs to master.

Exploring Your Data

When you first load a dataset, you need to understand what you're working with. These are the first commands every analyst runs:

import pandas as pd

# Load sample data
df = pd.read_csv('sales_data.csv')

# Quick look at the data
print(df.head())     # First 5 rows
print(df.tail(3))    # Last 3 rows
print(df.shape)      # (rows, columns)
print(df.columns)    # List of column names
print(df.dtypes)     # Data type of each column

Essential Method

df.info()

The info() method gives you a complete overview: column names, data types, non-null counts, and memory usage. It's the best single command to understand your dataset.

Pro tip: Run df.info() immediately after loading any dataset to spot missing values and incorrect data types.

# Complete dataset overview
df.info()
# Output:
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 1000 entries, 0 to 999
# Data columns (total 5 columns):
#  #   Column    Non-Null Count  Dtype  
# ---  ------    --------------  -----  
#  0   date      1000 non-null   object 
#  1   region    1000 non-null   object 
#  2   revenue   985 non-null    float64  # <-- 15 missing values!
#  3   units     1000 non-null   int64  
#  4   product   1000 non-null   object 
# memory usage: 39.1 KB

Selecting Columns

There are two ways to select columns: bracket notation (always works) and dot notation (convenient but limited).

# Bracket notation (recommended - always works)
revenue = df['revenue']            # Returns a Series
subset = df[['region', 'revenue']] # Returns a DataFrame with 2 columns

# Dot notation (convenient but limited)
revenue = df.revenue               # Works only if column name is valid Python identifier
# df.revenue target  # ERROR - space in name
# df['revenue target']  # This works!

Important: Always use bracket notation (df['column']) for column names with spaces, special characters, or names that match Python keywords like class or type.

Filtering Rows

Filtering is how you select specific rows based on conditions. This is one of the most common operations in data analysis.

# Single condition
high_revenue = df[df['revenue'] > 10000]

# Multiple conditions (use & for AND, | for OR)
# IMPORTANT: Wrap each condition in parentheses!
north_high = df[(df['region'] == 'North') & (df['revenue'] > 10000)]
low_or_south = df[(df['revenue'] < 5000) | (df['region'] == 'South')]

# Using isin() for multiple values
target_regions = df[df['region'].isin(['North', 'East', 'West'])]

# String contains (for text searching)
laptop_sales = df[df['product'].str.contains('Laptop', case=False)]

Common mistake: Forgetting parentheses around conditions. df[df['a'] > 5 & df['b'] < 10] causes an error. Always use df[(df['a'] > 5) & (df['b'] < 10)].

Creating New Columns

Adding calculated columns is a core skill. You can perform arithmetic, apply functions, or use conditional logic.

# Simple arithmetic
df['profit_margin'] = (df['revenue'] - df['cost']) / df['revenue'] * 100

# Using apply() for custom functions
def categorize_revenue(amount):
    if amount > 15000:
        return 'High'
    elif amount > 8000:
        return 'Medium'
    else:
        return 'Low'

df['revenue_category'] = df['revenue'].apply(categorize_revenue)

# Using np.where() for simple if-else (faster than apply)
import numpy as np
df['is_profitable'] = np.where(df['revenue'] > df['cost'], 'Yes', 'No')

loc and iloc: Precise Selection

For selecting specific rows and columns together, use loc (by label) or iloc (by position).

Method	Selects By	Example	Description
`loc`	Label	`df.loc[0:5, 'revenue']`	Rows 0-5 (inclusive), 'revenue' column
`iloc`	Position	`df.iloc[0:5, 2]`	First 5 rows, 3rd column
`loc`	Label	`df.loc[df['region']=='North', ['revenue','units']]`	Filter + select columns

# Select specific rows and columns by label
df.loc[0:10, ['region', 'revenue', 'units']]

# Select by position (first 10 rows, columns 1-3)
df.iloc[0:10, 1:4]

# Update values using loc
df.loc[df['region'] == 'North', 'revenue'] = df.loc[df['region'] == 'North', 'revenue'] * 1.1

Handling Missing Data

Real-world data is messy. Missing values (shown as NaN) are common. Here's how to find and handle them:

# Check for missing values
print(df.isnull().sum())  # Count NaN per column

# Drop rows with any missing values
df_clean = df.dropna()

# Drop rows with missing values in specific columns
df_clean = df.dropna(subset=['revenue', 'region'])

# Fill missing values
df['revenue'] = df['revenue'].fillna(0)  # Fill with zero
df['revenue'] = df['revenue'].fillna(df['revenue'].mean())  # Fill with average
df['category'] = df['category'].fillna('Unknown')  # Fill with placeholder

Practice Questions: DataFrame

Practice essential DataFrame operations.

Task: Filter a sales DataFrame to show only rows where revenue is greater than 10000.

Show Solution

high_revenue = df[df['revenue'] > 10000]
print(high_revenue)

Task: Create a new DataFrame with only the 'product', 'quantity', and 'price' columns.

Show Solution

subset = df[['product', 'quantity', 'price']]
print(subset.head())

Given: A DataFrame with 'revenue' and 'cost' columns.

Task: Create a 'profit' column (revenue - cost) and a 'profit_margin' column (profit/revenue * 100).

Show Solution

df['profit'] = df['revenue'] - df['cost']
df['profit_margin'] = (df['profit'] / df['revenue']) * 100
print(df[['revenue', 'cost', 'profit', 'profit_margin']].head())

Task: Find all rows where region is 'North' AND revenue is above 10000.

Show Solution

result = df[(df['region'] == 'North') & (df['revenue'] > 10000)]
print(result)

Task: Fill missing revenue values with the median, then create a 'revenue_tier' column that categorizes as 'High' (>15000), 'Medium' (>8000), or 'Low'.

Show Solution

# Fill missing values
df['revenue'] = df['revenue'].fillna(df['revenue'].median())

# Create revenue tiers
def categorize(val):
    if val > 15000:
        return 'High'
    elif val > 8000:
        return 'Medium'
    else:
        return 'Low'

df['revenue_tier'] = df['revenue'].apply(categorize)
print(df['revenue_tier'].value_counts())

Reading & Writing Data

Real data doesn't live in Python code—it's in CSV files, Excel spreadsheets, databases, and APIs. Pandas makes it easy to load data from these sources and save your results back. This section covers the most common I/O operations you'll use daily.

Reading CSV Files

CSV (Comma-Separated Values) is the most common format for data exchange. The read_csv() function handles most scenarios with the right parameters.

import pandas as pd

# Basic CSV read
df = pd.read_csv('sales_data.csv')

# With common options
df = pd.read_csv('sales_data.csv',
    parse_dates=['date'],        # Convert date columns to datetime
    usecols=['date', 'revenue', 'region'],  # Load only specific columns
    nrows=1000,                  # Read only first 1000 rows (for testing)
    na_values=['N/A', 'missing'] # Treat these as NaN
)

Parameter	Purpose	Example
`parse_dates`	Convert columns to datetime	`parse_dates=['date', 'created_at']`
`usecols`	Load only specific columns (faster)	`usecols=['id', 'name', 'revenue']`
`nrows`	Read only first N rows	`nrows=100`
`skiprows`	Skip rows at start of file	`skiprows=2` (skip header rows)
`encoding`	Handle special characters	`encoding='utf-8'` or `'latin-1'`

Performance tip: For large files (millions of rows), use usecols to load only the columns you need. This can reduce memory usage by 50-80% and dramatically speed up loading.

Reading Excel Files

Pandas can read Excel files (.xlsx, .xls) directly. You'll need the openpyxl library installed.

# Install openpyxl first: pip install openpyxl

# Read Excel file
df = pd.read_excel('report.xlsx')

# Read specific sheet
df = pd.read_excel('report.xlsx', sheet_name='Q1 Sales')

# Read all sheets (returns a dictionary)
all_sheets = pd.read_excel('report.xlsx', sheet_name=None)
print(all_sheets.keys())  # ['Q1 Sales', 'Q2 Sales', 'Q3 Sales']

Writing Data

After cleaning and transforming your data, you'll want to save the results. Use to_csv() for CSV and to_excel() for Excel.

# Save to CSV (most common)
df.to_csv('cleaned_data.csv', index=False)  # index=False prevents extra column

# Save to Excel
df.to_excel('report.xlsx', sheet_name='Results', index=False)

# Save multiple sheets to Excel
with pd.ExcelWriter('report.xlsx') as writer:
    df_north.to_excel(writer, sheet_name='North', index=False)
    df_south.to_excel(writer, sheet_name='South', index=False)
    df_summary.to_excel(writer, sheet_name='Summary', index=False)

Pro tip: Always use index=False when saving to CSV unless you specifically need the index. Otherwise, you'll get an extra unnamed column when you reload the file.

CSV Previewer (Interactive)

Drop a small CSV file or choose one to preview the first 10 rows (client-side only).

Practice Questions: Reading & Writing Data

Practice loading and saving data efficiently.

Task: Load only the 'date', 'region', and 'revenue' columns from 'sales_data.csv' with dates parsed as datetime.

Show Solution

df = pd.read_csv('sales_data.csv',
    usecols=['date', 'region', 'revenue'],
    parse_dates=['date']
)
print(df.dtypes)

Task: Drop rows with missing revenue, then save to 'sales_cleaned.csv' without the index column.

Show Solution

df_clean = df.dropna(subset=['revenue'])
df_clean.to_csv('sales_cleaned.csv', index=False)
print(f"Saved {len(df_clean)} rows")

Task: Read the 'Q1_Data' sheet from 'quarterly_report.xlsx' and print the shape.

Show Solution

df = pd.read_excel('quarterly_report.xlsx', sheet_name='Q1_Data')
print(f"Shape: {df.shape}")
print(df.head())

Task: Split data by region into separate DataFrames and save each as a different sheet in 'regional_report.xlsx'.

Show Solution

with pd.ExcelWriter('regional_report.xlsx') as writer:
    for region in df['region'].unique():
        region_df = df[df['region'] == region]
        region_df.to_excel(writer, sheet_name=region, index=False)
print("Report saved with separate sheets per region")

Groupby & Aggregation

Groupby is where Pandas shines brightest. Need to calculate total sales by region? Average revenue by month? Top products by category? Groupby handles all of these. It's the Pandas equivalent of SQL's GROUP BY clause, but more powerful and flexible.

Understanding Groupby

The groupby operation follows a "split-apply-combine" pattern:

Split: Divide data into groups based on one or more columns
Apply: Calculate something for each group (sum, mean, count, etc.)
Combine: Merge results back into a single DataFrame

Core Concept

Split-Apply-Combine

When you call df.groupby('region'), Pandas creates separate groups for each unique region. Then you apply an aggregation (like .sum()) to each group independently. Finally, results are combined into a new DataFrame.

Real example: "What's the total revenue for each region?" → df.groupby('region')['revenue'].sum()

Basic Groupby Operations

import pandas as pd

# Sample data
df = pd.DataFrame({
    'region': ['North', 'South', 'North', 'East', 'South', 'East'],
    'product': ['Laptop', 'Mouse', 'Mouse', 'Laptop', 'Keyboard', 'Mouse'],
    'revenue': [45000, 500, 800, 52000, 1500, 600]
})

# Basic groupby - sum revenue by region
by_region = df.groupby('region')['revenue'].sum()
print(by_region)
# region
# East     52600
# North    45800
# South     2000

# Multiple aggregations
summary = df.groupby('region')['revenue'].agg(['sum', 'mean', 'count'])
print(summary)
#        sum     mean  count
# East   52600  26300.0    2
# North  45800  22900.0    2
# South   2000   1000.0    2

Named Aggregations (Modern Syntax)

The modern agg() syntax lets you give custom names to your aggregated columns, making your output cleaner and more readable.

# Named aggregations (cleaner output)
result = df.groupby('region').agg(
    total_revenue=('revenue', 'sum'),
    avg_revenue=('revenue', 'mean'),
    num_transactions=('revenue', 'count'),
    max_sale=('revenue', 'max')
).reset_index()

print(result)
#   region  total_revenue  avg_revenue  num_transactions  max_sale
# 0   East          52600      26300.0                 2     52000
# 1  North          45800      22900.0                 2     45000
# 2  South           2000       1000.0                 2      1500

Why reset_index()? By default, groupby makes the grouped column(s) the index. Adding .reset_index() converts it back to a regular column, which is usually what you want for further analysis or export.

Grouping by Multiple Columns

You can group by multiple columns to get more granular breakdowns—like revenue by region AND product.

# Group by multiple columns
breakdown = df.groupby(['region', 'product']).agg(
    total_revenue=('revenue', 'sum'),
    count=('revenue', 'count')
).reset_index()

print(breakdown)
#   region   product  total_revenue  count
# 0   East    Laptop          52000      1
# 1   East     Mouse            600      1
# 2  North    Laptop          45000      1
# 3  North     Mouse            800      1
# 4  South  Keyboard           1500      1
# 5  South     Mouse            500      1

Common Aggregation Functions

Function	Description	Example
`sum`	Total of all values	`('revenue', 'sum')`
`mean`	Average value	`('revenue', 'mean')`
`count`	Number of non-null values	`('revenue', 'count')`
`min` / `max`	Minimum / Maximum value	`('revenue', 'min')`
`std`	Standard deviation	`('revenue', 'std')`
`first` / `last`	First / Last value in group	`('date', 'first')`
`nunique`	Number of unique values	`('product', 'nunique')`

Transform: Keep Original Rows

Sometimes you want to add group-level calculations back to each row (like "this row's revenue vs. regional average"). The transform() method does exactly this.

# Add regional average to each row
df['regional_avg'] = df.groupby('region')['revenue'].transform('mean')
df['vs_avg'] = df['revenue'] - df['regional_avg']

print(df[['region', 'revenue', 'regional_avg', 'vs_avg']])
#   region  revenue  regional_avg    vs_avg
# 0  North    45000       22900.0   22100.0
# 1  South      500        1000.0    -500.0
# 2  North      800       22900.0  -22100.0
# 3   East    52000       26300.0   25700.0
# 4  South     1500        1000.0     500.0
# 5   East      600       26300.0  -25700.0

Top-N Regions Slider (Interactive)

Use the slider to display the top N regions by revenue from a small sample dataset.

Practice Questions: Groupby & Aggregation

Master the art of aggregating and summarizing data.

Task: Calculate total revenue for each product category.

Show Solution

by_product = df.groupby('product')['revenue'].sum()
print(by_product.sort_values(ascending=False))

Task: Count how many transactions occurred in each region.

Show Solution

transaction_count = df.groupby('region').size()
# OR
transaction_count = df['region'].value_counts()
print(transaction_count)

Task: For each region, calculate: total revenue, average revenue, max single sale, and number of unique products.

Show Solution

summary = df.groupby('region').agg(
    total_revenue=('revenue', 'sum'),
    avg_revenue=('revenue', 'mean'),
    max_sale=('revenue', 'max'),
    unique_products=('product', 'nunique')
).reset_index()
print(summary)

Task: Calculate each region's percentage share of total revenue.

Show Solution

region_totals = df.groupby('region')['revenue'].sum().reset_index()
region_totals['pct_share'] = (region_totals['revenue'] / region_totals['revenue'].sum()) * 100
print(region_totals.sort_values('pct_share', ascending=False))

Task: Add a column showing whether each transaction's revenue is above or below its regional average.

Show Solution

df['regional_avg'] = df.groupby('region')['revenue'].transform('mean')
df['performance'] = df.apply(
    lambda x: 'Above Average' if x['revenue'] > x['regional_avg'] else 'Below Average', 
    axis=1
)
print(df[['region', 'revenue', 'regional_avg', 'performance']])

Key Takeaways

Series & DataFrame

Pandas core data structures with labeled axes

Easy I/O

Read and write CSV or Excel with one function call

Groupby

Aggregate and summarize data efficiently

Data Cleaning

Fill missing values and transform columns for analysis

Transformations

Apply functions and compute new columns

Export

Save cleaned data for visualization and modeling

What You'll Learn

Contents

Introduction to Pandas

Why Pandas Matters

The Two Core Data Structures

Series

DataFrame

Installing and Importing Pandas

Your First DataFrame

Practice Questions: Introduction

Easy Create your first DataFrame

Easy Check DataFrame dimensions

Medium Add a new column based on calculation

Working with Series

Creating a Series

Accessing Series Values

Series Operations

Common Series Methods

The map() Method

Practice Questions: Series

Easy Create a Series with labels

Easy Find the maximum value

Medium Filter and calculate percentage

Hard Map categories to scores

DataFrame Operations

Exploring Your Data

df.info()

Selecting Columns

Filtering Rows

Creating New Columns

loc and iloc: Precise Selection

Handling Missing Data

Practice Questions: DataFrame

Easy Filter rows by condition

Easy Select specific columns

Medium Create a calculated column

Medium Multiple filter conditions

Hard Handle missing values and create categories

Reading & Writing Data

Reading CSV Files

Reading Excel Files

Writing Data

CSV Previewer (Interactive)

Practice Questions: Reading & Writing Data

Easy Load specific columns from CSV

Easy Save cleaned data to CSV

Medium Read Excel with specific sheet

Hard Export multiple DataFrames to Excel

Groupby & Aggregation

Understanding Groupby

Split-Apply-Combine

Basic Groupby Operations

Named Aggregations (Modern Syntax)

Grouping by Multiple Columns

Common Aggregation Functions

Transform: Keep Original Rows

Top-N Regions Slider (Interactive)

Practice Questions: Groupby & Aggregation

Easy Calculate total by category

Easy Count transactions per region

Medium Multiple aggregations with named output

Medium Calculate percentage share

Hard Use transform to compare to group average

Key Takeaways

Series & DataFrame

Easy I/O

Groupby

Data Cleaning

Transformations

Export

Knowledge Check

1 Which function reads a CSV file into a DataFrame?

2 What does df.head() display by default?

3 Which method is used to group rows and compute aggregations?

4 How do you save a DataFrame to a CSV file?

5 What does df.dropna(subset=['revenue']) do?

6 What is the correct way to filter rows where revenue is greater than 10000?

7 Which parameter in read_csv() converts date columns to datetime?

8 What's the difference between loc and iloc?

2 What does `df.head()` display by default?

5 What does `df.dropna(subset=['revenue'])` do?

7 Which parameter in `read_csv()` converts date columns to datetime?

8 What's the difference between `loc` and `iloc`?