Introduction to Data Transformation
Data transformation is the process of reshaping, restructuring, and combining data to prepare it for analysis. While data cleaning focuses on fixing data quality issues, transformation focuses on organizing data into the right structure for your specific analytical needs. Pandas provides powerful tools for aggregating data across groups, combining multiple datasets, and applying custom transformations.
Why Data Transformation Matters
Real-world data rarely comes in the exact format you need. Sales data might be stored by transaction, but you need monthly summaries. Customer data might be split across multiple tables that need joining. Raw values might need to be categorized or normalized. Data transformation bridges the gap between how data is stored and how you need to analyze it.
Aggregation
Summarize detailed data into meaningful statistics. Calculate totals, averages, counts, and other metrics across groups of records.
Combining
Merge data from multiple sources into unified datasets. Join tables on common keys, concatenate DataFrames, and handle relationships.
Reshaping
Restructure data layout with pivot tables and melting operations. Convert between wide and long formats for different analysis needs.
Custom Functions
Apply your own logic to transform values. Use apply, map, and transform to implement complex business rules and calculations.
Sample Dataset for This Lesson
Throughout this lesson, we'll work with sales and employee data to demonstrate transformation techniques. Let's start by creating our sample datasets:
import pandas as pd
import numpy as np
# Sales data
sales = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=10, freq='D'),
'store': ['NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA'],
'product': ['Widget', 'Widget', 'Gadget', 'Gadget', 'Widget',
'Gadget', 'Widget', 'Widget', 'Gadget', 'Gadget'],
'quantity': [10, 15, 8, 12, 20, 5, 18, 22, 9, 11],
'price': [25.0, 25.0, 45.0, 45.0, 25.0, 45.0, 25.0, 25.0, 45.0, 45.0]
})
# Add revenue column
sales['revenue'] = sales['quantity'] * sales['price']
print(sales)
# date store product quantity price revenue
# 0 2024-01-01 NYC Widget 10 25.0 250.0
# 1 2024-01-02 LA Widget 15 25.0 375.0
# 2 2024-01-03 NYC Gadget 8 45.0 360.0
# 3 2024-01-04 LA Gadget 12 45.0 540.0
# ...
df.info(), df.head(), and df.describe(). This helps you
choose the right transformation approach.
GroupBy Aggregations
The GroupBy operation is one of the most powerful features in Pandas. It follows the "split-apply-combine" pattern: split the data into groups based on one or more keys, apply a function to each group independently, and combine the results into a new DataFrame. This is essential for calculating summary statistics across categories, time periods, or any other grouping.
Basic GroupBy Syntax
The groupby() method creates a GroupBy object that you can then apply aggregation functions to.
The simplest pattern is grouping by a single column and calculating one statistic:
# Group by store and calculate total revenue
store_revenue = sales.groupby('store')['revenue'].sum()
print(store_revenue)
# store
# LA 1620.0
# NYC 1495.0
# Name: revenue, dtype: float64
# Group by product and get multiple stats
product_stats = sales.groupby('product')['quantity'].agg(['sum', 'mean', 'count'])
print(product_stats)
# sum mean count
# product
# Gadget 40 8.0 5
# Widget 85 17.0 5
Multiple Aggregations with agg()
The agg() method allows you to apply multiple aggregation functions at once, and even
apply different functions to different columns. You can use built-in function names as strings,
NumPy functions, or custom functions:
# Multiple aggregations on single column
revenue_analysis = sales.groupby('store')['revenue'].agg([
'sum', 'mean', 'std', 'min', 'max'
])
print(revenue_analysis)
# sum mean std min max
# store
# LA 1620.0 324.0 102.470 225.0 540.0
# NYC 1495.0 299.0 74.162 250.0 450.0
# Different aggregations for different columns
summary = sales.groupby('store').agg({
'quantity': ['sum', 'mean'],
'revenue': ['sum', 'max'],
'product': 'count'
})
print(summary)
# quantity revenue product
# sum mean sum max count
# store
# LA 65 13.0 1620.0 540.0 5
# NYC 65 13.0 1495.0 450.0 5
Grouping by Multiple Columns
You can group by multiple columns to create hierarchical groupings. This is useful for analyzing data across multiple dimensions, like store AND product:
# Group by store AND product
multi_group = sales.groupby(['store', 'product'])['revenue'].sum()
print(multi_group)
# store product
# LA Gadget 765.0
# Widget 855.0
# NYC Gadget 765.0
# Widget 730.0
# Name: revenue, dtype: float64
# Convert to DataFrame and reset index
multi_df = multi_group.reset_index()
print(multi_df)
# store product revenue
# 0 LA Gadget 765.0
# 1 LA Widget 855.0
# 2 NYC Gadget 765.0
# 3 NYC Widget 730.0
Named Aggregations
Python 3.6+ supports named aggregations for cleaner, more readable output. This gives you control over the resulting column names:
# Named aggregations for clean column names
store_summary = sales.groupby('store').agg(
total_revenue=('revenue', 'sum'),
avg_quantity=('quantity', 'mean'),
num_transactions=('quantity', 'count'),
max_sale=('revenue', 'max')
)
print(store_summary)
# total_revenue avg_quantity num_transactions max_sale
# store
# LA 1620.0 13.0 5 540.0
# NYC 1495.0 13.0 5 450.0
Filtering Groups
The filter() method lets you keep or discard entire groups based on a condition.
This is useful for focusing on groups that meet certain criteria:
# Keep only products with total quantity > 50
high_volume = sales.groupby('product').filter(lambda x: x['quantity'].sum() > 50)
print(high_volume)
# Only Widget rows remain (total quantity = 85)
# Keep stores with average revenue > 300
high_revenue_stores = sales.groupby('store').filter(
lambda x: x['revenue'].mean() > 300
)
print(high_revenue_stores['store'].unique())
# ['LA']
groupby(dropna=False). Also remember that the result
of groupby().agg() has the grouping column(s) as the index - use
reset_index() to convert back to regular columns.
Practice Questions: GroupBy Aggregations
Test your understanding with these coding exercises.
Given:
df = pd.DataFrame({
'category': ['A', 'B', 'A', 'B', 'A'],
'value': [10, 20, 15, 25, 12]
})
Task: Calculate the average value for each category.
Expected output:
category
A 12.333333
B 22.500000
Show Solution
avg_by_category = df.groupby('category')['value'].mean()
print(avg_by_category)
Given:
orders = pd.DataFrame({
'customer': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 'Alice'],
'product': ['Widget', 'Gadget', 'Gadget', 'Widget', 'Widget', 'Widget']
})
Task: Count how many orders each customer made.
Show Solution
order_counts = orders.groupby('customer').size()
# OR
order_counts = orders.groupby('customer')['product'].count()
print(order_counts)
Given:
sales = pd.DataFrame({
'region': ['East', 'West', 'East', 'West', 'East'],
'amount': [100, 150, 200, 175, 125]
})
Task: For each region, calculate the sum, mean, and max of amount.
Show Solution
region_stats = sales.groupby('region')['amount'].agg(['sum', 'mean', 'max'])
print(region_stats)
Given:
transactions = pd.DataFrame({
'store': ['A', 'B', 'A', 'B', 'A', 'B'],
'quantity': [5, 8, 3, 10, 7, 6],
'revenue': [50, 80, 30, 100, 70, 60]
})
Task: Create a summary with named columns: total_qty, avg_revenue, num_sales for each store.
Show Solution
summary = transactions.groupby('store').agg(
total_qty=('quantity', 'sum'),
avg_revenue=('revenue', 'mean'),
num_sales=('quantity', 'count')
)
print(summary)
Merge, Join, Concat
Real-world data is often spread across multiple tables or files. Combining data from different sources
is a fundamental skill in data analysis. Pandas provides three main ways to combine DataFrames:
merge() for database-style joins, join() for index-based combining, and
concat() for stacking DataFrames together.
Setup: Sample DataFrames
Let's create sample DataFrames to demonstrate combining operations:
# Customers table
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'city': ['NYC', 'LA', 'NYC', 'Chicago']
})
# Orders table
orders = pd.DataFrame({
'order_id': [101, 102, 103, 104, 105],
'customer_id': [1, 2, 1, 3, 5], # Note: customer 5 doesn't exist
'amount': [250, 150, 300, 200, 175]
})
# Products table
products = pd.DataFrame({
'product_id': ['A', 'B', 'C'],
'product_name': ['Widget', 'Gadget', 'Gizmo'],
'price': [25.0, 45.0, 35.0]
})
Merge: Database-Style Joins
The merge() function works like SQL JOIN operations. You specify which columns to
match on and what type of join to perform:
# Inner join (default) - only matching rows
inner = pd.merge(orders, customers, on='customer_id')
print(inner)
# order_id customer_id amount name city
# 0 101 1 250 Alice NYC
# 1 103 1 300 Alice NYC
# 2 102 2 150 Bob LA
# 3 104 3 200 Charlie NYC
# Note: order 105 (customer 5) excluded - no matching customer
# Left join - all from left, matching from right
left = pd.merge(orders, customers, on='customer_id', how='left')
print(left)
# order_id customer_id amount name city
# 0 101 1 250 Alice NYC
# 1 102 2 150 Bob LA
# 2 103 1 300 Alice NYC
# 3 104 3 200 Charlie NYC
# 4 105 5 175 NaN NaN # No match
Join Types Explained
| Join Type | Description | Use Case |
|---|---|---|
inner |
Only rows with matches in BOTH tables | When you only want complete records |
left |
All rows from left, matches from right | Keep all primary records, add info if available |
right |
All rows from right, matches from left | Less common, same as left with reversed order |
outer |
All rows from BOTH tables | When you need complete picture of both datasets |
# Right join
right = pd.merge(orders, customers, on='customer_id', how='right')
print(right)
# Customer 4 (Diana) appears with NaN for order columns
# Outer join - all rows from both
outer = pd.merge(orders, customers, on='customer_id', how='outer')
print(outer)
# Includes both order 105 (no customer) and Diana (no orders)
Merging on Different Column Names
When the columns you want to join on have different names in each DataFrame, use
left_on and right_on:
# Different column names
employees = pd.DataFrame({
'emp_id': [1, 2, 3],
'emp_name': ['John', 'Jane', 'Joe']
})
departments = pd.DataFrame({
'employee_id': [1, 2, 3],
'department': ['Sales', 'Marketing', 'IT']
})
# Merge with different column names
result = pd.merge(employees, departments,
left_on='emp_id', right_on='employee_id')
print(result)
# emp_id emp_name employee_id department
# 0 1 John 1 Sales
# 1 2 Jane 2 Marketing
# 2 3 Joe 3 IT
Concatenating DataFrames
Use concat() to stack DataFrames vertically (adding rows) or horizontally (adding columns):
# Vertical concatenation (stacking rows)
q1_sales = pd.DataFrame({'product': ['A', 'B'], 'sales': [100, 150]})
q2_sales = pd.DataFrame({'product': ['A', 'B'], 'sales': [120, 180]})
yearly = pd.concat([q1_sales, q2_sales], ignore_index=True)
print(yearly)
# product sales
# 0 A 100
# 1 B 150
# 2 A 120
# 3 B 180
# Horizontal concatenation (adding columns)
names = pd.DataFrame({'name': ['Alice', 'Bob']})
ages = pd.DataFrame({'age': [25, 30]})
combined = pd.concat([names, ages], axis=1)
print(combined)
# name age
# 0 Alice 25
# 1 Bob 30
The join() Method
The join() method is a convenient way to combine DataFrames on their indices:
# Set index for joining
customers_indexed = customers.set_index('customer_id')
orders_indexed = orders.set_index('customer_id')
# Join on index
joined = customers_indexed.join(orders_indexed, how='inner')
print(joined)
# name city order_id amount
# customer_id
# 1 Alice NYC 101 250
# 1 Alice NYC 103 300
# 2 Bob LA 102 150
# 3 Charlie NYC 104 200
merge() when joining on column values (most common).
Use join() when your DataFrames already have meaningful indices. Use
concat() when stacking DataFrames with the same structure.
Practice Questions: Merge, Join, Concat
Test your understanding with these coding exercises.
Given:
students = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
grades = pd.DataFrame({'student_id': [1, 2, 4], 'grade': ['A', 'B', 'C']})
Task: Merge students with grades using an inner join.
Show Solution
result = pd.merge(students, grades, left_on='id', right_on='student_id')
print(result)
# Only Alice and Bob appear (student 4 has no match)
Given:
jan = pd.DataFrame({'month': ['Jan'], 'sales': [1000]})
feb = pd.DataFrame({'month': ['Feb'], 'sales': [1200]})
mar = pd.DataFrame({'month': ['Mar'], 'sales': [1100]})
Task: Stack all three DataFrames into one with a clean index.
Show Solution
quarterly = pd.concat([jan, feb, mar], ignore_index=True)
print(quarterly)
Given:
products = pd.DataFrame({'id': [1, 2, 3], 'name': ['Widget', 'Gadget', 'Gizmo']})
inventory = pd.DataFrame({'product_id': [1, 3], 'stock': [50, 30]})
Task: Get all products with their stock (NaN for missing inventory).
Show Solution
result = pd.merge(products, inventory, left_on='id',
right_on='product_id', how='left')
print(result)
# Gadget will have NaN for stock
Given:
orders = pd.DataFrame({'order_id': [1, 2], 'customer_id': [10, 20], 'product_id': ['A', 'B']})
customers = pd.DataFrame({'customer_id': [10, 20], 'name': ['Alice', 'Bob']})
products = pd.DataFrame({'product_id': ['A', 'B'], 'product_name': ['Widget', 'Gadget']})
Task: Create a view with order_id, customer name, and product name.
Show Solution
# Chain merges
result = (orders
.merge(customers, on='customer_id')
.merge(products, on='product_id')
[['order_id', 'name', 'product_name']]
)
print(result)
Pivot Tables & Cross-tabulations
Pivot tables are one of the most powerful tools for data analysis, allowing you to reshape data and create summary views. They transform data from a long format (many rows) to a wide format (categories as columns), making patterns easier to spot. Cross-tabulations (crosstabs) are specialized pivot tables for counting occurrences across categories.
Creating Pivot Tables
The pivot_table() function creates a spreadsheet-style pivot table. You specify which
columns become the index (rows), columns, and what values to aggregate:
# Sample data
sales = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=8, freq='D'),
'store': ['NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA'],
'product': ['Widget', 'Widget', 'Gadget', 'Gadget',
'Widget', 'Gadget', 'Gadget', 'Widget'],
'quantity': [10, 15, 8, 12, 20, 5, 9, 22],
'revenue': [250, 375, 360, 540, 500, 225, 405, 550]
})
# Basic pivot table
pivot = pd.pivot_table(
sales,
values='revenue', # What to aggregate
index='store', # Row labels
columns='product', # Column labels
aggfunc='sum' # How to aggregate
)
print(pivot)
# product Gadget Widget
# store
# LA 765.0 925.0
# NYC 765.0 750.0
Multiple Aggregations
You can apply multiple aggregation functions and summarize multiple value columns:
# Multiple aggregation functions
pivot_multi = pd.pivot_table(
sales,
values='revenue',
index='store',
columns='product',
aggfunc=['sum', 'mean', 'count']
)
print(pivot_multi)
# sum mean count
# product Gadget Widget Gadget Widget Gadget Widget
# store
# LA 765.0 925.0 382.5 462.500000 2 2
# NYC 765.0 750.0 382.5 375.000000 2 2
# Multiple values with different aggregations
pivot_values = pd.pivot_table(
sales,
values=['quantity', 'revenue'],
index='store',
columns='product',
aggfunc={'quantity': 'sum', 'revenue': 'mean'}
)
print(pivot_values)
Adding Margins (Totals)
The margins=True parameter adds row and column totals, similar to Excel pivot tables:
# Pivot table with totals
pivot_totals = pd.pivot_table(
sales,
values='revenue',
index='store',
columns='product',
aggfunc='sum',
margins=True, # Add totals
margins_name='Total' # Name for totals row/column
)
print(pivot_totals)
# product Gadget Widget Total
# store
# LA 765.0 925.0 1690.0
# NYC 765.0 750.0 1515.0
# Total 1530.0 1675.0 3205.0
Cross-Tabulation (crosstab)
Cross-tabulation is perfect for analyzing the frequency of combinations between two categorical variables. It's commonly used for categorical data analysis:
# Count occurrences of each combination
crosstab = pd.crosstab(sales['store'], sales['product'])
print(crosstab)
# product Gadget Widget
# store
# LA 2 2
# NYC 2 2
# Add margins
crosstab_margins = pd.crosstab(
sales['store'],
sales['product'],
margins=True
)
print(crosstab_margins)
# product Gadget Widget All
# store
# LA 2 2 4
# NYC 2 2 4
# All 4 4 8
# Normalize to get percentages
crosstab_pct = pd.crosstab(
sales['store'],
sales['product'],
normalize='all' # 'index', 'columns', or 'all'
)
print(crosstab_pct)
# product Gadget Widget
# store
# LA 0.25 0.25
# NYC 0.25 0.25
Pivot vs Pivot Table
Pandas also has a simpler pivot() method for reshaping without aggregation. Use it
when your data has unique combinations:
# Simple pivot (no aggregation - needs unique combinations)
data = pd.DataFrame({
'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
'metric': ['views', 'clicks', 'views', 'clicks'],
'value': [1000, 50, 1200, 65]
})
reshaped = data.pivot(index='date', columns='metric', values='value')
print(reshaped)
# metric clicks views
# date
# 2024-01-01 50 1000
# 2024-01-02 65 1200
pivot() will raise an error if there are duplicate
combinations. Use pivot_table() when you need to aggregate duplicates. Also,
pivot operations can create many NaN values if combinations don't exist in your data.
Practice Questions: Pivot Tables
Test your understanding with these coding exercises.
Given:
survey = pd.DataFrame({
'gender': ['M', 'F', 'M', 'F', 'M', 'F'],
'response': ['Yes', 'Yes', 'No', 'Yes', 'Yes', 'No']
})
Task: Create a crosstab showing the count of each response by gender.
Show Solution
result = pd.crosstab(survey['gender'], survey['response'])
print(result)
Given:
sales = pd.DataFrame({
'region': ['East', 'West', 'East', 'West'],
'product': ['A', 'A', 'B', 'B'],
'sales': [100, 150, 200, 175]
})
Task: Create a pivot table showing total sales by region and product.
Show Solution
pivot = pd.pivot_table(sales, values='sales',
index='region', columns='product',
aggfunc='sum')
print(pivot)
Given:
orders = pd.DataFrame({
'month': ['Jan', 'Jan', 'Feb', 'Feb', 'Jan', 'Feb'],
'category': ['Electronics', 'Clothing', 'Electronics', 'Clothing', 'Clothing', 'Electronics'],
'amount': [500, 200, 600, 250, 300, 450]
})
Task: Create a pivot table with average amounts, including row and column totals.
Show Solution
pivot = pd.pivot_table(orders, values='amount',
index='month', columns='category',
aggfunc='mean', margins=True)
print(pivot)
Given:
customers = pd.DataFrame({
'age_group': ['18-25', '26-35', '18-25', '36-45', '26-35', '18-25', '36-45', '26-35'],
'purchased': ['Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes']
})
Task: Create a crosstab showing what percentage of each age group purchased (normalize by index/row).
Show Solution
result = pd.crosstab(customers['age_group'],
customers['purchased'],
normalize='index') # Normalize by row
print(result)
# Shows percentage of Yes/No within each age group
Apply, Map, Transform
When built-in Pandas functions don't cover your needs, you can apply custom functions to your data.
Pandas provides three main methods for this: apply() for flexible function application,
map() for element-wise transformations, and transform() for group-aware
operations that preserve the original shape.
The apply() Method
apply() is the most versatile method. It can work on entire columns, rows, or DataFrames,
applying any function you provide:
# Sample data
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'salary': [50000, 60000, 75000]
})
# Apply to a single column (Series)
df['age_doubled'] = df['age'].apply(lambda x: x * 2)
print(df['age_doubled'])
# 0 50
# 1 60
# 2 70
# Apply a named function
def categorize_salary(salary):
if salary < 55000:
return 'Entry'
elif salary < 70000:
return 'Mid'
else:
return 'Senior'
df['level'] = df['salary'].apply(categorize_salary)
print(df[['name', 'salary', 'level']])
# name salary level
# 0 Alice 50000 Entry
# 1 Bob 60000 Mid
# 2 Charlie 75000 Senior
Apply Across Rows
Use axis=1 to apply a function across each row, accessing multiple columns:
# Apply across rows (access multiple columns)
df['description'] = df.apply(
lambda row: f"{row['name']} is {row['age']} years old",
axis=1
)
print(df['description'])
# 0 Alice is 25 years old
# 1 Bob is 30 years old
# 2 Charlie is 35 years old
# Calculate bonus based on multiple columns
def calculate_bonus(row):
base_bonus = row['salary'] * 0.1
if row['age'] > 30:
return base_bonus * 1.2 # Extra for seniority
return base_bonus
df['bonus'] = df.apply(calculate_bonus, axis=1)
print(df[['name', 'salary', 'bonus']])
The map() Method
map() is specifically for Series (single columns) and is great for value substitution
using a dictionary or function:
# Map with a dictionary (value substitution)
level_mapping = {
'Entry': 1,
'Mid': 2,
'Senior': 3
}
df['level_code'] = df['level'].map(level_mapping)
print(df[['name', 'level', 'level_code']])
# name level level_code
# 0 Alice Entry 1
# 1 Bob Mid 2
# 2 Charlie Senior 3
# Map with a function
df['name_upper'] = df['name'].map(str.upper)
print(df['name_upper'])
# 0 ALICE
# 1 BOB
# 2 CHARLIE
# Map with lambda
df['age_category'] = df['age'].map(lambda x: 'Young' if x < 30 else 'Mature')
print(df[['name', 'age', 'age_category']])
The transform() Method
transform() is special because it returns data with the same shape as the input.
This is particularly useful with GroupBy for creating group-level statistics aligned with original rows:
# Sample data
sales = pd.DataFrame({
'store': ['NYC', 'NYC', 'LA', 'LA', 'NYC'],
'product': ['Widget', 'Gadget', 'Widget', 'Gadget', 'Widget'],
'revenue': [100, 150, 200, 120, 180]
})
# Transform: add group mean to each row (same shape as original)
sales['store_avg'] = sales.groupby('store')['revenue'].transform('mean')
print(sales)
# store product revenue store_avg
# 0 NYC Widget 100 143.333333
# 1 NYC Gadget 150 143.333333
# 2 LA Widget 200 160.000000
# 3 LA Gadget 120 160.000000
# 4 NYC Widget 180 143.333333
# Calculate percentage of store total
sales['pct_of_store'] = (sales['revenue'] /
sales.groupby('store')['revenue'].transform('sum') * 100)
print(sales[['store', 'revenue', 'pct_of_store']])
Comparing apply, map, and transform
| Method | Works On | Returns | Best For |
|---|---|---|---|
map() |
Series only | Same-size Series | Simple value substitution, dictionary mapping |
apply() |
Series or DataFrame | Flexible (can change shape) | Complex row/column operations, custom logic |
transform() |
Series or GroupBy | Same-size as input | Group statistics aligned with original data |
Performance Tip: Vectorization
While apply() is flexible, it's often slower than vectorized operations.
When possible, use built-in Pandas/NumPy functions:
# Slow: using apply
df['tax'] = df['salary'].apply(lambda x: x * 0.2)
# Fast: vectorized operation (preferred)
df['tax'] = df['salary'] * 0.2
# Slow: apply with condition
df['bonus'] = df['age'].apply(lambda x: 1000 if x > 30 else 500)
# Fast: np.where (preferred)
import numpy as np
df['bonus'] = np.where(df['age'] > 30, 1000, 500)
apply() when you need complex logic that can't be
expressed with vectorized operations. For simple transformations, always prefer built-in
Pandas methods or NumPy functions for better performance.
Practice Questions: Apply, Map, Transform
Test your understanding with these coding exercises.
Given:
df = pd.DataFrame({
'grade': ['A', 'B', 'C', 'A', 'B']
})
Task: Create a new column 'points' that maps A=4, B=3, C=2.
Show Solution
grade_points = {'A': 4, 'B': 3, 'C': 2}
df['points'] = df['grade'].map(grade_points)
print(df)
Given:
df = pd.DataFrame({
'price': [10.5, 20.0, 15.75, 8.25]
})
Task: Create a 'price_with_tax' column that adds 10% tax to each price.
Show Solution
# Using apply
df['price_with_tax'] = df['price'].apply(lambda x: x * 1.1)
# Better: vectorized
df['price_with_tax'] = df['price'] * 1.1
print(df)
Given:
df = pd.DataFrame({
'first_name': ['John', 'Jane', 'Bob'],
'last_name': ['Doe', 'Smith', 'Johnson']
})
Task: Create a 'full_name' column combining first and last names.
Show Solution
# Using apply with axis=1
df['full_name'] = df.apply(
lambda row: f"{row['first_name']} {row['last_name']}",
axis=1
)
# Better: string concatenation
df['full_name'] = df['first_name'] + ' ' + df['last_name']
print(df)
Given:
sales = pd.DataFrame({
'region': ['East', 'East', 'West', 'West', 'East'],
'amount': [100, 150, 200, 180, 120]
})
Task: Add a column showing how each sale compares to its region's average (difference from mean).
Show Solution
sales['diff_from_avg'] = (sales['amount'] -
sales.groupby('region')['amount'].transform('mean'))
print(sales)
# Shows positive for above-average, negative for below
Key Takeaways
GroupBy Power
Use groupby() to split data into groups and aggregate with
sum(), mean(), or custom functions via agg()
Merge Mastery
Combine DataFrames with merge() using inner, left, right, or outer joins.
Use concat() to stack DataFrames vertically or horizontally
Pivot Tables
Reshape data with pivot_table() for summary views.
Use crosstab() to count category combinations with margins
Custom Functions
Apply custom logic with apply(). Use map() for value
substitution and transform() for group-aligned operations
Performance Matters
Prefer vectorized operations over apply() when possible.
Use NumPy functions like np.where() for conditional logic
Named Aggregations
Use named aggregations for cleaner output:
agg(total=('col', 'sum')) gives you control over column names
Ready for the Next Step?
You've mastered data transformation! With these skills, you can aggregate, reshape, and combine any dataset. Next, complete the Pandas assignment to practice these techniques on real-world data scenarios.
Knowledge Check
Test your understanding of data transformation concepts:
What does groupby('category')['value'].sum() return?
Which merge type keeps ALL rows from BOTH DataFrames?
What is the purpose of margins=True in a pivot table?
What is the key difference between apply() and transform()?
Which function is best for substituting values using a dictionary?
When should you use pd.concat() instead of pd.merge()?