Module 4.4

Data Transformation

Learn powerful data transformation techniques to reshape, aggregate, and combine datasets. Master GroupBy operations, merging DataFrames, pivot tables, and custom transformations for comprehensive data analysis.

50 min
Intermediate
Hands-on
What You'll Learn
  • Aggregate data with GroupBy operations
  • Merge, join, and concat DataFrames
  • Create pivot tables and cross-tabulations
  • Apply custom functions to data
  • Use map and transform methods
Contents
01

Introduction to Data Transformation

Data transformation is the process of reshaping, restructuring, and combining data to prepare it for analysis. While data cleaning focuses on fixing data quality issues, transformation focuses on organizing data into the right structure for your specific analytical needs. Pandas provides powerful tools for aggregating data across groups, combining multiple datasets, and applying custom transformations.

Why Data Transformation Matters

Real-world data rarely comes in the exact format you need. Sales data might be stored by transaction, but you need monthly summaries. Customer data might be split across multiple tables that need joining. Raw values might need to be categorized or normalized. Data transformation bridges the gap between how data is stored and how you need to analyze it.

Aggregation

Summarize detailed data into meaningful statistics. Calculate totals, averages, counts, and other metrics across groups of records.

Combining

Merge data from multiple sources into unified datasets. Join tables on common keys, concatenate DataFrames, and handle relationships.

Reshaping

Restructure data layout with pivot tables and melting operations. Convert between wide and long formats for different analysis needs.

Custom Functions

Apply your own logic to transform values. Use apply, map, and transform to implement complex business rules and calculations.

Sample Dataset for This Lesson

Throughout this lesson, we'll work with sales and employee data to demonstrate transformation techniques. Let's start by creating our sample datasets:

import pandas as pd
import numpy as np

# Sales data
sales = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=10, freq='D'),
    'store': ['NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA'],
    'product': ['Widget', 'Widget', 'Gadget', 'Gadget', 'Widget', 
                'Gadget', 'Widget', 'Widget', 'Gadget', 'Gadget'],
    'quantity': [10, 15, 8, 12, 20, 5, 18, 22, 9, 11],
    'price': [25.0, 25.0, 45.0, 45.0, 25.0, 45.0, 25.0, 25.0, 45.0, 45.0]
})

# Add revenue column
sales['revenue'] = sales['quantity'] * sales['price']

print(sales)
#         date store product  quantity  price  revenue
# 0 2024-01-01   NYC  Widget        10   25.0    250.0
# 1 2024-01-02    LA  Widget        15   25.0    375.0
# 2 2024-01-03   NYC  Gadget         8   45.0    360.0
# 3 2024-01-04    LA  Gadget        12   45.0    540.0
# ...
Pro tip: Before transforming data, always understand its current structure using df.info(), df.head(), and df.describe(). This helps you choose the right transformation approach.
02

GroupBy Aggregations

The GroupBy operation is one of the most powerful features in Pandas. It follows the "split-apply-combine" pattern: split the data into groups based on one or more keys, apply a function to each group independently, and combine the results into a new DataFrame. This is essential for calculating summary statistics across categories, time periods, or any other grouping.

Basic GroupBy Syntax

The groupby() method creates a GroupBy object that you can then apply aggregation functions to. The simplest pattern is grouping by a single column and calculating one statistic:

# Group by store and calculate total revenue
store_revenue = sales.groupby('store')['revenue'].sum()
print(store_revenue)
# store
# LA     1620.0
# NYC    1495.0
# Name: revenue, dtype: float64

# Group by product and get multiple stats
product_stats = sales.groupby('product')['quantity'].agg(['sum', 'mean', 'count'])
print(product_stats)
#          sum  mean  count
# product                  
# Gadget    40   8.0      5
# Widget    85  17.0      5

Multiple Aggregations with agg()

The agg() method allows you to apply multiple aggregation functions at once, and even apply different functions to different columns. You can use built-in function names as strings, NumPy functions, or custom functions:

# Multiple aggregations on single column
revenue_analysis = sales.groupby('store')['revenue'].agg([
    'sum', 'mean', 'std', 'min', 'max'
])
print(revenue_analysis)
#          sum    mean        std    min    max
# store                                         
# LA     1620.0  324.0  102.470  225.0  540.0
# NYC    1495.0  299.0   74.162  250.0  450.0

# Different aggregations for different columns
summary = sales.groupby('store').agg({
    'quantity': ['sum', 'mean'],
    'revenue': ['sum', 'max'],
    'product': 'count'
})
print(summary)
#       quantity        revenue          product
#            sum   mean     sum    max    count
# store                                         
# LA          65  13.0  1620.0  540.0        5
# NYC         65  13.0  1495.0  450.0        5

Grouping by Multiple Columns

You can group by multiple columns to create hierarchical groupings. This is useful for analyzing data across multiple dimensions, like store AND product:

# Group by store AND product
multi_group = sales.groupby(['store', 'product'])['revenue'].sum()
print(multi_group)
# store  product
# LA     Gadget      765.0
#        Widget      855.0
# NYC    Gadget      765.0
#        Widget      730.0
# Name: revenue, dtype: float64

# Convert to DataFrame and reset index
multi_df = multi_group.reset_index()
print(multi_df)
#   store product  revenue
# 0    LA  Gadget    765.0
# 1    LA  Widget    855.0
# 2   NYC  Gadget    765.0
# 3   NYC  Widget    730.0

Named Aggregations

Python 3.6+ supports named aggregations for cleaner, more readable output. This gives you control over the resulting column names:

# Named aggregations for clean column names
store_summary = sales.groupby('store').agg(
    total_revenue=('revenue', 'sum'),
    avg_quantity=('quantity', 'mean'),
    num_transactions=('quantity', 'count'),
    max_sale=('revenue', 'max')
)
print(store_summary)
#       total_revenue  avg_quantity  num_transactions  max_sale
# store                                                         
# LA           1620.0          13.0                 5     540.0
# NYC          1495.0          13.0                 5     450.0

Filtering Groups

The filter() method lets you keep or discard entire groups based on a condition. This is useful for focusing on groups that meet certain criteria:

# Keep only products with total quantity > 50
high_volume = sales.groupby('product').filter(lambda x: x['quantity'].sum() > 50)
print(high_volume)
# Only Widget rows remain (total quantity = 85)

# Keep stores with average revenue > 300
high_revenue_stores = sales.groupby('store').filter(
    lambda x: x['revenue'].mean() > 300
)
print(high_revenue_stores['store'].unique())
# ['LA']
Watch out: GroupBy operations exclude NaN values by default. If you need to include NaN as a group, use groupby(dropna=False). Also remember that the result of groupby().agg() has the grouping column(s) as the index - use reset_index() to convert back to regular columns.

Practice Questions: GroupBy Aggregations

Test your understanding with these coding exercises.

Given:

df = pd.DataFrame({
    'category': ['A', 'B', 'A', 'B', 'A'],
    'value': [10, 20, 15, 25, 12]
})

Task: Calculate the average value for each category.

Expected output:

category
A    12.333333
B    22.500000
Show Solution
avg_by_category = df.groupby('category')['value'].mean()
print(avg_by_category)

Given:

orders = pd.DataFrame({
    'customer': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 'Alice'],
    'product': ['Widget', 'Gadget', 'Gadget', 'Widget', 'Widget', 'Widget']
})

Task: Count how many orders each customer made.

Show Solution
order_counts = orders.groupby('customer').size()
# OR
order_counts = orders.groupby('customer')['product'].count()
print(order_counts)

Given:

sales = pd.DataFrame({
    'region': ['East', 'West', 'East', 'West', 'East'],
    'amount': [100, 150, 200, 175, 125]
})

Task: For each region, calculate the sum, mean, and max of amount.

Show Solution
region_stats = sales.groupby('region')['amount'].agg(['sum', 'mean', 'max'])
print(region_stats)

Given:

transactions = pd.DataFrame({
    'store': ['A', 'B', 'A', 'B', 'A', 'B'],
    'quantity': [5, 8, 3, 10, 7, 6],
    'revenue': [50, 80, 30, 100, 70, 60]
})

Task: Create a summary with named columns: total_qty, avg_revenue, num_sales for each store.

Show Solution
summary = transactions.groupby('store').agg(
    total_qty=('quantity', 'sum'),
    avg_revenue=('revenue', 'mean'),
    num_sales=('quantity', 'count')
)
print(summary)
03

Merge, Join, Concat

Real-world data is often spread across multiple tables or files. Combining data from different sources is a fundamental skill in data analysis. Pandas provides three main ways to combine DataFrames: merge() for database-style joins, join() for index-based combining, and concat() for stacking DataFrames together.

Setup: Sample DataFrames

Let's create sample DataFrames to demonstrate combining operations:

# Customers table
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'city': ['NYC', 'LA', 'NYC', 'Chicago']
})

# Orders table
orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 2, 1, 3, 5],  # Note: customer 5 doesn't exist
    'amount': [250, 150, 300, 200, 175]
})

# Products table
products = pd.DataFrame({
    'product_id': ['A', 'B', 'C'],
    'product_name': ['Widget', 'Gadget', 'Gizmo'],
    'price': [25.0, 45.0, 35.0]
})

Merge: Database-Style Joins

The merge() function works like SQL JOIN operations. You specify which columns to match on and what type of join to perform:

# Inner join (default) - only matching rows
inner = pd.merge(orders, customers, on='customer_id')
print(inner)
#    order_id  customer_id  amount     name city
# 0       101            1     250    Alice  NYC
# 1       103            1     300    Alice  NYC
# 2       102            2     150      Bob   LA
# 3       104            3     200  Charlie  NYC
# Note: order 105 (customer 5) excluded - no matching customer

# Left join - all from left, matching from right
left = pd.merge(orders, customers, on='customer_id', how='left')
print(left)
#    order_id  customer_id  amount     name   city
# 0       101            1     250    Alice    NYC
# 1       102            2     150      Bob     LA
# 2       103            1     300    Alice    NYC
# 3       104            3     200  Charlie    NYC
# 4       105            5     175      NaN    NaN  # No match

Join Types Explained

Join Type Description Use Case
inner Only rows with matches in BOTH tables When you only want complete records
left All rows from left, matches from right Keep all primary records, add info if available
right All rows from right, matches from left Less common, same as left with reversed order
outer All rows from BOTH tables When you need complete picture of both datasets
# Right join
right = pd.merge(orders, customers, on='customer_id', how='right')
print(right)
# Customer 4 (Diana) appears with NaN for order columns

# Outer join - all rows from both
outer = pd.merge(orders, customers, on='customer_id', how='outer')
print(outer)
# Includes both order 105 (no customer) and Diana (no orders)

Merging on Different Column Names

When the columns you want to join on have different names in each DataFrame, use left_on and right_on:

# Different column names
employees = pd.DataFrame({
    'emp_id': [1, 2, 3],
    'emp_name': ['John', 'Jane', 'Joe']
})

departments = pd.DataFrame({
    'employee_id': [1, 2, 3],
    'department': ['Sales', 'Marketing', 'IT']
})

# Merge with different column names
result = pd.merge(employees, departments, 
                  left_on='emp_id', right_on='employee_id')
print(result)
#    emp_id emp_name  employee_id department
# 0       1     John            1      Sales
# 1       2     Jane            2  Marketing
# 2       3      Joe            3         IT

Concatenating DataFrames

Use concat() to stack DataFrames vertically (adding rows) or horizontally (adding columns):

# Vertical concatenation (stacking rows)
q1_sales = pd.DataFrame({'product': ['A', 'B'], 'sales': [100, 150]})
q2_sales = pd.DataFrame({'product': ['A', 'B'], 'sales': [120, 180]})

yearly = pd.concat([q1_sales, q2_sales], ignore_index=True)
print(yearly)
#   product  sales
# 0       A    100
# 1       B    150
# 2       A    120
# 3       B    180

# Horizontal concatenation (adding columns)
names = pd.DataFrame({'name': ['Alice', 'Bob']})
ages = pd.DataFrame({'age': [25, 30]})

combined = pd.concat([names, ages], axis=1)
print(combined)
#     name  age
# 0  Alice   25
# 1    Bob   30

The join() Method

The join() method is a convenient way to combine DataFrames on their indices:

# Set index for joining
customers_indexed = customers.set_index('customer_id')
orders_indexed = orders.set_index('customer_id')

# Join on index
joined = customers_indexed.join(orders_indexed, how='inner')
print(joined)
#                  name     city  order_id  amount
# customer_id                                      
# 1               Alice      NYC       101     250
# 1               Alice      NYC       103     300
# 2                 Bob       LA       102     150
# 3             Charlie      NYC       104     200
Pro tip: Use merge() when joining on column values (most common). Use join() when your DataFrames already have meaningful indices. Use concat() when stacking DataFrames with the same structure.

Practice Questions: Merge, Join, Concat

Test your understanding with these coding exercises.

Given:

students = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
grades = pd.DataFrame({'student_id': [1, 2, 4], 'grade': ['A', 'B', 'C']})

Task: Merge students with grades using an inner join.

Show Solution
result = pd.merge(students, grades, left_on='id', right_on='student_id')
print(result)
# Only Alice and Bob appear (student 4 has no match)

Given:

jan = pd.DataFrame({'month': ['Jan'], 'sales': [1000]})
feb = pd.DataFrame({'month': ['Feb'], 'sales': [1200]})
mar = pd.DataFrame({'month': ['Mar'], 'sales': [1100]})

Task: Stack all three DataFrames into one with a clean index.

Show Solution
quarterly = pd.concat([jan, feb, mar], ignore_index=True)
print(quarterly)

Given:

products = pd.DataFrame({'id': [1, 2, 3], 'name': ['Widget', 'Gadget', 'Gizmo']})
inventory = pd.DataFrame({'product_id': [1, 3], 'stock': [50, 30]})

Task: Get all products with their stock (NaN for missing inventory).

Show Solution
result = pd.merge(products, inventory, left_on='id', 
                  right_on='product_id', how='left')
print(result)
# Gadget will have NaN for stock

Given:

orders = pd.DataFrame({'order_id': [1, 2], 'customer_id': [10, 20], 'product_id': ['A', 'B']})
customers = pd.DataFrame({'customer_id': [10, 20], 'name': ['Alice', 'Bob']})
products = pd.DataFrame({'product_id': ['A', 'B'], 'product_name': ['Widget', 'Gadget']})

Task: Create a view with order_id, customer name, and product name.

Show Solution
# Chain merges
result = (orders
    .merge(customers, on='customer_id')
    .merge(products, on='product_id')
    [['order_id', 'name', 'product_name']]
)
print(result)
04

Pivot Tables & Cross-tabulations

Pivot tables are one of the most powerful tools for data analysis, allowing you to reshape data and create summary views. They transform data from a long format (many rows) to a wide format (categories as columns), making patterns easier to spot. Cross-tabulations (crosstabs) are specialized pivot tables for counting occurrences across categories.

Creating Pivot Tables

The pivot_table() function creates a spreadsheet-style pivot table. You specify which columns become the index (rows), columns, and what values to aggregate:

# Sample data
sales = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=8, freq='D'),
    'store': ['NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA'],
    'product': ['Widget', 'Widget', 'Gadget', 'Gadget', 
                'Widget', 'Gadget', 'Gadget', 'Widget'],
    'quantity': [10, 15, 8, 12, 20, 5, 9, 22],
    'revenue': [250, 375, 360, 540, 500, 225, 405, 550]
})

# Basic pivot table
pivot = pd.pivot_table(
    sales,
    values='revenue',      # What to aggregate
    index='store',         # Row labels
    columns='product',     # Column labels
    aggfunc='sum'          # How to aggregate
)
print(pivot)
# product  Gadget  Widget
# store                   
# LA        765.0   925.0
# NYC       765.0   750.0

Multiple Aggregations

You can apply multiple aggregation functions and summarize multiple value columns:

# Multiple aggregation functions
pivot_multi = pd.pivot_table(
    sales,
    values='revenue',
    index='store',
    columns='product',
    aggfunc=['sum', 'mean', 'count']
)
print(pivot_multi)
#           sum            mean                count        
# product Gadget Widget  Gadget     Widget   Gadget Widget
# store                                                     
# LA       765.0  925.0   382.5  462.500000      2      2
# NYC      765.0  750.0   382.5  375.000000      2      2

# Multiple values with different aggregations
pivot_values = pd.pivot_table(
    sales,
    values=['quantity', 'revenue'],
    index='store',
    columns='product',
    aggfunc={'quantity': 'sum', 'revenue': 'mean'}
)
print(pivot_values)

Adding Margins (Totals)

The margins=True parameter adds row and column totals, similar to Excel pivot tables:

# Pivot table with totals
pivot_totals = pd.pivot_table(
    sales,
    values='revenue',
    index='store',
    columns='product',
    aggfunc='sum',
    margins=True,           # Add totals
    margins_name='Total'    # Name for totals row/column
)
print(pivot_totals)
# product  Gadget  Widget   Total
# store                          
# LA        765.0   925.0  1690.0
# NYC       765.0   750.0  1515.0
# Total    1530.0  1675.0  3205.0

Cross-Tabulation (crosstab)

Cross-tabulation is perfect for analyzing the frequency of combinations between two categorical variables. It's commonly used for categorical data analysis:

# Count occurrences of each combination
crosstab = pd.crosstab(sales['store'], sales['product'])
print(crosstab)
# product  Gadget  Widget
# store                   
# LA            2       2
# NYC           2       2

# Add margins
crosstab_margins = pd.crosstab(
    sales['store'], 
    sales['product'], 
    margins=True
)
print(crosstab_margins)
# product  Gadget  Widget  All
# store                       
# LA            2       2    4
# NYC           2       2    4
# All           4       4    8

# Normalize to get percentages
crosstab_pct = pd.crosstab(
    sales['store'], 
    sales['product'], 
    normalize='all'  # 'index', 'columns', or 'all'
)
print(crosstab_pct)
# product  Gadget  Widget
# store                   
# LA         0.25    0.25
# NYC        0.25    0.25

Pivot vs Pivot Table

Pandas also has a simpler pivot() method for reshaping without aggregation. Use it when your data has unique combinations:

# Simple pivot (no aggregation - needs unique combinations)
data = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
    'metric': ['views', 'clicks', 'views', 'clicks'],
    'value': [1000, 50, 1200, 65]
})

reshaped = data.pivot(index='date', columns='metric', values='value')
print(reshaped)
# metric       clicks   views
# date                        
# 2024-01-01       50    1000
# 2024-01-02       65    1200
Watch out: pivot() will raise an error if there are duplicate combinations. Use pivot_table() when you need to aggregate duplicates. Also, pivot operations can create many NaN values if combinations don't exist in your data.

Practice Questions: Pivot Tables

Test your understanding with these coding exercises.

Given:

survey = pd.DataFrame({
    'gender': ['M', 'F', 'M', 'F', 'M', 'F'],
    'response': ['Yes', 'Yes', 'No', 'Yes', 'Yes', 'No']
})

Task: Create a crosstab showing the count of each response by gender.

Show Solution
result = pd.crosstab(survey['gender'], survey['response'])
print(result)

Given:

sales = pd.DataFrame({
    'region': ['East', 'West', 'East', 'West'],
    'product': ['A', 'A', 'B', 'B'],
    'sales': [100, 150, 200, 175]
})

Task: Create a pivot table showing total sales by region and product.

Show Solution
pivot = pd.pivot_table(sales, values='sales', 
                       index='region', columns='product', 
                       aggfunc='sum')
print(pivot)

Given:

orders = pd.DataFrame({
    'month': ['Jan', 'Jan', 'Feb', 'Feb', 'Jan', 'Feb'],
    'category': ['Electronics', 'Clothing', 'Electronics', 'Clothing', 'Clothing', 'Electronics'],
    'amount': [500, 200, 600, 250, 300, 450]
})

Task: Create a pivot table with average amounts, including row and column totals.

Show Solution
pivot = pd.pivot_table(orders, values='amount',
                       index='month', columns='category',
                       aggfunc='mean', margins=True)
print(pivot)

Given:

customers = pd.DataFrame({
    'age_group': ['18-25', '26-35', '18-25', '36-45', '26-35', '18-25', '36-45', '26-35'],
    'purchased': ['Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes']
})

Task: Create a crosstab showing what percentage of each age group purchased (normalize by index/row).

Show Solution
result = pd.crosstab(customers['age_group'], 
                     customers['purchased'],
                     normalize='index')  # Normalize by row
print(result)
# Shows percentage of Yes/No within each age group
05

Apply, Map, Transform

When built-in Pandas functions don't cover your needs, you can apply custom functions to your data. Pandas provides three main methods for this: apply() for flexible function application, map() for element-wise transformations, and transform() for group-aware operations that preserve the original shape.

The apply() Method

apply() is the most versatile method. It can work on entire columns, rows, or DataFrames, applying any function you provide:

# Sample data
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'salary': [50000, 60000, 75000]
})

# Apply to a single column (Series)
df['age_doubled'] = df['age'].apply(lambda x: x * 2)
print(df['age_doubled'])
# 0    50
# 1    60
# 2    70

# Apply a named function
def categorize_salary(salary):
    if salary < 55000:
        return 'Entry'
    elif salary < 70000:
        return 'Mid'
    else:
        return 'Senior'

df['level'] = df['salary'].apply(categorize_salary)
print(df[['name', 'salary', 'level']])
#       name  salary   level
# 0    Alice   50000   Entry
# 1      Bob   60000     Mid
# 2  Charlie   75000  Senior

Apply Across Rows

Use axis=1 to apply a function across each row, accessing multiple columns:

# Apply across rows (access multiple columns)
df['description'] = df.apply(
    lambda row: f"{row['name']} is {row['age']} years old",
    axis=1
)
print(df['description'])
# 0       Alice is 25 years old
# 1         Bob is 30 years old
# 2    Charlie is 35 years old

# Calculate bonus based on multiple columns
def calculate_bonus(row):
    base_bonus = row['salary'] * 0.1
    if row['age'] > 30:
        return base_bonus * 1.2  # Extra for seniority
    return base_bonus

df['bonus'] = df.apply(calculate_bonus, axis=1)
print(df[['name', 'salary', 'bonus']])

The map() Method

map() is specifically for Series (single columns) and is great for value substitution using a dictionary or function:

# Map with a dictionary (value substitution)
level_mapping = {
    'Entry': 1,
    'Mid': 2,
    'Senior': 3
}
df['level_code'] = df['level'].map(level_mapping)
print(df[['name', 'level', 'level_code']])
#       name   level  level_code
# 0    Alice   Entry           1
# 1      Bob     Mid           2
# 2  Charlie  Senior           3

# Map with a function
df['name_upper'] = df['name'].map(str.upper)
print(df['name_upper'])
# 0      ALICE
# 1        BOB
# 2    CHARLIE

# Map with lambda
df['age_category'] = df['age'].map(lambda x: 'Young' if x < 30 else 'Mature')
print(df[['name', 'age', 'age_category']])

The transform() Method

transform() is special because it returns data with the same shape as the input. This is particularly useful with GroupBy for creating group-level statistics aligned with original rows:

# Sample data
sales = pd.DataFrame({
    'store': ['NYC', 'NYC', 'LA', 'LA', 'NYC'],
    'product': ['Widget', 'Gadget', 'Widget', 'Gadget', 'Widget'],
    'revenue': [100, 150, 200, 120, 180]
})

# Transform: add group mean to each row (same shape as original)
sales['store_avg'] = sales.groupby('store')['revenue'].transform('mean')
print(sales)
#   store product  revenue  store_avg
# 0   NYC  Widget      100  143.333333
# 1   NYC  Gadget      150  143.333333
# 2    LA  Widget      200  160.000000
# 3    LA  Gadget      120  160.000000
# 4   NYC  Widget      180  143.333333

# Calculate percentage of store total
sales['pct_of_store'] = (sales['revenue'] / 
    sales.groupby('store')['revenue'].transform('sum') * 100)
print(sales[['store', 'revenue', 'pct_of_store']])

Comparing apply, map, and transform

Method Works On Returns Best For
map() Series only Same-size Series Simple value substitution, dictionary mapping
apply() Series or DataFrame Flexible (can change shape) Complex row/column operations, custom logic
transform() Series or GroupBy Same-size as input Group statistics aligned with original data

Performance Tip: Vectorization

While apply() is flexible, it's often slower than vectorized operations. When possible, use built-in Pandas/NumPy functions:

# Slow: using apply
df['tax'] = df['salary'].apply(lambda x: x * 0.2)

# Fast: vectorized operation (preferred)
df['tax'] = df['salary'] * 0.2

# Slow: apply with condition
df['bonus'] = df['age'].apply(lambda x: 1000 if x > 30 else 500)

# Fast: np.where (preferred)
import numpy as np
df['bonus'] = np.where(df['age'] > 30, 1000, 500)
Pro tip: Use apply() when you need complex logic that can't be expressed with vectorized operations. For simple transformations, always prefer built-in Pandas methods or NumPy functions for better performance.

Practice Questions: Apply, Map, Transform

Test your understanding with these coding exercises.

Given:

df = pd.DataFrame({
    'grade': ['A', 'B', 'C', 'A', 'B']
})

Task: Create a new column 'points' that maps A=4, B=3, C=2.

Show Solution
grade_points = {'A': 4, 'B': 3, 'C': 2}
df['points'] = df['grade'].map(grade_points)
print(df)

Given:

df = pd.DataFrame({
    'price': [10.5, 20.0, 15.75, 8.25]
})

Task: Create a 'price_with_tax' column that adds 10% tax to each price.

Show Solution
# Using apply
df['price_with_tax'] = df['price'].apply(lambda x: x * 1.1)

# Better: vectorized
df['price_with_tax'] = df['price'] * 1.1
print(df)

Given:

df = pd.DataFrame({
    'first_name': ['John', 'Jane', 'Bob'],
    'last_name': ['Doe', 'Smith', 'Johnson']
})

Task: Create a 'full_name' column combining first and last names.

Show Solution
# Using apply with axis=1
df['full_name'] = df.apply(
    lambda row: f"{row['first_name']} {row['last_name']}", 
    axis=1
)

# Better: string concatenation
df['full_name'] = df['first_name'] + ' ' + df['last_name']
print(df)

Given:

sales = pd.DataFrame({
    'region': ['East', 'East', 'West', 'West', 'East'],
    'amount': [100, 150, 200, 180, 120]
})

Task: Add a column showing how each sale compares to its region's average (difference from mean).

Show Solution
sales['diff_from_avg'] = (sales['amount'] - 
    sales.groupby('region')['amount'].transform('mean'))
print(sales)
# Shows positive for above-average, negative for below

Key Takeaways

GroupBy Power

Use groupby() to split data into groups and aggregate with sum(), mean(), or custom functions via agg()

Merge Mastery

Combine DataFrames with merge() using inner, left, right, or outer joins. Use concat() to stack DataFrames vertically or horizontally

Pivot Tables

Reshape data with pivot_table() for summary views. Use crosstab() to count category combinations with margins

Custom Functions

Apply custom logic with apply(). Use map() for value substitution and transform() for group-aligned operations

Performance Matters

Prefer vectorized operations over apply() when possible. Use NumPy functions like np.where() for conditional logic

Named Aggregations

Use named aggregations for cleaner output: agg(total=('col', 'sum')) gives you control over column names

Knowledge Check

Test your understanding of data transformation concepts:

Question 1 of 6

What does groupby('category')['value'].sum() return?

Question 2 of 6

Which merge type keeps ALL rows from BOTH DataFrames?

Question 3 of 6

What is the purpose of margins=True in a pivot table?

Question 4 of 6

What is the key difference between apply() and transform()?

Question 5 of 6

Which function is best for substituting values using a dictionary?

Question 6 of 6

When should you use pd.concat() instead of pd.merge()?

Answer all questions to check your score