Data Transformation

Introduction to Data Transformation

Data transformation is the process of reshaping, restructuring, and combining data to prepare it for analysis. While data cleaning focuses on fixing data quality issues, transformation focuses on organizing data into the right structure for your specific analytical needs. Pandas provides powerful tools for aggregating data across groups, combining multiple datasets, and applying custom transformations.

Why Data Transformation Matters

Real-world data rarely comes in the exact format you need. Sales data might be stored by transaction, but you need monthly summaries. Customer data might be split across multiple tables that need joining. Raw values might need to be categorized or normalized. Data transformation bridges the gap between how data is stored and how you need to analyze it.

Aggregation

Summarize detailed data into meaningful statistics. Calculate totals, averages, counts, and other metrics across groups of records.

Combining

Merge data from multiple sources into unified datasets. Join tables on common keys, concatenate DataFrames, and handle relationships.

Reshaping

Restructure data layout with pivot tables and melting operations. Convert between wide and long formats for different analysis needs.

Custom Functions

Apply your own logic to transform values. Use apply, map, and transform to implement complex business rules and calculations.

Sample Dataset for This Lesson

Throughout this lesson, we'll work with sales and employee data to demonstrate transformation techniques. Let's start by creating our sample datasets:

import pandas as pd
import numpy as np

# Sales data
sales = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=10, freq='D'),
    'store': ['NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA'],
    'product': ['Widget', 'Widget', 'Gadget', 'Gadget', 'Widget', 
                'Gadget', 'Widget', 'Widget', 'Gadget', 'Gadget'],
    'quantity': [10, 15, 8, 12, 20, 5, 18, 22, 9, 11],
    'price': [25.0, 25.0, 45.0, 45.0, 25.0, 45.0, 25.0, 25.0, 45.0, 45.0]
})

# Add revenue column
sales['revenue'] = sales['quantity'] * sales['price']

print(sales)
#         date store product  quantity  price  revenue
# 0 2024-01-01   NYC  Widget        10   25.0    250.0
# 1 2024-01-02    LA  Widget        15   25.0    375.0
# 2 2024-01-03   NYC  Gadget         8   45.0    360.0
# 3 2024-01-04    LA  Gadget        12   45.0    540.0
# ...

Pro tip: Before transforming data, always understand its current structure using df.info(), df.head(), and df.describe(). This helps you choose the right transformation approach.

GroupBy Aggregations

The GroupBy operation is one of the most powerful features in Pandas. It follows the "split-apply-combine" pattern: split the data into groups based on one or more keys, apply a function to each group independently, and combine the results into a new DataFrame. This is essential for calculating summary statistics across categories, time periods, or any other grouping.

Basic GroupBy Syntax

The groupby() method creates a GroupBy object that you can then apply aggregation functions to. The simplest pattern is grouping by a single column and calculating one statistic:

# Group by store and calculate total revenue
store_revenue = sales.groupby('store')['revenue'].sum()
print(store_revenue)
# store
# LA     1620.0
# NYC    1495.0
# Name: revenue, dtype: float64

# Group by product and get multiple stats
product_stats = sales.groupby('product')['quantity'].agg(['sum', 'mean', 'count'])
print(product_stats)
#          sum  mean  count
# product                  
# Gadget    40   8.0      5
# Widget    85  17.0      5

Multiple Aggregations with agg()

The agg() method allows you to apply multiple aggregation functions at once, and even apply different functions to different columns. You can use built-in function names as strings, NumPy functions, or custom functions:

# Multiple aggregations on single column
revenue_analysis = sales.groupby('store')['revenue'].agg([
    'sum', 'mean', 'std', 'min', 'max'
])
print(revenue_analysis)
#          sum    mean        std    min    max
# store                                         
# LA     1620.0  324.0  102.470  225.0  540.0
# NYC    1495.0  299.0   74.162  250.0  450.0

# Different aggregations for different columns
summary = sales.groupby('store').agg({
    'quantity': ['sum', 'mean'],
    'revenue': ['sum', 'max'],
    'product': 'count'
})
print(summary)
#       quantity        revenue          product
#            sum   mean     sum    max    count
# store                                         
# LA          65  13.0  1620.0  540.0        5
# NYC         65  13.0  1495.0  450.0        5

Grouping by Multiple Columns

You can group by multiple columns to create hierarchical groupings. This is useful for analyzing data across multiple dimensions, like store AND product:

# Group by store AND product
multi_group = sales.groupby(['store', 'product'])['revenue'].sum()
print(multi_group)
# store  product
# LA     Gadget      765.0
#        Widget      855.0
# NYC    Gadget      765.0
#        Widget      730.0
# Name: revenue, dtype: float64

# Convert to DataFrame and reset index
multi_df = multi_group.reset_index()
print(multi_df)
#   store product  revenue
# 0    LA  Gadget    765.0
# 1    LA  Widget    855.0
# 2   NYC  Gadget    765.0
# 3   NYC  Widget    730.0

Named Aggregations

Python 3.6+ supports named aggregations for cleaner, more readable output. This gives you control over the resulting column names:

# Named aggregations for clean column names
store_summary = sales.groupby('store').agg(
    total_revenue=('revenue', 'sum'),
    avg_quantity=('quantity', 'mean'),
    num_transactions=('quantity', 'count'),
    max_sale=('revenue', 'max')
)
print(store_summary)
#       total_revenue  avg_quantity  num_transactions  max_sale
# store                                                         
# LA           1620.0          13.0                 5     540.0
# NYC          1495.0          13.0                 5     450.0

Filtering Groups

The filter() method lets you keep or discard entire groups based on a condition. This is useful for focusing on groups that meet certain criteria:

# Keep only products with total quantity > 50
high_volume = sales.groupby('product').filter(lambda x: x['quantity'].sum() > 50)
print(high_volume)
# Only Widget rows remain (total quantity = 85)

# Keep stores with average revenue > 300
high_revenue_stores = sales.groupby('store').filter(
    lambda x: x['revenue'].mean() > 300
)
print(high_revenue_stores['store'].unique())
# ['LA']

Watch out: GroupBy operations exclude NaN values by default. If you need to include NaN as a group, use groupby(dropna=False). Also remember that the result of groupby().agg() has the grouping column(s) as the index - use reset_index() to convert back to regular columns.

Practice Questions: GroupBy Aggregations

Test your understanding with these coding exercises.

Given:

df = pd.DataFrame({
    'category': ['A', 'B', 'A', 'B', 'A'],
    'value': [10, 20, 15, 25, 12]
})

Task: Calculate the average value for each category.

Expected output:

category
A    12.333333
B    22.500000

Show Solution

avg_by_category = df.groupby('category')['value'].mean()
print(avg_by_category)

Given:

orders = pd.DataFrame({
    'customer': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 'Alice'],
    'product': ['Widget', 'Gadget', 'Gadget', 'Widget', 'Widget', 'Widget']
})

Task: Count how many orders each customer made.

Show Solution

order_counts = orders.groupby('customer').size()
# OR
order_counts = orders.groupby('customer')['product'].count()
print(order_counts)

Given:

sales = pd.DataFrame({
    'region': ['East', 'West', 'East', 'West', 'East'],
    'amount': [100, 150, 200, 175, 125]
})

Task: For each region, calculate the sum, mean, and max of amount.

Show Solution

region_stats = sales.groupby('region')['amount'].agg(['sum', 'mean', 'max'])
print(region_stats)

Given:

transactions = pd.DataFrame({
    'store': ['A', 'B', 'A', 'B', 'A', 'B'],
    'quantity': [5, 8, 3, 10, 7, 6],
    'revenue': [50, 80, 30, 100, 70, 60]
})

Task: Create a summary with named columns: total_qty, avg_revenue, num_sales for each store.

Show Solution

summary = transactions.groupby('store').agg(
    total_qty=('quantity', 'sum'),
    avg_revenue=('revenue', 'mean'),
    num_sales=('quantity', 'count')
)
print(summary)

Merge, Join, Concat

Real-world data is often spread across multiple tables or files. Combining data from different sources is a fundamental skill in data analysis. Pandas provides three main ways to combine DataFrames: merge() for database-style joins, join() for index-based combining, and concat() for stacking DataFrames together.

Setup: Sample DataFrames

Let's create sample DataFrames to demonstrate combining operations:

# Customers table
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'city': ['NYC', 'LA', 'NYC', 'Chicago']
})

# Orders table
orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 2, 1, 3, 5],  # Note: customer 5 doesn't exist
    'amount': [250, 150, 300, 200, 175]
})

# Products table
products = pd.DataFrame({
    'product_id': ['A', 'B', 'C'],
    'product_name': ['Widget', 'Gadget', 'Gizmo'],
    'price': [25.0, 45.0, 35.0]
})

Merge: Database-Style Joins

The merge() function works like SQL JOIN operations. You specify which columns to match on and what type of join to perform:

# Inner join (default) - only matching rows
inner = pd.merge(orders, customers, on='customer_id')
print(inner)
#    order_id  customer_id  amount     name city
# 0       101            1     250    Alice  NYC
# 1       103            1     300    Alice  NYC
# 2       102            2     150      Bob   LA
# 3       104            3     200  Charlie  NYC
# Note: order 105 (customer 5) excluded - no matching customer

# Left join - all from left, matching from right
left = pd.merge(orders, customers, on='customer_id', how='left')
print(left)
#    order_id  customer_id  amount     name   city
# 0       101            1     250    Alice    NYC
# 1       102            2     150      Bob     LA
# 2       103            1     300    Alice    NYC
# 3       104            3     200  Charlie    NYC
# 4       105            5     175      NaN    NaN  # No match

Join Types Explained

Join Type	Description	Use Case
`inner`	Only rows with matches in BOTH tables	When you only want complete records
`left`	All rows from left, matches from right	Keep all primary records, add info if available
`right`	All rows from right, matches from left	Less common, same as left with reversed order
`outer`	All rows from BOTH tables	When you need complete picture of both datasets

# Right join
right = pd.merge(orders, customers, on='customer_id', how='right')
print(right)
# Customer 4 (Diana) appears with NaN for order columns

# Outer join - all rows from both
outer = pd.merge(orders, customers, on='customer_id', how='outer')
print(outer)
# Includes both order 105 (no customer) and Diana (no orders)

Merging on Different Column Names

When the columns you want to join on have different names in each DataFrame, use left_on and right_on:

# Different column names
employees = pd.DataFrame({
    'emp_id': [1, 2, 3],
    'emp_name': ['John', 'Jane', 'Joe']
})

departments = pd.DataFrame({
    'employee_id': [1, 2, 3],
    'department': ['Sales', 'Marketing', 'IT']
})

# Merge with different column names
result = pd.merge(employees, departments, 
                  left_on='emp_id', right_on='employee_id')
print(result)
#    emp_id emp_name  employee_id department
# 0       1     John            1      Sales
# 1       2     Jane            2  Marketing
# 2       3      Joe            3         IT

Concatenating DataFrames

Use concat() to stack DataFrames vertically (adding rows) or horizontally (adding columns):

# Vertical concatenation (stacking rows)
q1_sales = pd.DataFrame({'product': ['A', 'B'], 'sales': [100, 150]})
q2_sales = pd.DataFrame({'product': ['A', 'B'], 'sales': [120, 180]})

yearly = pd.concat([q1_sales, q2_sales], ignore_index=True)
print(yearly)
#   product  sales
# 0       A    100
# 1       B    150
# 2       A    120
# 3       B    180

# Horizontal concatenation (adding columns)
names = pd.DataFrame({'name': ['Alice', 'Bob']})
ages = pd.DataFrame({'age': [25, 30]})

combined = pd.concat([names, ages], axis=1)
print(combined)
#     name  age
# 0  Alice   25
# 1    Bob   30

The join() Method

The join() method is a convenient way to combine DataFrames on their indices:

# Set index for joining
customers_indexed = customers.set_index('customer_id')
orders_indexed = orders.set_index('customer_id')

# Join on index
joined = customers_indexed.join(orders_indexed, how='inner')
print(joined)
#                  name     city  order_id  amount
# customer_id                                      
# 1               Alice      NYC       101     250
# 1               Alice      NYC       103     300
# 2                 Bob       LA       102     150
# 3             Charlie      NYC       104     200

Pro tip: Use merge() when joining on column values (most common). Use join() when your DataFrames already have meaningful indices. Use concat() when stacking DataFrames with the same structure.

Practice Questions: Merge, Join, Concat

Test your understanding with these coding exercises.

Given:

students = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
grades = pd.DataFrame({'student_id': [1, 2, 4], 'grade': ['A', 'B', 'C']})

Task: Merge students with grades using an inner join.

Show Solution

result = pd.merge(students, grades, left_on='id', right_on='student_id')
print(result)
# Only Alice and Bob appear (student 4 has no match)

Given:

jan = pd.DataFrame({'month': ['Jan'], 'sales': [1000]})
feb = pd.DataFrame({'month': ['Feb'], 'sales': [1200]})
mar = pd.DataFrame({'month': ['Mar'], 'sales': [1100]})

Task: Stack all three DataFrames into one with a clean index.

Show Solution

quarterly = pd.concat([jan, feb, mar], ignore_index=True)
print(quarterly)

Given:

products = pd.DataFrame({'id': [1, 2, 3], 'name': ['Widget', 'Gadget', 'Gizmo']})
inventory = pd.DataFrame({'product_id': [1, 3], 'stock': [50, 30]})

Task: Get all products with their stock (NaN for missing inventory).

Show Solution

result = pd.merge(products, inventory, left_on='id', 
                  right_on='product_id', how='left')
print(result)
# Gadget will have NaN for stock

Given:

orders = pd.DataFrame({'order_id': [1, 2], 'customer_id': [10, 20], 'product_id': ['A', 'B']})
customers = pd.DataFrame({'customer_id': [10, 20], 'name': ['Alice', 'Bob']})
products = pd.DataFrame({'product_id': ['A', 'B'], 'product_name': ['Widget', 'Gadget']})

Task: Create a view with order_id, customer name, and product name.

Show Solution

# Chain merges
result = (orders
    .merge(customers, on='customer_id')
    .merge(products, on='product_id')
    [['order_id', 'name', 'product_name']]
)
print(result)

Pivot Tables & Cross-tabulations

Pivot tables are one of the most powerful tools for data analysis, allowing you to reshape data and create summary views. They transform data from a long format (many rows) to a wide format (categories as columns), making patterns easier to spot. Cross-tabulations (crosstabs) are specialized pivot tables for counting occurrences across categories.

Creating Pivot Tables

The pivot_table() function creates a spreadsheet-style pivot table. You specify which columns become the index (rows), columns, and what values to aggregate:

# Sample data
sales = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=8, freq='D'),
    'store': ['NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA'],
    'product': ['Widget', 'Widget', 'Gadget', 'Gadget', 
                'Widget', 'Gadget', 'Gadget', 'Widget'],
    'quantity': [10, 15, 8, 12, 20, 5, 9, 22],
    'revenue': [250, 375, 360, 540, 500, 225, 405, 550]
})

# Basic pivot table
pivot = pd.pivot_table(
    sales,
    values='revenue',      # What to aggregate
    index='store',         # Row labels
    columns='product',     # Column labels
    aggfunc='sum'          # How to aggregate
)
print(pivot)
# product  Gadget  Widget
# store                   
# LA        765.0   925.0
# NYC       765.0   750.0

Multiple Aggregations

You can apply multiple aggregation functions and summarize multiple value columns:

# Multiple aggregation functions
pivot_multi = pd.pivot_table(
    sales,
    values='revenue',
    index='store',
    columns='product',
    aggfunc=['sum', 'mean', 'count']
)
print(pivot_multi)
#           sum            mean                count        
# product Gadget Widget  Gadget     Widget   Gadget Widget
# store                                                     
# LA       765.0  925.0   382.5  462.500000      2      2
# NYC      765.0  750.0   382.5  375.000000      2      2

# Multiple values with different aggregations
pivot_values = pd.pivot_table(
    sales,
    values=['quantity', 'revenue'],
    index='store',
    columns='product',
    aggfunc={'quantity': 'sum', 'revenue': 'mean'}
)
print(pivot_values)

Adding Margins (Totals)

The margins=True parameter adds row and column totals, similar to Excel pivot tables:

# Pivot table with totals
pivot_totals = pd.pivot_table(
    sales,
    values='revenue',
    index='store',
    columns='product',
    aggfunc='sum',
    margins=True,           # Add totals
    margins_name='Total'    # Name for totals row/column
)
print(pivot_totals)
# product  Gadget  Widget   Total
# store                          
# LA        765.0   925.0  1690.0
# NYC       765.0   750.0  1515.0
# Total    1530.0  1675.0  3205.0

Cross-Tabulation (crosstab)

Cross-tabulation is perfect for analyzing the frequency of combinations between two categorical variables. It's commonly used for categorical data analysis:

# Count occurrences of each combination
crosstab = pd.crosstab(sales['store'], sales['product'])
print(crosstab)
# product  Gadget  Widget
# store                   
# LA            2       2
# NYC           2       2

# Add margins
crosstab_margins = pd.crosstab(
    sales['store'], 
    sales['product'], 
    margins=True
)
print(crosstab_margins)
# product  Gadget  Widget  All
# store                       
# LA            2       2    4
# NYC           2       2    4
# All           4       4    8

# Normalize to get percentages
crosstab_pct = pd.crosstab(
    sales['store'], 
    sales['product'], 
    normalize='all'  # 'index', 'columns', or 'all'
)
print(crosstab_pct)
# product  Gadget  Widget
# store                   
# LA         0.25    0.25
# NYC        0.25    0.25

Pivot vs Pivot Table

Pandas also has a simpler pivot() method for reshaping without aggregation. Use it when your data has unique combinations:

# Simple pivot (no aggregation - needs unique combinations)
data = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
    'metric': ['views', 'clicks', 'views', 'clicks'],
    'value': [1000, 50, 1200, 65]
})

reshaped = data.pivot(index='date', columns='metric', values='value')
print(reshaped)
# metric       clicks   views
# date                        
# 2024-01-01       50    1000
# 2024-01-02       65    1200

Watch out: pivot() will raise an error if there are duplicate combinations. Use pivot_table() when you need to aggregate duplicates. Also, pivot operations can create many NaN values if combinations don't exist in your data.

Practice Questions: Pivot Tables

Test your understanding with these coding exercises.

Given:

survey = pd.DataFrame({
    'gender': ['M', 'F', 'M', 'F', 'M', 'F'],
    'response': ['Yes', 'Yes', 'No', 'Yes', 'Yes', 'No']
})

Task: Create a crosstab showing the count of each response by gender.

Show Solution

result = pd.crosstab(survey['gender'], survey['response'])
print(result)

Given:

sales = pd.DataFrame({
    'region': ['East', 'West', 'East', 'West'],
    'product': ['A', 'A', 'B', 'B'],
    'sales': [100, 150, 200, 175]
})

Task: Create a pivot table showing total sales by region and product.

Show Solution

pivot = pd.pivot_table(sales, values='sales', 
                       index='region', columns='product', 
                       aggfunc='sum')
print(pivot)

Given:

orders = pd.DataFrame({
    'month': ['Jan', 'Jan', 'Feb', 'Feb', 'Jan', 'Feb'],
    'category': ['Electronics', 'Clothing', 'Electronics', 'Clothing', 'Clothing', 'Electronics'],
    'amount': [500, 200, 600, 250, 300, 450]
})

Task: Create a pivot table with average amounts, including row and column totals.

Show Solution

pivot = pd.pivot_table(orders, values='amount',
                       index='month', columns='category',
                       aggfunc='mean', margins=True)
print(pivot)

Given:

customers = pd.DataFrame({
    'age_group': ['18-25', '26-35', '18-25', '36-45', '26-35', '18-25', '36-45', '26-35'],
    'purchased': ['Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes']
})

Task: Create a crosstab showing what percentage of each age group purchased (normalize by index/row).

Show Solution

result = pd.crosstab(customers['age_group'], 
                     customers['purchased'],
                     normalize='index')  # Normalize by row
print(result)
# Shows percentage of Yes/No within each age group

Apply, Map, Transform

When built-in Pandas functions don't cover your needs, you can apply custom functions to your data. Pandas provides three main methods for this: apply() for flexible function application, map() for element-wise transformations, and transform() for group-aware operations that preserve the original shape.

The apply() Method

apply() is the most versatile method. It can work on entire columns, rows, or DataFrames, applying any function you provide:

# Sample data
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'salary': [50000, 60000, 75000]
})

# Apply to a single column (Series)
df['age_doubled'] = df['age'].apply(lambda x: x * 2)
print(df['age_doubled'])
# 0    50
# 1    60
# 2    70

# Apply a named function
def categorize_salary(salary):
    if salary < 55000:
        return 'Entry'
    elif salary < 70000:
        return 'Mid'
    else:
        return 'Senior'

df['level'] = df['salary'].apply(categorize_salary)
print(df[['name', 'salary', 'level']])
#       name  salary   level
# 0    Alice   50000   Entry
# 1      Bob   60000     Mid
# 2  Charlie   75000  Senior

Apply Across Rows

Use axis=1 to apply a function across each row, accessing multiple columns:

# Apply across rows (access multiple columns)
df['description'] = df.apply(
    lambda row: f"{row['name']} is {row['age']} years old",
    axis=1
)
print(df['description'])
# 0       Alice is 25 years old
# 1         Bob is 30 years old
# 2    Charlie is 35 years old

# Calculate bonus based on multiple columns
def calculate_bonus(row):
    base_bonus = row['salary'] * 0.1
    if row['age'] > 30:
        return base_bonus * 1.2  # Extra for seniority
    return base_bonus

df['bonus'] = df.apply(calculate_bonus, axis=1)
print(df[['name', 'salary', 'bonus']])

The map() Method

map() is specifically for Series (single columns) and is great for value substitution using a dictionary or function:

# Map with a dictionary (value substitution)
level_mapping = {
    'Entry': 1,
    'Mid': 2,
    'Senior': 3
}
df['level_code'] = df['level'].map(level_mapping)
print(df[['name', 'level', 'level_code']])
#       name   level  level_code
# 0    Alice   Entry           1
# 1      Bob     Mid           2
# 2  Charlie  Senior           3

# Map with a function
df['name_upper'] = df['name'].map(str.upper)
print(df['name_upper'])
# 0      ALICE
# 1        BOB
# 2    CHARLIE

# Map with lambda
df['age_category'] = df['age'].map(lambda x: 'Young' if x < 30 else 'Mature')
print(df[['name', 'age', 'age_category']])

The transform() Method

transform() is special because it returns data with the same shape as the input. This is particularly useful with GroupBy for creating group-level statistics aligned with original rows:

# Sample data
sales = pd.DataFrame({
    'store': ['NYC', 'NYC', 'LA', 'LA', 'NYC'],
    'product': ['Widget', 'Gadget', 'Widget', 'Gadget', 'Widget'],
    'revenue': [100, 150, 200, 120, 180]
})

# Transform: add group mean to each row (same shape as original)
sales['store_avg'] = sales.groupby('store')['revenue'].transform('mean')
print(sales)
#   store product  revenue  store_avg
# 0   NYC  Widget      100  143.333333
# 1   NYC  Gadget      150  143.333333
# 2    LA  Widget      200  160.000000
# 3    LA  Gadget      120  160.000000
# 4   NYC  Widget      180  143.333333

# Calculate percentage of store total
sales['pct_of_store'] = (sales['revenue'] / 
    sales.groupby('store')['revenue'].transform('sum') * 100)
print(sales[['store', 'revenue', 'pct_of_store']])

Comparing apply, map, and transform

Method	Works On	Returns	Best For
`map()`	Series only	Same-size Series	Simple value substitution, dictionary mapping
`apply()`	Series or DataFrame	Flexible (can change shape)	Complex row/column operations, custom logic
`transform()`	Series or GroupBy	Same-size as input	Group statistics aligned with original data

Performance Tip: Vectorization

While apply() is flexible, it's often slower than vectorized operations. When possible, use built-in Pandas/NumPy functions:

# Slow: using apply
df['tax'] = df['salary'].apply(lambda x: x * 0.2)

# Fast: vectorized operation (preferred)
df['tax'] = df['salary'] * 0.2

# Slow: apply with condition
df['bonus'] = df['age'].apply(lambda x: 1000 if x > 30 else 500)

# Fast: np.where (preferred)
import numpy as np
df['bonus'] = np.where(df['age'] > 30, 1000, 500)

Pro tip: Use apply() when you need complex logic that can't be expressed with vectorized operations. For simple transformations, always prefer built-in Pandas methods or NumPy functions for better performance.

Practice Questions: Apply, Map, Transform

Test your understanding with these coding exercises.

Given:

df = pd.DataFrame({
    'grade': ['A', 'B', 'C', 'A', 'B']
})

Task: Create a new column 'points' that maps A=4, B=3, C=2.

Show Solution

grade_points = {'A': 4, 'B': 3, 'C': 2}
df['points'] = df['grade'].map(grade_points)
print(df)

Given:

df = pd.DataFrame({
    'price': [10.5, 20.0, 15.75, 8.25]
})

Task: Create a 'price_with_tax' column that adds 10% tax to each price.

Show Solution

# Using apply
df['price_with_tax'] = df['price'].apply(lambda x: x * 1.1)

# Better: vectorized
df['price_with_tax'] = df['price'] * 1.1
print(df)

Given:

df = pd.DataFrame({
    'first_name': ['John', 'Jane', 'Bob'],
    'last_name': ['Doe', 'Smith', 'Johnson']
})

Task: Create a 'full_name' column combining first and last names.

Show Solution

# Using apply with axis=1
df['full_name'] = df.apply(
    lambda row: f"{row['first_name']} {row['last_name']}", 
    axis=1
)

# Better: string concatenation
df['full_name'] = df['first_name'] + ' ' + df['last_name']
print(df)

Given:

sales = pd.DataFrame({
    'region': ['East', 'East', 'West', 'West', 'East'],
    'amount': [100, 150, 200, 180, 120]
})

Task: Add a column showing how each sale compares to its region's average (difference from mean).

Show Solution

sales['diff_from_avg'] = (sales['amount'] - 
    sales.groupby('region')['amount'].transform('mean'))
print(sales)
# Shows positive for above-average, negative for below

Key Takeaways

GroupBy Power

Use groupby() to split data into groups and aggregate with sum(), mean(), or custom functions via agg()

Merge Mastery

Combine DataFrames with merge() using inner, left, right, or outer joins. Use concat() to stack DataFrames vertically or horizontally

Pivot Tables

Reshape data with pivot_table() for summary views. Use crosstab() to count category combinations with margins

Custom Functions

Apply custom logic with apply(). Use map() for value substitution and transform() for group-aligned operations

Performance Matters

Prefer vectorized operations over apply() when possible. Use NumPy functions like np.where() for conditional logic

Named Aggregations

Use named aggregations for cleaner output: agg(total=('col', 'sum')) gives you control over column names

Ready for the Next Step?

You've mastered data transformation! With these skills, you can aggregate, reshape, and combine any dataset. Next, complete the Pandas assignment to practice these techniques on real-world data scenarios.

What You'll Learn

Contents

Introduction to Data Transformation

Why Data Transformation Matters

Aggregation

Combining

Reshaping

Custom Functions

Sample Dataset for This Lesson

GroupBy Aggregations

Basic GroupBy Syntax

Multiple Aggregations with agg()

Grouping by Multiple Columns

Named Aggregations

Filtering Groups

Practice Questions: GroupBy Aggregations

Medium Compute product line performance metrics for quarterly review

Hard Analyze customer purchase frequency for loyalty program segmentation

Easy Generate regional sales summary for executive dashboard

Medium Build retail store KPI report with custom metric names

Merge, Join, Concat

Setup: Sample DataFrames

Merge: Database-Style Joins

Join Types Explained

Merging on Different Column Names

Concatenating DataFrames

The join() Method

Practice Questions: Merge, Join, Concat

Hard Link student records with academic grades for enrollment verification

Medium Consolidate quarterly sales reports into annual dataset

Easy Merge product catalog with warehouse inventory levels

Medium Create unified order report combining customer and product details

Pivot Tables & Cross-tabulations

Creating Pivot Tables

Multiple Aggregations

Adding Margins (Totals)

Cross-Tabulation (crosstab)

Pivot vs Pivot Table

Practice Questions: Pivot Tables

Medium Generate demographic response matrix for market research

Hard Build regional product performance grid for sales analysis

Easy Create monthly category spend report with totals

Medium Analyze conversion rates across age demographics

Apply, Map, Transform

The apply() Method

Apply Across Rows

The map() Method

The transform() Method

Comparing apply, map, and transform

Performance Tip: Vectorization

Practice Questions: Apply, Map, Transform

Medium Convert letter grades to GPA point values

Easy Calculate tax amount for product prices

Hard Generate employee display names from name fields

Medium Compare sales performance against regional averages

Key Takeaways

GroupBy Power

Merge Mastery

Pivot Tables

Custom Functions

Performance Matters

Named Aggregations

Ready for the Next Step?

Knowledge Check