Encoding & Feature Scaling

Categorical Encoding Basics

Imagine trying to teach a calculator about colors — it only understands numbers! Machine learning models are the same way. When your data has categories like "red/blue/green" or "small/medium/large", you need to convert them into numbers the model can work with. But here's the trick: how you convert them matters a lot! If you assign red=1, blue=2, green=3, the model might think green is "bigger" than red — which makes no sense for colors. This section teaches you the right way to encode different types of categorical data.

Key Concept

Types of Categorical Variables

Nominal (No Order): Categories with no natural ranking — like colors (red isn't "better" than blue), countries, or blood types. You can't sort them in any meaningful way. Ordinal (Has Order): Categories that have a natural sequence — like T-shirt sizes (S < M < L < XL), education levels (High School < Bachelor < Master < PhD), or star ratings (1★ < 2★ < 3★). The order matters and your encoding should preserve it!

One-Hot Encoding

The Problem: If we encode colors as red=1, blue=2, green=3, the model might think "blue + blue = green" (2+2≠3, but close!). The Solution: One-hot encoding creates a separate yes/no column for each category. Instead of one "color" column, you get "is_red", "is_blue", "is_green" columns with 1s and 0s. Now there's no fake ordering, and the model treats each color independently!

# Step 1: Import and create sample data
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

data = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'red', 'blue'],
    'size': ['S', 'M', 'L', 'M', 'S'],
    'price': [10, 20, 15, 12, 18]
})

Setting up our example: We have a simple product dataset with two categorical columns: color (nominal — no natural order between red, blue, green) and size (ordinal — S < M < L makes sense). The price column is already numerical, so it doesn't need encoding. Our goal: convert color and size into numbers the right way!

# Step 2: Apply one-hot encoding with sklearn
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
color_encoded = encoder.fit_transform(data[['color']])

# Create DataFrame with meaningful column names
color_df = pd.DataFrame(
    color_encoded, 
    columns=encoder.get_feature_names_out(['color'])
)
print(color_df)

What's happening: OneHotEncoder transforms our single "color" column into three separate columns: color_blue, color_green, color_red. Each row gets a 1 in exactly one column (the matching color) and 0s everywhere else.

sparse_output=False: Returns a regular array instead of a memory-efficient sparse matrix (easier to read and debug)
handle_unknown='ignore': If the model sees a new color during prediction (like "yellow"), it won't crash — it just puts 0s in all columns
get_feature_names_out(): Gives us nice column names like "color_red" instead of "x0_red"

# Step 3: Pandas alternative (simpler for exploration)
color_dummies = pd.get_dummies(data['color'], prefix='color')
print(color_dummies)

# Drop first to avoid multicollinearity (for linear models)
color_dummies_drop = pd.get_dummies(data['color'], prefix='color', drop_first=True)
print("\nWith drop_first=True:")
print(color_dummies_drop)

The quick pandas way: pd.get_dummies() does the same thing in one line! Great for quick exploration, but there's a catch: it doesn't "remember" the categories. If your test data has different colors than training data, you'll get mismatched columns.

Why drop_first=True? If you know "is_red" and "is_blue" are both 0, then it MUST be green — you don't need a third column! Dropping one column avoids redundancy (called the "dummy variable trap") which can confuse linear models. Tree-based models don't care, so you can skip this for Random Forest or XGBoost.

Label Encoding

Simple but dangerous! Label encoding just assigns numbers: red=0, blue=1, green=2. It's compact (one column instead of three), but creates a fake ordering. Use it only when: (1) your categories actually have an order, or (2) you're using tree-based models (like Random Forest or XGBoost) which only look at "is value above or below this threshold?" and don't care about the actual numbers:

from sklearn.preprocessing import LabelEncoder

# Label encoding for target variable
le = LabelEncoder()
data['color_label'] = le.fit_transform(data['color'])
print(data[['color', 'color_label']])

# See the mapping
print("\nMapping:", dict(zip(le.classes_, range(len(le.classes_)))))

How it works: LabelEncoder alphabetically sorts categories and assigns numbers: blue=0, green=1, red=2. The .classes_ attribute shows you the order it learned.

Big Warning: This implies green > red > blue, which is nonsense for colors! A linear model would think "if I increase the color number, the prediction changes" — but colors don't work that way. Only use LabelEncoder for: (1) truly ordinal data, (2) the target variable (y), or (3) tree-based models that don't assume order.

Ordinal Encoding

When order matters, encode it correctly! T-shirt sizes have a real order: S < M < L < XL. OrdinalEncoder lets you explicitly define this order so the model knows that XL > L > M > S. This is better than LabelEncoder because YOU control the order instead of relying on alphabetical sorting (which would give L=0, M=1, S=2, XL=3 — completely wrong!):

from sklearn.preprocessing import OrdinalEncoder

# Define the order explicitly
size_order = [['S', 'M', 'L', 'XL']]  # Small to Large
ordinal_encoder = OrdinalEncoder(categories=size_order)

data['size_ordinal'] = ordinal_encoder.fit_transform(data[['size']])
print(data[['size', 'size_ordinal']])

You're in control: By specifying categories=[['S', 'M', 'L', 'XL']], we guarantee S=0, M=1, L=2, XL=3 — the correct order! Now the model can learn things like "larger sizes cost more" because the numbers actually reflect the size relationship.

Pro tip: The categories parameter takes a list of lists (one list per column). If you have multiple ordinal columns, you'd do: categories=[size_order, rating_order]

When to Use Each:

One-Hot: Nominal data with few categories (<10)
Label/Ordinal: Ordinal data or tree-based models
Avoid: One-hot with many categories (causes dimensionality explosion)

Practice Questions

Task: Encode education levels ['High School', 'Bachelor', 'Master', 'PhD'] using OrdinalEncoder with the correct order.

Show Solution

from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

data = pd.DataFrame({
    'education': ['Bachelor', 'PhD', 'High School', 'Master', 'Bachelor']
})

# Define order from lowest to highest
edu_order = [['High School', 'Bachelor', 'Master', 'PhD']]
encoder = OrdinalEncoder(categories=edu_order)

data['education_encoded'] = encoder.fit_transform(data[['education']])
print(data)
# High School=0, Bachelor=1, Master=2, PhD=3

Task: One-hot encode a 'color' column with values ['red', 'blue', 'green', 'red', 'blue']. Use drop='first' to avoid the dummy variable trap.

Show Solution

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

data = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'red', 'blue']
})

# One-hot encode with drop='first'
encoder = OneHotEncoder(sparse_output=False, drop='first')
encoded = encoder.fit_transform(data[['color']])

# Create DataFrame with column names
feature_names = encoder.get_feature_names_out(['color'])
encoded_df = pd.DataFrame(encoded, columns=feature_names)

print("Original:")
print(data)
print("\nOne-Hot Encoded (dropped first):")
print(encoded_df)

Task: Train a OneHotEncoder on colors ['red', 'blue'], then transform new data that contains 'green'. Use handle_unknown='ignore' to handle the unseen category.

Show Solution

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Training data
train_data = pd.DataFrame({'color': ['red', 'blue', 'red', 'blue']})

# Test data with unseen category
test_data = pd.DataFrame({'color': ['red', 'green', 'blue']})

# Encoder that ignores unknown categories
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoder.fit(train_data[['color']])

# Transform training data
train_encoded = encoder.transform(train_data[['color']])
print("Train encoded:")
print(train_encoded)

# Transform test data (green becomes all zeros)
test_encoded = encoder.transform(test_data[['color']])
print("\nTest encoded (green = all zeros):")
print(test_encoded)

Task: Use OrdinalEncoder to encode two ordinal columns: 'size' (S, M, L, XL) and 'satisfaction' (Poor, Fair, Good, Excellent) with their correct orders.

Show Solution

import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

data = pd.DataFrame({
    'size': ['M', 'XL', 'S', 'L', 'M'],
    'satisfaction': ['Good', 'Poor', 'Excellent', 'Fair', 'Good']
})

# Define orders for both columns
size_order = ['S', 'M', 'L', 'XL']
satisfaction_order = ['Poor', 'Fair', 'Good', 'Excellent']

encoder = OrdinalEncoder(categories=[size_order, satisfaction_order])
data[['size_encoded', 'satisfaction_encoded']] = encoder.fit_transform(
    data[['size', 'satisfaction']]
)

print(data)
# size: S=0, M=1, L=2, XL=3
# satisfaction: Poor=0, Fair=1, Good=2, Excellent=3

Task: For a 'city' column with 5 cities, apply Label, One-Hot, and Ordinal encoding. Print the resulting shapes and discuss which is best for which model type.

Show Solution

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder

data = pd.DataFrame({
    'city': ['NYC', 'LA', 'Chicago', 'Houston', 'Phoenix', 'NYC', 'LA']
})

# Label Encoding - 1 column
le = LabelEncoder()
label_encoded = le.fit_transform(data['city'])
print(f"Label Encoded shape: {label_encoded.reshape(-1,1).shape}")

# One-Hot Encoding - 5 columns (or 4 with drop='first')
ohe = OneHotEncoder(sparse_output=False)
onehot_encoded = ohe.fit_transform(data[['city']])
print(f"One-Hot Encoded shape: {onehot_encoded.shape}")

# Ordinal Encoding - 1 column
oe = OrdinalEncoder()
ordinal_encoded = oe.fit_transform(data[['city']])
print(f"Ordinal Encoded shape: {ordinal_encoded.shape}")

print("""
Best uses:
- Label/Ordinal: Tree-based models (XGBoost, RandomForest)
- One-Hot: Linear models (Logistic Regression, SVM, Neural Networks)
- One-Hot: When categories have no natural order
""")

Task: One-hot encode a column, then use inverse_transform to convert the encoded data back to original categories. This is useful for interpreting model predictions.

Show Solution

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

# Original data
data = pd.DataFrame({
    'fruit': ['apple', 'banana', 'cherry', 'apple', 'banana']
})

# Encode
encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(data[['fruit']])
print("Encoded:")
print(encoded)

# Inverse transform back to original
decoded = encoder.inverse_transform(encoded)
print("\nDecoded back:")
print(decoded.flatten())

# Verify they match
print(f"\nOriginal matches decoded: {np.array_equal(data['fruit'].values, decoded.flatten())}")

Advanced Encoding Techniques

What if your "city" column has 10,000 different cities? One-hot encoding would create 10,000 new columns — your model would drown in data! This is called the high-cardinality problem. Advanced encoding techniques solve this by creating smarter, more compact representations. Target encoding replaces each city with how the target variable (like price or sales) behaves in that city. Frequency encoding uses how common each city is. These techniques keep your data manageable while preserving useful information.

Target Encoding (Mean Encoding)

The clever idea: Instead of creating columns for each city, replace the city name with what we actually care about — the average of our target variable for that city! If NYC apartments average $500K, replace "NYC" with 500000. Now one column captures the essence of thousands of categories. The danger: This uses information from the target variable, which can cause "data leakage" if not done carefully:

import pandas as pd
import numpy as np
from sklearn.model_selection import KFold

# Sample data
data = pd.DataFrame({
    'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA', 'NYC', 'Chicago', 'LA'],
    'price': [500, 300, 550, 200, 350, 480, 220, 320]
})

# Simple target encoding (prone to leakage - for demo only)
city_means = data.groupby('city')['price'].mean()
data['city_target_encoded'] = data['city'].map(city_means)
print(data)

What's happening: We calculate the average price for each city (NYC→$510, LA→$323, Chicago→$210) and replace the city names with these averages. Now instead of 3 columns (one-hot) or arbitrary numbers (label), we have ONE column with meaningful values!

Data Leakage Warning: This simple version "cheats" by using ALL the data (including the rows we're encoding) to calculate the means. In a real project, this leaks information from the test set into training. The next code block shows the proper way!

# Proper target encoding with cross-validation (prevents leakage)
from sklearn.model_selection import KFold

def target_encode_cv(df, col, target, n_splits=5):
    """Target encoding with CV to prevent leakage."""
    df = df.copy()
    df['target_encoded'] = np.nan
    
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    
    for train_idx, val_idx in kf.split(df):
        # Calculate means on training fold only
        means = df.iloc[train_idx].groupby(col)[target].mean()
        # Apply to validation fold
        df.loc[val_idx, 'target_encoded'] = df.loc[val_idx, col].map(means)
    
    # Fill any NaN with global mean
    df['target_encoded'].fillna(df[target].mean(), inplace=True)
    return df['target_encoded']

data['city_target_cv'] = target_encode_cv(data, 'city', 'price')
print(data[['city', 'price', 'city_target_cv']])

The right way — Cross-Validated Target Encoding: We split the data into folds (like in cross-validation). For each row, we calculate the city's mean using ONLY the other folds — never using the row's own target value. This prevents the model from "memorizing" the answers.

n_splits=5: Data is split into 5 parts; each part gets encoded using means from the other 4 parts
fillna with global mean: If a city appears only in one fold, we use the overall average as a fallback
Result: Slightly different encoding values than the simple version, but much more reliable for real predictions!

Frequency Encoding

Simple and safe! Sometimes how common a category is tells you something useful. Popular products might behave differently than rare ones. Big cities might have different patterns than small towns. Frequency encoding replaces each category with how often it appears — and since we're not using the target variable, there's zero risk of data leakage:

# Frequency encoding
freq = data['city'].value_counts()
data['city_freq'] = data['city'].map(freq)

# Or as proportion
prop = data['city'].value_counts(normalize=True)
data['city_prop'] = data['city'].map(prop)

print(data[['city', 'city_freq', 'city_prop']])

Two flavors of frequency:

Count (city_freq): NYC appears 3 times, LA appears 3 times, Chicago appears 2 times. Raw counts work well when the total dataset size is fixed.
Proportion (city_prop): NYC is 37.5% of the data, etc. Proportions are better when you might have different-sized datasets (like training vs. test).

When to use: Great when popularity/rarity matters! For example, rare product categories might have higher prices (limited edition), or frequent customer segments might have better retention.

Feature Hashing

For extreme situations! What if you have millions of categories (like user IDs or product SKUs)? Feature hashing uses a mathematical trick: it converts each category into a fixed number of columns using a "hash function". Think of it like assigning locker numbers in a school — 10,000 students get assigned to 100 lockers (some share). It's not perfect (some categories "collide" into the same slot), but it's incredibly memory-efficient:

from sklearn.feature_extraction import FeatureHasher

# High cardinality example
cities = [{'city': 'New York'}, {'city': 'Los Angeles'}, 
          {'city': 'Chicago'}, {'city': 'Houston'}]

# Hash into 4 features (usually use more in practice)
hasher = FeatureHasher(n_features=4, input_type='dict')
hashed = hasher.transform(cities)

print("Hashed features:")
print(hashed.toarray())

How it works: The hash function converts "New York" into a pattern of numbers across 4 slots. Each city gets a different pattern. Sometimes two cities might get similar patterns (collision), but with enough slots (usually 1000+), this is rare.

n_features=4: We're using just 4 slots for demo. In practice, use 1000+ to minimize collisions.
input_type='dict': We're passing dictionaries like {'city': 'New York'}
Memory magic: Even with 1 million unique categories, you only have 1000 columns (not 1 million!)

Practice Questions

Task: Implement leave-one-out target encoding where each row uses the mean of all other rows in its category (excluding itself).

Show Solution

import pandas as pd
import numpy as np

data = pd.DataFrame({
    'category': ['A', 'A', 'A', 'B', 'B', 'C'],
    'target': [10, 20, 30, 100, 120, 50]
})

def loo_target_encode(df, cat_col, target_col):
    """Leave-one-out target encoding."""
    df = df.copy()
    # Sum and count per category
    agg = df.groupby(cat_col)[target_col].agg(['sum', 'count'])
    
    # For each row, calculate mean excluding itself
    df['cat_sum'] = df[cat_col].map(agg['sum'])
    df['cat_count'] = df[cat_col].map(agg['count'])
    df['loo_encoded'] = (df['cat_sum'] - df[target_col]) / (df['cat_count'] - 1)
    
    # Handle categories with only 1 sample (use global mean)
    df['loo_encoded'].fillna(df[target_col].mean(), inplace=True)
    
    return df['loo_encoded']

data['target_loo'] = loo_target_encode(data, 'category', 'target')
print(data)

Task: Apply frequency encoding to a 'browser' column with values ['Chrome', 'Firefox', 'Chrome', 'Safari', 'Chrome', 'Firefox']. Show both count and proportion versions.

Show Solution

import pandas as pd

data = pd.DataFrame({
    'browser': ['Chrome', 'Firefox', 'Chrome', 'Safari', 'Chrome', 'Firefox']
})

# Count-based frequency encoding
freq_count = data['browser'].value_counts()
data['browser_count'] = data['browser'].map(freq_count)

# Proportion-based frequency encoding
freq_prop = data['browser'].value_counts(normalize=True)
data['browser_proportion'] = data['browser'].map(freq_prop)

print(data)
# Chrome: count=3, proportion=0.5
# Firefox: count=2, proportion=0.333
# Safari: count=1, proportion=0.167

Task: Implement target encoding with smoothing. Categories with few samples should be pulled toward the global mean to reduce noise.

Show Solution

import pandas as pd
import numpy as np

data = pd.DataFrame({
    'category': ['A', 'A', 'A', 'A', 'A', 'B', 'C'],
    'target': [100, 110, 90, 105, 95, 50, 200]  # C has only 1 sample
})

def smoothed_target_encode(df, cat_col, target_col, smoothing=10):
    """Target encoding with smoothing toward global mean."""
    global_mean = df[target_col].mean()
    agg = df.groupby(cat_col)[target_col].agg(['mean', 'count'])
    
    # Smoothing formula: weighted average of category mean and global mean
    # More samples = trust category mean more
    smoothed = (agg['count'] * agg['mean'] + smoothing * global_mean) / (agg['count'] + smoothing)
    
    return df[cat_col].map(smoothed)

data['encoded_no_smooth'] = data['category'].map(
    data.groupby('category')['target'].mean()
)
data['encoded_smoothed'] = smoothed_target_encode(data, 'category', 'target', smoothing=5)

print(data)
print(f"\nGlobal mean: {data['target'].mean():.2f}")
print("Notice: C (1 sample) is pulled toward global mean with smoothing!")

Task: Use FeatureHasher to encode product names into a fixed-size feature vector. Compare hashing 1000 products into 10 vs 100 features.

Show Solution

from sklearn.feature_extraction import FeatureHasher
import numpy as np

# Simulate 1000 unique products
products = [{'product': f'Product_{i}'} for i in range(1000)]

# Hash into 10 features (lots of collisions expected)
hasher_10 = FeatureHasher(n_features=10, input_type='dict')
hashed_10 = hasher_10.transform(products)

# Hash into 100 features (fewer collisions)
hasher_100 = FeatureHasher(n_features=100, input_type='dict')
hashed_100 = hasher_100.transform(products)

print(f"Original: 1000 unique products")
print(f"Hashed to 10 features: shape = {hashed_10.shape}")
print(f"Hashed to 100 features: shape = {hashed_100.shape}")

# Check for collisions (identical rows)
unique_10 = len(np.unique(hashed_10.toarray(), axis=0))
unique_100 = len(np.unique(hashed_100.toarray(), axis=0))
print(f"\nUnique patterns with 10 features: {unique_10}")
print(f"Unique patterns with 100 features: {unique_100}")

Task: Implement a simple binary encoding: convert category indices to binary representation. For 8 categories, you only need 3 columns (2³=8).

Show Solution

import pandas as pd
import numpy as np

data = pd.DataFrame({
    'day': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'Mon']
})

# First, label encode
days_order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
day_to_num = {day: i for i, day in enumerate(days_order)}
data['day_num'] = data['day'].map(day_to_num)

# Binary encode (need 3 bits for 7 values: 0-6)
def to_binary(num, n_bits=3):
    return [int(b) for b in format(num, f'0{n_bits}b')]

binary_cols = data['day_num'].apply(lambda x: pd.Series(to_binary(x, 3)))
binary_cols.columns = ['day_b0', 'day_b1', 'day_b2']

result = pd.concat([data, binary_cols], axis=1)
print(result)
print(f"\nOne-Hot would need 7 columns, Binary only needs 3!")

Task: For a dataset with: ordinal 'satisfaction', nominal 'color', and high-cardinality 'zip_code' columns - apply the appropriate encoding to each.

Show Solution

import pandas as pd
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.feature_extraction import FeatureHasher

data = pd.DataFrame({
    'satisfaction': ['Good', 'Poor', 'Excellent', 'Fair', 'Good'],
    'color': ['red', 'blue', 'green', 'red', 'blue'],
    'zip_code': ['10001', '90210', '60601', '10001', '33101']
})

# 1. Ordinal encoding for satisfaction
sat_order = [['Poor', 'Fair', 'Good', 'Excellent']]
ordinal_enc = OrdinalEncoder(categories=sat_order)
data['satisfaction_encoded'] = ordinal_enc.fit_transform(data[['satisfaction']])

# 2. One-hot encoding for color (low cardinality)
onehot_enc = OneHotEncoder(sparse_output=False, drop='first')
color_encoded = onehot_enc.fit_transform(data[['color']])
color_cols = pd.DataFrame(color_encoded, 
                          columns=onehot_enc.get_feature_names_out(['color']))
data = pd.concat([data, color_cols], axis=1)

# 3. Frequency encoding for zip_code (high cardinality)
zip_freq = data['zip_code'].value_counts(normalize=True)
data['zip_frequency'] = data['zip_code'].map(zip_freq)

print(data.drop(['satisfaction', 'color', 'zip_code'], axis=1))

Feature Scaling Techniques

Picture this: you're calculating the distance between two houses based on their price ($500,000 vs $300,000) and number of bedrooms (3 vs 5). The price difference is 200,000, while the bedroom difference is just 2. The price completely dominates! Many ML algorithms (especially distance-based ones like KNN and gradient-based ones like neural networks) get confused when features have wildly different scales. Feature scaling puts all features on a level playing field so each one contributes fairly to the model's decisions.

StandardScaler (Z-score Normalization)

The most popular choice! StandardScaler transforms each feature so that it has a mean of 0 and standard deviation of 1. Think of it like converting everyone's height to "how many standard deviations above/below average?" A value of +2 means "2 standard deviations above average", regardless of whether we're measuring height in inches, centimeters, or feet. This works best when your data is roughly normally distributed (bell-curve shaped):

# Step 1: Import and create data
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

data = pd.DataFrame({
    'age': [25, 30, 35, 40, 45, 50, 55, 60],
    'income': [30000, 45000, 50000, 65000, 70000, 80000, 90000, 100000],
    'score': [0.5, 0.6, 0.55, 0.7, 0.75, 0.8, 0.85, 0.9]
})

print("Original data statistics:")
print(data.describe())

See the problem? Our features have wildly different ranges:

age: 25 to 60 (range of 35)
income: 30,000 to 100,000 (range of 70,000!)
score: 0.5 to 0.9 (range of just 0.4)

If a distance-based model like KNN calculates "distance" between two people, income would dominate completely. A $10,000 income difference would seem huge compared to a 10-year age difference, even though both might be equally important!

# Step 2: Apply StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
data_scaled_df = pd.DataFrame(data_scaled, columns=data.columns)

print("\nAfter StandardScaler:")
print(data_scaled_df.describe().round(2))

The magic formula: z = (x - mean) / std

Subtract the mean → now the average is 0
Divide by standard deviation → now the spread is 1

Result: All three features now have mean≈0 and std≈1. A value of -1.5 means "1.5 standard deviations below average" whether it's age, income, or score. Now they're on equal footing!

MinMaxScaler (Normalization)

When you need specific bounds! MinMaxScaler squeezes all values into a fixed range, usually [0, 1]. The smallest value becomes 0, the largest becomes 1, and everything else is proportionally in between. This is perfect for neural networks (which often expect inputs between 0 and 1) and algorithms that need bounded values. Unlike StandardScaler, it preserves zero as zero and doesn't create negative values:

from sklearn.preprocessing import MinMaxScaler

# Scale to [0, 1]
minmax_scaler = MinMaxScaler()
data_minmax = minmax_scaler.fit_transform(data)
data_minmax_df = pd.DataFrame(data_minmax, columns=data.columns)

print("After MinMaxScaler [0, 1]:")
print(data_minmax_df.describe().round(2))

# Custom range [-1, 1]
minmax_custom = MinMaxScaler(feature_range=(-1, 1))
data_custom = minmax_custom.fit_transform(data)
print("\nCustom range [-1, 1]:")
print(pd.DataFrame(data_custom, columns=data.columns).describe().round(2))

The formula: x_scaled = (x - min) / (max - min)

Default [0, 1]: Minimum value → 0, maximum value → 1, everything else proportionally between
Custom range [-1, 1]: Use feature_range=(-1, 1) for algorithms that prefer centered data but still bounded
Preserves zeros: If your original feature has meaningful zeros (like "0 purchases"), they stay at the same relative position

Caution with outliers: If your max value is an extreme outlier (like one income of $10 million), most values will be squeezed near 0!

RobustScaler

Outlier-proof scaling! What if your data has extreme outliers? StandardScaler uses mean and standard deviation, which are heavily influenced by outliers. RobustScaler uses the median (middle value) and IQR (interquartile range — the spread of the middle 50%). Since outliers don't affect the median or IQR, the scaling stays sensible even with crazy outliers:

from sklearn.preprocessing import RobustScaler

# Data with outliers
data_outliers = pd.DataFrame({
    'value': [10, 12, 11, 13, 12, 11, 100, 14, 12]  # 100 is an outlier
})

# Compare scalers
standard = StandardScaler()
robust = RobustScaler()

data_outliers['standard'] = standard.fit_transform(data_outliers[['value']])
data_outliers['robust'] = robust.fit_transform(data_outliers[['value']])

print(data_outliers)
print("\nNotice how RobustScaler handles the outlier better")

See the difference:

StandardScaler: The outlier (100) pulls the mean way up and inflates the std, making normal values look like they're all clustered near -0.5
RobustScaler: Uses median (12) and IQR, which ignore the outlier. Normal values get reasonable scaled values, while the outlier just becomes a big number

The formula: x_scaled = (x - median) / IQR. Since median and IQR aren't affected by extreme values, your normal data gets scaled sensibly!

When to Scale

Scale Required

Gradient descent (Linear/Logistic Regression, Neural Networks)
Distance-based (KNN, K-Means, SVM with RBF)
Regularized models (Ridge, Lasso, Elastic Net)
PCA and other dimensionality reduction

Scale Not Required

Tree-based models (Decision Trees, Random Forest, XGBoost)
Naive Bayes
Rule-based models
Models that only use feature ordering, not magnitude

Practice Questions

Task: Create right-skewed data with outliers and compare StandardScaler, MinMaxScaler, and RobustScaler visually using histograms.

Show Solution

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Skewed data with outliers
np.random.seed(42)
data = np.concatenate([
    np.random.exponential(2, 200),
    [50, 60, 70]  # Outliers
]).reshape(-1, 1)

scalers = {
    'Original': None,
    'Standard': StandardScaler(),
    'MinMax': MinMaxScaler(),
    'Robust': RobustScaler()
}

fig, axes = plt.subplots(1, 4, figsize=(16, 4))

for ax, (name, scaler) in zip(axes, scalers.items()):
    if scaler is None:
        scaled = data
    else:
        scaled = scaler.fit_transform(data)
    ax.hist(scaled, bins=30, edgecolor='black')
    ax.set_title(name)

plt.tight_layout()
plt.show()

Task: Demonstrate the correct way to scale: fit on training data, transform both train and test. Show what happens if you fit on test data (data leakage).

Show Solution

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Sample data
np.random.seed(42)
X = np.random.randn(100, 2) * [10, 100] + [50, 500]
y = np.random.randint(0, 2, 100)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# CORRECT: Fit on train only, transform both
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Only transform!

print("CORRECT - Fit on train:")
print(f"Train mean: {X_train_scaled.mean(axis=0)}")  # Should be ~0
print(f"Test mean: {X_test_scaled.mean(axis=0)}")    # Slightly off, that's OK!

# WRONG: Fit on all data (data leakage!)
scaler_wrong = StandardScaler()
X_all_scaled = scaler_wrong.fit_transform(X)
print("\nWRONG - Fit on all data (leakage!):")
print("Test statistics leak into training preprocessing!")

Task: Use MinMaxScaler to scale data to range [-1, 1] instead of the default [0, 1]. Verify the min and max values.

Show Solution

import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Sample data
data = np.array([[10], [20], [30], [40], [50]])

# Default [0, 1] scaling
scaler_default = MinMaxScaler()
scaled_default = scaler_default.fit_transform(data)

# Custom [-1, 1] scaling
scaler_custom = MinMaxScaler(feature_range=(-1, 1))
scaled_custom = scaler_custom.fit_transform(data)

print("Original:", data.flatten())
print("\nScaled [0, 1]:", scaled_default.flatten())
print(f"  Min: {scaled_default.min()}, Max: {scaled_default.max()}")

print("\nScaled [-1, 1]:", scaled_custom.flatten())
print(f"  Min: {scaled_custom.min()}, Max: {scaled_custom.max()}")

Task: Use MaxAbsScaler to scale data to [-1, 1] while preserving zeros. This is important for sparse matrices where zeros are meaningful.

Show Solution

import numpy as np
from sklearn.preprocessing import MaxAbsScaler, MinMaxScaler

# Data with meaningful zeros
data = np.array([[-100], [0], [0], [50], [0], [25]])

# MaxAbsScaler - preserves zeros
maxabs = MaxAbsScaler()
scaled_maxabs = maxabs.fit_transform(data)

# MinMaxScaler - does NOT preserve zeros
minmax = MinMaxScaler(feature_range=(-1, 1))
scaled_minmax = minmax.fit_transform(data)

print("Original:", data.flatten())
print("\nMaxAbsScaler (zeros preserved):", scaled_maxabs.flatten())
print("MinMaxScaler (zeros shifted):", scaled_minmax.flatten())

print("\nMaxAbsScaler formula: x / max(|x|)")
print("Zero stays zero because 0 / anything = 0")

Task: Scale a target variable (house prices), train a simple model, make predictions, then inverse_transform to get predictions back in original units (dollars).

Show Solution

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

# Sample data: house size -> price
X = np.array([[1000], [1500], [2000], [2500], [3000]])  # sq ft
y = np.array([200000, 300000, 400000, 500000, 600000])  # price in $

# Scale the target (common for neural networks)
y_scaler = StandardScaler()
y_scaled = y_scaler.fit_transform(y.reshape(-1, 1))

# Train model on scaled target
model = LinearRegression()
model.fit(X, y_scaled.ravel())

# Predict for new house
new_house = np.array([[1800]])
prediction_scaled = model.predict(new_house)

# Convert prediction back to dollars!
prediction_dollars = y_scaler.inverse_transform(prediction_scaled.reshape(-1, 1))

print(f"House size: {new_house[0][0]} sq ft")
print(f"Prediction (scaled): {prediction_scaled[0]:.4f}")
print(f"Prediction (dollars): ${prediction_dollars[0][0]:,.0f}")

Task: Use QuantileTransformer to transform heavily skewed data into a normal distribution. Compare the before/after distributions.

Show Solution

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import QuantileTransformer
from scipy import stats

# Create heavily skewed data (exponential)
np.random.seed(42)
data = np.random.exponential(scale=2, size=1000).reshape(-1, 1)

# Transform to normal distribution
qt = QuantileTransformer(output_distribution='normal', random_state=42)
data_normal = qt.fit_transform(data)

# Plot comparison
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(data, bins=50, edgecolor='black')
axes[0].set_title(f'Original (skewness: {stats.skew(data.flatten()):.2f})')

axes[1].hist(data_normal, bins=50, edgecolor='black')
axes[1].set_title(f'After QuantileTransformer (skewness: {stats.skew(data_normal.flatten()):.2f})')

plt.tight_layout()
plt.show()

print("QuantileTransformer maps data to uniform, then to normal distribution")
print("Great for algorithms that assume normally distributed features!")

Handling Missing Values

Real-world data is messy! Surveys have unanswered questions, sensors fail, databases have gaps. If you just delete every row with a missing value, you might lose half your data! Imputation is the art of intelligently filling in missing values. Think of it like a detective filling in the blanks using clues from the rest of the data. Scikit-learn's SimpleImputer and KNNImputer make this easy.

Simple Imputation Strategies

# Step 1: Create data with missing values
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

data = pd.DataFrame({
    'age': [25, 30, np.nan, 40, 45, np.nan, 55, 60],
    'income': [30000, np.nan, 50000, 65000, np.nan, 80000, 90000, 100000],
    'category': ['A', 'B', np.nan, 'A', 'B', 'C', np.nan, 'A']
})

print("Data with missing values:")
print(data)
print(f"\nMissing counts:\n{data.isnull().sum()}")

The problem: Our data has NaN ("Not a Number") values scattered around. Python uses np.nan to represent missing data.

Numerical missing: age and income have some NaN values
Categorical missing: category column also has NaN

We need different strategies for different types of data. You wouldn't fill in a missing "color" with an average!

# Step 2: Impute numerical columns
# Mean imputation
mean_imputer = SimpleImputer(strategy='mean')
data['age_mean'] = mean_imputer.fit_transform(data[['age']])

# Median imputation (better for skewed data)
median_imputer = SimpleImputer(strategy='median')
data['income_median'] = median_imputer.fit_transform(data[['income']])

print("After numerical imputation:")
print(data[['age', 'age_mean', 'income', 'income_median']])

Two main strategies for numbers:

Mean (average): Great when data is normally distributed. "What's the typical age? Let's use that."
Median (middle value): Better when you have outliers or skewed data. If one person earns $10M, the mean income is misleading — use median instead!

Pro tip: Always check your data distribution first! Median is the safer default for most real-world data.

# Step 3: Impute categorical columns
cat_imputer = SimpleImputer(strategy='most_frequent')
data['category_mode'] = cat_imputer.fit_transform(data[['category']]).ravel()

# Constant value imputation
const_imputer = SimpleImputer(strategy='constant', fill_value='Unknown')
data['category_unknown'] = const_imputer.fit_transform(data[['category']]).ravel()

print("\nAfter categorical imputation:")
print(data[['category', 'category_mode', 'category_unknown']])

Two strategies for categories:

most_frequent (mode): Fill with the most common value. "Most customers choose 'Standard' shipping, so let's assume missing ones did too."
constant: Fill with a specific value like 'Unknown' or 'Missing'. This is powerful because the fact that data was missing might be meaningful!

When to use 'Unknown': If people who didn't answer "income" might be systematically different (maybe embarrassed?), preserving this info helps!

KNN Imputation

The smart approach! Instead of just using the average of everyone, what if we looked at similar people? If someone's age is missing but we know they earn $80k and live in NYC, we can find other NYC earners of $80k and use their average age. That's smarter than using the global average!

from sklearn.impute import KNNImputer

# KNN imputation for numerical data
knn_imputer = KNNImputer(n_neighbors=3)
data_knn = knn_imputer.fit_transform(data[['age', 'income']].dropna(how='all'))

print("KNN Imputed (uses similar samples):")
print(pd.DataFrame(data_knn, columns=['age', 'income']).head())

How KNNImputer works:

n_neighbors=3: Look at the 3 most similar samples (based on features that aren't missing)
Calculate distance: Find samples with similar values in other columns
Average their values: Use the average of those k neighbors to fill the gap

Trade-off: KNN imputation is smarter but slower. For small datasets, it's great! For millions of rows, simple imputation might be more practical.

Missing Value Indicators: Consider adding a binary column indicating whether a value was missing - this information can sometimes be predictive!

Practice Questions

Task: Create a function that adds binary columns indicating which values were missing before imputation.

Show Solution

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

def impute_with_indicators(df, columns):
    """Impute and add missing indicators."""
    df = df.copy()
    
    for col in columns:
        # Add indicator
        df[f'{col}_was_missing'] = df[col].isnull().astype(int)
        
        # Impute
        imputer = SimpleImputer(strategy='median')
        df[col] = imputer.fit_transform(df[[col]])
    
    return df

# Test
data = pd.DataFrame({
    'age': [25, np.nan, 35, np.nan, 45],
    'income': [50000, 60000, np.nan, 80000, np.nan]
})

result = impute_with_indicators(data, ['age', 'income'])
print(result)

Task: Create a feature that counts how many values are missing per row. This can be predictive if missingness has a pattern!

Show Solution

import pandas as pd
import numpy as np

data = pd.DataFrame({
    'a': [1, np.nan, 3, np.nan, 5],
    'b': [np.nan, 2, np.nan, 4, 5],
    'c': [np.nan, np.nan, 3, 4, 5]
})

# Count missing values per row
data['missing_count'] = data.isnull().sum(axis=1)

# Proportion of missing values per row
data['missing_ratio'] = data.isnull().sum(axis=1) / (len(data.columns) - 1)

print(data)
print("\nRows with more missing values might be different!")
print("Example: Users who skip many survey questions")

Task: For skewed data with outliers, compare mean, median, and most_frequent imputation. Which preserves the distribution best?

Show Solution

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Skewed data with outliers
np.random.seed(42)
data = np.concatenate([
    np.random.exponential(2, 20),
    [50, 60]  # Outliers
])
# Add some missing values
data[5] = np.nan
data[10] = np.nan
data[15] = np.nan

df = pd.DataFrame({'value': data})

print(f"Original median: {np.nanmedian(data):.2f}")
print(f"Original mean: {np.nanmean(data):.2f}")

# Try different strategies
for strategy in ['mean', 'median']:
    imputer = SimpleImputer(strategy=strategy)
    imputed = imputer.fit_transform(df[['value']])
    fill_value = imputer.statistics_[0]
    print(f"\n{strategy.upper()} imputation fills with: {fill_value:.2f}")

print("\nMedian is better for skewed data - outliers don't affect it!")

Task: Impute missing 'salary' values using the median salary within each 'department'. This is smarter than using the global median.

Show Solution

import pandas as pd
import numpy as np

data = pd.DataFrame({
    'department': ['IT', 'IT', 'IT', 'HR', 'HR', 'HR', 'Sales', 'Sales'],
    'salary': [80000, np.nan, 90000, 50000, np.nan, 55000, 60000, np.nan]
})

print("Before imputation:")
print(data)

# Group-wise imputation
def impute_by_group(df, group_col, value_col):
    df = df.copy()
    df[value_col] = df.groupby(group_col)[value_col].transform(
        lambda x: x.fillna(x.median())
    )
    return df

data_imputed = impute_by_group(data, 'department', 'salary')

print("\nAfter group-wise imputation:")
print(data_imputed)

print("\nEach missing salary filled with department median!")
print("IT: median of 80k, 90k = 85k")
print("HR: median of 50k, 55k = 52.5k")

Task: Demonstrate why you should scale data before KNN imputation. Compare results with and without scaling.

Show Solution

import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler

data = pd.DataFrame({
    'age': [25, 30, np.nan, 40, 45],          # Range: ~20
    'income': [30000, 45000, 50000, np.nan, 70000]  # Range: ~40000
})

print("Original data:")
print(data)

# KNN without scaling (income dominates distance)
knn_unscaled = KNNImputer(n_neighbors=2)
imputed_unscaled = knn_unscaled.fit_transform(data)

# KNN with scaling
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data.fillna(data.mean()))  # Initial fill for scaling
data_for_knn = data.copy()
knn_scaled = KNNImputer(n_neighbors=2)
imputed_scaled = knn_scaled.fit_transform(
    (data - data.mean()) / data.std()
) * data.std().values + data.mean().values

print("\nKNN without scaling:")
print(pd.DataFrame(imputed_unscaled, columns=data.columns))

print("\nNote: Without scaling, 'income' dominates the distance calculation!")
print("Scale your data before KNN imputation for better results.")

Task: Use IterativeImputer (MICE - Multiple Imputation by Chained Equations) to impute missing values using relationships between features.

Show Solution

import numpy as np
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer

# Data with correlated features
np.random.seed(42)
n = 100
age = np.random.uniform(20, 60, n)
income = age * 1000 + np.random.normal(0, 5000, n)  # Income correlates with age

data = pd.DataFrame({'age': age, 'income': income})

# Add missing values
data.loc[10:20, 'age'] = np.nan
data.loc[30:40, 'income'] = np.nan

# Compare Simple vs Iterative imputation
simple_imp = SimpleImputer(strategy='mean')
data_simple = pd.DataFrame(
    simple_imp.fit_transform(data), 
    columns=data.columns
)

iter_imp = IterativeImputer(random_state=42, max_iter=10)
data_iter = pd.DataFrame(
    iter_imp.fit_transform(data), 
    columns=data.columns
)

print("Simple Imputer uses global mean (ignores correlations)")
print("Iterative Imputer uses other features to predict missing values")
print("\nSample imputed rows (originally missing age):")
print(f"Simple age mean: {data_simple.loc[10:15, 'age'].mean():.0f}")
print(f"Iterative uses income to predict age!")

Building Preprocessing Pipelines

Here's a nightmare scenario: you preprocess training data (scale, encode, impute), train a model... then realize you need to apply the exact same transformations to test data, but with the training statistics (mean, std, etc.). Miss a step or use wrong stats? Your model fails silently. Pipelines solve this by bundling all steps together into one object. Fit it once, and it remembers everything. Deploy it, and it handles all preprocessing automatically!

ColumnTransformer for Mixed Data Types

The real-world challenge: Your data has numbers (scale them!) and categories (encode them!). You can't use the same transformer on both. ColumnTransformer is like a traffic controller — it routes each column type to the right preprocessing pipeline:

# Step 1: Set up the data
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Sample data
data = pd.DataFrame({
    'age': [25, 30, np.nan, 40, 45, 50],
    'income': [30000, 45000, 50000, np.nan, 70000, 80000],
    'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA', 'NYC'],
    'education': ['Bachelor', 'Master', 'PhD', 'Bachelor', 'Master', 'PhD'],
    'purchased': [0, 1, 1, 0, 1, 1]
})

X = data.drop('purchased', axis=1)
y = data['purchased']

Setting the stage: Look at our mixed data:

Numerical: age, income → need imputation (missing values) + scaling
Categorical: city, education → need imputation + one-hot encoding
Target: purchased → what we're predicting (0 or 1)

Without pipelines, you'd manually preprocess each column, fit transformers on training data, remember to transform (not fit!) test data... lots of room for mistakes!

# Step 2: Define preprocessing for each column type
numerical_features = ['age', 'income']
categorical_features = ['city', 'education']

# Numerical pipeline: impute then scale
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical pipeline: impute then encode
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

Building mini-pipelines: Each transformer type gets its own chain of steps:

Numerical pipeline: First impute (fill missing with median), then scale (StandardScaler)
Categorical pipeline: First impute (fill missing with most common value), then one-hot encode
handle_unknown='ignore': If test data has a city we never saw in training, don't crash — just ignore it!

Order matters! Impute first, then scale. You can't scale NaN values!

# Step 3: Combine with ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Step 4: Create full pipeline with model
full_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Fit and predict (all preprocessing happens automatically)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
full_pipeline.fit(X_train, y_train)
score = full_pipeline.score(X_test, y_test)
print(f"Accuracy: {score:.4f}")

The magic happens here:

ColumnTransformer: Routes numerical columns to numerical_transformer, categorical to categorical_transformer
Full pipeline: Chains preprocessor → classifier. One object does everything!
fit(): Learns all statistics (mean, categories, etc.) from training data ONLY
predict(): Applies transformations using those learned statistics, then predicts

No data leakage! When you call fit(X_train), it only sees training data. Test data statistics never leak into preprocessing.

Using make_column_selector

Even lazier (in a good way!): Instead of listing column names, let sklearn auto-detect columns by their data type. Numbers go one way, text goes another — automatically:

from sklearn.compose import make_column_selector

# Automatically select by dtype
preprocessor_auto = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, make_column_selector(dtype_include=np.number)),
        ('cat', categorical_transformer, make_column_selector(dtype_include=object))
    ]
)

# This automatically handles any numerical or categorical column

Why this is awesome:

dtype_include=np.number: Automatically grabs all numerical columns (int, float)
dtype_include=object: Automatically grabs all text/categorical columns
Future-proof: Add a new numerical column? It's automatically preprocessed correctly!

Pro tip: Great for Kaggle competitions where you don't want to hardcode column names. Just ensure your dtypes are correct!

Pipeline Benefits:

Prevents data leakage (fit only on training data)
Simplifies code (one object to fit/predict)
Easy to save and deploy (pickle the whole pipeline)
Works with GridSearchCV for hyperparameter tuning

Practice Questions

Task: Create a pipeline that preprocesses the titanic-style data and tunes both preprocessing (imputer strategy) and model (n_estimators) hyperparameters using GridSearchCV.

Show Solution

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
import pandas as pd
import numpy as np

# Create titanic-style data
data = pd.DataFrame({
    'age': [22, 38, np.nan, 35, np.nan, 54, 2, 27, 14, np.nan],
    'fare': [7.25, 71.28, 7.92, 53.1, 8.05, 51.86, 21.07, 11.13, 30.07, 7.87],
    'sex': ['male', 'female', 'female', 'female', 'male', 'male', 'male', 'male', 'female', 'male'],
    'embarked': ['S', 'C', 'S', 'S', np.nan, 'S', 'S', 'S', 'C', 'S'],
    'survived': [0, 1, 1, 1, 0, 0, 0, 0, 1, 1]
})

X = data.drop('survived', axis=1)
y = data['survived']

# Build pipeline
num_transformer = Pipeline([
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler())
])

cat_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', num_transformer, ['age', 'fare']),
    ('cat', cat_transformer, ['sex', 'embarked'])
])

pipeline = Pipeline([
    ('prep', preprocessor),
    ('clf', RandomForestClassifier(random_state=42))
])

# Grid search over preprocessing AND model params
param_grid = {
    'prep__num__imputer__strategy': ['mean', 'median'],
    'clf__n_estimators': [50, 100],
    'clf__max_depth': [3, 5, None]
}

grid = GridSearchCV(pipeline, param_grid, cv=3, scoring='accuracy')
grid.fit(X, y)

print(f"Best params: {grid.best_params_}")
print(f"Best score: {grid.best_score_:.4f}")

Task: Create a basic pipeline that applies StandardScaler followed by LogisticRegression. Fit on training data and evaluate on test data.

Show Solution

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create sample data
X, y = make_classification(n_samples=100, n_features=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

# Fit and evaluate
pipeline.fit(X_train, y_train)
train_score = pipeline.score(X_train, y_train)
test_score = pipeline.score(X_test, y_test)

print(f"Train accuracy: {train_score:.4f}")
print(f"Test accuracy: {test_score:.4f}")

# The pipeline automatically scales test data using training statistics!

Task: After fitting a pipeline, access the fitted scaler to see the learned mean and std, and access the classifier to see its coefficients.

Show Solution

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Create and fit pipeline
X, y = make_classification(n_samples=100, n_features=3, random_state=42)

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])
pipeline.fit(X, y)

# Access components using named_steps
scaler = pipeline.named_steps['scaler']
classifier = pipeline.named_steps['classifier']

print("Scaler learned statistics:")
print(f"  Mean: {scaler.mean_}")
print(f"  Std: {scaler.scale_}")

print("\nClassifier coefficients:")
print(f"  Coefficients: {classifier.coef_}")
print(f"  Intercept: {classifier.intercept_}")

# Alternative: access by index
print(f"\nAccess by index: {pipeline[0]} is the scaler")

Task: Fit a pipeline, save it to a file using joblib, then load it back and make predictions. This is essential for deployment!

Show Solution

import joblib
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import numpy as np

# Create and fit pipeline
X, y = make_classification(n_samples=100, n_features=5, random_state=42)

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=10, random_state=42))
])
pipeline.fit(X, y)

# Save the entire pipeline (including fitted transformers!)
joblib.dump(pipeline, 'my_pipeline.pkl')
print("Pipeline saved!")

# Later, in production...
loaded_pipeline = joblib.load('my_pipeline.pkl')
print("Pipeline loaded!")

# Make predictions with new data
new_data = np.random.randn(5, 5)  # 5 new samples
predictions = loaded_pipeline.predict(new_data)
print(f"Predictions: {predictions}")

# The loaded pipeline includes ALL preprocessing steps!

Task: Create a pipeline that scales data, selects the top 5 features using SelectKBest, then trains a classifier.

Show Solution

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Create data with many features
X, y = make_classification(n_samples=200, n_features=20, 
                           n_informative=5, random_state=42)

# Pipeline: scale -> select features -> classify
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('selector', SelectKBest(score_func=f_classif, k=5)),
    ('classifier', LogisticRegression())
])

pipeline.fit(X, y)
print(f"Original features: {X.shape[1]}")

# See which features were selected
selector = pipeline.named_steps['selector']
selected_mask = selector.get_support()
print(f"Selected feature indices: {np.where(selected_mask)[0]}")
print(f"Feature scores: {selector.scores_[:10]}...")  # First 10

print(f"\nTest accuracy: {pipeline.score(X, y):.4f}")

Task: Create a custom transformer that adds polynomial features (squares of each column), then use it in a pipeline.

Show Solution

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
import numpy as np

# Custom transformer
class SquareFeatures(BaseEstimator, TransformerMixin):
    """Adds squared versions of all features."""
    
    def fit(self, X, y=None):
        return self  # Nothing to learn
    
    def transform(self, X):
        X_squared = X ** 2
        return np.hstack([X, X_squared])
    
    def get_feature_names_out(self, input_features=None):
        if input_features is None:
            input_features = [f'x{i}' for i in range(self.n_features_)]
        squared_names = [f'{name}_squared' for name in input_features]
        return list(input_features) + squared_names

# Use in pipeline
X = np.random.randn(100, 3)
y = X[:, 0] + X[:, 1]**2 + np.random.randn(100) * 0.1  # Nonlinear target

pipeline = Pipeline([
    ('add_squares', SquareFeatures()),
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

pipeline.fit(X, y)
print(f"Input features: {X.shape[1]}")
print(f"After SquareFeatures: {pipeline.named_steps['add_squares'].transform(X).shape[1]}")
print(f"R² score: {pipeline.score(X, y):.4f}")

Key Takeaways

Match Encoding to Data Type

Use OneHotEncoder for nominal categories with few values. Use OrdinalEncoder for ordinal data with explicit ordering.

Target Encoding for High Cardinality

When one-hot creates too many columns, use target encoding with cross-validation or frequency encoding to avoid data leakage.

Scale for Gradient & Distance Models

StandardScaler for normal data, MinMaxScaler for bounded inputs, RobustScaler when outliers exist.

Handle Missing Values Carefully

Use SimpleImputer with mean/median for numerical, mode/constant for categorical. Consider adding missing indicators.

Use Pipelines for Everything

Pipeline + ColumnTransformer ensures consistent preprocessing, prevents leakage, and simplifies deployment.

Prevent Data Leakage

Always fit transformers on training data only. Pipelines handle this automatically when used with train_test_split or cross-validation.

What You'll Learn

Contents

Categorical Encoding Basics

Types of Categorical Variables

One-Hot Encoding

Label Encoding

Ordinal Encoding

Practice Questions

Advanced Encoding Techniques

Target Encoding (Mean Encoding)

Frequency Encoding

Feature Hashing

Practice Questions

Feature Scaling Techniques

StandardScaler (Z-score Normalization)

MinMaxScaler (Normalization)

RobustScaler

When to Scale

Scale Required

Scale Not Required

Practice Questions

Handling Missing Values

Simple Imputation Strategies

KNN Imputation

Practice Questions

Building Preprocessing Pipelines

ColumnTransformer for Mixed Data Types

Using make_column_selector

Practice Questions

Key Takeaways

Match Encoding to Data Type

Target Encoding for High Cardinality

Scale for Gradient & Distance Models

Handle Missing Values Carefully

Use Pipelines for Everything

Prevent Data Leakage

Knowledge Check

Encoding & Feature Scaling

What You'll Learn

Contents

Categorical Encoding Basics

Types of Categorical Variables

One-Hot Encoding

Label Encoding

Ordinal Encoding

Practice Questions

Easy Encode Education Level

Easy One-Hot Encode Colors

Medium Handle Unknown Categories

Medium Encode Multiple Ordinal Columns

Medium Compare Encoding Methods

Hard Inverse Transform Encoded Data

Advanced Encoding Techniques

Target Encoding (Mean Encoding)

Frequency Encoding

Feature Hashing

Practice Questions

Medium Implement Leave-One-Out Target Encoding

Easy Frequency Encode a Column

Medium Target Encoding with Smoothing

Medium Feature Hashing for Text

Easy Binary Encoding Simulation

Hard Combine Multiple Encoding Strategies

Feature Scaling Techniques

StandardScaler (Z-score Normalization)

MinMaxScaler (Normalization)

RobustScaler

When to Scale

Scale Required

Scale Not Required

Practice Questions

Medium Compare Scalers on Skewed Data

Easy Scale Train and Test Data Correctly

Easy Scale to Custom Range

Medium MaxAbsScaler for Sparse Data

Medium Inverse Transform Predictions

Hard Quantile Transformer for Non-Normal Data

Handling Missing Values

Simple Imputation Strategies

KNN Imputation

Practice Questions

Easy Add Missing Indicators

Easy Count Missing Values Per Row

Medium Compare Imputation Strategies

Medium Impute Based on Groups

Medium KNN Imputation with Scaling

Hard Iterative Imputation (MICE)

Building Preprocessing Pipelines

ColumnTransformer for Mixed Data Types

Using make_column_selector

Practice Questions

Hard Create a Complete Pipeline with GridSearchCV

Easy Create a Simple Pipeline

Easy Access Pipeline Components

Medium Save and Load a Pipeline

Medium Pipeline with Feature Selection

Hard Custom Transformer in Pipeline

Key Takeaways

Match Encoding to Data Type

Target Encoding for High Cardinality

Scale for Gradient & Distance Models

Handle Missing Values Carefully

Use Pipelines for Everything

Prevent Data Leakage

Knowledge Check