Categorical Encoding Basics
Imagine trying to teach a calculator about colors — it only understands numbers! Machine learning models are the same way. When your data has categories like "red/blue/green" or "small/medium/large", you need to convert them into numbers the model can work with. But here's the trick: how you convert them matters a lot! If you assign red=1, blue=2, green=3, the model might think green is "bigger" than red — which makes no sense for colors. This section teaches you the right way to encode different types of categorical data.
Types of Categorical Variables
Nominal (No Order): Categories with no natural ranking — like colors (red isn't "better" than blue), countries, or blood types. You can't sort them in any meaningful way. Ordinal (Has Order): Categories that have a natural sequence — like T-shirt sizes (S < M < L < XL), education levels (High School < Bachelor < Master < PhD), or star ratings (1★ < 2★ < 3★). The order matters and your encoding should preserve it!
One-Hot Encoding
The Problem: If we encode colors as red=1, blue=2, green=3, the model might think "blue + blue = green" (2+2≠3, but close!). The Solution: One-hot encoding creates a separate yes/no column for each category. Instead of one "color" column, you get "is_red", "is_blue", "is_green" columns with 1s and 0s. Now there's no fake ordering, and the model treats each color independently!
# Step 1: Import and create sample data
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
data = pd.DataFrame({
'color': ['red', 'blue', 'green', 'red', 'blue'],
'size': ['S', 'M', 'L', 'M', 'S'],
'price': [10, 20, 15, 12, 18]
})
Setting up our example: We have a simple product dataset with two categorical columns: color (nominal — no natural order between red, blue, green) and size (ordinal — S < M < L makes sense). The price column is already numerical, so it doesn't need encoding. Our goal: convert color and size into numbers the right way!
# Step 2: Apply one-hot encoding with sklearn
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
color_encoded = encoder.fit_transform(data[['color']])
# Create DataFrame with meaningful column names
color_df = pd.DataFrame(
color_encoded,
columns=encoder.get_feature_names_out(['color'])
)
print(color_df)
What's happening: OneHotEncoder transforms our single "color" column into three separate columns: color_blue, color_green, color_red. Each row gets a 1 in exactly one column (the matching color) and 0s everywhere else.
- sparse_output=False: Returns a regular array instead of a memory-efficient sparse matrix (easier to read and debug)
- handle_unknown='ignore': If the model sees a new color during prediction (like "yellow"), it won't crash — it just puts 0s in all columns
- get_feature_names_out(): Gives us nice column names like "color_red" instead of "x0_red"
# Step 3: Pandas alternative (simpler for exploration)
color_dummies = pd.get_dummies(data['color'], prefix='color')
print(color_dummies)
# Drop first to avoid multicollinearity (for linear models)
color_dummies_drop = pd.get_dummies(data['color'], prefix='color', drop_first=True)
print("\nWith drop_first=True:")
print(color_dummies_drop)
The quick pandas way: pd.get_dummies() does the same thing in one line! Great for quick exploration, but there's a catch: it doesn't "remember" the categories. If your test data has different colors than training data, you'll get mismatched columns.
Why drop_first=True? If you know "is_red" and "is_blue" are both 0, then it MUST be green — you don't need a third column! Dropping one column avoids redundancy (called the "dummy variable trap") which can confuse linear models. Tree-based models don't care, so you can skip this for Random Forest or XGBoost.
Label Encoding
Simple but dangerous! Label encoding just assigns numbers: red=0, blue=1, green=2. It's compact (one column instead of three), but creates a fake ordering. Use it only when: (1) your categories actually have an order, or (2) you're using tree-based models (like Random Forest or XGBoost) which only look at "is value above or below this threshold?" and don't care about the actual numbers:
from sklearn.preprocessing import LabelEncoder
# Label encoding for target variable
le = LabelEncoder()
data['color_label'] = le.fit_transform(data['color'])
print(data[['color', 'color_label']])
# See the mapping
print("\nMapping:", dict(zip(le.classes_, range(len(le.classes_)))))
How it works: LabelEncoder alphabetically sorts categories and assigns numbers: blue=0, green=1, red=2. The .classes_ attribute shows you the order it learned.
Big Warning: This implies green > red > blue, which is nonsense for colors! A linear model would think "if I increase the color number, the prediction changes" — but colors don't work that way. Only use LabelEncoder for: (1) truly ordinal data, (2) the target variable (y), or (3) tree-based models that don't assume order.
Ordinal Encoding
When order matters, encode it correctly! T-shirt sizes have a real order: S < M < L < XL. OrdinalEncoder lets you explicitly define this order so the model knows that XL > L > M > S. This is better than LabelEncoder because YOU control the order instead of relying on alphabetical sorting (which would give L=0, M=1, S=2, XL=3 — completely wrong!):
from sklearn.preprocessing import OrdinalEncoder
# Define the order explicitly
size_order = [['S', 'M', 'L', 'XL']] # Small to Large
ordinal_encoder = OrdinalEncoder(categories=size_order)
data['size_ordinal'] = ordinal_encoder.fit_transform(data[['size']])
print(data[['size', 'size_ordinal']])
You're in control: By specifying categories=[['S', 'M', 'L', 'XL']], we guarantee S=0, M=1, L=2, XL=3 — the correct order! Now the model can learn things like "larger sizes cost more" because the numbers actually reflect the size relationship.
Pro tip: The categories parameter takes a list of lists (one list per column). If you have multiple ordinal columns, you'd do: categories=[size_order, rating_order]
- One-Hot: Nominal data with few categories (<10)
- Label/Ordinal: Ordinal data or tree-based models
- Avoid: One-hot with many categories (causes dimensionality explosion)
Practice Questions
Task: Encode education levels ['High School', 'Bachelor', 'Master', 'PhD'] using OrdinalEncoder with the correct order.
Show Solution
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
data = pd.DataFrame({
'education': ['Bachelor', 'PhD', 'High School', 'Master', 'Bachelor']
})
# Define order from lowest to highest
edu_order = [['High School', 'Bachelor', 'Master', 'PhD']]
encoder = OrdinalEncoder(categories=edu_order)
data['education_encoded'] = encoder.fit_transform(data[['education']])
print(data)
# High School=0, Bachelor=1, Master=2, PhD=3
Task: One-hot encode a 'color' column with values ['red', 'blue', 'green', 'red', 'blue']. Use drop='first' to avoid the dummy variable trap.
Show Solution
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
data = pd.DataFrame({
'color': ['red', 'blue', 'green', 'red', 'blue']
})
# One-hot encode with drop='first'
encoder = OneHotEncoder(sparse_output=False, drop='first')
encoded = encoder.fit_transform(data[['color']])
# Create DataFrame with column names
feature_names = encoder.get_feature_names_out(['color'])
encoded_df = pd.DataFrame(encoded, columns=feature_names)
print("Original:")
print(data)
print("\nOne-Hot Encoded (dropped first):")
print(encoded_df)
Task: Train a OneHotEncoder on colors ['red', 'blue'], then transform new data that contains 'green'. Use handle_unknown='ignore' to handle the unseen category.
Show Solution
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Training data
train_data = pd.DataFrame({'color': ['red', 'blue', 'red', 'blue']})
# Test data with unseen category
test_data = pd.DataFrame({'color': ['red', 'green', 'blue']})
# Encoder that ignores unknown categories
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoder.fit(train_data[['color']])
# Transform training data
train_encoded = encoder.transform(train_data[['color']])
print("Train encoded:")
print(train_encoded)
# Transform test data (green becomes all zeros)
test_encoded = encoder.transform(test_data[['color']])
print("\nTest encoded (green = all zeros):")
print(test_encoded)
Task: Use OrdinalEncoder to encode two ordinal columns: 'size' (S, M, L, XL) and 'satisfaction' (Poor, Fair, Good, Excellent) with their correct orders.
Show Solution
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
data = pd.DataFrame({
'size': ['M', 'XL', 'S', 'L', 'M'],
'satisfaction': ['Good', 'Poor', 'Excellent', 'Fair', 'Good']
})
# Define orders for both columns
size_order = ['S', 'M', 'L', 'XL']
satisfaction_order = ['Poor', 'Fair', 'Good', 'Excellent']
encoder = OrdinalEncoder(categories=[size_order, satisfaction_order])
data[['size_encoded', 'satisfaction_encoded']] = encoder.fit_transform(
data[['size', 'satisfaction']]
)
print(data)
# size: S=0, M=1, L=2, XL=3
# satisfaction: Poor=0, Fair=1, Good=2, Excellent=3
Task: For a 'city' column with 5 cities, apply Label, One-Hot, and Ordinal encoding. Print the resulting shapes and discuss which is best for which model type.
Show Solution
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
data = pd.DataFrame({
'city': ['NYC', 'LA', 'Chicago', 'Houston', 'Phoenix', 'NYC', 'LA']
})
# Label Encoding - 1 column
le = LabelEncoder()
label_encoded = le.fit_transform(data['city'])
print(f"Label Encoded shape: {label_encoded.reshape(-1,1).shape}")
# One-Hot Encoding - 5 columns (or 4 with drop='first')
ohe = OneHotEncoder(sparse_output=False)
onehot_encoded = ohe.fit_transform(data[['city']])
print(f"One-Hot Encoded shape: {onehot_encoded.shape}")
# Ordinal Encoding - 1 column
oe = OrdinalEncoder()
ordinal_encoded = oe.fit_transform(data[['city']])
print(f"Ordinal Encoded shape: {ordinal_encoded.shape}")
print("""
Best uses:
- Label/Ordinal: Tree-based models (XGBoost, RandomForest)
- One-Hot: Linear models (Logistic Regression, SVM, Neural Networks)
- One-Hot: When categories have no natural order
""")
Task: One-hot encode a column, then use inverse_transform to convert the encoded data back to original categories. This is useful for interpreting model predictions.
Show Solution
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
# Original data
data = pd.DataFrame({
'fruit': ['apple', 'banana', 'cherry', 'apple', 'banana']
})
# Encode
encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(data[['fruit']])
print("Encoded:")
print(encoded)
# Inverse transform back to original
decoded = encoder.inverse_transform(encoded)
print("\nDecoded back:")
print(decoded.flatten())
# Verify they match
print(f"\nOriginal matches decoded: {np.array_equal(data['fruit'].values, decoded.flatten())}")
Advanced Encoding Techniques
What if your "city" column has 10,000 different cities? One-hot encoding would create 10,000 new columns — your model would drown in data! This is called the high-cardinality problem. Advanced encoding techniques solve this by creating smarter, more compact representations. Target encoding replaces each city with how the target variable (like price or sales) behaves in that city. Frequency encoding uses how common each city is. These techniques keep your data manageable while preserving useful information.
Target Encoding (Mean Encoding)
The clever idea: Instead of creating columns for each city, replace the city name with what we actually care about — the average of our target variable for that city! If NYC apartments average $500K, replace "NYC" with 500000. Now one column captures the essence of thousands of categories. The danger: This uses information from the target variable, which can cause "data leakage" if not done carefully:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
# Sample data
data = pd.DataFrame({
'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA', 'NYC', 'Chicago', 'LA'],
'price': [500, 300, 550, 200, 350, 480, 220, 320]
})
# Simple target encoding (prone to leakage - for demo only)
city_means = data.groupby('city')['price'].mean()
data['city_target_encoded'] = data['city'].map(city_means)
print(data)
What's happening: We calculate the average price for each city (NYC→$510, LA→$323, Chicago→$210) and replace the city names with these averages. Now instead of 3 columns (one-hot) or arbitrary numbers (label), we have ONE column with meaningful values!
Data Leakage Warning: This simple version "cheats" by using ALL the data (including the rows we're encoding) to calculate the means. In a real project, this leaks information from the test set into training. The next code block shows the proper way!
# Proper target encoding with cross-validation (prevents leakage)
from sklearn.model_selection import KFold
def target_encode_cv(df, col, target, n_splits=5):
"""Target encoding with CV to prevent leakage."""
df = df.copy()
df['target_encoded'] = np.nan
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
for train_idx, val_idx in kf.split(df):
# Calculate means on training fold only
means = df.iloc[train_idx].groupby(col)[target].mean()
# Apply to validation fold
df.loc[val_idx, 'target_encoded'] = df.loc[val_idx, col].map(means)
# Fill any NaN with global mean
df['target_encoded'].fillna(df[target].mean(), inplace=True)
return df['target_encoded']
data['city_target_cv'] = target_encode_cv(data, 'city', 'price')
print(data[['city', 'price', 'city_target_cv']])
The right way — Cross-Validated Target Encoding: We split the data into folds (like in cross-validation). For each row, we calculate the city's mean using ONLY the other folds — never using the row's own target value. This prevents the model from "memorizing" the answers.
- n_splits=5: Data is split into 5 parts; each part gets encoded using means from the other 4 parts
- fillna with global mean: If a city appears only in one fold, we use the overall average as a fallback
- Result: Slightly different encoding values than the simple version, but much more reliable for real predictions!
Frequency Encoding
Simple and safe! Sometimes how common a category is tells you something useful. Popular products might behave differently than rare ones. Big cities might have different patterns than small towns. Frequency encoding replaces each category with how often it appears — and since we're not using the target variable, there's zero risk of data leakage:
# Frequency encoding
freq = data['city'].value_counts()
data['city_freq'] = data['city'].map(freq)
# Or as proportion
prop = data['city'].value_counts(normalize=True)
data['city_prop'] = data['city'].map(prop)
print(data[['city', 'city_freq', 'city_prop']])
Two flavors of frequency:
- Count (city_freq): NYC appears 3 times, LA appears 3 times, Chicago appears 2 times. Raw counts work well when the total dataset size is fixed.
- Proportion (city_prop): NYC is 37.5% of the data, etc. Proportions are better when you might have different-sized datasets (like training vs. test).
When to use: Great when popularity/rarity matters! For example, rare product categories might have higher prices (limited edition), or frequent customer segments might have better retention.
Feature Hashing
For extreme situations! What if you have millions of categories (like user IDs or product SKUs)? Feature hashing uses a mathematical trick: it converts each category into a fixed number of columns using a "hash function". Think of it like assigning locker numbers in a school — 10,000 students get assigned to 100 lockers (some share). It's not perfect (some categories "collide" into the same slot), but it's incredibly memory-efficient:
from sklearn.feature_extraction import FeatureHasher
# High cardinality example
cities = [{'city': 'New York'}, {'city': 'Los Angeles'},
{'city': 'Chicago'}, {'city': 'Houston'}]
# Hash into 4 features (usually use more in practice)
hasher = FeatureHasher(n_features=4, input_type='dict')
hashed = hasher.transform(cities)
print("Hashed features:")
print(hashed.toarray())
How it works: The hash function converts "New York" into a pattern of numbers across 4 slots. Each city gets a different pattern. Sometimes two cities might get similar patterns (collision), but with enough slots (usually 1000+), this is rare.
- n_features=4: We're using just 4 slots for demo. In practice, use 1000+ to minimize collisions.
- input_type='dict': We're passing dictionaries like
{'city': 'New York'} - Memory magic: Even with 1 million unique categories, you only have 1000 columns (not 1 million!)
Practice Questions
Task: Implement leave-one-out target encoding where each row uses the mean of all other rows in its category (excluding itself).
Show Solution
import pandas as pd
import numpy as np
data = pd.DataFrame({
'category': ['A', 'A', 'A', 'B', 'B', 'C'],
'target': [10, 20, 30, 100, 120, 50]
})
def loo_target_encode(df, cat_col, target_col):
"""Leave-one-out target encoding."""
df = df.copy()
# Sum and count per category
agg = df.groupby(cat_col)[target_col].agg(['sum', 'count'])
# For each row, calculate mean excluding itself
df['cat_sum'] = df[cat_col].map(agg['sum'])
df['cat_count'] = df[cat_col].map(agg['count'])
df['loo_encoded'] = (df['cat_sum'] - df[target_col]) / (df['cat_count'] - 1)
# Handle categories with only 1 sample (use global mean)
df['loo_encoded'].fillna(df[target_col].mean(), inplace=True)
return df['loo_encoded']
data['target_loo'] = loo_target_encode(data, 'category', 'target')
print(data)
Task: Apply frequency encoding to a 'browser' column with values ['Chrome', 'Firefox', 'Chrome', 'Safari', 'Chrome', 'Firefox']. Show both count and proportion versions.
Show Solution
import pandas as pd
data = pd.DataFrame({
'browser': ['Chrome', 'Firefox', 'Chrome', 'Safari', 'Chrome', 'Firefox']
})
# Count-based frequency encoding
freq_count = data['browser'].value_counts()
data['browser_count'] = data['browser'].map(freq_count)
# Proportion-based frequency encoding
freq_prop = data['browser'].value_counts(normalize=True)
data['browser_proportion'] = data['browser'].map(freq_prop)
print(data)
# Chrome: count=3, proportion=0.5
# Firefox: count=2, proportion=0.333
# Safari: count=1, proportion=0.167
Task: Implement target encoding with smoothing. Categories with few samples should be pulled toward the global mean to reduce noise.
Show Solution
import pandas as pd
import numpy as np
data = pd.DataFrame({
'category': ['A', 'A', 'A', 'A', 'A', 'B', 'C'],
'target': [100, 110, 90, 105, 95, 50, 200] # C has only 1 sample
})
def smoothed_target_encode(df, cat_col, target_col, smoothing=10):
"""Target encoding with smoothing toward global mean."""
global_mean = df[target_col].mean()
agg = df.groupby(cat_col)[target_col].agg(['mean', 'count'])
# Smoothing formula: weighted average of category mean and global mean
# More samples = trust category mean more
smoothed = (agg['count'] * agg['mean'] + smoothing * global_mean) / (agg['count'] + smoothing)
return df[cat_col].map(smoothed)
data['encoded_no_smooth'] = data['category'].map(
data.groupby('category')['target'].mean()
)
data['encoded_smoothed'] = smoothed_target_encode(data, 'category', 'target', smoothing=5)
print(data)
print(f"\nGlobal mean: {data['target'].mean():.2f}")
print("Notice: C (1 sample) is pulled toward global mean with smoothing!")
Task: Use FeatureHasher to encode product names into a fixed-size feature vector. Compare hashing 1000 products into 10 vs 100 features.
Show Solution
from sklearn.feature_extraction import FeatureHasher
import numpy as np
# Simulate 1000 unique products
products = [{'product': f'Product_{i}'} for i in range(1000)]
# Hash into 10 features (lots of collisions expected)
hasher_10 = FeatureHasher(n_features=10, input_type='dict')
hashed_10 = hasher_10.transform(products)
# Hash into 100 features (fewer collisions)
hasher_100 = FeatureHasher(n_features=100, input_type='dict')
hashed_100 = hasher_100.transform(products)
print(f"Original: 1000 unique products")
print(f"Hashed to 10 features: shape = {hashed_10.shape}")
print(f"Hashed to 100 features: shape = {hashed_100.shape}")
# Check for collisions (identical rows)
unique_10 = len(np.unique(hashed_10.toarray(), axis=0))
unique_100 = len(np.unique(hashed_100.toarray(), axis=0))
print(f"\nUnique patterns with 10 features: {unique_10}")
print(f"Unique patterns with 100 features: {unique_100}")
Task: Implement a simple binary encoding: convert category indices to binary representation. For 8 categories, you only need 3 columns (2³=8).
Show Solution
import pandas as pd
import numpy as np
data = pd.DataFrame({
'day': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'Mon']
})
# First, label encode
days_order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
day_to_num = {day: i for i, day in enumerate(days_order)}
data['day_num'] = data['day'].map(day_to_num)
# Binary encode (need 3 bits for 7 values: 0-6)
def to_binary(num, n_bits=3):
return [int(b) for b in format(num, f'0{n_bits}b')]
binary_cols = data['day_num'].apply(lambda x: pd.Series(to_binary(x, 3)))
binary_cols.columns = ['day_b0', 'day_b1', 'day_b2']
result = pd.concat([data, binary_cols], axis=1)
print(result)
print(f"\nOne-Hot would need 7 columns, Binary only needs 3!")
Task: For a dataset with: ordinal 'satisfaction', nominal 'color', and high-cardinality 'zip_code' columns - apply the appropriate encoding to each.
Show Solution
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.feature_extraction import FeatureHasher
data = pd.DataFrame({
'satisfaction': ['Good', 'Poor', 'Excellent', 'Fair', 'Good'],
'color': ['red', 'blue', 'green', 'red', 'blue'],
'zip_code': ['10001', '90210', '60601', '10001', '33101']
})
# 1. Ordinal encoding for satisfaction
sat_order = [['Poor', 'Fair', 'Good', 'Excellent']]
ordinal_enc = OrdinalEncoder(categories=sat_order)
data['satisfaction_encoded'] = ordinal_enc.fit_transform(data[['satisfaction']])
# 2. One-hot encoding for color (low cardinality)
onehot_enc = OneHotEncoder(sparse_output=False, drop='first')
color_encoded = onehot_enc.fit_transform(data[['color']])
color_cols = pd.DataFrame(color_encoded,
columns=onehot_enc.get_feature_names_out(['color']))
data = pd.concat([data, color_cols], axis=1)
# 3. Frequency encoding for zip_code (high cardinality)
zip_freq = data['zip_code'].value_counts(normalize=True)
data['zip_frequency'] = data['zip_code'].map(zip_freq)
print(data.drop(['satisfaction', 'color', 'zip_code'], axis=1))
Feature Scaling Techniques
Picture this: you're calculating the distance between two houses based on their price ($500,000 vs $300,000) and number of bedrooms (3 vs 5). The price difference is 200,000, while the bedroom difference is just 2. The price completely dominates! Many ML algorithms (especially distance-based ones like KNN and gradient-based ones like neural networks) get confused when features have wildly different scales. Feature scaling puts all features on a level playing field so each one contributes fairly to the model's decisions.
StandardScaler (Z-score Normalization)
The most popular choice! StandardScaler transforms each feature so that it has a mean of 0 and standard deviation of 1. Think of it like converting everyone's height to "how many standard deviations above/below average?" A value of +2 means "2 standard deviations above average", regardless of whether we're measuring height in inches, centimeters, or feet. This works best when your data is roughly normally distributed (bell-curve shaped):
# Step 1: Import and create data
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
data = pd.DataFrame({
'age': [25, 30, 35, 40, 45, 50, 55, 60],
'income': [30000, 45000, 50000, 65000, 70000, 80000, 90000, 100000],
'score': [0.5, 0.6, 0.55, 0.7, 0.75, 0.8, 0.85, 0.9]
})
print("Original data statistics:")
print(data.describe())
See the problem? Our features have wildly different ranges:
- age: 25 to 60 (range of 35)
- income: 30,000 to 100,000 (range of 70,000!)
- score: 0.5 to 0.9 (range of just 0.4)
If a distance-based model like KNN calculates "distance" between two people, income would dominate completely. A $10,000 income difference would seem huge compared to a 10-year age difference, even though both might be equally important!
# Step 2: Apply StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
data_scaled_df = pd.DataFrame(data_scaled, columns=data.columns)
print("\nAfter StandardScaler:")
print(data_scaled_df.describe().round(2))
The magic formula: z = (x - mean) / std
- Subtract the mean → now the average is 0
- Divide by standard deviation → now the spread is 1
Result: All three features now have mean≈0 and std≈1. A value of -1.5 means "1.5 standard deviations below average" whether it's age, income, or score. Now they're on equal footing!
MinMaxScaler (Normalization)
When you need specific bounds! MinMaxScaler squeezes all values into a fixed range, usually [0, 1]. The smallest value becomes 0, the largest becomes 1, and everything else is proportionally in between. This is perfect for neural networks (which often expect inputs between 0 and 1) and algorithms that need bounded values. Unlike StandardScaler, it preserves zero as zero and doesn't create negative values:
from sklearn.preprocessing import MinMaxScaler
# Scale to [0, 1]
minmax_scaler = MinMaxScaler()
data_minmax = minmax_scaler.fit_transform(data)
data_minmax_df = pd.DataFrame(data_minmax, columns=data.columns)
print("After MinMaxScaler [0, 1]:")
print(data_minmax_df.describe().round(2))
# Custom range [-1, 1]
minmax_custom = MinMaxScaler(feature_range=(-1, 1))
data_custom = minmax_custom.fit_transform(data)
print("\nCustom range [-1, 1]:")
print(pd.DataFrame(data_custom, columns=data.columns).describe().round(2))
The formula: x_scaled = (x - min) / (max - min)
- Default [0, 1]: Minimum value → 0, maximum value → 1, everything else proportionally between
- Custom range [-1, 1]: Use
feature_range=(-1, 1)for algorithms that prefer centered data but still bounded - Preserves zeros: If your original feature has meaningful zeros (like "0 purchases"), they stay at the same relative position
Caution with outliers: If your max value is an extreme outlier (like one income of $10 million), most values will be squeezed near 0!
RobustScaler
Outlier-proof scaling! What if your data has extreme outliers? StandardScaler uses mean and standard deviation, which are heavily influenced by outliers. RobustScaler uses the median (middle value) and IQR (interquartile range — the spread of the middle 50%). Since outliers don't affect the median or IQR, the scaling stays sensible even with crazy outliers:
from sklearn.preprocessing import RobustScaler
# Data with outliers
data_outliers = pd.DataFrame({
'value': [10, 12, 11, 13, 12, 11, 100, 14, 12] # 100 is an outlier
})
# Compare scalers
standard = StandardScaler()
robust = RobustScaler()
data_outliers['standard'] = standard.fit_transform(data_outliers[['value']])
data_outliers['robust'] = robust.fit_transform(data_outliers[['value']])
print(data_outliers)
print("\nNotice how RobustScaler handles the outlier better")
See the difference:
- StandardScaler: The outlier (100) pulls the mean way up and inflates the std, making normal values look like they're all clustered near -0.5
- RobustScaler: Uses median (12) and IQR, which ignore the outlier. Normal values get reasonable scaled values, while the outlier just becomes a big number
The formula: x_scaled = (x - median) / IQR. Since median and IQR aren't affected by extreme values, your normal data gets scaled sensibly!
When to Scale
Scale Required
- Gradient descent (Linear/Logistic Regression, Neural Networks)
- Distance-based (KNN, K-Means, SVM with RBF)
- Regularized models (Ridge, Lasso, Elastic Net)
- PCA and other dimensionality reduction
Scale Not Required
- Tree-based models (Decision Trees, Random Forest, XGBoost)
- Naive Bayes
- Rule-based models
- Models that only use feature ordering, not magnitude
Practice Questions
Task: Create right-skewed data with outliers and compare StandardScaler, MinMaxScaler, and RobustScaler visually using histograms.
Show Solution
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
# Skewed data with outliers
np.random.seed(42)
data = np.concatenate([
np.random.exponential(2, 200),
[50, 60, 70] # Outliers
]).reshape(-1, 1)
scalers = {
'Original': None,
'Standard': StandardScaler(),
'MinMax': MinMaxScaler(),
'Robust': RobustScaler()
}
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
for ax, (name, scaler) in zip(axes, scalers.items()):
if scaler is None:
scaled = data
else:
scaled = scaler.fit_transform(data)
ax.hist(scaled, bins=30, edgecolor='black')
ax.set_title(name)
plt.tight_layout()
plt.show()
Task: Demonstrate the correct way to scale: fit on training data, transform both train and test. Show what happens if you fit on test data (data leakage).
Show Solution
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Sample data
np.random.seed(42)
X = np.random.randn(100, 2) * [10, 100] + [50, 500]
y = np.random.randint(0, 2, 100)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# CORRECT: Fit on train only, transform both
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Only transform!
print("CORRECT - Fit on train:")
print(f"Train mean: {X_train_scaled.mean(axis=0)}") # Should be ~0
print(f"Test mean: {X_test_scaled.mean(axis=0)}") # Slightly off, that's OK!
# WRONG: Fit on all data (data leakage!)
scaler_wrong = StandardScaler()
X_all_scaled = scaler_wrong.fit_transform(X)
print("\nWRONG - Fit on all data (leakage!):")
print("Test statistics leak into training preprocessing!")
Task: Use MinMaxScaler to scale data to range [-1, 1] instead of the default [0, 1]. Verify the min and max values.
Show Solution
import numpy as np
from sklearn.preprocessing import MinMaxScaler
# Sample data
data = np.array([[10], [20], [30], [40], [50]])
# Default [0, 1] scaling
scaler_default = MinMaxScaler()
scaled_default = scaler_default.fit_transform(data)
# Custom [-1, 1] scaling
scaler_custom = MinMaxScaler(feature_range=(-1, 1))
scaled_custom = scaler_custom.fit_transform(data)
print("Original:", data.flatten())
print("\nScaled [0, 1]:", scaled_default.flatten())
print(f" Min: {scaled_default.min()}, Max: {scaled_default.max()}")
print("\nScaled [-1, 1]:", scaled_custom.flatten())
print(f" Min: {scaled_custom.min()}, Max: {scaled_custom.max()}")
Task: Use MaxAbsScaler to scale data to [-1, 1] while preserving zeros. This is important for sparse matrices where zeros are meaningful.
Show Solution
import numpy as np
from sklearn.preprocessing import MaxAbsScaler, MinMaxScaler
# Data with meaningful zeros
data = np.array([[-100], [0], [0], [50], [0], [25]])
# MaxAbsScaler - preserves zeros
maxabs = MaxAbsScaler()
scaled_maxabs = maxabs.fit_transform(data)
# MinMaxScaler - does NOT preserve zeros
minmax = MinMaxScaler(feature_range=(-1, 1))
scaled_minmax = minmax.fit_transform(data)
print("Original:", data.flatten())
print("\nMaxAbsScaler (zeros preserved):", scaled_maxabs.flatten())
print("MinMaxScaler (zeros shifted):", scaled_minmax.flatten())
print("\nMaxAbsScaler formula: x / max(|x|)")
print("Zero stays zero because 0 / anything = 0")
Task: Scale a target variable (house prices), train a simple model, make predictions, then inverse_transform to get predictions back in original units (dollars).
Show Solution
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
# Sample data: house size -> price
X = np.array([[1000], [1500], [2000], [2500], [3000]]) # sq ft
y = np.array([200000, 300000, 400000, 500000, 600000]) # price in $
# Scale the target (common for neural networks)
y_scaler = StandardScaler()
y_scaled = y_scaler.fit_transform(y.reshape(-1, 1))
# Train model on scaled target
model = LinearRegression()
model.fit(X, y_scaled.ravel())
# Predict for new house
new_house = np.array([[1800]])
prediction_scaled = model.predict(new_house)
# Convert prediction back to dollars!
prediction_dollars = y_scaler.inverse_transform(prediction_scaled.reshape(-1, 1))
print(f"House size: {new_house[0][0]} sq ft")
print(f"Prediction (scaled): {prediction_scaled[0]:.4f}")
print(f"Prediction (dollars): ${prediction_dollars[0][0]:,.0f}")
Task: Use QuantileTransformer to transform heavily skewed data into a normal distribution. Compare the before/after distributions.
Show Solution
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import QuantileTransformer
from scipy import stats
# Create heavily skewed data (exponential)
np.random.seed(42)
data = np.random.exponential(scale=2, size=1000).reshape(-1, 1)
# Transform to normal distribution
qt = QuantileTransformer(output_distribution='normal', random_state=42)
data_normal = qt.fit_transform(data)
# Plot comparison
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].hist(data, bins=50, edgecolor='black')
axes[0].set_title(f'Original (skewness: {stats.skew(data.flatten()):.2f})')
axes[1].hist(data_normal, bins=50, edgecolor='black')
axes[1].set_title(f'After QuantileTransformer (skewness: {stats.skew(data_normal.flatten()):.2f})')
plt.tight_layout()
plt.show()
print("QuantileTransformer maps data to uniform, then to normal distribution")
print("Great for algorithms that assume normally distributed features!")
Handling Missing Values
Real-world data is messy! Surveys have unanswered questions, sensors fail, databases have gaps. If you just delete every row with a missing value, you might lose half your data! Imputation is the art of intelligently filling in missing values. Think of it like a detective filling in the blanks using clues from the rest of the data. Scikit-learn's SimpleImputer and KNNImputer make this easy.
Simple Imputation Strategies
# Step 1: Create data with missing values
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
data = pd.DataFrame({
'age': [25, 30, np.nan, 40, 45, np.nan, 55, 60],
'income': [30000, np.nan, 50000, 65000, np.nan, 80000, 90000, 100000],
'category': ['A', 'B', np.nan, 'A', 'B', 'C', np.nan, 'A']
})
print("Data with missing values:")
print(data)
print(f"\nMissing counts:\n{data.isnull().sum()}")
The problem: Our data has NaN ("Not a Number") values scattered around. Python uses np.nan to represent missing data.
- Numerical missing: age and income have some NaN values
- Categorical missing: category column also has NaN
We need different strategies for different types of data. You wouldn't fill in a missing "color" with an average!
# Step 2: Impute numerical columns
# Mean imputation
mean_imputer = SimpleImputer(strategy='mean')
data['age_mean'] = mean_imputer.fit_transform(data[['age']])
# Median imputation (better for skewed data)
median_imputer = SimpleImputer(strategy='median')
data['income_median'] = median_imputer.fit_transform(data[['income']])
print("After numerical imputation:")
print(data[['age', 'age_mean', 'income', 'income_median']])
Two main strategies for numbers:
- Mean (average): Great when data is normally distributed. "What's the typical age? Let's use that."
- Median (middle value): Better when you have outliers or skewed data. If one person earns $10M, the mean income is misleading — use median instead!
Pro tip: Always check your data distribution first! Median is the safer default for most real-world data.
# Step 3: Impute categorical columns
cat_imputer = SimpleImputer(strategy='most_frequent')
data['category_mode'] = cat_imputer.fit_transform(data[['category']]).ravel()
# Constant value imputation
const_imputer = SimpleImputer(strategy='constant', fill_value='Unknown')
data['category_unknown'] = const_imputer.fit_transform(data[['category']]).ravel()
print("\nAfter categorical imputation:")
print(data[['category', 'category_mode', 'category_unknown']])
Two strategies for categories:
- most_frequent (mode): Fill with the most common value. "Most customers choose 'Standard' shipping, so let's assume missing ones did too."
- constant: Fill with a specific value like 'Unknown' or 'Missing'. This is powerful because the fact that data was missing might be meaningful!
When to use 'Unknown': If people who didn't answer "income" might be systematically different (maybe embarrassed?), preserving this info helps!
KNN Imputation
The smart approach! Instead of just using the average of everyone, what if we looked at similar people? If someone's age is missing but we know they earn $80k and live in NYC, we can find other NYC earners of $80k and use their average age. That's smarter than using the global average!
from sklearn.impute import KNNImputer
# KNN imputation for numerical data
knn_imputer = KNNImputer(n_neighbors=3)
data_knn = knn_imputer.fit_transform(data[['age', 'income']].dropna(how='all'))
print("KNN Imputed (uses similar samples):")
print(pd.DataFrame(data_knn, columns=['age', 'income']).head())
How KNNImputer works:
- n_neighbors=3: Look at the 3 most similar samples (based on features that aren't missing)
- Calculate distance: Find samples with similar values in other columns
- Average their values: Use the average of those k neighbors to fill the gap
Trade-off: KNN imputation is smarter but slower. For small datasets, it's great! For millions of rows, simple imputation might be more practical.
Practice Questions
Task: Create a function that adds binary columns indicating which values were missing before imputation.
Show Solution
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
def impute_with_indicators(df, columns):
"""Impute and add missing indicators."""
df = df.copy()
for col in columns:
# Add indicator
df[f'{col}_was_missing'] = df[col].isnull().astype(int)
# Impute
imputer = SimpleImputer(strategy='median')
df[col] = imputer.fit_transform(df[[col]])
return df
# Test
data = pd.DataFrame({
'age': [25, np.nan, 35, np.nan, 45],
'income': [50000, 60000, np.nan, 80000, np.nan]
})
result = impute_with_indicators(data, ['age', 'income'])
print(result)
Task: Create a feature that counts how many values are missing per row. This can be predictive if missingness has a pattern!
Show Solution
import pandas as pd
import numpy as np
data = pd.DataFrame({
'a': [1, np.nan, 3, np.nan, 5],
'b': [np.nan, 2, np.nan, 4, 5],
'c': [np.nan, np.nan, 3, 4, 5]
})
# Count missing values per row
data['missing_count'] = data.isnull().sum(axis=1)
# Proportion of missing values per row
data['missing_ratio'] = data.isnull().sum(axis=1) / (len(data.columns) - 1)
print(data)
print("\nRows with more missing values might be different!")
print("Example: Users who skip many survey questions")
Task: For skewed data with outliers, compare mean, median, and most_frequent imputation. Which preserves the distribution best?
Show Solution
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
# Skewed data with outliers
np.random.seed(42)
data = np.concatenate([
np.random.exponential(2, 20),
[50, 60] # Outliers
])
# Add some missing values
data[5] = np.nan
data[10] = np.nan
data[15] = np.nan
df = pd.DataFrame({'value': data})
print(f"Original median: {np.nanmedian(data):.2f}")
print(f"Original mean: {np.nanmean(data):.2f}")
# Try different strategies
for strategy in ['mean', 'median']:
imputer = SimpleImputer(strategy=strategy)
imputed = imputer.fit_transform(df[['value']])
fill_value = imputer.statistics_[0]
print(f"\n{strategy.upper()} imputation fills with: {fill_value:.2f}")
print("\nMedian is better for skewed data - outliers don't affect it!")
Task: Impute missing 'salary' values using the median salary within each 'department'. This is smarter than using the global median.
Show Solution
import pandas as pd
import numpy as np
data = pd.DataFrame({
'department': ['IT', 'IT', 'IT', 'HR', 'HR', 'HR', 'Sales', 'Sales'],
'salary': [80000, np.nan, 90000, 50000, np.nan, 55000, 60000, np.nan]
})
print("Before imputation:")
print(data)
# Group-wise imputation
def impute_by_group(df, group_col, value_col):
df = df.copy()
df[value_col] = df.groupby(group_col)[value_col].transform(
lambda x: x.fillna(x.median())
)
return df
data_imputed = impute_by_group(data, 'department', 'salary')
print("\nAfter group-wise imputation:")
print(data_imputed)
print("\nEach missing salary filled with department median!")
print("IT: median of 80k, 90k = 85k")
print("HR: median of 50k, 55k = 52.5k")
Task: Demonstrate why you should scale data before KNN imputation. Compare results with and without scaling.
Show Solution
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
data = pd.DataFrame({
'age': [25, 30, np.nan, 40, 45], # Range: ~20
'income': [30000, 45000, 50000, np.nan, 70000] # Range: ~40000
})
print("Original data:")
print(data)
# KNN without scaling (income dominates distance)
knn_unscaled = KNNImputer(n_neighbors=2)
imputed_unscaled = knn_unscaled.fit_transform(data)
# KNN with scaling
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data.fillna(data.mean())) # Initial fill for scaling
data_for_knn = data.copy()
knn_scaled = KNNImputer(n_neighbors=2)
imputed_scaled = knn_scaled.fit_transform(
(data - data.mean()) / data.std()
) * data.std().values + data.mean().values
print("\nKNN without scaling:")
print(pd.DataFrame(imputed_unscaled, columns=data.columns))
print("\nNote: Without scaling, 'income' dominates the distance calculation!")
print("Scale your data before KNN imputation for better results.")
Task: Use IterativeImputer (MICE - Multiple Imputation by Chained Equations) to impute missing values using relationships between features.
Show Solution
import numpy as np
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer
# Data with correlated features
np.random.seed(42)
n = 100
age = np.random.uniform(20, 60, n)
income = age * 1000 + np.random.normal(0, 5000, n) # Income correlates with age
data = pd.DataFrame({'age': age, 'income': income})
# Add missing values
data.loc[10:20, 'age'] = np.nan
data.loc[30:40, 'income'] = np.nan
# Compare Simple vs Iterative imputation
simple_imp = SimpleImputer(strategy='mean')
data_simple = pd.DataFrame(
simple_imp.fit_transform(data),
columns=data.columns
)
iter_imp = IterativeImputer(random_state=42, max_iter=10)
data_iter = pd.DataFrame(
iter_imp.fit_transform(data),
columns=data.columns
)
print("Simple Imputer uses global mean (ignores correlations)")
print("Iterative Imputer uses other features to predict missing values")
print("\nSample imputed rows (originally missing age):")
print(f"Simple age mean: {data_simple.loc[10:15, 'age'].mean():.0f}")
print(f"Iterative uses income to predict age!")
Building Preprocessing Pipelines
Here's a nightmare scenario: you preprocess training data (scale, encode, impute), train a model... then realize you need to apply the exact same transformations to test data, but with the training statistics (mean, std, etc.). Miss a step or use wrong stats? Your model fails silently. Pipelines solve this by bundling all steps together into one object. Fit it once, and it remembers everything. Deploy it, and it handles all preprocessing automatically!
ColumnTransformer for Mixed Data Types
The real-world challenge: Your data has numbers (scale them!) and categories (encode them!). You can't use the same transformer on both. ColumnTransformer is like a traffic controller — it routes each column type to the right preprocessing pipeline:
# Step 1: Set up the data
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Sample data
data = pd.DataFrame({
'age': [25, 30, np.nan, 40, 45, 50],
'income': [30000, 45000, 50000, np.nan, 70000, 80000],
'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA', 'NYC'],
'education': ['Bachelor', 'Master', 'PhD', 'Bachelor', 'Master', 'PhD'],
'purchased': [0, 1, 1, 0, 1, 1]
})
X = data.drop('purchased', axis=1)
y = data['purchased']
Setting the stage: Look at our mixed data:
- Numerical: age, income → need imputation (missing values) + scaling
- Categorical: city, education → need imputation + one-hot encoding
- Target: purchased → what we're predicting (0 or 1)
Without pipelines, you'd manually preprocess each column, fit transformers on training data, remember to transform (not fit!) test data... lots of room for mistakes!
# Step 2: Define preprocessing for each column type
numerical_features = ['age', 'income']
categorical_features = ['city', 'education']
# Numerical pipeline: impute then scale
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Categorical pipeline: impute then encode
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
])
Building mini-pipelines: Each transformer type gets its own chain of steps:
- Numerical pipeline: First impute (fill missing with median), then scale (StandardScaler)
- Categorical pipeline: First impute (fill missing with most common value), then one-hot encode
- handle_unknown='ignore': If test data has a city we never saw in training, don't crash — just ignore it!
Order matters! Impute first, then scale. You can't scale NaN values!
# Step 3: Combine with ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
]
)
# Step 4: Create full pipeline with model
full_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Fit and predict (all preprocessing happens automatically)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
full_pipeline.fit(X_train, y_train)
score = full_pipeline.score(X_test, y_test)
print(f"Accuracy: {score:.4f}")
The magic happens here:
- ColumnTransformer: Routes numerical columns to numerical_transformer, categorical to categorical_transformer
- Full pipeline: Chains preprocessor → classifier. One object does everything!
- fit(): Learns all statistics (mean, categories, etc.) from training data ONLY
- predict(): Applies transformations using those learned statistics, then predicts
No data leakage! When you call fit(X_train), it only sees training data. Test data statistics never leak into preprocessing.
Using make_column_selector
Even lazier (in a good way!): Instead of listing column names, let sklearn auto-detect columns by their data type. Numbers go one way, text goes another — automatically:
from sklearn.compose import make_column_selector
# Automatically select by dtype
preprocessor_auto = ColumnTransformer(
transformers=[
('num', numerical_transformer, make_column_selector(dtype_include=np.number)),
('cat', categorical_transformer, make_column_selector(dtype_include=object))
]
)
# This automatically handles any numerical or categorical column
Why this is awesome:
- dtype_include=np.number: Automatically grabs all numerical columns (int, float)
- dtype_include=object: Automatically grabs all text/categorical columns
- Future-proof: Add a new numerical column? It's automatically preprocessed correctly!
Pro tip: Great for Kaggle competitions where you don't want to hardcode column names. Just ensure your dtypes are correct!
- Prevents data leakage (fit only on training data)
- Simplifies code (one object to fit/predict)
- Easy to save and deploy (pickle the whole pipeline)
- Works with GridSearchCV for hyperparameter tuning
Practice Questions
Task: Create a pipeline that preprocesses the titanic-style data and tunes both preprocessing (imputer strategy) and model (n_estimators) hyperparameters using GridSearchCV.
Show Solution
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
import pandas as pd
import numpy as np
# Create titanic-style data
data = pd.DataFrame({
'age': [22, 38, np.nan, 35, np.nan, 54, 2, 27, 14, np.nan],
'fare': [7.25, 71.28, 7.92, 53.1, 8.05, 51.86, 21.07, 11.13, 30.07, 7.87],
'sex': ['male', 'female', 'female', 'female', 'male', 'male', 'male', 'male', 'female', 'male'],
'embarked': ['S', 'C', 'S', 'S', np.nan, 'S', 'S', 'S', 'C', 'S'],
'survived': [0, 1, 1, 1, 0, 0, 0, 0, 1, 1]
})
X = data.drop('survived', axis=1)
y = data['survived']
# Build pipeline
num_transformer = Pipeline([
('imputer', SimpleImputer()),
('scaler', StandardScaler())
])
cat_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer([
('num', num_transformer, ['age', 'fare']),
('cat', cat_transformer, ['sex', 'embarked'])
])
pipeline = Pipeline([
('prep', preprocessor),
('clf', RandomForestClassifier(random_state=42))
])
# Grid search over preprocessing AND model params
param_grid = {
'prep__num__imputer__strategy': ['mean', 'median'],
'clf__n_estimators': [50, 100],
'clf__max_depth': [3, 5, None]
}
grid = GridSearchCV(pipeline, param_grid, cv=3, scoring='accuracy')
grid.fit(X, y)
print(f"Best params: {grid.best_params_}")
print(f"Best score: {grid.best_score_:.4f}")
Task: Create a basic pipeline that applies StandardScaler followed by LogisticRegression. Fit on training data and evaluate on test data.
Show Solution
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Create sample data
X, y = make_classification(n_samples=100, n_features=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression())
])
# Fit and evaluate
pipeline.fit(X_train, y_train)
train_score = pipeline.score(X_train, y_train)
test_score = pipeline.score(X_test, y_test)
print(f"Train accuracy: {train_score:.4f}")
print(f"Test accuracy: {test_score:.4f}")
# The pipeline automatically scales test data using training statistics!
Task: After fitting a pipeline, access the fitted scaler to see the learned mean and std, and access the classifier to see its coefficients.
Show Solution
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
# Create and fit pipeline
X, y = make_classification(n_samples=100, n_features=3, random_state=42)
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression())
])
pipeline.fit(X, y)
# Access components using named_steps
scaler = pipeline.named_steps['scaler']
classifier = pipeline.named_steps['classifier']
print("Scaler learned statistics:")
print(f" Mean: {scaler.mean_}")
print(f" Std: {scaler.scale_}")
print("\nClassifier coefficients:")
print(f" Coefficients: {classifier.coef_}")
print(f" Intercept: {classifier.intercept_}")
# Alternative: access by index
print(f"\nAccess by index: {pipeline[0]} is the scaler")
Task: Fit a pipeline, save it to a file using joblib, then load it back and make predictions. This is essential for deployment!
Show Solution
import joblib
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import numpy as np
# Create and fit pipeline
X, y = make_classification(n_samples=100, n_features=5, random_state=42)
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(n_estimators=10, random_state=42))
])
pipeline.fit(X, y)
# Save the entire pipeline (including fitted transformers!)
joblib.dump(pipeline, 'my_pipeline.pkl')
print("Pipeline saved!")
# Later, in production...
loaded_pipeline = joblib.load('my_pipeline.pkl')
print("Pipeline loaded!")
# Make predictions with new data
new_data = np.random.randn(5, 5) # 5 new samples
predictions = loaded_pipeline.predict(new_data)
print(f"Predictions: {predictions}")
# The loaded pipeline includes ALL preprocessing steps!
Task: Create a pipeline that scales data, selects the top 5 features using SelectKBest, then trains a classifier.
Show Solution
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
# Create data with many features
X, y = make_classification(n_samples=200, n_features=20,
n_informative=5, random_state=42)
# Pipeline: scale -> select features -> classify
pipeline = Pipeline([
('scaler', StandardScaler()),
('selector', SelectKBest(score_func=f_classif, k=5)),
('classifier', LogisticRegression())
])
pipeline.fit(X, y)
print(f"Original features: {X.shape[1]}")
# See which features were selected
selector = pipeline.named_steps['selector']
selected_mask = selector.get_support()
print(f"Selected feature indices: {np.where(selected_mask)[0]}")
print(f"Feature scores: {selector.scores_[:10]}...") # First 10
print(f"\nTest accuracy: {pipeline.score(X, y):.4f}")
Task: Create a custom transformer that adds polynomial features (squares of each column), then use it in a pipeline.
Show Solution
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
import numpy as np
# Custom transformer
class SquareFeatures(BaseEstimator, TransformerMixin):
"""Adds squared versions of all features."""
def fit(self, X, y=None):
return self # Nothing to learn
def transform(self, X):
X_squared = X ** 2
return np.hstack([X, X_squared])
def get_feature_names_out(self, input_features=None):
if input_features is None:
input_features = [f'x{i}' for i in range(self.n_features_)]
squared_names = [f'{name}_squared' for name in input_features]
return list(input_features) + squared_names
# Use in pipeline
X = np.random.randn(100, 3)
y = X[:, 0] + X[:, 1]**2 + np.random.randn(100) * 0.1 # Nonlinear target
pipeline = Pipeline([
('add_squares', SquareFeatures()),
('scaler', StandardScaler()),
('regressor', LinearRegression())
])
pipeline.fit(X, y)
print(f"Input features: {X.shape[1]}")
print(f"After SquareFeatures: {pipeline.named_steps['add_squares'].transform(X).shape[1]}")
print(f"R² score: {pipeline.score(X, y):.4f}")
Key Takeaways
Match Encoding to Data Type
Use OneHotEncoder for nominal categories with few values. Use OrdinalEncoder for ordinal data with explicit ordering.
Target Encoding for High Cardinality
When one-hot creates too many columns, use target encoding with cross-validation or frequency encoding to avoid data leakage.
Scale for Gradient & Distance Models
StandardScaler for normal data, MinMaxScaler for bounded inputs, RobustScaler when outliers exist.
Handle Missing Values Carefully
Use SimpleImputer with mean/median for numerical, mode/constant for categorical. Consider adding missing indicators.
Use Pipelines for Everything
Pipeline + ColumnTransformer ensures consistent preprocessing, prevents leakage, and simplifies deployment.
Prevent Data Leakage
Always fit transformers on training data only. Pipelines handle this automatically when used with train_test_split or cross-validation.
Knowledge Check
When should you use OneHotEncoder vs OrdinalEncoder?
What is the main advantage of RobustScaler over StandardScaler?
Why is target encoding prone to data leakage?
Which algorithms do NOT require feature scaling?
What does ColumnTransformer allow you to do?
What imputation strategy is best for categorical variables?