What is Feature Engineering?
Feature engineering is the art and science of transforming raw data into meaningful inputs for machine learning models. It is often the most impactful step in the data science pipeline, as better features lead to better model performance. Many Kaggle competition winners attribute their success more to clever feature engineering than to choosing the right algorithm.
Why Feature Engineering Matters
Beginner-Friendly Explanation
Imagine you're a detective trying to identify criminals from photos. If you only have raw pixel values (millions of numbers from 0-255), it's nearly impossible to find patterns. But if someone extracts useful features like "has beard", "wears glasses", "age range", "height" - suddenly identification becomes much easier!
That's feature engineering! You transform raw, messy data into clean, meaningful features that highlight the patterns your model needs to learn. It's like translating a foreign language into one the model understands.
Real-world house price example: Your raw data might include the house's build date (e.g., 1985). But what really matters to buyers is the age of the house - calculated as current year minus build year. A house built in 1985 is now 41 years old. That transformation from "build date" → "age" is feature engineering in action!
The fundamental truth: Machine learning models can only learn from the patterns you give them. If important relationships are hidden deep in your raw data, even the most sophisticated algorithm (neural networks, XGBoost, etc.) will struggle. Feature engineering makes those patterns explicit and learnable.
- Raw data: birth_date = "1990-05-15"
- Model sees: Just a string or timestamp
- Problem: Can't learn age patterns directly
- Result: Poor model performance
- Engineered: age = 2026 - 1990 = 36
- Model sees: Clear numerical value
- Benefit: Can learn age-based patterns easily
- Result: Much better predictions!
Feature Engineering (The Secret Sauce)
Definition: The process of using domain knowledge to create, transform, and select features (input variables) that make machine learning algorithms work more effectively. It bridges the gap between raw, messy data and the clean, meaningful patterns models need to learn.
Expert Quote:
"Applied machine learning is basically feature engineering"
- Andrew Ng (Stanford Professor, Founder of Coursera, former Google Brain lead)
Why is this so important?
- Kaggle winners spend 70% of their time on feature engineering, only 30% on model selection
- Good features with a simple model often beat poor features with a complex model
- Can improve accuracy by 10-20% without changing the algorithm at all!
The Feature Engineering Pipeline
Feature engineering encompasses several types of transformations. Understanding when to apply each technique is crucial for building effective models. Each technique solves a specific problem!
| Technique | Purpose | Example | When to Use |
|---|---|---|---|
| Encoding | Convert categorical to numerical | "red", "blue" → 0, 1 or [1,0], [0,1] |
When you have text labels that models can't understand |
| Scaling | Normalize numerical ranges | Income (0-1M) → (0-1) |
When features have different units or ranges (e.g., age vs. income) |
| Transformation | Change distribution shape | log(income) for skewed data |
When data is heavily skewed or has outliers |
| Creation | Derive new features | age = 2025 - birth_year |
When you can calculate meaningful features from existing ones |
| Binning | Discretize continuous values | age → "young", "adult", "senior" |
When categories are more meaningful than exact numbers |
❌ Without Pipeline
- Apply transformations inconsistently
- Forget steps when deploying model
- Data leakage from test set
- Harder to debug problems
With Pipeline
- Consistent transformation workflow
- Easy to reproduce and deploy
- Prevents data leakage automatically
- Clean, organized code
Setting Up Your Environment
Let's import the libraries we'll use throughout this lesson. We'll work with pandas for data manipulation and scikit-learn for preprocessing transformations. Don't worry! We'll explain each tool when we use it.
# ===== CORE LIBRARIES FOR FEATURE ENGINEERING =====
# pandas: The Excel of Python - handles our data tables (DataFrames)
# WHY? We need to load, view, and manipulate our data
import pandas as pd
# numpy: Handles mathematical operations and arrays
# WHY? Many ML algorithms work with numpy arrays under the hood
import numpy as np
# ===== SCALING TOOLS (Make features comparable) =====
# StandardScaler: Converts data to mean=0, std=1 (like converting to z-scores)
# WHEN TO USE: Most common choice, works with normally distributed data
from sklearn.preprocessing import StandardScaler
# MinMaxScaler: Squishes all values between 0 and 1
# WHEN TO USE: When you need exact 0-1 range, or data has clear boundaries
from sklearn.preprocessing import MinMaxScaler
# RobustScaler: Like StandardScaler but ignores outliers
# WHEN TO USE: When your data has lots of extreme values/outliers
from sklearn.preprocessing import RobustScaler
# ===== ENCODING TOOLS (Convert text to numbers) =====
# LabelEncoder: Converts categories to simple numbers (0, 1, 2, 3...)
# WHEN TO USE: For target variable, or ordinal categories (small < medium < large)
from sklearn.preprocessing import LabelEncoder
# OneHotEncoder: Creates separate columns for each category (0s and 1s)
# WHEN TO USE: For non-ordinal categories where no order exists (red, blue, green)
from sklearn.preprocessing import OneHotEncoder
# ===== FEATURE CREATION TOOLS =====
# PolynomialFeatures: Creates interactions like x1*x2, x1^2, etc.
# WHEN TO USE: When you suspect features interact (e.g., length × width = area)
from sklearn.preprocessing import PolynomialFeatures
# ===== DATA SPLITTING =====
# train_test_split: Divides data into training and testing sets
# WHY? To evaluate if our model works on NEW, unseen data
from sklearn.model_selection import train_test_split
print("Libraries loaded successfully!")
print("🎯 Ready to engineer some features!")
Pro Tip: Import Only What You Need
In real projects, don't import everything! Import only the specific tools you'll use. This makes your code faster and easier to understand. Think of it like packing for a trip - only bring what you need!
Creating Sample Data
Let's create a sample dataset to practice feature engineering techniques. This dataset represents customer information for a subscription service. Real-world scenario!
# Create sample customer dataset
data = {
'customer_id': [1, 2, 3, 4, 5, 6, 7, 8],
'age': [25, 45, 35, 52, 28, 61, 33, 40],
'income': [35000, 72000, 55000, 89000, 42000, 95000, 48000, 67000],
'education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor', 'Master', 'High School', 'Bachelor'],
'city': ['Mumbai', 'Delhi', 'Mumbai', 'Bangalore', 'Delhi', 'Mumbai', 'Bangalore', 'Delhi'],
'subscription_months': [6, 24, 12, 36, 3, 48, 9, 18],
'churned': [1, 0, 0, 0, 1, 0, 1, 0]
}
df = pd.DataFrame(data)
print(df.head())
# Check data types and structure
print(df.dtypes)
# customer_id int64
# age int64
# income int64
# education object
# city object
# subscription_months int64
# churned int64
int64 means numbers (can be used directly by ML models).
object means text (needs to be converted to numbers first). Notice that 'education' and 'city' are objects -
we'll need to encode these!
Practice Questions: Feature Engineering Basics
Test your understanding with these hands-on exercises.
Task: Given the sample dataframe, write code to count how many numerical and categorical columns exist.
Show Solution
# Count numerical and categorical columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = df.select_dtypes(include=['object']).columns
print(f"Numerical columns ({len(numerical_cols)}): {list(numerical_cols)}")
print(f"Categorical columns ({len(categorical_cols)}): {list(categorical_cols)}")
# Numerical columns (5): ['customer_id', 'age', 'income', 'subscription_months', 'churned']
# Categorical columns (2): ['education', 'city']
select_dtypes() filters columns by their data type.
We found 5 numerical columns (numbers) and 2 categorical columns (text). This helps us know
which features need encoding before modeling!
Task: Create a new feature called 'tenure_years' by converting subscription_months to years (divide by 12).
Show Solution
# Create tenure in years
df['tenure_years'] = df['subscription_months'] / 12
print(df[['customer_id', 'subscription_months', 'tenure_years']])
# customer_id subscription_months tenure_years
# 0 1 6 0.50
# 1 2 24 2.00
# 2 3 12 1.00
# ...
Task: Create a feature 'income_per_tenure_month' that divides income by subscription_months.
Show Solution
# Calculate income per month of tenure
df['income_per_tenure_month'] = df['income'] / df['subscription_months']
print(df[['customer_id', 'income', 'subscription_months', 'income_per_tenure_month']].head())
# customer_id income subscription_months income_per_tenure_month
# 0 1 35000 6 5833.333333
# 1 2 72000 24 3000.000000
Task: Create an 'age_group' feature that categorizes customers as 'Young' (under 30), 'Adult' (30-50), or 'Senior' (over 50).
Show Solution
# Create age groups using np.select or pd.cut
def categorize_age(age):
if age < 30:
return 'Young'
elif age <= 50:
return 'Adult'
else:
return 'Senior'
df['age_group'] = df['age'].apply(categorize_age)
# Alternative using pd.cut
# df['age_group'] = pd.cut(df['age'], bins=[0, 30, 50, 100], labels=['Young', 'Adult', 'Senior'])
print(df[['customer_id', 'age', 'age_group']])
# customer_id age age_group
# 0 1 25 Young
# 1 2 45 Adult
# 2 3 35 Adult
# 3 4 52 Senior
Encoding Categorical Variables
Machine learning algorithms work with numbers, not text. Categorical variables like "color" or "city" must be converted to numerical representations before models can use them effectively. Choosing the right encoding method depends on whether categories have a natural order and how many unique values exist.
Types of Categorical Variables
Before encoding, identify whether your categorical variable is nominal (no order) or ordinal (has order). This determines which encoding technique to use. Critical decision!
Nominal (No Order)
🤔 Can you rank these? NO! They're just different options with no "better" or "worse".
- City: Mumbai, Delhi, Bangalore (no city is "higher" than another)
- Color: Red, Blue, Green (no natural order)
- Payment: Cash, Card, UPI (all equal methods)
- Gender: Male, Female, Other (no ranking possible)
Creates separate columns:
city_Mumbai, city_Delhi, city_Bangalore
Ordinal (Has Order)
🤔 Can you rank these? YES! They have a clear progression from low to high.
- Education: High School < Bachelor < Master < PhD (clear progression)
- Size: Small < Medium < Large (obvious order)
- Rating: Poor < Fair < Good < Excellent (quality scale)
- Priority: Low < Medium < High (importance level)
Assigns numbers that respect order:
PhD=3 > Master=2 > Bachelor=1
⚠️ Common Mistake: Using Wrong Encoding
DON'T use Label Encoding for Nominal variables! If you encode cities as Mumbai=0, Delhi=1, Bangalore=2, the model thinks "Bangalore is twice as much as Mumbai" which makes no sense! Always use One-Hot Encoding for categories without order.
Label Encoding (Ordinal Variables)
Label encoding converts categories to integers (0, 1, 2, ...). This works well for ordinal variables where the numeric order matches the category order. ⚠️ Warning: Be careful with nominal variables - the model might incorrectly assume "Delhi (1)" is "greater than" "Mumbai (0)".
# WHY? ML algorithms need numbers, not text
# WHAT? LabelEncoder assigns a unique integer to each category
from sklearn.preprocessing import LabelEncoder
# Create label encoder
# WHAT IT DOES: Learns all unique categories and assigns 0, 1, 2...
label_encoder = LabelEncoder()
# Encode education (ordinal - has natural order)
# WARNING: LabelEncoder assigns numbers ALPHABETICALLY, not by order!
df['education_encoded'] = label_encoder.fit_transform(df['education'])
print(df[['education', 'education_encoded']])
# education education_encoded
# 0 High School 1 # ❌ Should be 0 (lowest education)
# 1 Bachelor 0 # ❌ Should be 1
# 2 Master 2 # Correct
# 3 PhD 3 # Correct
# 4 Bachelor 0 # ❌ Should be 1
# PROBLEM: The numbers don't match the actual education order!
- Bachelor=0 (starts with 'B')
- High School=1 (starts with 'H')
- Master=2 (starts with 'M')
- PhD=3 (starts with 'P')
This does NOT match the true education order! The model will think Bachelor (0) is less educated than High School (1). We need to fix this manually!
Solution: Manual Ordinal Encoding
# THE RIGHT WAY: Define the order explicitly using a dictionary
# WHY? You control the exact mapping to match real-world meaning
# WHAT? Create a dictionary that maps each category to its correct rank
education_order = {
'High School': 0, # Lowest education level
'Bachelor': 1, # Next level up
'Master': 2, # Advanced degree
'PhD': 3 # Highest education level
}
# Apply the mapping using .map() method
# WHAT IT DOES: Looks up each education value and replaces with its number
df['education_encoded'] = df['education'].map(education_order)
print(df[['education', 'education_encoded']])
# education education_encoded
# 0 High School 0 # Correct! Lowest level
# 1 Bachelor 1 # Correct!
# 2 Master 2 # Correct!
# 3 PhD 3 # Correct! Highest level
# NOW THE NUMBERS MATCH THE REAL ORDER! 🎉
Pro Tip: Always Use Manual Mapping for Ordinal!
For ordinal variables, always define your own mapping dictionary. Don't trust automatic encoding to get the order right! Think about: "What's the logical progression?" and map accordingly.
One-Hot Encoding (Nominal Variables)
One-hot encoding creates a separate binary column for each category. This is the standard approach for nominal variables because it does not imply any ordering between categories. Best practice for cities, colors, payment methods!
❌ BEFORE (Text)
city
Mumbai
Delhi
Mumbai
Bangalore
❌ Model can't use this!
AFTER (Numbers)
Bangalore Delhi Mumbai
0 0 1
0 1 0
0 0 1
1 0 0
Model can use this!
Method 1: Using Pandas get_dummies()
# EASIEST METHOD: Use pandas get_dummies()
# WHY? Quick and simple for exploration
# WHAT IT DOES: Creates a 1 where category matches, 0 everywhere else
# One-hot encode city using pandas get_dummies
# prefix='city' adds 'city_' before each column name for clarity
city_encoded = pd.get_dummies(df['city'], prefix='city')
print(city_encoded)
# city_Bangalore city_Delhi city_Mumbai
# 0 False False True # ← This row is Mumbai (only Mumbai column = 1)
# 1 False True False # ← This row is Delhi
# 2 False False True # ← Mumbai again
# 3 True False False # ← Bangalore
# INTERPRETATION: Each row has exactly ONE "True" (1), indicating which city it is
# No city is "better" than another - they're just different!
# Add encoded columns to original dataframe
# axis=1 means "add as columns" (not rows)
# WHY? Keep original data + new encoded columns together
df_encoded = pd.concat([df, city_encoded], axis=1)
print(df_encoded[['customer_id', 'city', 'city_Bangalore', 'city_Delhi', 'city_Mumbai']].head())
# customer_id city city_Bangalore city_Delhi city_Mumbai
# 0 1 Mumbai False False True
# 1 2 Delhi False True False
# ===== SHORTCUT: Do everything in one line! =====
# This drops the original 'city' column and replaces with encoded columns
df_encoded = pd.get_dummies(df, columns=['city'], prefix=['city'])
print(df_encoded.columns.tolist()) # See all columns including new city_ ones
drop_first=True to avoid this!
# Drop first category to avoid multicollinearity (dummy variable trap)
# WHY? If city_Delhi=0 and city_Mumbai=0, we KNOW it must be Bangalore
# This is called the "reference category" or "baseline"
city_encoded = pd.get_dummies(df['city'], prefix='city', drop_first=True)
print(city_encoded)
# city_Delhi city_Mumbai
# 0 False True # Mumbai (Bangalore is implied when both are 0)
# 1 True False # Delhi
# 2 False True # Mumbai
# 3 False False # ← Bangalore (the "reference" - neither Delhi nor Mumbai)
# INTERPRETATION:
# - Bangalore is the baseline (0, 0)
# - Delhi is represented by (1, 0)
# - Mumbai is represented by (0, 1)
# We reduced from 3 columns to 2, but kept all the information!
Scikit-learn OneHotEncoder
For machine learning pipelines, scikit-learn's OneHotEncoder is more robust. It remembers categories from training data and handles new categories gracefully.
from sklearn.preprocessing import OneHotEncoder
# Create encoder
onehot_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
# Fit and transform city column
city_array = df[['city']] # Must be 2D array
city_encoded = onehot_encoder.fit_transform(city_array)
# Get feature names
feature_names = onehot_encoder.get_feature_names_out(['city'])
print(feature_names) # ['city_Bangalore' 'city_Delhi' 'city_Mumbai']
# Create DataFrame with encoded values
city_df = pd.DataFrame(city_encoded, columns=feature_names)
print(city_df.head())
| Method | Best For | Pros | Cons |
|---|---|---|---|
LabelEncoder |
Ordinal variables | Simple, no extra columns | Implies order (bad for nominal) |
pd.get_dummies() |
Quick exploration | Easy to use, readable | Does not remember categories |
OneHotEncoder |
ML pipelines | Handles unknowns, pipeline-ready | More verbose syntax |
Practice Questions: Encoding
Test your understanding with these hands-on exercises.
Given:
payments = pd.DataFrame({'method': ['Cash', 'Card', 'UPI', 'Card', 'Cash']})
Task: One-hot encode the payment method column.
Show Solution
payments = pd.DataFrame({'method': ['Cash', 'Card', 'UPI', 'Card', 'Cash']})
# One-hot encode
encoded = pd.get_dummies(payments['method'], prefix='payment')
print(encoded)
# payment_Card payment_Cash payment_UPI
# 0 False True False
# 1 True False False
# 2 False False True
Given:
ratings = pd.DataFrame({'satisfaction': ['Poor', 'Good', 'Excellent', 'Fair', 'Good']})
Task: Create ordinal encoding where Poor=0, Fair=1, Good=2, Excellent=3.
Show Solution
ratings = pd.DataFrame({'satisfaction': ['Poor', 'Good', 'Excellent', 'Fair', 'Good']})
# Define order mapping
order_map = {'Poor': 0, 'Fair': 1, 'Good': 2, 'Excellent': 3}
# Apply mapping
ratings['satisfaction_encoded'] = ratings['satisfaction'].map(order_map)
print(ratings)
# satisfaction satisfaction_encoded
# 0 Poor 0
# 1 Good 2
# 2 Excellent 3
# 3 Fair 1
# 4 Good 2
Task: Use sklearn's OneHotEncoder to encode the 'city' column, then create a DataFrame with proper column names.
Show Solution
from sklearn.preprocessing import OneHotEncoder
# Create and fit encoder
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
city_encoded = encoder.fit_transform(df[['city']])
# Get feature names and create DataFrame
feature_names = encoder.get_feature_names_out(['city'])
city_df = pd.DataFrame(city_encoded, columns=feature_names)
# Combine with original data
result = pd.concat([df.drop('city', axis=1), city_df], axis=1)
print(result.head())
handle_unknown='ignore').
Perfect for ML pipelines that will be deployed to production!
Given:
products = pd.DataFrame({'category': ['A', 'B', 'A', 'C', 'A', 'B', 'A', 'D', 'A', 'B']})
Task: Encode categories by their frequency (how often they appear).
Show Solution
products = pd.DataFrame({'category': ['A', 'B', 'A', 'C', 'A', 'B', 'A', 'D', 'A', 'B']})
# Calculate frequency of each category
freq_map = products['category'].value_counts(normalize=True).to_dict()
# Apply frequency encoding
products['category_freq'] = products['category'].map(freq_map)
print(products)
# category category_freq
# 0 A 0.5
# 1 B 0.3
# 2 A 0.5
# 3 C 0.1
# 4 A 0.5
Scaling Numerical Features
When features have vastly different scales (e.g., age in years vs. income in thousands), many algorithms struggle to learn effectively. Feature scaling ensures all variables contribute equally and prevents features with larger values from dominating the model.
Why Scaling Matters
Consider a customer dataset with age (20-70) and income (20000-200000). Without scaling, a machine learning algorithm might think income is "more important" simply because its values are 1000x larger! Distance-based algorithms like KNN and gradient-based optimizers are especially sensitive to scale.
❌ WITHOUT Scaling
Customer 1: age=25, income=35000
Customer 2: age=45, income=72000
Distance = √[(45-25)² + (72000-35000)²]
= √[400 + 1,369,000,000]
= ~37,005
Problem: Income difference dominates! Age barely matters.
WITH Scaling
Customer 1: age=0.0, income=0.0
Customer 2: age=0.5, income=0.6
Distance = √[(0.5-0.0)² + (0.6-0.0)²]
= √[0.25 + 0.36]
= ~0.78
Solution: Both features contribute equally!
- Distance-based: K-Nearest Neighbors, K-Means Clustering, SVM
- Gradient-based: Neural Networks, Logistic Regression, Linear Regression
- Component-based: Principal Component Analysis (PCA)
DON'T need scaling: Tree-based models (Random Forest, XGBoost, Decision Trees) - they split on thresholds, not distances!
# WHY? Let's see how different our feature scales are
# WHAT IT SHOWS: The huge difference in magnitude between age and income
print("Age range:", df['age'].min(), "-", df['age'].max())
print("Income range:", df['income'].min(), "-", df['income'].max())
# Age range: 25 - 61 # ← Spans about 36 units
# Income range: 35000 - 95000 # ← Spans 60,000 units! ~1600x bigger!
StandardScaler (Z-score Normalization)
StandardScaler transforms data to have zero mean (μ=0) and unit variance (σ²=1). Each value becomes a z-score: how many standard deviations it is from the mean. Most commonly used scaler!
StandardScaler
z = (x - mean) / std
What it means:
- z = 0: Exactly at average
- z = 1: One std above average
- z = -1: One std below average
- z = 2: Two std above (rare!)
Best for:
- Normally distributed data
- Algorithms assuming zero-centered data
- SVM, Logistic Regression, Neural Networks
- PCA and gradient descent
# WHY? Most ML algorithms work better with zero-centered data
# WHAT? Transform data so mean=0 and std=1
from sklearn.preprocessing import StandardScaler
# Create scaler object
# WHAT IT DOES: Will learn the mean and std from training data
scaler = StandardScaler()
# Select numerical features to scale
# WHY THESE? Only numerical features need scaling, not categorical/encoded ones
numerical_features = ['age', 'income', 'subscription_months']
# Fit and transform in one step
# fit() = Learn the mean and std from this data
# transform() = Apply the formula: (x - mean) / std
# fit_transform() = Do both at once!
df_scaled = df.copy() # Make a copy to keep original
df_scaled[numerical_features] = scaler.fit_transform(df[numerical_features])
print(df_scaled[numerical_features].head())
# age income subscription_months
# 0 -1.190476 -1.197279 -0.707107 # ← Age is 1.19 std BELOW average
# 1 0.476190 0.705459 0.707107 # ← Age is 0.48 std ABOVE average
# 2 -0.357143 -0.168116 0.000000 # ← Subscription is exactly average!
# INTERPRETATION:
# - Negative values = below average
# - Positive values = above average
# - Values near 0 = close to average
# - Most values between -3 and +3 (99.7% rule for normal distribution)
# Verify that scaling worked correctly
# EXPECTED: Mean should be ~0, Standard deviation should be ~1
print("Mean after scaling:", df_scaled[numerical_features].mean().values)
print("Std after scaling:", df_scaled[numerical_features].std().values)
# Mean after scaling: [~0, ~0, ~0] # ← All close to zero!
# Std after scaling: [~1, ~1, ~1] # ← All close to one!
# WHY CHECK? To confirm StandardScaler worked as expected
# WHAT IF NOT? Would indicate a problem with your code
Pro Tip: When to Use StandardScaler
Use StandardScaler when your data is roughly normally distributed (bell curve shape). It's the default choice for most ML tasks. If you have severe outliers, consider RobustScaler instead!
MinMaxScaler (Normalization)
MinMaxScaler transforms data to a fixed range, typically 0 to 1. This is useful when you need bounded values or when the data is not normally distributed. Perfect for neural networks!
MinMaxScaler
x_scaled = (x - min) / (max - min)
What it means:
- Smallest value → 0
- Largest value → 1
- Everything else → Between 0 and 1
- Preserves the shape of distribution
Best for:
- Neural networks (sigmoid/tanh activation)
- Image pixel values (already 0-255)
- Bounded domains (percentages, probabilities)
- When you need exact 0-1 range
# WHY? Some algorithms (like neural networks) work best with 0-1 inputs
# WHAT? Squish all values to exactly 0-1 range
from sklearn.preprocessing import MinMaxScaler
# Create scaler
# WHAT IT DOES: Will find min and max from training data
minmax_scaler = MinMaxScaler() # Default range is (0, 1)
# Fit and transform
# Formula: (x - min) / (max - min)
df_minmax = df.copy()
df_minmax[numerical_features] = minmax_scaler.fit_transform(df[numerical_features])
print(df_minmax[numerical_features].head())
# age income subscription_months
# 0 0.0 0.000000 0.066667 # ← Age is the MINIMUM (25), so it becomes 0
# 1 0.5 0.616667 0.466667 # ← Age is halfway between min and max
# 2 0.25 0.333333 0.200000 # ← Age is 1/4 of the way from min to max
# INTERPRETATION:
# - 0.0 = Minimum value in original data
# - 1.0 = Maximum value in original data
# - 0.5 = Exactly in the middle
# - All values GUARANTEED to be between 0 and 1!
# Verify range - should be exactly 0 and 1
# WHY CHECK? To confirm MinMaxScaler worked correctly
print("Min after scaling:", df_minmax[numerical_features].min().values)
print("Max after scaling:", df_minmax[numerical_features].max().values)
# Min after scaling: [0.0, 0.0, 0.0] # ← All features start at exactly 0!
# Max after scaling: [1.0, 1.0, 1.0] # ← All features end at exactly 1!
# PERFECT! All features now on the same 0-1 scale
RobustScaler (Outlier-Resistant)
RobustScaler uses the median and interquartile range (IQR) instead of mean and standard deviation. This makes it resistant to outliers that would skew StandardScaler or completely ruin MinMaxScaler. Best choice when you have outliers!
With Outlier
Data: [10, 12, 11, 13, 12, 500, 11]
StandardScaler:
Mean = 81.3 (ruined by 500!) Std = 182.5 (huge!) Scaled: [-0.39, -0.38, -0.39, ... 2.30] # Most values near 0, outlier at 2.30
RobustScaler Solution
Data: [10, 12, 11, 13, 12, 500, 11]
RobustScaler:
Median = 12 (unaffected!) IQR = 1.5 (ignores extremes) Scaled: [-1.33, 0, -0.67, 0.67, 0, *, -0.67] # Outlier still an outlier, but doesn't ruin others!
# WHY? When you have extreme values that shouldn't influence normal data
# WHAT? Uses median (50th percentile) and IQR (25th to 75th percentile range)
from sklearn.preprocessing import RobustScaler
# Create robust scaler
# WHAT IT DOES: Uses statistics that ignore extreme values
robust_scaler = RobustScaler()
# Fit and transform
# Formula: (x - median) / IQR
# WHERE: IQR = Q3 (75th percentile) - Q1 (25th percentile)
df_robust = df.copy()
df_robust[numerical_features] = robust_scaler.fit_transform(df[numerical_features])
print(df_robust[numerical_features].head())
# age income subscription_months
# 0 -0.666667 -0.800000 -0.444444 # ← Scaled relative to MEDIAN, not mean
# 1 0.333333 0.400000 0.555556
# 2 -0.166667 -0.133333 0.000000 # ← Close to median!
# INTERPRETATION:
# - Values centered around MEDIAN (not mean)
# - Scaled by IQR (middle 50% of data)
# - Outliers don't distort the scaling!
Pro Tip: Detecting If You Need RobustScaler
Check your data: If mean - median is large, or if you see extreme values in df.describe()
(like max >> 75th percentile), you have outliers! Use RobustScaler instead of StandardScaler.
Comprehensive Scaler Comparison
| Scaler | Formula | Output Range | Best For | Outlier Sensitive? |
|---|---|---|---|---|
| StandardScaler Most Common |
(x - mean) / std |
Unbounded Usually -3 to +3 |
• Normal distributions • SVM, Logistic Regression • PCA, Neural Networks • Gradient descent algorithms |
Yes Outliers shift mean/std |
| MinMaxScaler Fixed Range |
(x - min) / (max - min) |
0 to 1 Exact bounds |
• Neural networks • Image data (pixels) • Bounded features • When exact 0-1 range needed |
Extremely! One outlier ruins everything |
| RobustScaler Outlier-Proof |
(x - median) / IQR |
Unbounded Similar to Standard |
• Data with outliers • Financial data • Real-world messy data • Skewed distributions |
No! Uses robust statistics |
🎯 Decision Guide: Which Scaler Should I Use?
Choose StandardScaler if:
- Data is roughly normally distributed
- No significant outliers
- Using SVM, PCA, or Logistic Regression
- Default choice!
Choose MinMaxScaler if:
- Need exact 0-1 range
- Using Neural Networks (sigmoid/tanh)
- Image/pixel data
- NO outliers present
Choose RobustScaler if:
- Data has outliers
- Skewed distribution
- Financial or real-world messy data
- StandardScaler gives weird results
Avoiding Data Leakage
A critical mistake is fitting scalers on the entire dataset before splitting into train/test. This causes data leakage because test data statistics (mean, std, min, max) influence the training process. Your model appears to work great, but fails miserably on real new data! Career-ending bug!
❌ WRONG: Data Leakage
# ❌ BAD: Scale BEFORE splitting
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Uses ALL data!
# Then split
X_train, X_test = train_test_split(X_scaled)
# PROBLEM: Test data mean/std influenced training!
# Model has "seen" test data statistics
# Results are FAKE GOOD!
CORRECT: No Leakage
# GOOD: Split FIRST, then scale
X_train, X_test = train_test_split(X)
# Fit scaler on TRAINING data ONLY
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
# Transform test using TRAINING statistics
X_test_scaled = scaler.transform(X_test) # No fit!
# PERFECT: Test data is truly unseen!
# CORRECT WORKFLOW: Preventing Data Leakage
# WHY? Must simulate real-world: you only have training data when building model!
from sklearn.model_selection import train_test_split
# Step 1: Define features and target
# WHAT? X = input features, y = what we want to predict
X = df[numerical_features] # Our input data
y = df['churned'] # What we're trying to predict
# Step 2: Split into train and test FIRST (before any scaling!)
# WHY? This is the REAL separation - test set is "future unseen data"
# test_size=0.2 means 20% for testing, 80% for training
# random_state=42 makes the split reproducible (same split every time)
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% test, 80% train
random_state=42 # For reproducibility
)
print(f"Training samples: {len(X_train)}") # e.g., 6 samples
print(f"Test samples: {len(X_test)}") # e.g., 2 samples
# Step 3: Create scaler (doesn't know anything yet)
scaler = StandardScaler()
# Step 4: Fit AND transform on training data
# fit() = Learn the mean and std from TRAINING data ONLY
# transform() = Apply the scaling using those learned statistics
# fit_transform() = Do both at once
X_train_scaled = scaler.fit_transform(X_train) # Learn from train, apply to train
print("Scaler learned from training data only!")
# Step 5: Transform test data using TRAINING statistics (no fit!)
# WHY NO FIT? Because we pretend test data doesn't exist yet!
# We use the SAME mean/std we learned from training data
X_test_scaled = scaler.transform(X_test) # Apply training stats to test
print("Test data scaled using training statistics!")
# RESULT: Test data is TRULY unseen!
# The scaler never learned anything from test data
# This simulates real-world: you'll get new data that wasn't in training
- Split FIRST: Separate train and test before ANY preprocessing
- fit_transform() on train: Learn statistics and apply them
- transform() on test: Apply training statistics (no learning!)
- NEVER fit() on test data - pretend it doesn't exist during training!
Pro Tip: Why This Matters So Much
In real-world ML, you train on historical data, then deploy to predict NEW data you've never seen. If you leak test data into training, your validation metrics will look amazing (95% accuracy!), but production performance will be terrible (60% accuracy). Always simulate the real scenario: test data is future data you don't have yet!
Practice Questions: Scaling
Test your understanding with these hands-on exercises.
Given:
scores = np.array([[85], [92], [78], [95], [88]])
Task: Scale these scores to a 0-1 range using MinMaxScaler.
Show Solution
from sklearn.preprocessing import MinMaxScaler
import numpy as np
scores = np.array([[85], [92], [78], [95], [88]])
scaler = MinMaxScaler()
scores_scaled = scaler.fit_transform(scores)
print(scores_scaled)
# [[0.41176471]
# [0.82352941]
# [0. ]
# [1. ]
# [0.58823529]]
Given:
salaries = pd.DataFrame({'salary': [50000, 75000, 120000, 45000, 200000]})
Task: Standardize the salary column and verify the mean is approximately 0.
Show Solution
from sklearn.preprocessing import StandardScaler
import pandas as pd
salaries = pd.DataFrame({'salary': [50000, 75000, 120000, 45000, 200000]})
scaler = StandardScaler()
salaries['salary_scaled'] = scaler.fit_transform(salaries[['salary']])
print(salaries)
print(f"\nMean: {salaries['salary_scaled'].mean():.10f}") # Very close to 0
print(f"Std: {salaries['salary_scaled'].std():.2f}") # Close to 1
Task: Split the customer dataframe into train/test (80/20), then properly scale numerical features avoiding data leakage.
Show Solution
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Features and target
X = df[['age', 'income', 'subscription_months']]
y = df['churned']
# Split first (before any scaling!)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Create and fit scaler on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
# Transform test data using training statistics
X_test_scaled = scaler.transform(X_test)
print("Training shape:", X_train_scaled.shape)
print("Test shape:", X_test_scaled.shape)
print("No data leakage!")
Given:
data = pd.DataFrame({'value': [10, 12, 11, 13, 12, 100, 11, 14]})
Task: Scale this data using RobustScaler (note the outlier value of 100).
Show Solution
from sklearn.preprocessing import RobustScaler
import pandas as pd
data = pd.DataFrame({'value': [10, 12, 11, 13, 12, 100, 11, 14]})
scaler = RobustScaler()
data['value_scaled'] = scaler.fit_transform(data[['value']])
print(data)
# The outlier (100) is scaled but doesn't distort other values
# because RobustScaler uses median and IQR
Creating New Features
Sometimes the best features are not in your original dataset - you create them. Interaction features, polynomial terms, and domain-specific transformations can unlock hidden patterns that raw features alone cannot express. Often improves accuracy by 10-20%!
Mathematical Transformations
Basic mathematical operations can create powerful new features. Ratios, differences, and aggregations often capture relationships that models would otherwise struggle to learn. Think: "What would make sense to a human?"
Common Feature Creation Patterns
Ratios
total_spent / num_orders = avg_order_value
clicks / impressions = click_rate
Differences
current_year - birth_year = age
end_date - start_date = duration
Aggregates
sum(purchases) = total_purchases
mean(ratings) = avg_rating
# WHY? Let's create a realistic e-commerce scenario
# WHAT? Customer behavior data for an online store
# GOAL: Predict who will become VIP customers
orders = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5],
'total_orders': [12, 5, 28, 8, 15], # How many times they ordered
'total_spent': [45000, 12000, 125000, 32000, 58000], # Total money spent
'account_age_days': [365, 120, 730, 200, 450] # How long they've been a customer
})
print(orders)
# customer_id total_orders total_spent account_age_days
# 0 1 12 45000 365
# 1 2 5 12000 120
# 2 3 28 125000 730 # ← This customer looks very valuable!
# 3 4 8 32000 200
# 4 5 15 58000 450
# ===== CREATE RATIO FEATURES =====
# WHY? Raw numbers don't tell the full story!
# Customer 3 spent 125k, but over 730 days. Customer 1 spent 45k in 365 days.
# Who's more valuable per day? We need to CALCULATE that!
# Feature 1: Average Order Value
# WHAT IT MEANS: How much does this customer spend per order?
# WHY VALUABLE? High AOV = big spender = VIP customer
orders['avg_order_value'] = orders['total_spent'] / orders['total_orders']
# Feature 2: Orders Per Month
# WHAT IT MEANS: How frequently does this customer order?
# WHY VALUABLE? High frequency = engaged customer = less likely to leave
orders['orders_per_month'] = orders['total_orders'] / (orders['account_age_days'] / 30)
# Feature 3: Spend Per Day
# WHAT IT MEANS: Average daily spending rate
# WHY VALUABLE? Normalizes spending by account age - fairer comparison
orders['spend_per_day'] = orders['total_spent'] / orders['account_age_days']
print(orders[['customer_id', 'avg_order_value', 'orders_per_month', 'spend_per_day']])
# customer_id avg_order_value orders_per_month spend_per_day
# 0 1 3750.000000 0.986301 123.287671 # ← Good spender!
# 1 2 2400.000000 1.250000 100.000000 # ← Frequent but small orders
# 2 3 4464.285714 1.150685 171.232877 # ← BEST CUSTOMER! High value + frequent
# 3 4 4000.000000 1.200000 160.000000 # ← Also great!
# 4 5 3866.666667 1.000000 128.888889
# INTERPRETATION:
# Customer 3: High AOV ($4,464), frequent orders (1.15/month), high daily spend ($171)
# This customer is MUCH more valuable than Customer 2 (frequent but cheap orders)
# These engineered features reveal insights the raw numbers hid!
# 1 2 2400.000000 1.250000 100.000000
# 2 3 4464.285714 1.150685 171.232877
Date and Time Features
Dates contain rich information that models cannot use directly. Extract components like year, month, day of week, and calculate durations to unlock temporal patterns.
# Sample data with dates and times
transactions = pd.DataFrame({
'transaction_id': [1, 2, 3, 4, 5],
'date': pd.to_datetime(['2024-01-15 09:30:00', '2024-03-22 14:45:00', '2024-07-04 20:15:00', '2024-11-28 11:00:00', '2024-12-25 16:30:00']),
'amount': [1500, 2300, 890, 4500, 3200]
})
# Extract date components
transactions['year'] = transactions['date'].dt.year
transactions['month'] = transactions['date'].dt.month
transactions['day'] = transactions['date'].dt.day
transactions['hour'] = transactions['date'].dt.hour # Hour of day (0-23)
transactions['day_of_week'] = transactions['date'].dt.dayofweek # 0=Monday
transactions['is_weekend'] = transactions['day_of_week'].isin([5, 6]).astype(int)
print(transactions[['date', 'month', 'hour', 'day_of_week', 'is_weekend']])
# Create time-of-day categories from hour
def get_time_period(hour):
if hour < 6:
return 'Night'
elif hour < 12:
return 'Morning'
elif hour < 18:
return 'Afternoon'
else:
return 'Evening'
transactions['time_period'] = transactions['hour'].apply(get_time_period)
print(transactions[['date', 'hour', 'time_period']])
# date hour time_period
# 0 2024-01-15 09:30:00 9 Morning
# 1 2024-03-22 14:45:00 14 Afternoon
# 2 2024-07-04 20:15:00 20 Evening
# Calculate days since a reference date
reference_date = pd.to_datetime('2024-01-01')
transactions['days_since_year_start'] = (transactions['date'] - reference_date).dt.days
# Quarter and season
transactions['quarter'] = transactions['date'].dt.quarter
print(transactions[['date', 'quarter', 'days_since_year_start']])
Interaction Features
Interaction features capture the combined effect of two or more variables. For example, the effect of "experience" on salary might be different for different "education levels".
# Create interaction features manually
df['age_income_interaction'] = df['age'] * df['income']
df['income_per_age'] = df['income'] / df['age']
print(df[['age', 'income', 'age_income_interaction', 'income_per_age']].head())
# age income age_income_interaction income_per_age
# 0 25 35000 875000 1400.000000
# 1 45 72000 3240000 1600.000000
Polynomial Features
Polynomial features create higher-order terms and interactions automatically. This is useful for capturing non-linear relationships in linear models.
from sklearn.preprocessing import PolynomialFeatures
# Sample data
X = df[['age', 'income']].head(3)
print("Original features:")
print(X)
# Create polynomial features (degree=2)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
# Get feature names
feature_names = poly.get_feature_names_out(['age', 'income'])
print("\nPolynomial feature names:")
print(feature_names)
# ['age', 'income', 'age^2', 'age income', 'income^2']
# Create DataFrame with polynomial features
X_poly_df = pd.DataFrame(X_poly, columns=feature_names)
print(X_poly_df)
# age income age^2 age income income^2
# 0 25.0 35000.0 625.0 875000.0 1225000000.0
# 1 45.0 72000.0 2025.0 3240000.0 5184000000.0
# 2 35.0 55000.0 1225.0 1925000.0 3025000000.0
interaction_only=True to limit to just interactions without powers.
Text-Based Feature Extraction
Text fields often contain valuable information. Extract features like length, word count, or presence of specific keywords.
# Sample product reviews
reviews = pd.DataFrame({
'review_id': [1, 2, 3],
'text': [
'Great product! Works perfectly.',
'Terrible quality. Broke after one week. DO NOT BUY!',
'Good value for money'
]
})
# Extract text features
reviews['char_count'] = reviews['text'].str.len()
reviews['word_count'] = reviews['text'].str.split().str.len()
reviews['avg_word_length'] = reviews['char_count'] / reviews['word_count']
reviews['has_exclamation'] = reviews['text'].str.contains('!').astype(int)
reviews['is_uppercase_heavy'] = (reviews['text'].str.count(r'[A-Z]') > 5).astype(int)
print(reviews[['text', 'word_count', 'has_exclamation', 'is_uppercase_heavy']])
Practice Questions: Creating Features
Test your understanding with these hands-on exercises.
Given:
customers = pd.DataFrame({
'revenue': [5000, 12000, 8000, 3000],
'acquisition_cost': [500, 800, 600, 400]
})
Task: Create a 'roi' feature as revenue divided by acquisition_cost.
Show Solution
customers = pd.DataFrame({
'revenue': [5000, 12000, 8000, 3000],
'acquisition_cost': [500, 800, 600, 400]
})
customers['roi'] = customers['revenue'] / customers['acquisition_cost']
print(customers)
# revenue acquisition_cost roi
# 0 5000 500 10.0
# 1 12000 800 15.0
# 2 8000 600 13.33
# 3 3000 400 7.5
Given:
orders = pd.DataFrame({
'order_date': pd.to_datetime(['2024-06-15', '2024-12-25', '2024-03-08'])
})
Task: Extract month, is_weekend, and is_holiday (Dec 25) features.
Show Solution
orders = pd.DataFrame({
'order_date': pd.to_datetime(['2024-06-15', '2024-12-25', '2024-03-08'])
})
orders['month'] = orders['order_date'].dt.month
orders['day_of_week'] = orders['order_date'].dt.dayofweek
orders['is_weekend'] = orders['day_of_week'].isin([5, 6]).astype(int)
orders['is_holiday'] = ((orders['order_date'].dt.month == 12) &
(orders['order_date'].dt.day == 25)).astype(int)
print(orders)
# order_date month day_of_week is_weekend is_holiday
# 0 2024-06-15 6 5 1 0
# 1 2024-12-25 12 2 0 1
# 2 2024-03-08 3 4 0 0
Given:
X = pd.DataFrame({'x1': [1, 2, 3], 'x2': [4, 5, 6]})
Task: Create degree-2 polynomial features and display the resulting feature names.
Show Solution
from sklearn.preprocessing import PolynomialFeatures
import pandas as pd
X = pd.DataFrame({'x1': [1, 2, 3], 'x2': [4, 5, 6]})
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
feature_names = poly.get_feature_names_out(['x1', 'x2'])
X_poly_df = pd.DataFrame(X_poly, columns=feature_names)
print("Feature names:", feature_names)
print(X_poly_df)
# Feature names: ['x1' 'x2' 'x1^2' 'x1 x2' 'x2^2']
Given:
products = pd.DataFrame({
'description': ['Premium quality laptop', 'Budget phone with great battery life', 'Watch']
})
Task: Create word_count and avg_word_length features.
Show Solution
products = pd.DataFrame({
'description': ['Premium quality laptop', 'Budget phone with great battery life', 'Watch']
})
products['word_count'] = products['description'].str.split().str.len()
products['char_count'] = products['description'].str.len()
products['avg_word_length'] = products['char_count'] / products['word_count']
print(products)
# description word_count char_count avg_word_length
# 0 Premium quality laptop 3 22 7.333333
# 1 Budget phone with great battery life 6 37 6.166667
# 2 Watch 1 5 5.000000
Binning and Discretization
Converting continuous variables into discrete bins can reduce noise, handle outliers, and capture non-linear relationships that simple linear models would miss. Binning transforms numerical data into categorical groups based on value ranges.
When to Use Binning
Binning is particularly useful when the exact numerical value matters less than which range or category it falls into. Think about age groups for marketing, income brackets for loans, or temperature ranges for weather classification.
- Age groups for marketing segments
- Income brackets for credit scoring
- Time of day for traffic analysis
- Reducing impact of outliers
- When business rules use categories
- Exact values are important
- Linear relationship with target
- Using tree-based models (they bin naturally)
- Small datasets (loses information)
- When continuous is more predictive
Equal-Width Binning with pd.cut()
Equal-width binning divides the range into bins of equal size. Use this when you want consistent intervals regardless of how data is distributed.
# Sample age data
ages = pd.DataFrame({'age': [22, 35, 45, 19, 67, 52, 28, 41, 73, 31]})
# Create 4 equal-width bins
ages['age_bin'] = pd.cut(ages['age'], bins=4)
print(ages)
# age age_bin
# 0 22 (18.946, 32.5]
# 1 35 (32.5, 46.0]
# 2 45 (32.5, 46.0]
# 3 19 (18.946, 32.5]
# 4 67 (59.5, 73.0]
# Custom bin edges with labels
age_bins = [0, 18, 30, 50, 65, 100]
age_labels = ['Minor', 'Young Adult', 'Adult', 'Middle Age', 'Senior']
ages['age_group'] = pd.cut(ages['age'], bins=age_bins, labels=age_labels)
print(ages[['age', 'age_group']])
# age age_group
# 0 22 Young Adult
# 1 35 Adult
# 2 45 Adult
# 3 19 Young Adult
# 4 67 Senior
# 5 52 Middle Age
Equal-Frequency Binning with pd.qcut()
Equal-frequency (quantile) binning puts approximately the same number of observations in each bin. This is useful when you want balanced groups regardless of value distribution.
# Create 4 equal-frequency bins (quartiles)
ages['age_quartile'] = pd.qcut(ages['age'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
print(ages[['age', 'age_quartile']])
# age age_quartile
# 0 22 Q1
# 1 35 Q2
# 2 45 Q3
# 3 19 Q1
# 4 67 Q4
# Check bin distribution
print(ages['age_quartile'].value_counts())
# Each quartile has approximately equal count
# Quantile binning with custom quantiles
income = pd.DataFrame({'income': [25000, 45000, 72000, 38000, 150000, 55000, 42000, 89000]})
# Create percentile-based bins
income['income_tier'] = pd.qcut(
income['income'],
q=[0, 0.25, 0.5, 0.75, 1.0],
labels=['Low', 'Medium', 'High', 'Premium']
)
print(income)
Comparing cut() vs qcut()
| Feature | pd.cut() (Equal-Width) | pd.qcut() (Equal-Frequency) |
|---|---|---|
| Bin sizes | Same range width | Same number of items |
| Best for | Uniformly distributed data | Skewed distributions |
| Custom edges | Yes, via bins parameter | No (uses quantiles) |
| Outlier handling | May create sparse bins | Distributes evenly |
Scikit-learn KBinsDiscretizer
For machine learning pipelines, scikit-learn's KBinsDiscretizer offers binning strategies that integrate seamlessly with other transformers.
from sklearn.preprocessing import KBinsDiscretizer
# Sample data
X = np.array([[22], [35], [45], [19], [67], [52], [28], [41]])
# Uniform strategy (equal-width)
discretizer = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='uniform')
X_binned = discretizer.fit_transform(X)
print("Original vs Binned:")
for orig, binned in zip(X.flatten(), X_binned.flatten()):
print(f" {orig} -> Bin {int(binned)}")
# Different encoding strategies
# 'ordinal': Returns bin indices (0, 1, 2, ...)
# 'onehot': Returns one-hot encoded sparse matrix
# 'onehot-dense': Returns one-hot encoded dense matrix
discretizer_onehot = KBinsDiscretizer(n_bins=3, encode='onehot-dense', strategy='quantile')
X_onehot = discretizer_onehot.fit_transform(X)
print("One-hot encoded bins shape:", X_onehot.shape) # (8, 3) for 3 bins
Practical Example: Customer Segmentation
Let's combine multiple binning techniques for customer segmentation in a real-world scenario.
# Create customer dataset
customers = pd.DataFrame({
'customer_id': range(1, 11),
'age': [22, 35, 45, 28, 67, 52, 31, 41, 58, 24],
'annual_spend': [1200, 8500, 15000, 3200, 25000, 12000, 4500, 9800, 18000, 2100],
'transactions': [12, 45, 89, 23, 156, 67, 34, 52, 98, 15]
})
# Bin age into life stages
age_bins = [0, 25, 35, 50, 65, 100]
age_labels = ['Gen Z', 'Millennial', 'Gen X', 'Boomer', 'Silent']
customers['generation'] = pd.cut(customers['age'], bins=age_bins, labels=age_labels)
# Bin spend into value tiers (quantile-based)
customers['value_tier'] = pd.qcut(
customers['annual_spend'],
q=3,
labels=['Bronze', 'Silver', 'Gold']
)
# Bin transaction frequency
customers['activity_level'] = pd.cut(
customers['transactions'],
bins=[0, 25, 75, float('inf')],
labels=['Low', 'Medium', 'High']
)
print(customers[['customer_id', 'generation', 'value_tier', 'activity_level']])
Practice Questions: Binning
Test your understanding with these hands-on exercises.
Given:
temps = pd.DataFrame({'temp_celsius': [5, 15, 25, 32, 8, 22, 38, 12]})
Task: Create a 'weather' column with categories: Cold (0-10), Mild (10-20), Warm (20-30), Hot (30+).
Show Solution
temps = pd.DataFrame({'temp_celsius': [5, 15, 25, 32, 8, 22, 38, 12]})
bins = [0, 10, 20, 30, 50]
labels = ['Cold', 'Mild', 'Warm', 'Hot']
temps['weather'] = pd.cut(temps['temp_celsius'], bins=bins, labels=labels)
print(temps)
# temp_celsius weather
# 0 5 Cold
# 1 15 Mild
# 2 25 Warm
# 3 32 Hot
Given:
spend = pd.DataFrame({'amount': [100, 5000, 250, 12000, 800, 3500, 15000, 450]})
Task: Use qcut to create 4 equal-frequency spending tiers.
Show Solution
spend = pd.DataFrame({'amount': [100, 5000, 250, 12000, 800, 3500, 15000, 450]})
spend['tier'] = pd.qcut(spend['amount'], q=4, labels=['Tier 1', 'Tier 2', 'Tier 3', 'Tier 4'])
print(spend)
print("\nCounts per tier:")
print(spend['tier'].value_counts())
# Each tier has 2 customers
Task: Use KBinsDiscretizer to bin the 'age' column into 5 quantile-based bins with one-hot encoding.
Show Solution
from sklearn.preprocessing import KBinsDiscretizer
import numpy as np
ages = np.array([[22], [35], [45], [28], [67], [52], [31], [41], [58], [24]])
discretizer = KBinsDiscretizer(
n_bins=5,
encode='onehot-dense',
strategy='quantile'
)
ages_binned = discretizer.fit_transform(ages)
print("Shape:", ages_binned.shape) # (10, 5)
print("First few rows:")
print(ages_binned[:5])
Given:
events = pd.DataFrame({'hour': [6, 14, 22, 3, 9, 18, 12, 20]})
Task: Create time_period with Night (0-6), Morning (6-12), Afternoon (12-18), Evening (18-24).
Show Solution
events = pd.DataFrame({'hour': [6, 14, 22, 3, 9, 18, 12, 20]})
bins = [0, 6, 12, 18, 24]
labels = ['Night', 'Morning', 'Afternoon', 'Evening']
events['time_period'] = pd.cut(
events['hour'],
bins=bins,
labels=labels,
include_lowest=True
)
print(events)
# hour time_period
# 0 6 Morning
# 1 14 Afternoon
# 2 22 Evening
# 3 3 Night
Interactive: Scaling Visualizer
See how different scaling methods transform your data in real-time. Adjust the input values and observe how StandardScaler, MinMaxScaler, and RobustScaler produce different results.
Scaling Comparison Tool
Formula: (x - mean) / std Mean: 0, Std: 0
Formula: (x - min) / (max - min) Min: 0, Max: 0
Formula: (x - median) / IQR Median: 0, IQR: 0
Observation
Enter values and move the slider to see how different scalers handle your data differently.
Encoding Comparison Tool
For nominal data, this implies false ordering!
Red | Blue | Green
Recommendation
For nominal (unordered) categories like colors, use One-Hot Encoding to avoid implying a false order.
Key Takeaways
Features Drive Model Success
Better features often matter more than choosing a fancier algorithm - invest time in feature engineering
Encode Categoricals Wisely
Use one-hot for nominal, label/ordinal for ordered categories - wrong encoding hurts models
Scale Features Appropriately
StandardScaler for normal distributions, MinMaxScaler for bounded ranges, RobustScaler for outliers
Create Meaningful Features
Interaction terms, ratios, and domain-specific features often capture patterns raw data cannot
Binning Reduces Noise
Convert continuous to categorical when exact values matter less than ranges or categories
Avoid Data Leakage
Fit transformers on training data only, then apply to test - never peek at test data statistics
Knowledge Check
Quick Quiz
Test what you've learned about feature engineering