One-Hot Encoding
One-hot encoding is the most widely used technique for converting nominal categorical variables into numerical format. It creates a new binary column for each unique category, where 1 indicates the presence of that category and 0 indicates absence. This approach is ideal when categories have no natural ordering and works exceptionally well with linear models, neural networks, and distance-based algorithms.
One-Hot Encoding Definition
A technique that transforms a categorical column with n unique values into n binary columns (or n-1 to avoid multicollinearity). Each row has exactly one "1" across these new columns, representing its category.
Using Pandas get_dummies()
The simplest way to perform one-hot encoding in Python is using pd.get_dummies().
This function automatically detects categorical columns and creates binary indicator variables.
Fastest method for exploration!
# WHY? Machine learning models need numbers, not text!
# WHAT? pd.get_dummies() converts categories to binary (0/1) columns
import pandas as pd
# Sample customer data with categorical 'city' and 'plan' columns
# SCENARIO: Telecom company analyzing customer subscriptions by location
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5],
'city': ['New York', 'Boston', 'Chicago', 'New York', 'Boston'],
'plan': ['Premium', 'Basic', 'Premium', 'Basic', 'Standard']
})
print("Original Data:")
print(customers)
# Output:
# customer_id city plan
# 0 1 New York Premium # ← Customer 1 lives in New York, has Premium plan
# 1 2 Boston Basic # ← Customer 2 lives in Boston, has Basic plan
# 2 3 Chicago Premium # ← Customer 3 lives in Chicago, has Premium plan
# 3 4 New York Basic # ← Customer 4 lives in New York, has Basic plan
# 4 5 Boston Standard # ← Customer 5 lives in Boston, has Standard plan
# STEP 1: One-hot encode ONLY the 'city' column
# WHY? We want to convert city names to numbers without losing information
# WHAT IT DOES: Creates 3 new columns (city_Boston, city_Chicago, city_New York)
# prefix='city' adds 'city_' before each column name for clarity
encoded = pd.get_dummies(customers, columns=['city'], prefix='city')
print("\nOne-Hot Encoded:")
print(encoded)
# Output:
# customer_id plan city_Boston city_Chicago city_New York
# 0 1 Premium 0 0 1 # ← New York = [0, 0, 1]
# 1 2 Basic 1 0 0 # ← Boston = [1, 0, 0]
# 2 3 Premium 0 1 0 # ← Chicago = [0, 1, 0]
# 3 4 Basic 0 0 1 # ← New York = [0, 0, 1]
# 4 5 Standard 1 0 0 # ← Boston = [1, 0, 0]
# INTERPRETATION:
# - Each city gets its own binary column
# - Row 0: city_New York = 1 (customer is in New York), other city columns = 0
# - Row 1: city_Boston = 1 (customer is in Boston), other city columns = 0
# - Each row has exactly ONE "1" in the city columns (mutually exclusive categories)
# - No city is considered "higher" or "better" than another - just different!
The Dummy Variable Trap
Problem: When using one-hot encoding with linear regression or similar models, having ALL n columns creates perfect multicollinearity. This happens because if you know the values of n-1 columns, you can perfectly predict the nth column!
Example: If city_Boston=0 and city_Chicago=0, we KNOW city_New York MUST be 1. The third column is redundant!
Solution: Use pd.get_dummies(df, drop_first=True) to automatically drop the first category (called the "reference category" or "baseline").
Note: Tree-based models (Random Forest, XGBoost) are NOT affected by this issue!
# AVOIDING THE DUMMY VARIABLE TRAP
# WHY? Linear models can't handle perfect multicollinearity
# WHAT? Drop the first category as a "baseline" - it's implied when all others are 0
# Encode BOTH 'city' and 'plan' columns with drop_first=True
# WHAT IT DOES: Creates n-1 columns for each categorical variable
# The first category (alphabetically) becomes the baseline/reference
encoded_safe = pd.get_dummies(customers, columns=['city', 'plan'], drop_first=True)
print(encoded_safe)
# Output:
# customer_id city_Chicago city_New York plan_Premium plan_Standard
# 0 1 0 1 1 0 # New York, Premium
# 1 2 0 0 0 0 # Boston (baseline!), Basic (baseline!)
# 2 3 1 0 1 0 # Chicago, Premium
# 3 4 0 1 0 0 # New York, Basic (baseline!)
# 4 5 0 0 0 1 # Boston (baseline!), Standard
# INTERPRETATION:
# City baselines: Boston is the baseline city (dropped)
# - city_Chicago=0, city_New York=0 → Boston (implied)
# - city_Chicago=1, city_New York=0 → Chicago
# - city_Chicago=0, city_New York=1 → New York
#
# Plan baselines: Basic is the baseline plan (dropped)
# - plan_Premium=0, plan_Standard=0 → Basic (implied)
# - plan_Premium=1, plan_Standard=0 → Premium
# - plan_Premium=0, plan_Standard=1 → Standard
#
# WHY THIS WORKS: We reduced from 3+3=6 columns to 2+2=4 columns, but kept ALL information!
Pro Tip: When to Drop First
Always drop_first=True for: Linear Regression, Logistic Regression, Ridge, Lasso.
Don't need to drop for: Random Forest, XGBoost, LightGBM, Decision Trees (they handle it fine).
Why? Tree-based models split on features independently, so multicollinearity doesn't matter!
Using Scikit-learn OneHotEncoder
For machine learning pipelines, sklearn.preprocessing.OneHotEncoder is preferred because it can be fitted once on training data and consistently applied to test data, handling unseen categories gracefully.
Production-ready approach!
- Consistency: Encoder remembers categories from training data
- Unseen categories: Handles new categories in test data gracefully
- Pipeline integration: Works seamlessly with sklearn pipelines
- Production deployment: Save encoder, use on new data later
# WHY? For production ML pipelines, we need reproducible encoding
# WHAT? OneHotEncoder learns categories from training data and applies consistently
from sklearn.preprocessing import OneHotEncoder
import numpy as np
# STEP 1: Create encoder with important parameters
# sparse_output=False → Get regular array instead of sparse matrix (easier to work with)
# handle_unknown='ignore' → If test data has NEW categories not in training, give them all 0s
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
# STEP 2: Fit encoder on training data
# WHY [[city]]? OneHotEncoder expects 2D array (DataFrame with double brackets)
cities = customers[['city']] # 2D DataFrame (5 rows, 1 column)
print("Shape of input:", cities.shape) # (5, 1)
# fit_transform() = Learn the unique categories AND encode them in one step
# WHAT IT LEARNS: "There are 3 cities: Boston, Chicago, New York"
encoded_array = encoder.fit_transform(cities)
print("\nEncoded array shape:", encoded_array.shape) # (5, 3) - 5 rows, 3 columns
# STEP 3: Get feature names (column names for the encoded data)
# WHY? We need to know which column represents which city
feature_names = encoder.get_feature_names_out(['city'])
print("\nFeature names:", feature_names)
# Output: ['city_Boston' 'city_Chicago' 'city_New York']
# STEP 4: Convert numpy array back to pandas DataFrame for readability
encoded_df = pd.DataFrame(encoded_array, columns=feature_names)
print("\nEncoded DataFrame:")
print(encoded_df)
# Output:
# city_Boston city_Chicago city_New York
# 0 0.0 0.0 1.0 # ← New York
# 1 1.0 0.0 0.0 # ← Boston
# 2 0.0 1.0 0.0 # ← Chicago
# 3 0.0 0.0 1.0 # ← New York
# 4 1.0 0.0 0.0 # ← Boston
# INTERPRETATION:
# - The encoder created columns in ALPHABETICAL order: Boston, Chicago, New York
# - Each row has exactly one 1.0 and two 0.0s
# - Unlike get_dummies(), this encoder is now "trained" and can be reused!
encoder.transform(new_data) - no retraining needed!
- Nominal categories with no natural order
- Low to medium cardinality (less than 15-20 unique values)
- Linear models, neural networks, SVM
- When interpretability matters
- High cardinality features (100+ categories)
- Memory-constrained environments
- Tree-based models (simpler encoding works fine)
- When sparse matrices cause performance issues
Handling Unseen Categories
A common challenge is handling categories in test data that were not seen during training.
Scikit-learn's handle_unknown='ignore' parameter solves this by setting all columns to 0 for unknown categories.
Critical for production!
# SCENARIO: Train on 3 cities, then encounter NEW city in production
# WHY THIS MATTERS: Real-world data is messy - you can't predict all future categories!
# STEP 1: Train encoder on training data (only 3 cities)
# SIMULATION: This is what you know during model development
train_cities = pd.DataFrame({'city': ['New York', 'Boston', 'Chicago']})
# Create encoder with handle_unknown='ignore'
# WHAT IT DOES: If it sees an unknown category, it assigns all 0s (safe default)
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoder.fit(train_cities) # LEARN: "I know 3 cities: Boston, Chicago, New York"
print("Categories learned during training:", encoder.categories_)
# Output: [array(['Boston', 'Chicago', 'New York'], dtype=object)]
# STEP 2: Test data contains UNSEEN category 'Seattle'
# SIMULATION: This is what happens in production with real-world data
test_cities = pd.DataFrame({'city': ['Boston', 'Seattle', 'New York']})
print("\nTest data (notice Seattle is NEW!):")
print(test_cities)
# city
# 0 Boston # ← Known category
# 1 Seattle # ← UNKNOWN CATEGORY! Not in training data
# 2 New York # ← Known category
# STEP 3: Transform test data using the trained encoder
# WHAT HAPPENS: Boston and New York are encoded normally, Seattle gets all 0s
encoded_test = encoder.transform(test_cities)
print("\nEncoded test data:")
encoded_df = pd.DataFrame(encoded_test, columns=encoder.get_feature_names_out())
print(encoded_df)
# Output:
# city_Boston city_Chicago city_New York
# 0 1.0 0.0 0.0 # ← Boston: Encoded normally
# 1 0.0 0.0 0.0 # ← Seattle: ALL ZEROS (unknown category)
# 2 0.0 0.0 1.0 # ← New York: Encoded normally
# INTERPRETATION:
# Row 0 (Boston): city_Boston=1, others=0 → Correct encoding
# Row 1 (Seattle): ALL ZEROS → "I don't know this city, treat it as missing/unknown"
# Row 2 (New York): city_New York=1, others=0 → Correct encoding
#
# WHY ALL ZEROS FOR SEATTLE? It's a safe default that says:
# "This data point doesn't match any category I know, so I'll treat it as neutral"
# The model can still make predictions, but won't use city information for Seattle
Without handle_unknown='ignore'
encoder = OneHotEncoder() encoder.fit(train_cities) encoder.transform(test_cities) # ERROR: ValueError: # Found unknown categories # ['Seattle'] during transform
Result: Production system crashes! ❌
With handle_unknown='ignore'
encoder = OneHotEncoder(
handle_unknown='ignore'
)
encoder.fit(train_cities)
encoded = encoder.transform(test_cities)
# SUCCESS: Seattle gets [0, 0, 0]
Result: System handles it gracefully! ✅
Pro Tip: Alternative Strategies
1. Rare category grouping: Before encoding, replace rare categories (appearing <3 times) with "Other"
2. Add "Unknown" category: Include an explicit "Unknown" category in training data
3. Use frequency encoding: For high-cardinality features, frequency encoding handles unknowns naturally
4. Monitor unknowns: Log how often unknown categories appear - might indicate data drift!
Practice Questions
Task: Given a DataFrame with product categories, one-hot encode the 'category' column and drop the first category to avoid multicollinearity.
# Starter code
import pandas as pd
products = pd.DataFrame({
'product_id': [101, 102, 103, 104, 105],
'name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones'],
'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Audio']
})
# Your code here: One-hot encode 'category' with drop_first=True
View Solution
import pandas as pd
products = pd.DataFrame({
'product_id': [101, 102, 103, 104, 105],
'name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones'],
'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Audio']
})
# One-hot encode with drop_first to avoid dummy variable trap
encoded_products = pd.get_dummies(products, columns=['category'], drop_first=True)
print(encoded_products)
# Output:
# product_id name category_Audio category_Electronics
# 0 101 Laptop 0 1
# 1 102 Mouse 0 0 <- Accessories (dropped)
# 2 103 Keyboard 0 0
# 3 104 Monitor 0 1
# 4 105 Headphones 1 0
drop_first=True, we dropped 'Accessories' (alphabetically first) to avoid multicollinearity.
Now we have 2 binary columns: category_Audio and category_Electronics. When both are 0, we know it's Accessories!
Task: Create a scikit-learn OneHotEncoder that can handle unseen categories, fit it on training data, and transform both training and test data.
# Starter code
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np
train_data = pd.DataFrame({
'color': ['Red', 'Blue', 'Green', 'Red', 'Blue'],
'size': ['Small', 'Large', 'Medium', 'Large', 'Small']
})
test_data = pd.DataFrame({
'color': ['Blue', 'Yellow', 'Red'], # Yellow is unseen!
'size': ['Small', 'XL', 'Large'] # XL is unseen!
})
# Your code here: Create encoder, fit on train, transform both
View Solution
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np
train_data = pd.DataFrame({
'color': ['Red', 'Blue', 'Green', 'Red', 'Blue'],
'size': ['Small', 'Large', 'Medium', 'Large', 'Small']
})
test_data = pd.DataFrame({
'color': ['Blue', 'Yellow', 'Red'], # Yellow is unseen!
'size': ['Small', 'XL', 'Large'] # XL is unseen!
})
# Create encoder that ignores unseen categories
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
# Fit on training data only
encoder.fit(train_data)
# Transform both datasets
train_encoded = encoder.transform(train_data)
test_encoded = encoder.transform(test_data)
# Get feature names
feature_names = encoder.get_feature_names_out(['color', 'size'])
# Convert to DataFrames
train_df = pd.DataFrame(train_encoded, columns=feature_names)
test_df = pd.DataFrame(test_encoded, columns=feature_names)
print("Encoded Training Data:")
print(train_df)
print("\nEncoded Test Data (notice Yellow and XL get all zeros):")
print(test_df)
handle_unknown='ignore'. Without this parameter, the encoder would crash with a ValueError.
This approach allows the model to make predictions even with new categories, treating them as "unknown/other".
Task: When a categorical column has too many unique values, group rare categories into "Other" before one-hot encoding. Keep only categories that appear at least 3 times.
# Starter code
import pandas as pd
orders = pd.DataFrame({
'order_id': range(1, 16),
'country': ['USA', 'USA', 'USA', 'USA', 'Canada', 'Canada', 'Canada',
'UK', 'UK', 'France', 'Germany', 'Spain', 'Italy', 'Japan', 'Australia']
})
# Your code here: Group rare countries (< 3 occurrences) into 'Other', then one-hot encode
View Solution
import pandas as pd
orders = pd.DataFrame({
'order_id': range(1, 16),
'country': ['USA', 'USA', 'USA', 'USA', 'Canada', 'Canada', 'Canada',
'UK', 'UK', 'France', 'Germany', 'Spain', 'Italy', 'Japan', 'Australia']
})
# Count occurrences of each country
country_counts = orders['country'].value_counts()
print("Country counts:\n", country_counts)
# Find categories with fewer than 3 occurrences
rare_countries = country_counts[country_counts < 3].index.tolist()
print("\nRare countries (< 3):", rare_countries)
# Replace rare countries with 'Other'
orders['country_grouped'] = orders['country'].apply(
lambda x: 'Other' if x in rare_countries else x
)
print("\nGrouped values:\n", orders['country_grouped'].value_counts())
# One-hot encode the grouped column
encoded = pd.get_dummies(orders, columns=['country_grouped'], prefix='country')
print("\nEncoded DataFrame:")
print(encoded)
Label & Ordinal Encoding
Label encoding and ordinal encoding both convert categories to integers, but they serve different purposes. Label encoding assigns arbitrary numbers to categories, while ordinal encoding preserves meaningful order. Understanding when to use each is crucial for building effective machine learning models.
Label Encoding
Assigns a unique integer (0, 1, 2, ...) to each category alphabetically or in order of appearance. The numbers have no inherent meaning or relationship.
Ordinal Encoding
Assigns integers that reflect the natural order or ranking of categories. For example, "Low" = 0, "Medium" = 1, "High" = 2.
Label Encoding with Scikit-learn
Label encoding is straightforward but should be used carefully. Since it assigns arbitrary numbers, models may incorrectly assume numerical relationships between categories.
from sklearn.preprocessing import LabelEncoder
import pandas as pd
# Sample data with color categories
products = pd.DataFrame({
'product': ['Shirt', 'Pants', 'Hat', 'Shoes', 'Jacket'],
'color': ['Red', 'Blue', 'Green', 'Red', 'Blue']
})
# Create and fit label encoder
label_encoder = LabelEncoder()
products['color_encoded'] = label_encoder.fit_transform(products['color'])
print(products)
# Output:
# product color color_encoded
# 0 Shirt Red 2
# 1 Pants Blue 0
# 2 Hat Green 1
# 3 Shoes Red 2
# 4 Jacket Blue 0
# View the mapping
print("\nLabel mapping:")
for i, label in enumerate(label_encoder.classes_):
print(f" {label} -> {i}")
# Output:
# Blue -> 0
# Green -> 1
# Red -> 2
Label Encoding Pitfall
Label encoding can mislead algorithms into thinking there is an order (Red > Green > Blue) when none exists. For nominal categories, one-hot encoding is usually safer. However, tree-based models (Random Forest, XGBoost) handle label encoding well because they split on individual values rather than assuming order.
Ordinal Encoding with Scikit-learn
When categories have a natural ranking, ordinal encoding preserves this information. You must explicitly define the category order.
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
# Customer data with ordinal categories
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5],
'education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor'],
'satisfaction': ['Low', 'Medium', 'High', 'High', 'Medium']
})
# Define the order for each category
education_order = ['High School', 'Bachelor', 'Master', 'PhD']
satisfaction_order = ['Low', 'Medium', 'High']
# Create ordinal encoder with specified order
ordinal_encoder = OrdinalEncoder(
categories=[education_order, satisfaction_order]
)
# Fit and transform
customers[['education_encoded', 'satisfaction_encoded']] = ordinal_encoder.fit_transform(
customers[['education', 'satisfaction']]
)
print(customers)
# Output:
# customer_id education satisfaction education_encoded satisfaction_encoded
# 0 1 High School Low 0.0 0.0
# 1 2 Bachelor Medium 1.0 1.0
# 2 3 Master High 2.0 2.0
# 3 4 PhD High 3.0 2.0
# 4 5 Bachelor Medium 1.0 1.0
Manual Ordinal Mapping with Pandas
For simple cases or when you need more control, you can create ordinal mappings manually using a dictionary.
import pandas as pd
# Survey responses
survey = pd.DataFrame({
'response_id': [1, 2, 3, 4, 5],
'experience': ['Beginner', 'Expert', 'Intermediate', 'Beginner', 'Expert'],
'priority': ['Low', 'Critical', 'Medium', 'High', 'Medium']
})
# Define ordinal mappings
experience_map = {'Beginner': 0, 'Intermediate': 1, 'Expert': 2}
priority_map = {'Low': 0, 'Medium': 1, 'High': 2, 'Critical': 3}
# Apply mappings using .map()
survey['experience_encoded'] = survey['experience'].map(experience_map)
survey['priority_encoded'] = survey['priority'].map(priority_map)
print(survey)
# Output:
# response_id experience priority experience_encoded priority_encoded
# 0 1 Beginner Low 0 0
# 1 2 Expert Critical 2 3
# 2 3 Intermediate Medium 1 1
# 3 4 Beginner High 0 2
# 4 5 Expert Medium 2 1
.map() method gives you full control over the encoding.
You can see that experience levels increase: Beginner(0) → Intermediate(1) → Expert(2), and priority increases: Low(0) → Medium(1) → High(2) → Critical(3).
This manual approach is great for small datasets or when you need custom orderings. Unlike OrdinalEncoder, it's also easier to read and doesn't require sklearn.
| Aspect | Label Encoding | Ordinal Encoding |
|---|---|---|
| Order Matters? | No - assigns arbitrary integers | Yes - preserves natural ranking |
| Use Case | Tree-based models, target variable | Ordinal features like ratings, sizes |
| Examples | Colors, countries, product IDs | Education level, size (S/M/L), rating |
| Risk | May imply false ordering | Must define correct order manually |
| Sklearn Class | LabelEncoder |
OrdinalEncoder |
Inverse Transform
Both encoders support converting encoded values back to original categories, which is useful for interpreting model predictions.
# Inverse transform with LabelEncoder
original_colors = label_encoder.inverse_transform([0, 1, 2])
print("Inverse transform:", original_colors) # ['Blue' 'Green' 'Red']
# Inverse transform with OrdinalEncoder
original_values = ordinal_encoder.inverse_transform([[2.0, 1.0], [3.0, 2.0]])
print("Inverse transform:", original_values)
# [['Master' 'Medium']
# ['PhD' 'High']]
inverse_transform().
This is useful for interpreting model predictions or displaying results in human-readable format.
For example, if your model predicts education level as 2.0, you can convert it back to "Master" for easier understanding.
Practice Questions
Task: Use LabelEncoder to encode the 'product_type' column and print the mapping between original values and encoded integers.
# Starter code
from sklearn.preprocessing import LabelEncoder
import pandas as pd
inventory = pd.DataFrame({
'item_id': [1, 2, 3, 4, 5, 6],
'product_type': ['Electronics', 'Clothing', 'Food', 'Electronics', 'Clothing', 'Furniture']
})
# Your code here: Label encode and show the mapping
View Solution
from sklearn.preprocessing import LabelEncoder
import pandas as pd
inventory = pd.DataFrame({
'item_id': [1, 2, 3, 4, 5, 6],
'product_type': ['Electronics', 'Clothing', 'Food', 'Electronics', 'Clothing', 'Furniture']
})
# Create and fit label encoder
encoder = LabelEncoder()
inventory['type_encoded'] = encoder.fit_transform(inventory['product_type'])
# Display result
print(inventory)
# Show the mapping
print("\nEncoding Mapping:")
for i, label in enumerate(encoder.classes_):
print(f" {label} -> {i}")
# Clothing -> 0
# Electronics -> 1
# Food -> 2
# Furniture -> 3
Task: The 'size' column contains T-shirt sizes with a natural order. Use OrdinalEncoder to encode them preserving the correct order: XS < S < M < L < XL < XXL.
# Starter code
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
tshirts = pd.DataFrame({
'sku': ['TS001', 'TS002', 'TS003', 'TS004', 'TS005'],
'size': ['M', 'XL', 'S', 'XXL', 'XS']
})
# Your code here: Ordinal encode with correct size order
View Solution
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
tshirts = pd.DataFrame({
'sku': ['TS001', 'TS002', 'TS003', 'TS004', 'TS005'],
'size': ['M', 'XL', 'S', 'XXL', 'XS']
})
# Define the correct order from smallest to largest
size_order = [['XS', 'S', 'M', 'L', 'XL', 'XXL']]
# Create ordinal encoder with the specified order
encoder = OrdinalEncoder(categories=size_order)
# Fit and transform
tshirts['size_encoded'] = encoder.fit_transform(tshirts[['size']])
print(tshirts)
# Output:
# sku size size_encoded
# 0 TS001 M 2.0 (XS=0, S=1, M=2)
# 1 TS002 XL 4.0
# 2 TS003 S 1.0
# 3 TS004 XXL 5.0
# 4 TS005 XS 0.0
# Verify the mapping
print("\nSize order mapping:")
for i, size in enumerate(size_order[0]):
print(f" {size} -> {i}")
Task: Fit an OrdinalEncoder on training data and apply it to test data. Handle an unknown category in test data by using handle_unknown='use_encoded_value' with unknown_value=-1.
# Starter code
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
import numpy as np
train = pd.DataFrame({
'rating': ['Poor', 'Good', 'Excellent', 'Good', 'Poor']
})
test = pd.DataFrame({
'rating': ['Good', 'Average', 'Excellent'] # 'Average' is unseen!
})
# Your code here: Handle unknown categories with -1
View Solution
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
import numpy as np
train = pd.DataFrame({
'rating': ['Poor', 'Good', 'Excellent', 'Good', 'Poor']
})
test = pd.DataFrame({
'rating': ['Good', 'Average', 'Excellent'] # 'Average' is unseen!
})
# Define ordinal categories
rating_order = [['Poor', 'Good', 'Excellent']]
# Create encoder that assigns -1 to unknown categories
encoder = OrdinalEncoder(
categories=rating_order,
handle_unknown='use_encoded_value',
unknown_value=-1
)
# Fit on training data
encoder.fit(train[['rating']])
# Transform both datasets
train['rating_encoded'] = encoder.transform(train[['rating']])
test['rating_encoded'] = encoder.transform(test[['rating']])
print("Training Data:")
print(train)
print("\nTest Data (notice 'Average' gets -1):")
print(test)
# Output:
# rating rating_encoded
# 0 Good 1.0
# 1 Average -1.0 <- Unknown category
# 2 Excellent 2.0
handle_unknown='use_encoded_value' with unknown_value=-1,
we can handle unseen categories gracefully. The encoder assigns -1 to 'Average' (not seen during training).
This is better than crashing, and the -1 signals to downstream processes that this category is unknown.
You can then handle it appropriately (e.g., assign global mean, create an "Other" category, or flag for review).
Target Encoding
Target encoding (also called mean encoding) replaces each category with a statistic derived from the target variable, typically the mean. This technique is particularly powerful for high-cardinality categorical features where one-hot encoding would create too many columns. However, it must be applied carefully to avoid data leakage and overfitting.
Target Encoding
A technique that replaces each category with the mean (or other aggregate) of the target variable for that category. For example, if customers from "California" have an average purchase of $150, "California" is encoded as 150.
Basic Target Encoding
The simplest form of target encoding calculates the mean of the target for each category. This creates a strong predictive signal but requires careful handling to prevent leakage.
import pandas as pd
import numpy as np
# Customer purchase data
customers = pd.DataFrame({
'customer_id': range(1, 11),
'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA', 'NYC', 'Chicago', 'LA', 'NYC', 'Chicago'],
'purchase_amount': [150, 200, 180, 90, 220, 160, 100, 190, 170, 110]
})
# Calculate target mean for each city
city_means = customers.groupby('city')['purchase_amount'].mean()
print("Mean purchase by city:")
print(city_means)
# Output:
# city
# Chicago 100.000000
# LA 203.333333
# NYC 165.000000
# Apply target encoding
customers['city_encoded'] = customers['city'].map(city_means)
print("\nTarget Encoded Data:")
print(customers[['customer_id', 'city', 'purchase_amount', 'city_encoded']])
# Output:
# customer_id city purchase_amount city_encoded
# 0 1 NYC 150 165.000000
# 1 2 LA 200 203.333333
# 2 3 NYC 180 165.000000
# ...
Data Leakage Warning
The basic approach above causes data leakage because each row's target value influences its own encoding. This leads to overfitting, especially for rare categories. Always use one of these solutions:
- Leave-one-out encoding: Exclude the current row when calculating the mean
- K-fold target encoding: Use cross-validation folds to calculate means
- Smoothing: Blend category mean with global mean based on category size
Target Encoding with Smoothing
Smoothing (also called regularization) blends the category mean with the global mean. Categories with few samples rely more on the global mean, preventing overfitting on rare categories.
import pandas as pd
import numpy as np
def target_encode_smooth(df, column, target, m=10):
"""
Target encoding with smoothing.
m: smoothing parameter (higher = more smoothing toward global mean)
"""
# Calculate global mean
global_mean = df[target].mean()
# Calculate category statistics
agg = df.groupby(column)[target].agg(['mean', 'count'])
# Apply smoothing formula: (count * category_mean + m * global_mean) / (count + m)
smoothed = (agg['count'] * agg['mean'] + m * global_mean) / (agg['count'] + m)
return df[column].map(smoothed)
# Example data with rare category
sales = pd.DataFrame({
'region': ['East', 'East', 'East', 'East', 'East',
'West', 'West', 'West',
'North', 'North',
'South'], # South has only 1 sample!
'revenue': [100, 120, 110, 130, 115,
200, 180, 190,
80, 85,
500] # South's single high value could cause overfitting
})
# Without smoothing
sales['region_raw'] = sales['region'].map(
sales.groupby('region')['revenue'].mean()
)
# With smoothing (m=3)
sales['region_smooth'] = target_encode_smooth(sales, 'region', 'revenue', m=3)
print(sales[['region', 'revenue', 'region_raw', 'region_smooth']])
# Notice South goes from 500 (raw) to ~300 (smoothed toward global mean)
K-Fold Target Encoding
The most robust approach uses cross-validation: for each fold, calculate the encoding using only the out-of-fold data. This completely prevents target leakage within the training set.
from sklearn.model_selection import KFold
import pandas as pd
import numpy as np
def kfold_target_encode(df, column, target, n_splits=5):
"""K-fold target encoding to prevent leakage."""
df = df.copy()
df['encoded'] = np.nan
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
for train_idx, val_idx in kf.split(df):
# Calculate means using ONLY training fold data
train_means = df.iloc[train_idx].groupby(column)[target].mean()
# Apply to validation fold
df.loc[df.index[val_idx], 'encoded'] = df.iloc[val_idx][column].map(train_means)
# Fill any NaN (unseen categories) with global mean
global_mean = df[target].mean()
df['encoded'] = df['encoded'].fillna(global_mean)
return df['encoded']
# Apply k-fold encoding
sales['region_kfold'] = kfold_target_encode(sales, 'region', 'revenue', n_splits=3)
print(sales[['region', 'revenue', 'region_kfold']])
Using Category Encoders Library
The category_encoders library provides production-ready target encoding with built-in smoothing and cross-validation support.
# Install: pip install category-encoders
import category_encoders as ce
import pandas as pd
from sklearn.model_selection import train_test_split
# Prepare data
X = sales[['region']]
y = sales['revenue']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create target encoder with smoothing
encoder = ce.TargetEncoder(cols=['region'], smoothing=1.0)
# Fit on training data ONLY (pass y_train to prevent leakage)
X_train_encoded = encoder.fit_transform(X_train, y_train)
X_test_encoded = encoder.transform(X_test)
print("Training encoded:")
print(X_train_encoded)
print("\nTest encoded:")
print(X_test_encoded)
category_encoders library provides production-ready target encoding with built-in leakage prevention.
By fitting on training data with y_train, it learns the category-to-target relationships.
When transforming test data, it applies the learned encoding WITHOUT looking at test target values (preventing leakage).
The smoothing parameter (1.0 here) adds regularization to prevent overfitting on rare categories.
| Aspect | Target Encoding | One-Hot Encoding |
|---|---|---|
| Dimensionality | 1 column (regardless of cardinality) | n columns (one per category) |
| High Cardinality | Excellent - no dimension explosion | Poor - creates sparse, high-dim data |
| Leakage Risk | High - must use regularization | None |
| Information | Encodes relationship with target | No target information |
| Best For | Tree-based models, high cardinality | Linear models, low cardinality |
Practice Questions
Task: Calculate the mean salary for each department and create a target-encoded column.
# Starter code
import pandas as pd
employees = pd.DataFrame({
'emp_id': [1, 2, 3, 4, 5, 6, 7, 8],
'department': ['Sales', 'Engineering', 'Sales', 'HR', 'Engineering', 'Sales', 'HR', 'Engineering'],
'salary': [50000, 80000, 55000, 45000, 85000, 52000, 48000, 90000]
})
# Your code here: Target encode 'department' using mean salary
View Solution
import pandas as pd
employees = pd.DataFrame({
'emp_id': [1, 2, 3, 4, 5, 6, 7, 8],
'department': ['Sales', 'Engineering', 'Sales', 'HR', 'Engineering', 'Sales', 'HR', 'Engineering'],
'salary': [50000, 80000, 55000, 45000, 85000, 52000, 48000, 90000]
})
# Calculate mean salary per department
dept_means = employees.groupby('department')['salary'].mean()
print("Mean salary by department:")
print(dept_means)
# Target encode
employees['dept_encoded'] = employees['department'].map(dept_means)
print("\nEncoded DataFrame:")
print(employees)
# Engineering -> 85000, HR -> 46500, Sales -> 52333.33
Task: Implement smoothed target encoding using the formula: (n * category_mean + m * global_mean) / (n + m), where n is category count and m is the smoothing parameter.
# Starter code
import pandas as pd
products = pd.DataFrame({
'product': range(1, 13),
'brand': ['Apple', 'Apple', 'Apple', 'Apple', 'Apple',
'Samsung', 'Samsung', 'Samsung',
'LG', 'LG',
'Sony', 'Nokia'], # Sony and Nokia are rare
'rating': [4.5, 4.2, 4.8, 4.3, 4.6,
4.0, 3.8, 4.1,
3.5, 3.6,
5.0, 2.0] # Extreme values for rare brands
})
# Your code here: Implement smoothed target encoding with m=3
View Solution
import pandas as pd
products = pd.DataFrame({
'product': range(1, 13),
'brand': ['Apple', 'Apple', 'Apple', 'Apple', 'Apple',
'Samsung', 'Samsung', 'Samsung',
'LG', 'LG',
'Sony', 'Nokia'], # Sony and Nokia are rare
'rating': [4.5, 4.2, 4.8, 4.3, 4.6,
4.0, 3.8, 4.1,
3.5, 3.6,
5.0, 2.0] # Extreme values for rare brands
})
# Smoothing parameter
m = 3
# Calculate global mean
global_mean = products['rating'].mean()
print(f"Global mean rating: {global_mean:.2f}")
# Calculate brand statistics
brand_stats = products.groupby('brand')['rating'].agg(['mean', 'count'])
print("\nBrand statistics:")
print(brand_stats)
# Apply smoothing formula
brand_stats['smoothed'] = (
(brand_stats['count'] * brand_stats['mean'] + m * global_mean) /
(brand_stats['count'] + m)
)
print("\nSmoothed values:")
print(brand_stats[['mean', 'smoothed']])
# Apply to dataframe
products['brand_encoded'] = products['brand'].map(brand_stats['smoothed'])
print("\nFinal DataFrame:")
print(products[['product', 'brand', 'rating', 'brand_encoded']])
# Notice Sony (5.0) and Nokia (2.0) are smoothed toward global mean
Task: Implement proper target encoding using K-fold cross-validation to prevent data leakage in training data, then apply the learned encoding to test data.
# Starter code
from sklearn.model_selection import KFold, train_test_split
import pandas as pd
import numpy as np
# Create sample data
data = pd.DataFrame({
'store': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C'] * 3,
'sales': np.random.randint(100, 500, 30)
})
# Split into train and test
train_df, test_df = train_test_split(data, test_size=0.3, random_state=42)
# Your code here:
# 1. K-fold encode training data (no leakage)
# 2. Calculate final encoding from all training data
# 3. Apply to test data
View Solution
from sklearn.model_selection import KFold, train_test_split
import pandas as pd
import numpy as np
# Create sample data
np.random.seed(42)
data = pd.DataFrame({
'store': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C'] * 3,
'sales': np.random.randint(100, 500, 30)
})
# Split into train and test
train_df, test_df = train_test_split(data, test_size=0.3, random_state=42)
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)
# Step 1: K-fold encode training data
train_df['store_encoded'] = np.nan
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in kf.split(train_df):
# Calculate means from training fold only
fold_means = train_df.iloc[train_idx].groupby('store')['sales'].mean()
# Apply to validation fold
train_df.loc[train_df.index[val_idx], 'store_encoded'] = \
train_df.iloc[val_idx]['store'].map(fold_means)
# Fill NaN with global mean
global_mean = train_df['sales'].mean()
train_df['store_encoded'] = train_df['store_encoded'].fillna(global_mean)
# Step 2: Calculate final encoding from ALL training data
final_encoding = train_df.groupby('store')['sales'].mean()
# Step 3: Apply to test data
test_df['store_encoded'] = test_df['store'].map(final_encoding)
test_df['store_encoded'] = test_df['store_encoded'].fillna(global_mean)
print("Training data with K-fold encoding:")
print(train_df.head(10))
print("\nTest data with training-based encoding:")
print(test_df)
print("\nFinal encoding mapping:")
print(final_encoding)
Frequency & Binary Encoding
Frequency encoding and binary encoding are powerful alternatives for handling categorical variables, especially with high cardinality. Frequency encoding uses occurrence counts, while binary encoding converts label-encoded integers to binary digits. Both methods balance dimensionality reduction with information preservation.
Frequency Encoding
Replaces each category with its frequency (count or proportion) in the dataset. Categories that appear often get higher values. Simple and effective when frequency correlates with the target.
Binary Encoding
First applies label encoding, then converts each integer to its binary representation across multiple columns. Creates log2(n) columns instead of n, dramatically reducing dimensionality.
Frequency Encoding
Frequency encoding is intuitive: common categories get higher values, rare categories get lower values. This often works well because popular categories may have different behavior patterns than rare ones.
import pandas as pd
# E-commerce order data
orders = pd.DataFrame({
'order_id': range(1, 16),
'product_category': ['Electronics', 'Electronics', 'Electronics', 'Electronics', 'Electronics',
'Clothing', 'Clothing', 'Clothing', 'Clothing',
'Books', 'Books', 'Books',
'Toys', 'Toys',
'Jewelry']
})
# Count frequency of each category
freq_map = orders['product_category'].value_counts()
print("Category frequencies:")
print(freq_map)
# Output:
# Electronics 5
# Clothing 4
# Books 3
# Toys 2
# Jewelry 1
# Frequency encoding (count)
orders['category_freq_count'] = orders['product_category'].map(freq_map)
# Frequency encoding (proportion)
orders['category_freq_prop'] = orders['product_category'].map(freq_map / len(orders))
print("\nEncoded DataFrame:")
print(orders)
# Output:
# order_id product_category category_freq_count category_freq_prop
# 0 1 Electronics 5 0.333333
# 1 2 Electronics 5 0.333333
# ...
# 14 15 Jewelry 1 0.066667
When Frequency Encoding Shines
Frequency encoding works best when the popularity of a category is predictive. For example, in fraud detection, rare transaction types might be more suspicious. In recommendation systems, popular items may have different click-through rates.
Handling Train/Test Frequency Encoding
When applying frequency encoding to test data, use the frequencies calculated from training data only. This prevents data leakage and handles unseen categories gracefully.
import pandas as pd
from sklearn.model_selection import train_test_split
# Sample data
data = pd.DataFrame({
'user_id': range(1, 21),
'browser': ['Chrome', 'Chrome', 'Chrome', 'Chrome', 'Chrome', 'Chrome',
'Firefox', 'Firefox', 'Firefox', 'Firefox',
'Safari', 'Safari', 'Safari',
'Edge', 'Edge',
'Chrome', 'Firefox', 'Safari', 'Opera', 'IE'] # Opera, IE appear once
})
# Split data
train_df, test_df = train_test_split(data, test_size=0.3, random_state=42)
# Calculate frequencies from TRAINING data only
train_freq = train_df['browser'].value_counts()
print("Training frequencies:")
print(train_freq)
# Apply to both datasets
train_df['browser_freq'] = train_df['browser'].map(train_freq)
test_df['browser_freq'] = test_df['browser'].map(train_freq)
# Handle unseen categories with 0 or median frequency
test_df['browser_freq'] = test_df['browser_freq'].fillna(0)
print("\nTest data with frequency encoding:")
print(test_df)
# Unseen categories get 0
Binary Encoding
Binary encoding is a clever compromise between label encoding and one-hot encoding. It first assigns integers to categories, then converts those integers to binary representation. For 8 categories (requiring 3 bits), you get 3 columns instead of 8.
import pandas as pd
import numpy as np
# Sample data with 8 categories
colors = pd.DataFrame({
'item_id': range(1, 9),
'color': ['Red', 'Blue', 'Green', 'Yellow', 'Orange', 'Purple', 'Pink', 'Brown']
})
# Step 1: Label encode (0-7)
color_map = {color: i for i, color in enumerate(colors['color'].unique())}
colors['label_encoded'] = colors['color'].map(color_map)
# Step 2: Convert to binary (3 bits needed for 8 values)
def int_to_binary_columns(df, column, n_bits):
"""Convert integer column to binary representation."""
binary_cols = []
for i in range(n_bits):
col_name = f'{column}_bit_{i}'
df[col_name] = (df[column] >> i) & 1
binary_cols.append(col_name)
return binary_cols
n_bits = int(np.ceil(np.log2(len(color_map)))) # 3 bits for 8 categories
binary_cols = int_to_binary_columns(colors, 'label_encoded', n_bits)
print(f"Number of bits needed: {n_bits}")
print("\nBinary Encoded DataFrame:")
print(colors)
# Output:
# item_id color label_encoded bit_0 bit_1 bit_2
# 0 1 Red 0 0 0 0
# 1 2 Blue 1 1 0 0
# 2 3 Green 2 0 1 0
# 3 4 Yellow 3 1 1 0
# 4 5 Orange 4 0 0 1
# 5 6 Purple 5 1 0 1
# 6 7 Pink 6 0 1 1
# 7 8 Brown 7 1 1 1
Using Category Encoders Library
The category_encoders library provides optimized implementations of both frequency and binary encoding.
# Install: pip install category-encoders
import category_encoders as ce
import pandas as pd
# Sample data
df = pd.DataFrame({
'city': ['NYC', 'LA', 'Chicago', 'Houston', 'Phoenix',
'Philadelphia', 'San Antonio', 'San Diego'],
'population': [8.3, 4.0, 2.7, 2.3, 1.7, 1.6, 1.5, 1.4]
})
# Binary Encoding
binary_encoder = ce.BinaryEncoder(cols=['city'])
df_binary = binary_encoder.fit_transform(df)
print("Binary Encoded:")
print(df_binary)
# Creates log2(8) = 3 binary columns
# Count/Frequency Encoding
count_encoder = ce.CountEncoder(cols=['city'])
df_count = count_encoder.fit_transform(df)
print("\nCount Encoded:")
print(df_count)
category_encoders library makes encoding much easier!
BinaryEncoder creates log₂(n) binary columns automatically. For 8 cities, it creates 3 binary columns instead of 8 one-hot columns.
CountEncoder counts category occurrences (like frequency encoding). Both encoders can be saved and reused in production,
handle unseen categories, and integrate seamlessly with sklearn pipelines!
| Encoding | Columns Created | 100 Categories | 1000 Categories | Best Use Case |
|---|---|---|---|---|
| One-Hot | n | 100 columns | 1000 columns | Low cardinality, linear models |
| Label | 1 | 1 column | 1 column | Tree models, target variable |
| Binary | log2(n) | 7 columns | 10 columns | High cardinality, any model |
| Frequency | 1 | 1 column | 1 column | When frequency is predictive |
| Target | 1 | 1 column | 1 column | High cardinality, supervised tasks |
Choosing the Right Encoding Method
The choice of encoding depends on several factors. Here is a decision guide to help you choose the right method for your situation.
- Is there a natural order? → Use Ordinal Encoding
- Low cardinality (< 15) + Linear Model? → Use One-Hot Encoding
- Any cardinality + Tree Model? → Use Label or Ordinal Encoding
- High cardinality + Supervised Task? → Use Target Encoding (with regularization)
- High cardinality + Frequency matters? → Use Frequency Encoding
- High cardinality + Need compromise? → Use Binary Encoding
- Memory constrained? → Avoid One-Hot, prefer Label/Binary/Frequency
Practice Questions
Task: Frequency encode the 'payment_method' column using both count and proportion.
# Starter code
import pandas as pd
transactions = pd.DataFrame({
'transaction_id': range(1, 13),
'payment_method': ['Credit Card', 'Credit Card', 'Credit Card', 'Credit Card',
'PayPal', 'PayPal', 'PayPal',
'Debit Card', 'Debit Card',
'Bank Transfer', 'Bank Transfer',
'Crypto']
})
# Your code here: Frequency encode with count and proportion
View Solution
import pandas as pd
transactions = pd.DataFrame({
'transaction_id': range(1, 13),
'payment_method': ['Credit Card', 'Credit Card', 'Credit Card', 'Credit Card',
'PayPal', 'PayPal', 'PayPal',
'Debit Card', 'Debit Card',
'Bank Transfer', 'Bank Transfer',
'Crypto']
})
# Calculate frequency map
freq_map = transactions['payment_method'].value_counts()
print("Frequency counts:")
print(freq_map)
# Frequency encoding - count
transactions['payment_freq_count'] = transactions['payment_method'].map(freq_map)
# Frequency encoding - proportion
total = len(transactions)
transactions['payment_freq_prop'] = transactions['payment_method'].map(freq_map / total)
print("\nEncoded DataFrame:")
print(transactions)
# Credit Card: 4 (0.333), PayPal: 3 (0.25), etc.
Task: Implement binary encoding without using category_encoders. First label encode, then convert to binary columns.
# Starter code
import pandas as pd
import numpy as np
countries = pd.DataFrame({
'user_id': range(1, 11),
'country': ['USA', 'Canada', 'UK', 'Germany', 'France',
'Japan', 'Australia', 'Brazil', 'India', 'Mexico']
})
# Your code here:
# 1. Label encode the countries (0-9)
# 2. Calculate number of bits needed
# 3. Create binary columns
View Solution
import pandas as pd
import numpy as np
countries = pd.DataFrame({
'user_id': range(1, 11),
'country': ['USA', 'Canada', 'UK', 'Germany', 'France',
'Japan', 'Australia', 'Brazil', 'India', 'Mexico']
})
# Step 1: Label encode
unique_countries = countries['country'].unique()
country_map = {country: i for i, country in enumerate(unique_countries)}
countries['country_label'] = countries['country'].map(country_map)
print("Label encoding map:")
for country, label in country_map.items():
print(f" {country}: {label}")
# Step 2: Calculate bits needed (10 countries need 4 bits: 2^4 = 16)
n_categories = len(unique_countries)
n_bits = int(np.ceil(np.log2(n_categories)))
print(f"\nCategories: {n_categories}, Bits needed: {n_bits}")
# Step 3: Create binary columns
for bit in range(n_bits):
countries[f'country_bit_{bit}'] = (countries['country_label'] >> bit) & 1
print("\nBinary Encoded DataFrame:")
print(countries)
# Verify: One-hot would need 10 columns, binary only needs 4!
Task: Apply one-hot, label, binary, and frequency encoding to the same categorical column. Compare the resulting number of features and summarize when to use each.
# Starter code
import pandas as pd
import numpy as np
# 20 different product brands
np.random.seed(42)
brands = [f'Brand_{chr(65+i)}' for i in range(20)] # Brand_A to Brand_T
products = pd.DataFrame({
'product_id': range(1, 101),
'brand': np.random.choice(brands, 100)
})
# Your code here:
# 1. One-hot encode
# 2. Label encode
# 3. Binary encode
# 4. Frequency encode
# 5. Print comparison of dimensions
View Solution
import pandas as pd
import numpy as np
# 20 different product brands
np.random.seed(42)
brands = [f'Brand_{chr(65+i)}' for i in range(20)] # Brand_A to Brand_T
products = pd.DataFrame({
'product_id': range(1, 101),
'brand': np.random.choice(brands, 100)
})
print(f"Original data: {products.shape[0]} rows, {len(products['brand'].unique())} unique brands\n")
# 1. One-Hot Encoding
df_onehot = pd.get_dummies(products, columns=['brand'], prefix='brand')
onehot_cols = len([c for c in df_onehot.columns if c.startswith('brand_')])
print(f"One-Hot Encoding: {onehot_cols} new columns")
# 2. Label Encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
products['brand_label'] = le.fit_transform(products['brand'])
print(f"Label Encoding: 1 new column")
# 3. Binary Encoding
n_bits = int(np.ceil(np.log2(len(brands)))) # 5 bits for 20 brands
for bit in range(n_bits):
products[f'brand_bit_{bit}'] = (products['brand_label'] >> bit) & 1
print(f"Binary Encoding: {n_bits} new columns")
# 4. Frequency Encoding
freq_map = products['brand'].value_counts() / len(products)
products['brand_freq'] = products['brand'].map(freq_map)
print(f"Frequency Encoding: 1 new column")
# Summary comparison
print("\n" + "="*50)
print("ENCODING COMPARISON SUMMARY")
print("="*50)
print(f"{'Encoding':<20} {'Columns':<10} {'Best For'}")
print("-"*50)
print(f"{'One-Hot':<20} {onehot_cols:<10} Linear models, low cardinality")
print(f"{'Label':<20} {1:<10} Tree models, target encoding base")
print(f"{'Binary':<20} {n_bits:<10} High cardinality, dimension reduction")
print(f"{'Frequency':<20} {1:<10} When frequency correlates with target")
# Show sample of final dataframe
print("\nSample encoded data:")
print(products.head())
Interactive Demo
Explore how different encoding methods transform categorical data in real-time. Use these interactive tools to visualize the differences between one-hot, label, binary, and frequency encoding.
Encoding Comparison Tool
Enter comma-separated categories to see how each encoding method transforms them.
Input Summary:
Total values: 8
Unique categories: 4
One-Hot Encoding
Columns created: 0
Label Encoding
Columns created: 1
Binary Encoding
Columns created: 0
Frequency Encoding
Columns created: 1
Cardinality Impact Visualizer
See how the number of unique categories affects the dimensionality of different encoding methods.
Columns Created:
Visual Comparison (columns created)
Key Takeaways
One-Hot Encoding
Creates binary columns for each category. Best for nominal variables with low cardinality. Beware of the dummy variable trap.
Label & Ordinal
Label encoding assigns integers arbitrarily. Ordinal encoding respects natural order. Use ordinal for ranked categories like education levels.
Target Encoding
Replaces categories with target mean. Powerful for high cardinality but prone to overfitting. Use smoothing and cross-validation.
Frequency Encoding
Encodes categories by their occurrence count. Simple and effective when frequency correlates with target. No explosion of dimensions.
Binary Encoding
Combines label encoding with binary representation. Reduces dimensions compared to one-hot. Great for high-cardinality features.
Choosing Wisely
Consider cardinality, ordinality, and model type. Tree-based models handle label encoding well. Linear models often need one-hot.
Knowledge Check
Quick Quiz
Test what you've learned about categorical encoding techniques