Feature Creation & Transformation

Domain Knowledge-Based Feature Creation

Imagine you're a doctor trying to predict heart disease. Just knowing a patient's weight and height separately isn't as useful as knowing their BMI (weight divided by height squared). That's domain knowledge in action! By understanding what your data actually represents, you can create new features that capture meaningful relationships. A bank knows that debt-to-income ratio matters more than raw debt, and an e-commerce company knows that "price per item" is more insightful than total price alone. This section teaches you to think like a domain expert and create features that make sense for your specific problem.

Key Concept

What is Feature Engineering?

In simple terms: Feature engineering is like being a translator between raw data and your machine learning model. Raw data often contains hidden information that your model can't see directly. For example, a timestamp like "2024-03-15 14:30" doesn't mean much to a model — but "Friday afternoon" might be very predictive for traffic patterns or shopping behavior!

Feature engineering transforms your data into a format that better tells the story hidden within it. You might:

Combine features: Create "price per square foot" from price and area
Extract information: Pull "day of week" from a date
Create ratios: Calculate "debt-to-income ratio" from debt and income
Aggregate data: Compute "average purchase amount" from transaction history

Why it matters: Here's a secret that experienced data scientists know — good feature engineering often matters more than choosing the fanciest algorithm! A simple model with great features will usually beat a complex model with poor features. It's often the difference between a model that barely works and one that truly understands your data.

Creating Business-Logic Features

Domain features capture relationships that algorithms can't automatically discover. Here are examples from different domains:

# Step 1: Import libraries and create sample data
import pandas as pd
import numpy as np

# E-commerce example
ecommerce_data = pd.DataFrame({
    'price': [100, 200, 50, 300],
    'quantity': [2, 1, 5, 2],
    'shipping': [10, 15, 5, 20],
    'discount_pct': [0.1, 0.05, 0.2, 0.0]
})

Setting up sample data: We create a simple e-commerce DataFrame with 4 orders. Each order has: price (unit price), quantity (how many items), shipping (shipping cost), and discount_pct (discount as a decimal, so 0.1 = 10% off). These are our "raw" features — they're useful, but we can create even more meaningful features by combining them intelligently!

# Step 2: Create domain-specific features
# Revenue and cost calculations
ecommerce_data['total_price'] = ecommerce_data['price'] * ecommerce_data['quantity']
ecommerce_data['discount_amount'] = ecommerce_data['total_price'] * ecommerce_data['discount_pct']
ecommerce_data['final_revenue'] = ecommerce_data['total_price'] - ecommerce_data['discount_amount']

# Ratio features (often very predictive)
ecommerce_data['shipping_ratio'] = ecommerce_data['shipping'] / ecommerce_data['total_price']
ecommerce_data['avg_item_price'] = ecommerce_data['price'] / ecommerce_data['quantity']

print(ecommerce_data)

Creating derived features: Watch what we're doing here!

total_price: price × quantity = what the customer would pay before discounts
discount_amount: How much money the discount saves (total_price × discount_pct)
final_revenue: What we actually receive after the discount
shipping_ratio: Shipping cost as a percentage of total price. This is a ratio feature — it tells us if shipping is expensive relative to the order. A $10 shipping on a $20 order (50%) feels different than $10 on a $200 order (5%)!
avg_item_price: Average price per item. Useful for understanding if someone is buying cheap items in bulk or expensive items individually.

Financial Domain Features

Financial data often benefits from derived metrics and ratios:

# Financial metrics example
finance_data = pd.DataFrame({
    'income': [50000, 75000, 30000, 120000],
    'expenses': [40000, 50000, 28000, 80000],
    'debt': [10000, 50000, 5000, 200000],
    'assets': [100000, 200000, 30000, 500000],
    'age': [25, 35, 22, 45]
})

# Create financial health indicators
finance_data['savings'] = finance_data['income'] - finance_data['expenses']
finance_data['savings_rate'] = finance_data['savings'] / finance_data['income']
finance_data['debt_to_income'] = finance_data['debt'] / finance_data['income']
finance_data['net_worth'] = finance_data['assets'] - finance_data['debt']
finance_data['debt_to_asset_ratio'] = finance_data['debt'] / finance_data['assets']

# Age-based features
finance_data['income_per_age'] = finance_data['income'] / finance_data['age']
finance_data['assets_per_age'] = finance_data['assets'] / finance_data['age']

print(finance_data[['savings_rate', 'debt_to_income', 'net_worth']].round(2))

Financial health indicators explained:

savings: Income minus expenses — how much money is left over each year
savings_rate: What percentage of income is saved (savings/income). A person earning $100k and saving $20k has a 20% savings rate — same as someone earning $50k and saving $10k. This makes comparison fair!
debt_to_income: The famous DTI ratio banks use! If your debt is $50k and income is $100k, your DTI is 0.5 (50%). Banks typically want this below 0.36 (36%) for mortgages.
net_worth: Assets minus debt — what you'd have if you sold everything and paid off all debts
income_per_age: Income normalized by age. Earning $100k at 25 is more impressive than at 45 — this captures that!

Aggregation Features

When you have grouped or transactional data, aggregations create powerful summary features:

# Customer transaction data
transactions = pd.DataFrame({
    'customer_id': [1, 1, 1, 2, 2, 3, 3, 3, 3],
    'amount': [100, 200, 150, 500, 300, 50, 75, 100, 25],
    'category': ['food', 'electronics', 'food', 'electronics', 'travel', 
                 'food', 'food', 'entertainment', 'food']
})

# Create customer-level aggregation features
customer_features = transactions.groupby('customer_id').agg({
    'amount': ['sum', 'mean', 'std', 'count', 'min', 'max']
}).reset_index()

# Flatten column names
customer_features.columns = ['customer_id', 'total_spent', 'avg_transaction', 
                             'std_transaction', 'num_transactions', 
                             'min_transaction', 'max_transaction']

# Additional derived features
customer_features['transaction_range'] = (customer_features['max_transaction'] - 
                                          customer_features['min_transaction'])
customer_features['cv_transaction'] = (customer_features['std_transaction'] / 
                                       customer_features['avg_transaction'])  # Coefficient of variation

print(customer_features)

Aggregation magic explained: We have 9 transactions from 3 customers, but our model needs one row per customer. groupby('customer_id').agg() groups transactions by customer and calculates summary statistics:

sum: Total amount spent by this customer
mean: Average transaction size — do they make small frequent purchases or large occasional ones?
std: Standard deviation — how variable are their purchases?
count: How many transactions they made (frequency)
min/max: Smallest and largest purchase
cv_transaction (coefficient of variation): std/mean — this is brilliant! If std=50 and mean=100, CV=0.5. Low CV = consistent spender, High CV = erratic spending. A customer who always spends ~$100 (CV≈0) behaves differently than one who spends $20 one day and $500 the next (high CV)!

Best Practice: Always consult with domain experts when possible. They understand which relationships matter, what ratios are meaningful, and which aggregations capture important patterns.

Practice Questions

Task: Given house data with bedrooms, bathrooms, sqft, and price, create features: price_per_sqft, bathroom_ratio (bathrooms/bedrooms), and total_rooms.

Show Solution

import pandas as pd

houses = pd.DataFrame({
    'bedrooms': [3, 4, 2, 5],
    'bathrooms': [2, 3, 1, 4],
    'sqft': [1500, 2200, 1000, 3000],
    'price': [300000, 450000, 200000, 600000]
})

# Create features
houses['price_per_sqft'] = houses['price'] / houses['sqft']
houses['bathroom_ratio'] = houses['bathrooms'] / houses['bedrooms']
houses['total_rooms'] = houses['bedrooms'] + houses['bathrooms']

print(houses)

Task: Given patient data with height (cm), weight (kg), age, and blood pressure (systolic/diastolic), create: BMI, blood_pressure_ratio, age_adjusted_bp, and is_hypertensive (systolic > 140 or diastolic > 90).

Show Solution

import pandas as pd

patients = pd.DataFrame({
    'height_cm': [175, 160, 180, 165],
    'weight_kg': [80, 55, 90, 70],
    'age': [35, 45, 28, 55],
    'systolic': [120, 145, 115, 160],
    'diastolic': [80, 95, 75, 100]
})

# BMI = weight (kg) / height (m)^2
patients['height_m'] = patients['height_cm'] / 100
patients['bmi'] = patients['weight_kg'] / (patients['height_m'] ** 2)

# Blood pressure features
patients['bp_ratio'] = patients['systolic'] / patients['diastolic']
patients['age_adjusted_systolic'] = patients['systolic'] / patients['age']

# Binary indicator
patients['is_hypertensive'] = ((patients['systolic'] > 140) | 
                               (patients['diastolic'] > 90)).astype(int)

print(patients[['bmi', 'bp_ratio', 'is_hypertensive']])

Task: Given customer data with total_orders, total_spent, and days_since_signup, create: avg_order_value, orders_per_month, and is_high_value (spent > $500).

Show Solution

import pandas as pd

customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'total_orders': [5, 12, 3, 25, 8],
    'total_spent': [250, 890, 150, 1500, 420],
    'days_since_signup': [90, 365, 30, 720, 180]
})

# Average order value
customers['avg_order_value'] = customers['total_spent'] / customers['total_orders']

# Orders per month (30 days)
customers['orders_per_month'] = customers['total_orders'] / (customers['days_since_signup'] / 30)

# High value indicator
customers['is_high_value'] = (customers['total_spent'] > 500).astype(int)

print(customers)

Task: Given company financial data with revenue, expenses, assets, liabilities, and equity, create: profit_margin, debt_to_equity, asset_turnover, and current_ratio.

Show Solution

import pandas as pd

companies = pd.DataFrame({
    'company': ['A', 'B', 'C', 'D'],
    'revenue': [1000000, 500000, 2500000, 750000],
    'expenses': [800000, 450000, 2000000, 700000],
    'assets': [1500000, 600000, 3000000, 900000],
    'liabilities': [500000, 300000, 1200000, 400000],
    'equity': [1000000, 300000, 1800000, 500000]
})

# Profit margin = (revenue - expenses) / revenue
companies['profit_margin'] = (companies['revenue'] - companies['expenses']) / companies['revenue']

# Debt to equity ratio
companies['debt_to_equity'] = companies['liabilities'] / companies['equity']

# Asset turnover = revenue / assets
companies['asset_turnover'] = companies['revenue'] / companies['assets']

# Current ratio (simplified) = assets / liabilities
companies['current_ratio'] = companies['assets'] / companies['liabilities']

print(companies[['company', 'profit_margin', 'debt_to_equity', 'asset_turnover']])

Task: Given student data with math_score, reading_score, writing_score, and study_hours, create: total_score, average_score, score_std (variability), and study_efficiency (avg_score/study_hours).

Show Solution

import pandas as pd
import numpy as np

students = pd.DataFrame({
    'student_id': [1, 2, 3, 4, 5],
    'math_score': [85, 72, 90, 65, 78],
    'reading_score': [78, 88, 85, 70, 82],
    'writing_score': [80, 85, 88, 68, 75],
    'study_hours': [15, 20, 18, 10, 12]
})

# Total and average score
students['total_score'] = students[['math_score', 'reading_score', 'writing_score']].sum(axis=1)
students['average_score'] = students[['math_score', 'reading_score', 'writing_score']].mean(axis=1)

# Score variability (standard deviation across subjects)
students['score_std'] = students[['math_score', 'reading_score', 'writing_score']].std(axis=1)

# Study efficiency
students['study_efficiency'] = students['average_score'] / students['study_hours']

print(students)

Task: Given transaction data with customer_id, amount, and category, create per-customer features: total_transactions, avg_transaction, max_transaction, and favorite_category (most frequent).

Show Solution

import pandas as pd

transactions = pd.DataFrame({
    'customer_id': [1, 1, 1, 2, 2, 3, 3, 3, 3],
    'amount': [50, 75, 120, 200, 180, 30, 45, 60, 35],
    'category': ['food', 'electronics', 'food', 'electronics', 'electronics', 
                 'clothing', 'food', 'clothing', 'clothing']
})

# Aggregate features per customer
customer_features = transactions.groupby('customer_id').agg(
    total_transactions=('amount', 'count'),
    total_spent=('amount', 'sum'),
    avg_transaction=('amount', 'mean'),
    max_transaction=('amount', 'max')
).reset_index()

# Favorite category (most frequent)
favorite_cat = transactions.groupby('customer_id')['category'].agg(
    lambda x: x.value_counts().index[0]
).reset_index()
favorite_cat.columns = ['customer_id', 'favorite_category']

# Merge features
customer_features = customer_features.merge(favorite_cat, on='customer_id')
print(customer_features)

Task: Given website session data with pages_viewed, time_on_site (seconds), and bounced (0/1), create: pages_per_minute, avg_time_per_page, engagement_score (pages * time / 100), and is_engaged (viewed >3 pages AND didn't bounce).

Show Solution

import pandas as pd

sessions = pd.DataFrame({
    'session_id': range(1, 9),
    'pages_viewed': [1, 5, 8, 2, 12, 3, 6, 1],
    'time_on_site': [15, 180, 300, 45, 420, 90, 240, 10],
    'bounced': [1, 0, 0, 1, 0, 0, 0, 1]
})

# Pages per minute
sessions['pages_per_minute'] = sessions['pages_viewed'] / (sessions['time_on_site'] / 60)

# Average time per page
sessions['avg_time_per_page'] = sessions['time_on_site'] / sessions['pages_viewed']

# Engagement score
sessions['engagement_score'] = (sessions['pages_viewed'] * sessions['time_on_site']) / 100

# Is engaged (viewed >3 pages AND didn't bounce)
sessions['is_engaged'] = ((sessions['pages_viewed'] > 3) & 
                          (sessions['bounced'] == 0)).astype(int)

print(sessions)

Polynomial & Interaction Features

Sometimes the relationship between your features and target isn't a straight line. Imagine predicting a ball's trajectory — it follows a curve, not a straight path! Polynomial features let your model capture these curved relationships by adding squared, cubed, or higher-power versions of your original features. Interaction features capture how two variables work together — like how both "study hours" AND "prior knowledge" combine to affect exam scores. Scikit-learn makes creating these features easy with PolynomialFeatures.

Understanding Polynomial Features

What are they? Polynomial features are new features created by raising your original features to different powers. If you have a feature called "age", polynomial features would include age, age², age³, and so on. Why does this help? Linear regression can only draw straight lines, but with polynomial features, it can fit curves! For features [a, b], polynomial features of degree 2 include: [1, a, b, a², ab, b²]. The "ab" term is an interaction — it captures how a and b work together.

# Step 1: Import and create sample data
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

# Create non-linear data
np.random.seed(42)
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = 2 * X.ravel()**2 - 5 * X.ravel() + 3 + np.random.randn(100) * 10

We create synthetic data following a quadratic pattern (y = 2x² - 5x + 3) with some noise. This represents the kind of non-linear relationship that polynomial features can capture.

# Step 2: Compare linear vs polynomial regression
# Linear regression (will underfit)
linear_model = LinearRegression()
linear_model.fit(X, y)
y_linear = linear_model.predict(X)

# Polynomial regression (degree 2)
poly_model = Pipeline([
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('linear', LinearRegression())
])
poly_model.fit(X, y)
y_poly = poly_model.predict(X)

Building the models step by step:

Linear model: Plain LinearRegression tries to fit a straight line to our curved data. It will underfit (miss the curve pattern).
Polynomial model: We use a Pipeline (like an assembly line) that first transforms X into [X, X²] using PolynomialFeatures, then feeds those to LinearRegression. Now the model can learn: y = a + b·X + c·X², which is a parabola!
include_bias=False: Prevents adding a column of 1s (the intercept is handled by LinearRegression already)
Why Pipeline? It ensures the transformation and model stay together — crucial for proper cross-validation and predictions on new data!

# Step 3: Visualize the difference
plt.figure(figsize=(10, 5))
plt.scatter(X, y, alpha=0.5, label='Data')
plt.plot(X, y_linear, 'r-', linewidth=2, label='Linear')
plt.plot(X, y_poly, 'g-', linewidth=2, label='Polynomial (degree=2)')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.title('Linear vs Polynomial Regression')
plt.show()

print(f"Linear R²: {linear_model.score(X, y):.4f}")
print(f"Polynomial R²: {poly_model.score(X, y):.4f}")

The plot makes it crystal clear! The red line (linear) cuts straight through the data, missing the curve entirely. The green line (polynomial) follows the parabola shape perfectly. The R² score confirms this numerically: R² ranges from 0 (terrible) to 1 (perfect). You'll see something like Linear R²: 0.75 vs Polynomial R²: 0.95 — a huge improvement just by adding X² as a feature! This is the power of polynomial features: enabling simple linear models to fit complex curves.

Interaction Features

Interaction features capture how variables work together. Setting interaction_only=True generates only interaction terms without powers.

from sklearn.preprocessing import PolynomialFeatures
import pandas as pd

# Sample data with two features
data = pd.DataFrame({
    'experience': [1, 3, 5, 10, 15],
    'education': [12, 14, 16, 18, 20]  # Years of education
})

# Create interaction features only
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
interaction_features = poly.fit_transform(data)

# Get feature names
feature_names = poly.get_feature_names_out(['experience', 'education'])
interaction_df = pd.DataFrame(interaction_features, columns=feature_names)

print("Original features:")
print(data)
print("\nWith interaction features:")
print(interaction_df)

Interaction features explained: With interaction_only=True, we skip squared terms (no experience² or education²) and only create the multiplication: experience × education. Why does this matter? Imagine predicting salary:

10 years experience + 12 years education = good salary
5 years experience + 20 years education (PhD) = good salary
But 10 years experience + 20 years education = GREAT salary!

The interaction term captures this "multiplier effect" — education might boost salary more at higher experience levels. Without interactions, the model treats experience and education as completely independent, missing this synergy!

Controlling Feature Explosion

Polynomial features grow exponentially. For 10 features with degree 3, you'd get 286 features! Use strategies to manage this:

from sklearn.preprocessing import PolynomialFeatures
import numpy as np

# Show feature count growth
original_features = [2, 5, 10, 20]
degrees = [2, 3, 4]

print("Number of polynomial features generated:")
print(f"{'Original':<12}", end='')
for d in degrees:
    print(f"Degree {d:<8}", end='')
print()

for n in original_features:
    X = np.random.randn(10, n)  # 10 samples, n features
    print(f"{n:<12}", end='')
    for d in degrees:
        poly = PolynomialFeatures(degree=d, include_bias=False)
        n_features = poly.fit_transform(X).shape[1]
        print(f"{n_features:<12}", end='')
    print()

The feature explosion problem: This table reveals a scary truth! Starting with just 10 features:

Degree 2: Creates 65 features (manageable)
Degree 3: Creates 285 features (getting crowded)
Degree 4: Creates 1000+ features (too many!)

Why is this bad? More features than samples leads to overfitting — your model memorizes noise instead of learning patterns. Solutions: (1) Use interaction_only=True to avoid squared/cubed terms, (2) Keep degree ≤ 3, (3) Apply feature selection afterward (like SelectKBest or L1 regularization) to keep only the most useful polynomial features.

Caution: High-degree polynomials (degree > 3) can lead to severe overfitting. Always use cross-validation to check if polynomial features actually improve your model.

Practice Questions

Task: Create a pipeline with PolynomialFeatures and Ridge regression. Use cross-validation to find the best polynomial degree from [1, 2, 3, 4] on the diabetes dataset.

Show Solution

from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_diabetes

X, y = load_diabetes(return_X_y=True)

best_score = -np.inf
best_degree = 1

for degree in [1, 2, 3, 4]:
    pipeline = Pipeline([
        ('poly', PolynomialFeatures(degree=degree, include_bias=False)),
        ('scaler', StandardScaler()),
        ('ridge', Ridge(alpha=1.0))
    ])
    
    scores = cross_val_score(pipeline, X, y, cv=5, scoring='r2')
    mean_score = scores.mean()
    print(f"Degree {degree}: R² = {mean_score:.4f} (+/- {scores.std()*2:.4f})")
    
    if mean_score > best_score:
        best_score = mean_score
        best_degree = degree

print(f"\nBest degree: {best_degree} with R² = {best_score:.4f}")

Task: Given a DataFrame with features x1 and x2, manually create degree-2 polynomial features: x1², x2², and x1*x2.

Show Solution

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'x1': [1, 2, 3, 4, 5],
    'x2': [2, 4, 6, 8, 10]
})

# Polynomial features manually
df['x1_squared'] = df['x1'] ** 2
df['x2_squared'] = df['x2'] ** 2
df['x1_x2'] = df['x1'] * df['x2']

print(df)

# Verify with sklearn
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(df[['x1', 'x2']])
print(f"\nFeature names: {poly.get_feature_names_out()}")

Task: Generate only interaction features (no squared terms) for a dataset with 4 features. Print the number of features before and after transformation.

Show Solution

from sklearn.preprocessing import PolynomialFeatures
import numpy as np

# Create sample data with 4 features
X = np.random.randn(100, 4)
print(f"Original features: {X.shape[1]}")

# Interaction only (no squared terms)
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_interactions = poly.fit_transform(X)

print(f"After interactions: {X_interactions.shape[1]}")
print(f"Feature names: {poly.get_feature_names_out()}")

# Compare with full polynomial
poly_full = PolynomialFeatures(degree=2, include_bias=False)
X_full = poly_full.fit_transform(X)
print(f"\nFull polynomial: {X_full.shape[1]} features")

Task: Create polynomial features, then use SelectKBest to select the top 10 most important polynomial features. Show which features were selected.

Show Solution

from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.datasets import load_diabetes
import numpy as np

# Load data
X, y = load_diabetes(return_X_y=True)
print(f"Original features: {X.shape[1]}")

# Create polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
print(f"Polynomial features: {X_poly.shape[1]}")

# Select top 10
selector = SelectKBest(f_regression, k=10)
X_selected = selector.fit_transform(X_poly, y)

# Get selected feature names
feature_names = poly.get_feature_names_out()
selected_mask = selector.get_support()
selected_features = feature_names[selected_mask]

print(f"\nTop 10 features:")
for name, score in zip(selected_features, selector.scores_[selected_mask]):
    print(f"  {name}: {score:.2f}")

Task: Create synthetic curved data and visualize linear vs polynomial (degree 2 and 3) regression fits on the same plot.

Show Solution

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

# Create curved data
np.random.seed(42)
X = np.linspace(0, 10, 50).reshape(-1, 1)
y = 2 * X.ravel() ** 2 - 5 * X.ravel() + 10 + np.random.randn(50) * 10

# Fit models
models = {
    'Linear': make_pipeline(PolynomialFeatures(1), LinearRegression()),
    'Degree 2': make_pipeline(PolynomialFeatures(2), LinearRegression()),
    'Degree 3': make_pipeline(PolynomialFeatures(3), LinearRegression())
}

plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.7, label='Data')

X_plot = np.linspace(0, 10, 100).reshape(-1, 1)
for name, model in models.items():
    model.fit(X, y)
    y_pred = model.predict(X_plot)
    plt.plot(X_plot, y_pred, label=f'{name}')

plt.legend()
plt.title('Polynomial Regression Comparison')
plt.show()

Task: For a dataset with 5 original features, calculate how many features you'll have after polynomial transformation of degree 2, 3, and 4 (without bias).

Show Solution

from sklearn.preprocessing import PolynomialFeatures
import numpy as np
from math import comb

n_features = 5

for degree in [2, 3, 4]:
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X = np.random.randn(10, n_features)
    X_poly = poly.fit_transform(X)
    
    # Formula: C(n+d, d) - 1 (excluding bias)
    # Where n = features, d = degree
    formula_result = comb(n_features + degree, degree) - 1
    
    print(f"Degree {degree}: {X_poly.shape[1]} features")
    print(f"  Formula verification: {formula_result}")
    print(f"  Feature names: {len(poly.get_feature_names_out())}\n")

Task: Compare Linear, Ridge, and Lasso regression with degree-3 polynomial features. Show which coefficients Lasso sets to zero.

Show Solution

from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_diabetes

X, y = load_diabetes(return_X_y=True)

models = {
    'Linear': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.1)
}

for name, model in models.items():
    pipeline = Pipeline([
        ('poly', PolynomialFeatures(degree=2, include_bias=False)),
        ('scaler', StandardScaler()),
        ('model', model)
    ])
    scores = cross_val_score(pipeline, X, y, cv=5, scoring='r2')
    print(f"{name}: R² = {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

# Show Lasso sparsity
pipeline.fit(X, y)
lasso_coef = pipeline.named_steps['model'].coef_
zero_coefs = np.sum(np.abs(lasso_coef) < 0.01)
print(f"\nLasso zeroed {zero_coefs}/{len(lasso_coef)} coefficients")

Time-Based Feature Extraction

Dates and times are gold mines of hidden information! A raw timestamp like "2024-03-15 14:30:00" might look meaningless to a model, but it contains valuable patterns: Is it a weekday or weekend? Morning or evening? Beginning of the month (when people get paid) or end? Is it a holiday? These patterns often strongly influence behavior. For example, restaurant traffic varies by day of week, online shopping peaks in evenings, and crime rates change by hour. This section teaches you to unlock these hidden patterns by extracting useful features from datetime columns.

Extracting Date Components

Pandas makes this incredibly easy! Every datetime column has a .dt accessor that lets you pull out components like year, month, day, hour, and even whether it's a weekend. Let's see how:

# Step 1: Create sample datetime data
import pandas as pd
import numpy as np

# Sample transaction data
transactions = pd.DataFrame({
    'transaction_id': range(1, 101),
    'timestamp': pd.date_range(start='2024-01-01', periods=100, freq='D'),
    'amount': np.random.randint(10, 500, 100)
})

print(transactions.head())

Setting up datetime data: We create 100 transactions over 100 consecutive days using pd.date_range(). The freq='D' means daily frequency. Each row has a transaction_id, timestamp, and random amount. The key here is that timestamp is a proper datetime column (not just text!), which unlocks all the .dt accessor magic we'll use next.

# Step 2: Extract basic date components
transactions['year'] = transactions['timestamp'].dt.year
transactions['month'] = transactions['timestamp'].dt.month
transactions['day'] = transactions['timestamp'].dt.day
transactions['day_of_week'] = transactions['timestamp'].dt.dayofweek  # 0=Monday, 6=Sunday
transactions['day_name'] = transactions['timestamp'].dt.day_name()
transactions['week_of_year'] = transactions['timestamp'].dt.isocalendar().week
transactions['quarter'] = transactions['timestamp'].dt.quarter

print(transactions[['timestamp', 'day_of_week', 'day_name', 'week_of_year', 'quarter']].head(10))

The .dt accessor is your best friend! It lets you extract any component from a datetime:

.year, .month, .day: The obvious components
.dayofweek: 0=Monday, 1=Tuesday, ..., 6=Sunday. Perfect for capturing weekly patterns (sales spike on weekends, etc.)
.day_name(): Human-readable day name (great for visualization, though you'd one-hot encode it for modeling)
.isocalendar().week: Week number 1-52. Captures seasonal patterns at weekly granularity
.quarter: 1-4, representing Q1-Q4. Financial data often follows quarterly patterns (earnings reports, etc.)

# Step 3: Create binary indicator features
transactions['is_weekend'] = transactions['day_of_week'].isin([5, 6]).astype(int)
transactions['is_month_start'] = transactions['timestamp'].dt.is_month_start.astype(int)
transactions['is_month_end'] = transactions['timestamp'].dt.is_month_end.astype(int)
transactions['is_quarter_end'] = transactions['timestamp'].dt.is_quarter_end.astype(int)

print(transactions[['timestamp', 'is_weekend', 'is_month_start', 'is_month_end']].head(10))

Binary indicators simplify patterns: Instead of 7 day-of-week values, sometimes you just need "is it a weekend?" (0 or 1). These simple yes/no features can be very powerful:

is_weekend: Saturday (5) or Sunday (6) → 1, else 0. Captures weekend shopping behavior!
is_month_start/end: Pandas has built-in checks for first/last day of month. Payday is often at month end — this captures that spending spike!
is_quarter_end: Important for business/financial data where quarters drive behavior (quarterly bonuses, sales quotas, etc.)

Pro tip: .astype(int) converts True/False to 1/0 — most ML models prefer numbers!

Cyclical Encoding

Features like hour (0-23) and month (1-12) are cyclical - 23 is close to 0, December is close to January. Use sine/cosine encoding to capture this:

# Create hourly data for cyclical encoding demo
hourly_data = pd.DataFrame({
    'timestamp': pd.date_range(start='2024-01-01', periods=168, freq='H'),
    'value': np.random.randn(168)
})

# Extract hour
hourly_data['hour'] = hourly_data['timestamp'].dt.hour

# Cyclical encoding
hourly_data['hour_sin'] = np.sin(2 * np.pi * hourly_data['hour'] / 24)
hourly_data['hour_cos'] = np.cos(2 * np.pi * hourly_data['hour'] / 24)

# Similarly for day of week
hourly_data['day_of_week'] = hourly_data['timestamp'].dt.dayofweek
hourly_data['dow_sin'] = np.sin(2 * np.pi * hourly_data['day_of_week'] / 7)
hourly_data['dow_cos'] = np.cos(2 * np.pi * hourly_data['day_of_week'] / 7)

print(hourly_data[['hour', 'hour_sin', 'hour_cos']].head(10))

Why sine and cosine? Here's the problem: hour 23 and hour 0 are actually neighbors (just 1 hour apart!), but if you use raw numbers, the model thinks they're far apart (23 vs 0). Sine and cosine encoding maps hours onto a circle, so midnight (0) and 11 PM (23) end up near each other. The formula 2π × value / cycle_length converts your value to an angle on a circle. You need BOTH sine and cosine because sine alone can't distinguish 9 AM from 3 PM (they'd have the same sine value!).

Time-Based Calculations

Calculate durations, time since events, and recency features:

# Customer data with signup and last purchase dates
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'signup_date': pd.to_datetime(['2022-01-15', '2023-06-20', '2021-03-10', '2024-01-01']),
    'last_purchase': pd.to_datetime(['2024-10-01', '2024-09-15', '2024-08-01', '2024-10-10'])
})

# Reference date (today)
today = pd.to_datetime('2024-10-15')

# Calculate time-based features
customers['days_since_signup'] = (today - customers['signup_date']).dt.days
customers['days_since_purchase'] = (today - customers['last_purchase']).dt.days
customers['customer_tenure_months'] = customers['days_since_signup'] / 30.44
customers['is_new_customer'] = (customers['days_since_signup'] < 90).astype(int)
customers['is_active'] = (customers['days_since_purchase'] < 30).astype(int)

print(customers[['customer_id', 'days_since_signup', 'days_since_purchase', 'is_active']])

Recency features are gold for predictions! Here's what we're calculating:

days_since_signup: Customer tenure — how long they've been with you. Subtracting dates gives a timedelta, and .dt.days extracts just the number of days.
days_since_purchase: THE most important feature for churn prediction! Customers who haven't purchased in 60+ days might be leaving.
customer_tenure_months: Days divided by 30.44 (average days/month). Monthly units are easier to interpret than raw days.
is_new_customer: Signed up less than 90 days ago? They might need onboarding support!
is_active: Purchased in the last 30 days? This simple binary flag can be your target variable for churn models!

Practice Questions

Task: Given flight departure times, extract: hour, is_morning_flight (5-11), is_evening_flight (17-21), is_redeye (22-5), and cyclical hour encoding.

Show Solution

import pandas as pd
import numpy as np

flights = pd.DataFrame({
    'flight_id': range(1, 11),
    'departure': pd.to_datetime([
        '2024-06-15 06:30', '2024-06-15 08:15', '2024-06-15 12:00',
        '2024-06-15 14:30', '2024-06-15 18:00', '2024-06-15 20:45',
        '2024-06-15 23:30', '2024-06-16 01:15', '2024-06-16 05:00',
        '2024-06-16 09:30'
    ])
})

# Extract hour
flights['hour'] = flights['departure'].dt.hour

# Time of day categories
flights['is_morning'] = flights['hour'].between(5, 11).astype(int)
flights['is_evening'] = flights['hour'].between(17, 21).astype(int)
flights['is_redeye'] = ((flights['hour'] >= 22) | (flights['hour'] < 5)).astype(int)

# Cyclical encoding
flights['hour_sin'] = np.sin(2 * np.pi * flights['hour'] / 24)
flights['hour_cos'] = np.cos(2 * np.pi * flights['hour'] / 24)

print(flights[['departure', 'hour', 'is_morning', 'is_evening', 'is_redeye']])

Task: Given a DataFrame with a 'date' column, extract: year, month, day, day_of_week (0=Monday), and quarter.

Show Solution

import pandas as pd

df = pd.DataFrame({
    'date': pd.to_datetime(['2024-01-15', '2024-03-22', '2024-06-01', 
                            '2024-09-10', '2024-12-25'])
})

# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['day_of_week'] = df['date'].dt.dayofweek  # 0=Monday, 6=Sunday
df['quarter'] = df['date'].dt.quarter

print(df)

Task: Encode months cyclically so that December (12) is close to January (1). Create sin and cos encodings for months.

Show Solution

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=12, freq='MS')
})

df['month'] = df['date'].dt.month

# Cyclical encoding for months (12 months in a year)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

# Verify: December (12) should be close to January (1)
print(df[['month', 'month_sin', 'month_cos']])

# Check distance between Dec and Jan vs Dec and July
from scipy.spatial.distance import euclidean
dec = df.loc[df['month'] == 12, ['month_sin', 'month_cos']].values[0]
jan = df.loc[df['month'] == 1, ['month_sin', 'month_cos']].values[0]
jul = df.loc[df['month'] == 7, ['month_sin', 'month_cos']].values[0]

print(f"\nDistance Dec-Jan: {euclidean(dec, jan):.3f}")
print(f"Distance Dec-Jul: {euclidean(dec, jul):.3f}")

Task: Given customer data with signup_date and last_purchase_date, calculate: days_since_signup, days_since_purchase, and is_recent_customer (signup < 90 days ago).

Show Solution

import pandas as pd
from datetime import datetime

customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'signup_date': pd.to_datetime(['2024-01-15', '2023-06-01', '2024-10-01', '2022-03-20']),
    'last_purchase': pd.to_datetime(['2024-11-01', '2024-10-15', '2024-11-10', '2024-05-01'])
})

# Reference date (could be current date)
reference_date = pd.to_datetime('2024-11-15')

# Calculate days since events
customers['days_since_signup'] = (reference_date - customers['signup_date']).dt.days
customers['days_since_purchase'] = (reference_date - customers['last_purchase']).dt.days

# Recent customer (signed up within 90 days)
customers['is_recent_customer'] = (customers['days_since_signup'] < 90).astype(int)

print(customers)

Task: Given daily sales data, create lag features (sales from 1, 7, and 30 days ago) and rolling averages (7-day and 30-day).

Show Solution

import pandas as pd
import numpy as np

# Create sample daily sales
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=90, freq='D')
sales = pd.DataFrame({
    'date': dates,
    'sales': np.random.randint(100, 500, 90) + \
             np.sin(np.arange(90) * 2 * np.pi / 7) * 50  # Weekly pattern
})

# Lag features
sales['sales_lag_1'] = sales['sales'].shift(1)   # Yesterday
sales['sales_lag_7'] = sales['sales'].shift(7)   # Last week
sales['sales_lag_30'] = sales['sales'].shift(30) # Last month

# Rolling averages
sales['sales_ma_7'] = sales['sales'].rolling(window=7).mean()
sales['sales_ma_30'] = sales['sales'].rolling(window=30).mean()

# Drop rows with NaN
print(sales.dropna().head(10))

Task: Given a date column, create an is_holiday feature based on a list of US holiday dates for 2024.

Show Solution

import pandas as pd

# Sample transactions
df = pd.DataFrame({
    'date': pd.to_datetime(['2024-01-01', '2024-01-15', '2024-07-04', 
                            '2024-07-10', '2024-12-25', '2024-12-26'])
})

# US Holidays 2024 (simplified list)
holidays_2024 = pd.to_datetime([
    '2024-01-01',  # New Year's Day
    '2024-01-15',  # MLK Day
    '2024-07-04',  # Independence Day
    '2024-11-28',  # Thanksgiving
    '2024-12-25',  # Christmas
])

# Create holiday indicator
df['is_holiday'] = df['date'].isin(holidays_2024).astype(int)

# Also flag day before/after holiday
df['near_holiday'] = (df['date'].isin(holidays_2024) | 
                      df['date'].isin(holidays_2024 - pd.Timedelta(days=1)) |
                      df['date'].isin(holidays_2024 + pd.Timedelta(days=1))).astype(int)

print(df)

Task: Given transaction data with timestamps, for each transaction calculate: transactions_last_7_days, amount_last_30_days, and avg_amount_per_transaction for that customer.

Show Solution

import pandas as pd
import numpy as np

# Sample transaction data
np.random.seed(42)
transactions = pd.DataFrame({
    'customer_id': np.random.choice([1, 2, 3], 20),
    'date': pd.date_range('2024-01-01', periods=20, freq='2D'),
    'amount': np.random.randint(20, 200, 20)
}).sort_values(['customer_id', 'date'])

def calculate_rolling_features(group):
    group = group.sort_values('date')
    
    # Count transactions in last 7 days (excluding current)
    group['txn_last_7d'] = group['date'].apply(
        lambda d: ((group['date'] >= d - pd.Timedelta(days=7)) & 
                   (group['date'] < d)).sum()
    )
    
    # Sum amount in last 30 days
    group['amount_last_30d'] = group['date'].apply(
        lambda d: group.loc[(group['date'] >= d - pd.Timedelta(days=30)) & 
                            (group['date'] < d), 'amount'].sum()
    )
    
    # Cumulative average (up to but not including current)
    group['cum_avg_amount'] = group['amount'].expanding().mean().shift(1)
    
    return group

result = transactions.groupby('customer_id').apply(calculate_rolling_features)
print(result.reset_index(drop=True))

Binning & Discretization Strategies

Sometimes exact numbers contain too much noise, or they don't matter as much as categories. Consider age: the difference between 25 and 26 probably doesn't matter much for most predictions, but the difference between "young adult" (18-30) and "middle-aged" (40-55) might! Binning (also called discretization) converts continuous numbers into discrete categories or "bins". Think of it like organizing books on a shelf — instead of tracking the exact page count of each book, you might just label them as "short", "medium", or "long". This can reduce noise, handle outliers gracefully, and sometimes reveal patterns that exact numbers hide.

Equal-Width Binning

The simplest approach: divide the entire range into equal-sized bins. If ages range from 18 to 80, and you want 4 bins, each bin covers (80-18)/4 = 15.5 years. Simple but can be problematic with outliers (one very old person could create a nearly-empty bin):

# Step 1: Import and create data
import numpy as np
import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer

# Age data (continuous)
np.random.seed(42)
ages = np.random.normal(40, 15, 1000).clip(18, 80).reshape(-1, 1)
df = pd.DataFrame(ages, columns=['age'])

Creating realistic age data: We generate 1000 ages from a normal distribution centered at 40 with standard deviation 15. The .clip(18, 80) ensures no impossible ages (negative or 150 years old!). .reshape(-1, 1) converts the 1D array to a 2D column — sklearn's KBinsDiscretizer requires this shape. This gives us a realistic age distribution to experiment with different binning strategies.

# Step 2: Apply equal-width binning
uniform_binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
df['age_uniform_bin'] = uniform_binner.fit_transform(df[['age']])

# See the bin edges
print("Uniform bin edges:", uniform_binner.bin_edges_[0].round(1))

Equal-width binning explained:

n_bins=5: Create 5 bins (categories)
strategy='uniform': Each bin covers the same range of values. Ages 18-80 divided into 5 bins means each bin spans (80-18)/5 = 12.4 years: [18-30.4], [30.4-42.8], [42.8-55.2], [55.2-67.6], [67.6-80]
encode='ordinal': Returns integers 0, 1, 2, 3, 4 (ordered categories). Other options: 'onehot' for one-hot encoding
bin_edges_: Shows exactly where each bin starts and ends — useful for understanding and communicating your binning!

The catch: If most people are 30-50, those middle bins will be crowded while edge bins might be nearly empty!

Equal-Frequency (Quantile) Binning

Ensures each bin has approximately the same number of samples:

# Equal-frequency binning
quantile_binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
df['age_quantile_bin'] = quantile_binner.fit_transform(df[['age']])

print("\nQuantile bin edges:", quantile_binner.bin_edges_[0].round(1))

# Compare distributions
print("\nSamples per bin:")
print("Uniform bins:", df['age_uniform_bin'].value_counts().sort_index().values)
print("Quantile bins:", df['age_quantile_bin'].value_counts().sort_index().values)

Quantile binning solves the imbalance problem: Instead of equal width, each bin gets an equal number of samples (1000 samples ÷ 5 bins = 200 per bin). The bin edges adapt to your data distribution!

For skewed data: If 80% of customers are under 40, uniform binning would cram them into 2 bins. Quantile binning spreads them across all 5 bins.
Compare the outputs: Uniform bins might show [50, 300, 400, 200, 50] samples per bin (uneven!). Quantile bins will show [200, 200, 200, 200, 200] (balanced!).
When to use: When you want equal representation in each category, or when your data is heavily skewed.

K-Means Binning

Uses K-means clustering to find natural groupings in the data:

# K-means based binning
kmeans_binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='kmeans')
df['age_kmeans_bin'] = kmeans_binner.fit_transform(df[['age']])

print("K-means bin edges:", kmeans_binner.bin_edges_[0].round(1))

Let the data decide! K-means binning runs the K-means clustering algorithm on your data to find natural groupings. If your data has natural clusters (like age groups with distinct behaviors), K-means will find them!

How it works: K-means finds 5 cluster centers, then bin edges are placed at the midpoints between adjacent centers.
Best for: Data with natural groupings that you don't know in advance — K-means discovers them!
Downside: Can be inconsistent between runs (different random initializations might give slightly different bins). Set random_state for reproducibility.

Custom Binning with Domain Knowledge

Often the best bins come from domain expertise:

# Custom age groups based on life stages
def age_category(age):
    if age < 25:
        return 'Young Adult'
    elif age < 35:
        return 'Early Career'
    elif age < 50:
        return 'Mid Career'
    elif age < 65:
        return 'Late Career'
    else:
        return 'Retired'

df['age_category'] = df['age'].apply(age_category)

# Using pd.cut for custom bins
bins = [0, 25, 35, 50, 65, 100]
labels = ['Young Adult', 'Early Career', 'Mid Career', 'Late Career', 'Retired']
df['age_category_cut'] = pd.cut(df['age'], bins=bins, labels=labels)

print(df['age_category'].value_counts())

Domain knowledge often beats algorithms! Here we define bins based on life stages that actually mean something:

Young Adult (18-25): Just starting out, different spending patterns
Early Career (25-35): Building career, maybe starting family
Mid Career (35-50): Peak earning years, established lifestyle
Late Career (50-65): Planning for retirement
Retired (65+): Fixed income, different priorities

Two ways to implement: (1) .apply() with a custom function — flexible but slower, or (2) pd.cut() with predefined bins and labels — faster and cleaner. Both give the same result!

When to Use Binning: Binning helps when you suspect step-function relationships (e.g., insurance rates changing at age thresholds), when reducing noise in continuous variables, or when you need interpretable categorical features.

Practice Questions

Task: Create custom income bins: Low (<30k), Medium (30k-70k), High (70k-150k), Very High (>150k) using pd.cut.

Show Solution

import pandas as pd
import numpy as np

incomes = pd.DataFrame({
    'income': [25000, 45000, 80000, 35000, 180000, 62000, 95000, 28000]
})

bins = [0, 30000, 70000, 150000, float('inf')]
labels = ['Low', 'Medium', 'High', 'Very High']

incomes['income_bin'] = pd.cut(incomes['income'], bins=bins, labels=labels)

print(incomes)
print("\nValue counts:")
print(incomes['income_bin'].value_counts())

Task: Create a skewed dataset and compare equal-width vs quantile binning. Show the distribution of samples in each bin for both methods.

Show Solution

import numpy as np
import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer

# Create skewed data (log-normal)
np.random.seed(42)
data = np.random.lognormal(mean=3, sigma=1, size=1000).reshape(-1, 1)

# Equal-width binning
uniform_binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
bins_uniform = uniform_binner.fit_transform(data)

# Quantile binning
quantile_binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
bins_quantile = quantile_binner.fit_transform(data)

print("Equal-width bins distribution:")
print(pd.Series(bins_uniform.flatten()).value_counts().sort_index())

print("\nQuantile bins distribution:")
print(pd.Series(bins_quantile.flatten()).value_counts().sort_index())

print("\nEqual-width edges:", uniform_binner.bin_edges_[0])
print("Quantile edges:", quantile_binner.bin_edges_[0])

Task: Create age groups: Child (0-12), Teen (13-19), Adult (20-64), Senior (65+) using pd.cut.

Show Solution

import pandas as pd

people = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [8, 15, 35, 72, 45]
})

bins = [0, 12, 19, 64, float('inf')]
labels = ['Child', 'Teen', 'Adult', 'Senior']

people['age_group'] = pd.cut(people['age'], bins=bins, labels=labels)

print(people)

Task: Use KBinsDiscretizer with one-hot encoding to bin a feature. Show the resulting sparse matrix dimensions.

Show Solution

from sklearn.preprocessing import KBinsDiscretizer
import numpy as np

# Sample data
np.random.seed(42)
X = np.random.randn(100, 2) * 10 + 50

print(f"Original shape: {X.shape}")

# One-hot encoded binning
binner = KBinsDiscretizer(n_bins=4, encode='onehot-dense', strategy='quantile')
X_binned = binner.fit_transform(X)

print(f"After binning: {X_binned.shape}")
print(f"Expected: 100 rows x (4 bins * 2 features) = 100 x 8")

# Show first row
print(f"\nFirst row (original): {X[0]}")
print(f"First row (binned): {X_binned[0]}")

Task: Create a pipeline that bins continuous features and applies logistic regression. Use ColumnTransformer to bin only selected columns.

Show Solution

from sklearn.preprocessing import KBinsDiscretizer, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
import numpy as np

# Load data
X, y = load_iris(return_X_y=True)
y = (y == 2).astype(int)  # Binary classification

# Define which columns to bin vs scale
bin_columns = [0, 1]    # Bin first two features
scale_columns = [2, 3]  # Scale last two features

preprocessor = ColumnTransformer([
    ('binner', KBinsDiscretizer(n_bins=5, encode='onehot-dense', strategy='quantile'), bin_columns),
    ('scaler', StandardScaler(), scale_columns)
])

pipeline = Pipeline([
    ('preprocess', preprocessor),
    ('classifier', LogisticRegression(random_state=42))
])

scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')
print(f"Accuracy: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

Task: Use pd.qcut to create percentile-based bins (Bottom 25%, Middle 50%, Top 25%) for exam scores.

Show Solution

import pandas as pd
import numpy as np

np.random.seed(42)
students = pd.DataFrame({
    'student_id': range(1, 101),
    'score': np.random.normal(70, 15, 100).clip(0, 100)
})

# Using qcut with custom labels
students['percentile_group'] = pd.qcut(
    students['score'], 
    q=[0, 0.25, 0.75, 1.0],
    labels=['Bottom 25%', 'Middle 50%', 'Top 25%']
)

# Alternative: get the actual percentile values
students['percentile_rank'] = students['score'].rank(pct=True)

print(students.head(10))
print("\nDistribution:")
print(students['percentile_group'].value_counts())

Task: Create price tiers for products: Budget ($0-25), Value ($25-50), Premium ($50-100), Luxury ($100+).

Show Solution

import pandas as pd

products = pd.DataFrame({
    'product': ['Widget A', 'Widget B', 'Gadget X', 'Gadget Y', 'Device Z'],
    'price': [15.99, 45.00, 89.99, 125.00, 32.50]
})

bins = [0, 25, 50, 100, float('inf')]
labels = ['Budget', 'Value', 'Premium', 'Luxury']

products['price_tier'] = pd.cut(products['price'], bins=bins, labels=labels)

print(products)

Mathematical Transformations

Real-world data is often messy \u2014 values might be heavily skewed (like income, where most people earn moderate amounts but a few earn millions), or relationships might not be linear. Mathematical transformations are like adjusting the lens through which your model sees the data. By applying functions like logarithm or square root, you can make lopsided distributions more balanced, compress extreme values, and reveal linear patterns hiding in curved relationships. Many machine learning algorithms assume data is normally distributed, so these transformations can significantly boost performance!

Log Transformation

When to use: When your data is right-skewed (has a long tail to the right), like income, house prices, or population. Log transformation is like a zoom lens that compresses large values while spreading out small values, making the distribution more symmetric. Intuition: The difference between earning $50k and $60k feels bigger than between $500k and $510k, even though both are $10k \u2014 that's logarithmic thinking!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Highly skewed data (income-like)
np.random.seed(42)
data = pd.DataFrame({
    'income': np.random.lognormal(10.5, 1, 1000)  # Right-skewed
})

# Apply log transformation
data['income_log'] = np.log1p(data['income'])  # log1p = log(1 + x), handles zeros

# Compare distributions
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(data['income'], bins=50, edgecolor='black')
axes[0].set_title(f"Original Income (skew: {data['income'].skew():.2f})")

axes[1].hist(data['income_log'], bins=50, edgecolor='black')
axes[1].set_title(f"Log Transformed (skew: {data['income_log'].skew():.2f})")

plt.tight_layout()
plt.show()

What's happening: Log transformation "squishes" large values and "stretches" small values. Think of it like plotting on a ruler where each mark represents 10x more (1, 10, 100, 1000). We use log1p (which calculates log(1+x)) instead of regular log because log(0) is undefined — adding 1 avoids this problem. The "skewness" number tells us how lopsided the distribution is: close to 0 means symmetric, positive means right-skewed (long tail on right). Notice how skewness drops dramatically after transformation!

Power Transformations (Box-Cox & Yeo-Johnson)

The smart approach: Instead of guessing which transformation to use, let the algorithm figure it out! PowerTransformer automatically finds the best power (like square root, log, or something in between) to make your data as normal as possible. Box-Cox only works with positive values, while Yeo-Johnson handles negative values too — so Yeo-Johnson is usually the safer choice.

from sklearn.preprocessing import PowerTransformer

# Create skewed data
data = pd.DataFrame({
    'feature1': np.random.exponential(2, 500),      # Right-skewed
    'feature2': 10 - np.random.exponential(2, 500)  # Left-skewed (can be negative)
})

# Yeo-Johnson (works with negative values)
yj_transformer = PowerTransformer(method='yeo-johnson', standardize=True)
data_transformed = yj_transformer.fit_transform(data)
data_transformed = pd.DataFrame(data_transformed, columns=['feature1_yj', 'feature2_yj'])

# Compare skewness
print("Original skewness:")
print(f"  feature1: {data['feature1'].skew():.3f}")
print(f"  feature2: {data['feature2'].skew():.3f}")

print("\nAfter Yeo-Johnson:")
print(f"  feature1: {data_transformed['feature1_yj'].skew():.3f}")
print(f"  feature2: {data_transformed['feature2_yj'].skew():.3f}")

What's happening: PowerTransformer is like an automatic "normalize" button — it tests different power transformations and picks the one that makes your data closest to a normal distribution (the famous bell curve). With standardize=True, it also scales the result to have mean=0 and standard deviation=1, which is great for most ML algorithms. Notice how both features end up with skewness near zero, regardless of whether they were originally right-skewed or left-skewed!

Quantile Transformation

Forces any distribution to follow a uniform or normal distribution:

from sklearn.preprocessing import QuantileTransformer

# Highly irregular data
data = pd.DataFrame({
    'value': np.concatenate([
        np.random.normal(10, 2, 300),
        np.random.normal(50, 5, 200)
    ])  # Bimodal distribution
})

# Transform to normal distribution
qt = QuantileTransformer(output_distribution='normal', random_state=42)
data['value_normal'] = qt.fit_transform(data[['value']])

# Transform to uniform distribution
qt_uniform = QuantileTransformer(output_distribution='uniform', random_state=42)
data['value_uniform'] = qt_uniform.fit_transform(data[['value']])

print("Skewness after transformation:")
print(f"  Normal: {data['value_normal'].skew():.3f}")
print(f"  Uniform: {data['value_uniform'].skew():.3f}")

The sledgehammer approach: QuantileTransformer works by ranking all values, then mapping those ranks to a normal or uniform distribution. It doesn't care about your data's original shape — it WILL make it normal or uniform, even if the original is bimodal (two peaks) like our example!

output_distribution='normal': Forces a bell curve shape — great for algorithms that expect normally distributed features
output_distribution='uniform': Spreads values evenly between 0 and 1 — every range contains the same proportion of data
The tradeoff: You lose information about the original magnitudes. A value of $10 and $100 might end up close together if there are many values in between.

When to use: When you have really stubborn distributions that other transformations can't fix, or when you specifically need uniform/normal data and don't care about preserving exact relationships.

Choosing Transformations:

Log: Right-skewed positive data (income, prices, counts)
Square root: Moderately skewed counts or variances
Box-Cox: Positive data, automatic optimal power
Yeo-Johnson: Any data including negatives
Quantile: Force any distribution to normal/uniform

Practice Questions

Task: Compare linear regression performance on raw vs log-transformed target variable using the diabetes dataset.

Show Solution

from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import numpy as np

X, y = load_diabetes(return_X_y=True)

# Original target
model = LinearRegression()
scores_original = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f"Original target R²: {scores_original.mean():.4f}")

# Log-transformed target
y_log = np.log1p(y)
scores_log = cross_val_score(model, X, y_log, cv=5, scoring='r2')
print(f"Log-transformed target R²: {scores_log.mean():.4f}")

# Note: R² on different scales aren't directly comparable!
# Better to compare predictions transformed back to original scale

Task: Apply log transformation to a right-skewed income column. Use np.log1p and show skewness before and after.

Show Solution

import pandas as pd
import numpy as np

# Right-skewed income data
np.random.seed(42)
df = pd.DataFrame({
    'income': np.random.lognormal(10, 1, 1000)
})

print(f"Original skewness: {df['income'].skew():.3f}")
print(f"Original range: ${df['income'].min():,.0f} - ${df['income'].max():,.0f}")

# Apply log transformation
df['income_log'] = np.log1p(df['income'])

print(f"\nAfter log transform skewness: {df['income_log'].skew():.3f}")
print(f"Log range: {df['income_log'].min():.2f} - {df['income_log'].max():.2f}")

Task: Create a dataset with both positive and negative values. Show that Box-Cox fails but Yeo-Johnson works.

Show Solution

import numpy as np
from sklearn.preprocessing import PowerTransformer

# Data with negative values
np.random.seed(42)
data = np.random.randn(100, 1) * 10  # Contains negatives

print(f"Data range: {data.min():.2f} to {data.max():.2f}")
print(f"Contains negatives: {(data < 0).sum()} values")

# Try Box-Cox (will fail with negative values)
try:
    bc = PowerTransformer(method='box-cox')
    bc.fit_transform(data)
    print("Box-Cox: Success")
except ValueError as e:
    print(f"Box-Cox: Failed - {e}")

# Yeo-Johnson works with negatives
yj = PowerTransformer(method='yeo-johnson')
data_yj = yj.fit_transform(data)
print(f"\nYeo-Johnson: Success!")
print(f"Transformed range: {data_yj.min():.2f} to {data_yj.max():.2f}")

Task: Apply square root transformation to count data (number of website visits). Compare with log transformation.

Show Solution

import pandas as pd
import numpy as np

# Count data (Poisson-like distribution)
np.random.seed(42)
df = pd.DataFrame({
    'visits': np.random.poisson(lam=10, size=1000)
})

print(f"Original skewness: {df['visits'].skew():.3f}")

# Square root transformation
df['visits_sqrt'] = np.sqrt(df['visits'])
print(f"After sqrt skewness: {df['visits_sqrt'].skew():.3f}")

# Log transformation
df['visits_log'] = np.log1p(df['visits'])
print(f"After log skewness: {df['visits_log'].skew():.3f}")

# For moderate skew, sqrt often works better than log
print("\nSquare root is often better for count data with moderate skew")

Task: Train a model on log-transformed target, make predictions, then inverse transform back to original scale. Calculate RMSE on original scale.

Show Solution

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Create skewed target data
np.random.seed(42)
X = np.random.randn(200, 5)
y = np.exp(1 + X @ np.array([0.5, -0.3, 0.2, 0.4, -0.1]) + np.random.randn(200) * 0.5)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Transform target
y_train_log = np.log(y_train)

# Train on log scale
model = LinearRegression()
model.fit(X_train, y_train_log)

# Predict on log scale
y_pred_log = model.predict(X_test)

# Inverse transform to original scale
y_pred = np.exp(y_pred_log)

# Calculate RMSE on original scale
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE on original scale: {rmse:.4f}")

# Compare: train without transformation
model_raw = LinearRegression().fit(X_train, y_train)
y_pred_raw = model_raw.predict(X_test)
rmse_raw = np.sqrt(mean_squared_error(y_test, y_pred_raw))
print(f"RMSE without transformation: {rmse_raw:.4f}")

Task: Use MinMaxScaler to scale features to the [0, 1] range. Show original and transformed values.

Show Solution

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

df = pd.DataFrame({
    'age': [25, 35, 45, 55, 65],
    'income': [30000, 50000, 75000, 100000, 150000],
    'score': [60, 75, 80, 85, 95]
})

print("Original data:")
print(df)

scaler = MinMaxScaler()
df_scaled = pd.DataFrame(
    scaler.fit_transform(df),
    columns=df.columns
)

print("\nScaled to [0, 1]:")
print(df_scaled)

Task: Compare model performance using: (1) StandardScaler, (2) QuantileTransformer (normal), (3) PowerTransformer. Use cross-validation.

Show Solution

from sklearn.preprocessing import StandardScaler, QuantileTransformer, PowerTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.datasets import fetch_california_housing

# Load data with skewed features
X, y = fetch_california_housing(return_X_y=True)

transformers = {
    'StandardScaler': StandardScaler(),
    'QuantileTransformer': QuantileTransformer(output_distribution='normal', random_state=42),
    'PowerTransformer': PowerTransformer(method='yeo-johnson')
}

for name, transformer in transformers.items():
    pipeline = Pipeline([
        ('transform', transformer),
        ('model', Ridge(alpha=1.0))
    ])
    
    scores = cross_val_score(pipeline, X, y, cv=5, scoring='r2')
    print(f"{name}: R² = {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

Key Takeaways

Domain Features Are Most Powerful

Features based on domain knowledge (ratios, aggregations, business logic) often outperform automated methods. Consult domain experts when possible.

Polynomial Features for Non-linearity

Use PolynomialFeatures to capture curved relationships. Keep degree ≤ 3 to avoid overfitting. Use interaction_only=True to limit feature explosion.

Extract Rich Time Features

Datetime columns yield many features: components (hour, day), cyclical encoding (sin/cos), recency (days since), and indicators (is_weekend, is_holiday).

Binning Reduces Noise

Use KBinsDiscretizer for automatic binning or pd.cut() for custom domain-based bins. Quantile binning handles skewed data better.

Transform Skewed Distributions

Log transform for right-skewed positive data. Use PowerTransformer (Yeo-Johnson) for automatic optimal transformation. QuantileTransformer forces any distribution to normal.

Always Validate Features

Use cross-validation to check if new features actually improve model performance. More features aren't always better—they can cause overfitting.

What You'll Learn

Contents

Domain Knowledge-Based Feature Creation

What is Feature Engineering?

Creating Business-Logic Features

Financial Domain Features

Aggregation Features

Practice Questions

Easy Create Real Estate Features

Medium Create Health Metric Features

Easy Create E-commerce Customer Features

Medium Financial Ratio Features

Easy Student Performance Features

Hard Create Aggregated Group Features

Medium Website Engagement Features

Polynomial & Interaction Features

Understanding Polynomial Features

Interaction Features

Controlling Feature Explosion

Practice Questions

Medium Polynomial Regression Pipeline with CV

Easy Generate Polynomial Features Manually

Medium Interaction Features Only

Hard Polynomial Feature Selection

Medium Visualize Polynomial Fit

Easy Count Polynomial Features

Hard Polynomial with Regularization Comparison

Time-Based Feature Extraction

Extracting Date Components

Cyclical Encoding

Time-Based Calculations

Practice Questions

Medium Extract Features from Flight Data

Easy Extract Date Components

Medium Cyclical Month Encoding

Medium Calculate Days Since Event

Hard Create Lag Features for Time Series

Easy Create Holiday Indicator

Hard Time-Based Aggregation Features

Binning & Discretization Strategies

Equal-Width Binning

Equal-Frequency (Quantile) Binning

K-Means Binning

Custom Binning with Domain Knowledge

Practice Questions

Easy Create Income Bins

Medium Compare Binning Strategies

Easy Create Age Group Categories

Medium One-Hot Encode Binned Features

Hard Binning in a Pipeline

Medium Create Percentile-Based Bins

Easy Bin Product Prices

Mathematical Transformations

Log Transformation

Power Transformations (Box-Cox & Yeo-Johnson)

Quantile Transformation

Practice Questions

Medium Transform & Compare Regression Performance

Easy Apply Log Transformation

Medium Compare Box-Cox vs Yeo-Johnson

Medium Square Root Transformation

Hard Inverse Transform Predictions

Easy Normalize Features to [0, 1]

Hard Transformation Pipeline Comparison

Key Takeaways

Domain Features Are Most Powerful

Polynomial Features for Non-linearity

Extract Rich Time Features

Binning Reduces Noise

Transform Skewed Distributions

Always Validate Features

Knowledge Check

1 What does PolynomialFeatures(degree=2, interaction_only=True) generate for features [a, b]?

2 Why use cyclical encoding (sin/cos) for hour of day?

3 Which binning strategy ensures equal sample counts in each bin?

4 Which transformation is best for right-skewed positive data like income?

5 What advantage does Yeo-Johnson have over Box-Cox transformation?

6 Which pandas accessor is used to extract datetime components?