Logistic Regression
Despite its name, Logistic Regression is a classification algorithm, not a regression one. It predicts the probability of an outcome belonging to a specific category - making it the foundation for understanding classification in machine learning.
What is Logistic Regression?
While linear regression predicts continuous values (like house prices or temperature), logistic regression predicts probabilities that get mapped to categories (like "spam" or "not spam"). Think of it as answering the question: "What's the chance this email is spam?" instead of "What's the temperature?"
It uses the sigmoid function (also called logistic function) to transform any number into a probability between 0 and 1. For example, if a linear calculation gives us 2.5 or -3.7, the sigmoid function squashes it to values like 0.92 or 0.02 - perfect for representing probabilities!
Logistic Regression
A statistical model that predicts the probability of a binary outcome using the logistic (sigmoid) function. Despite its name, it's used for classification, not regression.
Sigmoid Function: σ(z) = 1 / (1 + e-z) maps any input to a value between 0 and 1.
How It Works
Logistic regression works in three simple steps. Let's say we're predicting if a customer will buy a product:
-
Linear combination: z = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ
First, multiply each feature (age, income, etc.) by a weight and sum them up. This gives us a score that can be any number from -∞ to +∞. -
Sigmoid transformation: P(y=1) = 1 / (1 + e-z)
Then, pass this score through the sigmoid function, which converts it to a probability between 0 and 1. A score of 0 becomes 0.5 probability, positive scores become higher probabilities, negative scores become lower probabilities. -
Classification decision: If P ≥ 0.5 → Class 1, else → Class 0
Finally, if the probability is 50% or higher, we predict "Yes" (Class 1), otherwise we predict "No" (Class 0). You can adjust this threshold based on your needs!
- Binary classification problems
Like spam/not spam, fraud/legitimate, pass/fail, buy/not buy. Works best when you need to classify into two distinct groups. - Need probability estimates
Unlike some algorithms that only give yes/no, logistic regression provides probabilities (e.g., "75% chance of spam"). Useful for ranking or setting custom thresholds. - Want interpretable coefficients
Each feature has a weight you can interpret. Positive coefficient = feature increases probability, negative = decreases. Great when you need to explain decisions to stakeholders. - Linear decision boundary works
When classes can be roughly separated by a straight line (or hyperplane). If data requires complex curves, consider trees or SVM with RBF kernel instead. - Baseline model for comparison
Fast to train, easy to implement. Always start here before trying complex models - you might be surprised how well it works!
- Assumes linear decision boundary
Can't learn XOR patterns or complex curves. If your data looks like concentric circles or spirals, logistic regression will struggle. - Struggles with complex patterns
Can't automatically discover feature interactions (e.g., "high income AND young age"). You'd need to manually create interaction features. - Sensitive to outliers
Extreme values can pull the decision boundary in wrong direction. Consider removing outliers or using robust preprocessing. - Requires feature scaling
Features with larger ranges dominate the model. Always use StandardScaler or MinMaxScaler before training. Age (0-100) and salary ($20K-$200K) need same scale. - Can't capture feature interactions
Treats each feature independently. If "young + tech-savvy" together matter more than separately, you need to create interaction features manually.
Implementation in Python
Let's implement logistic regression using scikit-learn. We'll use a classic example: predicting whether a customer will purchase a product based on their age and estimated salary.
STEP 1: Import Libraries and Create Sample Data
# Import required libraries
import numpy as np # For numerical operations
import pandas as pd # For data manipulation and analysis
from sklearn.model_selection import train_test_split # Split data into train/test
from sklearn.preprocessing import StandardScaler # Scale features to similar ranges
from sklearn.linear_model import LogisticRegression # Our classification algorithm
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report # Evaluation tools
# Sample data: Age, Salary → Purchased (0/1)
np.random.seed(42) # Set random seed for reproducibility
n_samples = 200 # Create 200 customer records
# Generate random ages between 18 and 65
age = np.random.randint(18, 65, n_samples)
# Generate random salaries between $20,000 and $150,000
salary = np.random.randint(20000, 150000, n_samples)
# Create target variable with a pattern:
# Higher age (>35) AND higher salary (>60000) → more likely to purchase
purchased = ((age > 35) & (salary > 60000)).astype(int) # 1 if both conditions true, else 0
# Add some noise (randomness) to make it realistic
# Randomly flip 20 labels to simulate real-world unpredictability
noise_idx = np.random.choice(n_samples, 20, replace=False)
purchased[noise_idx] = 1 - purchased[noise_idx] # Flip 0→1 and 1→0
# Create DataFrame for better visualization
df = pd.DataFrame({'Age': age, 'Salary': salary, 'Purchased': purchased})
print("First 5 customers:")
print(df.head())
print(f"\nPurchased rate: {purchased.sum() / len(purchased):.1%}") # Show class balance
We generate synthetic customer data with age and salary as features, and purchased (0 or 1) as the target. The pattern: customers over 35 with salary above $60,000 are more likely to purchase, with some random noise added.
STEP 2: Prepare and Split Data
# Prepare features (X) and target (y)
X = df[['Age', 'Salary']] # Features: what we use to make predictions
y = df['Purchased'] # Target: what we want to predict (0 or 1)
# Split data: 80% for training, 20% for testing
# WHY? We train on 80% and test on the unseen 20% to measure real-world performance
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% for testing
random_state=42, # Same seed = same split every time (reproducible results)
stratify=y # Keep same class ratio in train and test sets
)
# Scale features - CRITICAL for logistic regression!
# WHY? Age (18-65) and Salary (20000-150000) have very different ranges
# Without scaling, salary would dominate because its numbers are much larger
scaler = StandardScaler() # Converts each feature to mean=0, std=1
# fit_transform on training data: learn mean & std, then scale
X_train_scaled = scaler.fit_transform(X_train)
# transform on test data: use training mean & std to scale
# NEVER fit on test data! That would be cheating (data leakage)
X_test_scaled = scaler.transform(X_test)
print(f"Training samples: {len(X_train)} ({len(X_train)/len(X)*100:.0f}%)")
print(f"Test samples: {len(X_test)} ({len(X_test)/len(X)*100:.0f}%)")
print(f"\nBefore scaling - Age range: {X_train['Age'].min()}-{X_train['Age'].max()}")
print(f"Before scaling - Salary range: ${X_train['Salary'].min():,}-${X_train['Salary'].max():,}")
print(f"\nAfter scaling - Both features now have similar range around 0!")
We separate features (X) from target (y), split into train/test sets, and importantly scale the features. StandardScaler converts both Age and Salary to similar ranges (mean=0, std=1), preventing the larger salary values from dominating the model.
STEP 3: Train and Evaluate Model
# Create the logistic regression model
# Default parameters are usually good to start with
model = LogisticRegression(
random_state=42, # For reproducible results
max_iter=1000 # Maximum iterations to find optimal weights
)
# Train the model on scaled training data
# This is where the algorithm learns the relationship between features and target
model.fit(X_train_scaled, y_train)
# Make predictions on test data
y_pred = model.predict(X_test_scaled) # Binary predictions: 0 or 1
y_prob = model.predict_proba(X_test_scaled) # Probability estimates: [P(class 0), P(class 1)]
# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2%}") # What percentage did we get right?
print(f"\nThis means we correctly predicted {accuracy:.1%} of customers!")
# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Not Purchased', 'Purchased']))
print("\nMetrics Explained:")
print(" • Precision: Of all we predicted 'Purchased', how many actually purchased?")
print(" • Recall: Of all who actually purchased, how many did we catch?")
print(" • F1-score: Harmonic mean of precision and recall (balanced metric)")
print(" • Support: Number of actual occurrences in test set")
The model learns the relationship between age/salary and purchase probability. predict() gives
binary predictions, while predict_proba() provides probability estimates for each class.
Understanding the Output
# Model coefficients show feature importance and direction
print("=== Model Coefficients (Weights) ===")
for feature, coef in zip(['Age', 'Salary'], model.coef_[0]):
direction = "increases" if coef > 0 else "decreases"
print(f" {feature}: {coef:.4f}")
print(f" → Each unit increase in {feature} {direction} purchase probability")
print(f" → {'Strong' if abs(coef) > 1 else 'Moderate' if abs(coef) > 0.5 else 'Weak'} effect\n")
print(f"Intercept (bias): {model.intercept_[0]:.4f}")
print(" → This is the baseline log-odds when all features are 0\n")
# Predict probabilities for a new customer
print("=== Example Prediction ===")
new_customer = [[45, 75000]] # 45 years old, $75,000 salary
new_customer_scaled = scaler.transform(new_customer) # Must scale new data too!
probability = model.predict_proba(new_customer_scaled)[0]
prediction = model.predict(new_customer_scaled)[0]
print(f"New customer: Age=45, Salary=$75,000")
print(f"\nProbabilities:")
print(f" • P(Not Purchase) = {probability[0]:.1%}")
print(f" • P(Purchase) = {probability[1]:.1%}")
print(f"\nPrediction: {'Will Purchase ✓' if prediction == 1 else 'Will Not Purchase ✗'}")
print(f"Confidence: {max(probability):.1%}")
# Let's test a few more scenarios
print("\n=== Testing Different Customer Profiles ===")
test_profiles = [
[25, 35000, "Young & Low Income"],
[30, 90000, "Young & High Income"],
[50, 45000, "Older & Low Income"],
[55, 120000, "Older & High Income"]
]
for age, salary, desc in test_profiles:
profile_scaled = scaler.transform([[age, salary]])
prob = model.predict_proba(profile_scaled)[0][1] # Probability of purchase
print(f"{desc:25} → {prob:.1%} chance of purchase")
Multi-class Classification
# Logistic Regression with multiple classes
from sklearn.datasets import load_iris
# Load Iris dataset (3 classes)
iris = load_iris()
X_iris, y_iris = iris.data, iris.target
# Split and scale
X_train_i, X_test_i, y_train_i, y_test_i = train_test_split(
X_iris, y_iris, test_size=0.2, random_state=42
)
# Multi-class logistic regression (uses one-vs-rest by default)
multi_model = LogisticRegression(
multi_class='multinomial', # Use softmax for multi-class
solver='lbfgs',
max_iter=1000,
random_state=42
)
multi_model.fit(X_train_i, y_train_i)
print(f"Multi-class Accuracy: {multi_model.score(X_test_i, y_test_i):.2%}")
Practice Questions: Logistic Regression
Test your understanding with these hands-on exercises.
Question: Create a logistic regression model to predict whether a student passes (1) or fails (0) based on hours studied and previous test score. Use the sample data below.
Show Solution
# STEP 1: Create sample student data
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
data = {
'hours_studied': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'previous_score': [45, 50, 55, 60, 65, 70, 75, 80, 85, 90],
'passed': [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
}
df = pd.DataFrame(data)
# STEP 2: Prepare and split data
X = df[['hours_studied', 'previous_score']]
y = df['passed']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# STEP 3: Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# STEP 4: Train and evaluate
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
print(f"Accuracy: {model.score(X_test_scaled, y_test):.2%}")
# STEP 5: Make a prediction for a new student
new_student = [[5, 68]] # 5 hours studied, 68 previous score
new_student_scaled = scaler.transform(new_student)
prob = model.predict_proba(new_student_scaled)[0]
print(f"\nNew student - Pass probability: {prob[1]:.2%}")
Question: Modify the LogisticRegression model to use stronger L2 regularization (C=0.1) and compare the results with the default regularization (C=1.0).
Show Solution
# STEP 1: Train with default regularization
model_default = LogisticRegression(C=1.0, random_state=42)
model_default.fit(X_train_scaled, y_train)
default_acc = model_default.score(X_test_scaled, y_test)
# STEP 2: Train with stronger regularization
# C is inverse of regularization strength: smaller C = stronger regularization
model_regularized = LogisticRegression(
C=0.1, # Stronger regularization
penalty='l2', # L2 regularization (default)
random_state=42
)
model_regularized.fit(X_train_scaled, y_train)
reg_acc = model_regularized.score(X_test_scaled, y_test)
# STEP 3: Compare results
print(f"Default (C=1.0): {default_acc:.2%}")
print(f"Regularized (C=0.1): {reg_acc:.2%}")
print(f"\nDefault coefficients: {model_default.coef_[0]}")
print(f"Regularized coefficients: {model_regularized.coef_[0]}")
print("\nNote: Regularization shrinks coefficients, reducing overfitting!")
Question: Instead of the default 0.5 threshold, classify as positive only if probability ≥ 0.7. Compare precision/recall with different thresholds.
Show Solution
from sklearn.metrics import classification_report
# STEP 1: Get probability predictions
y_prob = model.predict_proba(X_test_scaled)[:, 1] # Probability of class 1
# STEP 2: Default threshold (0.5)
y_pred_default = model.predict(X_test_scaled)
print("=== Default Threshold (0.5) ===")
print(classification_report(y_test, y_pred_default))
# STEP 3: Custom threshold (0.7) - more conservative
threshold = 0.7
y_pred_custom = (y_prob >= threshold).astype(int)
print("\n=== Custom Threshold (0.7) ===")
print(classification_report(y_test, y_pred_custom))
# STEP 4: Analyze trade-offs
print("\nInsight: Higher threshold means:")
print(" - Fewer positive predictions (more conservative)")
print(" - Higher precision (more confident when saying 'positive')")
print(" - Lower recall (miss some true positives)")
print(" - Use when false positives are costly!")
Question: Build a multi-class logistic regression classifier for the Iris dataset. Use 5-fold cross-validation to evaluate performance and tune the C parameter.
Show Solution
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score, GridSearchCV
import numpy as np
# STEP 1: Load Iris data (3 classes)
iris = load_iris()
X, y = iris.data, iris.target
# STEP 2: Test multi-class logistic regression with cross-validation
model = LogisticRegression(
multi_class='multinomial', # Softmax regression for multi-class
solver='lbfgs',
max_iter=1000,
random_state=42
)
scores = cross_val_score(model, X, y, cv=5)
print(f"Cross-validation scores: {scores}")
print(f"Mean accuracy: {scores.mean():.2%} (+/- {scores.std():.2%})")
# STEP 3: Tune C parameter with GridSearchCV
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X, y)
# STEP 4: Display results
print(f"\nBest C value: {grid_search.best_params_['C']}")
print(f"Best CV accuracy: {grid_search.best_score_:.2%}")
# STEP 5: Train final model and show per-class performance
best_model = grid_search.best_estimator_
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
print("\n=== Per-Class Performance ===")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
Decision Trees & Random Forests
Decision Trees are intuitive, interpretable models that make decisions by asking a series of yes/no questions. Random Forests combine multiple trees to create a powerful, robust ensemble that often outperforms individual models.
How Decision Trees Work
Imagine you're trying to decide if someone will like a movie. You might ask: "Do they like action movies?" If yes, you ask: "Are they over 25?" If yes again, you predict they'll like it. This is exactly how a decision tree works - it asks a series of yes/no questions to reach a decision.
A decision tree splits data based on feature values, creating a tree-like structure. At each node (decision point), it asks a question like "Is age > 30?" or "Is income > $50,000?" and branches left or right accordingly. It keeps asking questions until it reaches a leaf node (final answer) where it makes a prediction. The tree automatically learns which questions to ask and in what order by analyzing the training data!
Decision Tree
A flowchart-like model that makes predictions by learning simple decision rules from data features. Each internal node represents a "test" on a feature, each branch represents the outcome, and each leaf node holds a class label.
Key metrics: Gini Impurity and Entropy measure how "pure" each node is. The algorithm chooses splits that maximize purity (minimize impurity).
Simple Explanation: Measures how "mixed up" the classes are at a node.
Gini = 1 - Σ(pᵢ)²
Where pᵢ is the probability of class i. Gini = 0 means all samples belong to the same class (perfectly pure). Gini = 0.5 means classes are evenly mixed (most impure). The tree tries to minimize Gini impurity at each split.
Simple Explanation: Measures disorder or randomness in the data.
Entropy = -Σ(pᵢ × log₂(pᵢ))
Entropy = 0 means all samples are the same class (ordered). High entropy means classes are randomly mixed (disordered). Information Gain measures how much a split reduces entropy - the tree picks splits with highest information gain.
Decision Tree Implementation
STEP 1: Load Data and Train Decision Tree
# Decision Tree for Classification
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
class_names = iris.target_names
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Create and train Decision Tree
tree_model = DecisionTreeClassifier(
max_depth=3, # Limit depth to prevent overfitting
criterion='gini', # 'gini' or 'entropy'
random_state=42
)
tree_model.fit(X_train, y_train)
print(f"Training Accuracy: {tree_model.score(X_train, y_train):.2%}")
print(f"Test Accuracy: {tree_model.score(X_test, y_test):.2%}")
We limit max_depth=3 to prevent overfitting. The tree learns rules like "if petal_width ≤ 0.8 then Setosa"
by choosing splits that maximize information gain or minimize Gini impurity at each node.
STEP 2: Visualize the Tree Structure
# Visualize the Decision Tree
plt.figure(figsize=(20, 10))
plot_tree(
tree_model,
feature_names=feature_names,
class_names=class_names,
filled=True,
rounded=True,
fontsize=10
)
plt.title("Decision Tree for Iris Classification")
plt.tight_layout()
plt.show()
# Feature importance - which features matter most?
importance = pd.DataFrame({
'Feature': feature_names,
'Importance': tree_model.feature_importances_
}).sort_values('Importance', ascending=False)
print("Feature Importance:")
print(importance.to_string(index=False))
max_depth, min_samples_split, or pruning.
Random Forests: The Power of Ensembles
A Random Forest is like asking 100 different experts for their opinion and taking a vote, instead of trusting just one expert. It builds many decision trees (often 100-500 trees) and combines their predictions. Each tree is trained on a different random subset of data (called bagging or bootstrap sampling) and considers different random subsets of features at each split. This randomness makes each tree unique and prevents them from all making the same mistakes.
# Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
# Create Random Forest with 100 trees
rf_model = RandomForestClassifier(
n_estimators=100, # Number of trees
max_depth=5, # Depth per tree
min_samples_split=5, # Min samples to split
random_state=42,
n_jobs=-1 # Use all CPU cores
)
rf_model.fit(X_train, y_train)
print(f"Random Forest Training Accuracy: {rf_model.score(X_train, y_train):.2%}")
print(f"Random Forest Test Accuracy: {rf_model.score(X_test, y_test):.2%}")
# Compare single tree vs Random Forest
from sklearn.metrics import classification_report
# Single tree predictions
tree_pred = tree_model.predict(X_test)
# Random Forest predictions
rf_pred = rf_model.predict(X_test)
print("=== Single Decision Tree ===")
print(classification_report(y_test, tree_pred, target_names=class_names))
print("\n=== Random Forest (100 Trees) ===")
print(classification_report(y_test, rf_pred, target_names=class_names))
# Random Forest Feature Importance
rf_importance = pd.DataFrame({
'Feature': feature_names,
'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)
print("Random Forest Feature Importance:")
print(rf_importance.to_string(index=False))
# Random Forest importance is more reliable than single tree
# because it's averaged across many trees
Key Hyperparameters
| Parameter | Description | Default | Tuning Tip |
|---|---|---|---|
n_estimators |
Number of trees in the forest | 100 | More trees = better, but slower. 100-500 is usually good. |
max_depth |
Maximum depth of each tree | None | Limit to 5-15 to prevent overfitting. |
min_samples_split |
Minimum samples to split a node | 2 | Higher values (5-10) add regularization. |
max_features |
Features to consider at each split | 'sqrt' | 'sqrt' or 'log2' for diversity between trees. |
Practice Questions: Trees & Forests
Practice building and optimizing tree-based models.
Question: Create a decision tree classifier for the Iris dataset, limit max_depth to 3, and extract the feature importances.
Show Solution
# STEP 1: Load data and create decision tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# STEP 2: Train decision tree with limited depth
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)
print(f"Test Accuracy: {tree.score(X_test, y_test):.2%}")
# STEP 3: Extract feature importances
importances = pd.DataFrame({
'Feature': iris.feature_names,
'Importance': tree.feature_importances_
}).sort_values('Importance', ascending=False)
print("\nFeature Importances:")
print(importances.to_string(index=False))
# STEP 4: Understand the output
print("\nThe most important feature has the highest impact on predictions!")
Question: Use GridSearchCV to find the optimal combination of n_estimators, max_depth, and min_samples_split for a Random Forest classifier.
Show Solution
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
# STEP 1: Define parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 7, 10],
'min_samples_split': [2, 5, 10]
}
# STEP 2: Perform grid search
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1, # Use all CPU cores
verbose=1
)
grid_search.fit(X_train, y_train)
# STEP 3: Display results
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.2%}")
print(f"Test score: {grid_search.score(X_test, y_test):.2%}")
# STEP 4: Get feature importances from best model
best_rf = grid_search.best_estimator_
feature_imp = pd.DataFrame({
'Feature': iris.feature_names,
'Importance': best_rf.feature_importances_
}).sort_values('Importance', ascending=False)
print("\nOptimal Random Forest Feature Importances:")
print(feature_imp.to_string(index=False))
Question: Train both a single decision tree and a random forest on the same data, then compare their training vs test accuracy to demonstrate overfitting reduction.
Show Solution
# STEP 1: Train decision tree without depth limit (will overfit)
tree_overfit = DecisionTreeClassifier(random_state=42) # No max_depth
tree_overfit.fit(X_train, y_train)
# STEP 2: Train random forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
# STEP 3: Compare training vs test accuracy
print("=== Single Decision Tree ===")
print(f"Training Accuracy: {tree_overfit.score(X_train, y_train):.2%}")
print(f"Test Accuracy: {tree_overfit.score(X_test, y_test):.2%}")
print(f"Overfitting gap: {tree_overfit.score(X_train, y_train) - tree_overfit.score(X_test, y_test):.2%}")
print("\n=== Random Forest ===")
print(f"Training Accuracy: {rf.score(X_train, y_train):.2%}")
print(f"Test Accuracy: {rf.score(X_test, y_test):.2%}")
print(f"Overfitting gap: {rf.score(X_train, y_train) - rf.score(X_test, y_test):.2%}")
print("\nInsight: Random Forest has smaller train-test gap = less overfitting!")
Question: Use cost-complexity pruning (ccp_alpha) to find the optimal tree complexity that balances bias and variance.
Show Solution
import matplotlib.pyplot as plt
# STEP 1: Get pruning path
tree_full = DecisionTreeClassifier(random_state=42)
path = tree_full.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas[:-1] # Remove max alpha
# STEP 2: Train trees with different alpha values
train_scores = []
test_scores = []
for ccp_alpha in ccp_alphas:
tree = DecisionTreeClassifier(ccp_alpha=ccp_alpha, random_state=42)
tree.fit(X_train, y_train)
train_scores.append(tree.score(X_train, y_train))
test_scores.append(tree.score(X_test, y_test))
# STEP 3: Find optimal alpha
best_idx = test_scores.index(max(test_scores))
best_alpha = ccp_alphas[best_idx]
print(f"Optimal ccp_alpha: {best_alpha:.4f}")
print(f"Best test accuracy: {max(test_scores):.2%}")
# STEP 4: Visualize pruning effect
plt.figure(figsize=(10, 5))
plt.plot(ccp_alphas, train_scores, marker='o', label='Train', drawstyle="steps-post")
plt.plot(ccp_alphas, test_scores, marker='o', label='Test', drawstyle="steps-post")
plt.axvline(best_alpha, color='r', linestyle='--', label=f'Best alpha={best_alpha:.4f}')
plt.xlabel('ccp_alpha (complexity parameter)')
plt.ylabel('Accuracy')
plt.title('Pruning Path: Finding Optimal Tree Complexity')
plt.legend()
plt.grid(True)
plt.show()
print("\nAs alpha increases, tree becomes simpler (less overfitting but may underfit)")
Support Vector Machines (SVM)
SVM is a powerful classification algorithm that finds the optimal hyperplane separating different classes. It excels in high-dimensional spaces and can handle non-linear boundaries using the "kernel trick."
The Intuition Behind SVM
Imagine you have red dots and blue dots on a piece of paper, and you need to draw a line to separate them. There are many lines you could draw, but SVM finds the best line - the one with the most "breathing room" on both sides. This "breathing room" is called the margin.
Think of it like drawing a road between two neighborhoods. SVM makes the road as wide as possible while still separating the neighborhoods. The houses closest to the road (called support vectors) determine where the road goes - that's why they're so important! In higher dimensions (more than 2 features), the "line" becomes a hyperplane, but the concept is the same: maximize the margin for better generalization.
Support Vector Machine
A supervised learning algorithm that finds the hyperplane with maximum margin between classes. The data points closest to the decision boundary are called support vectors - they "support" and define the position of the hyperplane.
Key concepts: Hyperplane (decision boundary), Margin (distance from hyperplane to nearest points), Support Vectors (critical boundary points), Kernel (for non-linear separation).
Maximum Margin
SVM maximizes the "street" between classes for better generalization.
Support Vectors
Only the points closest to the boundary matter - the rest are ignored.
Kernel Trick
Transform data to higher dimensions for non-linear separation.
Linear SVM Implementation
Let's start with a simple linear SVM to understand the basics. We'll generate synthetic data that's linearly separable (can be divided by a straight line).
# Support Vector Machine for Classification
from sklearn.svm import SVC # Support Vector Classifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
import numpy as np
# Generate sample 2D data that's linearly separable
# WHY 2D? Easy to visualize, but concepts extend to any dimensions
X, y = make_classification(
n_samples=200, # 200 data points
n_features=2, # 2 features (x and y coordinates for visualization)
n_informative=2, # Both features are useful for classification
n_redundant=0, # No redundant/correlated features
n_clusters_per_class=1, # Each class forms one cluster
random_state=42 # Reproducible results
)
# Split data into train (80%) and test (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features - ABSOLUTELY CRITICAL for SVM!
# WHY? SVM uses distances between points. Features with larger ranges
# would dominate the distance calculations, making other features irrelevant
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Learn scaling from training data
X_test_scaled = scaler.transform(X_test) # Apply same scaling to test data
print("Data shape:", X_train.shape)
print(f"Class distribution: {np.bincount(y_train)} (should be roughly balanced)\n")
# Create linear SVM classifier
# 'linear' kernel = find straight line (hyperplane) to separate classes
svm_linear = SVC(
kernel='linear', # Use linear kernel (straight line decision boundary)
C=1.0, # Regularization: lower = simpler model, higher = fit training data better
random_state=42
)
# Train the model - this finds the optimal hyperplane
svm_linear.fit(X_train_scaled, y_train)
# Evaluate
accuracy = svm_linear.score(X_test_scaled, y_test)
print(f"Linear SVM Accuracy: {accuracy:.2%}")
# Support vectors are the critical points that define the decision boundary
print(f"Number of support vectors: {len(svm_linear.support_vectors_)}")
print(f" → Only {len(svm_linear.support_vectors_)} out of {len(X_train)} training points actually matter!")
print(f" → These are the points closest to the decision boundary")
print(f" → Moving other points won't change the boundary at all")
The Kernel Trick
Real-world data often isn't linearly separable - you can't draw a straight line to separate the classes. Imagine trying to separate two groups of dots arranged in circles - no straight line works!
The kernel trick is like a magic spell that transforms your data into a higher dimension where a straight line (or hyperplane) can separate them. Think of it like this: if you have overlapping circles on a flat paper (2D), you could lift one circle up into 3D space, and now a flat plane can separate them! The amazing part is that SVM does this transformation implicitly - it never actually moves the data, but mathematically acts as if it did. This makes it computationally efficient even in very high dimensions.
| Kernel | Best For | How It Works | Formula |
|---|---|---|---|
linear |
Linearly separable data, text classification, high-dimensional data | Finds straight line/hyperplane. Fastest option. Use when classes are clearly separated or you have many features (>10,000). | K(x, y) = x·y |
rbf(Radial Basis Function) |
Most common choice, handles non-linear patterns, general-purpose | Creates circular decision boundaries. Can fit complex curves. The 'go-to' kernel when you're not sure. Most flexible. | K(x, y) = exp(-γ||x-y||²) γ controls influence radius |
poly(Polynomial) |
Polynomial relationships, image classification | Creates polynomial curves as boundaries. Degree parameter controls complexity (2=quadratic, 3=cubic, etc.). Can be unstable. | K(x, y) = (γx·y + r)^d d is the degree |
sigmoid |
Neural network-like behavior (rarely used) | Similar to neural network activation. Historically used but RBF usually performs better. Mainly for research. | K(x, y) = tanh(γx·y + r) |
Quick Decision Guide:
1. Try linear first - fastest, works surprisingly often
2. If linear fails, use RBF - handles most non-linear cases
3. Consider poly only if you know data has polynomial relationship
4. Always tune C and gamma parameters using GridSearchCV
# RBF Kernel - handles non-linear decision boundaries
from sklearn.datasets import make_circles
# Create non-linearly separable data
X_circles, y_circles = make_circles(
n_samples=200,
noise=0.1,
factor=0.3,
random_state=42
)
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
X_circles, y_circles, test_size=0.2, random_state=42
)
# Compare Linear vs RBF kernel
svm_linear = SVC(kernel='linear')
svm_rbf = SVC(kernel='rbf', gamma='scale')
svm_linear.fit(X_train_c, y_train_c)
svm_rbf.fit(X_train_c, y_train_c)
print(f"Linear kernel on circles: {svm_linear.score(X_test_c, y_test_c):.2%}")
print(f"RBF kernel on circles: {svm_rbf.score(X_test_c, y_test_c):.2%}")
Key Hyperparameters Explained in Detail
# Important SVM parameters - let's understand what they do
svm_tuned = SVC(
kernel='rbf', # Kernel choice (we'll use RBF for non-linear data)
C=1.0, # Regularization parameter
gamma='scale', # Kernel coefficient
random_state=42
)
# ========================================
# PARAMETER 1: C (Regularization)
# ========================================
# C controls the trade-off between:
# 1. Having a smooth decision boundary (generalization)
# 2. Classifying training points correctly (accuracy on training data)
# Low C (e.g., 0.1):
# → MORE regularization = simpler model
# → Allows some misclassifications
# → Wider margin (more tolerance)
# → Better for noisy data
# → Risk: May underfit (too simple)
# High C (e.g., 100):
# → LESS regularization = complex model
# → Tries to classify every training point correctly
# → Narrow margin (less tolerance)
# → Fits training data very closely
# → Risk: May overfit (too specific to training data)
# Default C=1.0 is usually a good starting point
print("C Parameter Examples:")
for C_val in [0.1, 1.0, 10.0]:
svm = SVC(kernel='rbf', C=C_val, random_state=42)
svm.fit(X_train_scaled, y_train)
train_acc = svm.score(X_train_scaled, y_train)
test_acc = svm.score(X_test_scaled, y_test)
print(f"C={C_val:5.1f}: Train={train_acc:.2%}, Test={test_acc:.2%}, "
f"Support Vectors={len(svm.support_vectors_)}")
print("\n→ Notice: Higher C = Higher training accuracy but more support vectors")
print("→ Goal: Find C where test accuracy is highest (not training!)\n")
# ========================================
# PARAMETER 2: gamma (Kernel Coefficient)
# ========================================
# gamma controls how far the influence of a single training example reaches
# Think of it as the "reach" or "spread" of each point
# Low gamma (e.g., 0.001):
# → Far reach = each point influences distant points
# → Smoother decision boundaries
# → More general, less complex
# → Risk: May underfit
# High gamma (e.g., 10):
# → Close reach = only nearby points are influenced
# → Tight, wiggly decision boundaries
# → Very specific to training data
# → Risk: Overfits easily
# 'scale' (default): gamma = 1 / (n_features * X.var())
# 'auto': gamma = 1 / n_features
print("gamma Parameter Examples:")
for gamma_val in [0.01, 0.1, 1.0]:
svm = SVC(kernel='rbf', gamma=gamma_val, random_state=42)
svm.fit(X_train_scaled, y_train)
train_acc = svm.score(X_train_scaled, y_train)
test_acc = svm.score(X_test_scaled, y_test)
print(f"gamma={gamma_val:5.2f}: Train={train_acc:.2%}, Test={test_acc:.2%}")
print("\n→ Notice: Higher gamma can lead to overfitting (high train, low test)")
print("→ Start with 'scale' or 'auto', then tune if needed\n")
# PRACTICAL ADVICE:
print("=== Practical Tuning Strategy ===")
print("1. Start with C=1.0 and gamma='scale'")
print("2. If underfitting: increase C (try 10, 100)")
print("3. If overfitting: decrease C (try 0.1, 0.01)")
print("4. Fine-tune gamma only after C is reasonable")
print("5. Use GridSearchCV to test combinations systematically")
# Grid search for optimal C and gamma
from sklearn.model_selection import GridSearchCV
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': ['scale', 0.1, 0.01, 0.001],
'kernel': ['rbf']
}
grid_search = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV accuracy: {grid_search.best_score_:.2%}")
Practice Questions: SVM
Master SVM with these kernel and hyperparameter tuning exercises.
Question: SVM by default doesn't output probabilities. Train an SVM that provides probability estimates for each prediction.
Show Solution
# STEP 1: Enable probability estimates
svm_prob = SVC(kernel='rbf', probability=True, random_state=42)
svm_prob.fit(X_train_scaled, y_train)
# STEP 2: Get probability predictions
probs = svm_prob.predict_proba(X_test_scaled)
print(f"First sample probabilities: {probs[0]}")
print(f"Shape: {probs.shape} # (n_samples, n_classes)")
# STEP 3: Get class with highest probability
predicted_class = probs[0].argmax()
confidence = probs[0][predicted_class]
print(f"\nPredicted class: {predicted_class} with {confidence:.2%} confidence")
print("\nNote: probability=True uses 5-fold CV internally, making training slower!")
Question: Create non-linearly separable data using make_moons, then train both linear and RBF kernel SVMs to compare their performance.
Show Solution
from sklearn.datasets import make_moons
# STEP 1: Create non-linear data (moon shapes)
X, y = make_moons(n_samples=200, noise=0.15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# STEP 2: Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# STEP 3: Train with linear kernel
svm_linear = SVC(kernel='linear')
svm_linear.fit(X_train_scaled, y_train)
linear_score = svm_linear.score(X_test_scaled, y_test)
# STEP 4: Train with RBF kernel
svm_rbf = SVC(kernel='rbf', gamma='scale')
svm_rbf.fit(X_train_scaled, y_train)
rbf_score = svm_rbf.score(X_test_scaled, y_test)
# STEP 5: Compare results
print(f"Linear kernel accuracy: {linear_score:.2%}")
print(f"RBF kernel accuracy: {rbf_score:.2%}")
print(f"\nRBF wins by: {rbf_score - linear_score:.2%}")
print("\nLesson: RBF kernel handles non-linear boundaries much better!")
Question: Train SVMs with different C values (0.1, 1, 10, 100) and observe how it affects training vs test accuracy.
Show Solution
# STEP 1: Test different C values
C_values = [0.1, 1, 10, 100]
results = []
for C in C_values:
svm = SVC(kernel='rbf', C=C, random_state=42)
svm.fit(X_train_scaled, y_train)
train_acc = svm.score(X_train_scaled, y_train)
test_acc = svm.score(X_test_scaled, y_test)
gap = train_acc - test_acc
results.append({
'C': C,
'Train': train_acc,
'Test': test_acc,
'Gap': gap
})
# STEP 2: Display results
import pandas as pd
df_results = pd.DataFrame(results)
print(df_results.to_string(index=False))
# STEP 3: Interpret
print("\nObservations:")
print("- Low C (0.1): More regularization, simpler boundary, may underfit")
print("- High C (100): Less regularization, complex boundary, may overfit")
print("- Optimal C: Balance between train and test accuracy")
Question: Perform comprehensive hyperparameter tuning for SVM, testing multiple kernels (linear, rbf, poly), C values, and gamma values. Find the best combination.
Show Solution
from sklearn.model_selection import GridSearchCV
# STEP 1: Define comprehensive parameter grid
param_grid = {
'kernel': ['linear', 'rbf', 'poly'],
'C': [0.1, 1, 10, 100],
'gamma': ['scale', 'auto', 0.1, 0.01, 0.001]
}
# STEP 2: Perform grid search
grid_search = GridSearchCV(
SVC(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train_scaled, y_train)
# STEP 3: Display best parameters
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.2%}")
print(f"Test score: {grid_search.score(X_test_scaled, y_test):.2%}")
# STEP 4: Analyze top 5 configurations
results_df = pd.DataFrame(grid_search.cv_results_)
top_5 = results_df.nlargest(5, 'mean_test_score')[[
'param_kernel', 'param_C', 'param_gamma', 'mean_test_score'
]]
print("\nTop 5 Configurations:")
print(top_5.to_string(index=False))
# STEP 5: Get best model
best_svm = grid_search.best_estimator_
print(f"\nNumber of support vectors: {len(best_svm.support_vectors_)}")
K-Nearest Neighbors (KNN)
KNN is one of the simplest yet most intuitive classification algorithms. It classifies a new data point based on the majority vote of its K nearest neighbors - literally "tell me who your friends are, and I'll tell you who you are."
How KNN Works
KNN is called a lazy learner (or instance-based learner) because it doesn't actually "learn" anything during training - it just memorizes all the training data! Think of it like a student who doesn't study before the exam but brings all their notes and textbooks into the test.
When you ask KNN to predict a new data point, here's what happens: It looks at your new point and finds the K closest training examples (neighbors). If K=5, it finds the 5 nearest neighbors. Then it takes a democratic vote - if 4 out of 5 neighbors are "Class A", the new point is predicted as "Class A". The key idea: "You are who your friends are." Similar data points should have similar labels. This makes KNN very intuitive but can be slow on large datasets since it needs to calculate distances to all training points for every prediction.
K-Nearest Neighbors
A non-parametric algorithm that classifies data points based on the class of their K nearest neighbors in the feature space. The distance between points is typically measured using Euclidean distance.
Key idea: "Similar things are near each other." A point is likely to belong to the same class as its neighbors.
- Simple to understand and implement
- No training phase (lazy learning)
- Naturally handles multi-class problems
- Non-parametric (no assumptions about data)
- Can adapt to new data instantly
- Slow prediction on large datasets
- Sensitive to irrelevant features
- Requires feature scaling
- Suffers from curse of dimensionality
- Memory-intensive (stores all data)
KNN Implementation
STEP 1: Prepare Data with Scaling
# K-Nearest Neighbors Classifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import numpy as np
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features - CRITICAL for KNN!
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Create KNN classifier with K=5
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
print(f"KNN (K=5) Accuracy: {knn.score(X_test_scaled, y_test):.2%}")
Unlike other algorithms, KNN doesn't learn a model during training - it simply stores all training data. When predicting, it finds the 5 closest training samples and takes a majority vote. Scaling is critical because KNN uses distance calculations!
Choosing the Right K
The value of K is crucial and dramatically affects how your model behaves. Let's understand the trade-offs:
-
Small K (e.g., 1-3): More sensitive to noise, complex decision boundaries, may overfit
→ Like asking only your closest friend for advice - if they're wrong, you're wrong. With K=1, every outlier or noisy point affects predictions. Decision boundary becomes very wiggly and complex. -
Large K (e.g., 15-25): Smoother boundaries, more robust, but may underfit
→ Like asking 20 people for advice - more stable consensus but might miss subtle patterns. Too large and it becomes like asking the entire dataset, losing the "local" aspect of nearest neighbors. -
Rule of thumb: Start with K = √n (where n is training set size)
→ For 100 samples, try K=10. This balances local and global information. -
Best practice: Use odd K for binary classification to avoid ties
→ With K=4, you might get 2 votes for each class - which one wins? Using K=5 prevents ties!
# Find optimal K using cross-validation
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
k_range = range(1, 31)
k_scores = []
for k in k_range:
knn_temp = KNeighborsClassifier(n_neighbors=k)
# 5-fold cross-validation
scores = cross_val_score(knn_temp, X_train_scaled, y_train, cv=5)
k_scores.append(scores.mean())
# Find best K
best_k = k_range[np.argmax(k_scores)]
print(f"Best K: {best_k} with CV accuracy: {max(k_scores):.2%}")
# Plot K vs Accuracy
plt.figure(figsize=(10, 5))
plt.plot(k_range, k_scores, marker='o')
plt.xlabel('K (Number of Neighbors)')
plt.ylabel('Cross-Validation Accuracy')
plt.title('Finding Optimal K')
plt.axvline(x=best_k, color='r', linestyle='--', label=f'Best K={best_k}')
plt.legend()
plt.grid(True)
plt.show()
Distance Metrics: How KNN Measures "Closeness"
KNN relies on measuring distance between points. But what does "distance" mean? There are multiple ways to calculate it, and the choice can significantly impact your results!
| Metric | Formula | Intuition | Best For |
|---|---|---|---|
euclidean(Default) |
√(Σ(xᵢ - yᵢ)²) | "As the crow flies" Straight-line distance. Like measuring with a ruler on a map. |
Most common choice. Works well for continuous features. Natural for physical distances. |
manhattan(City Block) |
Σ|xᵢ - yᵢ| | "City block distance" Walking distance in a grid city (only horizontal/vertical moves). |
High-dimensional data, when dimensions are independent. Less sensitive to outliers than Euclidean. |
minkowski |
(Σ|xᵢ - yᵢ|^p)^(1/p) | "Generalization" p=1 → Manhattan, p=2 → Euclidean. Flexible family of metrics. |
When you want to experiment. Tune 'p' parameter to find what works best for your data. |
chebyshev |
max(|xᵢ - yᵢ|) | "Maximum difference" Only the largest difference matters, ignores other dimensions. |
When one feature's difference dominates. Chess king moves (max of horizontal or vertical). |
cosine |
1 - (x·y)/(||x|| ||y||) | "Angle between vectors" Measures direction similarity, not magnitude. |
Text data, recommender systems. When you care about patterns, not absolute values. |
• Continuous numerical features
• Physical measurements (height, weight, distance)
• Features have similar scales (after scaling)
• When you're not sure - it's the standard choice
• High-dimensional data (many features)
• Features are independent
• Data has outliers (Manhattan is more robust)
• Grid-like structure in your problem
# Compare different distance metrics
from sklearn.neighbors import KNeighborsClassifier
print("=== Comparing Distance Metrics ===")
metrics = ['euclidean', 'manhattan', 'chebyshev', 'minkowski']
for metric in metrics:
# Train KNN with different distance metrics
knn = KNeighborsClassifier(
n_neighbors=5,
metric=metric,
n_jobs=-1 # Use all CPU cores
)
knn.fit(X_train_scaled, y_train)
train_acc = knn.score(X_train_scaled, y_train)
test_acc = knn.score(X_test_scaled, y_test)
print(f"{metric:12} - Train: {train_acc:.2%}, Test: {test_acc:.2%}")
print("\n→ Try all metrics! Performance varies by dataset")
print("→ Usually Euclidean or Manhattan work best")
print("→ Cosine is special - use for text/sparse data")
Weighted KNN
# Weighted KNN - closer neighbors have more influence
knn_weighted = KNeighborsClassifier(
n_neighbors=5,
weights='distance' # 'uniform' (default) or 'distance'
)
knn_weighted.fit(X_train_scaled, y_train)
print(f"Uniform weights: {knn.score(X_test_scaled, y_test):.2%}")
print(f"Distance weights: {knn_weighted.score(X_test_scaled, y_test):.2%}")
# Distance weighting: closer neighbors have more vote power
# Often helps when K is large
Practice Questions: KNN
Practice distance-based classification and finding optimal K.
Question: Create a KNN classifier with K=3 for the Iris dataset. Don't forget to scale features!
Show Solution
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# STEP 1: Load and split data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# STEP 2: Scale features (CRITICAL for KNN!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# STEP 3: Train KNN with K=3
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_scaled, y_train)
# STEP 4: Evaluate
print(f"Training Accuracy: {knn.score(X_train_scaled, y_train):.2%}")
print(f"Test Accuracy: {knn.score(X_test_scaled, y_test):.2%}")
print("\nRemember: KNN is a lazy learner - no training phase, just stores data!")
Question: Test K values from 1 to 30 and use cross-validation to find the optimal K that maximizes accuracy.
Show Solution
from sklearn.model_selection import cross_val_score
import numpy as np
import matplotlib.pyplot as plt
# STEP 1: Test different K values
k_range = range(1, 31)
k_scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
# 5-fold cross-validation
scores = cross_val_score(knn, X_train_scaled, y_train, cv=5, scoring='accuracy')
k_scores.append(scores.mean())
# STEP 2: Find best K
best_k = k_range[np.argmax(k_scores)]
best_score = max(k_scores)
print(f"Best K: {best_k}")
print(f"Best CV Accuracy: {best_score:.2%}")
# STEP 3: Visualize K vs Accuracy
plt.figure(figsize=(10, 5))
plt.plot(k_range, k_scores, marker='o', linewidth=2)
plt.axvline(best_k, color='r', linestyle='--', label=f'Best K={best_k}')
plt.xlabel('K (Number of Neighbors)', fontsize=12)
plt.ylabel('Cross-Validation Accuracy', fontsize=12)
plt.title('Finding Optimal K for KNN', fontsize=14)
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()
# STEP 4: Train with optimal K
knn_best = KNeighborsClassifier(n_neighbors=best_k)
knn_best.fit(X_train_scaled, y_train)
print(f"\nTest Accuracy with K={best_k}: {knn_best.score(X_test_scaled, y_test):.2%}")
Question: Train KNN classifiers with different distance metrics (euclidean, manhattan, chebyshev) and compare their performance.
Show Solution
# STEP 1: Test different distance metrics
metrics = ['euclidean', 'manhattan', 'chebyshev', 'minkowski']
results = []
for metric in metrics:
knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
knn.fit(X_train_scaled, y_train)
train_acc = knn.score(X_train_scaled, y_train)
test_acc = knn.score(X_test_scaled, y_test)
results.append({
'Metric': metric,
'Train Acc': f"{train_acc:.2%}",
'Test Acc': f"{test_acc:.2%}"
})
# STEP 2: Display results
import pandas as pd
df_results = pd.DataFrame(results)
print(df_results.to_string(index=False))
# STEP 3: Explain metrics
print("\nDistance Metrics Explained:")
print("- Euclidean: Straight-line distance (most common)")
print("- Manhattan: Sum of absolute differences (grid-like paths)")
print("- Chebyshev: Maximum difference along any dimension")
print("- Minkowski: Generalization (p=2 is Euclidean)")
Question: Compare uniform vs distance-weighted KNN. Show how distance weighting gives more influence to closer neighbors.
Show Solution
# STEP 1: Train with uniform weights (default)
knn_uniform = KNeighborsClassifier(
n_neighbors=10, # Use larger K to see difference
weights='uniform' # All neighbors vote equally
)
knn_uniform.fit(X_train_scaled, y_train)
# STEP 2: Train with distance weights
knn_weighted = KNeighborsClassifier(
n_neighbors=10,
weights='distance' # Closer neighbors have more influence
)
knn_weighted.fit(X_train_scaled, y_train)
# STEP 3: Get predictions with probabilities
sample = X_test_scaled[0:1]
prob_uniform = knn_uniform.predict_proba(sample)[0]
prob_weighted = knn_weighted.predict_proba(sample)[0]
print("=== Sample Prediction Probabilities ===")
print(f"Uniform weights: {prob_uniform}")
print(f"Distance weights: {prob_weighted}")
# STEP 4: Compare accuracies
print("\n=== Accuracy Comparison ===")
print(f"Uniform KNN: {knn_uniform.score(X_test_scaled, y_test):.2%}")
print(f"Weighted KNN: {knn_weighted.score(X_test_scaled, y_test):.2%}")
# STEP 5: Test on different K values
print("\n=== K Value Impact ===")
for k in [3, 5, 10, 15, 20]:
knn_u = KNeighborsClassifier(n_neighbors=k, weights='uniform')
knn_w = KNeighborsClassifier(n_neighbors=k, weights='distance')
knn_u.fit(X_train_scaled, y_train)
knn_w.fit(X_train_scaled, y_train)
acc_u = knn_u.score(X_test_scaled, y_test)
acc_w = knn_w.score(X_test_scaled, y_test)
print(f"K={k:2d} Uniform: {acc_u:.2%} Weighted: {acc_w:.2%} Diff: {acc_w-acc_u:+.2%}")
print("\nInsight: Distance weighting often helps with larger K values!")
Naive Bayes
Naive Bayes is a fast, probabilistic classifier based on Bayes' theorem with a "naive" assumption of feature independence. Despite its simplicity, it often performs surprisingly well, especially for text classification tasks like spam detection.
Bayes' Theorem
Naive Bayes applies Bayes' theorem to calculate probabilities. Let's understand it with an example: If you see an email with words "free", "winner", and "claim", what's the probability it's spam?
Bayes' theorem lets us flip the question around. Instead of asking "What's P(spam | these words)?", we calculate it from easier-to-find probabilities: "What's P(these words | spam)?" and "What's P(spam in general)?" This clever reversal makes the math much simpler!
Naive Bayes Classifier
A probabilistic classifier that uses Bayes' theorem with the "naive" assumption that all features are independent of each other given the class label. Despite this unrealistic assumption, it often performs remarkably well in practice.
Bayes' Theorem: P(Class|Features) = P(Features|Class) × P(Class) / P(Features)
For example, in spam detection, it assumes seeing the word "free" is independent of seeing "winner" - but spam emails often use these words together! In movie reviews, words like "amazing" and "brilliant" tend to appear together, not independently.
Despite this unrealistic assumption, Naive Bayes works surprisingly well in practice! Why? Because for classification, we just need to know which class is more likely, not the exact probability. Even if the calculated probability is wrong, the ranking (Class A more likely than Class B) is often correct. Plus, the simplification makes it incredibly fast - perfect for real-time applications!
Types of Naive Bayes: Which One to Use?
There are three main types of Naive Bayes classifiers, each designed for different kinds of data. Choosing the right one is crucial for good performance!
For: Continuous Features
Assumption: Each feature follows a normal (bell-curve) distribution within each class.
Examples:
- Height, weight, age
- Temperature, pressure
- Sensor measurements
- Physical measurements
When to use: When your features are real numbers that could reasonably be normally distributed.
For: Discrete Count Features
Assumption: Features represent counts or frequencies (non-negative integers).
Examples:
- Word counts in documents
- TF-IDF scores
- Number of occurrences
- Frequency distributions
When to use: BEST FOR TEXT! Spam detection, sentiment analysis, document classification.
For: Binary Features
Assumption: Features are binary (0 or 1, present or absent, yes or no).
Examples:
- Word present/absent
- Feature exists or not
- Boolean flags
- Has symptom yes/no
When to use: When you only care if something exists, not how many times. Explicitly models absence.
Text classification? → Use Multinomial NB with TF-IDF
Continuous measurements? → Use Gaussian NB
Binary features only? → Use Bernoulli NB
Mixed features? → Try Gaussian first, or encode categorical as binary
Gaussian Naive Bayes
STEP 1: Train Gaussian NB on Continuous Features
# Gaussian Naive Bayes for continuous features
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Load Iris data (continuous features)
iris = load_iris()
X, y = iris.data, iris.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# Predictions
y_pred = gnb.predict(X_test)
print(f"Gaussian NB Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
Gaussian NB assumes each feature follows a normal distribution within each class. It calculates mean and standard deviation for each feature per class during training, then uses Bayes' theorem to compute class probabilities for new samples.
STEP 2: Examine Probability Outputs
# Examine class probabilities
sample = X_test[0:1]
proba = gnb.predict_proba(sample)
print(f"Sample features: {sample[0]}")
print(f"Class probabilities: {proba[0]}")
print(f"Predicted class: {iris.target_names[gnb.predict(sample)[0]]}")
Multinomial Naive Bayes for Text
Multinomial NB is the go-to algorithm for text classification. It works with word counts or TF-IDF features.
STEP 1: Convert Text to Features and Train
# Text Classification with Multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# Sample text data
texts = [
"Free money!!! Click here to win big prizes",
"Urgent: You have won a lottery!",
"Meeting tomorrow at 10am in conference room",
"Project deadline extended to next Friday",
"Congratulations! You've been selected for a prize",
"Can we schedule a call for next week?",
"FREE iPhone!!! Act now limited time offer",
"Quarterly report is ready for review"
]
labels = [1, 1, 0, 0, 1, 0, 1, 0] # 1 = spam, 0 = not spam
# Convert text to features using TF-IDF
vectorizer = TfidfVectorizer()
X_text = vectorizer.fit_transform(texts)
# Train Multinomial Naive Bayes
mnb = MultinomialNB()
mnb.fit(X_text, labels)
# Test on new messages
new_messages = [
"Win a free vacation today!",
"Meeting rescheduled to 3pm"
]
new_vectors = vectorizer.transform(new_messages)
predictions = mnb.predict(new_vectors)
for msg, pred in zip(new_messages, predictions):
print(f"'{msg}' → {'SPAM' if pred else 'NOT SPAM'}")
The vectorizer converts text to numerical features (word importance scores). Multinomial NB then calculates P(Spam|words) vs P(Not Spam|words) using Bayes' theorem. Words like "free", "win", "urgent" increase spam probability.
Bernoulli Naive Bayes
# Bernoulli NB for binary features
from sklearn.naive_bayes import BernoulliNB
# Binary features (word present = 1, absent = 0)
vectorizer_binary = CountVectorizer(binary=True)
X_binary = vectorizer_binary.fit_transform(texts)
# Train Bernoulli NB
bnb = BernoulliNB()
bnb.fit(X_binary, labels)
# Bernoulli NB explicitly models absence of features
# Good when "word NOT present" is informative
Advantages & Use Cases
| Advantage | Explanation |
|---|---|
| Extremely Fast | Training and prediction are very quick, O(n×d) complexity |
| Works with Small Data | Requires less training data than complex models |
| Handles High Dimensions | Works well with thousands of features (text classification) |
| Probabilistic Output | Provides probability estimates, not just predictions |
| No Feature Scaling Needed | Immune to feature scale differences |
- Text classification (spam, sentiment, topic categorization)
- Real-time prediction needed (very fast)
- Working with limited training data
- Baseline model for comparison
- Features are somewhat independent
# Smoothing parameter (Laplace smoothing)
# Prevents zero probability for unseen feature values
mnb_smoothed = MultinomialNB(alpha=1.0) # Default Laplace smoothing
mnb_no_smooth = MultinomialNB(alpha=0.0) # No smoothing (risky!)
# alpha = 1.0: Laplace smoothing (adds 1 to all counts)
# alpha = 0.5: Lidstone smoothing
# alpha = 0.0: No smoothing (not recommended)
print("Smoothing prevents zero probabilities from unseen words!")
Practice Questions: Naive Bayes
Build probabilistic classifiers for text and numerical data.
Question: Create a complete spam classifier using Pipeline with TfidfVectorizer and MultinomialNB. Make it reusable and easy to deploy.
Show Solution
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
# STEP 1: Create training data
texts = [
"Free money!!! Click here to win",
"Urgent: You won a lottery!",
"Meeting at 10am in conference room",
"Project deadline extended",
"Congratulations! You've been selected",
"Can we schedule a call?",
"FREE iPhone!!! Act now",
"Quarterly report ready"
]
labels = [1, 1, 0, 0, 1, 0, 1, 0] # 1=spam, 0=ham
# STEP 2: Create pipeline
spam_classifier = Pipeline([
('vectorizer', TfidfVectorizer(
stop_words='english', # Remove common words
max_features=100 # Limit vocabulary
)),
('classifier', MultinomialNB())
])
# STEP 3: Train (one command!)
spam_classifier.fit(texts, labels)
# STEP 4: Test on new messages
test_msgs = [
"Buy now! Limited offer!",
"See you at lunch tomorrow",
"WINNER! Claim your prize now!",
"Can you review the document?"
]
predictions = spam_classifier.predict(test_msgs)
probs = spam_classifier.predict_proba(test_msgs)
for msg, pred, prob in zip(test_msgs, predictions, probs):
label = 'SPAM' if pred else 'HAM'
confidence = prob[pred]
print(f"{label} ({confidence:.2%}): {msg}")
Question: Train both GaussianNB (for continuous features) and MultinomialNB (for count features) on the Iris dataset and compare their performance.
Show Solution
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
# STEP 1: Load Iris data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# STEP 2: Train Gaussian NB (for continuous features)
gnb = GaussianNB()
gnb.fit(X_train, y_train)
gnb_score = gnb.score(X_test, y_test)
gnb_cv = cross_val_score(gnb, X, y, cv=5).mean()
# STEP 3: Train Multinomial NB (requires non-negative features)
# Iris has some small values, so we'll shift to make all positive
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_scaled = scaler.fit_transform(X)
mnb = MultinomialNB()
mnb.fit(X_train_scaled, y_train)
mnb_score = mnb.score(X_test_scaled, y_test)
mnb_cv = cross_val_score(mnb, X_scaled, y, cv=5).mean()
# STEP 4: Compare results
print("=== Performance Comparison ===")
print(f"Gaussian NB: Test={gnb_score:.2%} CV={gnb_cv:.2%}")
print(f"Multinomial NB: Test={mnb_score:.2%} CV={mnb_cv:.2%}")
print("\nGaussian NB is better for continuous features like Iris!")
Question: Test different alpha (smoothing) values for MultinomialNB and find the optimal one using cross-validation.
Show Solution
from sklearn.model_selection import GridSearchCV
import numpy as np
# STEP 1: Prepare text data (using previous spam example)
vectorizer = TfidfVectorizer(stop_words='english')
X_text = vectorizer.fit_transform(texts)
# STEP 2: Test different alpha values
alphas = [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
results = []
for alpha in alphas:
mnb = MultinomialNB(alpha=alpha)
scores = cross_val_score(mnb, X_text, labels, cv=3)
results.append({
'Alpha': alpha,
'CV Score': scores.mean(),
'Std': scores.std()
})
# STEP 3: Display results
import pandas as pd
df_results = pd.DataFrame(results)
print(df_results.to_string(index=False))
# STEP 4: Use GridSearchCV for optimal alpha
param_grid = {'alpha': np.logspace(-3, 1, 20)}
grid = GridSearchCV(MultinomialNB(), param_grid, cv=3, scoring='accuracy')
grid.fit(X_text, labels)
print(f"\nBest alpha: {grid.best_params_['alpha']:.4f}")
print(f"Best CV score: {grid.best_score_:.2%}")
# STEP 5: Explain smoothing
print("\nSmoothing (alpha) explained:")
print(" alpha=0: No smoothing (risky - zero probabilities possible)")
print(" alpha=1: Laplace smoothing (default, adds 1 to all counts)")
print(" alpha>1: More smoothing (for very sparse data)")
Question: Create a sentiment classifier (positive/negative reviews) and identify which words are most predictive of each sentiment.
Show Solution
# STEP 1: Create sentiment dataset
reviews = [
"This movie was absolutely amazing! Loved it!",
"Terrible film, waste of time and money",
"Brilliant acting and great storyline",
"Boring and predictable, very disappointed",
"Outstanding performance, highly recommend",
"Awful movie, couldn't wait for it to end",
"Fantastic cinematography and soundtrack",
"Poor script and bad acting",
"Incredible experience, must watch!",
"Worst movie I've ever seen"
]
sentiments = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0] # 1=positive, 0=negative
# STEP 2: Train sentiment classifier
vectorizer = TfidfVectorizer(stop_words='english', max_features=50)
X_reviews = vectorizer.fit_transform(reviews)
mnb = MultinomialNB()
mnb.fit(X_reviews, sentiments)
# STEP 3: Extract feature importances
feature_names = vectorizer.get_feature_names_out()
log_prob_pos = mnb.feature_log_prob_[1] # Positive class
log_prob_neg = mnb.feature_log_prob_[0] # Negative class
# Words most indicative of positive sentiment
pos_words = pd.DataFrame({
'Word': feature_names,
'Positive Score': log_prob_pos
}).nlargest(10, 'Positive Score')
# Words most indicative of negative sentiment
neg_words = pd.DataFrame({
'Word': feature_names,
'Negative Score': log_prob_neg
}).nlargest(10, 'Negative Score')
print("=== Top Positive Words ===")
print(pos_words.to_string(index=False))
print("\n=== Top Negative Words ===")
print(neg_words.to_string(index=False))
# STEP 4: Test on new reviews
test_reviews = [
"Amazing movie, absolutely brilliant!",
"Terrible and boring, don't waste your time"
]
test_vectors = vectorizer.transform(test_reviews)
predictions = mnb.predict(test_vectors)
probs = mnb.predict_proba(test_vectors)
print("\n=== Sentiment Predictions ===")
for review, pred, prob in zip(test_reviews, predictions, probs):
sentiment = 'POSITIVE' if pred else 'NEGATIVE'
confidence = prob[pred]
print(f"{sentiment} ({confidence:.2%}): {review}")
print("\nThis demonstrates how Naive Bayes learns word-sentiment associations!")
Algorithm Comparison & Selection Guide
With 6 different classification algorithms, how do you choose? This guide compares them across key dimensions to help you make the right choice for your problem.
Comprehensive Algorithm Comparison
| Algorithm | Speed | Interpretability | Handles Non-Linear | Feature Scaling | Missing Values | Best Use Cases |
|---|---|---|---|---|---|---|
| Logistic Regression | Very Fast | Excellent Can interpret coefficients directly |
No Only linear boundaries |
Required | Cannot handle Must impute first |
• Binary classification • When interpretability matters • Baseline model • Linear relationships |
| Decision Tree | Fast | Excellent Easy to visualize & explain |
Yes Non-parametric |
Not needed | Partial Can handle but imputation better |
• Need clear decision rules • Mixed feature types • Feature interactions • Quick prototyping |
| Random Forest | Medium Trains multiple trees |
Medium Can get feature importance |
Yes Ensemble handles complexity |
Not needed | Partial Like Decision Tree |
• High accuracy needed • Reduce overfitting • Imbalanced data • Feature importance |
| SVM | Slow Especially large datasets |
Poor Hard to interpret |
Yes With RBF/poly kernels |
Critical Must scale! |
Cannot handle Must impute first |
• Small to medium data • High-dimensional data • Clear margin separation • Complex boundaries |
| KNN |
Training: Instant Prediction: Slow |
Medium Can visualize neighborhoods |
Yes No assumptions about data |
Critical Distance-based! |
Cannot handle Must impute first |
• Recommender systems • Anomaly detection • Small datasets • Pattern recognition |
| Naive Bayes | Very Fast | Good Probability-based |
Limited Depends on variant |
Not needed | Partial Depends on implementation |
• Text classification • Spam filtering • Real-time prediction • High-dimensional data |
Quick Reference: Problem → Algorithm
Best Choice: Multinomial Naive Bayes
Why?
- Text data (word counts/frequencies)
- Very fast training and prediction
- Handles high-dimensional data (many unique words)
- Industry standard for spam filtering
Best Choice: Random Forest or Logistic Regression
Why?
- Random Forest: High accuracy, handles non-linear patterns
- Logistic Regression: Interpretable (doctors need to explain!)
- Both work well with medical measurements
- Consider Random Forest for accuracy, Logistic for interpretability
Best Choice: KNN
Why?
- Finds similar users based on ratings
- No training phase (lazy learning)
- Naturally captures similarity patterns
- Used by Netflix, Spotify in their systems
Best Choice: Random Forest or SVM
Why?
- Random Forest: Handles imbalanced data well, high accuracy
- SVM: Good at finding rare fraud patterns
- Both handle complex, non-linear patterns in transactions
- Real-time prediction important → pre-train the model
Best Choice: SVM with RBF kernel
Why?
- High-dimensional data (pixels as features)
- Excellent for pattern recognition
- Handles non-linear boundaries well
- MNIST dataset classic use case
- Note: Deep Learning (CNNs) even better for large image datasets!
Best Choice: Random Forest or Logistic Regression
Why?
- Random Forest: High accuracy, feature importance
- Logistic Regression: Fast, interpretable coefficients
- Both provide probability scores (useful for ranking risk)
- Can explain to business stakeholders
- Start with a simple baseline (Logistic Regression or Decision Tree)
- Try multiple algorithms
- Use cross-validation to compare fairly
- Consider your constraints (speed, interpretability, accuracy)
- Let the data and results guide your final choice
Key Takeaways
Logistic Regression First
Start with logistic regression as a baseline. It's interpretable, fast, and often performs well on linearly separable data.
Random Forest > Single Tree
Random Forests combine multiple trees to reduce overfitting and improve accuracy. Prefer ensembles over single trees.
SVM Kernel Choice
Use RBF kernel for non-linear data, linear kernel for text/high-dimensional data. Always scale features first!
KNN Needs Scaling
KNN is distance-based, so feature scaling is critical. Use cross-validation to find optimal K value.
Naive Bayes for Text
Naive Bayes excels at text classification. It's fast, handles high dimensions, and works with small datasets.
No Free Lunch
No algorithm is best for all problems. Always try multiple classifiers and compare using cross-validation.
Knowledge Check
Quick Quiz
Test what you've learned about classification algorithms