Classification Algorithms

Logistic Regression

Despite its name, Logistic Regression is a classification algorithm, not a regression one. It predicts the probability of an outcome belonging to a specific category - making it the foundation for understanding classification in machine learning.

What is Logistic Regression?

While linear regression predicts continuous values (like house prices or temperature), logistic regression predicts probabilities that get mapped to categories (like "spam" or "not spam"). Think of it as answering the question: "What's the chance this email is spam?" instead of "What's the temperature?"

It uses the sigmoid function (also called logistic function) to transform any number into a probability between 0 and 1. For example, if a linear calculation gives us 2.5 or -3.7, the sigmoid function squashes it to values like 0.92 or 0.02 - perfect for representing probabilities!

Classification Algorithm

Logistic Regression

A statistical model that predicts the probability of a binary outcome using the logistic (sigmoid) function. Despite its name, it's used for classification, not regression.

Sigmoid Function: σ(z) = 1 / (1 + e^-z) maps any input to a value between 0 and 1.

Binary vs Multi-class: Basic logistic regression handles two classes (binary). For multiple classes, we use techniques like One-vs-Rest (OvR) or softmax regression (multinomial).

How It Works

Logistic regression works in three simple steps. Let's say we're predicting if a customer will buy a product:

Linear combination: z = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ
First, multiply each feature (age, income, etc.) by a weight and sum them up. This gives us a score that can be any number from -∞ to +∞.
Sigmoid transformation: P(y=1) = 1 / (1 + e^-z)
Then, pass this score through the sigmoid function, which converts it to a probability between 0 and 1. A score of 0 becomes 0.5 probability, positive scores become higher probabilities, negative scores become lower probabilities.
Classification decision: If P ≥ 0.5 → Class 1, else → Class 0
Finally, if the probability is 50% or higher, we predict "Yes" (Class 1), otherwise we predict "No" (Class 0). You can adjust this threshold based on your needs!

When to Use Logistic Regression

Binary classification problems
Like spam/not spam, fraud/legitimate, pass/fail, buy/not buy. Works best when you need to classify into two distinct groups.
Need probability estimates
Unlike some algorithms that only give yes/no, logistic regression provides probabilities (e.g., "75% chance of spam"). Useful for ranking or setting custom thresholds.
Want interpretable coefficients
Each feature has a weight you can interpret. Positive coefficient = feature increases probability, negative = decreases. Great when you need to explain decisions to stakeholders.
Linear decision boundary works
When classes can be roughly separated by a straight line (or hyperplane). If data requires complex curves, consider trees or SVM with RBF kernel instead.
Baseline model for comparison
Fast to train, easy to implement. Always start here before trying complex models - you might be surprised how well it works!

Limitations & Challenges

Assumes linear decision boundary
Can't learn XOR patterns or complex curves. If your data looks like concentric circles or spirals, logistic regression will struggle.
Struggles with complex patterns
Can't automatically discover feature interactions (e.g., "high income AND young age"). You'd need to manually create interaction features.
Sensitive to outliers
Extreme values can pull the decision boundary in wrong direction. Consider removing outliers or using robust preprocessing.
Requires feature scaling
Features with larger ranges dominate the model. Always use StandardScaler or MinMaxScaler before training. Age (0-100) and salary ($20K-$200K) need same scale.
Can't capture feature interactions
Treats each feature independently. If "young + tech-savvy" together matter more than separately, you need to create interaction features manually.

Real-World Example: E-commerce companies use logistic regression to predict whether a visitor will make a purchase. By analyzing features like time spent on site, pages visited, and previous purchase history, they can identify potential buyers and target them with personalized offers.

Implementation in Python

Let's implement logistic regression using scikit-learn. We'll use a classic example: predicting whether a customer will purchase a product based on their age and estimated salary.

STEP 1: Import Libraries and Create Sample Data

# Import required libraries
import numpy as np  # For numerical operations
import pandas as pd  # For data manipulation and analysis
from sklearn.model_selection import train_test_split  # Split data into train/test
from sklearn.preprocessing import StandardScaler  # Scale features to similar ranges
from sklearn.linear_model import LogisticRegression  # Our classification algorithm
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report  # Evaluation tools

# Sample data: Age, Salary → Purchased (0/1)
np.random.seed(42)  # Set random seed for reproducibility
n_samples = 200  # Create 200 customer records

# Generate random ages between 18 and 65
age = np.random.randint(18, 65, n_samples)

# Generate random salaries between $20,000 and $150,000
salary = np.random.randint(20000, 150000, n_samples)

# Create target variable with a pattern:
# Higher age (>35) AND higher salary (>60000) → more likely to purchase
purchased = ((age > 35) & (salary > 60000)).astype(int)  # 1 if both conditions true, else 0

# Add some noise (randomness) to make it realistic
# Randomly flip 20 labels to simulate real-world unpredictability
noise_idx = np.random.choice(n_samples, 20, replace=False)
purchased[noise_idx] = 1 - purchased[noise_idx]  # Flip 0→1 and 1→0

# Create DataFrame for better visualization
df = pd.DataFrame({'Age': age, 'Salary': salary, 'Purchased': purchased})
print("First 5 customers:")
print(df.head())
print(f"\nPurchased rate: {purchased.sum() / len(purchased):.1%}")  # Show class balance

We generate synthetic customer data with age and salary as features, and purchased (0 or 1) as the target. The pattern: customers over 35 with salary above $60,000 are more likely to purchase, with some random noise added.

STEP 2: Prepare and Split Data

# Prepare features (X) and target (y)
X = df[['Age', 'Salary']]  # Features: what we use to make predictions
y = df['Purchased']  # Target: what we want to predict (0 or 1)

# Split data: 80% for training, 20% for testing
# WHY? We train on 80% and test on the unseen 20% to measure real-world performance
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,  # 20% for testing
    random_state=42,  # Same seed = same split every time (reproducible results)
    stratify=y  # Keep same class ratio in train and test sets
)

# Scale features - CRITICAL for logistic regression!
# WHY? Age (18-65) and Salary (20000-150000) have very different ranges
# Without scaling, salary would dominate because its numbers are much larger
scaler = StandardScaler()  # Converts each feature to mean=0, std=1

# fit_transform on training data: learn mean & std, then scale
X_train_scaled = scaler.fit_transform(X_train)

# transform on test data: use training mean & std to scale
# NEVER fit on test data! That would be cheating (data leakage)
X_test_scaled = scaler.transform(X_test)

print(f"Training samples: {len(X_train)} ({len(X_train)/len(X)*100:.0f}%)")
print(f"Test samples: {len(X_test)} ({len(X_test)/len(X)*100:.0f}%)")
print(f"\nBefore scaling - Age range: {X_train['Age'].min()}-{X_train['Age'].max()}")
print(f"Before scaling - Salary range: ${X_train['Salary'].min():,}-${X_train['Salary'].max():,}")
print(f"\nAfter scaling - Both features now have similar range around 0!")

We separate features (X) from target (y), split into train/test sets, and importantly scale the features. StandardScaler converts both Age and Salary to similar ranges (mean=0, std=1), preventing the larger salary values from dominating the model.

STEP 3: Train and Evaluate Model

# Create the logistic regression model
# Default parameters are usually good to start with
model = LogisticRegression(
    random_state=42,  # For reproducible results
    max_iter=1000     # Maximum iterations to find optimal weights
)

# Train the model on scaled training data
# This is where the algorithm learns the relationship between features and target
model.fit(X_train_scaled, y_train)

# Make predictions on test data
y_pred = model.predict(X_test_scaled)  # Binary predictions: 0 or 1
y_prob = model.predict_proba(X_test_scaled)  # Probability estimates: [P(class 0), P(class 1)]

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2%}")  # What percentage did we get right?
print(f"\nThis means we correctly predicted {accuracy:.1%} of customers!")

# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Not Purchased', 'Purchased']))
print("\nMetrics Explained:")
print("  • Precision: Of all we predicted 'Purchased', how many actually purchased?")
print("  • Recall: Of all who actually purchased, how many did we catch?")
print("  • F1-score: Harmonic mean of precision and recall (balanced metric)")
print("  • Support: Number of actual occurrences in test set")

The model learns the relationship between age/salary and purchase probability. predict() gives binary predictions, while predict_proba() provides probability estimates for each class.

Understanding the Output

# Model coefficients show feature importance and direction
print("=== Model Coefficients (Weights) ===")
for feature, coef in zip(['Age', 'Salary'], model.coef_[0]):
    direction = "increases" if coef > 0 else "decreases"
    print(f"  {feature}: {coef:.4f}")
    print(f"    → Each unit increase in {feature} {direction} purchase probability")
    print(f"    → {'Strong' if abs(coef) > 1 else 'Moderate' if abs(coef) > 0.5 else 'Weak'} effect\n")

print(f"Intercept (bias): {model.intercept_[0]:.4f}")
print("  → This is the baseline log-odds when all features are 0\n")

# Predict probabilities for a new customer
print("=== Example Prediction ===")
new_customer = [[45, 75000]]  # 45 years old, $75,000 salary
new_customer_scaled = scaler.transform(new_customer)  # Must scale new data too!

probability = model.predict_proba(new_customer_scaled)[0]
prediction = model.predict(new_customer_scaled)[0]

print(f"New customer: Age=45, Salary=$75,000")
print(f"\nProbabilities:")
print(f"  • P(Not Purchase) = {probability[0]:.1%}")
print(f"  • P(Purchase) = {probability[1]:.1%}")
print(f"\nPrediction: {'Will Purchase ✓' if prediction == 1 else 'Will Not Purchase ✗'}")
print(f"Confidence: {max(probability):.1%}")

# Let's test a few more scenarios
print("\n=== Testing Different Customer Profiles ===")
test_profiles = [
    [25, 35000, "Young & Low Income"],
    [30, 90000, "Young & High Income"],
    [50, 45000, "Older & Low Income"],
    [55, 120000, "Older & High Income"]
]

for age, salary, desc in test_profiles:
    profile_scaled = scaler.transform([[age, salary]])
    prob = model.predict_proba(profile_scaled)[0][1]  # Probability of purchase
    print(f"{desc:25} → {prob:.1%} chance of purchase")

Important: Always scale your features before logistic regression! The algorithm is sensitive to feature scales, and gradient descent converges faster with standardized features.

Multi-class Classification

# Logistic Regression with multiple classes
from sklearn.datasets import load_iris

# Load Iris dataset (3 classes)
iris = load_iris()
X_iris, y_iris = iris.data, iris.target

# Split and scale
X_train_i, X_test_i, y_train_i, y_test_i = train_test_split(
    X_iris, y_iris, test_size=0.2, random_state=42
)

# Multi-class logistic regression (uses one-vs-rest by default)
multi_model = LogisticRegression(
    multi_class='multinomial',  # Use softmax for multi-class
    solver='lbfgs',
    max_iter=1000,
    random_state=42
)
multi_model.fit(X_train_i, y_train_i)

print(f"Multi-class Accuracy: {multi_model.score(X_test_i, y_test_i):.2%}")

Practice Questions: Logistic Regression

Test your understanding with these hands-on exercises.

Question: Create a logistic regression model to predict whether a student passes (1) or fails (0) based on hours studied and previous test score. Use the sample data below.

Show Solution

# STEP 1: Create sample student data
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

data = {
    'hours_studied': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'previous_score': [45, 50, 55, 60, 65, 70, 75, 80, 85, 90],
    'passed': [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
}
df = pd.DataFrame(data)

# STEP 2: Prepare and split data
X = df[['hours_studied', 'previous_score']]
y = df['passed']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# STEP 3: Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# STEP 4: Train and evaluate
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
print(f"Accuracy: {model.score(X_test_scaled, y_test):.2%}")

# STEP 5: Make a prediction for a new student
new_student = [[5, 68]]  # 5 hours studied, 68 previous score
new_student_scaled = scaler.transform(new_student)
prob = model.predict_proba(new_student_scaled)[0]
print(f"\nNew student - Pass probability: {prob[1]:.2%}")

Question: Modify the LogisticRegression model to use stronger L2 regularization (C=0.1) and compare the results with the default regularization (C=1.0).

Show Solution

# STEP 1: Train with default regularization
model_default = LogisticRegression(C=1.0, random_state=42)
model_default.fit(X_train_scaled, y_train)
default_acc = model_default.score(X_test_scaled, y_test)

# STEP 2: Train with stronger regularization
# C is inverse of regularization strength: smaller C = stronger regularization
model_regularized = LogisticRegression(
    C=0.1,           # Stronger regularization
    penalty='l2',    # L2 regularization (default)
    random_state=42
)
model_regularized.fit(X_train_scaled, y_train)
reg_acc = model_regularized.score(X_test_scaled, y_test)

# STEP 3: Compare results
print(f"Default (C=1.0): {default_acc:.2%}")
print(f"Regularized (C=0.1): {reg_acc:.2%}")
print(f"\nDefault coefficients: {model_default.coef_[0]}")
print(f"Regularized coefficients: {model_regularized.coef_[0]}")
print("\nNote: Regularization shrinks coefficients, reducing overfitting!")

Question: Instead of the default 0.5 threshold, classify as positive only if probability ≥ 0.7. Compare precision/recall with different thresholds.

Show Solution

from sklearn.metrics import classification_report

# STEP 1: Get probability predictions
y_prob = model.predict_proba(X_test_scaled)[:, 1]  # Probability of class 1

# STEP 2: Default threshold (0.5)
y_pred_default = model.predict(X_test_scaled)
print("=== Default Threshold (0.5) ===")
print(classification_report(y_test, y_pred_default))

# STEP 3: Custom threshold (0.7) - more conservative
threshold = 0.7
y_pred_custom = (y_prob >= threshold).astype(int)
print("\n=== Custom Threshold (0.7) ===")
print(classification_report(y_test, y_pred_custom))

# STEP 4: Analyze trade-offs
print("\nInsight: Higher threshold means:")
print("  - Fewer positive predictions (more conservative)")
print("  - Higher precision (more confident when saying 'positive')")
print("  - Lower recall (miss some true positives)")
print("  - Use when false positives are costly!")

Question: Build a multi-class logistic regression classifier for the Iris dataset. Use 5-fold cross-validation to evaluate performance and tune the C parameter.

Show Solution

from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score, GridSearchCV
import numpy as np

# STEP 1: Load Iris data (3 classes)
iris = load_iris()
X, y = iris.data, iris.target

# STEP 2: Test multi-class logistic regression with cross-validation
model = LogisticRegression(
    multi_class='multinomial',  # Softmax regression for multi-class
    solver='lbfgs',
    max_iter=1000,
    random_state=42
)
scores = cross_val_score(model, X, y, cv=5)
print(f"Cross-validation scores: {scores}")
print(f"Mean accuracy: {scores.mean():.2%} (+/- {scores.std():.2%})")

# STEP 3: Tune C parameter with GridSearchCV
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X, y)

# STEP 4: Display results
print(f"\nBest C value: {grid_search.best_params_['C']}")
print(f"Best CV accuracy: {grid_search.best_score_:.2%}")

# STEP 5: Train final model and show per-class performance
best_model = grid_search.best_estimator_
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
print("\n=== Per-Class Performance ===")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Decision Trees & Random Forests

Decision Trees are intuitive, interpretable models that make decisions by asking a series of yes/no questions. Random Forests combine multiple trees to create a powerful, robust ensemble that often outperforms individual models.

How Decision Trees Work

Imagine you're trying to decide if someone will like a movie. You might ask: "Do they like action movies?" If yes, you ask: "Are they over 25?" If yes again, you predict they'll like it. This is exactly how a decision tree works - it asks a series of yes/no questions to reach a decision.

A decision tree splits data based on feature values, creating a tree-like structure. At each node (decision point), it asks a question like "Is age > 30?" or "Is income > $50,000?" and branches left or right accordingly. It keeps asking questions until it reaches a leaf node (final answer) where it makes a prediction. The tree automatically learns which questions to ask and in what order by analyzing the training data!

Tree-Based Algorithm

Decision Tree

A flowchart-like model that makes predictions by learning simple decision rules from data features. Each internal node represents a "test" on a feature, each branch represents the outcome, and each leaf node holds a class label.

Key metrics: Gini Impurity and Entropy measure how "pure" each node is. The algorithm chooses splits that maximize purity (minimize impurity).

Gini Impurity

Simple Explanation: Measures how "mixed up" the classes are at a node.

Gini = 1 - Σ(pᵢ)²

Where pᵢ is the probability of class i. Gini = 0 means all samples belong to the same class (perfectly pure). Gini = 0.5 means classes are evenly mixed (most impure). The tree tries to minimize Gini impurity at each split.

Entropy & Information Gain

Simple Explanation: Measures disorder or randomness in the data.

Entropy = -Σ(pᵢ × log₂(pᵢ))

Entropy = 0 means all samples are the same class (ordered). High entropy means classes are randomly mixed (disordered). Information Gain measures how much a split reduces entropy - the tree picks splits with highest information gain.

Decision Tree Implementation

STEP 1: Load Data and Train Decision Tree

# Decision Tree for Classification
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
class_names = iris.target_names

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and train Decision Tree
tree_model = DecisionTreeClassifier(
    max_depth=3,           # Limit depth to prevent overfitting
    criterion='gini',      # 'gini' or 'entropy'
    random_state=42
)
tree_model.fit(X_train, y_train)

print(f"Training Accuracy: {tree_model.score(X_train, y_train):.2%}")
print(f"Test Accuracy: {tree_model.score(X_test, y_test):.2%}")

We limit max_depth=3 to prevent overfitting. The tree learns rules like "if petal_width ≤ 0.8 then Setosa" by choosing splits that maximize information gain or minimize Gini impurity at each node.

STEP 2: Visualize the Tree Structure

# Visualize the Decision Tree
plt.figure(figsize=(20, 10))
plot_tree(
    tree_model,
    feature_names=feature_names,
    class_names=class_names,
    filled=True,
    rounded=True,
    fontsize=10
)
plt.title("Decision Tree for Iris Classification")
plt.tight_layout()
plt.show()

# Feature importance - which features matter most?
importance = pd.DataFrame({
    'Feature': feature_names,
    'Importance': tree_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("Feature Importance:")
print(importance.to_string(index=False))

Overfitting Alert! Decision trees with no depth limit will perfectly memorize training data but fail on new data. Always use max_depth, min_samples_split, or pruning.

Healthcare Application: Hospitals use decision trees to triage patients in emergency rooms. The tree asks simple yes/no questions ("Temperature > 38°C?", "Chest pain?", "Breathing difficulty?") to quickly classify patients by urgency level, helping medical staff prioritize care efficiently.

Random Forests: The Power of Ensembles

A Random Forest is like asking 100 different experts for their opinion and taking a vote, instead of trusting just one expert. It builds many decision trees (often 100-500 trees) and combines their predictions. Each tree is trained on a different random subset of data (called bagging or bootstrap sampling) and considers different random subsets of features at each split. This randomness makes each tree unique and prevents them from all making the same mistakes.

Why Random Forests work - "Wisdom of the Crowd": Imagine 100 doctors diagnosing a patient. Some might make mistakes, but if we take a majority vote, we'll likely get the right answer. Individual trees may overfit specific patterns or make random errors, but when you average predictions from many diverse trees, the errors cancel out. One tree might say 70% spam, another 40%, another 90% - averaging gives us a more reliable estimate. This ensemble approach almost always outperforms a single decision tree!

# Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

# Create Random Forest with 100 trees
rf_model = RandomForestClassifier(
    n_estimators=100,      # Number of trees
    max_depth=5,           # Depth per tree
    min_samples_split=5,   # Min samples to split
    random_state=42,
    n_jobs=-1              # Use all CPU cores
)

rf_model.fit(X_train, y_train)

print(f"Random Forest Training Accuracy: {rf_model.score(X_train, y_train):.2%}")
print(f"Random Forest Test Accuracy: {rf_model.score(X_test, y_test):.2%}")

# Compare single tree vs Random Forest
from sklearn.metrics import classification_report

# Single tree predictions
tree_pred = tree_model.predict(X_test)
# Random Forest predictions
rf_pred = rf_model.predict(X_test)

print("=== Single Decision Tree ===")
print(classification_report(y_test, tree_pred, target_names=class_names))

print("\n=== Random Forest (100 Trees) ===")
print(classification_report(y_test, rf_pred, target_names=class_names))

# Random Forest Feature Importance
rf_importance = pd.DataFrame({
    'Feature': feature_names,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("Random Forest Feature Importance:")
print(rf_importance.to_string(index=False))

# Random Forest importance is more reliable than single tree
# because it's averaged across many trees

Key Hyperparameters

Parameter	Description	Default	Tuning Tip
`n_estimators`	Number of trees in the forest	100	More trees = better, but slower. 100-500 is usually good.
`max_depth`	Maximum depth of each tree	None	Limit to 5-15 to prevent overfitting.
`min_samples_split`	Minimum samples to split a node	2	Higher values (5-10) add regularization.
`max_features`	Features to consider at each split	'sqrt'	'sqrt' or 'log2' for diversity between trees.

Practice Questions: Trees & Forests

Practice building and optimizing tree-based models.

Question: Create a decision tree classifier for the Iris dataset, limit max_depth to 3, and extract the feature importances.

Show Solution

# STEP 1: Load data and create decision tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# STEP 2: Train decision tree with limited depth
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)
print(f"Test Accuracy: {tree.score(X_test, y_test):.2%}")

# STEP 3: Extract feature importances
importances = pd.DataFrame({
    'Feature': iris.feature_names,
    'Importance': tree.feature_importances_
}).sort_values('Importance', ascending=False)
print("\nFeature Importances:")
print(importances.to_string(index=False))

# STEP 4: Understand the output
print("\nThe most important feature has the highest impact on predictions!")

Question: Use GridSearchCV to find the optimal combination of n_estimators, max_depth, and min_samples_split for a Random Forest classifier.

Show Solution

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# STEP 1: Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10]
}

# STEP 2: Perform grid search
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,  # Use all CPU cores
    verbose=1
)
grid_search.fit(X_train, y_train)

# STEP 3: Display results
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.2%}")
print(f"Test score: {grid_search.score(X_test, y_test):.2%}")

# STEP 4: Get feature importances from best model
best_rf = grid_search.best_estimator_
feature_imp = pd.DataFrame({
    'Feature': iris.feature_names,
    'Importance': best_rf.feature_importances_
}).sort_values('Importance', ascending=False)
print("\nOptimal Random Forest Feature Importances:")
print(feature_imp.to_string(index=False))

Question: Train both a single decision tree and a random forest on the same data, then compare their training vs test accuracy to demonstrate overfitting reduction.

Show Solution

# STEP 1: Train decision tree without depth limit (will overfit)
tree_overfit = DecisionTreeClassifier(random_state=42)  # No max_depth
tree_overfit.fit(X_train, y_train)

# STEP 2: Train random forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# STEP 3: Compare training vs test accuracy
print("=== Single Decision Tree ===")
print(f"Training Accuracy: {tree_overfit.score(X_train, y_train):.2%}")
print(f"Test Accuracy: {tree_overfit.score(X_test, y_test):.2%}")
print(f"Overfitting gap: {tree_overfit.score(X_train, y_train) - tree_overfit.score(X_test, y_test):.2%}")

print("\n=== Random Forest ===")
print(f"Training Accuracy: {rf.score(X_train, y_train):.2%}")
print(f"Test Accuracy: {rf.score(X_test, y_test):.2%}")
print(f"Overfitting gap: {rf.score(X_train, y_train) - rf.score(X_test, y_test):.2%}")

print("\nInsight: Random Forest has smaller train-test gap = less overfitting!")

Question: Use cost-complexity pruning (ccp_alpha) to find the optimal tree complexity that balances bias and variance.

Show Solution

import matplotlib.pyplot as plt

# STEP 1: Get pruning path
tree_full = DecisionTreeClassifier(random_state=42)
path = tree_full.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas[:-1]  # Remove max alpha

# STEP 2: Train trees with different alpha values
train_scores = []
test_scores = []

for ccp_alpha in ccp_alphas:
    tree = DecisionTreeClassifier(ccp_alpha=ccp_alpha, random_state=42)
    tree.fit(X_train, y_train)
    train_scores.append(tree.score(X_train, y_train))
    test_scores.append(tree.score(X_test, y_test))

# STEP 3: Find optimal alpha
best_idx = test_scores.index(max(test_scores))
best_alpha = ccp_alphas[best_idx]
print(f"Optimal ccp_alpha: {best_alpha:.4f}")
print(f"Best test accuracy: {max(test_scores):.2%}")

# STEP 4: Visualize pruning effect
plt.figure(figsize=(10, 5))
plt.plot(ccp_alphas, train_scores, marker='o', label='Train', drawstyle="steps-post")
plt.plot(ccp_alphas, test_scores, marker='o', label='Test', drawstyle="steps-post")
plt.axvline(best_alpha, color='r', linestyle='--', label=f'Best alpha={best_alpha:.4f}')
plt.xlabel('ccp_alpha (complexity parameter)')
plt.ylabel('Accuracy')
plt.title('Pruning Path: Finding Optimal Tree Complexity')
plt.legend()
plt.grid(True)
plt.show()

print("\nAs alpha increases, tree becomes simpler (less overfitting but may underfit)")

Support Vector Machines (SVM)

SVM is a powerful classification algorithm that finds the optimal hyperplane separating different classes. It excels in high-dimensional spaces and can handle non-linear boundaries using the "kernel trick."

The Intuition Behind SVM

Imagine you have red dots and blue dots on a piece of paper, and you need to draw a line to separate them. There are many lines you could draw, but SVM finds the best line - the one with the most "breathing room" on both sides. This "breathing room" is called the margin.

Think of it like drawing a road between two neighborhoods. SVM makes the road as wide as possible while still separating the neighborhoods. The houses closest to the road (called support vectors) determine where the road goes - that's why they're so important! In higher dimensions (more than 2 features), the "line" becomes a hyperplane, but the concept is the same: maximize the margin for better generalization.

Classification Algorithm

Support Vector Machine

A supervised learning algorithm that finds the hyperplane with maximum margin between classes. The data points closest to the decision boundary are called support vectors - they "support" and define the position of the hyperplane.

Key concepts: Hyperplane (decision boundary), Margin (distance from hyperplane to nearest points), Support Vectors (critical boundary points), Kernel (for non-linear separation).

Maximum Margin

SVM maximizes the "street" between classes for better generalization.

Support Vectors

Only the points closest to the boundary matter - the rest are ignored.

Kernel Trick

Transform data to higher dimensions for non-linear separation.

Linear SVM Implementation

Let's start with a simple linear SVM to understand the basics. We'll generate synthetic data that's linearly separable (can be divided by a straight line).

# Support Vector Machine for Classification
from sklearn.svm import SVC  # Support Vector Classifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
import numpy as np

# Generate sample 2D data that's linearly separable
# WHY 2D? Easy to visualize, but concepts extend to any dimensions
X, y = make_classification(
    n_samples=200,  # 200 data points
    n_features=2,   # 2 features (x and y coordinates for visualization)
    n_informative=2,  # Both features are useful for classification
    n_redundant=0,    # No redundant/correlated features
    n_clusters_per_class=1,  # Each class forms one cluster
    random_state=42   # Reproducible results
)

# Split data into train (80%) and test (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features - ABSOLUTELY CRITICAL for SVM!
# WHY? SVM uses distances between points. Features with larger ranges
# would dominate the distance calculations, making other features irrelevant
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Learn scaling from training data
X_test_scaled = scaler.transform(X_test)  # Apply same scaling to test data

print("Data shape:", X_train.shape)
print(f"Class distribution: {np.bincount(y_train)} (should be roughly balanced)\n")

# Create linear SVM classifier
# 'linear' kernel = find straight line (hyperplane) to separate classes
svm_linear = SVC(
    kernel='linear',  # Use linear kernel (straight line decision boundary)
    C=1.0,           # Regularization: lower = simpler model, higher = fit training data better
    random_state=42
)

# Train the model - this finds the optimal hyperplane
svm_linear.fit(X_train_scaled, y_train)

# Evaluate
accuracy = svm_linear.score(X_test_scaled, y_test)
print(f"Linear SVM Accuracy: {accuracy:.2%}")

# Support vectors are the critical points that define the decision boundary
print(f"Number of support vectors: {len(svm_linear.support_vectors_)}")
print(f"  → Only {len(svm_linear.support_vectors_)} out of {len(X_train)} training points actually matter!")
print(f"  → These are the points closest to the decision boundary")
print(f"  → Moving other points won't change the boundary at all")

The Kernel Trick

Real-world data often isn't linearly separable - you can't draw a straight line to separate the classes. Imagine trying to separate two groups of dots arranged in circles - no straight line works!

The kernel trick is like a magic spell that transforms your data into a higher dimension where a straight line (or hyperplane) can separate them. Think of it like this: if you have overlapping circles on a flat paper (2D), you could lift one circle up into 3D space, and now a flat plane can separate them! The amazing part is that SVM does this transformation implicitly - it never actually moves the data, but mathematically acts as if it did. This makes it computationally efficient even in very high dimensions.

Kernel	Best For	How It Works	Formula
`linear`	Linearly separable data, text classification, high-dimensional data	Finds straight line/hyperplane. Fastest option. Use when classes are clearly separated or you have many features (>10,000).	K(x, y) = x·y
`rbf` (Radial Basis Function)	Most common choice, handles non-linear patterns, general-purpose	Creates circular decision boundaries. Can fit complex curves. The 'go-to' kernel when you're not sure. Most flexible.	K(x, y) = exp(-γ\|\|x-y\|\|²) γ controls influence radius
`poly` (Polynomial)	Polynomial relationships, image classification	Creates polynomial curves as boundaries. Degree parameter controls complexity (2=quadratic, 3=cubic, etc.). Can be unstable.	K(x, y) = (γx·y + r)^d d is the degree
`sigmoid`	Neural network-like behavior (rarely used)	Similar to neural network activation. Historically used but RBF usually performs better. Mainly for research.	K(x, y) = tanh(γx·y + r)

Quick Decision Guide:
1. Try linear first - fastest, works surprisingly often
2. If linear fails, use RBF - handles most non-linear cases
3. Consider poly only if you know data has polynomial relationship
4. Always tune C and gamma parameters using GridSearchCV

# RBF Kernel - handles non-linear decision boundaries
from sklearn.datasets import make_circles

# Create non-linearly separable data
X_circles, y_circles = make_circles(
    n_samples=200, 
    noise=0.1, 
    factor=0.3,
    random_state=42
)

X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_circles, y_circles, test_size=0.2, random_state=42
)

# Compare Linear vs RBF kernel
svm_linear = SVC(kernel='linear')
svm_rbf = SVC(kernel='rbf', gamma='scale')

svm_linear.fit(X_train_c, y_train_c)
svm_rbf.fit(X_train_c, y_train_c)

print(f"Linear kernel on circles: {svm_linear.score(X_test_c, y_test_c):.2%}")
print(f"RBF kernel on circles: {svm_rbf.score(X_test_c, y_test_c):.2%}")

When to use which kernel? Start with RBF (default) for most problems. Use linear kernel for text classification or when you have many features. If data has polynomial relationships, try poly kernel.

Image Recognition: SVM with RBF kernel is widely used in facial recognition systems. It can learn complex decision boundaries to distinguish between different faces by finding optimal hyperplanes in high-dimensional feature spaces derived from image pixels.

Key Hyperparameters Explained in Detail

Understanding SVM Parameters: These two parameters control the behavior of your SVM model. Getting them right is crucial for good performance!

# Important SVM parameters - let's understand what they do
svm_tuned = SVC(
    kernel='rbf',     # Kernel choice (we'll use RBF for non-linear data)
    C=1.0,            # Regularization parameter
    gamma='scale',    # Kernel coefficient
    random_state=42
)

# ========================================
# PARAMETER 1: C (Regularization)
# ========================================
# C controls the trade-off between:
#   1. Having a smooth decision boundary (generalization)
#   2. Classifying training points correctly (accuracy on training data)

# Low C (e.g., 0.1):
#   → MORE regularization = simpler model
#   → Allows some misclassifications
#   → Wider margin (more tolerance)
#   → Better for noisy data
#   → Risk: May underfit (too simple)

# High C (e.g., 100):
#   → LESS regularization = complex model
#   → Tries to classify every training point correctly
#   → Narrow margin (less tolerance)
#   → Fits training data very closely
#   → Risk: May overfit (too specific to training data)

# Default C=1.0 is usually a good starting point

print("C Parameter Examples:")
for C_val in [0.1, 1.0, 10.0]:
    svm = SVC(kernel='rbf', C=C_val, random_state=42)
    svm.fit(X_train_scaled, y_train)
    train_acc = svm.score(X_train_scaled, y_train)
    test_acc = svm.score(X_test_scaled, y_test)
    print(f"C={C_val:5.1f}: Train={train_acc:.2%}, Test={test_acc:.2%}, "
          f"Support Vectors={len(svm.support_vectors_)}")

print("\n→ Notice: Higher C = Higher training accuracy but more support vectors")
print("→ Goal: Find C where test accuracy is highest (not training!)\n")

# ========================================
# PARAMETER 2: gamma (Kernel Coefficient)
# ========================================
# gamma controls how far the influence of a single training example reaches
# Think of it as the "reach" or "spread" of each point

# Low gamma (e.g., 0.001):
#   → Far reach = each point influences distant points
#   → Smoother decision boundaries
#   → More general, less complex
#   → Risk: May underfit

# High gamma (e.g., 10):
#   → Close reach = only nearby points are influenced
#   → Tight, wiggly decision boundaries
#   → Very specific to training data
#   → Risk: Overfits easily

# 'scale' (default): gamma = 1 / (n_features * X.var())
# 'auto': gamma = 1 / n_features

print("gamma Parameter Examples:")
for gamma_val in [0.01, 0.1, 1.0]:
    svm = SVC(kernel='rbf', gamma=gamma_val, random_state=42)
    svm.fit(X_train_scaled, y_train)
    train_acc = svm.score(X_train_scaled, y_train)
    test_acc = svm.score(X_test_scaled, y_test)
    print(f"gamma={gamma_val:5.2f}: Train={train_acc:.2%}, Test={test_acc:.2%}")

print("\n→ Notice: Higher gamma can lead to overfitting (high train, low test)")
print("→ Start with 'scale' or 'auto', then tune if needed\n")

# PRACTICAL ADVICE:
print("=== Practical Tuning Strategy ===")
print("1. Start with C=1.0 and gamma='scale'")
print("2. If underfitting: increase C (try 10, 100)")
print("3. If overfitting: decrease C (try 0.1, 0.01)")
print("4. Fine-tune gamma only after C is reasonable")
print("5. Use GridSearchCV to test combinations systematically")

# Grid search for optimal C and gamma
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 0.1, 0.01, 0.001],
    'kernel': ['rbf']
}

grid_search = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV accuracy: {grid_search.best_score_:.2%}")

Scaling is Critical! SVM is very sensitive to feature scales. Always use StandardScaler or MinMaxScaler before training. Features with larger scales will dominate the distance calculations.

Practice Questions: SVM

Master SVM with these kernel and hyperparameter tuning exercises.

Question: SVM by default doesn't output probabilities. Train an SVM that provides probability estimates for each prediction.

Show Solution

# STEP 1: Enable probability estimates
svm_prob = SVC(kernel='rbf', probability=True, random_state=42)
svm_prob.fit(X_train_scaled, y_train)

# STEP 2: Get probability predictions
probs = svm_prob.predict_proba(X_test_scaled)
print(f"First sample probabilities: {probs[0]}")
print(f"Shape: {probs.shape}  # (n_samples, n_classes)")

# STEP 3: Get class with highest probability
predicted_class = probs[0].argmax()
confidence = probs[0][predicted_class]
print(f"\nPredicted class: {predicted_class} with {confidence:.2%} confidence")

print("\nNote: probability=True uses 5-fold CV internally, making training slower!")

Question: Create non-linearly separable data using make_moons, then train both linear and RBF kernel SVMs to compare their performance.

Show Solution

from sklearn.datasets import make_moons

# STEP 1: Create non-linear data (moon shapes)
X, y = make_moons(n_samples=200, noise=0.15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# STEP 2: Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# STEP 3: Train with linear kernel
svm_linear = SVC(kernel='linear')
svm_linear.fit(X_train_scaled, y_train)
linear_score = svm_linear.score(X_test_scaled, y_test)

# STEP 4: Train with RBF kernel
svm_rbf = SVC(kernel='rbf', gamma='scale')
svm_rbf.fit(X_train_scaled, y_train)
rbf_score = svm_rbf.score(X_test_scaled, y_test)

# STEP 5: Compare results
print(f"Linear kernel accuracy: {linear_score:.2%}")
print(f"RBF kernel accuracy: {rbf_score:.2%}")
print(f"\nRBF wins by: {rbf_score - linear_score:.2%}")
print("\nLesson: RBF kernel handles non-linear boundaries much better!")

Question: Train SVMs with different C values (0.1, 1, 10, 100) and observe how it affects training vs test accuracy.

Show Solution

# STEP 1: Test different C values
C_values = [0.1, 1, 10, 100]
results = []

for C in C_values:
    svm = SVC(kernel='rbf', C=C, random_state=42)
    svm.fit(X_train_scaled, y_train)
    
    train_acc = svm.score(X_train_scaled, y_train)
    test_acc = svm.score(X_test_scaled, y_test)
    gap = train_acc - test_acc
    
    results.append({
        'C': C,
        'Train': train_acc,
        'Test': test_acc,
        'Gap': gap
    })

# STEP 2: Display results
import pandas as pd
df_results = pd.DataFrame(results)
print(df_results.to_string(index=False))

# STEP 3: Interpret
print("\nObservations:")
print("- Low C (0.1): More regularization, simpler boundary, may underfit")
print("- High C (100): Less regularization, complex boundary, may overfit")
print("- Optimal C: Balance between train and test accuracy")

Question: Perform comprehensive hyperparameter tuning for SVM, testing multiple kernels (linear, rbf, poly), C values, and gamma values. Find the best combination.

Show Solution

from sklearn.model_selection import GridSearchCV

# STEP 1: Define comprehensive parameter grid
param_grid = {
    'kernel': ['linear', 'rbf', 'poly'],
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.1, 0.01, 0.001]
}

# STEP 2: Perform grid search
grid_search = GridSearchCV(
    SVC(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)
grid_search.fit(X_train_scaled, y_train)

# STEP 3: Display best parameters
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.2%}")
print(f"Test score: {grid_search.score(X_test_scaled, y_test):.2%}")

# STEP 4: Analyze top 5 configurations
results_df = pd.DataFrame(grid_search.cv_results_)
top_5 = results_df.nlargest(5, 'mean_test_score')[[
    'param_kernel', 'param_C', 'param_gamma', 'mean_test_score'
]]
print("\nTop 5 Configurations:")
print(top_5.to_string(index=False))

# STEP 5: Get best model
best_svm = grid_search.best_estimator_
print(f"\nNumber of support vectors: {len(best_svm.support_vectors_)}")

K-Nearest Neighbors (KNN)

KNN is one of the simplest yet most intuitive classification algorithms. It classifies a new data point based on the majority vote of its K nearest neighbors - literally "tell me who your friends are, and I'll tell you who you are."

How KNN Works

KNN is called a lazy learner (or instance-based learner) because it doesn't actually "learn" anything during training - it just memorizes all the training data! Think of it like a student who doesn't study before the exam but brings all their notes and textbooks into the test.

When you ask KNN to predict a new data point, here's what happens: It looks at your new point and finds the K closest training examples (neighbors). If K=5, it finds the 5 nearest neighbors. Then it takes a democratic vote - if 4 out of 5 neighbors are "Class A", the new point is predicted as "Class A". The key idea: "You are who your friends are." Similar data points should have similar labels. This makes KNN very intuitive but can be slow on large datasets since it needs to calculate distances to all training points for every prediction.

Instance-Based Algorithm

K-Nearest Neighbors

A non-parametric algorithm that classifies data points based on the class of their K nearest neighbors in the feature space. The distance between points is typically measured using Euclidean distance.

Key idea: "Similar things are near each other." A point is likely to belong to the same class as its neighbors.

Advantages

Simple to understand and implement
No training phase (lazy learning)
Naturally handles multi-class problems
Non-parametric (no assumptions about data)
Can adapt to new data instantly

Disadvantages

Slow prediction on large datasets
Sensitive to irrelevant features
Requires feature scaling
Suffers from curse of dimensionality
Memory-intensive (stores all data)

KNN Implementation

STEP 1: Prepare Data with Scaling

# K-Nearest Neighbors Classifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import numpy as np

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features - CRITICAL for KNN!
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create KNN classifier with K=5
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

print(f"KNN (K=5) Accuracy: {knn.score(X_test_scaled, y_test):.2%}")

Unlike other algorithms, KNN doesn't learn a model during training - it simply stores all training data. When predicting, it finds the 5 closest training samples and takes a majority vote. Scaling is critical because KNN uses distance calculations!

Choosing the Right K

The value of K is crucial and dramatically affects how your model behaves. Let's understand the trade-offs:

Small K (e.g., 1-3): More sensitive to noise, complex decision boundaries, may overfit
→ Like asking only your closest friend for advice - if they're wrong, you're wrong. With K=1, every outlier or noisy point affects predictions. Decision boundary becomes very wiggly and complex.
Large K (e.g., 15-25): Smoother boundaries, more robust, but may underfit
→ Like asking 20 people for advice - more stable consensus but might miss subtle patterns. Too large and it becomes like asking the entire dataset, losing the "local" aspect of nearest neighbors.
Rule of thumb: Start with K = √n (where n is training set size)
→ For 100 samples, try K=10. This balances local and global information.
Best practice: Use odd K for binary classification to avoid ties
→ With K=4, you might get 2 votes for each class - which one wins? Using K=5 prevents ties!

Pro Tip: Always use cross-validation to find the optimal K for your specific dataset. What works for one dataset might not work for another!

# Find optimal K using cross-validation
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

k_range = range(1, 31)
k_scores = []

for k in k_range:
    knn_temp = KNeighborsClassifier(n_neighbors=k)
    # 5-fold cross-validation
    scores = cross_val_score(knn_temp, X_train_scaled, y_train, cv=5)
    k_scores.append(scores.mean())

# Find best K
best_k = k_range[np.argmax(k_scores)]
print(f"Best K: {best_k} with CV accuracy: {max(k_scores):.2%}")

# Plot K vs Accuracy
plt.figure(figsize=(10, 5))
plt.plot(k_range, k_scores, marker='o')
plt.xlabel('K (Number of Neighbors)')
plt.ylabel('Cross-Validation Accuracy')
plt.title('Finding Optimal K')
plt.axvline(x=best_k, color='r', linestyle='--', label=f'Best K={best_k}')
plt.legend()
plt.grid(True)
plt.show()

Distance Metrics: How KNN Measures "Closeness"

KNN relies on measuring distance between points. But what does "distance" mean? There are multiple ways to calculate it, and the choice can significantly impact your results!

Metric	Formula	Intuition	Best For
`euclidean` (Default)	√(Σ(xᵢ - yᵢ)²)	"As the crow flies" Straight-line distance. Like measuring with a ruler on a map.	Most common choice. Works well for continuous features. Natural for physical distances.
`manhattan` (City Block)	Σ\|xᵢ - yᵢ\|	"City block distance" Walking distance in a grid city (only horizontal/vertical moves).	High-dimensional data, when dimensions are independent. Less sensitive to outliers than Euclidean.
`minkowski`	(Σ\|xᵢ - yᵢ\|^p)^(1/p)	"Generalization" p=1 → Manhattan, p=2 → Euclidean. Flexible family of metrics.	When you want to experiment. Tune 'p' parameter to find what works best for your data.
`chebyshev`	max(\|xᵢ - yᵢ\|)	"Maximum difference" Only the largest difference matters, ignores other dimensions.	When one feature's difference dominates. Chess king moves (max of horizontal or vertical).
`cosine`	1 - (x·y)/(\|\|x\|\| \|\|y\|\|)	"Angle between vectors" Measures direction similarity, not magnitude.	Text data, recommender systems. When you care about patterns, not absolute values.

✅ When to use Euclidean (default):
• Continuous numerical features
• Physical measurements (height, weight, distance)
• Features have similar scales (after scaling)
• When you're not sure - it's the standard choice

⚠️ When to use Manhattan:
• High-dimensional data (many features)
• Features are independent
• Data has outliers (Manhattan is more robust)
• Grid-like structure in your problem

# Compare different distance metrics
from sklearn.neighbors import KNeighborsClassifier

print("=== Comparing Distance Metrics ===")
metrics = ['euclidean', 'manhattan', 'chebyshev', 'minkowski']

for metric in metrics:
    # Train KNN with different distance metrics
    knn = KNeighborsClassifier(
        n_neighbors=5,
        metric=metric,
        n_jobs=-1  # Use all CPU cores
    )
    knn.fit(X_train_scaled, y_train)
    
    train_acc = knn.score(X_train_scaled, y_train)
    test_acc = knn.score(X_test_scaled, y_test)
    
    print(f"{metric:12} - Train: {train_acc:.2%}, Test: {test_acc:.2%}")

print("\n→ Try all metrics! Performance varies by dataset")
print("→ Usually Euclidean or Manhattan work best")
print("→ Cosine is special - use for text/sparse data")

Weighted KNN

# Weighted KNN - closer neighbors have more influence
knn_weighted = KNeighborsClassifier(
    n_neighbors=5,
    weights='distance'  # 'uniform' (default) or 'distance'
)
knn_weighted.fit(X_train_scaled, y_train)

print(f"Uniform weights: {knn.score(X_test_scaled, y_test):.2%}")
print(f"Distance weights: {knn_weighted.score(X_test_scaled, y_test):.2%}")

# Distance weighting: closer neighbors have more vote power
# Often helps when K is large

Curse of Dimensionality: KNN performance degrades in high-dimensional spaces because distances become less meaningful. Consider dimensionality reduction (PCA) before using KNN on data with many features.

Recommendation Systems: Netflix and Spotify use KNN-like algorithms to recommend content. If you like movies A, B, and C, the system finds users with similar taste (nearest neighbors) and recommends movies they enjoyed but you haven't seen yet.

Practice Questions: KNN

Practice distance-based classification and finding optimal K.

Question: Create a KNN classifier with K=3 for the Iris dataset. Don't forget to scale features!

Show Solution

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# STEP 1: Load and split data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# STEP 2: Scale features (CRITICAL for KNN!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# STEP 3: Train KNN with K=3
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_scaled, y_train)

# STEP 4: Evaluate
print(f"Training Accuracy: {knn.score(X_train_scaled, y_train):.2%}")
print(f"Test Accuracy: {knn.score(X_test_scaled, y_test):.2%}")

print("\nRemember: KNN is a lazy learner - no training phase, just stores data!")

Question: Test K values from 1 to 30 and use cross-validation to find the optimal K that maximizes accuracy.

Show Solution

from sklearn.model_selection import cross_val_score
import numpy as np
import matplotlib.pyplot as plt

# STEP 1: Test different K values
k_range = range(1, 31)
k_scores = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    # 5-fold cross-validation
    scores = cross_val_score(knn, X_train_scaled, y_train, cv=5, scoring='accuracy')
    k_scores.append(scores.mean())

# STEP 2: Find best K
best_k = k_range[np.argmax(k_scores)]
best_score = max(k_scores)
print(f"Best K: {best_k}")
print(f"Best CV Accuracy: {best_score:.2%}")

# STEP 3: Visualize K vs Accuracy
plt.figure(figsize=(10, 5))
plt.plot(k_range, k_scores, marker='o', linewidth=2)
plt.axvline(best_k, color='r', linestyle='--', label=f'Best K={best_k}')
plt.xlabel('K (Number of Neighbors)', fontsize=12)
plt.ylabel('Cross-Validation Accuracy', fontsize=12)
plt.title('Finding Optimal K for KNN', fontsize=14)
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()

# STEP 4: Train with optimal K
knn_best = KNeighborsClassifier(n_neighbors=best_k)
knn_best.fit(X_train_scaled, y_train)
print(f"\nTest Accuracy with K={best_k}: {knn_best.score(X_test_scaled, y_test):.2%}")

Question: Train KNN classifiers with different distance metrics (euclidean, manhattan, chebyshev) and compare their performance.

Show Solution

# STEP 1: Test different distance metrics
metrics = ['euclidean', 'manhattan', 'chebyshev', 'minkowski']
results = []

for metric in metrics:
    knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
    knn.fit(X_train_scaled, y_train)
    
    train_acc = knn.score(X_train_scaled, y_train)
    test_acc = knn.score(X_test_scaled, y_test)
    
    results.append({
        'Metric': metric,
        'Train Acc': f"{train_acc:.2%}",
        'Test Acc': f"{test_acc:.2%}"
    })

# STEP 2: Display results
import pandas as pd
df_results = pd.DataFrame(results)
print(df_results.to_string(index=False))

# STEP 3: Explain metrics
print("\nDistance Metrics Explained:")
print("- Euclidean: Straight-line distance (most common)")
print("- Manhattan: Sum of absolute differences (grid-like paths)")
print("- Chebyshev: Maximum difference along any dimension")
print("- Minkowski: Generalization (p=2 is Euclidean)")

Question: Compare uniform vs distance-weighted KNN. Show how distance weighting gives more influence to closer neighbors.

Show Solution

# STEP 1: Train with uniform weights (default)
knn_uniform = KNeighborsClassifier(
    n_neighbors=10,  # Use larger K to see difference
    weights='uniform'  # All neighbors vote equally
)
knn_uniform.fit(X_train_scaled, y_train)

# STEP 2: Train with distance weights
knn_weighted = KNeighborsClassifier(
    n_neighbors=10,
    weights='distance'  # Closer neighbors have more influence
)
knn_weighted.fit(X_train_scaled, y_train)

# STEP 3: Get predictions with probabilities
sample = X_test_scaled[0:1]
prob_uniform = knn_uniform.predict_proba(sample)[0]
prob_weighted = knn_weighted.predict_proba(sample)[0]

print("=== Sample Prediction Probabilities ===")
print(f"Uniform weights:   {prob_uniform}")
print(f"Distance weights:  {prob_weighted}")

# STEP 4: Compare accuracies
print("\n=== Accuracy Comparison ===")
print(f"Uniform KNN:   {knn_uniform.score(X_test_scaled, y_test):.2%}")
print(f"Weighted KNN:  {knn_weighted.score(X_test_scaled, y_test):.2%}")

# STEP 5: Test on different K values
print("\n=== K Value Impact ===")
for k in [3, 5, 10, 15, 20]:
    knn_u = KNeighborsClassifier(n_neighbors=k, weights='uniform')
    knn_w = KNeighborsClassifier(n_neighbors=k, weights='distance')
    
    knn_u.fit(X_train_scaled, y_train)
    knn_w.fit(X_train_scaled, y_train)
    
    acc_u = knn_u.score(X_test_scaled, y_test)
    acc_w = knn_w.score(X_test_scaled, y_test)
    
    print(f"K={k:2d}  Uniform: {acc_u:.2%}  Weighted: {acc_w:.2%}  Diff: {acc_w-acc_u:+.2%}")

print("\nInsight: Distance weighting often helps with larger K values!")

Naive Bayes

Naive Bayes is a fast, probabilistic classifier based on Bayes' theorem with a "naive" assumption of feature independence. Despite its simplicity, it often performs surprisingly well, especially for text classification tasks like spam detection.

Bayes' Theorem

Naive Bayes applies Bayes' theorem to calculate probabilities. Let's understand it with an example: If you see an email with words "free", "winner", and "claim", what's the probability it's spam?

Bayes' theorem lets us flip the question around. Instead of asking "What's P(spam | these words)?", we calculate it from easier-to-find probabilities: "What's P(these words | spam)?" and "What's P(spam in general)?" This clever reversal makes the math much simpler!

Probabilistic Algorithm

Naive Bayes Classifier

A probabilistic classifier that uses Bayes' theorem with the "naive" assumption that all features are independent of each other given the class label. Despite this unrealistic assumption, it often performs remarkably well in practice.

Bayes' Theorem: P(Class|Features) = P(Features|Class) × P(Class) / P(Features)

Why "Naive"? The algorithm makes a "naive" (overly simple) assumption: all features are independent of each other. In reality, this is almost never true!

For example, in spam detection, it assumes seeing the word "free" is independent of seeing "winner" - but spam emails often use these words together! In movie reviews, words like "amazing" and "brilliant" tend to appear together, not independently.

Despite this unrealistic assumption, Naive Bayes works surprisingly well in practice! Why? Because for classification, we just need to know which class is more likely, not the exact probability. Even if the calculated probability is wrong, the ranking (Class A more likely than Class B) is often correct. Plus, the simplification makes it incredibly fast - perfect for real-time applications!

Types of Naive Bayes: Which One to Use?

There are three main types of Naive Bayes classifiers, each designed for different kinds of data. Choosing the right one is crucial for good performance!

Gaussian NB

For: Continuous Features

Assumption: Each feature follows a normal (bell-curve) distribution within each class.

Examples:

Height, weight, age
Temperature, pressure
Sensor measurements
Physical measurements

When to use: When your features are real numbers that could reasonably be normally distributed.

Multinomial NB

For: Discrete Count Features

Assumption: Features represent counts or frequencies (non-negative integers).

Examples:

Word counts in documents
TF-IDF scores
Number of occurrences
Frequency distributions

When to use: BEST FOR TEXT! Spam detection, sentiment analysis, document classification.

Bernoulli NB

For: Binary Features

Assumption: Features are binary (0 or 1, present or absent, yes or no).

Examples:

Word present/absent
Feature exists or not
Boolean flags
Has symptom yes/no

When to use: When you only care if something exists, not how many times. Explicitly models absence.

Quick Decision Guide:
Text classification? → Use Multinomial NB with TF-IDF
Continuous measurements? → Use Gaussian NB
Binary features only? → Use Bernoulli NB
Mixed features? → Try Gaussian first, or encode categorical as binary

Gaussian Naive Bayes

STEP 1: Train Gaussian NB on Continuous Features

# Gaussian Naive Bayes for continuous features
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load Iris data (continuous features)
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predictions
y_pred = gnb.predict(X_test)

print(f"Gaussian NB Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Gaussian NB assumes each feature follows a normal distribution within each class. It calculates mean and standard deviation for each feature per class during training, then uses Bayes' theorem to compute class probabilities for new samples.

STEP 2: Examine Probability Outputs

# Examine class probabilities
sample = X_test[0:1]
proba = gnb.predict_proba(sample)
print(f"Sample features: {sample[0]}")
print(f"Class probabilities: {proba[0]}")
print(f"Predicted class: {iris.target_names[gnb.predict(sample)[0]]}")

Multinomial Naive Bayes for Text

Multinomial NB is the go-to algorithm for text classification. It works with word counts or TF-IDF features.

STEP 1: Convert Text to Features and Train

# Text Classification with Multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample text data
texts = [
    "Free money!!! Click here to win big prizes",
    "Urgent: You have won a lottery!",
    "Meeting tomorrow at 10am in conference room",
    "Project deadline extended to next Friday",
    "Congratulations! You've been selected for a prize",
    "Can we schedule a call for next week?",
    "FREE iPhone!!! Act now limited time offer",
    "Quarterly report is ready for review"
]
labels = [1, 1, 0, 0, 1, 0, 1, 0]  # 1 = spam, 0 = not spam

# Convert text to features using TF-IDF
vectorizer = TfidfVectorizer()
X_text = vectorizer.fit_transform(texts)

# Train Multinomial Naive Bayes
mnb = MultinomialNB()
mnb.fit(X_text, labels)

# Test on new messages
new_messages = [
    "Win a free vacation today!",
    "Meeting rescheduled to 3pm"
]
new_vectors = vectorizer.transform(new_messages)
predictions = mnb.predict(new_vectors)

for msg, pred in zip(new_messages, predictions):
    print(f"'{msg}' → {'SPAM' if pred else 'NOT SPAM'}")

The vectorizer converts text to numerical features (word importance scores). Multinomial NB then calculates P(Spam|words) vs P(Not Spam|words) using Bayes' theorem. Words like "free", "win", "urgent" increase spam probability.

Bernoulli Naive Bayes

# Bernoulli NB for binary features
from sklearn.naive_bayes import BernoulliNB

# Binary features (word present = 1, absent = 0)
vectorizer_binary = CountVectorizer(binary=True)
X_binary = vectorizer_binary.fit_transform(texts)

# Train Bernoulli NB
bnb = BernoulliNB()
bnb.fit(X_binary, labels)

# Bernoulli NB explicitly models absence of features
# Good when "word NOT present" is informative

Advantages & Use Cases

Advantage	Explanation
Extremely Fast	Training and prediction are very quick, O(n×d) complexity
Works with Small Data	Requires less training data than complex models
Handles High Dimensions	Works well with thousands of features (text classification)
Probabilistic Output	Provides probability estimates, not just predictions
No Feature Scaling Needed	Immune to feature scale differences

Email Spam Filtering: Gmail and other email providers use Naive Bayes as part of their spam detection. It's incredibly fast (can process millions of emails), learns from user feedback, and works well even though words in emails aren't truly independent - the "naive" assumption still works!

When to use Naive Bayes:

Text classification (spam, sentiment, topic categorization)
Real-time prediction needed (very fast)
Working with limited training data
Baseline model for comparison
Features are somewhat independent

# Smoothing parameter (Laplace smoothing)
# Prevents zero probability for unseen feature values
mnb_smoothed = MultinomialNB(alpha=1.0)  # Default Laplace smoothing
mnb_no_smooth = MultinomialNB(alpha=0.0)  # No smoothing (risky!)

# alpha = 1.0: Laplace smoothing (adds 1 to all counts)
# alpha = 0.5: Lidstone smoothing
# alpha = 0.0: No smoothing (not recommended)

print("Smoothing prevents zero probabilities from unseen words!")

Practice Questions: Naive Bayes

Build probabilistic classifiers for text and numerical data.

Question: Create a complete spam classifier using Pipeline with TfidfVectorizer and MultinomialNB. Make it reusable and easy to deploy.

Show Solution

from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

# STEP 1: Create training data
texts = [
    "Free money!!! Click here to win",
    "Urgent: You won a lottery!",
    "Meeting at 10am in conference room",
    "Project deadline extended",
    "Congratulations! You've been selected",
    "Can we schedule a call?",
    "FREE iPhone!!! Act now",
    "Quarterly report ready"
]
labels = [1, 1, 0, 0, 1, 0, 1, 0]  # 1=spam, 0=ham

# STEP 2: Create pipeline
spam_classifier = Pipeline([
    ('vectorizer', TfidfVectorizer(
        stop_words='english',  # Remove common words
        max_features=100       # Limit vocabulary
    )),
    ('classifier', MultinomialNB())
])

# STEP 3: Train (one command!)
spam_classifier.fit(texts, labels)

# STEP 4: Test on new messages
test_msgs = [
    "Buy now! Limited offer!",
    "See you at lunch tomorrow",
    "WINNER! Claim your prize now!",
    "Can you review the document?"
]

predictions = spam_classifier.predict(test_msgs)
probs = spam_classifier.predict_proba(test_msgs)

for msg, pred, prob in zip(test_msgs, predictions, probs):
    label = 'SPAM' if pred else 'HAM'
    confidence = prob[pred]
    print(f"{label} ({confidence:.2%}): {msg}")

Question: Train both GaussianNB (for continuous features) and MultinomialNB (for count features) on the Iris dataset and compare their performance.

Show Solution

from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score

# STEP 1: Load Iris data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# STEP 2: Train Gaussian NB (for continuous features)
gnb = GaussianNB()
gnb.fit(X_train, y_train)
gnb_score = gnb.score(X_test, y_test)
gnb_cv = cross_val_score(gnb, X, y, cv=5).mean()

# STEP 3: Train Multinomial NB (requires non-negative features)
# Iris has some small values, so we'll shift to make all positive
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_scaled = scaler.fit_transform(X)

mnb = MultinomialNB()
mnb.fit(X_train_scaled, y_train)
mnb_score = mnb.score(X_test_scaled, y_test)
mnb_cv = cross_val_score(mnb, X_scaled, y, cv=5).mean()

# STEP 4: Compare results
print("=== Performance Comparison ===")
print(f"Gaussian NB:    Test={gnb_score:.2%}  CV={gnb_cv:.2%}")
print(f"Multinomial NB: Test={mnb_score:.2%}  CV={mnb_cv:.2%}")
print("\nGaussian NB is better for continuous features like Iris!")

Question: Test different alpha (smoothing) values for MultinomialNB and find the optimal one using cross-validation.

Show Solution

from sklearn.model_selection import GridSearchCV
import numpy as np

# STEP 1: Prepare text data (using previous spam example)
vectorizer = TfidfVectorizer(stop_words='english')
X_text = vectorizer.fit_transform(texts)

# STEP 2: Test different alpha values
alphas = [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
results = []

for alpha in alphas:
    mnb = MultinomialNB(alpha=alpha)
    scores = cross_val_score(mnb, X_text, labels, cv=3)
    results.append({
        'Alpha': alpha,
        'CV Score': scores.mean(),
        'Std': scores.std()
    })

# STEP 3: Display results
import pandas as pd
df_results = pd.DataFrame(results)
print(df_results.to_string(index=False))

# STEP 4: Use GridSearchCV for optimal alpha
param_grid = {'alpha': np.logspace(-3, 1, 20)}
grid = GridSearchCV(MultinomialNB(), param_grid, cv=3, scoring='accuracy')
grid.fit(X_text, labels)

print(f"\nBest alpha: {grid.best_params_['alpha']:.4f}")
print(f"Best CV score: {grid.best_score_:.2%}")

# STEP 5: Explain smoothing
print("\nSmoothing (alpha) explained:")
print("  alpha=0: No smoothing (risky - zero probabilities possible)")
print("  alpha=1: Laplace smoothing (default, adds 1 to all counts)")
print("  alpha>1: More smoothing (for very sparse data)")

Question: Create a sentiment classifier (positive/negative reviews) and identify which words are most predictive of each sentiment.

Show Solution

# STEP 1: Create sentiment dataset
reviews = [
    "This movie was absolutely amazing! Loved it!",
    "Terrible film, waste of time and money",
    "Brilliant acting and great storyline",
    "Boring and predictable, very disappointed",
    "Outstanding performance, highly recommend",
    "Awful movie, couldn't wait for it to end",
    "Fantastic cinematography and soundtrack",
    "Poor script and bad acting",
    "Incredible experience, must watch!",
    "Worst movie I've ever seen"
]
sentiments = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]  # 1=positive, 0=negative

# STEP 2: Train sentiment classifier
vectorizer = TfidfVectorizer(stop_words='english', max_features=50)
X_reviews = vectorizer.fit_transform(reviews)

mnb = MultinomialNB()
mnb.fit(X_reviews, sentiments)

# STEP 3: Extract feature importances
feature_names = vectorizer.get_feature_names_out()
log_prob_pos = mnb.feature_log_prob_[1]  # Positive class
log_prob_neg = mnb.feature_log_prob_[0]  # Negative class

# Words most indicative of positive sentiment
pos_words = pd.DataFrame({
    'Word': feature_names,
    'Positive Score': log_prob_pos
}).nlargest(10, 'Positive Score')

# Words most indicative of negative sentiment
neg_words = pd.DataFrame({
    'Word': feature_names,
    'Negative Score': log_prob_neg
}).nlargest(10, 'Negative Score')

print("=== Top Positive Words ===")
print(pos_words.to_string(index=False))
print("\n=== Top Negative Words ===")
print(neg_words.to_string(index=False))

# STEP 4: Test on new reviews
test_reviews = [
    "Amazing movie, absolutely brilliant!",
    "Terrible and boring, don't waste your time"
]

test_vectors = vectorizer.transform(test_reviews)
predictions = mnb.predict(test_vectors)
probs = mnb.predict_proba(test_vectors)

print("\n=== Sentiment Predictions ===")
for review, pred, prob in zip(test_reviews, predictions, probs):
    sentiment = 'POSITIVE' if pred else 'NEGATIVE'
    confidence = prob[pred]
    print(f"{sentiment} ({confidence:.2%}): {review}")

print("\nThis demonstrates how Naive Bayes learns word-sentiment associations!")

Algorithm Comparison & Selection Guide

With 6 different classification algorithms, how do you choose? This guide compares them across key dimensions to help you make the right choice for your problem.

The Truth About Algorithm Selection: There is no "best" algorithm for all problems! The right choice depends on your data, problem constraints, and requirements. Often, you'll try multiple algorithms and compare their performance.

Comprehensive Algorithm Comparison

Algorithm	Speed	Interpretability	Handles Non-Linear	Feature Scaling	Missing Values	Best Use Cases
Logistic Regression	Very Fast	Excellent Can interpret coefficients directly	No Only linear boundaries	Required	Cannot handle Must impute first	• Binary classification • When interpretability matters • Baseline model • Linear relationships
Decision Tree	Fast	Excellent Easy to visualize & explain	Yes Non-parametric	Not needed	Partial Can handle but imputation better	• Need clear decision rules • Mixed feature types • Feature interactions • Quick prototyping
Random Forest	Medium Trains multiple trees	Medium Can get feature importance	Yes Ensemble handles complexity	Not needed	Partial Like Decision Tree	• High accuracy needed • Reduce overfitting • Imbalanced data • Feature importance
SVM	Slow Especially large datasets	Poor Hard to interpret	Yes With RBF/poly kernels	Critical Must scale!	Cannot handle Must impute first	• Small to medium data • High-dimensional data • Clear margin separation • Complex boundaries
KNN	Training: Instant Prediction: Slow	Medium Can visualize neighborhoods	Yes No assumptions about data	Critical Distance-based!	Cannot handle Must impute first	• Recommender systems • Anomaly detection • Small datasets • Pattern recognition
Naive Bayes	Very Fast	Good Probability-based	Limited Depends on variant	Not needed	Partial Depends on implementation	• Text classification • Spam filtering • Real-time prediction • High-dimensional data

Quick Reference: Problem → Algorithm

Spam Email Detection

Best Choice: Multinomial Naive Bayes

Why?

Text data (word counts/frequencies)
Very fast training and prediction
Handles high-dimensional data (many unique words)
Industry standard for spam filtering

Medical Diagnosis (Heart Disease)

Best Choice: Random Forest or Logistic Regression

Why?

Random Forest: High accuracy, handles non-linear patterns
Logistic Regression: Interpretable (doctors need to explain!)
Both work well with medical measurements
Consider Random Forest for accuracy, Logistic for interpretability

Movie Recommendation

Best Choice: KNN

Why?

Finds similar users based on ratings
No training phase (lazy learning)
Naturally captures similarity patterns
Used by Netflix, Spotify in their systems

Credit Card Fraud Detection

Best Choice: Random Forest or SVM

Why?

Random Forest: Handles imbalanced data well, high accuracy
SVM: Good at finding rare fraud patterns
Both handle complex, non-linear patterns in transactions
Real-time prediction important → pre-train the model

Image Classification (Handwritten Digits)

Best Choice: SVM with RBF kernel

Why?

High-dimensional data (pixels as features)
Excellent for pattern recognition
Handles non-linear boundaries well
MNIST dataset classic use case
Note: Deep Learning (CNNs) even better for large image datasets!

Customer Churn Prediction

Best Choice: Random Forest or Logistic Regression

Why?

Random Forest: High accuracy, feature importance
Logistic Regression: Fast, interpretable coefficients
Both provide probability scores (useful for ranking risk)
Can explain to business stakeholders

Important Reminder: These are guidelines, not absolute rules! Always:

Start with a simple baseline (Logistic Regression or Decision Tree)
Try multiple algorithms
Use cross-validation to compare fairly
Consider your constraints (speed, interpretability, accuracy)
Let the data and results guide your final choice

Key Takeaways

Logistic Regression First

Start with logistic regression as a baseline. It's interpretable, fast, and often performs well on linearly separable data.

Random Forest > Single Tree

Random Forests combine multiple trees to reduce overfitting and improve accuracy. Prefer ensembles over single trees.

SVM Kernel Choice

Use RBF kernel for non-linear data, linear kernel for text/high-dimensional data. Always scale features first!

KNN Needs Scaling

KNN is distance-based, so feature scaling is critical. Use cross-validation to find optimal K value.

Naive Bayes for Text

Naive Bayes excels at text classification. It's fast, handles high dimensions, and works with small datasets.

No Free Lunch

No algorithm is best for all problems. Always try multiple classifiers and compare using cross-validation.

What You'll Learn

Contents

Logistic Regression

What is Logistic Regression?

Logistic Regression

How It Works

Implementation in Python

Understanding the Output

Multi-class Classification

Practice Questions: Logistic Regression

Easy Build a basic binary classifier

Medium Apply L2 regularization to prevent overfitting

Medium Adjust classification threshold for business needs

Hard Multi-class classification with cross-validation

Decision Trees & Random Forests

How Decision Trees Work

Decision Tree

Decision Tree Implementation

Random Forests: The Power of Ensembles

Key Hyperparameters

Practice Questions: Trees & Forests

Easy Build and visualize a simple decision tree

Medium Optimize Random Forest with GridSearchCV

Medium Compare single tree vs ensemble performance

Hard Implement manual decision tree pruning

Support Vector Machines (SVM)

The Intuition Behind SVM

Support Vector Machine

Maximum Margin

Support Vectors

Kernel Trick

Linear SVM Implementation

The Kernel Trick

Key Hyperparameters Explained in Detail

Practice Questions: SVM

Easy Train SVM with probability estimates

Medium Compare linear vs RBF kernel performance

Medium Tune C parameter to control regularization

Hard Full hyperparameter tuning with GridSearchCV

K-Nearest Neighbors (KNN)

How KNN Works

K-Nearest Neighbors

KNN Implementation

Choosing the Right K

Distance Metrics: How KNN Measures "Closeness"

Weighted KNN

Practice Questions: KNN

Easy Build a basic KNN classifier

Medium Find optimal K with cross-validation

Medium Compare distance metrics

Hard Implement weighted KNN and compare with uniform

Naive Bayes

Bayes' Theorem

Naive Bayes Classifier

Types of Naive Bayes: Which One to Use?

Gaussian Naive Bayes

Multinomial Naive Bayes for Text

Bernoulli Naive Bayes

Advantages & Use Cases

Practice Questions: Naive Bayes

Easy Build a spam classifier pipeline

Easy Compare Gaussian vs Multinomial NB

Medium Tune smoothing parameter (alpha)

Hard Build a sentiment analyzer with feature analysis

Algorithm Comparison & Selection Guide

Comprehensive Algorithm Comparison

Quick Reference: Problem → Algorithm

Key Takeaways

Logistic Regression First

Random Forest > Single Tree

SVM Kernel Choice

KNN Needs Scaling

Naive Bayes for Text

No Free Lunch

Knowledge Check

Quick Quiz

1 What function does Logistic Regression use to convert outputs to probabilities?

2 What is the main advantage of Random Forest over a single Decision Tree?

3 Which SVM kernel is best for non-linearly separable data?