Module 2.1

Machine Learning Foundations

Learn the core concepts that underpin all machine learning systems. Understand how machines learn from data, the difference between learning paradigms, and how to properly evaluate model performance to build reliable AI systems.

45 min
Beginner
Hands-on
What You'll Learn
  • Distinguish supervised from unsupervised learning
  • Split data into training, validation, and test sets
  • Identify and prevent overfitting and underfitting
  • Apply cross-validation for robust evaluation
  • Choose appropriate evaluation metrics for your task
Contents
01

What is Machine Learning?

Machine learning is a subset of artificial intelligence that enables computers to learn patterns from data and make decisions without being explicitly programmed for each task. Instead of writing rules manually, you provide data and let the algorithm discover the underlying patterns. This approach has revolutionized how we solve complex problems, from recognizing faces in photos to predicting stock prices and detecting fraud. In this section, you'll understand what makes machine learning different from traditional programming and how it forms the foundation for modern AI systems.

Key Concept

Machine Learning

Machine learning is a field of computer science that gives computers the ability to learn from data and improve their performance on a specific task without being explicitly programmed. The system builds a mathematical model based on training data to make predictions or decisions.

Arthur Samuel (1959): "Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed"

Traditional Programming vs Machine Learning

In traditional programming, you write explicit rules that tell the computer exactly what to do in every situation. For example, to detect spam emails, you might write rules like "if the email contains the word 'lottery', mark it as spam." The problem is that spammers constantly change their tactics, and you would need to manually update rules forever. Machine learning flips this approach. Instead of writing rules, you provide examples of spam and legitimate emails, and the algorithm learns to distinguish between them automatically. This data-driven approach adapts to new patterns without human intervention.

Traditional Programming

Input: Data + Rules

Output: Answers

You write explicit rules to process data and produce results. Works well for well-defined problems with clear logic

Machine Learning

Input: Data + Answers

Output: Rules (Model)

You provide examples with correct answers, and the algorithm discovers the rules automatically

How Machine Learning Works

The machine learning process follows a systematic workflow that transforms raw data into predictive models. First, you collect and prepare data relevant to your problem. Then you choose an appropriate algorithm and train it on your data, allowing it to learn patterns. The trained model is then evaluated on new data to measure its performance. If the results are satisfactory, the model can be deployed to make predictions on real-world data. This iterative process of training, evaluating, and refining is at the heart of all machine learning projects.

# Simple example: Predicting house prices
# Traditional approach - manual rules
def predict_price_traditional(sqft, bedrooms, location):
    base_price = 100000
    price = base_price + (sqft * 200)  # $200 per sqft
    price += bedrooms * 15000          # $15k per bedroom
    if location == "downtown":
        price *= 1.5                    # 50% premium
    return price

# Machine Learning approach - learn from data
from sklearn.linear_model import LinearRegression

# Training data: features and actual prices
features = [[1500, 3], [2000, 4], [1200, 2], [1800, 3]]  # [sqft, bedrooms]
prices = [300000, 450000, 250000, 380000]

# Train the model - it learns the patterns automatically
model = LinearRegression()
model.fit(features, prices)

# Predict price for a new house
new_house = [[1600, 3]]
predicted_price = model.predict(new_house)
print(f"Predicted price: ${predicted_price[0]:,.0f}")  # Predicted price: $325,000

The Machine Learning Pipeline

Every machine learning project follows a structured pipeline that ensures systematic development and reliable results. Understanding this pipeline is essential because skipping steps or doing them incorrectly leads to poor model performance. The pipeline includes data collection, preprocessing, feature engineering, model selection, training, evaluation, and deployment. Each step builds on the previous one, and iteration between steps is common as you refine your approach based on results.

1. Collect

Gather relevant data from various sources

2. Clean

Handle missing values and outliers

3. Engineer

Create and select meaningful features

4. Train

Fit model to training data

5. Evaluate

Measure performance on test data

6. Deploy

Put model into production

Types of Machine Learning Problems

Machine learning problems generally fall into distinct categories based on the nature of the output you're trying to predict. Classification problems involve predicting discrete categories, like whether an email is spam or not spam. Regression problems involve predicting continuous values, like the price of a house or tomorrow's temperature. Clustering problems group similar data points together without predefined labels. Understanding which type of problem you're solving helps you choose the right algorithm and evaluation metrics.

# Classification: Predict discrete categories
from sklearn.tree import DecisionTreeClassifier

# Data: [age, income] -> will buy product? (0=No, 1=Yes)
features = [[25, 40000], [35, 60000], [45, 80000], [22, 35000]]
labels = [0, 1, 1, 0]  # Classification labels

classifier = DecisionTreeClassifier()
classifier.fit(features, labels)

prediction = classifier.predict([[30, 55000]])
print(f"Will buy: {'Yes' if prediction[0] == 1 else 'No'}")  # Will buy: Yes

# Regression: Predict continuous values
from sklearn.linear_model import LinearRegression

# Data: [years experience] -> salary
experience = [[1], [3], [5], [7], [10]]
salaries = [45000, 55000, 65000, 78000, 95000]

regressor = LinearRegression()
regressor.fit(experience, salaries)

salary_pred = regressor.predict([[4]])
print(f"Predicted salary: ${salary_pred[0]:,.0f}")  # Predicted salary: $60,000
Pro Tip: The quality of your data is more important than the complexity of your algorithm. A simple model trained on high-quality, relevant data will often outperform a complex model trained on poor data. Always invest time in understanding and cleaning your data before jumping to model building

Interactive: ML Paradigm Explorer

Try It!

Click on different problem types to see which learning paradigm and approach fits best:

Click a problem type above to see the recommended approach

Practice Questions

Task: Create a linear regression model to predict test scores based on hours studied. Train it on the given data and predict the score for 7 hours of study.

Given:

hours_studied = [[2], [3], [4], [5], [6]]
test_scores = [55, 62, 70, 78, 85]

Expected output: A predicted score around 92-93

Show Solution
from sklearn.linear_model import LinearRegression

hours_studied = [[2], [3], [4], [5], [6]]
test_scores = [55, 62, 70, 78, 85]

model = LinearRegression()
model.fit(hours_studied, test_scores)

prediction = model.predict([[7]])
print(f"Predicted score: {prediction[0]:.1f}")  # Predicted score: 92.5

Task: Create a decision tree classifier to predict if a customer will churn based on their account age (months) and monthly charges. Train and predict for a new customer.

Given:

# [account_age_months, monthly_charges]
customer_data = [[12, 50], [24, 60], [6, 80], [36, 45], [3, 90]]
churned = [0, 0, 1, 0, 1]  # 0=stayed, 1=churned

# New customer to predict
new_customer = [[8, 75]]
Show Solution
from sklearn.tree import DecisionTreeClassifier

customer_data = [[12, 50], [24, 60], [6, 80], [36, 45], [3, 90]]
churned = [0, 0, 1, 0, 1]

model = DecisionTreeClassifier()
model.fit(customer_data, churned)

new_customer = [[8, 75]]
prediction = model.predict(new_customer)
print(f"Will churn: {'Yes' if prediction[0] == 1 else 'No'}")  # Will churn: Yes

Task: Train a linear regression model on the house data below. Extract and print the coefficient and intercept. Then calculate what the model predicts as the price increase per additional square foot.

Given:

sqft = [[1000], [1500], [2000], [2500], [3000]]
prices = [150000, 225000, 300000, 375000, 450000]
Show Solution
from sklearn.linear_model import LinearRegression

sqft = [[1000], [1500], [2000], [2500], [3000]]
prices = [150000, 225000, 300000, 375000, 450000]

model = LinearRegression()
model.fit(sqft, prices)

print(f"Coefficient (price per sqft): ${model.coef_[0]:.2f}")  # $150.00
print(f"Intercept (base price): ${model.intercept_:,.2f}")    # $0.00

# The coefficient tells us: each additional sqft adds $150 to the price
02

Learning Paradigms

Machine learning algorithms are categorized into three main paradigms based on how they learn from data. Supervised learning uses labeled examples to learn mappings from inputs to outputs. Unsupervised learning discovers hidden patterns in unlabeled data. Reinforcement learning learns through trial and error by receiving rewards or penalties. Understanding these paradigms is crucial because the type of problem you're solving determines which approach to use, what data you need, and how you evaluate success.

Supervised Learning

Supervised learning is the most common paradigm in machine learning. You provide the algorithm with input-output pairs (labeled data), and it learns to map inputs to outputs. Think of it like learning with a teacher who provides correct answers. The algorithm sees many examples, learns the relationship between inputs and outputs, and then applies that knowledge to predict outputs for new, unseen inputs. Supervised learning is used for classification (predicting categories) and regression (predicting continuous values).

Key Concept

Supervised Learning

A type of machine learning where the model is trained on labeled data, meaning each training example includes both the input features and the correct output (label). The model learns to predict the output for new, unseen inputs.

Examples: Email spam detection, image classification, house price prediction, medical diagnosis, credit scoring

# Supervised Learning Example: Email Spam Classification
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Labeled training data - we know which emails are spam
emails = [
    "Win a free iPhone now! Click here!",
    "Meeting scheduled for tomorrow at 3pm",
    "Congratulations! You won $1,000,000",
    "Please review the attached quarterly report",
    "Get rich quick! Limited time offer!"
]
labels = [1, 0, 1, 0, 1]  # 1 = spam, 0 = not spam

# Convert text to numerical features
vectorizer = CountVectorizer()
features = vectorizer.fit_transform(emails)

# Train the classifier
classifier = MultinomialNB()
classifier.fit(features, labels)

# Predict on new email
new_email = ["Free money! Act now before it's too late!"]
new_features = vectorizer.transform(new_email)
prediction = classifier.predict(new_features)
print(f"Spam: {'Yes' if prediction[0] == 1 else 'No'}")  # Spam: Yes

Unsupervised Learning

Unsupervised learning works with unlabeled data, where you don't provide correct answers to the algorithm. Instead, the algorithm explores the data and discovers hidden patterns, structures, or groupings on its own. This is like learning without a teacher. The algorithm finds natural clusters in the data, reduces dimensionality to find important features, or detects anomalies that don't fit the normal pattern. Unsupervised learning is invaluable when you have lots of data but no labels, or when you want to discover unknown patterns.

Key Concept

Unsupervised Learning

A type of machine learning where the model is trained on unlabeled data. The algorithm discovers hidden patterns, groupings, or structures in the data without being told what to look for.

Examples: Customer segmentation, anomaly detection, recommendation systems, topic modeling, data compression

# Unsupervised Learning Example: Customer Segmentation
from sklearn.cluster import KMeans
import numpy as np

# Customer data: [annual_income, spending_score] - NO LABELS
customers = np.array([
    [15, 39], [16, 81], [17, 6], [18, 77], [19, 40],
    [70, 29], [71, 72], [72, 5], [73, 75], [74, 35],
    [130, 45], [131, 78], [132, 12], [133, 70], [134, 42]
])

# Find 3 natural clusters in the data
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(customers)

# See which cluster each customer belongs to
print("Cluster assignments:", kmeans.labels_)
# [0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]

# The algorithm found 3 segments:
# Cluster 0: Low income customers
# Cluster 1: Medium income customers  
# Cluster 2: High income customers

# Predict cluster for new customer
new_customer = [[75, 50]]
cluster = kmeans.predict(new_customer)
print(f"New customer belongs to cluster: {cluster[0]}")  # cluster: 1

Reinforcement Learning

Reinforcement learning takes a different approach where an agent learns by interacting with an environment. The agent takes actions, receives feedback in the form of rewards or penalties, and learns to maximize cumulative rewards over time. This is similar to how humans learn through trial and error. Reinforcement learning powers game-playing AI like AlphaGo, self-driving cars, and robotic control systems. While powerful, it requires careful design of the reward system and typically needs many interactions to learn effectively.

Key Concept

Reinforcement Learning

A type of machine learning where an agent learns to make decisions by performing actions in an environment and receiving rewards or penalties. The goal is to learn a policy that maximizes cumulative reward.

Examples: Game playing (Chess, Go), robotics, autonomous vehicles, trading systems, resource management

Comparing Learning Paradigms

Each learning paradigm has distinct characteristics that make it suitable for different types of problems. The key differences lie in the type of data required, how the algorithm learns, and what kind of problems it can solve. Choosing the right paradigm is one of the first decisions you'll make in any machine learning project, and it fundamentally shapes your approach.

Aspect Supervised Unsupervised Reinforcement
Data Type Labeled (input-output pairs) Unlabeled (inputs only) State-action-reward sequences
Goal Predict outputs for new inputs Discover hidden patterns Maximize cumulative reward
Feedback Correct answers provided No feedback Delayed reward signals
Common Tasks Classification, Regression Clustering, Dimensionality Reduction Game playing, Control
Example Algorithms Linear Regression, Random Forest, SVM K-Means, PCA, DBSCAN Q-Learning, Policy Gradient
# Quick reference: Choosing the right paradigm
def choose_paradigm(problem_description):
    """
    Decision guide for selecting ML paradigm
    """
    questions = {
        "Do you have labeled data?": None,
        "Are you predicting a specific output?": None,
        "Is it a sequential decision problem?": None
    }
    
    # Decision tree:
    # - Have labels + predicting output -> Supervised
    # - No labels + finding patterns -> Unsupervised  
    # - Sequential decisions + rewards -> Reinforcement
    
    examples = {
        "supervised": [
            "Predict house prices (regression)",
            "Classify images as cat/dog (classification)",
            "Detect fraudulent transactions"
        ],
        "unsupervised": [
            "Group customers by behavior (clustering)",
            "Find unusual network activity (anomaly detection)",
            "Reduce features for visualization (dimensionality reduction)"
        ],
        "reinforcement": [
            "Train a robot to walk",
            "Optimize ad placement strategy",
            "Play video games"
        ]
    }
    return examples

# Practical tip: Start with supervised if you have labeled data,
# as it's the most straightforward and well-understood approach
Hybrid Approaches: Real-world problems often combine paradigms. Semi-supervised learning uses a small amount of labeled data with lots of unlabeled data. Self-supervised learning creates labels from the data itself. Transfer learning applies knowledge from one task to another. Don't feel constrained to use just one paradigm

Practice Questions

Task: Use K-Means to cluster the following 2D points into 2 groups and print the cluster centers.

Given:

import numpy as np
points = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]])
Show Solution
from sklearn.cluster import KMeans
import numpy as np

points = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]])

kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(points)

print("Cluster labels:", kmeans.labels_)  # [1 1 0 0 1 0]
print("Cluster centers:")
print(kmeans.cluster_centers_)
# [[7.33 9.  ]
#  [1.17 1.47]]

Task: Load the iris dataset, train a Random Forest classifier, and print the accuracy on the test set.

Show Solution
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2%}")  # Accuracy: 100.00%

Task: Write a function that takes a problem description and returns the appropriate ML paradigm. Test it with the scenarios below.

Scenarios:

scenarios = [
    "Predict stock prices using historical data with known outcomes",
    "Group news articles by topic without predefined categories",
    "Train a drone to navigate through obstacles"
]
Show Solution
def identify_paradigm(description):
    description = description.lower()
    
    # Check for reinforcement learning keywords
    rl_keywords = ["train", "navigate", "agent", "reward", "game", "robot"]
    if any(kw in description for kw in rl_keywords):
        if "obstacle" in description or "environment" in description:
            return "Reinforcement Learning"
    
    # Check for supervised learning keywords
    supervised_keywords = ["predict", "classify", "known outcome", "labeled"]
    if any(kw in description for kw in supervised_keywords):
        return "Supervised Learning"
    
    # Check for unsupervised learning keywords
    unsupervised_keywords = ["group", "cluster", "without", "discover", "segment"]
    if any(kw in description for kw in unsupervised_keywords):
        return "Unsupervised Learning"
    
    return "Unknown"

scenarios = [
    "Predict stock prices using historical data with known outcomes",
    "Group news articles by topic without predefined categories",
    "Train a drone to navigate through obstacles"
]

for scenario in scenarios:
    print(f"{scenario[:50]}... -> {identify_paradigm(scenario)}")
# Predict stock prices... -> Supervised Learning
# Group news articles... -> Unsupervised Learning
# Train a drone to navigate... -> Reinforcement Learning
03

Data Splitting Strategies

One of the most critical aspects of machine learning is properly evaluating how well your model will perform on new, unseen data. If you train and test on the same data, you'll get misleading results because the model has already "seen" the test examples. Data splitting solves this by dividing your dataset into separate portions for training, validation, and testing. Getting this right is essential for building models that generalize well to real-world data rather than just memorizing the training examples.

Why Split Your Data?

Imagine studying for an exam by memorizing the exact questions and answers, then being tested on those same questions. You'd score perfectly, but you wouldn't actually understand the material. Machine learning faces the same challenge. A model that perfectly predicts its training data might fail completely on new data because it memorized specific examples rather than learning general patterns. By testing on data the model has never seen, you get an honest estimate of real-world performance.

Training Set

Purpose: Teach the model

Typical Size: 60-80% of data

The model learns patterns from this data by adjusting its parameters to minimize errors

Validation Set

Purpose: Tune hyperparameters

Typical Size: 10-20% of data

Used to compare different model configurations and select the best one

Test Set

Purpose: Final evaluation

Typical Size: 10-20% of data

Provides unbiased estimate of model performance. Use only once at the end

The Train-Test Split

The simplest splitting strategy divides data into two parts: training and testing. The training set is used to fit the model, and the test set evaluates performance. A common split is 80% for training and 20% for testing, though this can vary based on dataset size. For small datasets, you might use 70-30 to ensure enough test samples. For very large datasets, even 90-10 or 95-5 can work because you still have plenty of test examples. The key is randomizing the split to avoid bias.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample dataset
X = [[1, 2], [2, 3], [3, 4], [4, 5], [5, 6], [6, 7], [7, 8], [8, 9], [9, 10], [10, 11]]
y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]

# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing
    random_state=42     # For reproducibility
)

print(f"Training samples: {len(X_train)}")  # Training samples: 8
print(f"Test samples: {len(X_test)}")       # Test samples: 2

# Train model on training data only
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate on test data (never seen during training)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Test accuracy: {accuracy:.2%}")  # Test accuracy: 100.00%

Adding a Validation Set

When you need to tune hyperparameters (settings that control the learning process), using the test set for this purpose is a mistake. If you keep adjusting your model based on test performance, you're effectively using the test set for training decisions, which defeats its purpose. The solution is a three-way split: training, validation, and test sets. You train on the training set, tune hyperparameters using validation performance, and only touch the test set once for final evaluation.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Create sample data
np.random.seed(42)
X = np.random.randn(1000, 10)  # 1000 samples, 10 features
y = (X[:, 0] + X[:, 1] > 0).astype(int)  # Binary classification

# First split: separate out test set (20%)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Second split: separate train and validation from remaining 80%
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42  # 0.25 of 80% = 20%
)

print(f"Training: {len(X_train)} samples (60%)")   # Training: 600 samples
print(f"Validation: {len(X_val)} samples (20%)")   # Validation: 200 samples
print(f"Test: {len(X_test)} samples (20%)")        # Test: 200 samples

# Tune hyperparameters using validation set
best_accuracy = 0
best_n_estimators = 10

for n_estimators in [10, 50, 100, 200]:
    model = RandomForestClassifier(n_estimators=n_estimators, random_state=42)
    model.fit(X_train, y_train)
    val_accuracy = model.score(X_val, y_val)
    print(f"n_estimators={n_estimators}: validation accuracy = {val_accuracy:.2%}")
    
    if val_accuracy > best_accuracy:
        best_accuracy = val_accuracy
        best_n_estimators = n_estimators

# Final evaluation on test set (only once!)
final_model = RandomForestClassifier(n_estimators=best_n_estimators, random_state=42)
final_model.fit(X_train, y_train)
test_accuracy = final_model.score(X_test, y_test)
print(f"\nBest model (n_estimators={best_n_estimators})")
print(f"Final test accuracy: {test_accuracy:.2%}")

Stratified Splitting

When your dataset has imbalanced classes (for example, 95% negative and 5% positive), random splitting can create problems. You might end up with a test set that has very few or zero examples of the minority class, making evaluation unreliable. Stratified splitting maintains the same class proportions in each split as in the original dataset. This ensures that both training and test sets are representative of the overall data distribution.

from sklearn.model_selection import train_test_split
import numpy as np

# Imbalanced dataset: 90% class 0, 10% class 1
np.random.seed(42)
X = np.random.randn(100, 5)
y = np.array([0] * 90 + [1] * 10)  # 90 zeros, 10 ones

# Regular split - might create unbalanced test set
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print("Regular split:")
print(f"  Train class 1: {sum(y_train_reg)}/{len(y_train_reg)}")  # May vary
print(f"  Test class 1: {sum(y_test_reg)}/{len(y_test_reg)}")     # May vary

# Stratified split - maintains class proportions
X_train_str, X_test_str, y_train_str, y_test_str = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print("\nStratified split:")
print(f"  Train class 1: {sum(y_train_str)}/{len(y_train_str)}")  # 8/80 (10%)
print(f"  Test class 1: {sum(y_test_str)}/{len(y_test_str)}")     # 2/20 (10%)
Critical Rule: Never use test data for any decision during model development. Don't look at test performance to choose features, algorithms, or hyperparameters. The test set must remain completely untouched until your final evaluation. Breaking this rule leads to overly optimistic performance estimates that won't hold up in production

Common Splitting Mistakes

Even experienced practitioners make splitting mistakes that lead to misleading results. Data leakage occurs when information from the test set accidentally influences training, like fitting a scaler on all data before splitting. Time series data requires chronological splits, not random ones, because future data shouldn't be used to predict the past. Groups of related samples (like multiple images of the same person) should stay together in one split to avoid inflated accuracy. Being aware of these pitfalls helps you create reliable evaluation procedures.

Mistake Problem Solution
Scaling before splitting Test data statistics leak into training Fit scaler on training data only, then transform test data
Random split for time series Future data used to predict past Use chronological split or TimeSeriesSplit
Splitting grouped data Related samples in both sets Use GroupKFold to keep groups together
Too small test set High variance in evaluation Use at least 20% or cross-validation
No random state Results not reproducible Always set random_state for reproducibility
Best Practice: Always set random_state when splitting data to ensure reproducible results. This makes debugging easier and allows others to replicate your experiments exactly. Different random states can lead to different performance estimates, especially on small datasets

Practice Questions

Task: Split the iris dataset into 70% training and 30% testing. Print the number of samples in each set.

Show Solution
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print(f"Training samples: {len(X_train)}")  # Training samples: 105
print(f"Test samples: {len(X_test)}")       # Test samples: 45

Task: Split the iris dataset into 60% training, 20% validation, and 20% test sets. Verify the sizes are correct.

Show Solution
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X, y = iris.data, iris.target

# First split: 80% temp, 20% test
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Second split: 75% train, 25% val (from temp)
# 0.75 * 80% = 60%, 0.25 * 80% = 20%
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42
)

print(f"Train: {len(X_train)} ({len(X_train)/150:.0%})")  # Train: 90 (60%)
print(f"Val: {len(X_val)} ({len(X_val)/150:.0%})")        # Val: 30 (20%)
print(f"Test: {len(X_test)} ({len(X_test)/150:.0%})")     # Test: 30 (20%)

Task: Create an imbalanced dataset with 95% class 0 and 5% class 1. Perform a stratified split and verify that both train and test sets maintain the 95-5 ratio.

Show Solution
import numpy as np
from sklearn.model_selection import train_test_split

# Create imbalanced data: 95% class 0, 5% class 1
np.random.seed(42)
X = np.random.randn(200, 5)
y = np.array([0] * 190 + [1] * 10)

# Stratified split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Calculate class proportions
train_ratio = sum(y_train == 1) / len(y_train) * 100
test_ratio = sum(y_test == 1) / len(y_test) * 100

print(f"Original: {sum(y==1)/len(y)*100:.1f}% class 1")  # 5.0%
print(f"Train: {train_ratio:.1f}% class 1")              # 5.0%
print(f"Test: {test_ratio:.1f}% class 1")                # 5.0%
04

Overfitting & Underfitting

The fundamental challenge in machine learning is building models that generalize well to new, unseen data. Two common problems prevent good generalization: overfitting and underfitting. Overfitting occurs when a model learns the training data too well, including noise and random fluctuations, and fails on new data. Underfitting happens when a model is too simple to capture the underlying patterns. Understanding the bias-variance tradeoff helps you find the sweet spot between these extremes and build models that perform reliably in production.

Understanding Overfitting

Overfitting is like a student who memorizes exam answers instead of understanding concepts. They'll ace the practice test but fail when questions are worded differently. In machine learning, an overfitted model has essentially memorized the training data, including noise and outliers that don't represent the true underlying pattern. The telltale sign is high training accuracy but poor test accuracy. The model looks great on training data but disappoints in the real world.

Key Concept

Overfitting

Overfitting occurs when a model learns the training data too well, capturing noise and random fluctuations rather than just the underlying pattern. The model has high variance and fails to generalize to new data.

Symptoms: High training accuracy, low test accuracy, complex decision boundaries, model changes dramatically with small data changes

import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Generate simple data with some noise
np.random.seed(42)
X = np.linspace(0, 10, 20).reshape(-1, 1)
y = 2 * X.flatten() + 3 + np.random.normal(0, 2, 20)  # True pattern: y = 2x + 3

# Split data
X_train, X_test = X[:15], X[15:]
y_train, y_test = y[:15], y[15:]

# Simple model (appropriate complexity)
model_simple = LinearRegression()
model_simple.fit(X_train, y_train)
train_error_simple = mean_squared_error(y_train, model_simple.predict(X_train))
test_error_simple = mean_squared_error(y_test, model_simple.predict(X_test))

print("Simple Linear Model:")
print(f"  Train MSE: {train_error_simple:.2f}")  # Train MSE: 3.21
print(f"  Test MSE: {test_error_simple:.2f}")    # Test MSE: 4.15

# Overfit model (too complex - degree 15 polynomial)
poly = PolynomialFeatures(degree=15)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

model_overfit = LinearRegression()
model_overfit.fit(X_train_poly, y_train)
train_error_overfit = mean_squared_error(y_train, model_overfit.predict(X_train_poly))
test_error_overfit = mean_squared_error(y_test, model_overfit.predict(X_test_poly))

print("\nOverfit Polynomial Model (degree=15):")
print(f"  Train MSE: {train_error_overfit:.2f}")  # Train MSE: 0.01 (nearly perfect!)
print(f"  Test MSE: {test_error_overfit:.2f}")    # Test MSE: 892.45 (terrible!)

Understanding Underfitting

Underfitting is the opposite problem: the model is too simple to capture the patterns in the data. It's like trying to draw a circle using only straight lines. You can get close, but you'll never capture the true shape. Underfitted models perform poorly on both training and test data because they lack the capacity to learn the underlying relationships. This often happens when using overly simple algorithms for complex problems or when important features are missing.

Key Concept

Underfitting

Underfitting occurs when a model is too simple to capture the underlying pattern in the data. The model has high bias and performs poorly on both training and test data.

Symptoms: Low training accuracy, low test accuracy, overly simple decision boundaries, model ignores important patterns

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

# Generate quadratic data (y = x^2)
np.random.seed(42)
X = np.linspace(-5, 5, 50).reshape(-1, 1)
y = X.flatten() ** 2 + np.random.normal(0, 2, 50)  # True pattern: y = x^2

X_train, X_test = X[:40], X[40:]
y_train, y_test = y[:40], y[40:]

# Underfit model (linear for quadratic data)
model_underfit = LinearRegression()
model_underfit.fit(X_train, y_train)

train_error = mean_squared_error(y_train, model_underfit.predict(X_train))
test_error = mean_squared_error(y_test, model_underfit.predict(X_test))

print("Underfit Model (Linear for quadratic data):")
print(f"  Train MSE: {train_error:.2f}")  # Train MSE: 28.45 (bad)
print(f"  Test MSE: {test_error:.2f}")    # Test MSE: 31.20 (also bad)

# Appropriate model (quadratic for quadratic data)
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

model_good = LinearRegression()
model_good.fit(X_train_poly, y_train)

train_error_good = mean_squared_error(y_train, model_good.predict(X_train_poly))
test_error_good = mean_squared_error(y_test, model_good.predict(X_test_poly))

print("\nAppropriate Model (Quadratic):")
print(f"  Train MSE: {train_error_good:.2f}")  # Train MSE: 3.85 (good)
print(f"  Test MSE: {test_error_good:.2f}")    # Test MSE: 4.12 (also good)

The Bias-Variance Tradeoff

The bias-variance tradeoff is a fundamental concept that explains the tension between overfitting and underfitting. Bias is the error from overly simple assumptions. High bias models underfit by ignoring patterns. Variance is the error from being too sensitive to training data fluctuations. High variance models overfit by memorizing noise. The total error combines both. As you increase model complexity, bias decreases but variance increases. The goal is finding the complexity level that minimizes total error.

High Bias (Underfitting)

  • Model too simple
  • Misses relevant patterns
  • Poor on training AND test data
  • Consistent but wrong predictions
  • Fix: Add features, use complex model

High Variance (Overfitting)

  • Model too complex
  • Memorizes noise
  • Great on training, poor on test
  • Predictions vary wildly
  • Fix: Regularization, more data, simpler model

Techniques to Prevent Overfitting

Several techniques help prevent overfitting and improve generalization. Regularization adds a penalty for model complexity, discouraging overly large coefficients. Cross-validation provides more robust performance estimates by testing on multiple data splits. Early stopping halts training when validation performance starts degrading. Dropout randomly disables neurons during neural network training. Ensemble methods combine multiple models to reduce individual model variance. The right technique depends on your model type and problem characteristics.

from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate data
np.random.seed(42)
X = np.linspace(0, 10, 30).reshape(-1, 1)
y = 2 * X.flatten() + 3 + np.random.normal(0, 2, 30)

X_train, X_test = X[:24], X[24:]
y_train, y_test = y[:24], y[24:]

# Create polynomial features (prone to overfitting)
poly = PolynomialFeatures(degree=10)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Without regularization (overfits)
from sklearn.linear_model import LinearRegression
model_no_reg = LinearRegression()
model_no_reg.fit(X_train_poly, y_train)
print("No Regularization:")
print(f"  Train MSE: {mean_squared_error(y_train, model_no_reg.predict(X_train_poly)):.2f}")
print(f"  Test MSE: {mean_squared_error(y_test, model_no_reg.predict(X_test_poly)):.2f}")

# Ridge regularization (L2 penalty)
model_ridge = Ridge(alpha=1.0)
model_ridge.fit(X_train_poly, y_train)
print("\nRidge Regularization (alpha=1.0):")
print(f"  Train MSE: {mean_squared_error(y_train, model_ridge.predict(X_train_poly)):.2f}")
print(f"  Test MSE: {mean_squared_error(y_test, model_ridge.predict(X_test_poly)):.2f}")

# Lasso regularization (L1 penalty - also does feature selection)
model_lasso = Lasso(alpha=0.1)
model_lasso.fit(X_train_poly, y_train)
print("\nLasso Regularization (alpha=0.1):")
print(f"  Train MSE: {mean_squared_error(y_train, model_lasso.predict(X_train_poly)):.2f}")
print(f"  Test MSE: {mean_squared_error(y_test, model_lasso.predict(X_test_poly)):.2f}")

Diagnosing Fitting Problems

Learning curves are powerful diagnostic tools that help you understand whether your model is overfitting, underfitting, or performing well. By plotting training and validation errors as a function of training set size or model complexity, you can visualize the bias-variance tradeoff and make informed decisions about how to improve your model. A large gap between training and validation curves suggests overfitting, while both curves plateauing at high error indicates underfitting.

Symptom Diagnosis Solution
High train error, High test error Underfitting (High Bias) Increase model complexity, add features, reduce regularization
Low train error, High test error Overfitting (High Variance) Add regularization, get more data, reduce complexity
Low train error, Low test error Good fit! Monitor for drift, deploy with confidence
Train and test errors don't converge Need more data Collect more training examples
Rule of Thumb: Start with a simple model and gradually increase complexity. If your training accuracy is much higher than test accuracy (gap > 10%), you're likely overfitting. If both are low, you're underfitting. The goal is similar performance on both with acceptable accuracy levels

Interactive: Model Complexity Simulator

Try It!

Adjust the model complexity slider to see how it affects training and test performance:

Simple (High Bias) Complex (High Variance)
Good Fit
Training Accuracy 85%
Test Accuracy 82%
Gap (Train - Test) 3%

Both training and test accuracy are high with a small gap. The model generalizes well.

Practice Questions

Task: Given the following model results, write code to determine if the model is overfitting, underfitting, or has a good fit. Print the diagnosis.

Given:

train_accuracy = 0.98
test_accuracy = 0.65
Show Solution
train_accuracy = 0.98
test_accuracy = 0.65

gap = train_accuracy - test_accuracy

if gap > 0.15:
    diagnosis = "Overfitting (High Variance)"
elif train_accuracy < 0.70 and test_accuracy < 0.70:
    diagnosis = "Underfitting (High Bias)"
else:
    diagnosis = "Good Fit"

print(f"Train: {train_accuracy:.0%}, Test: {test_accuracy:.0%}")
print(f"Gap: {gap:.0%}")
print(f"Diagnosis: {diagnosis}")  # Diagnosis: Overfitting (High Variance)

Task: Train polynomial models with different Ridge alpha values and find the one with the best test performance.

Given:

import numpy as np
from sklearn.preprocessing import PolynomialFeatures
np.random.seed(42)
X = np.linspace(0, 5, 50).reshape(-1, 1)
y = 3 * X.flatten() + np.random.normal(0, 1, 50)
X_train, X_test = X[:40], X[40:]
y_train, y_test = y[:40], y[40:]

alphas = [0.01, 0.1, 1.0, 10.0]
Show Solution
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

np.random.seed(42)
X = np.linspace(0, 5, 50).reshape(-1, 1)
y = 3 * X.flatten() + np.random.normal(0, 1, 50)
X_train, X_test = X[:40], X[40:]
y_train, y_test = y[:40], y[40:]

poly = PolynomialFeatures(degree=8)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

alphas = [0.01, 0.1, 1.0, 10.0]
best_alpha = None
best_mse = float('inf')

for alpha in alphas:
    model = Ridge(alpha=alpha)
    model.fit(X_train_poly, y_train)
    test_mse = mean_squared_error(y_test, model.predict(X_test_poly))
    print(f"Alpha={alpha}: Test MSE = {test_mse:.4f}")
    if test_mse < best_mse:
        best_mse = test_mse
        best_alpha = alpha

print(f"\nBest alpha: {best_alpha} with Test MSE: {best_mse:.4f}")

Task: Use sklearn's learning_curve to plot how training and validation scores change with training set size. Determine if more data would help.

Show Solution
from sklearn.model_selection import learning_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target

model = DecisionTreeClassifier(max_depth=3)

train_sizes, train_scores, val_scores = learning_curve(
    model, X, y, 
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5,
    scoring='accuracy'
)

train_mean = train_scores.mean(axis=1)
val_mean = val_scores.mean(axis=1)

print("Training Size | Train Score | Val Score | Gap")
print("-" * 50)
for size, train, val in zip(train_sizes, train_mean, val_mean):
    print(f"{size:13} | {train:.3f}       | {val:.3f}     | {train-val:.3f}")

# If gap is small and both scores are high -> good fit
# If gap is large -> overfitting, more data might help
# If both scores are low -> underfitting
05

Cross-Validation Techniques

A single train-test split can produce unreliable performance estimates, especially with small datasets. The specific samples that end up in training versus testing can significantly affect measured accuracy. Cross-validation addresses this by testing your model on multiple different splits and averaging the results. This gives you a more robust estimate of how your model will perform on new data, reduces the variance of your evaluation, and makes better use of limited data. Cross-validation is essential for reliable model comparison and hyperparameter tuning.

K-Fold Cross-Validation

K-Fold cross-validation is the most popular cross-validation technique. It divides your data into K equal-sized folds (typically 5 or 10). The model is trained K times, each time using K-1 folds for training and 1 fold for validation. Every data point gets used for validation exactly once. The final performance is the average across all K folds. This provides a more reliable estimate than a single split and uses your data more efficiently.

Key Concept

K-Fold Cross-Validation

A resampling technique that divides data into K equal folds. The model is trained K times, each time holding out a different fold for validation. The final score is the average of all K validation scores.

Common choices: K=5 or K=10. Higher K gives better estimates but increases computation time. K=5 balances bias and variance well for most problems

from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
import numpy as np

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Create model
model = LogisticRegression(max_iter=200)

# 5-Fold Cross-Validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

print("5-Fold Cross-Validation Results:")
print(f"Fold scores: {cv_scores}")
# Fold scores: [0.97 1.0  0.93 0.97 0.93]

print(f"Mean accuracy: {cv_scores.mean():.2%}")      # Mean accuracy: 96.00%
print(f"Std deviation: {cv_scores.std():.2%}")       # Std deviation: 2.67%
print(f"95% CI: {cv_scores.mean():.2%} (+/- {cv_scores.std()*2:.2%})")
# 95% CI: 96.00% (+/- 5.33%)

# More control with KFold object
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores_shuffled = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')
print(f"\nShuffled 5-Fold mean: {cv_scores_shuffled.mean():.2%}")

Stratified K-Fold

Standard K-Fold doesn't consider class distribution. In imbalanced datasets, some folds might have very few examples of minority classes, leading to unreliable estimates. Stratified K-Fold maintains the same class proportions in each fold as in the original dataset. This is especially important for classification problems and is the default behavior when you pass an integer to cv in scikit-learn's cross-validation functions for classifiers.

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.tree import DecisionTreeClassifier
import numpy as np

# Imbalanced dataset
np.random.seed(42)
X = np.random.randn(100, 4)
y = np.array([0] * 90 + [1] * 10)  # 90% class 0, 10% class 1

model = DecisionTreeClassifier(random_state=42)

# Regular KFold - might create folds with no minority class
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

print("Regular KFold - Class distribution per fold:")
for i, (train_idx, val_idx) in enumerate(kfold.split(X)):
    class_1_count = sum(y[val_idx] == 1)
    print(f"  Fold {i+1}: {class_1_count} class 1 samples in validation")

# Stratified KFold - maintains class proportions
skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("\nStratified KFold - Class distribution per fold:")
for i, (train_idx, val_idx) in enumerate(skfold.split(X, y)):
    class_1_count = sum(y[val_idx] == 1)
    print(f"  Fold {i+1}: {class_1_count} class 1 samples in validation")
    # Each fold has exactly 2 class 1 samples (10% of 20)

# Cross-validation scores
cv_scores = cross_val_score(model, X, y, cv=skfold, scoring='accuracy')
print(f"\nStratified CV Mean Accuracy: {cv_scores.mean():.2%}")

Leave-One-Out Cross-Validation

Leave-One-Out (LOO) is an extreme form of K-Fold where K equals the number of samples. Each iteration uses all but one sample for training and tests on the single held-out sample. LOO gives an unbiased estimate of model performance but is computationally expensive for large datasets. It's most useful for very small datasets where you can't afford to hold out many samples for testing, but be aware that it has high variance.

from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris

# Small dataset example
X = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]
y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

model = KNeighborsClassifier(n_neighbors=3)

# Leave-One-Out
loo = LeaveOneOut()
loo_scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy')

print(f"LOO Cross-Validation:")
print(f"Number of iterations: {len(loo_scores)}")   # 10 (one per sample)
print(f"Individual scores: {loo_scores}")            # [1 1 1 1 1 1 1 1 1 1]
print(f"Mean accuracy: {loo_scores.mean():.2%}")     # Mean accuracy: 100.00%

# Compare with 5-Fold
cv5_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"\n5-Fold Mean accuracy: {cv5_scores.mean():.2%}")

# LOO is more expensive but useful for tiny datasets
# Time complexity: O(n) training runs vs O(k) for k-fold

Cross-Validation for Hyperparameter Tuning

Cross-validation is crucial for hyperparameter tuning because it provides reliable performance estimates for different parameter combinations. Using a single validation set can lead to overfitting to that specific split. GridSearchCV and RandomizedSearchCV combine parameter search with cross-validation, automatically trying different parameter combinations and selecting the best based on cross-validated scores. This ensures your hyperparameter choices generalize well.

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Define model and parameter grid
model = RandomForestClassifier(random_state=42)
param_grid = {
    'n_estimators': [10, 50, 100],
    'max_depth': [3, 5, None],
    'min_samples_split': [2, 5, 10]
}

# GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(
    model, 
    param_grid, 
    cv=5, 
    scoring='accuracy',
    n_jobs=-1,  # Use all CPU cores
    verbose=1
)

grid_search.fit(X, y)

print(f"\nBest parameters: {grid_search.best_params_}")
# Best parameters: {'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 50}

print(f"Best CV score: {grid_search.best_score_:.2%}")
# Best CV score: 96.67%

# Access all results
import pandas as pd
results = pd.DataFrame(grid_search.cv_results_)
print("\nTop 5 parameter combinations:")
print(results.nsmallest(5, 'rank_test_score')[['params', 'mean_test_score', 'std_test_score']])

Choosing the Right Cross-Validation Strategy

Different problems require different cross-validation strategies. Standard K-Fold works for most cases, but special situations need special handling. Time series data requires time-based splits where future data never leaks into training. Grouped data (like multiple samples from the same patient) needs GroupKFold to keep groups together. Very imbalanced data benefits from Stratified K-Fold. The table below summarizes when to use each strategy.

Strategy When to Use Key Feature
KFold General purpose, regression Simple, equal-sized folds
StratifiedKFold Classification, imbalanced data Preserves class proportions
LeaveOneOut Very small datasets Maximum data usage, high variance
TimeSeriesSplit Time-ordered data No future data leakage
GroupKFold Grouped/clustered data Keeps groups together
RepeatedKFold Need lower variance estimates Repeats K-Fold multiple times
Best Practice: For most classification problems, use StratifiedKFold with k=5 or k=10. For hyperparameter tuning, use cross-validation (not a single validation set) to avoid overfitting to specific data splits. Always set random_state for reproducibility and report both mean and standard deviation of CV scores
Remember: Cross-validation helps you estimate model performance and select hyperparameters, but you should still have a held-out test set for final evaluation. The workflow is: use CV for model selection and tuning, then evaluate the final chosen model on the test set exactly once

Practice Questions

Task: Use 10-fold cross-validation to evaluate a Decision Tree classifier on the iris dataset. Report the mean accuracy and standard deviation.

Show Solution
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

model = DecisionTreeClassifier(random_state=42)
cv_scores = cross_val_score(model, X, y, cv=10, scoring='accuracy')

print(f"10-Fold CV Scores: {cv_scores}")
print(f"Mean Accuracy: {cv_scores.mean():.2%}")  # ~96%
print(f"Std Deviation: {cv_scores.std():.2%}")   # ~4%

Task: Compare Logistic Regression, Decision Tree, and Random Forest using 5-fold CV on the iris dataset. Print which model performs best.

Show Solution
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

models = {
    'Logistic Regression': LogisticRegression(max_iter=200),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42)
}

results = {}
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    results[name] = scores.mean()
    print(f"{name}: {scores.mean():.2%} (+/- {scores.std():.2%})")

best_model = max(results, key=results.get)
print(f"\nBest model: {best_model} ({results[best_model]:.2%})")

Task: Use GridSearchCV with 5-fold CV to find the best hyperparameters for SVM on the iris dataset. Search over different values of C and kernel.

Show Solution
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

param_grid = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto']
}

svm = SVC(random_state=42)
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X, y)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.2%}")

# Show top 3 combinations
import pandas as pd
results = pd.DataFrame(grid_search.cv_results_)
top3 = results.nsmallest(3, 'rank_test_score')[['params', 'mean_test_score']]
print("\nTop 3 parameter combinations:")
print(top3.to_string())

Key Takeaways

ML Definition

Machine learning enables computers to learn patterns from data without explicit programming

Supervised Learning

Uses labeled data to learn mappings from inputs to outputs for prediction tasks

Unsupervised Learning

Discovers hidden patterns and structures in unlabeled data through clustering

Data Splits

Always separate data into training, validation, and test sets for reliable evaluation

Bias-Variance Tradeoff

Balance model complexity to avoid both overfitting and underfitting

Cross-Validation

Use k-fold cross-validation for robust model evaluation on limited data

Knowledge Check

Test your understanding of machine learning foundations:

Question 1 of 6

Which type of learning uses labeled data to train the model?

Question 2 of 6

What is the primary purpose of the test set?

Question 3 of 6

A model has high training accuracy but low test accuracy. What problem does this indicate?

Question 4 of 6

In 5-fold cross-validation, how many times is each data point used for testing?

Question 5 of 6

Which task is an example of unsupervised learning?

Question 6 of 6

What is the typical train-test split ratio used in machine learning?

Answer all questions to check your score