Module 3.2

Decision Trees & Ensemble Methods

Master the power of tree-based models and ensemble techniques. Learn how Decision Trees make predictions, how Random Forests combine multiple trees, and how Boosting algorithms like XGBoost achieve state-of-the-art results!

55 min read
Intermediate
What You'll Learn
  • Decision tree construction and pruning
  • Random Forests and feature importance
  • Bagging ensemble technique
  • Boosting with AdaBoost and Gradient Boosting
  • XGBoost and LightGBM for classification
Contents
01

Decision Trees

Imagine you are playing a game of 20 Questions where you try to guess an object by asking yes/no questions. Decision Trees work exactly like this game - they ask a series of questions about your data features, with each answer leading to another question until reaching a final prediction. Think of it like a flowchart that guides you from the top (root) to a final decision (leaf) based on the characteristics of your data. Decision Trees are one of the most intuitive and widely-used machine learning algorithms because they mimic how humans naturally make decisions through a series of if-then-else rules.

Real-World Example: Think about how a doctor diagnoses a patient. They might ask: "Do you have a fever?" If yes, "Is it above 102°F?" If yes, "Do you have a cough?" Each answer narrows down the possible diagnoses until reaching a conclusion. This is exactly how Decision Trees work - a series of questions leading to a final prediction!

How Decision Trees Work

A Decision Tree is a flowchart-like structure where each internal node represents a test on a feature (like "Is age greater than 30?"), each branch represents the outcome of that test, and each leaf node represents a class label or prediction. The tree starts at the root and splits the data based on the feature that provides the best separation between classes. Picture a tree turned upside down - the root is at the top, and the leaves (final predictions) are at the bottom.

Here's how the algorithm builds a tree step by step:

  1. Start with all data at the root: The entire training dataset begins at the top node.
  2. Find the best question to ask: The algorithm evaluates every feature and every possible split point to find the question that best separates the classes.
  3. Split the data: Based on the answer (Yes/No), data flows to left or right child nodes.
  4. Repeat recursively: Each child node repeats the process, finding the next best question for its subset of data.
  5. Stop when done: The process stops when a node is "pure" (all samples belong to one class) or when stopping criteria are met (like maximum depth).
Root Node

The topmost node representing the entire dataset. It is split into child nodes based on the best feature. Think of it as the first and most important question to ask.

Leaf Node

Terminal nodes that contain the final prediction or class label. No more questions are asked once you reach a leaf - you have your answer!

# Building a simple Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the famous Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and train the Decision Tree
tree_clf = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_clf.fit(X_train, y_train)

# Evaluate accuracy
accuracy = tree_clf.score(X_test, y_test)
print(f"Decision Tree Accuracy: {accuracy:.2%}")  # ~96.67%

Splitting Criteria: Gini Impurity vs Entropy

When building a Decision Tree, the algorithm needs to decide which feature and what threshold to use for splitting at each node. It does this by measuring how "pure" or "impure" each potential split would make the resulting groups. A pure node contains samples from only one class, while an impure node has a mix of classes. Think of it like sorting a bag of mixed candies - we want each pile to contain only one type of candy.

Why does purity matter? Imagine you have a bag with 50 red candies and 50 blue candies (very impure - maximum confusion). After sorting, you want one pile with all red and another with all blue (perfectly pure - no confusion). The algorithm tries every possible way to split and picks the one that creates the purest groups. There are two common ways to measure impurity:

Simple Analogy: Imagine a classroom of students. A "pure" classroom has only boys OR only girls. An "impure" classroom has a mix. Gini and Entropy are just two different mathematical ways to measure "how mixed" a group is. Both work well - Gini is slightly faster to compute, while Entropy is more theoretically grounded.
Splitting Criterion

Gini Impurity

$$Gini = 1 - \sum_{i=1}^{n} p_i^2$$

Measures the probability of incorrectly classifying a randomly chosen element. Gini = 0 means perfect purity (all samples belong to one class). Faster to compute than entropy.

Splitting Criterion

Entropy (Information Gain)

$$Entropy = -\sum_{i=1}^{n} p_i \log_2(p_i)$$

Measures the disorder or randomness in a set. Entropy = 0 means perfect order (all samples same class). Used in algorithms like ID3 and C4.5.

# Comparing Gini vs Entropy splitting criteria
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import cross_val_score

# Load wine dataset
wine = load_wine()
X, y = wine.data, wine.target

# Decision Tree with Gini (default)
tree_gini = DecisionTreeClassifier(criterion='gini', max_depth=4, random_state=42)
gini_scores = cross_val_score(tree_gini, X, y, cv=5)

# Decision Tree with Entropy
tree_entropy = DecisionTreeClassifier(criterion='entropy', max_depth=4, random_state=42)
entropy_scores = cross_val_score(tree_entropy, X, y, cv=5)

print(f"Gini Impurity - Mean Accuracy: {gini_scores.mean():.2%}")
print(f"Entropy - Mean Accuracy: {entropy_scores.mean():.2%}")

Tree Pruning: Preventing Overfitting

Left unchecked, a Decision Tree will keep growing until each leaf is perfectly pure - meaning it memorizes the training data exactly. This is called overfitting, and it causes poor performance on new, unseen data. Pruning is like trimming a garden tree - we cut back branches to keep the tree manageable and prevent it from becoming too complex. There are two main approaches: pre-pruning (stop growing early) and post-pruning (grow fully then cut back).

Beginner's Warning - Overfitting Explained: Imagine a student who memorizes every answer in a textbook word-for-word but doesn't understand the concepts. They'll ace the practice test but fail the real exam with new questions. An unpruned Decision Tree does the same thing - it memorizes training data perfectly (100% training accuracy) but performs poorly on new data. Pruning forces the tree to learn general patterns instead of memorizing specific examples.

Common Pruning Parameters Explained:

  • max_depth: Maximum levels from root to leaf. Like limiting a quiz to 5 questions maximum. Smaller values = simpler tree.
  • min_samples_split: Minimum samples required to create a new split. "Only ask another question if you have at least 20 samples."
  • min_samples_leaf: Minimum samples required in each leaf node. "Each final answer must represent at least 5 examples."
  • max_leaf_nodes: Maximum number of leaf nodes allowed. Directly limits the number of possible predictions.
# Pruning strategies to prevent overfitting
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create a dataset with some noise
X, y = make_classification(n_samples=1000, n_features=20, 
                           n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Unpruned tree (will overfit)
unpruned = DecisionTreeClassifier(random_state=42)
unpruned.fit(X_train, y_train)
print(f"Unpruned - Train: {unpruned.score(X_train, y_train):.2%}, Test: {unpruned.score(X_test, y_test):.2%}")

# Pre-pruning with max_depth
pruned_depth = DecisionTreeClassifier(max_depth=5, random_state=42)
pruned_depth.fit(X_train, y_train)
print(f"max_depth=5 - Train: {pruned_depth.score(X_train, y_train):.2%}, Test: {pruned_depth.score(X_test, y_test):.2%}")

# Pre-pruning with min_samples_leaf
pruned_leaf = DecisionTreeClassifier(min_samples_leaf=10, random_state=42)
pruned_leaf.fit(X_train, y_train)
print(f"min_samples_leaf=10 - Train: {pruned_leaf.score(X_train, y_train):.2%}, Test: {pruned_leaf.score(X_test, y_test):.2%}")
Key Pruning Parameters: Use max_depth to limit tree height, min_samples_split to require minimum samples for a split, min_samples_leaf to require minimum samples in each leaf, and max_leaf_nodes to limit total leaves.

Visualizing Decision Trees

One of the biggest advantages of Decision Trees is their interpretability - you can actually see and understand why the model makes each prediction. Unlike "black box" models like neural networks, Decision Trees show you exactly which questions led to each decision. Scikit-learn provides tools to visualize the tree structure, making it easy to explain model decisions to stakeholders who may not have a technical background.

What you'll see in a tree visualization:

  • Feature and threshold: Each box shows the question being asked (e.g., "petal length <= 2.45")
  • Gini or Entropy value: How pure or impure the node is (lower = purer)
  • Samples: How many training samples reached this node
  • Value: Distribution of classes at this node (e.g., [50, 0, 0] means 50 samples of class 0)
  • Class: The predicted class if we stopped here (majority class)
Pro Tip: Decision Tree visualizations are excellent for explaining ML predictions to non-technical stakeholders. You can literally point to the path a specific sample takes and show them exactly why the model made its prediction. This "explainability" is why Decision Trees remain popular in healthcare, finance, and legal applications where understanding the reasoning is as important as the prediction itself.
# Visualizing a Decision Tree
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# Train a small tree on Iris
iris = load_iris()
tree_clf = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_clf.fit(iris.data, iris.target)

# Plot the tree
plt.figure(figsize=(20, 10))
plot_tree(tree_clf, 
          feature_names=iris.feature_names,
          class_names=iris.target_names,
          filled=True,           # Color nodes by class
          rounded=True,          # Round node corners
          fontsize=10)
plt.title("Decision Tree for Iris Classification")
plt.tight_layout()
plt.savefig("decision_tree.png", dpi=150)
plt.show()

Practice Questions: Decision Trees

Test your understanding with these coding challenges.

Task: Load the Wine dataset, split it 80/20, train a DecisionTreeClassifier with max_depth=4, and print the accuracy.

Show Solution
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, test_size=0.2, random_state=42
)

tree = DecisionTreeClassifier(max_depth=4, random_state=42)
tree.fit(X_train, y_train)

print(f"Accuracy: {tree.score(X_test, y_test):.2%}")

Task: Train both an unpruned tree and one with min_samples_leaf=5. Compare train vs test accuracy to show overfitting.

Show Solution
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Unpruned
unpruned = DecisionTreeClassifier(random_state=42)
unpruned.fit(X_train, y_train)
print(f"Unpruned - Train: {unpruned.score(X_train, y_train):.2%}, Test: {unpruned.score(X_test, y_test):.2%}")

# Pruned
pruned = DecisionTreeClassifier(min_samples_leaf=5, random_state=42)
pruned.fit(X_train, y_train)
print(f"Pruned - Train: {pruned.score(X_train, y_train):.2%}, Test: {pruned.score(X_test, y_test):.2%}")

Task: Test max_depth values from 1 to 15 using 5-fold cross-validation. Print the best depth and its score.

Show Solution
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import cross_val_score

digits = load_digits()
X, y = digits.data, digits.target

best_depth, best_score = 1, 0

for depth in range(1, 16):
    tree = DecisionTreeClassifier(max_depth=depth, random_state=42)
    scores = cross_val_score(tree, X, y, cv=5)
    mean_score = scores.mean()
    if mean_score > best_score:
        best_depth, best_score = depth, mean_score
    print(f"depth={depth:2d}: {mean_score:.4f}")

print(f"\nBest: max_depth={best_depth} with accuracy {best_score:.4f}")
02

Random Forest

Imagine asking one friend for advice versus asking 100 friends and going with the majority opinion. Random Forest applies this "wisdom of the crowd" principle to machine learning. Instead of relying on a single Decision Tree (which might be biased or overfit), Random Forest builds hundreds of trees on random subsets of your data and features, then combines their predictions through voting. The result is almost always more accurate and robust than any individual tree. This is one of the most powerful and widely-used machine learning algorithms, often achieving excellent results with minimal tuning.

The Wisdom of Crowds: In 1906, Francis Galton discovered that the average guess of 800 people at a county fair was almost exactly correct for the weight of an ox - more accurate than any individual expert! Random Forest applies this same principle: individual trees might make mistakes, but when you average many diverse trees together, the errors cancel out and the wisdom emerges.

How Random Forest Works

Random Forest introduces two types of randomness to create diverse trees: Bootstrap Sampling (each tree trains on a random sample of the data with replacement) and Feature Randomness (each split considers only a random subset of features). This diversity is crucial - if all trees were identical, voting would be pointless. By making each tree slightly different, errors tend to cancel out when we aggregate predictions.

Step-by-step process:

  1. Create bootstrap samples: For each tree, randomly draw N samples from your data WITH replacement (some samples repeat, others are left out).
  2. Build diverse trees: Train each tree on its bootstrap sample, but at each split, only consider a random subset of features (typically sqrt of total features).
  3. Repeat many times: Build 100, 500, or even 1000 trees - more is usually better until returns diminish.
  4. Aggregate predictions: For classification, each tree votes and the majority wins. For regression, average all tree predictions.
Bootstrap Sampling

Each tree is trained on a random sample drawn with replacement from the training data. Some samples appear multiple times, others not at all. This creates different training sets for each tree.

Feature Randomness

At each split, only a random subset of features is considered. For classification, typically sqrt(n_features) features are used. This prevents strong features from dominating every tree.

# Building a Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load breast cancer dataset
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Create Random Forest with 100 trees
rf_clf = RandomForestClassifier(
    n_estimators=100,     # Number of trees
    max_depth=10,         # Limit depth to prevent overfitting
    random_state=42,
    n_jobs=-1             # Use all CPU cores
)
rf_clf.fit(X_train, y_train)

# Evaluate
print(f"Random Forest Accuracy: {rf_clf.score(X_test, y_test):.2%}")  # ~96%+

Feature Importance

One of Random Forest's most valuable features is its ability to rank the importance of each input feature. The algorithm measures how much each feature contributes to reducing impurity across all trees. Features that appear near the top of trees and lead to large impurity reductions are considered more important. This helps you understand which variables drive your predictions and can guide feature selection.

Why Feature Importance Matters: Imagine you're predicting house prices with 50 features. Feature importance tells you that "square footage" and "location" matter most, while "paint color" barely matters at all. This insight helps you: (1) understand what drives predictions, (2) remove useless features to simplify the model, (3) focus data collection efforts on important features, and (4) explain the model to stakeholders.

How importance is calculated: For each feature, Random Forest sums up how much that feature reduced impurity (Gini or Entropy) across all splits in all trees. Features that frequently appear early in trees and create large purity improvements get high importance scores. The scores are then normalized to sum to 1.0.

# Extracting and visualizing feature importance
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer

# Train Random Forest
data = load_breast_cancer()
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(data.data, data.target)

# Get feature importances
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1][:10]  # Top 10

# Plot top 10 features
plt.figure(figsize=(10, 6))
plt.title("Top 10 Feature Importances")
plt.bar(range(10), importances[indices], align="center")
plt.xticks(range(10), [data.feature_names[i] for i in indices], rotation=45, ha='right')
plt.ylabel("Importance")
plt.tight_layout()
plt.show()

# Print top 5
print("Top 5 Most Important Features:")
for i in indices[:5]:
    print(f"  {data.feature_names[i]}: {importances[i]:.4f}")

Out-of-Bag (OOB) Score

Since each tree only sees about 63% of the training data (due to bootstrap sampling), the remaining 37% can be used as a built-in validation set. This is called the Out-of-Bag (OOB) score. It provides a free estimate of generalization performance without needing a separate validation set, making it especially useful when data is limited.

Free Validation - No Extra Data Needed! Here's the magic: when Tree #1 makes predictions, it's tested on the ~37% of samples it never saw during training. Tree #2 is tested on its unseen samples, and so on. By combining these "out-of-bag" predictions, we get a validation score without ever splitting our data into train/test sets! The OOB score is typically very close to cross-validation scores but computed for free during training.

When to use OOB score:

  • When you have limited data and can't afford to hold out a validation set
  • For quick hyperparameter tuning without running cross-validation
  • To monitor training progress and detect overfitting
  • When you want a reliable accuracy estimate with minimal code
# Using Out-of-Bag score for validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.2, random_state=42
)

# Enable OOB scoring
rf = RandomForestClassifier(
    n_estimators=100,
    oob_score=True,      # Enable OOB scoring
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)

print(f"OOB Score (built-in validation): {rf.oob_score_:.2%}")
print(f"Test Score: {rf.score(X_test, y_test):.2%}")
Random Forest Strengths: Works well out-of-the-box with minimal tuning, handles high-dimensional data, provides feature importance, resistant to overfitting, and can handle missing values.

Practice Questions: Random Forest

Test your understanding with these coding challenges.

Task: Create a RandomForestClassifier with 200 trees on the Iris dataset. Print the accuracy.

Show Solution
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)

print(f"Accuracy: {rf.score(X_test, y_test):.2%}")

Task: Train a Random Forest on the Wine dataset and print the top 5 most important features with their scores.

Show Solution
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine

wine = load_wine()
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(wine.data, wine.target)

importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]

print("Top 5 Important Features:")
for i in indices[:5]:
    print(f"  {wine.feature_names[i]}: {importances[i]:.4f}")

Task: Train Random Forests with 10, 50, 100, 200 trees. Compare OOB score vs test score for each.

Show Solution
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.2, random_state=42
)

print("n_trees | OOB Score | Test Score")
print("-" * 35)

for n_trees in [10, 50, 100, 200]:
    rf = RandomForestClassifier(n_estimators=n_trees, oob_score=True, random_state=42)
    rf.fit(X_train, y_train)
    print(f"{n_trees:7d} | {rf.oob_score_:.4f}    | {rf.score(X_test, y_test):.4f}")
03

Bagging Technique

Bagging, short for Bootstrap Aggregating, is the foundational ensemble technique that Random Forest is built upon. The core idea is simple but powerful: instead of training one model on all your data, train multiple models on different random samples of your data, then combine their predictions. This reduces variance and helps prevent overfitting. Think of it as getting multiple opinions from doctors who have each seen different patient cases, then combining their diagnoses for a more reliable conclusion. Bagging is one of the most important concepts in ensemble learning and forms the basis for many powerful algorithms.

The Power of Averaging: A single Decision Tree might make wild predictions based on quirks in its training data. But if you train 100 trees on different data samples and average their predictions, the random errors cancel out! This is called "variance reduction" - individual models might be wrong, but their average is usually much more stable and accurate.

Bootstrap Sampling Explained

Bootstrap sampling means drawing samples with replacement from your original dataset. If you have 100 data points, you draw 100 samples randomly - but the same point can be selected multiple times. On average, each bootstrap sample contains about 63.2% of the original unique data points. The remaining ~36.8% (called Out-of-Bag samples) can be used for validation.

Why "with replacement" matters: Imagine picking names from a hat. Normally, once you pick "Alice," you can't pick her again. But with replacement, you put the name back after each pick. This means:

  • Some data points appear multiple times in a sample (they got picked more than once)
  • Some data points don't appear at all (they were never picked)
  • Each bootstrap sample is slightly different, creating diversity among models
  • The ~37% of "left out" samples provide free validation data
# Understanding Bootstrap Sampling
import numpy as np

# Original dataset
original_data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Create bootstrap samples
np.random.seed(42)
for i in range(3):
    # Sample WITH replacement (same size as original)
    bootstrap_sample = np.random.choice(original_data, size=len(original_data), replace=True)
    unique_count = len(np.unique(bootstrap_sample))
    print(f"Bootstrap {i+1}: {bootstrap_sample}")
    print(f"  Unique elements: {unique_count}/10 ({unique_count/10:.1%})")
    print()

BaggingClassifier in Scikit-learn

While Random Forest is specifically designed for Decision Trees, Scikit-learn's BaggingClassifier allows you to apply bagging to any base estimator. You can bag SVMs, Neural Networks, or any other classifier. This flexibility makes it a powerful tool for improving the stability of any high-variance model.

# Bagging with different base estimators
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.2, random_state=42
)

# Bagging with Decision Trees (similar to Random Forest)
bag_tree = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    max_samples=0.8,       # Use 80% of samples for each tree
    max_features=0.8,      # Use 80% of features
    bootstrap=True,
    random_state=42,
    n_jobs=-1
)
bag_tree.fit(X_train, y_train)
print(f"Bagged Trees Accuracy: {bag_tree.score(X_test, y_test):.2%}")

# Bagging with SVM
bag_svm = BaggingClassifier(
    estimator=SVC(),
    n_estimators=20,
    random_state=42,
    n_jobs=-1
)
bag_svm.fit(X_train, y_train)
print(f"Bagged SVM Accuracy: {bag_svm.score(X_test, y_test):.2%}")
n_estimators

Number of base models to train. More models = better but slower. Typically 50-500 trees.

max_samples

Fraction of samples for each bootstrap. 1.0 means same size as training set.

max_features

Fraction of features for each model. Lower values increase diversity.

Bagging vs Single Model

Let's demonstrate how bagging reduces variance and improves generalization. A single Decision Tree tends to overfit, but an ensemble of bagged trees produces more stable predictions.

# Comparing single tree vs bagged trees
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score

# Create a challenging dataset
X, y = make_classification(n_samples=1000, n_features=20, 
                           n_informative=10, n_redundant=5,
                           random_state=42)

# Single Decision Tree
single_tree = DecisionTreeClassifier(random_state=42)
single_scores = cross_val_score(single_tree, X, y, cv=10)

# Bagged Trees
bagged_trees = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bagged_scores = cross_val_score(bagged_trees, X, y, cv=10)

print("Single Tree:")
print(f"  Mean: {single_scores.mean():.2%}, Std: {single_scores.std():.4f}")
print("\nBagged Trees (50):")
print(f"  Mean: {bagged_scores.mean():.2%}, Std: {bagged_scores.std():.4f}")

Practice Questions: Bagging

Test your understanding with these coding challenges.

Task: Train a BaggingClassifier with 30 Decision Tree estimators on the Iris dataset.

Show Solution
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=30,
    random_state=42
)
bag.fit(X_train, y_train)

print(f"Accuracy: {bag.score(X_test, y_test):.2%}")

Task: From an array of 20 elements, create 5 bootstrap samples and print the percentage of unique elements in each.

Show Solution
import numpy as np

np.random.seed(42)
data = np.arange(1, 21)  # 1 to 20

for i in range(5):
    sample = np.random.choice(data, size=len(data), replace=True)
    unique_pct = len(np.unique(sample)) / len(data) * 100
    print(f"Sample {i+1}: {unique_pct:.1f}% unique elements")

Task: Use 10-fold CV to compare standard deviation of scores between single tree and bagging. Show variance reduction.

Show Solution
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import cross_val_score

wine = load_wine()
X, y = wine.data, wine.target

single = DecisionTreeClassifier(random_state=42)
single_scores = cross_val_score(single, X, y, cv=10)

bagged = BaggingClassifier(estimator=DecisionTreeClassifier(), 
                           n_estimators=50, random_state=42)
bag_scores = cross_val_score(bagged, X, y, cv=10)

print(f"Single Tree - Mean: {single_scores.mean():.2%}, Std: {single_scores.std():.4f}")
print(f"Bagged (50) - Mean: {bag_scores.mean():.2%}, Std: {bag_scores.std():.4f}")
print(f"\nVariance reduction: {(1 - bag_scores.std()/single_scores.std()):.1%}")
04

Boosting Algorithms

While Bagging trains models independently in parallel, Boosting takes a fundamentally different approach: it trains models sequentially, with each new model focusing on correcting the mistakes of previous ones. Think of it like learning from your errors - after each test, you focus extra study time on the questions you got wrong. This iterative improvement often leads to even better performance than Bagging, making Boosting algorithms among the most powerful in machine learning. Boosting has won more Kaggle competitions than almost any other technique!

Learning from Mistakes - A Student Analogy: Imagine a student preparing for an exam. After each practice test, they don't just study randomly - they focus extra time on the questions they got wrong. The next practice test, they might miss different questions, so they adjust again. Over many iterations, they gradually eliminate their weak spots. Boosting works exactly like this: each new model pays extra attention to the examples that previous models got wrong, gradually building a very strong overall model.

Boosting vs Bagging

Understanding the difference between Bagging and Boosting is crucial for choosing the right ensemble method:

Bagging (Parallel)

Models trained independently on bootstrap samples. Reduces variance. Each model has equal vote. Can run in parallel for speed. Best when individual models overfit (high variance).

Boosting (Sequential)

Models trained sequentially, each fixing previous errors. Reduces bias. Later models may have higher weights. Must run sequentially. Best when individual models are too simple (high bias).

AdaBoost (Adaptive Boosting)

AdaBoost was one of the first successful boosting algorithms. It works by assigning weights to training samples - samples that are misclassified get higher weights so the next model pays more attention to them. Each weak learner (often a simple decision stump with just one split) is then combined into a strong learner through weighted voting.

Teacher Analogy: Imagine multiple teachers helping a student. The first teacher teaches a topic and gives a test. The student misses some questions. The second teacher now knows which questions were missed and spends extra time on those specific topics. The third teacher focuses even more on the remaining weak areas. After many teachers, the student has gradually improved on all topics. AdaBoost works similarly - each new model focuses on the "difficult" examples that previous models got wrong.
1

How AdaBoost Works Step by Step

  • Start: Give all training samples equal weight (everyone is equally important)
  • Train Model 1: Build a simple "decision stump" (tree with just one split)
  • Find Mistakes: Identify which samples were classified incorrectly
  • Increase Weights: Give misclassified samples higher weights (they're now more important)
  • Train Model 2: Build another stump - it will focus more on the high-weight samples
  • Repeat: Continue for N iterations (typically 50-100 models)
  • Final Vote: Combine all models with weighted voting (models with fewer errors get higher votes)
# AdaBoost Classifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# AdaBoost with decision stumps (default)
# By default, uses DecisionTreeClassifier with max_depth=1 (a "stump")
ada_clf = AdaBoostClassifier(
    n_estimators=100,      # Number of weak learners to combine
    learning_rate=0.5,     # Shrinkage - lower = more robust but needs more trees
    random_state=42
)
ada_clf.fit(X_train, y_train)
print(f"AdaBoost Accuracy: {ada_clf.score(X_test, y_test):.2%}")

# AdaBoost with deeper trees as base estimator
# Using deeper trees can capture more complex patterns
ada_deep = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=3),  # Slightly deeper trees
    n_estimators=50,       # Can use fewer if base is stronger
    learning_rate=0.5,
    random_state=42
)
ada_deep.fit(X_train, y_train)
print(f"AdaBoost (depth=3) Accuracy: {ada_deep.score(X_test, y_test):.2%}")

Gradient Boosting

Gradient Boosting takes a different approach: instead of adjusting sample weights, it trains each new model to predict the residual errors (the difference between actual and predicted values) of the previous ensemble. By repeatedly adding models that correct residual errors, the ensemble gradually improves. This technique is the foundation for XGBoost, LightGBM, and CatBoost.

Archer Analogy: Imagine you're an archer trying to hit a target. You shoot and miss by 10cm to the right. Your coach (the next model) helps you correct - but instead of aiming at the target, they help you aim at the "10cm to the left" point (the error). You shoot again and now you're only 3cm right of target. The next coach helps you correct that 3cm error. Each coach focuses on correcting the remaining error until you're hitting the bullseye. That's Gradient Boosting - each tree learns to predict and correct the errors!
2

How Gradient Boosting Works Step by Step

  • Start: Make an initial prediction (often just the average for regression, or log-odds for classification)
  • Calculate Errors: For each sample, calculate residual = actual - predicted
  • Train Tree 1: Build a tree to predict these residuals (not the original targets!)
  • Update Predictions: Add Tree 1's predictions (scaled by learning rate) to current predictions
  • Calculate New Errors: Residuals are now smaller
  • Train Tree 2: Build another tree to predict the remaining residuals
  • Repeat: Continue until residuals are tiny or you've built N trees
# Gradient Boosting Classifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.2, random_state=42
)

# Gradient Boosting - most popular boosting method
gb_clf = GradientBoostingClassifier(
    n_estimators=100,      # Number of trees
    learning_rate=0.1,     # Step size - how much each tree contributes
    max_depth=3,           # Depth of each tree (usually keep shallow: 3-5)
    random_state=42
)
gb_clf.fit(X_train, y_train)

print(f"Gradient Boosting Accuracy: {gb_clf.score(X_test, y_test):.2%}")
Learning Rate Trade-off: The learning rate controls how much each tree contributes. Lower learning rates (e.g., 0.01-0.1) require more trees but often achieve better generalization because each step is more cautious. Higher rates (e.g., 0.3-1.0) train faster but may overfit because each tree makes big changes. Start with 0.1 and tune. A common strategy: decrease learning rate, increase n_estimators proportionally.
Boosting Pitfall - Overfitting to Noise: Because Boosting focuses on hard examples, it can overfit to noisy data (outliers or mislabeled samples). If those "hard" examples are actually noise or errors, Boosting will try very hard to fit them! Use validation sets and early stopping to prevent this.

Comparing Boosting Algorithms

# Comparing AdaBoost vs Gradient Boosting
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import cross_val_score

wine = load_wine()
X, y = wine.data, wine.target

# AdaBoost
ada = AdaBoostClassifier(n_estimators=100, random_state=42)
ada_scores = cross_val_score(ada, X, y, cv=5)

# Gradient Boosting
gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_scores = cross_val_score(gb, X, y, cv=5)

print("Algorithm Comparison (5-fold CV):")
print(f"  AdaBoost:          {ada_scores.mean():.2%} (+/- {ada_scores.std()*2:.2%})")
print(f"  Gradient Boosting: {gb_scores.mean():.2%} (+/- {gb_scores.std()*2:.2%})")

Practice Questions: Boosting

Test your understanding with these coding challenges.

Task: Create an AdaBoostClassifier with 50 estimators and learning_rate=0.5 on the Iris dataset.

Show Solution
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

ada = AdaBoostClassifier(n_estimators=50, learning_rate=0.5, random_state=42)
ada.fit(X_train, y_train)

print(f"Accuracy: {ada.score(X_test, y_test):.2%}")

Task: Test learning_rate values [0.01, 0.1, 0.5, 1.0] with Gradient Boosting (100 trees). Show how it affects accuracy.

Show Solution
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

print("Learning Rate | Accuracy")
print("-" * 25)
for lr in [0.01, 0.1, 0.5, 1.0]:
    gb = GradientBoostingClassifier(n_estimators=100, learning_rate=lr, random_state=42)
    gb.fit(X_train, y_train)
    print(f"{lr:12.2f} | {gb.score(X_test, y_test):.2%}")

Task: Train Gradient Boosting with 200 estimators. Plot staged_predict accuracy for training and test sets to visualize overfitting.

Show Solution
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.2, random_state=42
)

gb = GradientBoostingClassifier(n_estimators=200, learning_rate=0.1, random_state=42)
gb.fit(X_train, y_train)

train_scores = [accuracy_score(y_train, y_pred) 
                for y_pred in gb.staged_predict(X_train)]
test_scores = [accuracy_score(y_test, y_pred) 
               for y_pred in gb.staged_predict(X_test)]

# Find best iteration
best_iter = np.argmax(test_scores)
print(f"Best at {best_iter+1} estimators: {test_scores[best_iter]:.2%}")
print(f"Train at best: {train_scores[best_iter]:.2%}")
05

XGBoost & LightGBM

XGBoost and LightGBM are industrial-strength gradient boosting libraries that have dominated machine learning competitions and power production systems at companies like Microsoft, Airbnb, and Alibaba. They take the gradient boosting concept and optimize it for speed, scalability, and performance. If you need the best possible accuracy on tabular data, these are often your best choices. These libraries have won more Kaggle competitions than any other algorithms!

Why Learn These? XGBoost and LightGBM are the most practical ML algorithms you can learn. For tabular data (spreadsheets, databases, CSV files), they often beat neural networks while being faster to train, easier to tune, and more interpretable. Every data scientist should have these in their toolkit!

XGBoost (Extreme Gradient Boosting)

XGBoost adds several key improvements over standard gradient boosting: regularization to prevent overfitting, efficient handling of sparse data, built-in cross-validation, and parallelized tree construction. It has won more Kaggle competitions than any other algorithm and remains the go-to choice for many practitioners.

What Makes XGBoost "Extreme"?
  • Regularization: Built-in L1 and L2 regularization prevents overfitting (scikit-learn GB doesn't have this)
  • Parallel Processing: Builds trees in parallel, making it much faster
  • Smart Splits: Uses histograms to find best splits faster
  • Missing Values: Automatically handles missing data (no imputation needed!)
  • Early Stopping: Can stop training when validation score stops improving
# XGBoost Classifier
# Install: pip install xgboost
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# XGBoost with key hyperparameters explained
xgb_clf = XGBClassifier(
    n_estimators=100,        # Number of trees to build
    max_depth=5,             # How deep each tree can grow (prevent overfitting)
    learning_rate=0.1,       # Step size - smaller = more trees needed but better
    subsample=0.8,           # Use 80% of rows per tree (like bagging!)
    colsample_bytree=0.8,    # Use 80% of columns per tree (randomness)
    reg_alpha=0.1,           # L1 regularization (makes some features 0)
    reg_lambda=1.0,          # L2 regularization (shrinks feature weights)
    random_state=42,
    use_label_encoder=False,
    eval_metric='logloss'
)
xgb_clf.fit(X_train, y_train)

print(f"XGBoost Accuracy: {xgb_clf.score(X_test, y_test):.2%}")

LightGBM (Light Gradient Boosting Machine)

LightGBM, developed by Microsoft, focuses on speed and efficiency. It uses histogram-based algorithms and leaf-wise (best-first) tree growth instead of level-wise growth. This makes it significantly faster than XGBoost on large datasets while often achieving comparable or better accuracy.

What Makes LightGBM "Light" and Fast?
  • Histogram-based: Groups feature values into bins, reducing computation
  • Leaf-wise Growth: Grows the leaf that reduces error most (not level-by-level)
  • Gradient-based Sampling: Focuses on samples with larger gradients (bigger errors)
  • Memory Efficient: Uses less memory than XGBoost for same dataset
  • Categorical Support: Handles categorical features natively (no encoding needed!)
LightGBM Gotcha: Because of leaf-wise growth, LightGBM can overfit on small datasets. Use num_leaves carefully - it's the main parameter to control complexity. A good rule: num_leaves should be less than 2^max_depth.
# LightGBM Classifier
# Install: pip install lightgbm
from lightgbm import LGBMClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.2, random_state=42
)

# LightGBM with key hyperparameters explained
lgbm_clf = LGBMClassifier(
    n_estimators=100,        # Number of trees
    max_depth=10,            # Maximum depth (can go deeper with leaf-wise)
    learning_rate=0.1,       # Step size
    num_leaves=31,           # KEY PARAMETER! Controls model complexity
    subsample=0.8,           # Row sampling
    colsample_bytree=0.8,    # Column sampling
    reg_alpha=0.1,           # L1 regularization
    reg_lambda=0.1,          # L2 regularization
    random_state=42,
    verbose=-1               # Suppress training output
)
lgbm_clf.fit(X_train, y_train)

print(f"LightGBM Accuracy: {lgbm_clf.score(X_test, y_test):.2%}")

Comparison: XGBoost vs LightGBM vs Scikit-learn

Let's compare all the boosting methods we've learned. This will help you understand when to use each:

# Comprehensive comparison
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import cross_val_score
import time

digits = load_digits()
X, y = digits.data, digits.target

models = {
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Sklearn GB": GradientBoostingClassifier(n_estimators=100, random_state=42),
    "XGBoost": XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False, eval_metric='mlogloss', verbosity=0),
    "LightGBM": LGBMClassifier(n_estimators=100, random_state=42, verbose=-1)
}

print("Model Comparison (5-fold CV):")
print("-" * 50)

for name, model in models.items():
    start = time.time()
    scores = cross_val_score(model, X, y, cv=5)
    elapsed = time.time() - start
    print(f"{name:15s}: {scores.mean():.2%} (+/- {scores.std()*2:.2%}) | Time: {elapsed:.2f}s")
XGBoost Strengths
  • Built-in regularization (L1 & L2)
  • Handles missing values automatically
  • Parallel tree construction
  • Built-in cross-validation
  • Feature importance scores
  • Early stopping support
LightGBM Strengths
  • Faster training on large datasets
  • Lower memory usage (histogram-based)
  • Leaf-wise tree growth (often better accuracy)
  • Native categorical feature support
  • Handles large feature spaces well
  • GPU training support

Early Stopping

Both XGBoost and LightGBM support early stopping, which automatically halts training when validation performance stops improving. This prevents overfitting and saves training time - you don't need to guess the optimal number of trees.

What is Early Stopping? Imagine you're training for a marathon. Training more makes you better at first, but at some point, training too much leads to injury (overfitting!). Early stopping is like having a coach who monitors your performance on a validation test after each training session. When your performance stops improving for several sessions in a row, the coach says "Stop! You've peaked." This is exactly what early stopping does - it monitors validation score and stops training when improvement stalls.
3

How Early Stopping Works

  • Set a high number of trees (e.g., 1000) - we won't actually build all of them
  • After building each tree, check performance on a validation set
  • Keep track of the best validation score seen so far
  • If validation score hasn't improved for N rounds (e.g., 10 rounds), stop training
  • Return the model from the best iteration, not the last one
Pro Tip: Early stopping gives you two benefits: (1) you don't need to tune n_estimators - just set it high and let early stopping find the optimal number, and (2) you save training time by not building unnecessary trees. Always use early stopping when you have a validation set!
# Early stopping with XGBoost
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# XGBoost with early stopping
xgb = XGBClassifier(
    n_estimators=1000,       # Set HIGH - early stopping will find optimal!
    learning_rate=0.1,       # Lower learning rate = more trees = better chance to find optimum
    random_state=42,
    use_label_encoder=False,
    eval_metric='logloss'
)

# Fit with early stopping - KEY PARAMETERS:
# eval_set: validation data to monitor
# early_stopping_rounds would be set internally (default is usually 10)
xgb.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],  # Monitor performance on test set
    verbose=False                  # Don't print every iteration
)

print(f"Best iteration: {xgb.best_iteration}")  # Where it stopped!
print(f"Accuracy: {xgb.score(X_test, y_test):.2%}")

Which Algorithm Should I Use?

Decision Guide
Situation Recommended Algorithm Why
Small dataset (<10K rows) XGBoost or Random Forest LightGBM may overfit; RF is robust
Large dataset (>100K rows) LightGBM Much faster, lower memory usage
Need interpretability Single Decision Tree or RF Easier to explain to stakeholders
Kaggle competition XGBoost or LightGBM (ensemble both!) Best raw performance
Production system LightGBM Fast inference, low memory
Baseline model Random Forest Works well with default parameters

Practice Questions: XGBoost & LightGBM

Test your understanding with these coding challenges.

Task: Create an XGBClassifier with 50 estimators on the Wine dataset and print accuracy.

Show Solution
from xgboost import XGBClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, test_size=0.2, random_state=42
)

xgb = XGBClassifier(n_estimators=50, random_state=42, 
                    use_label_encoder=False, eval_metric='mlogloss')
xgb.fit(X_train, y_train)

print(f"Accuracy: {xgb.score(X_test, y_test):.2%}")

Task: Train both XGBoost and LightGBM with 200 trees on Digits dataset. Time each and compare accuracy.

Show Solution
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import time

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.2, random_state=42
)

# XGBoost
start = time.time()
xgb = XGBClassifier(n_estimators=200, random_state=42, verbosity=0)
xgb.fit(X_train, y_train)
xgb_time = time.time() - start
xgb_acc = xgb.score(X_test, y_test)

# LightGBM
start = time.time()
lgbm = LGBMClassifier(n_estimators=200, random_state=42, verbose=-1)
lgbm.fit(X_train, y_train)
lgbm_time = time.time() - start
lgbm_acc = lgbm.score(X_test, y_test)

print(f"XGBoost:  {xgb_acc:.2%} in {xgb_time:.3f}s")
print(f"LightGBM: {lgbm_acc:.2%} in {lgbm_time:.3f}s")

Task: Use XGBoost with early_stopping_rounds=10 to find optimal number of trees. Print best iteration and final accuracy.

Show Solution
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

xgb = XGBClassifier(
    n_estimators=500,
    learning_rate=0.1,
    random_state=42,
    use_label_encoder=False,
    eval_metric='logloss',
    early_stopping_rounds=10
)

xgb.fit(X_train, y_train, 
        eval_set=[(X_test, y_test)],
        verbose=False)

print(f"Best iteration: {xgb.best_iteration}")
print(f"Best score: {xgb.best_score:.4f}")
print(f"Final accuracy: {xgb.score(X_test, y_test):.2%}")

Key Takeaways

Decision Trees

Tree-based models split data using questions, are highly interpretable, but prone to overfitting without pruning

Random Forest

Ensemble of diverse trees trained on bootstrap samples - reduces variance and provides feature importance

Bagging

Bootstrap Aggregating trains models in parallel on random samples, reducing variance and preventing overfitting

Boosting

Sequential training where each model corrects previous errors - reduces bias and often achieves superior accuracy

XGBoost

Optimized gradient boosting with regularization, handling missing values, and parallel tree construction

LightGBM

Fast histogram-based boosting with leaf-wise growth - ideal for large datasets with speed-critical applications

Knowledge Check

Test your understanding of Decision Trees and Ensemble Methods:

Question 1 of 9

What is the primary purpose of tree pruning in Decision Trees?

Question 2 of 9

Which splitting criterion measures the probability of incorrectly classifying a random element?

Question 3 of 9

How does Random Forest introduce diversity among its trees?

Question 4 of 9

What is the Out-of-Bag (OOB) score in Random Forest?

Question 5 of 9

What is the key difference between Bagging and Boosting?

Question 6 of 9

How does Gradient Boosting improve upon previous predictions?

Question 7 of 9

What is the purpose of the learning_rate parameter in boosting algorithms?

Question 8 of 9

Which advantage does LightGBM have over traditional Gradient Boosting?

Question 9 of 9

What does early stopping prevent in XGBoost and LightGBM?

Answer all questions to check your score