Module 3.1

Logistic Regression

Master binary classification with logistic regression. Learn the sigmoid function, probability estimation, decision boundaries, and how to train models for real-world classification problems!

50 min read
Intermediate
What You'll Learn
  • Binary classification fundamentals
  • Sigmoid function and probability
  • Decision boundaries and thresholds
  • Cost function and gradient descent
  • Training with scikit-learn
Contents
01

Classification Fundamentals

So far, we have learned about Linear Regression, which predicts continuous values like house prices or temperature. But many real-world problems require different answers. Should we approve this loan? Will this patient develop diabetes? Is this email spam? These are classification problems where we predict categories, not numbers. Logistic Regression is the go-to algorithm for binary classification - predicting one of two possible outcomes.

Regression vs Classification

Let's clarify the difference between the two main types of supervised learning. Regression predicts continuous numerical outputs from a range of infinite possibilities. Classification predicts discrete categorical outputs from a finite set of classes. Despite its name, Logistic Regression is actually a classification algorithm, not a regression algorithm.

# Understanding the difference: Regression vs Classification

# REGRESSION: Predicting continuous values
# Example: House price prediction
house_features = {"bedrooms": 3, "sqft": 1500, "location": "suburban"}
predicted_price = 285000.50  # Could be any number
print(f"Predicted house price: ${predicted_price:,.2f}")

# CLASSIFICATION: Predicting discrete categories
# Example: Email spam detection
email_features = {"has_link": True, "urgent_words": 5, "sender": "unknown"}
predicted_class = "spam"  # Only two options: "spam" or "not spam"
print(f"Email classification: {predicted_class}")

# Key difference:
print("\nRegression output: Any number (infinite possibilities)")
print("Classification output: One of fixed categories (finite choices)")
Regression Example

Question: What will the house price be?
Output: $450,000 (continuous value)

Classification Example

Question: Will this email be spam?
Output: Yes or No (discrete class)

Binary vs Multiclass Classification

Classification problems come in two main flavors. Binary classification has two possible outcomes (spam or not spam, disease or healthy). Multiclass classification has three or more categories (is this email spam, promotional, or personal?). Logistic Regression handles binary classification directly, though we can extend it to multiclass problems using techniques like One-vs-Rest.

# Binary vs Multiclass Classification Examples

# BINARY CLASSIFICATION (2 classes)
# Logistic Regression is perfect for this!
binary_examples = {
    "Email": ["spam", "not spam"],
    "Medical": ["disease", "healthy"],
    "Loan": ["approved", "rejected"],
    "Fraud": ["fraudulent", "legitimate"]
}

print("Binary Classification Examples:")
for task, classes in binary_examples.items():
    print(f"  {task}: {classes[0]} vs {classes[1]}")

# MULTICLASS CLASSIFICATION (3+ classes)
# Requires extension like One-vs-Rest
multiclass_examples = {
    "Digit Recognition": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
    "Email Type": ["spam", "promotional", "personal", "work"],
    "Sentiment": ["negative", "neutral", "positive"]
}

print("\nMulticlass Classification Examples:")
for task, classes in multiclass_examples.items():
    print(f"  {task}: {len(classes)} classes")
Real-World Applications: Credit approval (approve/reject), medical diagnosis (disease/healthy), email filtering (spam/not spam), fraud detection (fraudulent/legitimate transaction), sentiment analysis (positive/negative).

Why Linear Regression Won't Work

You might ask: why can't we just use Linear Regression and round the output to 0 or 1? Let's see what happens. Recall that Linear Regression creates a line (or hyperplane in multiple dimensions) that fits the data. For classification, we want probabilities between 0 and 1, but Linear Regression can predict values far outside this range.

# Linear Regression on classification data
from sklearn.linear_model import LinearRegression
import numpy as np

# Example: predicting if student passes (1) or fails (0) based on hours studied
X = np.array([[2], [3], [4], [5], [6], [7], [8], [9]])
y = np.array([0, 0, 0, 1, 1, 1, 1, 1])

# Fit Linear Regression
model = LinearRegression()
model.fit(X, y)

# Predictions
predictions = model.predict(np.array([[0], [10], [15]]))
print("Hour 0:", predictions[0])  # -0.34 (negative probability!)
print("Hour 10:", predictions[1])  # 1.19 (probability > 1!)
print("Hour 15:", predictions[2])  # 2.09 (nonsensical!)

As you can see, Linear Regression produces predictions outside the valid probability range [0, 1]. This is why we need a special algorithm for classification: Logistic Regression. It uses the sigmoid function to squash any input value into the range 0 to 1.

Key Insight: We need an algorithm that guarantees outputs between 0 and 1 and interprets them as class probabilities. This is exactly what Logistic Regression provides.

Practice Questions: Classification Fundamentals

Test your understanding with these coding challenges.

Given:

problems = [
    "Predicting house prices",
    "Identifying if a photo contains a cat",
    "Forecasting next month's revenue",
    "Detecting credit card fraud"
]

Task: Create a dictionary mapping each problem to its type: "regression" or "classification"

Show Solution
problem_types = {
    "Predicting house prices": "regression",
    "Identifying if a photo contains a cat": "classification",
    "Forecasting next month's revenue": "regression",
    "Detecting credit card fraud": "classification"
}

for problem, ptype in problem_types.items():
    print(f"{problem}: {ptype}")

Given:

from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[2], [3], [4], [5], [6], [7], [8], [9]])
y = np.array([0, 0, 0, 1, 1, 1, 1, 1])

Task: Train LinearRegression and predict for X=0 and X=12. Print the predictions and explain why they are invalid probabilities.

Show Solution
from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[2], [3], [4], [5], [6], [7], [8], [9]])
y = np.array([0, 0, 0, 1, 1, 1, 1, 1])

model = LinearRegression()
model.fit(X, y)

pred_0 = model.predict([[0]])[0]
pred_12 = model.predict([[12]])[0]

print(f"Prediction at X=0: {pred_0:.3f}")   # Negative!
print(f"Prediction at X=12: {pred_12:.3f}") # Greater than 1!
print("\nThese are invalid probabilities!")

Given:

import numpy as np
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
y = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

Task: Train both LinearRegression and LogisticRegression. Predict for X values [-2, 5, 12] and compare results side by side.

Show Solution
import numpy as np
from sklearn.linear_model import LinearRegression, LogisticRegression

X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
y = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

linear_model = LinearRegression()
logistic_model = LogisticRegression()

linear_model.fit(X, y)
logistic_model.fit(X, y)

test_values = [[-2], [5], [12]]

print("X value | Linear | Logistic Prob")
print("-" * 35)
for x in test_values:
    lin_pred = linear_model.predict([x])[0]
    log_prob = logistic_model.predict_proba([x])[0, 1]
    print(f"{x[0]:7} | {lin_pred:6.3f} | {log_prob:.3f}")
02

The Sigmoid Function

The heart of Logistic Regression is the sigmoid function, an elegant mathematical tool that transforms any input value into a probability between 0 and 1. This S-shaped curve is perfect for classification because it maps negative infinity to 0, zero to 0.5, and positive infinity to 1. The sigmoid function is the key innovation that makes Logistic Regression work.

The Sigmoid Function Equation

The sigmoid function, also called the logistic function, is mathematically expressed as:

Mathematical Definition

Sigmoid Function

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Where: e is Euler's number (approximately 2.71828), and z is any real number (the linear combination of inputs).

This simple equation is incredibly powerful. Let's understand what happens at key points:

  • z = -5: σ(z) ≈ 0.0067 (very close to 0)
  • z = 0: σ(z) = 0.5 (exactly halfway)
  • z = 5: σ(z) ≈ 0.9933 (very close to 1)

Let's implement this in Python step by step:

Step 1: Define the Sigmoid Function

First, we create a simple function that takes any number z and returns the sigmoid value. We use Python's math.exp() for the exponential calculation.

# Import the math module for exponential function
import math

def sigmoid(z):
    """Calculate sigmoid of z"""
    return 1 / (1 + math.exp(-z))
Why math.exp()? The function math.exp(-z) calculates $e^{-z}$. This is more accurate and readable than writing 2.71828 ** (-z).
Step 2: Test with Negative Input (z = -5)

When z is negative, the sigmoid outputs a value close to 0. This represents low probability of belonging to class 1.

# When z = -5 (negative input)
z = -5
exp_term = math.exp(-z)  # e^5 ≈ 148.41
result = 1 / (1 + exp_term)

print(f"z = {z}: sigmoid = {result:.4f}")
# Output: z = -5: sigmoid = 0.0067
How does this calculation work?

When we plug z = -5 into the sigmoid formula, we first calculate math.exp(-(-5)) which equals math.exp(5). This gives us approximately 148.41 (that's $e$ raised to the power of 5).

Next, we add 1 to this value: 1 + 148.41 = 149.41. This is our denominator.


Finally, we divide 1 by this denominator: 1 / 149.41 = 0.0067. This tiny number (close to 0) tells us the model predicts this sample is very unlikely to belong to class 1.

Step 3: Test with Zero Input (z = 0)

When z = 0, the sigmoid returns exactly 0.5. This is the decision boundary where the model is equally uncertain about both classes.

# When z = 0 (neutral input)
z = 0
exp_term = math.exp(-z)  # e^0 = 1
result = 1 / (1 + exp_term)

print(f"z = {z}: sigmoid = {result:.4f}")
# Output: z = 0: sigmoid = 0.5000
Why is this special?

When z = 0, we calculate math.exp(-0) which equals math.exp(0). Any number raised to the power of 0 equals 1, so $e^0 = 1$.

Our denominator becomes 1 + 1 = 2.


The final result is 1 / 2 = 0.5 exactly. This is the decision boundary! A probability of 0.5 means the model is completely uncertain. It's a coin flip between class 0 and class 1.

Step 4: Test with Positive Input (z = 5)

When z is positive, the sigmoid outputs a value close to 1. This represents high probability of belonging to class 1.

# When z = 5 (positive input)
z = 5
exp_term = math.exp(-z)  # e^-5 ≈ 0.0067
result = 1 / (1 + exp_term)

print(f"z = {z}: sigmoid = {result:.4f}")
# Output: z = 5: sigmoid = 0.9933
What happens with positive z?

When z = 5, we calculate math.exp(-5). This is $e$ raised to the power of -5, which gives us a very small number: approximately 0.0067.

Our denominator becomes 1 + 0.0067 = 1.0067. Notice how adding such a tiny number barely changes 1.


The final result is 1 / 1.0067 = 0.9933. This is very close to 1, meaning the model is highly confident this sample belongs to class 1. The larger the positive z, the closer the sigmoid gets to 1.

Negative z

σ(z) → 0
Low probability

Zero z

σ(z) = 0.5
Decision boundary

Positive z

σ(z) → 1
High probability

The sigmoid function has a beautiful S-shaped curve. Small changes in input near z=0 cause large changes in output, while extreme values flatten out. This makes the sigmoid ideal for representing probabilities.

Visualizing the Sigmoid Function

A picture is worth a thousand words. Let's visualize the sigmoid curve to understand its behavior better.

Setting Up the Visualization

We'll use NumPy for efficient calculations and Matplotlib for plotting. NumPy's version of sigmoid can handle arrays of values at once.

# Import visualization libraries
import numpy as np
import matplotlib.pyplot as plt

# Define sigmoid for arrays (NumPy version)
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
Generate Input Values

We create 100 evenly spaced points between -5 and 5 to get a smooth curve.

# Generate input values from -5 to 5
z = np.linspace(-5, 5, 100)

# Calculate sigmoid for all values at once
sigma_z = sigmoid(z)

print(f"Input range: {z.min():.1f} to {z.max():.1f}")
print(f"Output range: {sigma_z.min():.4f} to {sigma_z.max():.4f}")
Create the Plot

Now we plot the S-curve with helpful reference lines at the decision boundary (0.5) and at z=0.

# Create the visualization
plt.figure(figsize=(10, 6))
plt.plot(z, sigma_z, 'b-', linewidth=2, label='σ(z)')

# Add reference lines
plt.axhline(y=0.5, color='r', linestyle='--', alpha=0.5, label='Threshold')
plt.axvline(x=0, color='gray', linestyle='-', alpha=0.3)

# Customize appearance
plt.xlabel('z (input)', fontsize=12)
plt.ylabel('σ(z) (probability)', fontsize=12)
plt.title('The Sigmoid Function', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.ylim(-0.1, 1.1)
plt.show()
What to notice: The curve is steepest at z=0, meaning small changes in input cause big changes in probability near the decision boundary. At extreme values, the curve flattens out (saturates), meaning the model becomes very confident in its prediction.
for z_val in [-5, -3, -1, 0, 1, 3, 5]: print(f" σ({z_val:2}) = {sigmoid(z_val):.4f}")

From Linear to Logistic Model

In Linear Regression, we computed a weighted sum of input features:

Linear Regression

Linear Combination

$$y = w_0 + w_1x_1 + w_2x_2 + \cdots + w_nx_n$$

Where: $w_0$ is the intercept (bias), $w_1, w_2, \ldots, w_n$ are the weights for each feature, and $x_1, x_2, \ldots, x_n$ are the input features.

In Logistic Regression, we wrap this linear combination in the sigmoid function:

Logistic Regression

Probability Model

$$P(y=1|x) = \sigma(w_0 + w_1x_1 + w_2x_2 + \cdots + w_nx_n)$$

Result: This outputs the probability that $y=1$ given the input features $x$. The sigmoid $\sigma$ squashes any value into the range [0, 1].

The sigmoid function ensures that the output is always a valid probability between 0 and 1, regardless of how large or small the linear combination becomes.

Let's see how this works with a real example. Imagine we're predicting whether a student will pass an exam based on two features: hours studied and previous test score.

Step 1: Set Up the Model

First, we import NumPy and define our sigmoid function. Then we set up the model weights. In a real scenario, these weights would be learned from training data, but here we'll use pre-defined values.

# Import library and define sigmoid
import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))
What is NumPy?

NumPy is Python's powerful numerical computing library. We use np.exp() instead of math.exp() because NumPy can handle arrays of numbers at once, which is essential for machine learning.

Step 2: Define Model Weights

The model has three weights: an intercept (bias) and one weight for each feature. These weights determine how much each feature influences the prediction.

# Model weights (learned from training data)
w0 = -4.0   # Intercept (bias)
w1 = 0.5    # Weight for hours studied
w2 = 0.3    # Weight for previous score

print("Model Weights:")
print(f"  w0 (intercept): {w0}")
print(f"  w1 (hours studied): {w1}")
print(f"  w2 (previous score): {w2}")
Understanding the Weights

Intercept (w0 = -4.0): This negative value means a student starts with a disadvantage. They need positive contributions from studying and previous scores to overcome this.

Hours Studied (w1 = 0.5): Each hour of study adds 0.5 to the linear combination. More study time increases the probability of passing.

Previous Score (w2 = 0.3): Each point in previous score adds 0.3. Past performance also helps predict success.

Step 3: Predict for Student 1 (Good Student)

Let's predict whether a student who studied 8 hours and had a previous score of 70 will pass.

# Student 1: studied 8 hours, previous score 70
hours_studied = 8
previous_score = 70

# Step A: Calculate linear combination
z = w0 + w1*hours_studied + w2*previous_score
print(f"Linear combination: {w0} + {w1}*{hours_studied} + {w2}*{previous_score}")
print(f"z = {z}")

# Step B: Apply sigmoid
probability = sigmoid(z)
print(f"Probability = sigmoid({z}) = {probability:.4f}")

# Step C: Make prediction
prediction = "PASS" if probability >= 0.5 else "FAIL"
print(f"Prediction: {prediction} ({probability*100:.1f}% confidence)")
How did we get this result?

Step A - Linear Combination: We calculate $z = -4.0 + 0.5 \times 8 + 0.3 \times 70 = -4.0 + 4.0 + 21.0 = 21.0$

Step B - Apply Sigmoid: We plug z=21 into the sigmoid. Since 21 is a large positive number, sigmoid(21) is very close to 1, approximately 0.9999.


Step C - Prediction: Since 0.9999 is greater than 0.5, we predict PASS. The model is 99.99% confident this student will pass!

Step 4: Predict for Student 2 (Struggling Student)

Now let's predict for a student who only studied 2 hours and had a previous score of 40.

# Student 2: studied 2 hours, previous score 40
hours_studied = 2
previous_score = 40

# Calculate linear combination
z = w0 + w1*hours_studied + w2*previous_score
print(f"Linear combination: {w0} + {w1}*{hours_studied} + {w2}*{previous_score}")
print(f"z = {z}")

# Apply sigmoid
probability = sigmoid(z)
print(f"Probability = sigmoid({z}) = {probability:.4f}")

# Make prediction
prediction = "PASS" if probability >= 0.5 else "FAIL"
print(f"Prediction: {prediction} ({probability*100:.1f}% confidence)")
Why does this student fail?

Linear Combination: $z = -4.0 + 0.5 \times 2 + 0.3 \times 40 = -4.0 + 1.0 + 12.0 = 9.0$

Wait, z=9 is positive! Yes, but let's see: sigmoid(9) = 0.9999. Actually, this student would also pass!


The Key Insight: The linear combination needs to be negative for the sigmoid to output less than 0.5. With w2=0.3, even a modest previous score contributes significantly. This shows why choosing the right weights during training is crucial.

Student 1
  • Hours Studied: 8
  • Previous Score: 70
  • Linear Combination (z): 21.0
  • Probability: 99.99%
  • Prediction: PASS
Student 2
  • Hours Studied: 2
  • Previous Score: 40
  • Linear Combination (z): 9.0
  • Probability: 99.99%
  • Prediction: PASS

Understanding the Sigmoid Mathematically

Let's break down the sigmoid function to understand why it has its special S-shape. The key is the exponential term $e^{-z}$. We'll visualize each component step by step.

What You'll Learn

By the end of this section, you'll understand:

  • What each part of the sigmoid formula does
  • Why the curve has its characteristic S-shape
  • How inputs get transformed into probabilities
  • Why sigmoid is perfect for classification problems
Step 1: Set Up the Functions

Before we can visualize anything, we need to import our tools and define our functions. Think of this as gathering your ingredients before cooking.

# Import libraries
import numpy as np
import matplotlib.pyplot as plt
What are these libraries?

NumPy (np): A library for working with numbers and arrays. It lets us do math on many numbers at once, which is much faster than using regular Python loops.

Matplotlib (plt): A library for creating charts and graphs. We use it to visualize our data and understand patterns visually.

Now let's define our two functions. We'll create them separately so we can examine each piece:

# Define sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Define just the exponential component
def exp_term(z):
    return np.exp(-z)
Understanding the Building Blocks

exp_term(z): This function calculates $e^{-z}$. The letter "e" is a special mathematical constant (approximately 2.71828), and we're raising it to the power of negative z. This is called the "exponential component" of sigmoid.

sigmoid(z): This is our complete sigmoid function. It takes the exponential term, adds 1 to it, and then divides 1 by the whole thing. The formula is: $\sigma(z) = \frac{1}{1 + e^{-z}}$


Why two functions? By separating them, we can visualize each piece and understand how they combine to create the S-curve. It's like understanding how flour, eggs, and sugar combine to make a cake!

Step 2: Generate Input Values

To draw a smooth curve, we need many points. Instead of typing each number manually, we use NumPy's linspace function to generate them automatically.

# Generate 100 points from -5 to 5
z = np.linspace(-5, 5, 100)

print(f"First 5 values: {z[:5]}")
print(f"Last 5 values: {z[-5:]}")
What does linspace do?

np.linspace(-5, 5, 100) creates exactly 100 numbers, evenly spaced between -5 and 5.

It's like marking 100 equally-spaced tick marks on a ruler from -5 to 5. The first mark is at -5, the last is at 5, and all the others are spread evenly in between.

Why 100 points? More points = smoother curve. 100 is usually enough to look smooth without being slow to compute.

Step 3: Visualize the Exponential Component

The exponential term $e^{-z}$ is the secret ingredient that gives sigmoid its special properties. Let's see what it looks like when we plot it.

# Plot the exponential component
plt.figure(figsize=(8, 5))
plt.plot(z, exp_term(z), 'r-', linewidth=2)
plt.title('Component: exp(-z)', fontsize=14)
plt.xlabel('z', fontsize=12)
plt.ylabel('exp(-z)', fontsize=12)
plt.grid(True, alpha=0.3)
plt.ylim(0, 150)
plt.show()
Understanding the Plotting Code

plt.figure(figsize=(8, 5)): Creates a new figure that's 8 inches wide and 5 inches tall

plt.plot(z, exp_term(z), 'r-', linewidth=2): Draws a red solid line ('r-') connecting all our points. The linewidth=2 makes it thicker and easier to see.

plt.grid(True, alpha=0.3): Adds a light grid in the background to help read values. alpha=0.3 makes it semi-transparent.

What do we see in the graph?

The graph shows a curve that shoots up dramatically on the left and flattens out on the right. Here's why:

When z is negative (left side of graph): The negative sign in $e^{-z}$ flips the sign, so $e^{-(-5)} = e^{5}$. This is a positive exponent, and $e^{5} \approx 148$ - a huge number!

When z is zero (middle): $e^{-0} = e^{0} = 1$. Any number raised to power 0 equals 1.

When z is positive (right side): $e^{-5} \approx 0.007$ - a tiny number close to zero. The larger z gets, the closer this value gets to 0.

Step 4: Visualize the Denominator

Now we add 1 to the exponential term. This might seem like a small change, but it's actually very important! This sum becomes the denominator of our sigmoid formula: $1 + e^{-z}$.

# Plot the denominator
plt.figure(figsize=(8, 5))
plt.plot(z, 1 + exp_term(z), 'g-', linewidth=2)
plt.title('Component: 1 + exp(-z)', fontsize=14)
plt.xlabel('z', fontsize=12)
plt.ylabel('1 + exp(-z)', fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()
Why do we add 1? This is the key insight!

Adding 1 guarantees that the denominator is always greater than 1. Since we're going to divide 1 by this denominator, this ensures our final answer is always less than 1.

Let's trace through what happens:

When z = -5 (very negative):

  • Exponential: $e^{5} \approx 148$
  • Denominator: $1 + 148 = 149$
  • Final sigmoid: $\frac{1}{149} \approx 0.007$ (close to 0!)

When z = 5 (very positive):

  • Exponential: $e^{-5} \approx 0.007$
  • Denominator: $1 + 0.007 = 1.007$
  • Final sigmoid: $\frac{1}{1.007} \approx 0.993$ (close to 1!)
Step 5: The Final Sigmoid Curve

Now for the grand finale! We divide 1 by the denominator to get the sigmoid. This creates the famous S-shaped curve that makes logistic regression work.

# Plot the final sigmoid
plt.figure(figsize=(8, 5))
plt.plot(z, sigmoid(z), 'b-', linewidth=2)
plt.axhline(y=0.5, color='r', linestyle='--', alpha=0.5)
plt.axvline(x=0, color='gray', linestyle='-', alpha=0.3)
plt.title('Result: σ(z) = 1 / (1 + exp(-z))', fontsize=14)
plt.xlabel('z', fontsize=12)
plt.ylabel('σ(z)', fontsize=12)
plt.grid(True, alpha=0.3)
plt.ylim(-0.1, 1.1)
plt.show()
What do the extra lines mean?

plt.axhline(y=0.5, ...): Draws a horizontal line at y=0.5. This is our decision boundary - the point where the model is 50% uncertain.

plt.axvline(x=0, ...): Draws a vertical line at x=0. This helps us see that when z=0, the sigmoid output is exactly 0.5.

The Beautiful S-Curve Explained

Look at the shape - it looks like a stretched letter "S" lying on its side! Here's what's happening at each part:

Left flat region (z < -3): Output is nearly 0. The model is very confident this is NOT class 1. No matter how much more negative z gets, the output stays near 0.

Steep middle region (-3 < z < 3): This is where the action happens! Small changes in z cause big changes in probability. This is the "decision zone" where the model is less certain.

Right flat region (z > 3): Output is nearly 1. The model is very confident this IS class 1. No matter how much more positive z gets, the output stays near 1.


Key insight: The curve never actually reaches 0 or 1 - it just gets infinitely close. This is mathematically convenient and reflects real-world uncertainty: we can never be 100% certain about predictions!

Step 6: See the Transformation in Action

Let's see exactly how different input values get transformed into probabilities. This will help you build intuition about what the sigmoid is doing.

# Show the transformation
print("Linear output -> Sigmoid output (probability)")
print("-" * 40)

for linear_output in [-10, -5, 0, 5, 10]:
    prob = sigmoid(linear_output)
    print(f"  z = {linear_output:3} -> σ(z) = {prob:.6f}")
Understanding the Code

for linear_output in [-10, -5, 0, 5, 10]: We loop through 5 test values, from very negative to very positive.

{prob:.6f}: This formats the probability to show 6 decimal places, so we can see how close values get to 0 or 1.

Here are the results in a nice table:

Input (z) Sigmoid Output As Percentage Interpretation
-10 0.000045 0.0045% Almost certainly Class 0
-5 0.006693 0.67% Very likely Class 0
0 0.500000 50% Completely uncertain (decision boundary)
5 0.993307 99.33% Very likely Class 1
10 0.999955 99.9955% Almost certainly Class 1
Key Observations for Beginners

1. Notice the symmetry: z = -5 gives 0.67%, and z = +5 gives 99.33%. These add up to 100%! This is because sigmoid has a special property: $\sigma(-z) = 1 - \sigma(z)$

2. The decision point: When z = 0, we get exactly 50%. This is why z = 0 is called the "decision boundary" - it's where we're equally unsure about both classes.

3. Extreme values saturate: At z = 10, we're already at 99.9955%. Going to z = 100 wouldn't add much more confidence. The curve "saturates" at the extremes.

Key Property: The sigmoid function is symmetric around the point (0, 0.5). This means σ(-z) = 1 - σ(z). This symmetry is useful for mathematical derivations in Logistic Regression.

Practice Questions: Sigmoid Function

Test your understanding with these coding challenges.

Task: Calculate σ(0) by hand using the formula σ(z) = 1 / (1 + e^-z), then verify with Python.

Expected output: 0.5

Show Solution
import math

# σ(0) = 1 / (1 + e^-0)
#      = 1 / (1 + e^0)
#      = 1 / (1 + 1)
#      = 1 / 2
#      = 0.5

z = 0
sigma = 1 / (1 + math.exp(-z))
print(f"σ(0) = {sigma}")  # 0.5

Given:

test_values = [-5, -2, 0, 2, 5]

Task: Implement sigmoid function and verify that σ(-z) = 1 - σ(z) for each test value.

Show Solution
import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

test_values = [-5, -2, 0, 2, 5]

print("Testing σ(-z) = 1 - σ(z):")
for z in test_values:
    sig_z = sigmoid(z)
    sig_neg_z = sigmoid(-z)
    one_minus = 1 - sig_z
    match = np.isclose(sig_neg_z, one_minus)
    print(f"z={z:2}: σ(-z)={sig_neg_z:.4f}, 1-σ(z)={one_minus:.4f}, Match: {match}")

Task: The derivative of sigmoid is σ'(z) = σ(z) * (1 - σ(z)). Implement both sigmoid and its derivative, then find the maximum value of σ'(z).

Expected: Maximum derivative is 0.25 at z=0

Show Solution
import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(z):
    sig = sigmoid(z)
    return sig * (1 - sig)

z = np.linspace(-5, 5, 100)

plt.figure(figsize=(10, 5))
plt.plot(z, sigmoid(z), 'b-', label='σ(z)')
plt.plot(z, sigmoid_derivative(z), 'r-', label="σ'(z)")
plt.axhline(y=0.25, color='g', linestyle='--', alpha=0.5)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"Maximum derivative: σ'(0) = {sigmoid_derivative(0):.4f}")
03

Decision Boundaries

While Logistic Regression outputs probabilities, we often need to make a definitive class prediction: yes or no, spam or not spam, disease or healthy. This is where the decision boundary comes in. The decision boundary is a threshold that separates the two classes in our classification problem. Understanding how to set and interpret this boundary is crucial for building effective classifiers.

The 0.5 Probability Threshold

Default Decision Boundary: 0.5

By default, Logistic Regression uses a decision boundary of 0.5. This means:

Positive Class

If P(y=1|x) ≥ 0.5, predict class 1

Negative Class

If P(y=1|x) < 0.5, predict class 0

Why 0.5? This threshold is natural because it's the equilibrium point of the sigmoid function - where both classes are equally likely. However, this default isn't always optimal, especially for imbalanced datasets or when different misclassification costs apply.
# Understanding the 0.5 Threshold: Step by Step
import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Simulate predictions for 10 samples
np.random.seed(42)
probabilities = np.random.uniform(0, 1, 10)

print("Decision Boundary at 0.5:")
print("="*50)
print(f"{'Sample':<10}{'Probability':<15}{'Prediction':<12}{'Reason'}")
print("-"*50)

for i, prob in enumerate(probabilities):
    if prob >= 0.5:
        prediction = "Class 1"
        reason = f"{prob:.3f} >= 0.5"
    else:
        prediction = "Class 0"
        reason = f"{prob:.3f} < 0.5"
    
    print(f"{i+1:<10}{prob:<15.4f}{prediction:<12}{reason}")

# Summary
class_1_count = sum(1 for p in probabilities if p >= 0.5)
class_0_count = len(probabilities) - class_1_count
print("-"*50)
print(f"Total Class 0: {class_0_count}, Class 1: {class_1_count}")
Understanding the Code: Step-by-Step Breakdown

This code demonstrates exactly how logistic regression makes the final decision to classify something as Class 0 or Class 1. Let's break it down piece by piece:

1. The Sigmoid Function (Lines 1-2)

def sigmoid(z): return 1 / (1 + np.exp(-z))

This is the mathematical function that converts any number (positive, negative, large, or small) into a probability between 0 and 1. For example, if z = 2, sigmoid gives us 0.88 (88% probability). If z = -2, we get 0.12 (12% probability). This ensures our predictions are always valid probabilities.

2. Generating Sample Probabilities (Lines 3-5)

probabilities = np.random.uniform(0, 1, 10)

Here we create 10 random probability values between 0 and 1 to simulate what a trained logistic regression model might output for 10 different samples. In real scenarios, these would come from your model after it processes the input features. For example, for an email, it might output 0.85 (85% probability it's spam).

3. Applying the Decision Rule (Lines 9-15)

if prob >= 0.5: prediction = "Class 1"

This is the critical step where we convert probabilities to actual predictions. The rule is simple: if the probability is 0.5 or higher, we predict Class 1; otherwise, we predict Class 0. So a probability of 0.51 becomes "Class 1", while 0.49 becomes "Class 0". This is called the decision threshold.

4. Displaying Results (Line 8 and 16)

The code prints a formatted table showing each sample's probability, what class it was assigned to, and the mathematical reason (comparison with 0.5). This helps you see exactly how the threshold works for each case.

5. Summary Statistics (Lines 18-21)

class_1_count = sum(1 for p in probabilities if p >= 0.5)

Finally, we count how many samples ended up in each class. This gives you a quick overview: out of 10 samples, maybe 6 were classified as Class 1 and 4 as Class 0.

Why This Matters for Beginners:

Understanding this threshold is crucial because it directly affects your model's behavior. Consider these real-world examples:

  • Email Spam Filter: A probability of 0.49 (49% spam) would be classified as "Not Spam" using the 0.5 threshold. But is that safe? Maybe you want a lower threshold like 0.3 to catch more spam.
  • Disease Detection: A probability of 0.51 (51% disease) would be classified as "Disease Present". But in medical scenarios, you might want a higher threshold like 0.8 to avoid false alarms that could worry patients unnecessarily.
  • Credit Approval: The 0.5 threshold might approve too many risky loans. Banks might use 0.7 or higher to be more conservative.

Key Takeaway: The 0.5 threshold is just the default starting point. As you gain experience, you'll learn to adjust it based on your specific problem, the costs of different types of errors, and your business requirements. The code above shows you exactly how this threshold affects every single prediction your model makes.

Decision Boundary (1D Case)

Predicted class = 1 if σ(z) >= threshold, else 0
where z = w₀ + w₁x is the linear combination

Geometric Interpretation

In higher dimensions, the decision boundary is a line or hyperplane that separates the feature space into regions for each class. Let's visualize this with a 2D example.

Step 1: Define Model Parameters
import numpy as np

# Trained model parameters (example)
w0 = -3.0    # Intercept (bias)
w1 = 1.5     # Weight for feature x1
w2 = 2.0     # Weight for feature x2

We start by defining the parameters that our logistic regression model has learned during training. w0 is the intercept (like the y-intercept in a line equation). w1 and w2 are the weights that tell us how important each feature is. In this example, feature x2 (weight = 2.0) has more influence than feature x1 (weight = 1.5) on the prediction.

Step 2: Understand the Linear Combination
print("Linear combination: z = w0 + w1*x1 + w2*x2")
print(f"z = {w0} + {w1}*x1 + {w2}*x2")

# Output: z = -3.0 + 1.5*x1 + 2.0*x2

The linear combination z is the raw output before we apply the sigmoid function. It's calculated by multiplying each feature by its weight, adding them together, and adding the intercept. For example, if x1=2 and x2=1, then z = -3.0 + 1.5(2) + 2.0(1) = -3.0 + 3.0 + 2.0 = 2.0. This z value will then be passed through the sigmoid function to get a probability.

Step 3: Find Where the Decision Boundary Occurs
# Decision boundary occurs where sigmoid(z) = 0.5
# This happens when z = 0
print("Decision boundary is where z = 0 (sigmoid = 0.5)")
print(f"{w0} + {w1}*x1 + {w2}*x2 = 0")

# Output: -3.0 + 1.5*x1 + 2.0*x2 = 0

The decision boundary is the special line where the probability equals exactly 0.5. Mathematically, sigmoid(0) = 0.5, so the boundary occurs where z = 0. Any point on this line will have a 50-50 chance of being in either class. Points on one side of the line will have z > 0 (probability > 0.5, predict Class 1), and points on the other side will have z < 0 (probability < 0.5, predict Class 0).

Step 4: Solve for the Boundary Line Equation
# Solve for x2 in terms of x1
print("Solving for x2:")
print(f"x2 = ({-w0} - {w1}*x1) / {w2}")
print(f"x2 = {-w0/w2:.2f} - {w1/w2:.2f}*x1")

# Output: x2 = 1.50 - 0.75*x1

By rearranging the equation -3.0 + 1.5*x1 + 2.0*x2 = 0, we solve for x2 to get the standard form of a line: x2 = 1.50 - 0.75*x1. This is just like y = mx + b in algebra! The slope is -0.75 and the y-intercept is 1.50. This equation tells us: for any value of x1, we can calculate the corresponding x2 value that lies exactly on the decision boundary. This line divides our 2D feature space into two regions.

Step 5: Verify with Actual Points
# Verify with some points on the boundary
print("Points on the decision boundary:")
for x1 in [0, 1, 2, 3]:
    x2 = (-w0 - w1*x1) / w2
    z = w0 + w1*x1 + w2*x2
    print(f"  x1={x1}: x2={x2:.2f} -> z={z:.4f}")

# Output:
#   x1=0: x2=1.50 -> z=0.0000
#   x1=1: x2=0.75 -> z=0.0000
#   x1=2: x2=0.00 -> z=0.0000
#   x1=3: x2=-0.75 -> z=0.0000

We test several points to confirm they lie on the boundary. For each x1 value (0, 1, 2, 3), we calculate the corresponding x2 using our boundary equation, then verify that z = 0 for all these points. Notice that z is always 0.0000 (or very close), confirming these points are on the boundary.

Real-world meaning: If you had a customer with features (x1=2, x2=0.00), the model would give exactly 50% probability. If x2 is higher than 0.00 (above the line), the model predicts Class 1. If x2 is lower (below the line), it predicts Class 0. This geometric visualization helps you understand how your model makes decisions based on feature values.

Step 6: Visualize the Decision Boundary
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_blobs

# Create synthetic 2D dataset
np.random.seed(42)
X, y = make_blobs(n_samples=100, n_features=2, 
                  centers=2, random_state=42)

# Train model
model = LogisticRegression()
model.fit(X, y)

We create a synthetic dataset with 100 samples and 2 features using make_blobs, which generates clusters of points (like two groups of customers with different characteristics). Then we train a logistic regression model on this data. The model will learn the optimal weights to separate these two clusters.

Step 7: Create a Mesh Grid for Visualization
# Create mesh grid covering the feature space
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                      np.linspace(y_min, y_max, 100))

# Get probability predictions for every point
Z = model.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z = Z.reshape(xx.shape)

To visualize the decision boundary, we need to check the model's prediction at many points across the entire feature space. meshgrid creates a grid of 10,000 points (100×100) covering the range of our data. For each point, we calculate the probability of Class 1. This creates a "probability map" where we can see how predictions change across the space. Areas with high probabilities will appear in one color, low probabilities in another, and the boundary (0.5) will be the line between them.

Step 8: Plot Everything Together
# Create the plot
fig, ax = plt.subplots(figsize=(10, 8))

# Color regions by predicted class
ax.contourf(xx, yy, Z, levels=[0, 0.5, 1], 
           colors=['lightblue', 'lightcoral'], alpha=0.6)

# Draw decision boundary (0.5 probability line)
ax.contour(xx, yy, Z, levels=[0.5], 
          colors='black', linewidths=2)

# Add probability contour lines
contours = ax.contour(xx, yy, Z, 
                     levels=[0.1, 0.3, 0.7, 0.9], 
                     colors='gray', alpha=0.4)
ax.clabel(contours, inline=True, fontsize=8)

# Plot actual data points
ax.scatter(X[y==0, 0], X[y==0, 1], c='blue', 
          marker='o', s=100, label='Class 0')
ax.scatter(X[y==1, 0], X[y==1, 1], c='red', 
          marker='s', s=100, label='Class 1')

ax.set_xlabel('Feature 1', fontsize=12)
ax.set_ylabel('Feature 2', fontsize=12)
ax.set_title('Decision Boundary with Probability Contours', 
            fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

This code creates a comprehensive visualization with multiple layers:

  • Background colors: Light blue for regions where the model predicts Class 0, light coral for Class 1 regions
  • Black line: The decision boundary (probability = 0.5) that separates the two classes
  • Gray contour lines: Show probability levels (0.1, 0.3, 0.7, 0.9). Points near these lines have those exact probabilities
  • Data points: Blue circles are actual Class 0 samples, red squares are Class 1 samples

How to read this plot: The further a point is from the decision boundary, the more confident the model is in its prediction. Points very close to the boundary (near the black line) have probabilities close to 0.5, meaning the model is uncertain. Points far from the boundary have probabilities close to 0 or 1, meaning the model is very confident.

Adjusting the Threshold

Different problems require different thresholds. A medical diagnostic test might need a lower threshold (more positive predictions, fewer false negatives) to catch all potential cases. A spam filter might prefer a higher threshold (fewer false positives) to avoid marking legitimate emails as spam.

Threshold Trade-off: Lowering the threshold catches more positive cases but increases false positives. Raising the threshold reduces false positives but misses more positive cases. This is the classic precision-recall trade-off.
Step 1: Setup the Scenario
import numpy as np

# Model probabilities for 10 patients (disease screening)
patient_probs = [0.15, 0.25, 0.35, 0.45, 0.55, 
                 0.65, 0.75, 0.85, 0.90, 0.95]

# Ground truth: who actually has the disease
actual_disease = [0, 0, 1, 0, 1, 1, 1, 1, 1, 1]

We simulate a disease detection scenario with 10 patients. The patient_probs list contains the model's predicted probability that each patient has the disease (ranging from 15% to 95%). The actual_disease list shows the ground truth: 0 means the patient is healthy, 1 means they actually have the disease. Notice that 7 out of 10 patients actually have the disease, but the model's probabilities vary. Some patients with disease have low probabilities (patient 3 has 35% probability but actually has disease), while some healthy patients have higher probabilities (patient 4 has 45% but is healthy).

Step 2: Test Different Thresholds
print("Impact of Different Thresholds:")
print("="*60)

for threshold in [0.3, 0.5, 0.7]:
    # Apply threshold to convert probabilities to predictions
    predicted = [1 if p >= threshold else 0 
                 for p in patient_probs]
    
    print(f"\nThreshold: {threshold}")
    print(f"  Patients diagnosed: {sum(predicted)}")

We test three different thresholds: 0.3 (low), 0.5 (default), and 0.7 (high). For each threshold, we convert probabilities to binary predictions using the rule: if probability ≥ threshold, predict 1 (disease), else predict 0 (healthy). A lower threshold like 0.3 will diagnose more patients (catching more actual cases but also more false alarms), while a higher threshold like 0.7 will be more conservative (fewer diagnoses but might miss some actual cases).

Step 3: Calculate Performance Metrics
    # Count different types of outcomes
    tp = sum(1 for p, a in zip(predicted, actual_disease) 
             if p == 1 and a == 1)  # Correctly found disease
    
    fp = sum(1 for p, a in zip(predicted, actual_disease) 
             if p == 1 and a == 0)  # False alarm (healthy diagnosed as sick)
    
    fn = sum(1 for p, a in zip(predicted, actual_disease) 
             if p == 0 and a == 1)  # Missed case (sick patient missed)
    
    print(f"  True Positives: {tp} (correctly found disease)")
    print(f"  False Positives: {fp} (unnecessary worry)")
    print(f"  Missed Cases: {fn} (dangerous!)")

For each threshold, we calculate three critical metrics:

  • True Positives (TP): Patients who have disease AND were correctly diagnosed. This is what we want to maximize.
  • False Positives (FP): Healthy patients incorrectly diagnosed with disease. Causes unnecessary worry and treatment costs.
  • False Negatives (FN): Sick patients who were missed by the test. This is dangerous because they won't get treatment!

The threshold choice directly affects these numbers. Lower threshold → More TP but also more FP. Higher threshold → Fewer FP but more FN. In medical scenarios, missing a disease (FN) is often worse than a false alarm (FP), so lower thresholds are preferred.

Step 4: Interpret the Results
    if threshold == 0.3:
        print("  -> Low threshold: Catches more disease")
        print("     but more false alarms")
    elif threshold == 0.7:
        print("  -> High threshold: Fewer false alarms")
        print("     but may miss disease!")

# Example output:
# Threshold: 0.3 -> Diagnoses 8 patients, TP=7, FP=1, FN=0
# Threshold: 0.5 -> Diagnoses 6 patients, TP=6, FP=0, FN=1
# Threshold: 0.7 -> Diagnoses 4 patients, TP=4, FP=0, FN=3

Real-World Impact:

  • Threshold 0.3: Catches all 7 diseased patients but wrongly diagnoses 1 healthy person. Safe but causes 1 false alarm.
  • Threshold 0.5: Catches 6 diseased patients, no false alarms, but misses 1 sick patient. Balanced but risky.
  • Threshold 0.7: Only diagnoses 4 patients, misses 3 sick patients! Very dangerous despite no false alarms.

For disease detection, most doctors would choose threshold 0.3 to ensure no sick patients are missed, even if it means some healthy people get retested.

Step 5: Real Dataset Example - Load and Prepare
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load breast cancer dataset
data = load_breast_cancer()
X, y = data.data[:, :2], data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

Now we move to a real medical dataset: breast cancer detection. The dataset contains measurements from cell samples, and the target is whether the tumor is malignant (1) or benign (0). We use only the first 2 features for simplicity. We split the data into 70% training and 30% testing to evaluate how different thresholds perform on unseen data.

Step 6: Train Model and Get Probabilities
# Train logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Get probability predictions for test set
y_prob = model.predict_proba(X_test)[:, 1]

print(f"Model trained on {len(X_train)} samples")
print(f"Testing on {len(X_test)} samples")
print(f"Probability range: {y_prob.min():.3f} to {y_prob.max():.3f}")

After training, we get probability predictions for all test samples using predict_proba. The [:, 1] extracts probabilities for the positive class (malignant tumor). These probabilities range from near 0 (very confident it's benign) to near 1 (very confident it's malignant). We'll now test how different thresholds affect our diagnostic accuracy.

Step 7: Compare Threshold Performance
from sklearn.metrics import confusion_matrix

thresholds = [0.3, 0.5, 0.7]
print("Threshold Performance Comparison:")
print("-" * 60)

for threshold in thresholds:
    # Apply custom threshold
    y_pred = (y_prob >= threshold).astype(int)
    
    # Get confusion matrix
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
    
    # Calculate metrics
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    
    print(f"\nThreshold: {threshold}")
    print(f"  TP={tp}, FP={fp}, FN={fn}, TN={tn}")
    print(f"  Precision: {precision:.3f}")
    print(f"  Recall: {recall:.3f}")

For each threshold, we evaluate the complete confusion matrix and calculate:

  • Precision: Of all patients we diagnosed as malignant, what percentage actually were? High precision = fewer false alarms.
  • Recall: Of all patients who actually had malignant tumors, what percentage did we catch? High recall = fewer missed cases.

You'll notice that as threshold increases, precision tends to go up (fewer false positives) but recall goes down (more false negatives). The optimal threshold depends on whether it's more important to avoid false alarms or to catch all cases. In cancer detection, high recall is usually prioritized because missing a malignant tumor is far more dangerous than a false positive that leads to additional testing.

Class Probability and Confidence

The closer a probability is to 0 or 1, the more confident the model is in its prediction. Probabilities near 0.5 indicate uncertainty. This information can be valuable for decision-making.

# Understanding Prediction Confidence
import numpy as np
from sklearn.linear_model import LogisticRegression

# Sample predictions with different confidence levels
probabilities = np.array([0.05, 0.15, 0.45, 0.51, 0.85, 0.95])
threshold = 0.5

print("Prediction Confidence Analysis:")
print("-" * 50)
for prob in probabilities:
    prediction = 1 if prob >= threshold else 0
    confidence = abs(prob - 0.5) * 2  # 0-1 scale
    uncertainty = 1 - confidence
    
    print(f"Probability: {prob:.2f} -> Predict: {prediction}, Confidence: {confidence:.1%}")

This code demonstrates how to interpret model confidence from probability values. Let's break down what's happening:

The Probability Array:

We have 6 sample predictions ranging from very confident Class 0 (0.05) to very confident Class 1 (0.95). Notice probabilities 0.45 and 0.51 are close to the decision boundary (0.5), indicating uncertainty.

Confidence Calculation:

The formula confidence = abs(prob - 0.5) * 2 converts probability to a confidence score from 0% to 100%. Here's why this works:

  • Prob = 0.05: abs(0.05 - 0.5) * 2 = 0.9 → 90% confidence in Class 0
  • Prob = 0.50: abs(0.50 - 0.5) * 2 = 0.0 → 0% confidence (maximum uncertainty)
  • Prob = 0.95: abs(0.95 - 0.5) * 2 = 0.9 → 90% confidence in Class 1

Practical Interpretation:

  • High Confidence (prob near 0 or 1): Model is very sure. You can trust these predictions more. Example: prob=0.95 means "I'm 95% sure this is Class 1"
  • Low Confidence (prob near 0.5): Model is uncertain. May need human review or additional features. Example: prob=0.51 means "It's a coin flip, barely leaning toward Class 1"

Real-World Application: In fraud detection, you might automatically block high-confidence fraud cases (prob > 0.9), send low-confidence cases (0.4 < prob < 0.6) for manual review, and automatically approve high-confidence legitimate transactions (prob < 0.1). This three-tier system uses confidence to make smarter decisions than a simple yes/no classification.

Practice Questions: Decision Boundaries

Test your understanding with these coding challenges.

Given:

probabilities = [0.2, 0.5, 0.7, 0.49]
threshold = 0.5

Task: Create predictions list using the threshold. Expected output: [0, 1, 1, 0]

Show Solution
probabilities = [0.2, 0.5, 0.7, 0.49]
threshold = 0.5

predictions = [1 if p >= threshold else 0 for p in probabilities]
print(f"Probabilities: {probabilities}")
print(f"Predictions:   {predictions}")  # [0, 1, 1, 0]

Given:

import numpy as np
from sklearn.metrics import confusion_matrix

y_prob = np.array([0.1, 0.3, 0.4, 0.6, 0.7, 0.9])
y_true = np.array([0, 0, 0, 1, 1, 1])

Task: Test thresholds [0.3, 0.5, 0.7] and print accuracy, TP, FP, FN, TN for each.

Show Solution
import numpy as np
from sklearn.metrics import confusion_matrix

y_prob = np.array([0.1, 0.3, 0.4, 0.6, 0.7, 0.9])
y_true = np.array([0, 0, 0, 1, 1, 1])

for threshold in [0.3, 0.5, 0.7]:
    y_pred = (y_prob >= threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    accuracy = (tp + tn) / len(y_true)
    print(f"Threshold {threshold}: Acc={accuracy:.1%}, TP={tp}, FP={fp}, FN={fn}, TN={tn}")

Task: Train a LogisticRegression model, generate precision-recall curve, and find the threshold that maximizes F1-score.

Show Solution
import numpy as np
from sklearn.metrics import precision_recall_curve
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=200, n_features=5, random_state=42)

model = LogisticRegression()
model.fit(X, y)
y_prob = model.predict_proba(X)[:, 1]

precision, recall, thresholds = precision_recall_curve(y, y_prob)
f1_scores = 2 * (precision[:-1] * recall[:-1]) / (precision[:-1] + recall[:-1] + 1e-10)
best_idx = np.argmax(f1_scores)
best_threshold = thresholds[best_idx]

print(f"Optimal threshold: {best_threshold:.4f}")
print(f"Best F1-score: {f1_scores[best_idx]:.4f}")
04

Model Training & Evaluation

Now that we understand the theory behind Logistic Regression, it's time to build real models. In this section, we'll learn how to train Logistic Regression models using scikit-learn, understand the optimization process, evaluate model performance, and interpret results. You'll go from theory to practical implementation that you can use immediately.

Cost Function: Log Loss

To train a Logistic Regression model, we need to minimize a cost function. Linear Regression uses Mean Squared Error (MSE), but Logistic Regression uses Log Loss (also called binary cross-entropy). This cost function penalizes confident wrong predictions heavily while rewarding confident correct predictions.

Log Loss (Binary Cross-Entropy)

$$J(w) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})]$$

where m is the number of samples, y is the true label (0 or 1), and ŷ is the predicted probability.

Elegant Interpretation of Log Loss

This formula has an elegant interpretation based on the true label:

1
When y = 1 (Positive Class)

Cost = -log(ŷ)

Prediction close to 1: Cost ≈ 0
Model is confident and correct → Low penalty
Prediction close to 0: Cost → ∞
Model is confident but wrong → Huge penalty!
0
When y = 0 (Negative Class)

Cost = -log(1 - ŷ)

Prediction close to 0: Cost ≈ 0
Model is confident and correct → Low penalty
Prediction close to 1: Cost → ∞
Model is confident but wrong → Huge penalty!

Why This Is Brilliant: The logarithm naturally creates an asymmetric penalty. Being slightly wrong (ŷ=0.4 when y=1) has moderate cost, but being confidently wrong (ŷ=0.01 when y=1) results in massive cost. This forces the model to be both accurate and appropriately confident.

# Understanding Log Loss: Why It Works
import numpy as np

def single_sample_loss(y_true, y_pred):
    """Calculate loss for a single sample"""
    epsilon = 1e-15  # Prevent log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    
    if y_true == 1:
        return -np.log(y_pred)
    else:
        return -np.log(1 - y_pred)

print("Log Loss Explained with Examples:")
print("="*60)

# Case 1: True label is 1 (positive class)
print("\nWhen TRUE LABEL = 1 (positive class):")
print("-"*40)
for pred in [0.99, 0.8, 0.5, 0.2, 0.01]:
    loss = single_sample_loss(1, pred)
    quality = "Excellent!" if loss < 0.1 else "Good" if loss < 0.5 else "Poor" if loss < 1 else "Terrible!"
    print(f"  Predict {pred:.2f} -> Loss: {loss:.4f} ({quality})")

# Case 2: True label is 0 (negative class)
print("\nWhen TRUE LABEL = 0 (negative class):")
print("-"*40)
for pred in [0.01, 0.2, 0.5, 0.8, 0.99]:
    loss = single_sample_loss(0, pred)
    quality = "Excellent!" if loss < 0.1 else "Good" if loss < 0.5 else "Poor" if loss < 1 else "Terrible!"
    print(f"  Predict {pred:.2f} -> Loss: {loss:.4f} ({quality})")

print("\nKey insight: Confident WRONG predictions are penalized heavily!")
Understanding the Code: Log Loss in Action

This code demonstrates exactly how log loss penalizes predictions based on how wrong they are. Let's break it down:

The Loss Function

The single_sample_loss function implements the two cases of log loss:

  • If y_true = 1: Returns -log(y_pred). The closer the prediction to 1, the smaller the loss.
  • If y_true = 0: Returns -log(1 - y_pred). The closer the prediction to 0, the smaller the loss.

Epsilon clipping: We add a tiny value (1e-15) and clip predictions to prevent log(0) which would be undefined (infinity).

Case 1: When True Label = 1

Testing predictions from 0.99 (very confident correct) to 0.01 (very confident wrong):

  • Predict 0.99: Loss ≈ 0.01 (Excellent!) - Nearly perfect prediction, minimal penalty
  • Predict 0.80: Loss ≈ 0.22 (Good) - Confident and correct, small penalty
  • Predict 0.50: Loss ≈ 0.69 (Poor) - Uncertain prediction, moderate penalty
  • Predict 0.20: Loss ≈ 1.61 (Terrible!) - Wrong direction, large penalty
  • Predict 0.01: Loss ≈ 4.61 (Terrible!) - Confidently wrong, massive penalty!

Notice how the loss grows exponentially as the prediction moves away from the true value.

Case 2: When True Label = 0

Testing predictions from 0.01 (very confident correct) to 0.99 (very confident wrong):

  • Predict 0.01: Loss ≈ 0.01 (Excellent!) - Correctly says "almost certainly 0"
  • Predict 0.20: Loss ≈ 0.22 (Good) - Leaning toward 0, small penalty
  • Predict 0.50: Loss ≈ 0.69 (Poor) - Uncertain, moderate penalty
  • Predict 0.80: Loss ≈ 1.61 (Terrible!) - Leaning wrong way, large penalty
  • Predict 0.99: Loss ≈ 4.61 (Terrible!) - Confidently wrong, massive penalty!

The Key Insight:

Log loss is asymmetric. Look at the difference:

  • Predicting 0.8 when truth is 1: Loss = 0.22 (tolerable)
  • Predicting 0.2 when truth is 1: Loss = 1.61 (7× worse!)
  • Predicting 0.01 when truth is 1: Loss = 4.61 (20× worse!!)

This exponential penalty prevents the model from making overconfident mistakes. It's better to be uncertain (predict 0.5) than to be confidently wrong (predict 0.01 when answer is 1). This is why log loss works so well for training classification models - it naturally encourages both accuracy and appropriate confidence levels.

Gradient Descent Optimization

Logistic Regression uses gradient descent to minimize the log loss. Gradient descent is an iterative algorithm that takes small steps toward the minimum of the cost function. The direction of each step is determined by the gradient (derivative) of the cost function.

import numpy as np

print("Gradient Descent - Mountain Analogy:")
print("="*50)
# Simplified 1D example: Finding minimum of f(x) = x^2
def f(x):
    return x ** 2

def gradient(x):
    return 2 * x  # Derivative of x^2
x = 5.0
learning_rate = 0.1

print(f"Start position: x = {x:.4f}, f(x) = {f(x):.4f}")
print("-"*50)
for step in range(10):
    grad = gradient(x)  # Check slope direction
    x = x - learning_rate * grad  # Move opposite to slope
    print(f"Step {step+1}: grad={grad:+.3f}, move to x={x:.4f}, f(x)={f(x):.4f}")
print("-"*50)
print(f"Final position: x = {x:.4f} (very close to 0!)")
print("\nIn Logistic Regression, we do the same but with multiple weights!")
import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def log_loss(y, y_pred):
    """Binary cross-entropy loss"""
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))
def gradient_descent_step(X, y, w, b, learning_rate):
    """One step of gradient descent"""
    m = X.shape[0]
    
    # Forward pass
    z = np.dot(X, w) + b
    y_pred = sigmoid(z)
    # Backward pass (gradients)
    dw = np.dot(X.T, (y_pred - y)) / m
    db = np.mean(y_pred - y)
    # Update weights
    w_new = w - learning_rate * dw
    b_new = b - learning_rate * db
    
    # Compute new loss
    z_new = np.dot(X, w_new) + b_new
    y_pred_new = sigmoid(z_new)
    loss = log_loss(y, y_pred_new)
    
    return w_new, b_new, loss
np.random.seed(42)
X = np.random.randn(50, 2)
y = (X[:, 0] + X[:, 1] > 0).astype(float)
w = np.random.randn(2) * 0.01
b = 0
learning_rate = 0.1
losses = []

for epoch in range(100):
    w, b, loss = gradient_descent_step(X, y, w, b, learning_rate)
    losses.append(loss)
plt.figure(figsize=(10, 5))
plt.plot(losses, linewidth=2)
plt.xlabel('Iteration')
plt.ylabel('Log Loss')
plt.title('Gradient Descent: Loss Decreasing Over Iterations')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Initial loss: {losses[0]:.4f}")
print(f"Final loss: {losses[-1]:.4f}")

Training with Scikit-learn

In practice, we don't implement gradient descent from scratch. Scikit-learn's LogisticRegression class handles everything for us. Let's train a real model on actual data.

from sklearn.linear_model import LogisticRegression
import numpy as np
hours_studied = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
passed_exam = np.array([0, 0, 0, 0, 1, 1, 1, 1])  # 0=fail, 1=pass
model = LogisticRegression()
model.fit(hours_studied, passed_exam)
test_hours = [[3.5], [4.5], [5.5]]
for hours in test_hours:
    prob = model.predict_proba(hours)[0]
    prediction = model.predict(hours)[0]
    result = "PASS" if prediction == 1 else "FAIL"
    print(f"Hours: {hours[0]} -> Probability: {prob[1]:.2%} -> {result}")
print(f"\nModel learned: coef={model.coef_[0][0]:.3f}, intercept={model.intercept_[0]:.3f}")
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = LogisticRegression(max_iter=10000)
model.fit(X_train_scaled, y_train)
y_train_pred = model.predict(X_train_scaled)
y_test_pred = model.predict(X_test_scaled)

y_train_proba = model.predict_proba(X_train_scaled)
y_test_proba = model.predict_proba(X_test_scaled)
print(f"Training accuracy: {model.score(X_train_scaled, y_train):.4f}")
print(f"Testing accuracy: {model.score(X_test_scaled, y_test):.4f}")
print(f"\nModel coefficients shape: {model.coef_.shape}")
print(f"First 5 coefficients: {model.coef_[0][:5]}")
print(f"Intercept: {model.intercept_[0]:.4f}")

Model Evaluation Metrics

Accuracy alone isn't enough to evaluate classification models, especially with imbalanced data. We need multiple metrics to understand different aspects of model performance.

y_true = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]  # 4 spam, 6 not-spam
y_pred = [1, 1, 0, 0, 0, 0, 0, 0, 1, 0]  # Our model's predictions
tp = 2  # True Positive: Correctly predicted spam
fp = 1  # False Positive: Predicted spam but was not spam
fn = 2  # False Negative: Predicted not-spam but was spam
tn = 5  # True Negative: Correctly predicted not-spam
accuracy = (tp + tn) / (tp + tn + fp + fn)
print(f"\nAccuracy = (TP+TN)/(TP+TN+FP+FN)")
print(f"         = ({tp}+{tn})/({tp}+{tn}+{fp}+{fn}) = {accuracy:.2f}")
print(f"  Meaning: {accuracy*100:.0f}% of all predictions were correct")
precision = tp / (tp + fp)
print(f"\nPrecision = TP/(TP+FP) = {tp}/({tp}+{fp}) = {precision:.2f}")
print(f"  Meaning: {precision*100:.0f}% of predicted spam was actually spam")
recall = tp / (tp + fn)
print(f"\nRecall = TP/(TP+FN) = {tp}/({tp}+{fn}) = {recall:.2f}")
print(f"  Meaning: We caught {recall*100:.0f}% of all spam")
f1 = 2 * (precision * recall) / (precision + recall)
print(f"\nF1-Score = 2*(Precision*Recall)/(Precision+Recall) = {f1:.2f}")
print(f"  Meaning: Balanced score considering both precision and recall")
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, confusion_matrix, roc_auc_score, 
                             roc_curve, auc)
import matplotlib.pyplot as plt
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LogisticRegression()
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)[:, 1]
print("Classification Metrics:")
print("=" * 40)
print(f"Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
print(f"F1-Score:  {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_proba):.4f}")
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print("\nConfusion Matrix:")
print(f"TP: {tp}, FP: {fp}")
print(f"FN: {fn}, TN: {tn}")
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, color='b', lw=2, label=f'ROC curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='r', lw=2, linestyle='--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Feature Importance

In Logistic Regression, the magnitude of coefficients indicates feature importance. Larger absolute coefficients mean the feature has more influence on the prediction. This makes Logistic Regression more interpretable than many other algorithms.

# Understanding Feature Importance in Logistic Regression
import numpy as np

# Suppose we're predicting if a customer will buy (1) or not (0)
# We have 3 features with these coefficients:
coefficients = {
    "visit_duration": 0.8,    # Time spent on website (minutes)
    "items_viewed": 1.5,      # Number of products viewed
    "previous_purchases": -0.3  # Already bought before (paradoxically negative!)
}

print("Feature Coefficients Explained:")
print("="*55)

for feature, coef in coefficients.items():
    direction = "increases" if coef > 0 else "decreases"
    impact = abs(coef)
    
    print(f"\n{feature}: {coef:+.2f}")
    print(f"  - Each unit increase {direction} log-odds by {impact:.2f}")
    print(f"  - Importance ranking: {'High' if impact > 1 else 'Medium' if impact > 0.5 else 'Low'}")

# Which feature matters most? Look at absolute values!
importance = {f: abs(c) for f, c in coefficients.items()}
sorted_features = sorted(importance.items(), key=lambda x: x[1], reverse=True)

print("\nFeature Importance (by absolute coefficient):")
for i, (feature, imp) in enumerate(sorted_features, 1):
    print(f"  {i}. {feature}: {imp:.2f}")
# Feature Importance from Coefficients
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt

# Load data
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Scale and train
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

model = LogisticRegression(max_iter=10000)
model.fit(X_scaled, y)

# Get feature importance (coefficients)
importance = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': model.coef_[0],
    'Abs_Coefficient': np.abs(model.coef_[0])
}).sort_values('Abs_Coefficient', ascending=False)

# Plot top 10 features
plt.figure(figsize=(10, 6))
top_features = importance.head(10)
colors = ['green' if c > 0 else 'red' for c in top_features['Coefficient']]
plt.barh(range(len(top_features)), top_features['Coefficient'], color=colors)
plt.yticks(range(len(top_features)), top_features['Feature'])
plt.xlabel('Coefficient Value')
plt.title('Top 10 Most Important Features')
plt.tight_layout()
plt.show()

print("\nTop 5 Most Important Features:")
print(importance.head())

Practice Questions: Model Training

Test your understanding with these coding challenges.

Task: Load the Iris dataset, filter to binary classification (classes 0 and 1 only), train LogisticRegression, and report accuracy.

Show Solution
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

iris = load_iris()
X, y = iris.data, iris.target

# Binary classification: use only classes 0 and 1
mask = y != 2
X, y = X[mask], y[mask]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)

accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.4f}")

Task: Train LogisticRegression on breast cancer dataset. Print precision and recall for thresholds [0.3, 0.4, 0.5, 0.6, 0.7].

Show Solution
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_score, recall_score

data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LogisticRegression()
model.fit(X_train_scaled, y_train)
y_proba = model.predict_proba(X_test_scaled)[:, 1]

print("Threshold | Precision | Recall")
print("-" * 35)
for threshold in [0.3, 0.4, 0.5, 0.6, 0.7]:
    y_pred = (y_proba >= threshold).astype(int)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    print(f"{threshold:.1f}      | {precision:.3f}     | {recall:.3f}")

Task: Use GridSearchCV to find optimal C and penalty parameters. Report best parameters and test score.

Show Solution
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler

data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']
}

model = LogisticRegression(solver='liblinear', max_iter=10000)
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='f1')
grid_search.fit(X_train_scaled, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
print(f"Test score: {grid_search.score(X_test_scaled, y_test):.4f}")
05

Naive Bayes Classifier

Naive Bayes is one of the simplest yet surprisingly effective classification algorithms. Despite its "naive" assumption that all features are independent of each other (which is rarely true in real life), it often performs remarkably well, especially for text classification tasks like spam detection and sentiment analysis. The algorithm is based on Bayes' Theorem from probability theory, and its simplicity means it's incredibly fast to train and predict - perfect for large datasets or real-time applications.

Bayes' Theorem: The Foundation

At its core, Naive Bayes uses Bayes' Theorem to calculate the probability of a class given the observed features. The theorem states:

Bayes' Theorem

P(Class | Features) = P(Features | Class) × P(Class) / P(Features)

  • P(Class | Features) - Posterior probability: What we want to find - the probability of a class given the features
  • P(Features | Class) - Likelihood: How probable are these features if we know the class
  • P(Class) - Prior probability: How common is this class in general
  • P(Features) - Evidence: The probability of seeing these features (constant for all classes)

The "Naive" Assumption

The "naive" part comes from assuming that all features are conditionally independent given the class. This means we assume that knowing the value of one feature tells us nothing about other features (given the class). For example, in spam detection, we assume that the presence of the word "free" is independent of the word "money" - which is obviously not true! Yet this simplification makes the math tractable and often works surprisingly well in practice.

Gaussian NB

For continuous features that follow a normal (Gaussian) distribution. Most common choice for general numeric data.

  • Assumes features are normally distributed
  • Works well with continuous data
  • Fast training and prediction
Multinomial NB

For discrete count features like word counts in text. The go-to choice for document classification and NLP tasks.

  • Perfect for text classification
  • Uses word frequency counts
  • Excellent for spam detection
Bernoulli NB

For binary/boolean features (present or not). Good for document classification with binary term occurrence.

  • Binary feature vectors only
  • Word presence, not frequency
  • Good for short text documents

Implementing Gaussian Naive Bayes

Let's implement Gaussian Naive Bayes for classification on a standard dataset. This variant assumes features follow a normal distribution.

from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train Gaussian Naive Bayes classifier
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)

# Evaluate the model
print("Gaussian Naive Bayes Results:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Text Classification with Multinomial Naive Bayes

Naive Bayes truly shines in text classification. Here's how to build a spam detector using Multinomial Naive Bayes:

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline

# Sample email data
emails = [
    "Get rich quick! Free money waiting for you!",
    "Meeting scheduled for tomorrow at 3pm",
    "Congratulations! You've won a million dollars!",
    "Please review the attached project proposal",
    "Limited time offer! Buy now and save 90%!",
    "Can we reschedule our call to next week?",
    "Claim your free prize now! Act immediately!",
    "The quarterly report is ready for your review"
]
labels = [1, 0, 1, 0, 1, 0, 1, 0]  # 1 = spam, 0 = not spam

# Create a pipeline with text vectorization and Naive Bayes
spam_detector = Pipeline([
    ('vectorizer', TfidfVectorizer(stop_words='english')),
    ('classifier', MultinomialNB(alpha=1.0))  # alpha is Laplace smoothing
])

# Train the model
spam_detector.fit(emails, labels)

# Test on new emails
new_emails = [
    "Free lottery tickets! Claim now!",
    "Let's discuss the project timeline"
]

predictions = spam_detector.predict(new_emails)
probabilities = spam_detector.predict_proba(new_emails)

for email, pred, prob in zip(new_emails, predictions, probabilities):
    label = "SPAM" if pred == 1 else "NOT SPAM"
    confidence = max(prob) * 100
    print(f"'{email[:40]}...' -> {label} ({confidence:.1f}% confident)")
When to Use Naive Bayes: Naive Bayes excels when you have limited training data, need fast training/prediction, or work with high-dimensional data like text. It's often used as a baseline model and can outperform more complex algorithms for text classification tasks.

Practice Questions

Problem: Train GaussianNB and compare it with Logistic Regression on the Iris dataset. Which performs better?

Show Solution
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score

iris = load_iris()
X, y = iris.data, iris.target

# Compare models using cross-validation
gnb = GaussianNB()
lr = LogisticRegression(max_iter=200)

gnb_scores = cross_val_score(gnb, X, y, cv=5)
lr_scores = cross_val_score(lr, X, y, cv=5)

print(f"Gaussian NB: {gnb_scores.mean():.4f} (+/- {gnb_scores.std():.4f})")
print(f"Logistic Regression: {lr_scores.mean():.4f} (+/- {lr_scores.std():.4f})")

Explanation: Both algorithms often perform similarly on well-separated datasets like Iris. Naive Bayes trains faster but may be less accurate when features are correlated.

Problem: Test different alpha values (0.01, 0.1, 1.0, 10.0) for MultinomialNB on a text dataset. How does alpha affect performance?

Show Solution
from sklearn.naive_bayes import MultinomialNB
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

# Load subset of 20 newsgroups
categories = ['sci.space', 'rec.sport.hockey']
newsgroups = fetch_20newsgroups(subset='train', categories=categories)

alphas = [0.01, 0.1, 1.0, 10.0]

for alpha in alphas:
    pipeline = Pipeline([
        ('tfidf', TfidfVectorizer(max_features=5000)),
        ('nb', MultinomialNB(alpha=alpha))
    ])
    scores = cross_val_score(pipeline, newsgroups.data, newsgroups.target, cv=5)
    print(f"Alpha={alpha:5.2f}: {scores.mean():.4f} (+/- {scores.std():.4f})")

Explanation: Alpha is the Laplace smoothing parameter. Higher values add more smoothing, preventing zero probabilities but potentially oversimplifying. Usually alpha=1.0 (default) works well.

Problem: Build a sentiment analyzer using Naive Bayes that classifies movie reviews as positive or negative. Use TF-IDF features.

Show Solution
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Sample movie reviews
reviews = [
    "This movie was absolutely fantastic! Great acting and plot.",
    "Terrible waste of time. Boring and predictable.",
    "I loved every minute of it. A masterpiece!",
    "Worst movie I've ever seen. Don't watch it.",
    "Brilliant performances by the entire cast. Highly recommend!",
    "So boring I fell asleep. Complete disappointment.",
    "An incredible journey with memorable characters.",
    "Awful script and terrible direction. Save your money."
]
sentiments = [1, 0, 1, 0, 1, 0, 1, 0]  # 1 = positive, 0 = negative

X_train, X_test, y_train, y_test = train_test_split(
    reviews, sentiments, test_size=0.25, random_state=42
)

# Create pipeline
sentiment_model = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2), max_features=1000)),
    ('nb', MultinomialNB())
])

sentiment_model.fit(X_train, y_train)
predictions = sentiment_model.predict(X_test)

print("Sentiment Analysis Results:")
print(classification_report(y_test, predictions, target_names=['Negative', 'Positive']))

Explanation: TF-IDF with n-grams captures both individual words and phrases. Multinomial NB works well for this text classification task.

06

K-Nearest Neighbors (KNN)

K-Nearest Neighbors is one of the most intuitive machine learning algorithms - it literally says "tell me who your neighbors are, and I'll tell you who you are." When classifying a new data point, KNN looks at the K closest training examples and assigns the class that appears most frequently among those neighbors. It's a "lazy learner" because it doesn't build a model during training; instead, it stores all the training data and does the work at prediction time. This makes it simple to understand and implement, but potentially slow for large datasets.

How KNN Works

The KNN Algorithm

  1. Store all training data - No actual training happens; we just memorize the examples
  2. For a new point, calculate distance - Compute distance from the new point to every training point
  3. Find K nearest neighbors - Select the K training points closest to the new point
  4. Vote for the class - The class that appears most often among the K neighbors wins

Choosing the Right K

The choice of K (number of neighbors) dramatically affects the model's behavior:

Small K (e.g., K=1)
  • Very sensitive to noise and outliers
  • Creates complex, jagged decision boundaries
  • High variance, low bias (overfitting risk)
  • Single point determines the prediction
Large K (e.g., K=15)
  • More robust to noise and outliers
  • Creates smoother decision boundaries
  • High bias, low variance (underfitting risk)
  • Multiple points influence the decision
Pro Tip: A common rule of thumb is to start with K = √n (square root of training samples) and use cross-validation to find the optimal value. Also, use odd K values for binary classification to avoid ties.

Distance Metrics

The "closeness" of neighbors depends on how we measure distance. Different metrics work better for different types of data:

Distance Metric Formula Best For
Euclidean √(Σ(xᵢ - yᵢ)²) Continuous features, general purpose (most common)
Manhattan Σ|xᵢ - yᵢ| High-dimensional data, grid-like movement
Minkowski (Σ|xᵢ - yᵢ|ᵖ)^(1/p) Generalizes Euclidean (p=2) and Manhattan (p=1)
Cosine 1 - cos(θ) Text data, when magnitude doesn't matter

Implementing KNN

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# IMPORTANT: Scale features for KNN (distance-based algorithm)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create and train KNN classifier
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train_scaled, y_train)

# Predict and evaluate
y_pred = knn.predict(X_test_scaled)
print(f"KNN Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Finding the Optimal K

import matplotlib.pyplot as plt

# Test different K values
k_values = range(1, 31)
train_scores = []
test_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    train_scores.append(knn.score(X_train_scaled, y_train))
    test_scores.append(knn.score(X_test_scaled, y_test))

# Plot results
plt.figure(figsize=(10, 6))
plt.plot(k_values, train_scores, 'b-o', label='Training Accuracy')
plt.plot(k_values, test_scores, 'r-s', label='Test Accuracy')
plt.xlabel('Number of Neighbors (K)')
plt.ylabel('Accuracy')
plt.title('KNN: Finding the Optimal K')
plt.legend()
plt.grid(True, alpha=0.3)

# Find best K
best_k = k_values[np.argmax(test_scores)]
print(f"Best K: {best_k} with test accuracy: {max(test_scores):.4f}")
Critical: Feature Scaling! KNN is distance-based, so features with larger scales will dominate the distance calculation. Always standardize or normalize your features before using KNN. This is one of the most common mistakes beginners make!

Practice Questions

Problem: Compare KNN accuracy with and without feature scaling on the Iris dataset. How much difference does scaling make?

Show Solution
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score

iris = load_iris()
X, y = iris.data, iris.target

# Without scaling
knn_no_scale = KNeighborsClassifier(n_neighbors=5)
scores_no_scale = cross_val_score(knn_no_scale, X, y, cv=5)

# With scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
knn_scaled = KNeighborsClassifier(n_neighbors=5)
scores_scaled = cross_val_score(knn_scaled, X_scaled, y, cv=5)

print(f"Without scaling: {scores_no_scale.mean():.4f}")
print(f"With scaling: {scores_scaled.mean():.4f}")

Explanation: Scaling often improves KNN performance, especially when features have different scales. The improvement depends on the dataset.

Problem: Test Euclidean, Manhattan, and Chebyshev distances on a dataset. Which works best?

Show Solution
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

wine = load_wine()
X = StandardScaler().fit_transform(wine.data)
y = wine.target

metrics = ['euclidean', 'manhattan', 'chebyshev']

for metric in metrics:
    knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
    scores = cross_val_score(knn, X, y, cv=5)
    print(f"{metric:12s}: {scores.mean():.4f} (+/- {scores.std():.4f})")

Explanation: Different metrics capture different notions of "closeness". Euclidean is most common, but Manhattan can work better in high dimensions.

Problem: Compare uniform weights vs distance-weighted KNN. When does distance weighting help?

Show Solution
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

# Create dataset with some overlap
X, y = make_classification(n_samples=500, n_features=10, 
                          n_informative=5, n_redundant=2,
                          n_clusters_per_class=2, random_state=42)
X = StandardScaler().fit_transform(X)

for k in [3, 7, 15]:
    knn_uniform = KNeighborsClassifier(n_neighbors=k, weights='uniform')
    knn_distance = KNeighborsClassifier(n_neighbors=k, weights='distance')
    
    scores_uniform = cross_val_score(knn_uniform, X, y, cv=5).mean()
    scores_distance = cross_val_score(knn_distance, X, y, cv=5).mean()
    
    print(f"K={k:2d}: Uniform={scores_uniform:.4f}, Distance={scores_distance:.4f}")

Explanation: Distance weighting gives closer neighbors more influence. It often helps with noisy data and allows using larger K values without losing local sensitivity.

07

Multi-class Classification Strategies

Many real-world problems have more than two classes: classifying emails into spam/promotional/personal/social, recognizing handwritten digits (0-9), or identifying plant species. While some algorithms like Decision Trees and KNN naturally handle multiple classes, others like Logistic Regression and SVM are inherently binary classifiers. In this section, we'll explore strategies to extend binary classifiers to multi-class problems, and how to use algorithms that natively support multiple classes.

One-vs-Rest (OvR) Strategy

One-vs-Rest (One-vs-All)

For N classes, train N separate binary classifiers. Each classifier learns to distinguish one class from all other classes combined. At prediction time, run all N classifiers and pick the class with the highest confidence score.

Example (3 classes: A, B, C):

  • Classifier 1: Is it class A? (A vs not-A)
  • Classifier 2: Is it class B? (B vs not-B)
  • Classifier 3: Is it class C? (C vs not-C)

One-vs-One (OvO) Strategy

One-vs-One

For N classes, train N×(N-1)/2 binary classifiers, one for each pair of classes. At prediction time, each classifier votes for one class, and the class with the most votes wins.

Example (3 classes: A, B, C):

  • Classifier 1: A vs B
  • Classifier 2: A vs C
  • Classifier 3: B vs C
One-vs-Rest Advantages
  • Only N classifiers needed (more efficient)
  • Each classifier sees all training data
  • Default strategy in scikit-learn
  • Works well when classes are well-separated
One-vs-One Advantages
  • Each classifier is simpler (only 2 classes)
  • Better for imbalanced datasets
  • Default for SVM (faster for SVMs)
  • More robust to outliers in other classes

Multinomial (Softmax) Logistic Regression

Instead of using multiple binary classifiers, we can extend Logistic Regression to directly predict probabilities across all classes using the softmax function. This is more elegant and often more effective:

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# OvR (One-vs-Rest) strategy
lr_ovr = LogisticRegression(multi_class='ovr', max_iter=200)
lr_ovr.fit(X_train, y_train)
print(f"One-vs-Rest Accuracy: {lr_ovr.score(X_test, y_test):.4f}")

# Multinomial (Softmax) - native multi-class
lr_multinomial = LogisticRegression(multi_class='multinomial', max_iter=200)
lr_multinomial.fit(X_train, y_train)
print(f"Multinomial Accuracy: {lr_multinomial.score(X_test, y_test):.4f}")

# Compare probability outputs
sample = X_test[0:1]
print(f"\nProbabilities for first sample:")
print(f"OvR: {lr_ovr.predict_proba(sample)[0]}")
print(f"Multinomial: {lr_multinomial.predict_proba(sample)[0]}")

Multi-class with Different Algorithms

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

# Load digit recognition dataset (10 classes: 0-9)
digits = load_digits()
X = StandardScaler().fit_transform(digits.data)
y = digits.target

print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features, {len(set(y))} classes\n")

# Compare different multi-class approaches
models = {
    'Logistic (OvR)': LogisticRegression(multi_class='ovr', max_iter=1000),
    'Logistic (Multinomial)': LogisticRegression(multi_class='multinomial', max_iter=1000),
    'Naive Bayes': GaussianNB(),
    'KNN (k=5)': KNeighborsClassifier(n_neighbors=5),
    'Decision Tree': DecisionTreeClassifier(max_depth=10),
    'SVM (OvO)': SVC(kernel='rbf')
}

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5)
    print(f"{name:25s}: {scores.mean():.4f} (+/- {scores.std():.4f})")

Explicit OvR and OvO Wrappers

from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Explicitly wrap a binary classifier for multi-class
# One-vs-Rest wrapper
ovr_classifier = OneVsRestClassifier(LogisticRegression(max_iter=1000))
ovr_classifier.fit(X_train, y_train)
print(f"Explicit OvR: {ovr_classifier.score(X_test, y_test):.4f}")
print(f"Number of classifiers: {len(ovr_classifier.estimators_)}")

# One-vs-One wrapper
ovo_classifier = OneVsOneClassifier(SVC(kernel='linear'))
ovo_classifier.fit(X_train, y_train)
print(f"Explicit OvO: {ovo_classifier.score(X_test, y_test):.4f}")
print(f"Number of classifiers: {len(ovo_classifier.estimators_)}")
Which Strategy to Choose?
  • Multinomial: Use when available (Logistic Regression) - most elegant solution
  • One-vs-Rest: Good default for most algorithms, fewer classifiers
  • One-vs-One: Better for SVMs, imbalanced data, or when classes overlap significantly
  • Native multi-class: Trees, KNN, Naive Bayes handle it automatically!

Practice Questions

Problem: If you have 10 classes (like digit recognition), how many classifiers does OvO need? Verify with code.

Show Solution
from sklearn.multiclass import OneVsOneClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_digits

digits = load_digits()
X, y = digits.data, digits.target

# Calculate theoretically
n_classes = len(set(y))
n_classifiers = n_classes * (n_classes - 1) // 2
print(f"Theoretical: {n_classes} classes -> {n_classifiers} classifiers")

# Verify with code
ovo = OneVsOneClassifier(SVC())
ovo.fit(X, y)
print(f"Actual: {len(ovo.estimators_)} classifiers")

Explanation: OvO needs N×(N-1)/2 classifiers. For 10 classes: 10×9/2 = 45 classifiers!

Problem: Compare OvR vs Multinomial Logistic Regression on the digits dataset. Which is more accurate? Which is faster?

Show Solution
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_digits
from sklearn.model_selection import cross_val_score
import time

digits = load_digits()
X, y = digits.data, digits.target

for strategy in ['ovr', 'multinomial']:
    model = LogisticRegression(multi_class=strategy, max_iter=5000, solver='lbfgs')
    
    start = time.time()
    scores = cross_val_score(model, X, y, cv=5)
    elapsed = time.time() - start
    
    print(f"{strategy:12s}: Accuracy={scores.mean():.4f}, Time={elapsed:.2f}s")

Explanation: Multinomial often achieves similar or better accuracy with less computation since it optimizes all classes jointly.

Problem: Train a multi-class classifier on digits and create a confusion matrix heatmap. Which digits are most commonly confused?

Show Solution
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.3, random_state=42
)

model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=digits.target_names,
            yticklabels=digits.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Multi-class Confusion Matrix: Digit Recognition')
plt.show()

# Find most confused pairs
import numpy as np
np.fill_diagonal(cm, 0)  # Ignore correct predictions
max_idx = np.unravel_index(cm.argmax(), cm.shape)
print(f"Most confused: {max_idx[0]} misclassified as {max_idx[1]} ({cm[max_idx]} times)")

Explanation: Common confusions include 3↔8, 4↔9, and 1↔7 due to visual similarity.

Key Takeaways

Binary Classification

Logistic regression is a powerful algorithm for predicting one of two categories based on input features

Sigmoid Function

The sigmoid function converts linear combinations into probability scores between 0 and 1

Decision Boundaries

Classification occurs at decision boundaries, typically at a probability threshold of 0.5

Log Loss Function

Logistic regression minimizes log loss to find the best-fitting model parameters

Gradient Descent

Iterative optimization algorithm that updates model parameters to minimize the cost function

Naive Bayes Classifier

Probabilistic classifier using Bayes' theorem with feature independence assumption—fast, effective, especially for text

K-Nearest Neighbors (KNN)

Instance-based learning that classifies based on majority vote of K closest neighbors—simple yet powerful

Multi-class Strategies

Extend binary classifiers using One-vs-Rest, One-vs-One, or native multinomial/softmax approaches

Real-World Applications

Email spam detection, disease diagnosis, credit approval, and countless classification tasks

Knowledge Check

Test your understanding of Logistic Regression:

Question 1 of 9

What is the primary purpose of Logistic Regression?

Question 2 of 9

What is the sigmoid function's output range?

Question 3 of 9

What does a decision boundary of 0.5 mean in Logistic Regression?

Question 4 of 9

Which cost function is typically used in Logistic Regression?

Question 5 of 9

What does Gradient Descent do in Logistic Regression?

Question 6 of 9

Which evaluation metric is best for imbalanced binary classification?

Question 7 of 9

What is the key assumption of Naive Bayes classifiers?

Question 8 of 9

Why is feature scaling important for K-Nearest Neighbors (KNN)?

Question 9 of 9

What is the One-vs-Rest (OvR) strategy for multi-class classification?

Answer all questions to check your score

Interactive Demo: Sigmoid Function Explorer

Sigmoid Visualization

Drag the slider to see how the linear input (z) transforms through the sigmoid function.

Interactive Controls
Range: -5 to 5
Range: 0 to 1
Results

Sigmoid Output: 0.5000

Prediction: Class 0

Confidence: 50%