Logistic Regression

Classification Fundamentals

So far, we have learned about Linear Regression, which predicts continuous values like house prices or temperature. But many real-world problems require different answers. Should we approve this loan? Will this patient develop diabetes? Is this email spam? These are classification problems where we predict categories, not numbers. Logistic Regression is the go-to algorithm for binary classification - predicting one of two possible outcomes.

Regression vs Classification

Let's clarify the difference between the two main types of supervised learning. Regression predicts continuous numerical outputs from a range of infinite possibilities. Classification predicts discrete categorical outputs from a finite set of classes. Despite its name, Logistic Regression is actually a classification algorithm, not a regression algorithm.

# Understanding the difference: Regression vs Classification

# REGRESSION: Predicting continuous values
# Example: House price prediction
house_features = {"bedrooms": 3, "sqft": 1500, "location": "suburban"}
predicted_price = 285000.50  # Could be any number
print(f"Predicted house price: ${predicted_price:,.2f}")

# CLASSIFICATION: Predicting discrete categories
# Example: Email spam detection
email_features = {"has_link": True, "urgent_words": 5, "sender": "unknown"}
predicted_class = "spam"  # Only two options: "spam" or "not spam"
print(f"Email classification: {predicted_class}")

# Key difference:
print("\nRegression output: Any number (infinite possibilities)")
print("Classification output: One of fixed categories (finite choices)")

Regression Example

Question: What will the house price be?
Output: $450,000 (continuous value)

Classification Example

Question: Will this email be spam?
Output: Yes or No (discrete class)

Binary vs Multiclass Classification

Classification problems come in two main flavors. Binary classification has two possible outcomes (spam or not spam, disease or healthy). Multiclass classification has three or more categories (is this email spam, promotional, or personal?). Logistic Regression handles binary classification directly, though we can extend it to multiclass problems using techniques like One-vs-Rest.

# Binary vs Multiclass Classification Examples

# BINARY CLASSIFICATION (2 classes)
# Logistic Regression is perfect for this!
binary_examples = {
    "Email": ["spam", "not spam"],
    "Medical": ["disease", "healthy"],
    "Loan": ["approved", "rejected"],
    "Fraud": ["fraudulent", "legitimate"]
}

print("Binary Classification Examples:")
for task, classes in binary_examples.items():
    print(f"  {task}: {classes[0]} vs {classes[1]}")

# MULTICLASS CLASSIFICATION (3+ classes)
# Requires extension like One-vs-Rest
multiclass_examples = {
    "Digit Recognition": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
    "Email Type": ["spam", "promotional", "personal", "work"],
    "Sentiment": ["negative", "neutral", "positive"]
}

print("\nMulticlass Classification Examples:")
for task, classes in multiclass_examples.items():
    print(f"  {task}: {len(classes)} classes")

Real-World Applications: Credit approval (approve/reject), medical diagnosis (disease/healthy), email filtering (spam/not spam), fraud detection (fraudulent/legitimate transaction), sentiment analysis (positive/negative).

Why Linear Regression Won't Work

You might ask: why can't we just use Linear Regression and round the output to 0 or 1? Let's see what happens. Recall that Linear Regression creates a line (or hyperplane in multiple dimensions) that fits the data. For classification, we want probabilities between 0 and 1, but Linear Regression can predict values far outside this range.

# Linear Regression on classification data
from sklearn.linear_model import LinearRegression
import numpy as np

# Example: predicting if student passes (1) or fails (0) based on hours studied
X = np.array([[2], [3], [4], [5], [6], [7], [8], [9]])
y = np.array([0, 0, 0, 1, 1, 1, 1, 1])

# Fit Linear Regression
model = LinearRegression()
model.fit(X, y)

# Predictions
predictions = model.predict(np.array([[0], [10], [15]]))
print("Hour 0:", predictions[0])  # -0.34 (negative probability!)
print("Hour 10:", predictions[1])  # 1.19 (probability > 1!)
print("Hour 15:", predictions[2])  # 2.09 (nonsensical!)

As you can see, Linear Regression produces predictions outside the valid probability range [0, 1]. This is why we need a special algorithm for classification: Logistic Regression. It uses the sigmoid function to squash any input value into the range 0 to 1.

Key Insight: We need an algorithm that guarantees outputs between 0 and 1 and interprets them as class probabilities. This is exactly what Logistic Regression provides.

Practice Questions: Classification Fundamentals

Test your understanding with these coding challenges.

Given:

problems = [
    "Predicting house prices",
    "Identifying if a photo contains a cat",
    "Forecasting next month's revenue",
    "Detecting credit card fraud"
]

Task: Create a dictionary mapping each problem to its type: "regression" or "classification"

Show Solution

problem_types = {
    "Predicting house prices": "regression",
    "Identifying if a photo contains a cat": "classification",
    "Forecasting next month's revenue": "regression",
    "Detecting credit card fraud": "classification"
}

for problem, ptype in problem_types.items():
    print(f"{problem}: {ptype}")

Given:

from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[2], [3], [4], [5], [6], [7], [8], [9]])
y = np.array([0, 0, 0, 1, 1, 1, 1, 1])

Task: Train LinearRegression and predict for X=0 and X=12. Print the predictions and explain why they are invalid probabilities.

Show Solution

from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[2], [3], [4], [5], [6], [7], [8], [9]])
y = np.array([0, 0, 0, 1, 1, 1, 1, 1])

model = LinearRegression()
model.fit(X, y)

pred_0 = model.predict([[0]])[0]
pred_12 = model.predict([[12]])[0]

print(f"Prediction at X=0: {pred_0:.3f}")   # Negative!
print(f"Prediction at X=12: {pred_12:.3f}") # Greater than 1!
print("\nThese are invalid probabilities!")

Given:

import numpy as np
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
y = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

Task: Train both LinearRegression and LogisticRegression. Predict for X values [-2, 5, 12] and compare results side by side.

Show Solution

import numpy as np
from sklearn.linear_model import LinearRegression, LogisticRegression

X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
y = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

linear_model = LinearRegression()
logistic_model = LogisticRegression()

linear_model.fit(X, y)
logistic_model.fit(X, y)

test_values = [[-2], [5], [12]]

print("X value | Linear | Logistic Prob")
print("-" * 35)
for x in test_values:
    lin_pred = linear_model.predict([x])[0]
    log_prob = logistic_model.predict_proba([x])[0, 1]
    print(f"{x[0]:7} | {lin_pred:6.3f} | {log_prob:.3f}")

The Sigmoid Function

The heart of Logistic Regression is the sigmoid function, an elegant mathematical tool that transforms any input value into a probability between 0 and 1. This S-shaped curve is perfect for classification because it maps negative infinity to 0, zero to 0.5, and positive infinity to 1. The sigmoid function is the key innovation that makes Logistic Regression work.

The Sigmoid Function Equation

The sigmoid function, also called the logistic function, is mathematically expressed as:

Mathematical Definition

Sigmoid Function

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Where: e is Euler's number (approximately 2.71828), and z is any real number (the linear combination of inputs).

This simple equation is incredibly powerful. Let's understand what happens at key points:

z = -5: σ(z) ≈ 0.0067 (very close to 0)
z = 0: σ(z) = 0.5 (exactly halfway)
z = 5: σ(z) ≈ 0.9933 (very close to 1)

Let's implement this in Python step by step:

Step 1: Define the Sigmoid Function

First, we create a simple function that takes any number z and returns the sigmoid value. We use Python's math.exp() for the exponential calculation.

# Import the math module for exponential function
import math

def sigmoid(z):
    """Calculate sigmoid of z"""
    return 1 / (1 + math.exp(-z))

Why math.exp()? The function math.exp(-z) calculates $e^{-z}$. This is more accurate and readable than writing 2.71828 ** (-z).

Step 2: Test with Negative Input (z = -5)

When z is negative, the sigmoid outputs a value close to 0. This represents low probability of belonging to class 1.

# When z = -5 (negative input)
z = -5
exp_term = math.exp(-z)  # e^5 ≈ 148.41
result = 1 / (1 + exp_term)

print(f"z = {z}: sigmoid = {result:.4f}")
# Output: z = -5: sigmoid = 0.0067

How does this calculation work?

When we plug z = -5 into the sigmoid formula, we first calculate math.exp(-(-5)) which equals math.exp(5). This gives us approximately 148.41 (that's $e$ raised to the power of 5).

Next, we add 1 to this value: 1 + 148.41 = 149.41. This is our denominator.

Finally, we divide 1 by this denominator: 1 / 149.41 = 0.0067. This tiny number (close to 0) tells us the model predicts this sample is very unlikely to belong to class 1.

Step 3: Test with Zero Input (z = 0)

When z = 0, the sigmoid returns exactly 0.5. This is the decision boundary where the model is equally uncertain about both classes.

# When z = 0 (neutral input)
z = 0
exp_term = math.exp(-z)  # e^0 = 1
result = 1 / (1 + exp_term)

print(f"z = {z}: sigmoid = {result:.4f}")
# Output: z = 0: sigmoid = 0.5000

Why is this special?

When z = 0, we calculate math.exp(-0) which equals math.exp(0). Any number raised to the power of 0 equals 1, so $e^0 = 1$.

Our denominator becomes 1 + 1 = 2.

The final result is 1 / 2 = 0.5 exactly. This is the decision boundary! A probability of 0.5 means the model is completely uncertain. It's a coin flip between class 0 and class 1.

Step 4: Test with Positive Input (z = 5)

When z is positive, the sigmoid outputs a value close to 1. This represents high probability of belonging to class 1.

# When z = 5 (positive input)
z = 5
exp_term = math.exp(-z)  # e^-5 ≈ 0.0067
result = 1 / (1 + exp_term)

print(f"z = {z}: sigmoid = {result:.4f}")
# Output: z = 5: sigmoid = 0.9933

What happens with positive z?

When z = 5, we calculate math.exp(-5). This is $e$ raised to the power of -5, which gives us a very small number: approximately 0.0067.

Our denominator becomes 1 + 0.0067 = 1.0067. Notice how adding such a tiny number barely changes 1.

The final result is 1 / 1.0067 = 0.9933. This is very close to 1, meaning the model is highly confident this sample belongs to class 1. The larger the positive z, the closer the sigmoid gets to 1.

Negative z

σ(z) → 0
Low probability

Zero z

σ(z) = 0.5
Decision boundary

Positive z

σ(z) → 1
High probability

The sigmoid function has a beautiful S-shaped curve. Small changes in input near z=0 cause large changes in output, while extreme values flatten out. This makes the sigmoid ideal for representing probabilities.

Visualizing the Sigmoid Function

A picture is worth a thousand words. Let's visualize the sigmoid curve to understand its behavior better.

Setting Up the Visualization

We'll use NumPy for efficient calculations and Matplotlib for plotting. NumPy's version of sigmoid can handle arrays of values at once.

# Import visualization libraries
import numpy as np
import matplotlib.pyplot as plt

# Define sigmoid for arrays (NumPy version)
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

Generate Input Values

We create 100 evenly spaced points between -5 and 5 to get a smooth curve.

# Generate input values from -5 to 5
z = np.linspace(-5, 5, 100)

# Calculate sigmoid for all values at once
sigma_z = sigmoid(z)

print(f"Input range: {z.min():.1f} to {z.max():.1f}")
print(f"Output range: {sigma_z.min():.4f} to {sigma_z.max():.4f}")

Create the Plot

Now we plot the S-curve with helpful reference lines at the decision boundary (0.5) and at z=0.

# Create the visualization
plt.figure(figsize=(10, 6))
plt.plot(z, sigma_z, 'b-', linewidth=2, label='σ(z)')

# Add reference lines
plt.axhline(y=0.5, color='r', linestyle='--', alpha=0.5, label='Threshold')
plt.axvline(x=0, color='gray', linestyle='-', alpha=0.3)

# Customize appearance
plt.xlabel('z (input)', fontsize=12)
plt.ylabel('σ(z) (probability)', fontsize=12)
plt.title('The Sigmoid Function', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.ylim(-0.1, 1.1)
plt.show()

What to notice: The curve is steepest at z=0, meaning small changes in input cause big changes in probability near the decision boundary. At extreme values, the curve flattens out (saturates), meaning the model becomes very confident in its prediction.

for z_val in [-5, -3, -1, 0, 1, 3, 5]: print(f" σ({z_val:2}) = {sigmoid(z_val):.4f}")

From Linear to Logistic Model

In Linear Regression, we computed a weighted sum of input features:

Linear Regression

Linear Combination

$$y = w_0 + w_1x_1 + w_2x_2 + \cdots + w_nx_n$$

Where: $w_0$ is the intercept (bias), $w_1, w_2, \ldots, w_n$ are the weights for each feature, and $x_1, x_2, \ldots, x_n$ are the input features.

In Logistic Regression, we wrap this linear combination in the sigmoid function:

Logistic Regression

Probability Model

$$P(y=1|x) = \sigma(w_0 + w_1x_1 + w_2x_2 + \cdots + w_nx_n)$$

Result: This outputs the probability that $y=1$ given the input features $x$. The sigmoid $\sigma$ squashes any value into the range [0, 1].

The sigmoid function ensures that the output is always a valid probability between 0 and 1, regardless of how large or small the linear combination becomes.

Let's see how this works with a real example. Imagine we're predicting whether a student will pass an exam based on two features: hours studied and previous test score.

Step 1: Set Up the Model

First, we import NumPy and define our sigmoid function. Then we set up the model weights. In a real scenario, these weights would be learned from training data, but here we'll use pre-defined values.

# Import library and define sigmoid
import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

What is NumPy?

NumPy is Python's powerful numerical computing library. We use np.exp() instead of math.exp() because NumPy can handle arrays of numbers at once, which is essential for machine learning.

Step 2: Define Model Weights

The model has three weights: an intercept (bias) and one weight for each feature. These weights determine how much each feature influences the prediction.

# Model weights (learned from training data)
w0 = -4.0   # Intercept (bias)
w1 = 0.5    # Weight for hours studied
w2 = 0.3    # Weight for previous score

print("Model Weights:")
print(f"  w0 (intercept): {w0}")
print(f"  w1 (hours studied): {w1}")
print(f"  w2 (previous score): {w2}")

Understanding the Weights

Intercept (w0 = -4.0): This negative value means a student starts with a disadvantage. They need positive contributions from studying and previous scores to overcome this.

Hours Studied (w1 = 0.5): Each hour of study adds 0.5 to the linear combination. More study time increases the probability of passing.

Previous Score (w2 = 0.3): Each point in previous score adds 0.3. Past performance also helps predict success.

Step 3: Predict for Student 1 (Good Student)

Let's predict whether a student who studied 8 hours and had a previous score of 70 will pass.

# Student 1: studied 8 hours, previous score 70
hours_studied = 8
previous_score = 70

# Step A: Calculate linear combination
z = w0 + w1*hours_studied + w2*previous_score
print(f"Linear combination: {w0} + {w1}*{hours_studied} + {w2}*{previous_score}")
print(f"z = {z}")

# Step B: Apply sigmoid
probability = sigmoid(z)
print(f"Probability = sigmoid({z}) = {probability:.4f}")

# Step C: Make prediction
prediction = "PASS" if probability >= 0.5 else "FAIL"
print(f"Prediction: {prediction} ({probability*100:.1f}% confidence)")

How did we get this result?

Step A - Linear Combination: We calculate $z = -4.0 + 0.5 \times 8 + 0.3 \times 70 = -4.0 + 4.0 + 21.0 = 21.0$

Step B - Apply Sigmoid: We plug z=21 into the sigmoid. Since 21 is a large positive number, sigmoid(21) is very close to 1, approximately 0.9999.

Step C - Prediction: Since 0.9999 is greater than 0.5, we predict PASS. The model is 99.99% confident this student will pass!

Step 4: Predict for Student 2 (Struggling Student)

Now let's predict for a student who only studied 2 hours and had a previous score of 40.

# Student 2: studied 2 hours, previous score 40
hours_studied = 2
previous_score = 40

# Calculate linear combination
z = w0 + w1*hours_studied + w2*previous_score
print(f"Linear combination: {w0} + {w1}*{hours_studied} + {w2}*{previous_score}")
print(f"z = {z}")

# Apply sigmoid
probability = sigmoid(z)
print(f"Probability = sigmoid({z}) = {probability:.4f}")

# Make prediction
prediction = "PASS" if probability >= 0.5 else "FAIL"
print(f"Prediction: {prediction} ({probability*100:.1f}% confidence)")

Why does this student fail?

Linear Combination: $z = -4.0 + 0.5 \times 2 + 0.3 \times 40 = -4.0 + 1.0 + 12.0 = 9.0$

Wait, z=9 is positive! Yes, but let's see: sigmoid(9) = 0.9999. Actually, this student would also pass!

The Key Insight: The linear combination needs to be negative for the sigmoid to output less than 0.5. With w2=0.3, even a modest previous score contributes significantly. This shows why choosing the right weights during training is crucial.

Student 1

Hours Studied: 8
Previous Score: 70
Linear Combination (z): 21.0
Probability: 99.99%
Prediction: PASS

Student 2

Hours Studied: 2
Previous Score: 40
Linear Combination (z): 9.0
Probability: 99.99%
Prediction: PASS

Understanding the Sigmoid Mathematically

Let's break down the sigmoid function to understand why it has its special S-shape. The key is the exponential term $e^{-z}$. We'll visualize each component step by step.

What You'll Learn

By the end of this section, you'll understand:

What each part of the sigmoid formula does
Why the curve has its characteristic S-shape
How inputs get transformed into probabilities
Why sigmoid is perfect for classification problems

Step 1: Set Up the Functions

Before we can visualize anything, we need to import our tools and define our functions. Think of this as gathering your ingredients before cooking.

# Import libraries
import numpy as np
import matplotlib.pyplot as plt

What are these libraries?

NumPy (np): A library for working with numbers and arrays. It lets us do math on many numbers at once, which is much faster than using regular Python loops.

Matplotlib (plt): A library for creating charts and graphs. We use it to visualize our data and understand patterns visually.

Now let's define our two functions. We'll create them separately so we can examine each piece:

# Define sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Define just the exponential component
def exp_term(z):
    return np.exp(-z)

Understanding the Building Blocks

exp_term(z): This function calculates $e^{-z}$. The letter "e" is a special mathematical constant (approximately 2.71828), and we're raising it to the power of negative z. This is called the "exponential component" of sigmoid.

sigmoid(z): This is our complete sigmoid function. It takes the exponential term, adds 1 to it, and then divides 1 by the whole thing. The formula is: $\sigma(z) = \frac{1}{1 + e^{-z}}$

Why two functions? By separating them, we can visualize each piece and understand how they combine to create the S-curve. It's like understanding how flour, eggs, and sugar combine to make a cake!

Step 2: Generate Input Values

To draw a smooth curve, we need many points. Instead of typing each number manually, we use NumPy's linspace function to generate them automatically.

# Generate 100 points from -5 to 5
z = np.linspace(-5, 5, 100)

print(f"First 5 values: {z[:5]}")
print(f"Last 5 values: {z[-5:]}")

What does linspace do?

np.linspace(-5, 5, 100) creates exactly 100 numbers, evenly spaced between -5 and 5.

It's like marking 100 equally-spaced tick marks on a ruler from -5 to 5. The first mark is at -5, the last is at 5, and all the others are spread evenly in between.

Why 100 points? More points = smoother curve. 100 is usually enough to look smooth without being slow to compute.

Step 3: Visualize the Exponential Component

The exponential term $e^{-z}$ is the secret ingredient that gives sigmoid its special properties. Let's see what it looks like when we plot it.

# Plot the exponential component
plt.figure(figsize=(8, 5))
plt.plot(z, exp_term(z), 'r-', linewidth=2)
plt.title('Component: exp(-z)', fontsize=14)
plt.xlabel('z', fontsize=12)
plt.ylabel('exp(-z)', fontsize=12)
plt.grid(True, alpha=0.3)
plt.ylim(0, 150)
plt.show()

Understanding the Plotting Code

plt.figure(figsize=(8, 5)): Creates a new figure that's 8 inches wide and 5 inches tall

plt.plot(z, exp_term(z), 'r-', linewidth=2): Draws a red solid line ('r-') connecting all our points. The linewidth=2 makes it thicker and easier to see.

plt.grid(True, alpha=0.3): Adds a light grid in the background to help read values. alpha=0.3 makes it semi-transparent.

What do we see in the graph?

The graph shows a curve that shoots up dramatically on the left and flattens out on the right. Here's why:

When z is negative (left side of graph): The negative sign in $e^{-z}$ flips the sign, so $e^{-(-5)} = e^{5}$. This is a positive exponent, and $e^{5} \approx 148$ - a huge number!

When z is zero (middle): $e^{-0} = e^{0} = 1$. Any number raised to power 0 equals 1.

When z is positive (right side): $e^{-5} \approx 0.007$ - a tiny number close to zero. The larger z gets, the closer this value gets to 0.

Step 4: Visualize the Denominator

Now we add 1 to the exponential term. This might seem like a small change, but it's actually very important! This sum becomes the denominator of our sigmoid formula: $1 + e^{-z}$.

# Plot the denominator
plt.figure(figsize=(8, 5))
plt.plot(z, 1 + exp_term(z), 'g-', linewidth=2)
plt.title('Component: 1 + exp(-z)', fontsize=14)
plt.xlabel('z', fontsize=12)
plt.ylabel('1 + exp(-z)', fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

Why do we add 1? This is the key insight!

Adding 1 guarantees that the denominator is always greater than 1. Since we're going to divide 1 by this denominator, this ensures our final answer is always less than 1.

Let's trace through what happens:

When z = -5 (very negative):

Exponential: $e^{5} \approx 148$
Denominator: $1 + 148 = 149$
Final sigmoid: $\frac{1}{149} \approx 0.007$ (close to 0!)

When z = 5 (very positive):

Exponential: $e^{-5} \approx 0.007$
Denominator: $1 + 0.007 = 1.007$
Final sigmoid: $\frac{1}{1.007} \approx 0.993$ (close to 1!)

Step 5: The Final Sigmoid Curve

Now for the grand finale! We divide 1 by the denominator to get the sigmoid. This creates the famous S-shaped curve that makes logistic regression work.

# Plot the final sigmoid
plt.figure(figsize=(8, 5))
plt.plot(z, sigmoid(z), 'b-', linewidth=2)
plt.axhline(y=0.5, color='r', linestyle='--', alpha=0.5)
plt.axvline(x=0, color='gray', linestyle='-', alpha=0.3)
plt.title('Result: σ(z) = 1 / (1 + exp(-z))', fontsize=14)
plt.xlabel('z', fontsize=12)
plt.ylabel('σ(z)', fontsize=12)
plt.grid(True, alpha=0.3)
plt.ylim(-0.1, 1.1)
plt.show()

What do the extra lines mean?

plt.axhline(y=0.5, ...): Draws a horizontal line at y=0.5. This is our decision boundary - the point where the model is 50% uncertain.

plt.axvline(x=0, ...): Draws a vertical line at x=0. This helps us see that when z=0, the sigmoid output is exactly 0.5.

The Beautiful S-Curve Explained

Look at the shape - it looks like a stretched letter "S" lying on its side! Here's what's happening at each part:

Left flat region (z < -3): Output is nearly 0. The model is very confident this is NOT class 1. No matter how much more negative z gets, the output stays near 0.

Steep middle region (-3 < z < 3): This is where the action happens! Small changes in z cause big changes in probability. This is the "decision zone" where the model is less certain.

Right flat region (z > 3): Output is nearly 1. The model is very confident this IS class 1. No matter how much more positive z gets, the output stays near 1.

Key insight: The curve never actually reaches 0 or 1 - it just gets infinitely close. This is mathematically convenient and reflects real-world uncertainty: we can never be 100% certain about predictions!

Step 6: See the Transformation in Action

Let's see exactly how different input values get transformed into probabilities. This will help you build intuition about what the sigmoid is doing.

# Show the transformation
print("Linear output -> Sigmoid output (probability)")
print("-" * 40)

for linear_output in [-10, -5, 0, 5, 10]:
    prob = sigmoid(linear_output)
    print(f"  z = {linear_output:3} -> σ(z) = {prob:.6f}")

Understanding the Code

for linear_output in [-10, -5, 0, 5, 10]: We loop through 5 test values, from very negative to very positive.

{prob:.6f}: This formats the probability to show 6 decimal places, so we can see how close values get to 0 or 1.

Here are the results in a nice table:

Input (z)	Sigmoid Output	As Percentage	Interpretation
-10	0.000045	0.0045%	Almost certainly Class 0
-5	0.006693	0.67%	Very likely Class 0
0	0.500000	50%	Completely uncertain (decision boundary)
5	0.993307	99.33%	Very likely Class 1
10	0.999955	99.9955%	Almost certainly Class 1

Key Observations for Beginners

1. Notice the symmetry: z = -5 gives 0.67%, and z = +5 gives 99.33%. These add up to 100%! This is because sigmoid has a special property: $\sigma(-z) = 1 - \sigma(z)$

2. The decision point: When z = 0, we get exactly 50%. This is why z = 0 is called the "decision boundary" - it's where we're equally unsure about both classes.

3. Extreme values saturate: At z = 10, we're already at 99.9955%. Going to z = 100 wouldn't add much more confidence. The curve "saturates" at the extremes.

Key Property: The sigmoid function is symmetric around the point (0, 0.5). This means σ(-z) = 1 - σ(z). This symmetry is useful for mathematical derivations in Logistic Regression.

Practice Questions: Sigmoid Function

Test your understanding with these coding challenges.

Task: Calculate σ(0) by hand using the formula σ(z) = 1 / (1 + e^-z), then verify with Python.

Expected output: 0.5

Show Solution

import math

# σ(0) = 1 / (1 + e^-0)
#      = 1 / (1 + e^0)
#      = 1 / (1 + 1)
#      = 1 / 2
#      = 0.5

z = 0
sigma = 1 / (1 + math.exp(-z))
print(f"σ(0) = {sigma}")  # 0.5

Given:

test_values = [-5, -2, 0, 2, 5]

Task: Implement sigmoid function and verify that σ(-z) = 1 - σ(z) for each test value.

Show Solution

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

test_values = [-5, -2, 0, 2, 5]

print("Testing σ(-z) = 1 - σ(z):")
for z in test_values:
    sig_z = sigmoid(z)
    sig_neg_z = sigmoid(-z)
    one_minus = 1 - sig_z
    match = np.isclose(sig_neg_z, one_minus)
    print(f"z={z:2}: σ(-z)={sig_neg_z:.4f}, 1-σ(z)={one_minus:.4f}, Match: {match}")

Task: The derivative of sigmoid is σ'(z) = σ(z) * (1 - σ(z)). Implement both sigmoid and its derivative, then find the maximum value of σ'(z).

Expected: Maximum derivative is 0.25 at z=0

Show Solution

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(z):
    sig = sigmoid(z)
    return sig * (1 - sig)

z = np.linspace(-5, 5, 100)

plt.figure(figsize=(10, 5))
plt.plot(z, sigmoid(z), 'b-', label='σ(z)')
plt.plot(z, sigmoid_derivative(z), 'r-', label="σ'(z)")
plt.axhline(y=0.25, color='g', linestyle='--', alpha=0.5)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"Maximum derivative: σ'(0) = {sigmoid_derivative(0):.4f}")

Decision Boundaries

While Logistic Regression outputs probabilities, we often need to make a definitive class prediction: yes or no, spam or not spam, disease or healthy. This is where the decision boundary comes in. The decision boundary is a threshold that separates the two classes in our classification problem. Understanding how to set and interpret this boundary is crucial for building effective classifiers.

The 0.5 Probability Threshold

Default Decision Boundary: 0.5

By default, Logistic Regression uses a decision boundary of 0.5. This means:

Positive Class

If P(y=1|x) ≥ 0.5, predict class 1

Negative Class

If P(y=1|x) < 0.5, predict class 0

Why 0.5? This threshold is natural because it's the equilibrium point of the sigmoid function - where both classes are equally likely. However, this default isn't always optimal, especially for imbalanced datasets or when different misclassification costs apply.

# Understanding the 0.5 Threshold: Step by Step
import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Simulate predictions for 10 samples
np.random.seed(42)
probabilities = np.random.uniform(0, 1, 10)

print("Decision Boundary at 0.5:")
print("="*50)
print(f"{'Sample':<10}{'Probability':<15}{'Prediction':<12}{'Reason'}")
print("-"*50)

for i, prob in enumerate(probabilities):
    if prob >= 0.5:
        prediction = "Class 1"
        reason = f"{prob:.3f} >= 0.5"
    else:
        prediction = "Class 0"
        reason = f"{prob:.3f} < 0.5"
    
    print(f"{i+1:<10}{prob:<15.4f}{prediction:<12}{reason}")

# Summary
class_1_count = sum(1 for p in probabilities if p >= 0.5)
class_0_count = len(probabilities) - class_1_count
print("-"*50)
print(f"Total Class 0: {class_0_count}, Class 1: {class_1_count}")

Understanding the Code: Step-by-Step Breakdown

This code demonstrates exactly how logistic regression makes the final decision to classify something as Class 0 or Class 1. Let's break it down piece by piece:

1. The Sigmoid Function (Lines 1-2)

def sigmoid(z): return 1 / (1 + np.exp(-z))

This is the mathematical function that converts any number (positive, negative, large, or small) into a probability between 0 and 1. For example, if z = 2, sigmoid gives us 0.88 (88% probability). If z = -2, we get 0.12 (12% probability). This ensures our predictions are always valid probabilities.

2. Generating Sample Probabilities (Lines 3-5)

probabilities = np.random.uniform(0, 1, 10)

Here we create 10 random probability values between 0 and 1 to simulate what a trained logistic regression model might output for 10 different samples. In real scenarios, these would come from your model after it processes the input features. For example, for an email, it might output 0.85 (85% probability it's spam).

3. Applying the Decision Rule (Lines 9-15)

if prob >= 0.5: prediction = "Class 1"

This is the critical step where we convert probabilities to actual predictions. The rule is simple: if the probability is 0.5 or higher, we predict Class 1; otherwise, we predict Class 0. So a probability of 0.51 becomes "Class 1", while 0.49 becomes "Class 0". This is called the decision threshold.

4. Displaying Results (Line 8 and 16)

The code prints a formatted table showing each sample's probability, what class it was assigned to, and the mathematical reason (comparison with 0.5). This helps you see exactly how the threshold works for each case.

5. Summary Statistics (Lines 18-21)

class_1_count = sum(1 for p in probabilities if p >= 0.5)

Finally, we count how many samples ended up in each class. This gives you a quick overview: out of 10 samples, maybe 6 were classified as Class 1 and 4 as Class 0.

Why This Matters for Beginners:

Understanding this threshold is crucial because it directly affects your model's behavior. Consider these real-world examples:

Email Spam Filter: A probability of 0.49 (49% spam) would be classified as "Not Spam" using the 0.5 threshold. But is that safe? Maybe you want a lower threshold like 0.3 to catch more spam.
Disease Detection: A probability of 0.51 (51% disease) would be classified as "Disease Present". But in medical scenarios, you might want a higher threshold like 0.8 to avoid false alarms that could worry patients unnecessarily.
Credit Approval: The 0.5 threshold might approve too many risky loans. Banks might use 0.7 or higher to be more conservative.

Key Takeaway: The 0.5 threshold is just the default starting point. As you gain experience, you'll learn to adjust it based on your specific problem, the costs of different types of errors, and your business requirements. The code above shows you exactly how this threshold affects every single prediction your model makes.

Decision Boundary (1D Case)

Predicted class = 1 if σ(z) >= threshold, else 0
where z = w₀ + w₁x is the linear combination

Geometric Interpretation

In higher dimensions, the decision boundary is a line or hyperplane that separates the feature space into regions for each class. Let's visualize this with a 2D example.

Step 1: Define Model Parameters

import numpy as np

# Trained model parameters (example)
w0 = -3.0    # Intercept (bias)
w1 = 1.5     # Weight for feature x1
w2 = 2.0     # Weight for feature x2

We start by defining the parameters that our logistic regression model has learned during training. w0 is the intercept (like the y-intercept in a line equation). w1 and w2 are the weights that tell us how important each feature is. In this example, feature x2 (weight = 2.0) has more influence than feature x1 (weight = 1.5) on the prediction.

Step 2: Understand the Linear Combination

print("Linear combination: z = w0 + w1*x1 + w2*x2")
print(f"z = {w0} + {w1}*x1 + {w2}*x2")

# Output: z = -3.0 + 1.5*x1 + 2.0*x2

The linear combination z is the raw output before we apply the sigmoid function. It's calculated by multiplying each feature by its weight, adding them together, and adding the intercept. For example, if x1=2 and x2=1, then z = -3.0 + 1.5(2) + 2.0(1) = -3.0 + 3.0 + 2.0 = 2.0. This z value will then be passed through the sigmoid function to get a probability.

Step 3: Find Where the Decision Boundary Occurs

# Decision boundary occurs where sigmoid(z) = 0.5
# This happens when z = 0
print("Decision boundary is where z = 0 (sigmoid = 0.5)")
print(f"{w0} + {w1}*x1 + {w2}*x2 = 0")

# Output: -3.0 + 1.5*x1 + 2.0*x2 = 0

The decision boundary is the special line where the probability equals exactly 0.5. Mathematically, sigmoid(0) = 0.5, so the boundary occurs where z = 0. Any point on this line will have a 50-50 chance of being in either class. Points on one side of the line will have z > 0 (probability > 0.5, predict Class 1), and points on the other side will have z < 0 (probability < 0.5, predict Class 0).

Step 4: Solve for the Boundary Line Equation

# Solve for x2 in terms of x1
print("Solving for x2:")
print(f"x2 = ({-w0} - {w1}*x1) / {w2}")
print(f"x2 = {-w0/w2:.2f} - {w1/w2:.2f}*x1")

# Output: x2 = 1.50 - 0.75*x1

By rearranging the equation -3.0 + 1.5*x1 + 2.0*x2 = 0, we solve for x2 to get the standard form of a line: x2 = 1.50 - 0.75*x1. This is just like y = mx + b in algebra! The slope is -0.75 and the y-intercept is 1.50. This equation tells us: for any value of x1, we can calculate the corresponding x2 value that lies exactly on the decision boundary. This line divides our 2D feature space into two regions.

Step 5: Verify with Actual Points

# Verify with some points on the boundary
print("Points on the decision boundary:")
for x1 in [0, 1, 2, 3]:
    x2 = (-w0 - w1*x1) / w2
    z = w0 + w1*x1 + w2*x2
    print(f"  x1={x1}: x2={x2:.2f} -> z={z:.4f}")

# Output:
#   x1=0: x2=1.50 -> z=0.0000
#   x1=1: x2=0.75 -> z=0.0000
#   x1=2: x2=0.00 -> z=0.0000
#   x1=3: x2=-0.75 -> z=0.0000

We test several points to confirm they lie on the boundary. For each x1 value (0, 1, 2, 3), we calculate the corresponding x2 using our boundary equation, then verify that z = 0 for all these points. Notice that z is always 0.0000 (or very close), confirming these points are on the boundary.

Real-world meaning: If you had a customer with features (x1=2, x2=0.00), the model would give exactly 50% probability. If x2 is higher than 0.00 (above the line), the model predicts Class 1. If x2 is lower (below the line), it predicts Class 0. This geometric visualization helps you understand how your model makes decisions based on feature values.

Step 6: Visualize the Decision Boundary

import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_blobs

# Create synthetic 2D dataset
np.random.seed(42)
X, y = make_blobs(n_samples=100, n_features=2, 
                  centers=2, random_state=42)

# Train model
model = LogisticRegression()
model.fit(X, y)

We create a synthetic dataset with 100 samples and 2 features using make_blobs, which generates clusters of points (like two groups of customers with different characteristics). Then we train a logistic regression model on this data. The model will learn the optimal weights to separate these two clusters.

Step 7: Create a Mesh Grid for Visualization

# Create mesh grid covering the feature space
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                      np.linspace(y_min, y_max, 100))

# Get probability predictions for every point
Z = model.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z = Z.reshape(xx.shape)

To visualize the decision boundary, we need to check the model's prediction at many points across the entire feature space. meshgrid creates a grid of 10,000 points (100×100) covering the range of our data. For each point, we calculate the probability of Class 1. This creates a "probability map" where we can see how predictions change across the space. Areas with high probabilities will appear in one color, low probabilities in another, and the boundary (0.5) will be the line between them.

Step 8: Plot Everything Together

# Create the plot
fig, ax = plt.subplots(figsize=(10, 8))

# Color regions by predicted class
ax.contourf(xx, yy, Z, levels=[0, 0.5, 1], 
           colors=['lightblue', 'lightcoral'], alpha=0.6)

# Draw decision boundary (0.5 probability line)
ax.contour(xx, yy, Z, levels=[0.5], 
          colors='black', linewidths=2)

# Add probability contour lines
contours = ax.contour(xx, yy, Z, 
                     levels=[0.1, 0.3, 0.7, 0.9], 
                     colors='gray', alpha=0.4)
ax.clabel(contours, inline=True, fontsize=8)

# Plot actual data points
ax.scatter(X[y==0, 0], X[y==0, 1], c='blue', 
          marker='o', s=100, label='Class 0')
ax.scatter(X[y==1, 0], X[y==1, 1], c='red', 
          marker='s', s=100, label='Class 1')

ax.set_xlabel('Feature 1', fontsize=12)
ax.set_ylabel('Feature 2', fontsize=12)
ax.set_title('Decision Boundary with Probability Contours', 
            fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

This code creates a comprehensive visualization with multiple layers:

Background colors: Light blue for regions where the model predicts Class 0, light coral for Class 1 regions
Black line: The decision boundary (probability = 0.5) that separates the two classes
Gray contour lines: Show probability levels (0.1, 0.3, 0.7, 0.9). Points near these lines have those exact probabilities
Data points: Blue circles are actual Class 0 samples, red squares are Class 1 samples

How to read this plot: The further a point is from the decision boundary, the more confident the model is in its prediction. Points very close to the boundary (near the black line) have probabilities close to 0.5, meaning the model is uncertain. Points far from the boundary have probabilities close to 0 or 1, meaning the model is very confident.

Adjusting the Threshold

Different problems require different thresholds. A medical diagnostic test might need a lower threshold (more positive predictions, fewer false negatives) to catch all potential cases. A spam filter might prefer a higher threshold (fewer false positives) to avoid marking legitimate emails as spam.

Threshold Trade-off: Lowering the threshold catches more positive cases but increases false positives. Raising the threshold reduces false positives but misses more positive cases. This is the classic precision-recall trade-off.

Step 1: Setup the Scenario

import numpy as np

# Model probabilities for 10 patients (disease screening)
patient_probs = [0.15, 0.25, 0.35, 0.45, 0.55, 
                 0.65, 0.75, 0.85, 0.90, 0.95]

# Ground truth: who actually has the disease
actual_disease = [0, 0, 1, 0, 1, 1, 1, 1, 1, 1]

We simulate a disease detection scenario with 10 patients. The patient_probs list contains the model's predicted probability that each patient has the disease (ranging from 15% to 95%). The actual_disease list shows the ground truth: 0 means the patient is healthy, 1 means they actually have the disease. Notice that 7 out of 10 patients actually have the disease, but the model's probabilities vary. Some patients with disease have low probabilities (patient 3 has 35% probability but actually has disease), while some healthy patients have higher probabilities (patient 4 has 45% but is healthy).

Step 2: Test Different Thresholds

print("Impact of Different Thresholds:")
print("="*60)

for threshold in [0.3, 0.5, 0.7]:
    # Apply threshold to convert probabilities to predictions
    predicted = [1 if p >= threshold else 0 
                 for p in patient_probs]
    
    print(f"\nThreshold: {threshold}")
    print(f"  Patients diagnosed: {sum(predicted)}")

We test three different thresholds: 0.3 (low), 0.5 (default), and 0.7 (high). For each threshold, we convert probabilities to binary predictions using the rule: if probability ≥ threshold, predict 1 (disease), else predict 0 (healthy). A lower threshold like 0.3 will diagnose more patients (catching more actual cases but also more false alarms), while a higher threshold like 0.7 will be more conservative (fewer diagnoses but might miss some actual cases).

Step 3: Calculate Performance Metrics

    # Count different types of outcomes
    tp = sum(1 for p, a in zip(predicted, actual_disease) 
             if p == 1 and a == 1)  # Correctly found disease
    
    fp = sum(1 for p, a in zip(predicted, actual_disease) 
             if p == 1 and a == 0)  # False alarm (healthy diagnosed as sick)
    
    fn = sum(1 for p, a in zip(predicted, actual_disease) 
             if p == 0 and a == 1)  # Missed case (sick patient missed)
    
    print(f"  True Positives: {tp} (correctly found disease)")
    print(f"  False Positives: {fp} (unnecessary worry)")
    print(f"  Missed Cases: {fn} (dangerous!)")

For each threshold, we calculate three critical metrics:

True Positives (TP): Patients who have disease AND were correctly diagnosed. This is what we want to maximize.
False Positives (FP): Healthy patients incorrectly diagnosed with disease. Causes unnecessary worry and treatment costs.
False Negatives (FN): Sick patients who were missed by the test. This is dangerous because they won't get treatment!

The threshold choice directly affects these numbers. Lower threshold → More TP but also more FP. Higher threshold → Fewer FP but more FN. In medical scenarios, missing a disease (FN) is often worse than a false alarm (FP), so lower thresholds are preferred.

Step 4: Interpret the Results

    if threshold == 0.3:
        print("  -> Low threshold: Catches more disease")
        print("     but more false alarms")
    elif threshold == 0.7:
        print("  -> High threshold: Fewer false alarms")
        print("     but may miss disease!")

# Example output:
# Threshold: 0.3 -> Diagnoses 8 patients, TP=7, FP=1, FN=0
# Threshold: 0.5 -> Diagnoses 6 patients, TP=6, FP=0, FN=1
# Threshold: 0.7 -> Diagnoses 4 patients, TP=4, FP=0, FN=3

Real-World Impact:

Threshold 0.3: Catches all 7 diseased patients but wrongly diagnoses 1 healthy person. Safe but causes 1 false alarm.
Threshold 0.5: Catches 6 diseased patients, no false alarms, but misses 1 sick patient. Balanced but risky.
Threshold 0.7: Only diagnoses 4 patients, misses 3 sick patients! Very dangerous despite no false alarms.

For disease detection, most doctors would choose threshold 0.3 to ensure no sick patients are missed, even if it means some healthy people get retested.

Step 5: Real Dataset Example - Load and Prepare

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load breast cancer dataset
data = load_breast_cancer()
X, y = data.data[:, :2], data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

Now we move to a real medical dataset: breast cancer detection. The dataset contains measurements from cell samples, and the target is whether the tumor is malignant (1) or benign (0). We use only the first 2 features for simplicity. We split the data into 70% training and 30% testing to evaluate how different thresholds perform on unseen data.

Step 6: Train Model and Get Probabilities

# Train logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Get probability predictions for test set
y_prob = model.predict_proba(X_test)[:, 1]

print(f"Model trained on {len(X_train)} samples")
print(f"Testing on {len(X_test)} samples")
print(f"Probability range: {y_prob.min():.3f} to {y_prob.max():.3f}")

After training, we get probability predictions for all test samples using predict_proba. The [:, 1] extracts probabilities for the positive class (malignant tumor). These probabilities range from near 0 (very confident it's benign) to near 1 (very confident it's malignant). We'll now test how different thresholds affect our diagnostic accuracy.

Step 7: Compare Threshold Performance

from sklearn.metrics import confusion_matrix

thresholds = [0.3, 0.5, 0.7]
print("Threshold Performance Comparison:")
print("-" * 60)

for threshold in thresholds:
    # Apply custom threshold
    y_pred = (y_prob >= threshold).astype(int)
    
    # Get confusion matrix
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
    
    # Calculate metrics
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    
    print(f"\nThreshold: {threshold}")
    print(f"  TP={tp}, FP={fp}, FN={fn}, TN={tn}")
    print(f"  Precision: {precision:.3f}")
    print(f"  Recall: {recall:.3f}")

For each threshold, we evaluate the complete confusion matrix and calculate:

Precision: Of all patients we diagnosed as malignant, what percentage actually were? High precision = fewer false alarms.
Recall: Of all patients who actually had malignant tumors, what percentage did we catch? High recall = fewer missed cases.

You'll notice that as threshold increases, precision tends to go up (fewer false positives) but recall goes down (more false negatives). The optimal threshold depends on whether it's more important to avoid false alarms or to catch all cases. In cancer detection, high recall is usually prioritized because missing a malignant tumor is far more dangerous than a false positive that leads to additional testing.

Class Probability and Confidence

The closer a probability is to 0 or 1, the more confident the model is in its prediction. Probabilities near 0.5 indicate uncertainty. This information can be valuable for decision-making.

# Understanding Prediction Confidence
import numpy as np
from sklearn.linear_model import LogisticRegression

# Sample predictions with different confidence levels
probabilities = np.array([0.05, 0.15, 0.45, 0.51, 0.85, 0.95])
threshold = 0.5

print("Prediction Confidence Analysis:")
print("-" * 50)
for prob in probabilities:
    prediction = 1 if prob >= threshold else 0
    confidence = abs(prob - 0.5) * 2  # 0-1 scale
    uncertainty = 1 - confidence
    
    print(f"Probability: {prob:.2f} -> Predict: {prediction}, Confidence: {confidence:.1%}")

This code demonstrates how to interpret model confidence from probability values. Let's break down what's happening:

The Probability Array:

We have 6 sample predictions ranging from very confident Class 0 (0.05) to very confident Class 1 (0.95). Notice probabilities 0.45 and 0.51 are close to the decision boundary (0.5), indicating uncertainty.

Confidence Calculation:

The formula confidence = abs(prob - 0.5) * 2 converts probability to a confidence score from 0% to 100%. Here's why this works:

Prob = 0.05: abs(0.05 - 0.5) * 2 = 0.9 → 90% confidence in Class 0
Prob = 0.50: abs(0.50 - 0.5) * 2 = 0.0 → 0% confidence (maximum uncertainty)
Prob = 0.95: abs(0.95 - 0.5) * 2 = 0.9 → 90% confidence in Class 1

Practical Interpretation:

High Confidence (prob near 0 or 1): Model is very sure. You can trust these predictions more. Example: prob=0.95 means "I'm 95% sure this is Class 1"
Low Confidence (prob near 0.5): Model is uncertain. May need human review or additional features. Example: prob=0.51 means "It's a coin flip, barely leaning toward Class 1"

Real-World Application: In fraud detection, you might automatically block high-confidence fraud cases (prob > 0.9), send low-confidence cases (0.4 < prob < 0.6) for manual review, and automatically approve high-confidence legitimate transactions (prob < 0.1). This three-tier system uses confidence to make smarter decisions than a simple yes/no classification.

Practice Questions: Decision Boundaries

Test your understanding with these coding challenges.

Given:

probabilities = [0.2, 0.5, 0.7, 0.49]
threshold = 0.5

Task: Create predictions list using the threshold. Expected output: [0, 1, 1, 0]

Show Solution

probabilities = [0.2, 0.5, 0.7, 0.49]
threshold = 0.5

predictions = [1 if p >= threshold else 0 for p in probabilities]
print(f"Probabilities: {probabilities}")
print(f"Predictions:   {predictions}")  # [0, 1, 1, 0]

Given:

import numpy as np
from sklearn.metrics import confusion_matrix

y_prob = np.array([0.1, 0.3, 0.4, 0.6, 0.7, 0.9])
y_true = np.array([0, 0, 0, 1, 1, 1])

Task: Test thresholds [0.3, 0.5, 0.7] and print accuracy, TP, FP, FN, TN for each.

Show Solution

import numpy as np
from sklearn.metrics import confusion_matrix

y_prob = np.array([0.1, 0.3, 0.4, 0.6, 0.7, 0.9])
y_true = np.array([0, 0, 0, 1, 1, 1])

for threshold in [0.3, 0.5, 0.7]:
    y_pred = (y_prob >= threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    accuracy = (tp + tn) / len(y_true)
    print(f"Threshold {threshold}: Acc={accuracy:.1%}, TP={tp}, FP={fp}, FN={fn}, TN={tn}")

Task: Train a LogisticRegression model, generate precision-recall curve, and find the threshold that maximizes F1-score.

Show Solution

import numpy as np
from sklearn.metrics import precision_recall_curve
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=200, n_features=5, random_state=42)

model = LogisticRegression()
model.fit(X, y)
y_prob = model.predict_proba(X)[:, 1]

precision, recall, thresholds = precision_recall_curve(y, y_prob)
f1_scores = 2 * (precision[:-1] * recall[:-1]) / (precision[:-1] + recall[:-1] + 1e-10)
best_idx = np.argmax(f1_scores)
best_threshold = thresholds[best_idx]

print(f"Optimal threshold: {best_threshold:.4f}")
print(f"Best F1-score: {f1_scores[best_idx]:.4f}")

Model Training & Evaluation

Now that we understand the theory behind Logistic Regression, it's time to build real models. In this section, we'll learn how to train Logistic Regression models using scikit-learn, understand the optimization process, evaluate model performance, and interpret results. You'll go from theory to practical implementation that you can use immediately.

Cost Function: Log Loss

To train a Logistic Regression model, we need to minimize a cost function. Linear Regression uses Mean Squared Error (MSE), but Logistic Regression uses Log Loss (also called binary cross-entropy). This cost function penalizes confident wrong predictions heavily while rewarding confident correct predictions.

Log Loss (Binary Cross-Entropy)

$$J(w) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})]$$

where m is the number of samples, y is the true label (0 or 1), and ŷ is the predicted probability.

Elegant Interpretation of Log Loss

This formula has an elegant interpretation based on the true label:

When y = 1 (Positive Class)

Cost = -log(ŷ)

Prediction close to 1: Cost ≈ 0

Model is confident and correct → Low penalty

Prediction close to 0: Cost → ∞

Model is confident but wrong → Huge penalty!

When y = 0 (Negative Class)

Cost = -log(1 - ŷ)

Prediction close to 0: Cost ≈ 0

Model is confident and correct → Low penalty

Prediction close to 1: Cost → ∞

Model is confident but wrong → Huge penalty!

Why This Is Brilliant: The logarithm naturally creates an asymmetric penalty. Being slightly wrong (ŷ=0.4 when y=1) has moderate cost, but being confidently wrong (ŷ=0.01 when y=1) results in massive cost. This forces the model to be both accurate and appropriately confident.

# Understanding Log Loss: Why It Works
import numpy as np

def single_sample_loss(y_true, y_pred):
    """Calculate loss for a single sample"""
    epsilon = 1e-15  # Prevent log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    
    if y_true == 1:
        return -np.log(y_pred)
    else:
        return -np.log(1 - y_pred)

print("Log Loss Explained with Examples:")
print("="*60)

# Case 1: True label is 1 (positive class)
print("\nWhen TRUE LABEL = 1 (positive class):")
print("-"*40)
for pred in [0.99, 0.8, 0.5, 0.2, 0.01]:
    loss = single_sample_loss(1, pred)
    quality = "Excellent!" if loss < 0.1 else "Good" if loss < 0.5 else "Poor" if loss < 1 else "Terrible!"
    print(f"  Predict {pred:.2f} -> Loss: {loss:.4f} ({quality})")

# Case 2: True label is 0 (negative class)
print("\nWhen TRUE LABEL = 0 (negative class):")
print("-"*40)
for pred in [0.01, 0.2, 0.5, 0.8, 0.99]:
    loss = single_sample_loss(0, pred)
    quality = "Excellent!" if loss < 0.1 else "Good" if loss < 0.5 else "Poor" if loss < 1 else "Terrible!"
    print(f"  Predict {pred:.2f} -> Loss: {loss:.4f} ({quality})")

print("\nKey insight: Confident WRONG predictions are penalized heavily!")

Understanding the Code: Log Loss in Action

This code demonstrates exactly how log loss penalizes predictions based on how wrong they are. Let's break it down:

The Loss Function

The single_sample_loss function implements the two cases of log loss:

If y_true = 1: Returns -log(y_pred). The closer the prediction to 1, the smaller the loss.
If y_true = 0: Returns -log(1 - y_pred). The closer the prediction to 0, the smaller the loss.

Epsilon clipping: We add a tiny value (1e-15) and clip predictions to prevent log(0) which would be undefined (infinity).

Case 1: When True Label = 1

Testing predictions from 0.99 (very confident correct) to 0.01 (very confident wrong):

Predict 0.99: Loss ≈ 0.01 (Excellent!) - Nearly perfect prediction, minimal penalty
Predict 0.80: Loss ≈ 0.22 (Good) - Confident and correct, small penalty
Predict 0.50: Loss ≈ 0.69 (Poor) - Uncertain prediction, moderate penalty
Predict 0.20: Loss ≈ 1.61 (Terrible!) - Wrong direction, large penalty
Predict 0.01: Loss ≈ 4.61 (Terrible!) - Confidently wrong, massive penalty!

Notice how the loss grows exponentially as the prediction moves away from the true value.

Case 2: When True Label = 0

Testing predictions from 0.01 (very confident correct) to 0.99 (very confident wrong):

Predict 0.01: Loss ≈ 0.01 (Excellent!) - Correctly says "almost certainly 0"
Predict 0.20: Loss ≈ 0.22 (Good) - Leaning toward 0, small penalty
Predict 0.50: Loss ≈ 0.69 (Poor) - Uncertain, moderate penalty
Predict 0.80: Loss ≈ 1.61 (Terrible!) - Leaning wrong way, large penalty
Predict 0.99: Loss ≈ 4.61 (Terrible!) - Confidently wrong, massive penalty!

The Key Insight:

Log loss is asymmetric. Look at the difference:

Predicting 0.8 when truth is 1: Loss = 0.22 (tolerable)
Predicting 0.2 when truth is 1: Loss = 1.61 (7× worse!)
Predicting 0.01 when truth is 1: Loss = 4.61 (20× worse!!)

This exponential penalty prevents the model from making overconfident mistakes. It's better to be uncertain (predict 0.5) than to be confidently wrong (predict 0.01 when answer is 1). This is why log loss works so well for training classification models - it naturally encourages both accuracy and appropriate confidence levels.

Gradient Descent Optimization

Logistic Regression uses gradient descent to minimize the log loss. Gradient descent is an iterative algorithm that takes small steps toward the minimum of the cost function. The direction of each step is determined by the gradient (derivative) of the cost function.

import numpy as np

print("Gradient Descent - Mountain Analogy:")
print("="*50)

Imagine you're blindfolded on a mountain and need to find the lowest point (valley). Your only tool is feeling the slope beneath your feet. Gradient descent mimics this strategy: check which way is downhill, take a small step in that direction, then repeat. This is exactly how logistic regression finds the best weights that minimize prediction errors.

# Simplified 1D example: Finding minimum of f(x) = x^2
def f(x):
    return x ** 2

def gradient(x):
    return 2 * x  # Derivative of x^2

We start with a simple function f(x) = x² to understand the concept. Its minimum is at x=0 (where f(x)=0). The gradient (derivative) tells us the slope direction: positive slope means we're going uphill, negative means downhill. The gradient of x² is 2x, so at x=5 the slope is +10 (steeply uphill to the right), meaning we should move left.

x = 5.0
learning_rate = 0.1

print(f"Start position: x = {x:.4f}, f(x) = {f(x):.4f}")
print("-"*50)

We start at x=5.0 (randomly chosen starting point). At this position, f(5)=25. The learning rate (0.1) controls step size—too large and we might overshoot the minimum, too small and training takes forever. Think of it as how cautiously you step down the mountain: big confident steps vs tiny careful steps.

for step in range(10):
    grad = gradient(x)  # Check slope direction
    x = x - learning_rate * grad  # Move opposite to slope
    print(f"Step {step+1}: grad={grad:+.3f}, move to x={x:.4f}, f(x)={f(x):.4f}")

This is the core gradient descent loop:

Calculate gradient: At x=5, gradient=10 (steep uphill to right)
Update position: x_new = x - 0.1 × 10 = 5 - 1 = 4 (move left/downhill)
Repeat: At x=4, gradient=8, so x_new = 4 - 0.8 = 3.2 (still going down)
Each iteration brings us closer to x=0 (the minimum)

The minus sign is crucial: if gradient is positive (uphill), we subtract it (go left); if negative (uphill to left), subtracting a negative means adding (go right).

print("-"*50)
print(f"Final position: x = {x:.4f} (very close to 0!)")
print("\nIn Logistic Regression, we do the same but with multiple weights!")

Result: After 10 steps, we reach x ≈ 0.819 with f(x) ≈ 0.67, down from f(5)=25. More iterations would get even closer to x=0. In logistic regression, instead of one parameter x, we have multiple weights (w₀, w₁, w₂, ...) and bias b. The principle is identical: calculate gradients for each weight, take small steps in the direction that reduces loss, and repeat until convergence.

Real-world analogy: Training a spam classifier with 1000 features means descending a 1001-dimensional surface (1000 weights + 1 bias) to find the combination that best separates spam from non-spam emails.

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def log_loss(y, y_pred):
    """Binary cross-entropy loss"""
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))

These are the building blocks for logistic regression gradient descent:

sigmoid(z): Converts any real number to probability between 0 and 1. We clip z to [-500, 500] to prevent numerical overflow when computing e^(-z).
log_loss(): Measures how wrong our predictions are. Perfect predictions (ŷ=1 when y=1) give loss=0. Wrong predictions (ŷ=0.01 when y=1) give large loss. The epsilon clipping prevents log(0) which would be undefined.

def gradient_descent_step(X, y, w, b, learning_rate):
    """One step of gradient descent"""
    m = X.shape[0]
    
    # Forward pass
    z = np.dot(X, w) + b
    y_pred = sigmoid(z)

Forward pass means making predictions with current weights:

z = X·w + b: Linear combination. If X is [50×2] (50 samples, 2 features) and w is [2×1], then z is [50×1] containing 50 raw scores.
y_pred = sigmoid(z): Convert scores to probabilities. If z=-2, sigmoid gives 0.12 (12% chance of class 1). If z=+2, sigmoid gives 0.88 (88% chance).
Example: For email spam detection with features [word_count_viagra, word_count_free], we compute z = w₀×word_count_viagra + w₁×word_count_free + b, then convert to spam probability.

    # Backward pass (gradients)
    dw = np.dot(X.T, (y_pred - y)) / m
    db = np.mean(y_pred - y)

Backward pass (backpropagation) calculates how much to adjust each weight:

dw: Gradient with respect to weights. Formula: (1/m) × X^T × (ŷ - y). If X is [50×2], X^T is [2×50], and (ŷ-y) is [50×1], result is [2×1] showing how to adjust each of 2 weights.
db: Gradient with respect to bias. Simply the average prediction error across all samples.
Intuition: If y_pred=0.9 but y=0 (false positive), the error (0.9-0=0.9) is large, so gradients will be large, causing big weight adjustments to reduce this mistake next time.

    # Update weights
    w_new = w - learning_rate * dw
    b_new = b - learning_rate * db
    
    # Compute new loss
    z_new = np.dot(X, w_new) + b_new
    y_pred_new = sigmoid(z_new)
    loss = log_loss(y, y_pred_new)
    
    return w_new, b_new, loss

Weight update rule: w_new = w - α × dw (where α is learning_rate)

If dw is positive (gradient points uphill), we subtract it to go downhill (reduce loss).
If dw is negative (gradient points downhill already), subtracting a negative means adding (moving in that downhill direction).
Example: If w=0.5, dw=2.0, α=0.1, then w_new = 0.5 - 0.1×2.0 = 0.3 (reduced weight).
After updating weights, we compute new predictions and loss to track training progress.

np.random.seed(42)
X = np.random.randn(50, 2)
y = (X[:, 0] + X[:, 1] > 0).astype(float)

We create a simple 2D classification problem:

X: 50 samples with 2 features each, drawn from standard normal distribution (mean=0, std=1).
y: Binary labels where class 1 (positive) if feature₁ + feature₂ > 0, else class 0.
This creates a diagonal decision boundary. Points in upper-right quadrant are mostly class 1, lower-left are class 0.
seed(42): Ensures reproducibility—same random data every time we run the code.

w = np.random.randn(2) * 0.01
b = 0
learning_rate = 0.1
losses = []

for epoch in range(100):
    w, b, loss = gradient_descent_step(X, y, w, b, learning_rate)
    losses.append(loss)

Training loop runs gradient descent for 100 iterations (epochs):

Initialization: Weights start near zero (×0.01 for stability), bias starts at 0.
Learning rate = 0.1: Moderate step size. Too high (0.9) might cause oscillation or divergence. Too low (0.001) trains very slowly.
Each epoch: Complete one gradient descent step (forward pass → compute gradients → update weights → compute loss).
We save loss values to visualize how error decreases over time, confirming the model is learning.

plt.figure(figsize=(10, 5))
plt.plot(losses, linewidth=2)
plt.xlabel('Iteration')
plt.ylabel('Log Loss')
plt.title('Gradient Descent: Loss Decreasing Over Iterations')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Initial loss: {losses[0]:.4f}")
print(f"Final loss: {losses[-1]:.4f}")

Visualizing convergence confirms gradient descent is working:

Y-axis (Log Loss): Starts high (random weights make poor predictions) and decreases over iterations.
Ideal curve: Steep drop initially (easy improvements), then flattening as we approach optimal weights.
Initial vs Final Loss: Typically drops from ~0.6-0.7 (random guessing) to ~0.1-0.3 (good predictions).
If loss increases or oscillates wildly, learning rate is too high. If loss barely moves, learning rate is too low.

Production tip: In real projects, monitor loss curves during training. A flat curve might indicate: (1) learning rate too small, (2) already converged, or (3) poor feature engineering. A spiking curve suggests learning rate is too high or gradient explosion.

Training with Scikit-learn

In practice, we don't implement gradient descent from scratch. Scikit-learn's LogisticRegression class handles everything for us. Let's train a real model on actual data.

from sklearn.linear_model import LogisticRegression
import numpy as np

Start by importing the necessary libraries. LogisticRegression is scikit-learn's implementation that handles all the gradient descent optimization internally. Unlike our manual implementation, it includes advanced features like regularization, multiple solvers, and automatic convergence detection.

hours_studied = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
passed_exam = np.array([0, 0, 0, 0, 1, 1, 1, 1])  # 0=fail, 1=pass

Create a simple dataset: students who studied 1-4 hours failed (0), while those who studied 5-8 hours passed (1). The feature X must be 2D shape (n_samples, n_features), hence [[1], [2], ...] instead of [1, 2, ...]. This is a classic binary classification problem with a clear threshold around 4.5 hours.

model = LogisticRegression()
model.fit(hours_studied, passed_exam)

Two-step process: (1) Create model instance, (2) Train with fit(). That's it! Behind the scenes, scikit-learn runs gradient descent for multiple iterations (default max_iter=100), adjusts weights to minimize log loss, and checks for convergence. The model learns the optimal coefficient (weight) and intercept (bias) automatically.

test_hours = [[3.5], [4.5], [5.5]]
for hours in test_hours:
    prob = model.predict_proba(hours)[0]
    prediction = model.predict(hours)[0]
    result = "PASS" if prediction == 1 else "FAIL"
    print(f"Hours: {hours[0]} -> Probability: {prob[1]:.2%} -> {result}")

Make predictions on new data:

predict_proba(): Returns probability for each class [prob_class_0, prob_class_1]. For binary classification, prob[1] is the chance of passing.
predict(): Returns final class prediction (0 or 1) after applying 0.5 threshold to probabilities.
Example: 3.5 hours → prob ≈ 30% → FAIL, 5.5 hours → prob ≈ 90% → PASS

print(f"\nModel learned: coef={model.coef_[0][0]:.3f}, intercept={model.intercept_[0]:.3f}")

Learned parameters:

coef_: Weight for "hours_studied" feature. Positive value means more hours → higher probability of passing.
intercept_: Bias term. Typically negative here because with 0 hours studied, probability should be low.
The decision boundary is where z = coef×hours + intercept = 0, which solves to hours ≈ 4-5 (the threshold between fail and pass).

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Import additional tools for real-world machine learning:

load_breast_cancer: Built-in dataset with 30 features (tumor measurements) to predict malignant/benign.
train_test_split: Splits data into training (model learns) and testing (model evaluated) sets to prevent overfitting.
StandardScaler: Normalizes features to mean=0, std=1. Critical for logistic regression since features with different scales can dominate the optimization.

data = load_breast_cancer()
X, y = data.data, data.target

Load the breast cancer dataset: 569 samples, 30 features per sample, binary labels (0=malignant, 1=benign). X is shape [569, 30] containing measurements like radius, texture, perimeter. y is shape [569] containing labels. This is a real medical diagnostic problem where accurate classification can save lives.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Split data: 80% for training (455 samples), 20% for testing (114 samples). random_state=42 ensures reproducibility—same split every time. The test set acts as unseen data to evaluate how well the model generalizes. Never train on test data, or accuracy will be misleadingly high!

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Feature scaling workflow:

fit_transform(X_train): Calculate mean and std from training data, then transform it. We only fit on training data to prevent data leakage.
transform(X_test): Use training mean/std to scale test data. Never fit on test data!
Example: Feature "radius" might have mean=14.5, std=3.2 from training. Both train and test scaled using these values.
After scaling, all features have comparable magnitude, preventing large-value features from dominating gradient updates.

model = LogisticRegression(max_iter=10000)
model.fit(X_train_scaled, y_train)

Create and train the model. max_iter=10000 sets maximum gradient descent iterations (default is 100, but complex problems may need more). Scikit-learn automatically stops early if convergence is detected (loss stops decreasing). The model learns 30 coefficients (one per feature) plus 1 intercept.

y_train_pred = model.predict(X_train_scaled)
y_test_pred = model.predict(X_test_scaled)

y_train_proba = model.predict_proba(X_train_scaled)
y_test_proba = model.predict_proba(X_test_scaled)

Generate predictions and probabilities for both training and test sets:

predict(): Binary predictions (0 or 1) after applying threshold.
predict_proba(): Raw probabilities [prob_class_0, prob_class_1] before thresholding. Useful for ROC curves and custom thresholds.
We predict on both train (to check for underfitting) and test (to check generalization).

print(f"Training accuracy: {model.score(X_train_scaled, y_train):.4f}")
print(f"Testing accuracy: {model.score(X_test_scaled, y_test):.4f}")
print(f"\nModel coefficients shape: {model.coef_.shape}")
print(f"First 5 coefficients: {model.coef_[0][:5]}")
print(f"Intercept: {model.intercept_[0]:.4f}")

Evaluate performance:

Training accuracy: How well model fits training data (typically 95-98% for this dataset).
Testing accuracy: How well model generalizes to unseen data (typically 94-97%). If much lower than training accuracy, model is overfitting.
coef_.shape: [1, 30] for binary classification with 30 features. Each coefficient shows feature importance.
Intercept: Bias term, often positive for this dataset since benign tumors are more common.

Good sign: If train and test accuracy are close (within 2-3%), model generalizes well. Large gap indicates overfitting—model memorized training data instead of learning patterns.

Model Evaluation Metrics

Accuracy alone isn't enough to evaluate classification models, especially with imbalanced data. We need multiple metrics to understand different aspects of model performance.

y_true = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]  # 4 spam, 6 not-spam
y_pred = [1, 1, 0, 0, 0, 0, 0, 0, 1, 0]  # Our model's predictions

Example scenario: spam email detection. y_true contains actual labels (1=spam, 0=not-spam), while y_pred contains our model's predictions. Out of 10 emails, 4 are actually spam but our model only predicted 3 as spam (catching 2 real spam, missing 2, and wrongly flagging 1 legitimate email).

tp = 2  # True Positive: Correctly predicted spam
fp = 1  # False Positive: Predicted spam but was not spam
fn = 2  # False Negative: Predicted not-spam but was spam
tn = 5  # True Negative: Correctly predicted not-spam

Confusion Matrix Components:

TP (True Positive) = 2: Correctly identified spam emails (caught 2 out of 4).
FP (False Positive) = 1: Wrongly flagged legitimate email as spam (annoying but not dangerous).
FN (False Negative) = 2: Missed 2 spam emails that ended up in inbox (dangerous - user sees spam!).
TN (True Negative) = 5: Correctly identified legitimate emails (5 out of 6).

Total predictions: TP + FP + FN + TN = 2 + 1 + 2 + 5 = 10 ✓

accuracy = (tp + tn) / (tp + tn + fp + fn)
print(f"\nAccuracy = (TP+TN)/(TP+TN+FP+FN)")
print(f"         = ({tp}+{tn})/({tp}+{tn}+{fp}+{fn}) = {accuracy:.2f}")
print(f"  Meaning: {accuracy*100:.0f}% of all predictions were correct")

Accuracy = (2+5)/(2+1+2+5) = 7/10 = 0.70 (70%). Simple metric: what percentage of predictions were correct? Problem: misleading with imbalanced data. If 95% emails are legitimate, a dummy model that always predicts "not spam" achieves 95% accuracy while catching zero spam!

precision = tp / (tp + fp)
print(f"\nPrecision = TP/(TP+FP) = {tp}/({tp}+{fp}) = {precision:.2f}")
print(f"  Meaning: {precision*100:.0f}% of predicted spam was actually spam")

Precision = 2/(2+1) = 2/3 = 0.67 (67%). Answers: "Of all emails we flagged as spam, how many were actually spam?" High precision means few false alarms. Critical when false positives are costly (e.g., medical diagnosis - don't want to tell healthy person they're sick).

recall = tp / (tp + fn)
print(f"\nRecall = TP/(TP+FN) = {tp}/({tp}+{fn}) = {recall:.2f}")
print(f"  Meaning: We caught {recall*100:.0f}% of all spam")

Recall (Sensitivity) = 2/(2+2) = 2/4 = 0.50 (50%). Answers: "Of all actual spam, how many did we catch?" High recall means we miss few positives. Critical when false negatives are dangerous (e.g., cancer detection - don't want to miss actual cancer cases).

f1 = 2 * (precision * recall) / (precision + recall)
print(f"\nF1-Score = 2*(Precision*Recall)/(Precision+Recall) = {f1:.2f}")
print(f"  Meaning: Balanced score considering both precision and recall")

F1-Score = 2×(0.67×0.50)/(0.67+0.50) = 0.57. Harmonic mean of precision and recall. Balances both metrics - useful when you need both low false positives AND low false negatives. F1=1.0 is perfect, F1=0.0 is worst.

Trade-off: Can't maximize both! Flagging more emails as spam increases recall (catch more spam) but decreases precision (more false alarms). F1-score finds the balance.

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, confusion_matrix, roc_auc_score, 
                             roc_curve, auc)
import matplotlib.pyplot as plt

Import scikit-learn's comprehensive evaluation metrics. These functions calculate the metrics we manually computed above, but handle edge cases and work with large datasets efficiently.

X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Create imbalanced dataset: 1000 samples, 20 features, 90% class 0, 10% class 1 (mimics real-world scenarios like fraud detection where positives are rare). Split into 800 training, 200 testing samples. This imbalance makes accuracy misleading - a dummy classifier predicting all zeros gets 90% accuracy!

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LogisticRegression()
model.fit(X_train_scaled, y_train)

Standard machine learning pipeline: scale features, then train model. Scaling ensures features have equal influence during gradient descent optimization.

y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)[:, 1]

Generate predictions: y_pred contains binary predictions (0 or 1) after applying 0.5 threshold. y_proba contains raw probabilities for class 1 before thresholding - needed for ROC curve analysis and custom threshold tuning.

print("Classification Metrics:")
print("=" * 40)
print(f"Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
print(f"F1-Score:  {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_proba):.4f}")

Metric interpretations:

Accuracy: Overall correctness, but misleading with 90% class 0.
Precision: Of predicted positives, how many are correct? High = few false alarms.
Recall: Of actual positives, how many caught? High = few missed cases.
F1-Score: Balance between precision and recall. Use when both matter equally.
ROC-AUC: Area under ROC curve (0-1). Measures model's ability to rank positives higher than negatives across all thresholds. AUC=0.5 is random, AUC=1.0 is perfect.

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print("\nConfusion Matrix:")
print(f"TP: {tp}, FP: {fp}")
print(f"FN: {fn}, TN: {tn}")

Confusion matrix shows all four outcomes:

TP (True Positive): Correctly predicted class 1.
FP (False Positive): Wrongly predicted class 1 (Type I error).
FN (False Negative): Wrongly predicted class 0 (Type II error).
TN (True Negative): Correctly predicted class 0.

With imbalanced data, expect large TN (many correct class 0 predictions) and small TP (few class 1 samples).

fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, color='b', lw=2, label=f'ROC curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='r', lw=2, linestyle='--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

ROC Curve (Receiver Operating Characteristic) plots True Positive Rate (Recall) vs False Positive Rate across all possible thresholds:

X-axis (FPR): FP/(FP+TN) - what fraction of negatives wrongly classified as positive.
Y-axis (TPR): TP/(TP+FN) - what fraction of positives correctly identified (recall).
Diagonal line: Random classifier (AUC=0.5). Good model curves toward top-left corner.
AUC (Area Under Curve): Probability that model ranks random positive sample higher than random negative sample. AUC=0.9+ is excellent, 0.7-0.8 is good.

Use case: ROC-AUC evaluates model quality independent of threshold choice. Useful when comparing models or when you'll adjust threshold later based on business needs.

Feature Importance

In Logistic Regression, the magnitude of coefficients indicates feature importance. Larger absolute coefficients mean the feature has more influence on the prediction. This makes Logistic Regression more interpretable than many other algorithms.

# Understanding Feature Importance in Logistic Regression
import numpy as np

# Suppose we're predicting if a customer will buy (1) or not (0)
# We have 3 features with these coefficients:
coefficients = {
    "visit_duration": 0.8,    # Time spent on website (minutes)
    "items_viewed": 1.5,      # Number of products viewed
    "previous_purchases": -0.3  # Already bought before (paradoxically negative!)
}

print("Feature Coefficients Explained:")
print("="*55)

for feature, coef in coefficients.items():
    direction = "increases" if coef > 0 else "decreases"
    impact = abs(coef)
    
    print(f"\n{feature}: {coef:+.2f}")
    print(f"  - Each unit increase {direction} log-odds by {impact:.2f}")
    print(f"  - Importance ranking: {'High' if impact > 1 else 'Medium' if impact > 0.5 else 'Low'}")

# Which feature matters most? Look at absolute values!
importance = {f: abs(c) for f, c in coefficients.items()}
sorted_features = sorted(importance.items(), key=lambda x: x[1], reverse=True)

print("\nFeature Importance (by absolute coefficient):")
for i, (feature, imp) in enumerate(sorted_features, 1):
    print(f"  {i}. {feature}: {imp:.2f}")

# Feature Importance from Coefficients
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt

# Load data
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Scale and train
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

model = LogisticRegression(max_iter=10000)
model.fit(X_scaled, y)

# Get feature importance (coefficients)
importance = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': model.coef_[0],
    'Abs_Coefficient': np.abs(model.coef_[0])
}).sort_values('Abs_Coefficient', ascending=False)

# Plot top 10 features
plt.figure(figsize=(10, 6))
top_features = importance.head(10)
colors = ['green' if c > 0 else 'red' for c in top_features['Coefficient']]
plt.barh(range(len(top_features)), top_features['Coefficient'], color=colors)
plt.yticks(range(len(top_features)), top_features['Feature'])
plt.xlabel('Coefficient Value')
plt.title('Top 10 Most Important Features')
plt.tight_layout()
plt.show()

print("\nTop 5 Most Important Features:")
print(importance.head())

Practice Questions: Model Training

Test your understanding with these coding challenges.

Task: Load the Iris dataset, filter to binary classification (classes 0 and 1 only), train LogisticRegression, and report accuracy.

Show Solution

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

iris = load_iris()
X, y = iris.data, iris.target

# Binary classification: use only classes 0 and 1
mask = y != 2
X, y = X[mask], y[mask]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)

accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.4f}")

Task: Train LogisticRegression on breast cancer dataset. Print precision and recall for thresholds [0.3, 0.4, 0.5, 0.6, 0.7].

Show Solution

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_score, recall_score

data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LogisticRegression()
model.fit(X_train_scaled, y_train)
y_proba = model.predict_proba(X_test_scaled)[:, 1]

print("Threshold | Precision | Recall")
print("-" * 35)
for threshold in [0.3, 0.4, 0.5, 0.6, 0.7]:
    y_pred = (y_proba >= threshold).astype(int)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    print(f"{threshold:.1f}      | {precision:.3f}     | {recall:.3f}")

Task: Use GridSearchCV to find optimal C and penalty parameters. Report best parameters and test score.

Show Solution

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler

data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']
}

model = LogisticRegression(solver='liblinear', max_iter=10000)
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='f1')
grid_search.fit(X_train_scaled, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
print(f"Test score: {grid_search.score(X_test_scaled, y_test):.4f}")

Naive Bayes Classifier

Naive Bayes is one of the simplest yet surprisingly effective classification algorithms. Despite its "naive" assumption that all features are independent of each other (which is rarely true in real life), it often performs remarkably well, especially for text classification tasks like spam detection and sentiment analysis. The algorithm is based on Bayes' Theorem from probability theory, and its simplicity means it's incredibly fast to train and predict - perfect for large datasets or real-time applications.

Bayes' Theorem: The Foundation

At its core, Naive Bayes uses Bayes' Theorem to calculate the probability of a class given the observed features. The theorem states:

Bayes' Theorem

P(Class | Features) = P(Features | Class) × P(Class) / P(Features)

P(Class | Features) - Posterior probability: What we want to find - the probability of a class given the features
P(Features | Class) - Likelihood: How probable are these features if we know the class
P(Class) - Prior probability: How common is this class in general
P(Features) - Evidence: The probability of seeing these features (constant for all classes)

The "Naive" Assumption

The "naive" part comes from assuming that all features are conditionally independent given the class. This means we assume that knowing the value of one feature tells us nothing about other features (given the class). For example, in spam detection, we assume that the presence of the word "free" is independent of the word "money" - which is obviously not true! Yet this simplification makes the math tractable and often works surprisingly well in practice.

Gaussian NB

For continuous features that follow a normal (Gaussian) distribution. Most common choice for general numeric data.

Assumes features are normally distributed
Works well with continuous data
Fast training and prediction

Multinomial NB

For discrete count features like word counts in text. The go-to choice for document classification and NLP tasks.

Perfect for text classification
Uses word frequency counts
Excellent for spam detection

Bernoulli NB

For binary/boolean features (present or not). Good for document classification with binary term occurrence.

Binary feature vectors only
Word presence, not frequency
Good for short text documents

Implementing Gaussian Naive Bayes

Let's implement Gaussian Naive Bayes for classification on a standard dataset. This variant assumes features follow a normal distribution.

from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train Gaussian Naive Bayes classifier
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)

# Evaluate the model
print("Gaussian Naive Bayes Results:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Text Classification with Multinomial Naive Bayes

Naive Bayes truly shines in text classification. Here's how to build a spam detector using Multinomial Naive Bayes:

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline

# Sample email data
emails = [
    "Get rich quick! Free money waiting for you!",
    "Meeting scheduled for tomorrow at 3pm",
    "Congratulations! You've won a million dollars!",
    "Please review the attached project proposal",
    "Limited time offer! Buy now and save 90%!",
    "Can we reschedule our call to next week?",
    "Claim your free prize now! Act immediately!",
    "The quarterly report is ready for your review"
]
labels = [1, 0, 1, 0, 1, 0, 1, 0]  # 1 = spam, 0 = not spam

# Create a pipeline with text vectorization and Naive Bayes
spam_detector = Pipeline([
    ('vectorizer', TfidfVectorizer(stop_words='english')),
    ('classifier', MultinomialNB(alpha=1.0))  # alpha is Laplace smoothing
])

# Train the model
spam_detector.fit(emails, labels)

# Test on new emails
new_emails = [
    "Free lottery tickets! Claim now!",
    "Let's discuss the project timeline"
]

predictions = spam_detector.predict(new_emails)
probabilities = spam_detector.predict_proba(new_emails)

for email, pred, prob in zip(new_emails, predictions, probabilities):
    label = "SPAM" if pred == 1 else "NOT SPAM"
    confidence = max(prob) * 100
    print(f"'{email[:40]}...' -> {label} ({confidence:.1f}% confident)")

When to Use Naive Bayes: Naive Bayes excels when you have limited training data, need fast training/prediction, or work with high-dimensional data like text. It's often used as a baseline model and can outperform more complex algorithms for text classification tasks.

Practice Questions

Problem: Train GaussianNB and compare it with Logistic Regression on the Iris dataset. Which performs better?

Show Solution

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score

iris = load_iris()
X, y = iris.data, iris.target

# Compare models using cross-validation
gnb = GaussianNB()
lr = LogisticRegression(max_iter=200)

gnb_scores = cross_val_score(gnb, X, y, cv=5)
lr_scores = cross_val_score(lr, X, y, cv=5)

print(f"Gaussian NB: {gnb_scores.mean():.4f} (+/- {gnb_scores.std():.4f})")
print(f"Logistic Regression: {lr_scores.mean():.4f} (+/- {lr_scores.std():.4f})")

Explanation: Both algorithms often perform similarly on well-separated datasets like Iris. Naive Bayes trains faster but may be less accurate when features are correlated.

Problem: Test different alpha values (0.01, 0.1, 1.0, 10.0) for MultinomialNB on a text dataset. How does alpha affect performance?

Show Solution

from sklearn.naive_bayes import MultinomialNB
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

# Load subset of 20 newsgroups
categories = ['sci.space', 'rec.sport.hockey']
newsgroups = fetch_20newsgroups(subset='train', categories=categories)

alphas = [0.01, 0.1, 1.0, 10.0]

for alpha in alphas:
    pipeline = Pipeline([
        ('tfidf', TfidfVectorizer(max_features=5000)),
        ('nb', MultinomialNB(alpha=alpha))
    ])
    scores = cross_val_score(pipeline, newsgroups.data, newsgroups.target, cv=5)
    print(f"Alpha={alpha:5.2f}: {scores.mean():.4f} (+/- {scores.std():.4f})")

Explanation: Alpha is the Laplace smoothing parameter. Higher values add more smoothing, preventing zero probabilities but potentially oversimplifying. Usually alpha=1.0 (default) works well.

Problem: Build a sentiment analyzer using Naive Bayes that classifies movie reviews as positive or negative. Use TF-IDF features.

Show Solution

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Sample movie reviews
reviews = [
    "This movie was absolutely fantastic! Great acting and plot.",
    "Terrible waste of time. Boring and predictable.",
    "I loved every minute of it. A masterpiece!",
    "Worst movie I've ever seen. Don't watch it.",
    "Brilliant performances by the entire cast. Highly recommend!",
    "So boring I fell asleep. Complete disappointment.",
    "An incredible journey with memorable characters.",
    "Awful script and terrible direction. Save your money."
]
sentiments = [1, 0, 1, 0, 1, 0, 1, 0]  # 1 = positive, 0 = negative

X_train, X_test, y_train, y_test = train_test_split(
    reviews, sentiments, test_size=0.25, random_state=42
)

# Create pipeline
sentiment_model = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2), max_features=1000)),
    ('nb', MultinomialNB())
])

sentiment_model.fit(X_train, y_train)
predictions = sentiment_model.predict(X_test)

print("Sentiment Analysis Results:")
print(classification_report(y_test, predictions, target_names=['Negative', 'Positive']))

Explanation: TF-IDF with n-grams captures both individual words and phrases. Multinomial NB works well for this text classification task.

K-Nearest Neighbors (KNN)

K-Nearest Neighbors is one of the most intuitive machine learning algorithms - it literally says "tell me who your neighbors are, and I'll tell you who you are." When classifying a new data point, KNN looks at the K closest training examples and assigns the class that appears most frequently among those neighbors. It's a "lazy learner" because it doesn't build a model during training; instead, it stores all the training data and does the work at prediction time. This makes it simple to understand and implement, but potentially slow for large datasets.

How KNN Works

The KNN Algorithm

Store all training data - No actual training happens; we just memorize the examples
For a new point, calculate distance - Compute distance from the new point to every training point
Find K nearest neighbors - Select the K training points closest to the new point
Vote for the class - The class that appears most often among the K neighbors wins

Choosing the Right K

The choice of K (number of neighbors) dramatically affects the model's behavior:

Small K (e.g., K=1)

Very sensitive to noise and outliers
Creates complex, jagged decision boundaries
High variance, low bias (overfitting risk)
Single point determines the prediction

Large K (e.g., K=15)

More robust to noise and outliers
Creates smoother decision boundaries
High bias, low variance (underfitting risk)
Multiple points influence the decision

Pro Tip: A common rule of thumb is to start with K = √n (square root of training samples) and use cross-validation to find the optimal value. Also, use odd K values for binary classification to avoid ties.

Distance Metrics

The "closeness" of neighbors depends on how we measure distance. Different metrics work better for different types of data:

Distance Metric	Formula	Best For
Euclidean	√(Σ(xᵢ - yᵢ)²)	Continuous features, general purpose (most common)
Manhattan	Σ\|xᵢ - yᵢ\|	High-dimensional data, grid-like movement
Minkowski	(Σ\|xᵢ - yᵢ\|ᵖ)^(1/p)	Generalizes Euclidean (p=2) and Manhattan (p=1)
Cosine	1 - cos(θ)	Text data, when magnitude doesn't matter

Implementing KNN

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# IMPORTANT: Scale features for KNN (distance-based algorithm)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create and train KNN classifier
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train_scaled, y_train)

# Predict and evaluate
y_pred = knn.predict(X_test_scaled)
print(f"KNN Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Finding the Optimal K

import matplotlib.pyplot as plt

# Test different K values
k_values = range(1, 31)
train_scores = []
test_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    train_scores.append(knn.score(X_train_scaled, y_train))
    test_scores.append(knn.score(X_test_scaled, y_test))

# Plot results
plt.figure(figsize=(10, 6))
plt.plot(k_values, train_scores, 'b-o', label='Training Accuracy')
plt.plot(k_values, test_scores, 'r-s', label='Test Accuracy')
plt.xlabel('Number of Neighbors (K)')
plt.ylabel('Accuracy')
plt.title('KNN: Finding the Optimal K')
plt.legend()
plt.grid(True, alpha=0.3)

# Find best K
best_k = k_values[np.argmax(test_scores)]
print(f"Best K: {best_k} with test accuracy: {max(test_scores):.4f}")

Critical: Feature Scaling! KNN is distance-based, so features with larger scales will dominate the distance calculation. Always standardize or normalize your features before using KNN. This is one of the most common mistakes beginners make!

Practice Questions

Problem: Compare KNN accuracy with and without feature scaling on the Iris dataset. How much difference does scaling make?

Show Solution

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score

iris = load_iris()
X, y = iris.data, iris.target

# Without scaling
knn_no_scale = KNeighborsClassifier(n_neighbors=5)
scores_no_scale = cross_val_score(knn_no_scale, X, y, cv=5)

# With scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
knn_scaled = KNeighborsClassifier(n_neighbors=5)
scores_scaled = cross_val_score(knn_scaled, X_scaled, y, cv=5)

print(f"Without scaling: {scores_no_scale.mean():.4f}")
print(f"With scaling: {scores_scaled.mean():.4f}")

Explanation: Scaling often improves KNN performance, especially when features have different scales. The improvement depends on the dataset.

Problem: Test Euclidean, Manhattan, and Chebyshev distances on a dataset. Which works best?

Show Solution

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

wine = load_wine()
X = StandardScaler().fit_transform(wine.data)
y = wine.target

metrics = ['euclidean', 'manhattan', 'chebyshev']

for metric in metrics:
    knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
    scores = cross_val_score(knn, X, y, cv=5)
    print(f"{metric:12s}: {scores.mean():.4f} (+/- {scores.std():.4f})")

Explanation: Different metrics capture different notions of "closeness". Euclidean is most common, but Manhattan can work better in high dimensions.

Problem: Compare uniform weights vs distance-weighted KNN. When does distance weighting help?

Show Solution

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

# Create dataset with some overlap
X, y = make_classification(n_samples=500, n_features=10, 
                          n_informative=5, n_redundant=2,
                          n_clusters_per_class=2, random_state=42)
X = StandardScaler().fit_transform(X)

for k in [3, 7, 15]:
    knn_uniform = KNeighborsClassifier(n_neighbors=k, weights='uniform')
    knn_distance = KNeighborsClassifier(n_neighbors=k, weights='distance')
    
    scores_uniform = cross_val_score(knn_uniform, X, y, cv=5).mean()
    scores_distance = cross_val_score(knn_distance, X, y, cv=5).mean()
    
    print(f"K={k:2d}: Uniform={scores_uniform:.4f}, Distance={scores_distance:.4f}")

Explanation: Distance weighting gives closer neighbors more influence. It often helps with noisy data and allows using larger K values without losing local sensitivity.

Multi-class Classification Strategies

Many real-world problems have more than two classes: classifying emails into spam/promotional/personal/social, recognizing handwritten digits (0-9), or identifying plant species. While some algorithms like Decision Trees and KNN naturally handle multiple classes, others like Logistic Regression and SVM are inherently binary classifiers. In this section, we'll explore strategies to extend binary classifiers to multi-class problems, and how to use algorithms that natively support multiple classes.

One-vs-Rest (OvR) Strategy

One-vs-Rest (One-vs-All)

For N classes, train N separate binary classifiers. Each classifier learns to distinguish one class from all other classes combined. At prediction time, run all N classifiers and pick the class with the highest confidence score.

Example (3 classes: A, B, C):

Classifier 1: Is it class A? (A vs not-A)
Classifier 2: Is it class B? (B vs not-B)
Classifier 3: Is it class C? (C vs not-C)

One-vs-One (OvO) Strategy

One-vs-One

For N classes, train N×(N-1)/2 binary classifiers, one for each pair of classes. At prediction time, each classifier votes for one class, and the class with the most votes wins.

Example (3 classes: A, B, C):

Classifier 1: A vs B
Classifier 2: A vs C
Classifier 3: B vs C

One-vs-Rest Advantages

Only N classifiers needed (more efficient)
Each classifier sees all training data
Default strategy in scikit-learn
Works well when classes are well-separated

One-vs-One Advantages

Each classifier is simpler (only 2 classes)
Better for imbalanced datasets
Default for SVM (faster for SVMs)
More robust to outliers in other classes

Multinomial (Softmax) Logistic Regression

Instead of using multiple binary classifiers, we can extend Logistic Regression to directly predict probabilities across all classes using the softmax function. This is more elegant and often more effective:

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# OvR (One-vs-Rest) strategy
lr_ovr = LogisticRegression(multi_class='ovr', max_iter=200)
lr_ovr.fit(X_train, y_train)
print(f"One-vs-Rest Accuracy: {lr_ovr.score(X_test, y_test):.4f}")

# Multinomial (Softmax) - native multi-class
lr_multinomial = LogisticRegression(multi_class='multinomial', max_iter=200)
lr_multinomial.fit(X_train, y_train)
print(f"Multinomial Accuracy: {lr_multinomial.score(X_test, y_test):.4f}")

# Compare probability outputs
sample = X_test[0:1]
print(f"\nProbabilities for first sample:")
print(f"OvR: {lr_ovr.predict_proba(sample)[0]}")
print(f"Multinomial: {lr_multinomial.predict_proba(sample)[0]}")

Multi-class with Different Algorithms

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

# Load digit recognition dataset (10 classes: 0-9)
digits = load_digits()
X = StandardScaler().fit_transform(digits.data)
y = digits.target

print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features, {len(set(y))} classes\n")

# Compare different multi-class approaches
models = {
    'Logistic (OvR)': LogisticRegression(multi_class='ovr', max_iter=1000),
    'Logistic (Multinomial)': LogisticRegression(multi_class='multinomial', max_iter=1000),
    'Naive Bayes': GaussianNB(),
    'KNN (k=5)': KNeighborsClassifier(n_neighbors=5),
    'Decision Tree': DecisionTreeClassifier(max_depth=10),
    'SVM (OvO)': SVC(kernel='rbf')
}

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5)
    print(f"{name:25s}: {scores.mean():.4f} (+/- {scores.std():.4f})")

Explicit OvR and OvO Wrappers

from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Explicitly wrap a binary classifier for multi-class
# One-vs-Rest wrapper
ovr_classifier = OneVsRestClassifier(LogisticRegression(max_iter=1000))
ovr_classifier.fit(X_train, y_train)
print(f"Explicit OvR: {ovr_classifier.score(X_test, y_test):.4f}")
print(f"Number of classifiers: {len(ovr_classifier.estimators_)}")

# One-vs-One wrapper
ovo_classifier = OneVsOneClassifier(SVC(kernel='linear'))
ovo_classifier.fit(X_train, y_train)
print(f"Explicit OvO: {ovo_classifier.score(X_test, y_test):.4f}")
print(f"Number of classifiers: {len(ovo_classifier.estimators_)}")

Which Strategy to Choose?

Multinomial: Use when available (Logistic Regression) - most elegant solution
One-vs-Rest: Good default for most algorithms, fewer classifiers
One-vs-One: Better for SVMs, imbalanced data, or when classes overlap significantly
Native multi-class: Trees, KNN, Naive Bayes handle it automatically!

Practice Questions

Problem: If you have 10 classes (like digit recognition), how many classifiers does OvO need? Verify with code.

Show Solution

from sklearn.multiclass import OneVsOneClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_digits

digits = load_digits()
X, y = digits.data, digits.target

# Calculate theoretically
n_classes = len(set(y))
n_classifiers = n_classes * (n_classes - 1) // 2
print(f"Theoretical: {n_classes} classes -> {n_classifiers} classifiers")

# Verify with code
ovo = OneVsOneClassifier(SVC())
ovo.fit(X, y)
print(f"Actual: {len(ovo.estimators_)} classifiers")

Explanation: OvO needs N×(N-1)/2 classifiers. For 10 classes: 10×9/2 = 45 classifiers!

Problem: Compare OvR vs Multinomial Logistic Regression on the digits dataset. Which is more accurate? Which is faster?

Show Solution

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_digits
from sklearn.model_selection import cross_val_score
import time

digits = load_digits()
X, y = digits.data, digits.target

for strategy in ['ovr', 'multinomial']:
    model = LogisticRegression(multi_class=strategy, max_iter=5000, solver='lbfgs')
    
    start = time.time()
    scores = cross_val_score(model, X, y, cv=5)
    elapsed = time.time() - start
    
    print(f"{strategy:12s}: Accuracy={scores.mean():.4f}, Time={elapsed:.2f}s")

Explanation: Multinomial often achieves similar or better accuracy with less computation since it optimizes all classes jointly.

Problem: Train a multi-class classifier on digits and create a confusion matrix heatmap. Which digits are most commonly confused?

Show Solution

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.3, random_state=42
)

model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=digits.target_names,
            yticklabels=digits.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Multi-class Confusion Matrix: Digit Recognition')
plt.show()

# Find most confused pairs
import numpy as np
np.fill_diagonal(cm, 0)  # Ignore correct predictions
max_idx = np.unravel_index(cm.argmax(), cm.shape)
print(f"Most confused: {max_idx[0]} misclassified as {max_idx[1]} ({cm[max_idx]} times)")

Explanation: Common confusions include 3↔8, 4↔9, and 1↔7 due to visual similarity.

What You'll Learn

Contents

Classification Fundamentals

Regression vs Classification

Regression Example

Classification Example

Binary vs Multiclass Classification

Why Linear Regression Won't Work

Practice Questions: Classification Fundamentals

Easy Classify each problem as regression or classification

Medium Show why Linear Regression fails for classification

Hard Compare Linear vs Logistic Regression on classification data

The Sigmoid Function

The Sigmoid Function Equation

Sigmoid Function

Step 1: Define the Sigmoid Function

Step 2: Test with Negative Input (z = -5)

How does this calculation work?

Step 3: Test with Zero Input (z = 0)

Why is this special?

Step 4: Test with Positive Input (z = 5)

What happens with positive z?

Negative z

Zero z

Positive z

Visualizing the Sigmoid Function

Setting Up the Visualization

Generate Input Values

Create the Plot

From Linear to Logistic Model

Linear Combination

Probability Model

Step 1: Set Up the Model

What is NumPy?

Step 2: Define Model Weights

Understanding the Weights

Step 3: Predict for Student 1 (Good Student)

How did we get this result?

Step 4: Predict for Student 2 (Struggling Student)

Why does this student fail?

Student 1

Student 2

Understanding the Sigmoid Mathematically

What You'll Learn

Step 1: Set Up the Functions

What are these libraries?

Understanding the Building Blocks

Step 2: Generate Input Values

What does linspace do?

Step 3: Visualize the Exponential Component

Understanding the Plotting Code

What do we see in the graph?

Step 4: Visualize the Denominator

Why do we add 1? This is the key insight!

Step 5: The Final Sigmoid Curve

What do the extra lines mean?

The Beautiful S-Curve Explained

Step 6: See the Transformation in Action

Understanding the Code

Key Observations for Beginners

Practice Questions: Sigmoid Function

Easy Calculate sigmoid values manually

Medium Implement sigmoid and verify its symmetry property

Hard Implement and visualize the sigmoid derivative

Decision Boundaries

The 0.5 Probability Threshold

Default Decision Boundary: 0.5

Understanding the Code: Step-by-Step Breakdown

1. The Sigmoid Function (Lines 1-2)

2. Generating Sample Probabilities (Lines 3-5)

3. Applying the Decision Rule (Lines 9-15)

4. Displaying Results (Line 8 and 16)

5. Summary Statistics (Lines 18-21)

Decision Boundary (1D Case)

Geometric Interpretation

Step 1: Define Model Parameters

Step 2: Understand the Linear Combination

Step 3: Find Where the Decision Boundary Occurs

Step 4: Solve for the Boundary Line Equation