Assignment 5: Model Evaluation | Machine Learning Course

Assignment Overview

In this assignment, you will build a complete Model Evaluation Pipeline for fraud detection. This project requires you to apply model evaluation concepts: cross-validation, performance metrics, visualization techniques, hyperparameter tuning, and bias-variance analysis - essential skills for any machine learning practitioner.

Imbalanced Data Focus: Fraud detection datasets are highly imbalanced (~2-3% fraud rate). You must use appropriate techniques like stratified sampling and evaluate with metrics beyond accuracy (F1-score, AUC, PR curves).

Skills Applied: This assignment tests your understanding of cross-validation techniques, classification metrics, ROC/PR curves, GridSearchCV, and learning curves from Module 5.

Cross-Validation

Stratified K-Fold, reliable performance estimates, variance reduction

Metrics & Visualization

Confusion matrix, ROC curves, PR curves, F1-score, Matthews Correlation

Hyperparameter Tuning

GridSearchCV, RandomizedSearchCV, learning curves, bias-variance

Ready to submit? Already completed the assignment? Submit your work now!

Submit Now

The Scenario

FinSecure Bank - Fraud Detection Evaluation

You have been hired as a Machine Learning Engineer at FinSecure Bank. The data science team has developed several ML models for credit card fraud detection, but they need a rigorous evaluation framework. The Head of Risk Management has given you this task:

"We have multiple fraud detection models but don't know which one performs best in production. Accuracy isn't enough - we need to understand precision vs recall tradeoffs. Can you build a comprehensive evaluation pipeline that tells us which model to deploy and why?"

Your Task

Create a Jupyter Notebook called fraud_evaluation.ipynb that implements a complete model evaluation framework. Your code must train multiple models, evaluate them using various metrics, visualize performance, tune hyperparameters, and generate a detailed recommendation report.

The Dataset

Create a synthetic credit card fraud dataset (credit_fraud.csv) with the following structure:

File: `credit_fraud.csv` (Credit Card Transactions)

transaction_id,amount,time_of_day,day_of_week,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase,repeat_retailer,used_chip,used_pin,online_order,is_fraud
1,125.50,14,2,15.3,5.2,1.2,1,1,0,0,0
2,5420.00,3,5,250.8,180.5,8.5,0,0,0,1,1
3,45.00,10,1,2.1,0.5,0.4,1,1,1,0,0
...

Columns Explained

transaction_id - Unique transaction identifier (integer)
amount - Transaction amount in USD (float)
time_of_day - Hour of transaction, 0-23 (integer)
day_of_week - Day of week, 0-6 (integer)
distance_from_home - Distance from cardholder's home in miles (float)
distance_from_last_transaction - Distance from previous transaction in miles (float)
ratio_to_median_purchase - Ratio of amount to cardholder's median purchase (float)
repeat_retailer - Whether retailer was used before, 0/1 (binary)
used_chip - Whether chip was used, 0/1 (binary)
used_pin - Whether PIN was used, 0/1 (binary)
online_order - Whether it was an online order, 0/1 (binary)
is_fraud - Target: Fraudulent transaction, 0/1 (binary)

Dataset Requirements: Generate at least 10,000 transactions with approximately 2-3% fraud rate to simulate real-world imbalanced data. Use numpy's random functions with a seed for reproducibility.

Requirements

Your fraud_evaluation.ipynb must implement ALL of the following functions. Each function is mandatory and will be tested individually.

Load and Preprocess Data

Create a function load_and_preprocess(filepath) that:

Loads the CSV file and handles missing values
Scales numerical features using StandardScaler
Splits into features (X) and target (y)
Returns X, y, and feature names

def load_and_preprocess(filepath):
    """Load data and prepare for modeling."""
    # Must use: StandardScaler for numerical features
    # Return: X (scaled features), y (target), feature_names
    pass

Create Stratified Train-Test Split

Create a function create_train_test_split(X, y, test_size=0.2, stratify=True) that:

Uses stratified sampling if stratify=True to preserve class distribution
Sets random_state for reproducibility
Prints class distribution before and after split
Returns X_train, X_test, y_train, y_test

def create_train_test_split(X, y, test_size=0.2, stratify=True):
    """Create stratified train-test split."""
    # Must use: train_test_split with stratify parameter
    pass

Perform Cross-Validation

Create a function perform_cross_validation(model, X, y, cv=5) that:

Uses StratifiedKFold for imbalanced data
Calculates multiple metrics per fold (accuracy, precision, recall, F1, ROC-AUC)
Returns mean and std for each metric
Prints formatted results table

def perform_cross_validation(model, X, y, cv=5):
    """Perform stratified k-fold cross-validation."""
    # Must use: StratifiedKFold, cross_val_score or manual fold iteration
    # Return: dict with metric names and (mean, std) tuples
    pass

Calculate Comprehensive Metrics

Create a function calculate_metrics(y_true, y_pred, y_proba=None) that:

Calculates accuracy, precision, recall, F1-score
Calculates ROC-AUC if probabilities provided
Calculates Matthews Correlation Coefficient (MCC)
Returns dictionary of all metrics

def calculate_metrics(y_true, y_pred, y_proba=None):
    """Calculate comprehensive classification metrics."""
    # Must include: accuracy, precision, recall, f1, roc_auc, mcc
    pass

Plot Confusion Matrix

Create a function plot_confusion_matrix(y_true, y_pred, model_name) that:

Uses seaborn heatmap with annotations
Shows both counts and percentages
Labels axes clearly (Predicted vs Actual)
Saves as PNG file

def plot_confusion_matrix(y_true, y_pred, model_name):
    """Create and visualize confusion matrix."""
    # Must use: confusion_matrix from sklearn, seaborn heatmap
    # Save to: confusion_matrices.png
    pass

Plot ROC Curves

Create a function plot_roc_curve(models_dict, X_test, y_test) that:

Plots ROC curve for each model on the same figure
Includes diagonal reference line (random classifier)
Displays AUC in legend for each model
Saves as roc_curves.png

def plot_roc_curve(models_dict, X_test, y_test):
    """Plot ROC curves for multiple models."""
    # Must use: roc_curve, auc from sklearn
    # Save to: roc_curves.png
    pass

Plot Precision-Recall Curves

Create a function plot_precision_recall_curve(models_dict, X_test, y_test) that:

Plots PR curve for each model
Includes baseline (no skill) reference line
Displays Average Precision in legend
Saves as pr_curves.png

def plot_precision_recall_curve(models_dict, X_test, y_test):
    """Plot precision-recall curves for multiple models."""
    # Must use: precision_recall_curve, average_precision_score
    # Save to: pr_curves.png
    pass

Compare Models

Create a function compare_models(models_dict, X_train, X_test, y_train, y_test) that:

Trains each model on training data
Evaluates on test data using all metrics
Creates comparison DataFrame sorted by F1-score
Visualizes comparison as bar chart (saved as model_comparison.png)

def compare_models(models_dict, X_train, X_test, y_train, y_test):
    """Train and compare multiple models."""
    # Evaluate: LogisticRegression, RandomForest, XGBoost, SVM (minimum)
    # Save to: model_comparison.png
    pass

Perform Grid Search

Create a function perform_grid_search(model, param_grid, X_train, y_train, cv=5) that:

Uses StratifiedKFold for cross-validation
Optimizes for F1-score (better for imbalanced data)
Returns best model and best parameters
Prints search results summary

def perform_grid_search(model, param_grid, X_train, y_train, cv=5):
    """Perform hyperparameter tuning with GridSearchCV."""
    # Must use: GridSearchCV with scoring='f1'
    # Return: best_model, best_params
    pass

Perform Randomized Search

Create a function perform_randomized_search(model, param_distributions, X_train, y_train, n_iter=50) that:

Samples n_iter combinations from parameter distributions
Uses StratifiedKFold cross-validation
Compares search time and results with GridSearch
Returns best model and parameters

Plot Learning Curves

Create a function plot_learning_curve(model, X, y, model_name) that:

Uses sklearn's learning_curve function
Plots training and validation scores vs training size
Adds shaded regions for standard deviation
Interprets overfitting/underfitting and saves as learning_curves.png

def plot_learning_curve(model, X, y, model_name):
    """Plot learning curves to analyze bias-variance tradeoff."""
    # Must use: learning_curve from sklearn.model_selection
    # Save to: learning_curves.png
    pass

Generate Evaluation Report

Create a function generate_evaluation_report(results, best_model_name) that:

Summarizes all model performances
Explains why best model was selected
Includes trade-off analysis (precision vs recall)
Writes to evaluation_report.txt

Main Program

Create a main() function that:

Loads and preprocesses the data
Creates train-test split
Evaluates at least 4 different models
Performs hyperparameter tuning on best model
Generates all visualizations and reports
Prints final recommendations

def main():
    # Load and preprocess
    X, y, features = load_and_preprocess("credit_fraud.csv")
    X_train, X_test, y_train, y_test = create_train_test_split(X, y)
    
    # Define models
    models = {
        'Logistic Regression': LogisticRegression(),
        'Random Forest': RandomForestClassifier(),
        'XGBoost': XGBClassifier(),
        'SVM': SVC(probability=True)
    }
    
    # Compare and evaluate
    results = compare_models(models, X_train, X_test, y_train, y_test)
    
    # Tune best model
    best_model = perform_grid_search(...)
    
    # Generate outputs
    generate_evaluation_report(results, best_model_name)
    
    print("Model evaluation complete!")

if __name__ == "__main__":
    main()

Submission

Create a public GitHub repository with the exact name shown below:

Required Repository Name

fraud-detection-evaluation

github.com/<your-username>/fraud-detection-evaluation

Required Files

fraud-detection-evaluation/
├── fraud_evaluation.ipynb    # Your Jupyter Notebook with ALL 13 functions
├── credit_fraud.csv          # Synthetic dataset (10,000+ transactions)
├── confusion_matrices.png    # Confusion matrices for all models
├── roc_curves.png            # ROC curves comparing all models
├── pr_curves.png             # Precision-Recall curves for all models
├── learning_curves.png       # Learning curves for best model
├── model_comparison.png      # Bar chart comparing all models
├── evaluation_report.txt     # Detailed evaluation report
└── README.md                 # REQUIRED - see contents below

README.md Must Include:

Your full name and submission date
Summary of all models evaluated and their metrics
Your best model selection and reasoning
Discussion of precision vs recall tradeoff for fraud detection
Instructions to run your notebook

Do Include

All 13 functions implemented and working
At least 4 different models evaluated
Stratified sampling for imbalanced data
All visualizations saved as PNG files
Hyperparameter tuning with GridSearchCV
README.md with all required sections

Do Not Include

Accuracy as the only metric (use F1, AUC, etc.)
Non-stratified splits for imbalanced data
Any .pyc or __pycache__ files
Virtual environment folders
Code that doesn't run without errors
Hardcoded results without reproducible code

Important: Before submitting, run all cells in your notebook to ensure it executes without errors and generates all output files correctly!

Submit Your Assignment

Enter your GitHub username - we'll verify your repository automatically

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria	Points	Description
Function Implementation	80	All 13 functions correctly implemented with proper logic and return types
Cross-Validation	25	Proper stratified k-fold implementation with multiple metrics
Visualizations	35	Clear, properly labeled plots (confusion matrix, ROC, PR, learning curves)
Hyperparameter Tuning	25	Effective use of GridSearchCV or RandomizedSearchCV
Evaluation Report	20	Comprehensive analysis with justified model recommendations
Code Quality	15	Docstrings, comments, naming conventions, clean organization
Total	200

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment

What You Will Practice

Cross-Validation Techniques

Implementing stratified k-fold CV for reliable performance estimates on imbalanced datasets

Performance Metrics

Beyond accuracy: precision, recall, F1-score, ROC-AUC, PR curves, Matthews Correlation Coefficient

Hyperparameter Optimization

Using GridSearchCV and RandomizedSearchCV to find optimal model parameters

Bias-Variance Analysis

Learning curves to diagnose overfitting/underfitting and determine optimal training size

Pro Tips

Why F1-Score for Imbalanced Data?

Accuracy is misleading with imbalanced classes
A model predicting all "not fraud" gets 97% accuracy
F1-score balances precision and recall
Use weighted F1 for multi-class problems

PR Curves > ROC for Fraud

ROC can be overly optimistic for rare events
PR curves focus on the minority class
Average Precision summarizes PR curve
Use both for complete picture

Stratified Sampling is Key

Preserves class distribution in train/test
Prevents all-minority or no-minority splits
Use StratifiedKFold for cross-validation
Essential for datasets with <10% minority class

Learning Curve Interpretation

Train >> Validation score: overfitting
Both scores low: underfitting
Converging scores: good fit
Gap persists: need regularization

Model Evaluation & Validation

What You'll Practice

Contents