Assignment 5-A

Model Evaluation & Validation

Build a comprehensive model evaluation pipeline for credit card fraud detection. Apply cross-validation, calculate metrics, create visualizations, tune hyperparameters, and analyze bias-variance tradeoff through learning curves.

4-5 hours
Intermediate
200 Points
Submit Assignment
What You'll Practice
  • Implement k-fold cross-validation
  • Calculate classification metrics
  • Plot ROC and Precision-Recall curves
  • Tune hyperparameters with GridSearch
  • Analyze bias-variance tradeoff
Contents
01

Assignment Overview

In this assignment, you will build a complete Model Evaluation Pipeline for fraud detection. This project requires you to apply model evaluation concepts: cross-validation, performance metrics, visualization techniques, hyperparameter tuning, and bias-variance analysis - essential skills for any machine learning practitioner.

Imbalanced Data Focus: Fraud detection datasets are highly imbalanced (~2-3% fraud rate). You must use appropriate techniques like stratified sampling and evaluate with metrics beyond accuracy (F1-score, AUC, PR curves).
Skills Applied: This assignment tests your understanding of cross-validation techniques, classification metrics, ROC/PR curves, GridSearchCV, and learning curves from Module 5.
Cross-Validation

Stratified K-Fold, reliable performance estimates, variance reduction

Metrics & Visualization

Confusion matrix, ROC curves, PR curves, F1-score, Matthews Correlation

Hyperparameter Tuning

GridSearchCV, RandomizedSearchCV, learning curves, bias-variance

Ready to submit? Already completed the assignment? Submit your work now!
Submit Now
02

The Scenario

FinSecure Bank - Fraud Detection Evaluation

You have been hired as a Machine Learning Engineer at FinSecure Bank. The data science team has developed several ML models for credit card fraud detection, but they need a rigorous evaluation framework. The Head of Risk Management has given you this task:

"We have multiple fraud detection models but don't know which one performs best in production. Accuracy isn't enough - we need to understand precision vs recall tradeoffs. Can you build a comprehensive evaluation pipeline that tells us which model to deploy and why?"

Your Task

Create a Jupyter Notebook called fraud_evaluation.ipynb that implements a complete model evaluation framework. Your code must train multiple models, evaluate them using various metrics, visualize performance, tune hyperparameters, and generate a detailed recommendation report.

03

The Dataset

Create a synthetic credit card fraud dataset (credit_fraud.csv) with the following structure:

File: credit_fraud.csv (Credit Card Transactions)

transaction_id,amount,time_of_day,day_of_week,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase,repeat_retailer,used_chip,used_pin,online_order,is_fraud
1,125.50,14,2,15.3,5.2,1.2,1,1,0,0,0
2,5420.00,3,5,250.8,180.5,8.5,0,0,0,1,1
3,45.00,10,1,2.1,0.5,0.4,1,1,1,0,0
...
Columns Explained
  • transaction_id - Unique transaction identifier (integer)
  • amount - Transaction amount in USD (float)
  • time_of_day - Hour of transaction, 0-23 (integer)
  • day_of_week - Day of week, 0-6 (integer)
  • distance_from_home - Distance from cardholder's home in miles (float)
  • distance_from_last_transaction - Distance from previous transaction in miles (float)
  • ratio_to_median_purchase - Ratio of amount to cardholder's median purchase (float)
  • repeat_retailer - Whether retailer was used before, 0/1 (binary)
  • used_chip - Whether chip was used, 0/1 (binary)
  • used_pin - Whether PIN was used, 0/1 (binary)
  • online_order - Whether it was an online order, 0/1 (binary)
  • is_fraud - Target: Fraudulent transaction, 0/1 (binary)
Dataset Requirements: Generate at least 10,000 transactions with approximately 2-3% fraud rate to simulate real-world imbalanced data. Use numpy's random functions with a seed for reproducibility.
04

Requirements

Your fraud_evaluation.ipynb must implement ALL of the following functions. Each function is mandatory and will be tested individually.

1
Load and Preprocess Data

Create a function load_and_preprocess(filepath) that:

  • Loads the CSV file and handles missing values
  • Scales numerical features using StandardScaler
  • Splits into features (X) and target (y)
  • Returns X, y, and feature names
def load_and_preprocess(filepath):
    """Load data and prepare for modeling."""
    # Must use: StandardScaler for numerical features
    # Return: X (scaled features), y (target), feature_names
    pass
2
Create Stratified Train-Test Split

Create a function create_train_test_split(X, y, test_size=0.2, stratify=True) that:

  • Uses stratified sampling if stratify=True to preserve class distribution
  • Sets random_state for reproducibility
  • Prints class distribution before and after split
  • Returns X_train, X_test, y_train, y_test
def create_train_test_split(X, y, test_size=0.2, stratify=True):
    """Create stratified train-test split."""
    # Must use: train_test_split with stratify parameter
    pass
3
Perform Cross-Validation

Create a function perform_cross_validation(model, X, y, cv=5) that:

  • Uses StratifiedKFold for imbalanced data
  • Calculates multiple metrics per fold (accuracy, precision, recall, F1, ROC-AUC)
  • Returns mean and std for each metric
  • Prints formatted results table
def perform_cross_validation(model, X, y, cv=5):
    """Perform stratified k-fold cross-validation."""
    # Must use: StratifiedKFold, cross_val_score or manual fold iteration
    # Return: dict with metric names and (mean, std) tuples
    pass
4
Calculate Comprehensive Metrics

Create a function calculate_metrics(y_true, y_pred, y_proba=None) that:

  • Calculates accuracy, precision, recall, F1-score
  • Calculates ROC-AUC if probabilities provided
  • Calculates Matthews Correlation Coefficient (MCC)
  • Returns dictionary of all metrics
def calculate_metrics(y_true, y_pred, y_proba=None):
    """Calculate comprehensive classification metrics."""
    # Must include: accuracy, precision, recall, f1, roc_auc, mcc
    pass
5
Plot Confusion Matrix

Create a function plot_confusion_matrix(y_true, y_pred, model_name) that:

  • Uses seaborn heatmap with annotations
  • Shows both counts and percentages
  • Labels axes clearly (Predicted vs Actual)
  • Saves as PNG file
def plot_confusion_matrix(y_true, y_pred, model_name):
    """Create and visualize confusion matrix."""
    # Must use: confusion_matrix from sklearn, seaborn heatmap
    # Save to: confusion_matrices.png
    pass
6
Plot ROC Curves

Create a function plot_roc_curve(models_dict, X_test, y_test) that:

  • Plots ROC curve for each model on the same figure
  • Includes diagonal reference line (random classifier)
  • Displays AUC in legend for each model
  • Saves as roc_curves.png
def plot_roc_curve(models_dict, X_test, y_test):
    """Plot ROC curves for multiple models."""
    # Must use: roc_curve, auc from sklearn
    # Save to: roc_curves.png
    pass
7
Plot Precision-Recall Curves

Create a function plot_precision_recall_curve(models_dict, X_test, y_test) that:

  • Plots PR curve for each model
  • Includes baseline (no skill) reference line
  • Displays Average Precision in legend
  • Saves as pr_curves.png
def plot_precision_recall_curve(models_dict, X_test, y_test):
    """Plot precision-recall curves for multiple models."""
    # Must use: precision_recall_curve, average_precision_score
    # Save to: pr_curves.png
    pass
8
Compare Models

Create a function compare_models(models_dict, X_train, X_test, y_train, y_test) that:

  • Trains each model on training data
  • Evaluates on test data using all metrics
  • Creates comparison DataFrame sorted by F1-score
  • Visualizes comparison as bar chart (saved as model_comparison.png)
def compare_models(models_dict, X_train, X_test, y_train, y_test):
    """Train and compare multiple models."""
    # Evaluate: LogisticRegression, RandomForest, XGBoost, SVM (minimum)
    # Save to: model_comparison.png
    pass
9
Perform Grid Search

Create a function perform_grid_search(model, param_grid, X_train, y_train, cv=5) that:

  • Uses StratifiedKFold for cross-validation
  • Optimizes for F1-score (better for imbalanced data)
  • Returns best model and best parameters
  • Prints search results summary
def perform_grid_search(model, param_grid, X_train, y_train, cv=5):
    """Perform hyperparameter tuning with GridSearchCV."""
    # Must use: GridSearchCV with scoring='f1'
    # Return: best_model, best_params
    pass
10
Perform Randomized Search

Create a function perform_randomized_search(model, param_distributions, X_train, y_train, n_iter=50) that:

  • Samples n_iter combinations from parameter distributions
  • Uses StratifiedKFold cross-validation
  • Compares search time and results with GridSearch
  • Returns best model and parameters
11
Plot Learning Curves

Create a function plot_learning_curve(model, X, y, model_name) that:

  • Uses sklearn's learning_curve function
  • Plots training and validation scores vs training size
  • Adds shaded regions for standard deviation
  • Interprets overfitting/underfitting and saves as learning_curves.png
def plot_learning_curve(model, X, y, model_name):
    """Plot learning curves to analyze bias-variance tradeoff."""
    # Must use: learning_curve from sklearn.model_selection
    # Save to: learning_curves.png
    pass
12
Generate Evaluation Report

Create a function generate_evaluation_report(results, best_model_name) that:

  • Summarizes all model performances
  • Explains why best model was selected
  • Includes trade-off analysis (precision vs recall)
  • Writes to evaluation_report.txt
13
Main Program

Create a main() function that:

  • Loads and preprocesses the data
  • Creates train-test split
  • Evaluates at least 4 different models
  • Performs hyperparameter tuning on best model
  • Generates all visualizations and reports
  • Prints final recommendations
def main():
    # Load and preprocess
    X, y, features = load_and_preprocess("credit_fraud.csv")
    X_train, X_test, y_train, y_test = create_train_test_split(X, y)
    
    # Define models
    models = {
        'Logistic Regression': LogisticRegression(),
        'Random Forest': RandomForestClassifier(),
        'XGBoost': XGBClassifier(),
        'SVM': SVC(probability=True)
    }
    
    # Compare and evaluate
    results = compare_models(models, X_train, X_test, y_train, y_test)
    
    # Tune best model
    best_model = perform_grid_search(...)
    
    # Generate outputs
    generate_evaluation_report(results, best_model_name)
    
    print("Model evaluation complete!")

if __name__ == "__main__":
    main()
05

Submission

Create a public GitHub repository with the exact name shown below:

Required Repository Name
fraud-detection-evaluation
github.com/<your-username>/fraud-detection-evaluation
Required Files
fraud-detection-evaluation/
├── fraud_evaluation.ipynb    # Your Jupyter Notebook with ALL 13 functions
├── credit_fraud.csv          # Synthetic dataset (10,000+ transactions)
├── confusion_matrices.png    # Confusion matrices for all models
├── roc_curves.png            # ROC curves comparing all models
├── pr_curves.png             # Precision-Recall curves for all models
├── learning_curves.png       # Learning curves for best model
├── model_comparison.png      # Bar chart comparing all models
├── evaluation_report.txt     # Detailed evaluation report
└── README.md                 # REQUIRED - see contents below
README.md Must Include:
  • Your full name and submission date
  • Summary of all models evaluated and their metrics
  • Your best model selection and reasoning
  • Discussion of precision vs recall tradeoff for fraud detection
  • Instructions to run your notebook
Do Include
  • All 13 functions implemented and working
  • At least 4 different models evaluated
  • Stratified sampling for imbalanced data
  • All visualizations saved as PNG files
  • Hyperparameter tuning with GridSearchCV
  • README.md with all required sections
Do Not Include
  • Accuracy as the only metric (use F1, AUC, etc.)
  • Non-stratified splits for imbalanced data
  • Any .pyc or __pycache__ files
  • Virtual environment folders
  • Code that doesn't run without errors
  • Hardcoded results without reproducible code
Important: Before submitting, run all cells in your notebook to ensure it executes without errors and generates all output files correctly!
Submit Your Assignment

Enter your GitHub username - we'll verify your repository automatically

06

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria Points Description
Function Implementation 80 All 13 functions correctly implemented with proper logic and return types
Cross-Validation 25 Proper stratified k-fold implementation with multiple metrics
Visualizations 35 Clear, properly labeled plots (confusion matrix, ROC, PR, learning curves)
Hyperparameter Tuning 25 Effective use of GridSearchCV or RandomizedSearchCV
Evaluation Report 20 Comprehensive analysis with justified model recommendations
Code Quality 15 Docstrings, comments, naming conventions, clean organization
Total 200

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment
07

What You Will Practice

Cross-Validation Techniques

Implementing stratified k-fold CV for reliable performance estimates on imbalanced datasets

Performance Metrics

Beyond accuracy: precision, recall, F1-score, ROC-AUC, PR curves, Matthews Correlation Coefficient

Hyperparameter Optimization

Using GridSearchCV and RandomizedSearchCV to find optimal model parameters

Bias-Variance Analysis

Learning curves to diagnose overfitting/underfitting and determine optimal training size

08

Pro Tips

Why F1-Score for Imbalanced Data?
  • Accuracy is misleading with imbalanced classes
  • A model predicting all "not fraud" gets 97% accuracy
  • F1-score balances precision and recall
  • Use weighted F1 for multi-class problems
PR Curves > ROC for Fraud
  • ROC can be overly optimistic for rare events
  • PR curves focus on the minority class
  • Average Precision summarizes PR curve
  • Use both for complete picture
Stratified Sampling is Key
  • Preserves class distribution in train/test
  • Prevents all-minority or no-minority splits
  • Use StratifiedKFold for cross-validation
  • Essential for datasets with <10% minority class
Learning Curve Interpretation
  • Train >> Validation score: overfitting
  • Both scores low: underfitting
  • Converging scores: good fit
  • Gap persists: need regularization
09

Pre-Submission Checklist

Code Requirements
Repository Requirements