Assignment Overview
In this assignment, you will build a complete Model Evaluation Pipeline for fraud detection. This project requires you to apply model evaluation concepts: cross-validation, performance metrics, visualization techniques, hyperparameter tuning, and bias-variance analysis - essential skills for any machine learning practitioner.
Cross-Validation
Stratified K-Fold, reliable performance estimates, variance reduction
Metrics & Visualization
Confusion matrix, ROC curves, PR curves, F1-score, Matthews Correlation
Hyperparameter Tuning
GridSearchCV, RandomizedSearchCV, learning curves, bias-variance
The Scenario
FinSecure Bank - Fraud Detection Evaluation
You have been hired as a Machine Learning Engineer at FinSecure Bank. The data science team has developed several ML models for credit card fraud detection, but they need a rigorous evaluation framework. The Head of Risk Management has given you this task:
"We have multiple fraud detection models but don't know which one performs best in production. Accuracy isn't enough - we need to understand precision vs recall tradeoffs. Can you build a comprehensive evaluation pipeline that tells us which model to deploy and why?"
Your Task
Create a Jupyter Notebook called fraud_evaluation.ipynb that implements a complete
model evaluation framework. Your code must train multiple models, evaluate them using various metrics,
visualize performance, tune hyperparameters, and generate a detailed recommendation report.
The Dataset
Create a synthetic credit card fraud dataset (credit_fraud.csv) with the following structure:
File: credit_fraud.csv (Credit Card Transactions)
transaction_id,amount,time_of_day,day_of_week,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase,repeat_retailer,used_chip,used_pin,online_order,is_fraud
1,125.50,14,2,15.3,5.2,1.2,1,1,0,0,0
2,5420.00,3,5,250.8,180.5,8.5,0,0,0,1,1
3,45.00,10,1,2.1,0.5,0.4,1,1,1,0,0
...
Columns Explained
transaction_id- Unique transaction identifier (integer)amount- Transaction amount in USD (float)time_of_day- Hour of transaction, 0-23 (integer)day_of_week- Day of week, 0-6 (integer)distance_from_home- Distance from cardholder's home in miles (float)distance_from_last_transaction- Distance from previous transaction in miles (float)ratio_to_median_purchase- Ratio of amount to cardholder's median purchase (float)repeat_retailer- Whether retailer was used before, 0/1 (binary)used_chip- Whether chip was used, 0/1 (binary)used_pin- Whether PIN was used, 0/1 (binary)online_order- Whether it was an online order, 0/1 (binary)is_fraud- Target: Fraudulent transaction, 0/1 (binary)
Requirements
Your fraud_evaluation.ipynb must implement ALL of the following functions.
Each function is mandatory and will be tested individually.
Load and Preprocess Data
Create a function load_and_preprocess(filepath) that:
- Loads the CSV file and handles missing values
- Scales numerical features using
StandardScaler - Splits into features (X) and target (y)
- Returns X, y, and feature names
def load_and_preprocess(filepath):
"""Load data and prepare for modeling."""
# Must use: StandardScaler for numerical features
# Return: X (scaled features), y (target), feature_names
pass
Create Stratified Train-Test Split
Create a function create_train_test_split(X, y, test_size=0.2, stratify=True) that:
- Uses stratified sampling if stratify=True to preserve class distribution
- Sets random_state for reproducibility
- Prints class distribution before and after split
- Returns X_train, X_test, y_train, y_test
def create_train_test_split(X, y, test_size=0.2, stratify=True):
"""Create stratified train-test split."""
# Must use: train_test_split with stratify parameter
pass
Perform Cross-Validation
Create a function perform_cross_validation(model, X, y, cv=5) that:
- Uses
StratifiedKFoldfor imbalanced data - Calculates multiple metrics per fold (accuracy, precision, recall, F1, ROC-AUC)
- Returns mean and std for each metric
- Prints formatted results table
def perform_cross_validation(model, X, y, cv=5):
"""Perform stratified k-fold cross-validation."""
# Must use: StratifiedKFold, cross_val_score or manual fold iteration
# Return: dict with metric names and (mean, std) tuples
pass
Calculate Comprehensive Metrics
Create a function calculate_metrics(y_true, y_pred, y_proba=None) that:
- Calculates accuracy, precision, recall, F1-score
- Calculates ROC-AUC if probabilities provided
- Calculates Matthews Correlation Coefficient (MCC)
- Returns dictionary of all metrics
def calculate_metrics(y_true, y_pred, y_proba=None):
"""Calculate comprehensive classification metrics."""
# Must include: accuracy, precision, recall, f1, roc_auc, mcc
pass
Plot Confusion Matrix
Create a function plot_confusion_matrix(y_true, y_pred, model_name) that:
- Uses seaborn heatmap with annotations
- Shows both counts and percentages
- Labels axes clearly (Predicted vs Actual)
- Saves as PNG file
def plot_confusion_matrix(y_true, y_pred, model_name):
"""Create and visualize confusion matrix."""
# Must use: confusion_matrix from sklearn, seaborn heatmap
# Save to: confusion_matrices.png
pass
Plot ROC Curves
Create a function plot_roc_curve(models_dict, X_test, y_test) that:
- Plots ROC curve for each model on the same figure
- Includes diagonal reference line (random classifier)
- Displays AUC in legend for each model
- Saves as roc_curves.png
def plot_roc_curve(models_dict, X_test, y_test):
"""Plot ROC curves for multiple models."""
# Must use: roc_curve, auc from sklearn
# Save to: roc_curves.png
pass
Plot Precision-Recall Curves
Create a function plot_precision_recall_curve(models_dict, X_test, y_test) that:
- Plots PR curve for each model
- Includes baseline (no skill) reference line
- Displays Average Precision in legend
- Saves as pr_curves.png
def plot_precision_recall_curve(models_dict, X_test, y_test):
"""Plot precision-recall curves for multiple models."""
# Must use: precision_recall_curve, average_precision_score
# Save to: pr_curves.png
pass
Compare Models
Create a function compare_models(models_dict, X_train, X_test, y_train, y_test) that:
- Trains each model on training data
- Evaluates on test data using all metrics
- Creates comparison DataFrame sorted by F1-score
- Visualizes comparison as bar chart (saved as model_comparison.png)
def compare_models(models_dict, X_train, X_test, y_train, y_test):
"""Train and compare multiple models."""
# Evaluate: LogisticRegression, RandomForest, XGBoost, SVM (minimum)
# Save to: model_comparison.png
pass
Perform Grid Search
Create a function perform_grid_search(model, param_grid, X_train, y_train, cv=5) that:
- Uses
StratifiedKFoldfor cross-validation - Optimizes for F1-score (better for imbalanced data)
- Returns best model and best parameters
- Prints search results summary
def perform_grid_search(model, param_grid, X_train, y_train, cv=5):
"""Perform hyperparameter tuning with GridSearchCV."""
# Must use: GridSearchCV with scoring='f1'
# Return: best_model, best_params
pass
Perform Randomized Search
Create a function perform_randomized_search(model, param_distributions, X_train, y_train, n_iter=50) that:
- Samples n_iter combinations from parameter distributions
- Uses
StratifiedKFoldcross-validation - Compares search time and results with GridSearch
- Returns best model and parameters
Plot Learning Curves
Create a function plot_learning_curve(model, X, y, model_name) that:
- Uses sklearn's
learning_curvefunction - Plots training and validation scores vs training size
- Adds shaded regions for standard deviation
- Interprets overfitting/underfitting and saves as learning_curves.png
def plot_learning_curve(model, X, y, model_name):
"""Plot learning curves to analyze bias-variance tradeoff."""
# Must use: learning_curve from sklearn.model_selection
# Save to: learning_curves.png
pass
Generate Evaluation Report
Create a function generate_evaluation_report(results, best_model_name) that:
- Summarizes all model performances
- Explains why best model was selected
- Includes trade-off analysis (precision vs recall)
- Writes to evaluation_report.txt
Main Program
Create a main() function that:
- Loads and preprocesses the data
- Creates train-test split
- Evaluates at least 4 different models
- Performs hyperparameter tuning on best model
- Generates all visualizations and reports
- Prints final recommendations
def main():
# Load and preprocess
X, y, features = load_and_preprocess("credit_fraud.csv")
X_train, X_test, y_train, y_test = create_train_test_split(X, y)
# Define models
models = {
'Logistic Regression': LogisticRegression(),
'Random Forest': RandomForestClassifier(),
'XGBoost': XGBClassifier(),
'SVM': SVC(probability=True)
}
# Compare and evaluate
results = compare_models(models, X_train, X_test, y_train, y_test)
# Tune best model
best_model = perform_grid_search(...)
# Generate outputs
generate_evaluation_report(results, best_model_name)
print("Model evaluation complete!")
if __name__ == "__main__":
main()
Submission
Create a public GitHub repository with the exact name shown below:
Required Repository Name
fraud-detection-evaluation
Required Files
fraud-detection-evaluation/
├── fraud_evaluation.ipynb # Your Jupyter Notebook with ALL 13 functions
├── credit_fraud.csv # Synthetic dataset (10,000+ transactions)
├── confusion_matrices.png # Confusion matrices for all models
├── roc_curves.png # ROC curves comparing all models
├── pr_curves.png # Precision-Recall curves for all models
├── learning_curves.png # Learning curves for best model
├── model_comparison.png # Bar chart comparing all models
├── evaluation_report.txt # Detailed evaluation report
└── README.md # REQUIRED - see contents below
README.md Must Include:
- Your full name and submission date
- Summary of all models evaluated and their metrics
- Your best model selection and reasoning
- Discussion of precision vs recall tradeoff for fraud detection
- Instructions to run your notebook
Do Include
- All 13 functions implemented and working
- At least 4 different models evaluated
- Stratified sampling for imbalanced data
- All visualizations saved as PNG files
- Hyperparameter tuning with GridSearchCV
- README.md with all required sections
Do Not Include
- Accuracy as the only metric (use F1, AUC, etc.)
- Non-stratified splits for imbalanced data
- Any .pyc or __pycache__ files
- Virtual environment folders
- Code that doesn't run without errors
- Hardcoded results without reproducible code
Enter your GitHub username - we'll verify your repository automatically
Grading Rubric
Your assignment will be graded on the following criteria:
| Criteria | Points | Description |
|---|---|---|
| Function Implementation | 80 | All 13 functions correctly implemented with proper logic and return types |
| Cross-Validation | 25 | Proper stratified k-fold implementation with multiple metrics |
| Visualizations | 35 | Clear, properly labeled plots (confusion matrix, ROC, PR, learning curves) |
| Hyperparameter Tuning | 25 | Effective use of GridSearchCV or RandomizedSearchCV |
| Evaluation Report | 20 | Comprehensive analysis with justified model recommendations |
| Code Quality | 15 | Docstrings, comments, naming conventions, clean organization |
| Total | 200 |
Ready to Submit?
Make sure you have completed all requirements and reviewed the grading rubric above.
Submit Your AssignmentWhat You Will Practice
Cross-Validation Techniques
Implementing stratified k-fold CV for reliable performance estimates on imbalanced datasets
Performance Metrics
Beyond accuracy: precision, recall, F1-score, ROC-AUC, PR curves, Matthews Correlation Coefficient
Hyperparameter Optimization
Using GridSearchCV and RandomizedSearchCV to find optimal model parameters
Bias-Variance Analysis
Learning curves to diagnose overfitting/underfitting and determine optimal training size
Pro Tips
Why F1-Score for Imbalanced Data?
- Accuracy is misleading with imbalanced classes
- A model predicting all "not fraud" gets 97% accuracy
- F1-score balances precision and recall
- Use weighted F1 for multi-class problems
PR Curves > ROC for Fraud
- ROC can be overly optimistic for rare events
- PR curves focus on the minority class
- Average Precision summarizes PR curve
- Use both for complete picture
Stratified Sampling is Key
- Preserves class distribution in train/test
- Prevents all-minority or no-minority splits
- Use StratifiedKFold for cross-validation
- Essential for datasets with <10% minority class
Learning Curve Interpretation
- Train >> Validation score: overfitting
- Both scores low: underfitting
- Converging scores: good fit
- Gap persists: need regularization