Assignment 9-A

Machine Learning Pipeline Challenge

Build end-to-end machine learning solutions: develop classification and regression models, implement cross-validation strategies, tune hyperparameters, and evaluate model performance using industry-standard metrics and techniques.

8-10 hours
Advanced
300 Points
Submit Assignment
What You'll Practice
  • Classification model development
  • Regression analysis techniques
  • Cross-validation strategies
  • Hyperparameter optimization
  • Model evaluation and comparison
Contents
01

Assignment Overview

In this assignment, you will build a complete Machine Learning System using scikit-learn. This comprehensive project requires you to apply ALL concepts from Module 9: ML fundamentals, regression models, classification algorithms, and model evaluation techniques to solve real-world prediction problems.

Important: You must use scikit-learn for all machine learning tasks. You may use pandas for data manipulation, numpy for numerical operations, and matplotlib/seaborn for visualization.
Skills Applied: This assignment tests your understanding of ML Concepts (Topic 9.1), Regression Models (Topic 9.2), Classification (Topic 9.3), and Model Evaluation (Topic 9.4) from Module 9.
ML Concepts (9.1)

Supervised learning, train-test splits, feature engineering, pipeline design

Regression (9.2)

Linear regression, polynomial features, regularization, price prediction

Classification (9.3)

Logistic regression, decision trees, random forests, customer churn

Evaluation (9.4)

Cross-validation, confusion matrix, ROC curves, hyperparameter tuning

Ready to submit? Already completed the assignment? Submit your work now!
Submit Now
02

The Scenario

TechMetrics Analytics Inc.

You have been hired as a Machine Learning Engineer at TechMetrics Analytics, a consulting firm that provides predictive analytics solutions to e-commerce clients. Your manager has assigned you two critical projects:

"We have two urgent client requests. First, an online retailer needs a model to predict customer lifetime value based on their purchasing behavior. Second, a subscription service wants to identify customers likely to churn so they can implement retention strategies. Build robust ML pipelines for both problems with proper validation and optimization."

Your Tasks

Create a Jupyter notebook called ml_pipeline.ipynb that implements complete machine learning solutions for both regression (customer value prediction) and classification (churn prediction) problems. Your code must include data preprocessing, model training, hyperparameter tuning, and comprehensive evaluation.

Project 1: Customer Value Prediction

Build a regression model to predict customer lifetime value based on:

  • Purchase history and frequency
  • Average order value
  • Customer demographics
  • Website engagement metrics
Project 2: Churn Prediction

Build a classification model to predict customer churn based on:

  • Subscription duration and plan type
  • Usage patterns and engagement
  • Support ticket history
  • Payment behavior
03

The Datasets

You will work with three real-world datasets for building your machine learning pipelines. Download the CSV files and explore them to understand the features and target variables.

TechMetrics Datasets

Business analytics data for regression and classification tasks

3 Files
Your Task: Load these datasets, explore the features, identify the target variables, and engineer additional features as needed for your machine learning pipelines. The datasets contain customer behavior, purchase patterns, and engagement metrics suitable for both regression and classification tasks.
04

Requirements

Your ml_pipeline.ipynb must implement ALL of the following components. Each section is mandatory and will be graded individually.

Part 1: Regression - Customer Value Prediction (120 points)

1
Data Preprocessing

Implement a preprocessing pipeline that:

  • Handles missing values appropriately
  • Scales numerical features using StandardScaler or MinMaxScaler
  • Creates train-test split (80/20) with random_state=42
  • Identifies and handles outliers if present
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

def preprocess_regression_data(df, target_col):
    """
    Preprocess data for regression task.
    Returns: X_train, X_test, y_train, y_test, scaler
    """
    # Your implementation
    pass
2
Baseline Model

Create and evaluate a baseline Linear Regression model:

  • Train a simple LinearRegression model
  • Calculate R-squared, MAE, RMSE on test set
  • Store results for comparison
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

def train_baseline_regression(X_train, y_train, X_test, y_test):
    """
    Train baseline linear regression.
    Returns: model, metrics_dict
    """
    # Your implementation
    pass
3
Advanced Regression Models

Implement at least TWO additional regression models:

  • Ridge Regression with regularization
  • Random Forest Regressor
  • Gradient Boosting Regressor (optional bonus)
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

def train_advanced_regression_models(X_train, y_train, X_test, y_test):
    """
    Train multiple regression models.
    Returns: dict of {model_name: (model, metrics_dict)}
    """
    # Your implementation
    pass
4
Cross-Validation

Implement k-fold cross-validation for model selection:

  • Use 5-fold cross-validation
  • Compare mean CV scores across models
  • Report standard deviation of scores
from sklearn.model_selection import cross_val_score

def cross_validate_models(models, X, y, cv=5):
    """
    Perform cross-validation for multiple models.
    Returns: dict of {model_name: (mean_score, std_score)}
    """
    # Your implementation
    pass
5
Hyperparameter Tuning

Optimize the best performing model using GridSearchCV:

  • Define parameter grid for Random Forest or Gradient Boosting
  • Use GridSearchCV with 5-fold CV
  • Report best parameters and best score
  • Evaluate tuned model on test set
from sklearn.model_selection import GridSearchCV

def tune_regression_model(model, param_grid, X_train, y_train):
    """
    Tune hyperparameters using GridSearchCV.
    Returns: best_model, best_params, best_score
    """
    # Your implementation
    pass
6
Feature Importance Analysis

Analyze and visualize feature importance:

  • Extract feature importances from tree-based model
  • Create horizontal bar chart of top 10 features
  • Interpret which features drive customer value
def plot_feature_importance(model, feature_names, top_n=10):
    """
    Plot feature importance chart.
    Saves plot to 'visualizations/regression_feature_importance.png'
    """
    # Your implementation
    pass

Part 2: Classification - Churn Prediction (120 points)

7
Data Preprocessing for Classification

Implement preprocessing for the classification task:

  • Encode categorical variables (plan_type) using OneHotEncoder or LabelEncoder
  • Scale numerical features
  • Handle class imbalance if present (check class distribution)
  • Create stratified train-test split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

def preprocess_classification_data(df, target_col):
    """
    Preprocess data for classification task.
    Returns: X_train, X_test, y_train, y_test, preprocessor
    """
    # Your implementation
    pass
8
Baseline Classifier

Create and evaluate a baseline Logistic Regression model:

  • Train LogisticRegression with default parameters
  • Calculate accuracy, precision, recall, F1-score
  • Generate classification report
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

def train_baseline_classifier(X_train, y_train, X_test, y_test):
    """
    Train baseline logistic regression classifier.
    Returns: model, classification_report_dict
    """
    # Your implementation
    pass
9
Advanced Classification Models

Implement at least TWO additional classifiers:

  • Decision Tree Classifier
  • Random Forest Classifier
  • Support Vector Machine (optional bonus)
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

def train_advanced_classifiers(X_train, y_train, X_test, y_test):
    """
    Train multiple classification models.
    Returns: dict of {model_name: (model, metrics_dict)}
    """
    # Your implementation
    pass
10
Confusion Matrix Visualization

Create confusion matrix visualizations for each model:

  • Use ConfusionMatrixDisplay or seaborn heatmap
  • Show both normalized and raw counts
  • Save visualizations to files
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

def plot_confusion_matrices(models, X_test, y_test):
    """
    Plot confusion matrix for each model.
    Saves plots to 'visualizations/confusion_matrix_*.png'
    """
    # Your implementation
    pass
11
ROC Curve Analysis

Generate ROC curves and calculate AUC for all classifiers:

  • Plot ROC curves for all models on same figure
  • Calculate and display AUC scores in legend
  • Include diagonal reference line
from sklearn.metrics import roc_curve, auc, RocCurveDisplay

def plot_roc_curves(models, X_test, y_test):
    """
    Plot ROC curves comparing all classifiers.
    Saves plot to 'visualizations/roc_curves.png'
    """
    # Your implementation
    pass
12
Hyperparameter Tuning for Classifier

Optimize the best classifier using GridSearchCV:

  • Define comprehensive parameter grid
  • Use stratified k-fold cross-validation
  • Optimize for F1-score (better for imbalanced data)
  • Report best parameters and final test performance
def tune_classifier(model, param_grid, X_train, y_train, scoring='f1'):
    """
    Tune classifier hyperparameters.
    Returns: best_model, best_params, cv_results
    """
    # Your implementation
    pass

Part 3: Model Comparison Report (60 points)

13
Summary Comparison Table

Create a comprehensive comparison of all models:

  • Regression: Compare R2, MAE, RMSE for all models
  • Classification: Compare Accuracy, Precision, Recall, F1, AUC
  • Include both baseline and tuned model performance
  • Highlight best model for each task
def create_model_comparison_table(regression_results, classification_results):
    """
    Create summary DataFrames comparing all models.
    Returns: regression_comparison_df, classification_comparison_df
    """
    # Your implementation
    pass
14
Business Recommendations

Write a markdown cell with business insights:

  • Which features most influence customer lifetime value?
  • What are the key indicators of customer churn?
  • What actions should the business take based on your models?
  • What are the limitations of your analysis?
05

Submission Instructions

Submit your completed assignment via GitHub following these instructions:

1
Create Jupyter Notebook

Create a single notebook called ml_pipeline.ipynb containing all requirements:

  • Organize with clear markdown headers for each part
  • Each function must have docstrings explaining inputs and outputs
  • Include markdown cells explaining your methodology
  • Run all cells top to bottom before submission
2
Save Visualizations

Export all plots to the visualizations/ folder:

  • regression_feature_importance.png
  • confusion_matrix_logistic.png
  • confusion_matrix_rf.png
  • roc_curves.png
  • model_comparison.png (optional)
3
Create README

Create README.md that includes:

  • Your name and assignment title
  • Summary of models built and their performance
  • Instructions to run your notebook
  • Key findings and business recommendations
4
Create requirements.txt
numpy==1.24.0
pandas==2.0.0
scikit-learn==1.3.0
matplotlib==3.7.0
seaborn==0.12.0
5
Repository Structure

Your GitHub repository should look like this:

techmetrics-ml-pipeline/
├── README.md
├── requirements.txt
├── ml_pipeline.ipynb
└── visualizations/
    ├── regression_feature_importance.png
    ├── confusion_matrix_logistic.png
    ├── confusion_matrix_rf.png
    └── roc_curves.png
6
Submit via Form

Once your repository is ready:

  • Make sure your repository is public
  • Click the "Submit Assignment" button below
  • Fill in the submission form with your GitHub username
Important: Make sure all cells in your notebook run without errors and all visualizations are saved before submitting!
06

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria Points Description
Regression Pipeline 60 Preprocessing, baseline model, advanced models, cross-validation
Regression Optimization 40 Hyperparameter tuning, feature importance, model interpretation
Classification Pipeline 60 Preprocessing, encoding, baseline and advanced classifiers
Classification Evaluation 40 Confusion matrices, ROC curves, AUC analysis, tuning
Model Comparison & Insights 40 Comprehensive comparison tables, business recommendations
Visualizations 30 Clear, well-labeled charts saved to visualizations folder
Code Quality 30 Docstrings, comments, clean organization, reproducibility
Total 300

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment
07

What You Will Practice

ML Concepts (9.1)

Supervised learning workflows, train-test splits, feature scaling, and pipeline design patterns

Regression (9.2)

Linear regression, Ridge regularization, Random Forest for regression, and customer value prediction

Classification (9.3)

Logistic regression, decision trees, random forests, and churn prediction modeling

Evaluation (9.4)

Cross-validation, confusion matrices, ROC curves, AUC, and hyperparameter optimization

08

Pro Tips

Model Selection
  • Start with simple models before complex ones
  • Use cross-validation for reliable comparisons
  • Consider model interpretability vs performance
  • Random Forest often works well as default
Evaluation Metrics
  • Use RMSE for regression, F1 for imbalanced classification
  • Always check multiple metrics, not just accuracy
  • Confusion matrix reveals error patterns
  • ROC-AUC is robust to class imbalance
Hyperparameter Tuning
  • Start with coarse grid, then refine
  • Use RandomizedSearchCV for large param spaces
  • Set random_state for reproducibility
  • Watch out for overfitting on validation set
Common Mistakes
  • Data leakage: scaling before split
  • Not using stratified split for classification
  • Forgetting to handle categorical variables
  • Not saving visualizations before submission
09

Pre-Submission Checklist

Regression Requirements
Classification Requirements