Assignment 9: Machine Learning | Data Science Course

Assignment Overview

In this assignment, you will build a complete Machine Learning System using scikit-learn. This comprehensive project requires you to apply ALL concepts from Module 9: ML fundamentals, regression models, classification algorithms, and model evaluation techniques to solve real-world prediction problems.

Important: You must use scikit-learn for all machine learning tasks. You may use pandas for data manipulation, numpy for numerical operations, and matplotlib/seaborn for visualization.

Skills Applied: This assignment tests your understanding of ML Concepts (Topic 9.1), Regression Models (Topic 9.2), Classification (Topic 9.3), and Model Evaluation (Topic 9.4) from Module 9.

ML Concepts (9.1)

Supervised learning, train-test splits, feature engineering, pipeline design

Regression (9.2)

Linear regression, polynomial features, regularization, price prediction

Classification (9.3)

Logistic regression, decision trees, random forests, customer churn

Evaluation (9.4)

Cross-validation, confusion matrix, ROC curves, hyperparameter tuning

Ready to submit? Already completed the assignment? Submit your work now!

Submit Now

The Scenario

TechMetrics Analytics Inc.

You have been hired as a Machine Learning Engineer at TechMetrics Analytics, a consulting firm that provides predictive analytics solutions to e-commerce clients. Your manager has assigned you two critical projects:

"We have two urgent client requests. First, an online retailer needs a model to predict customer lifetime value based on their purchasing behavior. Second, a subscription service wants to identify customers likely to churn so they can implement retention strategies. Build robust ML pipelines for both problems with proper validation and optimization."

Your Tasks

Create a Jupyter notebook called ml_pipeline.ipynb that implements complete machine learning solutions for both regression (customer value prediction) and classification (churn prediction) problems. Your code must include data preprocessing, model training, hyperparameter tuning, and comprehensive evaluation.

Project 1: Customer Value Prediction

Build a regression model to predict customer lifetime value based on:

Purchase history and frequency
Average order value
Customer demographics
Website engagement metrics

Project 2: Churn Prediction

Build a classification model to predict customer churn based on:

Subscription duration and plan type
Usage patterns and engagement
Support ticket history
Payment behavior

The Datasets

You will work with three real-world datasets for building your machine learning pipelines. Download the CSV files and explore them to understand the features and target variables.

TechMetrics Datasets

Business analytics data for regression and classification tasks

3 Files

Your Task: Load these datasets, explore the features, identify the target variables, and engineer additional features as needed for your machine learning pipelines. The datasets contain customer behavior, purchase patterns, and engagement metrics suitable for both regression and classification tasks.

Requirements

Your ml_pipeline.ipynb must implement ALL of the following components. Each section is mandatory and will be graded individually.

Part 1: Regression - Customer Value Prediction (120 points)

Data Preprocessing

Implement a preprocessing pipeline that:

Handles missing values appropriately
Scales numerical features using StandardScaler or MinMaxScaler
Creates train-test split (80/20) with random_state=42
Identifies and handles outliers if present

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

def preprocess_regression_data(df, target_col):
    """
    Preprocess data for regression task.
    Returns: X_train, X_test, y_train, y_test, scaler
    """
    # Your implementation
    pass

Baseline Model

Create and evaluate a baseline Linear Regression model:

Train a simple LinearRegression model
Calculate R-squared, MAE, RMSE on test set
Store results for comparison

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

def train_baseline_regression(X_train, y_train, X_test, y_test):
    """
    Train baseline linear regression.
    Returns: model, metrics_dict
    """
    # Your implementation
    pass

Advanced Regression Models

Implement at least TWO additional regression models:

Ridge Regression with regularization
Random Forest Regressor
Gradient Boosting Regressor (optional bonus)

from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

def train_advanced_regression_models(X_train, y_train, X_test, y_test):
    """
    Train multiple regression models.
    Returns: dict of {model_name: (model, metrics_dict)}
    """
    # Your implementation
    pass

Cross-Validation

Implement k-fold cross-validation for model selection:

Use 5-fold cross-validation
Compare mean CV scores across models
Report standard deviation of scores

from sklearn.model_selection import cross_val_score

def cross_validate_models(models, X, y, cv=5):
    """
    Perform cross-validation for multiple models.
    Returns: dict of {model_name: (mean_score, std_score)}
    """
    # Your implementation
    pass

Hyperparameter Tuning

Optimize the best performing model using GridSearchCV:

Define parameter grid for Random Forest or Gradient Boosting
Use GridSearchCV with 5-fold CV
Report best parameters and best score
Evaluate tuned model on test set

from sklearn.model_selection import GridSearchCV

def tune_regression_model(model, param_grid, X_train, y_train):
    """
    Tune hyperparameters using GridSearchCV.
    Returns: best_model, best_params, best_score
    """
    # Your implementation
    pass

Feature Importance Analysis

Analyze and visualize feature importance:

Extract feature importances from tree-based model
Create horizontal bar chart of top 10 features
Interpret which features drive customer value

def plot_feature_importance(model, feature_names, top_n=10):
    """
    Plot feature importance chart.
    Saves plot to 'visualizations/regression_feature_importance.png'
    """
    # Your implementation
    pass

Part 2: Classification - Churn Prediction (120 points)

Data Preprocessing for Classification

Implement preprocessing for the classification task:

Encode categorical variables (plan_type) using OneHotEncoder or LabelEncoder
Scale numerical features
Handle class imbalance if present (check class distribution)
Create stratified train-test split

from sklearn.preprocessing import OneHotEncoder, LabelEncoder

def preprocess_classification_data(df, target_col):
    """
    Preprocess data for classification task.
    Returns: X_train, X_test, y_train, y_test, preprocessor
    """
    # Your implementation
    pass

Baseline Classifier

Create and evaluate a baseline Logistic Regression model:

Train LogisticRegression with default parameters
Calculate accuracy, precision, recall, F1-score
Generate classification report

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

def train_baseline_classifier(X_train, y_train, X_test, y_test):
    """
    Train baseline logistic regression classifier.
    Returns: model, classification_report_dict
    """
    # Your implementation
    pass

Advanced Classification Models

Implement at least TWO additional classifiers:

Decision Tree Classifier
Random Forest Classifier
Support Vector Machine (optional bonus)

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

def train_advanced_classifiers(X_train, y_train, X_test, y_test):
    """
    Train multiple classification models.
    Returns: dict of {model_name: (model, metrics_dict)}
    """
    # Your implementation
    pass

Confusion Matrix Visualization

Create confusion matrix visualizations for each model:

Use ConfusionMatrixDisplay or seaborn heatmap
Show both normalized and raw counts
Save visualizations to files

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

def plot_confusion_matrices(models, X_test, y_test):
    """
    Plot confusion matrix for each model.
    Saves plots to 'visualizations/confusion_matrix_*.png'
    """
    # Your implementation
    pass

ROC Curve Analysis

Generate ROC curves and calculate AUC for all classifiers:

Plot ROC curves for all models on same figure
Calculate and display AUC scores in legend
Include diagonal reference line

from sklearn.metrics import roc_curve, auc, RocCurveDisplay

def plot_roc_curves(models, X_test, y_test):
    """
    Plot ROC curves comparing all classifiers.
    Saves plot to 'visualizations/roc_curves.png'
    """
    # Your implementation
    pass

Hyperparameter Tuning for Classifier

Optimize the best classifier using GridSearchCV:

Define comprehensive parameter grid
Use stratified k-fold cross-validation
Optimize for F1-score (better for imbalanced data)
Report best parameters and final test performance

def tune_classifier(model, param_grid, X_train, y_train, scoring='f1'):
    """
    Tune classifier hyperparameters.
    Returns: best_model, best_params, cv_results
    """
    # Your implementation
    pass

Part 3: Model Comparison Report (60 points)

Summary Comparison Table

Create a comprehensive comparison of all models:

Regression: Compare R2, MAE, RMSE for all models
Classification: Compare Accuracy, Precision, Recall, F1, AUC
Include both baseline and tuned model performance
Highlight best model for each task

def create_model_comparison_table(regression_results, classification_results):
    """
    Create summary DataFrames comparing all models.
    Returns: regression_comparison_df, classification_comparison_df
    """
    # Your implementation
    pass

Business Recommendations

Write a markdown cell with business insights:

Which features most influence customer lifetime value?
What are the key indicators of customer churn?
What actions should the business take based on your models?
What are the limitations of your analysis?

Submission Instructions

Submit your completed assignment via GitHub following these instructions:

Create Jupyter Notebook

Create a single notebook called ml_pipeline.ipynb containing all requirements:

Organize with clear markdown headers for each part
Each function must have docstrings explaining inputs and outputs
Include markdown cells explaining your methodology
Run all cells top to bottom before submission

Save Visualizations

Export all plots to the visualizations/ folder:

regression_feature_importance.png
confusion_matrix_logistic.png
confusion_matrix_rf.png
roc_curves.png
model_comparison.png (optional)

Create README

Create README.md that includes:

Your name and assignment title
Summary of models built and their performance
Instructions to run your notebook
Key findings and business recommendations

Create requirements.txt

numpy==1.24.0
pandas==2.0.0
scikit-learn==1.3.0
matplotlib==3.7.0
seaborn==0.12.0

Repository Structure

Your GitHub repository should look like this:

techmetrics-ml-pipeline/
├── README.md
├── requirements.txt
├── ml_pipeline.ipynb
└── visualizations/
    ├── regression_feature_importance.png
    ├── confusion_matrix_logistic.png
    ├── confusion_matrix_rf.png
    └── roc_curves.png

Submit via Form

Once your repository is ready:

Make sure your repository is public
Click the "Submit Assignment" button below
Fill in the submission form with your GitHub username

Important: Make sure all cells in your notebook run without errors and all visualizations are saved before submitting!

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria	Points	Description
Regression Pipeline	60	Preprocessing, baseline model, advanced models, cross-validation
Regression Optimization	40	Hyperparameter tuning, feature importance, model interpretation
Classification Pipeline	60	Preprocessing, encoding, baseline and advanced classifiers
Classification Evaluation	40	Confusion matrices, ROC curves, AUC analysis, tuning
Model Comparison & Insights	40	Comprehensive comparison tables, business recommendations
Visualizations	30	Clear, well-labeled charts saved to visualizations folder
Code Quality	30	Docstrings, comments, clean organization, reproducibility
Total	300

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment

What You Will Practice

ML Concepts (9.1)

Supervised learning workflows, train-test splits, feature scaling, and pipeline design patterns

Regression (9.2)

Linear regression, Ridge regularization, Random Forest for regression, and customer value prediction

Classification (9.3)

Logistic regression, decision trees, random forests, and churn prediction modeling

Evaluation (9.4)

Cross-validation, confusion matrices, ROC curves, AUC, and hyperparameter optimization

Pro Tips

Model Selection

Start with simple models before complex ones
Use cross-validation for reliable comparisons
Consider model interpretability vs performance
Random Forest often works well as default

Evaluation Metrics

Use RMSE for regression, F1 for imbalanced classification
Always check multiple metrics, not just accuracy
Confusion matrix reveals error patterns
ROC-AUC is robust to class imbalance

Hyperparameter Tuning

Start with coarse grid, then refine
Use RandomizedSearchCV for large param spaces
Set random_state for reproducibility
Watch out for overfitting on validation set

Common Mistakes

Data leakage: scaling before split
Not using stratified split for classification
Forgetting to handle categorical variables
Not saving visualizations before submission

Machine Learning Pipeline Challenge

What You'll Practice

Contents

Assignment Overview

ML Concepts (9.1)

Regression (9.2)

Classification (9.3)

Evaluation (9.4)

The Scenario

TechMetrics Analytics Inc.

Your Tasks

Project 1: Customer Value Prediction

Project 2: Churn Prediction

The Datasets

TechMetrics Datasets

Customers

Sales

Website

Requirements

Part 1: Regression - Customer Value Prediction (120 points)

Data Preprocessing

Baseline Model

Advanced Regression Models

Cross-Validation

Hyperparameter Tuning

Feature Importance Analysis

Part 2: Classification - Churn Prediction (120 points)

Data Preprocessing for Classification

Baseline Classifier

Advanced Classification Models

Confusion Matrix Visualization

ROC Curve Analysis

Hyperparameter Tuning for Classifier

Part 3: Model Comparison Report (60 points)

Summary Comparison Table

Business Recommendations

Submission Instructions

Create Jupyter Notebook

Save Visualizations

Create README

Create requirements.txt

Repository Structure

Submit via Form

Grading Rubric

Ready to Submit?

What You Will Practice

ML Concepts (9.1)

Regression (9.2)

Classification (9.3)

Evaluation (9.4)

Pro Tips

Model Selection

Evaluation Metrics

Hyperparameter Tuning

Common Mistakes

Pre-Submission Checklist

Regression Requirements

Classification Requirements