Assignment Overview
In this assignment, you will build a complete Machine Learning System using scikit-learn. This comprehensive project requires you to apply ALL concepts from Module 9: ML fundamentals, regression models, classification algorithms, and model evaluation techniques to solve real-world prediction problems.
ML Concepts (9.1)
Supervised learning, train-test splits, feature engineering, pipeline design
Regression (9.2)
Linear regression, polynomial features, regularization, price prediction
Classification (9.3)
Logistic regression, decision trees, random forests, customer churn
Evaluation (9.4)
Cross-validation, confusion matrix, ROC curves, hyperparameter tuning
The Scenario
TechMetrics Analytics Inc.
You have been hired as a Machine Learning Engineer at TechMetrics Analytics, a consulting firm that provides predictive analytics solutions to e-commerce clients. Your manager has assigned you two critical projects:
"We have two urgent client requests. First, an online retailer needs a model to predict customer lifetime value based on their purchasing behavior. Second, a subscription service wants to identify customers likely to churn so they can implement retention strategies. Build robust ML pipelines for both problems with proper validation and optimization."
Your Tasks
Create a Jupyter notebook called ml_pipeline.ipynb that implements complete machine learning
solutions for both regression (customer value prediction) and classification (churn prediction) problems.
Your code must include data preprocessing, model training, hyperparameter tuning, and comprehensive evaluation.
Project 1: Customer Value Prediction
Build a regression model to predict customer lifetime value based on:
- Purchase history and frequency
- Average order value
- Customer demographics
- Website engagement metrics
Project 2: Churn Prediction
Build a classification model to predict customer churn based on:
- Subscription duration and plan type
- Usage patterns and engagement
- Support ticket history
- Payment behavior
The Datasets
You will work with three real-world datasets for building your machine learning pipelines. Download the CSV files and explore them to understand the features and target variables.
TechMetrics Datasets
Business analytics data for regression and classification tasks
Requirements
Your ml_pipeline.ipynb must implement ALL of the following components.
Each section is mandatory and will be graded individually.
Part 1: Regression - Customer Value Prediction (120 points)
Data Preprocessing
Implement a preprocessing pipeline that:
- Handles missing values appropriately
- Scales numerical features using StandardScaler or MinMaxScaler
- Creates train-test split (80/20) with random_state=42
- Identifies and handles outliers if present
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
def preprocess_regression_data(df, target_col):
"""
Preprocess data for regression task.
Returns: X_train, X_test, y_train, y_test, scaler
"""
# Your implementation
pass
Baseline Model
Create and evaluate a baseline Linear Regression model:
- Train a simple LinearRegression model
- Calculate R-squared, MAE, RMSE on test set
- Store results for comparison
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
def train_baseline_regression(X_train, y_train, X_test, y_test):
"""
Train baseline linear regression.
Returns: model, metrics_dict
"""
# Your implementation
pass
Advanced Regression Models
Implement at least TWO additional regression models:
- Ridge Regression with regularization
- Random Forest Regressor
- Gradient Boosting Regressor (optional bonus)
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
def train_advanced_regression_models(X_train, y_train, X_test, y_test):
"""
Train multiple regression models.
Returns: dict of {model_name: (model, metrics_dict)}
"""
# Your implementation
pass
Cross-Validation
Implement k-fold cross-validation for model selection:
- Use 5-fold cross-validation
- Compare mean CV scores across models
- Report standard deviation of scores
from sklearn.model_selection import cross_val_score
def cross_validate_models(models, X, y, cv=5):
"""
Perform cross-validation for multiple models.
Returns: dict of {model_name: (mean_score, std_score)}
"""
# Your implementation
pass
Hyperparameter Tuning
Optimize the best performing model using GridSearchCV:
- Define parameter grid for Random Forest or Gradient Boosting
- Use GridSearchCV with 5-fold CV
- Report best parameters and best score
- Evaluate tuned model on test set
from sklearn.model_selection import GridSearchCV
def tune_regression_model(model, param_grid, X_train, y_train):
"""
Tune hyperparameters using GridSearchCV.
Returns: best_model, best_params, best_score
"""
# Your implementation
pass
Feature Importance Analysis
Analyze and visualize feature importance:
- Extract feature importances from tree-based model
- Create horizontal bar chart of top 10 features
- Interpret which features drive customer value
def plot_feature_importance(model, feature_names, top_n=10):
"""
Plot feature importance chart.
Saves plot to 'visualizations/regression_feature_importance.png'
"""
# Your implementation
pass
Part 2: Classification - Churn Prediction (120 points)
Data Preprocessing for Classification
Implement preprocessing for the classification task:
- Encode categorical variables (plan_type) using OneHotEncoder or LabelEncoder
- Scale numerical features
- Handle class imbalance if present (check class distribution)
- Create stratified train-test split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
def preprocess_classification_data(df, target_col):
"""
Preprocess data for classification task.
Returns: X_train, X_test, y_train, y_test, preprocessor
"""
# Your implementation
pass
Baseline Classifier
Create and evaluate a baseline Logistic Regression model:
- Train LogisticRegression with default parameters
- Calculate accuracy, precision, recall, F1-score
- Generate classification report
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
def train_baseline_classifier(X_train, y_train, X_test, y_test):
"""
Train baseline logistic regression classifier.
Returns: model, classification_report_dict
"""
# Your implementation
pass
Advanced Classification Models
Implement at least TWO additional classifiers:
- Decision Tree Classifier
- Random Forest Classifier
- Support Vector Machine (optional bonus)
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
def train_advanced_classifiers(X_train, y_train, X_test, y_test):
"""
Train multiple classification models.
Returns: dict of {model_name: (model, metrics_dict)}
"""
# Your implementation
pass
Confusion Matrix Visualization
Create confusion matrix visualizations for each model:
- Use ConfusionMatrixDisplay or seaborn heatmap
- Show both normalized and raw counts
- Save visualizations to files
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
def plot_confusion_matrices(models, X_test, y_test):
"""
Plot confusion matrix for each model.
Saves plots to 'visualizations/confusion_matrix_*.png'
"""
# Your implementation
pass
ROC Curve Analysis
Generate ROC curves and calculate AUC for all classifiers:
- Plot ROC curves for all models on same figure
- Calculate and display AUC scores in legend
- Include diagonal reference line
from sklearn.metrics import roc_curve, auc, RocCurveDisplay
def plot_roc_curves(models, X_test, y_test):
"""
Plot ROC curves comparing all classifiers.
Saves plot to 'visualizations/roc_curves.png'
"""
# Your implementation
pass
Hyperparameter Tuning for Classifier
Optimize the best classifier using GridSearchCV:
- Define comprehensive parameter grid
- Use stratified k-fold cross-validation
- Optimize for F1-score (better for imbalanced data)
- Report best parameters and final test performance
def tune_classifier(model, param_grid, X_train, y_train, scoring='f1'):
"""
Tune classifier hyperparameters.
Returns: best_model, best_params, cv_results
"""
# Your implementation
pass
Part 3: Model Comparison Report (60 points)
Summary Comparison Table
Create a comprehensive comparison of all models:
- Regression: Compare R2, MAE, RMSE for all models
- Classification: Compare Accuracy, Precision, Recall, F1, AUC
- Include both baseline and tuned model performance
- Highlight best model for each task
def create_model_comparison_table(regression_results, classification_results):
"""
Create summary DataFrames comparing all models.
Returns: regression_comparison_df, classification_comparison_df
"""
# Your implementation
pass
Business Recommendations
Write a markdown cell with business insights:
- Which features most influence customer lifetime value?
- What are the key indicators of customer churn?
- What actions should the business take based on your models?
- What are the limitations of your analysis?
Submission Instructions
Submit your completed assignment via GitHub following these instructions:
Create Jupyter Notebook
Create a single notebook called ml_pipeline.ipynb containing all requirements:
- Organize with clear markdown headers for each part
- Each function must have docstrings explaining inputs and outputs
- Include markdown cells explaining your methodology
- Run all cells top to bottom before submission
Save Visualizations
Export all plots to the visualizations/ folder:
regression_feature_importance.pngconfusion_matrix_logistic.pngconfusion_matrix_rf.pngroc_curves.pngmodel_comparison.png(optional)
Create README
Create README.md that includes:
- Your name and assignment title
- Summary of models built and their performance
- Instructions to run your notebook
- Key findings and business recommendations
Create requirements.txt
numpy==1.24.0
pandas==2.0.0
scikit-learn==1.3.0
matplotlib==3.7.0
seaborn==0.12.0
Repository Structure
Your GitHub repository should look like this:
techmetrics-ml-pipeline/
├── README.md
├── requirements.txt
├── ml_pipeline.ipynb
└── visualizations/
├── regression_feature_importance.png
├── confusion_matrix_logistic.png
├── confusion_matrix_rf.png
└── roc_curves.png
Submit via Form
Once your repository is ready:
- Make sure your repository is public
- Click the "Submit Assignment" button below
- Fill in the submission form with your GitHub username
Grading Rubric
Your assignment will be graded on the following criteria:
| Criteria | Points | Description |
|---|---|---|
| Regression Pipeline | 60 | Preprocessing, baseline model, advanced models, cross-validation |
| Regression Optimization | 40 | Hyperparameter tuning, feature importance, model interpretation |
| Classification Pipeline | 60 | Preprocessing, encoding, baseline and advanced classifiers |
| Classification Evaluation | 40 | Confusion matrices, ROC curves, AUC analysis, tuning |
| Model Comparison & Insights | 40 | Comprehensive comparison tables, business recommendations |
| Visualizations | 30 | Clear, well-labeled charts saved to visualizations folder |
| Code Quality | 30 | Docstrings, comments, clean organization, reproducibility |
| Total | 300 |
Ready to Submit?
Make sure you have completed all requirements and reviewed the grading rubric above.
Submit Your AssignmentWhat You Will Practice
ML Concepts (9.1)
Supervised learning workflows, train-test splits, feature scaling, and pipeline design patterns
Regression (9.2)
Linear regression, Ridge regularization, Random Forest for regression, and customer value prediction
Classification (9.3)
Logistic regression, decision trees, random forests, and churn prediction modeling
Evaluation (9.4)
Cross-validation, confusion matrices, ROC curves, AUC, and hyperparameter optimization
Pro Tips
Model Selection
- Start with simple models before complex ones
- Use cross-validation for reliable comparisons
- Consider model interpretability vs performance
- Random Forest often works well as default
Evaluation Metrics
- Use RMSE for regression, F1 for imbalanced classification
- Always check multiple metrics, not just accuracy
- Confusion matrix reveals error patterns
- ROC-AUC is robust to class imbalance
Hyperparameter Tuning
- Start with coarse grid, then refine
- Use RandomizedSearchCV for large param spaces
- Set random_state for reproducibility
- Watch out for overfitting on validation set
Common Mistakes
- Data leakage: scaling before split
- Not using stratified split for classification
- Forgetting to handle categorical variables
- Not saving visualizations before submission