Assignment Overview
In this assignment, you will build a complete Machine Learning Pipeline from scratch. This comprehensive project requires you to apply ALL concepts from Module 2: data loading, preprocessing, train/test splitting, classification models (KNN, Decision Trees), regression models (Linear Regression), and model evaluation with industry-standard metrics.
numpy, pandas,
scikit-learn, and matplotlib. These are the standard tools used
in the industry for machine learning tasks.
Data Preparation
Loading, cleaning, splitting, and scaling datasets
Classification
KNN, Decision Trees, accuracy, precision, recall, F1
Regression
Linear Regression, MSE, RMSE, MAE, R² score
The Scenario
TechPredict Analytics
You've just joined TechPredict Analytics, a consulting firm that helps businesses make data-driven decisions. Your manager has assigned you two key projects:
"We have two clients who need predictive models. First, a telecom company wants to identify customers likely to cancel their subscription. Second, a real estate agency needs an automated system to estimate house prices. Can you build ML models for both problems?"
Project 1: Classification
Customer Churn Prediction
Build a classifier to predict which customers are likely to leave the telecom company, so retention offers can be made proactively.
Project 2: Regression
House Price Estimation
Build a regression model to estimate house sale prices based on property features like square footage, bedrooms, and age.
Your Task
Create a Python file called ml_pipeline.py that implements a complete
machine learning pipeline. Your code must load datasets, train multiple models, evaluate
their performance, and generate visualizations comparing results.
The Datasets
You will work with TWO datasets. Create these CSV files or generate synthetic data:
customer_churn.csv
Telecom customer data with churn labels
| Column | Description |
|---|---|
tenure | Months as customer |
monthly_charges | Monthly bill amount |
total_charges | Total amount paid |
contract_type | 0=Monthly, 1=Yearly |
churn | 0=Stayed, 1=Left (target) |
house_prices.csv
House features and sale prices
| Column | Description |
|---|---|
sqft | Square footage |
bedrooms | Number of bedrooms |
bathrooms | Number of bathrooms |
age | House age in years |
price | Sale price $ (target) |
make_classification and make_regression.
Requirements
Implement the following 13 functions in your ml_pipeline.py file.
Each function should follow the exact signature provided:
load_dataset
Load a CSV file into a pandas DataFrame. Handle missing values by dropping rows with any null values.
def load_dataset(filepath: str) -> pd.DataFrame:
"""
Load and clean a CSV dataset.
Args:
filepath: Path to the CSV file
Returns:
Cleaned pandas DataFrame with no missing values
"""
pass
split_features_target
Separate the dataset into features (X) and target variable (y).
def split_features_target(df: pd.DataFrame, target_col: str) -> tuple:
"""
Split DataFrame into features and target.
Args:
df: Input DataFrame
target_col: Name of the target column
Returns:
Tuple of (X, y) where X is features DataFrame and y is target Series
"""
pass
create_train_test_split
Split data into training and testing sets. Use an 80/20 split ratio and set a random state for reproducibility.
def create_train_test_split(X, y, test_size=0.2, random_state=42) -> tuple:
"""
Create train/test split of the data.
Args:
X: Features
y: Target
test_size: Proportion of data for testing (default 0.2)
random_state: Random seed for reproducibility
Returns:
Tuple of (X_train, X_test, y_train, y_test)
"""
pass
scale_features
Standardize features using StandardScaler. Fit on training data only, then transform both train and test.
def scale_features(X_train, X_test) -> tuple:
"""
Standardize features using StandardScaler.
Args:
X_train: Training features
X_test: Test features
Returns:
Tuple of (X_train_scaled, X_test_scaled, scaler)
"""
pass
train_knn_classifier
Train a K-Nearest Neighbors classifier with configurable number of neighbors.
def train_knn_classifier(X_train, y_train, n_neighbors=5):
"""
Train a KNN classifier.
Args:
X_train: Training features
y_train: Training labels
n_neighbors: Number of neighbors (default 5)
Returns:
Trained KNeighborsClassifier model
"""
pass
train_decision_tree_classifier
Train a Decision Tree classifier with maximum depth to prevent overfitting.
def train_decision_tree_classifier(X_train, y_train, max_depth=5, random_state=42):
"""
Train a Decision Tree classifier.
Args:
X_train: Training features
y_train: Training labels
max_depth: Maximum tree depth (default 5)
random_state: Random seed for reproducibility
Returns:
Trained DecisionTreeClassifier model
"""
pass
train_linear_regression
Train a Linear Regression model for the house price prediction task.
def train_linear_regression(X_train, y_train):
"""
Train a Linear Regression model.
Args:
X_train: Training features
y_train: Training target values
Returns:
Trained LinearRegression model
"""
pass
evaluate_classifier
Evaluate a classification model. Calculate accuracy, precision, recall, and F1-score.
def evaluate_classifier(model, X_test, y_test) -> dict:
"""
Evaluate a classification model.
Args:
model: Trained classifier
X_test: Test features
y_test: True labels
Returns:
Dictionary with 'accuracy', 'precision', 'recall', 'f1' scores
"""
pass
evaluate_regressor
Evaluate a regression model. Calculate MSE, RMSE, MAE, and R² score.
def evaluate_regressor(model, X_test, y_test) -> dict:
"""
Evaluate a regression model.
Args:
model: Trained regressor
X_test: Test features
y_test: True target values
Returns:
Dictionary with 'mse', 'rmse', 'mae', 'r2' scores
"""
pass
compare_classifiers
Compare KNN and Decision Tree classifiers by training and evaluating both.
def compare_classifiers(X_train, X_test, y_train, y_test) -> dict:
"""
Train and compare KNN and Decision Tree classifiers.
Args:
X_train, X_test: Train and test features
y_train, y_test: Train and test labels
Returns:
Dictionary with model names as keys and evaluation dicts as values
Example: {'KNN': {'accuracy': 0.85, ...}, 'DecisionTree': {...}}
"""
pass
plot_confusion_matrix
Create and save a confusion matrix visualization for a classifier.
def plot_confusion_matrix(model, X_test, y_test, save_path: str) -> None:
"""
Plot and save confusion matrix.
Args:
model: Trained classifier
X_test: Test features
y_test: True labels
save_path: File path to save the plot
"""
pass
plot_regression_results
Create a scatter plot comparing actual vs predicted values for regression.
def plot_regression_results(model, X_test, y_test, save_path: str) -> None:
"""
Plot actual vs predicted values for regression.
Args:
model: Trained regressor
X_test: Test features
y_test: True target values
save_path: File path to save the plot
"""
pass
main (Pipeline Execution)
Create a main function that orchestrates the entire ML pipeline.
def main():
"""
Execute the complete ML pipeline:
1. Load and prepare both datasets
2. Train classification models (KNN, Decision Tree)
3. Train regression model (Linear Regression)
4. Evaluate all models
5. Generate visualizations
6. Print summary report
"""
pass
if __name__ == "__main__":
main()
Submission
Create a public GitHub repository with the exact name shown below:
Required Repository Name
ai-ml-basics-assignment
Required Files
ai-ml-basics-assignment/
├── ml_pipeline.py # Main implementation file with ALL 13 functions
├── data/
│ ├── customer_churn.csv # Classification dataset
│ └── house_prices.csv # Regression dataset
├── outputs/
│ ├── confusion_matrix.png
│ └── regression_plot.png
├── requirements.txt # Dependencies
└── README.md # REQUIRED - see contents below
README.md Must Include:
- Your full name and submission date
- Results Summary: A table showing model performance metrics
- Key Findings: Which model performed better and why
- Challenges: Any difficulties you encountered and how you solved them
- Instructions to run your code
Do Include
- All 13 functions implemented and working
- Docstrings for every function
- Type hints in function signatures
- Both output visualizations from running your code
- PEP 8 compliant code style
- README.md with all required sections
Do Not Include
- Hardcoded file paths
- Any .pyc or __pycache__ files (use .gitignore)
- Virtual environment folders
- Code that doesn't run without errors
- Deprecated sklearn functions
- Code copied without understanding
Enter your GitHub username - we'll verify your repository automatically
Grading Rubric
| Criteria | Points | Description |
|---|---|---|
| Data Loading & Prep | 30 | Correctly loads CSVs, handles missing values, splits features/target |
| Train/Test Split | 20 | Proper 80/20 split with random state, feature scaling implemented |
| Classification Models | 50 | KNN and Decision Tree correctly implemented and trained |
| Regression Model | 30 | Linear Regression correctly implemented and trained |
| Model Evaluation | 40 | All metrics calculated correctly (accuracy, precision, recall, F1, MSE, R²) |
| Visualizations | 30 | Confusion matrix and regression plot generated and saved |
| Code Quality | 30 | PEP 8, docstrings, type hints, clean structure, error handling |
| Documentation | 20 | Complete README with setup, results, and analysis |
| Total | 250 |
Ready to Submit?
Make sure you have completed all requirements and reviewed the grading rubric above.
Submit Your AssignmentPro Tips
Start Simple, Then Iterate
- Start with loading data and printing its shape
- Add train/test split, then add one model
- Test each step before moving to the next
- Build complexity gradually, not all at once
Feature Scaling
- Always scale before KNN (distance-based)
- Fit scaler on train data only, transform both
- Decision Trees don't require scaling
- Never fit on test data—that's data leakage!
Evaluation Tips
- Use
classification_report()for all metrics - Check for class imbalance with
y.value_counts() - Accuracy alone can be misleading
- Also look at precision, recall, and F1
Common Mistakes
- Using deprecated sklearn functions
- Hardcoding file paths instead of parameters
- Fitting scaler on entire dataset (data leakage)
- Not setting random_state for reproducibility