Assignment 1: ML Fundamentals Practice | Machine Learning Course

Assignment Overview

In this assignment, you will build a complete ML Data Pipeline for predicting employee attrition. This comprehensive project requires you to apply ALL concepts from Module 1: understanding ML types, data exploration, preprocessing techniques, feature engineering, train-test splitting, and implementing basic evaluation metrics.

Libraries Allowed: You may use pandas, numpy, matplotlib, seaborn, and scikit-learn for this assignment.

Skills Applied: This assignment tests your understanding of ML basics (Topic 1.1), data preprocessing (Topic 1.2), and ML workflow (Topic 1.3) from Module 1.

ML Concepts (1.1)

Supervised vs unsupervised, classification vs regression, model types

Data Preprocessing (1.2)

Missing values, encoding, scaling, outlier handling

ML Workflow (1.3)

Train-test split, cross-validation, evaluation metrics

Ready to submit? Already completed the assignment? Submit your work now!

Submit Now

The Scenario

TechCorp HR Analytics

You have been hired as a Junior Machine Learning Engineer at TechCorp Industries, a technology company facing high employee turnover. The HR director has given you this task:

"We have historical employee data including demographics, job details, and satisfaction scores. We need you to build a data pipeline that prepares this data for an attrition prediction model. Focus on proper preprocessing - the model accuracy depends on clean, well-prepared data!"

Your Task

Create a Jupyter Notebook called ml_pipeline.ipynb that implements a complete ML data preprocessing and evaluation pipeline. Your code must load the dataset, perform exploratory data analysis, preprocess features, split data properly, and calculate evaluation metrics on a baseline model.

The Dataset

You will work with an Employee Attrition dataset. Create this CSV file as shown below:

File: `employees.csv` (Employee Data)

employee_id,age,gender,department,job_role,years_at_company,monthly_income,satisfaction_score,work_life_balance,overtime,distance_from_home,education_level,performance_rating,attrition
E001,32,Male,Engineering,Software Engineer,5,85000,4.2,3,Yes,12,Bachelor,4,No
E002,28,Female,Sales,Sales Rep,2,45000,2.8,2,Yes,25,Bachelor,3,Yes
E003,45,Male,HR,HR Manager,12,95000,4.5,4,No,5,Master,5,No
E004,35,Female,Engineering,Data Scientist,3,92000,3.5,3,Yes,18,PhD,4,No
E005,26,Male,Marketing,Marketing Analyst,1,48000,2.2,2,Yes,30,Bachelor,2,Yes
E006,41,Female,Finance,Financial Analyst,8,78000,4.0,4,No,8,Master,4,No
E007,29,Male,Engineering,DevOps Engineer,4,82000,3.8,3,No,15,Bachelor,4,No
E008,38,Female,Sales,Sales Manager,7,88000,3.2,2,Yes,22,Master,3,Yes
E009,24,Male,Marketing,Content Writer,1,42000,2.5,2,Yes,35,Bachelor,3,Yes
E010,52,Female,HR,HR Director,15,125000,4.8,5,No,3,Master,5,No
E011,31,Male,Engineering,Frontend Dev,3,75000,3.0,3,Yes,20,Bachelor,3,No
E012,27,Female,Finance,Accountant,2,55000,2.9,2,Yes,28,Bachelor,3,Yes
E013,44,Male,Engineering,Tech Lead,10,115000,4.4,4,No,7,Master,5,No
E014,33,Female,Sales,Sales Rep,4,52000,3.3,3,No,16,Bachelor,3,No
E015,36,Male,Marketing,Marketing Manager,6,85000,4.1,4,No,10,Master,4,No
E016,25,Female,Engineering,Junior Dev,1,55000,2.6,2,Yes,32,Bachelor,2,Yes
E017,48,Male,Finance,CFO,14,150000,4.7,5,No,4,MBA,5,No
E018,30,Female,HR,Recruiter,3,52000,3.4,3,Yes,19,Bachelor,3,No
E019,34,Male,Engineering,Backend Dev,5,88000,3.9,3,No,14,Master,4,No
E020,22,Female,Marketing,Intern,0,28000,2.0,1,Yes,40,High School,2,Yes

Columns Explained

employee_id - Unique identifier (string)
age - Employee age (integer)
gender - Gender (categorical: Male/Female)
department - Department name (categorical)
job_role - Job title (categorical)
years_at_company - Tenure in years (integer)
monthly_income - Monthly salary (integer)
satisfaction_score - Job satisfaction 1-5 (float)
work_life_balance - WLB rating 1-5 (integer)
overtime - Works overtime (categorical: Yes/No)
distance_from_home - Commute distance in km (integer)
education_level - Highest education (categorical)
performance_rating - Performance score 1-5 (integer)
attrition - Left company (target: Yes/No)

Note: You may add additional rows with intentional missing values and outliers to test your preprocessing functions.

Requirements

Your ml_pipeline.ipynb must implement ALL of the following functions. Each function is mandatory and will be tested individually.

Load and Explore Data

Create a function load_and_explore(filename) that:

Loads the CSV file using pandas
Returns a dictionary with: shape, dtypes, missing values count, basic statistics
Prints a summary of the dataset

def load_and_explore(filename):
    """Load dataset and return exploration summary."""
    # Must return: dict with 'shape', 'dtypes', 'missing', 'stats'
    pass

Visualize Data Distribution

Create a function visualize_distributions(df) that:

Creates histograms for numerical columns
Creates bar charts for categorical columns
Shows class distribution of the target variable
Saves plots as eda_plots.png

def visualize_distributions(df):
    """Create and save EDA visualizations."""
    # Must save: eda_plots.png
    pass

Handle Missing Values

Create a function handle_missing_values(df, strategy='auto') that:

Detects missing values in each column
Fills numerical columns with median (or mean based on strategy)
Fills categorical columns with mode
Returns the cleaned DataFrame and a report of changes

def handle_missing_values(df, strategy='auto'):
    """Handle missing values with specified strategy."""
    # Return: (cleaned_df, missing_report_dict)
    pass

Detect and Handle Outliers

Create a function handle_outliers(df, columns, method='iqr') that:

Uses IQR method to detect outliers in specified numerical columns
Returns outlier counts per column
Optionally caps outliers at threshold values

def handle_outliers(df, columns, method='iqr'):
    """Detect and handle outliers using IQR method."""
    # Return: (cleaned_df, outliers_report_dict)
    pass

Encode Categorical Variables

Create a function encode_categorical(df, columns) that:

Uses Label Encoding for binary categorical columns
Uses One-Hot Encoding for multi-class categorical columns
Returns the encoded DataFrame and encoder objects

def encode_categorical(df, columns):
    """Encode categorical variables appropriately."""
    # Return: (encoded_df, encoders_dict)
    pass

Scale Numerical Features

Create a function scale_features(df, columns, method='standard') that:

Applies StandardScaler or MinMaxScaler based on method
Returns scaled DataFrame and scaler object
Preserves column names after scaling

def scale_features(df, columns, method='standard'):
    """Scale numerical features using specified method."""
    # Return: (scaled_df, scaler_object)
    pass

Create Train-Test Split

Create a function split_data(df, target_column, test_size=0.2, stratify=True) that:

Separates features (X) and target (y)
Performs stratified split to maintain class balance
Returns X_train, X_test, y_train, y_test
Prints class distribution in both sets

def split_data(df, target_column, test_size=0.2, stratify=True):
    """Split data into train and test sets with stratification."""
    # Return: (X_train, X_test, y_train, y_test)
    pass

Create Feature Correlation Matrix

Create a function get_correlation_matrix(df, threshold=0.7) that:

Calculates correlation matrix for numerical features
Identifies highly correlated feature pairs above threshold
Saves correlation heatmap as correlation_matrix.png
Returns list of highly correlated pairs

def get_correlation_matrix(df, threshold=0.7):
    """Generate correlation matrix and identify high correlations."""
    # Return: list of (col1, col2, correlation) tuples
    pass

Calculate Evaluation Metrics

Create a function calculate_metrics(y_true, y_pred) that:

Calculates accuracy, precision, recall, and F1-score
Generates confusion matrix
Returns a dictionary with all metrics

def calculate_metrics(y_true, y_pred):
    """Calculate classification evaluation metrics."""
    # Return: dict with 'accuracy', 'precision', 'recall', 'f1', 'confusion_matrix'
    pass

Train Baseline Model

Create a function train_baseline(X_train, X_test, y_train, y_test) that:

Trains a simple Logistic Regression as baseline
Makes predictions on test set
Returns model and predictions

def train_baseline(X_train, X_test, y_train, y_test):
    """Train a baseline logistic regression model."""
    # Return: (model, y_pred)
    pass

Save Preprocessed Data

Create a function save_processed_data(X_train, X_test, y_train, y_test, prefix='processed') that:

Saves train and test sets as CSV files
Creates: processed_X_train.csv, processed_X_test.csv, etc.
Returns list of saved file paths

def save_processed_data(X_train, X_test, y_train, y_test, prefix='processed'):
    """Save preprocessed train and test data to CSV files."""
    # Return: list of saved file paths
    pass

Main Pipeline

Create a main() function that:

Runs the complete pipeline from loading to evaluation
Prints summary statistics at each step
Generates all required output files
Prints final baseline model metrics

def main():
    # 1. Load and explore data
    df = pd.read_csv("employees.csv")
    exploration = load_and_explore("employees.csv")
    
    # 2. Visualize distributions
    visualize_distributions(df)
    
    # 3. Handle missing values
    df_clean, missing_report = handle_missing_values(df)
    
    # 4. Handle outliers
    numerical_cols = ['age', 'monthly_income', 'distance_from_home']
    df_clean, outlier_report = handle_outliers(df_clean, numerical_cols)
    
    # 5. Encode categorical variables
    cat_cols = ['gender', 'department', 'overtime', 'education_level']
    df_encoded, encoders = encode_categorical(df_clean, cat_cols)
    
    # 6. Scale features
    scale_cols = ['age', 'years_at_company', 'monthly_income', 'distance_from_home']
    df_scaled, scaler = scale_features(df_encoded, scale_cols)
    
    # 7. Split data
    X_train, X_test, y_train, y_test = split_data(df_scaled, 'attrition')
    
    # 8. Get correlations
    correlations = get_correlation_matrix(df_scaled)
    
    # 9. Train baseline model
    model, y_pred = train_baseline(X_train, X_test, y_train, y_test)
    
    # 10. Calculate metrics
    metrics = calculate_metrics(y_test, y_pred)
    print(f"Baseline Model Metrics: {metrics}")
    
    # 11. Save processed data
    save_processed_data(X_train, X_test, y_train, y_test)
    
    print("ML Pipeline Complete!")

if __name__ == "__main__":
    main()

Submission

Create a public GitHub repository with the exact name shown below:

Required Repository Name

employee-attrition-pipeline

github.com/<your-username>/employee-attrition-pipeline

Required Files

employee-attrition-pipeline/
├── ml_pipeline.ipynb         # Your Jupyter Notebook with ALL 12 functions
├── employees.csv             # Input dataset (as provided or extended)
├── eda_plots.png             # Generated EDA visualizations
├── correlation_matrix.png    # Generated correlation heatmap
├── processed_X_train.csv     # Preprocessed training features
├── processed_X_test.csv      # Preprocessed test features
├── processed_y_train.csv     # Training labels
├── processed_y_test.csv      # Test labels
└── README.md                 # REQUIRED - see contents below

README.md Must Include:

Your full name and submission date
Brief description of your preprocessing approach
Summary of baseline model metrics achieved
Any challenges faced and how you solved them
Instructions to run your notebook

Do Include

All 12 functions implemented and working
Docstrings for every function
Comments explaining preprocessing decisions
All output files from running your notebook
Visualizations with proper labels and titles
README.md with all required sections

Do Not Include

Any .pyc or __pycache__ files (use .gitignore)
Virtual environment folders
Large model files (just the code)
Code that doesn't run without errors
Hardcoded paths (use relative paths)

Important: Before submitting, run all cells in your notebook to make sure it executes without errors and generates all output files correctly!

Submit Your Assignment

Enter your GitHub username - we'll verify your repository automatically

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria	Points	Description
Data Loading & EDA	20	Correct data loading, exploration summary, and meaningful visualizations
Preprocessing Functions	40	Proper handling of missing values, outliers, encoding, and scaling
Train-Test Split	15	Correct stratified splitting with proper feature-target separation
Evaluation Metrics	25	Accurate calculation of all classification metrics and confusion matrix
Baseline Model	20	Working baseline model with predictions and proper evaluation
Code Quality	30	Docstrings, comments, naming conventions, and clean organization
Total	150

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment

What You Will Practice

Data Exploration (1.1)

Understanding data structure, distributions, identifying data quality issues

Data Preprocessing (1.2)

Handling missing values, outliers, encoding categorical variables, feature scaling

Data Splitting (1.3)

Proper train-test splitting with stratification to prevent data leakage

Model Evaluation

Understanding accuracy, precision, recall, F1-score, and confusion matrix

Pro Tips

Preprocessing Best Practices

Always explore before preprocessing
Scale AFTER splitting to prevent data leakage
Use fit_transform on train, transform on test
Document your preprocessing decisions

Code Quality

Write modular, reusable functions
Add docstrings explaining parameters
Handle edge cases (empty data, all NaN)
Return meaningful results from functions

Evaluation Tips

Accuracy can be misleading for imbalanced data
Use precision for minimizing false positives
Use recall for minimizing false negatives
F1-score balances precision and recall

Common Mistakes

Scaling before splitting (data leakage!)
Not using stratified split for classification
Forgetting to handle the target variable
Using wrong encoder type for columns

ML Fundamentals Practice Problems

What You'll Practice

Contents

Assignment Overview

ML Concepts (1.1)

Data Preprocessing (1.2)

ML Workflow (1.3)

The Scenario

TechCorp HR Analytics

Your Task

The Dataset

File: employees.csv (Employee Data)

Columns Explained

Requirements

Load and Explore Data

Visualize Data Distribution

Handle Missing Values

Detect and Handle Outliers

Encode Categorical Variables

Scale Numerical Features

Create Train-Test Split

Create Feature Correlation Matrix

Calculate Evaluation Metrics

Train Baseline Model

Save Preprocessed Data

Main Pipeline

Submission

Required Repository Name

Required Files

README.md Must Include:

Do Include

Do Not Include

Grading Rubric

Ready to Submit?

What You Will Practice

Data Exploration (1.1)

Data Preprocessing (1.2)

Data Splitting (1.3)

Model Evaluation

Pro Tips

Preprocessing Best Practices

Code Quality

Evaluation Tips

Common Mistakes

Pre-Submission Checklist

Code Requirements

Repository Requirements

File: `employees.csv` (Employee Data)