Assignment 1-A

ML Fundamentals Practice Problems

Build a complete ML pipeline that demonstrates all Module 1 concepts: data loading and exploration, preprocessing, feature engineering, train-test splitting, and basic model evaluation metrics.

4-6 hours
Intermediate
150 Points
Submit Assignment
What You'll Practice
  • Load and explore datasets
  • Handle missing values & outliers
  • Encode categorical variables
  • Scale and normalize features
  • Split data and evaluate models
Contents
01

Assignment Overview

In this assignment, you will build a complete ML Data Pipeline for predicting employee attrition. This comprehensive project requires you to apply ALL concepts from Module 1: understanding ML types, data exploration, preprocessing techniques, feature engineering, train-test splitting, and implementing basic evaluation metrics.

Libraries Allowed: You may use pandas, numpy, matplotlib, seaborn, and scikit-learn for this assignment.
Skills Applied: This assignment tests your understanding of ML basics (Topic 1.1), data preprocessing (Topic 1.2), and ML workflow (Topic 1.3) from Module 1.
ML Concepts (1.1)

Supervised vs unsupervised, classification vs regression, model types

Data Preprocessing (1.2)

Missing values, encoding, scaling, outlier handling

ML Workflow (1.3)

Train-test split, cross-validation, evaluation metrics

Ready to submit? Already completed the assignment? Submit your work now!
Submit Now
02

The Scenario

TechCorp HR Analytics

You have been hired as a Junior Machine Learning Engineer at TechCorp Industries, a technology company facing high employee turnover. The HR director has given you this task:

"We have historical employee data including demographics, job details, and satisfaction scores. We need you to build a data pipeline that prepares this data for an attrition prediction model. Focus on proper preprocessing - the model accuracy depends on clean, well-prepared data!"

Your Task

Create a Jupyter Notebook called ml_pipeline.ipynb that implements a complete ML data preprocessing and evaluation pipeline. Your code must load the dataset, perform exploratory data analysis, preprocess features, split data properly, and calculate evaluation metrics on a baseline model.

03

The Dataset

You will work with an Employee Attrition dataset. Create this CSV file as shown below:

File: employees.csv (Employee Data)

employee_id,age,gender,department,job_role,years_at_company,monthly_income,satisfaction_score,work_life_balance,overtime,distance_from_home,education_level,performance_rating,attrition
E001,32,Male,Engineering,Software Engineer,5,85000,4.2,3,Yes,12,Bachelor,4,No
E002,28,Female,Sales,Sales Rep,2,45000,2.8,2,Yes,25,Bachelor,3,Yes
E003,45,Male,HR,HR Manager,12,95000,4.5,4,No,5,Master,5,No
E004,35,Female,Engineering,Data Scientist,3,92000,3.5,3,Yes,18,PhD,4,No
E005,26,Male,Marketing,Marketing Analyst,1,48000,2.2,2,Yes,30,Bachelor,2,Yes
E006,41,Female,Finance,Financial Analyst,8,78000,4.0,4,No,8,Master,4,No
E007,29,Male,Engineering,DevOps Engineer,4,82000,3.8,3,No,15,Bachelor,4,No
E008,38,Female,Sales,Sales Manager,7,88000,3.2,2,Yes,22,Master,3,Yes
E009,24,Male,Marketing,Content Writer,1,42000,2.5,2,Yes,35,Bachelor,3,Yes
E010,52,Female,HR,HR Director,15,125000,4.8,5,No,3,Master,5,No
E011,31,Male,Engineering,Frontend Dev,3,75000,3.0,3,Yes,20,Bachelor,3,No
E012,27,Female,Finance,Accountant,2,55000,2.9,2,Yes,28,Bachelor,3,Yes
E013,44,Male,Engineering,Tech Lead,10,115000,4.4,4,No,7,Master,5,No
E014,33,Female,Sales,Sales Rep,4,52000,3.3,3,No,16,Bachelor,3,No
E015,36,Male,Marketing,Marketing Manager,6,85000,4.1,4,No,10,Master,4,No
E016,25,Female,Engineering,Junior Dev,1,55000,2.6,2,Yes,32,Bachelor,2,Yes
E017,48,Male,Finance,CFO,14,150000,4.7,5,No,4,MBA,5,No
E018,30,Female,HR,Recruiter,3,52000,3.4,3,Yes,19,Bachelor,3,No
E019,34,Male,Engineering,Backend Dev,5,88000,3.9,3,No,14,Master,4,No
E020,22,Female,Marketing,Intern,0,28000,2.0,1,Yes,40,High School,2,Yes
Columns Explained
  • employee_id - Unique identifier (string)
  • age - Employee age (integer)
  • gender - Gender (categorical: Male/Female)
  • department - Department name (categorical)
  • job_role - Job title (categorical)
  • years_at_company - Tenure in years (integer)
  • monthly_income - Monthly salary (integer)
  • satisfaction_score - Job satisfaction 1-5 (float)
  • work_life_balance - WLB rating 1-5 (integer)
  • overtime - Works overtime (categorical: Yes/No)
  • distance_from_home - Commute distance in km (integer)
  • education_level - Highest education (categorical)
  • performance_rating - Performance score 1-5 (integer)
  • attrition - Left company (target: Yes/No)
Note: You may add additional rows with intentional missing values and outliers to test your preprocessing functions.
04

Requirements

Your ml_pipeline.ipynb must implement ALL of the following functions. Each function is mandatory and will be tested individually.

1
Load and Explore Data

Create a function load_and_explore(filename) that:

  • Loads the CSV file using pandas
  • Returns a dictionary with: shape, dtypes, missing values count, basic statistics
  • Prints a summary of the dataset
def load_and_explore(filename):
    """Load dataset and return exploration summary."""
    # Must return: dict with 'shape', 'dtypes', 'missing', 'stats'
    pass
2
Visualize Data Distribution

Create a function visualize_distributions(df) that:

  • Creates histograms for numerical columns
  • Creates bar charts for categorical columns
  • Shows class distribution of the target variable
  • Saves plots as eda_plots.png
def visualize_distributions(df):
    """Create and save EDA visualizations."""
    # Must save: eda_plots.png
    pass
3
Handle Missing Values

Create a function handle_missing_values(df, strategy='auto') that:

  • Detects missing values in each column
  • Fills numerical columns with median (or mean based on strategy)
  • Fills categorical columns with mode
  • Returns the cleaned DataFrame and a report of changes
def handle_missing_values(df, strategy='auto'):
    """Handle missing values with specified strategy."""
    # Return: (cleaned_df, missing_report_dict)
    pass
4
Detect and Handle Outliers

Create a function handle_outliers(df, columns, method='iqr') that:

  • Uses IQR method to detect outliers in specified numerical columns
  • Returns outlier counts per column
  • Optionally caps outliers at threshold values
def handle_outliers(df, columns, method='iqr'):
    """Detect and handle outliers using IQR method."""
    # Return: (cleaned_df, outliers_report_dict)
    pass
5
Encode Categorical Variables

Create a function encode_categorical(df, columns) that:

  • Uses Label Encoding for binary categorical columns
  • Uses One-Hot Encoding for multi-class categorical columns
  • Returns the encoded DataFrame and encoder objects
def encode_categorical(df, columns):
    """Encode categorical variables appropriately."""
    # Return: (encoded_df, encoders_dict)
    pass
6
Scale Numerical Features

Create a function scale_features(df, columns, method='standard') that:

  • Applies StandardScaler or MinMaxScaler based on method
  • Returns scaled DataFrame and scaler object
  • Preserves column names after scaling
def scale_features(df, columns, method='standard'):
    """Scale numerical features using specified method."""
    # Return: (scaled_df, scaler_object)
    pass
7
Create Train-Test Split

Create a function split_data(df, target_column, test_size=0.2, stratify=True) that:

  • Separates features (X) and target (y)
  • Performs stratified split to maintain class balance
  • Returns X_train, X_test, y_train, y_test
  • Prints class distribution in both sets
def split_data(df, target_column, test_size=0.2, stratify=True):
    """Split data into train and test sets with stratification."""
    # Return: (X_train, X_test, y_train, y_test)
    pass
8
Create Feature Correlation Matrix

Create a function get_correlation_matrix(df, threshold=0.7) that:

  • Calculates correlation matrix for numerical features
  • Identifies highly correlated feature pairs above threshold
  • Saves correlation heatmap as correlation_matrix.png
  • Returns list of highly correlated pairs
def get_correlation_matrix(df, threshold=0.7):
    """Generate correlation matrix and identify high correlations."""
    # Return: list of (col1, col2, correlation) tuples
    pass
9
Calculate Evaluation Metrics

Create a function calculate_metrics(y_true, y_pred) that:

  • Calculates accuracy, precision, recall, and F1-score
  • Generates confusion matrix
  • Returns a dictionary with all metrics
def calculate_metrics(y_true, y_pred):
    """Calculate classification evaluation metrics."""
    # Return: dict with 'accuracy', 'precision', 'recall', 'f1', 'confusion_matrix'
    pass
10
Train Baseline Model

Create a function train_baseline(X_train, X_test, y_train, y_test) that:

  • Trains a simple Logistic Regression as baseline
  • Makes predictions on test set
  • Returns model and predictions
def train_baseline(X_train, X_test, y_train, y_test):
    """Train a baseline logistic regression model."""
    # Return: (model, y_pred)
    pass
11
Save Preprocessed Data

Create a function save_processed_data(X_train, X_test, y_train, y_test, prefix='processed') that:

  • Saves train and test sets as CSV files
  • Creates: processed_X_train.csv, processed_X_test.csv, etc.
  • Returns list of saved file paths
def save_processed_data(X_train, X_test, y_train, y_test, prefix='processed'):
    """Save preprocessed train and test data to CSV files."""
    # Return: list of saved file paths
    pass
12
Main Pipeline

Create a main() function that:

  • Runs the complete pipeline from loading to evaluation
  • Prints summary statistics at each step
  • Generates all required output files
  • Prints final baseline model metrics
def main():
    # 1. Load and explore data
    df = pd.read_csv("employees.csv")
    exploration = load_and_explore("employees.csv")
    
    # 2. Visualize distributions
    visualize_distributions(df)
    
    # 3. Handle missing values
    df_clean, missing_report = handle_missing_values(df)
    
    # 4. Handle outliers
    numerical_cols = ['age', 'monthly_income', 'distance_from_home']
    df_clean, outlier_report = handle_outliers(df_clean, numerical_cols)
    
    # 5. Encode categorical variables
    cat_cols = ['gender', 'department', 'overtime', 'education_level']
    df_encoded, encoders = encode_categorical(df_clean, cat_cols)
    
    # 6. Scale features
    scale_cols = ['age', 'years_at_company', 'monthly_income', 'distance_from_home']
    df_scaled, scaler = scale_features(df_encoded, scale_cols)
    
    # 7. Split data
    X_train, X_test, y_train, y_test = split_data(df_scaled, 'attrition')
    
    # 8. Get correlations
    correlations = get_correlation_matrix(df_scaled)
    
    # 9. Train baseline model
    model, y_pred = train_baseline(X_train, X_test, y_train, y_test)
    
    # 10. Calculate metrics
    metrics = calculate_metrics(y_test, y_pred)
    print(f"Baseline Model Metrics: {metrics}")
    
    # 11. Save processed data
    save_processed_data(X_train, X_test, y_train, y_test)
    
    print("ML Pipeline Complete!")

if __name__ == "__main__":
    main()
05

Submission

Create a public GitHub repository with the exact name shown below:

Required Repository Name
employee-attrition-pipeline
github.com/<your-username>/employee-attrition-pipeline
Required Files
employee-attrition-pipeline/
├── ml_pipeline.ipynb         # Your Jupyter Notebook with ALL 12 functions
├── employees.csv             # Input dataset (as provided or extended)
├── eda_plots.png             # Generated EDA visualizations
├── correlation_matrix.png    # Generated correlation heatmap
├── processed_X_train.csv     # Preprocessed training features
├── processed_X_test.csv      # Preprocessed test features
├── processed_y_train.csv     # Training labels
├── processed_y_test.csv      # Test labels
└── README.md                 # REQUIRED - see contents below
README.md Must Include:
  • Your full name and submission date
  • Brief description of your preprocessing approach
  • Summary of baseline model metrics achieved
  • Any challenges faced and how you solved them
  • Instructions to run your notebook
Do Include
  • All 12 functions implemented and working
  • Docstrings for every function
  • Comments explaining preprocessing decisions
  • All output files from running your notebook
  • Visualizations with proper labels and titles
  • README.md with all required sections
Do Not Include
  • Any .pyc or __pycache__ files (use .gitignore)
  • Virtual environment folders
  • Large model files (just the code)
  • Code that doesn't run without errors
  • Hardcoded paths (use relative paths)
Important: Before submitting, run all cells in your notebook to make sure it executes without errors and generates all output files correctly!
Submit Your Assignment

Enter your GitHub username - we'll verify your repository automatically

06

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria Points Description
Data Loading & EDA 20 Correct data loading, exploration summary, and meaningful visualizations
Preprocessing Functions 40 Proper handling of missing values, outliers, encoding, and scaling
Train-Test Split 15 Correct stratified splitting with proper feature-target separation
Evaluation Metrics 25 Accurate calculation of all classification metrics and confusion matrix
Baseline Model 20 Working baseline model with predictions and proper evaluation
Code Quality 30 Docstrings, comments, naming conventions, and clean organization
Total 150

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment
07

What You Will Practice

Data Exploration (1.1)

Understanding data structure, distributions, identifying data quality issues

Data Preprocessing (1.2)

Handling missing values, outliers, encoding categorical variables, feature scaling

Data Splitting (1.3)

Proper train-test splitting with stratification to prevent data leakage

Model Evaluation

Understanding accuracy, precision, recall, F1-score, and confusion matrix

08

Pro Tips

Preprocessing Best Practices
  • Always explore before preprocessing
  • Scale AFTER splitting to prevent data leakage
  • Use fit_transform on train, transform on test
  • Document your preprocessing decisions
Code Quality
  • Write modular, reusable functions
  • Add docstrings explaining parameters
  • Handle edge cases (empty data, all NaN)
  • Return meaningful results from functions
Evaluation Tips
  • Accuracy can be misleading for imbalanced data
  • Use precision for minimizing false positives
  • Use recall for minimizing false negatives
  • F1-score balances precision and recall
Common Mistakes
  • Scaling before splitting (data leakage!)
  • Not using stratified split for classification
  • Forgetting to handle the target variable
  • Using wrong encoder type for columns
09

Pre-Submission Checklist

Code Requirements
Repository Requirements