Assignment Overview
In this assignment, you will build a complete ML Data Pipeline for predicting employee attrition. This comprehensive project requires you to apply ALL concepts from Module 1: understanding ML types, data exploration, preprocessing techniques, feature engineering, train-test splitting, and implementing basic evaluation metrics.
pandas, numpy, matplotlib,
seaborn, and scikit-learn for this assignment.
ML Concepts (1.1)
Supervised vs unsupervised, classification vs regression, model types
Data Preprocessing (1.2)
Missing values, encoding, scaling, outlier handling
ML Workflow (1.3)
Train-test split, cross-validation, evaluation metrics
The Scenario
TechCorp HR Analytics
You have been hired as a Junior Machine Learning Engineer at TechCorp Industries, a technology company facing high employee turnover. The HR director has given you this task:
"We have historical employee data including demographics, job details, and satisfaction scores. We need you to build a data pipeline that prepares this data for an attrition prediction model. Focus on proper preprocessing - the model accuracy depends on clean, well-prepared data!"
Your Task
Create a Jupyter Notebook called ml_pipeline.ipynb that implements a complete
ML data preprocessing and evaluation pipeline. Your code must load the dataset, perform
exploratory data analysis, preprocess features, split data properly, and calculate
evaluation metrics on a baseline model.
The Dataset
You will work with an Employee Attrition dataset. Create this CSV file as shown below:
File: employees.csv (Employee Data)
employee_id,age,gender,department,job_role,years_at_company,monthly_income,satisfaction_score,work_life_balance,overtime,distance_from_home,education_level,performance_rating,attrition
E001,32,Male,Engineering,Software Engineer,5,85000,4.2,3,Yes,12,Bachelor,4,No
E002,28,Female,Sales,Sales Rep,2,45000,2.8,2,Yes,25,Bachelor,3,Yes
E003,45,Male,HR,HR Manager,12,95000,4.5,4,No,5,Master,5,No
E004,35,Female,Engineering,Data Scientist,3,92000,3.5,3,Yes,18,PhD,4,No
E005,26,Male,Marketing,Marketing Analyst,1,48000,2.2,2,Yes,30,Bachelor,2,Yes
E006,41,Female,Finance,Financial Analyst,8,78000,4.0,4,No,8,Master,4,No
E007,29,Male,Engineering,DevOps Engineer,4,82000,3.8,3,No,15,Bachelor,4,No
E008,38,Female,Sales,Sales Manager,7,88000,3.2,2,Yes,22,Master,3,Yes
E009,24,Male,Marketing,Content Writer,1,42000,2.5,2,Yes,35,Bachelor,3,Yes
E010,52,Female,HR,HR Director,15,125000,4.8,5,No,3,Master,5,No
E011,31,Male,Engineering,Frontend Dev,3,75000,3.0,3,Yes,20,Bachelor,3,No
E012,27,Female,Finance,Accountant,2,55000,2.9,2,Yes,28,Bachelor,3,Yes
E013,44,Male,Engineering,Tech Lead,10,115000,4.4,4,No,7,Master,5,No
E014,33,Female,Sales,Sales Rep,4,52000,3.3,3,No,16,Bachelor,3,No
E015,36,Male,Marketing,Marketing Manager,6,85000,4.1,4,No,10,Master,4,No
E016,25,Female,Engineering,Junior Dev,1,55000,2.6,2,Yes,32,Bachelor,2,Yes
E017,48,Male,Finance,CFO,14,150000,4.7,5,No,4,MBA,5,No
E018,30,Female,HR,Recruiter,3,52000,3.4,3,Yes,19,Bachelor,3,No
E019,34,Male,Engineering,Backend Dev,5,88000,3.9,3,No,14,Master,4,No
E020,22,Female,Marketing,Intern,0,28000,2.0,1,Yes,40,High School,2,Yes
Columns Explained
employee_id- Unique identifier (string)age- Employee age (integer)gender- Gender (categorical: Male/Female)department- Department name (categorical)job_role- Job title (categorical)years_at_company- Tenure in years (integer)monthly_income- Monthly salary (integer)satisfaction_score- Job satisfaction 1-5 (float)work_life_balance- WLB rating 1-5 (integer)overtime- Works overtime (categorical: Yes/No)distance_from_home- Commute distance in km (integer)education_level- Highest education (categorical)performance_rating- Performance score 1-5 (integer)attrition- Left company (target: Yes/No)
Requirements
Your ml_pipeline.ipynb must implement ALL of the following functions.
Each function is mandatory and will be tested individually.
Load and Explore Data
Create a function load_and_explore(filename) that:
- Loads the CSV file using pandas
- Returns a dictionary with: shape, dtypes, missing values count, basic statistics
- Prints a summary of the dataset
def load_and_explore(filename):
"""Load dataset and return exploration summary."""
# Must return: dict with 'shape', 'dtypes', 'missing', 'stats'
pass
Visualize Data Distribution
Create a function visualize_distributions(df) that:
- Creates histograms for numerical columns
- Creates bar charts for categorical columns
- Shows class distribution of the target variable
- Saves plots as
eda_plots.png
def visualize_distributions(df):
"""Create and save EDA visualizations."""
# Must save: eda_plots.png
pass
Handle Missing Values
Create a function handle_missing_values(df, strategy='auto') that:
- Detects missing values in each column
- Fills numerical columns with median (or mean based on strategy)
- Fills categorical columns with mode
- Returns the cleaned DataFrame and a report of changes
def handle_missing_values(df, strategy='auto'):
"""Handle missing values with specified strategy."""
# Return: (cleaned_df, missing_report_dict)
pass
Detect and Handle Outliers
Create a function handle_outliers(df, columns, method='iqr') that:
- Uses IQR method to detect outliers in specified numerical columns
- Returns outlier counts per column
- Optionally caps outliers at threshold values
def handle_outliers(df, columns, method='iqr'):
"""Detect and handle outliers using IQR method."""
# Return: (cleaned_df, outliers_report_dict)
pass
Encode Categorical Variables
Create a function encode_categorical(df, columns) that:
- Uses Label Encoding for binary categorical columns
- Uses One-Hot Encoding for multi-class categorical columns
- Returns the encoded DataFrame and encoder objects
def encode_categorical(df, columns):
"""Encode categorical variables appropriately."""
# Return: (encoded_df, encoders_dict)
pass
Scale Numerical Features
Create a function scale_features(df, columns, method='standard') that:
- Applies StandardScaler or MinMaxScaler based on method
- Returns scaled DataFrame and scaler object
- Preserves column names after scaling
def scale_features(df, columns, method='standard'):
"""Scale numerical features using specified method."""
# Return: (scaled_df, scaler_object)
pass
Create Train-Test Split
Create a function split_data(df, target_column, test_size=0.2, stratify=True) that:
- Separates features (X) and target (y)
- Performs stratified split to maintain class balance
- Returns X_train, X_test, y_train, y_test
- Prints class distribution in both sets
def split_data(df, target_column, test_size=0.2, stratify=True):
"""Split data into train and test sets with stratification."""
# Return: (X_train, X_test, y_train, y_test)
pass
Create Feature Correlation Matrix
Create a function get_correlation_matrix(df, threshold=0.7) that:
- Calculates correlation matrix for numerical features
- Identifies highly correlated feature pairs above threshold
- Saves correlation heatmap as
correlation_matrix.png - Returns list of highly correlated pairs
def get_correlation_matrix(df, threshold=0.7):
"""Generate correlation matrix and identify high correlations."""
# Return: list of (col1, col2, correlation) tuples
pass
Calculate Evaluation Metrics
Create a function calculate_metrics(y_true, y_pred) that:
- Calculates accuracy, precision, recall, and F1-score
- Generates confusion matrix
- Returns a dictionary with all metrics
def calculate_metrics(y_true, y_pred):
"""Calculate classification evaluation metrics."""
# Return: dict with 'accuracy', 'precision', 'recall', 'f1', 'confusion_matrix'
pass
Train Baseline Model
Create a function train_baseline(X_train, X_test, y_train, y_test) that:
- Trains a simple Logistic Regression as baseline
- Makes predictions on test set
- Returns model and predictions
def train_baseline(X_train, X_test, y_train, y_test):
"""Train a baseline logistic regression model."""
# Return: (model, y_pred)
pass
Save Preprocessed Data
Create a function save_processed_data(X_train, X_test, y_train, y_test, prefix='processed') that:
- Saves train and test sets as CSV files
- Creates:
processed_X_train.csv,processed_X_test.csv, etc. - Returns list of saved file paths
def save_processed_data(X_train, X_test, y_train, y_test, prefix='processed'):
"""Save preprocessed train and test data to CSV files."""
# Return: list of saved file paths
pass
Main Pipeline
Create a main() function that:
- Runs the complete pipeline from loading to evaluation
- Prints summary statistics at each step
- Generates all required output files
- Prints final baseline model metrics
def main():
# 1. Load and explore data
df = pd.read_csv("employees.csv")
exploration = load_and_explore("employees.csv")
# 2. Visualize distributions
visualize_distributions(df)
# 3. Handle missing values
df_clean, missing_report = handle_missing_values(df)
# 4. Handle outliers
numerical_cols = ['age', 'monthly_income', 'distance_from_home']
df_clean, outlier_report = handle_outliers(df_clean, numerical_cols)
# 5. Encode categorical variables
cat_cols = ['gender', 'department', 'overtime', 'education_level']
df_encoded, encoders = encode_categorical(df_clean, cat_cols)
# 6. Scale features
scale_cols = ['age', 'years_at_company', 'monthly_income', 'distance_from_home']
df_scaled, scaler = scale_features(df_encoded, scale_cols)
# 7. Split data
X_train, X_test, y_train, y_test = split_data(df_scaled, 'attrition')
# 8. Get correlations
correlations = get_correlation_matrix(df_scaled)
# 9. Train baseline model
model, y_pred = train_baseline(X_train, X_test, y_train, y_test)
# 10. Calculate metrics
metrics = calculate_metrics(y_test, y_pred)
print(f"Baseline Model Metrics: {metrics}")
# 11. Save processed data
save_processed_data(X_train, X_test, y_train, y_test)
print("ML Pipeline Complete!")
if __name__ == "__main__":
main()
Submission
Create a public GitHub repository with the exact name shown below:
Required Repository Name
employee-attrition-pipeline
Required Files
employee-attrition-pipeline/
├── ml_pipeline.ipynb # Your Jupyter Notebook with ALL 12 functions
├── employees.csv # Input dataset (as provided or extended)
├── eda_plots.png # Generated EDA visualizations
├── correlation_matrix.png # Generated correlation heatmap
├── processed_X_train.csv # Preprocessed training features
├── processed_X_test.csv # Preprocessed test features
├── processed_y_train.csv # Training labels
├── processed_y_test.csv # Test labels
└── README.md # REQUIRED - see contents below
README.md Must Include:
- Your full name and submission date
- Brief description of your preprocessing approach
- Summary of baseline model metrics achieved
- Any challenges faced and how you solved them
- Instructions to run your notebook
Do Include
- All 12 functions implemented and working
- Docstrings for every function
- Comments explaining preprocessing decisions
- All output files from running your notebook
- Visualizations with proper labels and titles
- README.md with all required sections
Do Not Include
- Any .pyc or __pycache__ files (use .gitignore)
- Virtual environment folders
- Large model files (just the code)
- Code that doesn't run without errors
- Hardcoded paths (use relative paths)
Enter your GitHub username - we'll verify your repository automatically
Grading Rubric
Your assignment will be graded on the following criteria:
| Criteria | Points | Description |
|---|---|---|
| Data Loading & EDA | 20 | Correct data loading, exploration summary, and meaningful visualizations |
| Preprocessing Functions | 40 | Proper handling of missing values, outliers, encoding, and scaling |
| Train-Test Split | 15 | Correct stratified splitting with proper feature-target separation |
| Evaluation Metrics | 25 | Accurate calculation of all classification metrics and confusion matrix |
| Baseline Model | 20 | Working baseline model with predictions and proper evaluation |
| Code Quality | 30 | Docstrings, comments, naming conventions, and clean organization |
| Total | 150 |
Ready to Submit?
Make sure you have completed all requirements and reviewed the grading rubric above.
Submit Your AssignmentWhat You Will Practice
Data Exploration (1.1)
Understanding data structure, distributions, identifying data quality issues
Data Preprocessing (1.2)
Handling missing values, outliers, encoding categorical variables, feature scaling
Data Splitting (1.3)
Proper train-test splitting with stratification to prevent data leakage
Model Evaluation
Understanding accuracy, precision, recall, F1-score, and confusion matrix
Pro Tips
Preprocessing Best Practices
- Always explore before preprocessing
- Scale AFTER splitting to prevent data leakage
- Use fit_transform on train, transform on test
- Document your preprocessing decisions
Code Quality
- Write modular, reusable functions
- Add docstrings explaining parameters
- Handle edge cases (empty data, all NaN)
- Return meaningful results from functions
Evaluation Tips
- Accuracy can be misleading for imbalanced data
- Use precision for minimizing false positives
- Use recall for minimizing false negatives
- F1-score balances precision and recall
Common Mistakes
- Scaling before splitting (data leakage!)
- Not using stratified split for classification
- Forgetting to handle the target variable
- Using wrong encoder type for columns