The Machine Learning Workflow

The ML Workflow Overview

Every successful ML project follows a structured workflow. Understanding this process is crucial before diving into algorithms. Let's explore the 7 key steps that transform raw data into production-ready ML solutions.

Remember: ML is iterative, not linear. You'll often loop back to earlier steps based on your findings. This is normal and expected!

Problem Definition

Define the business problem, target variable, and success metrics clearly.

Goals Metrics Stakeholders

5-10% of project time

Data Collection

Gather data from databases, APIs, files, and external sources.

SQL APIs Files

10-15% of project time

Data Preparation

Clean, handle missing values, remove duplicates, and split data.

Cleaning Split Filter

40-60% of project time!

Feature Engineering

Create, transform, scale, and select the most predictive features.

Create Scale Encode

10-20% of project time

Model Training

Train algorithms, tune hyperparameters, and use cross-validation.

Algorithms Tuning CV

10-15% of project time

Evaluation

Measure performance with metrics, confusion matrix, and validation.

Metrics Confusion Validate

5-10% of project time

Deployment

Deploy to production, create APIs, monitor, and plan retraining.

Deploy API Monitor

10-15% of project time

Interactive: ML Workflow Explorer

Click Steps!

Explore each phase of the ML pipeline. Click on any step to discover key activities, essential tools, common pitfalls, and real-world tips.

Problem Definition

Step 1 of 7

Define the business problem clearly. What are you trying to predict? What does success look like? This step sets the foundation for everything else.

Key Activities

Stakeholder interviews
Define target variable
Set success metrics
Assess feasibility

Common Tools

Requirements Doc Stakeholder Meetings KPI Definition

Common Pitfalls

Skipping this step, vague objectives, not involving stakeholders, choosing wrong success metrics.

Pro Tips

Start with a simple baseline. Define what "good enough" looks like before building anything.

Time Allocation

5-10%

of total project

Difficulty Level

Moderate

Key Deliverable

Problem Statement

Step 1: Problem Definition

Before writing any code, you must clearly understand the problem. A poorly defined problem leads to wasted effort and failed projects.

Questions to Ask

What exactly are we trying to predict or classify?
Is this a classification, regression, or clustering problem?
What data do we have available?
How will success be measured?
What are the business constraints (time, accuracy, interpretability)?

Example Problems

Classification: Will this customer churn? (Yes/No)
Regression: What will be the house price? ($X)
Clustering: What customer segments exist?
Ranking: Which products should we recommend?
Anomaly: Is this transaction fraudulent?

Problem Statement Template

# Problem Statement Template
problem = {
    "objective": "Predict customer churn within 30 days",
    "type": "Binary Classification",
    "target_variable": "churned (0 or 1)",
    "success_metric": "F1-score >= 0.85",
    "constraints": {
        "latency": "< 100ms per prediction",
        "interpretability": "Must explain top 3 factors",
        "refresh": "Retrain weekly"
    }
}

Template Fields Explained

objective

Clear, measurable goal statement. What exactly are you predicting and within what timeframe?

type

ML problem category: Classification (Binary/Multi), Regression, Clustering, Ranking, or Anomaly Detection.

target_variable

The column/feature you're predicting, including its data type and possible values (0/1, continuous, etc.).

success_metric

How you'll measure success. Always include metric name + threshold.

F1 ≥ 0.85 RMSE ≤ 10 AUC ≥ 0.90

constraints

Real-world limitations that affect model choices:

Latency Interpretability Refresh Rate Compute Budget

Pro Tip: Share this template with stakeholders BEFORE starting any coding. Getting alignment early prevents wasted effort and ensures everyone agrees on what "success" means.

Step 2: Data Collection

Data is the fuel for ML. The quality and quantity of your data directly impacts model performance. Garbage in, garbage out!

Internal Sources

Company databases
CRM systems
Transaction logs
User behavior data
Sensor/IoT data

External Sources

Public APIs
Open datasets (Kaggle, UCI)
Government data
Third-party providers
Web scraping

Considerations

Data privacy (GDPR, CCPA)
Data quality issues
Sampling bias
Licensing restrictions
Freshness/timeliness

Loading Data with pandas

import pandas as pd

# From CSV file
df = pd.read_csv('customers.csv')

# From Excel file
df = pd.read_excel('sales_data.xlsx', sheet_name='2024')

# From SQL database
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM customers', conn)

# From API (JSON)
import requests
response = requests.get('https://api.example.com/data')
df = pd.DataFrame(response.json())

# Quick data overview
print(f"Shape: {df.shape}")  # (rows, columns)
print(f"Columns: {df.columns.tolist()}")
df.head()  # First 5 rows

pandas Data Loading Methods

read_csv()

Most common format. Comma-separated values.

.csv .tsv .txt

read_excel()

Excel workbooks. Specify sheet_name for multi-sheet files.

.xlsx .xls

read_sql()

Query databases directly. Requires connection object.

SQLite MySQL PostgreSQL

DataFrame()

From JSON/API responses. Convert dict/list to DataFrame.

REST APIs .json

Quick Data Overview Methods

df.shape → (rows, columns)

df.head() → First 5 rows

df.info() → Data types & nulls

df.describe() → Statistics

Best Practice: Always run df.info() and df.describe() immediately after loading. This reveals data types, missing values, and statistical distribution at a glance.

Step 3: Data Preparation

This is where you spend 60-80% of your time! Real-world data is messy. You need to clean, transform, and prepare it before feeding it to ML algorithms.

Reality Check: Data scientists spend most of their time wrangling data, not building models. Master this step!

Common Data Issues

Missing Values

# Check for missing values
print(df.isnull().sum())

# Drop rows with missing values
df_clean = df.dropna()

# Fill with mean/median
df['age'].fillna(df['age'].median(), inplace=True)

# Fill with most frequent (mode)
df['category'].fillna(df['category'].mode()[0], inplace=True)

# Forward fill (time series)
df['price'].fillna(method='ffill', inplace=True)

Methods Explained

dropna()

Remove rows with any missing values. Use when missing data is minimal.

fillna(median)

Replace with median. Robust to outliers for numerical data.

fillna(mode)

Replace with most frequent value. Best for categorical columns.

ffill / bfill

Forward/backward fill. Use for time series data.

Warning: Never drop or fill before splitting! Calculate fill values from training set only.

Duplicates & Outliers

# Remove duplicates
df = df.drop_duplicates()

# Find duplicates
duplicates = df[df.duplicated()]
print(f"Found {len(duplicates)} duplicates")

# Detect outliers using IQR
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['price'] < Q1 - 1.5*IQR) | 
              (df['price'] > Q3 + 1.5*IQR)]

# Remove outliers
df = df[~df.index.isin(outliers.index)]

Methods Explained

drop_duplicates()

Removes exact duplicate rows. Keeps first occurrence by default.

duplicated()

Returns boolean mask. True for duplicate rows.

IQR Method

Interquartile Range: Values outside Q1 - 1.5×IQR to Q3 + 1.5×IQR are considered outliers. This method is robust and doesn't assume normal distribution.

Tip: Don't blindly remove outliers! Investigate first—they might be valid rare events or data entry errors.

Train-Test Split

Splitting your dataset is one of the most critical steps in ML. You need to evaluate your model on data it has never seen before to get an honest estimate of real-world performance.

The training set is used to teach your model patterns. The test set is kept completely separate and only used at the very end to evaluate how well your model generalizes to new, unseen data.

Critical Rule

Always split BEFORE preprocessing! If you scale, encode, or impute on the full dataset first, information from the test set leaks into training, giving overly optimistic results that won't hold in production.

from sklearn.model_selection import train_test_split

# Features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing
    random_state=42,    # Reproducibility
    stratify=y          # Keep class proportions (for classification)
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

train_test_split() Parameters

X, y

X = Features (input columns). y = Target (what you predict). Separate them before splitting.

test_size

Fraction for testing (0.2 = 20%). Common values: 0.2, 0.25, 0.3. Larger datasets can use smaller test sizes.

random_state

Seed for reproducibility. Same number = same split every time. Use 42, 0, or any integer.

stratify

Preserves class proportions. If 30% positive in original, both splits have ~30%. Critical for imbalanced data!

Visual: 80/20 Split

Training Set (80%)

Test (20%)

Train: Model learns from this data

Test: Model is evaluated on this (unseen!)

Pro Tip: For small datasets, consider a validation set too (60/20/20 split) or use cross-validation to get more reliable performance estimates without wasting data.

Data Leakage Warning: Never fit preprocessing (like scaling or encoding) on the entire dataset. Fit only on training data, then transform both train and test.

Step 4: Feature Engineering

Feature engineering is the art of creating and selecting the right input variables. Good features can make a simple model outperform a complex one!

Scaling Features

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# StandardScaler: mean=0, std=1
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use same scaler!

# MinMaxScaler: range [0, 1]
minmax = MinMaxScaler()
X_train_norm = minmax.fit_transform(X_train)
X_test_norm = minmax.transform(X_test)

Why Scale Features?

Many ML algorithms (like KNN, SVM, Neural Networks) are sensitive to the magnitude of features. If "age" ranges 0-100 and "income" ranges 0-1,000,000, the model will think income is more important just because it has bigger numbers!

StandardScaler

Transforms to mean=0, std=1. Best for most algorithms. Works well with outliers.

MinMaxScaler

Transforms to range [0, 1]. Good for neural networks. Sensitive to outliers.

fit_transform vs transform:

fit_transform(X_train) — Learn parameters (mean, std) from training data AND transform it
transform(X_test) — Use SAME learned parameters to transform test data. Never fit on test!

Encoding Categories

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd

# Label Encoding (ordinal categories)
le = LabelEncoder()
df['size_encoded'] = le.fit_transform(df['size'])
# S=0, M=1, L=2

# One-Hot Encoding (nominal categories)
df_encoded = pd.get_dummies(df, columns=['color'])
# Creates: color_red, color_blue, color_green

# Or using sklearn
ohe = OneHotEncoder(sparse=False)
encoded = ohe.fit_transform(df[['color']])

Why Encode Categories?

ML algorithms work with numbers, not text! You must convert categorical columns like "color" or "size" into numerical format. The encoding method depends on whether categories have a natural order.

LabelEncoder

Ordinal data (has order): S→0, M→1, L→2. The numbers reflect real ranking.

OneHotEncoder

Nominal data (no order): Red, Blue, Green become separate 0/1 columns.

Beginner Mistake: Using LabelEncoder on nominal data (like colors). The model might think Red(0) < Blue(1) < Green(2), which makes no sense! Use OneHot for nominal categories.

Creating New Features

Feature engineering is where data science becomes an art! Creating smart new features from existing data can dramatically improve your model's performance — sometimes more than changing the algorithm itself.

# Feature creation examples
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 50, 100], 
                         labels=['teen', 'young', 'middle', 'senior'])

# Date features
df['signup_date'] = pd.to_datetime(df['signup_date'])
df['signup_year'] = df['signup_date'].dt.year
df['signup_month'] = df['signup_date'].dt.month
df['signup_dayofweek'] = df['signup_date'].dt.dayofweek
df['days_since_signup'] = (pd.Timestamp.now() - df['signup_date']).dt.days

# Interaction features
df['price_per_sqft'] = df['price'] / df['sqft']
df['total_spend'] = df['quantity'] * df['unit_price']

# Log transform for skewed data
import numpy as np
df['log_income'] = np.log1p(df['income'])  # log(1+x) handles zeros

Feature Engineering Techniques Explained

pd.cut() Binning

Converts continuous numbers into categories. Groups ages 0-18 as "teen", 19-35 as "young", etc.

When to use: When exact values don't matter, but ranges do (age groups, income brackets, time periods)

.dt accessor Date Extraction

Extracts year, month, day, weekday from dates. Captures patterns like "more sales on weekends" or "seasonal trends".

Common extractions: .dt.year, .dt.month, .dt.dayofweek, .dt.hour, .dt.quarter

Column Math Interactions

Combines columns to create meaningful ratios. "Price per sqft" is more informative than price and sqft separately!

Ideas: ratios (A/B), products (A×B), differences (A-B), percentages (A/total×100)

np.log1p() Transform

Compresses skewed data (like income: few millionaires, many middle-class). Makes distribution more normal.

Why log1p? log(1+x) safely handles zero values. Use np.expm1() to reverse it.

Feature Engineering Gold: Domain knowledge creates the best features! Understanding your data's context (real estate, healthcare, e-commerce) helps you create features the algorithm can't discover on its own.

Beginner Tip: Start simple! Try basic ratios and date extractions first. Test if new features actually improve your model before creating dozens of them.

Step 5: Model Training

Now comes the exciting part - training your ML model! Start simple, then iterate to more complex models if needed.

Pro Tip: Always start with a simple baseline model (like Logistic Regression or Decision Tree). If it works well, you might not need deep learning!

The Basic Training Pattern

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Step 1: Choose a model
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Step 2: Train (fit) on training data
model.fit(X_train, y_train)

# Step 3: Make predictions
y_pred = model.predict(X_test)

# Step 4: Get probability scores (if needed)
y_proba = model.predict_proba(X_test)[:, 1]  # Probability of class 1

Understanding Each Step

Step 1 Choose Model

Pick an algorithm that suits your problem type (classification, regression) and data size.

RandomForestClassifier
• n_estimators = number of trees
• random_state = reproducibility

Step 2 Train (Fit)

The model learns patterns from your training data. This is where the "magic" happens!

model.fit(X_train, y_train)
• X_train = features
• y_train = target labels

Step 3 Predict

Use trained model to make predictions on new, unseen test data.

model.predict(X_test)
• Returns class labels (0, 1)
• Hard predictions

Step 4 Probabilities

Get confidence scores instead of just yes/no. Useful for ranking or setting custom thresholds.

predict_proba(X_test)[:, 1]
• Returns 0.0 to 1.0
• [:, 1] gets positive class

Scikit-learn's Consistent API: Every model follows the same pattern: .fit() → .predict() → .score(). Once you learn it for one model, you know it for all 50+ models in sklearn!

Beginner tip: Start with LogisticRegression (classification) or LinearRegression (regression) as baselines

Common mistake: Never call .fit() on test data — that's cheating and causes data leakage!

Hyperparameter Tuning

Hyperparameters are settings you choose before training (unlike model parameters learned during training). Finding the best combination can significantly boost your model's performance!

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Grid Search (tries all combinations)
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,              # 5-fold cross-validation
    scoring='f1',      # Optimize for F1-score
    n_jobs=-1          # Use all CPU cores
)

grid_search.fit(X_train, y_train)

# Best parameters and score
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")

# Use best model
best_model = grid_search.best_estimator_

Understanding the Code

Parameter Grid

A dictionary defining which values to try for each hyperparameter. GridSearch will test all combinations:

n_estimators

Number of trees in forest

max_depth

How deep each tree grows

min_samples_split

Min samples to split node

min_samples_leaf

Min samples in leaf node

Total combinations: 3 × 4 × 3 × 3 = 108 models trained!

cv=5

Uses 5-fold cross-validation for each combination. More reliable than a single train-test split!

scoring='f1'

Metric to optimize. Options: 'accuracy', 'precision', 'recall', 'roc_auc'

n_jobs=-1

Use all CPU cores for parallel processing. Makes tuning much faster!

Accessing Results

.best_params_

The winning hyperparameter combination

.best_score_

Best cross-validation score achieved

.best_estimator_

The trained model with best params

GridSearchCV: Tries every combination. Thorough but slow. Best for small grids (<100 combinations).

RandomizedSearchCV: Samples random combinations. Faster for large search spaces. Use n_iter to control attempts.

Cross-Validation

Why Cross-Validation?

The Problem: A single train-test split can give misleading results. You might get "lucky" or "unlucky" with which samples end up in each set.

Single Split Problems

Results depend heavily on which samples are in test set
May overestimate or underestimate true performance
Hard to know if your model will generalize well
Wastes data - some samples never used for training

Cross-Validation Benefits

Uses ALL data for both training and validation
Provides mean AND standard deviation of performance
Detects overfitting (high variance = overfitting)
More reliable estimate of real-world performance

How K-Fold Cross-Validation Works

5-Fold Example (K=5): Data split into 5 equal parts

Fold 1 (Test) Fold 2 Fold 3 Fold 4 Fold 5 → Score 1

Fold 1 Fold 2 (Test) Fold 3 Fold 4 Fold 5 → Score 2

Fold 1 Fold 2 Fold 3 (Test) Fold 4 Fold 5 → Score 3

Fold 1 Fold 2 Fold 3 Fold 4 (Test) Fold 5 → Score 4

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 (Test) → Score 5

Final Score = Average(Score 1, 2, 3, 4, 5) ± Std Dev

from sklearn.model_selection import cross_val_score

# 5-fold cross-validation
scores = cross_val_score(
    model, X_train, y_train, 
    cv=5, 
    scoring='accuracy'
)

print(f"Scores: {scores}")
print(f"Mean: {scores.mean():.4f}")
print(f"Std: {scores.std():.4f}")

# Interpretation:
# Mean = expected performance
# Low std = stable model (good!)
# High std = model overfitting (bad!)

Understanding cross_val_score

model

Your ML model (unfitted)

cv=5

Number of folds

scoring

Metric to evaluate

scores

Array of 5 scores

Interpreting results: .mean() = average performance, .std() = consistency (lower is better, means stable across folds)

# Other CV strategies
from sklearn.model_selection import (
    StratifiedKFold,  # Preserves class balance
    LeaveOneOut,      # N folds for N samples
    TimeSeriesSplit   # For time series data
)

# Stratified K-Fold (recommended for classification)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf)

# Time Series Split (respects temporal order)
tscv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(model, X, y, cv=tscv)

CV Strategies Explained

RECOMMENDED StratifiedKFold

Ensures each fold has same ratio of classes. Essential for imbalanced data (e.g., 90% class A, 10% class B).

SMALL DATA LeaveOneOut

Uses N-1 samples for training, 1 for testing. Repeats N times. Thorough but very slow for large datasets.

TIME DATA TimeSeriesSplit

Never uses future data to predict past! Training expands forward, test is always the next period.

Visual: How Each Strategy Splits Data

K-Fold (5 folds)

TestTrain Train Train Train

TrainTestTrain Train Train

Train TrainTestTrain Train

Each sample is tested exactly once

Stratified K-Fold

ABABA

Same A:B ratio in each fold

Preserves class distribution

Time Series Split

TrainTest—

Train TrainTest—

Train Train TrainTest

Training window expands forward

Rule of Thumb: Use 5 or 10 folds. More folds = more reliable but slower. For small datasets, use Leave-One-Out. For imbalanced classification, use StratifiedKFold.

Step 6: Model Evaluation

How do you know if your model is good? Evaluation metrics tell you how well your model performs on unseen data.

Classification Metrics

from sklearn.metrics import (
    accuracy_score, precision_score, 
    recall_score, f1_score, 
    confusion_matrix, classification_report
)

# Basic metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

# Detailed report
print(classification_report(y_test, y_pred))

Metrics Explained

Accuracy

% of correct predictions overall. Can be misleading with imbalanced data!

Precision

Of all "YES" predictions, how many were actually YES? (Avoid false alarms)

Recall

Of all actual YES cases, how many did we catch? (Don't miss anything!)

F1-Score

Harmonic mean of Precision & Recall. Best single metric for imbalanced data.

When to use what: Spam filter → High Precision (don't mark good emails as spam). Cancer detection → High Recall (don't miss any cancer cases).

Regression Metrics

from sklearn.metrics import (
    mean_squared_error, mean_absolute_error,
    r2_score, root_mean_squared_error
)

# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R-squared: {r2:.4f}")

Metrics Explained

MAE

Average absolute error. Easy to interpret: "Off by $X on average".

MSE

Average squared error. Penalizes large errors more heavily.

RMSE

Square root of MSE. Same units as target. Most commonly used!

R² Score

0-1 scale. "Model explains X% of variance". 1.0 = perfect fit.

Quick guide: Use RMSE for general comparison. Use MAE when outliers shouldn't dominate. R² for "how good is my model?" (0.8+ is usually good).

Understanding the Confusion Matrix

from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Predicted No', 'Predicted Yes'],
            yticklabels=['Actual No', 'Actual Yes'])
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

Code Breakdown

confusion_matrix()

Creates 2×2 array of TN, FP, FN, TP counts

sns.heatmap()

Visualizes matrix with colors and numbers

annot=True

Shows numbers in each cell

fmt='d'

Format as integers (not decimals)

Confusion Matrix Explained

TN (True Negative)	Correctly predicted NO
FP (False Positive)	Incorrectly predicted YES (Type I Error)
FN (False Negative)	Incorrectly predicted NO (Type II Error)
TP (True Positive)	Correctly predicted YES

Visual: How to Read the Matrix

← Predicted →

← Actual →

	NO	YES
NO	TN ✓ Correct	FP ✗ Type I
YES	FN ✗ Type II	TP ✓ Correct

Diagonals = Correct (green), Off-diagonals = Errors (red)

Real-World Examples:

False Positive (FP)

🔔 Fire alarm when there's no fire
📧 Good email marked as spam
🏥 Healthy patient told they're sick

False Negative (FN)

🔇 No alarm during actual fire!
📩 Spam lands in inbox
🏥 Sick patient told they're healthy!

Which error is worse? Depends on context! Medical diagnosis: FN is worse (missing disease). Spam filter: FP is worse (losing important email).

Metrics Calculated from Confusion Matrix

Accuracy

(TP + TN) / Total

Precision

TP / (TP + FP)

Recall

TP / (TP + FN)

Specificity

TN / (TN + FP)

Step 7: Deployment

A model that stays on your laptop creates zero value. Deployment puts your model into production where it can make real predictions!

Save Model

import joblib
import pickle

# Save with joblib (recommended)
joblib.dump(model, 'model.joblib')

# Load model
loaded_model = joblib.load('model.joblib')

# Make prediction
prediction = loaded_model.predict(new_data)

REST API

# Flask API example
from flask import Flask, request
import joblib

app = Flask(__name__)
model = joblib.load('model.joblib')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    prediction = model.predict([data['features']])
    return {'prediction': int(prediction[0])}

Monitor

After deployment, monitor:

Model accuracy over time
Data drift (input distribution changes)
Latency and throughput
Error rates
Business metrics impact

Remember: Models degrade over time as data patterns change. Plan for regular retraining and monitoring from the start!

Key Takeaways

Follow the Workflow

The 7-step ML workflow provides structure. Don't skip steps, especially problem definition and data prep

Data Prep Takes 60-80%

Most of your time goes into data cleaning and preparation. This is normal and expected

Always Split Data First

Split into train/test BEFORE preprocessing to prevent data leakage and get honest evaluation

Choose Right Metrics

Accuracy isn't everything. Use F1 for imbalanced data, RMSE for regression, and align with business goals

ML is Iterative

You'll loop back to earlier steps. Poor results? Go back to features or data. This is the process

Deployment is Essential

A model that isn't deployed creates zero value. Plan for production, monitoring, and retraining

What You'll Learn

Contents

The ML Workflow Overview

Problem Definition

Data Collection

Data Preparation

Feature Engineering

Model Training

Evaluation

Deployment

Interactive: ML Workflow Explorer

Problem Definition

Key Activities

Common Tools

Common Pitfalls

Pro Tips

Step 1: Problem Definition

Questions to Ask

Example Problems

Problem Statement Template

Template Fields Explained

Step 2: Data Collection

Internal Sources

External Sources

Considerations

Loading Data with pandas

pandas Data Loading Methods

Quick Data Overview Methods

Step 3: Data Preparation

Common Data Issues

Missing Values

Methods Explained

Duplicates & Outliers

Methods Explained

Train-Test Split

train_test_split() Parameters

Visual: 80/20 Split

Step 4: Feature Engineering

Scaling Features

Why Scale Features?

Encoding Categories

Why Encode Categories?

Creating New Features

Feature Engineering Techniques Explained

Step 5: Model Training

The Basic Training Pattern

Understanding Each Step

Hyperparameter Tuning

Understanding the Code

Parameter Grid

Accessing Results

Cross-Validation

Why Cross-Validation?

Single Split Problems

Cross-Validation Benefits

How K-Fold Cross-Validation Works

Understanding cross_val_score

CV Strategies Explained

Visual: How Each Strategy Splits Data

Step 6: Model Evaluation

Classification Metrics

Metrics Explained

Regression Metrics

Metrics Explained

Understanding the Confusion Matrix

Code Breakdown

Confusion Matrix Explained

Visual: How to Read the Matrix

Real-World Examples:

Metrics Calculated from Confusion Matrix

Step 7: Deployment

Save Model

REST API

Monitor

Key Takeaways

Follow the Workflow

Data Prep Takes 60-80%

Always Split Data First

Choose Right Metrics