Assignment 4: Unsupervised Learning | Machine Learning Course

Assignment Overview

In this assignment, you will build a complete Customer Segmentation System using various unsupervised learning algorithms. This comprehensive project requires you to apply ALL concepts from Module 4: K-Means Clustering, Hierarchical Clustering, DBSCAN, Principal Component Analysis (PCA), cluster validation metrics, and translating cluster insights into actionable business recommendations.

Libraries Allowed: You may use pandas, numpy, matplotlib, seaborn, scikit-learn, and scipy for this assignment.

Skills Applied: This assignment tests your understanding of K-Means (Topic 4.1), Hierarchical Clustering (Topic 4.2), DBSCAN (Topic 4.3), and PCA (Topic 4.4) from Module 4.

K-Means (4.1)

Centroid-based clustering, elbow method, inertia

Hierarchical (4.2)

Dendrograms, linkage methods, agglomerative

DBSCAN (4.3)

Density-based clustering, eps, min_samples

PCA (4.4)

Dimensionality reduction, explained variance

Ready to submit? Already completed the assignment? Submit your work now!

Submit Now

The Scenario

ShopSmart Retail Analytics

You have been hired as a Data Scientist at ShopSmart Retail, a large e-commerce company looking to improve their marketing strategies through customer segmentation. The Chief Marketing Officer has given you this task:

"We have data on customer purchasing behavior, but we're treating everyone the same with our marketing. We need to identify distinct customer segments so we can tailor our campaigns. Use clustering to find natural groupings in our customer base and tell us what makes each segment unique!"

Your Task

Create a Jupyter Notebook called customer_segmentation.ipynb that implements multiple clustering algorithms, validates cluster quality, reduces dimensionality for visualization, and provides actionable segment profiles for the marketing team.

The Dataset

You will work with a Customer Behavior dataset. Create this CSV file as shown below:

File: `customer_behavior.csv` (Customer Data)

customer_id,annual_income,spending_score,avg_purchase_value,purchase_frequency,days_since_last_purchase,total_purchases,product_categories_bought,avg_session_duration,website_visits_monthly,email_open_rate,age
C001,85000,72,145.50,24,5,156,8,12.5,18,0.45,34
C002,32000,35,42.00,6,45,28,3,4.2,5,0.12,22
C003,120000,88,285.00,36,2,432,12,18.3,28,0.68,45
C004,45000,42,68.50,12,21,89,5,6.8,9,0.25,28
C005,95000,78,198.00,28,7,224,10,15.2,22,0.52,38
C006,28000,25,35.00,4,62,18,2,3.1,3,0.08,19
C007,150000,92,425.00,48,1,576,15,22.5,35,0.78,52
C008,55000,55,95.00,18,14,145,7,9.5,14,0.35,31
C009,38000,38,52.00,8,38,52,4,5.2,6,0.18,25
C010,180000,95,520.00,52,1,624,18,25.8,42,0.85,58
C011,42000,40,62.00,10,28,75,4,5.8,8,0.22,26
C012,105000,82,225.00,32,4,298,11,16.8,25,0.58,42
C013,25000,22,28.00,3,75,12,2,2.5,2,0.05,20
C014,78000,68,135.00,22,9,178,8,11.2,16,0.42,35
C015,62000,58,108.00,20,12,165,7,10.2,15,0.38,33
C016,135000,90,355.00,42,2,485,14,20.5,32,0.72,48
C017,48000,45,75.00,14,18,112,6,7.5,11,0.28,29
C018,22000,18,22.00,2,88,8,1,1.8,1,0.02,18
C019,88000,75,165.00,26,6,198,9,13.8,20,0.48,36
C020,165000,93,465.00,50,1,598,16,24.2,38,0.82,55
C021,35000,32,48.00,7,42,38,3,4.5,5,0.15,24
C022,72000,62,118.00,19,11,152,7,10.8,14,0.40,32
C023,98000,80,188.00,30,5,268,10,14.8,23,0.55,40
C024,30000,28,38.00,5,55,25,3,3.8,4,0.10,21
C025,112000,85,248.00,35,3,358,12,17.5,27,0.62,44

Columns Explained

customer_id - Unique identifier (string)
annual_income - Estimated annual income in USD (integer)
spending_score - Score assigned based on spending behavior 1-100 (integer)
avg_purchase_value - Average transaction value in USD (float)
purchase_frequency - Number of purchases per year (integer)
days_since_last_purchase - Recency of last purchase (integer)
total_purchases - Lifetime total purchases (integer)
product_categories_bought - Number of different categories explored (integer)
avg_session_duration - Average time spent on site in minutes (float)
website_visits_monthly - Average monthly site visits (integer)
email_open_rate - Proportion of marketing emails opened 0-1 (float)
age - Customer age (integer)

Note: Features have different scales - income ranges from 20K to 180K while email_open_rate is between 0-1. You MUST normalize/standardize the data before clustering!

Requirements

Your customer_segmentation.ipynb must implement ALL of the following functions. Each function is mandatory and will be tested individually.

Load and Preprocess Data

Create a function load_and_preprocess(filename) that:

Loads the CSV file using pandas
Handles any missing values
Standardizes features using StandardScaler
Returns original df, scaled features, and scaler object

def load_and_preprocess(filename):
    """Load dataset and standardize features."""
    # Must return: (df, X_scaled, scaler)
    pass

Exploratory Data Analysis

Create a function perform_eda(df) that:

Creates correlation heatmap of all features
Generates distribution plots for key features
Creates pairplots for important feature combinations
Saves visualizations as eda_plots.png

def perform_eda(df):
    """Perform exploratory data analysis."""
    # Must save: eda_plots.png
    pass

Find Optimal K (Elbow Method)

Create a function find_optimal_k(X_scaled, k_range=(2, 11)) that:

Calculates inertia for different k values
Calculates silhouette scores for each k
Plots elbow curve and silhouette scores
Saves plot as optimal_k.png
Returns recommended k value

def find_optimal_k(X_scaled, k_range=(2, 11)):
    """Find optimal number of clusters using elbow method."""
    # Return: optimal_k
    pass

K-Means Clustering

Create a function perform_kmeans(X_scaled, n_clusters) that:

Fits K-Means with specified number of clusters
Returns cluster labels and cluster centers
Calculates and prints inertia and silhouette score

def perform_kmeans(X_scaled, n_clusters):
    """Perform K-Means clustering."""
    # Return: (labels, centers, silhouette_score)
    pass

Hierarchical Clustering

Create a function perform_hierarchical(X_scaled, n_clusters, linkage='ward') that:

Creates and plots dendrogram
Performs agglomerative clustering
Saves dendrogram as dendrogram.png
Returns cluster labels

def perform_hierarchical(X_scaled, n_clusters, linkage='ward'):
    """Perform Hierarchical clustering."""
    # Return: labels
    pass

DBSCAN Clustering

Create a function perform_dbscan(X_scaled, eps=0.5, min_samples=5) that:

Fits DBSCAN with specified parameters
Identifies noise points (label = -1)
Reports number of clusters and noise points
Returns cluster labels

def perform_dbscan(X_scaled, eps=0.5, min_samples=5):
    """Perform DBSCAN clustering."""
    # Return: (labels, n_clusters, n_noise)
    pass

Tune DBSCAN Parameters

Create a function tune_dbscan(X_scaled) that:

Uses k-distance graph to find optimal eps
Tests different min_samples values
Returns best parameters based on silhouette score

def tune_dbscan(X_scaled):
    """Find optimal DBSCAN parameters."""
    # Return: (best_eps, best_min_samples)
    pass

Apply PCA

Create a function apply_pca(X_scaled, n_components=2) that:

Applies PCA for dimensionality reduction
Calculates and displays explained variance ratio
Plots cumulative explained variance
Returns transformed data and PCA object

def apply_pca(X_scaled, n_components=2):
    """Apply PCA for dimensionality reduction."""
    # Return: (X_pca, pca, explained_variance_ratio)
    pass

Visualize Clusters

Create a function visualize_clusters(X_pca, labels, title, filename) that:

Creates 2D scatter plot using PCA components
Colors points by cluster assignment
Adds cluster centers if applicable
Saves plot with given filename

def visualize_clusters(X_pca, labels, title, filename):
    """Visualize clusters in 2D PCA space."""
    # Must save: {filename}.png
    pass

Compare Clustering Methods

Create a function compare_methods(X_scaled, X_pca, kmeans_labels, hierarchical_labels, dbscan_labels) that:

Calculates silhouette scores for all methods
Creates side-by-side cluster visualizations
Saves comparison as clustering_comparison.png
Returns comparison DataFrame

def compare_methods(X_scaled, X_pca, kmeans_labels, hierarchical_labels, dbscan_labels):
    """Compare different clustering methods."""
    # Return: comparison_df
    pass

Profile Clusters

Create a function profile_clusters(df, labels, feature_names) that:

Calculates mean values for each cluster
Identifies distinguishing characteristics
Creates radar/spider charts for segment profiles
Returns cluster profile DataFrame

def profile_clusters(df, labels, feature_names):
    """Create detailed profiles for each cluster."""
    # Return: cluster_profiles_df
    pass

Generate Business Recommendations

Create a function generate_recommendations(cluster_profiles) that:

Assigns descriptive names to each segment
Creates marketing strategy recommendations per segment
Suggests targeted campaigns based on behavior
Saves recommendations to segment_recommendations.txt

def generate_recommendations(cluster_profiles):
    """Generate business recommendations for each segment."""
    # Must save: segment_recommendations.txt
    pass

Main Pipeline

Create a main() function that:

Runs the complete segmentation pipeline
Applies all clustering methods
Generates all required visualizations
Produces final segment recommendations

def main():
    # 1. Load and preprocess data
    df, X_scaled, scaler = load_and_preprocess("customer_behavior.csv")
    feature_names = df.columns.drop('customer_id').tolist()
    
    # 2. Exploratory data analysis
    perform_eda(df)
    
    # 3. Find optimal k
    optimal_k = find_optimal_k(X_scaled)
    print(f"Optimal number of clusters: {optimal_k}")
    
    # 4. Apply PCA
    X_pca, pca, variance_ratio = apply_pca(X_scaled)
    print(f"Explained variance (2 components): {sum(variance_ratio):.2%}")
    
    # 5. K-Means clustering
    kmeans_labels, centers, kmeans_silhouette = perform_kmeans(X_scaled, optimal_k)
    visualize_clusters(X_pca, kmeans_labels, "K-Means Clustering", "kmeans_clusters")
    
    # 6. Hierarchical clustering
    hierarchical_labels = perform_hierarchical(X_scaled, optimal_k)
    visualize_clusters(X_pca, hierarchical_labels, "Hierarchical Clustering", "hierarchical_clusters")
    
    # 7. DBSCAN clustering
    best_eps, best_min_samples = tune_dbscan(X_scaled)
    dbscan_labels, n_clusters, n_noise = perform_dbscan(X_scaled, best_eps, best_min_samples)
    visualize_clusters(X_pca, dbscan_labels, "DBSCAN Clustering", "dbscan_clusters")
    
    # 8. Compare methods
    comparison_df = compare_methods(X_scaled, X_pca, kmeans_labels, hierarchical_labels, dbscan_labels)
    print("\nClustering Comparison:")
    print(comparison_df)
    
    # 9. Profile clusters (using best method)
    cluster_profiles = profile_clusters(df, kmeans_labels, feature_names)
    print("\nCluster Profiles:")
    print(cluster_profiles)
    
    # 10. Generate recommendations
    generate_recommendations(cluster_profiles)
    print("\nRecommendations saved to segment_recommendations.txt")
    
    # 11. Save customer segments
    df['segment'] = kmeans_labels
    df.to_csv('customer_segments.csv', index=False)
    print("Customer segments saved to customer_segments.csv")

if __name__ == "__main__":
    main()

Submission

Create a public GitHub repository with the exact name shown below:

Required Repository Name

customer-segmentation-clustering

github.com/<your-username>/customer-segmentation-clustering

Required Files

customer-segmentation-clustering/
├── customer_segmentation.ipynb    # Your Jupyter Notebook with ALL 13 functions
├── customer_behavior.csv          # Input dataset (as provided or extended)
├── eda_plots.png                  # EDA visualizations
├── optimal_k.png                  # Elbow curve and silhouette plots
├── dendrogram.png                 # Hierarchical clustering dendrogram
├── clustering_comparison.png      # Side-by-side cluster comparison
├── customer_segments.csv          # Final customer data with segment labels
├── segment_recommendations.txt    # Business recommendations per segment
└── README.md                      # REQUIRED - see contents below

README.md Must Include:

Your full name and submission date
Summary of all clustering methods used and their results
Your optimal number of clusters and how you determined it
Description of each customer segment discovered
Any challenges faced and how you solved them
Instructions to run your notebook

Do Include

All 13 functions implemented and working
Docstrings for every function
Clear visualizations with labels and titles
Feature scaling before clustering
Multiple clustering methods compared
README.md with segment descriptions

Do Not Include

Any .pyc or __pycache__ files (use .gitignore)
Virtual environment folders
Large model pickle files
Code that doesn't run without errors
Hardcoded file paths

Important: Before submitting, run all cells in your notebook to make sure it executes without errors and generates all output files correctly!

Submit Your Assignment

Enter your GitHub username - we'll verify your repository automatically

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria	Points	Description
K-Means Implementation	30	Elbow method, clustering, and silhouette validation
Hierarchical Clustering	25	Dendrogram visualization and agglomerative clustering
DBSCAN	25	Parameter tuning and noise point identification
PCA & Visualization	25	Dimensionality reduction and cluster visualizations
Data Preprocessing	20	Proper feature scaling and handling
Cluster Profiling	30	Segment analysis and business recommendations
Code Quality	45	Docstrings, comments, naming conventions, and organization
Total	200

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment

What You Will Practice

K-Means Clustering (4.1)

Partitioning data, choosing optimal k, centroid-based learning

Hierarchical Methods (4.2)

Dendrograms, linkage strategies, agglomerative clustering

Density-Based Clustering (4.3)

DBSCAN algorithm, handling noise, arbitrary cluster shapes

Dimensionality Reduction (4.4)

PCA for visualization, explained variance, feature extraction

Pro Tips

Clustering Best Practices

Always scale features before clustering
Use multiple methods to validate cluster structure
Silhouette score helps assess cluster quality
Visualize with PCA even if you have more features

Method Selection

K-Means: Fast, works well for spherical clusters
Hierarchical: Good for understanding relationships
DBSCAN: Best for arbitrary shapes and outliers
Compare all three for robust insights

Validation Metrics

Silhouette Score: -1 to 1, higher is better
Inertia: Lower is tighter clusters (but not always better)
Calinski-Harabasz: Higher means denser, well-separated
Davies-Bouldin: Lower is better separation

Common Mistakes

Forgetting to scale features
Using accuracy metrics (no labels in unsupervised!)
Choosing k based only on elbow without silhouette
Not interpreting what clusters actually mean

Unsupervised Learning Customer Segmentation

What You'll Practice

Contents