Assignment 4-A

Unsupervised Learning Customer Segmentation

Build a complete customer segmentation system applying all Module 4 concepts: K-Means Clustering, Hierarchical Clustering, DBSCAN, PCA for dimensionality reduction, cluster validation metrics, and business insight generation.

6-8 hours
Challenging
200 Points
Submit Assignment
What You'll Practice
  • Apply K-Means with elbow method
  • Build hierarchical clustering dendrograms
  • Implement DBSCAN for anomaly detection
  • Use PCA for visualization
  • Validate clusters with silhouette scores
Contents
01

Assignment Overview

In this assignment, you will build a complete Customer Segmentation System using various unsupervised learning algorithms. This comprehensive project requires you to apply ALL concepts from Module 4: K-Means Clustering, Hierarchical Clustering, DBSCAN, Principal Component Analysis (PCA), cluster validation metrics, and translating cluster insights into actionable business recommendations.

Libraries Allowed: You may use pandas, numpy, matplotlib, seaborn, scikit-learn, and scipy for this assignment.
Skills Applied: This assignment tests your understanding of K-Means (Topic 4.1), Hierarchical Clustering (Topic 4.2), DBSCAN (Topic 4.3), and PCA (Topic 4.4) from Module 4.
K-Means (4.1)

Centroid-based clustering, elbow method, inertia

Hierarchical (4.2)

Dendrograms, linkage methods, agglomerative

DBSCAN (4.3)

Density-based clustering, eps, min_samples

PCA (4.4)

Dimensionality reduction, explained variance

Ready to submit? Already completed the assignment? Submit your work now!
Submit Now
02

The Scenario

ShopSmart Retail Analytics

You have been hired as a Data Scientist at ShopSmart Retail, a large e-commerce company looking to improve their marketing strategies through customer segmentation. The Chief Marketing Officer has given you this task:

"We have data on customer purchasing behavior, but we're treating everyone the same with our marketing. We need to identify distinct customer segments so we can tailor our campaigns. Use clustering to find natural groupings in our customer base and tell us what makes each segment unique!"

Your Task

Create a Jupyter Notebook called customer_segmentation.ipynb that implements multiple clustering algorithms, validates cluster quality, reduces dimensionality for visualization, and provides actionable segment profiles for the marketing team.

03

The Dataset

You will work with a Customer Behavior dataset. Create this CSV file as shown below:

File: customer_behavior.csv (Customer Data)

customer_id,annual_income,spending_score,avg_purchase_value,purchase_frequency,days_since_last_purchase,total_purchases,product_categories_bought,avg_session_duration,website_visits_monthly,email_open_rate,age
C001,85000,72,145.50,24,5,156,8,12.5,18,0.45,34
C002,32000,35,42.00,6,45,28,3,4.2,5,0.12,22
C003,120000,88,285.00,36,2,432,12,18.3,28,0.68,45
C004,45000,42,68.50,12,21,89,5,6.8,9,0.25,28
C005,95000,78,198.00,28,7,224,10,15.2,22,0.52,38
C006,28000,25,35.00,4,62,18,2,3.1,3,0.08,19
C007,150000,92,425.00,48,1,576,15,22.5,35,0.78,52
C008,55000,55,95.00,18,14,145,7,9.5,14,0.35,31
C009,38000,38,52.00,8,38,52,4,5.2,6,0.18,25
C010,180000,95,520.00,52,1,624,18,25.8,42,0.85,58
C011,42000,40,62.00,10,28,75,4,5.8,8,0.22,26
C012,105000,82,225.00,32,4,298,11,16.8,25,0.58,42
C013,25000,22,28.00,3,75,12,2,2.5,2,0.05,20
C014,78000,68,135.00,22,9,178,8,11.2,16,0.42,35
C015,62000,58,108.00,20,12,165,7,10.2,15,0.38,33
C016,135000,90,355.00,42,2,485,14,20.5,32,0.72,48
C017,48000,45,75.00,14,18,112,6,7.5,11,0.28,29
C018,22000,18,22.00,2,88,8,1,1.8,1,0.02,18
C019,88000,75,165.00,26,6,198,9,13.8,20,0.48,36
C020,165000,93,465.00,50,1,598,16,24.2,38,0.82,55
C021,35000,32,48.00,7,42,38,3,4.5,5,0.15,24
C022,72000,62,118.00,19,11,152,7,10.8,14,0.40,32
C023,98000,80,188.00,30,5,268,10,14.8,23,0.55,40
C024,30000,28,38.00,5,55,25,3,3.8,4,0.10,21
C025,112000,85,248.00,35,3,358,12,17.5,27,0.62,44
Columns Explained
  • customer_id - Unique identifier (string)
  • annual_income - Estimated annual income in USD (integer)
  • spending_score - Score assigned based on spending behavior 1-100 (integer)
  • avg_purchase_value - Average transaction value in USD (float)
  • purchase_frequency - Number of purchases per year (integer)
  • days_since_last_purchase - Recency of last purchase (integer)
  • total_purchases - Lifetime total purchases (integer)
  • product_categories_bought - Number of different categories explored (integer)
  • avg_session_duration - Average time spent on site in minutes (float)
  • website_visits_monthly - Average monthly site visits (integer)
  • email_open_rate - Proportion of marketing emails opened 0-1 (float)
  • age - Customer age (integer)
Note: Features have different scales - income ranges from 20K to 180K while email_open_rate is between 0-1. You MUST normalize/standardize the data before clustering!
04

Requirements

Your customer_segmentation.ipynb must implement ALL of the following functions. Each function is mandatory and will be tested individually.

1
Load and Preprocess Data

Create a function load_and_preprocess(filename) that:

  • Loads the CSV file using pandas
  • Handles any missing values
  • Standardizes features using StandardScaler
  • Returns original df, scaled features, and scaler object
def load_and_preprocess(filename):
    """Load dataset and standardize features."""
    # Must return: (df, X_scaled, scaler)
    pass
2
Exploratory Data Analysis

Create a function perform_eda(df) that:

  • Creates correlation heatmap of all features
  • Generates distribution plots for key features
  • Creates pairplots for important feature combinations
  • Saves visualizations as eda_plots.png
def perform_eda(df):
    """Perform exploratory data analysis."""
    # Must save: eda_plots.png
    pass
3
Find Optimal K (Elbow Method)

Create a function find_optimal_k(X_scaled, k_range=(2, 11)) that:

  • Calculates inertia for different k values
  • Calculates silhouette scores for each k
  • Plots elbow curve and silhouette scores
  • Saves plot as optimal_k.png
  • Returns recommended k value
def find_optimal_k(X_scaled, k_range=(2, 11)):
    """Find optimal number of clusters using elbow method."""
    # Return: optimal_k
    pass
4
K-Means Clustering

Create a function perform_kmeans(X_scaled, n_clusters) that:

  • Fits K-Means with specified number of clusters
  • Returns cluster labels and cluster centers
  • Calculates and prints inertia and silhouette score
def perform_kmeans(X_scaled, n_clusters):
    """Perform K-Means clustering."""
    # Return: (labels, centers, silhouette_score)
    pass
5
Hierarchical Clustering

Create a function perform_hierarchical(X_scaled, n_clusters, linkage='ward') that:

  • Creates and plots dendrogram
  • Performs agglomerative clustering
  • Saves dendrogram as dendrogram.png
  • Returns cluster labels
def perform_hierarchical(X_scaled, n_clusters, linkage='ward'):
    """Perform Hierarchical clustering."""
    # Return: labels
    pass
6
DBSCAN Clustering

Create a function perform_dbscan(X_scaled, eps=0.5, min_samples=5) that:

  • Fits DBSCAN with specified parameters
  • Identifies noise points (label = -1)
  • Reports number of clusters and noise points
  • Returns cluster labels
def perform_dbscan(X_scaled, eps=0.5, min_samples=5):
    """Perform DBSCAN clustering."""
    # Return: (labels, n_clusters, n_noise)
    pass
7
Tune DBSCAN Parameters

Create a function tune_dbscan(X_scaled) that:

  • Uses k-distance graph to find optimal eps
  • Tests different min_samples values
  • Returns best parameters based on silhouette score
def tune_dbscan(X_scaled):
    """Find optimal DBSCAN parameters."""
    # Return: (best_eps, best_min_samples)
    pass
8
Apply PCA

Create a function apply_pca(X_scaled, n_components=2) that:

  • Applies PCA for dimensionality reduction
  • Calculates and displays explained variance ratio
  • Plots cumulative explained variance
  • Returns transformed data and PCA object
def apply_pca(X_scaled, n_components=2):
    """Apply PCA for dimensionality reduction."""
    # Return: (X_pca, pca, explained_variance_ratio)
    pass
9
Visualize Clusters

Create a function visualize_clusters(X_pca, labels, title, filename) that:

  • Creates 2D scatter plot using PCA components
  • Colors points by cluster assignment
  • Adds cluster centers if applicable
  • Saves plot with given filename
def visualize_clusters(X_pca, labels, title, filename):
    """Visualize clusters in 2D PCA space."""
    # Must save: {filename}.png
    pass
10
Compare Clustering Methods

Create a function compare_methods(X_scaled, X_pca, kmeans_labels, hierarchical_labels, dbscan_labels) that:

  • Calculates silhouette scores for all methods
  • Creates side-by-side cluster visualizations
  • Saves comparison as clustering_comparison.png
  • Returns comparison DataFrame
def compare_methods(X_scaled, X_pca, kmeans_labels, hierarchical_labels, dbscan_labels):
    """Compare different clustering methods."""
    # Return: comparison_df
    pass
11
Profile Clusters

Create a function profile_clusters(df, labels, feature_names) that:

  • Calculates mean values for each cluster
  • Identifies distinguishing characteristics
  • Creates radar/spider charts for segment profiles
  • Returns cluster profile DataFrame
def profile_clusters(df, labels, feature_names):
    """Create detailed profiles for each cluster."""
    # Return: cluster_profiles_df
    pass
12
Generate Business Recommendations

Create a function generate_recommendations(cluster_profiles) that:

  • Assigns descriptive names to each segment
  • Creates marketing strategy recommendations per segment
  • Suggests targeted campaigns based on behavior
  • Saves recommendations to segment_recommendations.txt
def generate_recommendations(cluster_profiles):
    """Generate business recommendations for each segment."""
    # Must save: segment_recommendations.txt
    pass
13
Main Pipeline

Create a main() function that:

  • Runs the complete segmentation pipeline
  • Applies all clustering methods
  • Generates all required visualizations
  • Produces final segment recommendations
def main():
    # 1. Load and preprocess data
    df, X_scaled, scaler = load_and_preprocess("customer_behavior.csv")
    feature_names = df.columns.drop('customer_id').tolist()
    
    # 2. Exploratory data analysis
    perform_eda(df)
    
    # 3. Find optimal k
    optimal_k = find_optimal_k(X_scaled)
    print(f"Optimal number of clusters: {optimal_k}")
    
    # 4. Apply PCA
    X_pca, pca, variance_ratio = apply_pca(X_scaled)
    print(f"Explained variance (2 components): {sum(variance_ratio):.2%}")
    
    # 5. K-Means clustering
    kmeans_labels, centers, kmeans_silhouette = perform_kmeans(X_scaled, optimal_k)
    visualize_clusters(X_pca, kmeans_labels, "K-Means Clustering", "kmeans_clusters")
    
    # 6. Hierarchical clustering
    hierarchical_labels = perform_hierarchical(X_scaled, optimal_k)
    visualize_clusters(X_pca, hierarchical_labels, "Hierarchical Clustering", "hierarchical_clusters")
    
    # 7. DBSCAN clustering
    best_eps, best_min_samples = tune_dbscan(X_scaled)
    dbscan_labels, n_clusters, n_noise = perform_dbscan(X_scaled, best_eps, best_min_samples)
    visualize_clusters(X_pca, dbscan_labels, "DBSCAN Clustering", "dbscan_clusters")
    
    # 8. Compare methods
    comparison_df = compare_methods(X_scaled, X_pca, kmeans_labels, hierarchical_labels, dbscan_labels)
    print("\nClustering Comparison:")
    print(comparison_df)
    
    # 9. Profile clusters (using best method)
    cluster_profiles = profile_clusters(df, kmeans_labels, feature_names)
    print("\nCluster Profiles:")
    print(cluster_profiles)
    
    # 10. Generate recommendations
    generate_recommendations(cluster_profiles)
    print("\nRecommendations saved to segment_recommendations.txt")
    
    # 11. Save customer segments
    df['segment'] = kmeans_labels
    df.to_csv('customer_segments.csv', index=False)
    print("Customer segments saved to customer_segments.csv")

if __name__ == "__main__":
    main()
05

Submission

Create a public GitHub repository with the exact name shown below:

Required Repository Name
customer-segmentation-clustering
github.com/<your-username>/customer-segmentation-clustering
Required Files
customer-segmentation-clustering/
├── customer_segmentation.ipynb    # Your Jupyter Notebook with ALL 13 functions
├── customer_behavior.csv          # Input dataset (as provided or extended)
├── eda_plots.png                  # EDA visualizations
├── optimal_k.png                  # Elbow curve and silhouette plots
├── dendrogram.png                 # Hierarchical clustering dendrogram
├── clustering_comparison.png      # Side-by-side cluster comparison
├── customer_segments.csv          # Final customer data with segment labels
├── segment_recommendations.txt    # Business recommendations per segment
└── README.md                      # REQUIRED - see contents below
README.md Must Include:
  • Your full name and submission date
  • Summary of all clustering methods used and their results
  • Your optimal number of clusters and how you determined it
  • Description of each customer segment discovered
  • Any challenges faced and how you solved them
  • Instructions to run your notebook
Do Include
  • All 13 functions implemented and working
  • Docstrings for every function
  • Clear visualizations with labels and titles
  • Feature scaling before clustering
  • Multiple clustering methods compared
  • README.md with segment descriptions
Do Not Include
  • Any .pyc or __pycache__ files (use .gitignore)
  • Virtual environment folders
  • Large model pickle files
  • Code that doesn't run without errors
  • Hardcoded file paths
Important: Before submitting, run all cells in your notebook to make sure it executes without errors and generates all output files correctly!
Submit Your Assignment

Enter your GitHub username - we'll verify your repository automatically

06

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria Points Description
K-Means Implementation 30 Elbow method, clustering, and silhouette validation
Hierarchical Clustering 25 Dendrogram visualization and agglomerative clustering
DBSCAN 25 Parameter tuning and noise point identification
PCA & Visualization 25 Dimensionality reduction and cluster visualizations
Data Preprocessing 20 Proper feature scaling and handling
Cluster Profiling 30 Segment analysis and business recommendations
Code Quality 45 Docstrings, comments, naming conventions, and organization
Total 200

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment
07

What You Will Practice

K-Means Clustering (4.1)

Partitioning data, choosing optimal k, centroid-based learning

Hierarchical Methods (4.2)

Dendrograms, linkage strategies, agglomerative clustering

Density-Based Clustering (4.3)

DBSCAN algorithm, handling noise, arbitrary cluster shapes

Dimensionality Reduction (4.4)

PCA for visualization, explained variance, feature extraction

08

Pro Tips

Clustering Best Practices
  • Always scale features before clustering
  • Use multiple methods to validate cluster structure
  • Silhouette score helps assess cluster quality
  • Visualize with PCA even if you have more features
Method Selection
  • K-Means: Fast, works well for spherical clusters
  • Hierarchical: Good for understanding relationships
  • DBSCAN: Best for arbitrary shapes and outliers
  • Compare all three for robust insights
Validation Metrics
  • Silhouette Score: -1 to 1, higher is better
  • Inertia: Lower is tighter clusters (but not always better)
  • Calinski-Harabasz: Higher means denser, well-separated
  • Davies-Bouldin: Lower is better separation
Common Mistakes
  • Forgetting to scale features
  • Using accuracy metrics (no labels in unsupervised!)
  • Choosing k based only on elbow without silhouette
  • Not interpreting what clusters actually mean
09

Pre-Submission Checklist

Code Requirements
Repository Requirements