Assignment Overview
In this assignment, you will build a complete Customer Segmentation System using various unsupervised learning algorithms. This comprehensive project requires you to apply ALL concepts from Module 4: K-Means Clustering, Hierarchical Clustering, DBSCAN, Principal Component Analysis (PCA), cluster validation metrics, and translating cluster insights into actionable business recommendations.
pandas, numpy, matplotlib,
seaborn, scikit-learn, and scipy for this assignment.
K-Means (4.1)
Centroid-based clustering, elbow method, inertia
Hierarchical (4.2)
Dendrograms, linkage methods, agglomerative
DBSCAN (4.3)
Density-based clustering, eps, min_samples
PCA (4.4)
Dimensionality reduction, explained variance
The Scenario
ShopSmart Retail Analytics
You have been hired as a Data Scientist at ShopSmart Retail, a large e-commerce company looking to improve their marketing strategies through customer segmentation. The Chief Marketing Officer has given you this task:
"We have data on customer purchasing behavior, but we're treating everyone the same with our marketing. We need to identify distinct customer segments so we can tailor our campaigns. Use clustering to find natural groupings in our customer base and tell us what makes each segment unique!"
Your Task
Create a Jupyter Notebook called customer_segmentation.ipynb that implements multiple
clustering algorithms, validates cluster quality, reduces dimensionality for visualization,
and provides actionable segment profiles for the marketing team.
The Dataset
You will work with a Customer Behavior dataset. Create this CSV file as shown below:
File: customer_behavior.csv (Customer Data)
customer_id,annual_income,spending_score,avg_purchase_value,purchase_frequency,days_since_last_purchase,total_purchases,product_categories_bought,avg_session_duration,website_visits_monthly,email_open_rate,age
C001,85000,72,145.50,24,5,156,8,12.5,18,0.45,34
C002,32000,35,42.00,6,45,28,3,4.2,5,0.12,22
C003,120000,88,285.00,36,2,432,12,18.3,28,0.68,45
C004,45000,42,68.50,12,21,89,5,6.8,9,0.25,28
C005,95000,78,198.00,28,7,224,10,15.2,22,0.52,38
C006,28000,25,35.00,4,62,18,2,3.1,3,0.08,19
C007,150000,92,425.00,48,1,576,15,22.5,35,0.78,52
C008,55000,55,95.00,18,14,145,7,9.5,14,0.35,31
C009,38000,38,52.00,8,38,52,4,5.2,6,0.18,25
C010,180000,95,520.00,52,1,624,18,25.8,42,0.85,58
C011,42000,40,62.00,10,28,75,4,5.8,8,0.22,26
C012,105000,82,225.00,32,4,298,11,16.8,25,0.58,42
C013,25000,22,28.00,3,75,12,2,2.5,2,0.05,20
C014,78000,68,135.00,22,9,178,8,11.2,16,0.42,35
C015,62000,58,108.00,20,12,165,7,10.2,15,0.38,33
C016,135000,90,355.00,42,2,485,14,20.5,32,0.72,48
C017,48000,45,75.00,14,18,112,6,7.5,11,0.28,29
C018,22000,18,22.00,2,88,8,1,1.8,1,0.02,18
C019,88000,75,165.00,26,6,198,9,13.8,20,0.48,36
C020,165000,93,465.00,50,1,598,16,24.2,38,0.82,55
C021,35000,32,48.00,7,42,38,3,4.5,5,0.15,24
C022,72000,62,118.00,19,11,152,7,10.8,14,0.40,32
C023,98000,80,188.00,30,5,268,10,14.8,23,0.55,40
C024,30000,28,38.00,5,55,25,3,3.8,4,0.10,21
C025,112000,85,248.00,35,3,358,12,17.5,27,0.62,44
Columns Explained
customer_id- Unique identifier (string)annual_income- Estimated annual income in USD (integer)spending_score- Score assigned based on spending behavior 1-100 (integer)avg_purchase_value- Average transaction value in USD (float)purchase_frequency- Number of purchases per year (integer)days_since_last_purchase- Recency of last purchase (integer)total_purchases- Lifetime total purchases (integer)product_categories_bought- Number of different categories explored (integer)avg_session_duration- Average time spent on site in minutes (float)website_visits_monthly- Average monthly site visits (integer)email_open_rate- Proportion of marketing emails opened 0-1 (float)age- Customer age (integer)
Requirements
Your customer_segmentation.ipynb must implement ALL of the following functions.
Each function is mandatory and will be tested individually.
Load and Preprocess Data
Create a function load_and_preprocess(filename) that:
- Loads the CSV file using pandas
- Handles any missing values
- Standardizes features using StandardScaler
- Returns original df, scaled features, and scaler object
def load_and_preprocess(filename):
"""Load dataset and standardize features."""
# Must return: (df, X_scaled, scaler)
pass
Exploratory Data Analysis
Create a function perform_eda(df) that:
- Creates correlation heatmap of all features
- Generates distribution plots for key features
- Creates pairplots for important feature combinations
- Saves visualizations as
eda_plots.png
def perform_eda(df):
"""Perform exploratory data analysis."""
# Must save: eda_plots.png
pass
Find Optimal K (Elbow Method)
Create a function find_optimal_k(X_scaled, k_range=(2, 11)) that:
- Calculates inertia for different k values
- Calculates silhouette scores for each k
- Plots elbow curve and silhouette scores
- Saves plot as
optimal_k.png - Returns recommended k value
def find_optimal_k(X_scaled, k_range=(2, 11)):
"""Find optimal number of clusters using elbow method."""
# Return: optimal_k
pass
K-Means Clustering
Create a function perform_kmeans(X_scaled, n_clusters) that:
- Fits K-Means with specified number of clusters
- Returns cluster labels and cluster centers
- Calculates and prints inertia and silhouette score
def perform_kmeans(X_scaled, n_clusters):
"""Perform K-Means clustering."""
# Return: (labels, centers, silhouette_score)
pass
Hierarchical Clustering
Create a function perform_hierarchical(X_scaled, n_clusters, linkage='ward') that:
- Creates and plots dendrogram
- Performs agglomerative clustering
- Saves dendrogram as
dendrogram.png - Returns cluster labels
def perform_hierarchical(X_scaled, n_clusters, linkage='ward'):
"""Perform Hierarchical clustering."""
# Return: labels
pass
DBSCAN Clustering
Create a function perform_dbscan(X_scaled, eps=0.5, min_samples=5) that:
- Fits DBSCAN with specified parameters
- Identifies noise points (label = -1)
- Reports number of clusters and noise points
- Returns cluster labels
def perform_dbscan(X_scaled, eps=0.5, min_samples=5):
"""Perform DBSCAN clustering."""
# Return: (labels, n_clusters, n_noise)
pass
Tune DBSCAN Parameters
Create a function tune_dbscan(X_scaled) that:
- Uses k-distance graph to find optimal eps
- Tests different min_samples values
- Returns best parameters based on silhouette score
def tune_dbscan(X_scaled):
"""Find optimal DBSCAN parameters."""
# Return: (best_eps, best_min_samples)
pass
Apply PCA
Create a function apply_pca(X_scaled, n_components=2) that:
- Applies PCA for dimensionality reduction
- Calculates and displays explained variance ratio
- Plots cumulative explained variance
- Returns transformed data and PCA object
def apply_pca(X_scaled, n_components=2):
"""Apply PCA for dimensionality reduction."""
# Return: (X_pca, pca, explained_variance_ratio)
pass
Visualize Clusters
Create a function visualize_clusters(X_pca, labels, title, filename) that:
- Creates 2D scatter plot using PCA components
- Colors points by cluster assignment
- Adds cluster centers if applicable
- Saves plot with given filename
def visualize_clusters(X_pca, labels, title, filename):
"""Visualize clusters in 2D PCA space."""
# Must save: {filename}.png
pass
Compare Clustering Methods
Create a function compare_methods(X_scaled, X_pca, kmeans_labels, hierarchical_labels, dbscan_labels) that:
- Calculates silhouette scores for all methods
- Creates side-by-side cluster visualizations
- Saves comparison as
clustering_comparison.png - Returns comparison DataFrame
def compare_methods(X_scaled, X_pca, kmeans_labels, hierarchical_labels, dbscan_labels):
"""Compare different clustering methods."""
# Return: comparison_df
pass
Profile Clusters
Create a function profile_clusters(df, labels, feature_names) that:
- Calculates mean values for each cluster
- Identifies distinguishing characteristics
- Creates radar/spider charts for segment profiles
- Returns cluster profile DataFrame
def profile_clusters(df, labels, feature_names):
"""Create detailed profiles for each cluster."""
# Return: cluster_profiles_df
pass
Generate Business Recommendations
Create a function generate_recommendations(cluster_profiles) that:
- Assigns descriptive names to each segment
- Creates marketing strategy recommendations per segment
- Suggests targeted campaigns based on behavior
- Saves recommendations to
segment_recommendations.txt
def generate_recommendations(cluster_profiles):
"""Generate business recommendations for each segment."""
# Must save: segment_recommendations.txt
pass
Main Pipeline
Create a main() function that:
- Runs the complete segmentation pipeline
- Applies all clustering methods
- Generates all required visualizations
- Produces final segment recommendations
def main():
# 1. Load and preprocess data
df, X_scaled, scaler = load_and_preprocess("customer_behavior.csv")
feature_names = df.columns.drop('customer_id').tolist()
# 2. Exploratory data analysis
perform_eda(df)
# 3. Find optimal k
optimal_k = find_optimal_k(X_scaled)
print(f"Optimal number of clusters: {optimal_k}")
# 4. Apply PCA
X_pca, pca, variance_ratio = apply_pca(X_scaled)
print(f"Explained variance (2 components): {sum(variance_ratio):.2%}")
# 5. K-Means clustering
kmeans_labels, centers, kmeans_silhouette = perform_kmeans(X_scaled, optimal_k)
visualize_clusters(X_pca, kmeans_labels, "K-Means Clustering", "kmeans_clusters")
# 6. Hierarchical clustering
hierarchical_labels = perform_hierarchical(X_scaled, optimal_k)
visualize_clusters(X_pca, hierarchical_labels, "Hierarchical Clustering", "hierarchical_clusters")
# 7. DBSCAN clustering
best_eps, best_min_samples = tune_dbscan(X_scaled)
dbscan_labels, n_clusters, n_noise = perform_dbscan(X_scaled, best_eps, best_min_samples)
visualize_clusters(X_pca, dbscan_labels, "DBSCAN Clustering", "dbscan_clusters")
# 8. Compare methods
comparison_df = compare_methods(X_scaled, X_pca, kmeans_labels, hierarchical_labels, dbscan_labels)
print("\nClustering Comparison:")
print(comparison_df)
# 9. Profile clusters (using best method)
cluster_profiles = profile_clusters(df, kmeans_labels, feature_names)
print("\nCluster Profiles:")
print(cluster_profiles)
# 10. Generate recommendations
generate_recommendations(cluster_profiles)
print("\nRecommendations saved to segment_recommendations.txt")
# 11. Save customer segments
df['segment'] = kmeans_labels
df.to_csv('customer_segments.csv', index=False)
print("Customer segments saved to customer_segments.csv")
if __name__ == "__main__":
main()
Submission
Create a public GitHub repository with the exact name shown below:
Required Repository Name
customer-segmentation-clustering
Required Files
customer-segmentation-clustering/
├── customer_segmentation.ipynb # Your Jupyter Notebook with ALL 13 functions
├── customer_behavior.csv # Input dataset (as provided or extended)
├── eda_plots.png # EDA visualizations
├── optimal_k.png # Elbow curve and silhouette plots
├── dendrogram.png # Hierarchical clustering dendrogram
├── clustering_comparison.png # Side-by-side cluster comparison
├── customer_segments.csv # Final customer data with segment labels
├── segment_recommendations.txt # Business recommendations per segment
└── README.md # REQUIRED - see contents below
README.md Must Include:
- Your full name and submission date
- Summary of all clustering methods used and their results
- Your optimal number of clusters and how you determined it
- Description of each customer segment discovered
- Any challenges faced and how you solved them
- Instructions to run your notebook
Do Include
- All 13 functions implemented and working
- Docstrings for every function
- Clear visualizations with labels and titles
- Feature scaling before clustering
- Multiple clustering methods compared
- README.md with segment descriptions
Do Not Include
- Any .pyc or __pycache__ files (use .gitignore)
- Virtual environment folders
- Large model pickle files
- Code that doesn't run without errors
- Hardcoded file paths
Enter your GitHub username - we'll verify your repository automatically
Grading Rubric
Your assignment will be graded on the following criteria:
| Criteria | Points | Description |
|---|---|---|
| K-Means Implementation | 30 | Elbow method, clustering, and silhouette validation |
| Hierarchical Clustering | 25 | Dendrogram visualization and agglomerative clustering |
| DBSCAN | 25 | Parameter tuning and noise point identification |
| PCA & Visualization | 25 | Dimensionality reduction and cluster visualizations |
| Data Preprocessing | 20 | Proper feature scaling and handling |
| Cluster Profiling | 30 | Segment analysis and business recommendations |
| Code Quality | 45 | Docstrings, comments, naming conventions, and organization |
| Total | 200 |
Ready to Submit?
Make sure you have completed all requirements and reviewed the grading rubric above.
Submit Your AssignmentWhat You Will Practice
K-Means Clustering (4.1)
Partitioning data, choosing optimal k, centroid-based learning
Hierarchical Methods (4.2)
Dendrograms, linkage strategies, agglomerative clustering
Density-Based Clustering (4.3)
DBSCAN algorithm, handling noise, arbitrary cluster shapes
Dimensionality Reduction (4.4)
PCA for visualization, explained variance, feature extraction
Pro Tips
Clustering Best Practices
- Always scale features before clustering
- Use multiple methods to validate cluster structure
- Silhouette score helps assess cluster quality
- Visualize with PCA even if you have more features
Method Selection
- K-Means: Fast, works well for spherical clusters
- Hierarchical: Good for understanding relationships
- DBSCAN: Best for arbitrary shapes and outliers
- Compare all three for robust insights
Validation Metrics
- Silhouette Score: -1 to 1, higher is better
- Inertia: Lower is tighter clusters (but not always better)
- Calinski-Harabasz: Higher means denser, well-separated
- Davies-Bouldin: Lower is better separation
Common Mistakes
- Forgetting to scale features
- Using accuracy metrics (no labels in unsupervised!)
- Choosing k based only on elbow without silhouette
- Not interpreting what clusters actually mean