Assignment Overview
In this assignment, you will build a complete Unsupervised Learning System using scikit-learn. This comprehensive project requires you to apply ALL concepts from Module 10: clustering algorithms and dimensionality reduction techniques to discover patterns in unlabeled customer data.
Clustering (10.1)
K-Means algorithm, cluster validation, silhouette analysis, elbow method, and customer segmentation
Dimensionality Reduction (10.2)
PCA, variance explained, component analysis, and high-dimensional data visualization
The Scenario
RetailMax E-Commerce
You have been hired as a Data Scientist at RetailMax, a growing e-commerce company. The marketing team wants to understand their customer base better to create targeted campaigns. Your manager has assigned you a critical project:
"We have transaction data for thousands of customers but no predefined categories. We need you to discover natural customer segments based on their purchasing behavior. Additionally, our customer feature set is quite large, and we need to identify the most important dimensions for visualization and analysis. Use clustering and PCA to help us understand our customers better."
Your Tasks
Create a Jupyter notebook called unsupervised_analysis.ipynb that implements customer
segmentation using clustering algorithms and reduces dimensionality using PCA for visualization
and feature analysis.
Project 1: Customer Segmentation
Segment customers using K-Means clustering based on:
- Purchase frequency and recency
- Average transaction value
- Product category preferences
- Customer lifetime metrics
Project 2: Dimensionality Reduction
Apply PCA to understand the data structure:
- Reduce features for visualization
- Identify principal components
- Analyze variance explained
- Visualize clusters in 2D space
The Dataset
You will work with real customer behavior data from RetailMax's e-commerce platform. Download the dataset below to get started.
retailmax_customers.csv
Customer transaction and behavior data including purchase history, product preferences, engagement metrics, and demographics.
Requirements
Your unsupervised_analysis.ipynb must implement ALL of the following components.
Each section is mandatory and will be graded individually.
Part 1: Data Preparation (40 points)
Data Loading and Exploration
Load the dataset and perform initial exploration:
- Check data shape, types, and missing values
- Generate descriptive statistics for all features
- Visualize feature distributions with histograms
- Create correlation heatmap
def explore_data(df):
"""
Perform comprehensive data exploration.
Returns: summary statistics and visualizations
"""
# Your implementation
pass
Feature Scaling
Prepare features for clustering and PCA:
- Select relevant features (exclude customer_id)
- Apply StandardScaler to normalize features
- Store the scaler for inverse transformations
from sklearn.preprocessing import StandardScaler
def prepare_features(df, feature_cols):
"""
Scale features for unsupervised learning.
Returns: scaled_data, scaler, feature_names
"""
# Your implementation
pass
Part 2: K-Means Clustering (80 points)
Elbow Method
Determine optimal number of clusters using the elbow method:
- Test K values from 2 to 10
- Calculate inertia (within-cluster sum of squares) for each K
- Plot the elbow curve
- Identify the "elbow" point
from sklearn.cluster import KMeans
def find_optimal_k_elbow(X, k_range=range(2, 11)):
"""
Apply elbow method to find optimal K.
Returns: inertias, optimal_k, elbow_plot
"""
# Your implementation
pass
Silhouette Analysis
Validate cluster quality using silhouette scores:
- Calculate silhouette score for each K
- Plot silhouette scores vs number of clusters
- Create silhouette plots for the optimal K
- Analyze cluster cohesion and separation
from sklearn.metrics import silhouette_score, silhouette_samples
def silhouette_analysis(X, k_range=range(2, 11)):
"""
Perform silhouette analysis for cluster validation.
Returns: silhouette_scores, best_k, silhouette_plot
"""
# Your implementation
pass
K-Means Clustering
Perform K-Means clustering with optimal K:
- Fit KMeans with the selected number of clusters
- Assign cluster labels to each customer
- Get cluster centroids
- Add cluster labels to original dataframe
def perform_kmeans(X, n_clusters, random_state=42):
"""
Perform K-Means clustering.
Returns: kmeans_model, labels, centroids
"""
# Your implementation
pass
Cluster Profiling
Create detailed profiles for each cluster:
- Calculate mean values for each feature per cluster
- Count customers in each cluster
- Identify distinguishing characteristics
- Create cluster summary table
def profile_clusters(df, cluster_labels, feature_cols):
"""
Generate cluster profiles with statistics.
Returns: cluster_profiles_df, cluster_summary
"""
# Your implementation
pass
Cluster Visualization
Visualize clusters using multiple approaches:
- Scatter plot of top 2 features colored by cluster
- Radar/spider chart for cluster comparison
- Box plots of key features by cluster
- Cluster size distribution bar chart
def visualize_clusters(df, cluster_labels, feature_cols):
"""
Create cluster visualizations.
Saves plots to 'visualizations/' folder
"""
# Your implementation
pass
Part 3: PCA Dimensionality Reduction (80 points)
Apply PCA
Reduce dimensionality using Principal Component Analysis:
- Fit PCA on scaled data
- Transform data to principal components
- Store component loadings
from sklearn.decomposition import PCA
def apply_pca(X, n_components=None):
"""
Apply PCA for dimensionality reduction.
Returns: pca_model, transformed_data, loadings
"""
# Your implementation
pass
Variance Analysis
Analyze explained variance:
- Calculate explained variance ratio for each component
- Calculate cumulative explained variance
- Plot scree plot (variance by component)
- Determine number of components for 80% and 95% variance
def analyze_variance(pca_model):
"""
Analyze PCA explained variance.
Returns: variance_df, scree_plot, n_components_80, n_components_95
"""
# Your implementation
pass
Component Interpretation
Interpret principal components:
- Extract component loadings (feature weights)
- Identify top contributing features for each PC
- Create loading heatmap visualization
- Name/describe each principal component
def interpret_components(pca_model, feature_names, n_components=5):
"""
Interpret principal components.
Returns: loadings_df, component_descriptions
"""
# Your implementation
pass
2D Visualization with PCA
Visualize data and clusters in 2D PCA space:
- Project data onto first 2 principal components
- Create scatter plot colored by cluster labels
- Add cluster centroids to the plot
- Include explained variance in axis labels
def visualize_pca_clusters(pca_data, cluster_labels, pca_model):
"""
Visualize clusters in PCA space.
Saves plot to 'visualizations/pca_clusters.png'
"""
# Your implementation
pass
Biplot Visualization
Create a biplot showing samples and feature vectors:
- Plot samples in PC1-PC2 space
- Overlay feature loading vectors as arrows
- Label feature arrows
- Interpret feature relationships
def create_biplot(pca_data, pca_model, feature_names, cluster_labels=None):
"""
Create PCA biplot with feature vectors.
Saves plot to 'visualizations/pca_biplot.png'
"""
# Your implementation
pass
Part 4: Business Insights (50 points)
Segment Naming and Description
Create business-friendly segment names and descriptions:
- Assign descriptive names to each cluster (e.g., "High-Value Loyalists")
- Write 2-3 sentence descriptions for each segment
- Identify key characteristics that define each segment
Marketing Recommendations
Provide actionable recommendations for each segment:
- Suggest marketing strategies per segment
- Recommend product focus for each group
- Propose retention or growth tactics
- Estimate potential value of targeted campaigns
Summary Report
Create a final summary with:
- Executive summary of findings
- Key metrics table (cluster sizes, avg values)
- Top 3 actionable insights
- Limitations and next steps
Submission Instructions
Submit your completed assignment via GitHub following these instructions:
Create Jupyter Notebook
Create a single notebook called unsupervised_analysis.ipynb containing all requirements:
- Organize with clear markdown headers for each part
- Each function must have docstrings explaining inputs and outputs
- Include markdown cells with analysis and interpretations
- Run all cells top to bottom before submission
Save Visualizations
Export all plots to the visualizations/ folder:
elbow_plot.pngsilhouette_plot.pngcluster_scatter.pngcluster_profiles.pngpca_variance.pngpca_clusters.pngpca_biplot.png
Create README
Create README.md that includes:
- Your name and assignment title
- Summary of customer segments discovered
- Key PCA findings
- Instructions to run your notebook
Create requirements.txt
numpy==1.24.0
pandas==2.0.0
scikit-learn==1.3.0
matplotlib==3.7.0
seaborn==0.12.0
Repository Structure
Your GitHub repository should look like this:
retailmax-customer-segmentation/
├── README.md
├── requirements.txt
├── unsupervised_analysis.ipynb
└── visualizations/
├── elbow_plot.png
├── silhouette_plot.png
├── cluster_scatter.png
├── cluster_profiles.png
├── pca_variance.png
├── pca_clusters.png
└── pca_biplot.png
Submit via Form
Once your repository is ready:
- Make sure your repository is public
- Click the "Submit Assignment" button below
- Fill in the submission form with your GitHub username
Grading Rubric
Your assignment will be graded on the following criteria:
| Criteria | Points | Description |
|---|---|---|
| Data Preparation | 40 | Data exploration, feature selection, proper scaling |
| Cluster Validation | 40 | Elbow method, silhouette analysis, optimal K selection |
| K-Means Implementation | 40 | Correct clustering, profiling, visualization |
| PCA Analysis | 40 | Variance analysis, component interpretation, loadings |
| PCA Visualization | 40 | 2D cluster plot, biplot, proper labeling |
| Business Insights | 30 | Segment naming, marketing recommendations, summary |
| Code Quality | 20 | Docstrings, comments, clean organization |
| Total | 250 |
Ready to Submit?
Make sure you have completed all requirements and reviewed the grading rubric above.
Submit Your AssignmentWhat You Will Practice
Clustering (10.1)
K-Means algorithm, cluster validation with elbow and silhouette methods, customer segmentation
Dimensionality Reduction (10.2)
PCA for feature reduction, variance analysis, component interpretation, and visualization
Data Visualization
Scatter plots, biplots, heatmaps, and radar charts for cluster and component analysis
Business Insights
Translating technical results into actionable marketing strategies and recommendations
Pro Tips
Clustering Tips
- Always scale features before K-Means
- Use multiple methods to validate optimal K
- Set random_state for reproducibility
- Consider business context for K selection
PCA Tips
- Standardize data before PCA
- Check cumulative variance explained
- Interpret loadings for component meaning
- Use biplots for feature relationships
Visualization Tips
- Use consistent colors across plots
- Label axes with explained variance
- Add legends for cluster identification
- Save high-resolution images (dpi=300)
Common Mistakes
- Forgetting to scale features
- Including ID columns in clustering
- Choosing K based only on elbow
- Not interpreting results for business