Assignment 10: Unsupervised Learning | Data Science Course

Assignment Overview

In this assignment, you will build a complete Unsupervised Learning System using scikit-learn. This comprehensive project requires you to apply ALL concepts from Module 10: clustering algorithms and dimensionality reduction techniques to discover patterns in unlabeled customer data.

Important: You must use scikit-learn for all machine learning tasks. You may use pandas for data manipulation, numpy for numerical operations, and matplotlib/seaborn for visualization.

Skills Applied: This assignment tests your understanding of Clustering (Topic 10.1) and Dimensionality Reduction (Topic 10.2) from Module 10.

Clustering (10.1)

K-Means algorithm, cluster validation, silhouette analysis, elbow method, and customer segmentation

Dimensionality Reduction (10.2)

PCA, variance explained, component analysis, and high-dimensional data visualization

Ready to submit? Already completed the assignment? Submit your work now!

Submit Now

The Scenario

RetailMax E-Commerce

You have been hired as a Data Scientist at RetailMax, a growing e-commerce company. The marketing team wants to understand their customer base better to create targeted campaigns. Your manager has assigned you a critical project:

"We have transaction data for thousands of customers but no predefined categories. We need you to discover natural customer segments based on their purchasing behavior. Additionally, our customer feature set is quite large, and we need to identify the most important dimensions for visualization and analysis. Use clustering and PCA to help us understand our customers better."

Your Tasks

Create a Jupyter notebook called unsupervised_analysis.ipynb that implements customer segmentation using clustering algorithms and reduces dimensionality using PCA for visualization and feature analysis.

Project 1: Customer Segmentation

Segment customers using K-Means clustering based on:

Purchase frequency and recency
Average transaction value
Product category preferences
Customer lifetime metrics

Project 2: Dimensionality Reduction

Apply PCA to understand the data structure:

Reduce features for visualization
Identify principal components
Analyze variance explained
Visualize clusters in 2D space

The Dataset

You will work with real customer behavior data from RetailMax's e-commerce platform. Download the dataset below to get started.

retailmax_customers.csv

Customer transaction and behavior data including purchase history, product preferences, engagement metrics, and demographics.

100 customers 14 features Clean dataset

Download CSV

Tip: Remember to scale your features using StandardScaler before applying K-Means or PCA, as these algorithms are sensitive to feature magnitudes.

Requirements

Your unsupervised_analysis.ipynb must implement ALL of the following components. Each section is mandatory and will be graded individually.

Part 1: Data Preparation (40 points)

Data Loading and Exploration

Load the dataset and perform initial exploration:

Check data shape, types, and missing values
Generate descriptive statistics for all features
Visualize feature distributions with histograms
Create correlation heatmap

def explore_data(df):
    """
    Perform comprehensive data exploration.
    Returns: summary statistics and visualizations
    """
    # Your implementation
    pass

Feature Scaling

Prepare features for clustering and PCA:

Select relevant features (exclude customer_id)
Apply StandardScaler to normalize features
Store the scaler for inverse transformations

from sklearn.preprocessing import StandardScaler

def prepare_features(df, feature_cols):
    """
    Scale features for unsupervised learning.
    Returns: scaled_data, scaler, feature_names
    """
    # Your implementation
    pass

Part 2: K-Means Clustering (80 points)

Elbow Method

Determine optimal number of clusters using the elbow method:

Test K values from 2 to 10
Calculate inertia (within-cluster sum of squares) for each K
Plot the elbow curve
Identify the "elbow" point

from sklearn.cluster import KMeans

def find_optimal_k_elbow(X, k_range=range(2, 11)):
    """
    Apply elbow method to find optimal K.
    Returns: inertias, optimal_k, elbow_plot
    """
    # Your implementation
    pass

Silhouette Analysis

Validate cluster quality using silhouette scores:

Calculate silhouette score for each K
Plot silhouette scores vs number of clusters
Create silhouette plots for the optimal K
Analyze cluster cohesion and separation

from sklearn.metrics import silhouette_score, silhouette_samples

def silhouette_analysis(X, k_range=range(2, 11)):
    """
    Perform silhouette analysis for cluster validation.
    Returns: silhouette_scores, best_k, silhouette_plot
    """
    # Your implementation
    pass

K-Means Clustering

Perform K-Means clustering with optimal K:

Fit KMeans with the selected number of clusters
Assign cluster labels to each customer
Get cluster centroids
Add cluster labels to original dataframe

def perform_kmeans(X, n_clusters, random_state=42):
    """
    Perform K-Means clustering.
    Returns: kmeans_model, labels, centroids
    """
    # Your implementation
    pass

Cluster Profiling

Create detailed profiles for each cluster:

Calculate mean values for each feature per cluster
Count customers in each cluster
Identify distinguishing characteristics
Create cluster summary table

def profile_clusters(df, cluster_labels, feature_cols):
    """
    Generate cluster profiles with statistics.
    Returns: cluster_profiles_df, cluster_summary
    """
    # Your implementation
    pass

Cluster Visualization

Visualize clusters using multiple approaches:

Scatter plot of top 2 features colored by cluster
Radar/spider chart for cluster comparison
Box plots of key features by cluster
Cluster size distribution bar chart

def visualize_clusters(df, cluster_labels, feature_cols):
    """
    Create cluster visualizations.
    Saves plots to 'visualizations/' folder
    """
    # Your implementation
    pass

Part 3: PCA Dimensionality Reduction (80 points)

Apply PCA

Reduce dimensionality using Principal Component Analysis:

Fit PCA on scaled data
Transform data to principal components
Store component loadings

from sklearn.decomposition import PCA

def apply_pca(X, n_components=None):
    """
    Apply PCA for dimensionality reduction.
    Returns: pca_model, transformed_data, loadings
    """
    # Your implementation
    pass

Variance Analysis

Analyze explained variance:

Calculate explained variance ratio for each component
Calculate cumulative explained variance
Plot scree plot (variance by component)
Determine number of components for 80% and 95% variance

def analyze_variance(pca_model):
    """
    Analyze PCA explained variance.
    Returns: variance_df, scree_plot, n_components_80, n_components_95
    """
    # Your implementation
    pass

Component Interpretation

Interpret principal components:

Extract component loadings (feature weights)
Identify top contributing features for each PC
Create loading heatmap visualization
Name/describe each principal component

def interpret_components(pca_model, feature_names, n_components=5):
    """
    Interpret principal components.
    Returns: loadings_df, component_descriptions
    """
    # Your implementation
    pass

2D Visualization with PCA

Visualize data and clusters in 2D PCA space:

Project data onto first 2 principal components
Create scatter plot colored by cluster labels
Add cluster centroids to the plot
Include explained variance in axis labels

def visualize_pca_clusters(pca_data, cluster_labels, pca_model):
    """
    Visualize clusters in PCA space.
    Saves plot to 'visualizations/pca_clusters.png'
    """
    # Your implementation
    pass

Biplot Visualization

Create a biplot showing samples and feature vectors:

Plot samples in PC1-PC2 space
Overlay feature loading vectors as arrows
Label feature arrows
Interpret feature relationships

def create_biplot(pca_data, pca_model, feature_names, cluster_labels=None):
    """
    Create PCA biplot with feature vectors.
    Saves plot to 'visualizations/pca_biplot.png'
    """
    # Your implementation
    pass

Part 4: Business Insights (50 points)

Segment Naming and Description

Create business-friendly segment names and descriptions:

Assign descriptive names to each cluster (e.g., "High-Value Loyalists")
Write 2-3 sentence descriptions for each segment
Identify key characteristics that define each segment

Marketing Recommendations

Provide actionable recommendations for each segment:

Suggest marketing strategies per segment
Recommend product focus for each group
Propose retention or growth tactics
Estimate potential value of targeted campaigns

Summary Report

Create a final summary with:

Executive summary of findings
Key metrics table (cluster sizes, avg values)
Top 3 actionable insights
Limitations and next steps

Submission Instructions

Submit your completed assignment via GitHub following these instructions:

Create Jupyter Notebook

Create a single notebook called unsupervised_analysis.ipynb containing all requirements:

Organize with clear markdown headers for each part
Each function must have docstrings explaining inputs and outputs
Include markdown cells with analysis and interpretations
Run all cells top to bottom before submission

Save Visualizations

Export all plots to the visualizations/ folder:

elbow_plot.png
silhouette_plot.png
cluster_scatter.png
cluster_profiles.png
pca_variance.png
pca_clusters.png
pca_biplot.png

Create README

Create README.md that includes:

Your name and assignment title
Summary of customer segments discovered
Key PCA findings
Instructions to run your notebook

Create requirements.txt

numpy==1.24.0
pandas==2.0.0
scikit-learn==1.3.0
matplotlib==3.7.0
seaborn==0.12.0

Repository Structure

Your GitHub repository should look like this:

retailmax-customer-segmentation/
├── README.md
├── requirements.txt
├── unsupervised_analysis.ipynb
└── visualizations/
    ├── elbow_plot.png
    ├── silhouette_plot.png
    ├── cluster_scatter.png
    ├── cluster_profiles.png
    ├── pca_variance.png
    ├── pca_clusters.png
    └── pca_biplot.png

Submit via Form

Once your repository is ready:

Make sure your repository is public
Click the "Submit Assignment" button below
Fill in the submission form with your GitHub username

Important: Make sure all cells in your notebook run without errors and all visualizations are saved before submitting!

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria	Points	Description
Data Preparation	40	Data exploration, feature selection, proper scaling
Cluster Validation	40	Elbow method, silhouette analysis, optimal K selection
K-Means Implementation	40	Correct clustering, profiling, visualization
PCA Analysis	40	Variance analysis, component interpretation, loadings
PCA Visualization	40	2D cluster plot, biplot, proper labeling
Business Insights	30	Segment naming, marketing recommendations, summary
Code Quality	20	Docstrings, comments, clean organization
Total	250

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment

What You Will Practice

Clustering (10.1)

K-Means algorithm, cluster validation with elbow and silhouette methods, customer segmentation

Dimensionality Reduction (10.2)

PCA for feature reduction, variance analysis, component interpretation, and visualization

Data Visualization

Scatter plots, biplots, heatmaps, and radar charts for cluster and component analysis

Business Insights

Translating technical results into actionable marketing strategies and recommendations

Pro Tips

Clustering Tips

Always scale features before K-Means
Use multiple methods to validate optimal K
Set random_state for reproducibility
Consider business context for K selection

PCA Tips

Standardize data before PCA
Check cumulative variance explained
Interpret loadings for component meaning
Use biplots for feature relationships

Visualization Tips

Use consistent colors across plots
Label axes with explained variance
Add legends for cluster identification
Save high-resolution images (dpi=300)

Common Mistakes

Forgetting to scale features
Including ID columns in clustering
Choosing K based only on elbow
Not interpreting results for business

Unsupervised Learning Discovery Challenge

What You'll Practice

Contents

Assignment Overview

Clustering (10.1)

Dimensionality Reduction (10.2)

The Scenario

RetailMax E-Commerce

Your Tasks

Project 1: Customer Segmentation

Project 2: Dimensionality Reduction

The Dataset

retailmax_customers.csv

Requirements

Part 1: Data Preparation (40 points)

Data Loading and Exploration

Feature Scaling

Part 2: K-Means Clustering (80 points)

Elbow Method

Silhouette Analysis

K-Means Clustering

Cluster Profiling

Cluster Visualization

Part 3: PCA Dimensionality Reduction (80 points)

Apply PCA

Variance Analysis

Component Interpretation

2D Visualization with PCA

Biplot Visualization

Part 4: Business Insights (50 points)

Segment Naming and Description

Marketing Recommendations

Summary Report

Submission Instructions

Create Jupyter Notebook

Save Visualizations

Create README

Create requirements.txt

Repository Structure

Submit via Form

Grading Rubric

Ready to Submit?

What You Will Practice

Clustering (10.1)

Dimensionality Reduction (10.2)

Data Visualization

Business Insights

Pro Tips

Clustering Tips

PCA Tips

Visualization Tips

Common Mistakes

Pre-Submission Checklist

Clustering Requirements

PCA Requirements