Capstone Project 2

Customer Segmentation

Apply RFM analysis and K-Means clustering to segment customers into actionable groups. Learn to profile customer segments and develop targeted marketing recommendations for each group.

10-14 hours
Intermediate-Advanced
600 Points
What You Will Build
  • RFM feature engineering
  • K-Means clustering model
  • Elbow method analysis
  • Customer segment profiles
  • Marketing recommendations
Contents
01

Project Overview

Customer segmentation is one of the most impactful applications of unsupervised learning in business. In this project, you will use RFM (Recency, Frequency, Monetary) analysis combined with K-Means clustering to identify distinct customer groups and develop targeted marketing strategies for each segment.

Skills Applied: This project tests your proficiency in feature engineering (RFM metrics), unsupervised learning (K-Means), data scaling, cluster evaluation (Elbow method, Silhouette score), and business insight generation.
RFM Analysis

Calculate Recency, Frequency, and Monetary values

K-Means Clustering

Implement and optimize clustering algorithm

Segment Profiling

Analyze and name each customer segment

Recommendations

Develop marketing strategies per segment

Ready to submit? Already completed the project? Submit your work now!
Submit Now
02

Business Scenario

ShopSmart Retail

You have been hired as a Customer Analytics Specialist at ShopSmart, a growing retail company operating in the competitive e-commerce space. The company has experienced steady growth over the past year, but the marketing team has been using generic, one-size-fits-all email campaigns that yield low engagement rates (averaging just 8% open rate and 1.5% click-through rate).

The CMO recognizes that different customers have vastly different needs and value propositions. A customer who purchased once six months ago shouldn't receive the same messaging as a loyal customer who shops weekly. However, without data-driven customer segments, the team has no framework for personalization.

"We have customer transaction data but no way to identify our best customers versus those at risk of churning. Can you segment our customer base and help us understand each group's characteristics so we can tailor our marketing efforts? We need actionable segments we can actually use in campaigns."

Priya Menon, Head of Marketing
The Business Challenge

ShopSmart faces several common retail challenges that customer segmentation can address:

Customer Churn

35% of customers make only one purchase and never return. The company has no early warning system to identify at-risk customers before they churn.

Inefficient Marketing

Marketing budget is spread evenly across all customers, wasting resources on disengaged users while under-investing in high-value segments.

Unknown Patterns

Management has no visibility into how many loyal customers exist, what percentage are at risk, or which segments drive the most revenue.

Business Objectives

Segmentation
  • Identify distinct customer segments based on behavior
  • Determine the optimal number of clusters
  • Profile each segment with clear characteristics
Strategy
  • Identify high-value customers for loyalty programs
  • Find at-risk customers for retention campaigns
  • Discover growth opportunities in each segment
Questions to Answer
  • Who are our most valuable customers?
  • Which customers are at risk of churning?
  • What percentage of customers are occasional buyers?
Deliverables
  • RFM score for each customer
  • Cluster assignment with labels
  • Marketing recommendations per segment
Pro Tip: Think like a marketer! Give your segments memorable names (e.g., "Champions", "At Risk", "New Customers") that stakeholders can easily understand and act upon.
03

The Dataset

You will work with a customer transaction dataset containing purchase history from a retail business over 12 months. This realistic dataset includes multiple transactions per customer, allowing you to calculate meaningful RFM metrics and identify distinct customer segments.

Dataset Overview:
150
Total Transactions
94
Unique Customers
12
Months of Data
14
Data Columns
What Makes This Dataset Ideal for Segmentation
Repeat Customers

Many customers have multiple transactions, allowing you to measure frequency and identify loyal vs. one-time buyers.

Time Spread

Transactions span a full year, providing sufficient time range to calculate meaningful recency values and identify churn patterns.

Value Diversity

Wide range of transaction amounts from small purchases to large orders, enabling clear monetary segmentation.

Dataset Schema
Column Type Description
transaction_idStringUnique transaction identifier
customer_idStringUnique customer identifier
customer_nameStringCustomer full name
emailStringCustomer email address
transaction_dateDateDate of transaction (YYYY-MM-DD)
product_idStringProduct identifier
product_nameStringName of product purchased
categoryStringProduct category (Electronics, Furniture, Office Supplies)
quantityIntegerNumber of units purchased
unit_priceFloatPrice per unit ($)
total_amountFloatTotal transaction value ($)
payment_methodStringPayment type (Credit Card, Debit Card, PayPal)
cityStringCustomer's city
regionStringGeographic region (North, South, East, West)
Key Columns for RFM Analysis
Column Used For Why It Matters
customer_id Grouping transactions by customer Enables aggregation of all purchases per customer
transaction_date Recency calculation Find most recent purchase date for each customer
transaction_id Frequency calculation Count number of transactions per customer
total_amount Monetary calculation Sum total spending per customer
Bonus Opportunity: While not required for basic segmentation, columns like category, region, and payment_method can be used for additional profiling insights (e.g., "Champions prefer Electronics and use Credit Cards").
04

RFM Analysis

RFM (Recency, Frequency, Monetary) is a proven customer segmentation technique that scores customers based on their purchase behavior. You must calculate these three metrics for each customer.

Recency (R)

Definition: How recently did the customer make a purchase?

Calculation: Days since last purchase

Lower recency = Better (more recent customer)

Frequency (F)

Definition: How often does the customer purchase?

Calculation: Total number of transactions

Higher frequency = Better (loyal customer)

Monetary (M)

Definition: How much does the customer spend?

Calculation: Total amount spent

Higher monetary = Better (high-value customer)

Understanding RFM Calculations
Recency Details

Analysis Date: Set to one day after the last transaction in your dataset. This ensures consistency.

For Each Customer: Calculate the number of days between the analysis date and their most recent purchase.

Interpretation: A customer with recency of 5 days is much more engaged than one with 200 days.

Frequency Details

Count Transactions: Simply count the total number of transactions each customer has made.

Customer Loyalty: Customers with 10+ transactions show strong loyalty and engagement.

One-time Buyers: Customers with frequency = 1 represent acquisition opportunities.

Monetary Details

Sum All Purchases: Total the dollar amount spent across all transactions for each customer.

High-Value Customers: Top 20% of customers often generate 80% of revenue (Pareto Principle).

Revenue Impact: Focus retention efforts on high monetary value customers.

RFM Scoring System (Optional Enhancement)

Beyond raw RFM values, you can create a scoring system (1-5) where each customer receives a score for each metric. This makes segments easier to communicate to business stakeholders.

Score Recency Frequency Monetary
5 (Best) 0-20 days ago 8+ transactions $1,000+ spent
4 21-50 days ago 5-7 transactions $500-$999 spent
3 51-100 days ago 3-4 transactions $200-$499 spent
2 101-180 days ago 2 transactions $50-$199 spent
1 (Worst) 180+ days ago 1 transaction Under $50 spent
RFM Segment Codes: Some analysts combine scores into a three-digit code (e.g., "555" = Champion, "111" = Lost Customer). An RFM score of 555 represents the best possible customer, while 111 indicates a customer at high risk of permanent churn.
Required: Your notebook must show the RFM DataFrame with at least the customer_id, recency, frequency, and monetary columns calculated correctly.
05

K-Means Clustering

After calculating RFM values, apply K-Means clustering to group customers into segments. You must determine the optimal number of clusters using the Elbow method and/or Silhouette analysis.

1
Data Preprocessing & Feature Scaling

K-Means clustering is highly sensitive to the scale of features. Without proper scaling, the monetary value (ranging from $50 to $5,000) would dominate over frequency (1-15 transactions), leading to poor cluster quality.

Why Scaling Matters: Imagine three customers:
  • Customer A: Recency=10 days, Frequency=5, Monetary=$2,000
  • Customer B: Recency=15 days, Frequency=6, Monetary=$2,200
  • Customer C: Recency=12 days, Frequency=5, Monetary=$500

Without scaling, K-Means sees A and B as similar (both ~$2,000) despite having similar behavior to C. With scaling, all three dimensions are treated equally.

StandardScaler Method: Transforms each feature to have mean=0 and standard deviation=1, ensuring all RFM metrics contribute equally to the clustering algorithm.

2
Elbow Method for Optimal Clusters

The Elbow Method helps you determine the optimal number of customer segments by plotting the within-cluster sum of squares (inertia) against different values of k (number of clusters).

What to Look For

The "elbow" is the point where adding more clusters provides diminishing returns. The inertia drops sharply until this point, then levels off. This inflection point suggests the optimal k.

Interpretation Example

If inertia drops from 500→200→100→85→80→78, the elbow appears at k=4 (drops from 100→85 but then slows). Going beyond k=4 adds complexity without much improvement.

Test Range: Evaluate k values from 2 to 10. Too few clusters (k=2) oversimplify customer diversity. Too many (k=10+) create segments too small to be actionable.

3
Silhouette Analysis for Validation

The Silhouette Score measures how well each customer fits within their assigned cluster compared to other clusters. Scores range from -1 to +1.

Score Range Interpretation Action
0.70 - 1.00 Strong, well-defined clusters Excellent choice
0.50 - 0.70 Reasonable cluster structure Acceptable
0.25 - 0.50 Weak cluster structure Consider different k
Below 0.25 Poor clustering Try different k
Balance Both Metrics: Choose the k value that shows an elbow in the inertia plot AND maintains a good Silhouette Score (above 0.50). Sometimes the mathematically optimal k may not align with business needs (e.g., k=7 might be too many segments to manage effectively).
4
Apply Final Clustering & Assign Segments

After determining your optimal k value (typically 3-5 for customer segmentation), apply K-Means to assign each customer to a cluster. Each cluster number will later be mapped to a meaningful business label.

3-4

Most Common
Champions, Loyals, At-Risk, Lost

5-6

More Granular
Add segments like "New Customers" or "Hibernating"

7+

Advanced
Complex segmentation for large enterprises

Random State: Always set a random state (e.g., 42) to ensure reproducible results. This means running your notebook multiple times will produce identical clusters.

Required Visualizations
1. Elbow Plot

Inertia vs. k to determine optimal clusters

2. Silhouette Plot

Silhouette score vs. k for validation

3. 3D Scatter

RFM clusters in 3D space (Plotly)

06

Segment Profiling & Recommendations

After clustering, analyze each segment's characteristics and develop actionable marketing recommendations. This is where your business acumen shines—translate data patterns into strategic actions.

Segment Analysis Process

After clustering, you need to understand what makes each segment unique. Calculate summary statistics for each cluster to reveal the behavioral patterns that define each customer group.

Key Statistics to Calculate:
  • Mean RFM values - Average behavior of the segment
  • Customer count - Size of each segment
  • Total revenue - Revenue contribution per segment
  • Percentage of total customers - Relative size
  • Revenue percentage - Value contribution
  • Min/Max ranges - Segment boundaries
Understanding 3D Visualization

A 3D scatter plot with Recency on the X-axis, Frequency on the Y-axis, and Monetary on the Z-axis provides an intuitive view of how your segments are distributed in RFM space. Each point represents a customer, colored by their assigned segment.

What Good Clustering Looks Like
  • Clear separation between segment colors
  • Tight grouping within each color cluster
  • Minimal overlap between segments
  • No outlier segments with just 1-2 customers
Warning Signs
  • Segments heavily overlapping in 3D space
  • One segment containing 80%+ of customers
  • Segments with fewer than 5% of total customers
  • No clear visual distinction between segments
Example Segment Profiles

Below are typical segments you might discover in retail customer data. Your actual segments may differ based on your chosen k value and the clustering results.

Segment Recency Frequency Monetary Label Strategy
Cluster 0 Very Low
(5-20 days)
High
(8+ purchases)
High
($1,000+)
Champions VIP loyalty rewards, early product access, referral incentives, satisfaction surveys
Cluster 1 Low
(20-50 days)
Medium
(4-7 purchases)
Medium
($400-$999)
Loyal Customers Upsell premium products, cross-sell recommendations, exclusive member benefits
Cluster 2 High
(100-180 days)
Low
(2-3 purchases)
Low
($100-$399)
At Risk Win-back campaigns (20% discount), re-engagement emails, personalized recommendations
Cluster 3 Very High
(180+ days)
Very Low
(1 purchase)
Low
(<$100)
Lost Final re-activation offer (30-40% off), survey for feedback, consider list removal
Your Segments Will Vary: The exact boundaries and characteristics depend on your data and chosen k value. A 3-cluster solution combines some of these groups, while a 5-cluster solution might split "Loyal Customers" into "Promising" and "Established Loyal" segments.
Expected Segment Distribution

In most retail datasets, you'll see an uneven distribution following these patterns:

5-15%
Champions

Small but high-value segment generating 40-60% of revenue

20-30%
Loyal Customers

Stable segment with good engagement and moderate spend

25-35%
At Risk

Critical segment needing immediate retention efforts

30-40%
Lost/Inactive

Largest segment with low engagement and minimal value

Required Recommendations

For each segment, provide comprehensive profiling that bridges data analysis and business action:

1. Segment Name

Use memorable, business-friendly labels that instantly communicate the customer's status. Examples: "Champions," "Loyal Customers," "Potential Loyalists," "At Risk," "Hibernating," "Lost," "New Customers," "Promising."

2. Characteristics

Summarize the RFM profile in plain English. Example: "Made a purchase within the last 30 days, buy frequently (5+ times), and spend moderately ($200-$500 average)."

3. Size & Distribution

Report both count and percentage. Example: "47 customers (23% of total customer base)." This helps prioritize which segments deserve the most attention.

4. Value Contribution

Calculate total revenue from the segment and its percentage of overall revenue. Example: "Generated $12,500 (31% of total revenue)." This quantifies the business impact.

5. Marketing Strategy (Most Important)

Provide 2-3 specific, actionable recommendations tailored to each segment's behavior:

  • For Champions: "Implement VIP loyalty program with exclusive early access to new products. Incentivize referrals with reward points. Conduct satisfaction surveys to maintain engagement."
  • For At-Risk: "Launch win-back email campaign with 20% discount on next purchase. Conduct exit survey to understand disengagement. Offer personalized product recommendations based on past purchases."
  • For Lost: "Send re-engagement campaign highlighting new product lines. Offer significant discount (30-40%) or free shipping. Consider removing from regular marketing lists to reduce costs."
Think ROI: Match marketing intensity to segment value. High-value segments (Champions, Loyal Customers) justify premium retention costs. Low-value, low-engagement segments may not warrant expensive campaigns.
Deliverable: Create a summary table or visualization showing all segments with their labels, sizes, and recommended strategies.
07

Submission Requirements

Create a public GitHub repository with the exact name shown below:

Required Repository Name
customer-segmentation-project
github.com/<your-username>/customer-segmentation-project
Required Project Structure
customer-segmentation-project/
├── data/
│   └── customer_transactions.csv    # The dataset (download from above)
├── notebooks/
│   └── customer_segmentation.ipynb  # Your main analysis notebook
├── outputs/
│   └── customer_segments.csv        # Final segmented customer data
├── requirements.txt                 # Python dependencies
└── README.md                        # REQUIRED - see contents below
README.md Must Include:
  • Your full name and submission date
  • Project overview and business context
  • RFM methodology explanation
  • Number of clusters chosen and why
  • Segment profiles with labels and strategies
  • Screenshots of key visualizations (Elbow plot, 3D scatter)
Python Dependencies

Create a requirements.txt file in your project root listing all Python packages needed to run your notebook. This allows anyone to recreate your environment.

Required Libraries:
  • pandas - Data manipulation and RFM calculations
  • numpy - Numerical operations
  • scikit-learn - K-Means clustering and scaling
  • plotly - Interactive 3D visualizations
  • matplotlib - Static plots (Elbow method)
  • seaborn - Statistical visualizations
  • jupyter - Notebook environment

Version Format: Use package>=version to ensure minimum compatible versions (e.g., pandas>=2.0.0).

Output File: customer_segments.csv

Export a CSV file containing your segmentation results. This file should be saved in an outputs/ folder and include all customers with their assigned segments.

Required Column Description Example Value
customer_id Unique customer identifier CUST001
recency Days since last purchase 15
frequency Total number of transactions 8
monetary Total amount spent 1250.50
cluster Numeric cluster assignment 0
segment_label Business-friendly segment name Champions
Export Process: Use pandas .to_csv() method with index=False to prevent adding row numbers. Map cluster numbers (0, 1, 2, 3) to meaningful labels ("Champions", "Loyal Customers", etc.) before exporting.
Do Include
  • Complete RFM calculation with code
  • Elbow method and Silhouette analysis
  • At least 5 visualizations
  • Segment profiles with marketing strategies
  • Exported customer_segments.csv
  • README with methodology explanation
Do Not Include
  • Virtual environment folders (venv, .env)
  • Any .pyc or __pycache__ files
  • Unexecuted notebooks
  • Hardcoded file paths
  • Clusters without business interpretation
Submit Your Project

Enter your GitHub username - we will verify your repository automatically

08

Grading Rubric

Your project will be evaluated on both technical execution and business insight. A perfect score requires not just correct implementation, but also clear communication of findings and actionable recommendations.

Criteria Points Description
1. RFM Calculation 20 Correct computation of Recency (days since last purchase), Frequency (transaction count), and Monetary (total spending) values for each customer. Must use appropriate date handling and aggregation functions.
2. Data Preprocessing 10 Proper data cleaning (if needed), datetime conversion, and StandardScaler implementation. Features must be scaled before clustering. Handle any edge cases (e.g., customers with single transactions).
3. K-Means Implementation 20 Correct K-Means clustering with appropriate parameters (random_state for reproducibility, n_init=10). Clear justification for chosen k value based on analysis, not arbitrary selection.
4. Elbow & Silhouette Analysis 10 Both Elbow plot (inertia vs k) and Silhouette scores calculated for k=2 to k=10. Visual plots included with clear labeling. Written explanation of how optimal k was determined.
5. Visualizations 15 Minimum 5 professional visualizations: Elbow plot, Silhouette plot, 3D scatter (Plotly), cluster size distribution, and one additional insight chart. All charts must have titles, axis labels, and legends.
6. Segment Profiling 15 Each segment has: (1) Memorable business name, (2) Clear RFM characteristics, (3) Size and percentage, (4) Revenue contribution, (5) Summary table or profile card. Segments must be distinct and interpretable.
7. Marketing Recommendations 10 2-3 specific, actionable marketing strategies for each segment. Recommendations must be tailored to segment behavior (not generic). Examples: loyalty programs for champions, win-back campaigns for at-risk customers.
Total 100 Weighted Score
Grading Breakdown
Excellent

90-100

Outstanding work with insights

Good

75-89

Solid implementation

Passing

60-74

Meets basic requirements

Helpful Resources