Assignment Overview
In this assignment, you will build a complete E-Commerce Sales Analytics Dashboard that analyzes sales data from an online retail store. This comprehensive project requires you to apply ALL concepts from Module 11: NumPy arrays, Pandas DataFrames, data cleaning, statistical analysis, and data visualization with Matplotlib.
NumPy Operations (11.1)
Array creation, mathematical operations, statistical functions, broadcasting
Pandas Analysis (11.2)
DataFrames, filtering, groupby, merging, pivot tables, data cleaning
Visualization (11.3)
Line plots, bar charts, histograms, scatter plots, subplots
The Scenario
ShopEasy E-Commerce
You have been hired as a Data Analyst at ShopEasy, an online retail company. The business intelligence team has given you this task:
"We have 12 months of sales data but it's messy - missing values, duplicates, and inconsistent formats. We need you to clean the data, analyze sales trends, identify our best products and customers, and create visualizations for our quarterly business review presentation."
Your Task
Create a Jupyter Notebook called ecommerce_analytics.ipynb that performs
comprehensive data analysis on the sales dataset. Your notebook should tell a data story
with clear markdown explanations, clean code, and professional visualizations.
The Dataset
You will work with a sales dataset. Create this CSV file or download it from the link below:
File: sales_data.csv
order_id,customer_id,product_name,category,quantity,unit_price,order_date,region,payment_method
1001,C101,Wireless Mouse,Electronics,2,29.99,2025-01-15,North,Credit Card
1002,C102,USB-C Cable,Electronics,3,12.99,2025-01-15,South,PayPal
1003,C103,Office Chair,Furniture,1,199.99,2025-01-16,East,Credit Card
1004,C101,Keyboard,Electronics,1,79.99,2025-01-16,North,Credit Card
1005,C104,Desk Lamp,Home,2,34.99,2025-01-17,West,Debit Card
1006,C105,Monitor Stand,Furniture,1,49.99,2025-01-17,North,PayPal
1007,C102,Webcam HD,Electronics,1,89.99,2025-01-18,South,Credit Card
1008,C106,Notebook Set,Office,5,8.99,2025-01-18,East,Cash
1009,C107,Wireless Earbuds,Electronics,2,59.99,2025-01-19,West,Credit Card
1010,C108,Desk Organizer,Office,3,24.99,2025-01-19,North,Debit Card
1011,C103,Bookshelf,Furniture,1,149.99,2025-01-20,East,Credit Card
1012,C109,Power Bank,Electronics,2,39.99,2025-01-20,South,PayPal
1013,C110,LED Strip,Home,4,19.99,2025-01-21,West,Credit Card
1014,C101,Mouse Pad,Electronics,1,,2025-01-21,North,Credit Card
1015,C111,Filing Cabinet,Furniture,1,89.99,2025-01-22,East,Debit Card
1016,C112,USB Hub,Electronics,2,29.99,2025-01-22,,Credit Card
1017,C113,Desk Mat,Office,1,29.99,2025-01-23,West,PayPal
1018,C114,Smart Speaker,Electronics,1,79.99,2025-01-23,North,Credit Card
1019,C115,Wall Clock,Home,2,29.99,2025-01-24,South,Cash
1020,C102,Laptop Stand,Electronics,1,44.99,2025-01-24,South,PayPal
1021,C116,Pen Holder,Office,-2,14.99,2025-01-25,East,Debit Card
1022,C117,Wireless Charger,Electronics,1,34.99,2025-01-25,West,Credit Card
1023,C118,Table Fan,Home,1,49.99,2025-01-26,North,PayPal
1024,C103,Office Chair,Furniture,1,199.99,2025-01-26,East,Credit Card
1025,C119,HDMI Cable,Electronics,5,9.99,2025-01-27,South,Debit Card
- Missing values (empty cells in unit_price and region)
- Invalid data (negative quantity in row 21)
- Duplicate orders (same customer, same product on same day)
Column Descriptions
order_id- Unique order identifier (integer)customer_id- Customer identifier (string)product_name- Name of the product (string)category- Product category: Electronics, Furniture, Home, Office (string)quantity- Number of items ordered (integer)unit_price- Price per unit in dollars (float)order_date- Date of order (YYYY-MM-DD)region- Sales region: North, South, East, West (string)payment_method- Payment type: Credit Card, Debit Card, PayPal, Cash (string)
Requirements
Your ecommerce_analytics.ipynb must include ALL of the following sections.
Each section is mandatory and will be graded.
Data Loading & Initial Exploration
Load the dataset and perform initial exploration:
- Import required libraries (numpy, pandas, matplotlib)
- Load CSV using
pd.read_csv() - Display shape, dtypes, and first/last 5 rows
- Use
df.info()anddf.describe() - Check for missing values with
df.isnull().sum()
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Load data
df = pd.read_csv('sales_data.csv')
# Initial exploration
print(f"Dataset Shape: {df.shape}")
print(f"\nData Types:\n{df.dtypes}")
print(f"\nMissing Values:\n{df.isnull().sum()}")
Data Cleaning
Clean the dataset by handling all data quality issues:
- Handle missing values (fill or drop with justification)
- Remove or fix invalid data (negative quantities)
- Convert order_date to datetime type
- Check for and handle duplicates
- Create a new column:
total_amount = quantity * unit_price
# Handle missing values
df['unit_price'].fillna(df['unit_price'].median(), inplace=True)
df['region'].fillna('Unknown', inplace=True)
# Remove invalid quantities
df = df[df['quantity'] > 0]
# Convert date
df['order_date'] = pd.to_datetime(df['order_date'])
# Create total_amount column
df['total_amount'] = df['quantity'] * df['unit_price']
NumPy Statistical Analysis
Use NumPy to compute statistics on numeric columns:
- Calculate mean, median, std using NumPy functions
- Find min, max, percentiles (25th, 50th, 75th)
- Create a NumPy array from the total_amount column
- Use boolean indexing to filter high-value orders (> $100)
- Calculate correlation between quantity and total_amount
# NumPy statistics
amounts = df['total_amount'].values
print(f"Mean: ${np.mean(amounts):.2f}")
print(f"Median: ${np.median(amounts):.2f}")
print(f"Std Dev: ${np.std(amounts):.2f}")
print(f"25th Percentile: ${np.percentile(amounts, 25):.2f}")
print(f"75th Percentile: ${np.percentile(amounts, 75):.2f}")
# High-value orders
high_value = amounts[amounts > 100]
print(f"\nHigh-value orders (>$100): {len(high_value)}")
Pandas GroupBy Analysis
Perform aggregations using Pandas groupby:
- Total sales by category
- Average order value by region
- Top 5 customers by total spending
- Sales count by payment method
- Best-selling products by quantity
# Sales by category
category_sales = df.groupby('category')['total_amount'].sum().sort_values(ascending=False)
# Average order by region
region_avg = df.groupby('region')['total_amount'].mean()
# Top 5 customers
top_customers = df.groupby('customer_id')['total_amount'].sum().nlargest(5)
Time Series Analysis
Analyze sales trends over time:
- Extract day of week, day name from order_date
- Calculate daily total sales
- Find the busiest day of the week
- Calculate rolling 3-day average sales
Data Visualizations (Minimum 6)
Create at least 6 professional visualizations using Matplotlib:
- Bar Chart: Sales by category
- Pie Chart: Sales distribution by region
- Line Plot: Daily sales trend
- Histogram: Distribution of order values
- Horizontal Bar: Top 5 products by revenue
- Grouped Bar: Sales by category per region
# Example: Sales by category bar chart
fig, ax = plt.subplots(figsize=(10, 6))
category_sales.plot(kind='bar', color=['#6366f1', '#10b981', '#f59e0b', '#ef4444'], ax=ax)
ax.set_title('Total Sales by Category', fontsize=14, fontweight='bold')
ax.set_xlabel('Category')
ax.set_ylabel('Total Sales ($)')
ax.tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.savefig('charts/sales_by_category.png', dpi=300)
plt.show()
Business Insights Summary
Write a markdown summary with key findings:
- What category generates the most revenue?
- Which region has the highest average order value?
- Who are the top 3 customers and what do they buy?
- What day of the week has the most sales?
- 3 actionable recommendations for the business
Export Results
Save your analysis outputs:
- Export cleaned DataFrame to
cleaned_sales_data.csv - Save all charts to a
charts/folder - Create a summary statistics CSV file
Submission
Create a public GitHub repository with the exact name shown below:
Required Repository Name
ecommerce-sales-analytics
Required Files
ecommerce-sales-analytics/
├── ecommerce_analytics.ipynb # Your Jupyter Notebook with all analysis
├── sales_data.csv # Original dataset
├── cleaned_sales_data.csv # Cleaned dataset (output)
├── summary_stats.csv # Summary statistics (output)
├── charts/ # Folder with all visualization images
│ ├── sales_by_category.png
│ ├── sales_by_region.png
│ ├── daily_sales_trend.png
│ ├── order_value_distribution.png
│ ├── top_products.png
│ └── category_by_region.png
├── requirements.txt # Python dependencies
└── README.md # Project documentation
README.md Must Include:
- Your full name and submission date
- Project overview and objectives
- Key findings and insights (with numbers)
- Sample visualizations (embed 2-3 charts)
- How to run the notebook
Do Include
- All 8 analysis sections completed
- At least 6 visualizations saved as PNG
- Clear markdown explanations in notebook
- Business insights with recommendations
- All output files generated
- Code runs without errors
Do Not Include
- Jupyter checkpoints (.ipynb_checkpoints/)
- Virtual environment folders
- __pycache__ or *.pyc files
- Large unnecessary files
- Code that doesn't execute
- Plagiarized analysis
Enter your GitHub username - we'll verify your repository automatically
Grading Rubric
Your assignment will be graded on the following criteria:
| Criteria | Points | Description |
|---|---|---|
| Data Loading & Exploration | 15 | Proper loading, shape/dtype display, missing value identification |
| Data Cleaning | 30 | Handling missing values, invalid data, duplicates, type conversions |
| NumPy Analysis | 25 | Statistical calculations, array operations, boolean indexing |
| Pandas GroupBy | 30 | Correct aggregations, sorting, filtering, top-N queries |
| Time Series Analysis | 20 | Date extraction, daily aggregation, trend analysis |
| Visualizations (6+) | 40 | Professional charts with titles, labels, legends, saved to files |
| Business Insights | 20 | Clear findings, data-driven recommendations, well-written summary |
| Code Quality | 20 | Clean code, markdown explanations, proper organization |
| Total | 200 |
Ready to Submit?
Make sure you have completed all requirements and reviewed the grading rubric above.
Submit Your AssignmentWhat You Will Practice
NumPy Operations (11.1)
Array creation, mathematical operations, statistical functions, boolean indexing, percentiles
Pandas DataFrames (11.2)
Data loading, cleaning, filtering, groupby aggregations, sorting, merging, pivot tables
Matplotlib Visualization (11.3)
Bar charts, pie charts, line plots, histograms, subplots, customization, saving figures
Business Analytics
Translating data into insights, identifying trends, making data-driven recommendations
Pro Tips
Pandas Tips
- Use
df.info()to see non-null counts - Chain methods:
df.groupby().agg().sort_values() - Use
.copy()to avoid SettingWithCopyWarning - Explore
df.describe(include='all')
Visualization Tips
- Always add titles and axis labels
- Use
plt.tight_layout()before saving - Save at 300 DPI for high quality
- Use consistent color schemes
Data Cleaning Tips
- Document why you fill/drop missing values
- Check for outliers using IQR method
- Verify data types after loading
- Look for impossible values (negative prices)
Common Mistakes
- Forgetting to convert date strings to datetime
- Not handling the SettingWithCopyWarning
- Charts without titles or labels
- Not running all cells before submitting