Assignment 11-A

Data Science Analysis Project

Build a complete E-Commerce Sales Analytics Dashboard using NumPy, Pandas, and Matplotlib. Perform exploratory data analysis, clean messy data, compute statistics, create visualizations, and generate actionable business insights.

6-8 hours
Intermediate
200 Points
Submit Assignment
What You'll Build
  • NumPy array operations
  • Pandas DataFrames & Series
  • Data cleaning & transformation
  • Statistical analysis
  • Matplotlib visualizations
Contents
01

Assignment Overview

In this assignment, you will build a complete E-Commerce Sales Analytics Dashboard that analyzes sales data from an online retail store. This comprehensive project requires you to apply ALL concepts from Module 11: NumPy arrays, Pandas DataFrames, data cleaning, statistical analysis, and data visualization with Matplotlib.

Required Libraries: You must use NumPy, Pandas, and Matplotlib. Seaborn is optional but encouraged for enhanced visualizations.
Skills Applied: This assignment tests your understanding of NumPy operations (Topic 11.1), Pandas data manipulation (Topic 11.2), and data visualization (Topic 11.3) from Module 11.
NumPy Operations (11.1)

Array creation, mathematical operations, statistical functions, broadcasting

Pandas Analysis (11.2)

DataFrames, filtering, groupby, merging, pivot tables, data cleaning

Visualization (11.3)

Line plots, bar charts, histograms, scatter plots, subplots

Ready to submit? Already completed the assignment? Submit your work now!
Submit Now
02

The Scenario

ShopEasy E-Commerce

You have been hired as a Data Analyst at ShopEasy, an online retail company. The business intelligence team has given you this task:

"We have 12 months of sales data but it's messy - missing values, duplicates, and inconsistent formats. We need you to clean the data, analyze sales trends, identify our best products and customers, and create visualizations for our quarterly business review presentation."

Your Task

Create a Jupyter Notebook called ecommerce_analytics.ipynb that performs comprehensive data analysis on the sales dataset. Your notebook should tell a data story with clear markdown explanations, clean code, and professional visualizations.

03

The Dataset

You will work with a sales dataset. Create this CSV file or download it from the link below:

File: sales_data.csv

order_id,customer_id,product_name,category,quantity,unit_price,order_date,region,payment_method
1001,C101,Wireless Mouse,Electronics,2,29.99,2025-01-15,North,Credit Card
1002,C102,USB-C Cable,Electronics,3,12.99,2025-01-15,South,PayPal
1003,C103,Office Chair,Furniture,1,199.99,2025-01-16,East,Credit Card
1004,C101,Keyboard,Electronics,1,79.99,2025-01-16,North,Credit Card
1005,C104,Desk Lamp,Home,2,34.99,2025-01-17,West,Debit Card
1006,C105,Monitor Stand,Furniture,1,49.99,2025-01-17,North,PayPal
1007,C102,Webcam HD,Electronics,1,89.99,2025-01-18,South,Credit Card
1008,C106,Notebook Set,Office,5,8.99,2025-01-18,East,Cash
1009,C107,Wireless Earbuds,Electronics,2,59.99,2025-01-19,West,Credit Card
1010,C108,Desk Organizer,Office,3,24.99,2025-01-19,North,Debit Card
1011,C103,Bookshelf,Furniture,1,149.99,2025-01-20,East,Credit Card
1012,C109,Power Bank,Electronics,2,39.99,2025-01-20,South,PayPal
1013,C110,LED Strip,Home,4,19.99,2025-01-21,West,Credit Card
1014,C101,Mouse Pad,Electronics,1,,2025-01-21,North,Credit Card
1015,C111,Filing Cabinet,Furniture,1,89.99,2025-01-22,East,Debit Card
1016,C112,USB Hub,Electronics,2,29.99,2025-01-22,,Credit Card
1017,C113,Desk Mat,Office,1,29.99,2025-01-23,West,PayPal
1018,C114,Smart Speaker,Electronics,1,79.99,2025-01-23,North,Credit Card
1019,C115,Wall Clock,Home,2,29.99,2025-01-24,South,Cash
1020,C102,Laptop Stand,Electronics,1,44.99,2025-01-24,South,PayPal
1021,C116,Pen Holder,Office,-2,14.99,2025-01-25,East,Debit Card
1022,C117,Wireless Charger,Electronics,1,34.99,2025-01-25,West,Credit Card
1023,C118,Table Fan,Home,1,49.99,2025-01-26,North,PayPal
1024,C103,Office Chair,Furniture,1,199.99,2025-01-26,East,Credit Card
1025,C119,HDMI Cable,Electronics,5,9.99,2025-01-27,South,Debit Card
Note: The dataset intentionally contains issues you must handle:
  • Missing values (empty cells in unit_price and region)
  • Invalid data (negative quantity in row 21)
  • Duplicate orders (same customer, same product on same day)
Column Descriptions
  • order_id - Unique order identifier (integer)
  • customer_id - Customer identifier (string)
  • product_name - Name of the product (string)
  • category - Product category: Electronics, Furniture, Home, Office (string)
  • quantity - Number of items ordered (integer)
  • unit_price - Price per unit in dollars (float)
  • order_date - Date of order (YYYY-MM-DD)
  • region - Sales region: North, South, East, West (string)
  • payment_method - Payment type: Credit Card, Debit Card, PayPal, Cash (string)
04

Requirements

Your ecommerce_analytics.ipynb must include ALL of the following sections. Each section is mandatory and will be graded.

1
Data Loading & Initial Exploration

Load the dataset and perform initial exploration:

  • Import required libraries (numpy, pandas, matplotlib)
  • Load CSV using pd.read_csv()
  • Display shape, dtypes, and first/last 5 rows
  • Use df.info() and df.describe()
  • Check for missing values with df.isnull().sum()
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv('sales_data.csv')

# Initial exploration
print(f"Dataset Shape: {df.shape}")
print(f"\nData Types:\n{df.dtypes}")
print(f"\nMissing Values:\n{df.isnull().sum()}")
2
Data Cleaning

Clean the dataset by handling all data quality issues:

  • Handle missing values (fill or drop with justification)
  • Remove or fix invalid data (negative quantities)
  • Convert order_date to datetime type
  • Check for and handle duplicates
  • Create a new column: total_amount = quantity * unit_price
# Handle missing values
df['unit_price'].fillna(df['unit_price'].median(), inplace=True)
df['region'].fillna('Unknown', inplace=True)

# Remove invalid quantities
df = df[df['quantity'] > 0]

# Convert date
df['order_date'] = pd.to_datetime(df['order_date'])

# Create total_amount column
df['total_amount'] = df['quantity'] * df['unit_price']
3
NumPy Statistical Analysis

Use NumPy to compute statistics on numeric columns:

  • Calculate mean, median, std using NumPy functions
  • Find min, max, percentiles (25th, 50th, 75th)
  • Create a NumPy array from the total_amount column
  • Use boolean indexing to filter high-value orders (> $100)
  • Calculate correlation between quantity and total_amount
# NumPy statistics
amounts = df['total_amount'].values

print(f"Mean: ${np.mean(amounts):.2f}")
print(f"Median: ${np.median(amounts):.2f}")
print(f"Std Dev: ${np.std(amounts):.2f}")
print(f"25th Percentile: ${np.percentile(amounts, 25):.2f}")
print(f"75th Percentile: ${np.percentile(amounts, 75):.2f}")

# High-value orders
high_value = amounts[amounts > 100]
print(f"\nHigh-value orders (>$100): {len(high_value)}")
4
Pandas GroupBy Analysis

Perform aggregations using Pandas groupby:

  • Total sales by category
  • Average order value by region
  • Top 5 customers by total spending
  • Sales count by payment method
  • Best-selling products by quantity
# Sales by category
category_sales = df.groupby('category')['total_amount'].sum().sort_values(ascending=False)

# Average order by region
region_avg = df.groupby('region')['total_amount'].mean()

# Top 5 customers
top_customers = df.groupby('customer_id')['total_amount'].sum().nlargest(5)
5
Time Series Analysis

Analyze sales trends over time:

  • Extract day of week, day name from order_date
  • Calculate daily total sales
  • Find the busiest day of the week
  • Calculate rolling 3-day average sales
6
Data Visualizations (Minimum 6)

Create at least 6 professional visualizations using Matplotlib:

  • Bar Chart: Sales by category
  • Pie Chart: Sales distribution by region
  • Line Plot: Daily sales trend
  • Histogram: Distribution of order values
  • Horizontal Bar: Top 5 products by revenue
  • Grouped Bar: Sales by category per region
# Example: Sales by category bar chart
fig, ax = plt.subplots(figsize=(10, 6))
category_sales.plot(kind='bar', color=['#6366f1', '#10b981', '#f59e0b', '#ef4444'], ax=ax)
ax.set_title('Total Sales by Category', fontsize=14, fontweight='bold')
ax.set_xlabel('Category')
ax.set_ylabel('Total Sales ($)')
ax.tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.savefig('charts/sales_by_category.png', dpi=300)
plt.show()
7
Business Insights Summary

Write a markdown summary with key findings:

  • What category generates the most revenue?
  • Which region has the highest average order value?
  • Who are the top 3 customers and what do they buy?
  • What day of the week has the most sales?
  • 3 actionable recommendations for the business
8
Export Results

Save your analysis outputs:

  • Export cleaned DataFrame to cleaned_sales_data.csv
  • Save all charts to a charts/ folder
  • Create a summary statistics CSV file
05

Submission

Create a public GitHub repository with the exact name shown below:

Required Repository Name
ecommerce-sales-analytics
github.com/<your-username>/ecommerce-sales-analytics
Required Files
ecommerce-sales-analytics/
├── ecommerce_analytics.ipynb    # Your Jupyter Notebook with all analysis
├── sales_data.csv               # Original dataset
├── cleaned_sales_data.csv       # Cleaned dataset (output)
├── summary_stats.csv            # Summary statistics (output)
├── charts/                      # Folder with all visualization images
│   ├── sales_by_category.png
│   ├── sales_by_region.png
│   ├── daily_sales_trend.png
│   ├── order_value_distribution.png
│   ├── top_products.png
│   └── category_by_region.png
├── requirements.txt             # Python dependencies
└── README.md                    # Project documentation
README.md Must Include:
  • Your full name and submission date
  • Project overview and objectives
  • Key findings and insights (with numbers)
  • Sample visualizations (embed 2-3 charts)
  • How to run the notebook
Do Include
  • All 8 analysis sections completed
  • At least 6 visualizations saved as PNG
  • Clear markdown explanations in notebook
  • Business insights with recommendations
  • All output files generated
  • Code runs without errors
Do Not Include
  • Jupyter checkpoints (.ipynb_checkpoints/)
  • Virtual environment folders
  • __pycache__ or *.pyc files
  • Large unnecessary files
  • Code that doesn't execute
  • Plagiarized analysis
Important: Before submitting, restart your kernel and run all cells to ensure the notebook executes from top to bottom without errors!
Submit Your Assignment

Enter your GitHub username - we'll verify your repository automatically

06

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria Points Description
Data Loading & Exploration 15 Proper loading, shape/dtype display, missing value identification
Data Cleaning 30 Handling missing values, invalid data, duplicates, type conversions
NumPy Analysis 25 Statistical calculations, array operations, boolean indexing
Pandas GroupBy 30 Correct aggregations, sorting, filtering, top-N queries
Time Series Analysis 20 Date extraction, daily aggregation, trend analysis
Visualizations (6+) 40 Professional charts with titles, labels, legends, saved to files
Business Insights 20 Clear findings, data-driven recommendations, well-written summary
Code Quality 20 Clean code, markdown explanations, proper organization
Total 200

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment
07

What You Will Practice

NumPy Operations (11.1)

Array creation, mathematical operations, statistical functions, boolean indexing, percentiles

Pandas DataFrames (11.2)

Data loading, cleaning, filtering, groupby aggregations, sorting, merging, pivot tables

Matplotlib Visualization (11.3)

Bar charts, pie charts, line plots, histograms, subplots, customization, saving figures

Business Analytics

Translating data into insights, identifying trends, making data-driven recommendations

08

Pro Tips

Pandas Tips
  • Use df.info() to see non-null counts
  • Chain methods: df.groupby().agg().sort_values()
  • Use .copy() to avoid SettingWithCopyWarning
  • Explore df.describe(include='all')
Visualization Tips
  • Always add titles and axis labels
  • Use plt.tight_layout() before saving
  • Save at 300 DPI for high quality
  • Use consistent color schemes
Data Cleaning Tips
  • Document why you fill/drop missing values
  • Check for outliers using IQR method
  • Verify data types after loading
  • Look for impossible values (negative prices)
Common Mistakes
  • Forgetting to convert date strings to datetime
  • Not handling the SettingWithCopyWarning
  • Charts without titles or labels
  • Not running all cells before submitting
09

Pre-Submission Checklist

Analysis Requirements
Repository Requirements