Assignment 11: Data Science Project | Python Course

Assignment Overview

In this assignment, you will build a complete E-Commerce Sales Analytics Dashboard that analyzes sales data from an online retail store. This comprehensive project requires you to apply ALL concepts from Module 11: NumPy arrays, Pandas DataFrames, data cleaning, statistical analysis, and data visualization with Matplotlib.

Required Libraries: You must use NumPy, Pandas, and Matplotlib. Seaborn is optional but encouraged for enhanced visualizations.

Skills Applied: This assignment tests your understanding of NumPy operations (Topic 11.1), Pandas data manipulation (Topic 11.2), and data visualization (Topic 11.3) from Module 11.

NumPy Operations (11.1)

Array creation, mathematical operations, statistical functions, broadcasting

Pandas Analysis (11.2)

DataFrames, filtering, groupby, merging, pivot tables, data cleaning

Visualization (11.3)

Line plots, bar charts, histograms, scatter plots, subplots

Ready to submit? Already completed the assignment? Submit your work now!

Submit Now

The Scenario

ShopEasy E-Commerce

You have been hired as a Data Analyst at ShopEasy, an online retail company. The business intelligence team has given you this task:

"We have 12 months of sales data but it's messy - missing values, duplicates, and inconsistent formats. We need you to clean the data, analyze sales trends, identify our best products and customers, and create visualizations for our quarterly business review presentation."

Your Task

Create a Jupyter Notebook called ecommerce_analytics.ipynb that performs comprehensive data analysis on the sales dataset. Your notebook should tell a data story with clear markdown explanations, clean code, and professional visualizations.

The Dataset

You will work with a sales dataset. Create this CSV file or download it from the link below:

File: `sales_data.csv`

order_id,customer_id,product_name,category,quantity,unit_price,order_date,region,payment_method
1001,C101,Wireless Mouse,Electronics,2,29.99,2025-01-15,North,Credit Card
1002,C102,USB-C Cable,Electronics,3,12.99,2025-01-15,South,PayPal
1003,C103,Office Chair,Furniture,1,199.99,2025-01-16,East,Credit Card
1004,C101,Keyboard,Electronics,1,79.99,2025-01-16,North,Credit Card
1005,C104,Desk Lamp,Home,2,34.99,2025-01-17,West,Debit Card
1006,C105,Monitor Stand,Furniture,1,49.99,2025-01-17,North,PayPal
1007,C102,Webcam HD,Electronics,1,89.99,2025-01-18,South,Credit Card
1008,C106,Notebook Set,Office,5,8.99,2025-01-18,East,Cash
1009,C107,Wireless Earbuds,Electronics,2,59.99,2025-01-19,West,Credit Card
1010,C108,Desk Organizer,Office,3,24.99,2025-01-19,North,Debit Card
1011,C103,Bookshelf,Furniture,1,149.99,2025-01-20,East,Credit Card
1012,C109,Power Bank,Electronics,2,39.99,2025-01-20,South,PayPal
1013,C110,LED Strip,Home,4,19.99,2025-01-21,West,Credit Card
1014,C101,Mouse Pad,Electronics,1,,2025-01-21,North,Credit Card
1015,C111,Filing Cabinet,Furniture,1,89.99,2025-01-22,East,Debit Card
1016,C112,USB Hub,Electronics,2,29.99,2025-01-22,,Credit Card
1017,C113,Desk Mat,Office,1,29.99,2025-01-23,West,PayPal
1018,C114,Smart Speaker,Electronics,1,79.99,2025-01-23,North,Credit Card
1019,C115,Wall Clock,Home,2,29.99,2025-01-24,South,Cash
1020,C102,Laptop Stand,Electronics,1,44.99,2025-01-24,South,PayPal
1021,C116,Pen Holder,Office,-2,14.99,2025-01-25,East,Debit Card
1022,C117,Wireless Charger,Electronics,1,34.99,2025-01-25,West,Credit Card
1023,C118,Table Fan,Home,1,49.99,2025-01-26,North,PayPal
1024,C103,Office Chair,Furniture,1,199.99,2025-01-26,East,Credit Card
1025,C119,HDMI Cable,Electronics,5,9.99,2025-01-27,South,Debit Card

Note: The dataset intentionally contains issues you must handle:

Missing values (empty cells in unit_price and region)
Invalid data (negative quantity in row 21)
Duplicate orders (same customer, same product on same day)

Column Descriptions

order_id - Unique order identifier (integer)
customer_id - Customer identifier (string)
product_name - Name of the product (string)
category - Product category: Electronics, Furniture, Home, Office (string)
quantity - Number of items ordered (integer)
unit_price - Price per unit in dollars (float)
order_date - Date of order (YYYY-MM-DD)
region - Sales region: North, South, East, West (string)
payment_method - Payment type: Credit Card, Debit Card, PayPal, Cash (string)

Requirements

Your ecommerce_analytics.ipynb must include ALL of the following sections. Each section is mandatory and will be graded.

Data Loading & Initial Exploration

Load the dataset and perform initial exploration:

Import required libraries (numpy, pandas, matplotlib)
Load CSV using pd.read_csv()
Display shape, dtypes, and first/last 5 rows
Use df.info() and df.describe()
Check for missing values with df.isnull().sum()

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv('sales_data.csv')

# Initial exploration
print(f"Dataset Shape: {df.shape}")
print(f"\nData Types:\n{df.dtypes}")
print(f"\nMissing Values:\n{df.isnull().sum()}")

Data Cleaning

Clean the dataset by handling all data quality issues:

Handle missing values (fill or drop with justification)
Remove or fix invalid data (negative quantities)
Convert order_date to datetime type
Check for and handle duplicates
Create a new column: total_amount = quantity * unit_price

# Handle missing values
df['unit_price'].fillna(df['unit_price'].median(), inplace=True)
df['region'].fillna('Unknown', inplace=True)

# Remove invalid quantities
df = df[df['quantity'] > 0]

# Convert date
df['order_date'] = pd.to_datetime(df['order_date'])

# Create total_amount column
df['total_amount'] = df['quantity'] * df['unit_price']

NumPy Statistical Analysis

Use NumPy to compute statistics on numeric columns:

Calculate mean, median, std using NumPy functions
Find min, max, percentiles (25th, 50th, 75th)
Create a NumPy array from the total_amount column
Use boolean indexing to filter high-value orders (> $100)
Calculate correlation between quantity and total_amount

# NumPy statistics
amounts = df['total_amount'].values

print(f"Mean: ${np.mean(amounts):.2f}")
print(f"Median: ${np.median(amounts):.2f}")
print(f"Std Dev: ${np.std(amounts):.2f}")
print(f"25th Percentile: ${np.percentile(amounts, 25):.2f}")
print(f"75th Percentile: ${np.percentile(amounts, 75):.2f}")

# High-value orders
high_value = amounts[amounts > 100]
print(f"\nHigh-value orders (>$100): {len(high_value)}")

Pandas GroupBy Analysis

Perform aggregations using Pandas groupby:

Total sales by category
Average order value by region
Top 5 customers by total spending
Sales count by payment method
Best-selling products by quantity

# Sales by category
category_sales = df.groupby('category')['total_amount'].sum().sort_values(ascending=False)

# Average order by region
region_avg = df.groupby('region')['total_amount'].mean()

# Top 5 customers
top_customers = df.groupby('customer_id')['total_amount'].sum().nlargest(5)

Time Series Analysis

Analyze sales trends over time:

Extract day of week, day name from order_date
Calculate daily total sales
Find the busiest day of the week
Calculate rolling 3-day average sales

Data Visualizations (Minimum 6)

Create at least 6 professional visualizations using Matplotlib:

Bar Chart: Sales by category
Pie Chart: Sales distribution by region
Line Plot: Daily sales trend
Histogram: Distribution of order values
Horizontal Bar: Top 5 products by revenue
Grouped Bar: Sales by category per region

# Example: Sales by category bar chart
fig, ax = plt.subplots(figsize=(10, 6))
category_sales.plot(kind='bar', color=['#6366f1', '#10b981', '#f59e0b', '#ef4444'], ax=ax)
ax.set_title('Total Sales by Category', fontsize=14, fontweight='bold')
ax.set_xlabel('Category')
ax.set_ylabel('Total Sales ($)')
ax.tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.savefig('charts/sales_by_category.png', dpi=300)
plt.show()

Business Insights Summary

Write a markdown summary with key findings:

What category generates the most revenue?
Which region has the highest average order value?
Who are the top 3 customers and what do they buy?
What day of the week has the most sales?
3 actionable recommendations for the business

Export Results

Save your analysis outputs:

Export cleaned DataFrame to cleaned_sales_data.csv
Save all charts to a charts/ folder
Create a summary statistics CSV file

Submission

Create a public GitHub repository with the exact name shown below:

Required Repository Name

ecommerce-sales-analytics

github.com/<your-username>/ecommerce-sales-analytics

Required Files

ecommerce-sales-analytics/
├── ecommerce_analytics.ipynb    # Your Jupyter Notebook with all analysis
├── sales_data.csv               # Original dataset
├── cleaned_sales_data.csv       # Cleaned dataset (output)
├── summary_stats.csv            # Summary statistics (output)
├── charts/                      # Folder with all visualization images
│   ├── sales_by_category.png
│   ├── sales_by_region.png
│   ├── daily_sales_trend.png
│   ├── order_value_distribution.png
│   ├── top_products.png
│   └── category_by_region.png
├── requirements.txt             # Python dependencies
└── README.md                    # Project documentation

README.md Must Include:

Your full name and submission date
Project overview and objectives
Key findings and insights (with numbers)
Sample visualizations (embed 2-3 charts)
How to run the notebook

Do Include

All 8 analysis sections completed
At least 6 visualizations saved as PNG
Clear markdown explanations in notebook
Business insights with recommendations
All output files generated
Code runs without errors

Do Not Include

Jupyter checkpoints (.ipynb_checkpoints/)
Virtual environment folders
__pycache__ or *.pyc files
Large unnecessary files
Code that doesn't execute
Plagiarized analysis

Important: Before submitting, restart your kernel and run all cells to ensure the notebook executes from top to bottom without errors!

Submit Your Assignment

Enter your GitHub username - we'll verify your repository automatically

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria	Points	Description
Data Loading & Exploration	15	Proper loading, shape/dtype display, missing value identification
Data Cleaning	30	Handling missing values, invalid data, duplicates, type conversions
NumPy Analysis	25	Statistical calculations, array operations, boolean indexing
Pandas GroupBy	30	Correct aggregations, sorting, filtering, top-N queries
Time Series Analysis	20	Date extraction, daily aggregation, trend analysis
Visualizations (6+)	40	Professional charts with titles, labels, legends, saved to files
Business Insights	20	Clear findings, data-driven recommendations, well-written summary
Code Quality	20	Clean code, markdown explanations, proper organization
Total	200

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment

What You Will Practice

NumPy Operations (11.1)

Array creation, mathematical operations, statistical functions, boolean indexing, percentiles

Pandas DataFrames (11.2)

Data loading, cleaning, filtering, groupby aggregations, sorting, merging, pivot tables

Matplotlib Visualization (11.3)

Bar charts, pie charts, line plots, histograms, subplots, customization, saving figures

Business Analytics

Translating data into insights, identifying trends, making data-driven recommendations

Pro Tips

Pandas Tips

Use df.info() to see non-null counts
Chain methods: df.groupby().agg().sort_values()
Use .copy() to avoid SettingWithCopyWarning
Explore df.describe(include='all')

Visualization Tips

Always add titles and axis labels
Use plt.tight_layout() before saving
Save at 300 DPI for high quality
Use consistent color schemes

Data Cleaning Tips

Document why you fill/drop missing values
Check for outliers using IQR method
Verify data types after loading
Look for impossible values (negative prices)

Common Mistakes

Forgetting to convert date strings to datetime
Not handling the SettingWithCopyWarning
Charts without titles or labels
Not running all cells before submitting

Data Science Analysis Project

What You'll Build

Contents

Assignment Overview

NumPy Operations (11.1)

Pandas Analysis (11.2)

Visualization (11.3)

The Scenario

ShopEasy E-Commerce

Your Task

The Dataset

File: sales_data.csv

Column Descriptions

Requirements

Data Loading & Initial Exploration

Data Cleaning

NumPy Statistical Analysis

Pandas GroupBy Analysis

Time Series Analysis

Data Visualizations (Minimum 6)

Business Insights Summary

Export Results

Submission

Required Repository Name

Required Files

README.md Must Include:

Do Include

Do Not Include

Grading Rubric

Ready to Submit?

What You Will Practice

NumPy Operations (11.1)

Pandas DataFrames (11.2)

Matplotlib Visualization (11.3)

Business Analytics

Pro Tips

Pandas Tips

Visualization Tips

Data Cleaning Tips

Common Mistakes

Pre-Submission Checklist

Analysis Requirements

Repository Requirements

File: `sales_data.csv`