Assignment Overview
In this capstone assignment for Module 7, you will combine everything you've learned about Python for analytics—pandas data manipulation, visualization with matplotlib/seaborn/plotly, and automation techniques—to build a complete, production-ready analytics pipeline.
Data Ingestion
Load and validate data from multiple sources
Transformation
Clean, transform, and aggregate data
Visualization
Generate automated charts and dashboards
Reporting
Create and export professional reports
The Scenario
TechMart E-Commerce
You are a Data Analytics Engineer at TechMart, an online electronics retailer. The management team is tired of manually running reports every week. Your task is to build an automated pipeline that:
"We need a system that automatically processes our daily sales data, calculates key metrics, generates visualizations, and produces a weekly summary report. The CEO wants this on her desk every Monday morning without anyone lifting a finger!"
Your Mission
Build a Python-based automated analytics pipeline that runs on a schedule and produces a comprehensive weekly sales report. Your solution should be robust, well-documented, and ready for production deployment.
The Dataset
You will work with a simulated e-commerce sales dataset. Download the two CSV files below
and place them in your project's data/ folder:
techmart_sales.csv
28 transactions with order details, quantities, prices, and regions
Download Sales DataDataset Description
Sales Data
order_id- Unique order identifierdate- Order date (YYYY-MM-DD)product_id- Product referencequantity- Units soldunit_price- Sale price per unitcustomer_id- Customer referenceregion- Sales region
Products Data
product_id- Unique product identifierproduct_name- Product namecategory- Product categorycost_price- Cost to purchasesupplier- Supplier name
Requirements
Your automated pipeline must include all of the following components. This is a comprehensive project that ties together everything from Module 7!
Data Pipeline Script (pipeline.py)
Create a main Python script that orchestrates the entire pipeline:
- Load data from CSV files using pandas
- Validate data (check for missing values, correct data types)
- Merge sales data with product data
- Calculate derived metrics (revenue, profit, profit margin)
# Example structure
import pandas as pd
from datetime import datetime
def load_data():
"""Load and validate data from CSV files"""
sales = pd.read_csv('data/sales_data.csv', parse_dates=['date'])
products = pd.read_csv('data/products.csv')
return sales, products
def transform_data(sales, products):
"""Merge and calculate metrics"""
df = sales.merge(products, on='product_id')
df['revenue'] = df['quantity'] * df['unit_price']
df['cost'] = df['quantity'] * df['cost_price']
df['profit'] = df['revenue'] - df['cost']
return df
Analysis Functions (analysis.py)
Create reusable functions to calculate key business metrics:
- Total Revenue & Profit: Overall financial performance
- Sales by Category: GroupBy aggregations
- Regional Performance: Compare regions
- Top Products: Best sellers by quantity and revenue
- Daily Trends: Time series analysis
Visualization Module (visualizations.py)
Generate at least 4 different visualizations automatically:
- Bar Chart: Revenue by category or region
- Line Chart: Daily sales trend over time
- Pie Chart: Category distribution
- Heatmap or Box Plot: Advanced visualization
Save all charts as PNG files in an output/charts/ folder.
import matplotlib.pyplot as plt
import seaborn as sns
def create_revenue_by_category(df, output_path):
"""Generate bar chart of revenue by category"""
fig, ax = plt.subplots(figsize=(10, 6))
category_revenue = df.groupby('category')['revenue'].sum()
category_revenue.plot(kind='bar', ax=ax, color=['#6366f1', '#22c55e'])
ax.set_title('Revenue by Category', fontsize=14, fontweight='bold')
ax.set_ylabel('Revenue ($)')
plt.tight_layout()
plt.savefig(output_path, dpi=150)
plt.close()
Report Generator (report.py)
Create an HTML report that includes:
- Executive Summary with key metrics
- Embedded visualizations
- Data tables with top products
- Insights and recommendations
- Timestamp of when report was generated
from datetime import datetime
def generate_html_report(metrics, output_path):
"""Generate HTML report with embedded charts"""
html = f"""
<html>
<head><title>Weekly Sales Report</title></head>
<body>
<h1>TechMart Weekly Sales Report</h1>
<p>Generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}</p>
<h2>Key Metrics</h2>
<p>Total Revenue: ${metrics['total_revenue']:,.2f}</p>
<img src="charts/revenue_by_category.png">
</body>
</html>
"""
with open(output_path, 'w') as f:
f.write(html)
return output_path
Scheduling Configuration
Provide documentation and configuration for scheduling:
- Windows: Task Scheduler XML export or PowerShell script
- Mac/Linux: Cron job configuration
- Include instructions in README for setting up the schedule
# Example cron job (runs every Monday at 7 AM)
0 7 * * 1 /usr/bin/python3 /path/to/pipeline.py
# Windows Task Scheduler (PowerShell)
$action = New-ScheduledTaskAction -Execute "python" -Argument "C:\path\to\pipeline.py"
$trigger = New-ScheduledTaskTrigger -Weekly -DaysOfWeek Monday -At 7am
Logging and Error Handling
Make your pipeline production-ready:
- Add logging to track pipeline execution
- Handle errors gracefully (missing files, bad data)
- Write logs to a file for debugging
import logging
logging.basicConfig(
filename='logs/pipeline.log',
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
def run_pipeline():
logging.info("Pipeline started")
try:
# Pipeline code here
logging.info("Pipeline completed successfully")
except Exception as e:
logging.error(f"Pipeline failed: {e}")
raise
Submission
Create a public GitHub repository with the exact name shown below:
Required Repository Name
analytics-pipeline-project
Required Project Structure
analytics-pipeline-project/
├── data/
│ ├── sales_data.csv
│ └── products.csv
├── src/
│ ├── __init__.py
│ ├── pipeline.py # Main orchestration script
│ ├── analysis.py # Analysis functions
│ ├── visualizations.py # Chart generation
│ └── report.py # Report generator
├── output/
│ ├── charts/ # Generated visualizations
│ └── reports/ # Generated HTML reports
├── logs/
│ └── pipeline.log # Execution logs
├── config/
│ └── schedule_config.txt # Scheduling instructions
├── tests/
│ └── test_analysis.py # Unit tests (bonus)
├── requirements.txt # Python dependencies
├── run_pipeline.py # Entry point script
└── README.md # Documentation
README.md Must Include:
- Project overview and purpose
- Installation instructions (how to set up the environment)
- Usage guide (how to run the pipeline)
- Scheduling instructions for Windows and Mac/Linux
- Sample output (screenshots of report and charts)
- Your name and submission date
Do Include
- All Python source files
- Sample data files
- At least one generated report
- requirements.txt with all dependencies
- Comprehensive README
- Proper .gitignore file
Do Not Include
- Virtual environment folder (venv)
- __pycache__ directories
- IDE configuration files
- Sensitive information or credentials
- Large data files (>10MB)
Enter your GitHub username - we'll verify your repository automatically
Grading Rubric
Your assignment will be graded on the following criteria:
| Criteria | Points | Description |
|---|---|---|
| Project Structure | 15 | Proper folder organization, all required files present, clean .gitignore |
| Data Pipeline | 25 | Data loading, validation, merging, and transformation work correctly |
| Analysis Functions | 25 | All required metrics calculated correctly with reusable functions |
| Visualizations | 20 | At least 4 charts generated automatically, properly styled |
| Report Generation | 20 | HTML report with metrics, charts, and professional formatting |
| Scheduling & Logging | 15 | Schedule configuration documented, logging implemented |
| Documentation | 15 | Complete README, code comments, docstrings |
| Code Quality | 15 | Clean, readable code following Python conventions |
| Total | 150 |
Ready to Submit?
Make sure you have completed all requirements and reviewed the grading rubric above.
Submit Your AssignmentSkills You Will Practice
Data Pipeline Design
Building modular, reusable data pipelines that can run automatically without human intervention
Python Best Practices
Writing clean, documented, production-ready Python code with proper error handling
Automated Visualization
Generating charts programmatically and saving them for reports
Task Scheduling
Configuring cron jobs and Task Scheduler for automated execution
Report Generation
Creating professional HTML reports with embedded visualizations
Logging & Debugging
Implementing logging for production systems and troubleshooting issues
Pro Tips
Getting Started
- Start with the data loading and validation
- Test each function independently before integrating
- Use Jupyter notebook for prototyping, then convert to .py
- Build incrementally—don't try to do everything at once
Code Quality
- Use meaningful function and variable names
- Add docstrings to all functions
- Keep functions small and focused
- Use type hints for better documentation
Visualizations
- Set a consistent style with plt.style.use()
- Always include titles and axis labels
- Use plt.tight_layout() to avoid cutoffs
- Save at 150+ DPI for crisp images
Common Mistakes
- Hardcoding file paths instead of using relative paths
- Not handling missing or malformed data
- Forgetting to close matplotlib figures (memory leak)
- Not testing the complete pipeline end-to-end