Assignment 7: Automated Analytics Pipeline | Data Analytics Course

Assignment Overview

In this capstone assignment for Module 7, you will combine everything you've learned about Python for analytics—pandas data manipulation, visualization with matplotlib/seaborn/plotly, and automation techniques—to build a complete, production-ready analytics pipeline.

Skills Applied: This assignment tests your mastery of pandas operations, data visualization, file handling, report generation, and task scheduling from Module 7.

Data Ingestion

Load and validate data from multiple sources

Transformation

Clean, transform, and aggregate data

Visualization

Generate automated charts and dashboards

Reporting

Create and export professional reports

Ready to submit? Already completed the assignment? Submit your work now!

Submit Now

The Scenario

TechMart E-Commerce

You are a Data Analytics Engineer at TechMart, an online electronics retailer. The management team is tired of manually running reports every week. Your task is to build an automated pipeline that:

"We need a system that automatically processes our daily sales data, calculates key metrics, generates visualizations, and produces a weekly summary report. The CEO wants this on her desk every Monday morning without anyone lifting a finger!"

Your Mission

Build a Python-based automated analytics pipeline that runs on a schedule and produces a comprehensive weekly sales report. Your solution should be robust, well-documented, and ready for production deployment.

Real-World Skills: Automated reporting pipelines like this are used by companies worldwide. This project simulates what you'd build as a professional data analyst or analytics engineer!

The Dataset

You will work with a simulated e-commerce sales dataset. Download the two CSV files below and place them in your project's data/ folder:

techmart_sales.csv

28 transactions with order details, quantities, prices, and regions

Download Sales Data

techmart_products.csv

8 products with cost prices and supplier information

Download Products Data

Dataset Description

Sales Data

order_id - Unique order identifier
date - Order date (YYYY-MM-DD)
product_id - Product reference
quantity - Units sold
unit_price - Sale price per unit
customer_id - Customer reference
region - Sales region

Products Data

product_id - Unique product identifier
product_name - Product name
category - Product category
cost_price - Cost to purchase
supplier - Supplier name

Bonus Challenge: For extra credit, expand the dataset to include 30+ days of data and add more products. This will make your analysis more realistic!

Requirements

Your automated pipeline must include all of the following components. This is a comprehensive project that ties together everything from Module 7!

Data Pipeline Script (pipeline.py)

Create a main Python script that orchestrates the entire pipeline:

Load data from CSV files using pandas
Validate data (check for missing values, correct data types)
Merge sales data with product data
Calculate derived metrics (revenue, profit, profit margin)

# Example structure
import pandas as pd
from datetime import datetime

def load_data():
    """Load and validate data from CSV files"""
    sales = pd.read_csv('data/sales_data.csv', parse_dates=['date'])
    products = pd.read_csv('data/products.csv')
    return sales, products

def transform_data(sales, products):
    """Merge and calculate metrics"""
    df = sales.merge(products, on='product_id')
    df['revenue'] = df['quantity'] * df['unit_price']
    df['cost'] = df['quantity'] * df['cost_price']
    df['profit'] = df['revenue'] - df['cost']
    return df

Analysis Functions (analysis.py)

Create reusable functions to calculate key business metrics:

Total Revenue & Profit: Overall financial performance
Sales by Category: GroupBy aggregations
Regional Performance: Compare regions
Top Products: Best sellers by quantity and revenue
Daily Trends: Time series analysis

Visualization Module (visualizations.py)

Generate at least 4 different visualizations automatically:

Bar Chart: Revenue by category or region
Line Chart: Daily sales trend over time
Pie Chart: Category distribution
Heatmap or Box Plot: Advanced visualization

Save all charts as PNG files in an output/charts/ folder.

import matplotlib.pyplot as plt
import seaborn as sns

def create_revenue_by_category(df, output_path):
    """Generate bar chart of revenue by category"""
    fig, ax = plt.subplots(figsize=(10, 6))
    category_revenue = df.groupby('category')['revenue'].sum()
    category_revenue.plot(kind='bar', ax=ax, color=['#6366f1', '#22c55e'])
    ax.set_title('Revenue by Category', fontsize=14, fontweight='bold')
    ax.set_ylabel('Revenue ($)')
    plt.tight_layout()
    plt.savefig(output_path, dpi=150)
    plt.close()

Report Generator (report.py)

Create an HTML report that includes:

Executive Summary with key metrics
Embedded visualizations
Data tables with top products
Insights and recommendations
Timestamp of when report was generated

from datetime import datetime

def generate_html_report(metrics, output_path):
    """Generate HTML report with embedded charts"""
    html = f"""
    <html>
    <head><title>Weekly Sales Report</title></head>
    <body>
        <h1>TechMart Weekly Sales Report</h1>
        <p>Generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}</p>
        <h2>Key Metrics</h2>
        <p>Total Revenue: ${metrics['total_revenue']:,.2f}</p>
        <img src="charts/revenue_by_category.png">
    </body>
    </html>
    """
    with open(output_path, 'w') as f:
        f.write(html)
    return output_path

Scheduling Configuration

Provide documentation and configuration for scheduling:

Windows: Task Scheduler XML export or PowerShell script
Mac/Linux: Cron job configuration
Include instructions in README for setting up the schedule

# Example cron job (runs every Monday at 7 AM)
0 7 * * 1 /usr/bin/python3 /path/to/pipeline.py

# Windows Task Scheduler (PowerShell)
$action = New-ScheduledTaskAction -Execute "python" -Argument "C:\path\to\pipeline.py"
$trigger = New-ScheduledTaskTrigger -Weekly -DaysOfWeek Monday -At 7am

Logging and Error Handling

Make your pipeline production-ready:

Add logging to track pipeline execution
Handle errors gracefully (missing files, bad data)
Write logs to a file for debugging

import logging

logging.basicConfig(
    filename='logs/pipeline.log',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

def run_pipeline():
    logging.info("Pipeline started")
    try:
        # Pipeline code here
        logging.info("Pipeline completed successfully")
    except Exception as e:
        logging.error(f"Pipeline failed: {e}")
        raise

Bonus Points: Add email functionality to automatically send the report, or create an interactive Plotly dashboard!

Submission

Create a public GitHub repository with the exact name shown below:

Required Repository Name

analytics-pipeline-project

github.com/<your-username>/analytics-pipeline-project

Required Project Structure

analytics-pipeline-project/
├── data/
│   ├── sales_data.csv
│   └── products.csv
├── src/
│   ├── __init__.py
│   ├── pipeline.py          # Main orchestration script
│   ├── analysis.py          # Analysis functions
│   ├── visualizations.py    # Chart generation
│   └── report.py            # Report generator
├── output/
│   ├── charts/              # Generated visualizations
│   └── reports/             # Generated HTML reports
├── logs/
│   └── pipeline.log         # Execution logs
├── config/
│   └── schedule_config.txt  # Scheduling instructions
├── tests/
│   └── test_analysis.py     # Unit tests (bonus)
├── requirements.txt         # Python dependencies
├── run_pipeline.py          # Entry point script
└── README.md                # Documentation

README.md Must Include:

Project overview and purpose
Installation instructions (how to set up the environment)
Usage guide (how to run the pipeline)
Scheduling instructions for Windows and Mac/Linux
Sample output (screenshots of report and charts)
Your name and submission date

Do Include

All Python source files
Sample data files
At least one generated report
requirements.txt with all dependencies
Comprehensive README
Proper .gitignore file

Do Not Include

Virtual environment folder (venv)
__pycache__ directories
IDE configuration files
Sensitive information or credentials
Large data files (>10MB)

Submit Your Assignment

Enter your GitHub username - we'll verify your repository automatically

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria	Points	Description
Project Structure	15	Proper folder organization, all required files present, clean .gitignore
Data Pipeline	25	Data loading, validation, merging, and transformation work correctly
Analysis Functions	25	All required metrics calculated correctly with reusable functions
Visualizations	20	At least 4 charts generated automatically, properly styled
Report Generation	20	HTML report with metrics, charts, and professional formatting
Scheduling & Logging	15	Schedule configuration documented, logging implemented
Documentation	15	Complete README, code comments, docstrings
Code Quality	15	Clean, readable code following Python conventions
Total	150

Bonus Points (up to 20): Email integration (+10), Interactive Plotly dashboard (+10), Unit tests (+5), Extended dataset with 30+ days (+5)

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment

Skills You Will Practice

Data Pipeline Design

Building modular, reusable data pipelines that can run automatically without human intervention

Python Best Practices

Writing clean, documented, production-ready Python code with proper error handling

Automated Visualization

Generating charts programmatically and saving them for reports

Task Scheduling

Configuring cron jobs and Task Scheduler for automated execution

Report Generation

Creating professional HTML reports with embedded visualizations

Logging & Debugging

Implementing logging for production systems and troubleshooting issues

Pro Tips

Getting Started

Start with the data loading and validation
Test each function independently before integrating
Use Jupyter notebook for prototyping, then convert to .py
Build incrementally—don't try to do everything at once

Code Quality

Use meaningful function and variable names
Add docstrings to all functions
Keep functions small and focused
Use type hints for better documentation

Visualizations

Set a consistent style with plt.style.use()
Always include titles and axis labels
Use plt.tight_layout() to avoid cutoffs
Save at 150+ DPI for crisp images

Common Mistakes

Hardcoding file paths instead of using relative paths
Not handling missing or malformed data
Forgetting to close matplotlib figures (memory leak)
Not testing the complete pipeline end-to-end

Automated Analytics Pipeline

What You'll Build

Contents