Assignment 7-A

Automated Analytics Pipeline

Put your Python for Analytics skills to work! Build a complete automated data pipeline that fetches data, performs analysis, generates visualizations, and delivers reports automatically.

4-6 hours
Intermediate
150 Points
Submit Assignment
What You'll Build
  • Automated data pipeline script
  • Data cleaning and transformation
  • Automated visualizations
  • HTML/PDF report generation
  • Scheduled task configuration
Contents
01

Assignment Overview

In this capstone assignment for Module 7, you will combine everything you've learned about Python for analytics—pandas data manipulation, visualization with matplotlib/seaborn/plotly, and automation techniques—to build a complete, production-ready analytics pipeline.

Skills Applied: This assignment tests your mastery of pandas operations, data visualization, file handling, report generation, and task scheduling from Module 7.
Data Ingestion

Load and validate data from multiple sources

Transformation

Clean, transform, and aggregate data

Visualization

Generate automated charts and dashboards

Reporting

Create and export professional reports

Ready to submit? Already completed the assignment? Submit your work now!
Submit Now
02

The Scenario

TechMart E-Commerce

You are a Data Analytics Engineer at TechMart, an online electronics retailer. The management team is tired of manually running reports every week. Your task is to build an automated pipeline that:

"We need a system that automatically processes our daily sales data, calculates key metrics, generates visualizations, and produces a weekly summary report. The CEO wants this on her desk every Monday morning without anyone lifting a finger!"

Your Mission

Build a Python-based automated analytics pipeline that runs on a schedule and produces a comprehensive weekly sales report. Your solution should be robust, well-documented, and ready for production deployment.

Real-World Skills: Automated reporting pipelines like this are used by companies worldwide. This project simulates what you'd build as a professional data analyst or analytics engineer!
03

The Dataset

You will work with a simulated e-commerce sales dataset. Download the two CSV files below and place them in your project's data/ folder:

techmart_sales.csv

28 transactions with order details, quantities, prices, and regions

Download Sales Data
techmart_products.csv

8 products with cost prices and supplier information

Download Products Data
Dataset Description
Sales Data
  • order_id - Unique order identifier
  • date - Order date (YYYY-MM-DD)
  • product_id - Product reference
  • quantity - Units sold
  • unit_price - Sale price per unit
  • customer_id - Customer reference
  • region - Sales region
Products Data
  • product_id - Unique product identifier
  • product_name - Product name
  • category - Product category
  • cost_price - Cost to purchase
  • supplier - Supplier name
Bonus Challenge: For extra credit, expand the dataset to include 30+ days of data and add more products. This will make your analysis more realistic!
04

Requirements

Your automated pipeline must include all of the following components. This is a comprehensive project that ties together everything from Module 7!

1
Data Pipeline Script (pipeline.py)

Create a main Python script that orchestrates the entire pipeline:

  • Load data from CSV files using pandas
  • Validate data (check for missing values, correct data types)
  • Merge sales data with product data
  • Calculate derived metrics (revenue, profit, profit margin)
# Example structure
import pandas as pd
from datetime import datetime

def load_data():
    """Load and validate data from CSV files"""
    sales = pd.read_csv('data/sales_data.csv', parse_dates=['date'])
    products = pd.read_csv('data/products.csv')
    return sales, products

def transform_data(sales, products):
    """Merge and calculate metrics"""
    df = sales.merge(products, on='product_id')
    df['revenue'] = df['quantity'] * df['unit_price']
    df['cost'] = df['quantity'] * df['cost_price']
    df['profit'] = df['revenue'] - df['cost']
    return df
2
Analysis Functions (analysis.py)

Create reusable functions to calculate key business metrics:

  • Total Revenue & Profit: Overall financial performance
  • Sales by Category: GroupBy aggregations
  • Regional Performance: Compare regions
  • Top Products: Best sellers by quantity and revenue
  • Daily Trends: Time series analysis
3
Visualization Module (visualizations.py)

Generate at least 4 different visualizations automatically:

  • Bar Chart: Revenue by category or region
  • Line Chart: Daily sales trend over time
  • Pie Chart: Category distribution
  • Heatmap or Box Plot: Advanced visualization

Save all charts as PNG files in an output/charts/ folder.

import matplotlib.pyplot as plt
import seaborn as sns

def create_revenue_by_category(df, output_path):
    """Generate bar chart of revenue by category"""
    fig, ax = plt.subplots(figsize=(10, 6))
    category_revenue = df.groupby('category')['revenue'].sum()
    category_revenue.plot(kind='bar', ax=ax, color=['#6366f1', '#22c55e'])
    ax.set_title('Revenue by Category', fontsize=14, fontweight='bold')
    ax.set_ylabel('Revenue ($)')
    plt.tight_layout()
    plt.savefig(output_path, dpi=150)
    plt.close()
4
Report Generator (report.py)

Create an HTML report that includes:

  • Executive Summary with key metrics
  • Embedded visualizations
  • Data tables with top products
  • Insights and recommendations
  • Timestamp of when report was generated
from datetime import datetime

def generate_html_report(metrics, output_path):
    """Generate HTML report with embedded charts"""
    html = f"""
    <html>
    <head><title>Weekly Sales Report</title></head>
    <body>
        <h1>TechMart Weekly Sales Report</h1>
        <p>Generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}</p>
        <h2>Key Metrics</h2>
        <p>Total Revenue: ${metrics['total_revenue']:,.2f}</p>
        <img src="charts/revenue_by_category.png">
    </body>
    </html>
    """
    with open(output_path, 'w') as f:
        f.write(html)
    return output_path
5
Scheduling Configuration

Provide documentation and configuration for scheduling:

  • Windows: Task Scheduler XML export or PowerShell script
  • Mac/Linux: Cron job configuration
  • Include instructions in README for setting up the schedule
# Example cron job (runs every Monday at 7 AM)
0 7 * * 1 /usr/bin/python3 /path/to/pipeline.py

# Windows Task Scheduler (PowerShell)
$action = New-ScheduledTaskAction -Execute "python" -Argument "C:\path\to\pipeline.py"
$trigger = New-ScheduledTaskTrigger -Weekly -DaysOfWeek Monday -At 7am
6
Logging and Error Handling

Make your pipeline production-ready:

  • Add logging to track pipeline execution
  • Handle errors gracefully (missing files, bad data)
  • Write logs to a file for debugging
import logging

logging.basicConfig(
    filename='logs/pipeline.log',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

def run_pipeline():
    logging.info("Pipeline started")
    try:
        # Pipeline code here
        logging.info("Pipeline completed successfully")
    except Exception as e:
        logging.error(f"Pipeline failed: {e}")
        raise
Bonus Points: Add email functionality to automatically send the report, or create an interactive Plotly dashboard!
05

Submission

Create a public GitHub repository with the exact name shown below:

Required Repository Name
analytics-pipeline-project
github.com/<your-username>/analytics-pipeline-project
Required Project Structure
analytics-pipeline-project/
├── data/
│   ├── sales_data.csv
│   └── products.csv
├── src/
│   ├── __init__.py
│   ├── pipeline.py          # Main orchestration script
│   ├── analysis.py          # Analysis functions
│   ├── visualizations.py    # Chart generation
│   └── report.py            # Report generator
├── output/
│   ├── charts/              # Generated visualizations
│   └── reports/             # Generated HTML reports
├── logs/
│   └── pipeline.log         # Execution logs
├── config/
│   └── schedule_config.txt  # Scheduling instructions
├── tests/
│   └── test_analysis.py     # Unit tests (bonus)
├── requirements.txt         # Python dependencies
├── run_pipeline.py          # Entry point script
└── README.md                # Documentation
README.md Must Include:
  • Project overview and purpose
  • Installation instructions (how to set up the environment)
  • Usage guide (how to run the pipeline)
  • Scheduling instructions for Windows and Mac/Linux
  • Sample output (screenshots of report and charts)
  • Your name and submission date
Do Include
  • All Python source files
  • Sample data files
  • At least one generated report
  • requirements.txt with all dependencies
  • Comprehensive README
  • Proper .gitignore file
Do Not Include
  • Virtual environment folder (venv)
  • __pycache__ directories
  • IDE configuration files
  • Sensitive information or credentials
  • Large data files (>10MB)
Submit Your Assignment

Enter your GitHub username - we'll verify your repository automatically

06

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria Points Description
Project Structure 15 Proper folder organization, all required files present, clean .gitignore
Data Pipeline 25 Data loading, validation, merging, and transformation work correctly
Analysis Functions 25 All required metrics calculated correctly with reusable functions
Visualizations 20 At least 4 charts generated automatically, properly styled
Report Generation 20 HTML report with metrics, charts, and professional formatting
Scheduling & Logging 15 Schedule configuration documented, logging implemented
Documentation 15 Complete README, code comments, docstrings
Code Quality 15 Clean, readable code following Python conventions
Total 150
Bonus Points (up to 20): Email integration (+10), Interactive Plotly dashboard (+10), Unit tests (+5), Extended dataset with 30+ days (+5)

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment
07

Skills You Will Practice

Data Pipeline Design

Building modular, reusable data pipelines that can run automatically without human intervention

Python Best Practices

Writing clean, documented, production-ready Python code with proper error handling

Automated Visualization

Generating charts programmatically and saving them for reports

Task Scheduling

Configuring cron jobs and Task Scheduler for automated execution

Report Generation

Creating professional HTML reports with embedded visualizations

Logging & Debugging

Implementing logging for production systems and troubleshooting issues

08

Pro Tips

Getting Started
  • Start with the data loading and validation
  • Test each function independently before integrating
  • Use Jupyter notebook for prototyping, then convert to .py
  • Build incrementally—don't try to do everything at once
Code Quality
  • Use meaningful function and variable names
  • Add docstrings to all functions
  • Keep functions small and focused
  • Use type hints for better documentation
Visualizations
  • Set a consistent style with plt.style.use()
  • Always include titles and axis labels
  • Use plt.tight_layout() to avoid cutoffs
  • Save at 150+ DPI for crisp images
Common Mistakes
  • Hardcoding file paths instead of using relative paths
  • Not handling missing or malformed data
  • Forgetting to close matplotlib figures (memory leak)
  • Not testing the complete pipeline end-to-end
09

Pre-Submission Checklist

Code Requirements
Repository Requirements