Capstone Project 11-A

End-to-End Data Science Project

Demonstrate your mastery of data science by building a complete end-to-end machine learning project. From data collection and EDA to model training, evaluation, and deployment — showcase everything you've learned throughout this course in one comprehensive portfolio project.

20-30 hours
Advanced
500 Points
Submit Project
Skills Demonstrated
  • Data collection & preprocessing
  • Exploratory data analysis (EDA)
  • Feature engineering
  • Model training & evaluation
  • Hyperparameter tuning
  • Model deployment & documentation
Contents
01

Project Overview

The Final Project is your opportunity to showcase everything you've learned in this Data Science course. You will build a complete, end-to-end machine learning project that can be added to your professional portfolio. This project should demonstrate your ability to solve real-world problems using data science techniques.

Portfolio-Ready: This project is designed to be impressive enough to show potential employers. Take your time, document everything thoroughly, and create something you're proud of!
Full Course Integration: This project integrates concepts from ALL modules: Python fundamentals, NumPy, Pandas, data visualization, statistical analysis, machine learning, deep learning, and deployment.
Data Handling

Pandas, NumPy, data cleaning, preprocessing

Visualization

Matplotlib, Seaborn, Plotly, dashboards

Machine Learning

Scikit-learn, model selection, evaluation

Deployment

Streamlit, Flask, or API deployment

Ready to submit? Already completed your project? Submit your work now!
Submit Now
02

Project Options

Choose ONE of the following project tracks. Each track presents a unique challenge and allows you to specialize in a specific area of data science.

Option A: Predictive Analytics

Build a predictive model to forecast future outcomes based on historical data. Examples include sales forecasting, stock price prediction, customer churn prediction, or demand forecasting.

Suggested Datasets
  • Kaggle Store Sales Forecasting
  • Time Series Stock Data (Yahoo Finance)
  • Telco Customer Churn Dataset
  • Energy Consumption Forecasting
Required Techniques
  • Time series analysis or regression
  • Feature engineering with dates/lags
  • Cross-validation for time series
  • Forecast visualization with confidence intervals

Option B: Classification System

Build an intelligent classification system that can categorize data into meaningful classes. Examples include sentiment analysis, spam detection, disease diagnosis, or image classification.

Suggested Datasets
  • IMDB Movie Reviews (Sentiment)
  • Credit Card Fraud Detection
  • Medical Diagnosis Datasets
  • News Category Classification
Required Techniques
  • Multiple classification algorithms comparison
  • Handling imbalanced datasets
  • ROC curves, precision-recall analysis
  • Feature importance analysis

Option C: Recommendation Engine

Build a recommendation system that suggests relevant items to users. Examples include movie recommendations, product suggestions, or content personalization.

Suggested Datasets
  • MovieLens Dataset
  • Amazon Product Reviews
  • Spotify Music Data
  • Book Recommendations (Goodreads)
Required Techniques
  • Collaborative filtering
  • Content-based filtering
  • Hybrid approaches
  • Evaluation metrics (RMSE, MAP)

Option D: Custom Project

Have your own project idea? Build something unique that demonstrates your data science skills. Custom projects must be pre-approved by submitting a brief proposal.

Approval Required: If choosing Option D, email your project proposal (1 paragraph describing the problem, dataset, and approach) before starting. Custom projects must meet the same complexity requirements as Options A-C.
03

Technical Requirements

Regardless of which project option you choose, your project must include ALL of the following components:

1
Data Collection & Loading
  • Use a dataset with at least 10,000 rows and 10+ features
  • Document the data source clearly (Kaggle, API, web scraping, etc.)
  • Provide data loading scripts that can be reproduced
  • Include data dictionary explaining each column
2
Exploratory Data Analysis (EDA)
  • Statistical summary of all features (describe, info, value_counts)
  • At least 10 meaningful visualizations (histograms, scatter plots, heatmaps, etc.)
  • Correlation analysis and multivariate exploration
  • Clear insights and observations documented in markdown
  • Missing value analysis and outlier detection
3
Data Preprocessing & Feature Engineering
  • Handle missing values with appropriate strategies (imputation, removal)
  • Encode categorical variables (one-hot, label, target encoding)
  • Scale/normalize numerical features
  • Create at least 5 new engineered features
  • Feature selection with documented reasoning
4
Model Development
  • Train at least 3 different algorithms
  • Implement proper train/validation/test split (or cross-validation)
  • Perform hyperparameter tuning (GridSearchCV, RandomizedSearchCV, or Optuna)
  • Document model selection reasoning
  • Handle class imbalance if applicable (SMOTE, class weights)
5
Model Evaluation
  • Use appropriate metrics for your problem type (accuracy, F1, RMSE, AUC, etc.)
  • Create confusion matrices, ROC curves, or residual plots as appropriate
  • Compare all models in a summary table
  • Analyze feature importance for the best model
  • Discuss model limitations and potential improvements
6
Deployment (Choose One)
  • Option A: Streamlit web application with interactive UI
  • Option B: Flask/FastAPI REST API with documented endpoints
  • Option C: Jupyter dashboard with ipywidgets
  • Must accept new data input and return predictions
  • Include deployment instructions in README
7
Documentation & Presentation
  • Comprehensive README with project overview
  • Well-commented code throughout
  • Executive summary with key findings (1-2 paragraphs)
  • Requirements.txt with all dependencies
  • Clear instructions to reproduce results
04

Deliverables

Your final submission must include all of the following files in your GitHub repository:

Repository Structure
ds-final-project/
├── README.md                      # Project overview, setup instructions, key findings
├── requirements.txt               # All Python dependencies
├── data/
│   ├── raw/                       # Original, unprocessed data
│   └── processed/                 # Cleaned and preprocessed data
├── notebooks/
│   ├── 01_data_exploration.ipynb  # EDA notebook
│   ├── 02_preprocessing.ipynb     # Data cleaning and feature engineering
│   ├── 03_model_training.ipynb    # Model development and evaluation
│   └── 04_final_analysis.ipynb    # Final results and visualizations
├── src/
│   ├── data_loader.py             # Functions to load and preprocess data
│   ├── feature_engineering.py     # Feature engineering functions
│   ├── model.py                   # Model training and prediction functions
│   └── utils.py                   # Utility functions
├── models/
│   └── best_model.pkl             # Saved trained model
├── app/
│   └── streamlit_app.py           # OR flask_app.py - Deployment application
├── reports/
│   ├── figures/                   # Exported visualizations
│   └── final_report.pdf           # Executive summary (optional but recommended)
└── .gitignore                     # Ignore data files, virtual env, etc.
README.md Must Include:
  • Your full name and submission date
  • Project title and executive summary (problem, approach, results)
  • Dataset description and source link
  • Installation instructions (pip install -r requirements.txt)
  • Usage guide for running notebooks and app
  • Key findings and business recommendations
  • Model performance summary table
  • Future work and limitations
Do Include
  • All notebooks with executed output cells
  • Modular, reusable Python code in /src
  • Saved model file (.pkl, .joblib, or .h5)
  • Working deployment application
  • Professional visualizations
  • .gitignore to exclude unnecessary files
Do Not Include
  • Large data files (use .gitignore, provide download link)
  • Virtual environment folders (venv, env)
  • Jupyter checkpoints (.ipynb_checkpoints)
  • API keys or credentials (use .env)
  • Notebooks without executed cells
  • Code without comments
05

Submission

Create a public GitHub repository with the exact name shown below:

Required Repository Name
ds-final-project
github.com/<your-username>/ds-final-project
Important: Before submitting, make sure all notebooks run without errors from top to bottom. Restart kernel and run all cells to verify!
Submit Your Final Project

Enter your GitHub username - we'll verify your repository automatically

06

Grading Rubric

Your final project will be graded on the following criteria:

Criteria Points Description
Data Handling & EDA 80 Data loading, cleaning, thorough exploratory analysis with meaningful visualizations
Feature Engineering 60 Creative and effective feature creation, proper encoding, scaling, and selection
Model Development 100 Multiple models compared, proper validation, hyperparameter tuning, best model selection
Model Evaluation 60 Appropriate metrics, comprehensive evaluation, feature importance, model interpretation
Deployment 80 Working application that accepts input and returns predictions, good UX
Code Quality 60 Modular code, proper functions, comments, PEP8 compliance, error handling
Documentation 60 Comprehensive README, clear notebook explanations, reproducibility
Total 500
Bonus Points (Up to 50)
  • +15 pts: Deployed to cloud (Streamlit Cloud, Heroku, AWS, etc.)
  • +15 pts: Interactive dashboard with multiple views
  • +10 pts: Exceptional visualizations (publication quality)
  • +10 pts: Deep learning implementation (if appropriate)

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Final Project
07

What You Will Demonstrate

Python & Libraries

Proficiency in Python, NumPy, Pandas, Scikit-learn, and visualization libraries

Data Analysis Skills

Ability to explore, clean, and derive insights from complex datasets

Machine Learning

Understanding of ML algorithms, model selection, evaluation, and tuning

Deployment Skills

Ability to package models into usable applications for end users

08

Pro Tips

Project Planning
  • Start with EDA - understand your data first
  • Break the project into weekly milestones
  • Document as you go, not at the end
  • Version control your progress with git commits
Quality Over Quantity
  • Deep analysis beats more models
  • Explain WHY you made each decision
  • Focus on interpretability over complexity
  • Professional visualizations matter
Time Management
  • Week 1: Data collection, EDA (25%)
  • Week 2: Preprocessing, feature engineering (25%)
  • Week 3: Model development, tuning (30%)
  • Week 4: Deployment, documentation (20%)
Common Pitfalls
  • Don't leak test data into training
  • Avoid overfitting - use proper validation
  • Don't ignore class imbalance
  • Test your app before submitting
09

Pre-Submission Checklist

Technical Requirements
Repository Requirements