Project 1: House Price Prediction | Machine Learning Course

Project Overview

This capstone project brings together everything you have learned in the Machine Learning course. You will work with the famous Kaggle House Prices dataset containing 1,460 training samples with 79 features describing almost every aspect of residential homes in Ames, Iowa. The dataset includes 36 numerical features and 43 categorical features covering everything from lot size and basement quality to garage type and sale conditions. Your goal is to build a production-ready regression pipeline that accurately predicts sale prices.

Skills Applied: This project tests your proficiency in Python (pandas, numpy, matplotlib, seaborn), scikit-learn (preprocessing, pipelines, models), feature engineering, hyperparameter tuning, and model evaluation.

EDA

Explore distributions, correlations, and identify patterns

Feature Engineering

Create, transform, and select the best features

Model Training

Train and compare multiple regression algorithms

Evaluation

Rigorous evaluation with cross-validation and metrics

Learning Objectives

Technical Skills

Master pandas for data manipulation and cleaning
Perform comprehensive exploratory data analysis
Build sklearn pipelines with ColumnTransformer
Implement feature engineering for mixed data types
Train and tune multiple regression models

ML Engineering Skills

Handle missing values with domain knowledge
Encode categorical variables effectively
Perform hyperparameter tuning with GridSearchCV
Evaluate models using appropriate regression metrics
Create reproducible and documented pipelines

Ready to submit? Already completed the project? Submit your work now!

Submit Now

Business Scenario

HomeValue AI

You have been hired as a Machine Learning Engineer at HomeValue AI, a real estate technology startup. The company is building an automated home valuation system to help buyers, sellers, and real estate agents get accurate price estimates instantly. The CEO has given you this challenge:

"We have historical sales data from Ames, Iowa - one of the most detailed housing datasets available. I need you to build a prediction model that can estimate house prices within 10% of actual sale prices. The model needs to be explainable so agents can tell clients WHY a house is valued at a certain price. Can you build us a reliable, interpretable pricing engine?"

Sarah Chen, CEO, HomeValue AI

Business Questions to Answer

Price Prediction

What is the predicted sale price for a given house?
What is the prediction confidence interval?
How accurate is our model on unseen data?
Which model performs best for this dataset?

Feature Importance

What features most influence house prices?
How much does each bedroom add to value?
What is the premium for quality finishes?
How does neighborhood affect pricing?

Market Insights

Are there undervalued properties in the market?
What renovations add the most value?
How does age affect property value?
What is the price distribution by neighborhood?

Model Insights

Where does the model make the largest errors?
Are there outliers affecting predictions?
How does model performance vary by price range?
What data quality issues need addressing?

Pro Tip: Think like a data scientist! Your model should not only predict well but also provide interpretable insights that real estate professionals can understand and trust.

The Dataset

You will work with the famous Kaggle House Prices dataset. Download the CSV files containing training data with 79 explanatory features:

Dataset Download

Download the house prices dataset files and save them to your project folder. The CSV files contain all necessary data for building your prediction model.

train.csv (1,460 samples) test.csv (1,459 samples) data_description.txt

Original Data Source

This project uses the House Prices: Advanced Regression Techniques dataset from Kaggle - one of the most popular competition datasets for learning regression. The dataset was compiled by Dean De Cock for use in data science education and contains 79 features describing homes in Ames, Iowa.

View on Kaggle Data Description

Key Features Overview

Category	Features	Description
Area	`LotArea, GrLivArea, TotalBsmtSF, 1stFlrSF, 2ndFlrSF, GarageArea`	Square footage measurements
Rooms	`BedroomAbvGr, TotRmsAbvGrd, FullBath, HalfBath, KitchenAbvGr`	Room counts
Quality	`OverallQual, OverallCond`	1-10 rating scales
Age	`YearBuilt, YearRemodAdd, GarageYrBlt`	Construction and remodel years
Basement	`BsmtFinSF1, BsmtFinSF2, BsmtUnfSF`	Basement area breakdown
Porch	`WoodDeckSF, OpenPorchSF, EnclosedPorch, ScreenPorch`	Outdoor areas
Target	`SalePrice`	Sale price in USD

Category	Features	Example Values
Location	`Neighborhood, MSZoning`	CollgCr, Veenker, NAmes \| RL, RM, FV
Building	`BldgType, HouseStyle, RoofStyle`	1Fam, 2fmCon \| 1Story, 2Story
Quality	`ExterQual, ExterCond, BsmtQual, KitchenQual`	Ex, Gd, TA, Fa, Po
Garage	`GarageType, GarageFinish, GarageCond`	Attchd, Detchd, BuiltIn
Basement	`BsmtExposure, BsmtFinType1, BsmtCond`	Gd, Av, Mn, No
Utilities	`Heating, CentralAir, Electrical`	GasA, GasW \| Y, N
Sale	`SaleType, SaleCondition`	WD, New, COD \| Normal, Abnormal

Feature	Missing %	Reason	Recommended Action
`PoolQC`	99.5%	No pool	Fill with "None"
`MiscFeature`	96.3%	No misc feature	Fill with "None"
`Alley`	93.8%	No alley access	Fill with "None"
`Fence`	80.8%	No fence	Fill with "None"
`FireplaceQu`	47.3%	No fireplace	Fill with "None"
`LotFrontage`	17.7%	Missing data	Impute by neighborhood median
`GarageYrBlt`	5.5%	No garage	Fill with 0 or YearBuilt

Dataset Stats: 1,460 training samples, 79 features (36 numerical, 43 categorical), ~6% missing values overall

Target Variable: SalePrice: Mean $180,921 | Median $163,000 | Right-skewed (log transform recommended)

Sample Data Preview

Here is what a typical training record looks like:

Id	MSSubClass	LotArea	OverallQual	YearBuilt	GrLivArea	BedroomAbvGr	Neighborhood	SalePrice
1	60	8,450	7	2003	1,710	3	CollgCr	$208,500
2	20	9,600	6	1976	1,262	3	Veenker	$181,500
3	60	11,250	7	2001	1,786	3	CollgCr	$223,500

Data Quality Note: The dataset contains many missing values that are NOT random - they often indicate the absence of a feature (e.g., no pool, no garage). Understanding this is crucial for proper imputation!

Project Requirements

Your project must include all of the following components. Structure your deliverables as Jupyter notebooks with clear documentation and code organization.

Exploratory Data Analysis (EDA)

Create 01_eda.ipynb:

Load data and examine shape, dtypes, and missing values
Analyze target variable (SalePrice) distribution
Visualize numerical feature distributions with histograms
Analyze categorical features with bar plots
Create correlation matrix heatmap for numerical features
Identify top 10 features correlated with SalePrice
Scatter plots for key features vs SalePrice
Document 5+ key insights from your analysis

Deliverable: Comprehensive EDA notebook with at least 15 visualizations and markdown cells explaining findings.

Feature Engineering

Create 02_feature_engineering.ipynb:

Handle Missing Values: Domain-aware imputation strategy
Create New Features:
- TotalSF = TotalBsmtSF + 1stFlrSF + 2ndFlrSF
- TotalBathrooms = FullBath + 0.5*HalfBath + BsmtFullBath + 0.5*BsmtHalfBath
- Age = YrSold - YearBuilt
- Remodeled = 1 if YearRemodAdd != YearBuilt else 0
- TotalPorchSF = sum of all porch areas
- HasPool, HasGarage, HasFireplace = binary indicators
Encode Categoricals: OrdinalEncoder for quality features, OneHotEncoder for nominal
Handle Outliers: Identify and treat outliers in GrLivArea, LotArea
Log Transform: Transform SalePrice and skewed features

Deliverable: Feature engineering notebook creating at least 10 new features with justification for each transformation.

Model Training

Create 03_model_training.ipynb:

Build sklearn Pipeline: Use ColumnTransformer for preprocessing
Train at least 5 models:
- Linear Regression (baseline)
- Ridge Regression
- Lasso Regression
- Random Forest Regressor
- XGBoost/Gradient Boosting
Cross-Validation: 5-fold CV for all models
Hyperparameter Tuning: GridSearchCV for top 2 models
Model Comparison: Create comparison table with metrics

Deliverable: Model training notebook with at least 5 models trained, tuned, and compared using consistent evaluation methodology.

Model Evaluation

Create 04_evaluation.ipynb:

Calculate metrics: RMSE, MAE, R², MAPE on test set
Create residual plots (residuals vs predicted)
Plot actual vs predicted values
Analyze prediction error distribution
Feature importance visualization (top 20 features)
Error analysis by price range (low, medium, high)
Identify worst predictions and analyze patterns

Deliverable: Evaluation notebook with comprehensive analysis of model performance and error patterns.

Final Report

Create analysis_report.pdf:

Executive Summary (1 page): Key findings and model performance
Data Analysis (2 pages): EDA insights with visualizations
Methodology (2 pages): Feature engineering and model approach
Results (2 pages): Model comparison and final performance
Recommendations (1 page): Business insights and next steps

Deliverable: Professional PDF report (6-8 pages) suitable for presentation to non-technical stakeholders.

Model Specifications

Train and evaluate the following models. Use cross-validation for fair comparison and tune hyperparameters for your top performers.

Linear Models

Linear Regression: Baseline model, no regularization
Ridge (L2): Tune alpha: [0.1, 1.0, 10.0, 100.0]
Lasso (L1): Tune alpha: [0.0001, 0.001, 0.01, 0.1]
ElasticNet: Optional - combine L1 and L2

Tree-Based Models

Random Forest: Tune n_estimators, max_depth, min_samples_split
Gradient Boosting: Tune learning_rate, n_estimators, max_depth
XGBoost: Tune learning_rate, max_depth, subsample, colsample_bytree
LightGBM: Optional - faster alternative

Evaluation Metrics

RMSE

Root Mean Squared Error - penalizes large errors

MAE

Mean Absolute Error - average error magnitude

R² Score

Coefficient of determination - variance explained

RMSLE

Root Mean Squared Log Error - Kaggle metric

Sample Pipeline Code

# Build preprocessing pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Define column types
numerical_cols = ['LotArea', 'GrLivArea', 'TotalBsmtSF', 'YearBuilt', ...]
categorical_cols = ['Neighborhood', 'HouseStyle', 'ExterQual', ...]

# Preprocessing pipelines
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='None')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine with ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_transformer, numerical_cols),
    ('cat', categorical_transformer, categorical_cols)
])

# Full pipeline with model
from sklearn.ensemble import RandomForestRegressor

model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])

Target Performance: Aim for RMSLE < 0.15 on your validation set. Top Kaggle solutions achieve RMSLE around 0.10-0.12.

Required Visualizations

Create at least 20 visualizations across your notebooks. Each visualization should have proper titles, labels, and interpretive commentary.

EDA

Exploratory Visualizations

SalePrice distribution (histogram + KDE)
Log-transformed SalePrice distribution
Correlation heatmap (top 15 features)
Missing values heatmap
GrLivArea vs SalePrice scatter
OverallQual vs SalePrice boxplot
Neighborhood price distribution
Year vs SalePrice trend

Model

Model Evaluation Visualizations

Actual vs Predicted scatter plot
Residuals vs Predicted plot
Residual distribution histogram
Feature importance bar chart (top 20)
Model comparison bar chart (CV scores)
Learning curves (train vs validation)
Cross-validation score boxplots
Error analysis by price segment

Design Tips: Use consistent color schemes, add proper titles and axis labels, and include brief interpretations in markdown cells below each visualization.

Submission Requirements

Create a public GitHub repository with the exact name shown below:

Required Repository Name

house-price-prediction-ml

github.com/<your-username>/house-price-prediction-ml

Required Project Structure

house-price-prediction-ml/
├── data/
│   ├── train.csv                 # Original training data
│   ├── test.csv                  # Original test data
│   └── data_description.txt      # Feature descriptions
├── notebooks/
│   ├── 01_eda.ipynb              # Exploratory data analysis
│   ├── 02_feature_engineering.ipynb # Feature engineering
│   ├── 03_model_training.ipynb   # Model training & tuning
│   └── 04_evaluation.ipynb       # Model evaluation
├── models/
│   └── best_model.joblib         # Saved best model
├── reports/
│   └── analysis_report.pdf       # Final analysis report
├── visualizations/
│   ├── correlation_heatmap.png
│   ├── feature_importance.png
│   ├── actual_vs_predicted.png
│   ├── residuals_plot.png
│   └── model_comparison.png
├── requirements.txt              # Python dependencies
└── README.md                     # Project documentation

README.md Required Sections

1. Project Header

Project title and description
Your full name and submission date
Course and project number

2. Business Context

HomeValue AI scenario overview
Project objectives
Dataset summary

3. Technologies Used

Python, pandas, numpy
scikit-learn, XGBoost
matplotlib, seaborn

4. Key Findings

Top 5 insights from EDA
Most important features
Best performing model

5. Model Performance

Final RMSE, MAE, R² scores
Model comparison table
Cross-validation results

6. Visualizations

Key visualization screenshots
Brief captions for each
Link to notebooks

7. How to Run

Installation instructions
Running notebooks
Making predictions

8. Contact

GitHub profile link
LinkedIn (optional)

Do Include

All 4 notebooks with clear documentation
At least 20 visualizations with interpretations
Trained model saved with joblib
PDF report with visualizations
requirements.txt with all dependencies
Professional README with screenshots

Do Not Include

Virtual environment folders (venv, .venv)
Jupyter checkpoint files (.ipynb_checkpoints)
Extremely large data files (>100MB)
API keys or personal credentials
Incomplete or non-running notebooks

Important: Before submitting, restart your kernel and run all cells to verify your notebooks execute without errors!

Submit Your Project

Enter your GitHub username - we will verify your repository automatically

Grading Rubric

Your project will be graded on the following criteria. Total: 500 points.

Criteria	Points	Description
Exploratory Data Analysis	80	Comprehensive EDA with 15+ visualizations and documented insights
Feature Engineering	100	At least 10 new features, proper encoding, missing value handling
Model Training	100	5+ models trained, proper pipelines, hyperparameter tuning
Model Evaluation	80	Comprehensive evaluation with multiple metrics and error analysis
Model Performance	50	R² > 0.85, RMSLE < 0.15 on validation set
Analysis Report	50	Professional PDF report with insights and recommendations
Documentation	40	README quality, code comments, notebook organization
Total	500

Grading Levels

Excellent

450-500

Exceeds all requirements with exceptional quality

Good

375-449

Meets all requirements with good quality

Satisfactory

300-374

Meets minimum requirements

Needs Work

<300

Missing key requirements

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Project

House Price Prediction

What You Will Build

Contents

Project Overview

EDA

Feature Engineering

Model Training

Evaluation

Learning Objectives

Technical Skills

ML Engineering Skills

Business Scenario

HomeValue AI

Business Questions to Answer

The Dataset

Dataset Download

Original Data Source

Key Features Overview

1 Numerical Features (36 columns)

2 Categorical Features (43 columns)

3 Missing Values Summary

Sample Data Preview

Project Requirements

Exploratory Data Analysis (EDA)

Feature Engineering

Model Training

Model Evaluation

Final Report

Model Specifications

Evaluation Metrics

RMSE

MAE

R² Score

RMSLE

Sample Pipeline Code

Required Visualizations

Exploratory Visualizations

Model Evaluation Visualizations

Submission Requirements

Required Repository Name

Required Project Structure

README.md Required Sections

1. Project Header

2. Business Context

3. Technologies Used

4. Key Findings

5. Model Performance

6. Visualizations

7. How to Run

8. Contact

Do Include

Do Not Include

Grading Rubric

Grading Levels

Excellent

Good

Satisfactory

Needs Work

Ready to Submit?