Project Overview
This capstone project brings together everything you have learned in the Machine Learning course. You will work with the famous Kaggle House Prices dataset containing 1,460 training samples with 79 features describing almost every aspect of residential homes in Ames, Iowa. The dataset includes 36 numerical features and 43 categorical features covering everything from lot size and basement quality to garage type and sale conditions. Your goal is to build a production-ready regression pipeline that accurately predicts sale prices.
EDA
Explore distributions, correlations, and identify patterns
Feature Engineering
Create, transform, and select the best features
Model Training
Train and compare multiple regression algorithms
Evaluation
Rigorous evaluation with cross-validation and metrics
Learning Objectives
Technical Skills
- Master pandas for data manipulation and cleaning
- Perform comprehensive exploratory data analysis
- Build sklearn pipelines with ColumnTransformer
- Implement feature engineering for mixed data types
- Train and tune multiple regression models
ML Engineering Skills
- Handle missing values with domain knowledge
- Encode categorical variables effectively
- Perform hyperparameter tuning with GridSearchCV
- Evaluate models using appropriate regression metrics
- Create reproducible and documented pipelines
Business Scenario
HomeValue AI
You have been hired as a Machine Learning Engineer at HomeValue AI, a real estate technology startup. The company is building an automated home valuation system to help buyers, sellers, and real estate agents get accurate price estimates instantly. The CEO has given you this challenge:
"We have historical sales data from Ames, Iowa - one of the most detailed housing datasets available. I need you to build a prediction model that can estimate house prices within 10% of actual sale prices. The model needs to be explainable so agents can tell clients WHY a house is valued at a certain price. Can you build us a reliable, interpretable pricing engine?"
Business Questions to Answer
- What is the predicted sale price for a given house?
- What is the prediction confidence interval?
- How accurate is our model on unseen data?
- Which model performs best for this dataset?
- What features most influence house prices?
- How much does each bedroom add to value?
- What is the premium for quality finishes?
- How does neighborhood affect pricing?
- Are there undervalued properties in the market?
- What renovations add the most value?
- How does age affect property value?
- What is the price distribution by neighborhood?
- Where does the model make the largest errors?
- Are there outliers affecting predictions?
- How does model performance vary by price range?
- What data quality issues need addressing?
The Dataset
You will work with the famous Kaggle House Prices dataset. Download the CSV files containing training data with 79 explanatory features:
Dataset Download
Download the house prices dataset files and save them to your project folder. The CSV files contain all necessary data for building your prediction model.
Original Data Source
This project uses the House Prices: Advanced Regression Techniques dataset from Kaggle - one of the most popular competition datasets for learning regression. The dataset was compiled by Dean De Cock for use in data science education and contains 79 features describing homes in Ames, Iowa.
Key Features Overview
| Category | Features | Description |
|---|---|---|
| Area | LotArea, GrLivArea, TotalBsmtSF, 1stFlrSF, 2ndFlrSF, GarageArea | Square footage measurements |
| Rooms | BedroomAbvGr, TotRmsAbvGrd, FullBath, HalfBath, KitchenAbvGr | Room counts |
| Quality | OverallQual, OverallCond | 1-10 rating scales |
| Age | YearBuilt, YearRemodAdd, GarageYrBlt | Construction and remodel years |
| Basement | BsmtFinSF1, BsmtFinSF2, BsmtUnfSF | Basement area breakdown |
| Porch | WoodDeckSF, OpenPorchSF, EnclosedPorch, ScreenPorch | Outdoor areas |
| Target | SalePrice | Sale price in USD |
| Category | Features | Example Values |
|---|---|---|
| Location | Neighborhood, MSZoning | CollgCr, Veenker, NAmes | RL, RM, FV |
| Building | BldgType, HouseStyle, RoofStyle | 1Fam, 2fmCon | 1Story, 2Story |
| Quality | ExterQual, ExterCond, BsmtQual, KitchenQual | Ex, Gd, TA, Fa, Po |
| Garage | GarageType, GarageFinish, GarageCond | Attchd, Detchd, BuiltIn |
| Basement | BsmtExposure, BsmtFinType1, BsmtCond | Gd, Av, Mn, No |
| Utilities | Heating, CentralAir, Electrical | GasA, GasW | Y, N |
| Sale | SaleType, SaleCondition | WD, New, COD | Normal, Abnormal |
| Feature | Missing % | Reason | Recommended Action |
|---|---|---|---|
PoolQC | 99.5% | No pool | Fill with "None" |
MiscFeature | 96.3% | No misc feature | Fill with "None" |
Alley | 93.8% | No alley access | Fill with "None" |
Fence | 80.8% | No fence | Fill with "None" |
FireplaceQu | 47.3% | No fireplace | Fill with "None" |
LotFrontage | 17.7% | Missing data | Impute by neighborhood median |
GarageYrBlt | 5.5% | No garage | Fill with 0 or YearBuilt |
Sample Data Preview
Here is what a typical training record looks like:
| Id | MSSubClass | LotArea | OverallQual | YearBuilt | GrLivArea | BedroomAbvGr | Neighborhood | SalePrice |
|---|---|---|---|---|---|---|---|---|
| 1 | 60 | 8,450 | 7 | 2003 | 1,710 | 3 | CollgCr | $208,500 |
| 2 | 20 | 9,600 | 6 | 1976 | 1,262 | 3 | Veenker | $181,500 |
| 3 | 60 | 11,250 | 7 | 2001 | 1,786 | 3 | CollgCr | $223,500 |
Project Requirements
Your project must include all of the following components. Structure your deliverables as Jupyter notebooks with clear documentation and code organization.
Exploratory Data Analysis (EDA)
Create 01_eda.ipynb:
- Load data and examine shape, dtypes, and missing values
- Analyze target variable (SalePrice) distribution
- Visualize numerical feature distributions with histograms
- Analyze categorical features with bar plots
- Create correlation matrix heatmap for numerical features
- Identify top 10 features correlated with SalePrice
- Scatter plots for key features vs SalePrice
- Document 5+ key insights from your analysis
Feature Engineering
Create 02_feature_engineering.ipynb:
- Handle Missing Values: Domain-aware imputation strategy
- Create New Features:
TotalSF= TotalBsmtSF + 1stFlrSF + 2ndFlrSFTotalBathrooms= FullBath + 0.5*HalfBath + BsmtFullBath + 0.5*BsmtHalfBathAge= YrSold - YearBuiltRemodeled= 1 if YearRemodAdd != YearBuilt else 0TotalPorchSF= sum of all porch areasHasPool, HasGarage, HasFireplace= binary indicators
- Encode Categoricals: OrdinalEncoder for quality features, OneHotEncoder for nominal
- Handle Outliers: Identify and treat outliers in GrLivArea, LotArea
- Log Transform: Transform SalePrice and skewed features
Model Training
Create 03_model_training.ipynb:
- Build sklearn Pipeline: Use ColumnTransformer for preprocessing
- Train at least 5 models:
- Linear Regression (baseline)
- Ridge Regression
- Lasso Regression
- Random Forest Regressor
- XGBoost/Gradient Boosting
- Cross-Validation: 5-fold CV for all models
- Hyperparameter Tuning: GridSearchCV for top 2 models
- Model Comparison: Create comparison table with metrics
Model Evaluation
Create 04_evaluation.ipynb:
- Calculate metrics: RMSE, MAE, R², MAPE on test set
- Create residual plots (residuals vs predicted)
- Plot actual vs predicted values
- Analyze prediction error distribution
- Feature importance visualization (top 20 features)
- Error analysis by price range (low, medium, high)
- Identify worst predictions and analyze patterns
Final Report
Create analysis_report.pdf:
- Executive Summary (1 page): Key findings and model performance
- Data Analysis (2 pages): EDA insights with visualizations
- Methodology (2 pages): Feature engineering and model approach
- Results (2 pages): Model comparison and final performance
- Recommendations (1 page): Business insights and next steps
Model Specifications
Train and evaluate the following models. Use cross-validation for fair comparison and tune hyperparameters for your top performers.
- Linear Regression: Baseline model, no regularization
- Ridge (L2): Tune alpha: [0.1, 1.0, 10.0, 100.0]
- Lasso (L1): Tune alpha: [0.0001, 0.001, 0.01, 0.1]
- ElasticNet: Optional - combine L1 and L2
- Random Forest: Tune n_estimators, max_depth, min_samples_split
- Gradient Boosting: Tune learning_rate, n_estimators, max_depth
- XGBoost: Tune learning_rate, max_depth, subsample, colsample_bytree
- LightGBM: Optional - faster alternative
Evaluation Metrics
RMSE
Root Mean Squared Error - penalizes large errors
MAE
Mean Absolute Error - average error magnitude
R² Score
Coefficient of determination - variance explained
RMSLE
Root Mean Squared Log Error - Kaggle metric
Sample Pipeline Code
# Build preprocessing pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
# Define column types
numerical_cols = ['LotArea', 'GrLivArea', 'TotalBsmtSF', 'YearBuilt', ...]
categorical_cols = ['Neighborhood', 'HouseStyle', 'ExterQual', ...]
# Preprocessing pipelines
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='None')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Combine with ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
# Full pipeline with model
from sklearn.ensemble import RandomForestRegressor
model = Pipeline(steps=[
('preprocessor', preprocessor),
('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])
Required Visualizations
Create at least 20 visualizations across your notebooks. Each visualization should have proper titles, labels, and interpretive commentary.
Exploratory Visualizations
- SalePrice distribution (histogram + KDE)
- Log-transformed SalePrice distribution
- Correlation heatmap (top 15 features)
- Missing values heatmap
- GrLivArea vs SalePrice scatter
- OverallQual vs SalePrice boxplot
- Neighborhood price distribution
- Year vs SalePrice trend
Model Evaluation Visualizations
- Actual vs Predicted scatter plot
- Residuals vs Predicted plot
- Residual distribution histogram
- Feature importance bar chart (top 20)
- Model comparison bar chart (CV scores)
- Learning curves (train vs validation)
- Cross-validation score boxplots
- Error analysis by price segment
Submission Requirements
Create a public GitHub repository with the exact name shown below:
Required Repository Name
house-price-prediction-ml
Required Project Structure
house-price-prediction-ml/
├── data/
│ ├── train.csv # Original training data
│ ├── test.csv # Original test data
│ └── data_description.txt # Feature descriptions
├── notebooks/
│ ├── 01_eda.ipynb # Exploratory data analysis
│ ├── 02_feature_engineering.ipynb # Feature engineering
│ ├── 03_model_training.ipynb # Model training & tuning
│ └── 04_evaluation.ipynb # Model evaluation
├── models/
│ └── best_model.joblib # Saved best model
├── reports/
│ └── analysis_report.pdf # Final analysis report
├── visualizations/
│ ├── correlation_heatmap.png
│ ├── feature_importance.png
│ ├── actual_vs_predicted.png
│ ├── residuals_plot.png
│ └── model_comparison.png
├── requirements.txt # Python dependencies
└── README.md # Project documentation
README.md Required Sections
1. Project Header
- Project title and description
- Your full name and submission date
- Course and project number
2. Business Context
- HomeValue AI scenario overview
- Project objectives
- Dataset summary
3. Technologies Used
- Python, pandas, numpy
- scikit-learn, XGBoost
- matplotlib, seaborn
4. Key Findings
- Top 5 insights from EDA
- Most important features
- Best performing model
5. Model Performance
- Final RMSE, MAE, R² scores
- Model comparison table
- Cross-validation results
6. Visualizations
- Key visualization screenshots
- Brief captions for each
- Link to notebooks
7. How to Run
- Installation instructions
- Running notebooks
- Making predictions
8. Contact
- GitHub profile link
- LinkedIn (optional)
Do Include
- All 4 notebooks with clear documentation
- At least 20 visualizations with interpretations
- Trained model saved with joblib
- PDF report with visualizations
- requirements.txt with all dependencies
- Professional README with screenshots
Do Not Include
- Virtual environment folders (venv, .venv)
- Jupyter checkpoint files (.ipynb_checkpoints)
- Extremely large data files (>100MB)
- API keys or personal credentials
- Incomplete or non-running notebooks
Enter your GitHub username - we will verify your repository automatically
Grading Rubric
Your project will be graded on the following criteria. Total: 500 points.
| Criteria | Points | Description |
|---|---|---|
| Exploratory Data Analysis | 80 | Comprehensive EDA with 15+ visualizations and documented insights |
| Feature Engineering | 100 | At least 10 new features, proper encoding, missing value handling |
| Model Training | 100 | 5+ models trained, proper pipelines, hyperparameter tuning |
| Model Evaluation | 80 | Comprehensive evaluation with multiple metrics and error analysis |
| Model Performance | 50 | R² > 0.85, RMSLE < 0.15 on validation set |
| Analysis Report | 50 | Professional PDF report with insights and recommendations |
| Documentation | 40 | README quality, code comments, notebook organization |
| Total | 500 |
Grading Levels
Excellent
Exceeds all requirements with exceptional quality
Good
Meets all requirements with good quality
Satisfactory
Meets minimum requirements
Needs Work
Missing key requirements
Ready to Submit?
Make sure you have completed all requirements and reviewed the grading rubric above.
Submit Your Project