Assignment 2-A

Regression Models Housing Price Prediction

Build a complete regression analysis system that applies all Module 2 concepts: simple and multiple linear regression, polynomial regression, Ridge and Lasso regularization, model comparison, and hyperparameter tuning.

5-7 hours
Intermediate
175 Points
Submit Assignment
What You'll Practice
  • Build linear regression models
  • Apply polynomial features
  • Implement Ridge & Lasso regression
  • Evaluate with MSE, RMSE, R²
  • Tune hyperparameters with CV
Contents
01

Assignment Overview

In this assignment, you will build a complete Housing Price Prediction System using various regression techniques. This comprehensive project requires you to apply ALL concepts from Module 2: simple linear regression, multiple linear regression, polynomial regression, Ridge regularization, Lasso regularization, and proper model evaluation using regression metrics.

Libraries Allowed: You may use pandas, numpy, matplotlib, seaborn, and scikit-learn for this assignment.
Skills Applied: This assignment tests your understanding of Linear Regression (Topic 2.1), Polynomial Regression (Topic 2.2), and Regularization (Topic 2.3) from Module 2.
Linear Regression (2.1)

Simple & multiple linear regression, coefficients, assumptions

Polynomial Regression (2.2)

Feature transformation, degree selection, overfitting detection

Regularization (2.3)

Ridge (L2), Lasso (L1), ElasticNet, alpha tuning

Ready to submit? Already completed the assignment? Submit your work now!
Submit Now
02

The Scenario

HomeValue Analytics

You have been hired as a Machine Learning Engineer at HomeValue Analytics, a real estate technology company that helps buyers and sellers understand property values. The lead data scientist has given you this task:

"We have historical housing data with various features like square footage, bedrooms, location scores, and more. We need you to build multiple regression models, compare their performance, and recommend the best approach for predicting house prices. Pay special attention to overfitting - we need models that generalize well!"

Your Task

Create a Jupyter Notebook called regression_analysis.ipynb that implements multiple regression models, compares their performance using appropriate metrics, and provides recommendations for the best model to use in production.

03

The Dataset

You will work with a Housing Price dataset. Create this CSV file as shown below:

File: housing_data.csv (Housing Data)

house_id,square_feet,bedrooms,bathrooms,age_years,garage_size,location_score,has_pool,has_garden,distance_to_city,price
H001,1850,3,2,5,2,8.5,0,1,12,385000
H002,2400,4,3,2,2,9.2,1,1,8,520000
H003,1200,2,1,25,1,6.5,0,0,22,195000
H004,3200,5,4,1,3,9.5,1,1,5,725000
H005,1650,3,2,15,1,7.2,0,1,18,275000
H006,2100,4,2,8,2,8.0,0,1,14,385000
H007,1400,2,1,30,1,5.8,0,0,28,165000
H008,2800,4,3,3,2,9.0,1,1,6,595000
H009,1950,3,2,12,2,7.8,0,1,16,325000
H010,3500,5,4,0,3,9.8,1,1,3,850000
H011,1100,2,1,35,0,5.2,0,0,32,145000
H012,2250,4,2,6,2,8.3,0,1,11,425000
H013,1750,3,2,18,1,6.8,0,0,20,255000
H014,2650,4,3,4,2,8.8,1,1,9,485000
H015,1550,3,1,22,1,6.2,0,0,25,215000
H016,2950,5,3,2,3,9.3,1,1,4,675000
H017,1350,2,1,28,1,5.5,0,0,30,155000
H018,2050,3,2,10,2,7.5,0,1,15,345000
H019,1900,3,2,7,2,8.2,0,1,13,365000
H020,3100,5,4,1,3,9.6,1,1,4,780000
H021,1450,2,1,20,1,6.0,0,0,24,185000
H022,2300,4,2,5,2,8.4,1,1,10,445000
H023,1600,3,2,14,1,7.0,0,1,19,285000
H024,2700,4,3,3,2,8.9,1,1,7,545000
H025,1250,2,1,32,0,5.0,0,0,35,135000
Columns Explained
  • house_id - Unique identifier (string)
  • square_feet - Living area in sq ft (integer) - Key predictor
  • bedrooms - Number of bedrooms (integer)
  • bathrooms - Number of bathrooms (integer)
  • age_years - Age of the house (integer)
  • garage_size - Garage capacity in cars (integer)
  • location_score - Location desirability 1-10 (float)
  • has_pool - Has swimming pool (binary: 0/1)
  • has_garden - Has garden (binary: 0/1)
  • distance_to_city - Distance to city center in km (integer)
  • price - Sale price in dollars (target variable)
Note: You may extend this dataset with more rows for better model training. Consider adding some outliers to test your preprocessing skills.
04

Requirements

Your regression_analysis.ipynb must implement ALL of the following functions. Each function is mandatory and will be tested individually.

1
Load and Explore Data

Create a function load_and_explore(filename) that:

  • Loads the CSV file using pandas
  • Displays basic statistics for all numerical columns
  • Checks for missing values and data types
  • Returns the DataFrame and exploration summary
def load_and_explore(filename):
    """Load dataset and return exploration summary."""
    # Must return: (df, exploration_dict)
    pass
2
Visualize Feature Relationships

Create a function visualize_relationships(df, target='price') that:

  • Creates scatter plots of each feature vs target
  • Creates a correlation heatmap
  • Identifies the most correlated features
  • Saves plots as feature_analysis.png
def visualize_relationships(df, target='price'):
    """Create visualizations of feature-target relationships."""
    # Must save: feature_analysis.png
    pass
3
Simple Linear Regression

Create a function simple_linear_regression(X, y, feature_name) that:

  • Trains a simple linear regression with ONE feature
  • Plots the regression line with data points
  • Returns model, coefficients, and intercept
  • Prints the regression equation
def simple_linear_regression(X, y, feature_name):
    """Train simple linear regression and visualize."""
    # Return: (model, coefficient, intercept)
    pass
4
Multiple Linear Regression

Create a function multiple_linear_regression(X_train, X_test, y_train, y_test) that:

  • Trains a multiple linear regression with ALL features
  • Returns model and predictions
  • Displays feature importance (coefficients)
def multiple_linear_regression(X_train, X_test, y_train, y_test):
    """Train multiple linear regression model."""
    # Return: (model, y_pred, coefficients_dict)
    pass
5
Polynomial Regression

Create a function polynomial_regression(X_train, X_test, y_train, y_test, degree=2) that:

  • Creates polynomial features using PolynomialFeatures
  • Trains linear regression on transformed features
  • Returns model, transformer, and predictions
def polynomial_regression(X_train, X_test, y_train, y_test, degree=2):
    """Train polynomial regression model."""
    # Return: (model, poly_transformer, y_pred)
    pass
6
Find Optimal Polynomial Degree

Create a function find_optimal_degree(X_train, X_test, y_train, y_test, max_degree=5) that:

  • Tests polynomial degrees from 1 to max_degree
  • Tracks train and test errors for each degree
  • Plots learning curves to show overfitting
  • Returns the optimal degree based on test performance
def find_optimal_degree(X_train, X_test, y_train, y_test, max_degree=5):
    """Find optimal polynomial degree by comparing train/test errors."""
    # Return: (optimal_degree, results_df)
    pass
7
Ridge Regression

Create a function ridge_regression(X_train, X_test, y_train, y_test, alpha=1.0) that:

  • Trains Ridge regression with specified alpha
  • Returns model, predictions, and coefficients
  • Compares coefficient magnitudes with linear regression
def ridge_regression(X_train, X_test, y_train, y_test, alpha=1.0):
    """Train Ridge regression model."""
    # Return: (model, y_pred, coefficients)
    pass
8
Lasso Regression

Create a function lasso_regression(X_train, X_test, y_train, y_test, alpha=1.0) that:

  • Trains Lasso regression with specified alpha
  • Returns model, predictions, and coefficients
  • Identifies features with zero coefficients (feature selection)
def lasso_regression(X_train, X_test, y_train, y_test, alpha=1.0):
    """Train Lasso regression model."""
    # Return: (model, y_pred, coefficients, zero_features)
    pass
9
Tune Regularization Alpha

Create a function tune_alpha(X_train, y_train, model_type='ridge') that:

  • Uses cross-validation to find optimal alpha
  • Tests alphas: [0.001, 0.01, 0.1, 1, 10, 100]
  • Plots alpha vs cross-validation score
  • Returns optimal alpha and CV results
def tune_alpha(X_train, y_train, model_type='ridge'):
    """Find optimal alpha using cross-validation."""
    # Return: (optimal_alpha, cv_results_df)
    pass
10
Calculate Regression Metrics

Create a function calculate_regression_metrics(y_true, y_pred, model_name) that:

  • Calculates MSE, RMSE, MAE, and R² score
  • Returns a dictionary with all metrics
  • Optionally creates residual plot
def calculate_regression_metrics(y_true, y_pred, model_name):
    """Calculate and return regression metrics."""
    # Return: dict with 'mse', 'rmse', 'mae', 'r2'
    pass
11
Compare All Models

Create a function compare_models(results_dict) that:

  • Takes dictionary of model results
  • Creates comparison bar charts for all metrics
  • Saves comparison as model_comparison.png
  • Returns DataFrame with comparison table
def compare_models(results_dict):
    """Compare all models and visualize results."""
    # Return: comparison_df
    pass
12
Main Pipeline

Create a main() function that:

  • Runs the complete regression analysis pipeline
  • Trains all model types and collects results
  • Generates all required visualizations
  • Prints final recommendation for best model
def main():
    # 1. Load and explore data
    df, summary = load_and_explore("housing_data.csv")
    
    # 2. Visualize relationships
    visualize_relationships(df)
    
    # 3. Prepare features and target
    X = df.drop(['house_id', 'price'], axis=1)
    y = df['price']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # 4. Scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # 5. Train all models
    results = {}
    
    # Simple Linear Regression (using square_feet)
    slr_model, coef, intercept = simple_linear_regression(X_train[['square_feet']], y_train, 'square_feet')
    
    # Multiple Linear Regression
    mlr_model, mlr_pred, mlr_coefs = multiple_linear_regression(X_train_scaled, X_test_scaled, y_train, y_test)
    results['Linear Regression'] = calculate_regression_metrics(y_test, mlr_pred, 'Linear Regression')
    
    # Polynomial Regression
    optimal_degree, degree_results = find_optimal_degree(X_train_scaled, X_test_scaled, y_train, y_test)
    poly_model, poly_trans, poly_pred = polynomial_regression(X_train_scaled, X_test_scaled, y_train, y_test, optimal_degree)
    results['Polynomial Regression'] = calculate_regression_metrics(y_test, poly_pred, 'Polynomial Regression')
    
    # Ridge Regression
    ridge_alpha, ridge_cv = tune_alpha(X_train_scaled, y_train, 'ridge')
    ridge_model, ridge_pred, ridge_coefs = ridge_regression(X_train_scaled, X_test_scaled, y_train, y_test, ridge_alpha)
    results['Ridge Regression'] = calculate_regression_metrics(y_test, ridge_pred, 'Ridge Regression')
    
    # Lasso Regression
    lasso_alpha, lasso_cv = tune_alpha(X_train_scaled, y_train, 'lasso')
    lasso_model, lasso_pred, lasso_coefs, zero_feats = lasso_regression(X_train_scaled, X_test_scaled, y_train, y_test, lasso_alpha)
    results['Lasso Regression'] = calculate_regression_metrics(y_test, lasso_pred, 'Lasso Regression')
    
    # 6. Compare all models
    comparison_df = compare_models(results)
    print(comparison_df)
    
    # 7. Recommendation
    best_model = comparison_df.loc[comparison_df['R2'].idxmax()]
    print(f"\nRecommendation: {best_model.name} with R² = {best_model['R2']:.4f}")

if __name__ == "__main__":
    main()
05

Submission

Create a public GitHub repository with the exact name shown below:

Required Repository Name
housing-price-regression
github.com/<your-username>/housing-price-regression
Required Files
housing-price-regression/
├── regression_analysis.ipynb  # Your Jupyter Notebook with ALL 12 functions
├── housing_data.csv           # Input dataset (as provided or extended)
├── feature_analysis.png       # Feature relationship visualizations
├── model_comparison.png       # Model comparison bar charts
├── predictions.csv            # Test predictions from best model
└── README.md                  # REQUIRED - see contents below
README.md Must Include:
  • Your full name and submission date
  • Summary of all models trained and their metrics
  • Your recommendation for the best model and why
  • Any challenges faced and how you solved them
  • Instructions to run your notebook
Do Include
  • All 12 functions implemented and working
  • Docstrings for every function
  • Clear visualizations with labels and titles
  • Model comparison with reasoning
  • Hyperparameter tuning with cross-validation
  • README.md with all required sections
Do Not Include
  • Any .pyc or __pycache__ files (use .gitignore)
  • Virtual environment folders
  • Large model pickle files
  • Code that doesn't run without errors
  • Hardcoded file paths
Important: Before submitting, run all cells in your notebook to make sure it executes without errors and generates all output files correctly!
Submit Your Assignment

Enter your GitHub username - we'll verify your repository automatically

06

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria Points Description
Linear Regression 25 Correct implementation of simple and multiple linear regression
Polynomial Regression 30 Proper feature transformation, degree selection, overfitting analysis
Regularization 35 Correct Ridge and Lasso implementation with alpha tuning
Model Evaluation 25 Accurate calculation of MSE, RMSE, MAE, R² and proper comparison
Visualizations 25 Clear, informative plots with proper labels and titles
Code Quality 35 Docstrings, comments, naming conventions, and clean organization
Total 175

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment
07

What You Will Practice

Linear Regression (2.1)

Understanding coefficients, interpreting regression equations, feature importance

Polynomial Regression (2.2)

Feature transformation, detecting overfitting, selecting optimal complexity

Regularization (2.3)

Ridge vs Lasso, feature selection with L1, hyperparameter tuning

Model Comparison

Evaluating regression models, understanding metrics, making recommendations

08

Pro Tips

Regression Best Practices
  • Always scale features before regularization
  • Check for multicollinearity using VIF
  • Visualize residuals to check assumptions
  • Use cross-validation for hyperparameter tuning
Model Selection
  • Start simple, increase complexity gradually
  • Compare train vs test performance
  • Consider interpretability vs accuracy trade-off
  • Lasso is better when you suspect many irrelevant features
Metrics to Focus On
  • R² tells you how much variance is explained
  • RMSE is in the same units as target
  • MAE is more robust to outliers than MSE
  • Compare metrics across train and test sets
Common Mistakes
  • Forgetting to scale features for regularized models
  • Using polynomial degree too high (overfitting)
  • Not using cross-validation for alpha selection
  • Ignoring the bias-variance trade-off
09

Pre-Submission Checklist

Code Requirements
Repository Requirements