Assignment 6-A

Feature Engineering Pipeline

Build a comprehensive feature engineering pipeline for house price prediction. Apply feature selection, transformation, creation, and encoding techniques to significantly improve model performance.

5-6 hours
Intermediate
200 Points
Submit Assignment
What You'll Practice
  • Handle missing values strategically
  • Encode categorical variables properly
  • Create new features from existing ones
  • Apply feature selection techniques
  • Build sklearn pipelines
Contents
01

Assignment Overview

In this assignment, you will build a complete Feature Engineering Pipeline for predicting house prices. This project requires you to apply feature engineering techniques: missing value handling, encoding strategies, feature creation, feature selection, and sklearn pipelines - skills that often make the difference between mediocre and excellent ML models.

Feature Engineering Focus: Good feature engineering can improve model performance more than algorithm selection. You must demonstrate measurable improvement in model metrics after applying your pipeline.
Skills Applied: This assignment tests your understanding of missing value imputation, categorical encoding, feature creation, feature selection methods, and sklearn Pipeline/ColumnTransformer.
Feature Creation

Polynomial features, interactions, aggregations, domain-specific features

Feature Selection

Filter methods, wrapper methods, embedded methods, RFE, importance-based

Sklearn Pipelines

Pipeline, ColumnTransformer, custom transformers, reproducibility

Ready to submit? Already completed the assignment? Submit your work now!
Submit Now
02

The Scenario

HomeValue Analytics - House Price Prediction

You have been hired as a Data Scientist at HomeValue Analytics, a real estate analytics company. The team has a raw dataset of house sales but the initial model performance is poor. The Chief Data Officer has given you this challenge:

"Our raw data has missing values, mixed data types, and irrelevant features. We tried a basic model but got an R² of only 0.65. Can you build a feature engineering pipeline that transforms this messy data into something that gives us at least 0.85 R²?"

Your Task

Create a Jupyter Notebook called feature_engineering.ipynb that implements a complete feature engineering pipeline. Your code must clean the data, create meaningful features, select the most important ones, and demonstrate significant improvement in model performance.

03

The Dataset

Create a synthetic house sales dataset (house_prices.csv) with the following structure:

File: house_prices.csv (House Sales Data)

house_id,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,neighborhood,garage_type,heating_type,price
1,3,1.5,1800,5650,1.0,0,0,3,7,1180,620,1955,,98178,47.5112,-122.257,1340,5650,Urban,Attached,Gas,221900
2,4,2.5,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.7210,-122.319,1690,7639,Suburban,Detached,Electric,538000
3,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,,98028,47.7379,-122.233,2720,8062,Rural,,Oil,180000
...
Columns Explained
  • house_id - Unique identifier (integer)
  • bedrooms - Number of bedrooms (integer)
  • bathrooms - Number of bathrooms (float)
  • sqft_living - Living area square footage (integer)
  • sqft_lot - Lot size square footage (integer)
  • floors - Number of floors (float)
  • waterfront - Waterfront property, 0/1 (binary)
  • view - View quality rating, 0-4 (ordinal)
  • condition - Overall condition, 1-5 (ordinal)
  • grade - Construction grade, 1-13 (ordinal)
  • sqft_above - Above ground sqft (integer)
  • sqft_basement - Basement sqft (integer)
  • yr_built - Year built (integer)
  • yr_renovated - Year renovated, blank if never (integer/null)
  • zipcode - ZIP code (categorical)
  • lat, long - Geographic coordinates (float)
  • neighborhood - Urban/Suburban/Rural (categorical)
  • garage_type - Attached/Detached/None (categorical with nulls)
  • heating_type - Gas/Electric/Oil (categorical)
  • price - Sale price in USD (target)
Dataset Requirements: Generate at least 5,000 houses with realistic distributions, intentional missing values (~5-10% in yr_renovated and garage_type), and some outliers. Use numpy's random with a seed.
04

Requirements

Your feature_engineering.ipynb must implement ALL of the following functions. Each function is mandatory and will be tested individually.

1
Load and Explore Data

Create a function load_and_explore(filepath) that:

  • Loads the CSV file into a DataFrame
  • Prints shape, dtypes, and missing value counts
  • Identifies numerical vs categorical columns
  • Returns DataFrame and column type lists
def load_and_explore(filepath):
    """Load data and perform initial exploration."""
    # Return: df, numerical_cols, categorical_cols
    pass
2
Handle Missing Values

Create a function handle_missing_values(df, strategy='smart') that:

  • Implements multiple strategies: 'drop', 'mean', 'median', 'mode', 'smart'
  • 'smart' uses domain knowledge (e.g., yr_renovated=0 means never renovated)
  • Creates indicator columns for missingness (useful features)
  • Returns cleaned DataFrame with no missing values
def handle_missing_values(df, strategy='smart'):
    """Handle missing values with specified strategy."""
    # 'smart' strategy: yr_renovated NaN -> 0, garage_type NaN -> 'None'
    # Create: has_renovation, has_garage indicator columns
    pass
3
Encode Categorical Variables

Create a function encode_categoricals(df, method='auto') that:

  • Uses OneHotEncoder for nominal categories (neighborhood, heating_type)
  • Uses OrdinalEncoder for ordinal categories (view, condition, grade)
  • Uses TargetEncoder for high-cardinality (zipcode)
  • Returns encoded DataFrame and fitted encoders
def encode_categoricals(df, method='auto'):
    """Encode categorical variables appropriately."""
    # Nominal: OneHotEncoder
    # Ordinal: OrdinalEncoder with proper ordering
    # High-cardinality: TargetEncoder
    pass
4
Create Domain Features

Create a function create_domain_features(df) that:

  • Creates age = current_year - yr_built
  • Creates years_since_renovation = current_year - yr_renovated (if renovated)
  • Creates price_per_sqft (if applicable), total_sqft
  • Creates bed_bath_ratio, living_lot_ratio
  • Returns DataFrame with new features
def create_domain_features(df):
    """Create domain-specific features."""
    # age, years_since_renovation, total_sqft
    # bed_bath_ratio, living_lot_ratio, has_basement
    pass
5
Create Interaction Features

Create a function create_interaction_features(df, pairs) that:

  • Creates multiplication interactions for specified pairs
  • Creates polynomial features (degree 2) for important numericals
  • Example: sqft_living × grade, bedrooms × bathrooms
  • Returns DataFrame with interaction features
def create_interaction_features(df, pairs):
    """Create interaction features between specified column pairs."""
    # pairs example: [('sqft_living', 'grade'), ('bedrooms', 'bathrooms')]
    pass
6
Handle Outliers

Create a function handle_outliers(df, method='iqr', threshold=1.5) that:

  • Implements IQR method and Z-score method
  • Provides options: 'remove', 'cap', 'log_transform'
  • Visualizes outliers before/after treatment
  • Returns treated DataFrame
def handle_outliers(df, method='iqr', threshold=1.5, treatment='cap'):
    """Detect and handle outliers."""
    # method: 'iqr' or 'zscore'
    # treatment: 'remove', 'cap', or 'log_transform'
    pass
7
Scale Features

Create a function scale_features(df, method='standard') that:

  • Implements StandardScaler, MinMaxScaler, RobustScaler
  • Only scales numerical columns
  • Returns scaled DataFrame and fitted scaler
def scale_features(df, method='standard'):
    """Scale numerical features."""
    # method: 'standard', 'minmax', 'robust'
    # Return: scaled_df, scaler
    pass
8
Select Features - Filter Methods

Create a function select_features_filter(X, y, method='correlation', k=10) that:

  • Implements correlation-based selection
  • Implements mutual information
  • Implements variance threshold
  • Returns selected features and scores
def select_features_filter(X, y, method='correlation', k=10):
    """Filter-based feature selection."""
    # method: 'correlation', 'mutual_info', 'variance'
    # Return: selected_features, scores
    pass
9
Select Features - Wrapper Methods

Create a function select_features_wrapper(X, y, method='rfe', n_features=10) that:

  • Implements Recursive Feature Elimination (RFE)
  • Implements Sequential Feature Selection (forward/backward)
  • Returns selected features and ranking
def select_features_wrapper(X, y, method='rfe', n_features=10):
    """Wrapper-based feature selection."""
    # method: 'rfe', 'forward', 'backward'
    # Return: selected_features, ranking
    pass
10
Select Features - Embedded Methods

Create a function select_features_embedded(X, y, method='lasso') that:

  • Implements Lasso regularization-based selection
  • Implements tree-based feature importance (Random Forest)
  • Visualizes feature importance
  • Returns selected features and importance scores
def select_features_embedded(X, y, method='lasso'):
    """Embedded feature selection methods."""
    # method: 'lasso', 'random_forest', 'gradient_boosting'
    # Return: selected_features, importance_scores
    pass
11
Build Sklearn Pipeline

Create a function build_pipeline(numerical_cols, categorical_cols) that:

  • Uses ColumnTransformer for different column types
  • Chains imputers, encoders, scalers in proper order
  • Creates a reproducible, production-ready pipeline
  • Returns fitted pipeline
def build_pipeline(numerical_cols, categorical_cols):
    """Build sklearn Pipeline with ColumnTransformer."""
    # numerical: impute -> scale
    # categorical: impute -> encode
    # Return: Pipeline object
    pass
12
Compare Before/After Performance

Create a function compare_performance(X_raw, X_engineered, y) that:

  • Trains same model on raw vs engineered features
  • Uses cross-validation for fair comparison
  • Reports R², MAE, RMSE improvements
  • Visualizes improvement metrics
def compare_performance(X_raw, X_engineered, y):
    """Compare model performance before and after feature engineering."""
    # Train same model (e.g., Ridge) on both datasets
    # Report: R², MAE, RMSE before/after
    # Visualize: bar chart of improvements
    pass
13
Main Pipeline

Create a main() function that:

  • Loads and explores the raw data
  • Applies all feature engineering steps
  • Compares feature selection methods
  • Builds the final pipeline
  • Demonstrates performance improvement
def main():
    # Load data
    df, num_cols, cat_cols = load_and_explore("house_prices.csv")
    
    # Apply feature engineering
    df = handle_missing_values(df, strategy='smart')
    df = create_domain_features(df)
    df = handle_outliers(df, method='iqr', treatment='cap')
    
    # Build pipeline
    pipeline = build_pipeline(num_cols, cat_cols)
    
    # Compare performance
    compare_performance(X_raw, X_engineered, y)
    
    print("Feature engineering complete!")

if __name__ == "__main__":
    main()
05

Submission

Create a public GitHub repository with the exact name shown below:

Required Repository Name
house-price-feature-engineering
github.com/<your-username>/house-price-feature-engineering
Required Files
house-price-feature-engineering/
├── feature_engineering.ipynb  # Your Jupyter Notebook with ALL 13 functions
├── house_prices.csv           # Synthetic dataset (5,000+ houses)
├── feature_importance.png     # Feature importance visualization
├── correlation_heatmap.png    # Correlation matrix heatmap
├── before_after_comparison.png # Performance comparison chart
├── pipeline_diagram.png       # Visual representation of your pipeline
├── feature_report.txt         # Summary of engineered features
└── README.md                  # REQUIRED - see contents below
README.md Must Include:
  • Your full name and submission date
  • List of all engineered features with descriptions
  • Performance improvement metrics (before vs after R², MAE, RMSE)
  • Explanation of your feature selection strategy
  • Instructions to run your notebook
Do Include
  • All 13 functions implemented and working
  • At least 10 new engineered features
  • Multiple feature selection methods compared
  • Sklearn Pipeline using ColumnTransformer
  • Measurable performance improvement (≥0.10 R² gain)
  • README.md with all required sections
Do Not Include
  • Data leakage (fitting on test data)
  • Hardcoded feature selections without justification
  • Any .pyc or __pycache__ files
  • Virtual environment folders
  • Code that doesn't run without errors
  • Target variable in feature engineering
Important: Before submitting, run all cells in your notebook to ensure it executes without errors and generates all output files correctly!
Submit Your Assignment

Enter your GitHub username - we'll verify your repository automatically

06

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria Points Description
Function Implementation 70 All 13 functions correctly implemented with proper logic
Feature Creation 30 At least 10 meaningful features with domain justification
Feature Selection 30 Comparison of filter, wrapper, and embedded methods
Sklearn Pipeline 25 Proper use of Pipeline and ColumnTransformer
Performance Improvement 25 Demonstrated improvement in R² (≥0.10 gain required)
Visualizations 10 Clear feature importance and comparison visualizations
Code Quality 10 Docstrings, comments, clean organization
Total 200

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment
07

What You Will Practice

Feature Creation

Creating domain-specific features, interactions, and polynomial features that capture hidden patterns

Feature Selection

Filter, wrapper, and embedded methods - knowing when to use each and comparing their effectiveness

Sklearn Pipelines

Building reproducible, production-ready pipelines with ColumnTransformer and custom transformers

Data Cleaning

Strategic handling of missing values, outliers, and encoding - the foundation of good features

08

Pro Tips

Domain Knowledge is Key
  • Price per sqft is more informative than raw price
  • Age of house matters more than year built
  • Bathroom-to-bedroom ratio indicates luxury
  • Location features (lat/long) can be clustered
Avoid Data Leakage
  • Never use target for feature engineering
  • Fit scalers/encoders on train data only
  • Use pipelines to ensure proper ordering
  • Target encoding needs special handling
Feature Selection Strategy
  • Start with correlation analysis (filter)
  • Use RFE for final selection (wrapper)
  • Validate with tree-based importance (embedded)
  • Remove highly correlated features (>0.9)
Common Pitfalls
  • Creating too many features (overfitting)
  • Not handling multicollinearity
  • Forgetting to scale after feature creation
  • Using raw categorical IDs as features
09

Pre-Submission Checklist

Code Requirements
Repository Requirements