Assignment 8: Feature Engineering | Data Science Course

Assignment Overview

In this assignment, you will build a complete Feature Engineering Pipeline for a machine learning project. This comprehensive project requires you to apply ALL concepts from Module 8: feature creation, categorical encoding, feature scaling, and feature selection to prepare raw data for ML models.

Scikit-learn Focus: You must use scikit-learn transformers and pipelines for all feature engineering tasks. This tests your understanding of proper ML preprocessing workflows.

Skills Applied: This assignment tests your understanding of Feature Creation (Topic 8.1), Categorical Encoding (Topic 8.2), Feature Scaling (Topic 8.3), and Feature Selection (Topic 8.4) from Module 8.

Feature Creation (8.1)

Polynomial features, datetime extraction, domain-specific features

Encoding (8.2)

One-hot, label, ordinal, and target encoding techniques

Scaling (8.3)

StandardScaler, MinMaxScaler, RobustScaler, normalization

Selection (8.4)

Variance threshold, correlation, mutual information, RFE

Ready to submit? Already completed the assignment? Submit your work now!

Submit Now

The Scenario

PropertyPredict Real Estate

You have been hired as a Machine Learning Engineer at PropertyPredict, a real estate analytics company. Your manager has assigned you this project:

"We have raw property listing data that needs to be transformed for our price prediction model. The data contains mixed types: numerical features, categorical variables, dates, and text. Your job is to build a robust feature engineering pipeline that can handle all these data types and produce ML-ready features. The pipeline must be reproducible and work on new data."

Your Task

Create a Jupyter notebook called feature_engineering.ipynb that implements a complete feature engineering pipeline using scikit-learn. Your pipeline must transform raw property data into features suitable for training a machine learning model.

The Dataset

You will work with real property listings data containing mixed data types: numerical features, categorical variables, dates, and boolean flags. Download the CSV file and explore it to understand the features you'll need to engineer.

Property Listings Dataset

Real estate data with mixed feature types for feature engineering practice

1 File

property_listings.csv

1500 property records with 16 features

Numerical Categorical DateTime Boolean Missing Values

Your Task: Load the property listings dataset and explore its structure. You'll find numerical features (square_feet, bedrooms, etc.), categorical variables (property_type, neighborhood), datetime data (listing_date), and boolean flags (has_pool, has_fireplace). Some columns contain missing values that you'll need to handle as part of your feature engineering pipeline.

Requirements

Your feature_engineering.ipynb must implement ALL of the following tasks. Each task is mandatory and will be tested individually.

Part 1: Feature Creation (50 points)

Extract DateTime Features

Create a function extract_date_features(df) that:

Extracts year, month, day, day_of_week, quarter from listing_date
Creates is_weekend boolean feature
Creates days_since_listing (days from listing to today)
Returns DataFrame with new date-based features

def extract_date_features(df):
    """Extract features from listing_date column."""
    # Must create: year, month, day, day_of_week, quarter, is_weekend, days_since_listing
    pass

Create Domain Features

Create a function create_domain_features(df) that:

Creates price_per_sqft = price / square_feet
Creates property_age = 2024 - year_built
Creates bed_bath_ratio = bedrooms / bathrooms
Creates total_rooms = bedrooms + bathrooms
Creates is_new_construction = True if year_built >= 2020

def create_domain_features(df):
    """Create domain-specific engineered features."""
    # Must create meaningful combinations of existing features
    pass

Create Polynomial Features

Create a function create_polynomial_features(df, columns, degree=2) that:

Uses sklearn PolynomialFeatures on specified numerical columns
Creates interaction terms and polynomial terms
Returns DataFrame with new polynomial features added

def create_polynomial_features(df, columns, degree=2):
    """Create polynomial and interaction features."""
    # Must use: sklearn.preprocessing.PolynomialFeatures
    pass

Create Binned Features

Create a function create_binned_features(df) that:

Bins square_feet into categories: 'Small', 'Medium', 'Large', 'Luxury'
Bins property_age into: 'New', 'Modern', 'Established', 'Historic'
Uses pd.cut() with appropriate bin edges

def create_binned_features(df):
    """Create binned categorical features from numerical columns."""
    # Must use: pd.cut() or pd.qcut()
    pass

Part 2: Categorical Encoding (50 points)

One-Hot Encoding

Create a function apply_onehot_encoding(df, columns) that:

Uses sklearn OneHotEncoder on specified categorical columns
Handles unknown categories gracefully (ignore or use a default)
Returns DataFrame with encoded columns

def apply_onehot_encoding(df, columns):
    """Apply one-hot encoding to categorical columns."""
    # Must use: sklearn.preprocessing.OneHotEncoder
    pass

Ordinal Encoding

Create a function apply_ordinal_encoding(df, column, order) that:

Uses sklearn OrdinalEncoder with specified category order
Apply to 'condition' column with order: Poor < Fair < Good < Excellent
Returns encoded column values

def apply_ordinal_encoding(df, column, order):
    """Apply ordinal encoding with specified category order."""
    # Must use: sklearn.preprocessing.OrdinalEncoder
    pass

Label Encoding

Create a function apply_label_encoding(df, columns) that:

Uses sklearn LabelEncoder on specified columns
Stores the encoder for inverse transformation
Returns encoded DataFrame and encoder dictionary

def apply_label_encoding(df, columns):
    """Apply label encoding to categorical columns."""
    # Must use: sklearn.preprocessing.LabelEncoder
    pass

Target Encoding

Create a function apply_target_encoding(df, column, target) that:

Replaces categories with mean of target variable
Handles train/test split properly (fit on train, transform both)
Returns encoded column values

def apply_target_encoding(df, column, target):
    """Apply target encoding (mean encoding) to a categorical column."""
    # Calculate mean of target for each category
    pass

Part 3: Feature Scaling (50 points)

Standard Scaling

Create a function apply_standard_scaling(df, columns) that:

Uses sklearn StandardScaler (z-score normalization)
Fits on training data and transforms both train and test
Returns scaled DataFrame and fitted scaler

def apply_standard_scaling(df, columns):
    """Apply standard scaling (z-score normalization)."""
    # Must use: sklearn.preprocessing.StandardScaler
    pass

MinMax Scaling

Create a function apply_minmax_scaling(df, columns) that:

Uses sklearn MinMaxScaler to scale features to [0, 1] range
Returns scaled DataFrame and fitted scaler

def apply_minmax_scaling(df, columns):
    """Apply min-max scaling to [0, 1] range."""
    # Must use: sklearn.preprocessing.MinMaxScaler
    pass

Robust Scaling

Create a function apply_robust_scaling(df, columns) that:

Uses sklearn RobustScaler (uses median and IQR, resistant to outliers)
Apply to columns with potential outliers
Returns scaled DataFrame and fitted scaler

def apply_robust_scaling(df, columns):
    """Apply robust scaling using median and IQR."""
    # Must use: sklearn.preprocessing.RobustScaler
    pass

Log Transformation

Create a function apply_log_transform(df, columns) that:

Applies log1p transformation (log(1 + x)) to handle skewed distributions
Useful for price and square_feet which are right-skewed
Returns transformed DataFrame

def apply_log_transform(df, columns):
    """Apply log transformation for skewed features."""
    # Must use: np.log1p()
    pass

Part 4: Feature Selection (50 points)

Variance Threshold

Create a function select_by_variance(df, threshold=0.01) that:

Uses sklearn VarianceThreshold to remove low-variance features
Returns list of features that pass the threshold

def select_by_variance(df, threshold=0.01):
    """Remove features with variance below threshold."""
    # Must use: sklearn.feature_selection.VarianceThreshold
    pass

Correlation Analysis

Create a function remove_correlated_features(df, threshold=0.9) that:

Calculates correlation matrix for numerical features
Identifies highly correlated pairs (above threshold)
Removes one feature from each correlated pair
Returns DataFrame with reduced features

def remove_correlated_features(df, threshold=0.9):
    """Remove highly correlated features."""
    # Calculate correlation matrix and remove redundant features
    pass

Mutual Information

Create a function select_by_mutual_info(X, y, k=10) that:

Uses sklearn mutual_info_regression for regression target
Selects top k features with highest mutual information
Returns selected feature names and scores

def select_by_mutual_info(X, y, k=10):
    """Select top k features by mutual information."""
    # Must use: sklearn.feature_selection.mutual_info_regression
    pass

Recursive Feature Elimination

Create a function select_by_rfe(X, y, n_features=10) that:

Uses sklearn RFE with a base estimator (e.g., RandomForest)
Recursively eliminates least important features
Returns selected feature names and ranking

def select_by_rfe(X, y, n_features=10):
    """Select features using Recursive Feature Elimination."""
    # Must use: sklearn.feature_selection.RFE
    pass

Part 5: Complete Pipeline (Bonus - 30 points)

Build Complete Pipeline

Create a function build_feature_pipeline() that:

Uses sklearn Pipeline and ColumnTransformer
Handles numerical and categorical features separately
Chains imputation, encoding, scaling, and selection
Returns a fitted pipeline that can transform new data

def build_feature_pipeline():
    """Build a complete feature engineering pipeline."""
    # Must use: sklearn.pipeline.Pipeline and ColumnTransformer
    # Chain: SimpleImputer -> Encoders -> Scalers
    pass

Demonstrate Pipeline Usage

Create cells that demonstrate:

Splitting data into train/test sets
Fitting pipeline on training data only
Transforming both train and test data
Showing the final feature matrix ready for ML

Submission Instructions

Submit your completed assignment via GitHub following these instructions:

Create Jupyter Notebook

Create a single notebook called feature_engineering.ipynb containing all functions listed above.

Organize your notebook with clear markdown headers for each part
Each function must have a docstring explaining what it does
Include test cells that demonstrate each function working
Add markdown cells explaining your approach

Include Test Demonstrations

In your notebook, add cells that:

Generate the property dataset
Call each of your functions with the data
Print results showing the transformations
Demonstrate the complete pipeline on train/test split

Create README

Create README.md that includes:

Your name and assignment title
Instructions to run your code
List of all functions with brief descriptions
Any challenges you faced and how you solved them

Create requirements.txt

numpy>=1.24.0
pandas>=2.0.0
scikit-learn>=1.3.0

Repository Structure

Your GitHub repository should look like this:

propertypredict-feature-engineering/
├── README.md
├── requirements.txt
└── feature_engineering.ipynb    # All functions with test demonstrations

Submit via Form

Once your repository is ready:

Make sure your repository is public or shared with your instructor
Click the "Submit Assignment" button below
Fill in the submission form with your GitHub repository URL

Important: Make sure all cells in your notebook run without errors before submitting!

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria	Points	Description
Feature Creation	50	DateTime extraction, domain features, polynomial features, binning
Categorical Encoding	50	One-hot, ordinal, label, and target encoding implementations
Feature Scaling	50	Standard, MinMax, Robust scaling and log transformation
Feature Selection	50	Variance threshold, correlation, mutual info, RFE
Complete Pipeline (Bonus)	30	Scikit-learn Pipeline with ColumnTransformer
Total	200 (+30 bonus)

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment

What You Will Practice

Feature Creation (8.1)

DateTime extraction, domain-specific features, polynomial features, and binning continuous variables

Categorical Encoding (8.2)

One-hot encoding, ordinal encoding, label encoding, and target/mean encoding for ML models

Feature Scaling (8.3)

StandardScaler, MinMaxScaler, RobustScaler, and log transformations for different scenarios

Feature Selection (8.4)

Variance threshold, correlation analysis, mutual information, and recursive feature elimination

Pro Tips

Encoding Best Practices

Use one-hot for low cardinality nominal features
Use ordinal encoding for ordered categories
Use target encoding for high cardinality features
Always fit encoders on training data only

Scaling Guidelines

StandardScaler for normally distributed data
MinMaxScaler when bounded range is needed
RobustScaler when outliers are present
Always scale after train/test split

Time Management

Start with feature creation (most creative part)
Test each function independently first
Build the pipeline incrementally
Leave time for testing the complete workflow

Common Mistakes

Data leakage: fitting on test data
Forgetting to handle missing values first
Not storing fitted transformers for new data
Encoding target variable (do not encode y!)

Feature Engineering Challenge

What You'll Practice

Contents

Assignment Overview

Feature Creation (8.1)

Encoding (8.2)

Scaling (8.3)

Selection (8.4)

The Scenario

PropertyPredict Real Estate

Your Task

The Dataset

Property Listings Dataset

property_listings.csv

Requirements

Part 1: Feature Creation (50 points)

Extract DateTime Features

Create Domain Features

Create Polynomial Features

Create Binned Features

Part 2: Categorical Encoding (50 points)

One-Hot Encoding

Ordinal Encoding

Label Encoding

Target Encoding

Part 3: Feature Scaling (50 points)

Standard Scaling

MinMax Scaling

Robust Scaling

Log Transformation

Part 4: Feature Selection (50 points)

Variance Threshold

Correlation Analysis

Mutual Information

Recursive Feature Elimination

Part 5: Complete Pipeline (Bonus - 30 points)

Build Complete Pipeline

Demonstrate Pipeline Usage

Submission Instructions

Create Jupyter Notebook

Include Test Demonstrations

Create README

Create requirements.txt

Repository Structure

Submit via Form

Grading Rubric

Ready to Submit?

What You Will Practice

Feature Creation (8.1)

Categorical Encoding (8.2)

Feature Scaling (8.3)

Feature Selection (8.4)

Pro Tips

Encoding Best Practices

Scaling Guidelines

Time Management

Common Mistakes

Pre-Submission Checklist

Feature Engineering Requirements

Repository Requirements