Assignment Overview
In this assignment, you will build a complete Feature Engineering Pipeline for a machine learning project. This comprehensive project requires you to apply ALL concepts from Module 8: feature creation, categorical encoding, feature scaling, and feature selection to prepare raw data for ML models.
Feature Creation (8.1)
Polynomial features, datetime extraction, domain-specific features
Encoding (8.2)
One-hot, label, ordinal, and target encoding techniques
Scaling (8.3)
StandardScaler, MinMaxScaler, RobustScaler, normalization
Selection (8.4)
Variance threshold, correlation, mutual information, RFE
The Scenario
PropertyPredict Real Estate
You have been hired as a Machine Learning Engineer at PropertyPredict, a real estate analytics company. Your manager has assigned you this project:
"We have raw property listing data that needs to be transformed for our price prediction model. The data contains mixed types: numerical features, categorical variables, dates, and text. Your job is to build a robust feature engineering pipeline that can handle all these data types and produce ML-ready features. The pipeline must be reproducible and work on new data."
Your Task
Create a Jupyter notebook called feature_engineering.ipynb that implements a complete
feature engineering pipeline using scikit-learn. Your pipeline must transform raw property data
into features suitable for training a machine learning model.
The Dataset
You will work with real property listings data containing mixed data types: numerical features, categorical variables, dates, and boolean flags. Download the CSV file and explore it to understand the features you'll need to engineer.
Property Listings Dataset
Real estate data with mixed feature types for feature engineering practice
Requirements
Your feature_engineering.ipynb must implement ALL of the following tasks.
Each task is mandatory and will be tested individually.
Part 1: Feature Creation (50 points)
Extract DateTime Features
Create a function extract_date_features(df) that:
- Extracts year, month, day, day_of_week, quarter from listing_date
- Creates is_weekend boolean feature
- Creates days_since_listing (days from listing to today)
- Returns DataFrame with new date-based features
def extract_date_features(df):
"""Extract features from listing_date column."""
# Must create: year, month, day, day_of_week, quarter, is_weekend, days_since_listing
pass
Create Domain Features
Create a function create_domain_features(df) that:
- Creates price_per_sqft = price / square_feet
- Creates property_age = 2024 - year_built
- Creates bed_bath_ratio = bedrooms / bathrooms
- Creates total_rooms = bedrooms + bathrooms
- Creates is_new_construction = True if year_built >= 2020
def create_domain_features(df):
"""Create domain-specific engineered features."""
# Must create meaningful combinations of existing features
pass
Create Polynomial Features
Create a function create_polynomial_features(df, columns, degree=2) that:
- Uses sklearn PolynomialFeatures on specified numerical columns
- Creates interaction terms and polynomial terms
- Returns DataFrame with new polynomial features added
def create_polynomial_features(df, columns, degree=2):
"""Create polynomial and interaction features."""
# Must use: sklearn.preprocessing.PolynomialFeatures
pass
Create Binned Features
Create a function create_binned_features(df) that:
- Bins square_feet into categories: 'Small', 'Medium', 'Large', 'Luxury'
- Bins property_age into: 'New', 'Modern', 'Established', 'Historic'
- Uses pd.cut() with appropriate bin edges
def create_binned_features(df):
"""Create binned categorical features from numerical columns."""
# Must use: pd.cut() or pd.qcut()
pass
Part 2: Categorical Encoding (50 points)
One-Hot Encoding
Create a function apply_onehot_encoding(df, columns) that:
- Uses sklearn OneHotEncoder on specified categorical columns
- Handles unknown categories gracefully (ignore or use a default)
- Returns DataFrame with encoded columns
def apply_onehot_encoding(df, columns):
"""Apply one-hot encoding to categorical columns."""
# Must use: sklearn.preprocessing.OneHotEncoder
pass
Ordinal Encoding
Create a function apply_ordinal_encoding(df, column, order) that:
- Uses sklearn OrdinalEncoder with specified category order
- Apply to 'condition' column with order: Poor < Fair < Good < Excellent
- Returns encoded column values
def apply_ordinal_encoding(df, column, order):
"""Apply ordinal encoding with specified category order."""
# Must use: sklearn.preprocessing.OrdinalEncoder
pass
Label Encoding
Create a function apply_label_encoding(df, columns) that:
- Uses sklearn LabelEncoder on specified columns
- Stores the encoder for inverse transformation
- Returns encoded DataFrame and encoder dictionary
def apply_label_encoding(df, columns):
"""Apply label encoding to categorical columns."""
# Must use: sklearn.preprocessing.LabelEncoder
pass
Target Encoding
Create a function apply_target_encoding(df, column, target) that:
- Replaces categories with mean of target variable
- Handles train/test split properly (fit on train, transform both)
- Returns encoded column values
def apply_target_encoding(df, column, target):
"""Apply target encoding (mean encoding) to a categorical column."""
# Calculate mean of target for each category
pass
Part 3: Feature Scaling (50 points)
Standard Scaling
Create a function apply_standard_scaling(df, columns) that:
- Uses sklearn StandardScaler (z-score normalization)
- Fits on training data and transforms both train and test
- Returns scaled DataFrame and fitted scaler
def apply_standard_scaling(df, columns):
"""Apply standard scaling (z-score normalization)."""
# Must use: sklearn.preprocessing.StandardScaler
pass
MinMax Scaling
Create a function apply_minmax_scaling(df, columns) that:
- Uses sklearn MinMaxScaler to scale features to [0, 1] range
- Returns scaled DataFrame and fitted scaler
def apply_minmax_scaling(df, columns):
"""Apply min-max scaling to [0, 1] range."""
# Must use: sklearn.preprocessing.MinMaxScaler
pass
Robust Scaling
Create a function apply_robust_scaling(df, columns) that:
- Uses sklearn RobustScaler (uses median and IQR, resistant to outliers)
- Apply to columns with potential outliers
- Returns scaled DataFrame and fitted scaler
def apply_robust_scaling(df, columns):
"""Apply robust scaling using median and IQR."""
# Must use: sklearn.preprocessing.RobustScaler
pass
Log Transformation
Create a function apply_log_transform(df, columns) that:
- Applies log1p transformation (log(1 + x)) to handle skewed distributions
- Useful for price and square_feet which are right-skewed
- Returns transformed DataFrame
def apply_log_transform(df, columns):
"""Apply log transformation for skewed features."""
# Must use: np.log1p()
pass
Part 4: Feature Selection (50 points)
Variance Threshold
Create a function select_by_variance(df, threshold=0.01) that:
- Uses sklearn VarianceThreshold to remove low-variance features
- Returns list of features that pass the threshold
def select_by_variance(df, threshold=0.01):
"""Remove features with variance below threshold."""
# Must use: sklearn.feature_selection.VarianceThreshold
pass
Correlation Analysis
Create a function remove_correlated_features(df, threshold=0.9) that:
- Calculates correlation matrix for numerical features
- Identifies highly correlated pairs (above threshold)
- Removes one feature from each correlated pair
- Returns DataFrame with reduced features
def remove_correlated_features(df, threshold=0.9):
"""Remove highly correlated features."""
# Calculate correlation matrix and remove redundant features
pass
Mutual Information
Create a function select_by_mutual_info(X, y, k=10) that:
- Uses sklearn mutual_info_regression for regression target
- Selects top k features with highest mutual information
- Returns selected feature names and scores
def select_by_mutual_info(X, y, k=10):
"""Select top k features by mutual information."""
# Must use: sklearn.feature_selection.mutual_info_regression
pass
Recursive Feature Elimination
Create a function select_by_rfe(X, y, n_features=10) that:
- Uses sklearn RFE with a base estimator (e.g., RandomForest)
- Recursively eliminates least important features
- Returns selected feature names and ranking
def select_by_rfe(X, y, n_features=10):
"""Select features using Recursive Feature Elimination."""
# Must use: sklearn.feature_selection.RFE
pass
Part 5: Complete Pipeline (Bonus - 30 points)
Build Complete Pipeline
Create a function build_feature_pipeline() that:
- Uses sklearn Pipeline and ColumnTransformer
- Handles numerical and categorical features separately
- Chains imputation, encoding, scaling, and selection
- Returns a fitted pipeline that can transform new data
def build_feature_pipeline():
"""Build a complete feature engineering pipeline."""
# Must use: sklearn.pipeline.Pipeline and ColumnTransformer
# Chain: SimpleImputer -> Encoders -> Scalers
pass
Demonstrate Pipeline Usage
Create cells that demonstrate:
- Splitting data into train/test sets
- Fitting pipeline on training data only
- Transforming both train and test data
- Showing the final feature matrix ready for ML
Submission Instructions
Submit your completed assignment via GitHub following these instructions:
Create Jupyter Notebook
Create a single notebook called feature_engineering.ipynb containing all functions listed above.
- Organize your notebook with clear markdown headers for each part
- Each function must have a docstring explaining what it does
- Include test cells that demonstrate each function working
- Add markdown cells explaining your approach
Include Test Demonstrations
In your notebook, add cells that:
- Generate the property dataset
- Call each of your functions with the data
- Print results showing the transformations
- Demonstrate the complete pipeline on train/test split
Create README
Create README.md that includes:
- Your name and assignment title
- Instructions to run your code
- List of all functions with brief descriptions
- Any challenges you faced and how you solved them
Create requirements.txt
numpy>=1.24.0
pandas>=2.0.0
scikit-learn>=1.3.0
Repository Structure
Your GitHub repository should look like this:
propertypredict-feature-engineering/
├── README.md
├── requirements.txt
└── feature_engineering.ipynb # All functions with test demonstrations
Submit via Form
Once your repository is ready:
- Make sure your repository is public or shared with your instructor
- Click the "Submit Assignment" button below
- Fill in the submission form with your GitHub repository URL
Grading Rubric
Your assignment will be graded on the following criteria:
| Criteria | Points | Description |
|---|---|---|
| Feature Creation | 50 | DateTime extraction, domain features, polynomial features, binning |
| Categorical Encoding | 50 | One-hot, ordinal, label, and target encoding implementations |
| Feature Scaling | 50 | Standard, MinMax, Robust scaling and log transformation |
| Feature Selection | 50 | Variance threshold, correlation, mutual info, RFE |
| Complete Pipeline (Bonus) | 30 | Scikit-learn Pipeline with ColumnTransformer |
| Total | 200 (+30 bonus) |
Ready to Submit?
Make sure you have completed all requirements and reviewed the grading rubric above.
Submit Your AssignmentWhat You Will Practice
Feature Creation (8.1)
DateTime extraction, domain-specific features, polynomial features, and binning continuous variables
Categorical Encoding (8.2)
One-hot encoding, ordinal encoding, label encoding, and target/mean encoding for ML models
Feature Scaling (8.3)
StandardScaler, MinMaxScaler, RobustScaler, and log transformations for different scenarios
Feature Selection (8.4)
Variance threshold, correlation analysis, mutual information, and recursive feature elimination
Pro Tips
Encoding Best Practices
- Use one-hot for low cardinality nominal features
- Use ordinal encoding for ordered categories
- Use target encoding for high cardinality features
- Always fit encoders on training data only
Scaling Guidelines
- StandardScaler for normally distributed data
- MinMaxScaler when bounded range is needed
- RobustScaler when outliers are present
- Always scale after train/test split
Time Management
- Start with feature creation (most creative part)
- Test each function independently first
- Build the pipeline incrementally
- Leave time for testing the complete workflow
Common Mistakes
- Data leakage: fitting on test data
- Forgetting to handle missing values first
- Not storing fitted transformers for new data
- Encoding target variable (do not encode y!)