Project Overview
This advanced capstone project challenges you to build a complete machine learning pipeline for predicting residential property prices across major Indian cities. You will work with a realistic housing dataset containing 150 properties from Mumbai, Bangalore, Delhi, Chennai, Hyderabad, Pune, and Kolkata. Your goal is to engineer meaningful features, train multiple regression models, compare their performance, and interpret which factors drive property valuations.
Learning Objectives
Feature Engineering Mastery
- Create domain-specific features (price per sqft, room ratios, location scores)
- Understand feature impact through importance analysis
- Combine multiple raw features into composite metrics
- Apply business logic to generate meaningful derived features
Model Comparison Skills
- Train and evaluate 5+ regression algorithms systematically
- Understand trade-offs: accuracy vs interpretability vs training time
- Use cross-validation for robust performance estimation
- Select optimal model based on multiple evaluation metrics
Evaluation & Interpretation
- Interpret RMSE, MAE, and R-squared in business context
- Analyze feature importance from tree-based models
- Identify model strengths and weaknesses through residual analysis
- Translate model insights into actionable business recommendations
End-to-End ML Pipeline
- Build complete workflow: EDA → Engineering → Training → Evaluation
- Handle data preprocessing (encoding, scaling) without leakage
- Document methodology and findings professionally
- Create reproducible analysis with clear explanations
Real-World Application
This project mirrors actual work done by data scientists at real estate tech companies like Zillow, Redfin, or MagicBricks. The ability to predict property prices accurately is a multi-million dollar business problem, and your solution demonstrates production-ready ML engineering skills.
Feature Engineering
Create derived features from raw property attributes
Multiple Models
Train and compare 5+ regression algorithms
Model Comparison
Evaluate using RMSE, MAE, and R-squared metrics
Interpretation
Analyze feature importance and model insights
Business Scenario
PropValue Analytics Pvt. Ltd.
You have been hired as a Machine Learning Engineer at PropValue Analytics, a real estate technology startup that provides property valuation services to banks, insurance companies, and individual buyers across India. The traditional property valuation process takes 5-7 days and costs ₹5,000-₹10,000 per assessment. The company wants to disrupt this market with AI-powered instant valuations priced at ₹499, making property assessment accessible to millions of Indians.
Currently, the company relies on manual appraisals by experienced real estate agents, but this approach doesn't scale. With 50-100 valuation requests coming in daily, the team is overwhelmed. Additionally, manual valuations suffer from inconsistency and human bias, with the same property sometimes receiving price estimates that vary by 15-20% depending on which agent performs the assessment.
"Our clients need accurate property valuations within minutes, not days. We have collected data on 150 properties across 7 major cities with verified sale prices. Can you build a model that predicts prices with at least 85% accuracy and tells us which features matter most for valuation? We also need to understand if our model works equally well across all cities or if we need city-specific models."
The Business Challenge
PropValue Analytics faces several critical challenges that machine learning can address:
Speed vs Accuracy
Manual valuations take 5-7 days but are reasonably accurate (90-95%). Instant online estimates are fast but often wildly inaccurate (60-70% accuracy), leading to customer distrust.
Market Variability
Mumbai properties average ₹150 Lakhs while similar properties in Pune cost ₹60 Lakhs. The model must capture both national patterns and city-specific pricing dynamics.
Feature Complexity
18 features influence price, but which matter most? Is a furnished 2BHK worth more than an unfurnished 3BHK? Does being near a metro station add ₹10 Lakhs or ₹30 Lakhs to value?
Business Questions to Answer
- What is the predicted price for a given property?
- How accurate is the model across different cities?
- What is the prediction confidence interval?
- Which features have the highest impact on price?
- How does location affect property valuation?
- What is the price premium for furnished properties?
- Which algorithm performs best for this data?
- Is there overfitting in complex models?
- What are the trade-offs between models?
- How does price vary by city and region?
- What is the price per square foot by property type?
- How does age affect property depreciation?
The Dataset
You will work with a realistic Indian housing market dataset containing 150 residential properties across 7 major cities. This professionally curated dataset includes verified sale prices, making it ideal for supervised learning. Each property record contains 18 features covering physical attributes, location factors, and neighborhood amenities.
Dataset Overview:
Why This Dataset is Perfect for Regression
Real Market Data
All properties have verified sale prices from actual transactions (2023-2024). No synthetic or estimated values, ensuring your model learns from real market dynamics.
Feature Diversity
Mix of numerical (area, age, price), categorical (city, furnishing), and binary (main road) features. Includes interaction opportunities (bedroom/bathroom ratio, floor position).
Geographic Variation
Properties span Tier-1 metros (Mumbai, Bangalore, Delhi) and Tier-2 cities (Pune, Hyderabad), capturing different market segments and price dynamics.
Dataset Schema
| Column | Type | Description |
|---|---|---|
property_id | String | Unique property identifier (HP001, HP002, ...) |
location | String | Specific locality/neighborhood name |
city | String | City name (Mumbai, Bangalore, Delhi, etc.) |
region | String | Geographic region (North, South, East, West) |
property_type | String | Type of property (Apartment, Villa) |
bedrooms | Integer | Number of bedrooms (1-5) |
bathrooms | Integer | Number of bathrooms (1-4) |
area_sqft | Integer | Total area in square feet |
floor | Integer | Floor number (0 for ground/villa) |
total_floors | Integer | Total floors in the building |
age_years | Integer | Age of property in years |
furnishing | String | Furnishing status (Furnished, Semi-Furnished, Unfurnished) |
parking | Integer | Number of parking spaces (0-3) |
amenities_score | Integer | Amenities rating (1-10 scale) |
nearby_schools | Integer | Number of schools within 2km |
nearby_hospitals | Integer | Number of hospitals within 2km |
metro_distance_km | Float | Distance to nearest metro station (km) |
main_road | String | On main road (Yes/No) |
price_lakhs | Float | Target Variable: Price in Indian Lakhs |
Data Quality & Completeness
Understanding Key Features
| Feature Category | Features Included | Expected Impact on Price |
|---|---|---|
| Size & Layout | area_sqft, bedrooms, bathrooms | High - Direct correlation with price |
| Location | city, region, location | High - Mumbai properties 2-3x costlier than others |
| Connectivity | metro_distance_km, main_road | Medium - ₹5-15 Lakhs premium for accessibility |
| Condition & Quality | age_years, furnishing, amenities_score | Medium - New/furnished properties command higher prices |
| Neighborhood | nearby_schools, nearby_hospitals | Low-Medium - Desirable but less impactful than size/location |
| Building Features | floor, total_floors, parking | Low-Medium - Varies by property type and city |
Getting Started
Loading the Dataset
Start by loading the CSV file and examining its basic properties:
- Shape: Check number of rows (properties) and columns (features)
- Price Range: Find minimum and maximum prices in Lakhs
- Cities: List all unique city values
- Preview: Display first few rows to understand data structure
Project Requirements
Your Jupyter Notebook must include all of the following components. Structure your notebook with clear markdown headers and explanations for each section.
Project Setup and Introduction
Title, your name, date, project overview, and business context. Import all required libraries: pandas, numpy, sklearn, matplotlib, seaborn, plotly.
Required Library Groups
- Data handling: pandas, numpy
- Visualization: matplotlib, seaborn, plotly
- Preprocessing: StandardScaler, train_test_split, LabelEncoder/OneHotEncoder
- Models: LinearRegression, Ridge, Lasso, RandomForestRegressor, GradientBoostingRegressor
- Evaluation: mean_squared_error, mean_absolute_error, r2_score, cross_val_score
Exploratory Data Analysis (EDA)
Comprehensive data exploration to understand patterns, distributions, and relationships before modeling.
Univariate Analysis
- Numerical Features: Analyze mean, median, std, min/max for price, area, age, parking
- Target Distribution: Histogram + KDE plot of price_lakhs to check for skewness
- Categorical Features: Value counts and percentages for city, property_type, furnishing
- Missing Values: Check for nulls with df.isnull().sum() (should be zero)
Bivariate Analysis
- Correlation Heatmap: Visualize relationships between all numerical features
- Price by City: Box plots showing price distribution across 7 cities
- Area vs Price: Scatter plot with property_type color coding
- Categorical Comparisons: Bar charts for average price by furnishing and property_type
Outlier Detection
- Price Outliers: Identify properties >3 standard deviations from mean
- Area Extremes: Flag properties with unusually large/small area_sqft
- Age Analysis: Check for very old properties (20+ years) affecting pricing
- Decision: Document whether to keep, cap, or remove outliers
Key Insights to Document
- Price Range: What's the min, max, and average property price?
- Top Correlations: Which features correlate most with price (>0.5)?
- City Patterns: Which city has highest/lowest average prices?
- Property Types: Are villas significantly more expensive than apartments?
Feature Engineering
Create at least 5 new derived features (see Feature Engineering section for ideas):
- Price per square foot calculation
- Location-based features (city premium, metro accessibility)
- Property age categories
- Room ratios and space efficiency metrics
- Amenity and accessibility composite scores
Data Preprocessing
Transform data into machine-learning-ready format through encoding and scaling.
Categorical Encoding
Convert text categories to numbers:
- One-Hot Encoding: City, property_type, furnishing, main_road
- Why One-Hot?: No ordinal relationship (Mumbai isn't "greater than" Delhi)
- Result: Creates binary columns (city_Mumbai, city_Delhi, etc.)
- Drop first: Use
drop_first=Trueto avoid multicollinearity
Feature Scaling
Normalize numerical features to same scale:
- StandardScaler: Mean=0, StdDev=1 (recommended for linear models)
- Why scale?: area_sqft (500-3000) vs parking (0-3) - prevent large values dominating
- Critical: Fit scaler on training data only, then transform train + test
- Not needed: Tree-based models (Random Forest, Gradient Boosting)
Critical: Prevent Data Leakage
Always follow this order:
- Separate features (X) from target (y)
- Encode categorical variables
- Split into train/test sets (80/20 split)
- Fit scaler on training data only, then transform both train and test
Why? Fitting the scaler on test data causes information leakage and inflates performance metrics.
Model Training and Comparison
Train at least 5 different regression models and compare performance:
- Linear Regression (baseline)
- Ridge Regression (L2 regularization)
- Lasso Regression (L1 regularization)
- Random Forest Regressor
- Gradient Boosting Regressor
Model Evaluation
- Calculate RMSE, MAE, and R-squared for each model
- Perform 5-fold cross-validation
- Create a comparison table of all models
- Analyze residual plots for the best model
Feature Importance Analysis
Understanding which features drive property prices helps stakeholders make data-informed decisions about pricing, renovations, and investment priorities. Different models reveal different aspects of feature importance.
Tree-Based Model Importance
Method: Extract feature_importances_ attribute from Random Forest or Gradient Boosting.
What it shows: How often features are used to split data and how much they reduce error.
Example: If total_sqft has importance 0.45, it explains 45% of the model's predictive power.
Action: Sort features by importance, visualize top 10-15 with horizontal bar chart.
Linear Model Coefficients
Method: Extract coef_ attribute from Linear, Ridge, or Lasso models.
What it shows: How much price changes per unit increase in each feature.
Example: Coefficient of 5.2 for bedrooms means each additional bedroom adds 5.2 lakhs to price.
Action: Get absolute values for ranking (ignore positive/negative), visualize top contributors.
Important Considerations:
- Feature scaling matters: For linear models, scale features before extracting coefficients (otherwise large-range features dominate)
- Correlation ≠ Causation: High importance doesn't mean changing that feature will change the price
- Context is key: A feature with 5% importance might still be critical for specific property segments
- Compare across models: If a feature is important in both tree-based AND linear models, it's truly influential
Insights and Recommendations
Transform your analysis into actionable business intelligence. Your insights should connect model findings to real-world decisions for buyers, sellers, investors, and real estate professionals.
Framework for Deriving Insights
Pricing Drivers
Identify the top 3-5 features with highest importance. Quantify their impact with examples: "Each additional bedroom adds approximately 8-12 lakhs to property value" or "Properties in South Mumbai command a 35% premium over similar properties in suburbs."
Surprising Findings
Highlight unexpected patterns from your EDA or feature importance: "Age has minimal impact on price (only 3% importance), suggesting buyers prioritize location and size over property age" or "Parking adds more value than an extra bedroom in urban areas."
Model Performance Context
Explain what your R-squared means in practical terms: "Model achieves 85% R-squared, meaning it can predict property prices within ±15 lakhs for 80% of properties. The remaining 15% variation is likely due to unique features like architectural style, renovation quality, or negotiation skills."
Segment-Specific Patterns
Analyze if pricing rules differ across segments: "For luxury properties (>1 Cr), floor level and amenities become critical. For budget properties (<50 lakhs), total area and locality score are primary drivers."
Data Quality Observations
Note any data limitations: "Properties above 5 Cr are underrepresented (only 2% of dataset), so model predictions may be less reliable for ultra-luxury segment. Recommend collecting more high-end property data."
Stakeholder-Specific Recommendations
For Sellers
- Focus on high-impact, low-cost improvements (e.g., if bathrooms add value, consider bathroom upgrades)
- Time market listings based on locality demand patterns
- Price competitively using model predictions adjusted for unique features
For Buyers
- Identify undervalued properties where actual price is significantly below model prediction
- Prioritize features with high importance if budget is constrained
- Consider emerging localities where location score is improving
For Investors
- Target properties with features that have growing importance trends
- Renovate to maximize features with highest ROI based on model coefficients
- Portfolio diversification based on segment-specific pricing patterns
For Developers
- Design projects emphasizing high-importance features (e.g., optimize sqft/bedroom ratio)
- Price new developments using model as baseline, adjust for amenities
- Identify underserved market segments with pricing gaps
Example High-Quality Insight:
"Analysis of 5,000 properties reveals that price per sqft varies by 85% across locations, making locality the single strongest pricing factor (42% feature importance). Properties within 2km of metro stations command a premium of 18-25%, while properties with dedicated parking add 8-12 lakhs to valuation. The model achieves 85% R-squared, accurately predicting prices within ±15 lakhs for most properties. This suggests location + connectivity + parking form the golden triangle of value drivers in the current market."
Feature Engineering
Create at least 5 derived features from the raw data. Feature engineering is crucial for improving model performance and capturing domain-specific knowledge about real estate valuation. Well-designed features can boost model accuracy by 10-20% compared to using raw data alone.
Why Feature Engineering Matters:
Raw features tell you a property has 3 bedrooms and 1500 sqft. But what really matters for pricing?
- Price per sqft reveals if a property is overpriced for its size
- Room density indicates efficient space usage vs. sprawling layouts
- Floor ratio captures premium for high floors with better views
- Locality score combines multiple amenities into a single quality metric
Recommended Feature Categories
Key Metrics to Create:
- price_per_sqft: Convert total price to per-square-foot rate for fair size comparison
- room_density: Measure how many rooms exist per 1000 sqft (efficient vs. spacious)
- bhk_ratio: Bedroom-to-bathroom ratio (2:1 is typical, 3:1 may indicate inadequate bathrooms)
- floor_ratio: Position in building (0.8-1.0 = top floors = premium views)
Property: 1500 sqft, 3 bedrooms, 2 bathrooms, 8th floor of 10
- price_per_sqft = (price_lakhs × 100,000) ÷ 1500
- room_density = (3 + 2) ÷ 1500 × 1000 = 3.33
- bhk_ratio = 3 ÷ 2 = 1.5 (balanced)
- floor_ratio = 8 ÷ 10 = 0.8 (high floor premium)
Location-Based Features:
- city_tier: Classify as Tier-1 metros (Mumbai, Delhi, Bangalore) vs Tier-2 cities
- metro_accessible: Binary flag for metro stations within 1km (walkable distance)
- locality_score: Sum nearby schools + hospitals as neighborhood quality indicator
- connectivity_score: Weighted combination: (metro_accessible × 3) + (main_road × 2)
Tier-1 cities command 2-3x premium. Metro accessibility adds ₹10-20 Lakhs. Properties near 5+ schools/hospitals are family-friendly and more desirable. Main road access improves resale value by ₹5-10 Lakhs.
Age-Based Categorization:
- age_category: Group as New (0-3 years), Recent (4-10 years), or Old (10+ years)
- is_new_property: Binary indicator for brand new construction (premium pricing)
- depreciation_factor: Estimate value reduction: max(0, 1 - age/30) to model 30-year lifespan
New (0-3): Full value, no depreciation, warranty coverage
Recent (4-10): Slight depreciation (5-10%), established neighborhood
Old (10+): 15-30% depreciation, may need renovation, but mature locality
Quality & Amenity Scores:
- furnishing_score: Convert to numeric scale (Unfurnished=0, Semi=1, Furnished=2)
- total_amenity_score: Weighted sum: amenities_score + (parking × 2) + locality_score
- luxury_index: Composite score combining furnishing, amenities, and floor ratio for premium properties
Furnished properties command 10-15% premium. Each parking space adds ₹3-5 Lakhs. Amenities (gym, pool, security) add ₹5-15 Lakhs. Luxury index >7 indicates high-end properties with 20-30% price premium.
Feature Engineering Best Practices:
- Check Correlations: After creating features, use
df.corr()to find highly correlated features (>0.85). Remove or combine them to avoid multicollinearity. - Domain Knowledge: Think like a real estate agent - what factors do buyers actually care about? Create features that capture buyer priorities.
- Test Impact: Train a baseline model, add your engineered features, and measure improvement in R-squared or RMSE to validate their value.
- Document Rationale: In your notebook, explain WHY you created each feature and what business insight it captures.
Common Feature Engineering Mistakes
What NOT to Do
- Create features with missing values (breaks model training)
- Use target variable in feature calculation (data leakage)
- Create 20+ features without testing which help (overfitting risk)
- Ignore extreme outliers in derived features (skews scaling)
Best Practices
- Start with 5-7 well-reasoned features, add more if needed
- Validate each feature makes sense (no negative room densities)
- Create features that could be calculated for new, unseen properties
- Document formulas clearly for reproducibility
ML Models to Implement
Train and compare at least 5 different regression models. Each model should be evaluated using consistent metrics and cross-validation for fair comparison. The goal is not just to find the "best" model, but to understand the trade-offs between interpretability, accuracy, training time, and complexity.
Why Multiple Models?
Different algorithms make different assumptions about data:
- Linear models assume straight-line relationships (fast, interpretable, but may underfit)
- Regularized models prevent overfitting by penalizing complex coefficients
- Tree-based models capture non-linear patterns and feature interactions automatically
- Ensemble methods combine multiple models for superior accuracy (but less interpretable)
Model Comparison Guide
| Model | Type | Key Parameters | Best For |
|---|---|---|---|
| Linear Regression | Linear | None (baseline) | Baseline comparison, interpretable coefficients |
| Ridge Regression | Linear (L2) | alpha: 0.1, 1.0, 10.0 |
Handling multicollinearity, preventing overfitting |
| Lasso Regression | Linear (L1) | alpha: 0.01, 0.1, 1.0 |
Feature selection, sparse solutions |
| Random Forest | Ensemble | n_estimators: 100, max_depth: 10-20 |
Non-linear relationships, feature importance |
| Gradient Boosting | Ensemble | n_estimators: 100, learning_rate: 0.1 |
Best accuracy, handles complex patterns |
Deep Dive: Understanding Each Algorithm
Linear Regression
How it works: Finds best-fit line minimizing squared errors. Price = (coef1 × area) + (coef2 × bedrooms) + ... + intercept
- Pros: Fast training, interpretable coefficients showing feature impact
- Cons: Assumes linear relationships, sensitive to outliers, can't capture interactions
- Use case: Baseline to beat. If it performs well (R² >0.75), data has linear patterns.
Ridge & Lasso Regression
How they work: Linear regression + penalty for large coefficients. Ridge (L2) shrinks all coefficients. Lasso (L1) can zero out features.
- Pros: Prevent overfitting, handle correlated features, Lasso does automatic feature selection
- Cons: Still assume linearity, need to tune alpha parameter
- Use case: When many features are correlated (area, bedrooms, bathrooms all correlate)
Random Forest
How it works: Builds 100+ decision trees on random data subsets, averages predictions. Each tree asks yes/no questions about features.
- Pros: Handles non-linear relationships, no feature scaling needed, provides feature importance
- Cons: Slower training, black box (hard to explain why), may overfit with too many trees
- Use case: When relationships are complex (e.g., Mumbai properties behave differently than Pune)
Gradient Boosting
How it works: Builds trees sequentially, each correcting previous tree's errors. Learns patterns iteratively.
- Pros: Often highest accuracy, handles missing data, captures complex interactions
- Cons: Longest training time, easy to overfit, requires careful tuning
- Use case: When you need best prediction accuracy and have clean, engineered features
Model Training Process
Initialize Models
Create instances of all 5 regression algorithms with sensible default hyperparameters. Use consistent random_state (42) for reproducibility.
Train on Training Data
Fit each model using the scaled training features (X_train_scaled) and target values (y_train). The model learns patterns and coefficients during this step.
Generate Predictions
Use each trained model to predict prices on the test set (X_test_scaled). These predictions will be compared against actual prices (y_test).
Calculate Performance Metrics
Compute RMSE, MAE, and R-squared for each model by comparing predictions to actual values. Lower RMSE/MAE and higher R-squared indicate better performance.
Cross-Validation
Perform 5-fold cross-validation on training data to get more reliable performance estimates. This helps detect overfitting.
Compare Results
Create a comparison table showing all metrics for all models. Sort by R-squared to identify the best performer.
Expected Performance Range
| Model | Expected R-squared | Expected RMSE (Lakhs) | Typical Training Time |
|---|---|---|---|
| Linear Regression | 0.75 - 0.82 | 18 - 25 | < 1 second |
| Ridge Regression | 0.76 - 0.83 | 17 - 24 | < 1 second |
| Lasso Regression | 0.74 - 0.81 | 19 - 26 | < 1 second |
| Random Forest | 0.82 - 0.88 | 14 - 20 | 5 - 15 seconds |
| Gradient Boosting | 0.84 - 0.90 | 12 - 18 | 10 - 30 seconds |
Interpretation Tip
R-squared of 0.85 means the model explains 85% of price variance. The remaining 15% is due to unmeasured factors like:
- Neighborhood reputation and schools
- Recent renovations or property condition details
- Seller motivation and negotiation factors
- Market timing and economic conditions
Understanding Evaluation Metrics
RMSE (Root Mean Squared Error)
Formula: Square root of average squared errors
Interpretation: Average prediction error in lakhs. RMSE of 20 means typical error is ±20 lakhs.
Key feature: Heavily penalizes large errors. A few big mistakes hurt RMSE more than many small ones.
Use case: When large errors are particularly costly (e.g., overpricing luxury properties).
MAE (Mean Absolute Error)
Formula: Average of absolute errors
Interpretation: Easier to explain to non-technical stakeholders. MAE of 15 means average error is 15 lakhs.
Key feature: Treats all errors equally, not sensitive to outliers.
Use case: When you want robust metric that isn't distorted by a few extreme cases.
R-squared (R²)
Formula: 1 - (Sum of squared errors / Total variance)
Interpretation: Percentage of price variance explained by model. R² of 0.85 = model explains 85% of variation.
Key feature: Scale-independent, ranges from 0 to 1 (higher is better).
Use case: Comparing models on different datasets or quickly assessing model quality.
Comparing Metrics: Practical Examples
| Scenario | Model A | Model B | Which is Better? |
|---|---|---|---|
| General comparison | R² = 0.85, RMSE = 20L | R² = 0.80, RMSE = 18L | Model A - Higher R² means better overall fit |
| Cost of big errors | MAE = 15L, RMSE = 25L | MAE = 17L, RMSE = 20L | Model B - Lower RMSE indicates fewer catastrophic errors |
| Stakeholder reporting | MAE = 12L, R² = 0.82 | MAE = 15L, R² = 0.85 | Model A - Lower MAE easier to communicate ("average error is 12 lakhs") |
| Luxury segment | R² = 0.75, max error = 80L | R² = 0.78, max error = 50L | Model B - Smaller max error critical for high-value properties |
Cross-Validation: The Gold Standard
Test set performance can be misleading if you happen to get an "easy" or "hard" split. Cross-validation divides training data into 5 folds, trains on 4 and validates on 1, rotating through all combinations.
Example interpretation: If cross-validation R² is 0.83 with std dev of 0.03, your model consistently performs well (0.80-0.86 range). If std dev is 0.12, performance is unstable (0.71-0.95 range) - investigate why.
Action: Always report mean CV score ± standard deviation. Use CV scores for model selection, test set only for final performance estimate.
Detecting Overfitting
Training R² = 0.95, Test R² = 0.72: Model memorized training data. Reduce model complexity or add regularization.
Training R² = 0.85, Test R² = 0.83: Healthy performance, model generalizes well.
Training R² = 0.68, Test R² = 0.70: Underfitting, model too simple. Add features or try more complex algorithms.
Required Visualizations
Create at least 10 visualizations covering EDA, model comparison, and feature importance. Use a mix of Matplotlib, Seaborn, and Plotly for different chart types.
Target Variable Distribution
Histogram of price_lakhs with KDE curve
Correlation Matrix
Correlation between all numerical features
Price by City
Price distribution across different cities
Area vs Price
Relationship with property type color coding
Average Price by City
Mean price comparison across cities
Price by Furnishing
Distribution by furnishing status
Model Comparison
R-squared scores for all 5 models
Actual vs Predicted
45-degree line comparison for best model
Residual Analysis
Residuals vs predicted values
Feature Importance
Top 15 features from Random Forest
Interactive Map
Plotly map showing prices by city
CV Score Distribution
Cross-validation scores across folds
Visualization 1: Feature Importance Analysis
Purpose
Identify which features have the strongest influence on price predictions in tree-based models (Random Forest or Gradient Boosting).
What to Show
- Top 15 features sorted by importance
- Horizontal bar chart for easy feature name reading
- Importance scores (0 to 1 scale)
- Clear labels and title
What to Look For
- Area metrics typically rank #1 or #2
- Location features (city, parking) in top 5
- Bathrooms often more important than bedrooms
- Engineered features competing with original ones
Visualization 2: Actual vs Predicted Prices
Purpose
Visually assess prediction accuracy by plotting predicted prices against actual prices. Perfect predictions would fall on a 45-degree diagonal line.
What to Show
- Scatter plot: x-axis = actual prices, y-axis = predicted
- 45-degree dashed reference line (perfect prediction)
- Use your best performing model (likely Gradient Boosting)
- Axis labels with units (Lakhs)
What to Look For
- Points close to line = accurate predictions
- Points below line = model under-predicts (conservative)
- Points above line = model over-predicts (optimistic)
- Outliers = properties with unusual characteristics
Business Insight
If your model consistently under-predicts luxury properties (₹1 Cr+), it may need engineered features capturing premium amenities or neighborhood prestige.
Visualization Best Practices
EDA Charts
Use Seaborn for statistical visualizations:
- Heatmaps for correlations
- Distribution plots with KDE
- Box/violin plots by category
Model Comparison
Use Matplotlib for simple comparisons:
- Bar charts for metrics across models
- Horizontal bars for feature importance
- Residual plots for error analysis
Interactive Plots
Use Plotly for exploration:
- Scatter plots with hover details
- 3D visualizations of relationships
- Geographic maps for location data
Interpreting Common Visualization Patterns
Distribution Plots (Histograms)
Right-skewed price distribution: Most properties are affordable (₹30-70L), few luxury outliers (₹2 Cr+)
Insight: Apply log transformation to normalize distribution for linear models
Business value: Focus marketing on 70-80% of market (middle segment), create specialized strategies for luxury tier
Correlation Heatmap
Strong correlation (0.8+): total_sqft and bedrooms often correlated - larger homes have more bedrooms
Insight: Creates multicollinearity issue for linear models. Consider removing one or creating ratio feature
Business value: Don't build 5-bedroom homes in small areas - market expects proportionality
Box Plots (Price by City)
Mumbai median at ₹95L, Pune at ₹52L: Location drives 45% price difference even for similar properties
Insight: Location is critical feature - include city/area encoding in model
Business value: Investment strategy should prioritize location over property features. A basic flat in Mumbai outperforms luxury villa in tier-2 city
Residual Plot
Random scatter around zero: Model assumptions are satisfied, no systematic bias
Funnel shape (increasing spread): Model is less confident with expensive properties - heteroscedasticity issue
Business value: Use model confidently for budget-mid range (₹30-80L), add ±20% margin for luxury predictions
Visualization Interpretation Checklist
For every visualization you create, ask these questions:
- What pattern do I see? (e.g., positive correlation, outliers, skewed distribution)
- Why does this pattern exist? (e.g., supply-demand dynamics, construction costs, buyer preferences)
- How does it affect my model? (e.g., need transformation, feature engineering, separate segments)
- What business decision does it inform? (e.g., pricing strategy, target market, renovation priorities)
- Should I investigate further? (e.g., outliers requiring detailed analysis, unexpected correlations)
Submission Requirements
Create a public GitHub repository with the exact name shown below:
Required Repository Name
house-price-prediction
Required Project Structure
Directory Layout
- data/ folder containing
housing_data.csv - notebooks/ folder with
house_price_analysis.ipynb(your main notebook) - models/ folder for saved model files (optional:
best_model.pkl) requirements.txtat root level listing all dependenciesREADME.mdat root level with project documentation
README.md Must Include:
- Your full name and submission date
- Project overview and business context
- Model comparison table with RMSE, MAE, R-squared for all models
- Best model selection with justification
- Top 5 feature insights from importance analysis
- Technologies used (Python, Pandas, Scikit-learn, etc.)
- Instructions to run the notebook
- Screenshots of at least 4 visualizations
Required Python Libraries
Create a requirements.txt file with these dependencies (minimum versions):
| Library | Version | Purpose |
|---|---|---|
pandas |
2.0.0+ | Data manipulation and analysis |
numpy |
1.24.0+ | Numerical operations and arrays |
scikit-learn |
1.3.0+ | ML models, preprocessing, evaluation |
matplotlib |
3.7.0+ | Static visualizations |
seaborn |
0.12.0+ | Statistical visualizations |
plotly |
5.18.0+ | Interactive charts |
jupyter |
1.0.0+ | Notebook environment |
nbformat |
5.9.0+ | Notebook formatting |
joblib |
1.3.0+ | Model serialization (optional) |
Do Include
- Clear markdown sections with headers
- All code cells executed with outputs
- At least 5 trained regression models
- At least 10 visualizations
- Model comparison table
- Feature importance analysis
- Business insights and recommendations
- README with model performance and screenshots
Do Not Include
- Virtual environment folders (venv, .env)
- Any .pyc or __pycache__ files
- Unexecuted notebooks
- Hardcoded absolute file paths
- Large model files (keep under 100MB)
- API keys or credentials
Enter your GitHub username - we will verify your repository automatically
Grading Rubric
Your project will be graded on the following criteria. Total: 600 points. Each criterion includes specific requirements that must be met for full credit.
| Criteria | Points | Detailed Requirements |
|---|---|---|
| EDA and Data Understanding | 75 |
|
| Feature Engineering | 100 |
|
| Data Preprocessing | 50 |
|
| Model Training | 100 |
|
| Model Evaluation | 75 |
|
| Visualizations | 75 |
|
| Feature Importance | 50 |
|
| Code Quality | 25 |
|
| Documentation | 25 |
|
| Business Insights | 25 |
|
| Total | 600 |
Ready to Submit?
Make sure you have completed all requirements and reviewed the grading rubric above.
Submit Your Project