Intermediate Project 4

Credit Risk Assessment

Build a machine learning classification pipeline to predict loan default risk using the famous German Credit dataset. Learn to handle imbalanced data, perform feature engineering, and evaluate models with business-relevant metrics like precision, recall, and AUC-ROC.

8-10 hours
Intermediate
350 Points
What You Will Build
  • Exploratory Data Analysis
  • Feature Engineering Pipeline
  • Multiple Classification Models
  • Model Comparison Dashboard
  • Business Recommendations
Contents
01

Project Overview

This project focuses on binary classification for credit risk prediction. You will work with the German Credit Dataset containing 1,000 loan applications with 20 features including credit history, employment, loan purpose, and personal attributes. Your goal is to build a model that predicts whether an applicant is a good or bad credit risk.

Skills Applied: This project tests your proficiency in Python (pandas, numpy, matplotlib, seaborn), scikit-learn (classification algorithms, preprocessing, model evaluation), and handling imbalanced datasets.
Explore

Analyze credit features and class distribution

Preprocess

Encode features and handle imbalance

Model

Train multiple classification algorithms

Evaluate

Compare models with business metrics

Learning Objectives

Technical Skills
  • Handle categorical features with encoding techniques
  • Address class imbalance with SMOTE/class weights
  • Implement Logistic Regression, Random Forest, XGBoost
  • Evaluate with confusion matrix, ROC-AUC, precision-recall
  • Perform hyperparameter tuning with cross-validation
Business Skills
  • Understand cost of false positives vs false negatives
  • Choose appropriate threshold for business needs
  • Interpret feature importance for lending decisions
  • Present model recommendations to stakeholders
  • Consider fairness and ethical implications
Ready to submit? Already completed the project? Submit your work now!
Submit Now
02

Business Scenario

Apex Financial Services

You have been hired as a Data Scientist at Apex Financial Services, a lending institution that provides personal loans. The risk management team wants to automate the credit scoring process using machine learning. They have historical data on past loan applicants and their repayment behavior.

"We're losing money on bad loans and rejecting good applicants. We need a data-driven approach to credit scoring. Build us a model that can predict default risk accurately, but remember - rejecting a good customer costs us revenue, while approving a bad loan costs us the principal. Find the right balance."

Marcus Williams, Chief Risk Officer, Apex Financial

Questions to Answer

Risk Prediction
  • What features are most predictive of default?
  • Which model performs best for this task?
  • What is the optimal classification threshold?
  • How confident can we be in predictions?
Business Impact
  • What is the cost-benefit of each prediction type?
  • How much can we reduce default rates?
  • Which applicant segments are highest risk?
  • Are there fairness concerns in the model?
Pro Tip: In credit risk, a false negative (approving a bad loan) typically costs 5-10x more than a false positive (rejecting a good applicant). Consider this asymmetry when choosing your evaluation threshold!
03

The Dataset

You will work with the German Credit dataset, a classic dataset for learning credit risk modeling and classification techniques. Download from Kaggle or use the local copy:

Dataset Download

Download the German Credit dataset from Kaggle or use our local copy for convenience.

Original Data Source

This project uses the German Credit Dataset from UCI ML Repository via Kaggle. Originally collected by Prof. Hans Hofmann, this dataset classifies loan applicants as good or bad credit risks based on 20 attributes including credit history, purpose, and personal status.

Dataset Info: 1000 samples × 21 columns | 20 features + 1 target | Imbalanced: 700 good (70%) / 300 bad (30%) | Mix of numerical and categorical | Classic benchmark for credit scoring models
Dataset Schema

ColumnTypeDescription
AgeIntegerAge in years (19-75)
SexStringGender (male/female)
JobIntegerJob type (0-3: unskilled to highly skilled)
HousingStringHousing status (own/rent/free)
Saving accountsStringSavings account level (little/moderate/quite rich/rich)
Checking accountStringChecking account status (little/moderate/rich)
Credit amountIntegerCredit amount in DM (250-18424)
DurationIntegerDuration of credit in months (4-72)
PurposeStringPurpose of loan (car, furniture, education, etc.)
RiskStringTarget: good (700) / bad (300)
Dataset Stats: 1000 applications, 10 columns, imbalanced classes (70/30 split), some missing values
Key Insight: Credit amount and duration are strong predictors - higher values increase risk
Sample Data Preview
AgeSexJobHousingCredit amountDurationPurposeRisk
67male2own11696radio/TVgood
22female2own595148radio/TVbad
49male1own209612educationgood
45male2free788242furnituregood
53male2free487024carbad
04

Project Requirements

Create a well-organized Jupyter notebook that covers all the following components with clear documentation and visualizations.

1
Exploratory Data Analysis
  • Load and inspect the German Credit dataset
  • Display dataset shape, dtypes, and descriptive statistics
  • Check for missing values and handle appropriately
  • Analyze target class distribution (good vs bad)
  • Create distribution plots for numerical features
  • Create count plots for categorical features
  • Generate correlation heatmap for numerical features
  • Analyze relationship between features and target
2
Data Preprocessing
  • Handle missing values (imputation or removal)
  • Encode categorical variables (LabelEncoder/OneHotEncoder)
  • Create new features if beneficial (e.g., credit_per_month)
  • Scale numerical features using StandardScaler
  • Split data into train (80%) and test (20%) sets
  • Address class imbalance (SMOTE, class_weight, or sampling)
3
Model Training
  • Train at least 3 different models:
  • Logistic Regression (baseline)
  • Random Forest Classifier
  • XGBoost or Gradient Boosting
  • Use cross-validation (5-fold) for model evaluation
  • Perform hyperparameter tuning (GridSearchCV or RandomizedSearchCV)
  • Document best parameters for each model
4
Model Evaluation
  • Generate confusion matrix for each model
  • Calculate accuracy, precision, recall, F1-score
  • Plot ROC curves and calculate AUC scores
  • Plot Precision-Recall curves
  • Compare models in a summary table
  • Select best model with justification
5
Feature Importance & Insights
  • Extract and visualize feature importances
  • Identify top 10 most predictive features
  • Analyze which factors increase default risk
  • Create risk profiles for different customer segments
6
Business Recommendations
  • Recommend optimal classification threshold
  • Estimate cost savings from model deployment
  • Provide lending guidelines based on model insights
  • Discuss model limitations and ethical considerations
05

Model Specifications

Implement these classification algorithms and evaluation metrics to ensure your analysis is thorough and industry-standard.

Logistic Regression
  • Purpose: Baseline model
  • Library: sklearn.linear_model
  • Key param: class_weight='balanced'
  • Regularization: C=1.0 (tune)
  • Solver: 'lbfgs' or 'liblinear'
Random Forest
  • Purpose: Ensemble model
  • Library: sklearn.ensemble
  • n_estimators: 100-500
  • max_depth: 10-30 (tune)
  • Feature importance: Built-in
XGBoost
  • Purpose: Advanced ensemble
  • Library: xgboost
  • scale_pos_weight: Handle imbalance
  • learning_rate: 0.01-0.3
  • n_estimators: 100-1000
Evaluation Metrics
Confusion Matrix

TP, TN, FP, FN counts for each model

Precision & Recall

Critical for imbalanced classification

ROC-AUC

Area under ROC curve (aim for >0.75)

F1-Score

Harmonic mean of precision & recall

Class Imbalance: The dataset has 70% good and 30% bad credit. Use techniques like SMOTE, class_weight='balanced', or adjust the classification threshold to handle this imbalance appropriately.
06

Required Visualizations

Create at least 12 visualizations in your notebook. Each visualization should have clear titles, labels, and annotations.

EDA Visualizations
  • Target class distribution (good vs bad)
  • Age distribution histogram
  • Credit amount distribution
  • Duration distribution
  • Categorical feature counts (housing, purpose)
  • Correlation heatmap
  • Box plots by risk category
Model Visualizations
  • Confusion matrices (for all models)
  • ROC curves (all models on same plot)
  • Precision-Recall curves
  • Feature importance bar chart
  • Model comparison bar chart (metrics)
  • Learning curves (optional)
Design Tip: Use a consistent color scheme - green for "good" credit and red for "bad" credit throughout your visualizations.
07

Submission Requirements

Create a public GitHub repository with the exact name shown below:

Required Repository Name
credit-risk-ml
github.com/<your-username>/credit-risk-ml
Required Project Structure
credit-risk-ml/
├── data/
│   └── german_credit.csv           # Dataset
├── notebooks/
│   └── credit_risk_analysis.ipynb  # Main analysis notebook
├── visualizations/
│   ├── confusion_matrix.png        # Confusion matrices
│   ├── roc_curves.png              # ROC curve comparison
│   ├── feature_importance.png      # Top features
│   └── model_comparison.png        # Metrics comparison
├── models/                         # (Optional) Saved models
│   └── best_model.pkl
├── requirements.txt                # Python dependencies
└── README.md                       # Project documentation
README.md Required Sections
  • Project Title and Description
  • Your name and submission date
  • Dataset description (source, features)
  • Technologies used (Python, sklearn, xgboost)
  • Model comparison results (table format)
  • Best model and its performance
  • Business recommendations
  • How to run the notebook
Submit Your Project

Enter your GitHub username - we will verify your repository automatically

08

Grading Rubric

Your project will be graded on the following criteria. Total: 350 points.

Criteria Points Description
Exploratory Data Analysis 50 Thorough exploration with descriptive statistics and visualizations
Data Preprocessing 40 Proper encoding, scaling, and handling of missing values
Class Imbalance Handling 30 Appropriate technique to address 70/30 class distribution
Model Training 50 At least 3 models with hyperparameter tuning
Model Evaluation 50 Comprehensive metrics, ROC curves, and comparison
Feature Analysis 30 Feature importance and business-relevant insights
Visualizations 50 At least 12 clear, labeled visualizations
Documentation 50 README, code comments, business recommendations
Total 350
Grading Levels
Excellent
315-350

Exceeds all requirements

Good
262-314

Meets all requirements

Satisfactory
210-261

Meets minimum requirements

Needs Work
< 210

Missing key requirements

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Project