Project 4: Credit Risk Assessment | Machine Learning Course

Project Overview

This project focuses on binary classification for credit risk prediction. You will work with the German Credit Dataset containing 1,000 loan applications with 20 features including credit history, employment, loan purpose, and personal attributes. Your goal is to build a model that predicts whether an applicant is a good or bad credit risk.

Skills Applied: This project tests your proficiency in Python (pandas, numpy, matplotlib, seaborn), scikit-learn (classification algorithms, preprocessing, model evaluation), and handling imbalanced datasets.

Explore

Analyze credit features and class distribution

Preprocess

Encode features and handle imbalance

Model

Train multiple classification algorithms

Evaluate

Compare models with business metrics

Learning Objectives

Technical Skills

Handle categorical features with encoding techniques
Address class imbalance with SMOTE/class weights
Implement Logistic Regression, Random Forest, XGBoost
Evaluate with confusion matrix, ROC-AUC, precision-recall
Perform hyperparameter tuning with cross-validation

Business Skills

Understand cost of false positives vs false negatives
Choose appropriate threshold for business needs
Interpret feature importance for lending decisions
Present model recommendations to stakeholders
Consider fairness and ethical implications

Ready to submit? Already completed the project? Submit your work now!

Submit Now

Business Scenario

Apex Financial Services

You have been hired as a Data Scientist at Apex Financial Services, a lending institution that provides personal loans. The risk management team wants to automate the credit scoring process using machine learning. They have historical data on past loan applicants and their repayment behavior.

"We're losing money on bad loans and rejecting good applicants. We need a data-driven approach to credit scoring. Build us a model that can predict default risk accurately, but remember - rejecting a good customer costs us revenue, while approving a bad loan costs us the principal. Find the right balance."

Marcus Williams, Chief Risk Officer, Apex Financial

Questions to Answer

Risk Prediction

What features are most predictive of default?
Which model performs best for this task?
What is the optimal classification threshold?
How confident can we be in predictions?

Business Impact

What is the cost-benefit of each prediction type?
How much can we reduce default rates?
Which applicant segments are highest risk?
Are there fairness concerns in the model?

Pro Tip: In credit risk, a false negative (approving a bad loan) typically costs 5-10x more than a false positive (rejecting a good applicant). Consider this asymmetry when choosing your evaluation threshold!

The Dataset

You will work with the German Credit dataset, a classic dataset for learning credit risk modeling and classification techniques. Download from Kaggle or use the local copy:

Dataset Download

Download the German Credit dataset from Kaggle or use our local copy for convenience.

Download from Kaggle german_credit.csv (Local)

Original Data Source

This project uses the German Credit Dataset from UCI ML Repository via Kaggle. Originally collected by Prof. Hans Hofmann, this dataset classifies loan applicants as good or bad credit risks based on 20 attributes including credit history, purpose, and personal status.

View on Kaggle Explore Similar Datasets

Dataset Info: 1000 samples × 21 columns | 20 features + 1 target | Imbalanced: 700 good (70%) / 300 bad (30%) | Mix of numerical and categorical | Classic benchmark for credit scoring models

Dataset Schema

Column	Type	Description
`Age`	Integer	Age in years (19-75)
`Sex`	String	Gender (male/female)
`Job`	Integer	Job type (0-3: unskilled to highly skilled)
`Housing`	String	Housing status (own/rent/free)
`Saving accounts`	String	Savings account level (little/moderate/quite rich/rich)
`Checking account`	String	Checking account status (little/moderate/rich)
`Credit amount`	Integer	Credit amount in DM (250-18424)
`Duration`	Integer	Duration of credit in months (4-72)
`Purpose`	String	Purpose of loan (car, furniture, education, etc.)
`Risk`	String	Target: good (700) / bad (300)

Dataset Stats: 1000 applications, 10 columns, imbalanced classes (70/30 split), some missing values

Key Insight: Credit amount and duration are strong predictors - higher values increase risk

Sample Data Preview

Age	Sex	Job	Housing	Credit amount	Duration	Purpose	Risk
67	male	2	own	1169	6	radio/TV	good
22	female	2	own	5951	48	radio/TV	bad
49	male	1	own	2096	12	education	good
45	male	2	free	7882	42	furniture	good
53	male	2	free	4870	24	car	bad

Project Requirements

Create a well-organized Jupyter notebook that covers all the following components with clear documentation and visualizations.

Exploratory Data Analysis

Load and inspect the German Credit dataset
Display dataset shape, dtypes, and descriptive statistics
Check for missing values and handle appropriately
Analyze target class distribution (good vs bad)
Create distribution plots for numerical features
Create count plots for categorical features
Generate correlation heatmap for numerical features
Analyze relationship between features and target

Data Preprocessing

Handle missing values (imputation or removal)
Encode categorical variables (LabelEncoder/OneHotEncoder)
Create new features if beneficial (e.g., credit_per_month)
Scale numerical features using StandardScaler
Split data into train (80%) and test (20%) sets
Address class imbalance (SMOTE, class_weight, or sampling)

Model Training

Train at least 3 different models:
Logistic Regression (baseline)
Random Forest Classifier
XGBoost or Gradient Boosting
Use cross-validation (5-fold) for model evaluation
Perform hyperparameter tuning (GridSearchCV or RandomizedSearchCV)
Document best parameters for each model

Model Evaluation

Generate confusion matrix for each model
Calculate accuracy, precision, recall, F1-score
Plot ROC curves and calculate AUC scores
Plot Precision-Recall curves
Compare models in a summary table
Select best model with justification

Feature Importance & Insights

Extract and visualize feature importances
Identify top 10 most predictive features
Analyze which factors increase default risk
Create risk profiles for different customer segments

Business Recommendations

Recommend optimal classification threshold
Estimate cost savings from model deployment
Provide lending guidelines based on model insights
Discuss model limitations and ethical considerations

Model Specifications

Implement these classification algorithms and evaluation metrics to ensure your analysis is thorough and industry-standard.

Logistic Regression

Purpose: Baseline model
Library: sklearn.linear_model
Key param: class_weight='balanced'
Regularization: C=1.0 (tune)
Solver: 'lbfgs' or 'liblinear'

Random Forest

Purpose: Ensemble model
Library: sklearn.ensemble
n_estimators: 100-500
max_depth: 10-30 (tune)
Feature importance: Built-in

XGBoost

Purpose: Advanced ensemble
Library: xgboost
scale_pos_weight: Handle imbalance
learning_rate: 0.01-0.3
n_estimators: 100-1000

Evaluation Metrics

Confusion Matrix

TP, TN, FP, FN counts for each model

Precision & Recall

Critical for imbalanced classification

ROC-AUC

Area under ROC curve (aim for >0.75)

F1-Score

Harmonic mean of precision & recall

Class Imbalance: The dataset has 70% good and 30% bad credit. Use techniques like SMOTE, class_weight='balanced', or adjust the classification threshold to handle this imbalance appropriately.

Required Visualizations

Create at least 12 visualizations in your notebook. Each visualization should have clear titles, labels, and annotations.

EDA Visualizations

Target class distribution (good vs bad)
Age distribution histogram
Credit amount distribution
Duration distribution
Categorical feature counts (housing, purpose)
Correlation heatmap
Box plots by risk category

Model Visualizations

Confusion matrices (for all models)
ROC curves (all models on same plot)
Precision-Recall curves
Feature importance bar chart
Model comparison bar chart (metrics)
Learning curves (optional)

Design Tip: Use a consistent color scheme - green for "good" credit and red for "bad" credit throughout your visualizations.

Submission Requirements

Create a public GitHub repository with the exact name shown below:

Required Repository Name

credit-risk-ml

github.com/<your-username>/credit-risk-ml

Required Project Structure

credit-risk-ml/
├── data/
│   └── german_credit.csv           # Dataset
├── notebooks/
│   └── credit_risk_analysis.ipynb  # Main analysis notebook
├── visualizations/
│   ├── confusion_matrix.png        # Confusion matrices
│   ├── roc_curves.png              # ROC curve comparison
│   ├── feature_importance.png      # Top features
│   └── model_comparison.png        # Metrics comparison
├── models/                         # (Optional) Saved models
│   └── best_model.pkl
├── requirements.txt                # Python dependencies
└── README.md                       # Project documentation

README.md Required Sections

Project Title and Description
Your name and submission date
Dataset description (source, features)
Technologies used (Python, sklearn, xgboost)

Model comparison results (table format)
Best model and its performance
Business recommendations
How to run the notebook

Submit Your Project

Enter your GitHub username - we will verify your repository automatically

Grading Rubric

Your project will be graded on the following criteria. Total: 350 points.

Criteria	Points	Description
Exploratory Data Analysis	50	Thorough exploration with descriptive statistics and visualizations
Data Preprocessing	40	Proper encoding, scaling, and handling of missing values
Class Imbalance Handling	30	Appropriate technique to address 70/30 class distribution
Model Training	50	At least 3 models with hyperparameter tuning
Model Evaluation	50	Comprehensive metrics, ROC curves, and comparison
Feature Analysis	30	Feature importance and business-relevant insights
Visualizations	50	At least 12 clear, labeled visualizations
Documentation	50	README, code comments, business recommendations
Total	350

Grading Levels

Excellent

315-350

Exceeds all requirements

Good

262-314

Meets all requirements

Satisfactory

210-261

Meets minimum requirements

Needs Work

< 210

Missing key requirements

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.