Project Overview
This project focuses on binary classification for credit risk prediction. You will work with the German Credit Dataset containing 1,000 loan applications with 20 features including credit history, employment, loan purpose, and personal attributes. Your goal is to build a model that predicts whether an applicant is a good or bad credit risk.
Explore
Analyze credit features and class distribution
Preprocess
Encode features and handle imbalance
Model
Train multiple classification algorithms
Evaluate
Compare models with business metrics
Learning Objectives
Technical Skills
- Handle categorical features with encoding techniques
- Address class imbalance with SMOTE/class weights
- Implement Logistic Regression, Random Forest, XGBoost
- Evaluate with confusion matrix, ROC-AUC, precision-recall
- Perform hyperparameter tuning with cross-validation
Business Skills
- Understand cost of false positives vs false negatives
- Choose appropriate threshold for business needs
- Interpret feature importance for lending decisions
- Present model recommendations to stakeholders
- Consider fairness and ethical implications
Business Scenario
Apex Financial Services
You have been hired as a Data Scientist at Apex Financial Services, a lending institution that provides personal loans. The risk management team wants to automate the credit scoring process using machine learning. They have historical data on past loan applicants and their repayment behavior.
"We're losing money on bad loans and rejecting good applicants. We need a data-driven approach to credit scoring. Build us a model that can predict default risk accurately, but remember - rejecting a good customer costs us revenue, while approving a bad loan costs us the principal. Find the right balance."
Questions to Answer
- What features are most predictive of default?
- Which model performs best for this task?
- What is the optimal classification threshold?
- How confident can we be in predictions?
- What is the cost-benefit of each prediction type?
- How much can we reduce default rates?
- Which applicant segments are highest risk?
- Are there fairness concerns in the model?
The Dataset
You will work with the German Credit dataset, a classic dataset for learning credit risk modeling and classification techniques. Download from Kaggle or use the local copy:
Dataset Download
Download the German Credit dataset from Kaggle or use our local copy for convenience.
Original Data Source
This project uses the German Credit Dataset from UCI ML Repository via Kaggle. Originally collected by Prof. Hans Hofmann, this dataset classifies loan applicants as good or bad credit risks based on 20 attributes including credit history, purpose, and personal status.
Dataset Schema
| Column | Type | Description |
|---|---|---|
Age | Integer | Age in years (19-75) |
Sex | String | Gender (male/female) |
Job | Integer | Job type (0-3: unskilled to highly skilled) |
Housing | String | Housing status (own/rent/free) |
Saving accounts | String | Savings account level (little/moderate/quite rich/rich) |
Checking account | String | Checking account status (little/moderate/rich) |
Credit amount | Integer | Credit amount in DM (250-18424) |
Duration | Integer | Duration of credit in months (4-72) |
Purpose | String | Purpose of loan (car, furniture, education, etc.) |
Risk | String | Target: good (700) / bad (300) |
Sample Data Preview
| Age | Sex | Job | Housing | Credit amount | Duration | Purpose | Risk |
|---|---|---|---|---|---|---|---|
| 67 | male | 2 | own | 1169 | 6 | radio/TV | good |
| 22 | female | 2 | own | 5951 | 48 | radio/TV | bad |
| 49 | male | 1 | own | 2096 | 12 | education | good |
| 45 | male | 2 | free | 7882 | 42 | furniture | good |
| 53 | male | 2 | free | 4870 | 24 | car | bad |
Project Requirements
Create a well-organized Jupyter notebook that covers all the following components with clear documentation and visualizations.
Exploratory Data Analysis
- Load and inspect the German Credit dataset
- Display dataset shape, dtypes, and descriptive statistics
- Check for missing values and handle appropriately
- Analyze target class distribution (good vs bad)
- Create distribution plots for numerical features
- Create count plots for categorical features
- Generate correlation heatmap for numerical features
- Analyze relationship between features and target
Data Preprocessing
- Handle missing values (imputation or removal)
- Encode categorical variables (LabelEncoder/OneHotEncoder)
- Create new features if beneficial (e.g., credit_per_month)
- Scale numerical features using StandardScaler
- Split data into train (80%) and test (20%) sets
- Address class imbalance (SMOTE, class_weight, or sampling)
Model Training
- Train at least 3 different models:
- Logistic Regression (baseline)
- Random Forest Classifier
- XGBoost or Gradient Boosting
- Use cross-validation (5-fold) for model evaluation
- Perform hyperparameter tuning (GridSearchCV or RandomizedSearchCV)
- Document best parameters for each model
Model Evaluation
- Generate confusion matrix for each model
- Calculate accuracy, precision, recall, F1-score
- Plot ROC curves and calculate AUC scores
- Plot Precision-Recall curves
- Compare models in a summary table
- Select best model with justification
Feature Importance & Insights
- Extract and visualize feature importances
- Identify top 10 most predictive features
- Analyze which factors increase default risk
- Create risk profiles for different customer segments
Business Recommendations
- Recommend optimal classification threshold
- Estimate cost savings from model deployment
- Provide lending guidelines based on model insights
- Discuss model limitations and ethical considerations
Model Specifications
Implement these classification algorithms and evaluation metrics to ensure your analysis is thorough and industry-standard.
- Purpose: Baseline model
- Library: sklearn.linear_model
- Key param: class_weight='balanced'
- Regularization: C=1.0 (tune)
- Solver: 'lbfgs' or 'liblinear'
- Purpose: Ensemble model
- Library: sklearn.ensemble
- n_estimators: 100-500
- max_depth: 10-30 (tune)
- Feature importance: Built-in
- Purpose: Advanced ensemble
- Library: xgboost
- scale_pos_weight: Handle imbalance
- learning_rate: 0.01-0.3
- n_estimators: 100-1000
Evaluation Metrics
Confusion Matrix
TP, TN, FP, FN counts for each model
Precision & Recall
Critical for imbalanced classification
ROC-AUC
Area under ROC curve (aim for >0.75)
F1-Score
Harmonic mean of precision & recall
Required Visualizations
Create at least 12 visualizations in your notebook. Each visualization should have clear titles, labels, and annotations.
EDA Visualizations
- Target class distribution (good vs bad)
- Age distribution histogram
- Credit amount distribution
- Duration distribution
- Categorical feature counts (housing, purpose)
- Correlation heatmap
- Box plots by risk category
Model Visualizations
- Confusion matrices (for all models)
- ROC curves (all models on same plot)
- Precision-Recall curves
- Feature importance bar chart
- Model comparison bar chart (metrics)
- Learning curves (optional)
Submission Requirements
Create a public GitHub repository with the exact name shown below:
Required Repository Name
credit-risk-ml
Required Project Structure
credit-risk-ml/
├── data/
│ └── german_credit.csv # Dataset
├── notebooks/
│ └── credit_risk_analysis.ipynb # Main analysis notebook
├── visualizations/
│ ├── confusion_matrix.png # Confusion matrices
│ ├── roc_curves.png # ROC curve comparison
│ ├── feature_importance.png # Top features
│ └── model_comparison.png # Metrics comparison
├── models/ # (Optional) Saved models
│ └── best_model.pkl
├── requirements.txt # Python dependencies
└── README.md # Project documentation
README.md Required Sections
- Project Title and Description
- Your name and submission date
- Dataset description (source, features)
- Technologies used (Python, sklearn, xgboost)
- Model comparison results (table format)
- Best model and its performance
- Business recommendations
- How to run the notebook
Enter your GitHub username - we will verify your repository automatically
Grading Rubric
Your project will be graded on the following criteria. Total: 350 points.
| Criteria | Points | Description |
|---|---|---|
| Exploratory Data Analysis | 50 | Thorough exploration with descriptive statistics and visualizations |
| Data Preprocessing | 40 | Proper encoding, scaling, and handling of missing values |
| Class Imbalance Handling | 30 | Appropriate technique to address 70/30 class distribution |
| Model Training | 50 | At least 3 models with hyperparameter tuning |
| Model Evaluation | 50 | Comprehensive metrics, ROC curves, and comparison |
| Feature Analysis | 30 | Feature importance and business-relevant insights |
| Visualizations | 50 | At least 12 clear, labeled visualizations |
| Documentation | 50 | README, code comments, business recommendations |
| Total | 350 |
Grading Levels
Excellent
Exceeds all requirements
Good
Meets all requirements
Satisfactory
Meets minimum requirements
Needs Work
Missing key requirements
Ready to Submit?
Make sure you have completed all requirements and reviewed the grading rubric above.
Submit Your Project