Research Article

A Leakage-Aware Machine Learning Pipeline for Credit Default Prediction Using LightGBM

Authors

  • Abdullah Al Mamun Department of computer and Information Science, Gannon University, Erie, PA, USA
  • Md Shahiduzzaman Business Analytics, Trine University, Angola, USA
  • Maria Kabtia Master of Science in Business Analytics, Trine University, Angola, Indiana, USA
  • Mohammad Sazzad Hossain Master’s in Business Analytics, Trine University
  • Samia Akter Master of Science in Business Analytics, Trine University
  • Md Firoz Kabir Master’s in Information Technology, University of the Cumberlands, USA

Abstract

Credit default prediction remains challenging because loan-outcome datasets are typically imbalanced, heterogeneous, and vulnerable to post-origination target leakage. This study proposes a leakage-aware and interpretable LightGBM-based credit-risk modelling framework for binary loan-status classification into Fully Paid and Charged Off/Default outcomes. The proposed workflow integrates rigorous target definition, removal of repayment-derived leakage variables, robust missing-value handling, outlier winsorisation, date-derived credit-history features, log-transformed monetary variables, affordability and utilisation ratios, mixed categorical encoding, FICO bucketisation, class-frequency reweighting, and mutual-information-based feature selection. A large lending dataset of 887,379 completed loans was analysed, comprising 725,223 Fully Paid loans and 162,156 Charged Off/Default loans. Data were stratified into training, validation, and holdout test sets using a 70/15/15 split, with additional stratified five-fold cross-validation repeated across five random seeds, yielding 25 validation runs.LightGBM was selected as the proposed best model after comparison with Logistic Regression, Random Forest, CatBoost, and XGBoost. The model achieved the highest mean cross-validation AUC of 0.762 ± 0.004, outperforming XGBoost, CatBoost, Random Forest, and Logistic Regression, which obtained AUC values of 0.758 ± 0.004, 0.755 ± 0.004, 0.731 ± 0.005, and 0.708 ± 0.005, respectively. On the independent holdout test set, LightGBM achieved an AUC of 0.764, accuracy of 0.853, default sensitivity of 0.730, specificity of 0.880, default F1-score of 0.644, positive predictive value of 0.576, negative predictive value of 0.936, and Brier score of 0.124. Feature-importance and SHAP-direction analysis identified interest rate, sub-grade, debt-to-income ratio, annual income, and FICO range as the dominant risk drivers.

Article information

Journal

Journal of Computer Science and Technology Studies

Volume (Issue)

8 (6)

Pages

143-160

Published

2026-05-09

Downloads

Views

0

Downloads

0

Keywords:

Credit default prediction, LightGBM, machine learning, credit risk modelling, target leakage, imbalanced classification, feature engineering, SHAP interpretability, loan default prediction, financial risk assessment.