A Leakage-Aware Machine Learning Pipeline for Credit Default Prediction Using LightGBM

Abdullah Al Mamun; Md Shahiduzzaman; Maria Kabtia; Mohammad Sazzad Hossain; Samia Akter; Md Firoz Kabir

doi:10.32996/jcsts.2026.8.6.11

Research Article

A Leakage-Aware Machine Learning Pipeline for Credit Default Prediction Using LightGBM

Authors

Abdullah Al Mamun Department of computer and Information Science, Gannon University, Erie, PA, USA
Md Shahiduzzaman Business Analytics, Trine University, Angola, USA
Maria Kabtia Master of Science in Business Analytics, Trine University, Angola, Indiana, USA
Mohammad Sazzad Hossain Master’s in Business Analytics, Trine University
Samia Akter Master of Science in Business Analytics, Trine University
Md Firoz Kabir Master’s in Information Technology, University of the Cumberlands, USA

Abstract

Credit default prediction remains challenging because loan-outcome datasets are typically imbalanced, heterogeneous, and vulnerable to post-origination target leakage. This study proposes a leakage-aware and interpretable LightGBM-based credit-risk modelling framework for binary loan-status classification into Fully Paid and Charged Off/Default outcomes. The proposed workflow integrates rigorous target definition, removal of repayment-derived leakage variables, robust missing-value handling, outlier winsorisation, date-derived credit-history features, log-transformed monetary variables, affordability and utilisation ratios, mixed categorical encoding, FICO bucketisation, class-frequency reweighting, and mutual-information-based feature selection. A large lending dataset of 887,379 completed loans was analysed, comprising 725,223 Fully Paid loans and 162,156 Charged Off/Default loans. Data were stratified into training, validation, and holdout test sets using a 70/15/15 split, with additional stratified five-fold cross-validation repeated across five random seeds, yielding 25 validation runs.LightGBM was selected as the proposed best model after comparison with Logistic Regression, Random Forest, CatBoost, and XGBoost. The model achieved the highest mean cross-validation AUC of 0.762 ± 0.004, outperforming XGBoost, CatBoost, Random Forest, and Logistic Regression, which obtained AUC values of 0.758 ± 0.004, 0.755 ± 0.004, 0.731 ± 0.005, and 0.708 ± 0.005, respectively. On the independent holdout test set, LightGBM achieved an AUC of 0.764, accuracy of 0.853, default sensitivity of 0.730, specificity of 0.880, default F1-score of 0.644, positive predictive value of 0.576, negative predictive value of 0.936, and Brier score of 0.124. Feature-importance and SHAP-direction analysis identified interest rate, sub-grade, debt-to-income ratio, annual income, and FICO range as the dominant risk drivers.

Article information

Journal

Journal of Computer Science and Technology Studies

Volume (Issue)

8 (6)

DOI

https://doi.org/10.32996/jcsts.2026.8.6.11

Pages

143-160

Published

2026-05-09

Copyright

Open access

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Journal of Computer Science and Technology Studies

A Leakage-Aware Machine Learning Pipeline for Credit Default Prediction Using LightGBM

Authors

Abstract

Article information

Journal

Journal of Computer Science and Technology Studies

Volume (Issue)

8 (6)

DOI

https://doi.org/10.32996/jcsts.2026.8.6.11

Pages

143-160

Published

Copyright

Open access

Downloads

59

41

Keywords:

rightbar

submission

menus