Article contents
A Leakage-Aware Machine Learning Pipeline for Credit Default Prediction Using LightGBM
Abstract
Credit default prediction remains challenging because loan-outcome datasets are typically imbalanced, heterogeneous, and vulnerable to post-origination target leakage. This study proposes a leakage-aware and interpretable LightGBM-based credit-risk modelling framework for binary loan-status classification into Fully Paid and Charged Off/Default outcomes. The proposed workflow integrates rigorous target definition, removal of repayment-derived leakage variables, robust missing-value handling, outlier winsorisation, date-derived credit-history features, log-transformed monetary variables, affordability and utilisation ratios, mixed categorical encoding, FICO bucketisation, class-frequency reweighting, and mutual-information-based feature selection. A large lending dataset of 887,379 completed loans was analysed, comprising 725,223 Fully Paid loans and 162,156 Charged Off/Default loans. Data were stratified into training, validation, and holdout test sets using a 70/15/15 split, with additional stratified five-fold cross-validation repeated across five random seeds, yielding 25 validation runs.LightGBM was selected as the proposed best model after comparison with Logistic Regression, Random Forest, CatBoost, and XGBoost. The model achieved the highest mean cross-validation AUC of 0.762 ± 0.004, outperforming XGBoost, CatBoost, Random Forest, and Logistic Regression, which obtained AUC values of 0.758 ± 0.004, 0.755 ± 0.004, 0.731 ± 0.005, and 0.708 ± 0.005, respectively. On the independent holdout test set, LightGBM achieved an AUC of 0.764, accuracy of 0.853, default sensitivity of 0.730, specificity of 0.880, default F1-score of 0.644, positive predictive value of 0.576, negative predictive value of 0.936, and Brier score of 0.124. Feature-importance and SHAP-direction analysis identified interest rate, sub-grade, debt-to-income ratio, annual income, and FICO range as the dominant risk drivers.
Article information
Journal
Journal of Computer Science and Technology Studies
Volume (Issue)
8 (6)
Pages
143-160
Published
Copyright
Copyright (c) 2026 https://creativecommons.org/licenses/by/4.0/
Open access

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Aims & scope
Call for Papers
Article Processing Charges
Publications Ethics
Google Scholar Citations
Recruitment