Forecasting Bank Failure with Machine Learning Models: A study on Turkish Banks

Forecasting bank failures has been an essential study in the literature due to their significant impact on the economic prosperity of a country. Acting as an intermediary player, banks channel funds from those with surplus capital to those who require capital to carry out their economic activities. Therefore, it is essential to generate early warning systems that could warn banks and stakeholders in case of financial turbulence. In this paper, three machine learning models named as GLMBoost, XGBoost, and SMO were used to forecast bank failures. We used commercial bank failure data of Turkey between 1997 and 2001, where we have 17 failed and 20 healthy banks. Our results show that the Sequential Minimal Optimization and GLMBoost provide the same performance when classifying failed banks, while GLMBoost performs better in AUC and SMO when considering total classification success. Lastly, XGBoost, one of the most recent and robust classification models, surprisingly underperformed in all three metrics we used in research.

This paper is organized as follows: The first section gives a general outlook on the banking sector and the importance of banks, along with introductory literature on how bank failure predictions have evolved and brief information of our datasets and models we used in our article.
The second section gives a comprehensive list of previous studies related to bank or company failure prediction and several machine learning algorithms and their performance.
The third section initially gives a general description of the data that we used in the research. However, a long list of the banks and ratios used can be found later on in Figures II and VI. Afterwards, we defined the methods used in this study and most of the terms that can be found in the results section in our paper.
In the fourth and last section, all the results were displayed together with the selected indicators, which can be found in Table III and their confusion matrix in Table IV. We compared the results of each applied method within themselves and with previously measured results in the literature. In this way, we could better assess the performance of the implemented models and whether they could replace those used in previous studies or not. Torna and DeYoung (2012) experimented if the income generated from non-traditional banking activities reasoned U.S. commercial banks to fail in times of financial crisis. Banks with more than 100$ billion worth in the size of assets were removed. Implementation of multi-period logistic regression demonstrated that the probability of distressed bank failure decreased with pure fee-based nontraditional activities such as securities brokerage and insurance sales. Furthermore, it increased with such activities as venture capital, investment banking, and asset securitization.

Literature Review
Berger and Bouwman (2012) empirically analyzed the importance of capital for banks on financial performance during financial crises. Logit survival and ordinary least squares regression models are processed between 1984: Q1 to 2010: Q4. Empirical outcomes demonstrate that the strength of the capital lowers the bankruptcy probability of small banks while it increases the performance of medium and large banks during banking crises.
Lu and Whidbee (2013) applied logistic regression to identify the reasons for failures and assess the bank-level characteristics. The data includes 6,236 U.S. banks, of which 324 failed during 2007-2011. Empirical results of their experiment state that there is a potential linkage between a bank's financial fragility and the likelihood of failure. On the other hand, multibank holding companies have more survival chances during the financial crisis than single banks. Cleary and Hebb (2015) analyzed the failures of 132 banks for the period of 2002-2009 by using discriminant analysis. The prediction efficiency of bank failure was 92% by the sample data. Furthermore, they did the same analysis to predict bank failures between 2010-2011 and the efficiency range was between 90-95%. Chiaramonte et al. (2016) investigated US commercial bank data for 2004 to 2012 to analyze how Z Score can forecast bank failure. The outcome of the investigation showed that Z-Score could forecast 76% of bank failures. Important to state that macro-level indicators did not increase the precision of the forecast. On the other hand, the forecast efficiency of the Z-score to predict bank default remains stable within the three-year forward window. Ekinci and Erdal (2017) analysed the bank failure prediction of 37 commercial banks in Turkey between 1997 and 2001. In the dataset 20 banks were healthy and 17 of them were failed. They used Logistic Regression, J48, and voted perceptron as the base learners along with different Hybrid Ensembles. Their empirical findings indicated that hybrid ensemble machine learning models perform better than the traditional base and ensemble models. According to their results, Hybrid ensembles are the most accurate forecasting methods. The most accurate AUC value belongs to the RS-B-L ensemble with 91.5 percent, while the most accurate classification rate belongs to the RS-B-J48 ensemble with 83.78 percent. Le and Viviani (2017) analyzed the failure of banks using both traditional techniques and machine learning for the sample of 3000 US banks, where 1438 of them were failures and 1562 of them were active. They used discriminant analysis and Logistic regression for traditional techniques and, as for machine learning, they used artificial neural networks, support vector machines, and k-nearest neighbors. CAMEL ratios were used in the analysis and 31 financial ratios represented them. The empirical findings state that the artificial neural network and k-nearest neighbor methods are the most precise ones to predict bank failures. Gogas et al. (2018) implemented machine learning models to forecast bank failures. Their data consist of a total of 1443 U.S. banks with 481 banks that have failed between 2007-2013. They implemented a two-step feature selection procedure to define the most informative variables. Afterwards, selected variables were put into an SVM model to proceed with a training-testing learning process. The model showed a 99.22% forecasting accuracy and outperformed the well-established Ohlson's score.
In their article, Carmona et al. (2019) implemented Extreme Gradient Boosting Method to forecast bank failure of 157 US national commercial banks from 2001 to 2015. They considered 30 financial ratios in their model. Their results state that lower values for retained earnings to average equity, pretax return on assets, and total risk-based capital ratio are linked with bank failure. Additionally, they suggest that retained earnings should be kept within the company in stressful periods and that dividend policies should be reconsidered. Manthoulis et al. (2020) used statistical and machine learning methods to sample data which contains 60,000 observations between 2006 and 2015 for the US banks. Their findings showed that diversification variables improve the prediction strength of bank failure forecasting models, mostly for mid to long-term forecast horizons. Furthermore, they indicated that ordinal classification models provide a better description of the state of the banks before failure and are competitive to standard binary classification models.

Data
The data was retrieved from 37 private commercial banks in Turkey between 1997 and 2001. 17 out of the 37 banks failed due to the 1998 and 2001 financial crises. The 35 financial ratios are used to make forecasting according to the CAMEL system. The information about banks and financial ratios can be found in Tables II and VI (in the appendix section).

Methods
In this experiment, we used three different machine learning algorithms to classify bank failures in Turkey. After a series of preexperiments where we tried to find out the best classifiers to proceed with this research, we concluded that Sequential Minimal Optimization, GlmBoost, and XGBoost were the top predictors.
We used WEKA 3.9.5 (Waikato Environment for Knowledge Analysis) for the experiments for the data processing. Sequential Minimal Optimization is already a part of the package of WEKA, however, for XGBoost and GLMBoost, we used the R extension of the WEKA where analysis could be done via WEKA but algorithms will work under R Console. Therefore, it requires an R programming language installed in the computer to proceed with the analysis.

Gradient Boosting
Guelman (2012) defines gradient boosting as an iterative algorithm that combines parameterized functions with poor performance to produce more accurate forecasting rules. While other statistical learning methods provide comparable accuracy, such as support vector machines and neural networks, gradient boosting brings more interpretable results. The method is highly robust and can be applied for both classification and regression problems with a variety of response distributions, for instance, Gaussian, Poisson, and Laplace.

GLMBoost
Adamu et al. (2019) defined Generalized Linear Models as a generalization of linear regression that allows not only the response variables that are normally distributed but also the variables that are not normally distributed. GLM is a large class of models for relating responses to linear combinations of predictor variables. Fitting generalized models using gradient boosting algorithm is called GLMBoost. GLMBoost can be used to fit linear models via component-wise boosting and each column of the design matrix is fitting and selected separately using a simple linear model. Thus, the algorithm is a gradient boosting for optimizing specific loss functions in which the component-wise linear models are used. When GlmBoost is used for any linear model, the results are more accurate and reliable than its corresponding GLM function.

XGBoost
It is an application of gradient boosting machines created by Tianqi Chen in 2014. Designed for efficiency and scalability, its parallel tree reinforcement capabilities make it significantly faster than other tree-based ensemble algorithms. (Quinto,2020) Analytics Vidhya Content Team (2018) defines the features of XGBoost as below: •The model has an option to penalize complex models through both L1 and L2 regularization. Therefore, it prevents overfitting.
•XGBoost incorporates a sparsity-aware split finding algorithm to handle different types of sparsity patterns in the data.
•XGBoost has a distributed weighted quantile sketch algorithm to handle weighted data effectively. Platt (1998) defined the Sequential Minimal Optimization model as a simple algorithm that can quickly solve the Support Vector Machines quadratic programming problem without any extra matrix storage and without using numerical QP optimization steps at all. SMO solves the smallest possible optimization problem at each step. For standard SVM QP problems, the smallest possible optimization problem involves two Lagrange multipliers, because the Lagrange multipliers must obey a linear equality constraint. Therefore the advantage of SMO is that two Lagrange multipliers can be solved analytically. Even though more subproblems are solved in the course of the algorithm, each sub-problem is so fast that the overall QP problem is solved quickly. Hanley and McNeil (1982) defined AUC as a single scalar value that measures the overall performance of a binary classifier. The area under a receiver operating characteristic (ROC) curve is abbreviated as AUC.

Hosmer and Lemeshow (2000) ranked ROC values as:
If ROC = 0.5: this suggests no discrimination (same as flipping a coin).
If ROC ≥ 0.9: This is considered outstanding discrimination.
Confusion Matrix: Miao and Zhu (2020) mentioned that the performance of the classification problems is typically measured with a confusion matrix generated by the associated classifier.  Miao and Zhu (2020) In the confusion matrix, True Positive and True Negative are correctly classified values. On the other hand, False Negative means that the value was originally positive, however, it is predicted as negative, while False Positive is when a value is originally True Negative but it is predicted as Positive.

K-fold cross-validation
In this experiment, we used 10-fold cross-validation to measure model performance while minimising bias associated with a random sampling of training (Chou and Pham (2013)). Yumurtaci et al. (2015) define the 10-fold CV as the dataset is randomly split into ten (k) subsets of the same size in which the class is represented in approximately the same proportions as in the full dataset. Next, each subset is held out in turn and the learning scheme is trained on the remaining nine-tenths (k-1); then, its error rate is calculated on the holdout set. Thus, the learning procedure has been executed a total of ten times on different training sets.   We analyzed three classification models: GLMBoost, XGBoost, and SMO. Our ranking methodology when considering the success of the models is based on three metrics: Classification, Recall, and AUC rates. The classification rate stands for the number of correctly classified instances (Both true positive and true negative) in the model, as shown in Table IV. AUC (ROC curve) is a commonly used metric for classification problems. True positive responses converge AUC closer to 1 while false-positive results diverge to 0.5 which means the same results with tossing a coin. Finally, the Recall rate measures the model's ability to classify the failed banks successfully; therefore, higher is better.

Results
In terms of Classification Rate performance, the ranking goes as following: SMO as the first, GLMBoost as the second, and XGBoost as the third.
Regarding AUC performance, the ranking goes as follows: GLMBoost as the first, SMO as the second, and XGBoost as the third.
In terms of Recall performance, the ranking goes as following: GLMBoost and SMO perform similar ratios with %82 percent ability to correctly classifying the failed banks while XGBoost stands at %77.

Conclusion
Due to its importance to the overall economy, forecasting bank failures have become an essential literature study. Our paper used SMO, GLMBoost, and XGBoost to assess bank failures with the three metrics mentioned above. However, recall rate is our most important metric which gives us the percentage of successfully classified banks that were failed. Because in financially turbulent times, bankers need to assess the fragile banks to be assisted and saved from bankruptcy. Our results show that the best classification rate is provided by SMO and followed by GLMBoost and XGBoost while GLMBoost followed by SMO and XGBoost provides the best AUC rate. Finally, the best recall rate provided by both GLMBoost and SMO and XGBoost performed the least even though it is a commonly used algorithm in these days due to its successful classification ability. Therefore, when analyzing bank failures and choosing AUC and Recall metrics as the top priority, we suggest using GLMBoost.On the other hand, if the most important metrics are classification and recall rate, we suggest using SMO for bank failure prediction.
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.