Solving Multicollinearity Problem in Linear Regression Model: The Review Suggests New Idea of Partitioning and Extraction of the Explanatory Variables

Multicollinearity has remained a major problem in regression analysis and should be sustainably addressed. Problems associated with multicollinearity are worse when it occurs at high level among regressors. This review revealed that studies on the subject have focused on developing estimators regardless of effect of differences in levels of multicollinearity among regressors. Studies have considered single-estimator and combined-estimator approaches without sustainable solution to multicollinearity problems. The possible influence of partitioning the regressors according to multicollinearity levels and extracting from each group to develop estimators that will estimate the parameters of a linear regression model when multicollinearity occurs is a new econometrics idea and therefore requires attention. The results of new studies should be compared with existing methods namely principal components estimator, partial least squares estimator, ridge regression estimator and the ordinary least square estimators using wide range of criteria by ranking their performances at each level of multicollinearity parameter and sample size. Based on a recent clue in literature, it is possible to develop innovative estimator that will sustainably solve the problem of multicollinearity through partitioning and extraction of explanatory variables approaches and identify situations where the innovative estimator will produce most efficient result of the model parameters. The new estimator should be applied to real data and popularized for use.


Introduction 1
Multicollinearity is an important econometric problem that has received increased attention in recent times Lukman et al., 2015;Ismail and Manjula, 2016;Olanrewaju et al., 2017;Tyagi and Chandra, 2017). The findings of researchers (particularly what they have done to address the problems associated with this phenomenon) and the gaps left uncovered are the clear objectives of this review. It is obvious (from literature reviewed) that multicollinearity till date is recognized as a very serious problem in linear regression model. Multicollinearity is the term used to describe cases in which the explanatory variables or regressors are correlated . According to this report, in multicollinearity, the regression coefficients are characterized by large standard errors and some possess the wrong sign. This culminates in wrong inferences. Efforts are ongoing, both at the national and international levels to tackle the problems of multicollinearity and the result is that an array of estimation methods (though with one limitation or the other) have been designed to mitigate the effects of this econometric problem. As recently reported by Tyagi and Chandra (2017), such techniques in chronological order available for overcoming this problem include stein estimator proposed by Stein (1956), partial least squares originated by Wold in 1966, principal component regression estimator introduced by Massy (1965) and ordinary ridge regression estimator originated by Hoerl and Kennard (1970). This review will inform new studies by helping to identify important gaps.

Single Estimator Approach to Multicollinearity Problem 2.1 The ridge regression estimator
The ridge regression estimator is a single-estimator approach to solving muticolinearity problems. It was first developed by Hoerl and Kennard (1970). Ridge estimator characteristically possesses a smaller mean square error (MSE) than the ordinary least square estimator (Vinod and Ullah, 1981). It is defined as Where k is a non-negative constant called the biasing or ridge parameter. It is observed that when k = 0, (1) returns to ordinary least square estimates,  are generally unknown, they were suggested to be replaced by their corresponding unbiased estimates  is the i th element of the vector  Q = and Q is an orthogonal matrix. It is known that as k increases, the ridge regression estimators are biased but however yields more precise estimates than ordinary least square estimator (Mardikyan and Cetin, 2008). It has earlier been suggested that the value of k should be chosen small enough such that the mean squared error of ridge estimator is less than the mean squared error of ordinary least square estimator. Different estimation techniques had been proposed by many researchers. A graphical method called ridge trace was proposed by Hoerl and Kennard (1970) to select the valve of the ridge parameter k. This is a plot of the values of individual k ˆ against range of values of k (0 < k < 1). The minimum value for which k ˆb ecomes stable and the wrong signs in the regression coefficient corrected, is used. Hoerl et al. (1975) proposed a different estimator of k by taking the harmonic mean of the ridge parameter. Furthermore, Kibria (2003) proposed some new estimators of k by taking the geometric mean, arithmetic mean and median of the ridge parameter. Khalaf and Shukur (2005) proposed k in the form of fixed maximum of the ridge parameter while Alkhamisi et al. (2006) suggested another ridge parameter as the arithmetic mean and median of the ridge parameter. Muniz et al. (2012) proposed the estimator of the ridge parameter k as the varying maximum and arithmetic mean of the ridge parameter, varying maximum and its reciprocal, its square root and reciprocal of its square root and the geometric mean of k, its square root and reciprocal of its square root. Lukman and Ayinde (2017) recently stated that the value of k for which the residual sum of squares is not too large can be selected. According to this literature, the ridge parameter can take different forms such as fixed maximum, varying maximum, arithmetic mean, harmonic mean, geometric mean and median and various types as original form (0), reciprocal form (R), square root form (SR) and reciprocal of square root (RSR). Combining the knowledge from these works and comparing with simulated data, results showed that best estimator of the ridge parameter technique are of the different forms and types and include fixed maximum original, varying maximum original, harmonic mean original and arithmetic mean square root. It has got a smaller mean square error than the ordinary least square estimator, a feature which gives it an edge over the ordinary least square estimator in the presence of multicollinearity . The ridge regression estimator reduces multicollinearity by adding the ridge parameter, K, to the main diagonal elements of the correlation matrix, XˈX. Furthermore, Liu (1993) introduced an estimator (new class of biased estimate in linear regression) with similarity in form but however differs from the ridge regression estimator of Hoerl and Kennard (1970).

Principle component
Principle component is a traditional multivariate statistical method mainly employed to decrease the number of predictive variables and provide remedy for the multicollinearity problem (Bair et al., 2006). As revealed by Rosipal and Nicole (2006), the aim of principle component is to identify a few linear combinations of the variables that can be used to summarize the data without loss of too much information in the process. This approach is based on the fact that the p set of variables are transformed to new p set of orthogonal variables denoted c1 ,…, cp where each variable cj is a linear combination of the variable x1 , . . . , xp that is , The linear combinations are variables c1 ,…, cp , such that they are orthogonal and the variance covariance matrix of the principal component is;  , the first few of the PCs are used in the regression, satisfying the following properties;

Partial least squares
Partial least squares is a recent technique that generalizes and combines attributes from principal component analysis and multiple regression (Abdi, 2003). It originated in the social sciences especially economics (Wold, 1966) and now, it is popular and particularly very useful in areas such as chemical engineering, where predictive variables often consist of several different measurements in the experiment and where the relationship between these variables are not well-understood (Geladi and Kowalski, 1986). Partial least squares is becoming a multivariate technique of choice for both non-experimental and experimental data in social sciences (McIntosh et al., 1996). Notably, the partial least square was first presented as an algorithm (Tenenhaus, 1998) and its statistical properties have been investigated by several workers (Stone and Brooks, 1990;Naes and Helland, 1993;Garthwaite, 1994;Helland and Almoy, 1994) with an impressive result. One of the desirable properties of partial least squares is that it has got a closed form which is given in equation (4)  The NIPLAS algorithm of Wold (1975) was ultimately modified so as to take into account the responses which culminated in the partial least square (PLS) regression algorithm on orthogonal scores presented in Wold et al. (1983). Thus, the general NIPLAS algorithm for PLS is given in algorithm below with E and F corresponding to residual terms. The decomposition in equations (6) and (7) above is analyzed to a great extent in Martens and Naes (1989). In Burnham et al. (2001), the procedure for extraction of the latent variables as well as for the estimation of the coefficient of the regression mode has been presented. Considering a univariate partial least square, attempt is made to find model of the form Where P T is the linear combination of the s X , and P  , the parameter variables are assumed to be centered to have mean zero and this implies that the intercept terms will always be zero. Thus the algorithm for determining the How to choose among all extracted latent variables, precisely the ones that provide the best fit to the data is another task. According to Burnham and Anderson (2004), this consideration calls for a balance between under and over-fitted models. Furthermore, Akaike (1974) proposed a statistical model fit measure known as the Akaike information criterion (AIC) defined as AIC = -2Lm + 2m (9) Where Lm is the maximized log-likelihood and for linear regression models; -2Lm = -2loglikelihood (known as the deviance) is nlog (RSS/n) and m stands for the number in the model. It is interesting to note that this index takes into consideration both the statistical goodness of fit and the number of parameters to be estimated and therefore imposing a penalty for increasing the number of parameters. Therefore, the index with lowest values indicates the preferred model. Schwarz (1978) developed the Bayesion information criterion (BIC) for selection of models among a finite set of models. Partly, it is based on the likelihood function and as a matter of fact it is closely related to the Akaike information criterion. However, to prevent over-fitting, the penalty term is inevitably larger in BIC than in AIC. Categorically, the model with lowest BIC is preferable. BIC is formally defined as Simply recall that R 2 = 1-RSS/TSS. Therefore, the addition of a variable to a model can only decrease the RSS and thus increases the R 2 . The implication is that R 2 itself is not a good criterion because it would always choose the largest possible model. However, the addition of a predictor will only increase R 2 if it has some value. Meanwhile, minimizing the standard error for prediction means minimizing 2  which invariably means maximizing Ra 2 . It is necessary to comment on the predicted residual error sum of squares (PRESS) which is a statistic used in regression analysis to provide a summary measure of the fit of a model to a sample of observations that were not themselves used to estimate the model. Briefly, it is calculated as the sum of squares of the prediction residuals for concerned observations (Allen, 1974;Tarpey, 2000). Here, the model with the lowest PRESS is preferred. Progressively, the Mallow's Cp is another technique for model selection in regression (Mallows, 1973). It is employed to assess the fit of a regression model that has been estimated using ordinary least squares estimator. It is applied during model selection where a number of predictor variables are available for predicting some outcomes and the aim is to find the best mode involving a subset of these predictors. A small Cp signals that the model is relatively precise and therefore average MSE of prediction might be a good criterion, Mallows (1973) concludes. Other model selection methods have been developed and applied by many workers and these include the Deviance Information Criterion (Vander, 2005;Celeux et al., 2006), False Discovery Rate (Benjamini and Hochberg, 1995), Hannan-Quinn Information Criterion (Burnham and Anderson, 2002) and Bayes Factor (Carlin and Chib, 1995;Chen et al., 2000). Although cross validation is the commonest method, it has been shown that the corrected Akaike information criterion AICc, outperforms AIC in model selection of q (Hurvich and Tsai, 1989;McQuarrie and Tsai, 1998). According to the papers reviewed, if the model in question is a linear regression, where k is the number of regressors, including the intercept (Findley, 1991), to minimize AIC or BIC, larger models will fit better, have smaller RSS but nevertheless use more parameters. The best choice of model will therefore balance fit with size. Since BIC has the capacity to penalize larger models more heavily, it will tend to prefer smaller models in comparison to AIC. It is on the record that this approach is beneficial in linear regression model in the presence of multicollinearity (Gujarati and Porter, 2009).

Combined Estimator Approach to Multicollinearity Problem
As part of the efforts to combat multicollinearity problems, the information in literature further revealed that some researchers have combined two estimation techniques with the hope that the combination will confer an advantage and become superior to the single-estimator approach. In chronological order, one of the oldest attempts was made in 1984 when Baye and Parker gave r -k class estimator by combining the principal component regression estimator and the ordinary ridge regression estimator, which includes the ordinary least squares estimator, ordinary ridge regression estimator and principal component regression estimator as special cases. Interestingly, Nomura and Ohkubo (1985) obtained conditions for dominance of the r -k class estimator over its special cases using the mean square error as criterion. Liu (1993) combined the merits of the stein and ordinary ridge regression estimator to obtain an estimator called the Liu estimator. Kaciranlar and Sakallioglu (2001) combined the Liu estimator and principal component regression estimator to get the r -d class estimator and demonstrated the superiority of this estimator over the ordinary least squares estimator, Liu estimator and principal component regression estimator. Ozkale and Kaciranlar (2007) combined the ordinary ridge regression estimator and Liu estimator to have a two-parameter estimator. They obtained necessary and sufficient condition for dominance of the two-parameter estimator over the ordinary least square estimator in mean square matrix sense. Yang and Chang (2010) worked on the combination of ordinary ridge regression estimator and Liu estimator in a different manner and proposed an another two-parameter estimator and here derived necessary and sufficient conditions for superiority of the another two-parameter estimator over the ordinary least square estimator, ordinary ridge regression estimator, Liu estimator and two parameter estimator under the mean square matrix criterion. Ozkale (2012) proposed a general class of estimators, r -(k,d) class estimator and assessed its performance under the mean square error criterion. The proposed estimator was a combination of the two-parameter estimator and principal component regression estimator and superiority was shown. This is interesting. Chang and Yang (2012) proposed the principal component twoparameter estimator by the combination of the principal component regression estimator and another two-parameter estimator and evaluated its performance under the mean square matrix sense. Tyagi and Chandra (2017) examined the performance of two biased estimators in the presence of multicollinearity with autocorrelated errors which include the same number of unknown parameters with the same range. In Tyagi and Chandra (2017), result suggests that for all the parametric conditions considered in the investigation, r -(k,d) class estimator performs better than the principal components two-parameter estimator under scalar mean square error sense. The shift from single estimator approach to a combined estimator approach was a progressive effort targeted at solving multicollinearity problem.

Idea of Partitioning and Extraction of the Explanatory Variables: Need for Research and Application
Previous studies on the subject have concentrated on developing estimators regardless of effect of differences in levels of multicollinearity among regressors. However, the increased concern for a sustainable solution to multicollinearity has compelled reorientation of research focus to the development of new ideas. In responding to the call, researchers in the field should examine the influence of partitioning the regressors according to multicollinearity levels and extracting from each group to develop estimators that will estimate the parameters of a linear regression model when multicollinearity exist in a linear regression model. Recently, the idea to partition and extract in order to solve multicollinearity problem has been inspired by William (2015). From the review, it is incontrovertible that the application of partitioning and extraction of explanatory variables in solving multicollinearity problems have been grossly understudied or largely untested. Therefore, there is need to encourage research on partitioning and extraction of the explanatory variables and apply the same to solve multicollinearity problem sustainably.

Concluding Remarks: Suggested Research Lines that Will Tackle Multicollinearity Problems
This study reviewed old and new works on multicollinearity problems with the purpose of suggesting sustainable solution. The present study contributed to existing knowledge by bringing to limelight the influence of partitioning the regressors according to multicollinearity levels and extracting from each group to develop estimators that will estimate the parameters of a linear regression model when multicollinearity occurs. Researchers in the field of econometrics should capitalize on the findings highlighted in this review to develop innovative estimator(s) that will sustainably solve the problem of multicollinearity by addressing the following problems which form the important suggestions for future work: i. develop estimators for solving multicollinearity problems through principal component and partial least squares techniques employing partitioning and extraction of explanatory variables approaches.
ii. examine the performance of newly-developed method in many linear regression models with different degrees of multicollinearity using simulated data.
iii. compare the performance of the new method(s) with existing methods under different degrees of multicollinearity.
iv. identify situations where the new method will produce most efficient result of the model parameters.
v. apply the new method using real data and popularize for use.