CUTOFF THRESHOLD OF VARIABLE IMPORTANCE IN PROJECTION FOR VARIABLE SELECTION

At present, variable selection turns to prominence since it obviously alleviate a trouble of measuring multiple variables per sample. The partial least squares regression (PLS-R) and the score of Variable Importance in Projection (VIP) are combined together for variable selection. The value of VIP score which is greater than 1 is the typical rule for selecting relevant variables. Due to a constant cutoff threshold is not sometimes suitable for every data structure, a new cutoff threshold for VIP in classification task has been proposed and then compared to the classical one thru the interesting situation simulation. There were 180 situations generated based on four parameters: Percentage of the number of relevant variables, Magnitude of mean difference of relevant variables between two groups, Degree of correlation between relevant variables, and the sample size. The result of this study presents that the new cutoff threshold can improve in identifying relevant variables more than the previous threshold as seeing of good value of the average balanced accuracy in most of situations. AMS Subject Classification: 62H30


Introduction
Because of the progressive technology in the past decade, a large number of data can be accumulated.Data set with hundreds or thousands of attributes is called high dimensional data.For example, microarray data which is a lot of biological data of tissues is derived from DNA microarray experiments.The experiment allows simultaneous measurement of tens of thousands of gene expression levels per sample.However, the number of samples from the microarray experiment usually contains less than one hundred samples.The number of genes (variables) in data then far exceeds the number of samples.Such data set presents great challenges in data analysis because some existing methods of data analysis can not support it.Furthermore, each of gene does not hold for relevant information.There are only 5% of total genes containing relevant information about the grouping [1].Therefore, selecting a subset of relevant genes and then using only some of them for the subsequent data analysis is essential.
Variable selection is the process of determining relevant variables from the original variable set.It offers several advantages such as avoiding overfitting, improving model performance, providing faster and more cost-effective models and gaining a deeper insight into the underlying processes.The methods of variable selection in the viewpoint of classification can be classified into three categories: filter, wrapper and embedded methods.Existing methods for variable selection reviewed in [2] was mentioned as a good review.For high dimensional data like microarray data, wrapper and embedded methods spend much of time in contrast to the filter method which considers only the intrinsic properties of the classification independence.Since it is independent and it performs only once for all classification algorithms, it can be computed fast and simply.Filter method is divided into two types corresponding to dependency of variable (univariate and multivariate).Univariate type considers each variable as independence from other variables while multivariate type includes variable dependency for selecting the relevant variable subset.
The VIP is a measurement including variable dependency which is considered as the benefit of multivariate filter method.In the situation of high dimensionality, it usually involves with correlation between variables and missing of observations or variables more than samples.Under this circumstance, nowadays the VIP score obtained by PLS-R has been paid an increasing attention as a significant measurement of each predictor variable [3], [4], [5].Normally, the average of the squared values of the VIPs is equal to 1.The criterion of VIP value with greater than 1 is then often used as a cutoff point for variable selection [3], [5], [6], [7].Predictor variables with the value of VIP score greater than 1 will be selected.However, data structures are generally diverse.The cutoff threshold then should not be the same in different type of data structure [3].Determining the appropriate cutoff threshold is not simple.Too high value of cutoff threshold will lead to absent of some crucial variables.Oppositely, too low value of cutoff threshold will reach to more unrelated variables.
For this study, the new cutoff threshold of VIP is proposed for identification of relevant variables relying on the use of detection outlier with boxplot obtained from the added noise variables to estimate the cutoff threshold.
The rest of this paper is organized as follows.Section 2 presents background and related works.Section 3 describes the methodology.The results and discussions are given in Section 4. Final conclusions are concluded in Section 5.

Background and Related Works
2.1.The Approach of PLS-R Partial least square (PLS) is the name of a set of algorithms developed in the 1960s and 1970s by Herman Wold to address problems in econometric path modeling.It was then subsequently adopted by his son Svante Wold and friends in the 1980s for regression problems in chemometric and spectrometric modeling [8] called partial least squares regression (PLS-R).The advantage of the PLS-R is handling data sets with many noisy, collinear variables and missing values.Additionally, the assumption of error distribution is not required in the PLS-R [9].The number of PLS-R applications is steadily increasing in research fields such as bioinformatics, machine leaning and chemometrics [10].
The relationship between blocks of observed variables and means of latent variables of the PLS-R model is called components.These components are linear transformations of the original predictor variables which have high covariance with the response variables.In case of single response variable y and ppredictor variables of Xbasing on these components, X and y are decomposed as of Equation 1 and Equation 2, respectively.
where T = [t 1 , . . ., t h ] ∈ R n×h represents the sample sized n of the hcomponents, P = [p 1 , . . ., p h ] ∈ R p×h and q = [q 1 , . . ., q h ] ∈ R 1×h denotes as loadings of X and y, respectively.Generally, P and q are computed by ordinary least squares (OLS).E and f are residuals of X and y, respectively.The construction of components is the major point of PLS-R.The components are the linear transformations of X which maximize covariance between response variable y and components.The approach of finding each of components is done sequentially.For the first component (t 1 = Xw 1 ), it is determined by maximizing the covariance between y and t 1 under the constraint of w 1 = 1.To extract each other components, original matrix X and y has to be reconstructed by substituting of their residuals.This process is called deflation of matrices X and y.The residuals of X and y for the first component are found out as of Equation 3and Equation 4, respectively.
where p 1 and q 1 are loadings defined by OLS fitting.Also, the residual of a th components Xand y are computed as of Equation 5and Equation 6, respectively.
where E 0 = X and f 0 = y.
There are various approaches of PLS-R.The PLS-R above is called PLS1.More detailed variants of PLS can be found in [11].The particular algorithm of PLS1 is given in Figure 1.X andy have been standardized to have mean 0 and unit variance before starting the procedure.The number of components (h) has to be determined at first time.There are many techniques to design the number of components.Some authors suggested to fixed the number of components from three to five [12], [13], [14] while as others recommended to identify the size of the space by classification performance of cross-validation [15].

The VIP Score
The VIP score first published by Wold and others in 1993 [3] measures explicative power of predictor variables with respect to the response variable which basing on the PLS-R.The VIP score of variablej is calculated as of Equation 7.
where w aj is weight of the j th predictor variable in component a and R 2 (y,t a ) is fraction of variance in y explained by the component a.The variable with higher value of VIP score shows that it is more relevant to predict the response variable.
Figure 1: Algorithm of PLS-R

Related Work
Two main problems encounter when high dimensional data are analyzed.Firstly, the number of predictors is larger than the sample size.Secondly, there is multicollinearity among predictor variables.Therefore, irrelevant variable should be eliminated from the data set before analyzing.The VIP has been used in microarray data to measure the importance of variables (genes) [16], [17], [18].There are several techniques in the use of VIP.Most of works selected variables with the value of VIP score more than a constant value such as 1 [6], [16], or 2 [18].Some studies like [17] used the VIP score to rank variables and choose the top k values.The other created new significant index based on the VIP [6].
The proposed method is compared to the works mentioned above as follows.Randomization of the order of the samples for generating noise variables applied from [19] is assessed to generating noise variable randomly.The use of VIP for ranking variable importance is evaluated to the classical of PLS-R coefficient [20], [21], [22], weight vector (w 1 ) [19], and t-statistic [12].Finally, consideration of cutoff threshold by use of boxplot is appraised to the using maximum value of importance index of noise variable [20], percentile of importance index of noise variable [19], [20], and range of importance index based on the t-Students distribution [22].

Methodology
The cutoff threshold presented here has many significant steps.Adding noise variables to the original data set is firstly and then computing the VIP scores of them.The VIP scores are always equal to or greater than 0 while only the VIP scores of noise variables (VIP noise ) should be closed to 0 because they are not relevant to the predict response variable.However, a chance of the VIP noise is far from 0 which will be probably identified as outlier.The outliers are observations inconsistent with other observations in the data set which is less likely to cause from the same population with other observations.Therefore, the outliers of VIP noise will be considered as scores of VIP of relevant variables.The cutoff threshold for detecting outlier is applied in selection of pertinent variables by estimating with boxplot.A boxplot demonstrated by Tukey [23] is a graphical display of data dispersion.It indicates which observations regarded as outliers.Without any of assumptions underlying statistical distribution, boxplot is suitable method for detecting outlier of VIP noise .In addition, only the upper detection is required because the lower VIP represents that the variables are irrelevant.Boxplot Cutoff Threshold (BCT) is defined as of Equation 8.
where Q 1 and Q 3 are lower and upper quartile of VIP noise , respectively and the IQR is the difference between Q 3 and Q 1 called the interquartile range.The algorithm of selecting variables via VIP with BCT (VIP-BCT) is shown as of Figure 2.

Design of Simulation
Comparison between the algorithms of VIP-BCT and VIP-1 was made thru a simulation program.In this experimental, it focused on a binary classification problem.Defined the vector of the binary response y = (−1, . . ., −1, 1, . . ., 1) ′ and the matrix of predictor variables X = (X 1 |X 2 ) n×p , where n was the sample size, p was the number of predictor variables (equal to 2,000), X 1 was the n × dmatrix corresponding to d truly relevant variables and X 2 was the matrix of the remaining p − dirrelevant variables.Since normal distribution has been widely utilized for gene expression data simulation [24], the irrelevant variables X 2 are independently drawn from it and the relevant variables X 1 are generated from different distribution or the same distribution with distinguishable parameters.Thus, the irrelevant variables were drawn from normal distribution with µ = 0, σ = 1 and the relevant variables were generated from multivariate normal distribution with mean and variance-covariance as described below.
There were four parameters required to simulate as following.

Measure of Performance
The balanced accuracy was applied and gauged to evaluate the both of performances between two different algorithms of cutoff threshold in variable selection.It is defined as the mean of sensitivity and specificity.Sensitivity is the ratio of the relevant variables classified correctly and the total number of variables while specificity is the ratio of irrelevant variables correctly classified and the total number of variables.Since relevant and irrelevant variable size here were not equal, the balanced accuracy was then chosen for evaluation instead of generally accuracy because of avoiding inflated performance estimates on unbalanced data sets.
Table 1 displayed the confusion matrix for balanced accuracy and descriptions of its entry.
From Table 1, a 1 is the number of relevant variables classified correctly, a 2 is the number of relevant variables classified incorrectly, a 3 is the number of irrelevant variables classified incorrectly and a 4 is the number of irrelevant variables classified correctly.Thus, sensitivity, specificity and balanced accuracy are respectively calculated as follows.Sensitivity = a 1 a 1 +a 2 , Specif icity = a 4 a 3 +a 4 and Balanced accuracy = Sensitivity+Specif icity 2 .

Results and Discussions
Three retaining components were fixed and 200 replications for each of 180 situations were made to evaluate performance between both of the two algorithms.The balanced accuracy of these two cutoff thresholds along the cases was exhibited as of Table 2.The bold figures denoted the best performance.In most of cases, the VIP-BCT outperforms the VIP-1.The superior magnitude of the VIP-BCT can be seen obviously when the Prel is low as of Figure 4 (a).The variables which values of VIP were greater than 1 were selected.Figure 5 (b) and (c) were plots of the VIP-BCT for the original predictor variables and noise variables, respectively.Its cutoff which was calculated from the VIP of noise variables as shown with red dash line in Figure 5 (c) was higher than the VIP-1.As of this result, the VIP-BCT cutoff threshold is more selective.Note that the VIP   The cutoff threshold of the VIP-BCT was greater than 1 for all the cases but they tended to decrease when the Prel, Mdif and nwere increasing.This was corresponding to [3] in parameter of the Prel.That is, when the Prel was low the proper cutoff value was required to be greater than one.As of this reason, the VIP-BCT cutoff threshold certainly outperformed the other when the Prel was low.The average cutoff of the VIP-BCT cutoff threshold along the cases was displayed as of Table 3.

Conclusions
For this study, 180 situations were conducted and then compared the cutoff threshold of VIP between the new VIP-BCT and the traditional VIP-1.Experiment was designed by simulating four parameters: Prel, Mdif,Σ and n.The results demonstrate that in most of cases, the VIP-BCT delivered balanced accuracy better than the VIP-1 also it outstandingly performs in identifying relevant variables and outperforms the other.Appropriate cutoff values of VIP should be different depending on data structure.Their cutoff values of VIP need to be greater than 1 especially when the Prel, Mdif and n are low also they seem to be increasing when the three parameters decrease.There are various measurements for ranking the importance of variable.Thus, there are not usually explicit rule for estimating a suitable number of variables for those measurements.The BCT can be applied to be the cutoff threshold for any measurements and then the results of that application should be studied.

Figure 2 :
Figure 2: Algorithm of the VIP-BCT

Figure 4 (
b) -(e) show the average balanced accuracy of the two cutoff thresholds according to the remaining parameters.All five figures confirm again that the VIP-BCT cutoff threshold can beat the VIP-1 cutoff threshold.

Figure 4 :
Figure 4: Average balanced accuracy of the VIP-BCT and the VIP-1 according to each of four parameters, (a) Prel, (b) Mdif, (c) n and (d) Σ

Table 2 :
Balanced Accuracy of the VIP-1 and the VIP-BCT

Table 3 :
The average cutoff threshold of the VIP-BCT