Current problems in different scientific fields deal with a large number of variables and are analysed by means of multivariate techniques. However, not all the variables are usually necessary or even useful for the development of mathematical models. Therefore, it is important to identify the best (or optimal) subset of variables. When the generation of all the possible combinations of variables, the All Subset Models approach, is not feasible, then one must use other variable selection techniques. Several different variable selection methods have been proposed.
The Reshaped Sequential Replacement algorithm adopts the Sequential Replacement method proposed by Miller in 1984 in its inner core, but implements new functionalities meant to face different issues, such as computational time, model pathologies and validation.
Sequential Replacement algorithm
The Sequential Replacement method defined by Miller in 1984 is based on a very simple algorithm. Each combination of variables, i.e. a model, is represented by a vector, called seed, which specificies the variables included in the model and is associated with a parameter representing the quality of the model. The basic idea is to replace each variable included in a model of size M (with M ≤ p, p being the total number of variables) one at a time with each of the remaining variables and see whether a better model is obtained. All the variables in the model are replaced with all the remainders and the new seed is chosen only after all the variables have been replaced and the obtained models have been compared. The seed that shows the best quality is chosen as new starting seed and the replacement procedure is reiterated till convergence, i.e. till no replacement gives an improvement of the quality. To this end, the size of each seed, i.e. the number of variables included in the model (M), is predefined and is kept constant throughout the calculation. An illustration of the algorithm is given in the figure below.
Reshaped Sequential Replacement
As aforementioned, the Reshaped Sequential Replacement algorithm is based on the Sequential Replacement approach, but implements some functionalities aimed to: a) decrease the calculation time; b) introduce state-of-the-art validation tools; c) increase the probability to converge to the optimal model; d) identify models that suffer from pathologies, such as overfitting, chance correlation, variable redundancy and collinearity between predictors.
The functions able to "reshape" the original method are:
- Coefficient of determination (Q2cv) or Non-Error Rate (NERcv) in cross-validation: these parameters are used
as fitness functions for regression and classification problems, respectively.
- Tabu list: this function is a coarse-grained filter that allows the preliminary exclusion of variables that are supposed not to be relevant. The assessment of the relevance is based on the univariate Q2cv and on the Canonical Measure of Correlation (CMC) index in regression and classification, respectively. In other words, variables with a negative univariate Q2cv or a CMC smaller than 0.3 are included in the Tabu list and, as a consequence, are not considered in the initial replacement procedure. When the algorithm reaches convergence, tabu variables are recovered. The inclusion of tabu variables in the models is allowed only if they provide a significant improvement, i.e. an increase in the Q2cv or NERcv larger than 0.01.
- Roulette Wheel: this is an algorithm biased towards optimal solutions. It is already used in Genetic Algorithms to select promising parent chromosomes. In the RSR method it is used to select variables for the generation of the initial population. When the Roulette Wheel is active, the initial population is not generated randomly, but each variable is given a probability to enter the initial models proportional to its univariate Q2cv or CMC in regression and classification, respectively. Only variables with a positive Q2cv are considered in the regression case.
- QUIK rule: this function is carried out only with OLS regression and aims at detecting models affected by collinearity between variables.
The QUIK rule is based on the K multivariate correlation index and is defined as:
if Kxy - Kx < δK ⇒ reject model
where δK is a user-defined threshold (e.g. 0.01-0.05, options.thr_quik in the toolbox). The higher the value, the stricter the criterion and therefore the larger the number of rejected models.
The Tabu list, Roulette Wheel and QUIK rule are the functions that affect the search ability of the algorithm. Several other functions are, instead, applied only on the final population of models and are meant to support the analysis and comparison of the final models.
The functions applied only on the final population of models are:
- R-function based rules: two functions, RP and RN, were implemented with the aim to identify two different model pathologies.
The rule based on the RP function aims at detecting models with a redundancy of explanatory variables. The rule is defined as:
if RP < tP ⇒ reject model
where tP is a user defined threshold (options.thr_RP in the toolbox) ranging from 0.01 to 0.1 depending on the data. Lower values of the threshold correspond to a stricter test.
The rule based on the RN function aims, instead, at detecting models affected by the presence of noisy variables. The rule is defined as:
if RN < tN ⇒ reject model
where tN is a threshold value ranging from -0.01 to -0.1, which depends on a tunable parameter ε (options.thr_RN in the toolbox) that defines the level of noise. Higher values of ε correspond to higher values of the tN threshold and therefore to a stricter rule (more variables will be regarded as noisy and the corresponding models will be rejected). Note that the rule based on the RN function was re-defined in Cassotti et al (2014) because the original formulation seemed too strict.
- Y-scrambling: this is a permutation test used to check the presence of chance correlation between predictors and response. The Y-scrambling test is carried out by scrambling the values in the response vector in such a way that each row of the X matrix is no longer associated with its correct response value. The test is carried out 100 times and only for the regression case.
- Random Error Rate (RER) test: for the classification case, the RER test is carried out, in spite of the Y-scrambling. The real Error Rates (ER) are compared
with the random classification error, i.e. the error rate obtained if the objects are randomly assigned to the classes. The Random Error Rate is calculated using the a priori probability (options.prob_type in the toolbox), which can be either uniform or proportional to the number of objects in each class. A model is rejected if:
RER - ERcv < 0.01
- Nested models test: this test allows to detect "nested" models. A model F can be defined as "nested" if there is a model G of higher size (i.e. including more variables) that comprises all the variables of F and has a very similar performance. If this occurs, model G is rejected because its higher complexity is not balanced by a better performance, measured by the Q2cv or NERcv in regression and classification, respectively. The test is defined as:
if Q2cv(G) - Q2cv(F) < thr or Q2cv(G) - Q2cv(F) < 0 ⇒ reject model
where thr is a user defined threshold value (options.thr_nested in the toolbox). A suggested threshold for nested models is 0.005. The larger the value of the threshold, the stricter the criterion because models will more easily be identified as nested. In classification the test is carried out in the same fashion, just using the NERcv in spite of the Q2cv. Note that the absolute value was removed compared to equations (15) and (16) of Cassotti et al (2014) in order to detect all the cases where the model with more variables (G) has a lower performance than the model with less variables (F).
- Canonical Model Correlation (CMC) and Canonical Model Distance (CMD) indices: CMC and CMD indices are used to represent correlations and distances, respectively, between models. CMC and CMD allow an easy comparison of the final models and are useful to determine whether models with different variables are really different in their nature.
For further information on the theory, please see the toolbox references.