Introduction

Current problems in different scientific fields deal with a large number of variables and are analysed by means of multivariate techniques. However, not all the variables are usually necessary or even useful for the development of mathematical models. Therefore, it is important to identify the best (or optimal) subset of variables. When the generation of all the possible combinations of variables, the All Subset Models approach, is not feasible, then one must use other variable selection techniques. Several different variable selection methods have been proposed.

The Reshaped Sequential Replacement algorithm adopts the Sequential Replacement method proposed by Miller in 1984 in its inner core, but implements new functionalities meant to face different issues, such as computational time, model pathologies and validation.

[-> top]

Sequential Replacement algorithm

The Sequential Replacement method defined by Miller in 1984 is based on a very simple
algorithm. Each combination of variables, i.e. a model, is represented by a vector, called *seed*, which specificies the variables
included in the model and is associated with a parameter representing the quality of the model. The basic idea is to replace each variable
included in a model of size *M* (with *M* ≤ *p*, *p* being the total number of variables) one at a time with
each of the remaining variables and see whether a better model is obtained. All the variables in the model are replaced with all the remainders
and the new seed is chosen only after all the variables have been replaced and the obtained models have been compared. The seed that shows the
best quality is chosen as new starting seed and the replacement procedure is reiterated till convergence, i.e. till no replacement gives an
improvement of the quality. To this end, the size of each seed, i.e. the number of variables included in the model (*M*), is predefined
and is kept constant throughout the calculation. An illustration of the algorithm is given in the figure below.

[-> top]

Reshaped Sequential Replacement

As aforementioned, the Reshaped Sequential Replacement algorithm is based on the Sequential Replacement approach, but implements some
functionalities aimed to: a) decrease the calculation time; b) introduce state-of-the-art validation tools; c) increase the probability to
converge to the optimal model; d) identify models that suffer from pathologies, such as overfitting, chance correlation, variable redundancy
and collinearity between predictors.

The functions able to "reshape" the original method are:

- Coefficient of determination (Q
^{2}_{cv}) or Non-Error Rate (NER_{cv}) in cross-validation: these parameters are used as fitness functions for regression and classification problems, respectively.

- Tabu list: this function is a coarse-grained filter that allows the preliminary exclusion of variables that are supposed not to be relevant.
The assessment of the relevance is based on the univariate Q
^{2}_{cv}and on the Canonical Measure of Correlation (*CMC*) index in regression and classification, respectively. In other words, variables with a negative univariate Q^{2}_{cv}or a*CMC*smaller than 0.3 are included in the Tabu list and, as a consequence, are not considered in the initial replacement procedure. When the algorithm reaches convergence, tabu variables are recovered. The inclusion of tabu variables in the models is allowed only if they provide a significant improvement, i.e. an increase in the Q^{2}_{cv}or NER_{cv}larger than 0.01. - Roulette Wheel: this is an algorithm biased towards optimal solutions. It is already used in Genetic Algorithms to select promising
parent chromosomes. In the RSR method it is used to select variables for the generation of the initial population. When the Roulette Wheel
is active, the initial population is not generated randomly, but each variable is given a probability to enter the initial models proportional to its univariate Q
^{2}_{cv}or*CMC*in regression and classification, respectively. Only variables with a positive Q^{2}_{cv}are considered in the regression case. - QUIK rule: this function is carried out only with OLS regression and aims at detecting models affected by collinearity between variables.
The QUIK rule is based on the
*K*multivariate correlation index and is defined as:

if*K*⇒ reject model_{xy}- K_{x}< δK

where*δK*is a user-defined threshold (e.g. 0.01-0.05,*options.thr_quik*in the toolbox). The higher the value, the stricter the criterion and therefore the larger the number of rejected models.

The Tabu list, Roulette Wheel and QUIK rule are the functions that affect the search ability of the algorithm. Several other functions are, instead, applied only on the final population of models and are meant to support the analysis and comparison of the final models.

The functions applied only on the final population of models are:

- R-function based rules: two functions, R
^{P}and R^{N}, were implemented with the aim to identify two different model pathologies.

The rule based on the R^{P}function aims at detecting models with a redundancy of explanatory variables. The rule is defined as:

if R^{P}< t^{P}⇒ reject model

where t^{P}is a user defined threshold (*options.thr_RP*in the toolbox) ranging from 0.01 to 0.1 depending on the data. Lower values of the threshold correspond to a stricter test.

The rule based on the R^{N}function aims, instead, at detecting models affected by the presence of noisy variables. The rule is defined as:

if R^{N}< t^{N}⇒ reject model

where t^{N}is a threshold value ranging from -0.01 to -0.1, which depends on a tunable parameter ε (*options.thr_RN*in the toolbox) that defines the level of noise. Higher values of ε correspond to higher values of the t^{N}threshold and therefore to a stricter rule (more variables will be regarded as noisy and the corresponding models will be rejected). Note that the rule based on the R^{N}function was re-defined in Cassotti*et al*(2014) because the original formulation seemed too strict. - Y-scrambling: this is a permutation test used to check the presence of chance correlation between predictors and response. The Y-scrambling test is carried out by scrambling the values in the response vector in such a way that each row of the
**X**matrix is no longer associated with its correct response value. The test is carried out 100 times and only for the regression case. - Random Error Rate (RER) test: for the classification case, the RER test is carried out, in spite of the Y-scrambling. The real Error Rates (ER) are compared
with the random classification error, i.e. the error rate obtained if the objects are randomly assigned to the classes. The Random Error Rate is calculated using the
*a priori*probability (*options.prob_type*in the toolbox), which can be either uniform or proportional to the number of objects in each class. A model is rejected if:

RER - ER_{cv}< 0.01

- Nested models test: this test allows to detect "nested" models. A model F can be defined as "nested" if there is a model G of higher size (i.e. including more variables) that comprises all the variables of F and has a very similar performance. If this occurs, model G is rejected because its higher complexity is not balanced by a better performance, measured by the Q
^{2}_{cv}or NER_{cv}in regression and classification, respectively. The test is defined as:

if Q^{2}_{cv}(G) - Q^{2}_{cv}(F) < thr or Q^{2}_{cv}(G) - Q^{2}_{cv}(F) < 0 ⇒ reject model

where*thr*is a user defined threshold value (*options.thr_nested*in the toolbox). A suggested threshold for nested models is 0.005. The larger the value of the threshold, the stricter the criterion because models will more easily be identified as nested. In classification the test is carried out in the same fashion, just using the NER_{cv}in spite of the Q^{2}_{cv}. Note that the absolute value was removed compared to equations (15) and (16) of Cassotti*et al*(2014) in order to detect all the cases where the model with more variables (G) has a lower performance than the model with less variables (F). - Canonical Model Correlation (
*CMC*) and Canonical Model Distance (*CMD*) indices:*CMC*and*CMD*indices are used to represent correlations and distances, respectively, between models.*CMC*and*CMD*allow an easy comparison of the final models and are useful to determine whether models with different variables are really different in their nature.

For further information on the theory, please see the toolbox references.

[-> top]