RQK fitness functions for regression models

In searching for regression models by evolutionary methods, optimising only the leave-one-out explained variance, Q2 has been demonstrated to be often overoptimistic and not able to give optimal predictive models. In fact, the final selected models often turned out not to be as predictive as expected if more severe validation was applied. On analysing these models, it was found that chance correlation and noisy variables are frequently the cause of their lack of predictivity.
The RQK function is a new fitness function for model searching proposed to avoid unwanted model properties, such as chance correlation, presence in the models of noisy variables and other model pathologies that cause lack of model prediction power [R. Todeschini, V. Consonni, A. Mauri, M. Pavan: Analytica Chimica Acta, 2004, 515, 199-208]. This is a constrained fitness function based on the Q2 statistics (or other fitness functions) and four tests that must be fulfilled contemporarily. By using the RQK function in an evolutionary algorithm for optimal model population searching, one should maximise the chosen fitness function and accept models only if the following tests are satisfied:

  • QUIK rule : K XY - K X > d K (otherwise, high predictor multicollinearity)

  • Asymptotic Q2 rule : Q 2 LOO - Q 2 ASYM > d Q (otherwise, doubtful predictive ability)

  • Redundancy RP rule : R P > t P (otherwise, redundancy in explanatory variables)

  • Overfitting RN rule : R N > t N (otherwise, overfitting due to noisy variables)

RQK fitness functions have been recently proposed with the aim to highlight patologies present in regression models.
These functions have been implemented in the software MOBYDIGS, for the variable subset selection using genetic algorithms.