Publications on Chemometrics and QSAR topics
The full list of publications made by researchers at Milano Chemometrics and QSAR research Group is available here.
We develop and apply several chemometric methods to face real problems in chemistry, toxicology, pharmacology and environmental field.
Special research interests are Genetic Algorithms, Kohonen artificial neural networks, multicriteria decision making, chemometric tools for image analysis, cluster analysis, validation procedures and the variable selection both in regression and classification problems.
A new classification method has been proposed. The CAIMAN method (Classification And Influence Matrix ANalysis) is a classification method exploiting the properties of the diagonal terms of the influence matrix, also called leverages. You can read more about Caiman and download the Caiman Matlab modules here.
A toolbox to calculate several classification models has been developed for MATLAB and can be downlaoded. The Classification toolbox (for MATLAB) is a collection of MATLAB modules for calculating classification (supervised pattern recognition) multivariate models: Discriminant Analysis, Partial Least Square Discriminant Analysis (PLSDA), Classification trees (CART), K-Nearest Neighbors (kNN), Support Vector Machines (SVM), Potential Functions and Soft Independent Modeling of Class Analogy (SIMCA).
Moreover, a collection of MATLAB modules for calculating unsupervised multivariate models for data structure analysis is available (PCA toolbox for MATLAB): Principal Component Analysis (PCA), Multidimensional Scaling (MDS) and Cluster Analysis.
 Molecular Descriptors
Some activities are aimed to the development of new theoretically-based molecular descriptors and to the evaluation of their ability in modelling different physico-chemical, biological and environmental responses.
In recent years new 3D - molecular descriptors have been developed, WHIM (Weighted Holistic Invariant Molecular descriptors) and G-WHIM (Grid-Weighted Holistic Invariant Molecular descriptors). GETAWAY (GEometry, Topology and Atoms-Weighted AssemblY) descriptors , accounting also for molecule local properties, are the last ones proposed.
Moreover, the second edition of the Handbook of molecular descriptors (Molecular Descriptors for Chemoinformatics by Roberto Todeschini and Viviana Consonni) has recently been published by Wiley-VCH. It is an encyclopedic collection of the molecular descriptors from the beginning. About 3300 definitions, presented in alphabetic order, allow not only a rapid consulting, but also an organized learning of algorithms, meanings and tables of the molecular descriptors, QSAR strategies, and other related topics.
In the last year, the MOLE db - Molecular Descriptors Data Base has been released. This is a free on-line database constituted of 1124 molecular descriptors calculated on 234773 molecules of the NCI database.
The DRAGON software is available for the calculation of molecular descriptors and includes WHIM and GETAWAY descriptors.
Milano Chemometrics has been involved in several projects related to the use of QSAR for the REACH registration of chemicals. Several quantitative structure-activity and structure-property relationships (QSAR and QSPR) are investigated with the main purpose of evaluating and monitoring both chemometric methods and molecular descriptors. In particular, the Variable Subset Selection approach based on Genetic Algorithms (VSS-GA) has been widely used searching for the best regression models. The software MOBYDIGS has been used for this purpose. A family of new fitness functions (RQK functions) has been proposed in order to avoid some model pathologies in seaching for the best regression models.
Recently, we have studied the problem of evaluating the predictive ability of QSAR models. The formula for calculating the predictive squared correlation coefficient Q2, adopted by the current OECD guidelines about QSAR validation, is based on SS (sum of squares) of the external test set referring to the training set response mean. In addition to this formula, another formula has been proposed, based on SS referring to deviations of observed values from the training set mean over the training set instead of the external evaluation set. This last formula appears independent of the external object distribution and satisfies the ergodic property, condition that we consider a fundamental requirement. Details on these studies can be found in the following papers:
V. Consonni, D. Ballabio, R. Todeschini, Evaluation of model predictive ability by external validation techniques, Journal of Chemometrics (2010), 24, 104-201
V. Consonni, D. Ballabio, R. Todeschini, Comments on the definition of the Q2 parameter for QSAR validation, Journal of Chemical Information and Modeling (2009), 49, 1669-1678

Finally, new QSAR models adressed to the prediction of several properties (such as biodegradation and acute toxicity) have been proposed in literature.
 Applicability Domain
Existing methods for Applicability Domain as well as new strategies to define the Applicability Domain of QSAR models were evaluated and studied. Details on these studies can be found in the following papers:
F. Sahigara, K. Mansouri, D. Ballabio, A. Mauri, V. Consonni, R. Todeschini. Comparison of different approaches to define the Applicability Domain of QSAR models. Molecules (2012), 17, 4791-4810
 Multicriteria Decision Making
Some researches have been started up in the field of multicriteria decision making (MCDM), with particular attention to Hasse diagrams and rank statistics. Recently, a new software for the MCDM analysis (DART) has been released.
 Correlation and Information
Data correlation has been deeply studied leading to the proposal of a general measure of the multivariate correlation (multivariate K correlation index). Use of this index has been proposed in different fileds, such as regression and selection of the significant principal components in PCA.
Recently, novel indices (Canonical Measure of Correlation, CMC, and Canonical Measure of Distance, CMD) have been proposed to measure similarity/diversity between pairs of data sets by the aid of the variable cross-correlation matrixbetween sets of data. These indices have been also applied for the determination of the subset of variables that reproduce as well as possible the main structural features of the complete data set. This method can be useful for pre-treatment of large data sets since it allows discarding variables that contain redundant information. Reducing the number of variables often allows one to better investigate data structure and obtain more stable results from multivariate modelling methods. To this end, the V-WSP variable reduction is a method proposed for calculating unsupervised variable reduction.
Softwares developed to carried on our research activities are: DRAGON, Molecular Descriptors calculation 
MOBYDIGS, Genetic Algorithm - Variable Subset Selection
PCA toolbox: Principal Component Analysis (PCA), Multidimensional Scaling (MDS) and Cluster Analysis in MATLAB
Kohonen and CPANN toolbox, Kohonen maps and counter-propagation in MATLAB
Classification toolbox: MATLAB modules for calculating classification models
DART, software for multicriteria decision making
 Partners in our scientific research