Classification (supervised pattern recognition)
Classification methods are fundamental chemometric (multivariate) techniques aimed to find mathematical models able to recognize the membership of each sample to its proper class on the basis of a set of measurements. Once a classification model has been obtained, the membership of unknown samples to one of the defined classes can be predicted. In other words, classification methods find mathematical relationships between a set of descriptive variables (e.g. chemical measurements) and a qualitative variable, i.e. the membership to a defined category (class). An excellent reference to multivariate classification methods is The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2009), by Trevor Hastie, Robert Tibshirani and Jerome Friedman (available on line here).
Among traditional classifiers, Discriminant Analysis is probably the most known method (McLachlan G, 1992, Discriminant Analysis and Statistical Pattern Recognition. New York: Wiley) and can be considered the first multivariate classification technique. Nowadays, several statistical software packages include procedures referred to by various names such as Linear Discriminant Analysis and Canonical Variates Analysis. Canonical Variates Analysis (CVA) separates samples into classes by minimizing the within-class variance and maximizing the between-class variance. So, with respect to Principal Component Analysis, the aim of CVA is to find directions (i.e. linear combinations of the original variables) in the data space that maximise the ratio of the between-class to within-class variance, rather than maximising the between-sample variance without taking into account any information on the classes, as PCA does. These directions are called discriminant functions or canonical variates and are in number equal to the number of categories minus one.
Quadratic Discriminant Analysis (QDA) is a probabilistic parametric classification technique and separates the class regions by quadratic boundaries. It makes the assumption that each class has a multivariate normal distribution, while the dispersion is different in the classes. A special case, referred to as Linear Discriminant Analysis (LDA), occurs if all the class covariance matrices are assumed to be identical.
Details on Discriminant Analysis can be found here: The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2009), by Trevor Hastie, Robert Tibshirani and Jerome Friedman (available on line here).
Partial Least Square Discriminant Analysis (PLSDA)
Partial Least Square (PLS) was originally designed as a tool for statistical regression and nowadays is one of the most used regression techniques in chemistry. The PLS algorithm has been modified for classification purposes. Barker and Rayens (Barker M, Rayens WS, 2003, Partial least squares for discrimination. Journal of Chemometrics, 17, 166-73) showed that Partial Least Squares-Discriminant Analysis (PLS-DA) corresponds to the inverse-least-squares approach to LDA and produces essentially the same results but with the noise reduction and variable selection advantages of PLS.
PLS-DA is essentially based on the PLS2 algorithm that searches for latent variables with a maximum covariance with the Y-variables. Of course, the main difference is related to the dependent variables, since these represent qualitative (and not quantitative) values, when dealing with classification. In PLS-DA the Y-block describes which samples are in the classes of interest. When dealing with G classes, the class vector is unfolded and the PLS2 algorithm is applied. For each sample, PLS-DA will return the prediction as a vector of size G, with values in-between 0 and 1: a g-th value closer to zero indicates that the sample does not belong to the g-th class, while a value closer to one the opposite. Since predicted vectors will not have the form (0,0,…,1,…0) but real values in the range between 0 and 1, a classification rule must be applied; from predicted values, probabilities can be estimated, as explained in the following publication: N.F. Pérez, J. Ferré, R. Boqué, Calculation of the reliability of classification in discriminant partial least-squares binary classification. Chemometrics and Intelligent Laboratory Systems 95 (2013) 122-128. The sample can thus be assigned to the class with the maximum probability or, alternatively, a threshold on calculated responses can be determined for each class on the basis of the Bayes theorem (the class threshold is selected at the point where the number of false positives and false negatives is minimized).
The optimal number of latent variable is usually selected by means of cross validation procedures, by choosing the latent variables which minimise the cross validation error in classification.
A tutorial on PLSDA is available in the following paper: Ballabio D, Consonni V, (2013) Classification tools in chemistry. Part 1: Linear models. PLS-DA. Analytical Methods, 5, 3790-3798.
Classification trees (CART)
Tree-based approaches consist of algorithms based on rule induction, that is a way of partitioning the data space into different class subspaces. Basically, the data set is recursively splitted into smaller subsets where each subset contains samples belonging to as few categories as possible. In each split (node), the partitioning is performed in such a way to reduce entropy (maximize purity) of the new subsets and the final classification model consists of a collection of nodes (tree) that define the classification rule.
Univariate and multivariate strategies for finding the best split can be distinguished; in the univariate approach the algorithm searches at each binary partitioning the single variable that gives the purest subsets; the partitioning can be formulated as a binary rule: all the samples that satisfy the rule are grouped in one subset, otherwise into another. This is the case of the Classification And Regression Trees (CART), that are a form of binary recursive partitioning based on univariate rule induction (Breiman LJ, Friedman JH, Olsen R, Stone C, 1984, Classification and Regression Trees. Belmont, CA: Wadsworth International Group, Inc.).
K-Nearest Neighbors (kNN)
The K-Nearest Neighbor (KNN) classification rule (Cover TM, Hart PE, 1967, Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13, 21-7.) is conceptually quite simple: an sample is classified according to the classes of the K closest samples, i.e. it is classified according to the majority of its k-nearest neighbors in the data space. In a computational point of view, all that is necessary is to calculate and analyse a distance matrix. The distance of each sample from all the other samples is computed, and the samples are then sorted according to this distance. KNN is a non-linear classification method. Because of these characteristics, KNN has been suggested as a standard comparative method for more sophisticated classification techniques.
When applying KNN, the optimal value of K must be searched for. One option for selecting K is by means of cross validation procedures, i.e. by testing a set of K values (e.g. from 1 to 10); then, the K giving the lowest classification error in cross-validation can be selected as the optimal one.
Potential Functions (Kernel Density Estimators)
Potential Functions consider each sample of the training set as a point in the pattern space. Around this point there is a potential field, that decreases with distance from the sample (D. Coomans, D.L. Massart, Potential Methods in pattern recognition. Part 1. Classification Aspects of the Supervised Method ALLOC. Analytica Chimica Acta, 133 (1981) 215-224). The classification of a new sample into one of the classes is determined by means of the cumulative potential of the class in the position of the test sample. The cumulative potential is obtained by summing up the individual potentials of the class samples. The test object is then classified into the class which gives rise to the largest cumulative potential.
Potential functions can be adapted to class modeling. For each class potential a potential threshold can be defined on the basis of the percentiles of the distribution of class potentials (M. Forina, C. Armanino, R. Leardi, G. Drava, A class-modelling technique based on potential functions, Journal of Chemometrics, 5 (1991) 435-453). A sample is therefore classified in a specific class space if its potential is higher than the potential class threshold.
The shape of the potential field depends on the choice of a potential function (kernel) and a smoothing parameter. Kernels included in this toolbox are: gaussian and triangular. The optimal smoothing value can be selected (for each class) by means of cross validation procedures, i.e. by testing a set of smoothing values (e.g. from 0.1 to 1.2); then, the smoothing value giving the lowest classification error in cross-validation can be selected as the optimal one.
Support Vector Machines (SVM)
Support Vector Machines (SVM) define a decision boundary that optimally separates two classes by maximizing the distance between them (Cortes, C.; Vapnik, V. Support-Vector Networks. In Machine Learning; 1995; pp. 273–297.). The decision boundary can be described as an hyper-plane that is expressed in terms of a linear combination of functions parameterized by support vectors, which consist in a subset of training molecules. SVM algorithms search for the support vectors that give the best separating hyper-plane using a kernel function. Kernels included in this toolbox are: linear, radial basis functions (RBF) and polynomial. During optimization, SVM search the decision boundary with maximal margin (cost, upper bound for the coefficients alpha) among all possible hyper-planes, where the margin can be intended as the distance between the hyper-plane and the closest point for both classes.
When applying SVM, the optimal value of cost must be searched for. One option for selecting the optimal cost value is by means of cross validation procedures, i.e. by testing a set of cost values (e.g. from 0.1 to 1000); then, the cost value giving the lowest classification error in cross-validation can be selected as the optimal one. When dealing with RBF and polynomial kernels, the kernel parameter must be selected too. In this case, a set of values of kernel parameters is tested in cross validation and the value giving the lowest error in classification can be selected as the optimal one.
Details on SVM can be found here: The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2009), by Trevor Hastie, Robert Tibshirani and Jerome Friedman (available on line here).
Soft Independent Modeling of Class Analogy (SIMCA)
Soft Independent Modeling of Class Analogy (SIMCA) is a class modeling technique (Wold S (1976) Pattern recognition by means of disjoint principal components models. Pattern Recognition, 8, 127-39). A SIMCA model consists of a collection of Principal Component Analysis (PCA) models, one for each class. Therefore, PCA is separately calculated on the samples of each class; since the number of significant components can be different for each category, cross-validation has been proposed as a way of choosing the number of retained components of each class model. In this way, SIMCA defines G class models, where G is the total number of classes. Then, a new sample is projected in each subspace and compared to it in order to assess its distance from each class. Distances are calculated on the basis of normalised Q residuals and normalised Hotelling T2 values. Q residuals and Hotelling T2 are normalised over their 95% confidence limits. Samples can be assigned to the class with normalised Q resilduals and T2 Hotelling distance lower than a fixed threshold. The distance threshold is found for each class by increasing the threshold and maximizing class specificity and sensitivity. In this way (class modelling approach) samples could be predicted in none or multiple class spaces. Otherwise, samples can be always assigned to the closest class, i.e. the class with the lowest normalised Q resilduals and T2 Hotelling distance. A tutorial on PCA (which is the basis for UNEQ) can be found here: R. Bro, A.K. Smilde, Principal component analysis, Analytical Methods, 2014,6, 2812-2831
Unequal class models (UNEQ)
Unequal class models (UNEQ) is a class modeling technique(M.P. Derde, D.L. Massart, UNEQ: a disjoint modelling technique for pattern recognition based on normal distribution, Anal. Chim. Acta, 184 (1986), pp. 33-51). A UNEQ model consists of a collection of Principal Component Analysis (PCA) models, one for each class. Therefore, PCA is separately calculated on the samples of each class; since the number of significant components can be different for each category, cross-validation has been proposed as a way of choosing the number of retained components of each class model. In this way, UNEQ defines G class models, where G is the total number of classes. Then, a new sample is projected in each subspace and compared to it in order to assess its distance from each class. Distances are calculated on the basis of normalised Hotelling T2 values. Hotelling T2 are normalised over their 95% confidence limits. Samples can be assigned to the class with normalised T2 Hotelling distance lower than a fixed threshold. The distance threshold is found for each class by changing the threshold and maximizing class specificity and sensitivity. In this way (class modelling approach) samples could be predicted in none or multiple class spaces. Otherwise, samples can be always assigned to the closest class, i.e. the class with the lowest normalised T2 Hotelling value. A tutorial on PCA (which is the basis for UNEQ) can be found here: R. Bro, A.K. Smilde, Principal component analysis, Analytical Methods, 2014,6, 2812-2831. A recent paper explaining UNEQ can be found here: P. Oliveri (2017) Class-modelling in food analytical chemistry: Development, sampling, optimisation and validation issues - A tutorial. Analytica Chimica Acta, in press, DOI: https://doi.org/10.1016/j.aca.2017.05.013