Theory
p is the number of original variables and M the number of retained components; l is the eigenvalue of the m-th principal component. These indicator functions were specifically proposed to deal with spectroscopic data; they are based on the assumption that the error is random and identically distributed in the data and thus the eigenvalues associated with the residual error of PCA model should be approximately equal. Both _{m}IE and IND are calculated as a function of M, as M goes from 1 to p; the minimum of the function indicates the optimal number of retained components.Another strategy to select significant components is based on the multivariate K correlation index, which is a multivariate approach to quantification of the correlation content of a data matrix [Todeschini,R., 1997. Data correlation, number of significant principal components and shape of molecules. The K correlation index, Anal. Chim. Acta 348, pp. 419-430]. From the K correlation index, a linear function ( KL) and a non-linear power function (KP) are derived as follows:KL gives the maximum number of theoretical significant principal components, under the assumption that the information in the data is linearly distributed, while KP estimates the safest minimum number of significant components under the assumption that the information in the data decreases more steeply.Finally, another option is to estimate the optimal number of components by means of cross validation procedures [Wise B, Ricker N (1991) In: Najim K, Dufour E (eds) IFAC Symp on Advanced Control of Chemical Processes, Toulouse, France, 14–16 October 1991, pp 125–130]. The data set is therefore divided into a number of cross-validation groups. PCA model is then built on all but one of the groups and used to estimate variables of the left out group. One variable at a time is removed and considered as missing data; the missing variable is predicted from the model and the sample observation excluding the one variable [R. Bro, K. Kjeldahl, A. K. Smilde, H. A. L. Kiers, 2008, Cross-validation of component models: A critical look at current methods, Anal Bioanal Chem 390, pp 1241–1251]. The residuals for this “reconstruction” is analysed as a function of the number of components used to calculate the PCA model. When PCs describing only small noise variance are added, the error (RMSECV) should increase. [-> top] Cluster Analysis Cluster analysis differs from PCA in that the goal is to detect similarities between objects and find groups in the data on the basis of sample similarities [Massart DL, Kaufman L (1983) The interpretation of analytical chemical data by the use of cluster analysis. New York: Wiley]. Similarities among samples are calculated by means of distances: similar samples are characterised by small distances and the opposite for dissimilar samples. In particular, hierarchical agglomerative methods use the distance measures, called linkage metrics, to quantify similarities between groups of objects (i.e., clusters). Distances are then used to cluster samples in groups and finally display a dendrogram which encodes the cluster structure of the data. [-> top] Multidimensional Scaling Multi-Dimensional Scaling (MDS) is able to elaborate distance (or similarity) matrices representing the internal similarity/diversity relationships of samples [Seber, G. A. F. Multivariate Observations. Hoboken, NJ: John Wiley & Sons, Inc., 1984.]. MDS takes into account the mutual relationships of sample distances by reproducing the data structure encoded in the distance (similarity) matrix into a low-dimensional space. Therefore, a scatter plot of samples in the reduced dimensional space provides a visual representation of the original distances. [-> top] |