The Sediment dataset

The Sediment dataset is composed by 1884 sediment samples. Samples were divided into 2 classes on the basis of their toxicity (class 1: non toxic, class 2: toxic) and described by 9 chemical variables. The dataset was randomly divided in two sets, one training set (1413 samples) and one test set (471 samples). The dataset has been published in the following papers:

M. Alvarez-Guerra, D. Ballabio, J. M. Amigo, J. R. Viguri, R. Bro, A chemometric approach to the environmental problem of predicting toxicity in contaminated sediments, Journal of Chemometrics (2010), 24, 379-386

M. Alvarez-Guerra, D. Ballabio, J. M. Amigo, R. Bro, J. R. Viguri, Development of models for predicting toxicity from sediment chemistry by partial least squares-discriminant analysis and counter-propagation artificial neural networks, Environmental Pollution (2010), 158, 607-614

A detailed description on the analysis of this dataset by means of PLSDA is given in the following paper:

Ballabio D, Consonni V, (2013) Classification tools in chemistry. Part 1: Linear models. PLS-DA. Analytical Methods, 5, 3790-3798

In the following paragraphs, a resume of the PLSDA model built by means of the Classification toolbox for MATLAB is given. The Sediment dataset is provided together with the toolbox. It can be opened by typing:

load sediment

on the MATLAB command window. Note that original data were log transformed.

[-> top]

Working with the graphical interface

Once data have been loaded in the MATLAB workspace, you can open the graphical interface by typing the following code in the MATLAB command window:


In order to build a classification model by means of PLSDA, we have to load data and class vector in the GUI. In order to do that, we can proceed in the following way: select "load data" in the file menu. We can select the Xtrain MATLAB variable (the data) and click load. The listbox of the toolbox main form will be updated with the data details (number of samples, number of variables). Then we can follow the same procedure for loading the corresponding class vector (class_train), by clicking "load class " in the file menu. The class details number of classes) will be updated in the toolbox main form. Finally, we can load the sample and variable labels (samples_train and variables, by clicking "load labels" in the file menu).
We can have a look to the variable means by choosing "view->plot profiles ". This will open a new window where the profiles of the variable averages (on the raw and scaled data) are shown. Since the class vector is loaded, averages are calculated on each class separately.
We can now proceed in the calculation of the PLSDA classification model. By clicking the "calculate->optimal components for PLSDA " we can choose the settings for cross validating the model and selecting the optimal number of components. The form for settings PLSDA options will appear. Here you can select the data scaling (no scaling in this example), the type of assignation criterion (bayes) and the type of validation (cross validation with venetian blinds with 5 cv groups). The cross validation procedure for selecting the optimal components for PLSDA will produce a plot of the number of components retained in the model versus the error rate (and the ratio of not assigned samples, if any). Here we can choose 2 components.
We can now calibrate the PLSDA model by selecting "calculate->PLSDA". The corresponding setting form will appear. Here you can select the number of components to be retained in the model (2 in this example), data scaling (no scaling), type of assignation criterion (bayes), and type of validation (venetian blinds with 5 cv groups). Then, click "calculate".

After the model calculation (and validation), the main form of the toolbox will be updated with the model details (type of calculated model, error rate and non-error rate). In this example, the error rate of the PLSDA model is equal to 0.19 (19%) in fitting and to 0.20 in cross validation. Detailed classification results can be analysed by clicking "results->classification results". The following form will appear.
All classification parameters are shown in this form, both for calibration and validation results. Considering specificity and sensitivity, class 1 and class 2 were partially separated. The overlap between classes is confirmed from the ROC curves of the classes. The ROC curve plots can be opened with the "results->plot ROC curve" button.
Then, we can have a look to the PLSDA scores, by choosing "results->PLSDA scores and loading". A form will appear with score and loading plots. In this plot, samples are coloured on the basis of their experimental class. The user can modify the components to be analysed on the plots, while other statistics are available in the result form, such as Q residuals, T2 Hotelling, leverages, y calculated/predicted classes, regression coefficients. All plots can be exported as MATLAB figures. Here, the score and loading plots between the first two components and the regression coefficient plot produced by the classification toolbox on the Sediment dataset are shown.

By selecting a reference class from the menu "class potential", one can highlight the class potential, which is defined on the basis of the distribution of the class samples in the visualised plot, as shown in the following figure for the first two latent variables of the PLSDA mode for the red class (class number two).

Finally, we can save the model ("file->save model"), clear the data ("file->clear data"), and load the test set ("file->load data" and choose Xtest_log) and the corresponding class vector ("file->load class " and choose class_test). In this case, the "predict samples" button in the prediction menu will be activated and the prediction results on the new set of data can be analysed. The classification results on the test set can be analysed by choosing "prediction->prediction results":

[-> top]

Working with the command line


load sediment

on the MATLAB command window to load the data. We can select the number of optimal components by using the plsdacompsel function:

res = plsdacompsel(Xtrain_log_log,class_train,'none','vene',5,'bayes')

We'll get the error rate in validation (and non-error rate in validation) associated to each component value. Type


on the MATLAB command window to see the error rates. We can then calculate the PLSDA model with 2 components by typing:

model = plsdafit(Xtrain_log,class_train,2,'none','bayes',1)

on the MATLAB command window. Once the model is calculated, we can see the model performances by typing:


Scores, loadings, calculated class, leverages and many other statistics are stored in the model structure. We can proceed by cross validating (with 5 venetian blind groups) the PLSDA model with 2 components:

cv = plsdacv(Xtrain_log,class_train,2,'none','vene',5,'bayes')

Once the validation procedure has finished, we can see the validation performances by typing:


Finally, we can predict the test set samples by using the calibrated model:

pred = plsdapred(Xtest_log,model)

and finally we can calculate the classification performances on the external test set predictions:

class_param = calc_class_param(pred.class_pred,class_test)

[-> top]