The Sediment dataset

The Sediment dataset is composed by 1884 sediment samples. Samples were divided into 2 classes on the basis of their toxicity (class 1: non toxic, class 2: toxic) and described by 9 chemical variables. The dataset was randomly divided in two sets, one training set (1413 samples) and one test set (471 samples). The dataset has been published in the following papers:

M. Alvarez-Guerra, D. Ballabio, J. M. Amigo, J. R. Viguri, R. Bro, A chemometric approach to the environmental problem of predicting toxicity in contaminated sediments,

*Journal of Chemometrics*(

**2010**), 24, 379-386

M. Alvarez-Guerra, D. Ballabio, J. M. Amigo, R. Bro, J. R. Viguri, Development of models for predicting toxicity from sediment chemistry by partial least squares-discriminant analysis and counter-propagation artificial neural networks,

*Environmental Pollution*(

**2010**), 158, 607-614

A

**detailed description on the analysis of this dataset**by means of PLSDA is given in the following paper:

Ballabio D, Consonni V, (

**2013**) Classification tools in chemistry. Part 1: Linear models. PLS-DA.

*Analytical Methods*, 5, 3790-3798

In the following paragraphs, a resume of the PLSDA model built by means of the Classification toolbox for MATLAB is given. The Sediment dataset is provided together with the toolbox. It can be opened by typing:

load sediment

on the MATLAB command window. Note that original data were log transformed.

[-> top]

Working with the graphical interface

Once data have been loaded in the MATLAB workspace, you can open the graphical interface by typing the following code in the MATLAB command window:

class_gui

In order to build a classification model by means of PLSDA, we have to load data and class vector in the GUI. In order to do that, we can proceed in the following way: select "load data" in the file menu. We can select the Xtrain MATLAB variable (the data) and click load. The listbox of the toolbox main form will be updated with the data details (number of samples, number of variables). Then we can follow the same procedure for loading the corresponding class vector (class_train), by clicking "load class " in the file menu. The class details number of classes) will be updated in the toolbox main form. Finally, we can load the sample and variable labels (samples_train and variables, by clicking "load labels" in the file menu).

After the model calculation (and validation), the main form of the toolbox will be updated with the model details (type of calculated model, error rate and non-error rate). In this example, the error rate of the PLSDA model is equal to 0.19 (19%) in fitting and to 0.20 in cross validation. Detailed classification results can be analysed by clicking "results->classification results". The following form will appear.

Finally, we can save the model ("file->save model"), clear the data ("file->clear data"), and load the test set ("file->load data" and choose Xtest_log) and the corresponding class vector ("file->load class " and choose class_test). In this case, the "predict samples" button in the prediction menu will be activated and the prediction results on the new set of data can be analysed. The classification results on the test set can be analysed by choosing "prediction->prediction results":

[-> top]

Working with the command line

Type:

load sediment

on the MATLAB command window to load the data. We can select the number of optimal components by using the plsdacompsel function:

res = plsdacompsel(Xtrain_log_log,class_train,'none','vene',5,'bayes')

We'll get the error rate in validation (and non-error rate in validation) associated to each component value. Type

res.er

on the MATLAB command window to see the error rates. We can then calculate the PLSDA model with 2 components by typing:

model = plsdafit(Xtrain_log,class_train,2,'none','bayes',1)

on the MATLAB command window. Once the model is calculated, we can see the model performances by typing:

model.class_param

Scores, loadings, calculated class, leverages and many other statistics are stored in the model structure. We can proceed by cross validating (with 5 venetian blind groups) the PLSDA model with 2 components:

cv = plsdacv(Xtrain_log,class_train,2,'none','vene',5,'bayes')

Once the validation procedure has finished, we can see the validation performances by typing:

cv.class_param

Finally, we can predict the test set samples by using the calibrated model:

pred = plsdapred(Xtest_log,model)

and finally we can calculate the classification performances on the external test set predictions:

class_param = calc_class_param(pred.class_pred,class_test)

[-> top]