The aquatic toxicity dataset

A QSAR model was developed from a dataset consisting in 545 organic molecules to predict acute aquatic toxicity towards Daphnia Magna. LC50 data, which is the concentration that causes death in 50% of test Daphnia Magna over a test duration of 48 hours, was used as model (quantitative) response. The 545 molecules of the dataset were randomly divided into training (436) and external test sets (109). The dataset included 8 molecular descriptors: TPSA(Tot) (Molecular properties), SAacc (Molecular properties), H-050 (Atom-centred fragments), MLOGP (Molecular properties), RDCHI (Connectivity indices), GATS1p (2D autocorrelations), nN (Constitutional indices), C-040 (Atom-centred fragments). The dataset has been published (and described in details) in the following paper:

M. Cassotti, D. Ballabio, V. Consonni, A. Mauri, I. Tetko, R. Todeschini (2014). Prediction of acute aquatic toxicity toward daphnia magna using the GA-kNN method, Alternatives to Laboratory Animals (ATLA), 42,31:41

In the following paragraphs, an example of the application of the PCA model built by means of the PCA toolbox for MATLAB is given. The aquatic toxicity dataset is provided together with the toolbox and can be opened by typing:

load data_toxicity

on the MATLAB command window. In the MATLAB workspace, it will be possible to see the aquatic toxicity dataset, composed by 545 samples, divided in training (436) and test (109) samples. Each sample is described by 8 variables (molecular descriptors). An experimental quantitative response (LC50) is associated to each sample. Moreover, in order to exemplify the use of the toolbox, samples were divided in two classes: class 1 includes samples with lc50 values lower than 4, class 2 includes samples with lc50 values higher than 4.

[-> top]

Working with the graphical interface

Once data have been loaded in the MATLAB workspace, you can open the graphical interface by typing the following code in the MATLAB command window:


In order to build a PCA model, we have to load data and class vector in the GUI. In order to do that, we can proceed in the following way: select "load data" in the file menu. We can select the X_train matlab variable (the data) and click load. The listbox of the toolbox main form will be updated with the data details (number of samples, number of variables). Then we can follow the same procedure for loading the corresponding response vector (y_train), by clicking "load response" in the file menu. Alternatively, we could load the class vector. Finally, we can load the sample and variable labels (samples_train and variables, by clicking "load labels" in the file menu).
We can have a look to the sample profiles by choosing "view->plot profiles ". This will open a new form to show the profiles of the variable averages or the sample profiles, for both the raw and scaled data. If the class vector is loaded, averages are calculated on each class separately.

Otherwise, it is possible to make biplot of variables and/or analyse histograms and boxplots of each variable by selecting "view -> plot univariate stat".
Initially, we can estimate the optimal number of components to be retained in the model by selecting "calculate->optimal number of components for PCA". In the setting form, we can choose the data scaling (autoscaling in this example), the type of cross validation ("vene" for venetian blinds) and the number of cross validation groups (e.g. 5). The following window will appear.

Here it is possible to plot different parameters (eigenvalues, explained variance, cumulative explained variance, rmsecv, IE and IND) as a function of the PCs included in the model. The number of PCs to be shown in the plot can be changed. When plotting eigenvalues, two horizontal lines are drawn; these corresponds to the average eigenvalue and to the average eigenvalue multiplied by 0.7; moreover, the number of components suggested by eigenvalue-based methods (AEC, CAEC, KL, KP) are marked and labelled. For example, in this case, AEC, CAEC and KP indices all suggest 3 as the optimal number of components to be retained in the model.
We can then proceed in the calculation of the PCA model. By selecting the "calculate->PCA" menu, we can choose the data scaling (autoscaling in this example) and the number of components to be retained in the model. After the model calculation, the main form of the toolbox will be updated with the model details (type of calculated model, data scaling, number of retained components, cumulative explained variance). The plots of eigenvalues, explained and cumulative variances associated to each component can be analysed by selecting "results->plot eigenvalues".

Alternatively, eigenvalues can be analysed by selecting "results->view eigenvalues"). A temporary matlab variable will appear with the list of eigenvalues and the associated explained variances.
We can have a look to the PCA scores and loadings, by choosing "results->scores and loading". A form will appear with score and loading plots. In this plot, samples are coloured on the basis of their experimental response (since we initially loaded the response vector). The user can modify the components to be analysed on the plots, while other statistics are available in the result form, such as Q residuals and T2 Hotelling. All plots can be exported as MATLAB figures. The "view sample" button will enable to select a single point (sample) on the score plot and subsequently view the raw and scaled variable profiles of the selected sample, as well as its Q residuals. Here, the score and loading plots between the first two components are shown.

Finally, we can save the model ("file->save model"), clear the data ("file->clear data"), and load the test set ("file->load data" and choose X_test) and the corresponding response vector ("file->load response " and choose lc50_test). In this case, the "predict samples" button in the prediction menu will be activated and the prediction results on the new set of data can be analysed. The scores of the predicted set can be analysed by choosing "results->plot scores and loadings". Since we loaded the quantitative response, predicted samples are plotted as diamond coloured on the basis of the intensity of their experimental response.

[-> top]

Working with the command line


load data_toxicity

on the MATLAB command window to load the data. We can calculate the PCA model with 4 components on the autoscaled data by typing:

model = pca_model(X_train,4,'auto');

on the MATLAB command window. The routine output (model) is a MATLAb structure containing all the model details. Once the model is calculated, we can see for example the eigenvalues by typing:


Cumulative and explained variances can be found in the "exp_var" and "cum_var" fields of the model structure. Scores, loadings, Q residuals, T2 Hotelling are stored in the model structure as well. Type "help pca_model" on the MATLAB command window for further information.
Finally, we can project the test samples in the PCA model by using the calibrated model:

model = pca_project(X_test,model);

Scores, Q residuals, Q contributions, T2 Hotelling and T2 Hotelling contributions of the test samples will be saved in the model structure. Type "help pca_project" on the MATLAB command window for further informations.

[-> top]