The IRIS dataset

IRIS is a classical data set used by statisticians and chemometricians to check classification methods. It is composed by 150 samples of flowers divided into 3 classes (50 setosa, 50 versicolor, 50 virginica) and described by 4 variables (petal length, petal width, sepal length, sepal width). The dataset has been published by Fisher in the following paper:

Fisher RA. (

**1936**) The use of multiple measurements in taxonomic problems.

*Annals of Eugenics*

**7**179-188.

[-> top]

Working with the graphical interface

Once data have been prepeared, you can open the graphical interface by typing the following code in the MATLAB command window:

model_gui

In order to build a classification model by means of CP-ANNs, we have to load data and class vector. In order to do that, we can proceed in the following way: select "load data" in the file menu, then "load from file" in the "load" form and select the iris.mat file, the following form will appear:

We can select the X matlab variable (the data) and click load. The listbox of the toolbox main form will be updated with the data details (number of samples, number of variables). Then we can follow the same procedure for loading the corresponding class vector (from the same iris file), by clicking "load class " in the file menu. The class details (number of samples, number of classes) will be updated in the toolbox main form.

After the model calculation (and cross-validation), the main form of the toolbox will be updated with the model details (number of neurons, number of epochs, error rate and non-error rate):

Detailed classification results can be analysed by clicking on the "classification results" button. The following form will appear:

Considering precision, specificity and sensitivity, class 1 results prefectly separated, while class 2 and class 3 overlaps a little bit (both in fitting and cross-validation). This is confirmed by the confusion matrix obtained in cross-validation (click the "view confusion matrix" button in the cross-validation frame):

Clicking on the "plot class profiles" button, the following plot will appear:

where the average of the Kohonen weights for each class and for each variable are shown. So, for example, we can see that class 1 (the best separated one) has a significative difference on the third and fourth variables with respect to the others two classes, that have more similar values.

The overlap between class 2 and class 3 is confirmed as well from the ROC curves of the classes. The ROC curve plots can be opened with the "plot ROC curve" button.

Finally, we can save the model ("file->save model"), clear the data ("file->clear data"), and upload a new set of data ("file->load data"). In this case, the "predict sample" button will be activated and the prediction results on the new set of data can be analysed.

[-> top]

Working with the command line

Type:

load iris

on the MATLAB command window to load the data. Then, we can build a defualt setting structure, define the number of epochs (50) and the size of the net (11) and have a look to the final setting structure:

settings = som_settings('cpann');

settings.epochs = 50;

settings.nsize = 11;

settings

settings.epochs = 50;

settings.nsize = 11;

settings

After that, we can run the classification model based on CP-ANNs by typing:

model = model_cpann(X,class,settings);

At the end of the calculation, we will have a structure (model) that contains all the classification results. We can evaluate the quality of the classification model by looking at the classification indices (non-error rate, error rate, etc..):

model.res.class_param

We'll get something like this in the MATLAB command window (PAY ATTENTION: since CP-ANNs can be randomly initialised and samples enter randomly in the net in each epoch, each model you build will be different, even if repeated):

ans =

conf_mat: [3x4 double]

ner: 0.9933

er: 0.0067

not_ass: 0

precision: [1 0.9804 1]

sensitivity: [1 1 0.9800]

specificity: [1 0.9900 1]

This means that the error rate of our model is equal to 0.0067 (let's say 0.01, i.e. 1%); considering precision, specificity and sensitivity, class 1 results prefectly separated, while class 2 and class 3 overlaps a little bit. This is confirmed by the confusion matrix:

model.res.class_param.conf_mat

ans =

50 0 0 0

0 50 0 0

0 1 49 0

ans =

50 0 0 0

0 50 0 0

0 1 49 0

We can also analyse the top map position of the samples and the map weights on the MATLAB command window, but it would be better to do that by using the GUI interface (see next paragraph).

[-> top]

Graphical results on the command line

Open the GUI interface by typing:

visualize_model(model)

Then, we can inspect the results. For example, we can plot the class labels and the weights of the third variable (by selectin 'class labels' in the Display labels combo, 'variable 3' in the Display weights combo) and verify that this variable has low values in the samples of the first class:

After that, we can have a look to the output weights of class 1 (of course, neurons where samples of class 1 are placed will have high weights).

Again, we can look at the assignment of each neuron (i.e. each neuron will be coloured on the basis of the assigned class).

Finally, we can look at the weights of a specific neuron and at the labels of the samples placed in that neuron by using the 'get neuron weights' and 'get neuron labels' buttons:

By clicking on "PCA on weights", we can open a new gui (next picture) and calculate Principal Component Analysis (PCA) on the weights of the neural network, in order to examine the relationship between variables and neurons in a global way and not variable by variable. The weights of the Kohonen layer are arranged as a data matrix with r rows and p columns, where r is the number of neurons (11*11 in this example), p the number of variables (4 in this example).

In the GUI it is possible to decide the scaling method (here mean centering) and colour the neurons in the score plot on the basis of the class assignments or on the basis of the weights of the output neurons with a gray scale. In this way it is possible to understand the relationship between variables and classes, i.e. how variables describe classes. In the shown picture, we plotted scores and laodings on the first two components (96% of explained variance). Each point in the score plot represents a neuron of the previous CP-ANN model. Then, each neuron is here coloured on the basis of the output weight of the first class: black neurons are those with an high weight, i.e. neurons with high probability to belong to the first class. On the other side, white neurons are those with low weight on the first class. So, comparing score and loading plot, it is easy to understand that the first class (neurons on the right of the score plot) is here characterized by the second variable (that is placed ont he right in the loading plot), while will have low value on variables 3 and 4 (on the left in the loading plot). Conclusions on classes 2 and 3 can be analysed by plotting the corresponding weights in the "class" combo list. When analysisng PCA on the Kohonen weights, it is also possible to draw Voronoi regions related to each neuron in the score space.

[-> top]

Cross validating the model on the command line

Finally, in order to cross-validate the model, we can use 3 cross validation groups divided by venetian blinds:

cv = cv_cpann(X,class,settings,1,3);

At the end of the calculation, we can look at the classification results by typing cv.class_param:

cv.class_param

ans:

conf_mat: [3x4 double]

ner: 0.9500

er: 0.0500

not_ass: 0

precision: [1 0.92 0.92]

sensitivity: [1 0.92 0.92]

specificity: [1 0.96 0.96]

ans:

conf_mat: [3x4 double]

ner: 0.9500

er: 0.0500

not_ass: 0

precision: [1 0.92 0.92]

sensitivity: [1 0.92 0.92]

specificity: [1 0.96 0.96]

[-> top]