K-means clustering

K-means clustering produces exactly k clusters of greatest possible distinction between them. This method, like the Two-step clustering, deals well with huge datasets and it gives a great advantage comparing with the next method – hierarchical clustering. The clustering procedure starts with k random clusters. Centres of these clusters are assigned by researcher or the procedure selects k well-spaced observations for the cluster centres. Then objects are moved between these clusters. Cases are assigned based on distances from them to cluster centres; location of cluster centres is also updated based on the mean values of cases in each cluster. All this is repeated until achieving final purpose – to minimize variability within clusters and maximize variability between them with producing the most significant ANOVA results.

We wil continue discussing application of cluster analysis on the example with sensitivity of different bacteria to 42 antibiotics (see Example 7).

Performing k-means clustering in SPSS

1) Click the Analyze menu, point to Classify, and select K-Means Cluster… :

The K-Means Cluster Analysis dialog box opens:

2) Select the variables on which classification of objects will be based (“Enterococcus”, “E.coli”, “Staphylococcus”, “Streptococcus”, “Other”); click the upper transfer arrow button . The selected variables are moved to the Variable(s): list box.

3) Select the variable “Antibiotics”, which contains names of antibiotics, then click the lower transfer arrow button . The selected variable is moved to the Label Cases by: list box.

4) Now we should decide which number of cluster we want to get in final classification. This time we may choose two clusters and if we will not be satisfied with obtained classification we may repeat the whole procedure with three, four or more clusters. So, in the Number of Clusters box we do not change proposed by default value “2”.

5) Click the Iterate… button. The K-Means Cluster Analysis: Iterate dialog box opens:

6) Number of iterations determines criteria for finishing the procedure of clustering; it is number of moving cases between clusters. By default 10 iterations is proposed, however, it is not always enough, so let us change it to “20”. Click the Continue button, it returns you to the K-Means Cluster Analysis dialog box.

7) Click the Options… button. The K-Means Cluster Analysis: Options dialog box opens:

By default only Initial cluster centers check box is selected. Select also other two check boxes – ANOVA table and Cluster information for each case. Click the Continue button, it again returns you to the K-Means Cluster Analysis dialog box.

8) Click the OK button. An Output Viewer window opens and displays the results.

Results of k-means clustering contain seven tables: Initial Cluster Centres (shows initially assigned centres for clusters which usually are changed in subsequent iterations):

Iteration History (describes movement of cases between clusters in order to achieve optimal classification):

Cluster Membership (describes membership of each observation):

Final Cluster Centres (contains values of centres of clusters after finishing clusterization):

Distances between Final Cluster Centres (contains values of Euclidian distance between the final cluster centres):

ANOVA tables (indicates which variables contribute the most to final classification solution):

and the last table is Number of cases in Each Cluster:

Looking at the table with cluster centres we can see that Cluster 1 corresponds to more active antibiotics, to which there were more sensitive strains. From the table containing number of cases in clusters we can see that the first cluster contains 15 cases, while the second – rest 27. From the table with cluster membership it is possible to trace which observations have fallen in which cluster: the first cluster contains cefazolin, cefoperazone, cefotaxime, cefoxitin, ceftriaxone, ciprofloxacin, furagin, gatifloxacin, gentamycin, imipenem, levofloxacin, lomefloxacin, meropenem, ofloxacin, and pefloxacin.

And, finally, looking at the table with ANOVA results we may establish which variable was the most important in produced classification. This is expressed by F values. Variable with large F value (in our example – E. coli: F is 133.393) provides the greatest separation between produced clusters.

It is worth to mention that classification obtained using two-step clustering appeared to be different from classification by k-means clustering: in the first method ‘more active’ cluster contained 8 antibiotics, while by using k-means clustering such cluster contains 15 antibiotics. Looking into values of variables in both classifications we can see that 8-antibiotic classification highlights more active antibiotics better than 15-antibiotic classification because mean values for variables are higher for first cluster in it. For example, mean value for number of sensitive Enterococcus strains to antibiotics from ‘more active’ cluster in two-step clustering is 53.62%, while in k-means clustering – only 29%. It is sometimes considered that two-step clustering provides better classification not only from statistical but also from ‘real life’ point of view. However, we may try to improve our k-means clustering by increasing number of clusters. This procedure will make clusters more homogenous and, therefore, more practically useful.

Let us repeat specification of k-means clusters but in the K-Means Cluster Analysis dialog box we will type “3” in the Number of Clusters: box:

In the Output Viewer window the table with cluster centres indicates that the first cluster includes the most active antibiotics, while the third – the least active ones. In the table with distances between clusters we can see that distance between the first and the third cluster is the maximal, which also proves previous statement:

The ANOVA table shows that this time the most significant impact into separation between clusters was made by the variable “Enterococcus”:

In the table with number of cases we can see that this time the most active cluster (first) includes only seven antibiotics, but as it was shown in the table with cluster centres, these antibiotics possess indeed high level of activity, they are ciprofloxacin, furagin, gatifloxacin, gentamicin, imipenem, levofloxacin and meropenem (this was taken from the table with cluster membership which is not shown here). Among enterococci 61% of strains were in average sensitive to these antibiotics, among E. coli – 91%, among staphylococci – 89%, among streptococci – 81.2, and among other bacteria – 77.5%:

Therefore, three-cluster classification can be accepted as better one comparing with two-cluster classification in the k-means cluster method.