Table 6: Cluster sizes using three algorithms, excluding the two outliers.
Algorithm K=2 K=3 K=4 K=5 K=6
Kmeans 288,92 264,8,108 7,91,238,44 27,34,234,5, 80 240, 26 ,41, 3, 5, 65
PAM 246,134 240,50,90 186,71,43,80 186,69,39,6,80 186,69,36,6,68,15
Agnes 227,143 237,13,130 227,5,8,130 237, 5,8, 100, 30 237, 4, 8,100, 30, 1
bees is about 24 percent of the total number of bees.
Similar percentaje (20 percent) was found in (Tenczar
et al., 2014). Also, when three clusters are considered,
both Kmeans and AGNES show a third cluster of very
small size compared with the other two. This suggests
the existence of more outliers.
3.1 Clustering Validation
Now, in order to determine the optimal number of
clusters, we have computed 4 internal cluster valida-
tion measures: The Silhouette, Dunn Index, Davies-
Bouldin Index, and the Calinski and Harabasz index
(Halkidi et al., 2001). A Silhouette value close to 1
indicates a good clustering. The number of clusters
with the highest Dunn index is the best one. Accord-
ing to the Davies-Bouldin index the best number of
clusters is the one with the minimum value. The opti-
mal number of clusters according to the Calinski and
Harabasz index is the one with the highest value.
Table 7 shows the results of the measures for the
kmeans algorithm. The Davies-Bouldin index was
computed using R package’s clusterSim and the re-
maining ones using the R package’s fpc.
Table 7: Internal measures for clustering validation using
kmeans, (*) indicates the best result for a clustering valida-
tion measure.
Measure K=2 K=3 K=4 K=5 K=6
Silhouette 0.4557* 0.4483 0.3813 0.3773 0.3865
Dunn 0.0571 0.0780* 0.0607 0.0607 0.0685
Davies-Bouldin 1.6793 1.3964* 1.8070 1.6652 1.6955
Calinski- Harabasz 133.97* 125.74 108.74 98.22 88.64
From Table 7, we can see that the optimal cluster
number can be either two or three. By visualization
(see Figure 4 ) three clusters are suggested, but ac-
cording to co-authors of this paper wih high domain
knowledge on bees behavior is better to consider only
two clusters.
In Table 8, we show the cluster validation mea-
sures for the clusters obtained using the PAM algo-
rithm.
Table 8: Internal measures for clustering validation using
PAM, (*) indicates the best result for a measure.
Measure K=2 K=3 K=4 K=5 K=6
Silhouette 0.4019* 0.3882 0.1971 0.1961 0.2110
Dunn 0.0395 0.0464* 0.0242 0.0255 0.0255
Davies-Bouldin 1.6773* 2.1711 2.1964 1.9573 1.8428
Calinski- Harabasz 126.69* 97.18 74.17 83.34 80.26
Using voting it seems that two clusters could be
the optimum number of clusters. In this case, there
is a concordance with the opinion of our co-authors
with domain knowledge on bees behavior.
3.2 Clusters Visualization
In this section, we will show plots for both two and
three clusters given by the kmeans algorihm and the
two clusters given by PAM.
From Figure 2, it can be seem very clearly bees
from the smaller cluster (red) have always more ac-
tivity than bees in cluster 1 (blue). Also, we can no-
tice that the bees’s activity start to increase at day 5.
This is very clear in the red cluster. On the other hand,
bees’s activity is noticeable greater in the afternoons.
The majority of the members of clusters 1 are com-
ing from colony M (168 bees out of 288, 58.33%),
whereas most of the bees in cluster 2 are from colony
L (55 bees out of 92, a 59.78%). Performing a Chi-
Square test yields a p-value of .014, hence there is
statistical significance of dependency between clus-
ters and colonies. On the other hand, the other cate-
gorical attribute: ”Treatment” behaves in similar way
for both clusters, giving a p-value of .499. However,
most of the members of cluster 1 (23.96 percent of
bees) are coming from treatment ”Mix D”, whereas a
21.74 percent of bees in cluster 2 are comming from
treatment ”Mix E”. Finally, we analyzed the com-
bined effect of both ”Treatment” and ”Colony” on the
cluster formation, and in fact, there is an effect. In the
small cluster that includes 92 bees, the p-value for the
Chi-square test is .017, which is highly significant. In
the large cluster including 288 bees, the p-value for
the Chi-square test is .016. A 27.27 percent of mem-
bers of cluster 1 belongs to colony L and treatment
”Mix E”. Also, in the second cluster a 25.95 percent
of bees belong to colony M and Treament ”Mix D”.
Figure 3 shows the bees grouped into two clus-
ters according to their daily activity. From Figure 4,
clearly we can notice that bees in Cluster 2(in Blue)
have more activity than bees in cluster 1 (red) and
cluster 3 (cyan), But bees in cluster 3 start to increase
theirs activity at day 7 and become the leading group.
The majority of the members of clusters 1 and 3 are
coming from colony M, but most of the bees in cluster
2 are from colony L. Finally in Figure 5, we visualize
the two clusters obtained by PAM. Figure 5 suggests