Table 1: Results of ASCDD and ENCLUS on “Gas Sensor Array Drift”.
Cluster ASCDD ENCLUS
Accuracy Subspace Accuracy Subspace
1 68% 76, 113, 17, 4, 79, 70, 14, 68, 121, 57, 15, 6, 7, 53, 118,
12, 54, 62, 127
41% 113, 4, 79, 70, 68, 57, 15, 54, 7, 14, 53, 118, 83, 14, 73
2 67% 15, 6, 78, 49, 7, 12, 55, 63 55% 20, 6, 78, 30, 19, 7, 66, 23, 11, 50, 93
3 39% 47, 24, 107, 111, 88, 97, 99, 105 31% 88, 40, 26, 113, 105, 95, 33, 28, 16
4 68% 44, 108, 39, 47, 24, 103, 111, 88, 97, 99, 105 52% 111, 23, 108, 75, 39, 94, 47, 85
5 34% 112, 56, 120, 122, 98, 16, 35, 106, 43, 80, 36, 108, 24,
107, 88, 97, 99, 105
19% 112, 43, 106, 16, 80, 24, 74, 87, 86, 98, 19, 108, 58
6 88% 65, 9, 76, 4, 79, 70, 14, 68, 15, 6, 78, 7, 12, 39, 47, 103 59% 65, 83, 4, 68, 70, 6, 81, 14, 7, 103, 79
clustering within a subspace with no matter high or
low dimension.
With increasing number of objects the run-time
of ASCDD grows quadratically, which is longer than
ENCLUS in this situation. The reason is that the cal-
culation of density for one object in ASCDD involves
all objects and ENCLUS works similar to CLIQUE
that separates the objects into grids, which is not sen-
sitive to amount of objects. Although the scalability
of ASCDD related to the size of objects is not linear,
the complexity ensures getting a complete clustering
result. Of course the run-time with regard to the num-
ber of objects depends also on the parameter setting
because choosing a DDT that yields many objects in
the clustering result takes more time than with a DDT
that involves fewer objects.
ENCLUS finds almost the same low dimensional
subspace candidates, but ENCLUS is slower than AS-
CDD for high dimensional subspace, because EN-
CLUS does clustering only from low to high dimen-
sional subspace, which takes much time than direct
clustering in high dimensional subspace as ASCDD.
4.2 Real Data
The data set “Gas Sensor Array Drift” has been ob-
tained from the UC Irvine Machine Learning Repos-
itory (Frank and Asuncion, 2010). This data set cor-
responds to the measurements of 16 chemical sensors
utilized in simulations for drift compensation in dis-
criminating six gas types (Ammonia, Acetaldehyde,
Acetone, Ethylene, Ethanol, and Toluene) at various
concentrations. The data is prepared for the chemo-
sensor research community and artificial intelligence
to develop strategies to cope with sensor/concept
drift. The dataset contains 128 dimensions, 13910
measurements with six clusters (six gas types), we
applied ASCDD and ENCLUS on the data without
cluster labels, the results were then compared with
the cluster labels. The clusters are located in differ-
ent subspaces, which means the particular subspaces
can specialize detecting the gas types. We illustrate
some examples of the clustering result and the accu-
racies of data related to months one and two in Table
1. The accuracy is defined as the proportion of the
number of correctly clustered objects to the number
of objects in that cluster.
This clustering process takes 1440 seconds with
ASCDD and 4410 seconds with ENCLUS. Compared
with ENCLUS, ASCDD is more efficient on high-
dimensional subspace and is able to detect the clusters
directly on these subspaces with higher precision.
5 CONCLUSIONS
Departing from the traditional clustering methods,
ASCDD is suitable for complex data with arbitrary
forms. It provides useful distribution information and
can be applied easily with just one simple parameter
DDT by clustering. The clusters are detected accord-
ing to their densities, which does not depend on the in-
put order. The results of ASCDD in our experiments
show high accuracy.
In this paper we improve the methods of sub-
space detection and parameter determination in
the subspace clustering method ASCDD for high-
dimensional data set. By adhibiting entropy, ASCDD
is able to detect high-dimensional subspace candi-
dates easily, where a subspace with low entropy is
considered as a potential subspace. We develop a way
to detect subspace candidates to reach its maximum
dimensions. ASCDD can directly find clusters within
the located subspace candidates. Since the cluster-
ing result and quality depend on choosing the param-
eter DDT , we investigate the DDT and introduce a
method of choosing this parameter. The DDT can be
chosen in accordance with the tendencies to complete
clustering results or short run-time. One of our future
works will be reducing the calculation time with very
high number of objects.
REFERENCES
Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., and
Park, J. S. (1999). Fast algorithms for projected clus-
SubspaceClusteringwithDistance-densityFunctionandEntropyinHigh-dimensionalData
21