5 SUMMARY AND DISCUSSION
The problem of high-dimensional data not only
increase the computation time of process but also
degrade the effectiveness of utilization. This paper
proposes a novel scheme of data reduction by
incorporating of the data clustering approach and
feature selection techniques. The proposed scheme
includes a primitive incremental clustering algorithm
and a discerning method of selecting features based
on relative difference. The evaluation has shown that
the proposed method is effective for different types
of single-label dataset. However, it still needs more
investigation on discerning the distinction among the
features for multi-label problem.
The advantages of the proposed scheme are
discussed as follows. First, the number of reduced
dimensions can be controlled by the threshold
in
the incremental clustering algorithm easily. Second,
the scheme is scalable since the relative discriminant
variable for each feature can be calculated
independently. The computation will not be limited
by the size of memory space or software tools. Third,
unlike conventional feature selection methods, the
final reduced features are the combinations of all
possible significant features instead of a set of single
features from original datasets.
The process of high-dimensional features is the
key problem for many modern applications, such as
text classification, information retrieval, social
network, and web analysis. The increment of data
including data rows and feature columns is a
common characteristic in applications of big data. It
is worthy to make further investigation on extending
the proposed scheme to keep effective data reduction
and efficient adaptation along with the increase of
data. Developing effective dynamic data reduction
solution should be considered as an important issue
in the future.
ACKNOWLEDGEMENTS
This research was supported in part by National
Science Council of Taiwan, R. O. C. under contract
NSC 102-2221-E-024-016.
REFERENCES
20Newsgroups, 2013. http://people.csail.mit.edu/jrennie
/20Newsgroups/
VMtools, 2013. http://sourceforge.net/projects/wvtool/
Cade, 2014. http://web.ist.utl.pt/acardoso/datasets/
LIBSVM, 2013. http://www.csie.ntu.edu.tw/~cjlin/libsvm/
RCV1, 2004. http://jmlr.org/papers/volume5/lewis04a/
lewis04a.pdf.
Baker, L. D., McCallum, A., 1998. Distributional
clustering of words for text classification. In ACM
SIGIR98 the 21st Annual International, pp. 96-103.
Bekkerman, R., El-Yaniv R., Tishby N., Winter Y., 2003.
Distributional word clusters vs. words for text
categorization. Journal of Machine Learning Research,
vol. 3, 1183-1208.
Blum, A. L., Langley, P., 1997. Selection of relevant
features and examples in machine learning. Aritficial
Intelligence, vol. 97, no.1-2, 245-271.
Combarro, E. F., Montȃnés, E., Díaz, I., Ranilla, J., Mones,
R., 2005. Introducing a family of linear measures for
feature selection in text categorization. IEEE
Transactions on Knowledge and Data Engineering,
vol. 17, no. 9, 1223-1232.
Daphne, K., Sahami, M., 1996. Toward optimal feature
selection. In the 13th International Conference on
Machine Learning, pp. 284-292.
Hsu, H. H., Hsieh, C. W., 2010, Feature selection via
correlation coefficient clustering. Journal of Software,
vol. 5, no. 12, 1371-1377.
Jiang, J. Y., Liou, R. J., Lee, S. J., 2011. A fuzzy self-
constructing feature clustering algorithm for text
classification. IEEE Transactions on Knowledge and
Data Engineering, vol. 23, no. 3, 335-349.
Jolliffe, I. T., 2002. Principal Component Analysis, 2nd
edition, Springer, 2002.
Kriegel, H. P., Kröger, P., Zimek, A., 2009. Clustering
high-dimensional data: A survey on subspace
clustering, pattern-based clustering, and correlation
clustering. ACM Transactions on Knowledge
Discovery from Data, vol. 3, no. 1, 1-58.
Liu, H., Yu, L., 2005. Toward integrating feature selection
algorithms for classification and clustering. IEEE
Transactions on Knowledge and Data Engineering, ,
17(4), 491-502.
Martinez, A. M., Kak, A. C., 2001. PCA versus LDA.
IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 23, no. 2, 228-233.
Park, H., Jeon, M., Rosen, J., 2003. Lower dimensional
representation of text data based on centroids and least
squares. BIT Numberical Math, vol. 43, 427-448.
Salton, G., McGill, M. J., 1983. Introduction to modern
retrieval. McGraw-Hill Book Company.
Slonim, N., Tishby, N. 2001. The power of word clusters
for text classification. In 23rd European Colloquium
on Information Retrieval Research (Vol. 1).
Yang, Y., Pedersen, J. O., 1997. A comparative study on
feature selection in text categorization. In the 14th
International Conference on Machine Learning, pp.
412-420.
IncorporatingFeatureSelectionandClusteringApproachesforHigh-DimensionalDataReduction
77