Incorporating Feature Selection and Clustering Approaches for

High-Dimensional Data Reduction

Been-Chian Chien

Department of Computer Science and Information Engineering

National University of Tainan, Tainan, Taiwan, China

Keywords: High-dimensional Data, Data Reduction, Feature Selection, Clustering, Document Categorization.

Abstract: Data reduction is an important research topic for analyzing mass data efficiently and effectively in the era of

big data. The task of dimension reduction is usually accomplished by technologies of feature selection,

feature clustering or algebraic transformation. A novel approach for reducing high-dimensional data is

initiated in this paper. The main idea of the proposed scheme is to incorporate data clustering and feature

selection to transform high-dimensional data into lower dimensions. The incremental clustering algorithm in

the scheme is used to handle the number of dimensions, and the relative discriminant variable is design for

selecting significant features. Finally, a simple inner product operation is applied to transform original high-

dimensional data into a low one. Evaluations are conducted by testing the reduction approach on the

problem of document categorization. The experimental results show that the reduced data have high

classification accuracy for most of datasets. For some special datasets, the reduced data can get higher

classification accuracy in comparison with original data.

1 INTRODUCTION

Handling a huge number of data records and high-

dimensional data features efficiently and effectively

is the main challenge in the era of big data. For

example, a large number of digital documents such

as blogs, e-news, e-papers, and on-line reports are

produced by persons and enterprises on the Internet

everyday. The numerous documents will derive the

problems of textual analysis and high-dimensional

feature space. However, it is time consuming to

process large amount of text and high-dimensional

data. Especially, the curse of dimensionality may

become a serious obstacle while machine learning

and data mining technologies are employed in some

applications, e. g. data classification, regression, etc.

A practical task is automatic text categorization

which uses bag-of-words model (Salton, 1983)

based on a set of feature keywords extracted from

numerous documents. The set of keywords thus

forms a large sparse matrix with high-dimensional

frequencies of terms and it is difficult for general

tools to process such a huge matrix.

To reduce the number of attributes and reserve

meaningful information in high-dimensional data,

many feature reduction methods were proposed in

the past. Generally, feature selection (Liu 2005) and

feature clustering (Kriegel et al., 2009) are the two

main categories of methods to reduce dimension

space of features. An alternative class of

transformation method, like Principle Component

Analysis (Jolliffe, 2002), uses projecting process of

algebraic operation to convert a high-dimensional

dataset into a lower-dimensional dataset. Although

the transformation method can provide effective

results of reducing dimensions, the computational

cost is expensive. Furthermore, the conversion of a

high-dimensional matrix in big data is impossible

since the number of data or the dimension of

features may be very large.

The idea of incorporating the strategies of feature

selection and data clustering approach to transform

high-dimensional data into low-dimensional data is

proposed in this paper. The proposed approach first

gives a simple incremental clustering method to

agglomerate data with a proper similarity function

for a specific application. The clustering results are

then used to analyze the relative discriminant

variables which represent the discerning ability of a

feature on different clusters. Through the matrix of

relative discriminant variables, the original dataset

with high dimensions can be transformed into a new

one with lower dimensions by inner product

Chien B..

Incorporating Feature Selection and Clustering Approaches for High-Dimensional Data Reduction.

DOI: 10.5220/0005093300720077

In Proceedings of 3rd International Conference on Data Management Technologies and Applications (DATA-2014), pages 72-77

ISBN: 978-989-758-035-2

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

operation. The number of dimensions will be the

number of clusters after the transformation.

One of the high-dimensional data applications,

document categorization, is adopted to verify the

performance of the proposed scheme. Three well-

known large text datasets, 20 Newsgroups, Cade12,

and RCV1, are used to evaluate whether the data

reduction will degrade the accuracy of classification

or not. The experimental results illustrate the fact

that some of the reduced datasets produced by the

proposed scheme even have better classification

accuracy than original datasets. Further, most of the

datasets still maintain effectiveness in a very low

data dimensions after the processing of reduction.

This paper is organized as follows. Section 2

introduces the related work on feature reduction.

The idea of combining feature selection and data

clustering is revealed in Section 3. The experiments

of employing the two techniques are evaluated in

Section 4 to demonstrate the feasibility of the work.

Finally, summary and discussion are depicted.

2 REVIEW ON DATA REDUCTION

The previous researches on feature reduction are

briefly reviewed and summarized as follows.

2.1 Feature Selection

Selecting informative features is the simplest and

direct way to reduce data dimensions. The objective

of feature selection is to find a subset of significant

features from a large number of high dimensional

features according to specific task of measurement

on a dataset. For instance, information gain (IG)

(Yang & Pedersen, 1997) is the most popular feature

selection method which is frequently used on data

classification. Many earlier researches on feature

selection were proposed and designed for machine

learning, such as (Daphne & Sahami, 1996), (Blum

& Langley, 1997), and (Combarro et al., 2005). The

recent work in (Hsu & Hsieh 2010) uses correlation

coefficients to select the class-dependent features.

Generally, most of the feature selection methods are

efficient in computation time.

2.2 Feature Clustering

The method of clustering features was initiated by

(Baker & McCallum, 1998). The main technique is

to aggregate similar features together first and

partition features into distinct clusters. Then, the

representative features for clusters are extracted to

be the features of a dataset. Many related works and

improvement were proposed in the past, like

distributional clustering of features (Slonim &

Tishby 2001) and clustering features based on the

distribution of class labels associated with each

feature (Bekkerman et al., 2003). Recently, an

efficient self-constructing fuzzy feature clustering

algorithm (Jiang et al., 2010) is proposed to extract

features by clustering data records instead of

features. An extracted feature is a fuzzy weighted

combination of original features on all clusters.

2.3 Other Methods

Feature transformation is the other type of feature

extraction which transforms high-dimensional data

into new subspace with lower dimensions. Principal

Component Analysis (PCA) (Jolliffe, 2002) is a

well-known method of feature transformation. PCA

transforms original data into new coordinate systems

such that the projection on the first coordinate has

the greatest variance among all possible projections,

and the projection on the second coordinate has the

second greatest variances, etc. The similar methods

include LDA (Martinez & Kak, 2001) and IOC

(Park, 2003). The incremental orthogonal centroid

method (IOC) is a feature extraction method that

tries to find an optimal transformation matrix

convert an original matrix |D|  n into a |D|  k

matrix, where k is much less than n.

3 DATA REDUCTION SCHEME

Given a dataset D, d

is a data row and d

 D. F =

, f

, … , f

} represents the set of features with n

dimensions in d

. Let d

be the value of jth feature

for the datum d

, where 1  i  |D|, 1  j  n, and |D|

is the number of data in D.

To reduce feature dimensions, the proposed

feature reduction scheme combines a clustering

algorithm and feature selection methods. The

procedures are described in the following sub-

sections.

3.1 Data Clustering

First, the similar data in the dataset D are grouped

together by their original features F. However, we

know that data clusters are dependent on not only

the steps of the clustering algorithm but also the

similarity function they applied. Since there are

different measures of similarity for various

IncorporatingFeatureSelectionandClusteringApproachesforHigh-DimensionalDataReduction

applications, the used similarity function will reflect

selecting results of significant features.

Let Sim(d

, d

) be a general form of specific

similarity functions that measures the similarity

degree between two data d

and d

. The mean of data

for each feature dimension belonging to the cluster

can be used to represent the centre of cluster. A

primitive incremental clustering algorithm based on

a similarity function Sim(d

, d

) is given as follows.

Algorithm: Primitive incremental clustering.

Input: Data set D, a threshold



Output: Clusters G = {G

, G

, … ,G

{

G = {G

}; // the set of clusters

k = 1; // the number of clusters

= {d

};

for all d

 D

if ( for all G

 G, Sim(G

, d

) <





k = k + 1;

= {d

};

G = G  {G

};

else

t =

)},({maxarg

dGSim

G

;

= d

 G

;

endif

endfor

}

The above clustering algorithm is an incremental

based scheme. The



is a threshold to determine the

mutual difference between the clusters. The first

cluster G

is initiated by the first data d

. The latter

joined data d

has two possible cases: the first one is

to merge the data d

into the existing cluster G

having the maximal similarity if Sim(G

, d

) is larger

than or equal to



 The other case is to generate a

new cluster when Sim(G

, d

) is less than



 for all

current clusters G

in G. A lower threshold



will

generate more clusters than a higher threshold. The

threshold



and the similarity function Sim() can be

set and defined, respectively, by a user according to

the requirement of an application.

3.2 Feature Selection

After clustering the data, all of the generated clusters

are used to analyze the importance of features. The

basic procedure of feature analysis is described as

follows.

Let G

be one of the clusters generated by the

primitive incremental clustering algorithm, 1  l 

|G|, |G| is the number of total clusters. Assume that

is the value of jth feature for the datum d

and d =

]

|D|n

is the matrix of the original dataset D. First,

the feature weight of each cluster, w

, is obtained by

averaging d

in each cluster G

, and w = [w

]

|G|n

defined as follows.







(1)

where G

 G and 1  j  n. Then, each weight w

normalized by the maximum value of the jth feature,

as follows.

}{max

||1

G



(2)

where 1  l  |G| and 1  j  n.

Let z

be the relative discriminant variable of the

jth feature between the cluster G

and other clusters

 G. The discriminative degree is considered as

the product of relative differences of normalized

weights for the corresponding cluster. The formal

definition is shown as:























;0if0

,0if|

||1

lilk

www

(3)

where 1  l  |G| and 1  j  n. The normalized

relative discriminant variable is defined as

















;logor 0if0

,log and 0if

log

max

zzz

ljlj

(4)

where z

max

is a presetting constant which describes

the maximum of computational precision. The range

is between 0 and 1.

3.3 Feature Reduction

Feature reduction for the dataset D is to find a

reduced matrix such that the dimension of features is

smaller than the dimension of original data. The

reduction step simply uses the original data matrix d

and the normalized relative discriminant variable

matrix

to get the reduced feature matrix

r = d

z ,

(5)

where

d = [d

]

|D|n

is the original data set matrix with

dimensions |

D|×n, and

z is the transpose of the

matrix

with dimensions

n×|G|. The reduced

feature matrix

r results a |D|×|G| matrix, where |G| is

the number of total clusters. The

n dimensions of

DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications

original data features thus are reduced to |G|

dimensions.

4 EVALUATION

To validate the feasibility of the proposed data

reduction method, a popular high-dimensional

application, document classification, was considered.

Three well-known document sets, 20Newsgroups

(20Newsgroup, 2013), Cade12 (Cade, 2014), and the

Reuters Corpus Volume 1 (RCV1, 2004) were used

to evaluate the effectiveness of the proposed scheme.

The experiments were conducted and evaluated on a

computer with Intel Core i7-2600 3.40GHz CPU and

16GB RAM. The programming tool is MATLAB7.

13.0 (R2011b).

The setup of experiments was designed and built

by the following steps. Given a set of documents, the

keywords first are extracted and analyzed using the

textual processing tool - WVtools (VMtools, 2013).

Then, the number of keywords in each document is

counted to form the original dataset

D. The total

number of distinct keywords for all documents,

n, is

the dimensions of

D. To accomplish the objective of

text categorization, the cosine similarity function is

used in the primitive cluster incremental algorithm

to measure the similarity degree between two

documents, as follows.

)()(

)(

),(













jkik

ddSim

(6)

The resulting set of clusters

for each document class

was obtained after setting a specific threshold



the clustering algorithm. The total clusters are used

to compute the normalized relative discriminant

variable and generate the reduced matrix

. The

was taken as the training data to build multiple

classifiers by one-against-all strategy using support

vector machines (LIBSVM, 2013). To classify

document categories, two types of classification

models were built. The first type is to learn one

classifier for each categories. Totally

K classifiers

are learned in the model. The second type is to build

a classifier for every cluster we got. The total

number of classifiers is |

While classifying an unknown document, we

first extract its keywords to get the matrix

t = [t]

1n

The reduction process is then applied to

t, such that

' = t

z ,

(7)

The classification model will use

t′ to determine the

category of the unknown document.

The effectiveness of document categorization for

multiple classifiers is evaluated by the measures of

microaveraged precision (

MicroP), microaveraged

recall (

MicroR), microaveraged F1 (MicroF1), and

microaveraged accuracy (

MiacroAcc). (Jiang, 2010)

In order to observe the effectiveness of the proposed

data reduction method, the classification results

using the original full keywords taken from (Jiang,

2010) are shown in Table 1 as baseline.

Table 1: The results using original features (in %).

atasets 20Newgroups Cade12 RCV1

Features #



25,718 122,607 47,152

icroP 94.53 69.57 86.66

icroR 73.18 40.11 75.03

icro

1 82.50 50.88 80.43

icroAcc 98.45 93.55 98.83

Experiment 1: 20Newsgroups dataset

This data set consists of 20,000 news messages.

The original document set is partitioned evenly

across 20 different categories of newsgroups. Two-

thirds of the dataset were selected as training set.

The others are testing documents. The version here

got 25,828 features after the pre-processing of

WVtools. That is to say, the original data matrix is a

20,000

 25,828 matrix. The number of reduced

features was determined by the number of clusters

which is obtained by setting threshold



 Generally,

the larger



is, the more number of clusters will be

generated in the proposed clustering algorithm.

The experimental results of 20Newsgroups are

shown in Table 2 and Table 3. The second column in

the table lists the results of taking original classes as

document clusters directly without further clustering.

The tables illustrates that the

MicroP values

decrease as the number of clusters increases.

However, the

MicroR and MicroF1 values show that

the reduced dataset with 56 features gets the best

results. In comparison with the results of Table 1

using full features, The proposed data reduction

method gets an excellent performance in recall and

F-measure. The difference on the measures between

Jiang's and this paper should be the setting of

parameters of SVM learners. Generally, the results

of using reduction dataset is even better than the

original full features. The main reason is that the

unique set of keywords in document categories can

be extracted effectively.

IncorporatingFeatureSelectionandClusteringApproachesforHigh-DimensionalDataReduction

Table 2: 20Newsgroups dataset with 20 classifiers (in %).

Features # 20 56 94 195 297





- 0.100 0.120 0.140 0.150

icroP 89.15 88.84 88.28 86.87 85.54

icroR 79.95 81.25 80.50 80.42 80.42

icro

1 84.30 84.87 84.21 83.52 82.95

icroAcc 98.55 98.52 98.49 98.41 98.34

Table 3: 20Newsgroups dataset with |G| classifiers (in %).

Features # 20 56 94 195 297





- 0.100 0.120 0.140 0.150

icroP 89.15 88.87 88.15 87.21 86.26

icroR 79.95 80.53 80.07 79.41 78.38

icro

1 84.30 84.50 83.91 83.13 82.13

icroAcc 98.55 98.52 98.46 98.33 98.29

Experiment 2: Cade 12 Dataset

The Cade12 is a set of classified web pages. This

dataset is classified into 12 categories. There are

totally 40,983 documents in this dataset. This

benchmark selects 27,322 documents as the training

set, and 13,661 documents are used for testing. The

distribution of documents in the 12 categories is not

as uniform as the 20Newsgroups dataset and the

numbers of documents for the 12 categories are very

different in quantity. After textual pre-processing,

157,483 features were got totally from the Cade12

dataset.

Table 4: Cade12 dataset with 12 classifiers (in %).

Features # 12 190 236 316 652





- 0.0005 0.0010 0.0050 0.0100

icroP 71.66 68.53 68.50 67.75 65.51

icroR 45.92 52.62 52.88 53.25 54.74

icro

1 55.97 59.53 59.69 59.63 59.64

icroAcc 93.98 94.04 94.05 93.99 93.82

Table 5: Cade12 dataset with |G| classifiers (in %).

Features # 12 190 236 316 652





- 0.0005 0.0010 0.0050 0.0100

icroP 71.66 75.82 67.08 66.29 64.06

icroR 45.92 32.66 45.11 44.67 43.93

icro

1 55.97 45.66 53.49 53.37 52,12

icroAcc 93.98 93.59 93.58 93.46 93.27

Table 4 and Table 5 list the experimental results

of Cade12. The results show that the

MicroR values

increase rapidly in this dataset as the number of

features is increasing. On the contrary, the

MicroP

values decrease slowly. Hence, the

MicroF1

measure was improved in the larger number of

clusters. While comparing with the result of full

features in Table 1, the reduced dataset can improve

recall and F-measure significantly. Generally, the

results of using reduction dataset is more effective

than using original full features in Cade12 dataset.

Experiment 3: Reuters Corpus Volume 1 Dataset

The Reuters Corpus Volume 1 (RCV1) dataset

consists of 804,414 news stories produced by

Reuters from 20 Aug. 1996 to 19 Aug. 1997. The set

of documents are divided into 23,149 training

documents and 781,265 testing documents. The

characteristics of this dataset are large number of

categories and multi-label for documents. There are

101 non-empty categories totally. All the documents

are categorized into one or more classes. There are

47,152 features for this dataset.

Table 6: RCV1 dataset with 101 classifiers (in %).

Features # 101 120 169 213 315





- 0.0300 0.0500 0.0600 0.0750

icroP 86.77 86.54 85.78 85.07 83.92

icroR 68.78 68.99 69.70 69.96 70.67

icro

1 76.74 76.77 76.91 76.78 76.72

icroAcc 98.66 98.66 98.66 98.64 98.62

Table 7: RCV1 dataset with |G| classifiers (in %).

Features # 101 120 169 213 315





- 0.0300 0.0500 0.0600 0.0750

icroP 86.77 86.69 86.03 84.71 81.94

icroR 68.78 68.75 68.82 69.30 70.79

icro

1 76.74 76.69 76.47 76.23 75.96

icroAcc 98.66 98.66 98.64 98.61 98.65

The experimental results of RCV1 are shown in

Table 6 and Table 7. The results in this dataset are

not so ideal like the previous two datasets. It is

similar to previous two datasets, the

MicroP values

decrease as the number of features increases; on the

contrary, the

MicroR values increase reversely.

However, all the measures are not as well as the

results of using full features in Table 1. Such an

outcome may be caused by several possible reasons.

First, since the number of document categories is

large, the one-against-all learning strategy will lead

to the problem that the number of positive examples

is much less than negative ones. The classification

model will be dominated by negative examples.

Second, the same data appear at different categories

simultaneously due to the documents are multi-label.

The feature selection using the relative discriminant

variables cannot handle the recognition of multi-

class well at this moment.

DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications

5 SUMMARY AND DISCUSSION

The problem of high-dimensional data not only

increase the computation time of process but also

degrade the effectiveness of utilization. This paper

proposes a novel scheme of data reduction by

incorporating of the data clustering approach and

feature selection techniques. The proposed scheme

includes a primitive incremental clustering algorithm

and a discerning method of selecting features based

on relative difference. The evaluation has shown that

the proposed method is effective for different types

of single-label dataset. However, it still needs more

investigation on discerning the distinction among the

features for multi-label problem.

The advantages of the proposed scheme are

discussed as follows. First, the number of reduced

dimensions can be controlled by the threshold



the incremental clustering algorithm easily. Second,

the scheme is scalable since the relative discriminant

variable for each feature can be calculated

independently. The computation will not be limited

by the size of memory space or software tools. Third,

unlike conventional feature selection methods, the

final reduced features are the combinations of all

possible significant features instead of a set of single

features from original datasets.

The process of high-dimensional features is the

key problem for many modern applications, such as

text classification, information retrieval, social

network, and web analysis. The increment of data

including data rows and feature columns is a

common characteristic in applications of big data. It

is worthy to make further investigation on extending

the proposed scheme to keep effective data reduction

and efficient adaptation along with the increase of

data. Developing effective dynamic data reduction

solution should be considered as an important issue

in the future.

ACKNOWLEDGEMENTS

This research was supported in part by National

Science Council of Taiwan, R. O. C. under contract

NSC 102-2221-E-024-016.

REFERENCES

20Newsgroups, 2013. http://people.csail.mit.edu/jrennie

/20Newsgroups/

VMtools, 2013. http://sourceforge.net/projects/wvtool/

Cade, 2014. http://web.ist.utl.pt/acardoso/datasets/

LIBSVM, 2013. http://www.csie.ntu.edu.tw/~cjlin/libsvm/

RCV1, 2004. http://jmlr.org/papers/volume5/lewis04a/

lewis04a.pdf.

Baker, L. D., McCallum, A., 1998. Distributional

clustering of words for text classification. In ACM

SIGIR98 the 21st Annual International, pp. 96-103.

Bekkerman, R., El-Yaniv R., Tishby N., Winter Y., 2003.

Distributional word clusters vs. words for text

categorization. Journal of Machine Learning Research,

vol. 3, 1183-1208.

Blum, A. L., Langley, P., 1997. Selection of relevant

features and examples in machine learning. Aritficial

Intelligence, vol. 97, no.1-2, 245-271.

Combarro, E. F., Montȃnés, E., Díaz, I., Ranilla, J., Mones,

R., 2005. Introducing a family of linear measures for

feature selection in text categorization. IEEE

Transactions on Knowledge and Data Engineering,

vol. 17, no. 9, 1223-1232.

Daphne, K., Sahami, M., 1996. Toward optimal feature

selection. In the 13th International Conference on

Machine Learning, pp. 284-292.

Hsu, H. H., Hsieh, C. W., 2010, Feature selection via

correlation coefficient clustering. Journal of Software,

vol. 5, no. 12, 1371-1377.

Jiang, J. Y., Liou, R. J., Lee, S. J., 2011. A fuzzy self-

constructing feature clustering algorithm for text

classification. IEEE Transactions on Knowledge and

Data Engineering, vol. 23, no. 3, 335-349.

Jolliffe, I. T., 2002. Principal Component Analysis, 2nd

edition, Springer, 2002.

Kriegel, H. P., Kröger, P., Zimek, A., 2009. Clustering

high-dimensional data: A survey on subspace

clustering, pattern-based clustering, and correlation

clustering. ACM Transactions on Knowledge

Discovery from Data, vol. 3, no. 1, 1-58.

Liu, H., Yu, L., 2005. Toward integrating feature selection

algorithms for classification and clustering. IEEE

Transactions on Knowledge and Data Engineering, ,

17(4), 491-502.

Martinez, A. M., Kak, A. C., 2001. PCA versus LDA.

IEEE Transactions on Pattern Analysis and Machine

Intelligence, vol. 23, no. 2, 228-233.

Park, H., Jeon, M., Rosen, J., 2003. Lower dimensional

representation of text data based on centroids and least

squares. BIT Numberical Math, vol. 43, 427-448.

Salton, G., McGill, M. J., 1983. Introduction to modern

retrieval. McGraw-Hill Book Company.

Slonim, N., Tishby, N. 2001. The power of word clusters

for text classification. In 23rd European Colloquium

on Information Retrieval Research (Vol. 1).

Yang, Y., Pedersen, J. O., 1997. A comparative study on

feature selection in text categorization. In the 14th

International Conference on Machine Learning, pp.

412-420.

IncorporatingFeatureSelectionandClusteringApproachesforHigh-DimensionalDataReduction