on projection pursuit (Gao and Wang, 2007). The idea
is founded on linear and non-linear structures and fea-
tures of the original high-dimensional data can be ex-
pressed by its projection weights in the optimal pro-
jection direction. Their results showed that it was ef-
fective to cluster texts. A new semi-supervised di-
mension reduction proposed by Martin-Merino, M.,
et al. is a textual data analysis method (Martin-
Merino and Roman, 2006). The Semi-supervised di-
mension reduction means exploiting manually cre-
ated classification of a subset of documents. Re-
cently, dimension reduction based on independent
component analysis (ICA) shows better performance
(Shafiei et al., 2007), which uses ICA to select inde-
pendent feature vectors.
In this study, we focus on the selection of charac-
teristic axes for PRDC. The text representation abil-
ity of the proposed method is compared with bag-
of-words and N-gram models, for which, we use
the methods to construct vector feature for several
benchmark datasets respectively. A popular cluster-
ing method, called k-means, is employed to examine
the representation results. Based on experiment re-
sults, we will show the performance of the proposed
method comparing to the other two methods.
2 THE PROPOSED METHOD
AND THE TRADITIONAL
METHODS
In this section, we first introduce the pattern represen-
tation scheme using data compression (PRDC). Then
we show how to choose characteristic axes for PRDC.
2.1 Pattern Representation Scheme
using Data Compression
In this study, data compression is used for represen-
tation of documents. In general, a model of input in-
formation source is used for encoding the input string
in data compression. And a compression dictionary
is used as the model. The compression dictionary is
automatically produced when compressing input data,
eg. Lempel-Ziv (LZ) compression (Ziv and Lempel,
1978). In the same way, PRDC constructs a compres-
sion dictionary by encoding input data forms. It pro-
duces a compressibility vector space from the com-
pression dictionary to project new input data into it.
Therefore, we can get the feature of data represented
by a compressibility vector. Finally, PRDC classifies
data by analyzing these compressibility vectors.
Subsequently, PRDC is used as follows for rela-
tion analysis of similar documents. The compression
dictionaries constitute a compressibility vector space.
The compressibility vector space can be represented
by a compressibility table, which is made by project-
ing the input document into the compressibility vec-
tor space. Let N
i
be the input document. By com-
pressing the input document, a compression dictio-
nary is obtained, which is expressed as D
N
i
. Com-
pressing document N
j
by D
N
i
, we get compression
ratio C
N
j
D
N
i
=
K
N
i
L
N
j
. Where, L
N
j
is the size of the in-
put stream N
j
, K
N
i
is the size of the output stream.
Compressing with all of the dictionaries, we obtain
a compressibility vector for each input document. In
the compressibility table, the columns show the doc-
ument data N
j
, the rows show the compression dic-
tionary D
N
j
formed by the same document, and the
elements show the compressibility C
N
j
D
N
i
[%]. PRDC
utilizes this table to characterize documents.
2.2 The Proposed Method
In PRDC a compressibility vector space (dictionary
space) is constructed by randomly choosing dictionar-
ies. This method is simple to implement and provides
comparative performance. However, it is impossible
to know how many dictionaries to be selected and
where they are in the data space in this manner. An
appropriate selection of dictionaries for PRDC to im-
prove its representation performance is needed. Gen-
erally the method of dictionary selection in PRDC,
may be considered as selection of dictionaries from a
few large clusters in the data space. A large cluster
occupies a big area, in which the randomly selected
dictionaries may not represent the cluster properly, if
one or more chosen dictionaries are at the edge of the
cluster. The larger the clusters become, the larger the
select bias is.
Small clusters can be considered to be ’pure’.
Randomly selected dictionaries are able to make a
representative feature space in small clusters. This se-
lection provides the number of dictionaries to select
and where to select dictionaries. When the size of
dataset becomes large, the number of the aforemen-
tioned pure clusters will increase and consequently
the number of selected dictionaries will increase.
Hence, this method also suffers the curse of dimen-
sionality problem when dataset becomes large, which
is a well known problem when handling data analysis.
We propose to use ICA to improve the vector
space construction which can represent input doc-
uments properly in our method, since it produces
spatially localized and statistically independent basis
ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence
218