show that the proposed method can help us to under-
stand complexrelationship among documentswithout
using morphological analysis.
2 RELATION ANALYSIS AND
TOPIC EXTRACTION WITH
DATA COMPRESSION
2.1 Representation of Data Feature
using PRDC
In this paper, data compression is used for represen-
tation of overlapped topics. In general, a model of in-
put information source is used for encoding the input
string in data compression. Moreover, a compression
dictionary is used as the model. The compression dic-
tionary is automatically produced when compressing
input data, ex. Lempel-Ziv (LZ) compression (Tim-
othy C. Bell, 1990), (J. Ziv, 1978). In the same way,
PRDC constructs a compression dictionary by encod-
ing input data forms. It makes a compressibility space
from the compression dictionary to project new in-
put data into it. Therefore, we can get the feature of
data represented by a compressibility vector. Finally,
PRDC classifies data by analyzing these compress-
ibility vectors.
2.2 Approach of Document Relation
Analysis
In this paper, we propose a document analysis ap-
proach based on PRDC.
Preprocessing. In the preprocessing procedure, we
unite the code characters, remove the white spaces,
newlines, tabs, and all non-alphabetical characters.
Clustering Similarity Document by PRDC. Sub-
sequently, PRDC is used as followsfor relation analy-
sis of similar documents, which have overlapped top-
ics. Compression dictionaries are obtained by com-
pressing input documentswith Lempel-Ziv(LZ) com-
pression. These dictionaries constitute a compress-
ibility space. Compressibility vector table is made by
projecting the input document into the compressibil-
ity space. Let N
i
be the input document. By com-
pressing the input document, a compression dictio-
nary is obtained, which is expressed as D
N
i
. Com-
pressing document N
j
by D
N
i
, we get compression
ratio C
N
j
D
N
i
=
K
N
i
L
N
j
. Where, L
N
j
is the size of the in-
put stream N
j
, K
N
i
is the size of the output stream.
Compressing with all of the dictionaries, we obtain
a compression ratio vector for each input document.
In the compressibility vector table, the columns show
the document data N
j
, the rows show the compression
dictionary D
N
j
formed by the same document, and the
elements show the compression ratio C
N
j
D
N
i
[%]. The
similar text in different document can be extracted by
clustering compression ratio vector.
Relation Analysis of Documents. There are com-
mon topics in the documents which belong to each
cluster obtained after using PRDC. Here, we extract
the common topics. A topic in a certain document is
composed of the repetition of some specific phrases
or words appeared more than once. If these phrases
or words can be found, it is possible to understand the
content of the common topics in a document. PRDC
is used to discover the specific phrase and word that
belong to common topic. PRDC is used to first com-
press the documents in the same cluster respectively,
and then compose compressibility space of the gen-
erated compression dictionary. Next, all of the docu-
ments in a cluster are divided into fragments of length
L characters or words, and all fragments are mapped
to the compressibility space.
From the study of the compressibility of each frag-
ment, a specific phrase or word can be discovered.
Because that fragments compressed into the same
level by any compression dictionary of any document
appears in common in the cluster. In other words,
the fragment plotted in the diagonal of the compress-
ibility space can be considered as the common topic
between two documents. On the opposite, fragments
which cannot be compressed by any compression dic-
tionary, are plotted far from the origin in the com-
pressibility space. Such fragment is considered as
an ”unknown” fragment in each cluster of the doc-
uments. Then, the relation table is made of extracting
common fragments. By analyzing the fragments in
the table, overlapped topics between documents can
be confirmed.
Presentation of Relation Table. The relation table
is made of extracting common fragments. In case of
the relation is shown for three document A, B, and
C. The extracted fragments are listed in the first row
of the table. Which document do the fragments ap-
pear in are displayed by ’O’ or ’X’ from the second
row. Fragment a and b of the second column in the ta-
ble show a common fragment of document A, B and
C. Fragment c and d of the third column in the table
are the common fragment only for document A and
B, etc. And the third and fourth columns are simi-
lar. Moreover, the 6th, 7th and 8th columns show the
ICAART 2009 - International Conference on Agents and Artificial Intelligence
256