DOCUMENTS AS A BAG OF MAXIMAL SUBSTRINGS

An Unsupervised Feature Extraction for Document Clustering

Tomonari Masada, Yuichiro Shibata and Kiyoshi Oguri

Graduate School of Engineering, Nagasaki University, 1-14 Bunkyo-machi, Nagasaki-shi, Nagasaki, 8528521, Japan

Keywords:

Maximal substrings, Document clustering, Sufﬁx array, Bayesian modeling.

Abstract:

This paper provides experimental results showing how we can use maximal substrings as elementary features

in document clustering. We extract maximal substrings, i.e., the substrings each giving a smaller number

of occurrences even after adding only one character at its head or tail, from the given document set and

represent each document as a bag of maximal substrings after reducing the variety of maximal substrings by

a simple frequency-based selection. This extraction can be done in an unsupervised manner. Our experiment

aims to compare bag of maximal substrings representation with bag of words representation in document

clustering. For clustering documents, we utilize Dirichlet compound multinomials, a Bayesian version of

multinomial mixtures, and measure the results by F-score. Our experiment showed that maximal substrings

were as effective as words extracted by a dictionary-based morphological analysis for Korean documents. For

Chinese documents, maximal substrings were not so effective as words extracted by a supervised segmentation

based on conditional random ﬁelds. However, one fourth of the clustering results given by bag of maximal

substrings representation achieved F-scores better than the mean F-score given by bag of words representation.

It can be said that the use of maximal substrings achieved an acceptable performance in document clustering.

1 INTRODUCTION

Recently, researchers propose a wide variety of

methods envisioning large scale data mining, where

documents originating from SNS environments or

DNA/RNA sequences providedby next generationse-

quencing are a typical target of their proposals. Many

of the methods adopt an unsupervised approach, be-

cause it is difﬁcult to prepare a sufﬁcient amount of

hand-maintained training data for a supervised learn-

ing when test data is of quite large scale.

This paper focuses on text mining, where we

can use various unsupervised methods, e.g. docu-

ment clustering (Nigam et al., 2000), topic extrac-

tion (Blei et al., 2003), topical trend analysis (Wang

and McCallum, 2006), etc, each based on an elabo-

rated document modeling. However, these unsuper-

vised methods assume that each document is repre-

sented as a bag of words, i.e., as an unordered col-

lection of words. Therefore, we should ﬁrst extract

elementary features that can be called words.

For some languages, e.g. English, French, Ger-

man, etc, we can obtain words by a simple heuristics

called stemming. However, for many languages, e.g.

Japanese, Chinese, Korean, etc, it is far from a trivial

task to extract words. Japanese and Chinese sentences

contain no white spaces and thus give no boundaries

between the words. While Korean sentences contain

white spaces, each string separated by a white space

often consists of multiple words (Choi et al., 2009).

Many of the existing word extraction methods

require a well-maintained dictionary and/or a well-

trained data model of character sequences. Further,

such extraction methods often presuppose the avail-

ability of a sufﬁcient amount of training data to which

supervised signals (e.g. 0/1 labels giving word bound-

aries, grammatical categories of words, etc) are as-

signed by hand. Therefore, any mining method sitting

on such supervised feature extraction methods may

show difﬁculty in scaling up to larger datasets even

when the mining method itself is an unsupervised one.

This paper shows how we can use maximal sub-

strings (Okanohara and Tsujii, 2009) as elementary

features of documents. One important characteristic

of maximal substrings is that they can be obtained in

a totally unsupervised manner.

In this paper, we compare bag of maximal sub-

strings representation with bag of words representa-

tion in document clustering. To be precise, we com-

pare the quality of document clusters obtained by us-

Masada T., Shibata Y. and Oguri K..

DOCUMENTS AS A BAG OF MAXIMAL SUBSTRINGS - An Unsupervised Feature Extraction for Document Clustering.

DOI: 10.5220/0003403300050013

In Proceedings of the 13th International Conference on Enterprise Information Systems (ICEIS-2011), pages 5-13

ISBN: 978-989-8425-53-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

ing maximal substrings as document features with the

quality of document clusters obtained by using words

as document features. We compute document clusters

by applying the same clustering method to the same

document set. In our comparison, only the document

representation is different.

As far as we know, this paper is the ﬁrst one that

gives a quantitative comparison of bag of maximal

substrings representation with bag of words represen-

tation in document clustering. While Chumwatana et

al. conduct a similar experiment with respect to Thai

documents (Chumwatana et al., 2010), they fail to

give reliable evaluation, because their datasets consist

of only tens of documents. Further, they do not com-

pare bag of maximal substrings representation with

bag of words representation.

We conducted document clustering on tens of

thousands of Korean and Chinese newswire articles.

To compare with maximal substrings, we extracted

words by applying a dictionary-based morphological

analyzer (Gang, 2009) to Korean documents and by

applying a word segmenter implemented by us based

on linear conditional random ﬁelds (CRF) (Sutton and

McCallum, 2007) to Chinese documents. The former

extraction method presupposes that we have a well-

maintained dictionary, and the latter presupposes that

we have a sufﬁcient amount of training data.

Our experiment will provide the following impor-

tant observations:

• For Korean documents, maximal substrings are

as effective as words extracted by the dictionary-

based morphological analyzer.

• For Chinese documents, maximal substrings are

not so effective as words extracted by the CRF-

based word segmenter. However, the performance

achieved with maximal substrings is acceptable.

The rest of the paper is organized as follows. Sec-

tion 2 gives previous works related to the extraction

of elementary features from documents. Section 3 de-

scribes how we use maximal substrings as elementary

features in Bayesian document clustering. Section 4

includes the procedure and the results of our evalua-

tion experiment. Section 5 concludes the paper with

discussions and future work.

2 PREVIOUS WORKS

Most text mining methods require word extraction

as a preprocessing of documents. We have a rela-

tively simple heuristics called stemming for English,

French, German, etc. However, it is far from a trivial

task to extract elementary linguistic features that can

be called words for many languages, e.g. Japanese,

Chinese, Korean, etc.

Word extraction can be conducted, for example,

by analyzing language-speciﬁc grammatical struc-

tures with a well-maintained dictionary (Gang, 2009),

or by labeling sequences with an elaborated prob-

abilistic model whose parameters are in advance

optimized with respect to hand-prepared training

data (Tseng et al., 2005). However, recent research

trends point to increasing need for large scale data

mining. Therefore, intensive use of supervised word

extraction becomes less realistic, because it is difﬁcult

to prepare training data of sufﬁcient size and quality.

Actually, we already have important results for

unsupervised feature extraction from documents.

Poon et al. (Poon et al., 2009) propose an unsu-

pervised word segmentation by using log-linear mod-

els, often adopted for supervised word segmentation,

in an unsupervised learning framework. However,

when computing the expected count that is required

in learning process, the authors exhaustively enumer-

ate all segmentation patterns. Consequently, this ap-

proach is only applicable to the languages whose sen-

tences are given as a sequence of short strings sepa-

rated by white spaces (e.g. Arabic and Hebrew), be-

cause the total number of segmentation patterns are

not so large for each short strings. That is, this ap-

proach will show an extreme inefﬁciency in execution

time for the languages whose sentences are given with

no white spaces (e.g. Chinese and Japanese).

Mochihashi et al. (Mochihashi et al., 2009) pro-

vide a sophisticated Bayesian probabilistic model for

segmenting given sentences into words in a totally un-

supervised manner. The authors improve the genera-

tive model of Teh (Teh, 2006) and utilize it for mod-

eling both character n-grams and word n-grams. The

proposed model can cope with the data containing so-

called out-of-vocabulary words, because the genera-

tive model of character n-grams serves as a new word

generator for that of word n-grams. However, highly

complicated sampling procedure, including MCMC

for the nested n-gram models and segmentation sam-

pling by an extended forward-backward algorithm,

may encounter an efﬁciency problem when we try to

implement this procedure, though the proposed model

is well designed enough to prevent any exhaustive

enumeration of segmentation candidates.

Okanohara et al. (Okanohara and Tsujii, 2009)

propose an unsupervised method from a completely

different angle. The authors extract maximal sub-

strings, i.e., the substrings each giving a smaller

number of occurrences even after adding only one

character at its head or tail, as elementary features.

This extraction can be efﬁciently implemented based

ICEIS 2011 - 13th International Conference on Enterprise Information Systems

on the works related to sufﬁx array and Burrows-

Wheeler transform (Kasai et al., 2001; Abouelhoda

et al., 2002; Navarro and Makinen, 2007; Nong et al.,

2008). While Zhang et al. (Zhang and Lee, 2006) also

provide a method for extracting a special set of sub-

strings, this is not the set of maximal substrings. Fur-

ther, their method has many control parameters and

thus seems based on a more heuristic intuition when

compared with the extraction of maximal substrings.

In this paper, we adopt maximal substrings as

elementary features of documents by following the

line of Okanohara et al. and check the effective-

ness in document clustering, because both previous

works (Okanohara and Tsujii, 2009; Zhang and Lee,

2006) try to prove the effectiveness of extracted sub-

strings in document classiﬁcation.

We can also ﬁnd previous works using maximal-

ity of substrings for document clustering. Zhang et

al. (Zhang and Dong, 2004) present a Chinese docu-

ment clustering method by using maximal substrings

as elementary features. However, the authors give

no quantitative evaluation. Especially, maximal sub-

strings are not compared with elementary features ex-

tracted by an elaborated supervised method. While

Li et al. (Li et al., 2008) also propose a document

clustering based on the maximality of subsequences,

the authors focus not on character sequences, but on

word sequences. Further, the proposed method uti-

lizes WordNet, i.e., an external knowledge base, for

reducing the variety of maximal subsequences and

thus is not an unsupervised method.

In this paper, we would like to show what kind

of effectiveness maximal substrings can provide in

document clustering when we only use a frequency-

based selection method for reducing the variety of

substrings and use no external knowledge base.

3 DOCUMENT CLUSTERING

WITH MAXIMAL SUBSTRINGS

3.1 Extracting Maximal Substrings

Maximal substrings are deﬁned to be a substring

whose number of occurrences is reduced even by

adding only one character to its head or tail. We dis-

cuss more formally below.

We assume that we have a string S of length l(S)

over a lexicographically ordered character set Σ. In

addition, we assume that a special character $, called

sentinel, is attached at the tail of S, i.e., S[l(S)] = $.

The sentinel $ does not appear in the given original

string and is smaller than all other characters in lexi-

cographical order.

For a pair of strings S and T over Σ, we de-

ﬁne Pos(S,T) by Pos(S,T) ≡ {i : S[i + j − 1] =

T[ j] for j = 1,..., l(T) }, i.e., the set of all occur-

rence positions of T in S. We denote the nth small-

est element in Pos(S, T) by pos

(S,T). Further,

we deﬁne Rel(S,T) by Rel(S,T) ≡ {(n, pos

(S,T) −

pos

(S,T)) : n = 1,.. .,|Pos(S, T)|}. Rel(S,T) is the

set of all occurrence positions relative to the ﬁrst

smallest occurrence position. Then, T is a maximal

substring of S when the following conditions hold:

1. |Pos(S,T)| > 1 ;

2. Rel(S,T) 6= Rel(S, T

′

) for any T

′

s.t. l(T

′

) =

l(T) + 1 and T[ j] = T

′

[ j], j = 1,... ,l(T) ; and

3. Rel(S,T) 6= Rel(S, T

′

) for any T

′

s.t. l(T

′

) =

l(T) + 1 and T[ j] = T

′

[ j + 1], j = 1,. .., l(T) .

The last condition corresponds to “left expansion”

discussed by Okanohara et al. (Okanohara and Tsu-

jii, 2009).

When we extract maximal substrings from a doc-

ument set, we ﬁrst concatenate all documents by in-

serting a special character, which does not appear in

the given document set, between the documents. The

concatenation order is irrelevant to our discussion.

We put a sentinel at the tail of the resulting string and

obtain a string S from which we extract maximal sub-

strings. We can efﬁciently extract all maximal sub-

strings from S in time proportional to l(S) (Okanohara

and Tsujii, 2009).

The number of different maximal substrings is in

general far larger than that of different words obtained

by morphological analysis or by word segmentation.

Therefore, we remove maximal substrings containing

special characters put between the documents. Fur-

ther, when the target language provides its sentences

with white spaces, delimiters (e.g. comma, period,

question mark, etc), and other functional characters

(e.g. parentheses, hyphen, center dot, etc), we remove

maximal substrings containing such characters.

Even after the above reduction, we still have a

large number of different maximal substrings. There-

fore, we propose a simple frequency-based strategy

for reducing the variety of maximal substrings. We

only use the following two parameters: the lowest and

the highest frequencies of maximal substrings.

To be precise, we remove all maximal substrings

whose frequencies are less than the threshold n

and

also remove all maximal substrings whose frequen-

cies are more than the threshold n

. We directly spec-

ify n

as 10, 20, 50, etc. On the other hand, we specify

through the equation n

= c

×n

, where n

is the

frequencyof the most frequent maximal substring and

is a real value. We feel difﬁculty in specifying n

DOCUMENTS AS A BAG OF MAXIMAL SUBSTRINGS - An Unsupervised Feature Extraction for Document

Clustering

100

1,000

10,000

100,000

aximal substrings

1 10 100 1,000 10,000 100,000 1,000,000

maximal substring frequency

100

1,000

10,000

100,000

aximal substrings

1 10 100 1,000 10,000 100,000 1,000,000

maximal substring frequency

Figure 1: Plot of the number of maximal substrings for each

different frequency. The top panel (resp. bottom panel)

shows the statistics of maximal substrings extracted from

Korean (resp. Chinese) newswire articles used in our evalu-

ation experiment. For example, when we have 800 different

maximal substrings each occurring 100 times in the docu-

ment set, a marker is placed at (100,800) in the chart.

directly, because the choice of n

heavily depends on

the size of the dataset. However, it can be regarded

that n

scales with the size of the dataset. Therefore,

we specify n

by multiplying a factor c

to n

Figure 1 shows how many different maximal sub-

strings we have at each frequency for the two datasets

used in our experiment. The top panel gives the statis-

tics of maximal substrings for the Korean newswire

article set, and the bottom panel gives the statistics

for the Chinese article set. The horizontal axis rep-

resents the frequency of maximal substrings, and the

vertical axis represents the number of different maxi-

mal substrings having the same frequency. For exam-

ple, when we have 800 different maximal substrings

each occurring at 100 positions in the document set, a

marker is placed at (100,800) in the chart. The data

plot range starts from ﬁve on the horizontal axis, be-

cause the maximal substrings of frequency less than

ﬁve were noisy strings and were thus removed. We

can observe that the distribution of the number of dif-

ferent maximal substrings roughly follows Zipf’s law.

The two shaded areas of each chart in Figure 1

show the frequency intervals where the correspond-

ing maximal substrings are removed by our selection

method. In other words, the unshaded area of each

chart ranges from n

to n

on the horizontal axis and

thus shows the frequency interval where the corre-

sponding maximal substrings are used in document

clustering. Figure 1 displays the frequency interval

giving the best evaluation result for each dataset. For

the chart in the top panel, n

is set to 100 and n

97,879, where n

is determined by setting c

to 0.1

with respect to n

= 978,789. For the chart in the

bottom panel, n

is set to 20 and n

to 53,051, where

is determined by setting c

to 0.2 with respect to

= 265,254. Each setting gave the best result as we

will discuss in Section 4.

3.2 Bayesian Document Clustering

When documents are represented as a bag of elemen-

tary features, multinomial distribution (Nigam et al.,

2000) is a natural choice for document modeling, be-

cause we can represent each document as a frequency

histogram of elementary features.

However, it is often discussed from a Bayesian

view point that multinomial distributions are likely

to overﬁt to sparse data. The term “sparse” means

that the number of differentfeatures appearing in each

document is far less than the total number of differ-

ent features observable in the entire document set.

This tendency becomes more apparent when we use

maximal substrings as document features, because the

number of maximal substrings extracted from a docu-

ment set is in general far larger than that of words ex-

tracted by an elaborated word segmentation method.

Therefore, we use Dirichlet compound multino-

mials (DCM) (Madsen et al., 2005) as our document

model to avoid overﬁtting. We assume that the num-

ber of clusters is K. DCM has K multinomial distribu-

tions, each of which models a word frequency distri-

bution for a different document cluster. Further, DCM

applies a Dirichlet prior distribution to each of the K

multinomials. DCM can effectively avoid overﬁtting

with these K Dirichlet prior distributions.

Here we prepare notations for discussions. We as-

sume that the given document set contains J docu-

ments and that W different words (or maximal sub-

strings) can be observed in the document set. Let

be the number of occurrences of the wth word

(or maximal substring) in the jth document. Let α

k = 1,... ,K, w = 1,... ,W be the hyperparameters of

the Dirichlet priors. The posterior probability that the

jth document belongs to the kth cluster is denoted by

. Note that

∑

= 1. We deﬁne α

≡

∑

and

≡

∑

ICEIS 2011 - 13th International Conference on Enterprise Information Systems

We update cluster assignment probabilities and

hyperparameters of Dirichlet priors with the EM al-

gorithm described below.

E step: Update p

←

∑

Γ(α

)

Γ(c

+ α

)

∏

Γ(c

+ α

)

Γ(α

)

and then normalize p

by p

← p

∑

M step: Update α

← α

∑

{Ψ(c

+ α

) − Ψ(α

)}

∑

{Ψ(c

+ α

) − Ψ(α

)}

where Γ(·) is gamma function, and Ψ(·) is digamma

function. The update formula for α

is based on

Minka’s discussion (Minka, 2000). We terminate the

iteration of E and M steps when the log likelihood in-

creases by less than 0.001%.

Before entering into the loop of E and M steps,

we initialize α

to 1, because this makes Dirichlet

priors uniform distributions. Further, we initialize p

not randomly but by the EM algorithm for multino-

mial mixtures (Nigam et al., 2000). In the EM for

multinomial mixtures, we use a random initialization

for p

. The execution of EM for multinomial mix-

tures is repeated 30 times each from a random ini-

tialization. Each of the 30 executions of the EM for

multinomial mixtures gives a different estimation of

. Therefore, we choose the estimation giving the

largest likelihood as the initial setting of p

in the EM

algorithm for DCM. We conduct this entire procedure

three times. Then, among the three clustering results,

we select the result giving the largest likelihood as the

ﬁnal output of our document clustering.

The time complexity of our EM algorithm is

O(IKM), where I is the number of iterations and M is

the number of different pairs of document and word.

In general, M is far smaller than J × W due to the

sparseness discussed above.

4 EVALUATION EXPERIMENT

4.1 Procedure

The following two document sets were used in our

evaluation experiment:

• The one is the set of Korean newswire articles

downloaded from the Web site of Seoul Newspa-

per

. We denote this dataset as SEOUL. This set

consists of 52,730 articles whose dates range from

http://www.seoul.co.kr/

January 1st, 2008 to September 30th, 2009. Each

article belongs to one among the following four

categories: Economy, Local issues, Politics, and

Sports. Therefore, we set K = 4 in DCM.

• The other is the set of Chinese newswire articles

dowloaded from Xinhua Net

. We denote this

dataset as XINHUA. This set consists of 20,127

articles whose dates range from May 8th to De-

cember 17th in 2009. All articles are written in

simpliﬁed Chinese. Each article belongs to one

among the three categories: Economy, Interna-

tional, and Politics. Therefore, we set K = 3.

We regarded article categories as the ground truth

when we evaluated document clusters.

To compare with maximal substrings, we ex-

tracted words by applying KLT morphological ana-

lyzer (Gang, 2009) to Korean articles. On the other

hand, we applied a word segmenter, implemented by

using L1-regularized linear conditional random ﬁelds

(CRF) (Sutton and McCallum, 2007), to Chinese ar-

ticles. Our algorithm for parameter optimization in

training this Chinese word segmenter is based on

stochastic gradient descent algorithm with exponen-

tial decay scheduling (Tsuruoka et al., 2009). This

segmenter achieved the following F-scores for the

four datasets of SIGHAN Bakeoff 2005 (Tseng et al.,

2005): 0.943 (AS), 0.941 (HK), 0.929 (PK) and

0.960 (MSR). In our experiment, we used the seg-

menter trained with MSR dataset, because this dataset

gave the highest F-score. For Korean language, we

could not ﬁnd any training data comparable with the

SIGHAN training data in its size and quality. There-

fore, we used KLT for Korean word segmentation.

The wall clock time required for extracting all

maximal substrings was only a few minutes for both

datasets on a PC equipped with Intel Core 2 Quad

9650 CPU. This wall clock time is comparable with

the time required for word extraction by our Chinese

segmenter, though the time required for training the

segmenter is not included. Further, it is much less

than the time required for Korean morphologicalanal-

ysis, because KLT achieves its excellence by dictio-

nary lookups. While KLT can provide part-of-speech

tags, they are not required for our experiment.

The running time of document clustering is pro-

portional to M, i.e., the number of different pairs

of document and word (or of document and maxi-

mal substring). When we extracted words by KLT

morphological analyzer from SEOUL dataset, M was

equal to 8,208,591 after removing the words whose

frequencies are less than ﬁve. When we used max-

imal substrings, M was 37,079,130 after removing

http://www.xinhuanet.com/

DOCUMENTS AS A BAG OF MAXIMAL SUBSTRINGS - An Unsupervised Feature Extraction for Document

Clustering

the maximal substrings whose frequencies are less

than ﬁve. Further, when we extracted words by the

CRF-based segmenter from XINHUA dataset, M was

3,244,859 after removing the words whose frequen-

cies are less than ﬁve. When we used maximal sub-

strings, M was 17,561,135 after removing the maxi-

mal substrings whose frequencies are less than ﬁve.

We evaluated the quality of clusters as follows:

1. We calculated precision and recall for each clus-

ter. Precision is deﬁned as

#(true positive)

#(true positive) + #(false positive)

and recall is deﬁned as

#(true positive)

#(true positive) + #(false negative)

where “#” means the number;

2. We calculated F-score as the harmonic mean of

precision and recall for each cluster; and

3. The F-score was averaged over all clusters.

We used the resulting averaged F-score as our evalu-

ation measure.

We executed document clustering 64 times for

each setting of n

and n

. Consequently, we obtained

64 F-scores for each setting. The effectivenessof each

setting of n

and n

was represented by the mean and

the standard deviation of these 64 F-scores.

We reduced the variety of maximal substrings as

follows. We removed the maximal substrings whose

frequency was less than n

. We tested the following

six settings for n

: 10, 20, 50, 100, 200, and 500. We

present the evaluation result for each setting of n

Table 1, where the mean and the standard deviation

are computed over the 64 F-scores for each case.

With respect to the best setting of n

, we tested

the following ﬁve settings for c

to specify n

: 0.2,

0.1, 0.05, 0.02, and 0.01. For example, when c

was

set to 0.02, n

was set to 0.02n

. The result for each

setting of c

is given in Table 2 also with the mean

and the corresponding standard deviation.

The same reduction method was applied not only

to maximal substrings but also to words extracted by

the morphological analyzer from SEOUL dataset and

to words extracted by our segmenter from XINHUA

dataset. The evaluation results for these supervised

word extraction methods are also given in Table 1 and

Table 2.

4.2 Analysis of Results

We can analyze the results presented in Table 1 and

Table 2 as follows.

In Table 1, the top panel gives the results for

SEOUL dataset, and the bottom panel for XINHUA

dataset. The column labeled as “n

” includes the set-

tings for n

, and the column labeled as “# words” in-

cludes the numbers of words or the numbers of maxi-

mal substrings after removing low frequency features.

We regard the boldfaced cases as the best setting for

, because each of these cases leads to the F-score

larger than the other settings. We can obtain the fol-

lowing observations from Table 1:

• For SEOUL dataset, the F-scores achieved with

maximal substrings were quite close to those

achieved with words extracted by the morpholog-

ical analyzer. It can be said that the results prove

the effectiveness of maximal substrings.

• For XINHUA dataset, maximal substrings gave

weaker results than words obtained by the CRF-

based segmenter. Further, bag of maximal sub-

strings representation led to larger standard devia-

tions of F-scores. These large standard deviations

suggest that there is a room for improvement re-

garding with selection of maximal substrings or

with clustering method so as to utilize maximal

substrings more effectively.

• By removing more low frequency words or max-

imal substrings, a smaller standard deviation was

achieved. A large standard deviation means that

the quality of document clusters is quite different

trial by trial. Therefore, it is desirable to reduce as

many low frequency words as possible. However,

the reductionof too many lowfrequency words re-

sulted in a large drop of F-scores. Table 1 shows

that the range n

≤ 200 is recommended.

• We may expect that low frequency words will be

sharply related to a speciﬁc topic and thus will

have a discriminative power. However, in our

case, by using more low frequency words, we ob-

tained larger standard deviations. This may tell

that most low frequency words misled clustering

process maybe by occasionally focusing on a mi-

nor aspect of the document content.

We further check if we can achieve an improve-

ment by reducing also high frequency features. Es-

pecially for XINHUA dataset, we could not obtain

any good F-scores only with the reduction of low fre-

quency maximal substrings. Therefore, we next dis-

cuss about the results presented in Table 2.

In Table 2, the top panel gives the results for

SEOUL dataset, and the bottom panel for XINHUA

dataset. Each of the boldfaced cases corresponds to

the best F-score among the various settings of c

For ease of comparison, Table 2 also presents the best

cases from Table 1 in the rows that have the value 1.0

ICEIS 2011 - 13th International Conference on Enterprise Information Systems

Table 1: Evaluation of clusters obtained after removing only

low frequency features.

SEOUL (52,730 docs, 4 clusters)

# words F-score

10 186,032 0.826±0.049

20 126,619 0.854±0.036

Maximal 50 72,104 0.867±0.029

Substrings 100 45,360 0.872±0.004

200 26,923 0.869±0.002

500 12,590 0.856±0.001

10 61,416 0.859±0.043

20 37,800 0.883±0.025

Morphological 50 20,068 0.892±0.002

Analysis 100 12,411 0.890±0.001

200 7,620 0.886±0.000

500 3,798 0.879±0.002

XINHUA (20,127 docs, 3 clusters)

# words F-score

10 285,438 0.650±0.051

20 140,690 0.690±0.051

Maximal 50 53,239 0.672±0.039

Substrings 100 25,344 0.649±0.014

200 12,351 0.642±0.002

500 5,049 0.619±0.002

10 23,234 0.753±0.021

20 15,347 0.750±0.021

Word 50 8,783 0.755±0.012

Segmentation 100 5,709 0.760±0.007

200 3,596 0.762±0.007

500 1,752 0.741±0.011

at the column labeled as c

. The case c

= 1.0 corre-

sponds to the case where we reduce no high frequency

features, i.e., the case considered in Table 1.

Table 2 provides the following observations:

• For bag of words representation, the reduction of

high frequency features did not lead to any im-

provement on both SEOUL dataset and XINHUA

dataset. The reduction of high frequency features

just worked as a reduction of working space for

document clustering.

• For bag of maximal substrings representation, the

reduction of high frequency features led to a small

improvement. However,the improvementwas not

statistically signiﬁcant. Therefore, also for max-

imal substrings, the reduction only worked as a

reduction of working space. Of course, the same

observation can be restated as follows: we could

reduce the number of elementary features without

harming cluster quality.

Table 2: Evaluation of clusters obtained after removing both

high and low frequency features.

SEOUL (52,730 docs, 4 clusters)

# words F-score

1.0 45,360 0.872±0.004

Maximal 0.2 45,320 0.873±0.008

Substrings 0.1 45,260 0.875±0.007

= 100) 0.05 45,159 0.875±0.008

0.02 44,918 0.873±0.011

0.01 44,589 0.874±0.011

1.0 20,068 0.892±0.002

Morphological 0.2 20,056 0.890±0.007

Analysis 0.1 20,040 0.890±0.008

= 50) 0.05 19,975 0.888±0.009

0.02 19,721 0.887±0.009

0.01 19,179 0.887±0.009

XINHUA (20,127 docs, 3 clusters)

# words F-score

1.0 140,690 0.690±0.051

Maximal 0.2 140,672 0.701±0.059

Substrings 0.1 140,611 0.693±0.056

= 20) 0.05 140,466 0.696±0.051

0.02 140,069 0.685±0.051

0.01 139,515 0.698±0.047

1.0 3,596 0.762±0.007

Word 0.2 3,593 0.762±0.001

Segmentation 0.1 3,587 0.762±0.001

= 200) 0.05 3,559 0.762±0.001

0.02 3,468 0.718±0.000

0.01 3,261 0.754±0.001

• For XINHUA dataset, even after the reduction

of high frequency features, bag of maximal sub-

strings representation provided weaker results

than bag of words representation. This means that

we should use a supervised segmenter for Chinese

documents as long as we can prepare a sufﬁcient

amount of training data.

We can conclude that maximal substrings are

as effective as words extracted by the dictionary-

based morphological analysis for Korean documents.

This may be partly because Korean sentences con-

tain white spaces and thus maximal substring extrac-

tion works as a ﬁne improvement of this intrinsic seg-

mentation. In contrast, for Chinese documents, we

need some more sophistication to make maximal sub-

strings equally effective.

However, we think that the difference of effective-

ness between bag of words representation and bag of

maximal substrings representation is not so large to

make the latter inapplicable in document clustering.

DOCUMENTS AS A BAG OF MAXIMAL SUBSTRINGS - An Unsupervised Feature Extraction for Document

Clustering

# results

(64 results in total)

0.57

0.58

0.59

0.60

0.61

0.62

0.63

0.64

0.65

0.66

0.67

0.68

0.69

0.70

0.71

0.72

0.73

0.74

0.75

0.76

0.77

0.78

# results

F-score

Figure 2: Distribution of F-scores achieved with maximal

substrings for XINHUA dataset. This histogram shows the

64 F-scores obtained by setting n

= 20 and c

= 0.2. F-

scores are rounded off to two decimal places. While the

mean of all 64 F-scores is 0.701, we have 16 F-scores (black

part of the histogram) that are larger than the best mean F-

score 0.762 achieved with CRF-based word segmentation

(cf. the bottom panel in Table 2).

Even with respect to XINHUA dataset, the cluster

quality obtained with bag of maximal substrings rep-

resentation is acceptable, as we discuss below.

When we set n

= 200, c

= 0.2 and reduce the

variety of words extracted by our CRF-based seg-

menter from XINHUA dataset, the best mean F-score

0.762 was achieved (cf. Table 2). However, when we

set n

= 20 and c

= 0.2 and reduce the variety of

maximal substrings, we could obtain F-scores larger

than 0.762 for 16 clustering results among all 64 clus-

tering results. That is, 25% of the clustering results

showed a quality better than the mean quality of doc-

ument clusters bag of words representation gave.

Figure 2 presents the detailed distribution of the F-

scores achieved with bag of maximal substrings repre-

sentation when we set n

= 20 and c

= 0.2 for XIN-

HUA dataset. In this histogram, F-scores are rounded

off to two decimal places. The black part of the his-

togram correspondsto the 16 F-scores that were larger

than the best mean F-score 0.762 achieved with bag of

words representation.

In addition, for the case where we achieved the

best mean F-score 0.762 with bag of words rep-

resentation, all 64 F-scores fell within the interval

[0.755, 0.765). Further, the best three F-scores were

0.764, 0.764, and 0.763. In contrast, the best three F-

scores were 0.782, 0.781, and 0.780 when we used

maximal substrings as elementary features and set

= 20, c

= 0.2.

Therefore, it is a promising work to introduce

some sophistication into selection of maximal sub-

strings and also into document clustering so as not to

occasionally give extremely poor clustering results.

5 CONCLUSIONS

As text data originating from SNS environmentscome

to show a wider divergence in the writing style or in

the used vocabularies, unsupervised extraction of el-

ementary features becomes more important as a pre-

processing for various text mining techniques than be-

fore. Therefore, in this paper, we provide experimen-

tal results where we compare bag of maximal sub-

strings representation with bag of words representa-

tion, because maximal substrings can be extracted in

a totally unsupervised manner.

Our results showed that maximal substrings were

not equally effective with words extracted by the elab-

orated supervised method for Chinese documents,

though we could obtain impressive results for Korean

documents. We need a more sophisticated selection

method to obtain a special subset of maximal sub-

strings for Chinese documents. We should also im-

prove document model for clustering to capture fre-

quency statistics of maximal substrings more cleverly

than DCM. We think that these future works are wor-

thy to do based on the reason discussed in the late part

of the previous section.

Further, if we use larger datasets, we can expect

that more reliable statistics of maximal substrings will

be obtained and thus that more convincing evaluation

results will be provided. It is an important future work

to acquire a more realistic insight with respect to the

trade-off between the following two types of cost:

• the cost to improve the effectiveness of maxi-

mal substrings with a more elaborated selection

method and/or with a more elaborated clustering

method; and

• the cost to prepare training data for supervised

feature extraction and then to train a segmentation

model by using the data.

We also have a plan to conduct experiments where

we use maximal substrings as elementary features for

a multi-topic analysis based on latent Dirichlet alloca-

tion (Blei et al., 2003) for text data or for DNA/RNA

sequence data (Chen et al., 2010).

ACKNOWLEDGEMENTS

This work was supported in part by Japan Society

for the Promotion of Science (JSPS) Grant-in-Aid

for Young Scientists (B) 60413928 and also by Na-

gasaki University Strategy for Fostering Young Scien-

tists with funding provided by Special Coordination

Funds for Promoting Science and Technology of the

ICEIS 2011 - 13th International Conference on Enterprise Information Systems

Ministry of Education, Culture, Sports, Science and

Technology (MEXT).

REFERENCES

Abouelhoda, M., Ohlebusch, E., and Kurtz, S. (2002). Op-

timal exact string matching based on sufﬁx arrays.

In SPIRE’02, the Ninth International Symposium on

String Processing and Information Retrieval, pages

31–43.

Blei, D., Ng, A., and Jordan, M. (2003). Latent Dirichlet

allocation. Journal of Machine Learning Research,

3:993–1022.

Chen, X., Hu, X., Shen, X., and Rosen, G. (2010). Prob-

abilistic topic modeling for genomic data interpreta-

tion. In BIBM’10, IEEE International Conference on

Bioinformatics & Biomedicine, pages 18–21.

Choi, K., Isahara, H., Kanzaki, K., Kim, H., Pak, S., and

Sun, M. (2009). Word segmentation standard in Chi-

nese, Japanese and Korean. In the 7th Workshop on

Asian Language Resources, pages 179–186.

Chumwatana, T., Wong, K., and Xie, H. (2010). A SOM-

based document clustering using frequent max sub-

strings for non-segmented texts. Journal of Intelligent

Learning Systems & Applications, 2:117–125.

Gang, S. (2009). Korean morphological analyzer KLT ver-

sion 2.10b. http://nlp.kookmin.ac.kr/HAM/kor/.

Kasai, T., Lee, G., Arimura, H., Arikawa, S., and Park, K.

(2001). Linear-time longest-common-preﬁx computa-

tion in sufﬁx arrays and its applications. In CPM’01,

the 12th Annual Symposium on Combinatorial Pattern

Matching, pages 181–192.

Li, Y., Chung, S., and Holt, J. (2008). Text document

clustering based on frequent word meaning sequences.

Data & Knowledge Engineering, 64:381–404.

Madsen, R., Kauchak, D., and Elkan, C. (2005). Model-

ing word burstiness using the Dirichlet distribution. In

ICML’05, the 22nd International Conference on Ma-

chine Learning, pages 545–552.

Minka, T. (2000). Estimating a Dirichlet dis-

tribution. http://research.microsoft.com/en-

us/um/people/minka/papers/dirichlet/.

Mochihashi, D., Yamada, T., and Ueda, N. (2009). Bayesian

unsupervised word segmentation with nested Pitman-

Yor language modeling. In ACL/IJCNLP’09, Joint

Conference of the 47th Annual Meeting of the Asso-

ciation for Computational Linguistics and the Fourth

International Joint Conference on Natural Language

Processing of the Asian Federation of Natural Lan-

guage Processing, pages 100–108.

Navarro, G. and Makinen, V. (2007). Compressed full-text

indexes. ACM Computing Surveys (CSUR), 39(1).

Nigam, K., McCallum, A., Thrun, S., and Mitchell, T.

(2000). Text classiﬁcation from labeled and un-

labeled documents using EM. Machine Learning,

39(2/3):103–134.

Nong, G., Zhang, S., and Chan, W. (2008). Two efﬁcient

algorithms for linear time sufﬁx array construction.

http://doi.ieeecomputersociety.org/10.1109/TC.2010.188.

Okanohara, D. and Tsujii, J. (2009). Text categorization

with all substring features. In SDM’09, 2009 SIAM

International Conference on Data Mining, pages 838–

846.

Poon, H., Cherry, C., and Toutanova, K. (2009). Unsu-

pervised morphological segmentation with log-linear

models. In NAACL/HLT’09, North American Chapter

of the Association for Computational Linguistics - Hu-

man Language Technologies 2009 Conference, pages

209–217.

Sutton, C. and McCallum, A. (2007). An introduction to

conditional random ﬁelds for relational learning. In

Introduction to Statistical Relational Learning, pages

93–128.

Teh, Y. (2006). A hierarchical Bayesian language model

based on Pitman-Yor processes. In COLING/ACL’06,

Joint Conference of the International Committee on

Computational Linguistics and the Association for

Computational Linguistics, pages 985–992.

Tseng, H., Chang, P., Andrew, G., Jurafsky, D., and Man-

ning, C. (2005). A conditional random ﬁeld word

segmenter for SIGHAN bakeoff 2005. In the Fourth

SIGHAN Workshop, pages 168–171.

Tsuruoka, Y., Tsujii, J., and Ananiadou, S. (2009). Stochas-

tic gradient descent training for L1-regularized

log-linear models with cumulative penalty. In

ACL/IJCNLP’09, Joint Conference of the 47th Annual

Meeting of the Association for Computational Lin-

guistics and the fourth International Joint Conference

on Natural Language Processing of the Asian Federa-

tion of Natural Language Processing, pages 477–485.

Wang, X. and McCallum, A. (2006). Topics over time: a

non-Markov continuous-time model of topical trends.

In KDD’06, the 12th ACM SIGKDD International

Conference on Knowledge Discovery and Data Min-

ing, pages 424–433.

Zhang, D. and Dong, Y. (2004). Semantic, hierarchical,

online clustering of Web search results. In APWeb’04,

the Sixth Asia Paciﬁc Web Conference, pages 69–78.

Zhang, D. and Lee, W. (2006). Extracting key-substring-

group features for text classiﬁcation. In KDD’06, the

Twelfth ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining, pages 474–

483.

DOCUMENTS AS A BAG OF MAXIMAL SUBSTRINGS - An Unsupervised Feature Extraction for Document

Clustering