overview of related works and an explanation of the
method we have proposed. In sections 4 and 5, we
have detailed our experimental procedures using the
news articles and discussed our results. Lastly we
present our conclusions and future work.
2 RELATED WORK
Traditionally, various methods have been applied to
extract topics from a document set. In this paper, we
especially focus on the methods based on the vector
space model. The vector space model (Salton83) rep-
resents a document as a column vector whose element
consists of the weight of an index word. The Euclid-
ian distance, the cosine, and so on are used for the
similarity. The one of the popular methods to extract
topics from a document set is clustering (Yang99).
After clustering documents, the centroid of each class
is regarded as a topic.
Recently, the method such as factorization of a
term-document matrix is focused for the topic ex-
traction. At first, it was reported that the Indepen-
dent Component Analysis (ICA) (Hyvarinen00) was
applied for a term-document matrix so that its inde-
pendent components represent topics by T.Kolenda
(Kolenda00). E. Bingham extracted the topics from
dynamical textual data such as chat lines with the ICA
(Bingham03). In addition, we confirmed that it was
possible for ICA to extract the topics from documents
and proposed application of the ICA to the informa-
tion filtering(Yokoi08). The independent component
is possible to have negative elements so that it is dif-
ficult to comprehend the weight as a term weight di-
rectly.
The NMF(Hoyer04) has been applied to textual
data and the column vectors of the basis matrix were
reported to represent the topics in a document set. The
basis matrix denotes one of the factorized matrices
by the NMF. The NMF factorizes the non-negative
matrix into two of non-negative matrices so that the
element of the column vector in the basis matrix di-
rectly corresponds to a term weight. Xu. et al. pro-
posed the bases are used for text clustering as one
of the NMF applications for textual data (Xu03). In
addition, the modified methods of the NMF have re-
cently been paid attention (Berry07). We especially
focus the NMF imposing the sparseness to one of the
factorized matrices in those methods. Moreover, the
conventional reports on the application of the NMF
to documents targeted a statistic one document set.
However, the size of document set becomes so large
that it is difficult for the NMF to apply to it.
Our proposed method sequentially combines the
topics based on the conventionalreports that the NMF
can extract topics from a document set.
3 TOPIC COMBINATION
In this section, a document vector, the SNMF/L for
documents, and combination of topics are explained.
3.1 Document Vector
A document is represented with a vector with the vec-
tor space model(Salton83) and it is called a docu-
ment vector. A document vector is a column vector
of which the elements are the weights of the words in
a document set. The ith document vector d
i
is defined
as:
d
i
= [
w
i1
w
i2
··· w
iV
]
T
(1)
where w
ij
signifies the weight for the jth word in the
ith document, V signifies the number of words and
[·]
T
signifies the transposition. In this paper, w
ij
is
established by the tf-idf method and calculated as:
w
ij
= tf
ij
log
n
df
j
!
(2)
where tf
ij
denotes the frequency of the jth word in the
ith document, df
j
denotes the number of documents
including the jth word and n denotes the number of
documents. The tf-idf method regards the words that
appear frequently in a few documents as the charac-
teristic features of the document. In addition, the n
document vectors are denoted as d
1
, d
2
, ··· , d
n
and
the term-document matrix D is defined as follows:
D = [
d
1
d
2
··· d
n
]. (3)
3.2 SNMF/L for Documents
The SNMF/L is one of the sparse NMF algorithms
that can control the degree of sparseness in the basis
matrix. The NMF approximately factorizes a matrix
of which all the components have non-negative val-
ues into two matrices with components having non-
negative values. When the NMF is applied to a docu-
ment set, it has been reported that that bases represent
the topics included in the document set. By using the
SNMF/L in our proposal, the keywords of the topics
are highlighted since only some words of each basis
have weighted.
The NMF approximately factorizes a matrix into
two matrices such as:
D = WH (4)
WEBIST 2009 - 5th International Conference on Web Information Systems and Technologies
662