External
equipment
Network
equipment
Software
Computer
parts
CPU Mainboard
Computer/Netork
Display HUB Graphics manipulation
Figure 1: The tree structure of the computer and network.
The D of SVD (Singular value decomposition) is
defined as D=UΛV
T
. Λ= diag(λ
1
…λ
n
), where the
elements of Λ are all singular values of D. Let n =
min{r, m}, and the singular value is represented by
λ
1
≥λ
2
≥…≥λ
n
≥0. U and V are r×r, m×m matrices,
respectively. After processing by the SVD, D=UΛV
T
simplifies to D
k
=U
k
Λ
k
V
k
T
. The dimensions of U
k
, Λ
k
,
V
k
T
are reduced to r×k, k×k, k×m. The common
element k is less than the original vector space. Λ
k
retains k large singular value in term-document. U
k
is a document vector, V
k
T
is a term vector. The LSA
theory not only eliminates disadvantages factors and
extracts common semantic relations between terms
and documents, but also decreases the dimension of
vector by the singular value decomposition.
3 MULTI-LEVEL TEXT
CLASSIFICATION BASED ON
LSA
3.1 Classification Tree
We can construct a classification system according
to some relationships of all the terms. In this paper,
we use the tree structure to illustrate the
classification system. Figure 1 shows the tree
structure of the computer and network.
In Figure 1, software can be further divided into
many kinds, such as game software, educational
software, and application software etc. All the nodes
in the same layer have some similarity. For example,
CPU, main board, and CD-ROM etc. are
components of computer. Thus our purpose is to
divide the web text as exactly as possible into all
sub-categories, which are all the nodes of the
category tree.
3.2 Training Text and Term Selection
We first convert the web file into text file. Then
we set the terms as the leaf nodes respectively. All
the terms make up of our training data. We use the
program to partition the training text according to
their parts of speech. In order to reduce the
dimension, we further remove some words from
the training text. The removed words normally
have little contributions, such as empty words etc.
Therefore, the training text only contains noun,
verb, adjective and adverb. We can divide verbs
into three categories: relation verbs, state verbs,
and action verbs. We finally delete the relation
verbs and state verbs. The final terms are thus
established.
3.3 Term Weight
Vector space model (VSM) is used in the traditional
text classification. The elements of those original
vectors are 0 or 1. If a document contains some
term, then the element in the corresponding position
is 1, otherwise the corresponding element is 0. The
VSM method cannot indicate how important the
term in the document is. So term frequency is used
to replace 0 or 1. The absolute term frequency is the
number of occurrences of this word in the document.
While the relative term frequency denotes the
normalized term frequency. The relative term
frequency is often determined by the term
frequency–inverse document frequency (TF-IDF)
formula. There are several TF-IDF formulas in the
literature. One popular formula is as following.
[]
∑
∈
+×
+×
=
dt
t
t
nNdttf
nNdttf
dtW
v
v
v
2
)01.0/log(),(
)01.0/log(),(
),(
MULTI-LEVEL TEXT CLASSIFICATION METHOD BASED ON LATENT SEMANTIC ANALYSIS
321