document and extract some characteristics and rules
which can effectively distinguish among each
category. The features and rules are used by the
classifier. The classification process is to classify the
input text, the classification is determined by the
trained classifier (Yang Wen-chuan, 2013) Fig. 1 is
a complete text classification process.
Figure 1: The basic process flow of text classification.
Text segmentation is the basis of Chinese text
classification process, which directly impacts on the
choosing of the characteristics, and thus affects the
results of classification. There are some professional
terms in the personnel profile, which are essential
characteristics to field recognition. According to
these characteristics, we propose a hybrid
segmentation method based on field Trie tree
dictionary and HMM models, which can improve
the recognition rate of professional terms in the text.
3 FIELD DICTIONARY
BUILDING METHOD BASE ON
THE PAPER LIBRARY
WA NFA NG
To construct specialized field dictionary, we
must first get the specific field professional terms.
This paper presents a growing method based on
word seeds. The method is mainly depend on the
keywords searching ability provided by the paper
database WanFang (http://www.wanfangdata.com.
cn). First, build some professional term lists as seed
lists manually. It is much better to choose these
professional terms which can distinguish themselves
from other field professional terms most. In the
research, we focus on the economic field, so we
choose "Macroeconomic science", "audit",
"finance", "insurance" and other words can be
identified in the field. Then, we use the crawler to
crawl the papers with the keywords in seed list and
extract the keyword in the paper which has same
keywords in the seed list. The extracted keywords
can be used as new keywords in the targeted field.
With the process goes over and over, we can get
more and more specific field professional terms
dictionary. The whole acquisition process shown in
Fig. 2.
Figure 2: Automatic field dictionary building method.
4 A HYBRID SEGMENTATION
METHOD BASED ON FIELD
TRIE TREE AND HMM-
VITERBI MODEL
Trie tree, also known as dictionary tree, prefix tree,
is a tree structure. The statistics, sorting and
searching operation of strings stored with the Tire
structure is very fast (Shang Wen-qian, 2007). Its
advantages are: strings with the same prefix share
the same storage for prefix, so not only can reduce
storage costs with the same prefix of the string, but
can also locate the matching string of characters
directly from the individual, thus effectively
reducing the number of matches, shorten find time.
In this paper, we the find the professional terms in
the text with Trie tree.
HMM-Viterbi (Hidden Markov Model)
segmentation can improve the recognition rate of
professional terms, place names, organization names
and etc. But the figures introductory text is relatively
short, with small vocabulary, recognition rates
vocabulary, which can directly affects the classified
by field. The study designed a hybrid segmentation
method based on word dictionary and probability.
First, obtain the dictionary of the target areas and
build the Trie tree with the dictionary. Second, using
a forward maximum matching method to extract the
terms of specific field with Trie tree. Then, use
HMM-Veterbi model to handle the remained text for
the terms of specific field. The entire segmentation
process consists of pretreatment and segmentation
Finding of People in Economic Field based on Personal Profile
437