The methods based on dictionary measure the
similarity with the semantic taxonomy tree or multi-
dimensional semantic description. Based on
WordNet, Pederson et al. have developed a
similarity tools, which supports the measures of
Resnik, Lin, Jiang-Conrath, Leacock-Chodorow,
Hirst-St.Onge, Wu-Palmer, Banerjee-Pedersen, and
Patwardhan-Pedersen (Pilevar, Jurgens and Navigli,
2013). Based on Tongyici Cilin, Wang has proposed
a method to measure similarities between Chinese
words on semantic taxonomy tree (Wang, 1999).
Based on HowNet, Liu et al. have proposed to
compute Chinese word similarity on the multi-
dimensional knowledge description (Liu and Li,
2002) Based on WordNet, Pilehvar et al. have
presented a unified approach to compute semantic
similarity from words to texts, which utilizes
personalized Pagerank on WordNet to get a semantic
signature for each word and compares the similarity
of semantic signatures of word pairs (Pilevar,
Jurgens and Navigli, 2013). For the methods based
on dictionary, a reliable dictionary with high quality
is difficult to build, which is a hard work and needs
lots of labors. With social development, many new
words will emerge, which usually are missed in the
dictionaries. This will affect the performance of
word similarity computation based on dictionary.
The methods based on corpus utilize the co-
occurrence statistical information to compute the
similarity of word pairs. Lin et al. have presented an
information-theoretic definition of similarity, which
measures similarities between words based on their
distribution in a database of dependency triples (Lin,
1998). Liu et al. have proposed to measure indirect
association of bilingual words with four common
methods, that is, PMI, Dice, Phi and LLR (Liu and
Zhao, 2010). With the development of deep
learning, a distributed representation for words, that
is word vector, has been proposed by Bengio et al
(Bengio et al., 2003). A word vector is trained on a
large scale corpus. With word vector, it is easy to
compute the similarity of words. For the methods
based on corpus, though their performances are
affected by the size and quality of corpus, the
methods can save lots of human labor and can
append new words at any time.
For the convenience of acquire knowledge
automatically, we focus on the methods based on
corpus, especially the method based on word vector.
A series of experiments has been done to compare
their performance.
3 SUBMITTED AND REVISED
SYSTEMS OVERVIEW
In NLPCC-ICCPOL 2016 Task 3, we have
submitted four system runs. All of them are
unsupervised systems.
3.1 Submitted System
Run 1: Word Vector Method.
In this run, we use a word vector obtained by
word2vec (Mikolov et al., 2013). Using these word
vector representations, the similarity between two
words can be computed with the cosine operation. It
is advisable to keep all the given values.
We train word vector by running the word2vec
toolkit (Mikolov et al., 2013). Sogou news corpus is
selected as train corpus, which contains news data
on Internet from June to July in 2012. The news
corpus is formatted and cleaned. HTML marks are
removed and only news text is reserved. The news
text is done Chinese word segment by ICTCLAS
2016. Word2vec runs on the preprocessed news
corpus to train an effective Chinese word vector. In
the run, CBOW model is selected, window size is set
to 5 and dimension of word vector is set to 200.
The similarity between a pair of words is
computed with the cosine distance of their
associated word vectors, as is shown in Equation (1).
12 12
12
(, ) (, )
cos( ( ), ( )) 10
Similar w w WordVector w w
vector w vector w
=
=×
(1)
In which,
1
w and
2
w are the target word pair,
1
()vector w and
2
()vector w are their word vectors.
Run2: Word Vector and PMI method.
In the experiment of Run1, we find that there are
some missing words by word vector, such as GDP,
GRE, WTO. The similarity of word pairs that
involve the missing words are 0 in Run1. This is not
reasonable. There should be a supplement for the
missing words. Therefore, in Run2, we take PMI
method as the backup of word vector. That is to say,
the missing words by word vector would be
processed by PMI.
For illustration purposes, supposing that two
words that need to calculate the similarity are w
1
and
w
2
. We introduce the following relations for each
word pair (w
1
, w
2
).
12
(, )afreqww