The methods based on dictionary measure the 
similarity with the semantic taxonomy tree or multi-
dimensional semantic description. Based on 
WordNet, Pederson et al. have developed a 
similarity tools, which supports the measures of 
Resnik, Lin, Jiang-Conrath, Leacock-Chodorow, 
Hirst-St.Onge, Wu-Palmer, Banerjee-Pedersen, and 
Patwardhan-Pedersen (Pilevar, Jurgens and Navigli, 
2013). Based on Tongyici Cilin, Wang has proposed 
a method to measure similarities between Chinese 
words on semantic taxonomy tree (Wang, 1999). 
Based on HowNet, Liu et al. have proposed to 
compute Chinese word similarity on the multi-
dimensional knowledge description (Liu and Li, 
2002) Based on WordNet, Pilehvar et al. have 
presented a unified approach to compute semantic 
similarity from words to texts, which utilizes 
personalized Pagerank on WordNet to get a semantic 
signature for each word and compares the similarity 
of semantic signatures of word pairs (Pilevar, 
Jurgens and Navigli, 2013). For the methods based 
on dictionary, a reliable dictionary with high quality 
is difficult to build, which is a hard work and needs 
lots of labors. With social development, many new 
words will emerge, which usually are missed in the 
dictionaries. This will affect the performance of 
word similarity computation based on dictionary. 
The methods based on corpus utilize the co-
occurrence statistical information to compute the 
similarity of word pairs. Lin et al. have presented an 
information-theoretic definition of similarity, which 
measures similarities between words based on their 
distribution in a database of dependency triples (Lin, 
1998). Liu et al. have proposed to measure indirect 
association of bilingual words with four common 
methods, that is, PMI, Dice, Phi and LLR (Liu and 
Zhao, 2010). With the development of deep 
learning, a distributed representation for words, that 
is word vector, has been proposed by Bengio et al 
(Bengio et al., 2003). A word vector is trained on a 
large scale corpus. With word vector, it is easy to 
compute the similarity of words. For the methods 
based on corpus, though their performances are 
affected by the size and quality of corpus, the 
methods can save lots of human labor and can 
append new words at any time.  
For the convenience of acquire knowledge 
automatically, we focus on the methods based on 
corpus, especially the method based on word vector. 
A series of experiments has been done to compare 
their performance. 
3 SUBMITTED AND REVISED 
SYSTEMS OVERVIEW 
In NLPCC-ICCPOL 2016 Task 3, we have 
submitted four system runs. All of them are 
unsupervised systems.  
3.1 Submitted System 
Run 1: Word Vector Method.  
In this run, we use a word vector obtained by 
word2vec (Mikolov et al., 2013). Using these word 
vector representations, the similarity between two 
words can be computed with the cosine operation. It 
is advisable to keep all the given values. 
We train word vector by running the word2vec 
toolkit (Mikolov et al., 2013). Sogou news corpus is 
selected as train corpus, which contains news data 
on Internet from June to July in 2012. The news 
corpus is formatted and cleaned. HTML marks are 
removed and only news text is reserved. The news 
text is done Chinese word segment by ICTCLAS 
2016. Word2vec runs on the preprocessed news 
corpus to train an effective Chinese word vector. In 
the run, CBOW model is selected, window size is set 
to 5 and dimension of word vector is set to 200. 
The similarity between a pair of words is 
computed with the cosine distance of their 
associated word vectors, as is shown in Equation (1). 
12 12
12
(, ) (, )
cos( ( ), ( )) 10
Similar w w WordVector w w
vector w vector w
=
=×
 
(1) 
In which, 
1
w  and 
2
w  are the target word pair, 
1
()vector w  and 
2
()vector w  are their word vectors. 
Run2: Word Vector and PMI method.  
In the experiment of Run1, we find that there are 
some missing words by word vector, such as GDP,  
GRE, WTO. The similarity of word pairs that 
involve the missing words are 0 in Run1. This is not 
reasonable. There should be a supplement for the 
missing words. Therefore, in Run2, we take PMI 
method as the backup of word vector. That is to say, 
the missing words by word vector would be 
processed by PMI. 
For illustration purposes, supposing that two 
words that need to calculate the similarity are w
1
 and 
w
2
. We introduce the following relations for each 
word pair (w
1
, w
2
). 
12
(, )afreqww