Unsupervised Word Sense Disambiguation based on Word Embedding

and Collocation

Shangzhuang Han and Kiyoaki Shirai

School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology,

Asahidai 1-1, Nomi, Ishikawa, Japan

Keywords:

Unsupervised Machine Learning, Word Sense Disambiguation, Word Embedding, Collocation.

Abstract:

This paper proposes a novel unsupervised word sense disambiguation (WSD) method. It utilizes two useful

features for WSD. One is contextual information of a target word. The similarity between words in a context

and a sense of a target word is measured based on the pre-trained word embedding, then the most similar

sense to the context is chosen. Furthermore, we introduce a procedure not to use irrelevant words in a context

in a calculation of the similarity. The other is a collocation, which is an idiomatic phrase including a target

word. High-precision rules to determine a sense by a collocation is automatically acquired from a raw corpus.

Finally, the above two methods are integrated into our ﬁnal WSD system. Results of the experiments using

Senseval-3 English lexical sample task showed that our proposed method could improve the precision by 4.7

point against the baseline.

1 INTRODUCTION

Word Sense Disambiguation (WSD) is a fundamental

task and long-standing challenge in Natural Language

Processing (NLP), which aims to determine a sense of

an ambiguous word in a particular context (Navigli,

2009). The WSD approaches can be grouped into two

main categories: methods based on supervised ma-

chine learning (supervised methods) and knowledge-

based methods (unsupervised methods).

Supervised methods often train a classiﬁer or

neural network model from a sense tagged corpus

(e.g. SemCor corpus). In recent years, Bidirectional

Encoder Representations from Transformers (BERT)

(Devlin et al., 2018) has often achieved the state-of-

the-art performance in many NLP tasks. Since BERT

can classify a pair of sentences into predeﬁned cate-

gories, Huang et al. proposed the GlossBERT, which

constructs context-gloss pairs from all possible senses

of the target word in WordNet, then treats the WSD

task as a sentence-pair classiﬁcation problem (Huang

et al., 2019). Although supervised methods tend to

achieve good performance, sense tagged corpora are

required for training. Obviously, they are hard to con-

struct due to heavy manual annotation.

Knowledge-based WSD methods rely on lexical

resources like a dictionary or WordNet (Miller, 1995)

rather than sense tagged corpora. A gloss, which de-

ﬁnes a meaning of a word, is ﬁrst utilized in Lesk

algorithm (Lesk, 1986). Given a word and its con-

text, Lesk algorithm calculates a score of each sense

by measuring the number of overlapped words in a

gloss (deﬁnition) of a sense of a target word and

that of words in a context. Then, the sense with

the highest score is chosen. A lot of studies fol-

low it and propose its extended models. In addi-

tion to methods using gloss sentences, a graph-based

WSD method is also investigated (Navigli and Lap-

ata, 2009). In this approach, graph nodes correspond

to word senses, whereas edges represent dependen-

cies between senses (e.g. synonymy and antonymy).

Sense disambiguation process corresponds to ﬁnd the

most “important” node in the graph.

Recently, Basile et al. propose a new unsuper-

vised WSD algorithm which extends the Lesk’s WSD

method (Basile et al., 2014). They introduce a dis-

tributional semantic space for WSD, then the word

similarity in the semantic space is regarded as gloss-

context overlap. Although Basile’s method is promis-

ing, it only relies on words in a context of a target

word. However, it is well-known that a collocation is

another useful feature for WSD. Collocation is a se-

ries of words or terms that frequently co-occur more

than expected by chance. Words in a collocation usu-

ally have a special and ﬁxed meaning. For example,

the collocation “hot spring” indicates that the sense of

1218

Han, S. and Shirai, K.

Unsupervised Word Sense Disambiguation based on Word Embedding and Collocation.

DOI: 10.5220/0010380112181225

In Proceedings of the 13th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2021) - Volume 2, pages 1218-1225

ISBN: 978-989-758-484-8

“spring” is FOUNTAIN, not SEASON.

This paper proposes a novel unsupervised WSD

method that extends Basile’s method. While the

Basile’s method only considers words in a context

for WSD, our method also takes collocations into ac-

count to determine a sense of a given word. In addi-

tion to the ordinary collocation (adjacent words that

often appear together), we also deﬁne a dependency

collocation, which is a syntactic dependency relation

between a target word and another word in a sentence.

We also propose to change the way how to make

a context vector in the semantic space. In the origi-

nal research, context embedding is computed by av-

erage of word embedding of all words in a context.

However, not all words are related to a sense of a tar-

get word. Our method only considers words that are

highly related to the sense when the context embed-

ding is built.

The rest of the paper is organized as follows.

Section 2 provides a brief introduction about related

work. Section 3 describes the details of our pro-

posed method. Section 4 reports several experiments

to evaluate our method. Finally, Section 5 concludes

the paper.

2 RELATED WORK

There are three commonly used features in WSD. The

ﬁrst one is words in the surroundings of the target

word. Part-of-speech (POS) tags of the neighboring

words are also widely used features. Local colloca-

tions represent another standard feature that captures

the ordered sequence of words which tend to appear

around the target word (Bazell, 1959).

Many unsupervised WSD methods are based on

calculation of similarity between word sense and its

context using some features. One of the most tra-

ditional methods for unsupervised WSD is Lesk al-

gorithm (Lesk, 1986). It is based on the assumption

that words in a given section of text will tend to share

a same topic. As already explained, it computes the

similarity between the sense deﬁnition of an ambigu-

ous word and the terms appearing in its neighborhood.

There are many measures to determine the similarity

between a sense and a context. Torres and Gelbukh

present a comparison of several similarity measures

applied to WSD by the Lesk algorithm (Torres and

Gelbukh, 2009). Since gloss sentences tend to be

short, several methods use external resources to get

additional information of the sense. Bhingardive et

al. try to use broad information of lexical database

related to the sense, such as hypernyms, hyponyms,

synonyms, and even example sentences in the dictio-

nary to construct vector representation of the sense in

order to identify the most frequent sense (Bhingardive

et al., 2015).

The most important paper related to this study is

(Basile et al., 2014). It utilizes semantic space, which

is geometrical space of words where vectors express

concepts of words. The proximity in the space can

measure semantic relatedness between words. Since

the gloss (deﬁnition) and the context are composed by

several terms, the vector of each set of terms is built

by adding the vector of every single words in the set.

Pre-trained word embedding is used to construct the

gloss and context vectors. The cosine similarity be-

tween gloss and context is used to choose the appro-

priate sense of the target word.

As already discussed in Section 1, this paper ex-

tends the Basile’s method in two directions. One is

to incorporate a mechanism to determine a sense us-

ing a collocation. Rules to determine a sense, which

are based on collocations, are automatically acquired

from a raw corpus, then these rules are integrated to

the Basile’s WSD model. The other is to propose a

better way to construct the context vector, since the

performance of WSD heavily relies on the quality of

it.

3 PROPOSED METHOD

Figure 1 shows an overview of the proposed sys-

tem. It accepts a sentence including a target word

as an input and proposes a sense for it as an output.

Our system consists of two modules: one is a rule

based WSD system, the other is a WSD system based

on Highly Related Word Embedding (hereafter, the

HRWE method in short). The ﬁrst module uses the

database of collocation WSD rules, which determine

the sense by a collocation (word sequence). Brieﬂy,

these rules determine the sense by a collocation as

collocation → sense. If a rule is hit for a collocation in

a given sentence, the sense is chosen by the rule, oth-

erwise the next module is applied. The second mod-

ule is similar to (Basile et al., 2014). It measures the

similarity between gloss sentences in a dictionary and

a context of a target word in a given sentence, then

chooses the sense whose gloss is the most similar to

the context of the target word. Since the rule-based

module is designed to achieve high precision in com-

pensation for low recall, it is applied ﬁrst.

In the following subsections, the HRWE method

will be introduced ﬁrst, since it is also used to con-

struct the sets of the collocation WSD rules. Then,

the rule based WSD system is described, especially

how to acquire WSD rules automatically.

Unsupervised Word Sense Disambiguation based on Word Embedding and Collocation

1219

sentence

sensesense

Rule based WSD

system

WSD system

based on HRWE

Dictionary

(gloss)

Collocation

WSD rule

sense

determined?

yes

Figure 1: Overview of proposed WSD system.

3.1 Highly Related Word Embedding

Method

In (Basile et al., 2014), three steps are required to de-

termine a sense of a target word: (1) to construct a

context vector ~c, (2) to construct sense vectors

, and

(3) to calculate the cosine similarity of two vectors to

choose the sense. The context vector is obtained by

averaging the vectors of all context words in a context

as Equation (1)

~c =

|W |

∑

∈W

(1)

, where W stands for a set of words in the context. Pre-

trained word embedding are used as word vectors ~w

Similarly, the sense vector is constructed by averaging

the word vectors in a gloss sentence. Finally, as show

in Equation (2), the sense whose vector is the most

similar to the context vector is chosen.

s = arg max

cos(~c,

)

(2)

An important parameter is the context window size,

CWS. When constructing the context vector, the most

nearest CWS words appearing before and after the tar-

get word are taken into account. Note that function

words are ignored. That is, a content word is used to

make a context vector if it is the CWS-th closest con-

tent word, even when the distance between it and a

target word is greater than CWS. See also an example

in the end of this subsection.

The gloss sentences in WordNet are used in the

Basile’s study, but they are rather concise. To en-

rich sense vectors, Basile expanded the gloss using

an API provided by BabelNet, which could extract

all the meanings related to a particular sense. In this

study, we expand the sense information with the gloss

of the hypernyms, hyponyms and synonyms, and em-

pirically evaluate its effectiveness in the experiment.

In our HRWE method, not all words but only

words highly related to senses are used to construct

the context vector. For each sense s

, a different con-

text vector, denoted as

(i)

, is made from contextual

words relevant to s

. First, for each word w

in a con-

text, the relevance score in terms of the i-th sense is

deﬁned as Equation (3), i.e. the maximum similarity

between a word and sense vector.

RelevantScore(w

(i)

) = max cos(~w

)

(3)

We assume that the word with high RelevantScore is

strongly related to the particular sense, thus it is effec-

tive feature for WSD. The relevant word set, W R

(i)

, is

made by selecting the top T

words with the highest

RelevantScore for each sense s

. Then the new sense-

dependent context vector is made by averaging word

vectors of words in W R

(i)

as in Equation (4).

(i)



W R

(i)



∑

∈W R

(i)

(4)

Finally, the sense with the highest cos(~c

(i)

) is cho-

sen.

Figure 2 shows an example to obtain a relevant

word set W R

(i)

. In this example, the target word is

“argument” that have four senses, and CWS and T

are

set to 5 and 3, respectively. The bottom table shows

the cosine similarity between the word vector of each

word in a context and the sense vector of each sense.

The values in bold indicate the three highest Rele-

vantScore for each sense, and these words are cho-

sen as the relevant word set, shown in the bottom of

Figure 2.

3.2 Collocation based WSD

Unlike the HRWE method, this method determines

the sense by only looking at a collocation, i.e. id-

iomatic phrase including a target word.

3.2.1 Collocation WSD Rule

Collocation WSD rule is deﬁned in the following

form:

collocation → sense = s

(5)

It means: when collcocation appears in an input sen-

tence, s

is chosen as the sense of the target word.

Two types of the collocation is considered in this

study. The ﬁrst one is a word collocation that is a

sequence of words including the target word. Five

types of word collocation are deﬁned as in Figure 3.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

1220

Sentence:

While using those methods, values passed to

those variables are called arguments.

Sense of argument:

a fact or assertion offered as evidence that

something is true.

a reference or value that is passed to a func-

tion, procedure, subroutine, command, or

program.

a summary of the subject or plot of a literary

work or play or movie.

a contentious speech act; a dispute where

there is strong disagreement.

Relevance score:

method value pass variable call

0.7 0.1 0.5 0.6 0.3

0.5 0.5 0.8 0.9 0.4

0.3 0.5 0.2 0.4 0.4

0.5 0.8 0.6 0.3 0.4

W R

(1)

= { method, pass, variable }

W R

(2)

= { value, pass, variable }

W R

(3)

= { value, variable, call }

W R

(4)

= { method, value, pass }

Figure 2: Example of relevant word set.

i−2

i−1

w → sense=s

i−1

w → sense=s

i−1

w w

i+1

→ sense=s

w w

i+1

→ sense=s

w w

i+1

i+2

→ sense=s

Figure 3: Template of word collocation WSD rule.

w - rel - w

→ sense=s

- rel - w → sense=s

Figure 4: Template of dependency collocation WSD rule.

w stands for the target word, while w

stands for a

word just before or after the target word. x ∈ {i −

2,i − 1, i + 1,i + 2} denotes a relative position.

The second collocation is a dependency colloca-

tion, which is deﬁned as a pair of words under a

certain syntactic relation such as “subject”, “object”,

“modiﬁer” and so on. Figure 4 shows the precise def-

inition of the rule. w is the target word, while w

is a

word in a context that is under the dependency rela-

tion rel with w.

3.2.2 Construction of Collocation WSD Rule

Collocation WSD rules are automatically acquired

from a raw corpus. Figure 5 shows overall proce-

dures.

First, for each sentence in an unlabeled corpus, the

HRWE method determines a sense of a target word.

If the chosen sense is reliable enough, the sentence is

used to obtain candidates of collocation WSD rules.

The reliability of the disambiguated sense s

is deﬁned

as the cosine similarity between the context vector

(i)

and sense vector

. If it is less than the threshold T

wsd

the sentence is just ignored.

Next, candidates of collocation WSD rules are

generated by applying rule templates shown in Figure

3 and 4. For example, from the sentence “they were

always getting into arguments about politics”, where

the HRWE determines the sense of “argument” as s

the following rules are obtained:

getting into argument → sense=s

into argument → sense=s

into argument about → sense=s

argument about → sense=s

argument about politics → sense=s

getting - obj - arugment → sense=s

argument - case - into → sense=s

argument - nmod - politics → sense=s

The ﬁrst ﬁve rules are word collocation WSD rules,

while the rest are dependency collocation WSD rules,

which are derived from the dependency tree shown in

Figure 6. Stanford Parser

is used to analyze depen-

dency relations in this study.

3.2.3 Filtering Collocation WSD Rule

After obtaining the candidates of the collocation

WSD rules, inaccurate ones are ﬁltered out. We apply

the following three ﬁltering procedures.

• Stop Word.

The collocation consisting of only the target word

and function words may not strongly associate

with any senses. Therefore, rules including such

collocations are discarded. Here are examples of

the removed rules.

play a → sense=s

the argument → sense=s

We have prepared 29 function words for this ﬁl-

tering.

• Infrequent Collocation.

If the frequency of a collocation in a corpus is

https://nlp.stanford.edu/software/lex-parser.shtml

Unsupervised Word Sense Disambiguation based on Word Embedding and Collocation

1221

Figure 5: Flowchart of acquisition of collocation WSD rule.

Figure 6: Dependency tree of example sentence.

small, a rule might be unreliable. Therefore, rules

are removed if the number of the collocation is

less than the threshold T

f re

• Reliability.

Obviously, not all rules are effective to choose

a correct sense. Several rules are even inconsis-

tent when the same collocation determines differ-

ent senses such as “col → sense = s

” and “col

→ sense = s

”. Therefore, the reliability score of

the rule is deﬁned as

score(col → s

) =

f (col,s

)

∑

f (col,s

)

(6)

, where f (col,s

) is the frequency of sentences in-

cluding the collocation col and the sense s

. Ba-

sically, this score means the precision of WSD

when the sense is determined by the rule. If

score(col → s

) is less than T

sco

, the rules are re-

moved.

After applying these three ﬁltering modules, the ﬁnal

set of collocation WSD rules is obtained.

4 EXPERIMENTS

4.1 Experimental Setting

The dataset of Senseval-3 English lexical sample task

is used to evaluate the performance of WSD of the

proposed systems. It consists of instances (sentences

or paragraphs including the target word) annotated

with gold senses for several target verbs, nouns, and

adjectives. The statistics of the dataset is shown in

Table 1.

Table 1: Dataset of Senseval-3 English lexical sample task.

POS # of words ave.# of instances

Verb 32 53.1

Noun 20 78.5

Adjective 5 28.2

Total 57 59.8

In the Senseval-3 data, the senses are deﬁned by

WordNet (Miller, 1995). As for the sense inventory,

glosses in WordNet 1.7.1 are used for nouns and ad-

jectives, while deﬁnition sentences in Wordsmyth

are used for verbs. Since the senses in WordNet are

ﬁne-grained and differences of some senses are too

subtle, we deﬁne a set of coarse-grained senses by

manually merging similar senses. The average num-

http://www.wordsmyth.net/

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

1222

bers of the senses per word in the original WordNet

and our coarse sense set are shown in Table 2.

Table 2: Average number of senses.

POS WordNet Our coarse sense

Verb 6.31 4.07

Noun 5.8 3.58

Adjective 10.2 2.75

A large unlabeled corpus is required to mine the col-

location WSD rules. In this experiment, 200,000 En-

glish sentences from the Leipzig corpus

are used.

To construct the context and sense vectors, three

pre-trained word embedding are used: word embed-

ding pre-trained by the Skip-gram model from Google

News corpus

, Glove

(Pennington et al., 2014), and

BERT

(Devlin et al., 2018). Since word embedding

in BERT is dynamic, i.e. sentence-dependent, we ex-

pect that it is good to produce abstract vector repre-

sentation of a context and sense. A context sentence

or a gloss sentence is given to the pre-trained language

model of BERT, then the average vector of every to-

ken in the last layer of the BERT forms the context or

sense vector.

Preliminarily, these three word embeddings are

compared. Table 3 shows the average precision for

disambiguation of verbs in the test data using the

Basile’s method with different word embedding mod-

els. Here the context window size CWS is set to 10.

It is found that the performance of the BERT is rather

poor. It indicates that pre-trained BERT model may

not be appropriate for WSD. Since the result of this

experiment indicates that the Skip-gram model is the

best, only the Skip-gram model is used in our experi-

ments.

Table 3: Comparison of word embedding.

Type Precision

Skip-gram 0.544

Glove 0.529

BERT 0.424

4.2 Results

First, a preliminary experiment is carried out to con-

ﬁrm the effectiveness of the word expansion. As ex-

plained in Subsection 3.1, in Basile’s method, not

only gloss sentences but also glosses of its related

https://wortschatz.uni-leipzig.de/en/download/english

https://code.google.com/archive/p/word2vec/

https://nlp.stanford.edu/projects/glove/

https://github.com/google-research/bert

words (hypernym, hyponym, and synonym) are used

to make a sense vector. Table 4 compares the preci-

sion of the original method with and without the gloss

expansion. Although (Basile et al., 2014) reported

that gloss expansion was effective, it is not true in our

experiment using Senseval-3 dataset. We are still un-

certain why it happens. Careful investigation on im-

pact of the gloss expansion in unsupervised WSD is

worth being carried out in future. Anyway, the gloss

expansion is not performed in the rest of experiments.

Table 4: Evaluation of gloss expansion in Basile’s method.

Model Precision

Verb Noun Adj

w/o expansion 0.542 0.505 0.560

with expansion 0.517 0.457 0.447

Table 5 reveals the precision of WSD for verbs,

nouns, adjectives, and all POSs. The third row is

the baseline that is equivalent to (Basile et al., 2014),

while the fourth row is the WSD system using our

proposed HRWE method only. The HRWE outper-

forms the baseline for nouns and verbs, but not for

adjectives. However, the precision is improved by 3.2

point for all POSs. It indicates that our idea to select

contextual words strongly associated with senses for

the context embedding is effective.

The ﬁfth row shows the precision by using col-

location WSD rules only. The applicability, propor-

tion of the number of disambiguated instances by the

rules to the total number of instances in the test data,

is shown in the sixth row. The applicability of the

rules is low, i.e. senses in many sentences cannot be

determined. However, the rules tend to achieve the

higher precision than the previous two systems, espe-

cially for nouns and adjectives. It is conﬁrmed that

we can obtain the disambiguation rules whose recall

is low but precision is high as we aimed. Note that the

applicability of all other WSD systems in Table 5 is 1,

that is, senses of all target instances are determined.

The 7-th to 10-th rows show the performance of

the systems integrating the baseline or HRWE with

the word or dependency collocation WSD rules. The

use of two different WSD systems can increase the

precision. Therefore, it is conﬁrmed that both words

in a context (considered in the baseline or HRWE)

and collocations (considered in the rules) can con-

tribute to choose the appropriate sense. Comparing

7-th and 8-th or 9-th and 10-th rows, the contribution

of two types of collocation WSD rules (word vs. de-

pendency) are almost equivalent.

Finally, the last row shows the precision of the

WSD system with the HRWE and both word and de-

pendency collocation WSD rules. It achieves the best

Unsupervised Word Sense Disambiguation based on Word Embedding and Collocation

1223

Table 5: Comparison of WSD methods.

Method Precision

Verb Noun Adj All

Baseline 0.542 0.506 0.560 0.525

HRWE only 0.583 0.534 0.511 0.557

collocation WSD rule only 0.573 0.631 0.625 0.591

(applicability) (36.4%) (17.8%) (11.3%) (26.8%)

Baseline + word collocation 0.553 0.516 0.553 0.536

Baseline + dependency collocation 0.547 0.510 0.546 0.530

HRWE + word collocation 0.588 0.545 0.511 0.565

HRWE + dependency collocation 0.589 0.540 0.525 0.564

HRWE + word & dependency collocation 0.594 0.552 0.525 0.572

performance for nouns,verbs and all POSs as indi-

cated in bold. Its precision is 0.572, which is 4.7 point

better than the baseline.

It is found that our HRWE and collocation WSD

rules poorly perform for the disambiguation of adjec-

tives. However, the number of target adjectives in the

test data is rather small, i.e. only 5. We will evaluate

our proposed method for more adjectives and investi-

gate how our system can improve sense disambigua-

tion of adjectives in future.

4.3 Evaluation of Collocation WSD

Rules

The details of the acquisition of the collocation WSD

rules are reported in this subsection.

Recall that there are three thresholds for rule ac-

quisition: T

wsd

(the reliability of WSD), T

f re

(the fre-

quency of the collocation), and T

sco

(the score of the

rule). These parameters are empirically determined

for individual POSs as in Table 6. They are optimized

so that the precision on the test data becomes the high-

est, although they should be normally optimized on a

development data.

Table 6: Parameters for acquisition of collocation WSD

rule.

wsd

f re

sco

Verb 0.75 4 0.7

Noun 0.7 5 0.7

Adjective 0.7 4 0.7

Table 7 shows the number of candidates of rules and

rules after the ﬁltering. Around ﬁve hundred word

collocation WSD rules and nine hundred dependency

rules are ﬁnally obtained. It is found that most of the

candidates are inaccurate and discarded by our ﬁlter-

ing methods.

We could ﬁnd many correct rules. Figure 7 shows

the examples of acquired rules. For example, the last

Table 7: Number of rules mined from raw corpus.

candidates after ﬁltering

word col. rule 132,300 528

depen. col. rule 120,170 379

rule indicates that when “argument” is an object of the

verb “refute”, its meaning is s

(assertion).

earth’s atmosphere → sense=s

(air)

bank robber → sense=s

(ﬁnancial institute)

running arguments → sense=s

(parameter)

talk - advmod - speechify → sense=s

(speech)

refute - obj - argument → sense=s

(assertion)

Figure 7: Example of acquired col. WSD rule.

4.4 Discussion about Context Window

Size

Next, inﬂuence of the context window size CWS

on the WSD performance is investigated. CWS is

changed to 5, 8, and 10 in the baseline and HRWE

method, then the WSD precision of these models are

compared. Note that collocation WSD rules are not

used in this experiment. Figure 8 (a) and (b) show

the results for verb and noun, respectively.

The pre-

cision of our HRWE method is improved when CWS

is increased, while that of the baseline is declined for

both verbs and nouns. In the baseline method, when

more context words are added to the context vector,

words irrelevant to the correct sense are also added in

great numbers. It results in spoiling the quality of the

context vector. On the other hand, in the HRWE, not

all but ﬁxed number of highly related words are used

to make the context vector. When the context window

size is increased, words that are far from a target word

but effective for WSD can be taken into account.

See also the sense deﬁnition in Figure 2.

A result of adjective is omitted since the number of tar-

get words in the test data is small.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

1224

(a)

(b)

Figure 8: Precision of models with different context win-

dow sizes.

5 CONCLUSION

This paper proposed the novel unsupervised WSD

system consisting of two methods. One was the

method to determine the sense by looking up the col-

location that strongly indicated the sense of the target

word. Two types of the collocation WSD rules were

acquired from a raw corpus. The other was the HRWE

method that measured the similarity between the con-

text and the gloss sentences, where noisy words were

ignored in the construction of the context vector. The

experimental results on Senseval-3 English lexical

sample task dataset showed that our proposed method

outperformed the previous work (Basile et al., 2014)

by 4.7 points.

The contribution of the paper was summarized as

follows. First, the collocation was newly integrated as

another useful feature into the existing word embed-

ding based method, which only considered words in

the context. Ensemble of collocation based and word

embedding based methods was effective to improve

the precision of WSD. Another contribution was to

reﬁne how to make the context vector, where only

highly related words were chosen to get better rep-

resentation of the context.

In future, more sophisticated methods to make the

context and sense vectors will be explored. For exam-

ple, it is worth investigating a method to use Sentence

BERT (Reimers and Gurevych, 2019) to obtain the

vector representation of the sentences. Another im-

portant line is to combine other unsupervised methods

such as graph based ones with our HRWE method and

collocation WSD rules.

REFERENCES

Basile, P., Caputo, A., and Semeraro, G. (2014). An en-

hanced Lesk word sense disambiguation algorithm

through a distributional semantic model. In Proceed-

ings of COLING 2014, the 25th International Confer-

ence on Computational Linguistics: Technical Papers,

pages 1591–1600.

Bazell, C. (1959). Studies in linguistic analysis. special vol-

ume of the philological society, vii, 205 pp., 5 plates.

oxford: Basil blackwell, 1957. 70s. Bulletin of the

School of Oriental and African Studies, 22(1):182–

184.

Bhingardive, S., Singh, D., Rudramurthy, V., Redkar, H.,

and Bhattacharyya, P. (2015). Unsupervised most fre-

quent sense detection using word embeddings. In Pro-

ceedings of the 2015 conference of the North Amer-

ican Chapter of the Association for Computational

Linguistics: Human language technologies, pages

1238–1243.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). BERT: Pre-training of deep bidirectional

transformers for language understanding. arXiv

preprint arXiv:1810.04805.

Huang, L., Sun, C., Qiu, X., and Huang, X. (2019). Gloss-

BERT: BERT for word sense disambiguation with

gloss knowledge. arXiv preprint arXiv:1908.07245.

Lesk, M. (1986). Automatic sense disambiguation using

machine readable dictionaries: how to tell a pine cone

from an ice cream cone. In Proceedings of the 5th an-

nual international conference on Systems documenta-

tion, pages 24–26.

Miller, G. A. (1995). WordNet: a lexical database for en-

glish. Communications of the ACM, 38(11):39–41.

Navigli, R. (2009). Word sense disambiguation: A survey.

ACM computing surveys (CSUR), 41(2):1–69.

Navigli, R. and Lapata, M. (2009). An experimental study

of graph connectivity for unsupervised word sense

disambiguation. IEEE transactions on pattern anal-

ysis and machine intelligence, 32(4):678–692.

Pennington, J., Socher, R., and Manning, C. D. (2014).

Glove: Global vectors for word representation. In

Proceedings of the 2014 conference on empirical

methods in natural language processing (EMNLP),

pages 1532–1543.

Reimers, N. and Gurevych, I. (2019). Sentence-BERT:

Sentence embeddings using siamese BERT-networks.

arXiv preprint arXiv:1908.10084.

Torres, S. and Gelbukh, A. (2009). Comparing similarity

measures for original WSD Lesk algorithm. Research

in Computing Science, 43:155–166.

Unsupervised Word Sense Disambiguation based on Word Embedding and Collocation

1225