SEMANTIC CLASSIFICATION OF UNKNOWN WORDS

BASED ON GRAPH-BASED SEMI-SUPERVISED CLUSTERING

Fumiyo Fukumoto and Yoshimi Suzuki

Interdisciplinary Graduate School of Medicine and Engineering, University of Yamanashi, Yamanashi, Japan

Keywords:

Unknown words, Polysemous verbs, Levin-style semantic classes, Semi-supervised clustering.

Abstract:

This paper presents a method for semantic classiﬁcation of unknown verbs including polysemies into Levin-

style semantic classes. We propose a semi-supervised clustering, which is based on a graph-based unsuper-

vised clustering technique. The algorithm detects the spin conﬁguration that minimizes the energy of the spin

glass. Comparing global and local minima of an energy function, called the Hamiltonian, allows for the de-

tection of nodes with more than one cluster. We extended the algorithm so as to employ a small amount of

labeled data to aid unsupervised learning, and applied the algorithm to cluster verbs including polysemies.

The distributional similarity between verbs used to calculate the Hamiltonian is in the form of probability

distributions over verb frames. The result obtained using 110 test polysemous verbs with labeled data of 10%

showed 0.577 F-score.

1 INTRODUCTION

Semantic verb classiﬁcation is not an end task in it-

self, but supports many NLP tasks, such as subcat-

egorization acquisition (Korhonen, 2002; Kermani-

dis et al., 2008), word sense disambiguation (Nav-

igli, 2009), and language generation (Reiter and Dale,

2000). Much of the previous work on verb classiﬁca-

tion has been to classify verbs into classes with se-

mantically similar senses taken from an existing the-

saurus or taxonomy. However, such a resource makes

it nearly impossible to cover large, and fast-changing

linguistic knowledge required for these NLP tasks,

depending on text-type and subject domain. Let us

take a look at the Levin-style semantic classes (Levin,

1993). It consists of 3,024 verbs. Similarly, Japanese

thesaurus dictionary called Bunrui-Goi-Hyo consists

of 87,743 content words. Therefore, considering this

resource scarcity problem, semantic classiﬁcation of

verbs which do not appear in the resource but appear

in corpora has been an interest since the earliest days

when a number of large scale corpora have become

available.

A number of methodologies have been developed

for verb classiﬁcation. One such attempt is to apply

clustering techniques to classiﬁcation. However, two

main difﬁculties arise in the use of clustering algo-

rithms. The ﬁrst is that we do not know how many

classes there are in a given input. The usual drawback

in many algorithms is that they cannot give a valid

criterion for measuring class structure. The second is

that the algorithm should allow each data point (verb)

to belong to more than one cluster because of the ex-

istence of polysemous verbs.

The aim of this work is to resolve these prob-

lems. We focus on unknown verbs including poly-

semies, and present a method for classifying them into

Levin-style semantic classes. We propose a graph-

based semi-supervised clustering method which al-

lows nodes (verbs) to belong to multiple clusters

(senses). The essence of the approach is to deﬁne

an energy function, called the Hamiltonian which

achieves minimal energy when there is high within-

cluster connectivity and low between-cluster connec-

tivity. The energy minimum is obtained by simu-

lated annealing. In this context, two verbs are “con-

nected” if they share many of the same subcategoriza-

tion frames. Comparing global and local minima of

an energy function Hamiltonian, allows for the de-

tection of overlapping nodes. We extended the al-

gorithm so as to employ a small amount of labeled

data to aid unsupervised learning, and clustered pol-

ysemous verbs. The distributional similarity between

verbs used to calculate the Hamiltonian is in the form

of probability distributions over verb frames. The

results obtained using 110 test verbs including pol-

ysemies with labeled data of 10% showed 0.577 F-

score, and it was comparable with previous work.

Fukumoto F. and Suzuki Y..

SEMANTIC CLASSIFICATION OF UNKNOWN WORDS BASED ON GRAPH-BASED SEMI-SUPERVISED CLUSTERING.

DOI: 10.5220/0003633100370046

In Proceedings of the International Conference on Knowledge Engineering and Ontology Development (KEOD-2011), pages 37-46

ISBN: 978-989-8425-80-5

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

The rest of the paper is organized as follows. The

next section provides an overview of existing tech-

niques. Section 3 explains verb description, i.e., verb

frame patterns. Section 4 describes distributional sim-

ilarity measures to compute semantic similarities be-

tween pairs of verbs. Section 5 explains our clustering

algorithm. Finally, we report some experiments using

110 verbs including polysemies, and end with a dis-

cussion of evaluation.

2 RELATED WORK

Much of the previous work on verb classiﬁcation is to

classify verbs into classes with semantically similar

senses taken from an existing thesaurus or taxonomy.

One attractive attempt is to use Levin-style semantic

classes (Levin, 1993), as this classiﬁcation includes

the largest number of English verbs with ﬁne-grained

classes. Moreover, it is based on the assumption that

the sense of a verb inﬂuences its syntactic behavior,

particularly with respect to the choice of its argu-

ments. Therefore, if we induce a verb classiﬁcation

on the basis of verb features, i.e., syntactic informa-

tion obtained from corpora, then the resulting classi-

ﬁcation should agree with a semantic classiﬁcation to

a certain extent.

Schulte (Schulte im Walde, 2000) attempted to

classify verbs using two algorithms: iterative clus-

tering based on a deﬁnition by (Hughes, 1994), and

unsupervised latent class analysis as described by

(Rooth, 1998), based on the expectation maximiza-

tion algorithm. Stevenson and Joanis compared their

supervised method for verb classiﬁcation with semi-

supervised and unsupervised techniques (Stevenson

and Joanis, 2003). Brew et al. focused on dimension-

ality reduction on the verb frame patterns, and applied

a spectral clustering technique (Ng et al., 2002) to the

unsupervised clustering of German verbs to Levin’s

English classes (Brew and Walde, 2002). They re-

ported that the results by a spectral clustering outper-

formed the standard k-means against all the evalua-

tion measures including “F-measure” and all the dis-

tance measures including “skew divergence.”

In the context of graph-based clustering of words,

Widdows and Dorow used a graph model for un-

supervised lexical acquisition (Widdows and Dorow,

2002). The graph structure is built by linking pairs

of words that participate in particular syntactic rela-

tionships. An incremental cluster-building algorithm

using the graph structure achieved 82% accuracy at

a lexical acquisition task, evaluated against WordNet

10 classes, and each class consists of 20 words. Mat-

suo et al. proposed a method of word clustering based

on a word similarity measure by Web counts (Matsuo

et al., 2006). They used Newman clustering for the

clustering algorithm, and reported that the results ob-

tained with the algorithm were better than those ob-

tained by average-link agglomerativeclustering using

90 Japanese noun words. However, all these meth-

ods relied on hard-clustering models, and thus have

largely ignored the issue of polysemy by assuming

that words belong to only one cluster.

In contrast to hard-clustering algorithms, soft

clustering allows that words to belong to more than

one cluster. Much of the previous work on soft clus-

tering is based on EM algorithm. The earliest work in

this direction is that of Pereira et al (Pereira et al.,

1993), who described a hierarchical soft clustering

method that clusters noun words. The clustering re-

sult was a hierarchy of noun clusters, where every

noun belongs to every cluster with a membership

probability. The initial data for the clustering process

were frequencies of verb–noun pairs in a direct ob-

ject relationship, as extracted from conditional verb–

noun probabilities, the similarity of the distributions

was determined by the KL divergence. The EM al-

gorithm was used to learn the hidden cluster mem-

bership probabilities, and deterministic annealing per-

formed the divisive hierarchical clustering. Schulte

et al. (Schulte im Walde et al., 2008) proposed a

method for semantic verb classiﬁcation that relies on

selectional preferences as verb properties. The model

was implemented as a soft clustering approach to cap-

ture the polysemy of the verbs. The training proce-

dure used the EM algorithm to iteratively improve the

probabilistic parameters of the model, and applied the

MDL principle to induce WordNet-based selectional

preferences for arguments within subcategorization

frames. The results showed that after 10 training it-

erations the verb class model results were above the

baseline results. Our work is similar to their method

in the use of semi-supervised clustering, while they

did not report in detail whether the clusters captured

polysemic verbs. Moreover, the algorithm cannot as-

sign unlabeled data to a new class other than known

classes.

Korhonen et al. (Korhonen et al., 2003) used

verb–frame pairs to cluster verbs relying on the in-

formation bottleneck. Our work is similar to their

method in the use of 110 test verbs provided by (Ko-

rhonen et al., 2003), and focused especially on verbal

polysemy. However, their method interpreted poly-

semy as representedby the soft clusters, i.e., they used

the Information Bottleneck, an iterative soft method

with hardening of the output, while the method pre-

sented in this paper allows that verbs belong to more

than one cluster.

KEOD 2011 - International Conference on Knowledge Engineering and Ontology Development

3 SUBCATEGORIZATION

INFORMATION

A typical word clustering task is to cluster words into

classes based on their distributional similarity. Sim-

ilarity measures based on distributional hypothesis

compare a pair of weighted feature vectors that char-

acterize two words (Hindle, 1990; Lin, 1998; Dagan

et al., 1999).

Like much previous work on verb classiﬁcation,

we used subcategorization frame distributions to cal-

culate similarity between verbs (Schulte im Walde,

2000; Brew and Walde, 2002). More precisely, (Ko-

rhonen et al., 2003) provided subcategorization frame

data. They used the subcategorization acquisition sys-

tems of (Briscoe and Carroll, 1997). The system

employs a robust statistical parser (Briscoe and Car-

roll, 2002), which yields complete but shallow parses,

and a comprehensive subcategorization frame clas-

siﬁer. It incorporates 163 subcategorization distinc-

tions, a set of those found in the ANLT and COM-

LEX dictionaries (Bouraev et al., 1987; Grishman

et al., 1994). A total of 6,433 verbs were ﬁrst se-

lected from COMLEX and British National Corpus

(Leech, 1992). Next, to obtain as comprehensive sub-

categorization frequency information as possible, up

to 10,000 sentences containing an occurrence of each

of these verbs were included in the input data for sub-

categorization acquisition. These sentences were ex-

tracted from ﬁve different corpora, including BNC

(Korhonen et al., 2006). We used these data to cal-

culate similarity between verbs.

4 DISTRIBUTIONAL

SIMILARITY

There is a large body of work on distributional simi-

larity measures. Here, we concentrate on eight more

commonly used measures. In the following formulae,

x and y refer to the verb vectors, their subscripts to the

verb subcategorization frame values.

1. The Binary Cosine Measure (bCos).

The cosine measures the similarity of the two vec-

tors x and y by calculating the cosine of the an-

gle between vectors, where each dimension of the

vector corresponds to each of 163 subcategoriza-

tion patterns and each value of the dimension is

the frequency of each pattern. The binary cosine

measure is a ﬂattened version of the cosine mea-

sure in which all non-zero counts are replaced by

1.0.

2. The Cosine Measure based on Probability of

Relative Frequencies (rfCos).

The differences between the cosine and the value

based on relative frequencies of subcategorization

frames are the values of each dimension, i.e., the

former are frequencies of each pattern and the lat-

ter are the probability of relative frequencies of

each pattern.

3. The Dice Coefﬁcient (Dice).

The Dice Coefﬁcient is a combinatorial similar-

ity measure adopted from the ﬁeld of Information

Retrieval for use as a measure of lexical distribu-

tional similarity. It is computed as twice the ratio

between size of the intersection of the two sub-

categorization patterns and the sum of the sizes of

the individual subcategorization patterns:

Dice(x,y) =

2· | F(x) ∩ F(y) |

| F(x) | + | F(y) |

4. Jaccard’s Coefﬁcient (Jacc).

Jaccard’s Coefﬁcient can be deﬁned as the ratio

between the size of the intersection of the two sub-

categorization patterns and the size of the union of

the subcategorization patterns:

Jacc(x, y) =

| F(x) ∩ F(y) |

| F(x) ∪ F(y) |

5. L

Norm (L

)

The L

Norm is a member of a family of measures

known as the Minkowski Distance, for measuring

the distance between two points in space. The L

distance between two verbs can be written as:

(x,y) =

∑

i=1

| x

− y

| .

6. Kullback-Leibler (KL).

Kullback-Leibler is a measure from information

theory that determines the inefﬁciency of assum-

ing a model probability distribution given the true

distribution.

D(x||y) =

∑

i=1

∗ log

KL is not deﬁned in case y

= 0. Thus, the prob-

ability distributions must be smoothed. We used

two smoothing methods, i.e., Add-one smoothing

SEMANTIC CLASSIFICATION OF UNKNOWN WORDS BASED ON GRAPH-BASED SEMI-SUPERVISED

CLUSTERING

and Witten and Bell smoothing (Witten and Bell,

1991).

Moreover, two variants of KL, α-skew

divergence and the Jensen-Shannon, were used to

perform smoothing.

7. α-skew Divergence (α div.).

The α-skew divergence measure is a variant of

KL, and is deﬁned as:

αdiv(x, y) = D(y || α · x+ (1− α) · y).

Lee reported the best results with α = 0.9 (Lee,

1999). We used the same value.

8. The Jensen-Shannon (JS).

The Jensen-Shannon is a measure that relies on

the assumption that if x and y are similar, they are

close to their average. It is deﬁned as:

JS(x,y) =

[D(x ||

x+ y

) + D(y ||

x+ y

)].

All measures except bCos, rfCos, Dice, and Jacc

showed that smaller values indicate a closer relation

between two verbs. Thus, we used inverse of each

value.

5 CLUSTERING METHOD

We now proceed to a discussion of our modiﬁcations

to the algorithm reported by (Reichardt and Born-

holdt, 2006); we call this semi-supervised RB algo-

rithm. In this work, we focus on background knowl-

edge that can be expressed as a set of constraints

on the clustering process. After a discussion of the

kind of constraints we are using, we describe semi-

supervised RB algorithm.

5.1 The Constraints

In semi-supervised clustering, a small amount of la-

beled data is available to aid the clustering process.

Like much previous work on semi-supervised cluster-

ing (Bar-Hillel et al., 2003; Bilenko et al., 2004), our

work uses both must-link and cannot-link constraints

between pairs of nodes (Wagstaff et al., 2001). Must-

link constraints specify that two nodes (verbs) have

to be in the same cluster. Cannot-link constraints, on

the other hand, specify that two nodes must not be

placed in the same cluster. These constraints are de-

rived from a small amount of labeled data.

We report Add-one smoothing results in the evaluation,

as it was better than Witten and Bell smoothing.

5.2 Clustering Algorithm

The clustering algorithm used in this study was based

on a graph-based unsupervised clustering technique

reported by (Reichardt and Bornholdt, 2006). This al-

gorithm detects the spin conﬁguration that minimizes

the energy of the spin glass. The energy function

Hamiltonian, for assignment of nodes into communi-

ties clusters together those that are linked, and keeps

separate those that are not by rewarding internal edges

between nodes and penalizing existing edges between

different clusters. Here, “community” or “cluster”

have in common that they are groups of densely in-

terconnected nodes that are only sparsely connected

with the rest of the network. Only local information is

used to update the nodes which makes parallelization

of the algorithm straightforward and allows the appli-

cation to very large networks. Moreover, comparing

global and local minima of the energy function allows

the detection of overlapping nodes. Reichardt et al.

evaluated their method by applying several data, the

college football network and a large protein folding

network, and reported that the algorithm successfully

detected overlappingnodes (Reichardt and Bornholdt,

2004). We extended the algorithm so as to employ

a small amount of labeled data to aid unsupervised

learning, and clustered polysemous verbs. Let v

(1 ≤

i ≤ n) be a verb in the input, and σ

be a label assigned

to the cluster in which v

is placed. The Hamiltonian

is deﬁned as:

H({σ

}) = −

∑

i< j

(θ) − γp

)δ

. (1)

Here, δ denotes the Kronecker delta. The function

(θ) refers to the adjacency matrix of the graph. If

both of the v

and v

are labeled data, it is deﬁned by

Eq. (2), otherwise it is deﬁned by Eq. (3).

(θ) =



1 if v

and v

satisfy must-link

0 if v

and v

satisfy cannot-link.

(2)

(θ) =



1 if sim(v

, v

) ≥ θ

0 otherwise.

(3)

We calculated sim(v

), i.e., similarity between v

and v

using one of the measures mentioned in Sec-

tion 4. If θ is 0.9 for example, the value of the topmost

10% of the verb pairs are 1, and the remaining pairs

are 0.

The matrix p

in Eq. (1) denotes the probability

that a link exists between verb v

and v

, and is deﬁned

as:

KEOD 2011 - International Conference on Knowledge Engineering and Ontology Development

∑

i< j

(θ)

N(N − 1)/2

, (4)

where N in Eq. (4) denotes the number of verbs and

N(N − 1)/2 is the total number of verb pairs. As

the parameter γ in Eq. (1) increases, each verb is dis-

tributed into larger number of clusters. Eq. (1) thus

shows comparison of the actual values of internal or

external edges with its respective expectation value

under the assumption of equally probable links and

given data sizes. The minima of the Hamiltonian

H are obtained by simulated annealing (Kirkpatrick

et al., 1983) as illustrated in Figure 1.

We applied the ﬂow of the minima of the Hamil-

tonian shown in Figure 1 for M runs. We need to ﬁnd

a global minimum of the Hamiltonian. Each value

of the H

min

for M runs does not generally coincide

with each other. Only the minimum among the values

can be a global minimum and others are local minima.

However, it often happens that one of the local min-

ima is unexpectedly the minimum value, i.e., a global

minimum

. Thus, we regarded the minimum value

which appears more than m times among the M re-

sults as the desired global minimum. We picked H

min

and its corresponding all {σ

}

min

. If a v

belongs to

more than two {σ

}

min

, the v

is regarded as a polyse-

mous verb. The procedure of verb classiﬁcation using

the RB consists of four steps.

1. Input.

The input is a set of verbs {v

, ···, v

}, where n is

the number of input verbs.

2. Calculation of Similarity.

Similarities for each pair of v

are calculated by

using measures mentioned in Section 4.

3. Construction of Adjacency Matrix.

According to Eq. (2) and (3), adjacency matrix,

(θ) is created.

4. Running RB Algorithm.

The RB algorithm shown in Figure 1 is applied

to the adjacency matrix, and clusters of verbs are

obtained.

We note that the algorithm applies m times to ﬁnd

the minima of the Hamiltonian. Therefore, we paral-

lelized the algorithm using the Message Passing Inter-

face (MPI), as we applied simulated annealing for M

runs. For implementation, we used a supercomputer,

SPARC Enterprise M9000, 64 CPU, 1 TB memory.

The method to obtain the H

min

does not warrant the

value to be an actual global minimum, as it is based on the

Monte-Carlo way.

Input {

(θ) // The adjacency matrix.

}

Output {

min

, {σ

}

min

)

// The minimum of the H and its corresponding

// clusters.

}

Initialization {

1. Assign an initial random value to each {σ

2. Calculate H

:= H({σ

}).

3. H := H

, H

min

:= H

4. T

max

:= 10×|H

|, T

min

:= 0.1×|H

// T is a parameter called the temperature.

// T

max

and T

min

are the maximum and minimum

// value of T, respectively

}

Simulated annealing {

For t = 1, 2, ···, N {

// N is the number of iteration

T := T

max

− (T

max

− T

min

)

(t−1)

(N−1)

// Let T be linearly decreased.

For i = 1, 2, · ··, n {

1. Calculate the change △H(σ

→ s)

from the current value of H when σ

changed to be s as:

△H(σ

→ s) = −

∑

m(m6=i)

(e) − γp

)δ

sσ

2. Choose a random cluster label q

as a candidate of σ

, such that q 6= σ

3. Calculate p(σ

= q) as:

p(σ

= q) =

exp(−

△H(σ

→ q))

∑

exp(−

△H(σ

→ s))

// p(σ

= q) is a probability that a cluster

// label of σ

is q.

4. If r < p(σ

= q), then σ

and H are

updated as follows:

H := H + △H(σ

= q), σ

:= q

// r is a random value and 0 ≤ r ≤ 1

5. If H < H

min

, then

min

:= H, {σ

}

min

:= {σ

}

Figure 1: Minima of the Hamiltonian by Simulated Anneal-

ing.

SEMANTIC CLASSIFICATION OF UNKNOWN WORDS BASED ON GRAPH-BASED SEMI-SUPERVISED

CLUSTERING

Table 1: Clustering results.

θ γ Sim C Prec Rec F

RB 0.2 1.0 Dice 48 0.536 0.626 0.577

EM – – – 59 0.301 0.512 0.387

6 EXPERIMENTS

6.1 Experimental Setup

We used the data consisting of 110 test verbs con-

structed by Korhonen et al. (Korhonen et al., 2003).

There are two types: one is the monosemous gold

standard, which lists only a single sense for each

test verb corresponding to the most frequent sense

in WordNet, and the other is the polysemous stan-

dard, which provides all senses for each verb. We

used polysemous data in the experiments. We ran-

domly selected 10% of verbs from 110 test verbs, and

used them as a labeled data. The remaining verbs are

used as unknown verbs. The selection of labeled data

was repeated 10 times. All results are averaged per-

formance against 10 trials. The similarity between

verbs are calculated by using subcategorization frame

data provided by (Korhonen et al., 2003). In the ex-

periments, we experimentally set m and M in semi-

supervised RB to 3 and 1,000, respectively.

For evaluation of verb classiﬁcation, we used the

precision, recall, and F-score, which were deﬁned by

(Schulte im Walde, 2000), especially to capture how

many verbs does the algorithm actually detect more

than just the predominant sense. Precision was de-

ﬁned by the percentage of verb senses appearing in

the correct clusters compared to the number of verb

senses appearing in any cluster, and recall was deﬁned

by the percentage of verb senses within the correct

clusters compared to the total number of verb senses

to be clustered.

For comparison, we utilized the EM algorithm

which is widely used as a soft clustering and semi-

supervised clustering technique (Schulte im Walde

et al., 2008). We followed the method presented in

(Rooth et al., 1999). We used a probability distri-

bution over verb frames with selectional preferences,

and used up to 30 iterations to learn the model proba-

bilities.

6.2 Basic Results

The results are shown in Table 1. “γ” and “θ” refer

to the parameters used by semi-supervised RB algo-

rithm. “Sim” indicates similarity measure reported

in Section 4. We performed experiments by varying

Table 2: Results against each measure.

Sim θ γ C Prec Rec F

Cos 0.1 1.1 39 0.402 0.583 0.476

rfCos 0.1 1.0 39 0.396 0.565 0.466

Dice 0.2 1.0 48 0.536 0.626 0.577

Jacc 0.1 0.7 36 0.314 0.785 0.449

0.2 0.7 46 0.378 0.724 0.497

KL 0.1 0.9 38 0.411 0.630 0.497

αdiv. 0.1 1.2 39 0.421 0.634 0.506

JS 0.4 1.0 37 0.380 0.539 0.446

EM – – 59 0.301 0.512 0.387

these values. Table 1 denotes the value that max-

imized F-score. “C” refers to the number of clus-

ters obtained by the method, and “EM” shows the

results obtained by EM algorithm. Table 1 shows

that the results obtained by semi-supervised RB al-

gorithm were comparable to those obtained using the

EM algorithm. We note that the number of clusters

obtained by EM algorithm is the number of different

labeled verbs in the test data, as the algorithm can-

not assign unlabeled data to a new class other than

known classes (labeled verbs). The result obtained by

semi-supervised RB shows that three new classes on

an average are correctly obtained in 10 trials, while

the number of clusters is smaller than that obtained

by EM and the correct clusters, 62 clusters.

Table 2 shows the results by using each similarity

measure. We can see from Table 2 that our core ﬁnd-

ing, that unknown words including polysemy actually

aids verb classiﬁcation, was robust across a wide vari-

ety of distributional similarity measures, although the

Dice coefﬁcient was decidedly the best such measure

for this particular problem. The observation indicates

that the RB algorithm, especially the minima of the

Hamiltonian, demonstrates our basic assumption: a

verb which belongs to multiple clusters will in gen-

eral reduce the energy.

We recall that semi-supervised RB algorithm uses

two parameters: γ and θ. We examined how these

parameters affect overall clustering results. Figure 2

shows F-score of 110 verbs plotted against γ value for

the top approach, i.e., by running RB with Dice coefﬁ-

cient as similarity measure. Similarly, Figure 3 shows

F-score of 110 verbs plotted against θ value by run-

ning RB with Dice coefﬁcient.

As can be seen clearly from Figure 2, the overall

performance is extremely worse when the γ value is

smaller than 0.9. This is because most of the cluster-

ing results show that verbs are classiﬁed into one of

the clusters , while they are polysemous verbs. Sim-

ilarly, the performance is worse when the γ value is

KEOD 2011 - International Conference on Knowledge Engineering and Ontology Development

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.6 0.8 1 1.2 1.4 1.6 1.8

F-score

Gamma value

Figure 2: F-score against γ values.

0.3

0.35

0.4

0.45

0.5

0.55

0.6

10 20 30 40 50 60 70 80 90

F-score

Theta value (%)

Figure 3: F-score against θ values.

larger than 1.3 because many of the clustering results

show that verbs are classiﬁed into more than one clus-

ters, while they are not polysemous verbs. This indi-

cates that the RB with an extremely large or small

value of γ is not effective for verb classiﬁcation.

Figure 3 also shows the impact against the θ val-

ues on the effectiveness for clustering verbs. As

shown in Figure 3, the larger value of θ affects over-

all performance. This is because the number of se-

lected pairs of verbs is few. As a result, pairs are not

chose, even if they are semantically related. The re-

sult supports the usefulness of Dice coefﬁcient, as the

best performance was obtained when the θ value was

around 20%, then the performance is decreased when

the value of θ is large.

6.3 Efﬁcacy against Unknown Words

It is interesting to note that how semi-supervised RB

algorithm affects the ratio of labeled data. Table 3

shows the results. In table 3, “Labeled data” shows a

rate of labeled data against the number of input data.

Each value shown in Table 3 denotes the average ac-

curacy, namely we randomly selected labeled data.

The process is repeated 10 times for each ratio. The

average accuracy is a ratio that the average number of

unknown words which is assigned correctly divided

by the total number of unknown words over 10 trials.

As can be seen clearly from Table 3, more labeled

verbs improves overall performance in both methods,

while the ratio equals to 5% and 10%, there is no sta-

tistical signiﬁcance between F-scores, as the result

was P-value ≥ .005. using micro sign test. Table

3 shows that the results obtained by semi-supervised

RB are always better than those of EM at all of the ra-

tios, especially precision. As we have shown in (Ko-

rhonen et al., 2003), each sense of the class is very

delicate. Therefore, these results show the effective-

ness of the method.

6.4 Error Analysis against Polysemy

We examined whether polysemous verbs were cor-

rectly classiﬁed into classes. Then, we manually an-

alyzed clustering results obtained by running semi-

supervised RB algorithm, with Dice coefﬁcient as

a similarity measure, which was the best quality F-

score against the polysemic gold standard. 13 out of

71 polysemies (unlabeled data) were perfect classiﬁ-

cation: each polysemous verb was correctly classiﬁed

into multiple clusters. For example, the polysemous

verb, “drop” was correctly classiﬁed into four clus-

ters. The words in italics denotes the majority sense,

which corresponds to the sense according to (Levin,

1993).

Putting {drop ﬁll}

Change of State {drop dry build}

Existence {drop hang hit}

Motion {drop walk travel}

Others were errors and classiﬁed into two patterns

shown in Table 4. In Table 4, “Pattern” and “#times”

refer to a type of an error and the numbers of errors,

respectively. “Example” indicates example verbs, and

“Target” sense(s)“ denotes sense(s) that polysemies

should be assigned. “Clustered sense(s)“ refers to

the sense(s) assigned by the system. “Partial” refers

to partially correct: some senses of a polysemous

verb were correctly identiﬁed, but others were not.

The ﬁrst example of this pattern is that “sit” has two

senses, “verbs of existence” and “verbs of putting”.

However, only one sense: “verbs of existence” was

identiﬁed correctly. The second example is that three

senses of the verb “remove” were correctly into the

classes, while it was classiﬁed incorrectly into the

class “verbs of change of possession”.

SEMANTIC CLASSIFICATION OF UNKNOWN WORDS BASED ON GRAPH-BASED SEMI-SUPERVISED

CLUSTERING

Table 3: Results against the ratio of labeled data.

Labeled

RB EM

data Sim θ γ C Prec Rec F C Prec Rec F

5% Dice 0.2 1.0 44 0.513 0.650 0.574 43 0.241 0.632 0.350

10% Dice 0.2 1.0 48 0.536 0.626 0.577 59 0.301 0.512 0.387

15% Dice 0.3 1.0 47 0.541 0.629 0.582 58 0.320 0.634 0.425

20% Dice 0.4 1.0 58 0.563 0.638 0.593 60 0.381 0.659 0.483

25% Dice 0.2 1.0 43 0.583 0.644 0.612 58 0.488 0.697 0.574

Table 4: Types of errors.

Pattern #times Example Target sense(s) Clustered sense(s)

Partial 23 sit verbs of existence, verbs of putting verbs of existence

31 remove verbs of removing, verbs of killing verbs of removing

verbs of sending and carrying verbs of killing

verbs of change of possession

verbs of sending and carrying

Poly → 1 4 hang verbs of existence, verbs of putting verbs of putting

verbs of killing

verbs involving the body

“poly → 1” of “Pattern” refers to polysemous verb

classiﬁed into only one cluster consisting of multi-

ple senses. “hang” was classiﬁed into one cluster.

However, the cluster consisted of four senses. There

was no error type that the target senses and clustered

senses did not completely match.

Error analysis against polysemies provided some

interesting insights for further improvement. First,

we should be able to obtain further advantages in ef-

ﬁciency and efﬁcacy of the method by using hierar-

chical splits in the clusters, as the number of clus-

ters obtained by semi-supervised RB algorithm was

smaller than the number of correct clusters. One so-

lution is to hierarchically apply RB algorithm, i.e., in

the hierarchical approach, the classiﬁcation problem

can be decomposed into a set of smaller problems cor-

responding to hierarchical splits in the tree (Navigli,

2008). Roughly speaking, one ﬁrst classiﬁes to distin-

guish among classes at the top level, then lower level

classiﬁcation is performed only within the appropri-

ate top level of the tree (Pereira et al., 1993). Each

of these sub-problems can be solved much more efﬁ-

ciently, and hopefully more accurately as well. This is

deﬁnitely worth trying with our method. Second, it is

important to use other types of features, such as selec-

tional preferences using semantic concepts from the-

saurus like WordNet (Schulte im Walde et al., 2008).

Third, we plan to apply the method to other the-

saurus such as WordNet semantic classes, and other

languages to evaluate the robustness of the method.

7 CONCLUSIONS

We have developed an approach for classifying un-

known verbs including polysemies into Levin-style

semantic classes. We proposed a graph-based semi-

supervised clustering method which employs a small

amount of labeled data to aid unsupervised learning.

Moreover, the method allows verbs to belong to mul-

tiple senses. The results using the data consisting 110

test verbs was better than the EM algorithm, as the

F-score obtained by the RB was 0.577 and that of the

EM was 0.387. Moreover, we found that unknown

words including polysemy actually aids verb classi-

ﬁcation, was robust across a wide variety of distri-

butional similarity measures. To examine the effects

of unknown words classiﬁcation, we applied semi-

supervised RB to the different ratio of labeled data

against the number of input data. The results showed

that RB is always better than the EM, even for a small

number of labeled verbs, while more labeled verbs

improves the overall performance in both methods.

Future work includes (i) extending the method to deal

with hierarchical splits in the clusters, (ii) incorpo-

rating other semantic concepts into the method, and

(iii) applying the method to other dictionaries and lan-

guages.

KEOD 2011 - International Conference on Knowledge Engineering and Ontology Development

ACKNOWLEDGEMENTS

The authors would like to thank the referees for their

comments on the earlier version of this paper. This

work was partially supported by the Telecommunica-

tions Advancement Foundation.

REFERENCES

Bar-Hillel, A., Hertz, T., Shental, N., and Weinshall, D.

(2003). Learning Distance Functions using Equiva-

lence Relations. In Proc. of the 20th International

Conference on Machine Learning, pages 11–18.

Bilenko, M., Basu, S., and Mooney, R. J. (2004). Integrating

Constraints and Metric Learning in Semi-Supervised

Clustering. In Proc. of the 21th International Confer-

ence on Machine Learning, pages 81–88.

Bouraev, B., Briscoe, E. J., Carroll, J., Carter, D.,

and Grover, C. (1987). The Derivation of a

Grammatically-Indexed Lexicon from the Longman

Dictionary of Contemporary English. In Proc. of the

25th Annual Meeting of the Association for Computa-

tional Linguistics, pages 193–200.

Brew, C. and Walde, S. S. (2002). Spectral Clustering

for German Verbs. In Proc. of 2002 Conference on

Empirical Methods in Natural Language Processing,

pages 117–123.

Briscoe, E. J. and Carroll, J. (1997). Automatic Extraction

of Subcategorization from Corpora. In Proc. of 5th

ACL Conference on Applied Natural Language Pro-

cessing, pages 356–363.

Briscoe, E. J. and Carroll, J. (2002). Robust Accurate Sta-

tistical Annotaion of General Text. In Proc. of 3rd

International Conference on Language Resources and

Evaluation, pages 1499–1504.

Dagan, I., Lee, L., and Pereira, F. C. N. (1999). Similarity-

based Models of Word Cooccurrence Probabilities.

Machine Learning, 34(1-3):43–69.

Grishman, R., Macleod, C., and Meyers, A. (1994). Com-

plex Syntax: Building a Computational Lexicon. In

Proc. of International Conference on Computational

Linguistics, pages 268–272.

Hindle, D. (1990). Noun Classiﬁcation from Predicate-

Argument Structures. In Proc. of 28th Annual Meet-

ing of the Association for Computational Linguistics,

pages 268–275.

Hughes, J. (1994). Automatically Acquiring Classiﬁcation

of Words. Ph.D. thesis University of Leeds.

Kermanidis, K., Maragoudakis, M., Fakotakis, N., and

Kokkinakis, G. K. (2008). Learning Verb Com-

plements for Modern Greek: Balancing the Noisy

Dataset. Natural Language Engineering, 14(1):71–

100.

Kirkpatrick, S., Jr., C. D. G., and Vecchi, M. P. (1983).

Optimization by Simulated Annealing. Science,

220(4598):671–680.

Korhonen, A. (2002). Subcategorization Acquisition. Ph.D.

thesis University of Cambridge.

Korhonen, A., Krymolowski, Y., and Briscoe, T. (2006).

A Large Subcategorization Lexicon for Natural Lan-

guage Processing Applications. In Proc. of the 5th

International Conference on Language Resources and

Evaluation.

Korhonen, A., Krymolowski, Y., and Marx, Z. (2003). Clus-

tering Polysemic Subcategorization Frame Distribu-

tions Semantically. In Proc. of the 41st Annual Meet-

ing of the Association for Computational Linguistics,

pages 64–71.

Lee, L. (1999). Measures of Distributional Similarity. In

Proc. of the 37th Annual Meeting of the Association

for Computational Linguistics, pages 25–32.

Leech, G. (1992). 100 Million Words of English:

The British National Corpus. Language Research,

28(1):1–13.

Levin, B. (1993.). English Verb Classes and Alternations.

Chicago University Press.

Lin, D. (1998). Automatic Retrieval and Clustering of Sim-

ilar Words. In Proc. of 36th Annual Meeting of the

Association for Computational Linguistics and 17th

International Conference on Computational Linguis-

tics, pages 768–773.

Matsuo, Y., Sakaki, T., Uchiyama, K., and Ishizuka, M.

(2006). Graph-based Word Clustering using a Web

Search Engine. In Proc. of 2006 Conference on

Empirical Methods in Natural Language Processing

(EMNLP2006), pages 542–550.

Navigli, R. (2008). A Structural Approach to the Automatic

Adjudication of Word Sense Disagreements. Natural

Language Engineering, 14(4):547–573.

Navigli, R. (2009). Word Sense Disambiguation: A Survey.

ACM Computing Surveys, 41(2):1–69.

Ng, A. Y., Jordan, M. I., and Weiss, Y. (2002.). On Spectral

Clustering: Analysis and an Algorithm. MIT Press.

Pereira, F., Tishby, N., and Lee, L. (1993). Distributional

Clustering of English Words. In Proc. of the 31st

Annual Meeting of the Association for Computational

Linguistics, pages 183–190.

Reichardt, J. and Bornholdt, S. (2004). Detecting Fuzzy

Community Structure in Complex Networks with a

Potts Model. PHYSICAL REVIEW LETTERS, 93(21).

Reichardt, J. and Bornholdt, S. (2006). Statistical Mechan-

ics of Community Detection. PHYSICAL REVIEW E,

74.

Reiter, E. and Dale, R. (2000.). Building Natural Language

Generation Systems. Cambridge University Press.

Rooth, M. (1998). Two-Dimensional Clusters in Grammat-

ical Relations. In Inducing Lexicons with the EM Al-

gorithm, AIMS Report, 4(3).

Rooth, M., Riezler, S., Prescher, D., Carroll, G., and Beil,

F. (1999). Inducing a Semantically Annotated Lex-

icon via EM-Based Clustering. In Proc. of the 37th

Annual Meeting of the Association for Computational

Linguistics.

Schulte im Walde, S. (2000). Clustering Verbs Seman-

tically according to their Alternation Behaviour. In

Proc. of the 18th International Conference on Com-

putational Linguistics, pages 747–753.

SEMANTIC CLASSIFICATION OF UNKNOWN WORDS BASED ON GRAPH-BASED SEMI-SUPERVISED

CLUSTERING

Schulte im Walde, S., Hying, C., Scheible, C., and Schmid,

H. (2008). Combining EM Training and the MDL

Principle for an Automatic Verb Classiﬁcation Incor-

porating Selectional Preferences. In Proc. of the 46th

Annual Meeting of the Association for Computational

Linguistics, pages 496–504.

Stevenson, S. and Joanis, E. (2003). Semi-Supervised Verb-

Class Discovery using Noisy Features. In Proc. of

the 7th Conference on Natural Language Learning at

HLT-NAACL 2003, pages 71–78.

Wagstaff, K., Cardie, C., Rogers, S., and Schroedl, S.

(2001). Constrained K-Means Clustering with Back-

ground Knowledge. In Proc. of 18th International

Conference on Machine Learning, pages 577–584.

Widdows, D. and Dorow, B. (2002). A Graph Model for

Unsupervised Lexical Acquisition. In Proc. of 19th

International conference on Computational Linguis-

tics (COLING2002), pages 1093–1099.

Witten, I. H. and Bell, T. C. (1991). The Zero-Frequency

Problem: Estimating the Probabilities of Novel Events

in Adaptive Text Compression. IEEE Transactions on

Information Theory, 37(4):1085–1094.

KEOD 2011 - International Conference on Knowledge Engineering and Ontology Development