CALCULATING SEMANTIC SIMILARITY BETWEEN FACTS

Sergey Afonin and Denis Golomazov

Institute of Mechanics, Moscow State University, Michurinskij av. 1, Moscow, Russia

Keywords:

Semantic similarity, Facts, Events.

Abstract:

The present paper is devoted to the calculation of semantic similarity between facts or events. A fact is

considered as a single natural sentence including three parts, “what happened”, “where” and “when”. Possible

types of mismatches between facts are discussed and a function calculating the semantic similarity is proposed.

Very preliminary experimental results are presented.

1 INTRODUCTION

People use search engines to ﬁnd information on any

subject. Sometimes they search for facts. A fact is

something that has happened in a certain place at a

certain time. Why do people need information on

facts? If they just want to check news they use news

portals, not search engines. That’s true, but every

once in awhile people want to ﬁnd additional infor-

mation about a fact. For example, they have heard

from a friend about some fact and want to read more

about it. Or it can be a journalist creating a dossier

for a person and they want to check if some rumours

are true facts. Moreover, fact search provides a basis

for task of revealing relations between facts and other

data mining problems.

The problem is that fact search is a type of search

that search engines currently cannot always handle

properly. Let us consider some local news, for ex-

ample, bank robbery in Livermore, California in July

2008. A user wants to ﬁnd some additional infor-

mation on the fact. Suppose they forgot the name

of the city and ask Google with the query Califor-

nia bank robbery in July 2008. No relevant results

are on the ﬁrst page. The same situation occurs if we

ask Google with the query Livermore bank burglary

in July 2008. This probably happens due to the not

great importance of the fact, and low page rank of

the portal on which the news had been posted. The

search engine ranks the page describing the event rel-

atively low since there was not exact matching, and

there were a lot of robberies in California during that

period. Another problem of modern search engines

is the inability to perform almost any analytics. For

example, we can not get neither a list of all events in

some city for some period, nor a list of cities in which

bank robberies occured in July 2008, nor a list of dates

on which robberies in California took place.

Fact search appears to be a more complicated task

than ordinary keyword search. The main reason is that

a fact can be represented in many ways, using syn-

onyms, abbreviations, with some keywords included

or omitted. The second reason is that a fact can be

inferred basing on information distributed over sen-

tences or even documents.

Let us discuss a virtual system that performs fact

search. It operates as follows. First, it crawls the web

and extracts facts from web pages. Second, it lets

a user enter a sentence describing a fact and returns

semantically similar facts from the database. During

this process it somehow calculates the trust rate of the

fact, i.e. how likely is the fact to be true. To construct

such a system, we divide the fact search task into four

steps.

• Extraction of facts from large number of texts in

natural language.

• Calculating semantic similarity between facts.

• Calculating the trust rate of a fact.

• Efﬁciently perform similar facts search on a large

database. For example, this can include develop-

ment of index structures for facts.

In this paper we focus on the task of calculating

semantic similarity between facts. We believe this

task to be the cornerstone of the fact search problem.

For example, the similar facts search task can be eas-

ily, though not efﬁciently solved with the help of a

semantic similarity function and linear search. Hav-

ing a semantic similarity function for facts, one can

514

Afonin S. and Golomazov D..

CALCULATING SEMANTIC SIMILARITY BETWEEN FACTS.

DOI: 10.5220/0003118705140517

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2010), pages 514-517

ISBN: 978-989-8425-28-7

 2010 SCITEPRESS (Science and Technology Publications, Lda.)

apply it for such common tasks as fact classiﬁcation

and clustering.

The problem of semantic similarity calculation is

a wide one. The objects to calculate similarity be-

tween can be single words (Bollegala et al., 2009),

groups of words (Varelas et al., 2005), sentences or

texts (Islam and Inkpen, 2008), (Li et al., 2006). In

this paper we consider sentences of a special kind,

namely that describe facts. We have not found any pa-

pers devoted to this particular problem, though some

general algorithms mentioned above can be applied

and there exist works on event description detection

and classiﬁcation, e.g. (Naughton et al., 2010).

2 SEMANTIC SIMILARITY

BETWEEN FACTS

We consider facts consisting of three parts: what hap-

pened, where and when, so a fact F is a triple F =

(what, where, when). Our goal is a function S(F

, F

)

that calculates semantic similarity between facts F

and F

. The function S takes values between 0 and 1,

and higher value means higher similarity. In this sec-

tion we discuss properties that such function should

satisfy.

First of all, let us note that two facts should be

treated similar if all their components are pairwise

similar. It seems unlikely that there exists a “univer-

sal” semantic similarity function suitable for all three

parts, so three separate functions s

, s

, and s

mea-

suring semantic similarity of what, where and when

parts, respectively, should be deﬁned. The fact sim-

ilarity fuinction S should use values of these three

functions. Note that functions s

, s

, and s

should

generally use all components of the compared facts.

As we assume that one of the facts, say F

is a user’s query and that this query describes the

same fact as F

, but using different lexical means,

we classify possible reasons for facts description

mismatching.

Synonymy, Acronyms, Abbreviations etc. Two

facts may be described differently due to synonymy,

acronyms, abbreviations, or slang. For example, A

theft in X bank describes the same fact as A larceny

in X bank, and armed robbery may be replaced by a

slang word blagging.

Underspeciﬁcation. Quite often desriptions of

geographic objects contain speciﬁcations like small

town X that can be omitted in a query. It seems that in

most cases descriptive words like large or small may

be simply dropped without loosing any information

about the fact itself. Nevertheless in some cases, such

as small city Moscow, a descriptive word may be

used for distinguishing the deﬁned object from some

other (well-known) object.

Vertical Taxonomy Relations. By vertical taxon-

omy relation we mean hyponym-hypernym relation.

A user formulating the query may have only a fuzzy

knowledge about the fact they are looking for. For

example, they may not be aware of the type of crime

happened, or the date, or place. If a query requests

for a robbery in a small California town the fact de-

scribing a robbery in Livermore, CA should be con-

sidered as relevant. Similarly, hypernym relation may

exist between what-parts of two facts, e.g. burglary –

larceny – crime, and when-parts, e.g. June – summer.

We expect this type of mismatching to be one of

the most frequent in real applications.

Horizontal Taxonomy Relations. This type corre-

sponds to the case when a user provided information

on the same level of abstraction, but it does not

match the fact precisely. Two facts refer to different

concepts (e.g. robbery – burglary), but these con-

cepts share a common hypernym (crime). Similarly,

toponyms like Livermore and Hartford may be

considered similar if they have similar description

(in this example both names correspond to small

towns in California). Clearly not every pair of words

having the same hypernym are similar. For example,

both murder and stealing are crimes, but the facts A

murder in an X’s ofﬁce and A stealing in an X’s ofﬁce

are not similar. This means that some additional

constraints should be applied. For example, both

words may be required to be similar in the sense of

the next type.

General Similarity. Both types of taxonomy rela-

tions described above are special cases of semantic

similarity between terms. We have separated them

into speciﬁc classes because if such relations exist

then one can expect strong semantic relation between

corresponding facts. Nevertheless, the facts may be

similar even if taxonomic relations are not present.

For instance, one can expect that facts about robbery

of a bank and shooting in a bank are similar.

3 EVALUATION

We calculate the semantic similarity between

two facts F

= (what

, where

, when

) and

= (what

, where

, when

) using the following

CALCULATING SEMANTIC SIMILARITY BETWEEN FACTS

515

formula:

S(F

, F

) = min{s

, t

), s

, r

), s

, n

)}

This function simply returns the worst mismatch of

what-, where-, and when- parts of the two facts. In

the sequel we brieﬂy describe the functions s

, s

, and

The What Part. Possible types of semantic rela-

tions between terms in the what part of an event

description can be divided into three classes: ver-

tical taxonomy relations, (e.g. Livermore – small

town, robbery – crime), horizontal taxonomy rela-

tions, terms that have a common direct hypernym, and

other semantic relations. We calculate the semantic

similarity between two terms what

and what

using

the following formula.

, t

) = max{s

vert

, t

) ×C

horiz

, t

) ×C

stat

, t

) ×C

where the function s

vert

calculates the estimation of

the fact that one of the terms what

, what

is a hy-

pernym of the other one. The function s

horiz

calcu-

lates the estimation of the fact that terms what

, what

have a common hypernym (i.e. they have a horizontal

taxonomy relation). The function s

stat

calculates sta-

tistical (corpus-based) similarity between the terms.

, C

, and C

are weight coefﬁcients that help im-

plement the idea that the vertical taxonomy relation

is more important than the horizontal taxonomy rela-

tion and the latter is more important than the “default”

semantic similarity calculated statistically. In our ex-

periments we took C

= 1, C

= 0.8, C

= 0.8.

For hypernym estimation we use lexical patterns

approach, similar to (Bollegala et al., 2009). Statis-

tical similarity function s

stat

calculates the Normal-

ized Google Distance (Cilibrasi and Vitanyi, 2007) by

means of the YahooBOSS API

. We calculate the es-

timation of a fact that the terms what

and what

have

a common hypernym using the following formula.

horiz

= max

∈H

stat

, h

), (1)

where H

and H

are sets of possible hypernyms of

what

and what

, respectively. The sets H

and H

are

constructed by means of lexical patterns approach.

The Where and When Parts. The speciﬁcity of

the where part of a fact is that it usually con-

tains geographical labels that can be mapped to lat-

itude/longitude coordinates. Let w

, w

be the strings

representing the where parts of two facts F

, F

. The

http://developer.yahoo.com/search/boss

semantic similarity between w

and w

is calculated

using the following formula.

s(w

, w

) = 1 −

min

∈G(w

),g

∈G(w

)

dist(g

, g

)

MAX DIST

where G(w) is a set of all geographical objects match-

ing the string w, dist(g

, g

) is the great-circle dis-

tance between geographical objects g

and g

, and

MAX DIST is the maximum distance between two

points on the Earth surface, which is about 20018 km.

To calculate semantic similarity between two

strings w

, w

representing the when parts of the facts

, F

, we use the following idea. We map the strings

into dates and then calculate relative time interval be-

tween the dates applying some normalization. We use

the following formula.

s(w

, w

) = 1−

d(w

, w

)

d(w

, w

) + min{d(w

, D), d(w

, D)}

where d(w

, w

) is the time interval (in seconds) be-

tween two dates matching the strings w

and w

. D is

the date representing the current moment. Normaliza-

tion by D is used to implement the idea that a one-year

interval 1000 years ago should be considered less im-

portant than the same interval nowadays, e.g. between

January 1, 2009 and January 1, 2010.

Experimental Results. To evaluate the function

proposed, we ran the following experiment. We

manually extracted the following facts from the news

(what; where; when).

1. robbery; Livermore; 28 July 2008.

2. burglary; California; July 2008.

3. deposit; Fremont; November 2, 2007.

4. anniversary; small town in California; summer

2007.

5. shootout; California; January 3, 1997.

6. crime; Hartford; August 27, 2007.

7. kill; West Yorkshire, England; February 21, 2010.

8. wine country festival; Livermore; 2008.

9. trafﬁc; on main street in Pleasanton; Tuesday

August 13, 2008.

10. armed robbery; 901 S. Main St. in Hartford, KY;

On Friday July 13, 2007 at approximately 11:15 A.M.

The corresponding similarity matrix for the what,

where and when parts of the facts is presented in Table

One can see that for very short descriptions the

results are meaningful. For example, the most rele-

vant neighbours for the term shootout are robbery and

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

516

Table 1: Similarity between the what/where/when-parts of the test facts.

# 1 2 3 4 5 6 7 8 9 10

1 1/1/1 .5/1/1 .1/1/.7 0/1/.6 .2/1/.1 .9/.8/.7 0/.6/.2 0/1/.9 0/1/1 .7/.8/.6

2 .5/1/1 1/1/1 0/1/.7 0/1/.6 0/1/.1 .9/.8/.7 0/.6/.2 0/1/.9 .1/1/1 .4/.9/.6

3 .1/1/.7 0/1/.7 1/1/1 0/1/.9 0/1/.2 0/.8/.9 0/.6/.1 0/1/.7 0/1/.7 0/.8/.9

4 0/1/.6 0/1/.6 0/1/.9 1/1/1 0/1/.2 0/.8/.9 0/.6/.1 .2/1/.7 0/1/.6 0/.9/1

5 .2/1/.1 0/1/.1 0/1/.2 0/1/.2 1/1/1 0/.8/.2 .1/.6/0 .1/1/.1 .1/1/.1 .3/.9/.2

6 .9/.8/.7 .9/.8/.7 0/.8/.9 0/.8/.9 0/.8/.2 1/1/1 .9/.7/.1 0/.8/.7 .8/.8/.6 .9/.9/.9

7 0/.6/.2 0/.6/.2 0/.6/.1 0/.6/.1 .1/.6/0 .9/.7/.1 1/1/1 0/.6/.2 0/.6/.2 .1/.7/.1

8 0/1/.9 0/1/.9 0/1/.7 .2/1/.7 .1/1/.1 0/.8/.7 0/.6/.2 1/1/1 .1/1/.9 0/.8/.7

9 0/1/1 .1/1/1 0/1/.7 0/1/.6 .1/1/.1 .8/.8/.6 0/.6/.2 .1/1/.9 1/1/1 0/.8/.6

10 .7/.8/.6 .4/.9/.6 0/.8/.9 0/.9/1 .3/.9/.2 .9/.9/.9 .1/.7/.1 0/.8/.7 0/.8/.6 1/1/1

armed robbery, and the closest term to anniversary is

festival.

4 CONCLUSIONS AND FUTURE

WORK

In this paper we have described the task of fact search

and proposed a function for calculating semantic dis-

tance between facts that are represented by single sen-

tences and consist of three parts (what, where, and

when). Some experimental results are provided, justi-

fying the proposed function.

One direction of future work includes applying

some methods from ontology theory. For instance,

one can further detail the what part of the fact, e.g. ap-

plying the subject-predicate-object (“who-did-what”)

model of knowledge representation, that is exten-

sively used in ontologies.

A function comparing the where parts could dis-

tinguish geographical names from some abstract de-

scriptions, e.g. in an American school or in a small

town near the West Coast and compare them some-

how. The complex task of disambiguation of geo-

graphical names meaning several places (e.g. there

are at least ﬁve cities named Moscow) can also be ap-

proached.

Finally, comparing parts of facts we did not take

into account the context, i.e. the other parts. For ex-

ample, if the what parts of facts are about politics, we

can compare the where parts in some special way.

REFERENCES

Bollegala, D., Matsuo, Y., and Ishizuka, M. (2009). A re-

lational model of semantic similarity between words

using automatically extracted lexical pattern clusters

from the web. In EMNLP ’09, pages 803–812.

Cilibrasi, R. L. and Vitanyi, P. M. B. (2007). The google

similarity distance. IEEE Transactions on Knowledge

and Data Engineering, 19(3):370–383.

Islam, A. and Inkpen, D. (2008). Semantic text similarity

using corpus-based word similarity and string similar-

ity. ACM Trans. Knowl. Discov. Data, 2(2):1–25.

Li, Y., McLean, D., Bandar, Z. A., O’Shea, J. D., and

Crockett, K. (2006). Sentence similarity based on

semantic nets and corpus statistics. IEEE Trans. on

Knowl. and Data Eng., 18(8):1138–1150.

Naughton, M., Stokes, N., and Carthy, J. (2010). Sentence-

level event classiﬁcation in unstructured texts. Infor-

mation Retrieval, 13(2):132–156.

Varelas, G., Voutsakis, E., Raftopoulou, P., Petrakis, E. G.,

and Milios, E. E. (2005). Semantic similarity meth-

ods in wordnet and their application to information

retrieval on the web. In WIDM ’05, pages 10–16.

CALCULATING SEMANTIC SIMILARITY BETWEEN FACTS

517