apply it for such common tasks as fact classification
and clustering.
The problem of semantic similarity calculation is
a wide one. The objects to calculate similarity be-
tween can be single words (Bollegala et al., 2009),
groups of words (Varelas et al., 2005), sentences or
texts (Islam and Inkpen, 2008), (Li et al., 2006). In
this paper we consider sentences of a special kind,
namely that describe facts. We have not found any pa-
pers devoted to this particular problem, though some
general algorithms mentioned above can be applied
and there exist works on event description detection
and classification, e.g. (Naughton et al., 2010).
2 SEMANTIC SIMILARITY
BETWEEN FACTS
We consider facts consisting of three parts: what hap-
pened, where and when, so a fact F is a triple F =
(what, where, when). Our goal is a function S(F
1
, F
2
)
that calculates semantic similarity between facts F
1
and F
2
. The function S takes values between 0 and 1,
and higher value means higher similarity. In this sec-
tion we discuss properties that such function should
satisfy.
First of all, let us note that two facts should be
treated similar if all their components are pairwise
similar. It seems unlikely that there exists a “univer-
sal” semantic similarity function suitable for all three
parts, so three separate functions s
t
, s
r
, and s
n
mea-
suring semantic similarity of what, where and when
parts, respectively, should be defined. The fact sim-
ilarity fuinction S should use values of these three
functions. Note that functions s
t
, s
r
, and s
n
should
generally use all components of the compared facts.
As we assume that one of the facts, say F
1
,
is a user’s query and that this query describes the
same fact as F
2
, but using different lexical means,
we classify possible reasons for facts description
mismatching.
Synonymy, Acronyms, Abbreviations etc. Two
facts may be described differently due to synonymy,
acronyms, abbreviations, or slang. For example, A
theft in X bank describes the same fact as A larceny
in X bank, and armed robbery may be replaced by a
slang word blagging.
Underspecification. Quite often desriptions of
geographic objects contain specifications like small
town X that can be omitted in a query. It seems that in
most cases descriptive words like large or small may
be simply dropped without loosing any information
about the fact itself. Nevertheless in some cases, such
as small city Moscow, a descriptive word may be
used for distinguishing the defined object from some
other (well-known) object.
Vertical Taxonomy Relations. By vertical taxon-
omy relation we mean hyponym-hypernym relation.
A user formulating the query may have only a fuzzy
knowledge about the fact they are looking for. For
example, they may not be aware of the type of crime
happened, or the date, or place. If a query requests
for a robbery in a small California town the fact de-
scribing a robbery in Livermore, CA should be con-
sidered as relevant. Similarly, hypernym relation may
exist between what-parts of two facts, e.g. burglary –
larceny – crime, and when-parts, e.g. June – summer.
We expect this type of mismatching to be one of
the most frequent in real applications.
Horizontal Taxonomy Relations. This type corre-
sponds to the case when a user provided information
on the same level of abstraction, but it does not
match the fact precisely. Two facts refer to different
concepts (e.g. robbery – burglary), but these con-
cepts share a common hypernym (crime). Similarly,
toponyms like Livermore and Hartford may be
considered similar if they have similar description
(in this example both names correspond to small
towns in California). Clearly not every pair of words
having the same hypernym are similar. For example,
both murder and stealing are crimes, but the facts A
murder in an X’s office and A stealing in an X’s office
are not similar. This means that some additional
constraints should be applied. For example, both
words may be required to be similar in the sense of
the next type.
General Similarity. Both types of taxonomy rela-
tions described above are special cases of semantic
similarity between terms. We have separated them
into specific classes because if such relations exist
then one can expect strong semantic relation between
corresponding facts. Nevertheless, the facts may be
similar even if taxonomic relations are not present.
For instance, one can expect that facts about robbery
of a bank and shooting in a bank are similar.
3 EVALUATION
We calculate the semantic similarity between
two facts F
1
= (what
1
, where
1
, when
1
) and
F
2
= (what
2
, where
2
, when
2
) using the following
CALCULATING SEMANTIC SIMILARITY BETWEEN FACTS
515