if the same restaurant and POI entities are matched up
so that any update or change for an entity from one ta-
ble can be automatically transferred to another.
Another use case that can benefit from matching
restaurant and POI is the deduplicated search of same
place entity inside both tables. A user searching for
a specific restaurant will trigger the search action in
both the restaurant and the POI tables. If this restau-
rant has a matched POI, and this matching relation
has been identified, one of the two searches could be
avoided, allowing a faster return of the search results
and avoiding duplication of results for users.
Motivated by the above use cases, we attempted to
conduct entity matching between restaurant and POI
tables. The whole process includes several subtasks
such as data exploration, preprocessing, distance met-
rics calculation, labelling, as well as supervised learn-
ing. These will be elaborated in the following sec-
tions.
2 PRIOR WORK IN RECORD
LINKAGE FOR SPATIAL DATA
The problem of determining whether two records re-
fer to the same or different entities is called Record
Linkage or Entity Resolution, and sometimes Entity
Conflation. As a literary problem it goes back for
several centuries, with questions such as the identity
of the poet Homer (“Were the Iliad and the Odyssey
written by the same person?”). The formal study of
entity resolution goes back to the 1940s, leading to a
canonical statistical formulation for aligning medical
records being attributed to (Newcombe et al., 1959)
(see also (Talburt, 2011) for an in-depth summary).
2.1 Entity Resolution in Metric Spaces
A typical entity resolution approach is to model each
entity using a collection of features, to train a similar-
ity or distance function based on these features, and
then to merge entities whose similarity is above some
threshold, perhaps transitively using a clustering al-
gorithm. That is, given a distance function between
records, a simple application of this to entity resolu-
tion comes from assuming that two records represent
the same entity if the distance between them is be-
neath some threshold. (Conversely, for a similarity
function, if the similarity between the two, is above
some threshold.)
This process is sometimes more reliable if the dis-
tance function satisfies the metric properties. Mathe-
matically speaking, a metric on a space M is a func-
tion d : M × M → R
≥0
which satisfies the proper-
ties of symmetry (d(x, y) = d(y, x)), identity (d(x, y) =
0 ⇐⇒ x = y), and the triangle inequality (d(x, z) ≤
d(x, y) + d(y, z) for all x, y, z ∈ M. Metric spaces can
be particularly useful for record linkage problems be-
cause the notion of proximity can lend itself to sen-
sible clustering properties. For an accessible intro-
duction to metric spaces for informatics practitioners,
including some well-known caveats, see (Widdows,
2004, Ch 4). For example, standard similarity func-
tions such as cosine similarity are not metrics because
cos(x, x) = 1, and conceptual ‘distances’ often do not
obey the symmetric rule. In high-dimensional spaces
with many features, the triangle inequality also be-
comes a very weak condition that can allow distantly-
related points to become transitively linked (Ch
´
avez
et al., 2001).
The general notion of a distance function that
combines several features was used throughout the
experiments reported in this paper. However, since
the problem addressed in this paper is bipartite match-
ing between two separate datasets, a distance measure
that supports clustering within datasets was not a re-
quirement, and so no attempt was made to ensure that
metric properties were satisfied.
Note that not all spatial record linkage problems
can be posed in this similarity-based form, especially
for moving objects. The challenge of recognizing the
trajectories of individual moving objects from differ-
ent observations arises in computational astronomy
(Kubica et al., 2007), and is an increasing focus for
human-generated datasets (Basık et al., 2017).
2.2 Blocking Features and Identifiers
When inferring pairwise matches from a distance or
similarity function, comparing every record to ev-
ery other record can be intractable and unnecessary.
Sometimes there might be ‘blocking’ features, which
are necessary for matching, in the sense that two
records are prevented from matching if these features
do not have identical values (for example, the year-
of-birth of a person, if this is available in the dataset
and known to be recorded accurately). Some block-
ing attributes may even be considered to be sufficient
for matching, especially if an attribute is meant to be
a unique identifier — for example, two products with
the same barcode might be expected to be identical (as
in (Bilenko et al., 2005)). Blocking can also be used
as an early-out in computation to reduce a quadratic
problem (comparing every pair of items) to a linear
problem (grouping together all items with identical
or at least similar values of the blocking attribute).
In practice, however, it is rarely the case that any at-
tribute can be trusted entirely with this responsibility.
Spatial Entity Resolution between Restaurant Locations and Transportation Destinations in Southeast Asia
93