Spatial Entity Resolution between Restaurant Locations and

Transportation Destinations in Southeast Asia

Emily Gao and Dominic Widdows

Grab Inc, Singapura

Keywords:

Southeast Asia, Points of Interest, Restaurants, Entity Resolution.

Abstract:

As a tech company, Grab has expanded from transportation to food delivery, aiming to serve Southeast Asia

with hyperlocalized applications. Information about places as transportation destinations can help to improve

our knowledge about places as restaurants, so long as the spatial entity resolution problem between these

datasets can be solved. In this project, we attempted to recognize identical place entities from databases

of Points-of-Interest (POI) and GrabFood restaurants, using their spatial and textual attributes, i.e., latitude,

longitude, place name, and street address. Distance metrics were calculated for these attributes and fed to tree-

based classiﬁers. POI-restaurant matching was conducted separately for Singapore, Philippines, Indonesia,

and Malaysia. Experimental estimates demonstrate that a matching POI can be found for over 35% of restau-

rants in these countries. As part of these estimates, test datasets were manually created, and RandomForest,

AdaBoost, Gradient Boosting, and XGBoost perform well, with most accuracy, precision, and recall scores

close to or higher than 90% for matched vs. unmatched classiﬁcation. To the authors’ knowledge, there are no

previous published scientiﬁc papers devoted to matching of spatial entities for the Southeast Asia region.

1 INTRODUCTION

Location matters in many businesses and services to-

day, particularly for transportation and delivery, sce-

narios in which it is important to ﬁnd the correct pick-

up and drop-off locations very quickly. User expe-

rience can be negatively affected if the location in-

formation is inaccurate or insufﬁcient. Inaccuracies

can originate from imprecise GPS data, manual error

happening in the process of data entry, or the lack of

effective data quality control. Insufﬁciencies can also

take many forms, including lack of coverage, and lack

of detail — for example, we may know the latitude

and longitude of a restaurant location in a mall, but

this might not include information about where pas-

sengers should be dropped off, or where a delivery

courier should park to collect food for delivery. Or

the location of a business may be known, but not its

contact details or opening hours.

One core problem in managing and improving

spatial datasets is recognizing when two records refer

to the same real-world entity. Solving this problem

can improve precision by removing duplicates, and

can enrich detail by (for example) merging a phone

number from one record with the hours of operation

from another, once these records are known to refer

to the same thing. This problem is referred to as en-

tity resolution (see (Talburt, 2011)), and it occurs with

various datasets, including those representing people,

products, works of literature, etc.

For Grab, one entity resolution problem that arises

for spatial data is the alignment of transportation des-

tinations and restaurants. Currently Grab maintains

two tables separately for transportation and food de-

livery, because each use case requires some speciﬁc

features, i.e., food delivery needs information about

the estimated delivery time, cuisine types, and open-

ing hours which are absent in the POI table. However,

it is highly likely that some entities from both tables

refer to the same place, and how to ﬁgure that out is

related to the study of entity resolution.

Matching restaurant and POI is beneﬁcial for

some use cases in Grab. The ﬁrst one is about auto-

matic geolocation correction update that could hap-

pen in both transportation and food delivery. As

shown in Figure 1 below, as Driver1 takes the passen-

ger to the KFC, i.e., the restaurant destination, he/she

may ﬁnd it very difﬁcult to drop off the passenger be-

cause of the incorrect POI suggested by POI table.

Driver1 can return this feedback to Grab transporta-

tion team, then that POI gets corrected and updated.

Sometime later, Driver2 needs to pick up a food or-

der from the same KFC, and he/she might experience

the same difﬁculty to quickly locate the restaurant be-

cause of the inaccurate POI returned from restaurant

table. This unhappy user experience could be avoided

Gao, E. and Widdows, D.

Spatial Entity Resolution between Restaurant Locations and Transportation Destinations in Southeast Asia.

DOI: 10.5220/0009416600920103

In Proceedings of the 6th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2020), pages 92-103

ISBN: 978-989-758-425-1

if the same restaurant and POI entities are matched up

so that any update or change for an entity from one ta-

ble can be automatically transferred to another.

Another use case that can beneﬁt from matching

restaurant and POI is the deduplicated search of same

place entity inside both tables. A user searching for

a speciﬁc restaurant will trigger the search action in

both the restaurant and the POI tables. If this restau-

rant has a matched POI, and this matching relation

has been identiﬁed, one of the two searches could be

avoided, allowing a faster return of the search results

and avoiding duplication of results for users.

Motivated by the above use cases, we attempted to

conduct entity matching between restaurant and POI

tables. The whole process includes several subtasks

such as data exploration, preprocessing, distance met-

rics calculation, labelling, as well as supervised learn-

ing. These will be elaborated in the following sec-

tions.

2 PRIOR WORK IN RECORD

LINKAGE FOR SPATIAL DATA

The problem of determining whether two records re-

fer to the same or different entities is called Record

Linkage or Entity Resolution, and sometimes Entity

Conﬂation. As a literary problem it goes back for

several centuries, with questions such as the identity

of the poet Homer (“Were the Iliad and the Odyssey

written by the same person?”). The formal study of

entity resolution goes back to the 1940s, leading to a

canonical statistical formulation for aligning medical

records being attributed to (Newcombe et al., 1959)

(see also (Talburt, 2011) for an in-depth summary).

2.1 Entity Resolution in Metric Spaces

A typical entity resolution approach is to model each

entity using a collection of features, to train a similar-

ity or distance function based on these features, and

then to merge entities whose similarity is above some

threshold, perhaps transitively using a clustering al-

gorithm. That is, given a distance function between

records, a simple application of this to entity resolu-

tion comes from assuming that two records represent

the same entity if the distance between them is be-

neath some threshold. (Conversely, for a similarity

function, if the similarity between the two, is above

some threshold.)

This process is sometimes more reliable if the dis-

tance function satisﬁes the metric properties. Mathe-

matically speaking, a metric on a space M is a func-

tion d : M × M → R

≥0

which satisﬁes the proper-

ties of symmetry (d(x, y) = d(y, x)), identity (d(x, y) =

0 ⇐⇒ x = y), and the triangle inequality (d(x, z) ≤

d(x, y) + d(y, z) for all x, y, z ∈ M. Metric spaces can

be particularly useful for record linkage problems be-

cause the notion of proximity can lend itself to sen-

sible clustering properties. For an accessible intro-

duction to metric spaces for informatics practitioners,

including some well-known caveats, see (Widdows,

2004, Ch 4). For example, standard similarity func-

tions such as cosine similarity are not metrics because

cos(x, x) = 1, and conceptual ‘distances’ often do not

obey the symmetric rule. In high-dimensional spaces

with many features, the triangle inequality also be-

comes a very weak condition that can allow distantly-

related points to become transitively linked (Ch

avez

et al., 2001).

The general notion of a distance function that

combines several features was used throughout the

experiments reported in this paper. However, since

the problem addressed in this paper is bipartite match-

ing between two separate datasets, a distance measure

that supports clustering within datasets was not a re-

quirement, and so no attempt was made to ensure that

metric properties were satisﬁed.

Note that not all spatial record linkage problems

can be posed in this similarity-based form, especially

for moving objects. The challenge of recognizing the

trajectories of individual moving objects from differ-

ent observations arises in computational astronomy

(Kubica et al., 2007), and is an increasing focus for

human-generated datasets (Basık et al., 2017).

2.2 Blocking Features and Identiﬁers

When inferring pairwise matches from a distance or

similarity function, comparing every record to ev-

ery other record can be intractable and unnecessary.

Sometimes there might be ‘blocking’ features, which

are necessary for matching, in the sense that two

records are prevented from matching if these features

do not have identical values (for example, the year-

of-birth of a person, if this is available in the dataset

and known to be recorded accurately). Some block-

ing attributes may even be considered to be sufﬁcient

for matching, especially if an attribute is meant to be

a unique identiﬁer — for example, two products with

the same barcode might be expected to be identical (as

in (Bilenko et al., 2005)). Blocking can also be used

as an early-out in computation to reduce a quadratic

problem (comparing every pair of items) to a linear

problem (grouping together all items with identical

or at least similar values of the blocking attribute).

In practice, however, it is rarely the case that any at-

tribute can be trusted entirely with this responsibility.

Spatial Entity Resolution between Restaurant Locations and Transportation Destinations in Southeast Asia

Figure 1: Use Case Illustration for Automatic Geolocation Update.

2.3 Special Considerations for Spatial

Datasets

With spatial datasets, blocking can be particularly

tricky, because there is not a standard system of is-

suing unique identiﬁers. Buildings do not have ISBN

numbers or Driver’s License Numbers! Many have

street addresses, which can sometimes be considered

as a taxonomic path to ﬁnding the building (see (Wid-

dows, 2004, Ch 3)), but these are not unique identi-

ﬁers and usually have many spelling variants (“9 High

Street”, “9 High St”, etc.). An obvious class of pro-

posals for blocking functions for spatial datasets are

sharing geographic areas — the same city, state or re-

gion, or at least the same country. However, there are

two problems with relying on these as blocking func-

tions:

• The names in these ﬁelds might not match. Sur-

prisingly, this happens with some regularity, for

example, a store near the boundary between

Bellevue and Redmond may be listed in either.

This problem is particularly apparent in some of

our datasets for Southeast Asia.

• The input datasets might not be parsed into these

ﬁelds. There are approaches for performing such

parsing automatically (see e.g., (Borkar et al.,

2001)). We are working on this challenge as well,

though the results are not yet ready to be used as

input for restaurant POI matching.

Entity resolution for spatial datasets therefore typi-

cally involves largely statistical and continuous mea-

sures of similarity. The work described here is typical

in this respect, and in terms of methodology is quite

similar to that of (Sehgal et al., 2006). One class of

similarity is spatial proximity: provided that each en-

tity record has latitude and longitude coordinates, the

great circle distance between these points can easily

be calculated and used as a feature. Another class of

similarity is textual similarity: whether a string of text

represents a business name, a street number, a street

name, or a city name, we still expect a high similarity

between two different textual descriptions of the same

entity.

Another common difﬁculty to note is that decid-

ing whether two records refer to the same entity —

and even deciding if two entities are identical at all —

is less straightforward than one might assume. Some

of the problems in geo-ontology engineering, and es-

pecially problems that arise when trying to combine

this with statistical methods, are discussed in (Janow-

icz, 2012). A restaurant and the building it occupies

are conceptually two different things, as demonstrated

by cases where the building takes on a different ten-

ant, or the restaurant moves to different premises. Of

course, we say “We are going to the restaurant”, rather

than “We are going to the building that houses the

restaurant”, but when we try to build formal compu-

tational models that start with such everyday conve-

niences, difﬁculties soon arise, a case-in-point being

the challenge of building a formal model for reason-

ing from crowdsourced OpenStreetMap tags (Code-

scu et al., 2011).

These theoretical and practical difﬁculties are im-

portant to keep in mind, not because we should de-

spair of ever solving this matching problem, but be-

cause they place sensible limits on what sort of re-

sults we should expect. This motivates the right ques-

tion from a human-centered technology point of view

— we are not asking “Can we build a perfect match-

ing system”, but rather “Can we build a matching sys-

tem whose results will improve the experience of our

users?”

GISTAM 2020 - 6th International Conference on Geographical Information Systems Theory, Applications and Management

3 DATA SOURCES USED

The two main tables this work focuses on are points-

of-interest (POI) and restaurants.

3.1 Points of Interest

Grab uses POIs as the drop-off / pick-up location rec-

ommendations to drivers for both the transportation

and food delivery purposes. The POIs are licensed

externally from Google, Foursquare, Nokia, Azure,

OpenStreetMap, and Aapico, and also internally by

operation teams. Right now most POIs come from

external sources, and Grab’s own data collection ac-

tivities are growing. Speciﬁc to the database of POI,

each POI entity is identiﬁed by its unique id, and has

dozens of attributes describing its geolocation, clas-

siﬁcation, and lifecycle milestones. The POI dataset

contains approximately 150M records at the time of

writing.

Efforts to deduplicate these records are ongoing, a

problem that effectively subsumes the work on restau-

rant matching (because the restaurant dataset could be

considered as just another data source).

3.2 Restaurants

The restaurant table contains the most updated infor-

mation about those restaurants that have registered to

GrabFood. Restaurant information provided by mer-

chant owners are merged with the spatial informa-

tion from Map Operation team, and then entered to

database and cloud data warehouse through a series

of internal platforms. Like POI, each restaurant entity

has its unique id, the geolocation information, and the

lifecycle milestones, but with some extra attributes

such as status information and owner information, etc.

The restaurant dataset contains approximately 200K

records at the time of writing.

The two tables exist to support different opera-

tions, i.e., the POIs are used for transportation and

the restaurants are used for food delivery.

3.3 Name Variation

Many of the records within these datasets vary consid-

erably in representation. To give examples at the city

and regional level, Table 1 shows some spelling dif-

ferences that regularly appear with some of the larger

cities in Indonesia.

4 MATCHING APPROACH

At a high level, our approach to matching is sim-

ple: annotate suitable (POI, restaurant) pairs as either

matching or non-matching, and use this to train and

evaluate classiﬁers that rely on basic textual and spa-

tial features. It is important to note that the classiﬁers

trained in this fashion classify pairs of (POI, restau-

rant) to say whether they match or not: it is not just

a classiﬁer that tells if a restaurant can be matched

to any POI, but a classiﬁer that tells us if a particular

match is a good one.

This turns the matching problem into a classi-

ﬁcation problem over the Cartesian product of two

sets. In practice, however, the number of potential

(POI, restaurant) pairs is prohibitively large, and the

chance that any randomly selected pair is a match is

correspondingly small, so we sample down to make

these numbers tractable and sensible by reducing the

matching candidates. The steps are outlined as fol-

lows.

4.1 Early-out Blocking on Geohash

Given approximately 150M POIs and 200K restau-

rants, there are approximately 30 trillion potential

(POI, restaurant) pairs — far too many to consider all

of them! To reduce the number of potential matches,

we start with a blocking strategy based on geohash.

A geohash is a rectangle in the latitude / longitude

coordinate space, where rectangles at different levels

have predictable alphanumeric identiﬁers (Liu et al.,

2014). For these experiments, we found that taking

geohashes at level 6 was a suitable tradeoff between

computational performance and thoroughness. A 6-

level geohash at the equator encloses an area of ap-

proximately 1.2 km × 0.6 km.

The restaurant and POI are paired up by joining on

the same geohash that is of level 6. A holistic compar-

ison between each restaurant to all the other POIs is

unnecessary because two locations that are thousands

of miles away could never be the identical entities.

We adopted geohash to join a restaurant only to POIs

that have the same geohash, thus reducing the com-

parison space dramatically.

4.2 Features Used

The spatial and non-spatial attributes used to iden-

tify the same place entities include latitude and longi-

tude, place name, and street name, which are common

ﬁelds in the POI and restaurant tables. Some distance

metrics were derived from the above attributes. The

distance metrics are great circle distance calculated

Spatial Entity Resolution between Restaurant Locations and Transportation Destinations in Southeast Asia

Table 1: Examples of Name Variations Found in Indonesia.

First Name Variant Second Name Variant Notes

Jakarta Djakarta The spelling ‘Djakarta’ is usually considered obsolete but

still appears

Solo Surakarta Alternative names for the same city

Lampung Bandar Lampung ‘bandar’ means ‘city’ in Malay, these are like ‘New York’

and ‘New York City’

Aceh Banda Aceh ‘banda’ and ‘bandar’ are used similarly

Lubuklinggau Lubuk Linggau The space is optional

Palangkaraya Palangka Raya The space is optional

from latitude and longitude, Levenshtein and Jaro dis-

tances for place name, and Levenshtein distance for

street name. These metrics are the input features for

machine learning model, and a restaurant-POI pair

with lower distance values is more likely to be identi-

cal place entity. Levenshtein distance is calculated by

counting the number of operations needed to convert

one string into another and the edit operations include

adding, deleting, and replacing. The Jaro similarity is

another string similarity measure that has been used

successfully for name-matching (see (Cohen et al.,

2003)). It is deﬁned as follows:

Jaro(θ

, θ

) =



|θ

c −



(1)

where θ

are the strings, c is the number of characters

that match within a given distance, and t is the num-

ber of transpositions needed to put the overlapping

characters back in the same order. Jaro distance is

derived from Jaro similarity by subtracting Jaro sim-

ilarity from the value of 1. Both Levenshtein and

Jaro distances are character level based, but Jaro dis-

tance focuses more on local similarities between two

strings. For example, Levenshtein distance between

‘ab’ and ‘ba’, which is normalized by the sum of their

lengths, is 0.5, and Jaro distance between ‘ab’ and

‘ba’ is 1 because within half of the string length we

cannot ﬁnd any identical character pairs from the two

strings.

Some preparations are conducted for name and

street address strings to convert characters to lower

case, remove blank spaces and special characters,

etc. Notably, the Levenshtein distance between two

strings is normalized by the sum of lengths of two

strings. This is because Levenshtein distance as a

metric is more generous to shorter strings. For in-

stance, Levenshtein distance between ‘a’ and ‘b’ is

1, which is seemingly smaller than the Levenshtein

distance (i.e., 3) between ‘abczzzzzzzzzzzzzz’ and

‘fghzzzzzzzzzzzzzz’. However, we tend to recognize

that the later pair is more similar compared to the sim-

ilarity for ‘a’ and ‘b’.

Great circle distance and Levenshtein distance for

street name are more associated with spatial close-

ness of a restaurant and a POI. Nevertheless, being

spatially close does not necessarily mean two place

entities are identical. This is particularly the case

where stores usually get crowded in densely popu-

lated metropolis such as Singapore. Therefore, sim-

ilarity between names for two places is an important

feature to be incorporated into our evaluation system.

We consider both Levenshtein and Jaro distances for

place name to increase the weight of non-spatial at-

tribute in determining place entity similarity.

Figure 2 displays distributions of features for the

different countries (i.e., Indonesia, Malaysia, Singa-

pore, and Philippines). The samples plotted in the

ﬁgure is a subset randomly selected from the ﬁnal

population of POI-restaurant pairs. The ﬁnal pop-

ulation was obtained after applying early-out block-

ing on geohash and a few other predeﬁned rules that

help to reduce the unmatched POI-restaurant popula-

tion size (which will be elaborated in the section 4.4).

Due to the predeﬁned downsampling rules there is

remarkable clustering of geolocation distance within

200 meters and the clear cutoff of Levenshtein dis-

tance for name at 0.4 (i.e., the threshold we used to

ﬁlter out POI-restaurant pair samples that possess low

probability to be identical place entities). Interest-

ingly, while the distribution of Levenshtein distance

for name tends to be skewed to the high-value end,

Jaro distance for name mimics more a bell shaped dis-

tribution, which indicates that for this speciﬁc study

when the same group of string pairs are evaluated us-

ing Jaro distance instead of Levenshtein distance, a

greater portion of POI name and restaurant name are

considered to be more similar. This is the true case for

two strings like ‘ab’ and ‘abcd’, for which the Lev-

enshtein distance is 1/3 whereas the Jaro distance is

1/6. We observed signiﬁcant number of POI name and

restaurant name pairs falling into this format category,

i.e., in the case where a POI refers to the same place as

a restaurant the POI name is an exact substring of the

restaurant name because a restaurant name is usually

appended with a street address that is not included in

GISTAM 2020 - 6th International Conference on Geographical Information Systems Theory, Applications and Management

the POI name.

4.3 Countries Considered

The countries where Grab offers both transportation

and food delivery services that were considered for

this study are Indonesia, Malaysia, Singapore, Philip-

pines, Vietnam and Thailand. The model training and

results reported here are restricted to the ﬁrst four of

these. Vietnam and Thailand were not investigated in

this pilot study due to the difﬁculty of the character

sets. See section 6 for further discussion of this point.

4.4 Annotation / Labeling with

Downsampling

This project aims to match restaurant and POI through

a supervised learning process. Since labels are not

available for the restaurant-POI pairs joined on the

same geohash, i.e, we do not know in prior whether

a restaurant and a POI is matched or not, manual an-

notation is a necessary step before training/testing the

models. We decide the relationship between a restau-

rant and a POI mainly based on latitude&longitude

(i.e., distance), name, street.

The relationship between a POI and a restaurant

can be classiﬁed into two categories (see Table 2).

Two place entities should refer to the same one if they

have similar name and similar street address or spa-

tially close locations. On the other hand, if it is obvi-

ous that the restaurant and the POI cannot be the same

place given the different names, even though they are

spatially close, they are marked as not matching.

Even after applying geohash to block out the

unmatched restaurant-POI pairs, the percentage of

matched restaurant-POI pairs out of the total pairs is

still expected to be very low. Take Singapore for ex-

ample, the total population of restaurant-POI pairs for

comparison is about 220 millions and total number of

unique restaurant entity is about 8K. Suppose every

restaurant can ﬁnd a matched POI, then the match-

ing percentage is still as low as 0.0037%. Therefore,

the whole restaurant-POI pairs population turns out

rather imbalanced with the much lower rate of occur-

rence of matched samples. This poses the difﬁculty in

the manual annotation, as one has to label more than

10000 pairs of restaurant-POI to hopefully get about

37 matched samples. Manually labeling 10000 pairs

of samples requires too much human labor, which is

both time consuming and error prone.

We proposed two predeﬁned rules to exclude or

downsample a great portion of restaurant-POI pairs

that are expected to be unmatched. This is based on

two underlying assumptions. The ﬁrst one is that a

POI that is closer to a restaurant is more likely to

be matched with the restaurant than another POI that

is further away. Therefore, we only consider the top

K nearest POIs as the potential matching candidates.

Secondly, a restaurant-POI pair with name Leven-

shtein distance greater than 0.4 tend to be unmatched.

A special case is ‘abc’ and ‘def’, which has the nor-

malized Levenshtein distance of 0.5. As one of the

strings gets longer, that ratio will be higher than 0.5,

moving further from being matched. We chose 0.4

as the threshold to apply stricter penalty to name edit

distance. After prescreening the restaurant-POI pairs

based on the above predeﬁned rules, the whole pop-

ulation for comparison is reduced signiﬁcantly, i.e.,

about 32K. This is beneﬁcial to identifying consider-

able number of matched restaurant-POI pairs without

having to label a huge amount of pairs.

The manual labeling follows three steps. We ﬁrst

randomly select 500 samples from the population for

comparison, compare through each pair of restaurant

and POI, and assign a label to the pair which could

be matched and unmatched that is coded as 1 and 0

respectively. The 500 pre-labeled samples are applied

to train and test a Decision Tree, which is used to sub-

sequently predict for another randomly selected 2000

samples. Among the 2000 samples, the restaurant-

POI pairs predicted as matched are manually rectiﬁed

if any of the predicted labels are wrong. In this way,

for each country we could collected over 1000 pre-

labeled samples that could be used for training and

testing.

4.5 Model Training

Each dataset was divided between training and test-

ing with the ratio of 4:1, that is, 20% of the datasets

were held aside for evaluation. All models were

trained using Scikit Learn in Python (Pedregosa et al.,

2011). The implementation uses PySpark for dis-

tributed computing where appropriate.

4.6 Computational Resources

Preprocessing to get the ﬁnal population of restaurant-

POI pairs appeared to be the most computationally

time consuming and memory demanding process.

This includes but not limited to joining restaurants

and POIs based on geohash, calculating distance met-

rics, applying predeﬁned rule-out conditions to de-

crease unmatched pair population, saving parquet and

CSV ﬁles to AWS S3. These operations were con-

ducted on a Spark cluster of 20 machines. Although

suitable partition strategies were utilized, depending

on the population sizes of restaurant and POI for the

Spatial Entity Resolution between Restaurant Locations and Transportation Destinations in Southeast Asia

Figure 2: Histograms for Features of Geolocation Distance (in Meters), Levenshtein Distance, and Jaro Distance (ID: Indone-

sia; MY: Malaysia; SG: Singapore; PH: Philippines).

Table 2: Restaurant-POI Relationship Types.

Matched Name Street Distance (meters)

Restaurant

fore coffee

- bintarof

jl. boulevard bintaro jaya ruko kebayoran arcade 2

blok b3 no 51 pd. jaya pd. aren tangerang selatan

0.0

POI

fore coffee

- 20ﬁt bintaro

jl. boulevard bintaro jaya ruko kebayoran

arcade 2 blok b3 no 51 pondok aren

Unmatched Name Street Distance (meters)

Restaurant

mie setan ’noodle and

dimsum’ - tlogomas

jl. raya tlogomas no. 31 tlogomas

lowokwaru malang

15.92

POI

KFC- tlogomas jalan raya tlogo mas

different countries, running time on Spark could last

for a few hours, with the longest to be 4 hours for In-

donesia. 10 GB and 40 GB memory was allocated to

master and executors respectively to evade the out-of-

memory failure. Since the ﬁnal comparison domain

was considerably reduced, training and testing the

tree-based models and making predictions ran very

quickly and lasted for several minutes.

5 RESULTS AND DISCUSSION

This section reports the results for each country for

which supervised classiﬁers were trained and tested

(as shown in Table 3). Here Class 1 refers to the un-

matched and Class 2 to the matched pairs. As can

be seen, results of test accuracy are uniformly quite

high regardless of countries. The lowest accuracy is

still over 93%, a great improvement from a random

guess based on the proportion of major category (i.e.,

non-matching class taking up 70-80% of the popula-

tion). A comparison of precision and recall across the

different countries shows that Malaysia has the weak-

est non-matching results, and the weakest matching

results are found in the Philippines. The tree-based

classiﬁers perform the best for matching restaurants

and POIs in Indonesia in terms of precision and recall

scores for the matched class. This is demonstrated

in Figure 3 that lists some randomly selected sam-

ples of restaurant-POI pairs predicted as matched in

Indonesia. After examining on Google map, it is ob-

GISTAM 2020 - 6th International Conference on Geographical Information Systems Theory, Applications and Management

served that all the pairs are correctly predicted even

though there are many variations in names and street

addresses between restaurants and POIs. Some of

the cases are hazy to be identiﬁed as matched when

checking with human eyes. For example, the ﬁrst

pair has the quite different street address expressions

and the geolocation distance is also relatively further

than the other pairs. However, the model could still

make a correct prediction through the experience it

learned from the training set. Some POIs have void

street addresses, but the model is able to recognize the

matched pairs based on the close geolocation distance

combined with similar name strings.

Using Random Forest to predict the matching of

all restaurant-POI candidate pairs, we estimated the

number of matchable restaurants in each country con-

sidered. These were approximately 37% (Indonesia),

53% (Malaysia), 51% (Philippines) and 23% (Sin-

gapore). On average 38% of restaurants in the four

countries can be found to match with at least one POI,

implying a great potential to enhance user experience

for drivers and passengers if those restaurants happen

to be the popular ones. The reason for the low match-

ing rate in Singapore is still under investigation. It is

important to note that the precision and recall scores

measure the recall of the automatic classiﬁer com-

pared with the manual matching efforts during the an-

notation process. So for example, achieving high 90’s

recall for Malaysia means that of the whole popula-

tion of matched restaurant-POI pairs, over 90% were

also captured by the classiﬁer. It does not mean that

over 90% of restaurants have been matched to a cor-

responding POI.

Notably, the role played by each input feature

weights differently when Random Forest was used to

classify the relationship between merchants and POIs,

as shown in 4. For each country, the same order of the

importance of features can be observed, i.e. Jaro dis-

tance of name appears the most important, followed

by Levenshtein distance of name, then great circle ge-

olocation distance, and minimal role played by Lev-

enshtein distance of street name. It is an interesting

ﬁnding that Jaro distance of name outweighs Leven-

shtein distance of name in determining if a merchant-

POI pair is matched or not. Compared to Leven-

shtein distance, Jaro distance focuses more on local

similarity of two strings and is more likely to assign

lower distance score to two strings with identical sub-

strings. As a result of the fact that many restaurant en-

tries have their name concatenated with a street name

(as discussed in the section 4.2, Jaro distance turns

out to be a better measurement of the similarity be-

tween restaurant and POI names). As for geolocation

distance, its much smaller importance score seems

surprising. However, remember that some previous

early-out blocking actions were already executed on

the geolocation distance, so the real role of geoloca-

tion distance should turn out to be much more promi-

nent than what the ﬁgure shows.

A map representing GrabFood restaurants in Sin-

gapore is displayed in Figure 5. The red dots stand

for restaurants that do not have matched POIs, and

the green dots represent restaurants that have matched

POIs. Spatial distribution of the two categories

of restaurants shows that there is no special clus-

ters of certain type of restaurant, and the restau-

rants with/without matched POIs are mixed with each

other.

Finally, we ran an experiment to see if results im-

proved or deteriorated when datasets from each coun-

try were not kept separate. We expected that the ex-

tra data for training could be beneﬁcial, but that us-

ing training data from another country could lead to

poorer results from inappropriate extrapolation. What

we found is that on the whole results became less

reliable (Table 3), with the exception being that re-

call rate of matching results in Malaysia improved a

little. One hypothesis is that the similarity between

Malaysian and Indonesian addresses and street names

may contribute to the usefulness of Indonesian data

for Malaysia. The impaired matching result after

merging all countries together could be attributed to

the different distributions of features across the dif-

ferent countries (see Figure 2). For instance, Leven-

shtein distance between names for Singapore is dis-

tributed relatively more homogeneously, whereas for

other countries there is concentration of high value

Levenshtein distance of name. Distributions of Jaro

distance for name do not exactly coincide with each

other for the different countries, so do for the distribu-

tions of Levenshtein distance for street name. Hence,

there is regional variations across the different coun-

tries we are working on in terms of spatial distribution

patterns of restaurant vs. POI, the name and street ad-

dress format conventions, etc. This leads to the con-

sequence that the Levenshtein distance for name with

the value of 0.1 could indicate the matched POI and

restaurant in Indonesia, but being unmatched in Sin-

gapore.

6 CHALLENGES WITH THAI

Matching restaurants and POIs in Thailand is more

challenging than in the countries evaluated above, be-

cause of the non-Roman character set. Other ma-

jor languages in countries where Grab operates that

use non-Roman character sets include Burmese and

Spatial Entity Resolution between Restaurant Locations and Transportation Destinations in Southeast Asia

Table 3: Tree-based Model Performance for Different Countries.

Classiﬁers

Test

accuracy

F1 score

(class1)

F1 score

(class2)

Precision

(class1)

Precision

(class2)

Recall

(class1)

Recall

(class2)

Malaysia

RandomForest 93.1% 95.2% 87.6% 96.0% 85.7% 94.4% 89.6%

AdaBoost 95.5% 96.9% 91.8% 97.2% 91.2% 96.6% 92.5%

GradientBoost 94.7% 96.4% 90.2% 96.1% 90.9% 96.6% 89.6%

XGBoost 94.7% 96.3% 90.5% 97.1% 88.6% 95.5% 92.5%

Indonesia

RandomForest 97.0% 98.1% 92.6% 97.6% 94.3% 98.6% 90.9%

AdaBoost 96.6% 97.8% 91.9% 98.1% 91.1% 97.6% 92.7%

GradientBoost 98.1% 98.8% 95.5% 99.0% 94.6% 98.6% 96.4%

XGBoost 97.7% 98.6% 94.5% 98.6% 94.5% 98.6% 94.5%

Philippines

RandomForest 97.0% 98.2% 90.1% 99.6% 83.7% 96.9% 97.6%

AdaBoost 96.3% 97.8% 88.2% 99.6% 80.4% 96.1% 97.6%

GradientBoost 96.7% 98.0% 89.1% 99.6% 82.0% 96.5% 97.6%

XGBoost 97.0% 98.2% 90.1% 99.6% 83.7% 96.9% 97.6%

Singapore

RandomForest 97.6% 98.6% 90.6% 98.3% 92.3% 98.9% 88.9%

AdaBoost 96.2% 97.8% 85.7% 98.3% 82.8% 97.2% 88.9%

GradientBoost 97.1% 98.3% 89.3% 98.9% 86.2% 97.8% 92.6%

XGBoost 97.6% 98.6% 90.9% 98.9% 89.3% 98.3% 92.6%

Merge all four countries

RandomForest 95.9% 97.5% 87.2% 98.4% 83.1% 96.6% 91.7%

AdaBoost 95.4% 97.2% 86.0% 98.6% 80.4% 95.9% 92.3%

GradientBoost 95.5% 97.3% 86.3% 98.7% 80.6% 95.9% 92.9%

XGBoost 95.6% 97.3% 86.6% 98.7% 81.0% 96.0% 92.9%

Khmer, while Vietnamese script is based on Ro-

man characters with many diacritical accent and tone

marks. Each of these scripts is challenging for unfa-

miliar readers and computers to process. This section

considers issues with Thai which are somewhat repre-

sentative of the kinds of problems encountered across

different languages and especially different scripts.

6.1 Transliteration vs. Translation

Non-Roman character sets are more challenging not

just because it makes the characters harder for unfa-

miliar users, but because it inﬂuences the structure of

names, which often contain repetitions of local and

Roman character versions. For example, the name

‘Starbucks’ may appear in a title repeated in Roman /

English and Thai characters, as in ‘Starbucks (สตาร์-

บัคส์)’, which makes string overlap techniques behave

differently. This also illustrates the point that translat-

ing business names and addresses as if they were reg-

ular paragraph texts often works badly — for exam-

ple, a good translation of ‘Starbucks (สตาร์บัคส์)’ into

English would not be ‘Starbucks (Starbucks)’. Proper

names often should not be translated at all. For ex-

ample, the (short) Thai name for Bangkok is ‘กรุง-

เทพ’ (Krung Thep), which means ‘city of angels’, but

showing this translation to English speakers would

possibly lead to confusion between Bangkok and Los

Angeles! In practice, restaurants with the Thai name

‘กรุงเทพ’ (which are relatively common worldwide)

are for more likely to be rendered as ‘Krung Thep’

for English speakers, which is a phonetic translitera-

tion rather than a semantic translation.

A more familiar example for Europeans is perhaps

the famous Avenue des Champs-

Elys

ees in Paris. To

render this as ‘Elysian Fields Avenue’ on an English-

language streetmap would be much more likely to

confuse than to inform, partly because it’s less famil-

iar, and partly because it doesn’t correspond to the

physical street signs.

Understanding which phrases should be translated

semantically, which should be transliterated between

character sets, and which should be left alone appears

to be an open area for research. It is at least fair to

note that current machine translation systems are not

designed to recognize and respond appropriately to

GISTAM 2020 - 6th International Conference on Geographical Information Systems Theory, Applications and Management

100

Figure 3: Example of Restaurant-POI Pairs Predicted as Matched for Indonesia.

Figure 4: Feature Importance Scores in Classiﬁcation Using Random Forest.

these differences, and should not be trusted for these

purposes without manual review. For reasons such as

this, adapting our methods to Vietnamese and Thai

(and eventually Burmese and Khmer) is left for future

work.

6.2 Non-symmetric Overlap Measures

The relationship between the name ‘Starbucks’ and

‘Starbucks (สตาร์บัคส์)’ is clearly an asymmetric con-

tainment relationship. Symmetric measures such as

Levenshtein and Jaro do not naturally capture this

(though as remarked above, Jaro gives more credit for

matching substrings). In other work at Grab for ma-

chine translation and intent recognition for search, it

is also apparent that the relationship between a source

and a target (for example, a food query and a menu

item) is often asymmetric. Symmetric distance mea-

sures sometimes can be enhanced by taking this into

account — for example, smartphone users in informal

situations often leave out diacritical marks in Viet-

namese and vowels in Indonesian. A thorough anal-

ysis of such phenomena is beyond the scope of this

paper.

7 CONCLUSIONS

This paper has demonstrated that relatively good

matching results between restaurants and points of

Spatial Entity Resolution between Restaurant Locations and Transportation Destinations in Southeast Asia

101

Figure 5: Spatial Distribution of Sampled Restaurants With/without Matched POIs in Singapore.

interest in Southeast Asia can be obtained by train-

ing tree-based classiﬁers iteratively on a few hundred

pairs. This is an encouraging start. These results

approximate human performance at ﬁnding matching

pairs given available search tools. An open question

this leaves is how close this gets to exhausting the

matching possibilities — when no match for a restau-

rant is found in the POI dataset, does this indicate

that matching need to be improved, or that the POI

data itself is lacking? If the latter is the cause of no-

match restaurants, a next step may be just to add the

unmatched restaurants as POIs.

Many local questions can be asked about the ef-

ﬁciency of raw string matching, because many ad-

dress names vary systematically. For example, the

word ‘Jalan’ in Malay languages (including Indone-

sian and Malaysian) means ‘Street’ and can be abbre-

viated as ‘Jl’ and ‘Jln’ without any change in mean-

ing. Such variants should be considered identical for

matching. This is also an example of a general ques-

tion for knowledge discovery and natural language

processing in Southeast Asia — when is it helpful to

treat Indonesian and Malaysian separately, and when

is it useful to treat them as part of the same language

group?

Finally, we have clearly worked only with the

Southeast Asian countries and languages that use the

Roman alphabet (without extensive diacritic modiﬁ-

cations, as in Vietnamese). Adapting this work to

other countries with different alphabets and various

transliteration methods will be important for extend-

ing coverage throughout the region.

ACKNOWLEDGEMENTS

The authors would like to thank colleagues from Grab

for guidance and help throughout this work, including

Xiaonan Lu, Kristin Tolle, Jagan Varadarajan, Wenjie

Xu, Sien Yi Tan, Sidi Chang, and Jacob Lucas.

REFERENCES

Basık, F., Gedik, B., Etemo

glu, C¸ ., and Ferhatosmano

glu,

H. (2017). Spatio-temporal linkage over location-

enhanced services. IEEE Transactions on Mobile

Computing, 17(2):447–460.

Bilenko, M., Basil, S., and Sahami, M. (2005). Adap-

tive product normalization: Using online learning

for record linkage in comparison shopping. In

GISTAM 2020 - 6th International Conference on Geographical Information Systems Theory, Applications and Management

102

Fifth IEEE International Conference on Data Mining

(ICDM’05). IEEE.

Borkar, V., Deshmukh, K., and Sarawagi, S. (2001). Auto-

matic segmentation of text into structured records. In

ACM SIGMOD Record (Vol. 30, No. 2, pp. 175-186).

ACM.

avez, E., Navarro, G., Baeza-Yates, R., and Marroqu

ın,

J. L. (2001). Searching in metric spaces. ACM com-

puting surveys (CSUR), 33(3):273–321.

Codescu, M., Horsinka, G., Kutz, O., Mossakowski, T., and

Rau, R. (2011). Osmonto — an ontology of Open-

StreetMap tags. State of the map Europe (SOTM-EU).

Cohen, W. W., Ravikumar, P., and Fienberg, S. E. (2003).

A comparison of string distance metrics for name-

matching tasks. In 2003 International Conference on

Information Integration on the Web (pp. 73-78). II-

Web.

Janowicz, K. (2012). Observation-driven geo-ontology en-

gineering. Transactions in GIS, 16(3):351–374.

Kubica, J., Denneau, L., Grav, T., Heasley, J., Jedicke, R.,

Masiero, J., Milani, A., Moore, A., Tholen, D., and

Wainscoat, R. J. (2007). Efﬁcient intra-and inter-night

linking of asteroid detections using kd-trees. Icarus,

189(1):151–168.

Liu, J., Li, H., Gao, Y., Yu, H., and Jiang, D. (2014). A

geohash-based index for spatial data management in

distributed memory. In 22nd International Conference

on Geoinformatics (pp. 1-4). IEEE.

Newcombe, H. B., Kennedy, J. M., Axford, S. J., and James,

A. P. (1959). Automatic linkage of vital records. Sci-

ence, 130(3381):954–959.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,

Weiss, R., Dubourg, V., and Vanderplas, J. (2011).

Scikit-learn: Machine learning in python. Journal of

machine learning research, 12:2825–2830.

Sehgal, V., Getoor, L., and Viechnicki, P. D. (2006). En-

tity resolution in geospatial data integration. In 14th

annual ACM international symposium on Advances in

geographic information systems (pp. 83-90). ACM.

Talburt, J. R. (2011). Entity resolution and information

quality. Morgan Kaufmann, San Francisco.

Widdows, D. (2004). Geometry and meaning (Vol. 773).

Stanford: CSLI publications.

Spatial Entity Resolution between Restaurant Locations and Transportation Destinations in Southeast Asia

103