embedding, POS tags, letter cases and prepositions
are considered as well to augment the model.
2 RELATED WORK
The extraction of location mentions from texts is a
long-studied problem since texts are one of the most
common forms to encode geographic information, but
in this process the geo-referents of the location
mentions become ambiguous and even the boundaries
of the location mentions in text are difficult to
recognize without enough background information,
especially when the place name is abbreviated for
brevity. Naturally the place name extraction falls into
sub-problems: entity delimitation and toponym
disambiguation. The former delimitates the
boundaries of a place name, which is the focus of our
study, and the latter decides the most possible geo-
referent to the place name.
While entity delimitation can be solved by
matching with gazetteer (Sultanik and Fink, 2012) or
hard-coded rules (Cunningham et al., 2001), the
current NER tools or systems are generally more
powerful in extracting location mentions from formal
texts in terms of recall. One of the most renowned
NER systems is Stanford NER (Finkel, Grenager and
Manning, 2005) where a Conditional Random Field
(CRF) model incorporates long distance features to
identify named entities including location. Some
researchers have also attempted further augmentation
to the NER capability of extracting locations by
constructing a gazetteer from word clusters (Kazama
and Torisawa, 2008). Web NER aims at separating
complex place names from web pages by utilizing
capitalization cues and lexical statistics (Downey,
Broadhead and Etzioni, 2007).
However, the NER systems have shown
unsatisfactory performances on social media
messages (Bontcheva et al., 2013) mainly owing to
the informal and irregular expressions as well as the
short texts. An outstanding performance has been
achieved by Stanford NER when it is retrained by
annotated tweets (Lingad, Karimi and Yin, 2013), but
the annotation of social media messages is time-
consuming and elusive on large scale text processing.
Meanwhile LSTM is becoming a popular choice of
NER owing to its suitability for sequence
classification. Bi-LSTM-CRF networks using
contextualized embeddings were employed for
chemical NER (Awan et al., 2019). An exploratory
study on Indonesian Twitter Posts (Rachman et al.,
2018) used LSTM for NER and experimented on a
small dataset, on which it achieved a F1-score of 0.81
for location recognition.
Recent years have witnessed a lot of efforts
attempting to address the problem of extracting
location mentions from social media streams. A
model to predict the occurrence of location mentions
in a tweet was introduced and was found that this
preliminary process can enhance the accuracy of
entity delimitation (Hoang and Mothe, 2018). A
statistical language model was built in (Al-Olimat et
al., 2017) based on augmented and filtered region-
specific gazetteers from online resources such as
OpenStreetMap (OSM) to extract place names from
tweets, a F1-score of 0.85 was achieved while no
training data is required.
Unlike the studies above, we mainly focus on the
location mentions referring to geo-entities of small
scale, which rarely appear in most gazetteers. And a
deep learning model is constructed and trained on
Stanford NER annotated tweets, without manual
annotation.
3 METHODOLOGY
3.1 Model
We used a Bidirectional LSTM (Bi-LSTM) model,
which belongs to the category of Recurrent Neural
Networks (RNNs), to label whether a word is an
element of a location mention or not. RNN is an
appropriate approach for sequence classification due
to its capability of passing the output of one node to
its successor, which can be interpreted as the
influence of a word to the successive word. LSTM is
a specialized RNN which performs better on long
sequences while the impact of previous layers decay
along with vanishing gradient in RNN. LSTM
employs a mechanism called forget gates to control
the information flow in the neural network.
On the basis of LSTM, Bi-LSTM model is trained
on two directions of the input sequence, forward and
backward, providing comprehensive context
information of the target word. Bi-LSTM model has
been implemented on NER tasks and shown
competitive performances on benchmark datasets
(Chiu and Nichols, 2016).
As seen in Figure 1, features of the input sequence
are passed to the forward and backward LSTM layers
respectively in our model, the two outputs are
subsequently concatenated and passed to a fully
connected layer. Finally, a softmax layer outputs the
probability distribution over all the class labels.