Language Identiﬁcation of Similar Languages using Recurrent Neural

Networks

Ermelinda Oro

, Massimo Ruffolo

and Mostafa Sheikhalishahi

Institute for High Performance Computing and Networking, National Research Council (CNR),

Via P. Bucci 41/C, 87036, Rende (CS), Italy

Fondazione Bruno Kessler, e-Health Research Unit, Trento, Italy

Keywords:

Language Identiﬁcation, Word Embedding, Natural Language Processing, Deep Neural Network, Long

Short-Term Memory, Recurrent Neural Network.

Abstract:

The goal of similar Language IDentiﬁcation (LID) is to quickly and accurately identify the language of the

text. It plays an important role in several Natural Language Processing (NLP) applications where it is fre-

quently used as a pre-processing technique. For example, information retrieval systems use LID as a ﬁltering

technique to provide users with documents written only in a given language. Although different approaches

to this problem have been proposed, similar language identiﬁcation, in particular applied to short texts, re-

mains a challenging task in NLP. In this paper, a method that combines word vectors representation and Long

Short-Term Memory (LSTM) has been implemented. The experimental evaluation on public and well-known

datasets has shown that the proposed method improves accuracy and precision of language identiﬁcation tasks.

1 INTRODUCTION

Many approaches of Natural Language Processing

(NLP), such as part-of-speech taggers and parsers, as-

sume that the language of input texts is already given

or recognized by a pre-processing step. Language

IDentiﬁcation (LID) is the task of determining the

language of a given input (written or spoken). Re-

search in LID aims to imitate the human ability to

identify the language of the input. In literature, differ-

ent approaches to LID have been presented. But, LID,

in particular applied to short text, remains an open is-

sue.

The objective of this paper is to present a LID

model, applied to the written text, that results

enough effective and accurate to discriminate sim-

ilar languages, even when it is applied to short

texts. The proposed method combines Word2vec

(Mikolov et al., 2013) representation and Long Short-

Term Memory (LSTM) (Hochreiter and Schmidhu-

ber, 1997). The experimental evaluation shows that

the proposed method obtains better results compared

to approaches presented in the literature.

The main contributions of the paper are:

• Deﬁnition of a new LID method that combines

the word vector representation (Word2vec) and

the classiﬁcation based on neural network (LSTM

RNN).

• Building of a Word2vec representation by using

Wikipedia Corpus.

• Creation of a dataset extracting data from

Wikipedia for Serbian and Croatian Language,

which aren’t yet available in literature.

• Experimental evaluation on public datasets in lit-

erature.

The rest of this article is organized as follows:

Section 2 describes related work, Section 3 shows

the proposed model and Section 4 presents the exper-

imental evaluation, Finally, Section 5 concludes the

paper.

2 RELATED WORK

In this section, the most recent methods aimed to

identify the language in texts is reviewed.

The lack of standardized datasets and evaluation

metrics in LID research makes very difﬁcult to con-

trast the relative effectiveness of the different ap-

proaches to a text representation. Results across dif-

ferent datasets are generally not comparable, as a

methods efﬁcacy can vary substantially with param-

eters such as the number of languages considered, the

Oro, E., Ruffolo, M. and Sheikhalishahi, M.

Language Identiﬁcation of Similar Languages using Recurrent Neural Networks.

DOI: 10.5220/0006678606350640

In Proceedings of the 10th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2018) - Volume 2, pages 635-640

ISBN: 978-989-758-275-2

635

relative amounts of training data and the length of the

test documents (Han et al., 2011). For this reason, we

are particularly interested in related work that makes

available datasets and evaluation metrics enabling ex-

perimentally comparison.

Malmasi and Dras (Malmasi and Dras, 2015) pre-

sented the ﬁrst experimental study to distinguish be-

tween Persian and Dari languages at the sentence

level. They used Support Vector Machine (SVM) and

n-grams of characters and word to classify languages.

For the experimental evaluation, the authors collected

textual news from Voice of America website.

Mathur et al. (Mathur et al., 2017) presented a

method based on Recurrent Neural Networks (RNNs)

and, as feature set, they used word unigrams and char-

acter n-grams. For the experimental evaluation, the

authors used the dataset DSL 2015

(Tan et al., 2014).

Pla and Hurtado (Pla and Hurtado, 2017) applied

a language identiﬁcation method based on SVM to

tweets. They used the bag-of-words model to rep-

resent each tweet as a feature vector containing the

tf-idf factors of selected features. They considered

a wide set of features, such as tokens, n-grams, and

n-grams of characters. For the evaluation of the im-

plemented system, they used the TweetLID ofﬁcial

corpus, which contains multilingual tweets

(Zubiaga

et al., 2016).

Trieschnigg et al. (Trieschnigg et al., 2012) com-

pared a number of methods to automatic language

identiﬁcation. They used a number of classiﬁcation

methods based on the Nearest Neighbor (NN) and

Nearest Prototype (NP) in combination with the co-

sine similarity metric. To perform the experimen-

tal evaluation, they used the Dutch folktale database,

a large collection of folktales in primarily Dutch,

Frisian and a large variety of Dutch dialects.

Ljube

sic and Kranjcic (Ljube

sic and Kranjcic,

2014), using discriminative models, handled the prob-

lem of distinguishing among similar south-Slavic lan-

guage such as Bosnian, Croatian, Montenegrin and

Serbian languages in Twitter. However, they did not

identify the language on the tweet level, but the user

level. The tweets collection has been collected with

the TweetCat tool, they annotated a subset of 500

users according to language that the user’s tweet in.

They attempt with the traditional classiﬁers such as

Gaussian Nave Bayes (GNB), K-Nearest Neighbor

(KNN), Decision Tree (DT) and linear Support Vector

Machine(SVM), as well as classiﬁer ensembles such

as Ada-Boost and random forests. They observe that

each set of features produces very similar results.

http://ttg.uni-saarland.de/lt4vardial2015/dsl.html

http://komunitatea.elhuyar.eus/tweetlid/

Table 1 summarizes the comparison among the

considered related work. Each row of the Table 1 has

Table 1: Comparison of Related Work.

Related Work Algorithm Granularity

(Trieschnigg et al., 2012) Nearest Neighbor Document

(Pla and Hurtado, 2017) SVN Tweets

(Ljube

sic and Kranjcic, 2014) SVM, KNN, RF Tweets

(Malmasi and Dras, 2015) Ensemble SVN Sentence

(Mathur et al., 2017) RNN Sentence

the reference to the related work. The second col-

umn shows the used classiﬁcation algorithm (such as

Nave Bayes, KNN, SVM, Random Forest and Recur-

rent Neural Network). In the third column is indicated

the processed input, i.e., document, sentence or tweet.

Documents can have different lengths (both short and

long). All approaches use, as extracted features, both

character and word n-grams.

Compared to related work, we exploited different

ways to represent input features (i.e., character and

word n-gram vs word embedding model) and to clas-

sify the language (we used LSTM RNN method). In

our experiments, we used the datasets exploited in

(Malmasi and Dras, 2015; Pla and Hurtado, 2017) and

(Mathur et al., 2017) because they are publicly avail-

able and we can, in a straightforward way, compare

results.

3 PROPOSED MODEL

In this section, we present the proposed method that

combines Word2vec with LSTM recurrent neural net-

works.

Figure 1 illustrates an overview of the proposed

LID model.

Figure 1: Proposed model.

ICAART 2018 - 10th International Conference on Agents and Artiﬁcial Intelligence

636

First, Wikipedia the text corpus of each target lan-

guage are collected. After the pre-processing, the text

is fed to Word2vec that outputs a list of vectors re-

lated to each word contained in the input text. Then,

a lookup table that matches each vocabulary word of

the dataset with its related vector is obtained. During

the training phase, the classiﬁer, which corresponds

to an LSTM RNN, takes as input the vectors of the

dataset. After the training of the classiﬁer, we per-

form the test phase that takes as input the test set. Fi-

nally, the accuracy and precision of the built model

are computed.

3.1 Word2vec

The Distributional hypothesis says that words occur-

ring in the same or similar contexts tend to convey

similar meaning (Harris, 1954).

There are many approaches to computing seman-

tic similarity between words based on their distribu-

tion in a corpus. Word2vec (Mikolov et al., 2013) are

model architectures for computing continuous vector

representations of words from very large data sets.

Such vector representations are capable to ﬁnd simi-

larity of words not just considering syntactic regulari-

ties, but also contextual information. Word2vec takes

a large corpus of text as input for training and pro-

duces a set of vectors called embeddings, normally

having several hundred dimensions, with each unique

word in the corpus. Given enough quantity of data,

usage, and contexts, this model can make highly ac-

curate guesses about a words meaning based on past

appearances. Word2vec produces word embeddings

in one of two ways:

• Using context to predict a target word, a method

known as ”continuous bag of words” (CBOW)

• Using a word to predict a target context, which is

called Skip-gram.

To generate our word embeddings, we chose the

CBOW method to train our Word2vec model be-

cause the training time is less than Skip-gram and the

CBOW architecture works slightly better on the syn-

tactic task than the Skip-gram model.

To feed the word2vec model, we prepared a com-

plete corpus that considers every target language by

using Wikipedia. First, we downloaded the wiki

dump of each target language available on Wikipedia.

Wikipedia provides static dumps of the complete con-

tents of all wiki

exported automatically following a

rotating export schedule. The contents of these dumps

are licensed under the GNU Free. In particular, in

April 2017, we obtained XML dumps of Wikipedia

http://dumps.wikimedia.org/backup-index.html

with valid ISO 639-1 codes, giving us Wikipedia

database exports for six languages (Persian, Span-

ish, Macedonian, Bulgarian, Bosnian and Croatian)

target of this work. We discarded exports that con-

tained less than 50 document. For each language,

we randomly selected 40,000 raw pages of at least

500 bytes in length by using the WikiExtractor python

script

.The script removes images, tables, references,

and lists. By using another script, we removed links.

We removed the stop-words. Then, we tokenized the

cleaned text. Finally, we were able to use the obtained

corpus to learn the vector representation of words in

the different considered languages.

3.2 Long Short-Term Memory

Recurrent Neural Networks (RNNs)(Mikolov et al.,

2010) are a special type of neural networks which

have an internal state by virtue of a cycle in their

hidden units. Therefore, RNNs are able to record

temporal dependencies among the input sequence, as

opposed to most other machine learning algorithms

where the inputs are considered independent of each

other. For this reason, they are very well suited to

natural language processing tasks and have been suc-

cessfully used for applications like speech recogni-

tion, handwriting recognition (Graves and Schmidhu-

ber, 2009; Graves, 2012; Graves et al., 2013)

Until recently, RNNs were considered very dif-

ﬁcult to train because of the problem of exploding

or vanishing gradients (Pascanu et al., 2013) which

makes it very difﬁcult for them to learn long se-

quences of input. Few methods like gradient clip-

ping have been proposed to remedy this (Neelakan-

tan et al., 2015). Recently developed architectures

of RNNs such as Long Short-Term Memory (LSTM)

(Hochreiter and Schmidhuber, 1997) and Gated Re-

current Unit (GRU) (Cho et al., 2014) were also

speciﬁcally designed to get around this problem.

In our implementation, we used LSTM neural net-

work. This choice is due to the capability of:

• managing vanishing or exploding gradients prob-

lems,

• handling the long time series of data by managing

different time steps,

• identifying patterns in sequences of data, such as

genomes and text.

In our experiments, we used single hidden layer recur-

rent neural networks that use Long Short-Term Mem-

ory introduced by Graves in (Graves, 2012). Using

a good weight initialization brings substantially faster

http://medialab.di.unipi.it/wiki/Wikipedia Extractor

Language Identiﬁcation of Similar Languages using Recurrent Neural Networks

637

convergence. We use the recently developed weight

initialization method which is introduced by Glorot et

al. (Glorot and Bengio, 2010).

We implemented our approach by using the

Deeplearning4j

. It is a Java-based deep learning li-

brary developed. It is an open source product made

for adaptability in business and released under the

Apache 2.0 license. The library provides several tools

for text pre-processing, including tokenization, and

neural networks implementations. We used this tool

also to build the Word2vec and the LSTM RNN mod-

els.

4 EXPERIMENTAL EVALUATION

This Section presents the experimental evaluation

performed on a new dataset and some well-know

datasets. We show that the proposed method improves

language identiﬁcation results.

4.1 Datasets

This Subsection describes datasets used to experi-

mentally evaluate the proposed method. As observed

in (Malmasi and Dras, 2015), short sentences are

more difﬁcult to classify with respect to longer ones.

In fact, may not exist enough distinguishing features

if a sentence is too short, and conversely, very long

texts will likely have more features that facilitate cor-

rect classiﬁcation. Therefore, to well evaluate our

method, we considered four datasets having different

lengths of input documents, named: Wikipedia, VOA,

DSL 2015, and TweetID. Statistics are shown in Ta-

ble 2.

Table 2: Datasets statistics.

Dataset Languages Train. Size Test Size Doc Length

(# of docs) (# of docs) (# of words)

Wikipedia

Serbian 3000 1000 5-2500

Croatian 3000 1000 5-2500

VOA

Persian 3000 1000 5-55

Dari 3000 1000 5-55

DSL-2015

Bulgarian 18000 2000 22-80

Macedonian 18000 2000 22-80

TweetID

Spanish 1170 1170 1-25

Catalan 1190 1170 1-15

4.1.1 Wikipedia Dataset

A ﬁrst dataset was created by extracting articles from

Wikipedia. We considered two similar languages:

Serbian and Croatian. As shown in Table 2, we lim-

ited the maximum length of each document to 2500

https://deeplearning4j.org//

words. We randomly collected a set of 4000 articles

per language (3000 training and 1000 evaluation).

4.1.2 VOA Dataset

Malmasi and Dras (Malmasi and Dras, 2015) cre-

ated a dataset extracting sentences from the Voice of

America (VOA) website

for Persian and Dari lan-

guages. VOA is an international multimedia broad-

caster with a service in more than 40 languages. It

provides news, information, and cultural program-

ming through the Internet, mobile and social media,

radio, and television. The Persian and Dari are sim-

ilar languages, they are part of the eastern branch of

the Indo-European language family and Dari is a low-

resourced language. The authors collected sentences

in the range of 5-55 tokens in order to maintain a bal-

ance between short and long sentences. For this study,

as shown in Table 2, we have considered a subset of

their dataset, which includes 4000 sentences for each

language.

4.1.3 DSL 2015 Dataset

As Mathur et al. (Mathur et al., 2017) done, we

used the dataset created for the language compe-

tition, named Discriminating between Similar Lan-

guage (DSL) Shared Task 2015. It includes a set

of 20000 instances per language (18000 training and

2000 evaluation) and it was provided for 13 differ-

ent world languages. In particular, as shown in Ta-

ble 2, we considered the similar languages Bulgarian

and Macedonian.

4.1.4 TweetID Dataset

Zubiaga et al. (Zubiaga et al., 2016) collected a

dataset of tweets including seven languages. They ex-

ploited the geo-location to retrieve posted from areas

of interest. In this work, we considered a subset of

Spanish, Catalan tweets. In Table 2 the distribution of

train and test dataset is shown in details.

4.2 Results

In this subsection, we show the evaluation results of

the presented method that enables to identify the lan-

guage of each input text. We trained each model using

the monolingual training dataset, presented in the pre-

vious subsection and veriﬁed results considering the

test set of the same data sources. Because identifying

the languages of each input document is a classiﬁca-

tion problem, we evaluate results by using the stan-

dard notions of accuracy (A) precision (P), recall (R)

https://www.voanews.com/

ICAART 2018 - 10th International Conference on Agents and Artiﬁcial Intelligence

638

and F-score (F). Results are compared with the val-

ues and measures asserted in the papers of the related

work. Results are summarized in Table 3, where for

the approaches existing in literature are shown only

results published in the original papers.

Table 3: Comparison between proposed model (LID) and

related work, considering different datasets.

Dataset Model A P R F

Wikipedia our approach 97.65 97.74 97.90 97.82

VOA

our approach 98.65 98.89 98.40 98.64

(Malmasi and Dras, 2015) 96.00

DLS-2015

our approach 99.50 99.40 99.60 99.50

(Mathur et al., 2017) 95.12

TweetID

our approach 88.37 87.40 89.09 88.23

(Zubiaga et al., 2016) 82.5 74.4 78.2

As shown in Table 3, our proposed method ob-

tained around 97% of accuracy and F-measure con-

sidering the collected Wikipedia’s documents, which

contains documents written in Serbian and Croat-

ian. Our method outperformed the method presented

in (Malmasi and Dras, 2015) considering the same

dataset that contains sentences in Persian and Dari

languages. The improvement is more than 2% of

accuracy. Our method outperformed the method

presented in (Mathur et al., 2017) considering the

same dataset that contains sentences in Bulgarian and

Macedonian languages. The improvement is more

than 4% of accuracy. Our method performs better

than the model that is proposed by (Zubiaga et al.,

2016) and the improvement is about 5% using accu-

racy and is around 12% in F1-Measure as evaluation

metrics on the same dataset that contains tweets writ-

ten in Catalan and Spanish languages.

5 CONCLUSION

In this paper, we presented a method based on neu-

ral networks to identify the language of a given doc-

ument. The method is able to distinguish between

similar languages, even when the input documents are

short texts, like tweets.

The proposed model has been compared with

prior works considering same languages. Experimen-

tal evaluation shows that the proposed method obtain

better results. However, we intend to more datasets to

evaluate our method and work further and deeply on

statistical analysis.

There are several modiﬁcations that could be

tested to improve the proposed method. For exam-

ple, other features extraction techniques, or other re-

cent neural network based classiﬁer, or more datasets

could be used. This work has just shown how the

combination of recent deep learning and vector rep-

resentation techniques allows to getting better results

on the problem of language identiﬁcation of (short)

texts.

REFERENCES

Cho, K., Van Merri

enboer, B., Gulcehre, C., Bahdanau, D.,

Bougares, F., Schwenk, H., and Bengio, Y. (2014).

Learning phrase representations using rnn encoder-

decoder for statistical machine translation. arXiv

preprint arXiv:1406.1078.

Glorot, X. and Bengio, Y. (2010). Understanding the dif-

ﬁculty of training deep feedforward neural networks.

In Proceedings of the Thirteenth International Con-

ference on Artiﬁcial Intelligence and Statistics, pages

249–256.

Graves, A. (2012). Supervised sequence labelling with re-

current neural networks, volume 385. Springer.

Graves, A., Mohamed, A.-r., and Hinton, G. (2013). Speech

recognition with deep recurrent neural networks. In

Acoustics, speech and signal processing (icassp),

2013 ieee international conference on, pages 6645–

6649. IEEE.

Graves, A. and Schmidhuber, J. (2009). Ofﬂine handwrit-

ing recognition with multidimensional recurrent neu-

ral networks. In Advances in neural information pro-

cessing systems, pages 545–552.

Han, B., Lui, M., and Baldwin, T. (2011). Melbourne lan-

guage group microblog track report. In TREC.

Harris, Z. S. (1954). Distributional structure. Word, 10(2-

3):146–162.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural computation, 9(8):1735–1780.

Ljube

sic, N. and Kranjcic, D. (2014). Discriminating be-

tween very similar languages among twitter users.

In Proceedings of the Ninth Language Technologies

Conference, pages 90–94.

Malmasi, S. and Dras, M. (2015). Automatic language iden-

tiﬁcation for persian and dari texts. In Proceedings of

PACLING, pages 59–64.

Mathur, P., Misra, A., and Budur, E. (2017). Lide: Lan-

guage identiﬁcation from text documents. arXiv

preprint arXiv:1701.03682.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efﬁcient estimation of word representations in vector

space. arXiv preprint arXiv:1301.3781.

Mikolov, T., Karaﬁ

at, M., Burget, L., Cernock

y, J., and

Khudanpur, S. (2010). Recurrent neural network

based language model. In Interspeech, volume 2,

page 3.

Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser,

L., Kurach, K., and Martens, J. (2015). Adding gra-

dient noise improves learning for very deep networks.

arXiv preprint arXiv:1511.06807.

Pascanu, R., Mikolov, T., and Bengio, Y. (2013). On the

difﬁculty of training recurrent neural networks. In In-

ternational Conference on Machine Learning, pages

1310–1318.

Language Identiﬁcation of Similar Languages using Recurrent Neural Networks

639

Pla, F. and Hurtado, L.-F. (2017). Language identiﬁcation of

multilingual posts from twitter: a case study. Knowl-

edge and Information Systems, 51(3):965–989.

Tan, L., Zampieri, M., Ljube

sic, N., and Tiedemann, J.

(2014). Merging comparable data sources for the dis-

crimination of similar languages: The dsl corpus col-

lection. In Proceedings of the 7th Workshop on Build-

ing and Using Comparable Corpora (BUCC), pages

11–15.

Trieschnigg, D., Hiemstra, D., Theune, M., Jong, F., and

Meder, T. (2012). An exploration of language identi-

ﬁcation techniques for the dutch folktale database.

Zubiaga, A., San Vicente, I., Gamallo, P., Pichel, J. R.,

Alegria, I., Aranberri, N., Ezeiza, A., and Fresno, V.

(2016). Tweetlid: a benchmark for tweet language

identiﬁcation. Language Resources and Evaluation,

50(4):729–766.

ICAART 2018 - 10th International Conference on Agents and Artiﬁcial Intelligence

640