and F-score (F). Results are compared with the val-
ues and measures asserted in the papers of the related
work. Results are summarized in Table 3, where for
the approaches existing in literature are shown only
results published in the original papers.
Table 3: Comparison between proposed model (LID) and
related work, considering different datasets.
Dataset Model A P R F
Wikipedia our approach 97.65 97.74 97.90 97.82
VOA
our approach 98.65 98.89 98.40 98.64
(Malmasi and Dras, 2015) 96.00
DLS-2015
our approach 99.50 99.40 99.60 99.50
(Mathur et al., 2017) 95.12
TweetID
our approach 88.37 87.40 89.09 88.23
(Zubiaga et al., 2016) 82.5 74.4 78.2
As shown in Table 3, our proposed method ob-
tained around 97% of accuracy and F-measure con-
sidering the collected Wikipedia’s documents, which
contains documents written in Serbian and Croat-
ian. Our method outperformed the method presented
in (Malmasi and Dras, 2015) considering the same
dataset that contains sentences in Persian and Dari
languages. The improvement is more than 2% of
accuracy. Our method outperformed the method
presented in (Mathur et al., 2017) considering the
same dataset that contains sentences in Bulgarian and
Macedonian languages. The improvement is more
than 4% of accuracy. Our method performs better
than the model that is proposed by (Zubiaga et al.,
2016) and the improvement is about 5% using accu-
racy and is around 12% in F1-Measure as evaluation
metrics on the same dataset that contains tweets writ-
ten in Catalan and Spanish languages.
5 CONCLUSION
In this paper, we presented a method based on neu-
ral networks to identify the language of a given doc-
ument. The method is able to distinguish between
similar languages, even when the input documents are
short texts, like tweets.
The proposed model has been compared with
prior works considering same languages. Experimen-
tal evaluation shows that the proposed method obtain
better results. However, we intend to more datasets to
evaluate our method and work further and deeply on
statistical analysis.
There are several modifications that could be
tested to improve the proposed method. For exam-
ple, other features extraction techniques, or other re-
cent neural network based classifier, or more datasets
could be used. This work has just shown how the
combination of recent deep learning and vector rep-
resentation techniques allows to getting better results
on the problem of language identification of (short)
texts.
REFERENCES
Cho, K., Van Merri
¨
enboer, B., Gulcehre, C., Bahdanau, D.,
Bougares, F., Schwenk, H., and Bengio, Y. (2014).
Learning phrase representations using rnn encoder-
decoder for statistical machine translation. arXiv
preprint arXiv:1406.1078.
Glorot, X. and Bengio, Y. (2010). Understanding the dif-
ficulty of training deep feedforward neural networks.
In Proceedings of the Thirteenth International Con-
ference on Artificial Intelligence and Statistics, pages
249–256.
Graves, A. (2012). Supervised sequence labelling with re-
current neural networks, volume 385. Springer.
Graves, A., Mohamed, A.-r., and Hinton, G. (2013). Speech
recognition with deep recurrent neural networks. In
Acoustics, speech and signal processing (icassp),
2013 ieee international conference on, pages 6645–
6649. IEEE.
Graves, A. and Schmidhuber, J. (2009). Offline handwrit-
ing recognition with multidimensional recurrent neu-
ral networks. In Advances in neural information pro-
cessing systems, pages 545–552.
Han, B., Lui, M., and Baldwin, T. (2011). Melbourne lan-
guage group microblog track report. In TREC.
Harris, Z. S. (1954). Distributional structure. Word, 10(2-
3):146–162.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term
memory. Neural computation, 9(8):1735–1780.
Ljube
ˇ
sic, N. and Kranjcic, D. (2014). Discriminating be-
tween very similar languages among twitter users.
In Proceedings of the Ninth Language Technologies
Conference, pages 90–94.
Malmasi, S. and Dras, M. (2015). Automatic language iden-
tification for persian and dari texts. In Proceedings of
PACLING, pages 59–64.
Mathur, P., Misra, A., and Budur, E. (2017). Lide: Lan-
guage identification from text documents. arXiv
preprint arXiv:1701.03682.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).
Efficient estimation of word representations in vector
space. arXiv preprint arXiv:1301.3781.
Mikolov, T., Karafi
´
at, M., Burget, L., Cernock
`
y, J., and
Khudanpur, S. (2010). Recurrent neural network
based language model. In Interspeech, volume 2,
page 3.
Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser,
L., Kurach, K., and Martens, J. (2015). Adding gra-
dient noise improves learning for very deep networks.
arXiv preprint arXiv:1511.06807.
Pascanu, R., Mikolov, T., and Bengio, Y. (2013). On the
difficulty of training recurrent neural networks. In In-
ternational Conference on Machine Learning, pages
1310–1318.
Language Identification of Similar Languages using Recurrent Neural Networks
639