Authors:
Ermelinda Oro
1
;
Massimo Ruffolo
1
and
Mostafa Sheikhalishahi
2
Affiliations:
1
National Research Council (CNR), Italy
;
2
Fondazione Bruno Kessler, Italy
Keyword(s):
Language Identification, Word Embedding, Natural Language Processing, Deep Neural Network, Long Short-Term Memory, Recurrent Neural Network.
Related
Ontology
Subjects/Areas/Topics:
Applications
;
Artificial Intelligence
;
Biomedical Engineering
;
Biomedical Signal Processing
;
Computational Intelligence
;
Evolutionary Computing
;
Health Engineering and Technology Applications
;
Human-Computer Interaction
;
Knowledge Discovery and Information Retrieval
;
Knowledge Engineering and Ontology Development
;
Knowledge-Based Systems
;
Machine Learning
;
Methodologies and Methods
;
Natural Language Processing
;
Neural Networks
;
Neurocomputing
;
Neurotechnology, Electronics and Informatics
;
Pattern Recognition
;
Physiological Computing Systems
;
Sensor Networks
;
Signal Processing
;
Soft Computing
;
Symbolic Systems
;
Theory and Methods
Abstract:
The goal of similar Language IDentification (LID) is to quickly and accurately identify the language of the
text. It plays an important role in several Natural Language Processing (NLP) applications where it is frequently
used as a pre-processing technique. For example, information retrieval systems use LID as a filtering
technique to provide users with documents written only in a given language. Although different approaches
to this problem have been proposed, similar language identification, in particular applied to short texts, remains
a challenging task in NLP. In this paper, a method that combines word vectors representation and Long
Short-Term Memory (LSTM) has been implemented. The experimental evaluation on public and well-known
datasets has shown that the proposed method improves accuracy and precision of language identification tasks.