Authors:
Diego Bernardes de Lima Santos
1
;
Frederico Giffoni de Carvalho Dutra
2
;
Fernando Silva Parreiras
3
and
Wladmir Cardoso Brandão
1
Affiliations:
1
Department of Computer Science, Pontifical Catholic University of Minas Gerais (PUC Minas), Belo Horizonte, Brazil
;
2
Companhia Energética de Minas Gerais (CEMIG), Belo Horizonte, Brazil
;
3
Laboratory for Advanced Information Systems, FUMEC University, Belo Horizonte, Brazil
Keyword(s):
Named Entity Recognition, Text Embedding, Neural Network, Transformer, Multilingual, Portuguese.
Abstract:
Recent state of the art named entity recognition approaches are based on deep neural networks that use an attention mechanism to learn how to perform the extraction of named entities from relevant fragments of text. Usually, training models in a specific language leads to effective recognition, but it requires a lot of time and computational resources. However, fine-tuning a pre-trained multilingual model can be simpler and faster, but there is a question on how effective that recognition model can be. This article exploits multilingual models for named entity recognition by adapting and training tranformer-based architectures for Portuguese, a challenging complex language. Experimental results show that multilingual trasformer-based text embeddings approaches fine tuned with a large dataset outperforms state of the art trasformer-based models trained specifically for Portuguese. In particular, we build a comprehensive dataset from different versions of HAREM to train our multilingua
l transformer-based text embedding approach, which achieves 88.0% of precision and 87.8% in F1 in named entity recognition for Portuguese, with gains of up to 9.89% of precision and 11.60% in F1 compared to the state of the art single-lingual approach trained specifically for Portuguese.
(More)