Authors:
Carlos Alberto Alvares Rocha
1
;
2
;
Marcos Vinícius Pinheiro Dib
1
;
3
;
Li Weigang
1
;
3
;
Andrea Ferreira Portela Nunes
1
;
4
;
Allan Victor Almeida Faria
1
;
5
;
Daniel Oliveira Cajueiro
1
;
6
;
Maísa Kely de Melo
1
;
7
and
Victor Rafael Rezende Celestino
8
;
1
Affiliations:
1
LAMFO - Lab. of ML in Finance and Organizations, University of Brasilia, Campus Darcy Ribeiro, Brasilia, Brazil
;
2
PPMEC, Faculty of Technology, University of Brasilia, Federal District, Brazil
;
3
TransLab, Department of Computer Science, University of Brasilia, Campus Darcy Ribeiro, Brasilia, Brazil
;
4
Ministry of Science, Technology and Innovation of Brazil, Federal District, Brazil
;
5
Department of Statistics, University of Brasília, Federal District, Brazil
;
6
Department of Economics, University of Brasilia, Federal District, Brazil
;
7
Department of Mathematics, Instituto Federal de Minas Gerais Campus Formiga, Formiga, Brazil
;
8
Department of Business Administration, University of Brasilia, Federal District, Brazil
Keyword(s):
CNN, Deep Learning, MCTI, Longformer, Web Long-text Classification, LSTM, Transfer-learning, Word2vec.
Abstract:
Text classification is a traditional problem in Natural Language Processing (NLP). Most of the state-of-the-art implementations require high-quality, voluminous, labeled data. Pre-trained models on large corpora have shown beneficial for text classification and other NLP tasks, but they can only take a limited amount of symbols as input. This is a real case study that explores different machine learning strategies to classify a small amount of long, unstructured, and uneven data to find a proper method with good performance. The collected data includes texts of financing opportunities the international R&D funding organizations provided on their websites. The main goal is to find international R&D funding eligible for Brazilian researchers, sponsored by the Ministry of Science, Technology and Innovation. We use pre-training and word embedding solutions to learn the relationship of the words from other datasets with considerable similarity and larger scale. Then, using the acquired fe
atures, based on the available dataset from MCTI, we apply transfer learning plus deep learning models to improve the comprehension of each sentence. Compared to the baseline accuracy rate of 81%, based on the available datasets, and the 85% accuracy rate achieved through a Transformer-based approach, the Word2Vec-based approach improved the accuracy rate to 88%. The research results serve as a successful case of artificial intelligence in a federal government application.
(More)