Authors:
Vinícius Di Oliveira
1
;
2
;
Li Weigang
2
and
Geraldo Pereira Rocha Filho
2
Affiliations:
1
Secretary of Economy, Brasilia, Federal District, Brazil
;
2
TransLab, University of Brasilia, Brasilia, Federal District, Brazil
Keyword(s):
BERT, Electronic Invoice, Labeled Data-set, Short Text, Supervised Learning, Text Classification.
Abstract:
The task of classifying short text through machine learning (ML) models is promising and challenging for economic related sectors such as electronic invoice processing and auditing. Considering the scarcity of labeled short text data sets and the high cost of establishing new labeled short text databases for supervised learning, especially when they are manually established by experts, this research proposes ELEVEN (ELEctronic inVoicEs in portuguese laNguage) Data-Set in an open data format. This labeled short text database is composed of the product descriptions extracted from electronic invoices. These short Portuguese text descriptions are unstructured, but limited to 120 characters. First, we construct BERT and other models to demonstrate the short text classification using ELEVEN. Then, we show three successful cases, also using the data set we developed, to identify correct products codes according to the short text descriptions of goods captured from the electronic invoices an
d others. ELEVEN consists of 1.1 million merchandise descriptions recorded as labeled short-texts, annotated by specialist tax auditors, and detailed according to the Mercosur Common Nomenclature. For easy public use, ELEVEN is shared on GitHub by the link: https://github.com/vinidiol/descmerc.
(More)