loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Authors: Vinícius Di Oliveira 1 ; 2 ; Li Weigang 2 and Geraldo Pereira Rocha Filho 2

Affiliations: 1 Secretary of Economy, Brasilia, Federal District, Brazil ; 2 TransLab, University of Brasilia, Brasilia, Federal District, Brazil

Keyword(s): BERT, Electronic Invoice, Labeled Data-set, Short Text, Supervised Learning, Text Classification.

Abstract: The task of classifying short text through machine learning (ML) models is promising and challenging for economic related sectors such as electronic invoice processing and auditing. Considering the scarcity of labeled short text data sets and the high cost of establishing new labeled short text databases for supervised learning, especially when they are manually established by experts, this research proposes ELEVEN (ELEctronic inVoicEs in portuguese laNguage) Data-Set in an open data format. This labeled short text database is composed of the product descriptions extracted from electronic invoices. These short Portuguese text descriptions are unstructured, but limited to 120 characters. First, we construct BERT and other models to demonstrate the short text classification using ELEVEN. Then, we show three successful cases, also using the data set we developed, to identify correct products codes according to the short text descriptions of goods captured from the electronic invoices an d others. ELEVEN consists of 1.1 million merchandise descriptions recorded as labeled short-texts, annotated by specialist tax auditors, and detailed according to the Mercosur Common Nomenclature. For easy public use, ELEVEN is shared on GitHub by the link: https://github.com/vinidiol/descmerc. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 18.119.122.69

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Di Oliveira, V.; Weigang, L. and Filho, G. (2022). ELEVEN Data-Set: A Labeled Set of Descriptions of Goods Captured from Brazilian Electronic Invoices. In Proceedings of the 18th International Conference on Web Information Systems and Technologies - WEBIST; ISBN 978-989-758-613-2; ISSN 2184-3252, SciTePress, pages 257-264. DOI: 10.5220/0011524800003318

@conference{webist22,
author={Vinícius {Di Oliveira}. and Li Weigang. and Geraldo Pereira Rocha Filho.},
title={ELEVEN Data-Set: A Labeled Set of Descriptions of Goods Captured from Brazilian Electronic Invoices},
booktitle={Proceedings of the 18th International Conference on Web Information Systems and Technologies - WEBIST},
year={2022},
pages={257-264},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011524800003318},
isbn={978-989-758-613-2},
issn={2184-3252},
}

TY - CONF

JO - Proceedings of the 18th International Conference on Web Information Systems and Technologies - WEBIST
TI - ELEVEN Data-Set: A Labeled Set of Descriptions of Goods Captured from Brazilian Electronic Invoices
SN - 978-989-758-613-2
IS - 2184-3252
AU - Di Oliveira, V.
AU - Weigang, L.
AU - Filho, G.
PY - 2022
SP - 257
EP - 264
DO - 10.5220/0011524800003318
PB - SciTePress