ELEVEN Data-Set: A Labeled Set of Descriptions of Goods Captured from Brazilian Electronic Invoices
Vinícius Di Oliveira, Vinícius Di Oliveira, Li Weigang, Geraldo Filho
2022
Abstract
The task of classifying short text through machine learning (ML) models is promising and challenging for economic related sectors such as electronic invoice processing and auditing. Considering the scarcity of labeled short text data sets and the high cost of establishing new labeled short text databases for supervised learning, especially when they are manually established by experts, this research proposes ELEVEN (ELEctronic inVoicEs in portuguese laNguage) Data-Set in an open data format. This labeled short text database is composed of the product descriptions extracted from electronic invoices. These short Portuguese text descriptions are unstructured, but limited to 120 characters. First, we construct BERT and other models to demonstrate the short text classification using ELEVEN. Then, we show three successful cases, also using the data set we developed, to identify correct products codes according to the short text descriptions of goods captured from the electronic invoices and others. ELEVEN consists of 1.1 million merchandise descriptions recorded as labeled short-texts, annotated by specialist tax auditors, and detailed according to the Mercosur Common Nomenclature. For easy public use, ELEVEN is shared on GitHub by the link: https://github.com/vinidiol/descmerc.
DownloadPaper Citation
in Harvard Style
Di Oliveira V., Weigang L. and Filho G. (2022). ELEVEN Data-Set: A Labeled Set of Descriptions of Goods Captured from Brazilian Electronic Invoices. In Proceedings of the 18th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-758-613-2, pages 257-264. DOI: 10.5220/0011524800003318
in Bibtex Style
@conference{webist22,
author={Vinícius Di Oliveira and Li Weigang and Geraldo Filho},
title={ELEVEN Data-Set: A Labeled Set of Descriptions of Goods Captured from Brazilian Electronic Invoices},
booktitle={Proceedings of the 18th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2022},
pages={257-264},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011524800003318},
isbn={978-989-758-613-2},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 18th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - ELEVEN Data-Set: A Labeled Set of Descriptions of Goods Captured from Brazilian Electronic Invoices
SN - 978-989-758-613-2
AU - Di Oliveira V.
AU - Weigang L.
AU - Filho G.
PY - 2022
SP - 257
EP - 264
DO - 10.5220/0011524800003318