MCCD: Generating Human Natural Language Conversational Datasets

Matheus Sanches, Jader C. de Sá, Allan M. de Souza, Diego Silva, Rafael R. de Souza, Julio Reis, Leandro Villas

2022

Abstract

In recent years, state-of-the-art problems related to Natural Language Processing (NLP) have been extensively explored. This includes better models for text generation and text understanding. These solutions depend highly on data to training models, such as dialogues. The limitations imposed by the lack of data in a specific language significantly limit the available datasets. This becomes worse as intensive data is required to achieve specific solutions for a particular domain. This investigation proposes MCCD, a methodology to extract human conversational datasets based on several data sources. MCCD identifies different answers to the same message differentiating various conversation flows. This enables the resulting dataset to be used in more applications. Datasets generated by MCCD can train models for different purposes, such as Questions & Answers (QA) and open-domain conversational agents. We developed a complete software tool to implement and evaluate our proposal. We applied our solution to extract human conversations from two datasets in Portuguese language.

Download


Paper Citation


in Harvard Style

Sanches M., C. de Sá J., M. de Souza A., Silva D., R. de Souza R., Reis J. and Villas L. (2022). MCCD: Generating Human Natural Language Conversational Datasets. In Proceedings of the 24th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-569-2, pages 247-255. DOI: 10.5220/0011077400003179


in Bibtex Style

@conference{iceis22,
author={Matheus Sanches and Jader C. de Sá and Allan M. de Souza and Diego Silva and Rafael R. de Souza and Julio Reis and Leandro Villas},
title={MCCD: Generating Human Natural Language Conversational Datasets},
booktitle={Proceedings of the 24th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2022},
pages={247-255},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011077400003179},
isbn={978-989-758-569-2},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 24th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - MCCD: Generating Human Natural Language Conversational Datasets
SN - 978-989-758-569-2
AU - Sanches M.
AU - C. de Sá J.
AU - M. de Souza A.
AU - Silva D.
AU - R. de Souza R.
AU - Reis J.
AU - Villas L.
PY - 2022
SP - 247
EP - 255
DO - 10.5220/0011077400003179