forum-like online data sources. This study applied our
software tool to generate two conversational datasets
in Portuguese language. The obtained datasets from
our study were made available for sharing and reuse
purposes. We found that our designed methodol-
ogy and software tool enables the creation of rele-
vant datasets. This must leverage the training of lan-
guage models. Indeed, our subsequent research steps
involve using the generated conversational datasets to
train new models and refine existing language mod-
els. We aim to apply and evaluate such models in
constructing chatbots for customer services.
REFERENCES
Bansal, A., Kauffman, R. J., and Weitz, R. R. (1993). Com-
paring the modeling performance of regression and
neural networks as data quality varies: A business
value approach. Journal of Management Information
Systems, 10(1):11–32.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,
G., Henighan, T. J., Child, R., Ramesh, A., Ziegler,
D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler,
E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner,
C., McCandlish, S., Radford, A., Sutskever, I., and
Amodei, D. (2020). Language models are few-shot
learners. ArXiv, abs/2005.14165.
Budzianowski, P., Wen, T.-H., Tseng, B.-H., Casanueva, I.,
Ultes, S., Ramadan, O., and Ga
ˇ
si
´
c, M. (2018). Multi-
woz - a large-scale multi-domain wizard-of-oz dataset
for task-oriented dialogue modelling. Proceedings of
the 2018 Conference on Empirical Methods in Natural
Language Processing.
Byrne, B., Krishnamoorthi, K., Sankar, C., Neelakantan,
A., Goodrich, B., Duckworth, D., Yavuz, S., Dubey,
A., Kim, K.-Y., and Cedilnik, A. (2019). Taskmaster-
1: Toward a realistic and diverse dialog dataset. Pro-
ceedings of the 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the 9th Inter-
national Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP).
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2018). Bert: Pre-training of deep bidirectional trans-
formers for language understanding. arXiv preprint
arXiv:1810.04805.
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W.,
Wallach, H., Iii, H. D., and Crawford, K. (2021).
Datasheets for datasets. Communications of the ACM,
64(12):86–92.
Kelley, J. F. (1984). An iterative design methodology for
user-friendly natural language office information ap-
plications. ACM Trans. Inf. Syst., 2(1):26–41.
Li, Y., Su, H., Shen, X., Li, W., Cao, Z., and Niu, S.
(2017). DailyDialog: A manually labelled multi-turn
dialogue dataset. In Proceedings of the Eighth Inter-
national Joint Conference on Natural Language Pro-
cessing (Volume 1: Long Papers), pages 986–995,
Taipei, Taiwan. Asian Federation of Natural Language
Processing.
Lowe, R., Pow, N., Serban, I., and Pineau, J. (2015). The
ubuntu dialogue corpus: A large dataset for research
in unstructured multi-turn dialogue systems. arXiv
preprint arXiv:1506.08909.
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., and
Galstyan, A. (2021). A survey on bias and fairness in
machine learning. ACM Comput. Surv., 54(6).
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark,
C., Lee, K., and Zettlemoyer, L. (2018). Deep contex-
tualized word representations. In Proc. of NAACL.
Qiu, X., Sun, T., Xu, Y., Shao, Y., Dai, N., and Huang,
X. (2020). Pre-trained models for natural language
processing: A survey. Science China Technological
Sciences, 63(10):1872–1897.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and
Sutskever, I. (2019). Language models are unsuper-
vised multitask learners. In OpenAI.
Smith, J. R., Saint-Amand, H., Plamada, M., Koehn, P.,
Callison-Burch, C., and Lopez, A. (2013). Dirt cheap
web-scale parallel text from the Common Crawl. In
Proceedings of the 51st Annual Meeting of the Associ-
ation for Computational Linguistics (Volume 1: Long
Papers), pages 1374–1383, Sofia, Bulgaria. Associa-
tion for Computational Linguistics.
Traum, D. R. (1999). Speech Acts for Dialogue Agents,
pages 169–201. Springer Netherlands, Dordrecht.
Wagner Filho, J. A., Wilkens, R., Idiart, M., and Villav-
icencio, A. (2018). The brWaC corpus: A new
open resource for Brazilian Portuguese. In Pro-
ceedings of the Eleventh International Conference on
Language Resources and Evaluation (LREC 2018),
Miyazaki, Japan. European Language Resources As-
sociation (ELRA).
Williams, J. D., Raux, A., and Henderson, M. (2016). The
dialog state tracking challenge series: A review. Dia-
logue & Discourse, 7(3):4–33.
Wolf, M., Miller, K., and Grodzinsky, F. (2017). Why
we should have seen that coming: Comments on mi-
crosoft’s tay “experiment,” and wider implications.
The ORBIT Journal, 1(2):1–12.
MCCD: Generating Human Natural Language Conversational Datasets
255