Identifying High-Quality Training Data for Misinformation Detection

Jaren Haber, Kornraphop Kawintiranon, Lisa Singh, Alexander Chen, Aidan Pizzo, Anna Pogrebivsky, Joyce Yang

2023

Abstract

Misinformation spread through social media poses a grave threat to public health, interfering with the best scientific evidence available. This spread was particularly visible during the COVID-19 pandemic. To track and curb misinformation, an essential first step is to detect it. One component of misinformation detection is finding examples of misinformation posts that can serve as training data for misinformation detection algorithms. In this paper, we focus on the challenge of collecting high-quality training data in misinformation detection applications. To that end, we demonstrate the effectiveness of a simple methodology and show its viability on five myths related to COVID-19. Our methodology incorporates both dictionary-based sampling and predictions from weak learners to identify a reasonable number of myth examples for data labeling. To aid researchers in adjusting this methodology for specific use cases, we use word usage entropy to describe when fewer iterations of sampling and training will be needed to obtain high-quality samples. Finally, we present a case study that shows the prevalence of three of our myths on Twitter at the beginning of the pandemic.

Download


Paper Citation


in Harvard Style

Haber J., Kawintiranon K., Singh L., Chen A., Pizzo A., Pogrebivsky A. and Yang J. (2023). Identifying High-Quality Training Data for Misinformation Detection. In Proceedings of the 12th International Conference on Data Science, Technology and Applications - Volume 1: DATA; ISBN 978-989-758-664-4, SciTePress, pages 64-76. DOI: 10.5220/0012089000003541


in Bibtex Style

@conference{data23,
author={Jaren Haber and Kornraphop Kawintiranon and Lisa Singh and Alexander Chen and Aidan Pizzo and Anna Pogrebivsky and Joyce Yang},
title={Identifying High-Quality Training Data for Misinformation Detection},
booktitle={Proceedings of the 12th International Conference on Data Science, Technology and Applications - Volume 1: DATA},
year={2023},
pages={64-76},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012089000003541},
isbn={978-989-758-664-4},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 12th International Conference on Data Science, Technology and Applications - Volume 1: DATA
TI - Identifying High-Quality Training Data for Misinformation Detection
SN - 978-989-758-664-4
AU - Haber J.
AU - Kawintiranon K.
AU - Singh L.
AU - Chen A.
AU - Pizzo A.
AU - Pogrebivsky A.
AU - Yang J.
PY - 2023
SP - 64
EP - 76
DO - 10.5220/0012089000003541
PB - SciTePress