HTC-GEN: A Generative LLM-Based Approach to Handle Data Scarcity in Hierarchical Text Classification

Carmelo Longo, Misael Mongiovı̀, Misael Mongiovı̀, Luana Bulla, Luana Bulla, Giusy Tuccari, Giusy Tuccari

2024

Abstract

Hierarchical text classification is a challenging task, in particular when complex taxonomies, characterized by multi-level labeling structures, need to be handled. A critical aspect of the task lies in the scarcity of labeled data capable of representing the entire spectrum of taxonomy labels. To address this, we propose HTC-GEN, a novel framework that leverages on synthetic data generation by means of large language models, with a specific focus on LLama2. LLama2 generates coherent, contextually relevant text samples across hierarchical levels, faithfully emulating the intricate patterns of real-world text data. HTC-GEN obviates the need for labor-intensive human annotation required to build data for training supervised models. The proposed methodology effectively handles the common issue of imbalanced datasets, enabling robust generalization for labels with minimal or missing real-world data. We test our approach on a widely recognized benchmark dataset for hierarchical zero-shot text classification, demonstrating superior performance compared to the state-of-the-art zero-shot model. Our findings underscore the significant potential of synthetic-data-driven solutions to effectively address the intricate challenges of hierarchical text classification.

Download


Paper Citation


in Harvard Style

Longo C., Mongiovı̀ M., Bulla L. and Tuccari G. (2024). HTC-GEN: A Generative LLM-Based Approach to Handle Data Scarcity in Hierarchical Text Classification. In Proceedings of the 13th International Conference on Data Science, Technology and Applications - Volume 1: DATA; ISBN 978-989-758-707-8, SciTePress, pages 129-138. DOI: 10.5220/0012790700003756


in Bibtex Style

@conference{data24,
author={Carmelo Longo and Misael Mongiovı̀ and Luana Bulla and Giusy Tuccari},
title={HTC-GEN: A Generative LLM-Based Approach to Handle Data Scarcity in Hierarchical Text Classification},
booktitle={Proceedings of the 13th International Conference on Data Science, Technology and Applications - Volume 1: DATA},
year={2024},
pages={129-138},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012790700003756},
isbn={978-989-758-707-8},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 13th International Conference on Data Science, Technology and Applications - Volume 1: DATA
TI - HTC-GEN: A Generative LLM-Based Approach to Handle Data Scarcity in Hierarchical Text Classification
SN - 978-989-758-707-8
AU - Longo C.
AU - Mongiovı̀ M.
AU - Bulla L.
AU - Tuccari G.
PY - 2024
SP - 129
EP - 138
DO - 10.5220/0012790700003756
PB - SciTePress