Authors:
Carmelo Longo
1
;
Misael Mongiovı̀
1
;
2
;
Luana Bulla
1
;
2
and
Giusy Tuccari
1
;
2
Affiliations:
1
National Research Council, Institute of Cognitive Science and Technology, Italy
;
2
Department of Mathematics and Computer Science, University of Catania, Italy
Keyword(s):
Hierarchical Text Classification, Synthetic Data Generation, Large Language Models.
Abstract:
Hierarchical text classification is a challenging task, in particular when complex taxonomies, characterized by multi-level labeling structures, need to be handled. A critical aspect of the task lies in the scarcity of labeled data capable of representing the entire spectrum of taxonomy labels. To address this, we propose HTC-GEN, a novel framework that leverages on synthetic data generation by means of large language models, with a specific focus on LLama2. LLama2 generates coherent, contextually relevant text samples across hierarchical levels, faithfully emulating the intricate patterns of real-world text data. HTC-GEN obviates the need for labor-intensive human annotation required to build data for training supervised models. The proposed methodology effectively handles the common issue of imbalanced datasets, enabling robust generalization for labels with minimal or missing real-world data. We test our approach on a widely recognized benchmark dataset for hierarchical zero-shot
text classification, demonstrating superior performance compared to the state-of-the-art zero-shot model. Our findings underscore the significant potential of synthetic-data-driven solutions to effectively address the intricate challenges of hierarchical text classification.
(More)