loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Authors: Carmelo Longo 1 ; Misael Mongiovı̀ 1 ; 2 ; Luana Bulla 1 ; 2 and Giusy Tuccari 1 ; 2

Affiliations: 1 National Research Council, Institute of Cognitive Science and Technology, Italy ; 2 Department of Mathematics and Computer Science, University of Catania, Italy

Keyword(s): Hierarchical Text Classification, Synthetic Data Generation, Large Language Models.

Abstract: Hierarchical text classification is a challenging task, in particular when complex taxonomies, characterized by multi-level labeling structures, need to be handled. A critical aspect of the task lies in the scarcity of labeled data capable of representing the entire spectrum of taxonomy labels. To address this, we propose HTC-GEN, a novel framework that leverages on synthetic data generation by means of large language models, with a specific focus on LLama2. LLama2 generates coherent, contextually relevant text samples across hierarchical levels, faithfully emulating the intricate patterns of real-world text data. HTC-GEN obviates the need for labor-intensive human annotation required to build data for training supervised models. The proposed methodology effectively handles the common issue of imbalanced datasets, enabling robust generalization for labels with minimal or missing real-world data. We test our approach on a widely recognized benchmark dataset for hierarchical zero-shot text classification, demonstrating superior performance compared to the state-of-the-art zero-shot model. Our findings underscore the significant potential of synthetic-data-driven solutions to effectively address the intricate challenges of hierarchical text classification. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.144.90.203

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Longo, C., Mongiovı̀, M., Bulla, L. and Tuccari, G. (2024). HTC-GEN: A Generative LLM-Based Approach to Handle Data Scarcity in Hierarchical Text Classification. In Proceedings of the 13th International Conference on Data Science, Technology and Applications - DATA; ISBN 978-989-758-707-8; ISSN 2184-285X, SciTePress, pages 129-138. DOI: 10.5220/0012790700003756

@conference{data24,
author={Carmelo Longo and Misael Mongiovı̀ and Luana Bulla and Giusy Tuccari},
title={HTC-GEN: A Generative LLM-Based Approach to Handle Data Scarcity in Hierarchical Text Classification},
booktitle={Proceedings of the 13th International Conference on Data Science, Technology and Applications - DATA},
year={2024},
pages={129-138},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012790700003756},
isbn={978-989-758-707-8},
issn={2184-285X},
}

TY - CONF

JO - Proceedings of the 13th International Conference on Data Science, Technology and Applications - DATA
TI - HTC-GEN: A Generative LLM-Based Approach to Handle Data Scarcity in Hierarchical Text Classification
SN - 978-989-758-707-8
IS - 2184-285X
AU - Longo, C.
AU - Mongiovı̀, M.
AU - Bulla, L.
AU - Tuccari, G.
PY - 2024
SP - 129
EP - 138
DO - 10.5220/0012790700003756
PB - SciTePress