HTC-GEN: A Generative LLM-Based Approach to Handle Data Scarcity in Hierarchical Text Classification

Carmelo Longo; Misael Mongiovı̀; Misael Mongiovı̀; Luana Bulla; Luana Bulla; Giusy Tuccari; Giusy Tuccari

Research.Publish.Connect.

*Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Country:
Subject:

Advanced Search Affiliations Search

If you're looking for an exact phrase use quotation marks on text fields.

Proceedings

Proceedings Search *Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

Papers

Papers Search *Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

Authors

Authors Search *Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

Advanced Search

Paper

HTC-GEN: A Generative LLM-Based Approach to Handle Data Scarcity in Hierarchical Text Classification

Topics: Big Data Applications; Data and Information Quality for Big Data ; Data Science; Deep Learning; Text Analytics

In Proceedings of the 13th International Conference on Data Science, Technology and Applications DATA - Volume 1, 129-138, 2024 , Dijon, France

Authors: Carmelo Longo ¹ ; Misael Mongiovı̀ ^{1

;

2} ; Luana Bulla ^{1

;

2} and Giusy Tuccari ^{1

;

2}

Affiliations: ¹ National Research Council, Institute of Cognitive Science and Technology, Italy ; ² Department of Mathematics and Computer Science, University of Catania, Italy

Keyword(s): Hierarchical Text Classification, Synthetic Data Generation, Large Language Models.

Abstract: Hierarchical text classification is a challenging task, in particular when complex taxonomies, characterized by multi-level labeling structures, need to be handled. A critical aspect of the task lies in the scarcity of labeled data capable of representing the entire spectrum of taxonomy labels. To address this, we propose HTC-GEN, a novel framework that leverages on synthetic data generation by means of large language models, with a specific focus on LLama2. LLama2 generates coherent, contextually relevant text samples across hierarchical levels, faithfully emulating the intricate patterns of real-world text data. HTC-GEN obviates the need for labor-intensive human annotation required to build data for training supervised models. The proposed methodology effectively handles the common issue of imbalanced datasets, enabling robust generalization for labels with minimal or missing real-world data. We test our approach on a widely recognized benchmark dataset for hierarchical zero-shot text classification, demonstrating superior performance compared to the state-of-the-art zero-shot model. Our findings underscore the significant potential of synthetic-data-driven solutions to effectively address the intricate challenges of hierarchical text classification. (More)

CC BY-NC-ND 4.0

Guest: Register as new SciTePress user now for free.

SciTePress user: please login.

My Papers

You are not signed in, therefore limits apply to your IP address 3.144.90.203

In the current month:

Recent papers: 100 available of 100 total

2⁺ years older papers: 200 available of 200 total

Paper citation in several formats:

Longo, C., Mongiovı̀, M., Bulla, L. and Tuccari, G. (2024). HTC-GEN: A Generative LLM-Based Approach to Handle Data Scarcity in Hierarchical Text Classification. In Proceedings of the 13th International Conference on Data Science, Technology and Applications - DATA; ISBN 978-989-758-707-8; ISSN 2184-285X, SciTePress, pages 129-138. DOI: 10.5220/0012790700003756

@conference{data24,
author={Carmelo Longo and Misael Mongiovı̀ and Luana Bulla and Giusy Tuccari},
title={HTC-GEN: A Generative LLM-Based Approach to Handle Data Scarcity in Hierarchical Text Classification},
booktitle={Proceedings of the 13th International Conference on Data Science, Technology and Applications - DATA},
year={2024},
pages={129-138},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012790700003756},
isbn={978-989-758-707-8},
issn={2184-285X},
}

TY - CONF

JO - Proceedings of the 13th International Conference on Data Science, Technology and Applications - DATA
TI - HTC-GEN: A Generative LLM-Based Approach to Handle Data Scarcity in Hierarchical Text Classification
SN - 978-989-758-707-8
IS - 2184-285X
AU - Longo, C.
AU - Mongiovı̀, M.
AU - Bulla, L.
AU - Tuccari, G.
PY - 2024
SP - 129
EP - 138
DO - 10.5220/0012790700003756
PB - SciTePress