An Index Bucketing Framework to Support Data Manipulation and Extraction of Nested Data Structures
Jeffrey Myers II, Yaser Mowafi
2024
Abstract
Handling nested data collections in large-scale distributed data structures poses considerable challenges in query processing, often resulting in substantial costs and error susceptibility. These challenges are exacerbated in scenarios involving skewed, nested data with irregular inner data collections. Processing such data demands costly operations, leading to extensive data duplication and imposing challenges in ensuring balanced distribution across partitions—consequently impeding performance and scalability. This work introduces an index bucketing framework that amalgamates upfront computations with data manipulation techniques, specifically focusing on flattening procedures. The framework resembles principles from the bucket spreading strategy, a parallel hash join method that aims to mitigate adverse implications of data duplication and information loss, while effectively addressing both skewed and irregularly nested structures. The efficacy of the proposed framework is assessed through evaluations conducted on prominent question-answering datasets such as QuAC and NewsQA, comparing its performance against the Pandas Python API and recursive, iterative flattening implementations.
DownloadPaper Citation
in Harvard Style
Myers II J. and Mowafi Y. (2024). An Index Bucketing Framework to Support Data Manipulation and Extraction of Nested Data Structures. In Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR; ISBN 978-989-758-716-0, SciTePress, pages 191-199. DOI: 10.5220/0012828800003838
in Bibtex Style
@conference{kdir24,
author={Jeffrey Myers II and Yaser Mowafi},
title={An Index Bucketing Framework to Support Data Manipulation and Extraction of Nested Data Structures},
booktitle={Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR},
year={2024},
pages={191-199},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012828800003838},
isbn={978-989-758-716-0},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR
TI - An Index Bucketing Framework to Support Data Manipulation and Extraction of Nested Data Structures
SN - 978-989-758-716-0
AU - Myers II J.
AU - Mowafi Y.
PY - 2024
SP - 191
EP - 199
DO - 10.5220/0012828800003838
PB - SciTePress