A Hybrid Framework for Resource-Efﬁcient Query Processing by

Effective Utilization of Existing Resources

Mayank Patel

1 a

and Minal Bhise

2 b

Department of Information & Communication Technology, Adani University, Shantigram, Gujarat, India

Distributed Databases Group, Dhirubhai Ambani Institute of Information and Communication Technology, India

Keywords:

Big Data Partitioning, Optimize Resource Utilization, Raw Data Query Processing, Real-Time Dynamic

Resource Allocation, Task Scheduling.

Abstract:

Scientiﬁc experiments and contemporary applications generate substantial volumes of data daily, posing a

challenge for traditional database management systems (DBMS) that expend considerable time and resources

on data loading. In-situ engines offer a distinct advantage by enabling immediate querying on raw data. Re-

searchers have observed that resources are often underutilized during data loading. In contrast, in-situ engines

spend ample time and resources in reparsing required data multiple times. Allocating query speciﬁc resources

is another challenging task that must be addressed to reduce overall workload execution time and resource uti-

lization. This research paper introduces a novel approach called the Resource Availability & Workload-aware

Hybrid Framework (RAW-HF), designed to enhance the efﬁciency of data querying by judiciously utilizing

optimal resources in systems comprising an in-situ engine and DBMS. RAW-HF incorporates modules that

facilitate the optimization of resources necessary for executing a given workload, striving to maximize the uti-

lization of available resources. The effectiveness of RAW-HF is demonstrated using the scientiﬁc dataset Sloan

Digital Sky Survey (SDSS) and Linked Observation data (LOD). Comparative analysis with the state-of-the-

art workload-aware partial loading technique (WA) reveals that RAW-HF excels in allocating query-speciﬁc

resources and implementing resource-aware task scheduling. Results from the study indicate that RAW-HF

outperforms WA, reducing workload execution time by 26%. It also reduces CPU and IO resource utilization

by 26% and 25% compared to WA at a cost of 33% additional RAM.

1 INTRODUCTION

The escalating pace of data generation from diverse

sources, including modern applications, scientiﬁc ex-

periments, and IoT applications, underscores the chal-

lenge of managing perishable resources effectively.

The volume of Astronomy datasets like the Sloan

Digital Sky Survey (SDSS) has increased by 233

times since DR-1 to DR-17(Ahumada et al., 2020).

NASA’s Earth Observing System (EOS) collects more

than 3.3TB of data every day(Guo et al., 2016).

Research indicates that the bottleneck in data pro-

cessing often lies in slow IO devices, particularly

magnetic disks, leading to under-utilization of exist-

ing CPU resources by most DBMS(Dziedzic et al.,

2017). Shockingly, data centers observe a mere 12%

CPU utilization, elevating operational costs in cloud

and distributed environments (Ailamaki, 2015; Max-

https://orcid.org/0000-0002-7804-4017

https://orcid.org/0000-0003-4364-3930

imilien et al., 2022). Parallelizing data loading proves

ineffective for systems with disk-based storage de-

vices (Dziedzic et al., 2017; Patel and Bhise, 2023b).

HTAP or hybrid systems utilize additional re-

sources in processing the same dataset twice. Widely

used DBMSs like Postgres, MySQL, Oracle, Au-

toSteer (Anneser et al., 2023) and other open-source

systems do not monitor the utilization of available re-

sources or consider them during query planning. The

existing systems or techniques proposed to address re-

source optimization or maximization issues have been

developed for speciﬁc DBMSs, which may not work

for most DBMSs, hybrid systems, or cloud vendor

(Pimpley et al., 2022; Kaviani et al., 2019).

This paper addresses the issues of under-utilized

resources and slow IO devices by focusing on the

availability of resources. By monitoring resource uti-

lization in real-time, the proposed approach aims to

mitigate the shortcomings of existing works (Li et al.,

2022; Pimpley et al., 2022; Viswanathan et al., 2018;

Patel and Bhise, 2023a). Speciﬁcally, it allocates

Patel, M. and Bhise, M.

A Hybrid Framework for Resource-Efﬁcient Query Processing by Effective Utilization of Existing Resources.

DOI: 10.5220/0012691600003690

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 26th International Conference on Enterprise Information Systems (ICEIS 2024) - Volume 1, pages 337-344

ISBN: 978-989-758-692-7; ISSN: 2184-4992

337

query-speciﬁc resources and optimizes task schedul-

ing, presenting a novel solution to enhance efﬁciency

in data querying. This shift in perspective, from the

under-utilization of resources to the effective manage-

ment of resources, marks a signiﬁcant contribution in

the domain of database management and system opti-

mization.

1.1 Motivation

The majority of existing Database Management Sys-

tems (DBMSs) and contemporary cloud resource uti-

lization strategies encounter challenges in effectively

leveraging all available resources(Ailamaki, 2015;

Maximilien et al., 2022). The optimization of hard-

ware resource utilization has become a focal point,

given the signiﬁcant demand for cloud or hardware

resources in querying large datasets to reduce oper-

ational costs. Notably, external Machine Learning

(ML)-based solutions demand substantial time and

resources for ﬁltering, analyzing resource utilization

data, and training ML models (Ailamaki, 2015; Pim-

pley et al., 2022). However, the challenge lies in

the fact that most systems do not account for real-

time availability of existing resources when schedul-

ing tasks or dynamically allocating query-speciﬁc re-

sources in a dynamic operational environment.

1.2 Paper Contributions

• This study consolidates modules like Raw Data

Query Processing (RQP), Resource Monitoring

(RM), Optimizing Required Resources (ORR),

and Maximizing Utilization of Existing Resources

(MUER) to construct RAW-HF framework.

• ORR module integrated lightweight query com-

plexity, workload, & storage utilization aware par-

titioning algorithms designed for hybrid systems.

MUAR integrated resource utilization aware task

scheduling and resource allocation algorithms.

• RAW-HF results have been compared with state-

of-the-art in-situ engine(Alagiannis et al., 2012),

row store DBMS PostgreSQL(Pos, 2022), and

workload-aware Partial Loading technique(Zhao

et al., 2015) using real-world scientiﬁc datasets.

2 RELATED WORK

Traditional database management systems necessitate

loading the entire dataset before executing a query,

while in-situ engines expedite results by bypassing

data loading and processing only essential raw data

upon query arrival. However, in-situ engines ex-

hibit slower query execution time (QET) compared to

DBMSs due to reparsing of raw data. Notably, NoDB

(Alagiannis et al., 2012) & Slalom (Olma et al., 2020)

proposed caching and indexing processed raw data

in main memory to reduce QET. Despite these ad-

vancements, caching techniques encounter occasional

reparsing issues and high utilization for large datasets.

Invisible loading proposed loading processed data

into column store DBMS generated as a side ef-

fect of executing queries on raw datasets to elimi-

nate reparsing (Abouzied et al., 2013). SCANRAW,

one of the ﬁrst hybrid systems, suggested monitoring

CPU and IO to speculatively load additional data into

DBMS to utilize idle resources, in parallel to query-

ing raw data using an in-situ engine (Cheng and Rusu,

2015). Regardless, both invisible loading and SCAN-

RAW are prone to processing and loading the entire

dataset when workload queries access database at-

tributes even once.

To address this, cost-aware techniques have been

introduced, calculating the cost of accessing raw data

ﬁles and loading only frequently accessed partitions

into DBMS (Patel et al., 2022; Zhao et al., 2015). The

Query Complexity Aware (QCA) technique reduces

loaded data by loading attributes required by complex

queries, maintaining attributes accessed by zero-join

simple queries in raw format (Patel and Bhise, 2022).

These techniques aim to reduce data loading time, re-

searchers have also partitioned datasets into smaller,

workload-aware partitions to reduce query execution

time (Padiya and Bhise, 2017; Tang et al., 2020).

As cloud usage rises, attention has shifted to-

wards optimizing hardware resource utilization to

reduce costs associated with each query. QROP

(Query and Resource Optimization) proposed con-

sidering the resources required by each query during

query planning to minimize costs (Viswanathan et al.,

2018). PCC (Performance Characteristic Curve), de-

rived from past resource utilization and query exe-

cution time data, is employed by QROP to allocate

optimal resources to each query individually, reduc-

ing costs through a trade-off in query response time

(Pimpley et al., 2022). An elastic resource manage-

ment technique traded off OLTP throughput to en-

hance OLAP query time by allocating more resources

to OLAP queries (Raza et al., 2020).

Despite advancements, techniques, and systems

featuring optimizations for resources used by data

loading and query execution tasks, task scheduling,

and query-speciﬁc resource allocation based on real-

time hardware resource availability remain scarce.

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

338

3 RAW-HF: RESOURCE

AVAILABILITY & WORKLOAD

AWARE HYBRID

FRAMEWORK

This section discusses the proposed RAW-HF frame-

work integrated with lightweight partitioning, task

scheduling, and query-speciﬁc resource allocation al-

gorithms. The following subsections discuss the ar-

chitecture and modules of RAW-HF.

3.1 RAW-HF Architecture

The RAW-HF consists of four modules. Each module

serves a task dedicated to it. Figure 1 shows how all

the framework modules connect and communicate.

• Raw Data Query Processing (RQP) module:

RQP module processes the given dataset and

workload automatically for individual and hybrid

systems. It can handle raw data storage, data

loading, and query processing tasks for individual

in-situ engine, DBMS, HTAP, or hybrid systems.

The combination of an in-situ engine (NoDB) and

DBMS (PostgreSQL) is used to build a hybrid

system because in-situ engines eliminate DLT

while DBMS performs complex queries faster.

• Resource Monitoring (RM) module: Monitors

resources utilized by each workload task and ﬁl-

ters essential information for each query obtained

from RM tools like top and iotop for faster online

or ofﬂine analysis. This real-time resource utiliza-

tion data is used by MUER module.This module

also records past query execution information for

ORR and MUER modules to make partitioning

and resource allocation decisions.

• Optimizing Required Resources (ORR) mod-

ule: Reduces unnecessary utilization of resources.

It optimizes data loading and query execution

time by combining query complexity, workload,

& storage utilization aware lightweight partition-

ing for a given dataset & workload.

• Maximizing Utilization of Existing Resources

(MUER) module: MUAR considers the real-time

availability of resources to utilize resources efﬁ-

ciently and reduces total workload execution time

by allocating additional resources. It schedules

data loading tasks in parallel to querying raw data.

It identiﬁes complex queries (CQs) and reduces

QET for CQs by increasing work memory.

RAW-HF core algorithms reside in ORR and

MUER modules. Therefore, both of them have been

explained in detail below.

Figure 1: RAW-HF Architecture.

3.1.1 Optimizing Required Resources (ORR)

The proposed workload and query complexity aware

algorithms of ORR use lightweight query identiﬁca-

tion and partitioning steps to partition the dataset for

a hybrid system. The idea behind the query complex-

ity aware algorithm is to partition the dataset and dis-

tribute the workload in such a way that queries per-

forming faster on a given tool can be allocated to that

tool. The analysis has shown that zero-join queries

perform faster while multiple joins queries are slower

in in-situ engines than in traditional DBMS.

The Query Complexity Identiﬁcation (QCI) algo-

rithm ﬁrst identiﬁes the simple queries (SQ) and com-

plex queries (CQ) type queries stored in the workload

list. It uses the simple logic of counting no. of tables

present in the query statement. The query is classi-

ﬁed as complex; when two or more table instances

are found in the query statement. The single table in-

stance queries are classiﬁed as simple queries. The

GRA function groups attributes accessed by SQs and

CQs in two different lists. The SQ partition is QT P0,

and the CQ partition is QT P1 after the ﬁrst round of

partitioning. SQ partition can be stored in raw for-

mat, while the CQ partition needs to be loaded in

DBMS. ORR further reﬁnes the partitions based on

different storage budget B in QC function. The par-

tially covered queries (PCQ) list is new input query

type QT. QC function can be called until all attributes

are covered for different storage budgets. Most fre-

quent queries and storage budget list can be sorted

in descending order to reduce iterations & cover fre-

quent queries ﬁrst.

A Hybrid Framework for Resource-Efﬁcient Query Processing by Effective Utilization of Existing Resources

339

Algorithm 1: ORR Module Partitioning.

Data: w l = Workload List; QT = Query Types Dic-

tionary; q l = Query List; B = Storage budget B in

MB; que d = Dictionary of Queries; s d = Schema

Dictionary; QT P = Query Type Partitions; QT P’=

Final Query Type Partitions for given budget B; q l =

Query List; PCQ = Partially Covered Queries List;

ca l = List of covered Attributes; rqa l = List of

Remaining Query Attributes; cq l = List of covered

Queries;

Result: SQ-Raw, CQ-DB & CAP Partitions;

# # Query Complexity Identiﬁcation

1. def QCI(w l, que d, s d):

2. For each task T in w l do

3. If T.Statement has multiple tables

4. QT[T.Q ID] = 1

5. Else

6. QT[T.Q ID] = 0

7. End

8. Get QT P0, QT P1 = GRA(que d, QT)

9. Return ﬁnal partitions QT P’;

# #Grouping of Attributes

10. def GRA(que d, QT)

11. For each query i in que d:

12. For each attribute j in que d[i]:

13. If QT[i] == 0

14. Add j in QT P0

15. If budget B is limited

16. QT P’ = QC(i, ca l,

cq l, B)

17.Repeat above step until PCQ is empty for different

18. Else

19. Add j in QT P1

20. End

21. End

22. Return QT P0, QT P1,QT P’;

# #Grouping Attributes based on Budget B

23. def QC(i, ca l, cq l, B)

24. If (SUM(size of all attributes of query

que d[i]))¡ B

25. For each attribute A in que d[i]

26. If A is not in ca l & size of

A¡remaining budget B

27. Add A in rqa l list

28. If size of rqa l ¡ remaining budget B

29. Move all attributes of rqa l in ca l

& update B

30. Add query q in cq l

31. Else

32. Add query q in PCQ

33. Return ca l, cq l, rqa l;

3.1.2 Maximizing Utilization of Existing

Resources (MUER)

This module implements Maximizing Utilization of

Available Resources (MUAR) technique to improve

the utilization of existing resources. MUAR considers

real-time resource monitoring values of CPU, RAM,

and IO resource utilization stored in the global struc-

ture RM AR. RM AR is continuously updated by the

resource monitoring module. MUAR adds a new task

for processing if all three values of RM AR are greater

than the minimum required CPU, RAM & IO re-

sources stored in Min RR. Before executing the query

in the new thread, the WM Query function is called

to set the work memory for each complex query to

increase RAM utilization.

The WM Query function ﬁrst counts the num-

ber of joins used in the given query and stores the

count in J C. The WM Value in line 18 calculates the

work memory value for new queries based on avail-

able memory RM AR.RAM, process count (P C), to-

tal RAM (TR), and join count J C. The ﬁrst part of

the equation divides the available RAM between the

maximum processes that the available CPU cores can

handle. The second part deﬁnes the maximum RAM

that can be assigned to a thread, while the third part

helps in allocating more RAM to complex queries

considering join count J C. MUAR also tries to es-

timate required work memory considering previous

work mem, disk writes, current & past record count

ratio for frequent queries to achieve the best QET.

4 EXPERIMENTAL SETUP

This section provides details of experimental setup.

4.1 Hardware & Software Setup

The experimental machine uses a quad-core Intel i5-

6500 CPU clocked at 3.20GHz. It has 16GB of RAM.

The operating system of the machine is running a 64-

bit Ubuntu 18.04 LTS. The machine has a 500GB

SATA hard disk drive to store raw datasets and DBMS

databases. The disk rotation speed is 7200RPM. The

RAW-HF framework is implemented by combining

and updating the raw data query processing, resource

monitoring, ORR, and MUER modules. The updated

framework uses Eclipse to run Java code connected

with state-of-the-art open-source DBMS PostgreSQL

and NoDB. NoDB has data caching capability and ex-

ecutes SQL queries on raw ﬁles. The top and iotop

tools provide real-time resource utilization data of

CPU, RAM & IO resources to the RM module.

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

340

4.2 Dataset & Query Set

Data release 16 of Sloan Digital Sky Survey (SDSS)

(Ahumada et al., 2020) has been used to check the

ORR and MUER phases of RAW-HF. 19GB partition

having 4M records of PhotoPrimary view having 509

attributes has been used to represent SDSS dataset.

We have extracted the top 1000 unique queries exe-

cuted on PhotoPrimary view. The queries have been

grouped based on the similarity of the attributes and

query type, forming 12 query groups. One represen-

tative query has been chosen from each group.

The 8GB partition with 35M records containing

descriptions of blizzard and hurricane observations

has been extracted from Linked Observation Data

(LOD) (Patni et al., 2010; Padiya and Bhise, 2017)

(Ahumada et al., 2020). LOD is a benchmark RDF

triple store dataset. It has been used to investigate the

performance of the MUER module of RAW-HF. Nine

queries out of 16 queries of LOD workload have 5

joins. Queries with multiple self-joins have been con-

sidered because these queries need to process com-

plex join operations, requiring signiﬁcant resources.

5 RAW-HF RESULTS

This section presents the results of RAW-HF after

combining all techniques proposed in ORR & MUER.

5.1 RAW-HF: Workload Execution

Time

Figure 2: WET: Comparison.

Figure 2 shows the comparison of all the techniques

with RAW-HF. First column shows the raw data query

processing time using NoDB, which has zero data

loading time. The 2nd column shows WET time

required by traditional DBMS PgSQL. The 3rd and

4th columns show the WET results for individual

ORR and MUER phases of RAW-HF. RAW-HF takes

only 22.6 sec to complete the execution compared

to 30.6sec required by workload aware partitioning

techniques like Partial Loading (WA)(Zhao et al.,

2015). It can be observed that the combination of all

techniques achieved total reduction of 96.32% using

RAW-HF compared to NoDB and 26.14% compared

to the WA. RAW-HF beneﬁts from low DLT achieved

by only loading attributes used by complex queries in

the ORR phase for SDSS.

5.2 RAW-HF: Resource Utilization

Figure 3 shows the comparison of resources utilized

by ORR, MUAR, and RAW-HF (ORR+MUER) with

NoDB , PostgreSQL , and WA. Here, the 1M records

dataset used in experiment utilized 4.7GB of space on

IO device. MUAR results showed that CPU utiliza-

tion time is reduced by only 6.34% because most of

the time is spent in data loading process compared to

PgSQL. MUAR can only utilize other CPU cores to

execute read queries in parallel. Figure 3 conﬁrms

that CPU utilization is reduced by 77% only during

query processing tasks due to parallel processing.

The QCA with WSAC technique in ORR reduced

the required DB partition size(IO) by 91.08%, re-

ducing WET, CPU, and RAM utilization by 85.9%,

85.1%, and 80.9%. WA (Partial loading) reduced the

CPU and RAM utilization by 81.3% and 86.4%. The

RAW-HF experiments combined ORR and MUAR

techniques, which showed a 32.8% increase in RAM

utilization compared to the Partial Loading tech-

nique due to parallel processing of queries. How-

ever, RAW-HF improved DLT, QET, WET, CPU, and

DB Size(IO) requirements by 29.5%, 12.9%, 26.14%,

26.14%, 24.92% compared to Partial Loading tech-

nique (Zhao et al., 2015) executing all read queries in

parallel after data loading is complete.

Figure 3: RAW-HF: Resource Utilization.

5.3 RAW-HF for Different Datasets

This section discusses the impact of RAW-HF on

WET for different types of datasets like LOD &

SDSS. Table 1 compares LOD and SDSS datasets

based on the Fraction of Attributes Accessed (FAA)

A Hybrid Framework for Resource-Efﬁcient Query Processing by Effective Utilization of Existing Resources

341

by workload queries and the Fraction of Attributes

Loaded (FAL) by RAW-HF. It can be seen that SDSS

is a broad table dataset. All the SDSS workload

queries access only 10.6% of attributes. On the other

hand, LOD dataset is a narrow table dataset contain-

ing only three attributes. Due to fewer attributes, al-

most all queries use two or more attributes. The im-

pact of broad and narrow tables and queries accessing

only a small part or entire of the dataset can be seen

in the ORR results for SDSS in Figure 4. MUAR al-

located maximum available resources to CQs and re-

duced WET signiﬁcantly for the LOD dataset because

87% of LOD workload queries consist of two or more

joins. In comparison, SDSS had only one query with

more than one join.

Figure 4: Impact of RAW-HF on WET for LOD & SDSS

datasets.

RAW-HF only loads attributes required by com-

plex queries to reduce DLT time. This helps datasets

like SDSS, which requires only a small fraction of

the dataset (10.6%) to answer queries by loading only

6.7% of attributes. The remaining 93.3% of attributes

are not loaded into DBMS, reducing WET by 85.9%

for the SDSS dataset. On the other hand, vertical

partitioning cannot help datasets like LOD that ac-

cess 100% of attributes. Therefore, the WET re-

duction achieved by applying ORR phase is 0% for

LOD. However, the MUAR in MUER phase achieves

84.8% reduction in WET by efﬁciently utilizing ex-

isting CPU and RAM resources for complex queries.

5.4 RAW-HF: Dynamic Resource

Allocation

Figure 5 illustrates the comparison between the dy-

namic task and resource allocation of RAW-HF and

static allocation in PostgreSQL. The results are ob-

tained through the execution of the LOD workload

on a single core (1 thread) and multiple cores (three

threads), varying the values of work memory. In con-

trast to RAW-HF’s query-speciﬁc memory allocation,

the static parallel execution method maintains a con-

stant value of work memory across different queries.

Figure 5: Impact of RAW-HF on WET compared to Static

task & resource allocation.

The comparison reveals that the static multi-

thread execution of the workload, with a default 4 MB

work memory allocation, takes more time than RAW-

HF’s single-thread dynamic memory allocation. In-

creasing the work memory allocation to 3500 MB

with the static approach resulted in a 22% increase

in workload execution time compared to a 1500 MB

allocation. This increase is attributed to the over-

allocation of memory for certain queries, leading to

reduced memory availability for complex queries run-

ning in parallel. RAW-HF reduces WET by 4-67%

compared to the static approach.

6 COMPARISON WITH

STATE-OF-THE-ART

This section presents a comparison of the RAW-

HF technique with other state-of-the-art techniques

and tools like NoDB (Alagiannis et al., 2012), Post-

greSQL (Pos, 2022), WA (Partial Loading) (Zhao

et al., 2015) , and PCC (Pimpley et al., 2022). NoDB

is an open-source in-situ processing engine with main

memory caching and indexing features (Alagiannis

et al., 2012). Partial Loading technique (Zhao et al.,

2015) loaded only 10.6% of original dataset into

DBMS, which reduced the WET time by 88.1% com-

pared to NoDB. In comparison, RAW-HF loads only

6.7% of dataset. Only RAW-HF can execute join

queries on data residing in DBMS and CSV ﬁles. Un-

like PCC, RAW-HF allocates appropriate resources to

ad-hoc queries.

Table 2 presents a comparison of state-of-the-art

raw data query processing techniques with RAW-HF.

Resource Utilization Aware (RUA) shows whether the

technique considered resource utilization information

or not. The Multi-Format Join column represents

whether the tool can execute join queries on data re-

siding in raw and database formats. PDC uses a static

partition of main memory (50%) to keep lookup ta-

bles. PCC needs historical data for allocating ap-

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

342

Table 1: ORR: Fraction of Attributes Accessed (FAA) and Loaded (FAL).

Total

Attributes

Accessed

Attributes

FAA

(%)

Loaded

Attributes

FAL

(%)

DLT

(%)

QET

(%)

WET

(%)

SDSS 509 54 10.6 34 6.7 90.9 87.9 85.9

LOD 3 3 100.0 3 100.0 0 0 0

Table 2: RAW-HF Technique Comparison with State-of-the-art.

# Technique / Tool Partiti-

oning

DBMS

Data

Work-

load

Aware

Ad-

hoc

queries

RUA Multi-

format

Join

Remarks

1 NoDB (Alagiannis

et al., 2012)

None 0 No NA No No Required more memory.

High QET.

2 PostgreSQL (Pos,

2022)

None 100 No NA No No High DLT, Low QET &

memory.

3 WA (Zhao et al.,

2015)

VP 10.6 Yes No Yes No Technique Not

Lightweight

4 PCC (Pimpley

et al., 2022)

- 100 Yes No Yes No Resource Intensive, for

freq. queries

5 RAW-HF (Hybrid) VP 6.7 Yes Yes Yes Yes Lightweight, works for

ad-hoc & freq. queries

Table 3: RAW-HF Performance Parameters Comparison.

# Technique/

Tool

Query Performance % Resource Utilization %

DLT (sec) QET (sec) WET (sec) CPU (sec) RAM (MB) IO (MB)

1 NoDB 0 613.59

(99.12%)

613.59

(96.32%)

613.59

(96.32%)

10956.8

(90.36%)

4710.4

(94.96%)

2 DBMS

(PostgreSQL)

188.63

(90.88%)

44.7

(87.92%)

233.33

(90.31%)

233.33

(89.78%)

2758.3

(61.70%)

2660.4

(91.08%)

3 WA (Partial

Loading)

24.4

(29.51%)

6.2

(12.90%)

30.6

(26.14%)

30.6

(26.14%)

709.2

(+32.87%)

316.0

(24.92%)

4 RAW-HF 17.2 5.4 22.6 22.6 1056.5 237.3

propriate resources to each query. However, it does

not work well for ad-hoc queries. In comparison, al-

though RAW-HF performs workload aware partition-

ing, resource allocation is done based on query com-

plexity. Therefore, RAW-HF is capable of allocating

appropriate resources to ad-hoc queries.

Figure 2 & 3 presented comparison of these tech-

niques with RAW-HF for SDSS dataset. Table 3

shows the RAW-HF performance parameters compar-

ison with state-of-the-art techniques.

7 CONCLUSION

This work presented a Resource Availability and

Workload aware Hybrid Framework (RAW-HF) to

process raw datasets efﬁciently. It optimized re-

quired resources using ORR module developed by in-

tegrating lightweight algorithms of Query Complex-

ity Aware (QCA) and Workload and Storage aware

Cost-based (WSAC) techniques. This work also ad-

dresses the issue of underutilized resources during

data loading and query processing using MUAR mod-

ule, considering the availability of hardware resources

in real-time. RAW-HF allows the execution of join

queries on data stored in DBMS and raw format.

RAW-HF never loads partitions used by simple

queries, reducing DBMS data loading requirements

by 93.3% for SDSS. ORR phase works better for

broad table datasets like SDSS, while MUAR is ca-

pable of improving WET for workloads with com-

plex multi-join queries like LOD. RAW-HF reduced

WET by 84.8% for SDSS and 90.3% for LOD com-

pared to the PostgreSQL. It reduced the WET by 26%,

96%, and 90% compared to the state-of-the-art Partial

A Hybrid Framework for Resource-Efﬁcient Query Processing by Effective Utilization of Existing Resources

343

Loading techniques, NoDB, and PostgreSQL respec-

tively for SDSS.

REFERENCES

(2022). Postgresql: The world’s most advanced open source

database. https://www.postgresql.org/.

Abouzied, A., Abadi, D. J., and Silberschatz, A. (2013).

Invisible loading: Access-driven data transfer from

raw ﬁles into database systems. In Proceedings of the

16th International Conference on Extending Database

Technology - EDBT ’13, volume 20. ACM Press.

Ahumada, R., Prieto, C. A., and et al. (2020). The 16th Data

Release of the Sloan Digital Sky Surveys: First Re-

lease from the APOGEE-2 Southern Survey and Full

Release of eBOSS Spectra. The Astrophysical Journal

Supplement Series, 249(1).

Ailamaki, A. (2015). Databases and hardware: The begin-

ning and sequel of a beautiful friendship. Proceedings

of the VLDB Endowment, 8(12).

Alagiannis, I., Borovica, R., Branco, M., Idreos, S., and

Ailamaki, A. (2012). Nodb in action. Proceedings of

the VLDB Endowment, 5(12).

Anneser, C., Tatbul, N., Cohen, D., Xu, Z., Pandian,

P., Laptev, N., and Marcus, R. (2023). Autosteer:

Learned query optimization for any sql database.

PVLDB, 16:12.

Cheng, Y. and Rusu, F. (2015). Scanraw: A database meta-

operator for parallel in-situ processing and loading.

ACM Transactions on Database Systems, 40(3).

Dziedzic, A., Karpathiotakis, M., Alagiannis, I., Ap-

puswamy, R., and Ailamaki, A. (2017). DBMS Data

Loading: An Analysis on Modern Hardware. In Data

Management on New Hardware, volume 10195 LNCS,

page 95–117. Springer International Publishing.

Guo, H., Wang, L., and Liang, D. (2016). Big earth data

from space: a new engine for earth science. Science

Bulletin, 61(7).

Kaviani, N., Kalinin, D., and Maximilien, M. (2019). To-

wards serverless as commodity. In Proceedings of the

5th International Workshop on Serverless Computing

- WOSC ’19. ACM Press.

Li, Y., Wang, L., Wang, S., Sun, Y., and Peng, Z. (2022).

A resource-aware deep cost model for big data query

processing. Proceedings - International Conference

on Data Engineering.

Maximilien, M., Hadas, D., Danducci II, A., and Moser, S.

(2022). The future is serverless - ibm developer.

Olma, M., Karpathiotakis, M., Alagiannis, I., Athanas-

soulis, M., and Ailamaki, A. (2020). Adaptive parti-

tioning and indexing for in situ query processing. The

VLDB Journal, 29(1).

Padiya, T. and Bhise, M. (2017). DWAHP: Workload Aware

Hybrid Partitioning and Distribution of RDF Data.

In In Proceedings of the 21st International Database

Engineering and Applications Symposium (IDEAS).

ACM.

Patel, M. and Bhise, M. (2022). Query complexity based

optimal processing of raw data. In 2022 IEEE 10th Re-

gion 10 Humanitarian Technology Conference (R10-

HTC). IEEE.

Patel, M. and Bhise, M. (2023a). MUAR: Maximizing

Utilization of Available Resources for Query Process-

ing. In 2023 IEEE/ACM 23rd International Sympo-

sium on Cluster, Cloud and Internet Computing Work-

shops (CCGridW).

Patel, M. and Bhise, M. (2023b). Resource monitoring

framework for big raw data processing. International

Journal of Big Data Intelligence, Inderscience, 9(1).

Patel, M., Yadav, N., and Bhise, M. (2022). Workload

aware cost-based partial loading of raw data for lim-

ited storage resources. In Futuristic Trends in Net-

works and Computing Technologies FTNCT, Lecture

Notes in Electrical Engineering, vol 936. Singapore:

Springer Nature Singapore.

Patni, H., Henson, C., and Sheth, A. (2010). Linked sensor

data. In 2010 International Symposium on Collabora-

tive Technologies and Systems. IEEE.

Pimpley, A., Li, S., Sen, R., Srinivasan, S., and Jindal, A.

(2022). Towards optimal resource allocation for big

data analytics. In International Conference on Extend-

ing Database Technology, EDBT.

Raza, A., Chrysogelos, P., Anadiotis, A. C., and Ailamaki,

A. (2020). Adaptive htap through elastic resource

scheduling. In Proceedings of the 2020 ACM SIG-

MOD International Conference on Management of

Data. ACM.

Tang, H., Byna, S., Dong, B., and Koziol, Q. (2020). Par-

allel query service for object-centric data manage-

ment systems. Proceedings - 2020 IEEE 34th Interna-

tional Parallel and Distributed Processing Symposium

Workshops, IPDPSW 2020.

Viswanathan, L., Jindal, A., and Karanasos, K. (2018).

Query and resource optimization: Bridging the gap.

In 2018 IEEE 34th International Conference on Data

Engineering (ICDE). IEEE.

Zhao, W., Cheng, Y., and Rusu, F. (2015). Vertical parti-

tioning for query processing over raw data. In Pro-

ceedings of the 27th International Conference on Sci-

entiﬁc and Statistical Database Management, volume

29-June-20. ACM.

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

344