Towards Performance Prediction in Massive Scale Datastores

Francisco Cruz, F

abio Coelho and Rui Oliveira

INESC TEC & Universidade do Minho, Braga, Portugal

Keywords:

Performance, Cloud Computing, NoSQL Databases.

Abstract:

Buffer caching mechanisms are paramount to improve the performance of today’s massive scale NoSQL

databases. In this work, we show that in fact there is a direct and univocal relationship between the resource

usage and the cache hit ratio in NoSQL databases. In addition, this relationship can be leveraged to build a

mechanism that is able to estimate resource usage of the nodes composing the NoSQL cluster.

1 INTRODUCTION

Massive scale distributed key-value datastores (pop-

ularly referred to as NoSQL databases) are becom-

ing pivotal systems in nowadays infrastructures. They

support some of the most popular Internet services

and have to cope with huge amounts of data while of-

fering stellar performance. In fact, their highly desir-

able performance, scalability and availability proper-

ties cannot be achieved without carefully choosing its

underlying infrastructure and without adequate data

allocation. All requiring real scenario testing and per-

formance prediction.

Similar to traditional relational databases, NoSQL

databases make heavy use of buffer caching mecha-

nisms, in order to improve the performance of read

requests. Moreover, the effectiveness of such mech-

anisms is directly related to the performance and,

as a consequence, to the resource utilization of the

database. This effectiveness can be measured in terms

of the hit ratio that the caching mechanism exhibits.

The higher the cache hit ratio the more effective the

cache mechanism is, and thus more performant is the

database.

In this position paper, we set up some experiments

to demonstrate that there is a direct relationship and

univocal relationship between the cache hit ratio and

the resource utilization. Stemming from this relation-

ship, we envision that it is possible to take advantage

of this relationship in order to estimate the resource

usage of NoSQL databases simply by knowing its

cache hit ratio, as it is a reﬂection of the data size and

the request distribution, and the incoming throughput.

2 BACKGROUND

Caching mechanisms are crucial to improve the

performance of computing systems. In particular,

databases make use of buffer caching to improve their

read performance. When using caching one of the

main goals is to try to keep the cache hit rate as high

as possible. The cache hit rate measures the percent-

age of requests that result in a cache hit. A high hit

cache rate means that a good number of requests are

being served exclusively by the cache, and thus op-

timizing resource consumption. In other words, this

means using less CPU and less I/O operations. As

a result, the cache hit ratio is directly related to re-

source consumption. When the data size exceeds the

cache size, eventually, some data in the cache needs to

be removed to give room to more frequently accessed

data. One of the most widely used cache replacement

algorithms (Puzak, 1985). is the Least Recently Used

(LRU) algorithm (Sleator and Tarjan, 1985), and it is

used by NoSQL databases in their implemented buffer

caches (George, 2011) (L. and M., 2009).

NoSQL databases focus on high throughput, high

scalability and high availability, in addtiion they run

in a distributed setting with tenths to hundreds of

nodes, usually composed of commodity hardware.

The application data is partitioned and these parti-

tions are assigned to the available nodes according to

a data placement strategy. In HBase, nodes are called

RegionSevers and data partitions are referred as re-

gions. Contrasting with relational database manage-

ment systems (RDBMS), these databases only pro-

vide a simple key-value interface to manipulate data

by means of put, get, delete, and scan operations and

they do not offer strong consistency criteria. Based on

Bigtable (Chang et al., 2006), HBase’s data model im-

Cruz, F., Coelho, F. and Oliveira, R.

Towards Performance Prediction in Massive Scale Datastores.

In Proceedings of the 6th International Conference on Cloud Computing and Services Science (CLOSER 2016) - Volume 1, pages 371-373

ISBN: 978-989-758-182-3

371

plements a variant of the entity-attribute-value (EAV)

model and can be thought of as a multi-dimensional

sorted map. This map is called HTable and is indexed

by the row key, the column name and a timestamp.

HBase has a block cache implementing the LRU re-

placement algorithm. Several key-values are grouped

into block of conﬁgurable size and these blocks are

the ones used in the cache mechanism. The block size

within the block cache is a parameter but defaults to

64KB.

3 INTERDEPENDENCE OF

RESOURCE USAGE AND

CACHE HIT RATIO

The cache hit ratio has a great impact on how a sys-

tem performs and is thus directly related to its re-

source consumption. By resource consumption we

mean the amount of main memory used, the number

of I/O operations to distinct storage mediums and the

amount of memory/disk swapping needed. The server

usage encompasses the CPU time waiting for I/O op-

erations to complete (I/O wait), the time spent on user

space (CPU

user

) and the time spent on kernel space

(CPU

system

Server

usage

= I/Owait + CPU

user

+ CPU

system

With this measure it is possible to have an accurate

picture of how the machine is using its resources. Al-

though the I/O wait corresponds to a period when the

CPU is free for other computational tasks, we are ad-

dressing a speciﬁc scenario that focus on a NoSQL

database where we cannot achieve a perfect paral-

lelism between I/O wait and CPU usage. In fact,

as most operations require network and/or disk re-

sources we must consider I/O wait to accurately rep-

resent the cost of such operations in the metric. Thus,

if the combined I/O wait and CPU usage reaches

100%, the throughput does not increase by adding

more clients.

In order to demonstrate that effectively the cache

hit ratio is related to server usage, we set up two dif-

ferent experiments using a HBase deployment and

YCSB (Cooper et al., 2010) as the workload gener-

ator. These experiments, while not necessarily repre-

sentative of real-world workloads, cover a wide spec-

trum of possible behaviors. With these we are able to

show a clear and direct relationship between the cache

hit ratio and server usage in NoSQL databases.

In both experiments, one node acts as master for

both HBase, and it also holds a Zookeeper (Hunt

et al., 2010) instance running in standalone mode,

which is required by HBase. Our HBase cluster was

composed of 1 RegionServer, conﬁgured with a heap

of 4 GB, and 1 DataNode. HBase’s LRU block cache

was conﬁgured to use 55% of the heap size, which

HBase translates into roughly 2.15 GB. We used one

other node to run the YCSB workload generator. The

YCSB client was conﬁgured with a readProportion

of 100% i.e. only issue get operations, and with a

ﬁxed throughput of 2000 operations per second with

75 client threads so we solely analyze the impact of

cache hit ratio in server usage. All experiments were

set to run for 30 minutes with 150 seconds of ramp

up time and the results are the computed average of

5 individual runs. The server usage was logged ev-

ery second in the RegionServer node using the UNIX

top command. All nodes used for these experiments

have an Intel i3 CPU at 3.1GHz, with 8GB of main

memory and a local 7200 RPM SATA disk, and are

interconnected by a switched Gigabit local area net-

work.

In the ﬁrst experiment, a single region was pop-

ulated using the YCSB generator with 4,000,000

records (4.3 GB). This means that the region cannot

be ﬁtted entirely into the block cache: about 1.1 mil-

lions records (1.21 GB) remain on secondary mem-

ory and must be brought into main memory when re-

quested. There were two different scenarios each with

a different conﬁgured request popularity:

1. A uniform popularity distribution, that is all

records have equal probability of being requested

(the is the case where the cache hit ratio is mini-

mum (Sleator and Tarjan, 1985));

2. A zipﬁan popularity distribution, highly skewed,

and clustered, meaning that the most popular keys

are contiguous, which makes them fall in the same

cache block.

The results for this experiment are depicted in Ta-

ble 1. As expected, the uniform request popularity is

the one that achieves the lower cache hit ratio (49%)

and thus consumes more server resources (58.35%).

On the contrary, the zipﬁan request popularity attains

the higher cache hit ratio (93%), because the most

popular records are clustered and as a result fall in

the same block served by HBase’s block cache. And

because of that it consumes the least Server resources

(only 19.28 %) for the same 2000 ops/s.

Table 1: Average Server

usage

and cache hit ratio results un-

der 2 different distributions, for a region not ﬁtting in block

cache.

Distribution p

hit

Average Server

usage

#Records

Uniform 49% 58.35% 4,000,000

Zipﬁan 93% 19.28% 4,000,000

In the following experiment we show that two dif-

DataDiversityConvergence 2016 - Workshop on Towards Convergence of Big Data, SQL, NoSQL, NewSQL, Data streaming/CEP, OLTP

and OLAP

372

ferent distributions, with different data sizes but with

the same cache hit ratio, will have the same server

resources consumption if subject to the same ﬁxed

throughput. Consequently, in this experiment we used

the same setting as the ﬁrst experiment for the zipﬁan,

again populated with 4,000,000 records (4.3GB), but

we changed the number of records of the uniform dis-

tribution to 2,141,881(2.3 GB) so its cache hit ratio

could also be 93%. The throughput is again ﬁxed at

2000 operations per second. From Table 2 it is possi-

ble to see that the amount of resources used by both

distinct distributions is identical. Despite the fact that

they have been populated with different data sizes.

Therefore, for some throughput all it takes to have an

identical server usage is an identical cache hit ratio,

regardless of the data size and the distribution.

Table 2: Average Server

usage

and cache hit ratio results for

2 distributions with different sizes, but with same cache hit

ratio.

Distribution p

hit

Average Server

usage

#Records

Uniform 93% 19.76% 2,141,881

Zipﬁan 93% 19.28% 4,000,000

From both experiments we can infer that the cache

hit ratio is related to database resource usage: for a

given throughput, the higher the cache hit ratio, the

lower the server usage. In addition, the cache hit rate

reﬂects not only the data size, but also the underlying

distribution of requests which, in combination with an

incoming throughput, corresponds to a given server

usage.

As a result, we envision that the server usage of

any workload can be estimated simply by knowing

its cache hit ratio and incoming throughput, regard-

less of the distribution of requests and data size. In

that regard, it should be possible to to build a tridi-

mensional model, that models the server usage for

a NoSQL database, when the cache hit ratio and the

throughput vary. It is worth noting that by collapsing

the data size and the request distribution into a single

metric, the cache hit ratio, the construction of such

model should be simpliﬁed. Nonetheless, that model

will be hardware dependent and has to be rebuilt when

the hardware changes or when there are changes in

core conﬁguration parameters of the database.

4 CONCLUSION

In this paper, we demonstrated that there is a direct

and univocal relationship between the server usage

and the cache hit ratio of NoSQL databases. Fur-

thermore. we propose that instead of a speciﬁc work-

load is characterized by the three common parame-

ters, namely: i) data size; ii) distribution of requests

and iii) incoming throughput; a workload can be char-

acterized by the incoming throughput and by its cache

hit ratio, as the latter is a reﬂection of the i) data size

and of the ii) distribution of requests. This simpliﬁca-

tion can simplify the construction of a resource usage

server model that could then be used to estimate the

server usage for any combination of cache hit ratio

and incoming throughput.

ACKNOWLEDGEMENTS

This work was part-funded by project CoherentPaaS:

A Coherent and Rich PaaS with a Common Program-

ming Model (FP7-611068).

REFERENCES

Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach,

D. A., Burrows, M., Chandra, T., Fikes, A., and Gru-

ber, R. E. (2006). Bigtable: a distributed storage sys-

tem for structured data. In OSDI.

Cooper, B. F., Silberstein, A., Tam, E., Ramakrishnan, R.,

and Sears, R. (2010). Benchmarking cloud serving

systems with YCSB. In SoCC.

George, L. (2011). HBase: The Deﬁnitive Guide. O’Reilly.

Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. (2010).

Zookeeper: Wait-free coordination for internet-scale

systems. In Proceedings of USENIX Conference

on USENIX Annual Technical Conference, USENIX-

ATC’10, pages 11–11.

L., A. and M., P. (2009). Cassandra - a decentralized struc-

tured storage system. In LADIS.

Puzak, T. R. (1985). Analysis of Cache Replacement-

algorithms. PhD thesis. AAI8509594.

Sleator, D. D. and Tarjan, R. E. (1985). Amortized efﬁ-

ciency of list update and paging rules. Commun. ACM,

pages 202–208.

Towards Performance Prediction in Massive Scale Datastores

373