plements a variant of the entity-attribute-value (EAV)
model and can be thought of as a multi-dimensional
sorted map. This map is called HTable and is indexed
by the row key, the column name and a timestamp.
HBase has a block cache implementing the LRU re-
placement algorithm. Several key-values are grouped
into block of configurable size and these blocks are
the ones used in the cache mechanism. The block size
within the block cache is a parameter but defaults to
64KB.
3 INTERDEPENDENCE OF
RESOURCE USAGE AND
CACHE HIT RATIO
The cache hit ratio has a great impact on how a sys-
tem performs and is thus directly related to its re-
source consumption. By resource consumption we
mean the amount of main memory used, the number
of I/O operations to distinct storage mediums and the
amount of memory/disk swapping needed. The server
usage encompasses the CPU time waiting for I/O op-
erations to complete (I/O wait), the time spent on user
space (CPU
user
) and the time spent on kernel space
(CPU
system
).
Server
usage
= I/Owait + CPU
user
+ CPU
system
With this measure it is possible to have an accurate
picture of how the machine is using its resources. Al-
though the I/O wait corresponds to a period when the
CPU is free for other computational tasks, we are ad-
dressing a specific scenario that focus on a NoSQL
database where we cannot achieve a perfect paral-
lelism between I/O wait and CPU usage. In fact,
as most operations require network and/or disk re-
sources we must consider I/O wait to accurately rep-
resent the cost of such operations in the metric. Thus,
if the combined I/O wait and CPU usage reaches
100%, the throughput does not increase by adding
more clients.
In order to demonstrate that effectively the cache
hit ratio is related to server usage, we set up two dif-
ferent experiments using a HBase deployment and
YCSB (Cooper et al., 2010) as the workload gener-
ator. These experiments, while not necessarily repre-
sentative of real-world workloads, cover a wide spec-
trum of possible behaviors. With these we are able to
show a clear and direct relationship between the cache
hit ratio and server usage in NoSQL databases.
In both experiments, one node acts as master for
both HBase, and it also holds a Zookeeper (Hunt
et al., 2010) instance running in standalone mode,
which is required by HBase. Our HBase cluster was
composed of 1 RegionServer, configured with a heap
of 4 GB, and 1 DataNode. HBase’s LRU block cache
was configured to use 55% of the heap size, which
HBase translates into roughly 2.15 GB. We used one
other node to run the YCSB workload generator. The
YCSB client was configured with a readProportion
of 100% i.e. only issue get operations, and with a
fixed throughput of 2000 operations per second with
75 client threads so we solely analyze the impact of
cache hit ratio in server usage. All experiments were
set to run for 30 minutes with 150 seconds of ramp
up time and the results are the computed average of
5 individual runs. The server usage was logged ev-
ery second in the RegionServer node using the UNIX
top command. All nodes used for these experiments
have an Intel i3 CPU at 3.1GHz, with 8GB of main
memory and a local 7200 RPM SATA disk, and are
interconnected by a switched Gigabit local area net-
work.
In the first experiment, a single region was pop-
ulated using the YCSB generator with 4,000,000
records (4.3 GB). This means that the region cannot
be fitted entirely into the block cache: about 1.1 mil-
lions records (1.21 GB) remain on secondary mem-
ory and must be brought into main memory when re-
quested. There were two different scenarios each with
a different configured request popularity:
1. A uniform popularity distribution, that is all
records have equal probability of being requested
(the is the case where the cache hit ratio is mini-
mum (Sleator and Tarjan, 1985));
2. A zipfian popularity distribution, highly skewed,
and clustered, meaning that the most popular keys
are contiguous, which makes them fall in the same
cache block.
The results for this experiment are depicted in Ta-
ble 1. As expected, the uniform request popularity is
the one that achieves the lower cache hit ratio (49%)
and thus consumes more server resources (58.35%).
On the contrary, the zipfian request popularity attains
the higher cache hit ratio (93%), because the most
popular records are clustered and as a result fall in
the same block served by HBase’s block cache. And
because of that it consumes the least Server resources
(only 19.28 %) for the same 2000 ops/s.
Table 1: Average Server
usage
and cache hit ratio results un-
der 2 different distributions, for a region not fitting in block
cache.
Distribution p
hit
Average Server
usage
#Records
Uniform 49% 58.35% 4,000,000
Zipfian 93% 19.28% 4,000,000
In the following experiment we show that two dif-
DataDiversityConvergence 2016 - Workshop on Towards Convergence of Big Data, SQL, NoSQL, NewSQL, Data streaming/CEP, OLTP
and OLAP
372