Reliable In-Memory Data Management on Unreliable Hardware

Dirk Habich, Till Kolditz, Juliana Hildebrandt and Wolfgang Lehner

Database Systems Group, Technische Universit

at Dresden, Germany

Keywords:

In-Memory, Database Systems, Error Detection, Error Correction, AN Encoding.

Abstract:

The key objective of database systems is to reliably manage data, whereby high query throughput and low

query latency are core requirements. To satisfy these requirements, database systems constantly adapt to novel

hardware features. Although it has been intensively studied and commonly accepted that hardware error rates

in terms of bit ﬂips increase dramatically with the decrease of the underlying chip structures, most database

system research activities neglected this fact, leaving error (bit ﬂip) detection as well as correction to the

underlying hardware. Especially for memory, silent data corruption (SDC) as a result of transient bit ﬂips

leading to faulty data is mainly detected and corrected at the DRAM and memory-controller layer. However,

since future hardware becomes less reliable and error detection as well as correction by hardware becomes

more expensive, this free ride will come to an end in the near future. To further provide a reliable data

management, an emerging research direction will be employing speciﬁc and tailored protection techniques at

the database system level. Following that, we are currently developing and implementing an adopted system

design for state-of-the-art in-memory column stores. In this position paper, we summarize our vision, the

current state and outline future work of our research.

1 INTRODUCTION

We have already known for a long time that hardware

components are not perfect and soft errors in terms

of single bit ﬂips happen all the time (transient bit

ﬂips). Up to now, hardware-based protection is used

to mitigate these single bit ﬂips. However, recent

studies have shown that future hardware is becom-

ing less and less reliable and the occurrence of multi-

bit ﬂips instead of single bit ﬂips is prevailing (Kim

et al., 2014; Rehman et al., 2016; Henkel et al., 2013;

Shaﬁque et al., 2015). For example, repeatedly ac-

cessing one memory cell in DRAM modules causes

bit ﬂips in physically-adjacent memory cells, whereby

one to four bits ﬂips per 64-bit word have been dis-

covered (Kim et al., 2014; Mutlu, 2017). The reason

for this is a hardware failure mechanism called dis-

turbance error (Kim et al., 2014; Mutlu, 2017). In

this case, electromagnetic (cell-to-cell) interference

leads to bit ﬂips and it is already known that this in-

terference effect increases with smaller feature sizes

and higher densities of transistors (Kim et al., 2014;

Mutlu, 2017). Furthermore, emerging non-volatile

memory technologies like PCM (phase change mem-

ory) (Lee et al., 2009), STT-MRAM (Kultursay et al.,

2013), and PRAM (Wong et al., 2012) exhibit sim-

ilar and perhaps even more reliability issues (Khan

et al., 2014; Khan et al., 2016; Liu et al., 2013; Mutlu,

2017). For instance, heat produced by writing one

PCM cell can alter the value stored in many nearby

cells (e.g., up to 11 cells in a 64 byte-block). Addi-

tionally, hardware aging effects will lead to changing

bit ﬂip rates at run-time (Henkel et al., 2013).

Generally, all hardware components in the nano

transistor era will show an increasing unreliability be-

havior (Borkar, 2005; Henkel et al., 2013; Rehman

et al., 2016), but memory cells are more suscepti-

ble than logic gates (Henkel et al., 2013; Hwang

et al., 2012; Kim et al., 2007). To tackle the up-

coming increasing reliability concerns, there exist a

lot of hardware-oriented research activities (Borkar,

2005; Henkel et al., 2013; Khan et al., 2014; Khan

et al., 2016; Kim et al., 2007). However, these activ-

ities show that hardware-based approaches are very

effective on the one hand, but the protection is very

challenging and each technique introduces large per-

formance, chip area, and power overheads on the

other hand (Henkel et al., 2013; Rehman et al., 2016;

Shaﬁque et al., 2015). Furthermore, the protection

techniques have to be implemented in a pessimistic

way to cover the aging aspect leading usually to an

over-provisioning. The whole is made more difﬁcult

by Dark Silicon (Esmaeilzadeh et al., 2012): billions

of transistors can be put on a chip, but not all them can

Habich, D., Kolditz, T., Hildebrandt, J. and Lehner, W.

Reliable In-Memory Data Management on Unreliable Hardware.

DOI: 10.5220/0006884203650372

In Proceedings of the 7th Inter national Conference on Data Science, Technology and Applications (DATA 2018), pages 365-372

ISBN: 978-989-758-318-6

365

be used at the same time. This and the various new

disruptive hardware interference effects make the reli-

able hardware design and development very challeng-

ing, time consuming, and very expensive (Rehman

et al., 2016). The disadvantages outweigh the advan-

tages for hardware-based protection, so that the semi-

conductor as well as hardware/software communities

have recently experienced a shift towards mitigating

these reliability issues also at higher software layers,

rather than completely mitigating these issues only in

hardware (Henkel et al., 2013; Rehman et al., 2016;

Shaﬁque et al., 2015).

Consequently, this shift will also affects database

systems, because data as well as query processing

have to be protected in software accordingly to further

guarantee a reliable data management on future un-

reliable hardware. Unfortunately, classical software-

based protection techniques are usually based on

data/code redundancy using dual or triple modular

redundancy (DMR/TMR). While DMR only allows

error detection, TMR can also correct errors (Pittelli

and Garcia-Molina, 1986; Pittelli and Garcia-Molina,

1989). However, the application of these techniques

with respect to in-memory database systems causes a

high overhead (Pittelli and Garcia-Molina, 1986; Pit-

telli and Garcia-Molina, 1989). For example, DMR

protection requires twice as much memory capac-

ity compared to a normal (unprotected) setting, since

data must be kept twice in different main memory lo-

cations. Furthermore, every query is redundantly ex-

ecuted with an additional voting at the end resulting

in a computational overhead slightly higher than 2x.

Thus, there is a clear need for database-speciﬁc pro-

tection approaches without sacriﬁcing the overall per-

formance too much (B

ohm et al., 2011). To tackle

that grand challenge, we present our overall vision

and summarize ﬁrst promising results in this paper.

In detail, we make the following contributions:

1. We describe our big picture by introducing our as-

sumptions and based on that three requirements

for database-speciﬁc approaches (Section 2).

2. While Section 3 summarizes our novel developed

error detection approach (Kolditz et al., 2018),

Section 4 outlines our vision or research activities

for error correction.

Finally, we close the paper with related work in Sec-

tion 5 and a short conclusion in Section 6.

2 OVERALL VISION

In principle, any undetected and uncorrected bit ﬂip

destroys the reliability objective of database systems

in form of false negatives (missing tuples), false posi-

Unreliable Hardware

Hardened

Dictionaries

Hardened Columns Redundant Columns

Error Detection

Error Correction

Memory

Trade-offs:

Scan

Join

Scan

Group

Sort

Challenges:

Ø Hardening data and data

structures to enable error detection

Ø Adjust processing to enable a

continuous error detection,

because bit flips can happen

anytime

Challenges:

Ø Adjust processing to

be able to correct

on-the-fly detected

bit flips

Ø Define necessary

data redundancy for

data correction

Storage Layer

Processing Layer

Changing error model All components unreliable

Reliability

Query Latency

Figure 1: Reliable In-Memory Column Store Architecture.

tives (tuples with invalid predicates) or inaccurate ag-

gregates in a silent way. So far, database systems

do not care about this reliability aspect because hard-

ware components usually provide appropriate protec-

tion techniques. However, this free-ride will come

to an end in the near future (Borkar, 2005; Henkel

et al., 2013; Rehman et al., 2016). To prepare state-

of-the-art in-memory database systems by developing

appropriate approaches, we assume in our work as il-

lustrated in Figure 1 that (i) all hardware components

are unreliable, (ii) multi-bit ﬂips will occur regularly

rather than exceptionally, and (iii) bit ﬂips are not

handled at the hardware layer. Furthermore, the bit

ﬂip rate changes at runtime due to various effects like

heat (Henkel et al., 2013).

Based on these assumptions, major challenges for

in-memory database system are reliable data storage

as well as reliable query processing (B

ohm et al.,

2011). To provide both reliability issues, we envi-

sion to tightly integrate protection techniques in in-

memory database systems and to use the available

database knowledge to specialize as well as to balance

protection and the associated overhead. That means,

our goal is to develop an appropriate solution satisfy-

ing the following requirements based on our assump-

tions:

(R1). Our solution has to detect as well as to correct

(i) errors (multi-bit ﬂips) that modify data stored

in main memory, (ii) errors induced during trans-

ferring on interconnects, and (iii) errors induced

during computations during query processing (de-

tection capability).

(R2). Our solution has to be adaptable to different er-

ror model at runtime because the number and the

rate of bit ﬂips may vary over hardware genera-

tions or due to hardware aging effects (run-time

adaptability).

(R3). Our solution has to introduce only the nec-

essary overhead in terms of memory consump-

tion and query runtime being required to protect

against a desired error model. That means, the

overhead should be as small as possible, but still

provide a reliable behavior (balanced overhead).

DATA 2018 - 7th International Conference on Data Science, Technology and Applications

366

In our work, we mainly focus on state-of-the-art

in-memory columns stores (Abadi et al., 2013; Idreos

et al., 2012; Stonebraker et al., 2005) and our envi-

sioned reliable architecture is depicted in Figure 1. As

illustrated, we explicitly distinguish between error de-

tection and error correction on the one hand. On the

other hand, we have to consider the storage as well as

processing layer of column stores. From our point of

view, the most important part is error detection, be-

cause it is the prerequisite for error correction. If we

are not able to detect bit ﬂips, no correction can be

triggered. Thus, we developed a novel approach for

error detection tailored for in-memory column stores

as summarized in the next section. In the following,

we want to extend this approach with error correction

capabilities.

3 ERROR DETECTION

For error detection, we developed a novel column

store-speciﬁc approach called AHEAD (Kolditz et al.,

2018) which is mainly based on error coding, but

we are not using a well-known error code like Ham-

ming (Hamming, 1950; Moon, 2005). However, our

approach has unique properties as shown later. That

means, we encode all data and data structures in way

that we are able to detect bit ﬂips in base data as well

as during query processing. To represent the intention

of error detection, we introduce new terms for encod-

ing and decoding. We denote the encoding of data as

data hardening, since data is literally ﬁrmed so that

corruption becomes detectable. In contrast, we denote

as data softening the decoding of data, as it becomes

vulnerable to corruption again.

Generally, in-memory column stores maintain re-

lational data using the decomposition storage model

(DSM) (Copeland and Khoshaﬁan, 1985), where each

column of a table is separately stored as a ﬁxed-width

dense array (Abadi et al., 2013). To reconstruct the

tuples of a table, each column record is stored in

the same (array) position across all columns of a ta-

ble (Abadi et al., 2013). Column stores typically sup-

port a ﬁxed set of basic data types, including inte-

gers, decimal (ﬁxed-, or ﬂoating-point) numbers, and

strings. For ﬁxed-width data types (e.g., integer, dec-

imal and ﬂoating-point), column stores utilize basic

arrays of the respective type for the values of a col-

umn (Abadi et al., 2013; Idreos et al., 2012). For

variable-width data types (e.g., strings), some kind

of dictionary encoding is applied to transform them

into ﬁxed-width columns (Abadi et al., 2013; Abadi

et al., 2006; Binnig et al., 2009). The simplest form

constructs a dictionary for an entire column sorted on

Example

5 7 4 … 3

101

111

100

…

011

Hardening

Multiplication with A = 29

145

203

116

… 87

100

110

010

011

101

…

010

101

Logical Level

Physical Level

3 Data Bit Width

8 Hardened Data Bit Width

all 1 and 2 Bit Flips are detectable

Raw Data Array

Hardened Data Array

Figure 2: Example for Hardening Data Arrays.

frequency, and represents values as integer positions

in this dictionary (Abadi et al., 2013).

That means, in-memory column stores are based

on two main data structures as illustrated in Figure 1:

(i) dictionaries for variable-length data types and (ii)

column arrays for ﬁxed-length data types. Thus, each

base table column is stored either by means of a single

data array or by a combination of a dictionary and a

data array containing ﬁxed-width integer references

to the dictionary. The decision is made based on the

data type of the column. Therefore, we have to harden

both structures.

3.1 Hardened Data Arrays

For data arrays, we only have to harden values and

this is done using AN coding (Avizienis, 1971; Hoff-

mann et al., 2014) as illustrated in Figure 2. AN

coding is a representative of arithmetic error detect-

ing codes, where the hardened code words are com-

puted by multiplying a constant integer value A onto

each original data word. The multiplication modiﬁes

the data word itself and all data is viewed as integers

as shown in Figure 2. As a result of this multipli-

cation, the domain of code words expands such that

only multiples of A become valid code words, and all

other integers are considered non-code words. The

used value of A has an impact on the detection ca-

pability as described later. For softening, A division

or multiplication of the inverse of A is required. Bit

ﬂips are detected by testing the remainder of this op-

eration, which must be zero, otherwise the code word

was corrupted. A unique feature of arithmetic codes,

and thus AN coding, is the ability to operate directly

on hardened data by encoding the other operands,

too (Avizienis, 1971; Hoffmann et al., 2014).

3.1.1 Different Data Types

Regarding hardening arrays of integer data, this re-

quires only multiplication with a constant factor of A.

For decimal numbers, the case is a bit more complex:

for the sake of correctness and accuracy, database sys-

Reliable In-Memory Data Management on Unreliable Hardware

367

tems typically use ﬁxed-point numbers and arithmetic

instead of native ﬂoating point numbers (ﬂoat / dou-

ble) (Neumann, 2016). These ﬁxed-point numbers

are usually represented as integer (Neumann, 2016),

which can be hardened like integers.

Table 1: Super As for detecting a guaranteed minimum

number of bit ﬂips (min bfw). Numbers are: super

A/hardening overhead/hardened code bit length.

∗

=derived

by approximatio2n, bold=prime, tbc=to be computed.

min

bfw

Data Bit Width (Byte-aligned)

8 16 32

1 3/2/10 3/2/18 3/2/34

2 29/5/13 61/6/22 125/7/39

3 233/8/16 463/9/27 881/10/42

4 1,939/11/19 7,785/13/29 16, 041

∗

/14/46

5 13,963/14/22 63,877/16/32 tbc

6 55,831/16/24 tbc tbc

3.1.2 Parametrization of AN Coding

As mentioned above, AN coding has only one param-

eter A which has an impact on the error detection rate

as well as the necessary storage overhead. Now, to re-

liably detect b bit ﬂips in each code word, a value for

A has to be used which guarantees a minimum Ham-

ming distance of b + 1, whereby A depends on the

data bit width l and on the number of detectable bit

ﬂips (Avizienis, 1971; Hoffmann et al., 2014). More-

over, to reduce the necessary space overhead, it is usu-

ally not some arbitrary value for A sought but a small

one (called “super A”), so that the domain of code

words is small. We applied a brute force approach

to compute “super As” for different settings of l and

b, whereby the brute force approach consists of two

components:

Component 1: Determine minimum Hamming dis-

tance for a given A and l, and

Component 2: Iterate over all possible As to deter-

mine a small A with a minimum Hamming dis-

tance of b + 1, whereby component 1 is heavily

applied.

Table 1 lists an extract of computed “super As”.

For example, for 8-bit data and a minimum bit ﬂip

weight of two, we have to use A=29 for the harden-

ing. As depicted, we require ﬁve additional bits for

the hardening. If we want to increase the minimum bit

ﬂip weight to 3, we only have to use A=233 resulting

in a code word width of 16. In this case, the data over-

head increases from 62, 5% (13 bit code word width)

to 100% (16 bit code word width for 8 bit data).

Based on that, we are able to use this knowledge

for a balanced data hardening with regard to a speciﬁc

hardware error model (number of bit ﬂips) and to spe-

ciﬁc data characteristics (data bit width). Addition-

ally, column data arrays can be re-hardened at run-

time with different As. Thus, the requirements (R2)

and (R3) are adequately addressed from the storage

perspective. Nevertheless, Table 1 also highlights that

the brute force approach for the computation of A is

very expensive, because we are currently not able to

compute a value for A for all settings of l and b. Thus,

a new approach have to be developed for this compu-

tation.

3.2 Hardened Dictionaries

Dictionaries are usually realized using index struc-

tures to efﬁciently support encoding and decod-

ing (Binnig et al., 2009). In contrast to data arrays, not

only the data values must be hardened, but also nec-

essary pointers within the index structures. To tackle

that issue, Kolditz et al. (Kolditz et al., 2014) al-

ready proposed various techniques to harden B-Trees,

which we are currently using in our approach. As

they have shown, hardening pointer-intensive struc-

tures are more challenging as hardening data arrays.

However, slightly increasing data redundancy at the

right places by incorporating context knowledge in-

creases error detection signiﬁcantly (Kolditz et al.,

2014). Moreover, for dictionaries of integer data,

AN hardening can be applied on the dictionary en-

tries. The corresponding column (data array) con-

tains ﬁxed-width, AN hardened integer references to

the dictionary.

3.3 Continuous Error Detection

To satisfy requirement (R1), we integrated bit ﬂip de-

tection into each and every physical query operator

by checking each value. From our point of view,

this is the best solution, because bit ﬂips caused by

any hardware components are continuously detected.

Moreover, each and every value is checked for bit

ﬂips in the columns of base tables and intermediate

results. The integration can be seamlessly done for

both state-of-the-art processing models of column-

at-a-time (Abadi et al., 2013; Idreos et al., 2012)

and vector-at-a-time (Zukowski et al., 2012) with our

hardened storage concept. There are two reasons:

(i) the column structure is unchanged, only the data

width is increased and (ii) the values are multiplied

by A and can thus be processed as before.

We also fully implemented our error detection ap-

proach and conducted an experimental evaluation us-

ing the SSB benchmark (O’Neil et al., 2009). In

our evaluation, we compared our approach with the

Unprotected baseline and dual modular redundancy

DATA 2018 - 7th International Conference on Data Science, Technology and Applications

368

Figure 3: Relative SSB runtimes for vectorized (SSE4.2) execution (average over all scale factors).

(DMR). In the Unprotected baseline, data is always

compressed on a byte-level based on the column char-

acteristics. DMR uses the Unprotected setting and

replicates all data in main memory, executes each

query twice sequentially, and afterwards a voter com-

pares both results. Our approach hardens each col-

umn using the largest currently known A for the cor-

responding column data width from Table 1. Thus,

compared to Unprotected setting, our approach in-

creases the data width of each column to the next byte

level. For all approaches, we measured all 13 SSB

queries for vectorized (Intel SSE4.2) execution and

we varied the SSB scale factor from 1 to 10. Each

single experiment ran 10 times. Figure 3 shows vec-

torized (SSE4.2) runtimes relative to the Unprotected

baseline. On the one hand, DMR results in the ex-

pected runtime overhead of about 100%, because each

query is executed twice. On the other hand, the run-

time overhead of our approach is only between 7% to

28%. This is due to the tight integration of error de-

tection in each operator combined with our AN cod-

ing approach.

3.4 Future Work

Our future work in this direction is manifold, whereby

three aspects are very important: (i) develop new ap-

proach for the computation of the parameter value

A, (ii) investigate the interplay of data compression

and data hardening, and (iii) develop appropriate tech-

niques to protect internal query operator data struc-

tures like hash maps. For instance, the second aspects

is crucial, because in-memory column stores heavily

employ lightweight data compression to reduce the

necessary memory space and to address the access

bottleneck between main memory and CPU (Abadi

et al., 2006; Damme et al., 2017; Zukowski et al.,

2006). While both domains are orthogonal to each

other, their interplay is very important to keep the

overall memory footprint of the data as low as possi-

ble (Kolditz et al., 2015). With data hardening, com-

pression gains even more signiﬁcance, since it can re-

duce the newly introduced storage overhead. How-

ever, combining both is challenging and we brieﬂy

outline some aspects requiring closer investigation.

Fitness of Compression Algorithms. There is a high

number of lightweight data compression algorithms

(Abadi et al., 2006; Damme et al., 2017; Zukowski

et al., 2006), which differ in how far they are suited

for the combination with AN-coding.

Order of Hardening and Compression. Hardening

could be applied to compressed data, or vice-versa.

The decision depends on the compression algorithm:

While dictionary coding (Abadi et al., 2006) must

be applied before hardening to obtain integers from

variable-width data, null suppression (Abadi et al.,

2006; Damme et al., 2017; Lemire and Boytsov,

2015; Zukowski et al., 2006) could be applied before

or after.

Hardened Compression Meta Data. Most

lightweight compression algorithms store some meta

data along with the compressed data to enable decom-

pression. If hardening comes before compression, the

latter must harden the meta data on its own. For in-

stance, with run length encoding (Abadi et al., 2006;

Damme et al., 2017) of hardened data, the run values

will already be hardended, while the run lengths as

meta data still need to be hardened.

Detection and Re-encoding vs. Decompression.

Detection and re-encoding happen many times per

query. Conversely, decompression can often be de-

layed until the end of the query, since many operators

can process compressed data directly (Abadi et al.,

2013; Zukowski et al., 2006). Hence, detection and

re-encoding should not require decompression.

4 ERROR CORRECTION

Up to now, we only considered error detection. As

next, we want to extend our approach with the ability

of continuous error correction. In this case, detected

bit ﬂips should be on-the-ﬂy corrected during query

processing. At the moment, we are already able to

detect bit ﬂips on value granularity and can ﬁnd out

where the error occurred. Based on that property, we

believe that speciﬁc correction techniques can be de-

veloped and integrated in the query processing. For

example, if we detect a faulty code word in the input

of an operator, we can re-transmit this value, possibly

several times, to correct errors induced during trans-

mission. If we get a valid code word, processing can

Reliable In-Memory Data Management on Unreliable Hardware

369

Example

145

203

116

… 87

Hardened Data Array

Redundant Data

Column X (A=29)

122

183

…

305

Column Y (A=61)

267

264

299

…

392

Column Z = X+Y

Reconstruction by Z-Y

Reconstruction by Z-X

Figure 4: Example for Linear Combination.

continue with this correct code word. If we get an in-

valid code word, we can assume that bits are ﬂipped

in main memory and then we require an appropriate

technique for error correction. For that, correcting bit

ﬂips in memory requires data redundancy in any case.

Here, classical RAID-techniques or techniques from

the network coding domain like linear combinations

may be interesting to consult.

In particular, a very promising idea from our per-

spective is to use linear combinations of hardened

data columns array as illustrated in Figure 4 for the

error correction. In this example, a single redun-

dant column Z is created as a linear combination of

two columns X and Y . As depicted, the columns

X and Y can be appropriately reconstructed for cor-

rection, whereby the reconstruction is possible on a

value granularity. Fundamentally, AN coding and lin-

ear combinations perfectly ﬁt together and they re-

quire less space overhead than triple modular redun-

dancy. Furthermore, the redundant data in form of

linear combinations are automatically hardened at the

same time, so that these data can be checked for bit

ﬂips. Main research questions are:

1. Which linear combinations should be created at

all?

2. How can queries be answered on these linear com-

binations?

3. What does an efﬁcient access to these linear com-

binations for value-based correction look like?

5 RELATED WORK

Fundamentally, hardware components fabricated with

nano-scale transistors face serious reliability issues

like soft errors, aging, thermal hot spots, and process

variations as a consequence of the aggressive transis-

tor miniaturization (Rehman et al., 2016). Addition-

ally, memory cells are more susceptible to soft errors

than logic gates (Henkel et al., 2013; Hwang et al.,

2012; Kim et al., 2007). These issues arise from mul-

tiple sources and they jeopardize the correct applica-

tion execution (Rehman et al., 2016). The recently

published book (Rehman et al., 2016) summarizes

state-of-the-art protection techniques in all hardware

as well as software layers and presents new results of

a large research initiative.

Hardware-Level Protection. Hardware protection

to mitigate soft errors can be done on three lay-

ers (Rehman et al., 2016): (i) transistor, (ii) circuit,

and (iii) architectural. On the transistor layer, several

techniques have been proposed to harden transistors

against radiation events like alpha particles or neutron

strikes (Itoh et al., 1980; Kohara et al., 1990). For

example, thick polyimide can be used for alpha parti-

cle protection (Itoh et al., 1980; Kohara et al., 1990).

However, this technique cannot be utilized for neutron

strikes (Rehman et al., 2016). In general, techniques

at this layer have in common that the protection re-

sults in adopted fabrication processes using special-

ized materials (Itoh et al., 1980; Kohara et al., 1990;

Rehman et al., 2016). Therefore, these techniques are

very effective, but they produce (i) substantial over-

head in terms of area and cost, and (ii) immense vali-

dation and veriﬁcation costs.

At the circuit layer, redundant circuits and er-

ror detection/correction circuits are prominent exam-

ples (Dell, 1997; Ernst et al., 2004; Kim et al., 2007;

Rehman et al., 2016). For instance, the RAZOR ap-

proach introduces shadow ﬂip ﬂops in the pipeline

to recover from errors in logic gates (Ernst et al.,

2004). Memories and caches are usually protected

using error correcting codes (ECC) or parity tech-

niques. Current ECC memories are based on Ham-

ming using a (72,64) code, meaning that 64 bits of

data are enhanced with 8 bits of parity allowing sin-

gle error correction and double error detection. To

address multi bit ﬂips advanced ECC schemes have

to be used. Examples are (i) IBM’s Chipkill ap-

proach, which computes the parity bits from differ-

ent memory words and even separate DIMMs instead

of physically adjacent bits (Dell, 1997), and (ii) (Kim

et al., 2007), which shows that other ECC codes like

BCH-codes (Moon, 2005) can be realized in hard-

ware. However, this increases the number of transis-

tors in hardware and consequently impacts the energy

demand, the overhead growing quickly as the code

strength is increased (Kim et al., 2007). Addition-

ally, reading and computing the enhanced ECC bits

can be a performance bottleneck during read opera-

tions (Kim et al., 2007).

At the architectural layer, the protection is based

upon the redundant execution either in space (using

duplicated hardware units) or in time (using the same

hardware multiple times for redundant execution and

comparing the results). Dual Modular Redundancy

(DMR) and Triple Modular Redundancy (TMR) are

traditional approaches. Generally, these techniques

DATA 2018 - 7th International Conference on Data Science, Technology and Applications

370

lead to an increased power usage which may poten-

tially increase the temperature (Rehman et al., 2016).

Increased temperatures lead to higher soft error rate

and increased aging (Rehman et al., 2016).

To summarize, hardware-based protection has

been proposed at different layers. The techniques are

usually very effective, but they also have some draw-

backs in terms of (i) high chip area overhead leading

at the same time to more power overhead and (ii) per-

formance penalties. Furthermore, the high veriﬁca-

tion/validation costs make the reliable hardware de-

sign and development very expensive and time con-

suming (Rehman et al., 2016). To overcome these

non-negligible drawbacks, a rich set of software-

based techniques has evolved.

Software-Level Protection. Classical software-

based protection techniques are (Goloubeva et al.,

2006; Rehman et al., 2016): (i) N-version pro-

gramming, (ii) code redundancy, (iii) control ﬂow

checking, and (iv) checkpoint recovery. For in-

stance, N-version programming (Avizienis, 1985)

is based on implementing multiple program ver-

sions of the same speciﬁcation which reduces the

probability of identical errors occurring in two or

more versions. State-of-the-art redundancy-based

techniques are Error Detection using Duplicated

Instructions (EDDI) (Oh et al., 2002) and Software

Implemented Fault Tolerance (SWIFT) (Reis et al.,

2005). Both provide software reliability by dupli-

cating instructions, and inserting comparison and

checking instructions. However, these techniques

incur signiﬁcant performance overheads (Oh et al.,

2002; Reis et al., 2005).

Moreover, AN coding has also been used for

software-based fault tolerance (Hoffmann et al., 2014;

Schiffel, 2011; Ulbrich et al., 2012). For instance, the

work of Schiffel (Schiffel, 2011) allows to encode ex-

isting software binaries or to add encoding at compile

time, where not all variables’ states need to be known

in advance. However, in her work she only describes

encoding integers of size |D| ∈ {1, 8, 16, 32} bits and

pointers, where the encoded values are always 64 bits

large. Furthermore, protecting processors by AN cod-

ing was also suggested in (Forin, 1989).

6 CONCLUSION

A few years ago, Boehm et al. (B

ohm et al., 2011)

pointed out the lack of data management techniques

dealing with an increasing number of bit ﬂips in main

memory as a more and more relevant source of er-

rors. Thus, we presented our overall vision for a reli-

able data management on unreliable hardware in this

position paper, because recent studies show that fu-

ture hardware becomes less reliable. In particular,

we summarized our novel develop error detection ap-

proach, which is the ﬁrst comprehensive database-

speciﬁc approach to tackle a reliable data manage-

ment on unreliable hardware.

REFERENCES

Abadi, D., Boncz, P. A., Harizopoulos, S., Idreos, S., and

Madden, S. (2013). The design and implementation

of modern column-oriented database systems. Foun-

dations and Trends in Databases, 5(3):197–280.

Abadi, D. J., Madden, S., and Ferreira, M. (2006). Integrat-

ing compression and execution in column-oriented

database systems. In SIGMOD 2006, pages 671–682.

Avizienis, A. (1971). Arithmetic error codes: Cost and

effectiveness studies for application in digital system

design. IEEE Trans. Computers, 20(11):1322–1331.

Avizienis, A. (1985). The n-version approach to fault-

tolerant software. IEEE Trans. Software Eng.,

11(12):1491–1501.

Binnig, C., Hildenbrand, S., and F

arber, F. (2009).

Dictionary-based order-preserving string compression

for main memory column stores. In SIGMOD 2009,

pages 283–296.

ohm, M., Lehner, W., and Fetzer, C. (2011). Resiliency-

aware data management. PVLDB, 4(12):1462–1465.

Borkar, S. Y. (2005). Designing reliable systems from unre-

liable components: The challenges of transistor vari-

ability and degradation. IEEE Micro, 25(6):10–16.

Copeland, G. P. and Khoshaﬁan, S. (1985). A decompo-

sition storage model. In SIGMOD 1985, pages 268–

279.

Damme, P., Habich, D., Hildebrandt, J., and Lehner, W.

(2017). Lightweight data compression algorithms: An

experimental survey (experiments and analyses). In

EDBT 2017, pages 72–83.

Dell, T. J. (1997). A white paper on the beneﬁts of chipkill-

correct ecc for pc server main memory. IBM Micro-

electronics Division, 11.

Ernst, D., Das, S., Lee, S., Blaauw, D., Austin, T., Mudge,

T., Kim, N. S., and Flautner, K. (2004). Razor: circuit-

level correction of timing errors for low-power opera-

tion. IEEE Micro, 24(6):10–20.

Esmaeilzadeh, H., Blem, E. R., Amant, R. S., Sankar-

alingam, K., and Burger, D. (2012). Dark silicon and

the end of multicore scaling. IEEE Micro, 32(3):122–

134.

Forin, P. (1989). Vital coded microprocessor: Principles and

application for various transit systems. IFAC-GCCT.

Goloubeva, O., Rebaudengo, M., Reorda, M. S., and Vi-

olante, M. (2006). Software-implemented hardware

fault tolerance. Springer Science & Business Media.

Hamming, R. W. (1950). Error detecting and error correct-

ing codes. Bell System technical journal, 29(2).

Reliable In-Memory Data Management on Unreliable Hardware

371

Henkel, J., Bauer, L., Dutt, N., Gupta, P., Nassif, S. R.,

Shaﬁque, M., Tahoori, M. B., and Wehn, N. (2013).

Reliable on-chip systems in the nano-era: lessons

learnt and future trends. In DAC 2013, pages 99:1–

99:10.

Hoffmann, M., Ulbrich, P., Dietrich, C., Schirmeier, H.,

Lohmann, D., and Schr

oder-Preikschat, W. (2014). A

practitioner’s guide to software-based soft-error miti-

gation using an-codes. In HASE 2014, pages 33–40.

Hwang, A. A., Stefanovici, I. A., and Schroeder, B. (2012).

Cosmic rays don’t strike twice: understanding the na-

ture of DRAM errors and the implications for system

design. In ASPLOS 2012, pages 111–122.

Idreos, S., Groffen, F., Nes, N., Manegold, S., Mullender,

K. S., and Kersten, M. L. (2012). Monetdb: Two

decades of research in column-oriented database ar-

chitectures. IEEE Data Eng. Bull., 35(1):40–45.

Itoh, K., Hori, R., Masuda, H., Kamigaki, Y., Kawamoto,

H., and Katto, H. (1980). A single 5v 64k dynamic

ram. In ISSCC 1980, volume 23, pages 228–229.

Khan, S., Lee, D., Kim, Y., Alameldeen, A. R., Wilkerson,

C., and Mutlu, O. (2014). The efﬁcacy of error miti-

gation techniques for dram retention failures: A com-

parative experimental study. SIGMETRICS Perform.

Eval. Rev., 42(1):519–532.

Khan, S. M., Lee, D., and Mutlu, O. (2016). PARBOR:

an efﬁcient system-level technique to detect data-

dependent failures in DRAM. In DSN 2016, pages

239–250.

Kim, J., Hardavellas, N., Mai, K., Falsaﬁ, B., and Hoe,

J. (2007). Multi-bit error tolerant caches using two-

dimensional error coding. In Symposium on Microar-

chitecture 2007, pages 197–209.

Kim, Y., Daly, R., Kim, J., Fallin, C., Lee, J., Lee, D., Wilk-

erson, C., Lai, K., and Mutlu, O. (2014). Flipping bits

in memory without accessing them: An experimental

study of DRAM disturbance errors. In ISCA 2014,

pages 361–372.

Kohara, M., Mashiko, Y., Nakasaki, K., and Nunoshita, M.

(1990). Mechanism of electromigration in ceramic

packages induced by chip-coating polyimide. IEEE

Transactions on Components, Hybrids, and Manufac-

turing Technology, 13(4):873–878.

Kolditz, T., Habich, D., Damme, P., Lehner, W., Kuvaiskii,

D., Oleksenko, O., and Fetzer, C. (2015). Resiliency-

aware data compression for in-memory database sys-

tems. In DATA 2015, pages 326–331.

Kolditz, T., Habich, D., Lehner, W., Werner, M., and

de Bruijn, S. T. J. (2018). AHEAD: adaptable data

hardening for on-the-ﬂy hardware error detection dur-

ing database query processing. In SIGMOD, pages

1619–1634.

Kolditz, T., Kissinger, T., Schlegel, B., Habich, D., and

Lehner, W. (2014). Online bit ﬂip detection for in-

memory b-trees on unreliable hardware. In DaMoN

2014, pages 5:1–5:9.

Kultursay, E., Kandemir, M. T., Sivasubramaniam, A., and

Mutlu, O. (2013). Evaluating STT-RAM as an energy-

efﬁcient main memory alternative. In ISPASS 2013,

pages 256–267.

Lee, B. C., Ipek, E., Mutlu, O., and Burger, D. (2009). Ar-

chitecting phase change memory as a scalable dram

alternative. In ISCA 2009, pages 2–13.

Lemire, D. and Boytsov, L. (2015). Decoding billions of in-

tegers per second through vectorization. Softw., Pract.

Exper., 45(1):1–29.

Liu, J., Jaiyen, B., Kim, Y., Wilkerson, C., and Mutlu, O.

(2013). An experimental study of data retention be-

havior in modern dram devices: Implications for re-

tention time proﬁling mechanisms. SIGARCH Com-

put. Archit. News, 41(3):60–71.

Moon, T. K. (2005). Error correction coding. Mathematical

Methods and Algorithms. Jhon Wiley and Son.

Mutlu, O. (2017). The rowhammer problem and other is-

sues we may face as memory becomes denser. In

DATE 2017, pages 1116–1121.

Neumann, T. (2016). The price of correctness.

http://databasearchitects.blogspot.de/2015/12/the-

price-of-correctness.html.

Oh, N., Shirvani, P. P., and McCluskey, E. J. (2002).

Error detection by duplicated instructions in super-

scalar processors. IEEE Transactions on Reliability,

51(1):63–75.

O’Neil, P., O’Neil, E., Chen, X., and Revilak, S. (2009).

The Star Schema Benchmark and Augmented Fact Ta-

ble Indexing, pages 237–252.

Pittelli, F. M. and Garcia-Molina, H. (1986). Database

processing with triple modular redundancy. In SRDS

1986, pages 95–103.

Pittelli, F. M. and Garcia-Molina, H. (1989). Reliable

scheduling in a TMR database system. ACM Trans.

Comput. Syst., 7(1):25–60.

Rehman, S., Shaﬁque, M., and Henkel, J. (2016). Reliable

Software for Unreliable Hardware - A Cross Layer

Perspective. Springer.

Reis, G. A., Chang, J., Vachharajani, N., Rangan, R., and

August, D. I. (2005). SWIFT: software implemented

fault tolerance. In CGO 2005, pages 243–254.

Schiffel, U. (2011). Hardware error detection using AN-

Codes. PhD thesis, Dresden University of Technology.

Shaﬁque, M. et al. (2015). Multi-layer software reliability

for unreliable hardware. it - Information Technology,

57(3):170–180.

Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cher-

niack, M., Ferreira, M., Lau, E., Lin, A., Madden,

S., O’Neil, E. J., O’Neil, P. E., Rasin, A., Tran, N.,

and Zdonik, S. B. (2005). C-store: A column-oriented

DBMS. In VLDB 2005, pages 553–564.

Ulbrich, P., Hoffmann, M., Kapitza, R., Lohmann, D.,

Schroder-Preikschat, W., and Schmid, R. (2012).

Eliminating single points of failure in software-based

redundancy. In EDCC 2012, pages 49–60.

Wong, H. P., Lee, H., Yu, S., Chen, Y., Wu, Y., Chen, P.,

Lee, B., Chen, F. T., and Tsai, M. (2012). Metal-oxide

RRAM. Proceedings of the IEEE, 100(6):1951–1970.

Zukowski, M., H

eman, S., Nes, N., and Boncz, P. A. (2006).

Super-scalar RAM-CPU cache compression. In ICDE

2006, page 59.

Zukowski, M., van de Wiel, M., and Boncz, P. A. (2012).

Vectorwise: A vectorized analytical DBMS. In ICDE

2012, pages 1349–1350.

DATA 2018 - 7th International Conference on Data Science, Technology and Applications

372