GREEN COMPUTING

A Case for Data Caching and Flash Disks?

Karsten Schmidt, Theo Härder, Joachim Klein and Steffen Reithermann

University of Kaiserslautern, Gottlieb-Daimler-Str., 67663 Kaiserslautern, Germany

Keywords: Flash memory, flash disks, solid state disk, data caching, cache management, energy efficiency.

Abstract: Green computing or energy saving when processing information is primarily considered a task of processor

development. However, this position paper advocates that a holistic approach is necessary to reduce power

consumption to a minimum. We discuss the potential of integrating NAND flash memory into DB-based ar-

chitectures and its support by adjusted DBMS algorithms governing IO processing. The goal is to drastically

improve energy efficiency while comparable performance as is disk-based systems is maintained.

1 INTRODUCTION

Recently, green computing gained a lot of attention

and visibility also triggered by public discussion

concerning global warming due to increased CO2

emissions. It was primarily addressed by enhanced

research and development efforts to reduce power

usage, heat transmission, and, in turn, cooling needs

of hardware devices, in particular, processor chips

using extensive hardware controls. Thermal

management, however, is a holistic challenge. It

includes not only simultaneous optimizations in the

materials, devices, circuits, cores, and chip areas, but

also combined efforts regarding system architecture

(e.g., integration of more energy-efficient storage

devices), system software (e.g., energy-optimal

application of algorithms), and system management

(e.g., a control center responding to workload

changes by allocating/deactivating entire servers). In

the area of database management systems (DBMSs),

up-to-date little research work has contributed to this

important overall goal. But, NAND flash memory

(also denoted as solid state disk) seems to have the

potential to become the future store for permanent

database data, because – compared to magnetic disks

(disk, for short) – it promises breakthroughs in

bandwidth (IOps), energy saving, reliability, and

volumetric capacity (Gray and Fitzgerald, 2007).

So far, flash memory was considered ideal for

storing permanent data in embedded devices,

because it is energy efficient, small, light-weight,

noiseless, and shock resistant. So, it is used in

personal digital assistants (PDAs), pocket PCs, or

digital cameras and provides the great advantage of

zero-energy needs, when idle or turned off. In these

cases, flash use could be optimally configured to

typical single-user workloads known in advance.

However, not all aspects count for DB servers and

not all DBMS algorithms can be directly applied

when processing data on flash.

2 NAND FLASH-BASED DISKS

Because NAND flash memory is non-volatile,

allows for sequential and random block reads/writes,

and keeps its state even without energy supply, it

can be compared to disks and can take over the role

of disks in server environments. Therefore, we call

such storage units NAND flash-based disks or flash,

for short. To evaluate their potential when mapping

DB data to such devices, we briefly sketch – for

DBMS use – the typical read/write model of disk

and flash.

2.1 Read/Write Models

Disks are devices enabling very fast sequential block

reads and, at the same time, equally fast writes,

whereas random block read/writes are much slower

(requiring substantial “mechanical time fractions”).

The block size can be configured to the needs of the

DBMS application with page sizes typically ranging

between 4KB and 64KB. To hide the access gap be-

tween memory and disk, DBMS use a large DB

cache in memory (RAM) where (in the simplest

535

Schmidt K., Härder T., Klein J. and Reithermann S. (2008).

GREEN COMPUTING - A Case for Data Caching and Flash Disks?.

In Proceedings of the Tenth International Conference on Enterprise Information Systems - DISI, pages 535-540

DOI: 10.5220/0001725805350540

 SciTePress

case) each cache frame can keep a DB page which,

in turn, can be mapped to a disk block. In most

DBMSs, propagation of DB pages follows the

update-in-place principle applying WAL (Gray and

Reuter, 1993).

Flash storage is divided into m equal blocks typi-

cally much larger than DB pages. A flash block nor-

mally contains b (32 – 128) fixed-size pages where a

page ranges between 512B and 2KB. Because zeros

cannot be directly written to a page, one must erase

(reset) the block to all 1’s, before a page can be writ-

ten. Thus, a written page cannot be updated

anymore, but only freshly written after the entire

block is erased again. Hence, the block is the unit of

erasure automatically done by the flash device when

allocating an empty block. The page is the smallest

and the block the largest unit of read whereas the

page is the unit of write; using chained IO, the

DBMS, however, can write 1 < i ≤ b pages into a

block at a time. Note, whenever a page is written in-

place, the flash device automatically allocates a new

block and moves to it all pages from the old block

together with the updated page (keeping a cluster

property). This wear leveling (Ban, 2004) is entirely

transparent to the client, i.e., the DBMS, such that

all references to displaced pages, e.g., index pointers

and other links, remain valid.

Another concern called write endurance and of-

ten cited in the literature is the limited number of

erase cycles, between 100,000 (older references) and

5,000,000 (most recent references). When a block

reaches this erase cycle limit, it cannot be longer

used and has to be marked as corrupted. Hence,

management of flash relies on a pool of spare

blocks; due to the application of wear leveling

overly frequent overwriting of the same block is

avoided.

2.2 Flash Potential

To gain a deeper and more complete comparative

picture of disk and flash, we want to outline the

differences and advantages for IO, power, size, and

price of both device types and indicate where drastic

processing improvements and costs can be

anticipated. Here, we can only summarize the

evaluation of others (Gray and Fitzgerald, 2007,

Nath and Kansal, 2007) in a coarse way and give

indicative numbers or orders of magnitude of gains

or degradations. Of course, this discussion assumes

that DBMS algorithms provide adjusted mappings to

take full advantage of the flash potential. For the

performance figures, we assume what technology

currently provides for fast disks (e.g., SCSI 15k

rpm) and flash (SAMSUNG 2008).

IO performance: We distinguish different forms

of IO processing: Sequential IO continuously reads/

writes blocks to the device whereas random IO can

be directly performed to/from any given block

address.

Sequential reads and writes on flash having a

bandwidth of ~90 MBps are comparable to those on

fast disks.

• Sequential reads and writes on flash having a

bandwidth of ~90 MBps are comparable to

those on fast disks.

• Random reads on flash are spectacularly faster

by a factor of 10–15 (2800 IOps compared to

<200 IOps).

• Random writes, requiring block erasure first,

perform worst with ~27 IOps and are slower

by a factor of 4–8 compared to disks.

Hence, dramatic bandwidth gains are obtained

for random reads while random writes are

problematic and have to be algorithmically

addressed at the DBMS side.

Energy consumption: The power needed to drive

a flash read/write is 0.9 Watt and, hence, by a factor

of >15 lower than for a disk. Using the figures for

IOps, we can compute IOps/Watt as another

indicator for energy-saving potential. Hence, 3,100

flash-reads and 30 flash-writes can be achieved per

Watt, whereas a disk only reaches 13 operations per

Watt.

Unit size: Starting in 1996, NAND flash chips

doubled their densities each year and currently pro-

vide 64 Gbit. According to Hwang (2006), this

growth will continue or accelerate such that 256

GByte per chip are available in 2012. Because sever-

al of these chips can be packaged as a “disk”, the

DB community should be prepared for flash drives

with terabyte capacity in this near future. Hwang

(2006) also expects the advent of 20 TByte flash

devices in 2015. Of course, disks will also reach a

comparable capacity, but flash drives will provide

further important properties, as outlined in Section 1.

Price per unit: Today, flash is quite expensive. A

GByte of flash memory amounts to 20$, but technol-

ogy forecast predicts a dramatic decrease to only 2$/

GByte in the near future. Therefore, Gray and

Fitzgerald (2007) expect that disk and flash of

comparable capacity will have roughly the same

price (e.g., ~500$ for an SCSI and ~400$ for a flash

drive). This assumption allows us to compute IOps/$

as an additional measure of comparison. While

flash-read gets 7.0 IOps/$, flash-write gets poor 0.07

and an SCSI operation 0.5 IOps/$. Hence, using

ICEIS 2008 - International Conference on Enterprise Information Systems

536

flash, we achieve for the same amount of money 14

times more reads, but only 1/7 of writes compared to

disks.

This evaluation revealed even for a DBMS a

large optimization potential when its processing

characteristics can be adjusted in a way that the

drawbacks are avoided and, at the same time, the

strengths of flash is exploited as far as possible.

Here, we want to explore the energy-saving potential

of DBMSs, in particular, when data caching is used

together with flash as permanent storage.

3 DB CACHE MANAGEMENT

Caching plays a dominant role in all DB-based

applications and its importance steadily grows with

the number of Internet applications. In current

DBMSs, the overall goal of cache management is to

exploit locality of reference as far as possible

thereby minimizing IO to the permanent DB on disk.

When a requested page is not found in the cache, a

page fault occurs. If a free frame is found in the

cache, this page can be directly loaded from disk.

Otherwise, a page replacement algorithm determines

a “victim” to make room for the request. If the

victim page is marked as updated, loading has to be

deferred until this page is written (flushed) to the

disk. To reduce such wait situations, the cache

manager often gives such modified pages a

preferential treatment and flushes them asynchro-

nously (without being evicted) to make them again

“clean” in the cache.

Most of the prominent page replacement algo-

rithms, e.g., LRD (Effelsberg and Härder, 1984) or

LRU-K (O’Neil et al., 1993), proven in disk-based

DB applications, concentrate on single-page selec-

tions and are not flash-aware, because they do not

consider the asymmetric cost of read and write or

even the potential of multi-page fetches, e.g., an en-

tire block content, which can be helpful as a kind of

prefetching for specific workloads.

With flash, single-page flush optimization or vic-

tim determination is contra-productive. As a general

principle, output should be absolutely minimized,

potentially at the expense of input. Because writes

need special handling and are, therefore, much more

expensive than reads, page replacement should bias

the eviction of read-only pages over modified pages

which should be collected, before they are flushed.

Ideally such a collection should fill a block on flash.

Figure 1: Logical-to-physical data mapping.

If there is no restriction on page types to be col-

lected for flash mapping, a single cache is sufficient.

However, if the flash is organized in type-specific

files, this can be best achieved by type-specific

caches controlled by a multi-cache manager, which

only flushes complete blocks. To isolate update

propagation from logging and recovery concerns,

e.g., transaction-specific force-writes at commit, a

no-force strategy is mandatory for modified pages

(Gray and Reuter, 1993). A no-steal policy for dirty

pages, i.e., pages modified by still running

transactions, is not required, as long as the WAL

principle is observed. Note, there is an important

dependency between data cache flushes and log

buffer writes (see Section 4.2).

Preliminary results show that flash-aware re-

placement using a single cache cannot support most

workloads well. For scenarios sketched in Section 4,

multi-cache management may outperform a single

common LRU- or LRD-based cache by a substantial

margin (e.g., more than a factor of 2).

4 MAPPING SCENARIOS

Flash-aware cache algorithms must address the IO

asymmetry when performing writes and reads. First,

let’s look at the physical mapping of flash pages.

Figure 1 illustrates that the flash can be divided into

several files whose blocks may be either randomly

written (e.g., for user and index data) or filled in a

circular fashion (e.g., for log data). Consecutive data

pages are often mapped to the same block whereas

index pages are typically dispersed across several

blocks. As an additional degree of freedom, the DB

page size can be chosen as a multiple of the flash

page size. Thus, a DB page is mapped to one or

more consecutive flash pages, which, in turn, sets

two major goals for DB page writing:

Block n

Flash

circular log

Data

record

i equal-sized

flash pages

in block n

Block 0

Index pages Consecutive data pages

GREEN COMPUTING - A Case for Data Caching and Flash Disks?

537

Figure 2: Out-buffering of pages.

1. Avoid expensive erasure by using always empty

blocks for page writing.

2. Flush collections of modified pages such that

new blocks are filled without fragmentation.

To achieve the first goal, writing single or few

pages (chained IO) always into a new block causes

tremendous fragmentation in this block. At the end,

most of the blocks are poorly filled, read

performance degrades, and flash capacity is wasted.

Therefore, goal 2 prefers to buffer DB pages for

output (out-buffer, for short) to fill flash blocks

completely. Indeed, this avoids fragmentation as

long as enough empty blocks are available to

propagate new or modified data. But pages being

relocated into new blocks leave gaps in their old

blocks causing fragmentation and later garbage

collection. Hence, replacement must balance both

extremes to minimize overall writing.

Due to different DB page types for indexes, data

(often chained), log, and temporary data (used for

external sort and intermediate results), the

propagation strategy decides whether it is better to

assign to a block (rather) static together with

(highly) dynamic pages or to separate them. From an

application point of view, the context of related user

data should not be destroyed due to page update

propagation to preserve locality of data reference.

On the other hand, maintaining the original block

assignment of data may be overly expensive due to

read/write asymmetry. To achieve reasonable page

mapping with minimum fragmentation and maximal

locality support, different out-buffering strategies

(block-building) are possible.

4.1 Propagation Strategies

Figure 2 sketches strategies for out-buffering modi-

fied DB pages before flushing them to blocks.

• In-place: Substituting each page in-place

avoids fragmentation at all, but requires each

time the entire flash block to be erased and

rewritten with all existing data.

Table 1: Comparison for propagation strategies.

Strategy

single page

writ e

multiple page

write (same

type)

multiple page

write (mixed

types)

in place

- | - | + - | - | + - | - | +

relocate

o | o | + o | o | + o | o | +

new block

+ | + | - + | + | o o | + | o

performance | power consumption | fragmentation

- poor o average + good

• Relocate: Shifting a page to a new physical

position leaves a gap (outdated page) in the

present block, but the page can be out-

buffered together with other pages (even

unmodified pages) to be written into another

block.

• Allocate new block: When an empty block is

used to be filled with new or modified pages,

the block can be written by chained IO which

is much more efficient than separate page

flushes.

To summarize these ideas, Table 1 shows the

anticipated performance, power consumption, and

degree of fragmentation for single-page and out-

buffered- page flushes. The combination of different

page types to build blocks is disadvantageous for

sequential data being spread over more blocks than

necessary when being read later. Table 1 also reveals

that the straightforward and easy to use in-place

strategy does not perform well in most of the

considered aspects. Depending on DBMS

requirements, different propagation strategies favor

different design goals (e.g., power, time, cost,

space); cost-based decisions may choose the best

strategy even on demand.

4.2 Mapping Data to Flash

So far, we have characterized the general properties

and principles of mapping cached data to flash. For

that purpose, Gal and Toledo (2005) already present-

ed lower-level algorithms.

To become more DB-spe-

cific in our position paper, we propose in the follow-

ing how the mapping can be applied to important

DBMS processing situations and workloads:

Log data: The properties of collecting and writ-

ing log data lend themselves perfectly to flash sup-

port. The log file is organized in a circular way and

new log data is always appended to the current end

of the log file which makes the log tail to a high-

traffic data element. Because flash blocks are rather

large, they normally are not filled by the log data of

Index page

modified

Data page

modified

Out-Buffer Cache

Block

ICEIS 2008 - International Conference on Enterprise Information Systems

538

a single transaction. However, repeated log writes of

consecutively committing transactions to the same

block would cause a performance-critical bottleneck.

Therefore, a pool of block-size log buffers should be

allocated, which are alternately used to enable filling

of log data while a full block (using chained IO) is

asynchronously written to flash. For this purpose,

the concept of group commit was developed for

which the log data are written in commit sequence

(at least the end-of-transaction entries) to the log.

The successful commit is acknowledged to the user

together with the release of all locks kept by the

committed transactions when the log data has

reached stable storage (e.g., flash). If the delayed

lock release is a performance problem, group

commit can be combined with pre-commit (see Gray

and Reuter, 1993) which immediately releases locks

at the end of the individual transaction T

although

the log data is still in a volatile state. The only

situation that such a transaction may still fail is a

crash. But then all pre-committed (and other

running) transactions potentially dependent on the

updates of this transaction (because of early lock

release) also fail because their log data appear after

the log data of T

Static user data: DB pages containing user data

are primarily read and rarely updated, hence, almost

static. In many cases, the physical allocation of DB

pages on external media is of particular importance,

e.g., if a cluster property has to be preserved. For

that reason, data pages should be updated in-place as

in disk-based systems, i.e., n single-page updates re-

quire n block writes. Note, wear leveling does not

affect the cluster property at the block level, because

random page reads are not sensitive to physical

neighbourhood of blocks. Hence, management of

DB pages within a block provides new opportunities

to reallocate data on flash without loosing

performance.

Append-on data: User data is often appended to

the end of a file by using sequential numbering

scheme for primary keys or to ordered XML trees

which are right-growing when new order-dependent

data is attached. In this case, the usage of a pool of

block-size buffers, similar to that for log data, can

dramatically improve the performance.

Dynamic user data: In such a scenario, page up-

dates are assumed to be so frequent that block

update in-place is too expensive because of a

substantial number of single-page writes and, in

turn, energy consumption. Therefore, we sacrifice

the cluster property (automatically preserved for in-

place block updates by the flash) and propose a

Figure 3: DBMS-controlled mapping to flash.

DBMS-controlled relocation, as illustrated in Figure

3. Modified pages are out-buffered until an entire

block can be written. Administration information in

the form of block info and page mapping is

maintained by the DBMS and remains in memory.

To enable recovery from a crash, the DBMS saves

sufficient redundancy in a log file. Hence, if a flash

block can store d DB pages, DBMS-controlled

relocation reduces the write overhead of n pages to

n/d block writes.

Index data: Index maintenance provides similar

update frequencies as dynamic user data and, there-

fore, we propose a similar DBMS-controlled

mapping (Figure 3). A first prototype of flash-aware

index algorithms was developed by Nath and Kansal

2007, where memory had to cope with severe size

restrictions. Although they omitted scalability

considerations, they emphasized the need for

flexible index maintenance to reduce expensive

write operations. Due to the frequent but small

changes within an index node, it is better to collect

corresponding log information for that node update

instead of flushing every single modification

separately. Moreover, index traversals accessing

single dispersed pages profit from the random-read

speed of the flash. Other B*-tree operations, such as

merge and split, may be deferred or covered by the

same logging mechanism.

Hot spots: A critical task is the management of

data which is updated with a high repetition rate

(hot-spot data). The use of NAND flash implies the

need of an adjusted approach. Due to their high

locality of reference, the cache manager would never

evict them for replacement in case of a no-force

policy. The normal logging component is used to

collect the needed redundancy to guarantee the

ACID properties in case of a crash. To enable reuse

of the log space, hot-spot data must eventually reach

the stable storage. Therefore, a checkpointing

Flash

Assignment

Cache Out-Buffer

4. Maintain

page mapping

& block info

Page mapping

…

1. Load

2. Modify & Buffer

3. Flush

Block info

Page no.

0 1 2 3

dy st st dy

…

Type

Flash addr.

…

II0I0 IIIIII IIIIII 0I0I0

234 582 32 491 n

0 1 2 3 i

Block no.

GREEN COMPUTING - A Case for Data Caching and Flash Disks?

539

mechanism should be applied to periodically flush

hot-spot pages using the normal out-buffer method.

4.3 Further Optimization Options

The forms of mapping and their effectivity/energy-

saving potential discussed so far could be further en-

hanced by a number of optional methods.

Workload analysis: To optimize the read/write

ratio, the cache manager may predict access patterns

or pro-actively tolerate performance degradations to

gain additional benefit for later read operations. For

example, flushing sequentially referenced pages im-

mediately may increase current write costs. But, be-

cause re-reference probability is very low such that

the cache space freed may amortize the initial

writing costs. Besides reducing gaps in existing

blocks, a suitable propagation strategy compares

write and read operations within a workload for a

specific page type or each page individually. Hence,

the cache manager has to distinguish between page

types and their future usage.

Database design: Defining a flash-aware DB

schema should reduce index usage or redundant data

structures, because updates would affect too many

pages. A higher amount of data to read, even for se-

lective accesses or joins, may compensate the other-

wise unacceptable update costs for such redundant

data.

Replacement urgency: Regarding power con-

sumption and latency for write compared to a read

operation, the cache may prefer to eagerly drop non-

modified pages. Indeed, modified pages may benefit

from a deferred flush, as long as WAL is used and

data consistency is assured. Thus, a page may be

replaced containing modifications from several

transactions resulting in one instead of multiple

block writes (no-force policy).

5 SUMMARY AND OUTLOOK

We outlined the specific properties of NAND flash

when used as persistent devices in DB servers. Al-

though magnetic disks with salient and proven

performance and reliability properties and, above all,

acceptable capacities (512GByte and larger) will be

available on the market, flash disks may provide

even superior properties in the near future. The

dramatic reduction of energy consumption and the

potential to read random data nearly as fast as

sequential data are outstanding advantages which

make NAND flash memory to an almost perfect

hard disk alternative. Because of the problematic

erase operation of flash blocks, DBMS algorithms

and, above all, the mapping between cache memory

and flash disk have to be adjusted to reach the best

performance possible. We illustrated the op-

portunities and draw-backs of various mapping sce-

narios and developed ideas for optimization.

Primarily, the collection of data pages in block-size

memory buffers may greatly improve the

performance.

Currently, we are working on the integration of

flash memory into a DBMS and its adaptation

providing a flash-aware cache together with

enhanced mapping algorithms. Other hot topics for

the improvement of energy efficiency in DBMSs is

the use of specialized logging techniques and group

commit as well as cost-based query optimization

regarding energy consumption as a prime cost

factor.

REFERENCES

Ban, A., 2004. Wear leveling of static areas in flash

memory. US patent, (6732221); Assigned to M-

Systems

Effelsberg, W., and Härder, T., 1984. Principles of

Database Buffer Management. In ACM Transactions

on Database Systems 9(4): 560-595

Gal, E., and Toledo, S., 2005. Algorithms and data

structures for flash memories. In Computing Surveys

37(2): 138-163

Gray, J., and Fitzgerald, B., 2007. FLASH Disk

Opportunity for Server-Applications. http://research

.microsoft.com/ ~Gray/papers/FlashDiskPublic.doc

Gray, J., and Reuter, A., 1993. Transaction Processing:

Concepts and Techniques. Morgan Kaufmann

Härder, T., and Reuter, A., 1983. Principles of

Transaction-Oriented Database Recovery. In

Computing Surveys 15(4): 287-317

Hwang, C., 2006. Chip memory to keep doubling annually

for 10 years: Hwang. http://www.korea.net/

News/news/LangView.asp?serial_no=20060526020&

lang_no=5&part= 106&SearchDay= (2006)

Kerekes, Z., 2007. (editor of STORAGEsearch.com): SSD

Myths and Legends – "write endurance".

http://www.storagesearch.com/ssdmyths-

endurance.html

Nath, S., and Kansal, A., 2007. FlashDB: Dynamic Self-

tuning Database for NAND Flash. ftp.research.mi-

crosoft.com/pub/tr/TR-2006-168.pdf

O’Neil, E. J., O’Neil, P. E., and Weikum, G.: The LRU-K

Page replacement algorithm for database disk buffer-

ing. Proc. ACM SIGMOD, 297-306 (1993)

SAMSUNG Develops, 2008. MLC-based 128 Gigabyte,

SATA II Solid State Drive. http://www.samsung.com/

global/ business/semiconductor/newsView.do?news_

id=893

ICEIS 2008 - International Conference on Enterprise Information Systems

540