Towards New Interface for Non-volatile Memory Storage

Shuichi Oikawa

Faculty of Engineering, Information and Systems, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, Japan

Keywords:

Operating Systems, File Systems, Storage, Non-volatile Memory.

Abstract:

Non-volatile memory (NVM) storage is rapidly dominating the high end markets of enterprise storage, which

requires high performance, and mobile devices, which require lower power consumption. As NVM storage

becomes more popular, its form evolves from that of HDDs into those that ﬁt the market requirements more

appropriately. Such evolution also stimulates the performance improvement because it leads to the changes of

the interface that connects NVM storage with systems. There is a claim that the further improvement of NVM

storage performance makes it better to poll a storage device to sense completion of access requests rather than

to use interrupts. Polling based storage can expand to become main memory based on NVM storage since

there is no complex mechanism required to enable interrupts and access requests are processed one by one.

This paper predicts that NVM storage will be in a form of main memory, and proposes constructing a ﬁle

system directly on it in order to overcome its drawbacks when used simply as main memory. The performance

projection of the proposed architecture is that accessing ﬁles on such a ﬁle system can reduce the overhead

introduced by handling block devices.

1 INTRODUCTION

Non-volatile memory (NVM) storage is rapidly dom-

inating the high end markets of enterprise storage,

which requires high performance, and mobile de-

vices, which require lower power consumption. Flash

memory

is currently the most popular NVM, and its

storage is called SSDs (Solid State Drives). As NVM

storage becomes more popular, its form evolves from

that of HDDs (Hard Disk Drives) into those that ﬁt the

market requirements more appropriately. Such evo-

lution also stimulates the performance improvement

because it leads to the changes of the interface that

connects NVM storage with systems. The examples

of the currently employed interfaces are PCI Express

and NVM Express.

The investigation results shown by (Caulﬁeld

et al., 2010) and (Yang et al., 2012) posed one in-

teresting claim that the further improvement of NVM

storage performance makes it better to poll a storage

device to sense completion of access requests rather

than to use interrupts. They investigated high perfor-

mance NVM storage architecture and found that the

existing block device interface of the operating sys-

tem (OS) kernel does not always perform well with it.

The reason of this claim is that by excluding the time

required for process context switching and interrupt

In this paper, ﬂash memory stands for NAND ﬂash

memory.

processing there is no time left for a yielded process

to be executed if processing times of access requests

shorten; thus, it is simply faster for systems to poll a

device in order to wait for completion of access re-

quests.

Such polling based storage device can simplify the

mechanisms that constitute the device since there is

no complex mechanism required to enable interrupts

and access requests are processed synchronously.

Synchronous processing of access requests corre-

sponds to the functionality of main memory; thus,

NVM storage can expand to become main memory

in this aspect. Flash memory is, however, not byte

addressable; thus, DRAM buffer is required in or-

der to connect ﬂash memory to the memory bus of

CPUs, and address translation is performed to export

the whole address space of ﬂash memory. This paper

calls this memory architecture NVM storage based

main memory. eNVy (Wu and Zwaenepoel, 1994) is

an example of such a memory architecture while it as-

sumes that ﬂash memory is byte addressable. This ap-

proach, however, has a signiﬁcant drawback that there

is no way to predict addresses of future access; thus,

access locality is the only means to mitigate the long

access latency of NVM.

This paper predicts that NVM storage will be in

a form of main memory and connect with the mem-

ory bus; thus, the storage devices are byte address-

able, and CPUs can simply access the data on them by

174

Oikawa S..

Towards New Interface for Non-volatile Memory Storage.

DOI: 10.5220/0004757901740179

In Proceedings of the 4th International Conference on Pervasive and Embedded Computing and Communication Systems (PECCS-2014), pages

174-179

ISBN: 978-989-758-000-0

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

using memory access instructions. The drawback of

this architecture is that access latencies vary depend-

ing upon the locations of the accessed data. Based on

the prediction, the paper proposes constructing a ﬁle

system directly on NVM storage in order to overcome

its drawbacks when used simply as main memory.

The proposed architecture utilizes NVM storage

based main memory as a base device of a ﬁle system.

The ﬁle system directly interacts with the device with-

out the interposition of the block device driver frame-

work and the page cache mechanism. This architec-

ture can be a solution to avoid the drawback posed by

NVM storage based main memory, and also has sev-

eral advantages over the existing ﬁle system architec-

ture based on block devices. The advantages include

the simpliﬁcation of the OS kernel architecture and

faster processing of data access requests because of

the simpliﬁed execution paths.

We show the performance advantage of the pro-

posed architecture by performing two experiments.

The experiments show the performance impacts im-

posed by the existing block device driver framework.

The results show the signiﬁcant overhead of the block

device driver framework, and the proposed architec-

ture has a lot of room to achieve high performance

access to data on NVM storage.

The rest of this paper is organized as follows. Sec-

tion 2 describes the background of the work. Section

3 describes the details of the proposed architecture.

Section 4 describes the result of the experiments. Sec-

tion 5 describes the related work. Section 6 summa-

rizes the paper.

2 BACKGROUND

This section describes the background of the work.

It ﬁrst describes the OS kernel’s interaction with

block devices. It then describes, the implication of

NVM storage performance improvement. It ﬁnally

describes the overall architecture of NVM storage

based main memory.

2.1 Interacting with Block Devices

Block devices, such as SSDs and HDDs, are not byte

addressable; thus, CPUs cannot directly access the

data on these devices. A certain size of data, which

is typically multiples of 512 byte, needs to be trans-

ferred between memory and a block device for CPUs

to access the data on the device. Such a unit to trans-

fer data is called a block.

The OS kernel employs a ﬁle system to store data

in a block device. A ﬁle system is constructed on a

CPU

Page'Cache'

(DRAM'Main'Memory)

HDD/SSD'Block'Device

I/O'Bus

Memory'Bus

Figure 1: The existing architecture to interact with block

devices.

processing*syscall

switching*

process

interrupt*handling*

&*switching*process

CPU

Block*Device

Issuing*

Command

No>fying*command*

comple>on*by*interrupt*

Proc*1

Proc*2

Proc*1

command*processing

Tproc2



Figure 2: The asynchronous access command processing

and process context switches.

block device, and ﬁles are stored in it. In order to read

the data in a ﬁle, the data ﬁrst needs to be read from a

block device to memory. If the data on memory was

modiﬁed, it is written back to a block device. A mem-

ory region used to store the data of a block device is

called a page cache. Therefore, CPUs access a page

cache on behalf of a block device. Figure 1 depicts

the hierarchy of CPUs, a page cache, and a block de-

vice as the existing architecture to interact with block

devices.

Since HDDs are orders of magnitude slower than

memory to access data on them, various techniques

were devised to amortize the slow access time. The

asynchronous access command processing is one of

commonly used techniques. Its basic idea is that a

CPU executes another process while a device pro-

cesses a command. In Figure 2, Process 1 issues a

system call to access data on a block device. The

kernel processes the system call and issues an access

command to the corresponding device. The kernel

then looks for the next process to execute and per-

form context switching to Process 2. Meanwhile, the

device processes the command, and sends an interrupt

to notify its completion. The kernel handles the inter-

rupt, processes command completion, and performs

context switching back to Process 1. T

proc2

is a time

left for Process 2 to run. Because HDDs are slow and

thus their command processing time is long, T

proc2

long enough for Process 2 to proceed its execution.

2.2 Implication of NVM Storage

Performance Improvement

The performance of ﬂash memory storage has been

improved by exploiting parallel access to multiple

chips (Josephson et al., 2010), and newer NVM

TowardsNewInterfaceforNon-volatileMemoryStorage

175

technologies inherently achieve higher performance.

Such higher performance changes a premise that de-

vised the asynchronous access command processing

to amortize the slow access time of block devices, and

can affects how the OS kernel manages the interaction

with NVM storage.

One of such possibilities is a claim made by

(Caulﬁeld et al., 2010) and (Yang et al., 2012). The

claim compares the asynchronous access command

processing with the synchronous processing, and

shows that the synchronous processing is faster. The

claim was supported by the experiments performed by

employing a DRAM-based prototype block storage

device, which was connected to a system through the

PCIe Gen2 I/O bus. For a 4KB transfer experiment

performed by (Yang et al., 2012), the asynchronous

processing takes 9.0 µs for Process 1 to receive data

from the block device, and T

proc2

is 2.7 µs. In contrast,

the synchronous processing takes only 4.38 µs. The

difference is 4.62 µs, which is longer than T

proc2

the asynchronous processing; thus, the synchronous

processing saves the overall processing time.

As the I/O bus becomes faster, the command pro-

cessing time of a device becomes shorter; thus, T

proc2

also becomes shorter because the context switching

time remains the same. Therefore, there will be no

useful time left for another process while waiting for

command completion.

2.3 NVM Storage based Main Memory

High performance NVM storage makes asynchronous

processing of access requests meaningless as de-

scribed above. Moreover, most of the current NVM

storage devices equip with a certain amount of

DRAM as a buffer to accommodate transient data.

Though the sizes of DRAM buffer vary in accord with

their targets and prices, devices with 1GB of DRAM

buffer can be found among recent products.

Such facts easily make NVM storage expand to

become main memory. By making it directly con-

nect to CPUs through a memory bus, there is no com-

plex mechanism required to enable the asynchronous

access command processing, and access requests are

simply processed synchronously. In order to make

NVM storage work as main memory, the whole ad-

dress space made available by NVM storage needs to

be addressable by CPUs while NVM storage cannot

be directly connected to CPUs. Since a DRAM buffer

is connected to CPUs through an address translation

mechanism, the address translation mechanism en-

ables mapping of DRAM to NVM storage. This paper

calls this memory architecture NVM storage based

main memory, and Figure 3 depicts it. Making NVM

!"#

$%&'()*+,-(

./'(012-34,

567()*8

',92-:()*8

./'(812-34,(;38,<(93=>(9,92-:

&<<-,88(?-3>8@3A2>(

',92-:()*8

Figure 3: NVM storage based main memory architecture.

storage work as main memory is not a new idea as it

was researched when ﬂash memory appeared. eNVy

(Wu and Zwaenepoel, 1994) is such an example.

Newer NVM technologies, such as PCM and

ReRAM, are byte addressable and provide higher per-

formance than ﬂash memory. They, however, have

limited write similar to ﬂash memory; thus, they can-

not simply replace DRAM, and a DRAM buffer is re-

quired to place frequently accessed data. Therefore,

NVM storage based main memory is also legitimate

for such newer NVM technologies.

This approach, however, has a signiﬁcant draw-

back that access latencies vary depending upon the lo-

cations of the accessed data. If the data is on a DRAM

buffer, the access latencies are comparable to DRAM.

If the data is on NVM, the access latencies can be

signiﬁcantly longer. The problem is that there is no

generally working way to predict addresses of future

access. If addresses of future access can be predicted,

the data of these addresses can be read ahead in the

DRAM buffer in order to mitigate the long access la-

tency of NVM. Main memory is accessed by physical

addresses. Because physical addresses do not provide

any information of higher abstractions that can be clue

to predict addresses of future access, no useful infor-

mation is available. Therefore, access locality is the

only means to accommodate frequently accessed data

in the DRAM buffer.

3 PROPOSED ARCHITECTURE

This section describes the proposed architecture that

constructs a ﬁle system on NVM storage based main

memory.

The proposed architecture utilizes NVM storage

based main memory as a base device of a ﬁle sys-

tem. NVM storage persistently stores the data of a ﬁle

system, and CPUs access the data through a memory

bus. The ﬁle system directly interacts with the de-

vice through the memory interface; thus, the block

device driver for NVM storage is not required. Files

PECCS2014-InternationalConferenceonPervasiveandEmbeddedComputingandCommunicationSystems

176

in the ﬁle system can be directly accessed and also

mapped into the virtual address spaces of processes

since ﬁles reside on memory; thus, there is no need to

copy the data of ﬁles to a page cache. Therefore, the

page cache mechanism is not required, either.

This architecture can be a solution to avoid the

drawback posed by NVM storage based main mem-

ory. File systems are designed to allocate the blocks

referenced by a single ﬁle as contiguous as possi-

ble since contiguous blocks can be accessed faster

on HDDs. Such contiguous blocks make it easy to

read ahead blocks in the DRAM buffer. Moreover,

recent advances of ﬁle systems and storage architec-

ture bring the concept of object based storage devices

(OSDs), and higher abstractions are introduced and

made available in storage. Such an architecture also

enables to take advantage of the knowledge of higher

abstractions for the prediction of future access. It

should also be possible for user processes to give hints

to NVM storage based main memory in order to read

ahead blocks since user processes should have knowl-

edge of their access patterns.

The architecture also has several advantages over

the existing ﬁle system architecture based on block

devices, such as the simpliﬁcation of the kernel archi-

tecture and faster processing of data access requests

because of the simpliﬁed execution paths in the ker-

nel. The kernel architecture can be signiﬁcantly sim-

pliﬁed since the architecture does not require block

device drivers and a page cache. File systems di-

rectly interact with NVM storage through the mem-

ory interface although additional command interface

may be needed to communicate with it in order to ex-

change hint information. The simpliﬁed architecture

can stimulate active development of more advance

features that take advantage of NVM storage based

main memory. Accessing data in a ﬁle becomes much

faster since there is no need to go through a com-

plex software framework that consists of a page cache

mechanism and block device drivers in the kernel.

Such a complex software framework paid off when

block devices are as slow as HDDs. High perfor-

mance NVM storage makes it outdated, and can rather

take advantage of the simpliﬁed execution paths in the

kernel.

4 EXPERIMENTS

This section shows the performance advantage of the

proposed architecture. We performed two experi-

ments in order to show the performance impacts im-

posed by the existing block device driver framework.

We performed the experiments on the QEMU system

emulator, and measured the number of instructions re-

quired to read and write certain sizes of a ﬁle from a

user process by using read and write system calls on

the Linux kernel, of which version is 3.4. The NVM

storage based main memory is emulated by memory;

thus, no extra overhead is considered. The user pro-

cess allocates a 8KB buffer for reading and writing.

For the measurements of ﬁle reading, the ﬁle was

read when its data is not cached; thus, the measure-

ments include the costs of page cache allocation if

applicable. For the measurements of ﬁle writing, the

ﬁle was written when all the blocks of the ﬁle were

allocated; thus, the measurements do not include the

costs of block allocation.

4.1 Performance Impact of I/O Request

Scheduling

The ﬁrst experiment measures the performance im-

pact of I/O request scheduling involved in the block

device driver framework. The I/O request scheduling

is especially necessary for HDDs. It queues access

requests and sorts them, so that it can minimize seek

times, which are the major factor of access latencies.

The I/O scheduler is, however, unnecessary for NVM

storage of which random access performance is much

better than HDDs. Therefore, it is recommended for

NVM storage to use the noop mode, which stands for

no operation, of the I/O scheduler, in order to mini-

mize the overhead. The noop mode still involves sim-

ple queueing.

The synchronous access of NVM storage allows

the I/O scheduler to be skipped completely. When

skipped, each access request is processed in the order

of issuing. By comparing the noop mode of the I/O

scheduler and the synchronous access mode, it is pos-

sible to reveal the performance impact of I/O request

scheduling. We developed a ram disk block device

driver that can choose the use the noop mode of the

I/O scheduler or the synchronous access mode.

Figure 4 and 5 show the results of ﬁle reading and

writing with and without the I/O request scheduling,

respectively. We used the three ﬁle systems, Ext2,

Ext4, and Btrfs, for the measurements in order to see

the differences of the processing costs. In the ﬁgures,

queueing means the I/O request scheduling is used.

The results show the signiﬁcant overheads of the

I/O request scheduling. For ﬁle reading, Ext2, Ext4,

and Btrfs performs 4.69x, 4.75x, and 2.02x slower

with the I/O request scheduling than the synchronous

access, respectively. For ﬁle writing, Ext2, Ext4, and

Btrfs performs 3.52x, 3.31x, and 1.48x slower with

the I/O request scheduling than the synchronous ac-

cess, respectively. The results advocate that the syn-

TowardsNewInterfaceforNon-volatileMemoryStorage

177

0 !

1,000,000 !

2,000,000 !

3,000,000 !

4,000,000 !

5,000,000 !

6,000,000 !

7,000,000 !

1! 32! 64! 128! 256! 512!

The number of executed instructions [1/1000]

File size [MB]

Ext2!

Ext2 (queueing)!

Ext4!

Ext4 (queueing)!

Btrfs!

Btrfs (queueing)!

Figure 4: The experiment result of ﬁle reading with and

without I/O request scheduling.

0 !

1,000,000 !

2,000,000 !

3,000,000 !

4,000,000 !

5,000,000 !

6,000,000 !

7,000,000 !

8,000,000 !

9,000,000 !

10,000,000 !

1! 32! 64! 128! 256! 512!

The number of executed instructions [1/1000]

File size [MB]

Ext2!

Ext2 (queueing)!

Ext4!

Ext4 (queueing)!

Btrfs!

Btrfs (queueing)!

Figure 5: The experiment result of ﬁle writing with and

without I/O request scheduling.

chronous access of NVM storage can reduce the I/O

processing cost signiﬁcantly, and the processing cost

of the I/O request scheduling can pay off only with

slow storage devices.

The results also clearly show the differences of

ﬁle system performance. Btrfs is the slowest among

the three ﬁle systems that were used for the measure-

ments although it is the most functionally rich ﬁle sys-

tem. Ext2 is the fastest but the functionally simplest.

Ext4 performs comparably to Ext2 with more func-

tionality than Ext2.

4.2 Performance Impact of XIP

The second experiment measures the performance im-

pact of the XIP (Execution-In-Place) feature in order

to seek the further performance improvement. The

XIP feature is for byte addressable devices, such as

ram disks, and enables direct access to blocks with-

out going through a page cache. Since the XIP fea-

ture does not require a page cache, there is no need

to copy data to a page cache; thus, it incurs less pro-

cessing cost. The XIP feature requires device drivers

to support the direct access interface; thus, we imple-

0 !

100,000 !

200,000 !

300,000 !

400,000 !

500,000 !

600,000 !

700,000 !

800,000 !

900,000 !

1,000,000 !

1! 32! 64! 128! 256! 512!

The number of executed instructions [1/1000]

File size [MB]

PRAMFS (XIP)!

Ext2 (XIP)!

Ext2!

Figure 6: The experiment result of ﬁle reading with and

without XIP.

0 !

200,000 !

400,000 !

600,000 !

800,000 !

1,000,000 !

1,200,000 !

1,400,000 !

1! 32! 64! 128! 256! 512!

The number of executed instructions [1/1000]

File size [MB]

PRAMFS (XIP)!

Ext2 (XIP)!

Ext2!

Figure 7: The experiment result of ﬁle writing with and

without XIP.

mented the interface in our ram disk driver.

Figure 6 and 7 show the results of ﬁle reading

and writing with and without the XIP feature, re-

spectively. We used the two ﬁle systems, Ext2 and

PRAMFS. We chose Ext2 because it is the fastest ﬁle

system as shown in the previous experiment. We also

chose PRAMFS because PRAMFS is a ﬁle system

that directly access memory and does not require a

device driver. In the ﬁgures, XIP means the XIP fea-

ture is used.

The results show the XIP feature can accelerate

the ﬁle access performance even further. For ﬁle read-

ing and writing, Ext2 with XIP performs 2.92x and

3.42x faster than without XIP, respectively. PRAMFS

is even faster than Ext2 with XIP. It is 1.36x and

1.26x faster than Ext2 for ﬁle reading and writing,

respectively. The cumulative performance gains by

both XIP and the exclusion of I/O request scheduling

for ﬁle reading and writing are 13.69x and 12.03x,

respectively, on Ext2. If further acceleration made

possible by PRAMFS is included, the gains become

18.61x and 15.15x, respectively.

From the above experiment results, constructing a

ﬁle system on NVM storage based main memory en-

PECCS2014-InternationalConferenceonPervasiveandEmbeddedComputingandCommunicationSystems

178

ables signiﬁcantly higher performance of ﬁle access

than using the block device framework by removing

the overheads of the framework.

5 RELATED WORK

eNVy (Wu and Zwaenepoel, 1994) proposes NVM

storage based main memory. It assumes the utiliza-

tion of NOR ﬂash memory, which is byte address-

able and of which read access latency is compara-

ble to DRAM. Moneta (Caulﬁeld et al., 2010) is a

storage array architecture designed for NVM. While

its evaluation revealed the necessity of reducing the

software costs to deal with block devices, it does not

consider the removal of the block device interface.

(Yang et al., 2012) also investigated the software costs

to deal with block devices on the premise that PCIe

gen3 based ﬂash storage devices will become even

faster. While it proposed the synchronous interface

to block devices, it is still based on the block device

interface. (Tanakamaru et al., 2013) proposed a stor-

age state storage device that is a hybrid of ﬂash mem-

ory and ReRAM. Their hybrid architecture is simi-

lar to the NVM storage based main memory, they do

not discuss the interaction with a ﬁle system. (Meza

et al., 2013) describes the idea to coordinate the man-

agement of memory and storage under a single hard-

ware unit in a single address space. They focused

energy efﬁciency, and did not propose any software

architecture. (Condit et al., 2009) and SCMFS (Wu

and Reddy, 2011) proposed the ﬁle systems that were

designed to be constructed directly on NVM. They,

however, have no consideration of hybrid storage ar-

chitecture.

6 SUMMARY AND FUTURE

WORK

Non-volatile memory (NVM) storage is becoming

more popular as its performance and cost efﬁciency

improve. High performance NVM storage can expand

to become NVM storage based main memory since

polling based access is faster and its hardware can be

simpliﬁed. This paper proposed constructing a ﬁle

system directly on NVM storage based main memory

in order to overcome its drawbacks when used simply

as main memory. The performance projection of the

proposed architecture is that accessing ﬁles on such

a ﬁle system can reduce the overhead introduced by

handling block devices.

We are currently developing a simulation environ-

ment of NVM storage based main memory based on

a virtual machine. By employing a virtual machine

environment, the simulation of NVM storage and its

interface with the DRAM buffer becomes possible.

It also enables more precise performance evaluation

with realistic benchmark and workload.

REFERENCES

Caulﬁeld, A. M., De, A., Coburn, J., Mollow, T. I.,

Gupta, R. K., and Swanson, S. (2010). Moneta: A

high-performance storage array architecture for next-

generation, non-volatile memories. In Proceedings

of the 2010 43rd Annual IEEE/ACM International

Symposium on Microarchitecture, MICRO ’43, pages

385–395, Washington, DC, USA. IEEE Computer So-

ciety.

Condit, J., Nightingale, E. B., Frost, C., Ipek, E., Lee, B.,

Burger, D., and Coetzee, D. (2009). Better i/o through

byte-addressable, persistent memory. In Proceedings

of the ACM SIGOPS 22nd Symposium on Operating

Systems Principles, SOSP ’09, pages 133–146, New

York, NY, USA. ACM.

Josephson, W. K., Bongo, L. A., Li, K., and Flynn, D.

(2010). Dfs: A ﬁle system for virtualized ﬂash stor-

age. Trans. Storage, 6(3):14:1–14:25.

Meza, J., Luo, Y., Khan, S., Zhao, J., Xie, Y., and Mutlu, O.

(2013). A case for efﬁcient hardware-software coop-

erative management of storage and memory. In Pro-

ceedings of the 5th Workshop on Energy-Efﬁcient De-

sign (WEED), pages 1–7.

Tanakamaru, S., Doi, M., and Takeuchi, K. (2013). Uniﬁed

solid-state-storage architecture with nand ﬂash mem-

ory and reram that tolerates 32x higher ber for big-

data applications. In Solid-State Circuits Conference

Digest of Technical Papers (ISSCC), 2013 IEEE Inter-

national, pages 226–227.

Wu, M. and Zwaenepoel, W. (1994). envy: a non-volatile,

main memory storage system. In Proceedings of the

6th International Conference on Architectural Sup-

port for Programming Languages and Operating Sys-

tems, ASPLOS VI, pages 86–97, New York, NY,

USA. ACM.

Wu, X. and Reddy, A. L. N. (2011). Scmfs: a ﬁle system for

storage class memory. In Proceedings of 2011 Inter-

national Conference for High Performance Comput-

ing, Networking, Storage and Analysis, SC ’11, pages

39:1–39:11, New York, NY, USA. ACM.

Yang, J., Minturn, D. B., and Hady, F. (2012). When

poll is better than interrupt. In Proceedings of the

10th USENIX Conference on File and Storage Tech-

nologies, FAST’12, pages 1–7, Berkeley, CA, USA.

USENIX Association.

TowardsNewInterfaceforNon-volatileMemoryStorage

179