Distributed File System Based on Erasure Coding for I/O-Intensive

Applications

Dimitri Pertin

1,2

, Sylvain David

, Pierre

Evenou

, Benoˆıt Parrein

and Nicolas Normand

LUNAM Universit´e, Universit´e de Nantes, IRCCyN UMR CNRS 6597, Nantes, France

Fizians SAS, Nantes, France

Keywords:

Distributed File System, RozoFS, Erasure Coding, Mojette Transform, IOzone, Video Editing.

Abstract:

Distributed storage systems take advantage of the network, storage and computational resources to provide

a scalable infrastructure. But in such large system, failures are frequent and expected. Data replication is

the common technique to provide fault-tolerance but suffers from its important storage consumption. Erasure

coding is an alternative that offers the same data protection but reduces signiﬁcantly the storage consumption.

As it entails additional workload, current storage providers limit its use for longterm storage. We present the

Mojette Transform (MT), an erasure code whose computations rely on fast XOR operations. The MT is part

of RozoFS, a distributed ﬁle system that provides a global namespace relying on a cluster of storage nodes.

This work is part of our ongoing effort to prove that erasure coding is not necessarily a bottleneck for intense

I/O applications. In order to validate our approach, we consider a case study involving a storage cluster of

RozoFS that supports video editing as an I/O intensive application.

1 INTRODUCTION

Distributed storage systems have been used to provide

data reliability. Failures in such large systems arecon-

sidered as the norm, and can come either from hard-

ware or software considerations. They can result in

dramatic data loss and/or the crash of the service. The

traditional way to deal with data protection is to repli-

cate the data. Once n copies of data are distributed

across multiple network nodes, the system is able to

face n − 1 failures. If a node has a breakdown, its

data is not accessible anymore, but other copies are

still available on other nodes. On the other hand, this

technique is expensive, particularly for huge amounts

of data. Replication is the default data protection fea-

ture included in distributed storage systems.

Erasure coding is an alternative that provides the

same data protection but reduces signiﬁcantly the

storage consumption. Optimal codes, called Maxi-

mal Distance Separable (MDS), aim at encoding k

data blocks into n parity blocks, with n ≥ k. The en-

coded blocks hold enough redundancy to recover the

former data from any subset of k parity blocks during

the decoding process. Compared to replication whose

storage overhead is n, erasure coding’s overhead is

deﬁned by n/k. However, encoding and decoding re-

quire further computations compared to the simple

replication technique. Efforts are done today to de-

sign efﬁcient codes. The most famous ones are the

Reed-Solomon (RS) codes. They rely on Galois ﬁeld

operations. In this arithmetic, addition is fast as it

corresponds to exclusive-OR (XOR), but multiplica-

tion has many implementations which are much more

computational expensive. The storage systems that

provide erasure coding by RS suffer from the slowing

down of its computations. Many implementations of

RS codes exist. Examples of implementation libraries

are OpenFEC

and Jerasure (Plank, 2007).

We propose the use of the Mojette Transform

(MT) (Normand, Kingston, and

Evenou, 2006) as an

erasure code for distributed storage. The MT relies

only on additions and its implementation uses the fast

XOR operations. The MT is part of RozoFS

, an

open-source software providing a distributed ﬁle sys-

tem. In RozoFS, when a client wants to store a ﬁle,

the data is cut into small chunks of 8 KB. Chunks are

composed of k blocks of ﬁxed size which depends on

the desired protection. These blocks are encoded into

n parity blocks, which are discrete projections in the

Mojette formalism. These projections are then dis-

tributed to storage nodes and the system is able to face

http://openfec.org/

code available at http://www.rozofs.org

451

Pertin D., David S., Evenou P., Parrein B. and Normand N..

Distributed File System Based on Erasure Coding for I/O-Intensive Applications.

DOI: 10.5220/0004960604510456

In Proceedings of the 4th International Conference on Cloud Computing and Services Science (CLOSER-2014), pages 451-456

ISBN: 978-989-758-019-2

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

n − k failures. Decoding is done client-side after re-

trieving k blocks out of the n from storage nodes.

The current storage systems provide erasure cod-

ing to beneﬁt from the storage capacity saving for

longterm storage. When intensive data I/Os are

needed, replication is put forward because of the ad-

ditional workload of erasure codes. Our work differs

from this position by exploring the use of an erasure

code for intensive applications. We experiment the

concept by applying RozoFS to video editing.

The rest of the paper is structured as follows: Sec-

tion 2 brieﬂy describes some distributed storage sys-

tems and their data protection policy. In Section 3,

we present the MT and its erasure code characteris-

tics, while Section 4 describes how RozoFS use it to

provide a distributed ﬁle system. The video editing

experiment is presented in section 5 and demonstrates

that erasure coding is not a bottleneck for I/O inten-

sive applications. Section 6 concludes the paper with

the possible future work.

2 RELATED WORK

Erasure Coding receives signiﬁcant attention from

both the business and the research communities. Scal-

ity, Caringo, Isilon or Cleversafe are such examples of

companies providing private storage solutions based

on erasure codes. Data protection for these systems

are partially described in technical white papers.

Scality and Caringo provide replication and era-

sure coding as a way to protect data. Erasure cod-

ing is only used for the longterm storage of massive

amounts of data. They both recommend replication

for intensive applications. Isilon puts forward the use

of erasure coding through a Reed Solomon imple-

mentation in OneFS. Replication is only used when

erasure coding is not applicable (e.g. too few many

nodes). Cleversafe provides exclusively data protec-

tion through erasure coding. It relies on the Luby’s

implementation (Bl¨omer et al., ) of the Reed Solomon

algorithm.

Besides these solutions, famous and stable free al-

ternatives exist. One of the most famous free solu-

tion is Apache HDFS. This Distributed File System

(DFS) aims at archiving and data-intensive comput-

ing (i.e. not really an I/O centric DFS). It relies on the

MapReduce framework that divides and distributes

tasks to storage nodes. Data protection is done by

replication, and triplication is the default protection

policy. Carnegie Mellon University has developed a

version based on erasure codes, called HDFS-RAID

(Fan et al., 2009). It is based on Reed-Solomon codes

for generic parity computations. HDFS-RAID has

been applied in practice (e.g. Facebook Warehouse).

However, only a small part of the warehouse’s data

has been translated by erasure coding. Warm data

(i.e. frequently accessed data) is still replicated. Once

the data have not been used for a while, the raid-node

daemon encodes it and erases the correspondingrepli-

cated blocks to make room. GlusterFS

and Ceph

(Weil et al., 2006) are examples of popular I/O cen-

tric DFS. Replication is currently their standard for

data protection, but erasure coding is a hot topic on

the roadmap.

Erasure coding is already included in private so-

lutions but it is not to be efﬁcient enough for inten-

sive applications. It is exclusively used today for

longterm storage and systems beneﬁt from its low

data consumption compared to replication. Open-

source storage solutions are still actively looking for

high speed erasure code implementations as an alter-

native to replication.

Our contribution, RozoFS, is an open-source soft-

ware jointly developed by Fizians SAS and the Uni-

versity of Nantes. It provides a POSIX DFS whose

ﬁle operations are exclusively done by erasure cod-

ing. This code is based on the MT whose computa-

tions rely entirely on the very fast XOR operations.

Thus, we expect RozoFS to be efﬁcient enough for

intensive applications.

3 MOJETTE TRANSFORM

The Mojette Transform (MT) is a mathematical tool

based on discrete geometry. Long developed at the

University of Nantes, its ﬁrst deﬁnition as an erasure

code remains in (Normand et al., 1996 ). A ﬁrst de-

sign for distributed storage relying on the MT was

proposed in (Gu´edon et al., 2001).

3.1 Algorithm

Fundamentally, an erasure code should compute ef-

ﬁciently a desired amount of redundant information

from a set of data. This operation is done by a lin-

ear operation that should be efﬁciently inverted. The

inverse operation reconstructs the former data from a

subset of the computed information.

The Mojette encoding is done as follows. A set of

data ﬁlls a discrete grid whose elements are identiﬁed

by (i, j), where i and j are integers. The linear trans-

form consists in summing these elements following

different discrete directions. Each direction is deﬁned

by a couple of integers (p, q). The elements aligned in

http://www.gluster.org/

CLOSER2014-4thInternationalConferenceonCloudComputingandServicesScience

452

2 2

3 3 3

Figure 1: The Mojette Transform, applied to two lines of 4

integers, computes here the following set of 3 projections:

{(p

) = (−1,1),(0,1),(1, 1)}.

a direction are summed to form a bin. The set of bins

deﬁned by the same direction is called a projection.

Hence, it is possible to use the Mojette Transform to

compute a desired number of projections from a set of

lines of data. If k and n are respectively the number of

lines of the grid, and the number of projections, and if

n > k, the set of projections holds redundant informa-

tion. Figure 1 shows the computation of 3 projections

from a set of 2 lines of data.

142

3 3 3

Figure 2: Inverse Mojette Transform as described in (Nor-

mand, Kingston, and

Evenou, 2006).

The decoding is the inverse operation. It consists

in ﬁlling the empty grid with the only projection data.

Data reconstruction is possible if the Katz’s criterion

(Normand, Kingston, and

Evenou, 2006) holds for the

rectangular-shaped grid. This criterion was extended

for any convex shape in (Normand et al., 1996 ). Re-

construction is straightforward for the elements in the

corners of the grid. Indeed these elements match

entirely a projection bin. (Normand, Kingston, and

Evenou, 2006) showed that if a reconstruction path

can be found to ﬁll a side of the grid, then it can be ap-

plied several time to easily reconstruct the grid. Fig-

ure 2 represents such reconstruction considering the

example of Figure 1. According to the Katz criterion

if the following condition is sufﬁcient to validate the

reconstruction:

∑

i=0

≥ k (1)

where r is the number of projections that remains,

depends on this projection set, and k is the num-

ber of lines of the grid. For instance, in Figure 2,

∑

i=1

= 2 which equals k, so reconstruction is pos-

sible. By modifying the number of lines k and the

number of projections n, it is possible to set the de-

sired fault-tolerance threshold.

3.2 Implementation

In storage application we cut the input stream into k

ﬁxed size blocks. Each line of the grid in the Mo-

jette representation is such a data block. The shape of

the grid must be rectangular then. The set of projec-

tions is set so that q

= 1. If we suppose that l, the

size of blocks (i.e. grid columns) is higher than k, the

number of blocks (i.e. grid lines), then the Katz’s cri-

terion guarantees that reconstruction is possible from

any subset of k projections.

The strength of the Mojette Transform is that en-

coding and decoding rely entirely on additions. Com-

pared to classical codes like Reed Solomon, there is

no need to compute expensive multiplications or di-

visions in Galois ﬁelds. The elements of the grid

should ﬁt computer words to improve computation

performance. Then, addition is done by efﬁcient

exclusive-OR (XOR). Most recent Intel-based pro-

cessors can perform very fast computations with ele-

ments of 128 bits as it ﬁts the dedicated SSE (Stream-

ing SIMD Extensions) registers.

In traditional solutions like Reed Solomon codes,

the size of the parity blocks is ﬁxed. The Mojette

Transform relaxes this constraint. The size B of each

projection varies slightly with the angle of projection,

and is given by the following formula:

B(k,l, p,q) = (k − 1) | q | +(l − 1) | p | +1 (2)

where k is the number of the grid lines and l the size

of blocks. The MT is then considered as (1+ ε) MDS

(Parrein et al., 2001), where ε is the quotient of the

number of bins required for decoding by the number

of element in the grid. The set of projections is taken

such as q = 1 and p varies around 0 in order to mini-

mize the value of ε.

4 ROZOFS

RozoFS is an open source software solution provid-

ing a scalable distributed ﬁle system. Once mounted,

the ﬁle system yields a unique name space (i.e. an hi-

erarchy of directories and ﬁles), that relies on clusters

of commodity storage nodes. RozoFS manages scal-

ability by adding storage nodes to expand the global

resources of the system. For data reliability, it relies

on the Mojette Transform as an erasure code.

DistributedFileSystemBasedonErasureCodingforI/O-IntensiveApplications

453

4.1 Information Dispersal

We consider a network of commodity computers (or

nodes) communicating by one or several links. The

more the links, the more the paths for packets, and

the more reliable and high-performance the system

is. Reliability comes from the capacity to commu-

nicate even if a link is down. Again, failure proba-

bilities are non negligible and should be considered

as the norm. Multiple links induce high-performance

because packets can be sent in parallel.

RozoFS considers ﬂows of information of ﬁxed

size. A small data chunk of 8 KB ﬁlls a Mojette grid

as seen in Figure 1. The protection settings, called

layouts, deﬁne the value of n and k (respectively the

number of projections and the number of lines in the

Mojette grid).

Table 1: The protection settings provided by RozoFS.

Layout k n storage nodes fault-tolerance

0 2 3 4 1

1 4 6 8 2

2 8 12 16 4

Currently, three layouts are designed in RozoFS.

The table 1 displays the relative information for these

conﬁgurations. Each layout corresponds to a storage

consumption of n/k = 1.5 times the size of the input.

For instance, we consider the write operation of 1 GB

of data in an exported volume of RozoFS, set with the

layout 0. It results that 3 projection ﬁles of around

500 MB (i.e. plus ε) are distributed across physical

disks.

4.2 Stackable Architecture

RozoFS relies on three components:

• exportd: the daemon that manages metadata;

• storaged: the daemon in charge of the projections;

• rozofsmount: used by the clients to mount RozoFS

locally.

For the sake of modularity, these components are

stackable. For instance, it is either possible to col-

locate them on a single node, storing projections on

different disks, to obtain protection over disk failure

(similar to some RAID conﬁgurations). In a large net-

work topology, dedicated nodes are much more ap-

preciated for the sake of scalability.

RozoFS Client. Each client can use RozoFS as a

standard POSIX ﬁle system thanks to the rozofsmount

process. It relies on FUSE

to operates in the user-

mode. Clients handle two kinds of operations: (i)

http://www.fuse.sourceforge.net

Gigabit Ethernet

NLE1

NLE2 NLE3 NLE4 NLE5

RozoFS

Global Namespace

Figure 3: The RozoFS cluster is composed of 6 nodes that

hold different services. They provide the DFS to 5 clients

running a Non-Linear Editing (NLE) application.

the metadata operations (lookup, getattr, etc), inter-

facing the export server; (ii) the ﬁle I/O operations

(read/write the projections), interfacing the storage

nodes through the storcli subprocesses. The encoding

and decoding workload are managed by the clients.

The network design should take care of reliability.

Services and links must be replicated to provide avail-

ability facing failures.

Metadata Server. This node manages the exportd

process that services the metadata operations. Its con-

ﬁguration ﬁle deﬁnes the exported ﬁle systems. It

keeps statistics of storage nodes to provide a data dis-

tribution policy based on the storage capacity. The

metada server supplies clients with two lists: (i) the

list of storage nodes servicing the mounted volume;

(ii) the list of storage nodes associate with a regular

ﬁle (for read/write operations). This service should

be replicated to guarantee high availability.

Storage Nodes. These entities hold the storaged dae-

mon. Two services are provided by storage nodes: (i)

the management services to exportd; (ii) the projec-

tion I/O services to the clients, called the storio pro-

cesses. Each storio process can listen to a speciﬁc net-

work interface. A storcli groups all storio of a storage

node in a load-balancing group to improve availabil-

ity.

5 EVALUATION

In this section, we explore the capacity of RozoFS to

scale as the global workload increases.

Experiment Setup. For our evaluation, we employ a

RozoFS cluster of 6 similar servers. Each machine

contains an Intel Xeon CPU @ 2.40 GHz, 12 GB

of RAM, and 10 disks (7200 rpm SAS) with 2 TB

each. These drives are managed by a RAID-5 con-

CLOSER2014-4thInternationalConferenceonCloudComputingandServicesScience

454

troller, providing better I/O performances and local

reliability. Each node holds 8× 1 Gbit Ethernet ports.

The exported volume is set to layout 0 providing fault-

tolerance against a single failure. The services are dis-

tributed as follows:

• the 6 nodes serve as storage servers;

• one among them serves as active metadata server;

• another is a passive/replicated metadata server;

• the 6 nodes mount RozoFS and re-export the ﬁle

system through the AFP protocol for client access;

Figure 3 displays the platform used here. For high

speed communications and reliability, 4 Ethernet

ports are reserved for storio processes. RozoFS man-

ages itself the packet scheduling for load-balancing

and availability. Because the metadata server is a po-

tential point of failure, it is necessary to set a high-

availability strategy. Here, we use DRBD

to syn-

chronise the active metadata server with the passive

one. Pacemaker

is used as a cluster management

software that manages the failover mechanisms in

case of failure.

The Study Case. We use the previous RozoFS clus-

ter as a storage solution for video editing. Non-Linear

Editing (NLE) applications entail intensiveworkloads

for computers. Multiple source editing is particu-

larly ﬁle I/O intensive since multiple multimedia con-

tents are accessed simultaneously. For this exper-

imentation, we use the famous NLE software Ap-

ple Final Cut Pro

. Remote clients, running on Ap-

ple Mac OS, attach the RozoFS ﬁle system to their

own machine using the AFP protocol. Once mounted,

the ﬁle system provides an easy access to the source

media stored on RozoFS. In our case, the system

stores video encoded with the lossy compressed for-

mat Apple ProRes 422. It requires at least a rate of

200 Mbit/s (i.e. 25 MB/s) to avoid frame drops. We

consider that editing involves 5 input video ﬁles si-

multaneously and outputs a single one. The software

is both designed for sequential operations (e.g. read

the video stream) and direct access to a part of the

video.

5.1 IOzone Benchmark

IOzone

is a ﬁlesystem benchmark that can output ﬁle

I/O performance according to a set of stressing tests.

In particular, the software can perform sequential and

random read/write tests, which are in accordance with

Distributed Replicated Block Device (www.drbd.org)

http://clusterlabs.org/

http://www.apple.com/ﬁnal-cut-pro/

http://www.iozone.org

1 2 3 4

500

1,000

1,500

Throughput (MB/s)

sequential write

sequential read

Figure 4: Accumulated throughput (in MB/s) recorded by

the threads, depending on the number of clients working in

parallel, for sequential access operations.

our study case. We explore the capacity of RozoFS to

adapt, facing a growing number of clients accessing

the ﬁle system simultaneously.

IOzone takes the following parameters. The ﬁle

size is set to 1 GB which is larger than the CPU cache,

and smaller than the RAM capacity. To ﬁt the multi-

source editing, each client involved in the test man-

ages 5 threads. A thread reads or writes the 1 GB

ﬁle according to the desired access strategy. For in-

stance, in a writing test, each node induces the writing

of 5 GB in RozoFS (i.e. which represents 7.5 GB in

the physical disks).

We measure the accumulated throughput recorded

by the threads as the number of clients grows. Each

thread should access data with a rate of at least

25 MB/s to validate the study case. Figure 4 and 5

respectively display the average results from 5 test it-

erations for sequential and random access.

Benchmarking Sequential I/Os. The write opera-

tions are asynchronous and multiple write requests

are done simultaneously. Figure 4 shows that as the

workload grows, the performance for sequential write

scales up to 3 nodes. When more clients are added,

the nodes are overwhelmed by the requests and cache

misses slow down the performances as data need to

be fetch on the disks. The sequential read operations

signiﬁcantly beneﬁt from fast access as data is pre-

fetched in the cache. The performance scales up as

the number of clients increases. For instance, when

5 clients are involved, the 25 threads record an ac-

cumulated throughput of 1800 MB/s. Each client re-

ceives 1800/5 = 360 MB/s and each thread receives

360/5 = 72 MB/s. In any case in this benchmark, we

validate the study case as each thread receive more

than 25 MB/s. Mechanical disks are particularly sen-

sitive to the type of access. For sequential access, the

DistributedFileSystemBasedonErasureCodingforI/O-IntensiveApplications

455

1 2 3 4

200

400

600

Throughput (MB/s)

random read

random write

Figure 5: Accumulated throughput (in MB/s) recorded by

the threads, depending on the number of clients working in

parallel, for random access operations.

disk’s arm move the read-and-write head to reach the

correct track on the physical disk. Once there, the fol-

lowing accesses are straightforward.

Benchmarking Random I/Os. Random operations

are slowed down by intermediate seeks between oper-

ations, increasing the latency of disk’s head. Figure 5

shows that the throughput suffers from these random

accesses. During random read operations, the perfor-

mance scales as it beneﬁts from small cache effects

and each thread receive at least 25 MB/s. However,

the random write test is clearly the worst case and

shows the limit of hardware as the performances col-

lapse. For 3 clients, each thread receives 350/3/5 =

23 MB/s. For more clients, performances get worse.

We should note that 5 writing threads per client does

not ﬁt our study case. For video editing, each editors

outputs a single ﬁle. It would correspond to a single

write thread in our case.

6 CONCLUSIONS

In this work, we have presented the Mojette Trans-

form (MT) as an efﬁcient alternative to classical era-

sure codes since it relies on fast computations, and a

great candidate to handle intensive applications. We

have set the MT as the I/O centric process in Ro-

zoFS with good expectations for practice efﬁciency.

Finally, we have designed an evaluation based on a

RozoFS cluster. The platform is able to handle mul-

tiple parallel access from intensive I/O applications

(i.e. multiple source video editing). The evaluation

has revealed that erasure coding is not a bottleneck

for intensive applications and should not be limited to

longterm storage.

More speciﬁc measures should be done to reveal

the real cost of the MT in the computational time. Our

evaluation could be extended to more intensive ap-

plications like virtualization, which access data over

block devices. There is clearly a need for comparisons

with other existing solutions. The MT must be com-

pared to erasure code libraries such as OpenFEC and

Jerasure. Further erasure code aspects beyond com-

putational considerations should be explored, such as

the node repair problem.

ACKNOWLEDGEMENTS

This work has been supported by the French Agence

Nationale de la Recherche (ANR) through the project

FEC4Cloud (ANR-12-EMMA-0031-01).

REFERENCES

Bl¨omer, J., Kalfane, M., Karpinski, M., Karp, R., Luby, M.,

and Zuckerman, D. Technical report.

Fan, B., Tantisiriroj, W., Xiao, L., and Gibson, G. (2009).

Diskreduce: Raid for data-intensive scalable comput-

ing. In Proc. PDSW 2009.

Gu´edon, J. P., Parrein, B., and Normand, N. (2001). In-

ternet distributed image information system. Integr.

Comput.-Aided Eng., 8(3):205–214.

Normand, N., Gu´edon, J., Philipp´e, O., and Barba, D. Con-

trolled redundancy for image coding and high-speed

transmission. In Proc. VCIP 1996, Orlando, FL.

Parrein, B., Normand, N., and Gu´edon, J. P. (2001). Mul-

tiple description coding using exact discrete Radon

transform. In Proceedings of the Data Compression

Conference, DCC ’01, pages 508–.

Plank, J. S. (2007). Jerasure: A library in C/C++ facilitat-

ing erasure coding for storage applications. Technical

Report CS-07-603, University of Tennessee.

Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D. E.,

and Maltzahn, C. (2006). Ceph: A scalable, high-

performance distributed ﬁle system. In Proc. OSDI

2006.

CLOSER2014-4thInternationalConferenceonCloudComputingandServicesScience

456