Figure 1: Deduplication process.
store(Chunking). Commonly used dividing mech-
anism is fixed length chunking or variable length
chunking. Fixed length chunking divides the data into
the same length chunks. Variable length chunking
divides into different length chunks. Typical imple-
mentation of variable length chunking scans the data
from the starting bytes to the end in sequence with
short length window using like Rabin’s algorithm(M.
O. Rabin, 1981), then a special value in the window
is recognized as a boundary of the chunks(U. Man-
ber, 1994)(A. Muthitacharoen et al., 2001). This pa-
per call the value ’anchor’. The average chunk size is
defined from how many bits is taken in the window
bandwidth. Next, the module calculates the unique
code from the chunk data to analyze the similarity
of chunks(Fingerprinting). The codes are calculated
for example by SHA-1, SHA-256, MD5 algorithms(J.
Burrows and D. O. C. W. DC, 1995)(R. Rivest, 1992).
This paper call the codes ’fingerprint’. Next the mod-
ule decides the uniqueness of each chunk using the
codes(Decision), such that the chunk is duplicated
that the same chunk has been stored or not duplicated
that all previous chunks are different. Finally, the
module write out only the unique data into the stor-
age(Writing).
Backup target data includes various types of files,
such as images, documents, compressed files, struc-
tured M2M data and so on(N. Park and D J. Lilj,
2010). The duplication level included in the data
also depends on the environments, for example some
are scarcely duplicated because it was much edited,
changed or updated, some are densely duplicated be-
cause it was rarely changed, just replicated or copied,
and so on.
2.2 Issues
Many approaches have been proposed and imple-
mented with the aim of increasing the deduplica-
tion rate and decreasing the processing time. These
approaches are applied only for the improvement
of single layer deduplication system. Some typical
techniques are to choose fixed or variable chunking
method, use bloom filter, construct the layered index
table and become aware of the data contents and so
on(Y. Tan et al., 2010)(C. Liu et al., 2008)(Y. Won
et al., 2008)(B. Zhu et al., 2008). How to choose the
adequate chunk size in practice is a sensitive factor
of the algorithm, because that affects the total sys-
tem behavior. In many cases, it is determined based
on the reasonable balance of the performance and the
deduplication capability for a hypothetical environ-
ment(D. Meister and A. Brinkmann, 2009)(C. Dub-
nicki et al., 2009)(Quantum Corporation, 2009)(EMC
Corporation, 2010)(G. Wallace et al., 2012).
The trade-off between the chunk size and the pro-
cessing time is not easy to overcome. A smaller
chunk size provides more precise reduction of dupli-
cation, a bigger one provides more coarse. At the
same time, a smaller chunk size requires more CPU
and IO intensive processing time, a bigger one does
less time. Further, it is a non-linear correlation, that
is, for a smaller chunk size, the processing time in-
creases more steeply as the size decreases.
On the other hand, in typical user environment,
the time tolerance for backup operation is pre-defined.
The time tolerance is decided to minimize the impact
by an operation for system continuity. One important
factor is a period of time during which backup are
permitted to run. It should be decided to assure the
least interference with normal operations. We call it
as Backup-window. In these cases, even if the system
desires more precise deduplication for reducing the
storing capacity and cost, the Backup-window cannot
allow to use the adequate small size. In practical en-
vironment, bigger chunk size is adapted to maintain
required performance as a compromise.
In addition, smaller chunk size causes disadvan-
tages as well as advantages, for example, the smaller
chunk works efficiently to reduce the duplication in
case that the duplicate area is adequately spread out
across the data, but works inefficiently when the du-
plicate area is heavily or sparsely localized. When the
duplication is heavy or sparse, smaller chunk wastes
the time of chunking and decision. The continuous
appearance of duplicate area or unique area cause less
productive chunking process. In this case, smaller
chunk size is not beneficial in comparison with bigger
chunk size. In typical user environment, periodical
full backup scenario is popular to help a safe disas-
ter recovery, in this situation, the target data includes
a lot of duplicate area, therefore smaller chunk size
may be inefficient. On the other hand, in the situation
when periodical differential backup scenario is used
to reduce a excessive cost, the target data includes du-
plicate area with less density, ex. a half. Thus, the ef-
ficiency of the chunk size depends on the amount and
locality of duplication embedded in the target data.
This makes it difficult to choose the size uniformly
over all different environments.
ICEIS2013-15thInternationalConferenceonEnterpriseInformationSystems
144