Most of the file systems used today use
journaling in order to ensure file system consistency.
This involves writing either metadata alone or both
metadata and data to a journal prior to making
commits to the file system itself. In the occurrence
described previously, the journal can be “replayed"
in an attempt to either finish committing data to
disk, or at least bring the disk back to a previous
consistent state, with a higher probability of success.
Such a safety mechanism isn’t free, nor does it
completely avert risks. Ultimately, the heavier the
use of journalling (i.e. for both metadata and data)
the lower the risk of unrecoverable inconsistency, at
the expense of performance.
As mentioned previously, ZFS is a CoW file
system; it doesn’t ever overwrite data. Transactions
are atomic. As a result, the on-disk format is always
consistent, hence the lack of fsck tool for ZFS.
The equivalent feature to journalling that ZFS
has is the ZIL. However, they function completely
differently; in traditional file systems, data held in
RAM is typically flushed to a journal, which is then
read when its contents is to be committed to the file
system. As a gross oversimplification of the
behaviour of ZFS, the ZIL is only ever read to replay
transactions following a failure, with data still being
read from RAM when committed to disk. It is
possible to store replace the ZIL with a dedicated
VDEV, called a SLOG, though there are some
important considerations to be made.
A.4 Silent Corruption
Silent corruption refers to the corruption of data
undetected by normal operations of a system and in
some cases unresolvable with certainty. It is often
assumed that servergrade hardware is almost
resilient to errors, with errorcorrection code (ECC)
system memory on top of common ECC and/or
cyclic redundancy check (CRC) capabilities of
various components and buses within the storage
subsystem. However, this is far from the case in
reality. In 2007, Panzer-Steindel at CERN released a
study which revealed the following errors under
various occurrences and tests (though the sampled
configurations are not mentioned):
Disk Errors. Approximately 50 single-bit errors and
50 sector-sized regions of corrupted data, over
a period of five weeks of activity across 3000
systems
RAID-5 Verification. Recalculation of parity;
approximately 300 block problem fixes across
492 systems over four weeks
CASTOR Data Pool Checksum Verification.
Approximately “one bad file in 1500 files" in
8.7TB of data, with an estimated “byte error
rate of 3 10
7
"
Conventional RAID and file system
combinations have no capabilities in resolving the
aforementioned errors. In a RAID-1 mirror, the array
would not be able to determine which copy of the
data is correct, only that there is a mismatch. A
parity array would arguably be even worse in this
situation: a consistency check would reveal
mismatching parity blocks based on parity
recalculations using the corrupt data.
In this instance, CASTOR (CERN Advanced
STORage manager) and it’s checksumming
capability coupled with data replication is the only
method that can counter silent corruption; if the
checksum of a file is miscalculated on verification,
the file is corrupt and can be rewritten from the
replica. There are two disadvantages to this
approach: at the time of the report’s publication, this
validation process did not run in real-time; and this
is a file-level functionality, meaning that the process
of reading a large file to calculate checksums and
rewriting the file from a replica if an error is
discovered, will be expensive in terms of disk
activity, as well as CPU time at a large enough scale.
As stated in A.2, ZFS’s on-disk structure is a
Merkle tree, storing checksums of data blocks in
parent nodes. Like CASTOR, it is possible to run a
scrub operation to verify these checksums. However,
ZFS automatically verifies the checksum for a block
each time it is read and if a copy exists it will
automatically copy that block only, as opposed to an
entire file.
All the aforementioned points apply to both
metadata and data. A crucial difference between a
conventional file system combined with RAID and
ZFS is that these copies, REFERENCES known as ditto
blocks, can exist anywhere within a zpool (allowing
for some data-level resiliency even on a single disk),
and can have up to three instances. ZFS tries to
ensure ditto blocks are placed at least 1/8 of a disk
apart as a worst case scenario. Metadata ditto blocks
are mandatory, with ZFS increasing the replication
count higher up the tree (these blocks have a greater
number of children, thus are more critical to
consistency).
Another form of silent corruption associated
with traditional RAID arrays is the “write hole"; the
same type of occurrence as outlined above but on
power failure. In production this is rare due to the
use of uninterpretable power supplys (UPSs) to
prevent system power loss and RAID controllers
qvm: A Command Line Tool for the Provisioning of Virtual Machines
297