towards minimizing this overhead.
In the rest of the paper, we begin by summariz-
ing the related work. Afterwards we explain our dis-
tributed checkpointing mechanism and provide an ex-
emplary architecture and experimental environment
to evaluate out approach in Section. We finalize with
conclusions and planned future work.
2 RELATED WORK
Performance of the checkpointing has always been a
widely investigated topic in the distributed comput-
ing domain. There are numerous techniques to be
used as a solution in order to minimize checkpoint
and recovery costs. (Heo, Junyoung & Yi, Sangho &
Cho, Yookun & Hong,Jiman & Shin,Sung, 2006) pro-
poses a solution for minimizing checkpointing costs
in terms of storage aspects while (Mao, Yanhua and
Junqueira, Flavio and Marzullo, K., 2008) focuses on
high network performance, in terms of throughput.
(Mao, Yanhua and Junqueira, Flavio and Marzullo,
K., 2008) propose a high-performance replicated state
machine check-pointing and recovering approach de-
rived from Paxos consensus protocol, which is out
of scope for this research. (B. Ghit and D. H. J.
Epema, 2017) propose to checkpoint only straggling
tasks in order to minimize the number of checkpoints
and hence, overall checkpointing overhead. (Naksine-
haboon, Nichamon and Liu, Yudan and Leangsuksun,
C. and Nassar, Ruba and Paun, Mihaela and Scott,
S., 2008) propose a novel checkpointing mechanism
in order to reduce the checkpoint data size by check-
pointing only dirty pages that are modified since last
checkpoint time. This novel approach named as in-
cremental check-point model which involves a deci-
sion mechanism in order to persist the minimal nec-
essary data to be check-pointed since the last check-
point time in the execution history.
There are many other efforts which is targeted
for finding a way to efficiently implement the check-
pointing mechanism in system level (Gioiosa, R. and
Sancho, J.C. and Jiang, S. and Petrini, Fabrizio, 2005)
or user level (Sancho, J.C. and Petrini, Fabrizio and
Johnson, G. and Frachtenberg, Eitan, 2004). (Sancho,
J.C. and Petrini, Fabrizio and Johnson, G. and Fracht-
enberg, Eitan, 2004) states the user level approach
as checkpointing is performed explicitly by exter-
nal applications and propose the approach for deter-
mining optimal checkpoint frequency as a matter-of-
fact. (Gioiosa, R. and Sancho, J.C. and Jiang, S. and
Petrini, Fabrizio, 2005) defines the system-level ap-
proach as generally-applicable approach, which can
be defined as an application is unaware whether it
is checkpointed or not. Gioiose et al. also pro-
poses an innovative methodology called buffered co-
scheduling which is implemented at kernel level,
hence has unrestricted access to processor registers,
file descriptors, and states several check-pointing for-
mulations to be used, such as internal check-pointing
in which uses UNIX/LINUX signal mechanism.
The idea behind using replicated state machines in
order to model distributed check-pointing approach is
already stated by (Bolosky, William and Bradshaw,
Dexter and Haagens, Randolph and Kusters, Norbert
and Microsoft, Peng, 2011) and (Fred B. Schneider,
1990), replicated state machines can be made fault-
tolerant by running on multiple computers with feed-
ing the same inputs.
3 DISTRIBUTED
CHECKPOINTING
Distributed checkpointing approach for replicated
state machines utilizes the idea of each replica sav-
ing the state of execution history for a predesignated
period of time. This way each replica stores one or
more portions of the execution history locally, later to
be retrieved by a freshly booting replica.
Let’s assume that each state machine consists of
some states denoted as s
i
and some actions that trig-
ger transition between states such as s
i
a
k
−→ s
j
where
transition from state i to state j is triggered by action
k. These definitions result in a basic state machine
model in 1 that is going to be used in this paper’s con-
text. In this definition δ defines the transitions as a
function from state-action pairs to states.
M = {S, A}
S = {s
0
, s
1
, . . .}
A = {a
0
, a
1
, . . .}
δ = S × A → A
(1)
Regarding the definition in equation 1 we can ex-
emplify the execution history for a state machine as
in definition in equation 2 where the history begins
with a state and continues by action-state pairs where
each state is navigable by the related action in the state
machine definition. This example can be used to rep-
resent portions of history where the history may con-
tain only a portion of the full execution of the state
machine. However if the history begins by the initial
state (e.g. s
0
) than the history represents the full ex-
ecution history until the final state of the sequence in
the history.
H = s
i
, a
j
, s
k
, a
p
, s
r
, . . . (2)
CLOSER 2020 - 10th International Conference on Cloud Computing and Services Science
516