may only run on a particular subset of resources, it
would be beneficial to store checkpoints on or close
to resources capable of resuming its execution.
Replication. Replication of jobs in an HTC sys-
tem is generally dismissed due to increased overheads
and reduced system throughput. While this holds true
for heavily utilised HTC clusters, there is a case for
energy-conscious replication of jobs. The Newcastle
University HTC cluster features significant spare ca-
pacity so the replication of certain jobs need not im-
pact the makespan of other jobs. If replicas were to
run alongside interactive users, the energy cost asso-
ciated with the HTC workload would also be minimal.
Vacation. Furthering the desire to minimise the
impact of HTC workloads on interactive users of
computers, HTC clusters are configured to ensure
an interrupted job vacates a resource quickly. Many
clusters including Newcastle are configured to vacate
HTC tasks immediately without checkpointing, lead-
ing to wasted execution. A more beneficial approach
would be to allow checkpoint at the time of vacation,
but limit the impact on users with a timeout interval
after which the checkpoint operation is abandoned.
5 CONCLUSION
In this paper we have shown existing checkpointing
mechanisms to be inadequate in reducing makespan
while maintaining acceptable levels of energy con-
sumption in multi-use clusters with interactive user
interruptions. Our preliminary experimentation
shows the naive application of checkpointing ap-
proaches to have the potential to negatively impact
energy consumption, but small changes to make these
strategies energy- and load-aware may lead to signif-
icant benefits. We highlight key considerations when
adopting checkpointing in an HTC cluster and mo-
tivate a number of areas of future research interest
in energy-efficient checkpointing. A detailed evalu-
ation of new energy-aware checkpointing strategies
will form the basis of our ongoing research.
REFERENCES
Anderson, D. P. (2004). Boinc: A system for public-
resource computing and storage. GRID ’04, pages
4–10.
Aupy, G., Benoit, A., Melhem, R. G., Renaud-Goud, P.,
and Robert, Y. (2013). Energy-aware checkpointing
of divisible tasks with soft or hard deadlines. CoRR,
abs/1302.3720.
Bailey Lee, C., Schwartzman, Y., Hardy, J., and Snavely, A.
(2005). Are user runtime estimates inherently inaccu-
rate? volume 3277 of LNCS, pages 253–263.
Barroso, L. and Holzle, U. (2007). The case for energy-
proportional computing. Computer, 40(12):33–37.
Bouguerra, M., Kondo, D., and Trystram, D. (2011). On the
Scheduling of Checkpoints in Desktop Grids. CCGrid
’13, pages 305–313.
Bradley, J., Forshaw, M., Stefanek, A., and Thomas, N.
(2013). Time-inhomogeneous Population Models of
a Cycle-Stealing Distributed System. UKPEW’13,
pages 8–13.
Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B.,
and Snir, M. (2009). Toward exascale resilience. Int.
J. High Perform. Comput. Appl., 23(4):374–388.
Choi, S., Baik, M., Hwang, C., Gil, J., and Yu, H. (2004).
Volunteer availability based fault tolerant scheduling
mechanism in desktop grid computing environment.
NCA ’04, pages 366–371.
El Mehdi Diouri, M., Gluck, O., Lefevre, L., and Cappello,
F. (2012). Energy considerations in checkpointing and
fault tolerance protocols. DSN-W ’12, pages 1–6.
Jarvis, S., Thomas, N., and van Moorsel, A. (2004). Open
issues in grid performability. IJSPM, 5(5):3–12.
Li, J., Deshpande, A., Srinivasan, J., and Ma, X. (2009). En-
ergy and performance impact of aggressive volunteer
computing with multi-core computers. MASCOTS
’09, pages 1–10.
Liang, S., Holmes, V., and Kureshi, I. (2012). Hybrid Com-
puter Cluster with High Flexibility. CLUSTERW ’12,
pages 128–135.
Litzkow, M., Livney, M., and Mutka, M. W. (1998).
Condor-a hunter of idle workstations. ICDCS ’88,
pages 104–111.
McGough, A., Gerrard, C., Noble, J., Robinson, P., and
Wheater, S. (2011). Analysis of Power-Saving Tech-
niques over a Large Multi-use Cluster. In DASC’11,
pages 364–371.
McGough, A. S., Forshaw, M., Gerrard, C., Robinson, P.,
and Wheater, S. (2013). Analysis of power-saving
techniques over a large multi-use cluster with variable
workload. CCPE, 25(18):2501–2522.
Melhem, R., Mosse, D., and Elnozahy, E. (2004). The in-
terplay of power management and fault recovery in
real-time systems. Computers, 53(2):217–231.
Mills, B., Grant, R. E., Ferreira, K. B., and Riesen,
R. (2013). Evaluating energy savings for check-
point/restart. E2SC ’13, pages 6:1–6:8. ACM.
Ren, X., Eigenmann, R., and Bagchi, S. (2007). Failure-
aware Checkpointing in Fine-grained Cycle Sharing
Systems. HPDC ’07, pages 33–42. ACM.
Unsal, O. S., Koren, I., and Krishna, C. M. (2002). Towards
energy-aware software-based fault tolerance in real-
time systems. In ISLPED, pages 124–129.
UW-Madison (2013). UW-Madison
CS Dept. HTCondor Pool Policies.
http://research.cs.wisc.edu/htcondor/uwcs/policy.html.
Zhang, Y. and Chakrabarty, K. (2003). Energy-aware adap-
tive checkpointing in embedded real-time systems. In
Design, Automation and Test in Europe Conference
and Exhibition, 2003, pages 918–923.
OnEnergy-efficientCheckpointinginHigh-throughputCycle-stealingDistributedSystems
267