On Energy-efficient Checkpointing in High-throughput Cycle-stealing Distributed Systems

Matthew Forshaw, A. Stephen McGough, Nigel Thomas

Abstract

Checkpointing is a fault-tolerance mechanism commonly used in High Throughput Computing (HTC) environments to allow the execution of long-running computational tasks on compute resources subject to hardware and software failures and interruptions from resource owners. With increasing scrutiny of the energy consumption of IT infrastructures, it is important to understand the impact of checkpointing on the energy consumption of HTC environments. In this paper we demonstrate through trace-driven simulation on real-world datasets that existing checkpointing strategies are inadequate at maintaining an acceptable level of energy consumption whilst reducing the makespan of tasks. Furthermore, we identify factors important in deciding whether to employ checkpointing within an HTC environment, and propose novel strategies to curtail the energy consumption of checkpointing approaches.

References

  1. Anderson, D. P. (2004). Boinc: A system for publicresource computing and storage. GRID 7804, pages 4-10.
  2. Aupy, G., Benoit, A., Melhem, R. G., Renaud-Goud, P., and Robert, Y. (2013). Energy-aware checkpointing of divisible tasks with soft or hard deadlines. CoRR, abs/1302.3720.
  3. (2005). Are user runtime estimates inherently inaccurate? volume 3277 of LNCS, pages 253-263.
  4. Barroso, L. and Holzle, U. (2007). The case for energyproportional computing. Computer, 40(12):33-37.
  5. Bouguerra, M., Kondo, D., and Trystram, D. (2011). On the Scheduling of Checkpoints in Desktop Grids. CCGrid 7813, pages 305-313.
  6. Bradley, J., Forshaw, M., Stefanek, A., and Thomas, N. (2013). Time-inhomogeneous Population Models of a Cycle-Stealing Distributed System. UKPEW'13, pages 8-13.
  7. Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., and Snir, M. (2009). Toward exascale resilience. Int. J. High Perform. Comput. Appl., 23(4):374-388.
  8. Choi, S., Baik, M., Hwang, C., Gil, J., and Yu, H. (2004). Volunteer availability based fault tolerant scheduling mechanism in desktop grid computing environment. NCA 7804, pages 366-371.
  9. El Mehdi Diouri, M., Gluck, O., Lefevre, L., and Cappello, F. (2012). Energy considerations in checkpointing and fault tolerance protocols. DSN-W 7812, pages 1-6.
  10. Jarvis, S., Thomas, N., and van Moorsel, A. (2004). Open issues in grid performability. IJSPM, 5(5):3-12.
  11. Li, J., Deshpande, A., Srinivasan, J., and Ma, X. (2009). Energy and performance impact of aggressive volunteer computing with multi-core computers. MASCOTS 7809, pages 1-10.
  12. Liang, S., Holmes, V., and Kureshi, I. (2012). Hybrid Computer Cluster with High Flexibility. CLUSTERW 7812, pages 128-135.
  13. Litzkow, M., Livney, M., and Mutka, M. W. (1998). Condor-a hunter of idle workstations. ICDCS 7888, pages 104-111.
  14. McGough, A., Gerrard, C., Noble, J., Robinson, P., and Wheater, S. (2011). Analysis of Power-Saving Techniques over a Large Multi-use Cluster. In DASC'11, pages 364-371.
  15. McGough, A. S., Forshaw, M., Gerrard, C., Robinson, P., and Wheater, S. (2013). Analysis of power-saving techniques over a large multi-use cluster with variable workload. CCPE, 25(18):2501-2522.
  16. Melhem, R., Mosse, D., and Elnozahy, E. (2004). The interplay of power management and fault recovery in real-time systems. Computers, 53(2):217-231.
  17. Mills, B., Grant, R. E., Ferreira, K. B., and Riesen, R. (2013). Evaluating energy savings for checkpoint/restart. E2SC 7813, pages 6:1-6:8. ACM.
  18. Ren, X., Eigenmann, R., and Bagchi, S. (2007). Failureaware Checkpointing in Fine-grained Cycle Sharing Systems. HPDC 7807, pages 33-42. ACM.
  19. Unsal, O. S., Koren, I., and Krishna, C. M. (2002). Towards energy-aware software-based fault tolerance in realtime systems. In ISLPED, pages 124-129.
  20. UW-Madison (2013). UW-Madison CS Dept. HTCondor Pool Policies. http://research.cs.wisc.edu/htcondor/uwcs/policy.html.
  21. Zhang, Y. and Chakrabarty, K. (2003). Energy-aware adaptive checkpointing in embedded real-time systems. In Design, Automation and Test in Europe Conference and Exhibition, 2003, pages 918-923.
Download


Paper Citation


in Harvard Style

Forshaw M., McGough A. and Thomas N. (2014). On Energy-efficient Checkpointing in High-throughput Cycle-stealing Distributed Systems . In Proceedings of the 3rd International Conference on Smart Grids and Green IT Systems - Volume 1: SMARTGREENS, ISBN 978-989-758-025-3, pages 262-267. DOI: 10.5220/0004958302620267


in Bibtex Style

@conference{smartgreens14,
author={Matthew Forshaw and A. Stephen McGough and Nigel Thomas},
title={On Energy-efficient Checkpointing in High-throughput Cycle-stealing Distributed Systems},
booktitle={Proceedings of the 3rd International Conference on Smart Grids and Green IT Systems - Volume 1: SMARTGREENS,},
year={2014},
pages={262-267},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004958302620267},
isbn={978-989-758-025-3},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 3rd International Conference on Smart Grids and Green IT Systems - Volume 1: SMARTGREENS,
TI - On Energy-efficient Checkpointing in High-throughput Cycle-stealing Distributed Systems
SN - 978-989-758-025-3
AU - Forshaw M.
AU - McGough A.
AU - Thomas N.
PY - 2014
SP - 262
EP - 267
DO - 10.5220/0004958302620267