Monitoring Large Cloud-Based Systems

Mauro Andreolini, Marcello Pietri, Stefania Tosi, Andrea Balboni


Large scale cloud-based services are built upon a multitude of hardware and software resources, disseminated in one or multiple data centers. Controlling and managing these resources requires the integration of several pieces of software that may yield a representative view of the data center status. Today’s both closed and open-source monitoring solutions fail in different ways, including the lack of scalability, scarce representativity of global state conditions, inability in guaranteeing persistence in service delivery, and the impossibility of monitoring multi-tenant applications. In this paper, we present a novel monitoring architecture that addresses the aforementioned issues. It integrates a hierarchical scheme to monitor the resources in a cluster with a distributed hash table (DHT) to broadcast system state information among different monitors. This architecture strives to obtain high scalability, effectiveness and resilience, as well as the possibility of monitoring services spanning across different clusters or even different data centers of the cloud provider. We evaluate the scalability of the proposed architecture through a bottleneck analysis achieved by experimental results.


  1. Andreolini, M., Colajanni, M., and Pietri, M. (2012). A Scalable Architecture for Real-Time Monitoring of Large Information Systems. In NCCA'12, 2nd IEEE Symposium on Network Cloud Computing and Applications. IEEE Computer Society.
  2. Andreolini, M., Colajanni, M., and Tosi, S. (2011). A Software Architecture for the Analysis of Large Sets of Data Streams in Cloud Infrastructures. In CIT'11, 11th IEEE International Conference on Computer and Information Technology. IEEE Computer Society.
  3. Babu, S., Subramanian, L., and Widom, J. (2001). A Data Stream Management System for Network Traffic Management. In NRDM'01, 1st Workshop on Network-Related Data Management.
  4. Badger, M. (2008). Zenoss Core Network and System Monitoring. Packt Publishing Ltd, Birmingham, UK.
  5. Calder, B. et al. (2011). Windows Azure Storage: a highly available cloud storage service with strong consis-
  6. tency. In SOSP'11, 23rd ACM Symposium on Operat-
  7. Castro, M., Druschel, P., Kermarrec, A.-M., and Rowstron, A. (2002). Scribe: A large-scale and decentralized application-level multicast infrastructure. IEEE Journal on Selected Areas in Communications (JSAC).
  8. Cranor, C., Johnson, T., and Spataschek, O. (2003). Gigascope: a stream database for network applications. In SIGMOD'03, 2003 ACM SIGMOD International Conference on Management of Data. ACM.
  9. Davis, C. (2013). Graphite - Scalable Realtime Graphing.
  10. Dean, J. and Lopes, J. (2004). MapReduce: Simplified Data Processing on Large Clusters. In OSDI'04, 6th Symposium on Operating Systems Design and Implementation. USENIX Association.
  11. Gantz, J. and Reinsel, D. (2012). The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east.
  12. George, L. (2011). HBase: the definitive guide. O'Reilly Media, Sebastopol, CA.
  13. Hasselmeyer., P. and d'Heureuse, N. (2010). Towards holistic multi-tenant monitoring for virtual data centers. In NOMS'10, 2010 IEEE/IFIP Network Operations and Management Symposium Workshops. IEEE Computer Society.
  14. Hoffman, S. and Souza, S. D. (2013). Apache Flume: Distributed Log Collection for Hadoop. Packt Publishing, Birmingham, UK.
  15. Josephsen, D. (2007). Building a Monitoring Infrastructure with Nagios. Prentice Hall, Upper Saddle River, NJ.
  16. Keller, A. and Ludwig, H. (2003). The WSLA Framework: Specifying and Monitoring Service Level Agreements for Web Services. Journal of Network and Systems Management.
  17. Kundu, D. and Lavlu, S. (2009). Cacti 0.8 Network Monitoring. Packt Publishing, Birmingham, UK.
  18. Leu, J. S., Yee, Y. S., and Chen, W. L. (2010). Comparison of Map-Reduce and SQL on Large-Scale Data Processing. In ISPA'10, 1st International Symposium on Parallel and Distributed Processing with Applications. IEEE Computer Society.
  19. Litvinova, A., Engelmann, C., and Scott, S. L. (2010). A proactive fault tolerance framework for highperformance computing. In PDCN'10, 9th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN2010). ACTA Press.
  20. Lv, Q., Cao, P., Cohen, E., Li, K., and Shenker, S. (2002). Search and replication in unstructured peer-to-peer networks. In ICS'02, 16th International Conference on Supercomputing. ACM.
  21. Marchetti, M., Colajanni, M., and Messori, M. (2010). Selective and early threat detectionin large networked systems. In CIT'10, 10th IEEE International Conference on Computer and Information Technology. IEEE Computer Society.
  22. Massie, M. L., Chun, B. N., and Culler, D. E. (2004). The Ganglia Distributed Monitoring System: Design, Implementation, and Experience. Parallel Computing.
  23. Olston, C. et al. (2008). Pig Latin: a not-so-foreign language for data processing. In SIGMOD'08, 2008 ACM SIGMOD International Conference on Management of Data, New York, NY. ACM.
  24. Olups, R. (2010). Zabbix 1.8 network monitoring. Packt Publishing, Birmingham, UK.
  25. Rabkin, A. and Katz, R. (2010). Chukwa: a system for reliable large-scale log collection. In LISA'10, 24th International Conference on Large Installation System Administration. USENIX Association.
  26. Renesse, R. V., Birman, K. P., and Vogels, W. (2003). Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining. ACM Transactions on Computer Systems.
  27. Rowstron, A. and Druschel, P. (2001). Pastry: Scalable, decentralized object location, and routing for largescale peer-to-peer systems. MIDDLEWARE'01, 3rd IFIP/ACM International Conference on Distributed Systems Platforms.
  28. Sacerdoti, F. D., Katz, M. J., Massie, M. L., and Culler, D. E. (2003). Wide Area Cluster Monitoring with Ganglia. Cluster Computing.
  29. Shvachko, K. et al. (2010). The Hadoop Distributed File System. In MSST'10, 26th Symposium on Massive Storage Systems and Technologies. IEEE Computer Society.
  30. Sigoure, B. (2010). OpenTSDB, a distributed, scalable Time Series Database.
  31. Surhone, L. M., Tennoe, M. T., and Henssonow, S. F. (2011). OpenNMS. Betascript Publishing, Mauritius.
  32. Voicu, R., Newman, H., and Cirstoiu, C. (2009). MonALISA: An agent based, dynamic service system to monitor, control and optimize distributed systems. Computer Physics Communications.
  33. Zyrion (2010-2013). Traverse: distributed, scalable, high-availability architecture.

Paper Citation

in Harvard Style

Andreolini M., Pietri M., Tosi S. and Balboni A. (2014). Monitoring Large Cloud-Based Systems . In Proceedings of the 4th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER, ISBN 978-989-758-019-2, pages 341-351. DOI: 10.5220/0004794003410351

in Bibtex Style

author={Mauro Andreolini and Marcello Pietri and Stefania Tosi and Andrea Balboni},
title={Monitoring Large Cloud-Based Systems},
booktitle={Proceedings of the 4th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,},

in EndNote Style

JO - Proceedings of the 4th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,
TI - Monitoring Large Cloud-Based Systems
SN - 978-989-758-019-2
AU - Andreolini M.
AU - Pietri M.
AU - Tosi S.
AU - Balboni A.
PY - 2014
SP - 341
EP - 351
DO - 10.5220/0004794003410351