Ganglia, the scalable distributed monitoring system is meant for high-performance computing systems like Clusters and Grids. Based on a hierarchical design, Ganglia is targeted at federations of clusters. It leverages widely used technologies such as XML for data representations, XDR for compact, portable data transport, and RRD tool for data storage and visualizations. It uses carefully engineered data structure and algorithms to achieve very low pernode overheads and high concurrencies. As it is being used on thousands of clusters around the world, it has been ported to extensive set of operating systems and process architectures.
In order to listen/announce protocol to monitor the state within clusters, it relies on a multicast-based listen/announce protocol and uses a tree of point-to-point connection amongst representative clusters nodes to federate clusters and aggregate their state. This paper presents the design, implementation, and evaluation of Ganglia along with experience gained through real world deployments on systems of widely varying scale, configuration, and domains over the last few years.
The design challenges for distributed monitoring systems include:
The system should scale appropriately with the number of nodes in the system. Clusters today commonly consist of hundreds or even thousands of nodes. Grid computing efforts, such as TeraGrid, will eventually push these numbers even further.
The system should be robust enough to node and network failures of various types. As systems scale, when it comes to number of nodes, failures become inevitable. The system should localize such failures, so that they continue to operate and deliver useful services during the times of failures.
The system should be extensible with the type of data that are monitored and the nature in which such data is collected. It is not possible to know everything that is required to be monitored. The system should allow new data to be collected and monitored in a convenient manner.
The system should incur management overheads that scale slowly with the number of nodes. For example, managing the systems should not require a linear increase in system administrator’s time as the number of nodes in the system increases. Manual configurations should also be avoided as much as possible.
The system should be portable to a variety of operating systems and CPU architectures. Despite recent trend towards Linux on x86, there is still wide variations in hardware and software used for high performance computing. Systems such as Globus further facilitate use of such heterogeneous systems.
Systems should incur low pernode overheads for all scarce computational resources including CPU, Memory, I/O, and Network Bandwidth. For high performance systems, this is particularly important since applications often have enormous resource demands.
It’s a solid monitoring platform that supports many operating systems and processors. Large clusters around the world use the platform, especially in university settings.