This article needs additional citations for verification .(April 2014) |
Stable release | 3.7.2 [1] / 14 June 2016 |
---|---|
Written in | C, Perl, PHP, Python |
Operating system | Cross-platform |
Available in | English |
Type | Distributed monitoring |
License | BSD license |
Website | www |
Ganglia is a scalable, distributed monitoring tool for high-performance computing systems, clusters and networks. The software is used to view either live or recorded statistics covering metrics such as CPU load averages or network utilization for many nodes.
Ganglia software is bundled with enterprise-level Linux distributions such as Red Hat Enterprise Linux (RHEL) or the CentOS repackaging of the same. Ganglia grew out of requirements for monitoring systems by Berkeley (University of California) but now sees use by commercial and educational organisations such as Cray, MIT, NASA and Twitter.
It is based on a hierarchical design targeted at federations of clusters. It relies on a multicast-based listen/announce protocol to monitor state within clusters and uses a tree of point-to-point connections amongst representative cluster nodes to federate clusters and aggregate their state. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization. It uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency. The implementation is robust, has been ported to an extensive set of operating systems and processor architectures, and is currently in use on over 500 clusters around the world. It has been used to link clusters across university campuses and around the world and can scale to handle clusters with 2000 nodes. [2]
The ganglia system comprises two unique daemons, a PHP-based web front-end, and a few other small utility programs.
Gmond is a multi-threaded daemon which runs on each cluster node you want to monitor. Installation does not require having a common NFS filesystem or a database back-end, installing special accounts or maintaining configuration files.
Gmond has four main responsibilities:
Each gmond transmits information in two different ways:
Federation in Ganglia is achieved using a tree of point-to-point connections amongst representative cluster nodes to aggregate the state of multiple clusters. At each node in the tree, a Ganglia Meta Daemon (gmetad) periodically polls a collection of child data sources, parses the collected XML, saves all numeric, volatile metrics to round-robin databases and exports the aggregated XML over a TCP socket to clients. Data sources may be either gmond daemons, representing specific clusters, or other gmetad daemons, representing sets of clusters. Data sources use source IP addresses for access control and can be specified using multiple IP addresses for failover. The latter capability is natural for aggregating data from clusters since each gmond daemon contains the entire state of its cluster.
The Ganglia web front-end provides a view of the gathered information via real-time dynamic web pages. Most importantly, it displays Ganglia data in a meaningful way for system administrators and computer users. Although the web front-end to ganglia started as a simple HTML view of the XML tree, it has evolved into a system that keeps a colorful history of all collected data.
The Ganglia web front-end caters to system administrators and users. For example, one can view the CPU utilization over the past hour, day, week, month, or year. The web front-end shows similar graphs for memory usage, disk usage, network statistics, number of running processes, and all other Ganglia metrics.
The web front-end depends on the existence of the gmetad which provides it with data from several Ganglia sources. Specifically, the web front-end will open the local port 8651 (by default) and expects to receive a Ganglia XML tree. The web pages themselves are highly dynamic; any change to the Ganglia data appears immediately on the site. This behavior leads to a very responsive site, but requires that the full XML tree be parsed on every page access. Therefore, the Ganglia web front-end should run on a fairly powerful, dedicated machine if it presents a large amount of data.
The Ganglia web front-end is written in PHP, and uses graphs generated by gmetad to display history information. It has been tested on many flavours of Unix (primarily Linux) with the Apache webserver and the PHP5 module.
In computer networking, multicast is a type of group communication where data transmission is addressed to a group of destination computers simultaneously. Multicast can be one-to-many or many-to-many distribution. Multicast differs from physical layer point-to-multipoint communication.
Anycast is a network addressing and routing methodology in which a single IP address is shared by devices in multiple locations. Routers direct packets addressed to this destination to the location nearest the sender, using their normal decision-making algorithms, typically the lowest number of BGP network hops. Anycast routing is widely used by content delivery networks such as web and name servers, to bring their content closer to end users.
Nagios is an event monitoring system that offers monitoring and alerting services for servers, switches, applications and services. It alerts users when things go wrong and alerts them a second time when the problem has been resolved.
In computer science and networking in particular, a session is a time-delimited two-way link, a practical layer in the TCP/IP protocol enabling interactive expression and information exchange between two or more communication devices or ends – be they computers, automated systems, or live active users. A session is established at a certain point in time, and then ‘torn down’ - brought to an end - at some later point. An established communication session may involve more than one message in each direction. A session is typically stateful, meaning that at least one of the communicating parties needs to hold current state information and save information about the session history to be able to communicate, as opposed to stateless communication, where the communication consists of independent requests with responses.
Protocol-Independent Multicast (PIM) is a family of multicast routing protocols for Internet Protocol (IP) networks that provide one-to-many and many-to-many distribution of data over a LAN, WAN or the Internet. It is termed protocol-independent because PIM does not include its own topology discovery mechanism, but instead uses routing information supplied by other routing protocols. PIM is not dependent on a specific unicast routing protocol; it can make use of any unicast routing protocol in use on the network. PIM does not build its own routing tables. PIM uses the unicast routing table for reverse-path forwarding.
An overlay network is a computer network that is layered on top of another network. The concept of overlay networking is distinct from the traditional model of OSI layered networks, and almost always assumes that the underlay network is an IP network of some kind.
IP multicast is a method of sending Internet Protocol (IP) datagrams to a group of interested receivers in a single transmission. It is the IP-specific form of multicast and is used for streaming media and other network applications. It uses specially reserved multicast address blocks in IPv4 and IPv6.
In IBM System z9 and successor mainframes, the System z Integrated Information Processor (zIIP) is a special purpose processor. It was initially introduced to relieve the general mainframe central processors (CPs) of specific Db2 processing loads, but currently is used to offload other z/OS workloads as described below. The idea originated with previous special purpose processors, the zAAP, which offloads Java processing, and the IFL, which runs Linux and z/VM but not other IBM operating systems such as z/OS, DOS/VSE and TPF. A System z PU is "characterized" as one of these processor types, or as a CP, or SAP. These processors do not contain microcode or hardware features that accelerate their designated workloads. Instead, by relieving the general CP of particular workloads, they often lead to a higher workload throughput at reduced license fees.
Transparent Inter Process Communication (TIPC) is an inter-process communication (IPC) service in Linux designed for cluster-wide operation. It is sometimes presented as Cluster Domain Sockets, in contrast to the well-known Unix Domain Socket service; the latter working only on a single kernel.
Apache Hadoop is a collection of open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.
IBM App Connect Enterprise (abbreviated as IBM ACE, formerly known as IBM Integration Bus, WebSphere Message Broker, WebSphere Business Integration Message Broker, WebSphere MQSeries Integrator and started life as MQSeries Systems Integrator. App Connect IBM's integration software offering, allowing business information to flow between disparate applications across multiple hardware and software platforms. Rules can be applied to the data flowing through user-authored integrations to route and transform the information. The product can be used as an Enterprise Service Bus supplying a communication channel between applications and services in a service-oriented architecture. App Connect from V11 supports container native deployments with highly optimised container start-up times.
JGroups is a library for reliable one-to-one or one-to-many communication written in the Java language.
The Red Hat Cluster Suite (RHCS) includes software to create a high availability and load balancing cluster. Both can be used on the same system although this use case is unlikely. Both products, the High Availability Add-On and Load Balancer Add-On, are based on open-source community projects. Red Hat Cluster developers contribute code upstream for the community. Computational clustering is not part of cluster suite, but instead provided by Red Hat MRG.
oVirt is a free, open-source virtualization management platform. It was founded by Red Hat as a community project on which Red Hat Virtualization is based. It allows centralized management of virtual machines, compute, storage and networking resources, from an easy-to-use web-based front-end with platform independent access. KVM on x86-64, PowerPC64 and s390x architecture are the only hypervisors supported, but there is an ongoing effort to support ARM architecture in a future releases.
Data center bridging (DCB) is a set of enhancements to the Ethernet local area network communication protocol for use in data center environments, in particular for use with clustering and storage area networks.
collectd is a Unix daemon that collects, transfers and stores performance data of computers and network equipment. The acquired data is meant to help system administrators maintain an overview over available resources to detect existing or looming bottlenecks.
IEEE 802.1aq is an amendment to the IEEE 802.1Q networking standard which adds support for Shortest Path Bridging (SPB). This technology is intended to simplify the creation and configuration of Ethernet networks while enabling multipath routing.
The Slurm Workload Manager, formerly known as Simple Linux Utility for Resource Management (SLURM), or simply Slurm, is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and computer clusters.
Hierarchical Cluster Engine (HCE) is a FOSS complex solution for: construct custom network mesh or distributed network cluster structure with several relations types between nodes, formalize the data flow processing goes from upper node level central source point to down nodes and backward, formalize the management requests handling from multiple source points, support native reducing of multiple nodes results, internally support powerful full-text search engine and data storage, provide transactions-less and transactional requests processing, support flexible run-time changes of cluster infrastructure, have many languages bindings for client-side integration APIs in one product build on C++ language.
Network Device Interface (NDI) is a software specification developed by the technology company NewTek. It enables high-definition video to be transmitted, received, and communicated over a computer network with low latency and high quality. This royalty-free specification supports frame-accurate switching, making it suitable for live video production environments.