Prometheus (software)

Last updated
Prometheus
Initial release24 November 2012;12 years ago (2012-11-24)
Stable release
v3.0.1 [1] / 28 November 2024;22 days ago (2024-11-28)
Repository github.com/prometheus/prometheus
Written in Go
Operating system Cross-platform
Type Time series database
License Apache License 2.0
Website prometheus.io

Prometheus is a free software application used for event monitoring and alerting. [2] It records metrics in a time series database (allowing for high dimensionality) built using an HTTP pull model, with flexible queries and real-time alerting. [3] [4] The project is written in Go and licensed under the Apache 2 License, with source code available on GitHub, [5] and is a graduated project of the Cloud Native Computing Foundation, along with Kubernetes and Envoy. [6]

Contents

History

Prometheus was developed at SoundCloud starting in 2012, [7] when the company discovered that its existing metrics and monitoring tools (using StatsD and Graphite) were insufficient for their needs. Specifically, they identified needs that Prometheus was built to meet, including a multi-dimensional data model, operational simplicity, scalable data collection, and a powerful query language, all in a single tool. [8] The project was open-source from the beginning and began to be used by Boxever and Docker users as well, despite not being explicitly announced. [8] [9] Prometheus was inspired by the monitoring tool Borgmon used at Google. [10] [11]

By 2013, Prometheus was introduced for production monitoring at SoundCloud. [8] The official public announcement was made in January 2015. [8]

In May 2016, the Cloud Native Computing Foundation accepted Prometheus as its second incubated project, after Kubernetes. The blog post announcing this stated that the tool was in use at many companies including DigitalOcean, Ericsson, CoreOS, Weaveworks, Red Hat, and Google. [12]

Prometheus 1.0 was released in July 2016. [13] Subsequent versions were released through 2016 and 2017, leading to Prometheus 2.0 in November 2017. [14]

In August 2018, the Cloud Native Computing Foundation announced that the Prometheus project had graduated. [6]

A variety of conferences focused on Prometheus have been held.

Architecture

A typical monitoring platform with Prometheus is composed of multiple tools:[ citation needed ]

Data storage format

Prometheus data is stored in the form of metrics, with each metric having a name that is used for referencing and querying it. Each metric can be drilled down by an arbitrary number of key=value pairs (labels). Labels can include information on the data source (which server the data is coming from) and other application-specific breakdown information such as the HTTP status code (for metrics related to HTTP responses), query method (GET versus POST), endpoint, etc. The ability to specify an arbitrary list of labels and to query based on these in real time is why Prometheus' data model is called multi-dimensional. [16] [8] [9]

Prometheus stores data locally on disk, which helps for fast data storage and fast querying. [8] There is the ability to store metrics in remote storage. [17]

Data collection

Prometheus collects data in the form of time series. The time series are built through a pull model: the Prometheus server queries a list of data sources (sometimes called exporters) at a specific polling frequency. Each of the data sources serves the current values of the metrics for that data source at the endpoint queried by Prometheus. The Prometheus server then aggregates data across the data sources. [8] Prometheus has a number of mechanisms to automatically discover resources that should be used as data sources. [18]

PromQL

Prometheus provides its own query language PromQL (Prometheus Query Language) that lets users select and aggregate data. PromQL is specifically adjusted to work in convention with a Time-Series Database and therefore provides time-related query functionalities. Examples include the rate() function, the instant vector and the range vector which can provide many samples for each queried time series. [19] Prometheus has four clearly defined metric types around which the PromQL components revolve. The four types are: [20]

Example code

# A metric with label filteringgo_gc_duration_seconds{instance="localhost:9090",job="alertmanager"}# Aggregation operatorssumby(app,proc)(instance_memory_limit_bytes-instance_memory_usage_bytes)/1024/1024

[21]

Alerts and monitoring

Configuration for alerts can be specified in Prometheus which specifies a condition that needs to be maintained for a specific duration in order for an alert to trigger. When alerts trigger, they are forwarded to the Alertmanager service. Alertmanager can include logic to silence alerts and also to forward them to email, Slack, or notification services such as PagerDuty. [22] Some other messaging systems like Microsoft Teams [23] could be configured using the Alertmanager Webhook Receiver as a mechanism for external integrations. [24] also Prometheus Alerts can be used to receive alerts directly on android devices even without the requirement of any targets configuration in Alert Manager. [25]

Dashboards

Prometheus is not intended as a full-fledged dashboard. Although it can be used to graph specific queries, it is not a full-fledged dashboard and needs to be hooked up with Grafana to generate dashboards; this has been cited as a disadvantage due to the additional setup complexity. [26]

Interoperability

Prometheus favors white-box monitoring. Applications are encouraged to publish (export) internal metrics to be collected periodically by Prometheus. [27] Some exporters and agents for various applications are available to provide metrics. [28] Prometheus supports some monitoring and administration protocols to allow interoperability for transitioning: Graphite, StatsD, SNMP, JMX, and CollectD.

Prometheus focuses on the availability of the platform and basic operations. [29] The metrics are typically stored for a few weeks. For long-term storage, the metrics can be streamed to remote storage. [17]

Standardization into OpenMetrics

There is an effort to promote Prometheus exposition format into a standard known as OpenMetrics. [30] Some products adopted the format: InfluxData's TICK suite, [31] InfluxDB, Google Cloud Platform, [32] and DataDog. [33]

Usage

Prometheus was first used in-house at SoundCloud, where it was developed, for monitoring their systems. [8] The Cloud Native Computing Foundation has a number of case studies of other companies using Prometheus. These include digital hosting service DigitalOcean, [34] digital festival DreamHack, [35] and email and contact migration service ShuttleCloud. [36] Separately, Pandora Radio has mentioned using Prometheus to monitor its data pipeline. [37]

GitLab provides a Prometheus integration guide to export GitLab metrics to Prometheus [38] and it is activated by default since version 9.0 [39]

See also

Related Research Articles

<span class="mw-page-title-main">Vertica</span> Software company

Vertica is an analytic database management software company. Vertica was founded in 2005 by the database researcher Michael Stonebraker with Andrew Palmer as the founding CEO. Ralph Breslauer and Christopher P. Lynch served as CEOs later on.

<span class="mw-page-title-main">OpenNebula</span> Cloud-computing platform for managing heterogeneous distributed infrastructure

OpenNebula is an open source cloud computing platform for managing heterogeneous data center, public cloud and edge computing infrastructure resources. OpenNebula manages on-premises and remote virtual infrastructure to build private, public, or hybrid implementations of infrastructure as a service (IaaS) and multi-tenant Kubernetes deployments. The two primary uses of the OpenNebula platform are data center virtualization and cloud deployments based on the KVM hypervisor, LXD/LXC system containers, and AWS Firecracker microVMs. The platform is also capable of offering the cloud infrastructure necessary to operate a cloud on top of existing VMware infrastructure. In early June 2020, OpenNebula announced the release of a new Enterprise Edition for corporate users, along with a Community Edition. OpenNebula CE is free and open-source software, released under the Apache License version 2. OpenNebula CE comes with free access to patch releases containing critical bug fixes but with no access to the regular EE maintenance releases. Upgrades to the latest minor/major version is only available for CE users with non-commercial deployments or with significant open source contributions to the OpenNebula Community. OpenNebula EE is distributed under a closed-source license and requires a commercial Subscription.

Checkmk is a software system developed in Python and C++ for IT Infrastructure monitoring. It is used for the monitoring of servers, applications, networks, cloud infrastructures, containers, storage, databases and environment sensors.

<span class="mw-page-title-main">OpenShift</span> Cloud computing software

OpenShift is a family of containerization software products developed by Red Hat. Its flagship product is the OpenShift Container Platform — a hybrid cloud platform as a service built around Linux containers orchestrated and managed by Kubernetes on a foundation of Red Hat Enterprise Linux. The family's other products provide this platform through different environments: OKD serves as the community-driven upstream, Several deployment methods are available including self-managed, cloud native under ROSA, ARO and RHOIC on AWS, Azure, and IBM Cloud respectively, OpenShift Online as software as a service, and OpenShift Dedicated as a managed service.

Google Cloud Platform (GCP) is a suite of cloud computing services offered by Google that provides a series of modular cloud services including computing, data storage, data analytics, and machine learning, alongside a set of management tools. It runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search, Gmail, and Google Docs, according to Verma et al. Registration requires a credit card or bank account details.

Kubernetes is an open-source container orchestration system for automating software deployment, scaling, and management. Originally designed by Google, the project is now maintained by a worldwide community of contributors, and the trademark is held by the Cloud Native Computing Foundation.

<span class="mw-page-title-main">Mirantis</span> Cloud computing software and services company

Mirantis Inc. is a Campbell, California, based B2B open source cloud computing software and services company. Its primary container and cloud management products, part of the Mirantis Cloud Native Platform suite of products, are Mirantis Container Cloud and Mirantis Kubernetes Engine. The company focuses on the development and support of container and cloud infrastructure management platforms based on Kubernetes and OpenStack. The company was founded in 1999 by Alex Freedland and Boris Renski. It was one of the founding members of the OpenStack Foundation, a non-profit corporate entity established in September, 2012 to promote OpenStack software and its community. Mirantis has been an active member of the Cloud Native Computing Foundation since 2016.

<span class="mw-page-title-main">Dynatrace</span> American technology company

Dynatrace, Inc. is a global technology company that provides a software observability platform based on artificial intelligence (AI) and automation. Dynatrace technologies are used to monitor, analyze, and optimize application performance, software development and security practices, IT infrastructure, and user experience for businesses and government agencies throughout the world.

Fluentd is a cross-platform open-source data collection software project originally developed at Treasure Data. It is written primarily in the C programming language with a thin-Ruby wrapper that gives users flexibility.

<span class="mw-page-title-main">GraphQL</span> Data query language developed by Facebook

GraphQL is a data query and manipulation language for APIs that allows a client to specify what data it needs. A GraphQL server can fetch data from separate sources for a single client query and present the results in a unified graph. It is not tied to any specific database or storage engine.

The tools listed here support emulating or simulating APIs and software systems. They are also called API mocking tools, service virtualization tools, over the wire test doubles and tools for stubbing and mocking HTTP(S) and other protocols. They enable component testing in isolation.

The Update Framework (TUF) is a software framework designed to protect mechanisms that automatically identify and download updates to software. TUF uses a series of roles and keys to provide a means to retain security, even when some keys or servers are compromised. It does this with a stated goal of requiring minimal changes and effort from repository administrators, software developers, and end users. In this way, it protects software repositories, which are an increasingly desirable target for hackers.

TiDB is an open-source NewSQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. Designed to be MySQL compatible, it is developed and supported primarily by PingCAP and licensed under Apache 2.0. It is also available as a paid product. TiDB drew its initial design inspiration from Google's Spanner and F1 papers.

Kubeflow is an open-source platform for machine learning and MLOps on Kubernetes introduced by Google. The different stages in a typical machine learning lifecycle are represented with different software components in Kubeflow, including model development (Kubeflow Notebooks), model training (Kubeflow Pipelines,Kubeflow Training Operator), model serving (KServe), and automated machine learning (Katib).

<span class="mw-page-title-main">IBM Cloud</span> Cloud computing services provided by IBM

IBM Cloud is a set of cloud computing services for business offered by the information technology company IBM.

The Cloud Native Computing Foundation (CNCF) is a Linux Foundation project that was started in 2015 to help advance container technology and align the tech industry around its evolution.

Teleport is an open-source tool for providing zero trust access to servers and cloud applications using SSH, Kubernetes and HTTPS. It can eliminate the need for VPNs by providing a single gateway to access computing infrastructure via SSH, Kubernetes clusters, and cloud applications via a built-in proxy.

Open Service Mesh (OSM) was a free and open source cloud native service mesh developed by Microsoft that ran on Kubernetes.

LPAR2RRD is an open-source software tool that is used for monitoring and reporting performance of servers, clouds and databases. It is developed by the Czech company XoruX.

<span class="mw-page-title-main">Cilium (computing)</span> Open source cloud computing software

Cilium is a cloud native technology for networking, observability, and security. It is based on the kernel technology eBPF, originally for better networking performance, and now leverages many additional features for different use cases. The core networking component has evolved from only providing a flat Layer 3 network for containers to including advanced networking features, like BGP and Service mesh, within a Kubernetes cluster, across multiple clusters, and connecting with the world outside Kubernetes. Hubble was created as the network observability component and Tetragon was later added for security observability and runtime enforcement. Cilium runs on Linux and is one of the first eBPF applications being ported to Microsoft Windows through the eBPF on Windows project.

References

  1. Latest release at Github
  2. "Overview". prometheus.io.
  3. James Turnbull (12 June 2018). Monitoring with Prometheus. Turnbull Press. ISBN   978-0-9888202-8-9.
  4. "Prometheus: From metrics to insight. Power your metrics and alerting with a leading open-source monitoring solution" . Retrieved December 26, 2018.
  5. "Prometheus". GitHub . Retrieved December 26, 2018.
  6. 1 2 Evans, Kristen (August 9, 2018). "Cloud Native Computing Foundation Announces Prometheus Graduation" . Retrieved December 26, 2018.
  7. Brian Brazil (9 July 2018). Prometheus: Up & Running: Infrastructure and Application Performance Monitoring. O'Reilly Media. p. 3. ISBN   978-1-4920-3409-4.
  8. 1 2 3 4 5 6 7 8 Volz, Julius; Rabenstein, Björn (January 26, 2015). "Prometheus: Monitoring at SoundCloud". SoundCloud.
  9. 1 2 "Monitor Docker Containers with Prometheus". 5π Consulting. January 26, 2015. Archived from the original on January 3, 2019. Retrieved December 26, 2018.
  10. Murphy, Niall; Beyer, Betsy; Jones, Chris; Petoff, Jennifer (2016). Site Reliability Engineering:How Google Runs Production Systems. O'Reilly Media. ISBN   978-1491929124. Even though Borgmon remains internal to Google, the idea of treating time-series data as a data source for generating alerts is now accessible to everyone through those open source tools like Prometheus ...
  11. Volz, Julius (4 September 2017). "PromCon 2017: Conference Recap" via YouTube. I joined SoundCloud back in 2012 coming from Google...we didn't yet have any monitoring tools that that works with this kind of dynamic environment. We were kind of missing the way Google did its monitoring for its own internal cluster scheduler and we were very inspired by that and finally decided to build our own open-source solution.
  12. "Cloud Native Computing Foundation Accepts Prometheus as Second Hosted Project". Cloud Native Computing Foundation. May 9, 2016. Retrieved December 26, 2018.
  13. "Prometheus 1.0 Is Here". Cloud Native Computing Foundation. July 18, 2016. Retrieved December 26, 2018.
  14. "New Features in Prometheus 2.0.0". Robust Perception. November 8, 2017. Retrieved December 26, 2018.
  15. "Alertmanager". GitHub . 17 May 2022.
  16. "Data model". Prometheus. Retrieved December 26, 2018.
  17. 1 2 "Integrations - Prometheus". prometheus.io.
  18. "Prometheus: Collects metrics, provides alerting and graphs web UI". March 18, 2017. Retrieved December 26, 2018.
  19. "Querying Prometheus" . Retrieved November 4, 2019.
  20. "Metric types". prometheus.io. Retrieved 2024-06-29.
  21. pygments/tests/examplefiles/promql/example.promql at master · pygments/pygments on GitHub
  22. Dubey, Abhishek (March 25, 2018). "AlertManager Integration with Prometheus" . Retrieved December 26, 2018.
  23. Danuka, Praneeth (March 8, 2020). "Alerting for Cloud-native Applications with Prometheus" . Retrieved October 18, 2020.
  24. "Integrations | Prometheus".
  25. "Prometheus alerts - Apps on Google Play".
  26. Ryckbosch, Frederick (July 28, 2017). "Prometheus monitoring: Pros and cons" . Retrieved December 26, 2018.
  27. Prometheus. "Instrumentation - Prometheus". prometheus.io.
  28. "Exporters". prometheus.io.
  29. Prometheus. "Prometheus - Monitoring system & time series database". prometheus.io.
  30. "OpenMetrics". GitHub. 2018-11-13.
  31. "Telegraf from InfluxData". GitHub . 2018-12-25.
  32. "Announcing Stackdriver Kubernetes Monitoring".
  33. "DataDogHQ".
  34. Evans, Kristen (February 28, 2017). "Prometheus User Profile: How DigitalOcean Uses Prometheus". Cloud Native Computing Foundation . Retrieved December 26, 2018.
  35. Evans, Kristen (August 24, 2016). "Prometheus User Profile: Monitoring the World's Largest Digital Festival – DreamHack". Cloud Native Computing Foundation . Retrieved December 26, 2018.
  36. Evans, Kirsten (May 17, 2017). "Prometheus User Profile: ShuttleCloud Explains Why Prometheus Is Good for Your Small Startup". Cloud Native Computing Foundation . Retrieved December 26, 2018.
  37. Haidrey, Ace (March 15, 2018). "Apache Airflow at Pandora". Engineering at Pandora. Retrieved December 26, 2018.
  38. "GitLab Prometheus metrics" . Retrieved December 26, 2018.
  39. "GitLab 9.0 released with Subgroups and Deploy Boards". GitLab. 2017-03-22.

Further reading