Observability (software)

Last updated May 26, 2024

In software engineering, more specifically in distributed computing, observability is the ability to collect data about programs' execution, modules' internal states, and the communication among components.^[1]^[2] To improve observability, software engineers use a wide range of logging and tracing techniques to gather telemetry information, and tools to analyze and use it. Observability is foundational to site reliability engineering, as it is the first step in triaging a service outage. One of the goals of observability is to minimize the amount of prior knowledge needed to debug an issue.

Etymology, terminology and definition

The term is borrowed from control theory, where the "observability" of a system measures how well its state can be determined from its outputs. Similarly, software observability measures how well a system's state can be understood from the obtained telemetry (metrics, logs, traces, profiling).

The definition of observability varies by vendor:

a measure of how well you can understand and explain any state your system can get into, no matter how novel or bizarre [...] without needing to ship new code
— Honeycomb ^[3]

software tools and practices for aggregating, correlating and analyzing a steady stream of performance data from a distributed application along with the hardware and network it runs on
— IBM Instana ^[4]

observability starts by shipping all your raw data to central service before you begin analysis
— Edge Delta ^[5]

the ability to measure a system’s current state based on the data it generates, such as logs, metrics, and traces
— Dynatrace ^[6]

Observability is tooling or a technical solution that allows teams to actively debug their system. Observability is based on exploring properties and patterns not defined in advance.
— Google Cloud ^[7]

proactively collecting, visualizing, and applying intelligence to all of your metrics, events, logs, and traces—so you can understand the behavior of your complex digital system
— New Relic ^[8]

The term is frequently referred to as its numeronym o11y (where 11 stands for the number of letters between the first letter and the last letter of the word). This is similar to other computer science abbreviations such as i18n and l10n and k8s.^[9]

Observability vs. monitoring

Observability and monitoring are sometimes used interchangeably.^[10] As tooling, commercial offerings and practices evolved in complexity, "monitoring" was re-branded as observability in order to differentiate new tools from the old.

The terms are commonly contrasted in that systems are monitored using predefined sets of telemetry,^[7] and monitored systems may be observable.^[11]

Majors et al. suggest that engineering teams that only have monitoring tools end up relying on expert foreknowledge (seniority), whereas teams that have observability tools rely on exploratory analysis (curiosity).^[3]

Telemetry types

Observability relies on three main types of telemetry data: metrics, logs and traces.^[6]^[7]^[12] Those are often referred to as "pillars of observability".^[13]

Metrics

A metric is a point in time measurement (scalar) that represents some system state. Examples of common metrics include:

number of HTTP requests per second;
total number of query failures;
database size in bytes;
time in seconds since last garbage collection.

Monitoring tools are typically configured to emit alerts when certain metric values exceed set thresholds. Thresholds are set based on knowledge about normal operating conditions and experience.

Metrics are typically tagged to facilitate grouping and searchability.

Application developers choose what kind of metrics to instrument their software with, before it is released. As a result, when a previously unknown issue is encountered, it is impossible to add new metrics without shipping new code. Furthermore, their cardinality can quickly make the storage size of telemetry data prohibitively expensive. Since metrics are cardinality-limited, they are often used to represent aggregate values (for example: average page load time, or 5-second average of the request rate). Without external context, it is impossible to correlate between events (such as user requests) and distinct metric values.

Logs

Logs, or log lines, are generally free-form, unstructured text blobs^{[ clarification needed ]} that are intended to be human readable. Modern logging is structured to enable machine parsability.^[3] As with metrics, an application developer must instrument the application upfront and ship new code if different logging information is required.

Logs typically include a timestamp and severity level. An event (such as a user request) may be fragmented across multiple log lines and interweave with logs from concurrent events.

Traces

Distributed traces

A cloud native application is typically made up of distributed services which together fulfill a single request. A distributed trace is an interrelated series of discrete events (also called spans) that track the progression of a single user request.^[3] A trace shows the causal and temporal relationships between the services that interoperate to fulfill a request.

Instrumenting an application with traces means sending span information to a tracing backend. The tracing backend correlates the received spans to generate presentable traces. To be able to follow a request as it traverses multiple services, spans are labeled with unique identifiers that enable constructing a parent-child relationship between spans. Span information is typically shared in the HTTP headers of outbound requests.^[3]^[14]^[15]

Continuous profiling

Continuous profiling is another telemetry type used to precisely determine how an application consumes resources.^[16]

Instrumentation

To be able to observe an application, telemetry about the application's behavior needs to be collected or exported. Instrumentation means generating telemetry alongside the normal operation of the application.^[3] Telemetry is then collected by an independent backend for later analysis.

In fast-changing systems, instrumentation itself is often the best possible documentation, since it combines intention (what are the dimensions that an engineer named and decided to collect?) with the real-time, up-to-date information of live status in production.^[3]

Instrumentation can be automatic, or custom. Automatic instrumentation offers blanket coverage and immediate value; custom instrumentation brings higher value but requires more intimate involvement with the instrumented application.

Instrumentation can be native - done in-code (modifying the code of the instrumented application) - or out-of-code (e.g. sidecar, eBPF).

Verifying new features in production by shipping them together with custom instrumentation is a practice called "observability-driven development".^[3]

"Pillars of observability"

Metrics, logs and traces are most commonly listed as the pillars of observability.^[13] Majors et al. suggest that the pillars of observability are high cardinality, high-dimensionality, and explorability, arguing that runbooks and dashboards have little value because "modern systems rarely fail in precisely the same way twice."^[3]

Self monitoring

Self monitoring is a practice where observability stacks monitor each other, in order to reduce the risk of inconspicuous outages. Self monitoring may be put in place in addition to high availability and redundancy to further avoid correlated failures.

External links

CNCF Observability Technical Advisory Group (TAG)

Bibliography

Boten, Alex; Majors, Charity (2022). Cloud-Native Observability with OpenTelemetry. Packt Publishing. ISBN 978-1-80107-190-1. OCLC 1314053525.
Majors, Charity; Fong-Jones, Liz; Miranda, George (2022). Observability engineering : achieving production excellence (1st ed.). Sebastopol, CA: O'Reilly Media, Inc. ISBN 9781492076445. OCLC 1315555871.
Sridharan, Cindy (2018). Distributed systems observability : a guide to building robust systems (1st ed.). Sebastopol, CA: O'Reilly Media, Inc. ISBN 978-1-4920-3342-4. OCLC 1044741317.
Hausenblas, Michael (2023). Cloud Observability in Action. Manning. ISBN 9781633439597. OCLC 1359045370.

Related Research Articles

In software engineering, profiling is a form of dynamic program analysis that measures, for example, the space (memory) or time complexity of a program, the usage of particular instructions, or the frequency and duration of function calls. Most commonly, profiling information serves to aid program optimization, and more specifically, performance engineering.

Varnish is a reverse caching proxy used as HTTP accelerator for content-heavy dynamic web sites as well as APIs. In contrast to other web accelerators, such as Squid, which began life as a client-side cache, or Apache and nginx, which are primarily origin servers, Varnish was designed as an HTTP accelerator. Varnish is focused exclusively on HTTP, unlike other proxy servers that often support FTP, SMTP, and other network protocols.

Tracing in software engineering refers to the process of capturing and recording information about the execution of a software program. This information is typically used by programmers for debugging purposes, and additionally, depending on the type and detail of information contained in a trace log, by experienced system administrators or technical-support personnel and by software monitoring tools to diagnose common problems with software. Tracing is a cross-cutting concern.

<span class="mw-page-title-main">FusionReactor</span>

FusionReactor is a developer and DevOps focused Java application performance monitor (APM), developed by Intergral GmbH for monitoring Java application servers such as Tomcat, WildFly. WebSphere, GlassFish and in particular Adobe ColdFusion and Lucee. FusionReactor provides low level metrics, telemetry and "insight". Since its initial release in November 2005, FusionReactor has been used by organizations to monitor their production environments.

DevOps is a methodology in the software development and IT industry. Used as a set of practices and tools, DevOps integrates and automates the work of software development (Dev) and IT operations (Ops) as a means for improving and shortening the systems development life cycle. DevOps is complementary to agile software development; several DevOps aspects came from the agile way of working.

Canigó is the name chosen for the Java EE framework of the Generalitat de Catalunya.

Middleware is a type of computer software program that provides services to software applications beyond those available from the operating system. It can be described as "software glue".

Software analytics is the analytics specific to the domain of software systems taking into account source code, static and dynamic characteristics as well as related processes of their development and evolution. It aims at describing, monitoring, predicting, and improving the efficiency and effectiveness of software engineering throughout the software lifecycle, in particular during software development and software maintenance. The data collection is typically done by mining software repositories, but can also be achieved by collecting user actions or production data.

Software diagnosis refers to concepts, techniques, and tools that allow for obtaining findings, conclusions, and evaluations about software systems and their implementation, composition, behaviour, and evolution. It serves as means to monitor, steer, observe and optimize software development, software maintenance, and software re-engineering in the sense of a business intelligence approach specific to software systems. It is generally based on the automatic extraction, analysis, and visualization of corresponding information sources of the software system. It can also be manually done and not automatic.

In software engineering, a microservice architecture is a variant of the service-oriented architecture structural style. It is an architectural pattern that arranges an application as a collection of loosely coupled, fine-grained services, communicating through lightweight protocols. One of its goals is that teams can develop and deploy their services independently of others. This is achieved by the reduction of several dependencies in the code base, allowing developers to evolve their services with limited restrictions from users, and for additional complexity to be hidden from users. As a consequence, organizations are able to develop software with fast growth and size, as well as use off-the-shelf services more easily. Communication requirements are reduced. These benefits come at a cost to maintaining the decoupling. So, you should use microservice architecture only if your application is too complex to manage as a monolith. Interfaces need to be designed carefully and treated as a public API. One technique that is used is having multiple interfaces on the same service, or multiple versions of the same service, so as to not disrupt existing users of the code.

Dynatrace, Inc. is a global technology company that provides a software observability platform based on artificial intelligence (AI) and automation. Dynatrace technologies are used to monitor, analyze, and optimize application performance, software development and security practices, IT infrastructure, and user experience for businesses and government agencies throughout the world.

"X as a service" is a phrasal template for any business model in which a product use is offered as a subscription-based service rather than as an artifact owned and maintained by the customer. Originating from the software as a service concept that appeared in the 2010s with the advent of cloud computing, the template has expanded to numerous offerings in the field of information technology and beyond it. The term XaaS can mean "anything as a service".

SNAMP is an open-source, cross-platform software platform for telemetry, tracing and elasticity management of distributed applications.

Site reliability engineering (SRE) is a set of principles and practices that applies aspects of software engineering to IT infrastructure and operations. SRE claims to create highly reliable and scalable software systems. Although they are closely related, SRE is slightly different from DevOps.

Serverless computing is a cloud computing execution model in which the cloud provider allocates machine resources on demand, taking care of the servers on behalf of their customers. "Serverless" is a misnomer in the sense that servers are still used by cloud service providers to execute code for developers. However, developers of serverless applications are not concerned with capacity planning, configuration, management, maintenance, fault tolerance, or scaling of containers, VMs, or physical servers. When an app is not in use, there are no computing resources allocated to the app. Pricing is based on the actual amount of resources consumed by an application. It can be a form of utility computing.

Mezmo is a technology company located in Silicon Valley, California. They provide a data pipeline intended to ingest telemetry data from multiple sources, transform it, enrich it, and route it to a variety of destinations.

Prometheus is a free software application used for event monitoring and alerting. It records metrics in a time series database built using an HTTP pull model, with flexible queries and real-time alerting. The project is written in Go and licensed under the Apache 2 License, with source code available on GitHub, and is a graduated project of the Cloud Native Computing Foundation, along with Kubernetes and Envoy.

<span class="mw-page-title-main">Netdata</span> Open-source system monitor software

With Netdata Users can monitor their servers, containers, and applications,in high-resolution and in real-time. Netdata is an open source tool designed to collect real-time metrics, such as CPU usage, disk activity, bandwidth usage, website visits, etc., and then display them in low-latency dashboards. The tool is designed to visualize activity in the greatest possible detail, allowing the user to obtain an overview of what is happening and what has just happened in their system or application.

The Cloud Native Computing Foundation (CNCF) is a Linux Foundation project that was started in 2015 to help advance container technology and align the tech industry around its evolution.

Honeycomb is an American software company known for its eponymous observability and application performance management (APM) platform and for its diversity, equity, and inclusion (DEI) practices. Honeycomb's venture capital investors to date include Headline, Scale Venture Partners, and Insight Partners.

References

↑ Fellows, Geoff (1998). "High-Performance Client/Server: A Guide to Building and Managing Robust Distributed Systems". Internet Research. 8 (5). doi:10.1108/intr.1998.17208eaf.007. ISSN 1066-2243.
↑ Cantrill, Bryan (2006). "Hidden in Plain Sight: Improvements in the observability of software can help you diagnose your most crippling performance problems". Queue. 4 (1): 26–36. doi: 10.1145/1117389.1117401 . ISSN 1542-7730. S2CID 14505819.
1 2 3 4 5 6 7 8 9 Majors, Charity; Fong-Jones, Liz; Miranda, George (2022). Observability engineering : achieving production excellence (1st ed.). Sebastopol, CA: O'Reilly Media, Inc. ISBN 9781492076445. OCLC 1315555871.
↑ "What is observability". IBM. 15 October 2021. Retrieved 9 March 2023.
↑ "How to Begin Observability at the Data Source". Cisco. 26 October 2023. Retrieved 26 October 2023.
1 2 Livens, Jay (October 2021). "What is observability?". Dynatrace. Retrieved 9 March 2023.
1 2 3 "DevOps measurement: Monitoring and observability". Google Cloud. Retrieved 9 March 2023.
↑ Reinholds, Amy (30 November 2021). "What is observability?". New Relic. Retrieved 9 March 2023.
↑ "How Are Structured Logs Different from Events?". 26 June 2018.
↑ Hadfield, Ally (29 June 2022). "Observability vs. Monitoring: What's The Difference in DevOps?". Instana. Retrieved 15 March 2023.
↑ Kidd, Chrissy. "Monitoring, Observability & Telemetry: Everything You Need To Know for Observable Work" . Retrieved 15 March 2023.
↑ "What is Observability? A Beginner's Guide". Splunk. Retrieved 9 March 2023.
1 2 Sridharan, Cindy (2018). "Chapter 4. The Three Pillars of Observability". Distributed systems observability : a guide to building robust systems (1st ed.). Sebastopol, CA: O'Reilly Media, Inc. ISBN 978-1-4920-3342-4. OCLC 1044741317.
↑ "Trace Context". W3C. 2021-11-23. Retrieved 2023-09-27.
↑ "b3-propagation". openzipkin. Retrieved 2023-09-27.
↑ "What is continuous profiling?". Cloud Native Computing Foundation. 31 May 2022. Retrieved 9 March 2023.

This computer science article is a stub. You can help Wikipedia by expanding it.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Fellows, Geoff (1998). "High-Performance Client/Server: A Guide to Building and Managing Robust Distributed Systems". Internet Research. 8 (5). doi:10.1108/intr.1998.17208eaf.007. ISSN 1066-2243.

[2] Cantrill, Bryan (2006). "Hidden in Plain Sight: Improvements in the observability of software can help you diagnose your most crippling performance problems". Queue. 4 (1): 26–36. doi: 10.1145/1117389.1117401 . ISSN 1542-7730. S2CID 14505819.

[ObsEng2022-3] 1 2 3 4 5 6 7 8 9 Majors, Charity; Fong-Jones, Liz; Miranda, George (2022). Observability engineering : achieving production excellence (1st ed.). Sebastopol, CA: O'Reilly Media, Inc. ISBN 9781492076445. OCLC 1315555871.

[4] "What is observability". IBM. 15 October 2021. Retrieved 9 March 2023.

[5] "How to Begin Observability at the Data Source". Cisco. 26 October 2023. Retrieved 26 October 2023.

[dynatrace-6] 1 2 Livens, Jay (October 2021). "What is observability?". Dynatrace. Retrieved 9 March 2023.

[googlecloud-7] 1 2 3 "DevOps measurement: Monitoring and observability". Google Cloud. Retrieved 9 March 2023.

[8] Reinholds, Amy (30 November 2021). "What is observability?". New Relic. Retrieved 9 March 2023.

[9] "How Are Structured Logs Different from Events?". 26 June 2018.

[instana-10] Hadfield, Ally (29 June 2022). "Observability vs. Monitoring: What's The Difference in DevOps?". Instana. Retrieved 15 March 2023.

[11] Kidd, Chrissy. "Monitoring, Observability & Telemetry: Everything You Need To Know for Observable Work" . Retrieved 15 March 2023.

[12] "What is Observability? A Beginner's Guide". Splunk. Retrieved 9 March 2023.

[Sridharan2018-13] 1 2 Sridharan, Cindy (2018). "Chapter 4. The Three Pillars of Observability". Distributed systems observability : a guide to building robust systems (1st ed.). Sebastopol, CA: O'Reilly Media, Inc. ISBN 978-1-4920-3342-4. OCLC 1044741317.

[14] "Trace Context". W3C. 2021-11-23. Retrieved 2023-09-27.

[15] "b3-propagation". openzipkin. Retrieved 2023-09-27.

[16] "What is continuous profiling?". Cloud Native Computing Foundation. 31 May 2022. Retrieved 9 March 2023.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]