Tracing (software)

Last updated

Tracing in software engineering refers to the process of capturing and recording information about the execution of a software program. This information is typically used by programmers for debugging purposes, and additionally, depending on the type and detail of information contained in a trace log, by experienced system administrators or technical-support personnel and by software monitoring tools to diagnose common problems with software. [1] Tracing is a cross-cutting concern.

Contents

There is not always a clear distinction between tracing and other forms of logging , except that the term tracing is almost never applied to logging that is a functional requirement of a program (therefore excluding logging of data from an external source, such as data acquisition in a high-energy physics experiment, and write-ahead logging). Logs that record program usage (such as a server log) or operating-system events primarily of interest to a system administrator (see for example Event Viewer ) fall into a terminological gray area.

Tracing is primarily used for anomaly detection, fault analysis, debugging or diagnostic purposes in distributed software systems, such as microservices or serverless functions. [2]

Software tracing

Software tracing is a tool for developers to gather information for debugging. This information is used both during development cycles and post-release. Unlike event logging, software tracing usually does not have the concept of a "class" of event or an "event code". Other reasons why event-logging solutions based on event codes are inappropriate for software tracing include:

Tools

OpenTelemetry is a CNCF open source project that provides comprehensive support for distributed tracing. [3] Some vendors including Datadog, New Relic, Splunk also offer tracing SaaS services. [4]

Google and Meta have developed their own tracing frameworks namely Dapper and Canopy. [2]

Application-specific tracing

System-specific tracing

In operating systems, tracing can be used in situations (such as booting) where some of the technologies used to provide event logging may not be available.

Linux offers system-level and user-level tracing capabilities with kernel markers and LTTng. ftrace also supports Linux kernel tracing. syslog is another tool in various operating systems for logging and tracing system messages.

FreeBSD and SmartOS employ DTrace for tracing for the kernel and the userland.

In embedded software, tracing also requires special techniques for efficient instrumentation and logging and low CPU overhead. [6]

Techniques

Trace generation and collection

Trace generation of method calls can be done with source code instrumentation, runtime information collection, or under debugger control. [7] Tracing macros, aspect-oriented programming and related instrumentation techniques can be employed.

Libraries used in source code send data to an agent or directly to the collection component. [4]

Trace analysis

To model execution trees, ISVis converts a rooted tree into a directed acyclic graph while Jinsight utilizes the call frame principle to gather and represent cumulative information about traces. [7]

The primary visualization method is the swimlane view, which is exemplified by tools like Jaeger and often includes annotations and key-value attributes. Despite its widespread use, this design lacks rigorous justification and users frequently face challenges like missing features and confusing navigation. Alternatives to swimlane views exist, like Jaeger’s service dependency view or SkyWalking’s List, Tree, and Table views. Aggregate visualizations are also used for analyzing large volumes of traces, with systems like Canopy offering queryable metrics and Jaeger providing trace comparison features. [8]

Event logging

Event logging provides system administrators with information useful for diagnostics and auditing. The different classes of events that will be logged, as well as what details will appear in the event messages, are often considered early in the development cycle. Many event logging technologies allow or even require each class of event to be assigned a unique "code", which is used by the event logging software or a separate viewer (e.g., Event Viewer) to format and output a human-readable message. This facilitates localization and allows system administrators to more easily obtain information on problems that occur.

Because event logging is used to log high-level information (often failure information), performance of the logging implementation is often less important.

A special concern, preventing duplicate events from being recorded "too often" is taken care of through event throttling.

Difficulties in making a clear distinction between event logging and software tracing arise from the fact that some of the same technologies are used for both, and further because many of the criteria that distinguish between the two are continuous rather than discrete. The following table lists some important, but by no means precise or universal, distinctions that are used by developers to select technologies for each purpose, and that guide the separate development of new technologies in each area:

Event loggingSoftware tracing
Consumed primarily by system administratorsConsumed primarily by developers
Logs "high level" information (e.g. failed installation of a program)Logs "low level" information (e.g. a thrown exception)
Must not be too "noisy" (containing many duplicate events or information that is not helpful for its intended audience)Can be noisy
A standards-based output format is often desirable, sometimes even requiredFew limitations on output format
Event log messages are often localized Localization is rarely a concern
Addition of new types of events, as well as new event messages, need not be agileAddition of new tracing messages must be agile

Challenges and limitations

Enabling or disabling tracing during runtime often necessitates the inclusion of extra data in the binary. This can lead to performance degradation, even when tracing is not active.

If tracing is enabled or disabled at compile-time, collecting trace data from a client's system hinges on their willingness and capability to install a version of the software specifically enabled for tracing, and subsequently replicate the issue.

Tracing in software typically demands high standards of robustness, not only in the accuracy and reliability of the trace output but also in ensuring that the process being traced remains uninterrupted.

Given its low-level nature, tracing can generate a large volume of messages. To mitigate performance issues, it's often necessary to have the option to deactivate software tracing, either at the time of compilation or during run-time.

Security and privacy

In proprietary software, tracing data may include sensitive information about the product's source code.

See also

Related Research Articles

<span class="mw-page-title-main">Embedded system</span> Computer system with a dedicated function

An embedded system is a specialized computer system—a combination of a computer processor, computer memory, and input/output peripheral devices—that has a dedicated function within a larger mechanical or electronic system. It is embedded as part of a complete device often including electrical or electronic hardware and mechanical parts. Because an embedded system typically controls physical operations of the machine that it is embedded within, it often has real-time computing constraints. Embedded systems control many devices in common use. In 2009, it was estimated that ninety-eight percent of all microprocessors manufactured were used in embedded systems.

<span class="mw-page-title-main">Kernel panic</span> Fatal error condition associated with Unix-like computer operating systems

A kernel panic is a safety measure taken by an operating system's kernel upon detecting an internal fatal error in which either it is unable to safely recover or continuing to run the system would have a higher risk of major data loss. The term is largely specific to Unix and Unix-like systems. The equivalent on Microsoft Windows operating systems is a stop error, often called a "blue screen of death".

Nucleus RTOS is a real-time operating system (RTOS) produced by the Embedded Software Division of Mentor Graphics, a Siemens Business, supporting 32- and 64-bit embedded system platforms. The operating system (OS) is designed for real-time embedded systems for medical, industrial, consumer, aerospace, and Internet of things (IoT) uses. Nucleus was released first in 1993. The latest version is 3.x, and includes features such as power management, process model, 64-bit support, safety certification, and support for heterogeneous computing multi-core system on a chip (SOCs) processors.

<span class="mw-page-title-main">DTrace</span> Dynamic tracing framework for kernel and applications

DTrace is a comprehensive dynamic tracing framework originally created by Sun Microsystems for troubleshooting kernel and application problems on production systems in real time. Originally developed for Solaris, it has since been released under the free Common Development and Distribution License (CDDL) in OpenSolaris and its descendant illumos, and has been ported to several other Unix-like systems.

In software engineering, profiling is a form of dynamic program analysis that measures, for example, the space (memory) or time complexity of a program, the usage of particular instructions, or the frequency and duration of function calls. Most commonly, profiling information serves to aid program optimization, and more specifically, performance engineering.

Open-source software development (OSSD) is the process by which open-source software, or similar software whose source code is publicly available, is developed by an open-source software project. These are software products available with its source code under an open-source license to study, change, and improve its design. Examples of some popular open-source software products are Mozilla Firefox, Google Chromium, Android, LibreOffice and the VLC media player.

Dynamic program analysis is the act of analyzing software that involves executing a program – as opposed to static program analysis, which does not execute it.

The Windows software trace preprocessor is a preprocessor that simplifies the use of WMI event tracing to implement efficient software tracing in drivers and applications that target Windows 2000 and later operating systems. WPP was created by Microsoft and is included in the Windows DDK. Although WPP is wide in its applicability, it is not included in the Windows SDK, and therefore is primarily used for drivers and driver support software produced by software vendors that purchase the Windows DDK.

In computer science, fault injection is a testing technique for understanding how computing systems behave when stressed in unusual ways. This can be achieved using physical- or software-based means, or using a hybrid approach. Widely studied physical fault injections include the application of high voltages, extreme temperatures and electromagnetic pulses on electronic components, such as computer memory and central processing units. By exposing components to conditions beyond their intended operating limits, computing systems can be coerced into mis-executing instructions and corrupting critical data.

In computer programming, instrumentation is the act of modifying software so that analysis can be performed on it.

<span class="mw-page-title-main">SystemTap</span> Scripting language and tool

In computing, SystemTap is a scripting language and tool for dynamically instrumenting running production Linux-based operating systems. System administrators can use SystemTap to extract, filter and summarize data in order to enable diagnosis of complex performance or functional problems.

Kernel markers were a static kernel instrumentation support mechanism for Linux kernel source code, allowing special tools such as LTTng or SystemTap to trace information exposed by these probe points. Kernel markers were declared in the kernel code by one-liners of the form:

LTTng is a system software package for correlated tracing of the Linux kernel, applications and libraries. The project was originated by Mathieu Desnoyers with an initial release in 2005. Its predecessor is the Linux Trace Toolkit.

perf is a performance analyzing tool in Linux, available from Linux kernel version 2.6.31 in 2009. Userspace controlling utility, named perf, is accessed from the command line and provides a number of subcommands; it is capable of statistical profiling of the entire system.

<span class="mw-page-title-main">DynamoRIO</span> Software framework

DynamoRIO is a BSD-licensed dynamic binary instrumentation framework for the development of dynamic program analysis tools. DynamoRIO targets user space applications under the Android, Linux, and Windows operating systems running on the AArch32, IA-32, and x86-64 instruction set architectures.

CodeXL was an open-source software development tool suite which included a GPU debugger, a GPU profiler, a CPU profiler, a graphics frame analyzer and a static shader/kernel analyzer.

ftrace is a tracing framework for the Linux kernel. Although its original name, Function Tracer, came from ftrace's ability to record information related to various function calls performed while the kernel is running, ftrace's tracing capabilities cover a much broader range of kernel's internal operations.

The Cloud Native Computing Foundation (CNCF) is a Linux Foundation project that was started in 2015 to help advance container technology and align the tech industry around its evolution.

eBPF Runtime system for operating systems

eBPF is a technology that can run programs in a privileged context such as the operating system kernel. It is the successor to the Berkeley Packet Filter filtering mechanism in Linux and is also used in non-networking parts of the Linux kernel as well.

In software engineering, more specifically in distributed computing, observability is the ability to collect data about programs' execution, modules' internal states, and the communication among components. To improve observability, software engineers use a wide range of logging and tracing techniques to gather telemetry information, and tools to analyze and use it. Observability is foundational to site reliability engineering, as it is the first step in triaging a service outage. One of the goals of observability is to minimize the amount of prior knowledge needed to debug an issue.

References

  1. "The Tracing Book". Archived from the original on 2009-02-24.
  2. 1 2 Li, Bowen; Peng, Xin; Xiang, Qilin; Wang, Hanzhang; Xie, Tao; Sun, Jun; Liu, Xuanzhe (2022). "Enjoy your observability: an industrial survey of microservice tracing and analysis". Empirical Software Engineering. 27 (1): 25. doi:10.1007/s10664-021-10063-9. ISSN   1382-3256. PMC   8629732 . PMID   34867075.
  3. Mandel, Maya (2023-06-07). "Council Post: Distributed Tracing: The Key To Microservices Observability". Forbes. Retrieved 2024-01-12.
  4. 1 2 Janes, Andrea; Li, Xiaozhou; Lenarduzzi, Valentina (2023). "Open tracing tools: Overview and critical comparison". Journal of Systems and Software. 204. Elsevier BV: 111793. arXiv: 2207.06875 . doi:10.1016/j.jss.2023.111793. ISSN   0164-1212.
  5. "Tracepoints (Debugging with GDB)". sourceware.org. Retrieved 2022-06-24.
  6. Kraft, Johan; Wall, Anders; Kienle, Holger (2010), "Trace Recording for Embedded Systems: Lessons Learned from Five Industrial Projects", Runtime Verification, Springer Berlin Heidelberg, pp. 315–329, doi:10.1007/978-3-642-16612-9_24, ISBN   9783642166112
  7. 1 2 Mertz, Jhonny; Nunes, Ingrid (2019). On the Practical Feasibility of Software Monitoring: a Framework for Low-Impact Execution Tracing. CASCON '04: Proceedings of the 2004 conference of the Centre for Advanced Studies on Collaborative research. IEEE. pp. 169–180. doi:10.1109/SEAMS.2019.00030. ISBN   978-1-7281-3368-3.
  8. "A Qualitative Interview Study of Distributed Tracing Visualisation: A Characterisation of Challenges and Opportunities". IEEE Xplore. 2023-02-01. Retrieved 2024-01-12.