Sparse

Last updated
Sparse
Original author(s) Linus Torvalds
Developer(s) Josh Triplett, Christopher Li, Luc Van Oostenryck
Initial release2003
Stable release
0.6.4 / September 6, 2021;2 years ago (2021-09-06) [1]
Repository
Written in C
Operating system Linux, BSD, macOS, MinGW, Cygwin
Type Static code analysis
License MIT License
Website sparse.docs.kernel.org

Sparse is a computer software tool designed to find possible coding faults in the Linux kernel. [2] Unlike other such tools, this static analysis tool was initially designed to only flag constructs that were likely to be of interest to kernel developers, such as the mixing of pointers to user and kernel address spaces.

Contents

Sparse checks for known problems and allows the developer to include annotations in the code that convey information about data types, such as the address space that pointers point to and the locks that a function acquires or releases.

Linus Torvalds started writing Sparse in 2003. Josh Triplett was its maintainer from 2006, a role taken over by Christopher Li in 2009 [3] and by Luc Van Oostenryck in November 2018. [4] Sparse is released under the MIT License.

Annotations

Some of the checks performed by Sparse require annotating the source code using the __attribute__ GCC extension, or the Sparse-specific __context__ specifier. [5] Sparse defines the following list of attributes:

When an API is defined with a macro, the specifier __attribute__((context(...))) can be replaced by __context__(...).

Linux kernel definitions

The Linux kernel defines the following short forms as pre-processor macros in files linux/compiler.h and linux/types.h (when building without the __CHECKER__ flag, all these annotations are removed from the code):

#ifdef __CHECKER__# define __user  __attribute__((noderef, address_space(1)))# define __kernel __attribute__((address_space(0)))# define __safe  __attribute__((safe))# define __force __attribute__((force))# define __nocast __attribute__((nocast))# define __iomem __attribute__((noderef, address_space(2)))# define __must_hold(x) __attribute__((context(x,1,1)))# define __acquires(x) __attribute__((context(x,0,1)))# define __releases(x) __attribute__((context(x,1,0)))# define __acquire(x) __context__(x,1)# define __release(x) __context__(x,-1)# define __cond_lock(x,c) ((c) ? ({ __acquire(x); 1; }) : 0)# define __percpu __attribute__((noderef, address_space(3)))#ifdef CONFIG_SPARSE_RCU_POINTER# define __rcu  __attribute__((noderef, address_space(4)))#else# define __rcu#endifexternvoid__chk_user_ptr(constvolatilevoid__user*);externvoid__chk_io_ptr(constvolatilevoid__iomem*);#else# define __user# define __kernel# define __safe# define __force# define __nocast# define __iomem# define __chk_user_ptr(x) (void)0# define __chk_io_ptr(x) (void)0# define __builtin_warning(x, y...) (1)# define __must_hold(x)# define __acquires(x)# define __releases(x)# define __acquire(x) (void)0# define __release(x) (void)0# define __cond_lock(x,c) (c)# define __percpu# define __rcu#endif
#ifdef __CHECKER__# define __bitwise    __attribute__((bitwise))#else# define __bitwise#endif

Examples

The types __le32 and __be32 represent 32-bit integer types with different endianness. However, the C language does not allow to specify that variables of these types should not be mixed. The bitwise attribute is used to mark these types as restricted, so Sparse will give a warning if variables of these types or other integer variables are mixed:

typedef__u32__bitwise__le32;typedef__u32__bitwise__be32;

To mark valid conversions between restricted types, a casting with the force attribute is used to avoid Sparse giving a warning.

See also

Related Research Articles

C is a general-purpose computer programming language. It was created in the 1970s by Dennis Ritchie, and remains very widely used and influential. By design, C's features cleanly reflect the capabilities of the targeted CPUs. It has found lasting use in operating systems, device drivers, and protocol stacks, but its use in application software has been decreasing. C is commonly used on computer architectures that range from the largest supercomputers to the smallest microcontrollers and embedded systems.

Common Intermediate Language (CIL), formerly called Microsoft Intermediate Language (MSIL) or Intermediate Language (IL), is the intermediate language binary instruction set defined within the Common Language Infrastructure (CLI) specification. CIL instructions are executed by a CIL-compatible runtime environment such as the Common Language Runtime. Languages which target the CLI compile to CIL. CIL is object-oriented, stack-based bytecode. Runtimes typically just-in-time compile CIL instructions into native code.

GNU Bison, commonly known as Bison, is a parser generator that is part of the GNU Project. Bison reads a specification in Bison syntax, warns about any parsing ambiguities, and generates a parser that reads sequences of tokens and decides whether the sequence conforms to the syntax specified by the grammar.

In computer science, read-copy-update (RCU) is a synchronization mechanism that avoids the use of lock primitives while multiple threads concurrently read and update elements that are linked through pointers and that belong to shared data structures.

In computer programming, indentation style is a convention, a.k.a. style, governing the indentation of blocks of source code that is generally intended to convey structure.

<span class="mw-page-title-main">Pointer (computer programming)</span> Object which stores memory addresses in a computer program

In computer science, a pointer is an object in many programming languages that stores a memory address. This can be that of another value located in computer memory, or in some cases, that of memory-mapped computer hardware. A pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer. As an analogy, a page number in a book's index could be considered a pointer to the corresponding page; dereferencing such a pointer would be done by flipping to the page with the given page number and reading the text found on that page. The actual format and content of a pointer variable is dependent on the underlying computer architecture.

In compiler construction, name mangling is a technique used to solve various problems caused by the need to resolve unique names for programming entities in many modern programming languages.

In computing, a memory barrier, also known as a membar, memory fence or fence instruction, is a type of barrier instruction that causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction. This typically means that operations issued prior to the barrier are guaranteed to be performed before operations issued after the barrier.

In the C++ programming language, a reference is a simple reference datatype that is less powerful but safer than the pointer type inherited from C. The name C++ reference may cause confusion, as in computer science a reference is a general concept datatype, with pointers and C++ references being specific reference datatype implementations. The definition of a reference in C++ is such that it does not need to exist. It can be implemented as a new name for an existing object.

The GNU coding standards are a set of rules and guidelines for writing programs that work consistently within the GNU system. The GNU Coding Standards were written by Richard Stallman and other GNU Project volunteers. The standards document is part of the GNU Project and is available from the GNU website. Though it focuses on writing free software for GNU in C, much of it can be applied more generally. In particular, the GNU Project encourages its contributors to always try to follow the standards—whether or not their programs are implemented in C.

The Thread Information Block (TIB) or Thread Environment Block (TEB) is a data structure in Win32 on x86 that stores information about the currently running thread. It descended from, and is backward-compatible on 32-bit systems with, a similar structure in OS/2.

<span class="mw-page-title-main">Linux kernel interfaces</span> An overview and comparison of the Linux kernel API and ABI.

The Linux kernel provides multiple interfaces to user-space and kernel-mode code that are used for varying purposes and that have varying properties by design. There are two types of application programming interface (API) in the Linux kernel:

  1. the "kernel–user space" API; and
  2. the "kernel internal" API.

typedef is a reserved keyword in the programming languages C, C++, and Objective-C. It is used to create an additional name (alias) for another data type, but does not create a new type, except in the obscure case of a qualified typedef of an array type where the typedef qualifiers are transferred to the array element type. As such, it is often used to simplify the syntax of declaring complex data structures consisting of struct and union types, although it is also commonly used to provide specific descriptive type names for integer data types of varying sizes.

In computer programming, the term hooking covers a range of techniques used to alter or augment the behaviour of an operating system, of applications, or of other software components by intercepting function calls or messages or events passed between software components. Code that handles such intercepted function calls, events or messages is called a hook.

Platform Invocation Services, commonly referred to as P/Invoke, is a feature of Common Language Infrastructure implementations, like Microsoft's Common Language Runtime, that enables managed code to call native code.

The C and C++ programming languages are closely related but have many significant differences. C++ began as a fork of an early, pre-standardized C, and was designed to be mostly source-and-link compatible with C compilers of the time. Due to this, development tools for the two languages are often integrated into a single product, with the programmer able to specify C or C++ as their source language.

In computer programming, DLL injection is a technique used for running code within the address space of another process by forcing it to load a dynamic-link library. DLL injection is often used by external programs to influence the behavior of another program in a way its authors did not anticipate or intend. For example, the injected code could hook system function calls, or read the contents of password textboxes, which cannot be done the usual way. A program used to inject arbitrary code into arbitrary processes is called a DLL injector.

splice is a Linux-specific system call that moves data between a file descriptor and a pipe without a round trip to user space. The related system call vmsplice moves or copies data between a pipe and user space. Ideally, splice and vmsplice work by remapping pages and do not actually copy any data, which may improve I/O performance. As linear addresses do not necessarily correspond to contiguous physical addresses, this may not be possible in all cases and on all hardware combinations.

A weak symbol denotes a specially annotated symbol during linking of Executable and Linkable Format (ELF) object files. By default, without any annotation, a symbol in an object file is strong. During linking, a strong symbol can override a weak symbol of the same name. In contrast, in the presence of two strong symbols by the same name, the linker resolves the symbol in favor of the first one found. This behavior allows an executable to override standard library functions, such as malloc(3). When linking a binary executable, a weakly declared symbol does not need a definition. In comparison, a declared strong symbol without a definition triggers an undefined symbol link error.

A code sanitizer is a programming tool that detects bugs in the form of undefined or suspicious behavior by a compiler inserting instrumentation code at runtime. The class of tools was first introduced by Google's AddressSanitizer of 2012, which uses directly mapped shadow memory to detect memory corruption such as buffer overflows or accesses to a dangling pointer (use-after-free).

References

  1. Luc Van Oostenryck (2021-09-06). "Sparse 0.6.4". linux-sparse@vger.kernel.org (Mailing list). Retrieved 2024-05-08.
  2. Yoann Padioleau; René Rydhof Hansen; Julia L. Lawall; Gilles Muller (2006). Semantic patches for documenting and automating collateral evolutions in Linux device drivers. Proceedings of the 3rd workshop on Programming languages and operating systems: linguistic support for modern operating systems. CiteSeerX   10.1.1.122.7080 . doi:10.1145/1215995.1216005. ISBN   1-59593-577-0. The Linux community has recently begun using various tools to better analyze C code. Sparse is a library that, like a compiler front end, provides convenient access to the abstract syntax tree and typing information of a C program.
  3. Christopher Li (2009-10-16). "Sparse 0.4.2 released". linux-sparse (Mailing list). Retrieved 2010-11-06.
  4. change Sparse's maintainer , retrieved December 10, 2018
  5. "Attribute Syntax Using the GNU Compiler Collection (GCC)". Free Software Foundation . Retrieved 2010-11-13.

Further reading