Bit array

Last updated

A bit array (also known as bitmask, [1] bit map, bit set, bit string, or bit vector) is an array data structure that compactly stores bits. It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level parallelism in hardware to perform operations quickly. A typical bit array stores kw bits, where w is the number of bits in the unit of storage, such as a byte or word, and k is some nonnegative integer. If w does not divide the number of bits to be stored, some space is wasted due to internal fragmentation.

Contents

Definition

A bit array is a mapping from some domain (almost always a range of integers) to values in the set {0, 1}. The values can be interpreted as dark/light, absent/present, locked/unlocked, valid/invalid, et cetera. The point is that there are only two possible values, so they can be stored in one bit. As with other arrays, the access to a single bit can be managed by applying an index to the array. Assuming its size (or length) to be n bits, the array can be used to specify a subset of the domain (e.g. {0, 1, 2, ..., n1}), where a 1-bit indicates the presence and a 0-bit the absence of a number in the set. This set data structure uses about n/w words of space, where w is the number of bits in each machine word. Whether the least significant bit (of the word) or the most significant bit indicates the smallest-index number is largely irrelevant, but the former tends to be preferred (on little-endian machines).

A finite binary relation may be represented by a bit array called a logical matrix. In the calculus of relations, these arrays are composed with matrix multiplication where the arithmetic is Boolean, and such a composition represents composition of relations. [2]

Basic operations

Although most machines are not able to address individual bits in memory, nor have instructions to manipulate single bits, each bit in a word can be singled out and manipulated using bitwise operations. In particular:

To obtain the bit mask needed for these operations, we can use a bit shift operator to shift the number 1 to the left by the appropriate number of places, as well as bitwise negation if necessary.

Given two bit arrays of the same size representing sets, we can compute their union, intersection, and set-theoretic difference using n/w simple bit operations each (2n/w for difference), as well as the complement of either:

for i from 0 to n/w-1     complement_a[i] := not a[i]     union[i]        := a[i] or b[i]     intersection[i] := a[i] and b[i]     difference[i]   := a[i] and (not b[i])

If we wish to iterate through the bits of a bit array, we can do this efficiently using a doubly nested loop that loops through each word, one at a time. Only n/w memory accesses are required:

for i from 0 to n/w-1     index := 0    // if needed     word := a[i]     for b from 0 to w-1         value := word and 1 ≠ 0         word := word shift right 1         // do something with value         index := index + 1    // if needed

Both of these code samples exhibit ideal locality of reference, which will subsequently receive large performance boost from a data cache. If a cache line is k words, only about n/wk cache misses will occur.

More complex operations

As with character strings it is straightforward to define length, substring, lexicographical compare, concatenation, reverse operations. The implementation of some of these operations is sensitive to endianness.

Population / Hamming weight

If we wish to find the number of 1 bits in a bit array, sometimes called the population count or Hamming weight, there are efficient branch-free algorithms that can compute the number of bits in a word using a series of simple bit operations. We simply run such an algorithm on each word and keep a running total. Counting zeros is similar. See the Hamming weight article for examples of an efficient implementation.

Inversion

Vertical flipping of a one-bit-per-pixel image, or some FFT algorithms, requires flipping the bits of individual words (so b31 b30 ... b0 becomes b0 ... b30 b31). When this operation is not available on the processor, it's still possible to proceed by successive passes, in this example on 32 bits:

exchange two 16-bit halfwords exchange bytes by pairs (0xddccbbaa -> 0xccddaabb) ... swap bits by pairs swap bits (b31 b30 ... b1 b0 -> b30 b31 ... b0 b1)  The last operation can be written ((x&0x55555555) << 1) | (x&0xaaaaaaaa) >> 1)). 

Find first one

The find first set or find first one operation identifies the index or position of the 1-bit with the smallest index in an array, and has widespread hardware support (for arrays not larger than a word) and efficient algorithms for its computation. When a priority queue is stored in a bit array, find first one can be used to identify the highest priority element in the queue. To expand a word-size find first one to longer arrays, one can find the first nonzero word and then run find first one on that word. The related operations find first zero, count leading zeros, count leading ones, count trailing zeros, count trailing ones, and log base 2 (see find first set) can also be extended to a bit array in a straightforward manner.

Compression

A bit array is the most dense storage for "random" bits, that is, where each bit is equally likely to be 0 or 1, and each one is independent. But most data are not random, so it may be possible to store it more compactly. For example, the data of a typical fax image is not random and can be compressed. Run-length encoding is commonly used to compress these long streams. However, most compressed data formats are not so easy to access randomly; also by compressing bit arrays too aggressively we run the risk of losing the benefits due to bit-level parallelism (vectorization). Thus, instead of compressing bit arrays as streams of bits, we might compress them as streams of bytes or words (see Bitmap index (compression)).

Advantages and disadvantages

Bit arrays, despite their simplicity, have a number of marked advantages over other data structures for the same problems:

However, bit arrays are not the solution to everything. In particular:

Applications

Because of their compactness, bit arrays have a number of applications in areas where space or efficiency is at a premium. Most commonly, they are used to represent a simple group of boolean flags or an ordered sequence of boolean values.

Bit arrays are used for priority queues, where the bit at index k is set if and only if k is in the queue; this data structure is used, for example, by the Linux kernel, and benefits strongly from a find-first-zero operation in hardware.

Bit arrays can be used for the allocation of memory pages, inodes, disk sectors, etc. In such cases, the term bitmap may be used. However, this term is frequently used to refer to raster images, which may use multiple bits per pixel.

Another application of bit arrays is the Bloom filter, a probabilistic set data structure that can store large sets in a small space in exchange for a small probability of error. It is also possible to build probabilistic hash tables based on bit arrays that accept either false positives or false negatives.

Bit arrays and the operations on them are also important for constructing succinct data structures, which use close to the minimum possible space. In this context, operations like finding the nth 1 bit or counting the number of 1 bits up to a certain position become important.

Bit arrays are also a useful abstraction for examining streams of compressed data, which often contain elements that occupy portions of bytes or are not byte-aligned. For example, the compressed Huffman coding representation of a single 8-bit character can be anywhere from 1 to 255 bits long.

In information retrieval, bit arrays are a good representation for the posting lists of very frequent terms. If we compute the gaps between adjacent values in a list of strictly increasing integers and encode them using unary coding, the result is a bit array with a 1 bit in the nth position if and only if n is in the list. The implied probability of a gap of n is 1/2n. This is also the special case of Golomb coding where the parameter M is 1; this parameter is only normally selected when −log(2 − p) / log(1 − p) ≤ 1, or roughly the term occurs in at least 38% of documents.

Language support

The APL programming language fully supports bit arrays of arbitrary shape and size as a Boolean datatype distinct from integers. All major implementations (Dyalog APL, APL2, APL Next, NARS2000, Gnu APL, etc.) pack the bits densely into whatever size the machine word is. Bits may be accessed individually via the usual indexing notation (A[3]) as well as through all of the usual primitive functions and operators where they are often operated on using a special case algorithm such as summing the bits via a table lookup of bytes.

The C programming language's bit fields , pseudo-objects found in structs with size equal to some number of bits, are in fact small bit arrays; they are limited in that they cannot span words. Although they give a convenient syntax, the bits are still accessed using bytewise operators on most machines, and they can only be defined statically (like C's static arrays, their sizes are fixed at compile-time). It is also a common idiom for C programmers to use words as small bit arrays and access bits of them using bit operators. A widely available header file included in the X11 system, xtrapbits.h, is “a portable way for systems to define bit field manipulation of arrays of bits.” A more explanatory description of aforementioned approach can be found in the comp.lang.c faq.

In C++, although individual bools typically occupy the same space as a byte or an integer, the STL type vector<bool> is a partial template specialization in which bits are packed as a space efficiency optimization. Since bytes (and not bits) are the smallest addressable unit in C++, the [] operator does not return a reference to an element, but instead returns a proxy reference. This might seem a minor point, but it means that vector<bool> is not a standard STL container, which is why the use of vector<bool> is generally discouraged. Another unique STL class, bitset, [3] creates a vector of bits fixed at a particular size at compile-time, and in its interface and syntax more resembles the idiomatic use of words as bit sets by C programmers. It also has some additional power, such as the ability to efficiently count the number of bits that are set. The Boost C++ Libraries provide a dynamic_bitset class [4] whose size is specified at run-time.

The D programming language provides bit arrays in its standard library, Phobos, in std.bitmanip. As in C++, the [] operator does not return a reference, since individual bits are not directly addressable on most hardware, but instead returns a bool.

In Java, the class BitSet creates a bit array that is then manipulated with functions named after bitwise operators familiar to C programmers. Unlike the bitset in C++, the Java BitSet does not have a "size" state (it has an effectively infinite size, initialized with 0 bits); a bit can be set or tested at any index. In addition, there is a class EnumSet , which represents a Set of values of an enumerated type internally as a bit vector, as a safer alternative to bit fields.

The .NET Framework supplies a BitArray collection class. It stores bits using an array of type int (each element in the array usually represents 32 bits). [5] The class supports random access and bitwise operators, can be iterated over, and its Length property can be changed to grow or truncate it.

Although Standard ML has no support for bit arrays, Standard ML of New Jersey has an extension, the BitArray structure, in its SML/NJ Library. It is not fixed in size and supports set operations and bit operations, including, unusually, shift operations.

Haskell likewise currently lacks standard support for bitwise operations, but both GHC and Hugs provide a Data.Bits module with assorted bitwise functions and operators, including shift and rotate operations and an "unboxed" array over boolean values may be used to model a Bit array, although this lacks support from the former module.

In Perl, strings can be used as expandable bit arrays. They can be manipulated using the usual bitwise operators (~ | & ^), [6] and individual bits can be tested and set using the vec function. [7]

In Ruby, you can access (but not set) a bit of an integer (Fixnum or Bignum) using the bracket operator ([]), as if it were an array of bits.

Apple's Core Foundation library contains CFBitVector and CFMutableBitVector structures.

PL/I supports arrays of bit strings of arbitrary length, which may be either fixed-length or varying. The array elements may be aligned each element begins on a byte or word boundary or unaligned elements immediately follow each other with no padding.

PL/pgSQL and PostgreSQL's SQL support bit strings as native type. There are two SQL bit types: bit(n) and bit varying(n), where n is a positive integer. [8]

Hardware description languages such as VHDL, Verilog, and SystemVerilog natively support bit vectors as these are used to model storage elements like flip-flops, hardware busses and hardware signals in general. In hardware verification languages such as OpenVera, e and SystemVerilog, bit vectors are used to sample values from the hardware models, and to represent data that is transferred to hardware during simulations.

Common Lisp provides a one-dimensional bit-vector implementation as a special case of the built-in array, acting in a dual capacity as a class and a type specifier. [9] Being a derivative of the array, it relies on the general make-array function to be configured with an element type of bit, which optionally permits the bit vector to be designated as dynamically resizable. The bit-vector, however, is not infinite in extent. A more restricted simple-bit-vector type exists, which explicitly excludes the dynamic characteristics. [10] Bit vectors are represented as, and can be constructed in a more concise fashion by, the reader macro#*bits. [11] In addition to the general functions applicable to all arrays, dedicated operations exist for bit vectors. Single bits may be accessed and modified using the bit and sbit functions [12] and an extensive number of logical operations is supported. [13]

See also

Related Research Articles

In computer science, an array is a data structure consisting of a collection of elements, of same memory size, each identified by at least one array index or key. An array is stored such that the position of each element can be computed from its index tuple by a mathematical formula. The simplest type of data structure is a linear array, also called one-dimensional array.

<span class="mw-page-title-main">Hash function</span> Mapping arbitrary data to fixed-size values

A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable length output. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

<span class="mw-page-title-main">Trie</span> K-ary search tree data structure

In computer science, a trie, also called digital tree or prefix tree, is a type of k-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key, the trie is traversed depth-first, following the links between nodes, which represent each character in the key.

In computer programming, a bitwise operation operates on a bit string, a bit array or a binary numeral at the level of its individual bits. It is a fast and simple action, basic to the higher-level arithmetic operations and directly supported by the processor. Most bitwise operations are presented as two-operand instructions where the result replaces one of the input operands.

A bitboard is a specialized bit array data structure commonly used in computer systems that play board games, where each bit corresponds to a game board space or piece. This allows parallel bitwise operations to set or query the game state, or determine moves or plays in the game.

In computer science, primitive data types are a set of basic data types from which all other data types are constructed. Specifically it often refers to the limited set of data representations in use by a particular processor, which all compiled programs must use. Most processors support a similar set of primitive data types, although the specific representations vary. More generally, "primitive data types" may refer to the standard data types built into a programming language. Data types which are not primitive are referred to as derived or composite.

<span class="mw-page-title-main">C syntax</span> Set of rules defining correctly structured programs

The syntax of the C programming language is the set of rules governing writing of software in the C language. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

<span class="mw-page-title-main">Pointer (computer programming)</span> Object which stores memory addresses in a computer program

In computer science, a pointer is an object in many programming languages that stores a memory address. This can be that of another value located in computer memory, or in some cases, that of memory-mapped computer hardware. A pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer. As an analogy, a page number in a book's index could be considered a pointer to the corresponding page; dereferencing such a pointer would be done by flipping to the page with the given page number and reading the text found on that page. The actual format and content of a pointer variable is dependent on the underlying computer architecture.

The Burroughs B6x00-7x00 instruction set includes the set of valid operations for the Burroughs B6500, B7500 and later Burroughs large systems, including the current Unisys Clearpath/MCP systems; it does not include the instruction for other Burroughs large systems including the B5000, B5500, B5700 and the B8500. These unique machines have a distinctive design and instruction set. Each word of data is associated with a type, and the effect of an operation on that word can depend on the type. Further, the machines are stack based to the point that they had no user-addressable registers.

The computer programming languages C and Pascal have similar times of origin, influences, and purposes. Both were used to design their own compilers early in their lifetimes. The original Pascal definition appeared in 1969 and a first compiler in 1970. The first version of C appeared in 1972.

A bitmap index is a special kind of database index that uses bitmaps.

<span class="mw-page-title-main">C data types</span> Data types supported by the C programming language

In the C programming language, data types constitute the semantics and characteristics of storage of data elements. They are expressed in the language syntax in form of declarations for memory locations or variables. Data types also determine the types of operations or methods of processing of data elements.

Data structure alignment is the way data is arranged and accessed in computer memory. It consists of three separate but related issues: data alignment, data structure padding, and packing.

A bit field is a data structure that consists of one or more adjacent bits which have been allocated for specific purposes, so that any single bit or group of bits within the structure can be set or inspected. A bit field is most commonly used to represent integral types of known, fixed bit-width, such as single-bit Booleans.

Bit manipulation is the act of algorithmically manipulating bits or other pieces of data shorter than a word. Computer programming tasks that require bit manipulation include low-level device control, error detection and correction algorithms, data compression, encryption algorithms, and optimization. For most other tasks, modern programming languages allow the programmer to work directly with abstractions instead of bits that represent those abstractions.

In computer science, a type punning is any programming technique that subverts or circumvents the type system of a programming language in order to achieve an effect that would be difficult or impossible to achieve within the bounds of the formal language.

ALGOL 68RS is the second ALGOL 68 compiler written by I. F. Currie and J. D. Morrison, at the Royal Signals and Radar Establishment (RSRE). Unlike the earlier ALGOL 68-R, it was designed to be portable, and implemented the language of the Revised Report.

In computer software and hardware, find first set (ffs) or find first one is a bit operation that, given an unsigned machine word, designates the index or position of the least significant bit set to one in the word counting from the least significant bit position. A nearly equivalent operation is count trailing zeros (ctz) or number of trailing zeros (ntz), which counts the number of zero bits following the least significant one bit. The complementary operation that finds the index or position of the most significant set bit is log base 2, so called because it computes the binary logarithm ⌊log2(x)⌋. This is closely related to count leading zeros (clz) or number of leading zeros (nlz), which counts the number of zero bits preceding the most significant one bit. There are two common variants of find first set, the POSIX definition which starts indexing of bits at 1, herein labelled ffs, and the variant which starts indexing of bits at zero, which is equivalent to ctz and so will be called by that name.

In computing, sequence containers refer to a group of container class templates in the standard library of the C++ programming language that implement storage of data elements. Being templates, they can be used to store arbitrary elements, such as integers or custom classes. One common property of all sequential containers is that the elements can be accessed sequentially. Like all other standard library components, they reside in namespace std.

AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and first implemented in the 2016 Intel Xeon Phi x200, and then later in a number of AMD and other Intel CPUs. AVX-512 consists of multiple extensions that may be implemented independently. This policy is a departure from the historical requirement of implementing the entire instruction block. Only the core extension AVX-512F is required by all AVX-512 implementations.

References

  1. "Linux Magic System Request Key Hacks". Kernel.org.
  2. Irving Copilowish (December 1948) "Matrix development of the calculus of relations", Journal of Symbolic Logic 13(4): 193–203 Jstor link
  3. "SGI.com Tech Archive Resources now retired". SGI. 2 January 2018.
  4. "dynamic_bitset<Block, Allocator> - 1.66.0". www.boost.org.
  5. ".NET mscorlib source code". github.com/microsoft. 15 October 2021.
  6. "perlop - perldoc.perl.org". perldoc.perl.org.
  7. "vec - perldoc.perl.org". perldoc.perl.org.
  8. "8.10. Bit String Types". 30 September 2021.
  9. "CLHS: System Class BIT-VECTOR". www.lispworks.com.
  10. "CLHS: Type SIMPLE-BIT-VECTOR". www.lispworks.com.
  11. "CLHS: Section 2.4.8.4". www.lispworks.com.
  12. "CLHS: Accessor BIT, SBIT". www.lispworks.com.
  13. "CLHS: Function BIT-AND, BIT-ANDC1, BIT-ANDC2..." www.lispworks.com.