Run-length encoding

Last updated

Run-length encoding (RLE) is a form of lossless data compression in which runs of data (consecutive occurrences of the same data value) are stored as a single occurrence of that data value and a count of its consecutive occurrences, rather than as the original run. As an imaginary example of the concept, when encoding an image built up from colored dots, the sequence "green green green green green green green green green" is shortened to "green x 9". This is most efficient on data that contains many such runs, for example, simple graphic images such as icons, line drawings, games, and animations. For files that do not have many runs, encoding them with RLE could increase the file size.

Contents

RLE may also refer in particular to an early graphics file format supported by CompuServe for compressing black and white images, that was widely supplanted by their later Graphics Interchange Format (GIF).

RLE also refers to a little-used image format in Windows 3.x that is saved with the file extension rle; it is a run-length encoded bitmap, and the format was used for the Windows 3.x startup screen.

History and applications

Run-length encoding (RLE) schemes were employed in the transmission of analog television signals as far back as 1967. [1] In 1983, run-length encoding was patented by Hitachi. [2] [3] [4] RLE is particularly well suited to palette-based bitmap images (which use relatively few colours) such as computer icons, and was a popular image compression method on early online services such as CompuServe before the advent of more sophisticated formats such as GIF. [5] It does not work well on continuous-tone images (which use very many colours) such as photographs, although JPEG uses it on the coefficients that remain after transforming and quantizing image blocks.

Common formats for run-length encoded data include Truevision TGA, PackBits (by Apple, used in MacPaint), PCX and ILBM. The International Telecommunication Union also describes a standard to encode run-length colour for fax machines, known as T.45. [6] That fax colour coding standard, which along with other techniques is incorporated into Modified Huffman coding,[ citation needed ] is relatively efficient because most faxed documents are primarily white space, with occasional interruptions of black.

Algorithm

RLE has a space complexity of , where n is the size of the input data.

Encoding algorithm

Run-length encoding compresses data by reducing the physical size of a repeating string of characters. This process involves converting the input data into a compressed format by identifying and counting consecutive occurrences of each character. The steps are as follows:

  1. Traverse the input data.
  2. Count the number of consecutive repeating characters (run length).
  3. Store the character and its run length.

Python implementation

Imports and helper functions
fromitertoolsimportrepeat,compress,groupbydefilen(iterable):"""    Return the number of items in iterable.    >>> ilen(x for x in range(1000000) if x % 3 == 0)    333334    """# using zip() to wrap the input with 1-tuples which compress() reads as true values.returnsum(compress(repeat(1),zip(iterable)))
defrle_encode(iterable,*,length_first=True):"""    >>> "".join(rle_encode("AAAABBBCCDAA"))    '4A3B2C1D2A'    >>> "".join(rle_encode("AAAABBBCCDAA", length_first=False))    'A4B3C2D1A2'    """return(f"{ilen(g)}{k}"iflength_firstelsef"{k}{ilen(g)}"# ilen(g): length of iterable gfork,gingroupby(iterable))

[7]

Decoding algorithm

The decoding process involves reconstructing the original data from the encoded format by repeating characters according to their counts. The steps are as follows:

  1. Traverse the encoded data.
  2. For each count-character pair, repeat the character count times.
  3. Append these characters to the result string.

Python implementation

Imports
fromitertoolsimportchain,repeat,batched
defrle_decode(iterable,*,length_first=True):"""    >>> "".join(rle_decode("4A3B2C1D2A"))    'AAAABBBCCDAA'    >>> "".join(rle_decode("A4B3C2D1A2", length_first=False))    'AAAABBBCCDAA'    """returnchain.from_iterable(repeat(b,int(a))iflength_firstelserepeat(a,int(b))fora,binbatched(iterable,2))

[7]

Example

Consider a screen containing plain black text on a solid white background. There will be many long runs of white pixels in the blank space, and many short runs of black pixels within the text. A hypothetical scan line, with B representing a black pixel and W representing white, might read as follows:

WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW

With a run-length encoding (RLE) data compression algorithm applied to the above hypothetical scan line, it can be rendered as follows:

12W1B12W3B24W1B14W

This can be interpreted as a sequence of twelve Ws, one B, twelve Ws, three Bs, etc., and represents the original 67 characters in only 18. While the actual format used for the storage of images is generally binary rather than ASCII characters like this, the principle remains the same. Even binary data files can be compressed with this method; file format specifications often dictate repeated bytes in files as padding space. However, newer compression methods such as DEFLATE often use LZ77-based algorithms, a generalization of run-length encoding that can take advantage of runs of strings of characters (such as BWWBWWBWWBWW).

Run-length encoding can be expressed in multiple ways to accommodate data properties as well as additional compression algorithms. For instance, one popular method encodes run lengths for runs of two or more characters only, using an "escape" symbol to identify runs, or using the character itself as the escape, so that any time a character appears twice it denotes a run. On the previous example, this would give the following:

WW12BWW12BB3WW24BWW14

This would be interpreted as a run of twelve Ws, a B, a run of twelve Ws, a run of three Bs, etc. In data where runs are less frequent, this can significantly improve the compression rate.

One other matter is the application of additional compression algorithms. Even with the runs extracted, the frequencies of different characters may be large, allowing for further compression; however, if the run lengths are written in the file in the locations where the runs occurred, the presence of these numbers interrupts the normal flow and makes it harder to compress. To overcome this, some run-length encoders separate the data and escape symbols from the run lengths, so that the two can be handled independently. For the example data, this would result in two outputs, the string "WWBWWBBWWBWW" and the numbers (12,12,3,24,14).

Variants

See also

Related Research Articles

In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information. Typically, a device that performs data compression is referred to as an encoder, and one that performs the reversal of the process (decompression) as a decoder.

<span class="mw-page-title-main">JPEG</span> Lossy compression method for reducing the size of digital images

JPEG is a commonly used method of lossy compression for digital images, particularly for those images produced by digital photography. The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and image quality. JPEG typically achieves 10:1 compression with little perceptible loss in image quality. Since its introduction in 1992, JPEG has been the most widely used image compression standard in the world, and the most widely used digital image format, with several billion JPEG images produced every day as of 2015.

Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistical redundancy. By contrast, lossy compression permits reconstruction only of an approximation of the original data, though usually with greatly improved compression rates.

PCX, standing for PiCture eXchange, is an image file format developed by the now-defunct ZSoft Corporation of Marietta, Georgia, United States. It was the native file format for PC Paintbrush and became one of the first widely accepted DOS imaging standards, although it has since been succeeded by more sophisticated image formats, such as BMP, JPEG, and PNG. PCX files commonly store palette-indexed images ranging from 2 or 4 colors to 16 and 256 colors, although the format has been extended to record true-color (24-bit) images as well.

PackBits is a fast, simple lossless compression scheme for run-length encoding of data.

<span class="mw-page-title-main">Raster graphics</span> Image display as a 2D grid of pixels

In computer graphics and digital photography, a raster graphic represents a two-dimensional picture as a rectangular matrix or grid of pixels, viewable via a computer display, paper, or other display medium. A raster image is technically characterized by the width and height of the image in pixels and by the number of bits per pixel. Raster images are stored in image files with varying dissemination, production, generation, and acquisition formats.

bzip2 File compression software

bzip2 is a free and open-source file compression program that uses the Burrows–Wheeler algorithm. It only compresses single files and is not a file archiver. It relies on separate external utilities for tasks such as handling multiple files, encryption, and archive-splitting.

Lempel–Ziv–Welch (LZW) is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. The algorithm is simple to implement and has the potential for very high throughput in hardware implementations. It is the algorithm of the Unix file compression utility compress and is used in the GIF image format.

The BMP file format, or bitmap, is a raster graphics image file format used to store bitmap digital images, independently of the display device, especially on Microsoft Windows and OS/2 operating systems.

<span class="mw-page-title-main">ILBM</span> File format

Interleaved Bitmap (ILBM) is an image file format conforming to the Interchange File Format (IFF) standard. The format originated on the Amiga platform, and on IBM-compatible systems, files in this format or the related PBM format are typically encountered in games from late 1980s and early 1990s that were either Amiga ports or had their graphical assets designed on Amiga machines.

Truevision TGA, often referred to as TARGA, is a raster graphics file format created by Truevision Inc.. It was the native format of TARGA and VISTA boards, which were the first graphic cards for IBM-compatible PCs to support high color or true color display. This family of graphic cards was intended for professional computer image synthesis and video editing with PCs; for this reason, usual resolutions of TGA image files match those of the NTSC and PAL video formats.

<span class="mw-page-title-main">Smacker video</span> Digital video file format

Smacker video is a video file format developed by Epic Games Tools, and primarily used for full-motion video in video games. Smacker uses an adaptive 8-bit RGB palette. RAD's format for video at higher color depths is Bink Video. The Smacker format specifies a container format, a video compression format, and an audio compression format. Since its release in 1994, Smacker has been used in over 2300 games. Blizzard used this format for the cinematic videos seen in its games Warcraft II, StarCraft and Diablo I.

QuickTime Animation format is a video compression format and codec created by Apple Computer to enable playback of RGB video in real time without expensive hardware. It is generally found in the QuickTime container with the FourCC 'rle '. It can perform either lossless or lossy compression and is one of the few video codecs that supports an alpha channel. Supported color depths are 1-bit (monochrome), 15-bit RGB, 24-bit RGB, 32-bit ARGB, as well as palettized RGB. As a result of reverse-engineering of the format, a decoder is implemented in XAnim as well as an encoder and decoder in libavcodec.

Modified Huffman coding is used in fax machines to encode black-on-white images (bitmaps). It combines the variable-length codes of Huffman coding with the coding of repetitive data in run-length encoding.

An image file format is a file format for a digital image. There are many formats that can be used, such as JPEG, PNG, and GIF. Most formats up until 2022 were for storing 2D images, not 3D ones. The data stored in an image file format may be compressed or uncompressed. If the data is compressed, it may be done so using lossy compression or lossless compression. For graphic design applications, vector formats are often used. Some image file formats support transparency.

JBIG2 is an image compression standard for bi-level images, developed by the Joint Bi-level Image Experts Group. It is suitable for both lossless and lossy compression. According to a press release from the Group, in its lossless mode JBIG2 typically generates files 3–5 times smaller than Fax Group 4 and 2–4 times smaller than JBIG, the previous bi-level compression standard released by the Group. JBIG2 was published in 2000 as the international standard ITU T.88, and in 2001 as ISO/IEC 14492.

The Apple Icon Image format (.icns) is an icon format used in Apple Inc.'s macOS. It supports icons of 16 × 16, 32 × 32, 48 × 48, 128 × 128, 256 × 256, 512 × 512 points at 1x and 2x scale, with both 1- and 8-bit alpha channels and multiple image states. The fixed-size icons can be scaled by the operating system and displayed at any intermediate size.

Silicon Graphics Image (SGI) or the RGB file format is the native raster graphics file format for Silicon Graphics workstations. The format was invented by Paul Haeberli. It can be run-length encoded (RLE). FFmpeg and ImageMagick, among others, support this format.

<span class="mw-page-title-main">Sixel</span> Bitmap graphics format

Sixel, short for "six pixels", is a bitmap graphics format supported by terminals and printers from DEC. It consists of a pattern six pixels high and one wide, resulting in 64 possible patterns. Each possible pattern is assigned an ASCII character, making the sixels easy to transmit on 7-bit serial links.

<span class="mw-page-title-main">Progressive Graphics File</span> File format

PGF is a wavelet-based bitmapped image format that employs lossless and lossy data compression. PGF was created to improve upon and replace the JPEG format. It was developed at the same time as JPEG 2000 but with a focus on speed over compression ratio.

References

  1. Robinson, A. H.; Cherry, C. (1967). "Results of a prototype television bandwidth compression scheme". Proceedings of the IEEE . 55 (3). IEEE: 356–364. doi:10.1109/PROC.1967.5493.
  2. "Run Length Encoding Patents". Internet FAQ Consortium. 21 March 1996. Retrieved 14 July 2019.
  3. "Method and system for data compression and restoration". Google Patents . 7 August 1984. Retrieved 14 July 2019.
  4. "Data recording method". Google Patents . 8 August 1983. Retrieved 14 July 2019.
  5. Dunn, Christopher (1987). "Smile! You're on RLE!" (PDF). The Transactor . 7 (6). Transactor Publishing: 16–18. Retrieved 2015-12-06.
  6. Recommendation T.45 (02/00): Run-length colour encoding. International Telecommunication Union. 2000. Retrieved 2015-12-06.
  7. 1 2 "more-itertools 10.4.0 documentation". August 2024.