Consistent Overhead Byte Stuffing

Last updated

Consistent Overhead Byte Stuffing (COBS) is an algorithm for encoding data bytes that results in efficient, reliable, unambiguous packet framing regardless of packet content, thus making it easy for receiving applications to recover from malformed packets. It employs a particular byte value, typically zero, to serve as a packet delimiter (a special value that indicates the boundary between packets). When zero is used as a delimiter, the algorithm replaces each zero data byte with a non-zero value so that no zero data bytes will appear in the packet and thus be misinterpreted as packet boundaries.

Contents

Byte stuffing is a process that transforms a sequence of data bytes that may contain 'illegal' or 'reserved' values (such as packet delimiter) into a potentially longer sequence that contains no occurrences of those values. The extra length of the transformed sequence is typically referred to as the overhead of the algorithm. HDLC framing is a well-known example, used particularly in PPP (see RFC 1662 § 4.2). Although HDLC framing has an overhead of <1% in the average case, it suffers from a very poor worst-case overhead of 100%; for inputs that consist entirely of bytes that require escaping, HDLC byte stuffing will double the size of the input.

The COBS algorithm, on the other hand, tightly bounds the worst-case overhead. COBS requires a minimum of 1 byte overhead, and a maximum of n/254⌉ bytes for n data bytes (one byte in 254, rounded up). Consequently, the time to transmit the encoded byte sequence is highly predictable, which makes COBS useful for real-time applications in which jitter may be problematic. The algorithm is computationally inexpensive, and in addition to its desirable worst-case overhead, its average overhead is also low compared to other unambiguous framing algorithms like HDLC. [1] [2] COBS does, however, require up to 254 bytes of lookahead. Before transmitting its first byte, it needs to know the position of the first zero byte (if any) in the following 254 bytes.

A 1999 Internet Draft proposed to standardize COBS as an alternative for HDLC framing in PPP, due to the aforementioned poor worst-case overhead of HDLC framing. [3]

Packet framing and stuffing

When packetized data is sent over any serial medium, some protocol is required to demarcate packet boundaries. This is done by using a framing marker, a special bit-sequence or character value that indicates where the boundaries between packets fall. Data stuffing is the process that transforms the packet data before transmission to eliminate all occurrences of the framing marker, so that when the receiver detects a marker, it can be certain that the marker indicates a boundary between packets.

COBS transforms an arbitrary string of bytes in the range [0,255] into bytes in the range [1,255]. Having eliminated all zero bytes from the data, a zero byte can now be used to unambiguously mark the end of the transformed data. This is done by appending a zero byte to the transformed data, thus forming a packet consisting of the COBS-encoded data (the payload ) to unambiguously mark the end of the packet.

(Any other byte value may be reserved as the packet delimiter, but using zero simplifies the description.)

Consistent Overhead Byte Stuffing (COBS) encoding process Cobs encoding with example.png
Consistent Overhead Byte Stuffing (COBS) encoding process

There are two equivalent ways to describe the COBS encoding process:

Prefixed block description
To encode some bytes, first append a zero byte, then break them into groups of either 254 non-zero bytes, or 0–253 non-zero bytes followed by a zero byte. Because of the appended zero byte, this is always possible.
Encode each group by deleting the trailing zero byte (if any) and prepending the number of non-zero bytes, plus one. Thus, each encoded group is the same size as the original, except that 254 non-zero bytes are encoded into 255 bytes by prepending a byte of 255.
As a special exception, if a packet ends with a group of 254 non-zero bytes, it is not necessary to add the trailing zero byte. This saves one byte in some situations.
Linked list description
First, insert a zero byte at the beginning of the packet, and after every run of 254 non-zero bytes. This encoding is obviously reversible. It is not necessary to insert a zero byte at the end of the packet if it happens to end with exactly 254 non-zero bytes.
Second, replace each zero byte with the offset to the next zero byte, or the end of the packet. Because of the extra zeros added in the first step, each offset is guaranteed to be at most 255.

Encoding examples

These examples show how various data sequences would be encoded by the COBS algorithm. In the examples, all bytes are expressed as hexadecimal values, and encoded data is shown with text formatting to illustrate various features:

ExampleUnencoded data (hex)Encoded with COBS (hex)
100010100
200 000101 0100
300 11 000102 1101 00
411 22 00 330311 22023300
511 22 33 440511 22 33 4400
611 00 00 00021101 01 0100
701 02 03 ... FD FEFF01 02 03 ... FD FE00
800 01 02 ... FC FD FE01FF01 02 ... FC FD FE00
901 02 03 ... FD FE FFFF01 02 03 ... FD FE02FF00
1002 03 04 ... FE FF 00FF02 03 04 ... FE FF010100
1103 04 05 ... FF 00 01FE03 04 05 ... FF020100

Below is a diagram using example 4 from above table, to illustrate how each modified data byte is located, and how it is identified as a data byte or an end of frame byte.

   [OHB]                                : Overhead byte (Start of frame)      3+ -------------->|                : Points to relative location of first zero symbol                        2+-------->|     : Is a zero data byte, pointing to next zero symbol                                   [EOP] : Location of end-of-packet zero symbol.      0     1     2     3     4    5     : Byte Position      03    11    22    02    33   00    : COBS Data Frame            11    22    00    33         : Extracted Data       OHB = Overhead Byte (Points to next zero symbol) EOP = End Of Packet 

Examples 7 through 10 show how the overhead varies depending on the data being encoded for packet lengths of 255 or more.

Implementation

The following code implements a COBS encoder and decoder in the C programming language:

#include<stddef.h>#include<stdint.h>#include<assert.h>/** COBS encode data to buffer @param data Pointer to input data to encode @param length Number of bytes to encode @param buffer Pointer to encoded output buffer @return Encoded buffer length in bytes @note Does not output delimiter byte*/size_tcobsEncode(constvoid*data,size_tlength,uint8_t*buffer){assert(data&&buffer);uint8_t*encode=buffer;// Encoded byte pointeruint8_t*codep=encode++;// Output code pointeruint8_tcode=1;// Code valuefor(constuint8_t*byte=(constuint8_t*)data;length--;++byte){if(*byte)// Byte not zero, write it*encode++=*byte,++code;if(!*byte||code==0xff)// Input is zero or block completed, restart{*codep=code,code=1,codep=encode;if(!*byte||length)++encode;}}*codep=code;// Write final code valuereturn(size_t)(encode-buffer);}/** COBS decode data from buffer @param buffer Pointer to encoded input bytes @param length Number of bytes to decode @param data Pointer to decoded output data @return Number of bytes successfully decoded @note Stops decoding if delimiter byte is found*/size_tcobsDecode(constuint8_t*buffer,size_tlength,void*data){assert(buffer&&data);constuint8_t*byte=buffer;// Encoded input byte pointeruint8_t*decode=(uint8_t*)data;// Decoded output byte pointerfor(uint8_tcode=0xff,block=0;byte<buffer+length;--block){if(block)// Decode block byte*decode++=*byte++;else{block=*byte++;// Fetch the next block lengthif(block&&(code!=0xff))// Encoded zero, write it unless it's delimiter.*decode++=0;code=block;if(!code)// Delimiter code foundbreak;}}return(size_t)(decode-(uint8_t*)data);}

See also

Related Research Articles

<span class="mw-page-title-main">Huffman coding</span> Technique to compress data

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

PackBits is a fast, simple lossless compression scheme for run-length encoding of data.

<span class="mw-page-title-main">String (computer science)</span> Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

In data transmission and telecommunication, bit stuffing is the insertion of non-information bits into data. Stuffed bits should not be confused with overhead bits.

<span class="mw-page-title-main">Non-return-to-zero</span> Telecommunication coding technique

In telecommunication, a non-return-to-zero (NRZ) line code is a binary code in which ones are represented by one significant condition, usually a positive voltage, while zeros are represented by some other significant condition, usually a negative voltage, with no other neutral or rest condition.

bzip2 File compression software

bzip2 is a free and open-source file compression program that uses the Burrows–Wheeler algorithm. It only compresses single files and is not a file archiver. It relies on separate external utilities for tasks such as handling multiple files, encryption, and archive-splitting.

LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively. These two algorithms form the basis for many variations including LZW, LZSS, LZMA and others. Besides their academic influence, these algorithms formed the basis of several ubiquitous compression schemes, including GIF and the DEFLATE algorithm used in PNG and ZIP.

High-Level Data Link Control (HDLC) is a communication protocol used for transmitting data between devices in telecommunication and networking. Developed by the International Organization for Standardization (ISO), it is defined in the standard ISO/IEC 13239:2002.

The data link layer, or layer 2, is the second layer of the seven-layer OSI model of computer networking. This layer is the protocol layer that transfers data between nodes on a network segment across the physical layer. The data link layer provides the functional and procedural means to transfer data between network entities and may also provide the means to detect and possibly correct errors that can occur in the physical layer.

In computer programming, Base64 is a group of binary-to-text encoding schemes that transforms binary data into a sequence of printable characters, limited to a set of 64 unique characters. More specifically, the source binary data is taken 6 bits at a time, then this group of 6 bits is mapped to one of 64 unique characters.

uuencoding is a form of binary-to-text encoding that originated in the Unix programs uuencode and uudecode written by Mary Ann Horton at the University of California, Berkeley in 1980, for encoding binary data for transmission in email systems.

The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.

<span class="mw-page-title-main">Aztec Code</span> Type of matrix barcode

The Aztec Code is a matrix code invented by Andrew Longacre, Jr. and Robert Hussey in 1995. The code was published by AIM, Inc. in 1997. Although the Aztec Code was patented, that patent was officially made public domain. The Aztec Code is also published as ISO/IEC 24778:2008 standard. Named after the resemblance of the central finder pattern to an Aztec pyramid, Aztec Code has the potential to use less space than other matrix barcodes because it does not require a surrounding blank "quiet zone".

The Fletcher checksum is an algorithm for computing a position-dependent checksum devised by John G. Fletcher (1934–2012) at Lawrence Livermore Labs in the late 1970s. The objective of the Fletcher checksum was to provide error-detection properties approaching those of a cyclic redundancy check but with the lower computational effort associated with summation techniques.

Ascii85, also called Base85, is a form of binary-to-text encoding developed by Paul E. Rutter for the btoa utility. By using five ASCII characters to represent four bytes of binary data, it is more efficient than uuencode or Base64, which use four characters to represent three bytes of data.

In computer networks, a syncword, sync character, sync sequence or preamble is used to synchronize a data transmission by indicating the end of header information and the start of data. The syncword is a known sequence of data used to identify the start of a frame, and is also called reference signal or midamble in wireless communications.

In cryptography, Treyfer is a block cipher/MAC designed in 1997 by Gideon Yuval. Aimed at smart card applications, the algorithm is extremely simple and compact; it can be implemented in just 29 bytes of 8051 machine code.

<span class="mw-page-title-main">Computation of cyclic redundancy checks</span> Overview of the computation of cyclic redundancy checks

Computation of a cyclic redundancy check is derived from the mathematics of polynomial division, modulo two. In practice, it resembles long division of the binary message string, with a fixed number of zeroes appended, by the "generator polynomial" string except that exclusive or operations replace subtractions. Division of this type is efficiently realised in hardware by a modified shift register, and in software by a series of equivalent algorithms, starting with simple code close to the mathematics and becoming faster through byte-wise parallelism and space–time tradeoffs.

LEB128 or Little Endian Base 128 is a variable-length code compression used to store arbitrarily large integers in a small number of bytes. LEB128 is used in the DWARF debug file format and the WebAssembly binary encoding for all integer literals.

LZ4 is a lossless data compression algorithm that is focused on compression and decompression speed. It belongs to the LZ77 family of byte-oriented compression schemes.

References

  1. Cheshire, Stuart; Baker, Mary (April 1999). "Consistent Overhead Byte Stuffing" (PDF). IEEE/ACM Transactions on Networking . 7 (2): 159–172. CiteSeerX   10.1.1.108.3143 . doi:10.1109/90.769765. S2CID   47267776 . Retrieved November 30, 2015.
  2. Cheshire, Stuart; Baker, Mary (17 November 1997). Consistent Overhead Byte Stuffing (PDF). ACM SIGCOMM '97. Cannes . Retrieved November 23, 2010.
  3. Carlson, James; Cheshire, Stuart; Baker, Mary (November 1997). PPP Consistent Overhead Byte Stuffing (COBS). I-D draft-ietf-pppext-cobs-00.txt.