LEB128 or Little Endian Base 128 is a variable-length code compression used to store arbitrarily large integers in a small number of bytes. LEB128 is used in the DWARF debug file format [1] [2] and the WebAssembly binary encoding for all integer literals. [3]
LEB128 format is very similar to variable-length quantity (VLQ) format; the primary difference is that LEB128 is little-endian whereas variable-length quantities are big-endian. Both allow small numbers to be stored in a single byte, while also allowing encoding of arbitrarily long numbers. There are two versions of LEB128: unsigned LEB128 and signed LEB128. The decoder must know whether the encoded value is unsigned LEB128 or signed LEB128.
To encode an unsigned number using unsigned LEB128 (ULEB128) first represent the number in binary. Then zero extend the number up to a multiple of 7 bits (such that if the number is non-zero, the most significant 7 bits are not all 0). Break the number up into groups of 7 bits. Output one encoded byte for each 7 bit group, from least significant to most significant group. Each byte will have the group in its 7 least significant bits. Set the most significant bit on each byte except the last byte. The number zero is usually encoded as a single byte 0x00. WebAssembly allows alternate encodings of zero (0x80 0x00, 0x80 0x80 0x00, ...). [3]
As an example, here is how the unsigned number 624485 gets encoded:
MSB ------------------ LSB 10011000011101100101 In raw binary 010011000011101100101 Padded to a multiple of 7 bits 0100110 0001110 1100101 Split into 7-bit groups 00100110 10001110 11100101 Add high 1 bits on all but last (most significant) group to form bytes 0x26 0x8E 0xE5 In hexadecimal → 0xE5 0x8E 0x26 Output stream (LSB to MSB)
Unsigned LEB128 and VLQ (variable-length quantity) both compress any given integer into not only the same number of bits, but exactly the same bits—the two formats differ only in exactly how those bits are arranged.
A signed number is represented similarly: Starting with an -bit two's complement representation, where is a multiple of 7, the number is broken into groups as for the unsigned encoding.
For example, the signed number -123456 is encoded as 0xC0 0xBB 0x78:
MSB ------------------ LSB 11110001001000000 Binary encoding of 123456 000011110001001000000 As a 21-bit number 111100001110110111111 Negating all bits (ones' complement) 111100001110111000000 Adding one (two's complement) 1111000 0111011 1000000 Split into 7-bit groups 01111000 10111011 11000000 Add high 1 bits on all but last (most significant) group to form bytes 0x78 0xBB 0xC0 In hexadecimal → 0xC0 0xBB 0x78 Output stream (LSB to MSB)
A straightforward scalar implementation of LEB128 decoding is fairly slow, even more so on modern hardware where branch misprediction is relatively expensive. A series of papers presents SIMD techniques for accelerating decoding (it is called VByte in these papers, but is another name for the same encoding). The "Vectorized VByte Decoding" paper [4] presented "Masked VByte", which demonstrated speeds of 650–2700 million integers per second on commodity Haswell hardware, depending on encoding density. A followup paper presented a variant encoding, "Stream VByte: Faster Byte Oriented Integer Compression", [5] which increased speeds to over 4 billion integers per second. This stream encoding separates the control stream from the encoded data, so is not binary compatible with LEB128.
do{byte=low-order7bitsofvalue;value>>=7;if(value!=0)/* more bytes to come */sethigh-orderbitofbyte;emitbyte;}while(value!=0);
more=1;negative=(value<0);/* the size in bits of the variable value, e.g., 64 if value's type is int64_t */size=no.ofbitsinsignedinteger;while(more){byte=low-order7bitsofvalue;value>>=7;/* the following is only necessary if the implementation of >>= uses a logical shift rather than an arithmetic shift for a signed left operand */if(negative)value|=(~0<<(size-7));/* sign extend *//* sign bit of byte is second high-order bit (0x40) */if((value==0&&signbitofbyteisclear)||(value==-1&&signbitofbyteisset))more=0;elsesethigh-orderbitofbyte;emitbyte;}
result=0;shift=0;while(true){byte=nextbyteininput;result|=(low-order7bitsofbyte)<<shift;if(high-orderbitofbyte==0)break;shift+=7;}
result=0;shift=0;/* the size in bits of the result variable, e.g., 64 if result's type is int64_t */size=numberofbitsinsignedinteger;do{byte=nextbyteininput;result|=(low-order7bitsofbyte<<shift);shift+=7;}while(high-orderbitofbyte!=0);/* sign bit of byte is second high-order bit (0x40) */if((shift<size)&&(signbitofbyteisset))/* sign extend */result|=(~0<<shift);
functionencodeUnSignedInt32toLeb128(value){varrawbine=value.toString(2);//In raw binaryvarpaded=rawbine.padStart(Math.ceil(((rawbine.length)/7))*7,'0')//Padded to a multiple of 7 bitsvarsplited=paded.match(/.{1,7}/g);//Split into 7 - bit groupsconstresult=[];//Add high 1 bits on all but last(most significant) group to form bytesvary=splited.length-1for(leti=0;i<splited.length;i++){if(i===0){varstr="0"+splited[i];result[y]=parseInt(str,2).toString(16).padStart(2,'0');}else{varstr="1"+splited[i];result[y]=parseInt(str,2).toString(16).padStart(2,'0');}y--;}returnresult.join('');};
functiondecodeLeb128toUnSignedInt32(value){vartabvalue=value.split("");varresult="";consttabtemp=[];vary=tabvalue.length/2-1;for(leti=0;i<tabvalue.length;i++){if(i%2!=0){tabtemp[y]=((parseInt(tabvalue[i-1],16).toString(2)).padStart(4,'0')+(parseInt(tabvalue[i],16).toString(2)).padStart(4,'0')).slice(1);y--;}}result=parseInt(tabtemp.join(''),2).toString(10)returnresult;};
constencodeSignedLeb128FromInt32=(value)=>{value|=0;constresult=[];while(true){constbyte_=value&0x7f;value>>=7;if((value===0&&(byte_&0x40)===0)||(value===-1&&(byte_&0x40)!==0)){result.push(byte_);returnresult;}result.push(byte_|0x80);}};
constdecodeSignedLeb128=(input)=>{letresult=0;letshift=0;while(true){constbyte=input.shift();result|=(byte&0x7f)<<shift;shift+=7;if((0x80&byte)===0){if(shift<32&&(byte&0x40)!==0){returnresult|(~0<<shift);}returnresult;}}};
In computer science, an integer is a datum of integral data type, a data type that represents some range of mathematical integers. Integral data types may be of different sizes and may or may not be allowed to contain negative values. Integers are commonly represented in a computer as a group of binary digits (bits). The size of the grouping varies so the set of integer sizes available varies between different types of computers. Computer hardware nearly always provides a way to represent a processor register or memory address as an integer.
PackBits is a fast, simple lossless compression scheme for run-length encoding of data.
A computer number format is the internal representation of numeric values in digital device hardware and software, such as in programmable computers and calculators. Numerical values are stored as groupings of bits, such as bytes and words. The encoding between numerical values and bit patterns is chosen for convenience of the operation of the computer; the encoding used by the computer's instruction set generally requires conversion for external use, such as for printing and display. Different types of processors may have different internal representations of numerical values and different conventions are used for integer and real numbers. Most calculations are carried out with number formats that fit into a processor register, but some software systems allow representation of arbitrarily large numbers using multiple words of memory.
G.711 is a narrowband audio codec originally designed for use in telephony that provides toll-quality audio at 64 kbit/s. It is an ITU-T standard (Recommendation) for audio encoding, titled Pulse code modulation (PCM) of voice frequencies released for use in 1972.
x86 assembly language is the name for the family of assembly languages which provide some level of backward compatibility with CPUs back to the Intel 8008 microprocessor, which was launched in April 1972. It is used to produce object code for the x86 class of processors.
INT is an assembly language instruction for x86 processors that generates a software interrupt. It takes the interrupt number formatted as a byte value.
In computer programming, a bitwise operation operates on a bit string, a bit array or a binary numeral at the level of its individual bits. It is a fast and simple action, basic to the higher-level arithmetic operations and directly supported by the processor. Most bitwise operations are presented as two-operand instructions where the result replaces one of the input operands.
The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.
External Data Representation (XDR) is a standard data serialization format, for uses such as computer network protocols. It allows data to be transferred between different kinds of computer systems. Converting from the local representation to XDR is called encoding. Converting from XDR to the local representation is called decoding. XDR is implemented as a software library of functions which is portable between different operating systems and is also independent of the transport layer.
The syntax of the C programming language is the set of rules governing writing of software in C. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.
Shift JIS is a character encoding for the Japanese language, originally developed by the Japanese company ASCII Corporation in conjunction with Microsoft and standardized as JIS X 0208 Appendix 1.
Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese (characters).
The Fletcher checksum is an algorithm for computing a position-dependent checksum devised by John G. Fletcher (1934–2012) at Lawrence Livermore Labs in the late 1970s. The objective of the Fletcher checksum was to provide error-detection properties approaching those of a cyclic redundancy check but with the lower computational effort associated with summation techniques.
Bencode is the encoding used by the peer-to-peer file sharing system BitTorrent for storing and transmitting loosely structured data.
The computer programming languages C and Pascal have similar times of origin, influences, and purposes. Both were used to design their own compilers early in their lifetimes. The original Pascal definition appeared in 1969 and a first compiler in 1970. The first version of C appeared in 1972.
In the C programming language, data types constitute the semantics and characteristics of storage of data elements. They are expressed in the language syntax in form of declarations for memory locations or variables. Data types also determine the types of operations or methods of processing of data elements.
Action Message Format (AMF) is a binary format used to serialize object graphs such as ActionScript objects and XML, or send messages between an Adobe Flash client and a remote service, usually a Flash Media Server or third party alternatives. The Actionscript 3 language provides classes for encoding and decoding from the AMF format.
A variable-length quantity (VLQ) is a universal code that uses an arbitrary number of binary octets to represent an arbitrarily large integer. A VLQ is essentially a base-128 representation of an unsigned integer with the addition of the eighth bit to mark continuation of bytes. VLQ is identical to LEB128 except in endianness. See the example below.
In the C programming language, operations can be performed on a bit level using bitwise operators.
Concise Binary Object Representation (CBOR) is a binary data serialization format loosely based on JSON authored by Carsten Bormann and Paul Hoffman. Like JSON it allows the transmission of data objects that contain name–value pairs, but in a more concise manner. This increases processing and transfer speeds at the cost of human readability. It is defined in IETF RFC 8949.
{{cite journal}}
: Cite journal requires |journal=
(help)