Xxencoding

Last updated

xxencode is a binary-to-text encoding similar to uuencode which uses only the alphanumeric characters, and the plus and minus signs. It was invented as a means to transfer files in a format which would survive character set translation, particularly that between ASCII and the EBCDIC encoding used on IBM mainframes. [1]

Contents

The encoding process

xxencoded data starts with a line of the form:

 begin <mode> <file>

Where <mode> is the file's read/write/execute permissions as three octal digits, and <file> is the name to be used when recreating the binary data.

xxencode repeatedly takes in groups of three bytes, adding trailing zeroes if there are fewer than three bytes left. These 24 bits are split into four 6-bit numbers, each of which is then translated to the th character in the following table:

           1         2         3         4         5         6  0123456789012345678901234567890123456789012345678901234567890123  |         |         |         |         |         |         |  +-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

Each group of sixty output characters (corresponding to 45 input bytes) is output as a separate line preceded by an encoded character giving the number of encoded bytes on that line. For all lines except the last, this will be the character 'h' (the character mapping to the value 45). If the input is not evenly divisible by 45, the last line will contain the remaining N output characters, preceded by the number of remaining input bytes encoded as above. Finally, a line containing just a single space (or plus character) is output, followed by one line containing the string "end".

xxencoded data is generally distinguishable from Uuencoded data by the first character of the line ('h' for Xxencode, 'M' for Uuencode). This assumes at least one full-length line (45 encoded bytes/60 characters) in the output.

Example

The following is an example of xxencoding a one-line text file. In this example, %0D is the byte representation for carriage return (CR), and %0A is the byte representation for line feed (LF).

file
 File Name = wikipedia-url.txt  File Contents = http://www.wikipedia.org%0D%0A
xxencoding
 begin 644 wikipedia-url.txt  OO5FoQ1cj9rRrRmtrOKhdQ4JYOK2iPr7b1Ec+  end

See also

Related Research Articles

<span class="mw-page-title-main">Plain text</span> Term for computer data consisting only of unformatted characters of readable material

In computing, plain text is a loose term for data that represent only characters of readable material but not its graphical representation nor other objects. It may also include a limited number of "whitespace" characters that affect simple arrangement of text, such as spaces, line breaks, or tabulation characters. Plain text is different from formatted text, where style information is included; from structured text, where structural parts of the document such as paragraphs, sections, and the like are identified; and from binary files in which some portions must be interpreted as binary objects.

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from UnicodeTransformation Format – 8-bit.

Lempel–Ziv–Welch (LZW) is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. The algorithm is simple to implement and has the potential for very high throughput in hardware implementations. It is the algorithm of the Unix file compression utility compress and is used in the GIF image format.

A text file is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file (EOF) marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. Most text files need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records.

In computer programming, Base64 is a group of binary-to-text encoding schemes that represent binary data in sequences of 24 bits that can be represented by four 6-bit Base64 digits.

<span class="mw-page-title-main">Newline</span> Special characters in computing signifying the end of a line of text

A newline is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a sequence of characters, is used to signify the end of a line of text and the start of a new one.

uuencoding is a form of binary-to-text encoding that originated in the Unix programs uuencode and uudecode written by Mary Ann Horton at the University of California, Berkeley in 1980, for encoding binary data for transmission in email systems.

The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.

Quoted-Printable, or QP encoding, is a binary-to-text encoding system using printable ASCII characters to transmit 8-bit data over a 7-bit data path or, generally, over a medium which is not 8-bit clean. Historically, because of the wide range of systems and protocols that could be used to transfer messages, e-mail was often assumed to be non-8-bit-clean – however, modern SMTP servers are in most cases 8-bit clean and support 8BITMIME extension. It can also be used with data that contains non-permitted octets or line lengths exceeding SMTP limits. It is defined as a MIME content transfer encoding for use in e-mail.

yEnc is a binary-to-text encoding scheme for transferring binary files in messages on Usenet or via e-mail. It reduces the overhead over previous US-ASCII-based encoding methods by using an 8-bit encoding method. yEnc's overhead is often as little as 1–2%, compared to 33–40% overhead for 6-bit encoding methods like uuencode and Base64. yEnc was initially developed by Jürgen Helbing, and its first release was early 2001. By 2003 yEnc became the de facto standard encoding system for binary files on Usenet. The name yEncode is a wordplay on "Why encode?", since the idea is to only encode characters if it is absolutely required to adhere to the message format standard.

BinHex, originally short for "binary-to-hexadecimal", is a binary-to-text encoding system that was used on the classic Mac OS for sending binary files through e-mail. Originally a hexadecimal encoding, subsequent versions of BinHex are more similar to uuencode, but combined both "forks" of the Mac file system together along with extended file information. BinHexed files take up more space than the original files, but will not be corrupted by non-"8-bit clean" software.

<span class="mw-page-title-main">Binary file</span> Non-human-readable computer file encoded in binary form

A binary file is a computer file that is not a text file. The term "binary file" is often used as a term meaning "non-text file". Many binary file formats contain parts that can be interpreted as text; for example, some computer document files containing formatted text, such as older Microsoft Word document files, contain the text of the document but also contain formatting information in binary form.

In computing, a hex dump is a hexadecimal view of computer data, from memory or from a computer file or storage device. Looking at a hex dump of data is usually done in the context of either debugging, reverse engineering or digital forensics.

Binary Ordered Compression for Unicode (BOCU) is a MIME compatible Unicode compression scheme. BOCU-1 combines the wide applicability of UTF-8 with the compactness of Standard Compression Scheme for Unicode (SCSU). This Unicode encoding is designed to be useful for compressing short strings, and maintains code point order. BOCU-1 is specified in a Unicode Technical Note.

Ascii85, also called Base85, is a form of binary-to-text encoding developed by Paul E. Rutter for the btoa utility. By using five ASCII characters to represent four bytes of binary data, it is more efficient than uuencode or Base64, which use four characters to represent three bytes of data.

A binary-to-text encoding is encoding of data in plain text. More precisely, it is an encoding of binary data in a sequence of printable characters. These encodings are necessary for transmission of data when the communication channel does not allow binary data or is not 8-bit clean. PGP documentation uses the term "ASCII armor" for binary-to-text encoding when referring to Base64.

Intel hexadecimal object file format, Intel hex format or Intellec Hex is a file format that conveys binary information in ASCII text form, making it possible to store on non-binary media such as paper tape, punch cards, etc., to display on text terminals or be printed on line-oriented printers. The format is commonly used for programming microcontrollers, EPROMs, and other types of programmable logic devices and hardware emulators. In a typical application, a compiler or assembler converts a program's source code to machine code and outputs it into a HEX file. Some also use it as a container format holding packets of stream data. Common file extensions used for the resulting files are .HEX or .H86. The HEX file is then read by a programmer to write the machine code into a PROM or is transferred to the target system for loading and execution.

The Apple Icon Image format is an icon format used in Apple Inc.'s macOS. It supports icons of 16 × 16, 32 × 32, 48 × 48, 128 × 128, 256 × 256, 512 × 512 points at 1x and 2x scale, with both 1- and 8-bit alpha channels and multiple image states. The fixed-size icons can be scaled by the operating system and displayed at any intermediate size.

<span class="mw-page-title-main">SREC (file format)</span> File format developed by Motorola

Motorola S-record is a file format, created by Motorola in the mid-1970s, that conveys binary information as hex values in ASCII text form. This file format may also be known as SRECORD, SREC, S19, S28, S37. It is commonly used for programming flash memory in microcontrollers, EPROMs, EEPROMs, and other types of programmable logic devices. In a typical application, a compiler or assembler converts a program's source code to machine code and outputs it into a HEX file. The HEX file is then imported by a programmer to "burn" the machine code into non-volatile memory, or is transferred to the target system for loading and execution.

Tektronix hex format and Extended Tektronix hex format / Extended Tektronix Object Format are ASCII-based hexadecimal file formats, created by Tektronix, for conveying binary information for applications like programming microcontrollers, EPROMs, and other kinds of chips.

References

  1. Tony Catone (February 1995). "Keys to the kingdom: Unlocking Internet file formats". University of Pennsylvania.