Uuencoding

Last updated

uuencoding is a form of binary-to-text encoding that originated in the Unix programs uuencode and uudecode written by Mary Ann Horton at the University of California, Berkeley in 1980, [1] for encoding binary data for transmission in email systems.

Contents

The name "uuencoding" is derived from Unix-to-Unix Copy, i.e. "Unix-to-Unix encoding" is a safe encoding for the transfer of arbitrary files from one Unix system to another Unix system but without guarantee that the intervening links would all be Unix systems. Since an email message might be forwarded through or to computers with different character sets or through transports which are not 8-bit clean, or handled by programs that are not 8-bit clean, forwarding a binary file via email might cause it to be corrupted. By encoding such data into a character subset common to most character sets, the encoded form of such data files was unlikely to be "translated" or corrupted, and would thus arrive intact and unchanged at the destination. The program uudecode reverses the effect of uuencode, recreating the original binary file exactly. uuencode/decode became popular for sending binary (and especially compressed) files by email and posting to Usenet newsgroups, etc.

It has now been largely replaced by MIME and yEnc. With MIME, files that might have been uuencoded are instead transferred with Base64 encoding.

Encoded format

A uuencoded file starts with a header line of the form:

begin <mode> <file><newline>

<mode> is the file's Unix file permissions as three octal digits (e.g. 644, 744). This is typically only significant to Unix-like operating systems.

<file> is the file name to be used when recreating the binary data.

<newline> signifies a newline character, used to terminate each line.

Each data line uses the format:

<length character><formatted characters><newline>

<length character> is a character indicating the number of data bytes which have been encoded on that line. This is an ASCII character determined by adding 32 to the actual byte count, with the sole exception of a grave accent "`" (ASCII code 96) signifying zero bytes. All data lines, except the last (if the data length was not divisible by 45), have 45 bytes of encoded data (60 characters after encoding). Therefore, the vast majority of length values are 'M', (32 + 45 = ASCII code 77 or "M").

<formatted characters> are encoded characters. See § Formatting mechanism for more details on the actual implementation.

The file ends with two lines:

`<newline> end<newline>

The second to last line is also a character indicating the line length, with the grave accent signifying zero bytes.

As a complete file, the uuencoded output for a plain text file named cat.txt containing only the characters Cat would be

begin 644 cat.txt #0V%T ` end

The begin line is a standard uuencode header; the '#' indicates that its line encodes three characters; the last two lines appear at the end of all uuencoded files.

Formatting mechanism

The mechanism of uuencoding repeats the following for every 3 bytes, encoding them into 4 printable characters, each character representing a radix-64 numerical digit:

  1. Start with 3 bytes from the source, 24 bits in total.
  2. Split into 4 6-bit groupings, each representing a value in the range 0 to 63: bits (00-05), (06-11), (12-17) and (18-23).
  3. Add 32 to each of the values. With the addition of 32 this means that the possible results can be between 32 (" " space) and 95 ("_" underline). 96 ("`" grave accent) as the "special character" is a logical extension of this range. Despite space character being documented as the encoding for value of 0, implementations, such as GNU sharutils, [2] actually use the grave accent character to encode zeros in the body of the file as well, never using space.
  4. Output the ASCII equivalent of these numbers.

If the source length is not divisible by 3, then the last 4-byte section will contain padding bytes to make it cleanly divisible. These bytes are subtracted from the line's <length character> so that the decoder does not append unwanted characters to the file.

uudecoding is reverse of the above, subtract 32 from each character's ASCII code (modulo 64 to account for the grave accent usage) to get a 6-bit value, concatenate 4 6-bit groups to get 24 bits, then output 3 bytes.

The encoding process is demonstrated by this table, which shows the derivation of the above encoding for "Cat".

Original charactersCat
Original ASCII, decimal6797116
ASCII, binary010000110110000101110100
New decimal values1654552
+3248863784
Uuencoded characters0V%T

uuencode table

The following table shows the conversion of the decimal value of the 6-bit fields obtained during the conversion process and their corresponding ASCII character output code and character.

Note that some encoders might produce space (code 32) instead of grave accent ("`", code 96), while some decoders might refuse to decode data containing space.

bitsASCII
code
ASCII
char
bitsASCII
code
ASCII
char
bitsASCII
code
ASCII
char
bitsASCII
code
ASCII
char
0096`164803264@4880P
0133!174913365A4981Q
0234"185023466B5082R
0335#195133567C5183S
0436$205243668D5284T
0537%215353769E5385U
0638&225463870F5486V
0739'235573971G5587W
0840(245684072H5688X
0941)255794173I5789Y
1042*2658:4274J5890Z
1143+2759;4375K5991[
1244,2860<4476L6092\
1345-2961=4577M6193]
1446.3062>4678N6294^
1547/3163?4779O6395_

Example

The following is an example of uuencoding a one-line text file. In this example, %0D is the byte representation for carriage return, and %0A is the byte representation for line feed.

file
File Name = wikipedia-url.txt File Contents = http://www.wikipedia.org%0D%0A
uuencoding
begin 644 wikipedia-url.txt ::'1T<#HO+W=W=RYW:6MI<&5D:6$N;W)G#0H` ` end

Forks (file, resource)

Unix traditionally has a single fork where file data is stored. However, some file systems support multiple forks associated with a single file. For example, classic Mac OS Hierarchical File System (HFS) supported a data fork and a resource fork . Mac OS HFS+ supports multiple forks, as does Microsoft Windows NTFS alternate data streams. Most uucoding tools will only handle data from the primary data fork, which can result in a loss of information when encoding/decoding (for example, Windows NTFS file comments are kept in a different fork). Some tools (like the classic Mac OS application UUTool) solved the problem by concatenating the different forks into one file and differentiating them by file name.

Relation to xxencode, Base64, and Ascii85

Despite its limited range of characters, uuencoded data is sometimes corrupted on passage through certain computers using non-ASCII character sets such as EBCDIC. One attempt to solve the problem was the xxencode format, which used only alphanumeric characters and the plus and minus symbols. More common today is the Base64 format, which is based on the same concept of alphanumeric-only as opposed to ASCII 32–95. All three formats use 6 bits (64 different characters) to represent their input data.

Base64 can also be generated by the uuencode program and is similar in format, except for the actual character translation:

The header is changed to

begin-base64 <mode> <file>

the trailer becomes

====

and lines between are encoded with characters chosen from

ABCDEFGHIJKLMNOP QRSTUVWXYZabcdef ghijklmnopqrstuv wxyz0123456789+/

Another alternative is Ascii85, which encodes four binary characters in five ASCII characters. Ascii85 is used in PostScript and PDF formats.

Disadvantages

uuencoding takes 3 pre-formatted bytes and turns them into 4 and also adds begin/end tags, filename, and delimiters. This adds at least 33% data overhead compared to the source alone, though this can be at least somewhat compensated for by compressing the file before uuencoding it.

Support in languages

Python

The Python language supports uuencoding using the codecs module with the codec "uu":

For Python 2 (deprecated/sunset as of January 1st 2020):

$ python-c'print "Cat".encode("uu")'begin 666 <data>#0V%T end$

For Python 3 where the codecs module needs to be imported and used directly:

$ python3-c"from codecs import encode;print(encode(b'Cat', 'uu'))"b'begin 666 <data>\n#0V%T\n \nend\n'$

To decode, pass the whole file:

$ python3-c"from codecs import decode;print(decode(b'begin 666 <data>\n#0V%T\n \nend\n', 'uu'))"b'Cat'

Perl

The Perl language supports uuencoding natively using the pack() and unpack() operators with the format string "u":

$ perl-e'print pack("u","Cat")'#0V%T 

Decoding base64 with unpack can likewise be accomplished by translating the characters:

$ perl-e'print unpack("u","#0V%T")'Cat

To produce wellformed uuencoded files, you need to use modules, [3] or a little bit more of code: [4]

Encode (oneliner)

$ perl-ple'BEGIN{use File::Basename;$/=undef;$sn=basename($ARGV[0]);} $_= "begin 600 $sn\n".(pack "u", $_)."`\nend" if $_'/some/file/to_encode.gz 

Encode/Decode (proper Perl scripts)

https://metacpan.org/dist/PerlPowerTools/view/bin/uuencode

https://metacpan.org/dist/PerlPowerTools/view/bin/uudecode

See also

Related Research Articles

Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email messages to support text in character sets other than ASCII, as well as attachments of audio, video, images, and application programs. Message bodies may consist of multiple parts, and header information may be specified in non-ASCII character sets. Email messages with MIME formatting are typically transmitted with standard protocols, such as the Simple Mail Transfer Protocol (SMTP), the Post Office Protocol (POP), and the Internet Message Access Protocol (IMAP).

<span class="mw-page-title-main">Plain text</span> Term for computer data consisting only of unformatted characters of readable material

In computing, plain text is a loose term for data that represent only characters of readable material but not its graphical representation nor other objects. It may also include a limited number of "whitespace" characters that affect simple arrangement of text, such as spaces, line breaks, or tabulation characters. Plain text is different from formatted text, where style information is included; from structured text, where structural parts of the document such as paragraphs, sections, and the like are identified; and from binary files in which some portions must be interpreted as binary objects.

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit.

8-bit clean is an attribute of computer systems, communication channels, and other devices and software, that process 8-bit character encodings without treating any byte as an in-band control code.

A text file is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file (EOF) marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. Most text files need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records.

In computer programming, Base64 is a group of binary-to-text encoding schemes that transforms binary data into a sequence of printable characters, limited to a set of 64 unique characters. More specifically, the source binary data is taken 6 bits at a time, then this group of 6 bits is mapped to one of 64 unique characters.

<span class="mw-page-title-main">Newline</span> Special characters in computing signifying the end of a line of text

A newline is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a sequence of characters, is used to signify the end of a line of text and the start of a new one.

UTF-7 is an obsolete variable-length character encoding for representing Unicode text using a stream of ASCII characters. It was originally intended to provide a means of encoding Unicode text for use in Internet E-mail messages that was more efficient than the combination of UTF-8 with quoted-printable.

Quoted-Printable, or QP encoding, is a binary-to-text encoding system using printable ASCII characters to transmit 8-bit data over a 7-bit data path or, generally, over a medium which is not 8-bit clean. Historically, because of the wide range of systems and protocols that could be used to transfer messages, e-mail was often assumed to be non-8-bit-clean – however, modern SMTP servers are in most cases 8-bit clean and support 8BITMIME extension. It can also be used with data that contains non-permitted octets or line lengths exceeding SMTP limits. It is defined as a MIME content transfer encoding for use in e-mail.

<span class="mw-page-title-main">Delimiter</span> Characters that specify the boundary between regions in a data stream

A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text, mathematical expressions or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values. Another example of a delimiter is the time gap used to separate letters and words in the transmission of Morse code.

yEnc is a binary-to-text encoding scheme for transferring binary files in messages on Usenet or via e-mail. It reduces the overhead over previous US-ASCII-based encoding methods by using an 8-bit encoding method. yEnc's overhead is often as little as 1–2%, compared to 33–40% overhead for 6-bit encoding methods like uuencode and Base64. yEnc was initially developed by Jürgen Helbing, and its first release was early 2001. By 2003 yEnc became the de facto standard encoding system for binary files on Usenet. The name yEncode is a wordplay on "Why encode?", since the idea is to only encode characters if it is absolutely required to adhere to the message format standard.

BinHex, originally short for "binary-to-hexadecimal", is a binary-to-text encoding system that was used on the classic Mac OS for sending binary files through e-mail. Originally a hexadecimal encoding, subsequent versions of BinHex are more similar to uuencode, but combined both "forks" of the Mac file system together along with extended file information. BinHexed files take up more space than the original files, but will not be corrupted by non-"8-bit clean" software.

<span class="mw-page-title-main">Binary file</span> Non-human-readable computer file encoded in binary form

A binary file is a computer file that is not a text file. The term "binary file" is often used as a term meaning "non-text file". Many binary file formats contain parts that can be interpreted as text; for example, some computer document files containing formatted text, such as older Microsoft Word document files, contain the text of the document but also contain formatting information in binary form.

Ascii85, also called Base85, is a form of binary-to-text encoding developed by Paul E. Rutter for the btoa utility. By using five ASCII characters to represent four bytes of binary data, it is more efficient than uuencode or Base64, which use four characters to represent three bytes of data.

Netpbm is an open-source package of graphics programs and a programming library. It is used mainly in the Unix world, where one can find it included in all major open-source operating system distributions, but also works on Microsoft Windows, macOS, and other operating systems.

The HZ character encoding is an encoding of GB 2312 that was formerly commonly used in email and USENET postings. It was designed in 1989 by Fung Fung Lee of Stanford University, and subsequently codified in 1995 into RFC 1843.

A binary-to-text encoding is encoding of data in plain text. More precisely, it is an encoding of binary data in a sequence of printable characters. These encodings are necessary for transmission of data when the communication channel does not allow binary data or is not 8-bit clean. PGP documentation uses the term "ASCII armor" for binary-to-text encoding when referring to Base64.

xxencode is a binary-to-text encoding similar to uuencode which uses only the alphanumeric characters, and the plus and minus signs. It was invented as a means to transfer files in a format which would survive character set translation, particularly that between ASCII and the EBCDIC encoding used on IBM mainframes.

<span class="mw-page-title-main">SREC (file format)</span> File format developed by Motorola

Motorola S-record is a file format, created by Motorola in the mid-1970s, that conveys binary information as hex values in ASCII text form. This file format may also be known as SRECORD, SREC, S19, S28, S37. It is commonly used for programming flash memory in microcontrollers, EPROMs, EEPROMs, and other types of programmable logic devices. In a typical application, a compiler or assembler converts a program's source code to machine code and outputs it into a HEX file. The HEX file is then imported by a programmer to "burn" the machine code into non-volatile memory, or is transferred to the target system for loading and execution.

UUTool was a freeware application written for the Apple Macintosh by Bernie Wieser. The purpose of UUTool was to uuencode and uudecode files, however, the application functionality grew to translate uLaw encoded files to AIFF format, segment large uuencoded files, and recombine multiple uuencoded files for decode.

References

  1. Horton, Mark. "UUENCODE(1C) UNIX Programmer's Manual". The Unix Heritage Society. Retrieved 2020-11-10.
  2. "uuencode.c source". fossies.org. Retrieved 2021-06-05.
  3. "PerlPowerTools source". metacpan.org. Retrieved 2024-02-12.
  4. "uuencode.pl source". main.linuxfocus.org. Retrieved 2024-02-12.