Binary file

Last updated
A hex dump of the 318 byte Wikipedia favicon, or . The first column numerates the line's starting address, while the * indicates repetition. Wikipedia favicon hexdump.svg
A hex dump of the 318 byte Wikipedia favicon, or Wikipedia-favicon.png . The first column numerates the line's starting address, while the * indicates repetition.

A binary file is a computer file that is not a text file. [1] The term "binary file" is often used as a term meaning "non-text file". [2] Many binary file formats contain parts that can be interpreted as text; for example, some computer document files containing formatted text, such as older Microsoft Word document files, contain the text of the document but also contain formatting information in binary form. [2]

Contents

Structure

Binary files are usually thought of as being a sequence of bytes, which means the binary digits (bits) are grouped in eights. Binary files typically contain bytes that are intended to be interpreted as something other than text characters. Compiled computer programs are typical examples; indeed, compiled applications are sometimes referred to, particularly by programmers, as binaries. But binary files can also mean that they contain images, sounds, compressed versions of other files, etc. – in short, any type of file content whatsoever. [1]

Some binary files contain headers, blocks of metadata used by a computer program to interpret the data in the file. The header often contains a signature or magic number which can identify the format. For example, a GIF file can contain multiple images, and headers are used to identify and describe each block of image data. The leading bytes of the header would contain text like GIF87a or GIF89a that can identify the binary as a GIF file. If a binary file does not contain any headers, it may be called a flat binary file.

A text file may consist partly or entirely of encoded binary information. When sending binary files over the network they may be encoded so that they use only printable characters. This is often necessary due to the limitations of network protocols used for internet browsing and e-mail communication. One such encoding is Base64. Also, files containing public-key and private-key information for use in systems employing asymmetric cryptography (such as website certificates) may also be stored with the binary information encoded in printable characters.

Manipulation

To send binary files through certain systems (such as email) that do not allow all data values, they are often translated into a plain text representation (using, for example, Base64). Encoding the data has the disadvantage of increasing the file size during the transfer (for example, using Base64 will increase the file's size by approximately 30%), as well as requiring translation back into binary after receipt. The increased size may be countered by lower-level link compression, as the resulting text data will have about as much less entropy as it has increased size, so the actual data transferred in this scenario would likely be very close to the size of the original binary data. See Binary-to-text encoding for more on this subject.

Microsoft Windows and its standard libraries for the C and C++ programming languages allow the programmer to specify a parameter indicating if a file is expected to be plain text or binary when opening a file; this affects the standard library calls to read and write from the file in that the system converts between the C/C++ "end of line" character (the ASCII linefeed character) and the end-of-line sequence Windows expects in files (the ASCII carriage return and linefeed characters in sequence). In Unix-like systems, the C and C++ standard libraries on those systems also allow the programmer to specify whether a file is expected to be text or binary, but the libraries can and do ignore that parameter, as the end-of-line sequence in Unix-like systems is just the C/C++ end-of-line character.

Viewing

A hex editor or viewer may be used to view file data as a sequence of hexadecimal (or decimal, binary or ASCII character) values for corresponding bytes of a binary file. [2]

If a binary file is opened in a text editor, each group of eight bits will typically be translated as a single character, and the user will see a (probably unintelligible) display of textual characters. If the file is opened in some other application, that application will have its own use for each byte: maybe the application will treat each byte as a number and output a stream of numbers between 0 and 255—or maybe interpret the numbers in the bytes as colors and display the corresponding picture. Other type of viewers (called 'word extractors') simply replace the unprintable characters with spaces revealing only the human-readable text. This type of view is useful for a quick inspection of a binary file in order to find passwords in games, find hidden text in non-text files and recover corrupted documents. [2] It can even be used to inspect suspicious files (software) for unwanted effects. For example, the user would see any URL/email to which the suspected software may attempt to connect in order to upload unapproved data (to steal). If the file is itself treated as an executable and run, then the operating system will attempt to interpret the file as a series of instructions in its machine language.

Interpretation

Standards are very important to binary files. For example, a binary file interpreted by the ASCII character set will result in text being displayed. A custom application can interpret the file differently: a byte may be a sound, or a pixel, or even an entire word. Binary itself is meaningless, until such time as an executed algorithm defines what should be done with each bit, byte, word or block. Thus, just examining the binary and attempting to match it against known formats can lead to the wrong conclusion as to what it actually represents. This fact can be used in steganography, where an algorithm interprets a binary data file differently to reveal hidden content. Without the algorithm, it is impossible to tell that hidden content exists.

Binary compatibility

Two files that are binary compatible will have the same sequence of zeros and ones in the data portion of the file. The file header, however, may be different.

The term is used most commonly to state that data files produced by one application are exactly the same as data files produced by another application. For example, some software companies produce applications for Windows and the Macintosh that are binary compatible, which means that a file produced in a Windows environment is interchangeable with a file produced on a Macintosh. This avoids many of the conversion problems caused by importing and exporting data.

One possible binary compatibility issue between different computers is the endianness of the computer. Some computers store the bytes in a file in a different order. [3]

See also

Related Research Articles

Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email messages to support text in character sets other than ASCII, as well as attachments of audio, video, images, and application programs. Message bodies may consist of multiple parts, and header information may be specified in non-ASCII character sets. Email messages with MIME formatting are typically transmitted with standard protocols, such as the Simple Mail Transfer Protocol (SMTP), the Post Office Protocol (POP), and the Internet Message Access Protocol (IMAP).

<span class="mw-page-title-main">Plain text</span> Term for computer data consisting only of unformatted characters of readable material

In computing, plain text is a loose term for data that represent only characters of readable material but not its graphical representation nor other objects. It may also include a limited number of "whitespace" characters that affect simple arrangement of text, such as spaces, line breaks, or tabulation characters. Plain text is different from formatted text, where style information is included; from structured text, where structural parts of the document such as paragraphs, sections, and the like are identified; and from binary files in which some portions must be interpreted as binary objects.

Run-length encoding (RLE) is a form of lossless data compression in which runs of data are stored as a single data value and count, rather than as the original run. This is most efficient on data that contains many such runs, for example, simple graphic images such as icons, line drawings, Conway's Game of Life, and animations. For files that do not have many runs, RLE could increase the file size.

<span class="mw-page-title-main">String (computer science)</span> Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

In computer programming, Base64 is a group of binary-to-text encoding schemes that transforms binary data into a sequence of printable characters, limited to a set of 64 unique characters. More specifically, the source binary data is taken 6 bits at a time, then this group of 6 bits is mapped to one of 64 unique characters.

UTF-7 is an obsolete variable-length character encoding for representing Unicode text using a stream of ASCII characters. It was originally intended to provide a means of encoding Unicode text for use in Internet E-mail messages that was more efficient than the combination of UTF-8 with quoted-printable.

uuencoding is a form of binary-to-text encoding that originated in the Unix programs uuencode and uudecode written by Mary Ann Horton at the University of California, Berkeley in 1980, for encoding binary data for transmission in email systems.

Quoted-Printable, or QP encoding, is a binary-to-text encoding system using printable ASCII characters to transmit 8-bit data over a 7-bit data path or, generally, over a medium which is not 8-bit clean. Historically, because of the wide range of systems and protocols that could be used to transfer messages, e-mail was often assumed to be non-8-bit-clean – however, modern SMTP servers are in most cases 8-bit clean and support 8BITMIME extension. It can also be used with data that contains non-permitted octets or line lengths exceeding SMTP limits. It is defined as a MIME content transfer encoding for use in e-mail.

<span class="mw-page-title-main">Delimiter</span> Characters that specify the boundary between regions in a data stream

A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text, mathematical expressions or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values. Another example of a delimiter is the time gap used to separate letters and words in the transmission of Morse code.

yEnc is a binary-to-text encoding scheme for transferring binary files in messages on Usenet or via e-mail. It reduces the overhead over previous US-ASCII-based encoding methods by using an 8-bit encoding method. yEnc's overhead is often as little as 1–2%, compared to 33–40% overhead for 6-bit encoding methods like uuencode and Base64. yEnc was initially developed by Jürgen Helbing, and its first release was early 2001. By 2003 yEnc became the de facto standard encoding system for binary files on Usenet. The name yEncode is a wordplay on "Why encode?", since the idea is to only encode characters if it is absolutely required to adhere to the message format standard.

BinHex, originally short for "binary-to-hexadecimal", is a binary-to-text encoding system that was used on the classic Mac OS for sending binary files through e-mail. Originally a hexadecimal encoding, subsequent versions of BinHex are more similar to uuencode, but combined both "forks" of the Mac file system together along with extended file information. BinHexed files take up more space than the original files, but will not be corrupted by non-"8-bit clean" software.

The data URI scheme is a uniform resource identifier (URI) scheme that provides a way to include data in-line in Web pages as if they were external resources. It is a form of file literal or here document. This technique allows normally separate elements such as images and style sheets to be fetched in a single Hypertext Transfer Protocol (HTTP) request, which may be more efficient than multiple HTTP requests, and used by several browser extensions to package images as well as other multimedia content in a single HTML file for page saving. As of 2022, data URIs are fully supported by most major browsers, and partially supported in Internet Explorer.

Ascii85, also called Base85, is a form of binary-to-text encoding developed by Paul E. Rutter for the btoa utility. By using five ASCII characters to represent four bytes of binary data, it is more efficient than uuencode or Base64, which use four characters to represent three bytes of data.

URL encoding, officially known as percent-encoding, is a method to encode arbitrary data in a uniform resource identifier (URI) using only the US-ASCII characters legal within a URI. Although it is known as URL encoding, it is also used more generally within the main Uniform Resource Identifier (URI) set, which includes both Uniform Resource Locator (URL) and Uniform Resource Name (URN). As such, it is also used in the preparation of data of the application/x-www-form-urlencoded media type, as is often used in the submission of HTML form data in HTTP requests.

This article compares Unicode encodings. Two situations are considered: 8-bit-clean environments, and environments that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in some standards and so some standard-conforming software must generate messages that comply with the restrictions. Standard Compression Scheme for Unicode and Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.

A binary-to-text encoding is encoding of data in plain text. More precisely, it is an encoding of binary data in a sequence of printable characters. These encodings are necessary for transmission of data when the communication channel does not allow binary data or is not 8-bit clean. PGP documentation uses the term "ASCII armor" for binary-to-text encoding when referring to Base64.

The Apple Icon Image format (.icns) is an icon format used in Apple Inc.'s macOS. It supports icons of 16 × 16, 32 × 32, 48 × 48, 128 × 128, 256 × 256, 512 × 512 points at 1x and 2x scale, with both 1- and 8-bit alpha channels and multiple image states. The fixed-size icons can be scaled by the operating system and displayed at any intermediate size.

<span class="mw-page-title-main">SREC (file format)</span> File format developed by Motorola

Motorola S-record is a file format, created by Motorola in the mid-1970s, that conveys binary information as hex values in ASCII text form. This file format may also be known as SRECORD, SREC, S19, S28, S37. It is commonly used for programming flash memory in microcontrollers, EPROMs, EEPROMs, and other types of programmable logic devices. In a typical application, a compiler or assembler converts a program's source code to machine code and outputs it into a HEX file. The HEX file is then imported by a programmer to "burn" the machine code into non-volatile memory, or is transferred to the target system for loading and execution.

A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free.

010 Editor is a commercial hex editor and text editor for Microsoft Windows, Linux and macOS. Typically 010 Editor is used to edit text files, binary files, hard drives, processes, tagged data, source code, shell scripts, log files, etc. A large variety of binary data formats can be edited through the use of Binary Templates.

References

  1. 1 2 "Binary file definition by The Linux Information Project (LINFO)". www.linfo.org. Retrieved 2017-10-12.
  2. 1 2 3 4 "Ascii vs. Binary Files". www.cs.umd.edu. Retrieved 2017-10-12.
  3. "NCL: Reading binary data". www.ncl.ucar.edu. Archived from the original on 2017-10-12. Retrieved 2017-10-12.