Text file

Text file
Filename extension	.txt
Internet media type	text/plain
Type code	TEXT
Uniform Type Identifier (UTI)	public.plain-text
UTI conformation	public.text
Type of format	Document file format, Generic container format

Last updated November 28, 2024

A text file (sometimes spelled textfile; an old alternative name is flat file) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system.

In operating systems such as CP/M, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file (EOF) marker, as padding after the last line in a text file. In modern operating systems such as DOS, Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes.

Some operating systems, such as Multics, Unix-like systems, CP/M, DOS, the classic Mac OS, and Windows, store text files as a sequence of bytes, with an end-of-line delimiter at the end of each line. Other operating systems, such as OpenVMS and OS/360 and its successors, have record-oriented filesystems, in which text files are stored as a sequence either of fixed-length records or of variable-length records with a record-length value in the record header.

"Text file" refers to a type of container, while plain text refers to a type of content.

At a generic level of description, there are two kinds of computer files: text files and binary files.^[1]

Data storage

A stylized iconic depiction of a CSV-formatted text file CsvDelimited001.svg — A stylized iconic depiction of a CSV-formatted **text file**

Because of their simplicity, text files are commonly used for storage of information. They avoid some of the problems encountered with other file formats, such as endianness, padding bytes, or differences in the number of bytes in a machine word. Further, when data corruption occurs in a text file, it is often easier to recover and continue processing the remaining contents. A disadvantage of text files is that they usually have a low entropy, meaning that the information occupies more storage than is strictly necessary.

A simple text file may need no additional metadata (other than knowledge of its character set) to assist the reader in interpretation. A text file may contain no data at all, which is a case of zero-byte file.

Encoding

The ASCII character set is the most common compatible subset of character sets for English-language text files, and is generally assumed to be the default file format in many situations. It covers American English, but for the British pound sign, the euro sign, or characters used outside English, a richer character set must be used. In many systems, this is chosen based on the default locale setting on the computer it is read on. Prior to UTF-8, this was traditionally single-byte encodings (such as ISO-8859-1 through ISO-8859-16) for European languages and wide character encodings for Asian languages.

Because encodings necessarily have only a limited repertoire of characters, often very small, many are only usable to represent text in a limited subset of human languages. Unicode is an attempt to create a common standard for representing all known languages, and most known character sets are subsets of the very large Unicode character set. Although there are multiple character encodings available for Unicode, the most common is UTF-8, which has the advantage of being backwards-compatible with ASCII; that is, every ASCII text file is also a UTF-8 text file with identical meaning. UTF-8 also has the advantage that it is easily auto-detectable. Thus, a common operating mode of UTF-8 capable software, when opening files of unknown encoding, is to try UTF-8 first and fall back to a locale dependent legacy encoding when it definitely is not UTF-8.

Formats

On most operating systems, the name text file refers to a file format that allows only plain text content with very little formatting (e.g., no bold or italic types). Such files can be viewed and edited on text terminals or in simple text editors. Text files usually have the MIME type text/plain, usually with additional information indicating an encoding.

Microsoft Windows text files

DOS and Microsoft Windows use a common text file format, with each line of text separated by a two-character combination: carriage return (CR) and line feed (LF). It is common for the last line of text not to be terminated with a CR-LF marker, and many text editors (including Notepad) do not automatically insert one on the last line.

On Microsoft Windows operating systems, a file is regarded as a text file if the suffix of the name of the file (the "filename extension") is .txt. However, many other suffixes are used for text files with specific purposes. For example, source code for computer programs is usually kept in text files that have file name suffixes indicating the programming language in which the source is written.

Most Microsoft Windows text files use ANSI, OEM, Unicode or UTF-8 encoding. What Microsoft Windows terminology calls "ANSI encodings" are usually single-byte ISO/IEC 8859 encodings (i.e. ANSI in the Microsoft Notepad menus is really "System Code Page", non-Unicode, legacy encoding), except for in locales such as Chinese, Japanese and Korean that require double-byte character sets. ANSI encodings were traditionally used as default system locales within Microsoft Windows, before the transition to Unicode. By contrast, OEM encodings, also known as DOS code pages, were defined by IBM for use in the original IBM PC text mode display system. They typically include graphical and line-drawing characters common in DOS applications. "Unicode"-encoded Microsoft Windows text files contain text in UTF-16 Unicode Transformation Format. Such files normally begin with byte order mark (BOM), which communicates the endianness of the file content. Although UTF-8 does not suffer from endianness problems, many Microsoft Windows programs (i.e. Notepad) prepend the contents of UTF-8-encoded files with BOM,^[2] to differentiate UTF-8 encoding from other 8-bit encodings.^[3]

Unix text files

On Unix-like operating systems, text files format is precisely described: POSIX defines a text file as a file that contains characters organized into zero or more lines,^[4] where lines are sequences of zero or more non-newline characters plus a terminating newline character,^[5] normally LF.

Additionally, POSIX defines a printable file as a text file whose characters are printable or space or backspace according to regional rules. This excludes most control characters, which are not printable.^[6]

Apple Macintosh text files

Prior to the advent of macOS, the classic Mac OS system regarded the content of a file (the data fork) to be a text file when its resource fork indicated that the type of the file was "TEXT".^[7] Lines of classic Mac OS text files are terminated with CR characters.^[8]

Being a Unix-like system, macOS uses Unix format for text files.^[8] Uniform Type Identifier (UTI) used for text files in macOS is "public.plain-text"; additional, more specific UTIs are: "public.utf8-plain-text" for utf-8-encoded text, "public.utf16-external-plain-text" and "public.utf16-plain-text" for utf-16-encoded text and "com.apple.traditional-mac-plain-text" for classic Mac OS text files.^[7]

Rendering

When opened by a text editor, human-readable content is presented to the user. This often consists of the file's plain text visible to the user. Depending on the application, control codes may be rendered either as literal instructions acted upon by the editor, or as visible escape characters that can be edited as plain text. Though there may be plain text in a text file, control characters within the file (especially the end-of-file character) can render the plain text unseen by a particular method.

Notes and references

↑ Lewis, John (2006). Computer Science Illuminated . Jones and Bartlett. ISBN 0-7637-4149-3.
↑ "Using Byte Order Marks". Internationalization for Windows Applications. Microsoft. Jan 7, 2021. Archived from the original on Feb 21, 2023. Retrieved 2022-04-21.
↑ Freytag, Asmus (2015-12-18). "FAQ – UTF-8, UTF-16, UTF-32 & BOM". The Unicode Consortium. Retrieved 2016-05-30. Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order. An initial BOM is only used as a signature — an indication that an otherwise unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix shell scripts.
↑ "3.403 Text File". IEEE Std 1003.1, 2017 Edition . IEEE Computer Society . Retrieved 2019-03-01.
↑ "3.206 Line". IEEE Std 1003.1, 2013 Edition . IEEE Computer Society . Retrieved 2015-12-15.
↑ "3.284 Printable File". IEEE Std 1003.1, 2013 Edition . IEEE Computer Society . Retrieved 2015-12-15.
1 2 "System-Declared Uniform Type Identifiers". Guides and Sample Code. Apple Inc. 2009-11-17. Retrieved 2016-09-12.
1 2 "Designing Scripts for Cross-Platform Deployment". Mac Developer Library. Apple Inc. 2014-03-10. Retrieved 2016-09-12.

External links

Power of Plain Text on C2 wiki

Related Research Articles

ASCII, an acronym for American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. ASCII has just 128 code points, of which only 95 are printable characters, which severely limit its scope. The set of available punctuation had significant impact on the syntax of computer languages and text markup. ASCII hugely influenced the design of character sets used by modern computers, including Unicode which has over a million code points, but the first 128 of these are the same as ASCII.

In computing, plain text is a loose term for data that represent only characters of readable material but not its graphical representation nor other objects. It may also include a limited number of "whitespace" characters that affect simple arrangement of text, such as spaces, line breaks, or tabulation characters. Plain text is different from formatted text, where style information is included; from structured text, where structural parts of the document such as paragraphs, sections, and the like are identified; and from binary files in which some portions must be interpreted as binary objects.

Web pages authored using HyperText Markup Language (HTML) may contain multilingual text represented with the Unicode universal character set. Key to the relationship between Unicode and HTML is the relationship between the "document character set", which defines the set of characters that may be present in an HTML document and assigns numbers to them, and the "external character encoding", or "charset", used to encode a given document as a sequence of bytes.

UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit. Almost every webpage is stored in UTF-8.

UTF-16 (16-bit Unicode Transformation Format) is a character encoding method capable of encoding all 1,112,064 valid code points of Unicode. The encoding is variable-length as code points are encoded with one or two 16-bitcode units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding now known as UCS-2 (for 2-byte Universal Character Set), once it became clear that more than 2¹⁶ (65,536) code points were needed, including most emoji and important CJK characters such as for personal and place names.

The byte-order mark (BOM) is a particular usage of the special Unicode character code, U+FEFFZERO WIDTH NO-BREAK SPACE, whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text:

Windows Notepad is a simple text editor for Windows; it creates and edits plain text documents. It was first released in 1983 to commercialize the computer mouse in MS-DOS.

<span class="mw-page-title-main">Mojibake</span> Garbled text as a result of incorrect character encodings

Mojibake is the garbled or gibberish text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, often from a different writing system.

<span class="mw-page-title-main">Windows-1252</span> Windows character set for Latin alphabet

Windows-1252 or CP-1252 is a legacy single-byte character encoding that is used by default in Microsoft Windows throughout the Americas, Western Europe, Oceania, and much of Africa.

A newline is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a sequence of characters, is used to signify the end of a line of text and the start of a new one.

UTF-7 is an obsolete variable-length character encoding for representing Unicode text using a stream of ASCII characters. It was originally intended to provide a means of encoding Unicode text for use in Internet E-mail messages that was more efficient than the combination of UTF-8 with quoted-printable.

A filename or file name is a name used to uniquely identify a computer file in a file system. Different file systems impose different restrictions on filename lengths.

uuencoding is a form of binary-to-text encoding that originated in the Unix programs uuencode and uudecode written by Mary Ann Horton at the University of California, Berkeley in 1980, for encoding binary data for transmission in email systems.

Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese (characters).

A binary file is a computer file that is not a text file. The term "binary file" is often used as a term meaning "non-text file". Many binary file formats contain parts that can be interpreted as text; for example, some computer document files containing formatted text, such as older Microsoft Word document files, contain the text of the document but also contain formatting information in binary form.

A wide character is a computer character datatype that generally has a size greater than the traditional 8-bit character. The increased datatype size allows for the use of larger coded character sets.

This article compares Unicode encodings in two types of environments: 8-bit clean environments, and environments that forbid the use of byte values with the high bit set. Originally, such prohibitions allowed for links that used only seven data bits, but they remain in some standards and so some standard-conforming software must generate messages that comply with the restrictions. The Standard Compression Scheme for Unicode and the Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.

Windows code pages are sets of characters or code pages used in Microsoft Windows from the 1980s and 1990s. Windows code pages were gradually superseded when Unicode was implemented in Windows, although they are still supported both within Windows and other platforms, and still apply when Alt code shortcuts are used.

"Bush hid the facts" is a common name for a bug present in Microsoft Windows which causes text encoded in ASCII to be interpreted as if it were UTF-16LE, resulting in garbled text. When the string "Bush hid the facts", without quotes, was put in a new Notepad document and saved, closed, and reopened, the nonsensical sequence of the Chinese characters "畂桳栠摩琠敨映捡獴" would appear instead.

Microsoft was one of the first companies to implement Unicode in their products. Windows NT was the first operating system that used "wide characters" in system calls. Using the UCS-2 encoding scheme at first, it was upgraded to the variable-width encoding UTF-16 starting with Windows 2000, allowing a representation of additional planes with surrogate pairs. However Microsoft did not support UTF-8 in its API until May 2019.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Lewis000-1] Lewis, John (2006). Computer Science Illuminated . Jones and Bartlett. ISBN 0-7637-4149-3.

[2] "Using Byte Order Marks". Internationalization for Windows Applications. Microsoft. Jan 7, 2021. Archived from the original on Feb 21, 2023. Retrieved 2022-04-21.

[3] Freytag, Asmus (2015-12-18). "FAQ – UTF-8, UTF-16, UTF-32 & BOM". The Unicode Consortium. Retrieved 2016-05-30. Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order. An initial BOM is only used as a signature — an indication that an otherwise unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix shell scripts.

[4] "3.403 Text File". IEEE Std 1003.1, 2017 Edition . IEEE Computer Society . Retrieved 2019-03-01.

[5] "3.206 Line". IEEE Std 1003.1, 2013 Edition . IEEE Computer Society . Retrieved 2015-12-15.

[6] "3.284 Printable File". IEEE Std 1003.1, 2013 Edition . IEEE Computer Society . Retrieved 2015-12-15.

[mac-uti-7] 1 2 "System-Declared Uniform Type Identifiers". Guides and Sample Code. Apple Inc. 2009-11-17. Retrieved 2016-09-12.

[mac-line-endings-8] 1 2 "Designing Scripts for Cross-Platform Deployment". Mac Developer Library. Apple Inc. 2014-03-10. Retrieved 2016-09-12.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

v t e Computer files
Types	Binary file / text file Data file File format List of file formats File signatures Magic number Open file formats Proprietary file formats Metafile Sidecar file Sparse file Swap file System file Temporary file Zero-byte file
Properties	Filename 8.3 filename Long filename Filename mangling Filename extension List of filename extensions File attribute Extended file attributes File size Hidden file / Hidden directory
Organisation	Directory/folder NTFS links Temporary folder Directory structure File system Filesystem Hierarchy Standard Grid file system Semantic file system Path
Operations	Open Close Read Write
Linking	File descriptor Hard link Shortcut Alias Shadow Symbolic link
Management	Backup File comparison File copying Data compression File manager Comparison of file managers File system fragmentation File-system permissions File transfer File sharing File synchronization File verification