Control character

Last updated

In computing and telecommunication, a control character or non-printing character (NPC) is a code point in a character set that does not represent a written character or symbol. They are used as in-band signaling to cause effects other than the addition of a symbol to the text. All other characters are mainly graphic characters , also known as printing characters (or printable characters), except perhaps for "space" characters. In the ASCII standard there are 33 control characters, such as code 7, BEL, which rings a terminal bell.

Contents

History

Procedural signs in Morse code are a form of control character.

A form of control characters were introduced in the 1870 Baudot code: NUL and DEL. The 1901 Murray code added the carriage return (CR) and line feed (LF), and other versions of the Baudot code included other control characters.

The bell character (BEL), which rang a bell to alert operators, was also an early teletype control character.

Some control characters have also been called "format effectors".

In ASCII

Early symbols assigned to the 32 control characters, space and delete characters. (ISO 2047, MIL-STD-188-100, 1972) US ASCII Control Character Symbols.png
Early symbols assigned to the 32 control characters, space and delete characters. (ISO 2047, MIL-STD-188-100, 1972)

There were quite a few control characters defined (33 in ASCII, and the ECMA-48 standard adds 32 more). This was because early terminals had very primitive mechanical or electrical controls that made any kind of state-remembering API quite expensive to implement, thus a different code for each and every function looked like a requirement. It quickly became possible and inexpensive to interpret sequences of codes to perform a function, and device makers found a way to send hundreds of device instructions. Specifically, they used ASCII code 2710 (escape), followed by a series of characters called a "control sequence" or "escape sequence". The mechanism was invented by Bob Bemer, the father of ASCII. For example, the sequence of code 2710, followed by the printable characters "[2;10H", would cause a Digital Equipment Corporation VT100 terminal to move its cursor to the 10th cell of the 2nd line of the screen. Several standards exist for these sequences, notably ANSI X3.64. But the number of non-standard variations in use is large, especially among printers, where technology has advanced far faster than any standards body can possibly keep up with.

All entries in the ASCII table below code 3210 (technically the C0 control code set) are of this kind, including CR and LF used to separate lines of text. The code 12710 (DEL) is also a control character. [1] [2] Extended ASCII sets defined by ISO 8859 added the codes 12810 through 15910 as control characters. This was primarily done so that if the high bit was stripped, it would not change a printing character to a C0 control code. This second set is called the C1 set.

These 65 control codes were carried over to Unicode. Unicode added more characters that could be considered controls, but it makes a distinction between these "Formatting characters" (such as the zero-width non-joiner) and the 65 control characters.

The Extended Binary Coded Decimal Interchange Code (EBCDIC) character set contains 65 control codes, including all of the ASCII control codes plus additional codes which are mostly used to control IBM peripherals.

ASCII control codes. [3]
0x000x10
0x00 NUL DLE
0x01 SOH DC1
0x02 STX DC2
0x03 ETX DC3
0x04 EOT DC4
0x05 ENQ NAK
0x06 ACK SYN
0x07 BEL ETB
0x08 BS CAN
0x09 HT EM
0x0A LF SUB
0x0B VT ESC
0x0C FF FS
0x0D CR GS
0x0E SO RS
0x0F SI US
0x7F DEL

The control characters in ASCII still in common use include:

Control characters may be described as doing something when the user inputs them, such as code 3 (End-of-Text character, ETX, ^C) to interrupt the running process, or code 4 (End-of-Transmission character, EOT, ^D), used to end text input on Unix or to exit a Unix shell. These uses usually have little to do with their use when they are in text being output.

In Unicode

In Unicode, "Control-characters" are U+0000U+001F (C0 controls), U+007F (delete), and U+0080U+009F (C1 controls). Their General Category is "Cc". Formatting codes are distinct, in General Category "Cf". The Cc control characters have no Name in Unicode, but are given labels such as "<control-001A>" instead. [4]

Display

There are a number of techniques to display non-printing characters, which may be illustrated with the bell character in ASCII encoding:

How control characters map to keyboards

ASCII-based keyboards have a key labelled "Control", "Ctrl", or (rarely) "Cntl" which is used much like a shift key, being pressed in combination with another letter or symbol key. In one implementation, the control key generates the code 64 places below the code for the (generally) uppercase letter it is pressed in combination with (i.e., subtract 0x40 from ASCII code value of the (generally) uppercase letter). The other implementation is to take the ASCII code produced by the key and bitwise AND it with 0x1F, forcing bits 5 to 7 to zero. For example, pressing "control" and the letter "g" (which is 0110 0111 in binary), produces the code 7 (BELL, 7 in base ten, or 0000 0111 in binary). The NULL character (code 0) is represented by Ctrl-@, "@" being the code immediately before "A" in the ASCII character set. For convenience, some terminals accept Ctrl-Space as an alias for Ctrl-@. In either case, this produces one of the 32 ASCII control codes between 0 and 31. Neither approach works to produce the DEL character because of its special location in the table and its value (code 12710), Ctrl-? is sometimes used for this character. [5]

When the control key is held down, letter keys produce the same control characters regardless of the state of the shift or caps lock keys. In other words, it does not matter whether the key would have produced an upper-case or a lower-case letter. The interpretation of the control key with the space, graphics character, and digit keys (ASCII codes 32 to 63) vary between systems. Some will produce the same character code as if the control key were not held down. Other systems translate these keys into control characters when the control key is held down. The interpretation of the control key with non-ASCII ("foreign") keys also varies between systems.

Control characters are often rendered into a printable form known as caret notation by printing a caret (^) and then the ASCII character that has a value of the control character plus 64. Control characters generated using letter keys are thus displayed with the upper-case form of the letter. For example, ^G represents code 7, which is generated by pressing the G key when the control key is held down.

Keyboards also typically have a few single keys which produce control character codes. For example, the key labelled "Backspace" typically produces code 8, "Tab" code 9, "Enter" or "Return" code 13 (though some keyboards might produce code 10 for "Enter").

Many keyboards include keys that do not correspond to any ASCII printable or control character, for example cursor control arrows and word processing functions. The associated keypresses are communicated to computer programs by one of four methods: appropriating otherwise unused control characters; using some encoding other than ASCII; using multi-character control sequences; or using an additional mechanism outside of generating characters. "Dumb" computer terminals typically use control sequences. Keyboards attached to stand-alone personal computers made in the 1980s typically use one (or both) of the first two methods. Modern computer keyboards generate scancodes that identify the specific physical keys that are pressed; computer software then determines how to handle the keys that are pressed, including any of the four methods described above.

The design purpose

The control characters were designed to fall into a few groups: printing and display control, data structuring, transmission control, and miscellaneous.

Printing and display control

Printing control characters were first used to control the physical mechanism of printers, the earliest output device. An early example of this idea was the use of Figures (FIGS) and Letters (LTRS) in Baudot code to shift between two code pages. A later, but still early, example was the out-of-band ASA carriage control characters. Later, control characters were integrated into the stream of data to be printed. The carriage return character (CR), when sent to such a device, causes it to put the character at the edge of the paper at which writing begins (it may, or may not, also move the printing position to the next line). The line feed character (LF/NL) causes the device to put the printing position on the next line. It may (or may not), depending on the device and its configuration, also move the printing position to the start of the next line (which would be the leftmost position for left-to-right scripts, such as the alphabets used for Western languages, and the rightmost position for right-to-left scripts such as the Hebrew and Arabic alphabets). The vertical and horizontal tab characters (VT and HT/TAB) cause the output device to move the printing position to the next tab stop in the direction of reading. The form feed character (FF/NP) starts a new sheet of paper, and may or may not move to the start of the first line. The backspace character (BS) moves the printing position one character space backwards. On printers, including hard-copy terminals, this is most often used so the printer can overprint characters to make other, not normally available, characters. On video terminals and other electronic output devices, there are often software (or hardware) configuration choices that allow a destructive backspace (e.g., a BS, SP, BS sequence), which erases, or a non-destructive one, which does not. The shift in and shift out characters (SI and SO) selected alternate character sets, fonts, underlining, or other printing modes. Escape sequences were often used to do the same thing.

With the advent of computer terminals that did not physically print on paper and so offered more flexibility regarding screen placement, erasure, and so forth, printing control codes were adapted. Form feeds, for example, usually cleared the screen, there being no new paper page to move to. More complex escape sequences were developed to take advantage of the flexibility of the new terminals, and indeed of newer printers. The concept of a control character had always been somewhat limiting, and was extremely so when used with new, much more flexible, hardware. Control sequences (sometimes implemented as escape sequences) could match the new flexibility and power and became the standard method. However, there were, and remain, a large variety of standard sequences to choose from.

Data structuring

The separators (File, Group, Record, and Unit: FS, GS, RS and US) were made to structure data, usually on a tape, in order to simulate punched cards. End of medium (EM) warns that the tape (or other recording medium) is ending. While many systems use CR/LF and TAB for structuring data, it is possible to encounter the separator control characters in data that needs to be structured. The separator control characters are not overloaded; there is no general use of them except to separate data into structured groupings. Their numeric values are contiguous with the space character, which can be considered a member of the group, as a word separator.

For example, the RS separator is used by RFC   7464 (JSON Text Sequences) to encode a sequence of JSON elements. Each sequence item starts with a RS character and ends with a line feed. This allows to serialize open-ended JSON sequences. It is one of the JSON streaming protocols.

Transmission control

The transmission control characters were intended to structure a data stream, and to manage re-transmission or graceful failure, as needed, in the face of transmission errors.

The start of heading (SOH) character was to mark a non-data section of a data stream—the part of a stream containing addresses and other housekeeping data. The start of text character (STX) marked the end of the header, and the start of the textual part of a stream. The end of text character (ETX) marked the end of the data of a message. A widely used convention is to make the two characters preceding ETX a checksum or CRC for error-detection purposes. The end of transmission block character (ETB) was used to indicate the end of a block of data, where data was divided into such blocks for transmission purposes.

The escape character (ESC) was intended to "quote" the next character, if it was another control character it would print it instead of performing the control function. It is almost never used for this purpose today. Various printable characters are used as visible "escape characters", depending on context.

The substitute character (SUB) was intended to request a translation of the next character from a printable character to another value, usually by setting bit 5 to zero. This is handy because some media (such as sheets of paper produced by typewriters) can transmit only printable characters. However, on MS-DOS systems with files opened in text mode, "end of text" or "end of file" is marked by this Ctrl-Z character, instead of the Ctrl-C or Ctrl-D, which are common on other operating systems.

The cancel character (CAN) signaled that the previous element should be discarded. The negative acknowledge character (NAK) is a definite flag for, usually, noting that reception was a problem, and, often, that the current element should be sent again. The acknowledge character (ACK) is normally used as a flag to indicate no problem detected with current element.

When a transmission medium is half duplex (that is, it can transmit in only one direction at a time), there is usually a master station that can transmit at any time, and one or more slave stations that transmit when they have permission. The enquire character (ENQ) is generally used by a master station to ask a slave station to send its next message. A slave station indicates that it has completed its transmission by sending the end of transmission character (EOT).

The device control codes (DC1 to DC4) were originally generic, to be implemented as necessary by each device. However, a universal need in data transmission is to request the sender to stop transmitting when a receiver is temporarily unable to accept any more data. Digital Equipment Corporation invented a convention which used 19 (the device control 3 character (DC3), also known as control-S, or XOFF) to "S"top transmission, and 17 (the device control 1 character (DC1), a.k.a. control-Q, or XON) to start transmission. It has become so widely used that most don't realize it is not part of official ASCII. This technique, however implemented, avoids additional wires in the data cable devoted only to transmission management, which saves money. A sensible protocol for the use of such transmission flow control signals must be used, to avoid potential deadlock conditions, however.

The data link escape character (DLE) was intended to be a signal to the other end of a data link that the following character is a control character such as STX or ETX. For example a packet may be structured in the following way (DLE) <STX> <PAYLOAD> (DLE) <ETX>.

Miscellaneous codes

Code 7 (BEL) is intended to cause an audible signal in the receiving terminal. [6]

Many of the ASCII control characters were designed for devices of the time that are not often seen today. For example, code 22, "synchronous idle" (SYN), was originally sent by synchronous modems (which have to send data constantly) when there was no actual data to send. (Modern systems typically use a start bit to announce the beginning of a transmitted word this is a feature of asynchronous communication. Synchronous communication links were more often seen with mainframes, where they were typically run over corporate leased lines to connect a mainframe to another mainframe or perhaps a minicomputer.)

Code 0 (ASCII code name NUL) is a special case. In paper tape, it is the case when there are no holes. It is convenient to treat this as a fill character with no meaning otherwise. Since the position of a NUL character has no holes punched, it can be replaced with any other character at a later time, so it was typically used to reserve space, either for correcting errors or for inserting information that would be available at a later time or in another place. In computing, it is often used for padding in fixed length records; to mark the end of a string; and formerly to give printing devices enough time to execute a control function.

Code 127 (DEL, a.k.a. "rubout") is likewise a special case. Its 7-bit code is all-bits-on in binary, which essentially erased a character cell on a paper tape when overpunched. Paper tape was a common storage medium when ASCII was developed, with a computing history dating back to WWII code breaking equipment at Biuro Szyfrów. Paper tape became obsolete in the 1970s, so this clever aspect of ASCII rarely saw any use after that. Some systems (such as the original Apples) converted it to a backspace. But because its code is in the range occupied by other printable characters, and because it had no official assigned glyph, many computer equipment vendors used it as an additional printable character (often an all-black "box" character useful for erasing text by overprinting with ink).

Non-erasable programmable ROMs are typically implemented as arrays of fusible elements, each representing a bit, which can only be switched one way, usually from one to zero. In such PROMs, the DEL and NUL characters can be used in the same way that they were used on punched tape: one to reserve meaningless fill bytes that can be written later, and the other to convert written bytes to meaningless fill bytes. For PROMs that switch one to zero, the roles of NUL and DEL are reversed; also, DEL will only work with 7-bit characters, which are rarely used today; for 8-bit content, the character code 255, commonly defined as a nonbreaking space character, can be used instead of DEL.

Many file systems do not allow control characters in filenames, as they may have reserved functions.

See also

Notes and references

  1. ASCII format for network interchange. 1969-10-01. doi: 10.17487/RFC0020 . RFC 20 . Retrieved 2023-04-05.
  2. "5.2 Control Characters". American National Standard Code for Information Interchange | ANSI X3.4-1977 (PDF). National Institute for Standards. 1977. Archived (PDF) from the original on 2022-10-09.
  3. MS-DOS QBasic v1.1 Documentation. Microsoft 1987-1991.
  4. "4.8 Name". The Unicode Standard Version 13.0 – Core Specification (PDF). Unicode, Inc. Archived (PDF) from the original on 2022-10-09.
  5. "ASCII Characters". Archived from the original on October 28, 2009. Retrieved 2010-10-08.
  6. ASCII format for Network Interchange. October 1969. doi: 10.17487/RFC0020 . RFC 20 . Retrieved 2013-11-03. An old RFC, which explains the structure and meaning of the control characters in chapters 4.1 and 5.2

Related Research Articles

<span class="mw-page-title-main">ASCII</span> American character encoding standard

ASCII, an acronym for American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of technical limitations of computer systems at the time it was invented, ASCII has just 128 code points, of which only 95 are printable characters, which severely limited its scope. Modern computer systems have evolved to use Unicode, which has millions of code points, but the first 128 of these are the same as the ASCII set.

<span class="mw-page-title-main">Plain text</span> Term for computer data consisting only of unformatted characters of readable material

In computing, plain text is a loose term for data that represent only characters of readable material but not its graphical representation nor other objects. It may also include a limited number of "whitespace" characters that affect simple arrangement of text, such as spaces, line breaks, or tabulation characters. Plain text is different from formatted text, where style information is included; from structured text, where structural parts of the document such as paragraphs, sections, and the like are identified; and from binary files in which some portions must be interpreted as binary objects.

In telecommunication, an End-of-Transmission character (EOT) is a transmission control character. Its intended use is to indicate the conclusion of a transmission that may have included one or more texts and any associated message headings.

In computing and telecommunication, an escape character is a character that invokes an alternative interpretation on the following characters in a character sequence. An escape character is a particular case of metacharacters. Generally, the judgement of whether something is an escape character or not depends on the context.

In computer science, an escape sequence is a combination of characters that has a meaning other than the literal characters contained therein; it is marked by one or more preceding characters.

<span class="mw-page-title-main">Control key</span> Key on computer keyboards

In computing, a Control keyCtrl is a modifier key which, when pressed in conjunction with another key, performs a special operation. Similarly to the Shift key, the Control key rarely performs any function when pressed by itself. The Control key is located on or near the bottom left side of most keyboards, with many featuring an additional one at the bottom right.

A text file is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file (EOF) marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. Most text files need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records.

A carriage return, sometimes known as a cartridge return and often shortened to CR, <CR> or return, is a control character or mechanism used to reset a device's position to the beginning of a line of text. It is closely associated with the line feed and newline concepts, although it can be considered separately in its own right.

<span class="mw-page-title-main">Newline</span> Special characters in computing signifying the end of a line of text

A newline is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a sequence of characters, is used to signify the end of a line of text and the start of a new one.

The null character is a control character with the value zero. It is present in many character sets, including those defined by the Baudot and ITA2 codes, ISO/IEC 646, the C0 control code, the Universal Coded Character Set, and EBCDIC. It is available in nearly all mainstream programming languages. It is often abbreviated as NUL. In 8-bit codes, it is known as a null byte.

<span class="mw-page-title-main">Tab key</span> Key on a keyboard for tabulation

The tab keyTab ↹ on a keyboard is used to advance the cursor to the next tab stop.

ISO/IEC 2022Information technology—Character code structure and extension techniques, is an ISO/IEC standard in the field of character encoding. It is equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202. Originating in 1971, it was most recently revised in 1994.

A bell character is a device control code originally sent to ring a small electromechanical bell on tickers and other teleprinters and teletypewriters to alert operators at the other end of the line, often of an incoming message. Though tickers punched the bell codes into their tapes, printers generally do not print a character when the bell code is received. Bell codes are usually represented by the label "BEL". They have been used since 1870.

Caret notation is a notation for control characters in ASCII. The notation assigns ^A to control-code 1, sequentially through the alphabet to ^Z assigned to control-code 26 (0x1A). For the control-codes outside of the range 1–26, the notation extends to the adjacent, non-alphabetic ASCII characters.

The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, an instruction to start a new line, or a message that the text has been received.

The delete control character is the last character in the ASCII repertoire, with the code 127. It is supposed to do nothing and was designed to erase incorrect characters on paper tape. It is denoted as ^? in caret notation and is U+007F in Unicode.

Many Unicode characters are used to control the interpretation or display of text, but these characters themselves have no visual or spatial representation. For example, the null character is used in C-programming application environments to indicate the end of a string of characters. In this way, these programs only require a single starting memory address for a string, since the string ends once the program reads the null character.

The End-of-Text character (ETX) is a control character used to inform the receiving computer that the end of a record has been reached. This may or may not be an indication that all of the data in a record have been received. It is often used in conjunction with Start of Text (STX) and Data Link Escape (DLE), e.g., to distinguish data frames in the data link layer. All this use is pretty much obsolete.

ISO 1745:1975 Information processing – Basic mode control procedures for data communication systems is an early ISO standard defining a Telex-oriented communications protocol that used the non-printable ASCII transmission control characters SOH, STX, ETX, EOT, ENQ (Enquiry), ACK (Acknowledge), DLE, NAK, SYN, and ETB.

<span class="mw-page-title-main">Hazeltine 1500</span> Computer terminal

The Hazeltine 1500 was a popular smart terminal introduced by Hazeltine Corporation in April 1977 at a price of $1,125. Using a microprocessor and semiconductor random access memory, it implemented the basic features of the earlier Hazeltine 2000 in a much smaller and less expensive system, less than half the price. It came to market just as the microcomputer revolution was taking off, and the 1500 was very popular among early hobbyist users.