Luit

Last updated

luit
Luit demo screenshot.png
luit rendering ISO 8859-1 accented characters on a UTF-8 terminal emulator.
Original author(s) Juliusz Chroboczek
Initial release2001;21 years ago (2001)
Stable release
2.0 / February 17, 2013;9 years ago (2013-02-17) [1]
Repository
Operating system Unix and Unix-like
Type Utility software
License MIT/X Consortium License
Website invisible-island.net/luit/ OOjs UI icon edit-ltr-progressive.svg

luit is a utility program used to translate the character set of a computer program so that its output can be displayed correctly on a terminal emulator that uses a different character set. [2] Whereas iconv converts the character set of strings or text files at rest, luit converts the input and output of programs running interactively.

Contents

Overview

The main purpose of luit is to allow "legacy" applications that use character sets other than UTF-8 to work with contemporary terminal emulators.

luit may be required today when connecting to a "legacy" host that only supports an older encoding, such as ISO 8859-1. For example, instead of running " ssh legacy-machine", a user may have to run "LC_ALL=fr_FR luit ssh legacy-machine" to properly render French accented characters on a UTF-8 terminal. [2]

luit is also used to properly render the output of applications that use ISO 2022 character set switching. ISO 2022 is an older standard [3] that allowed an application to "switch" between different fonts, e.g., to mix line-drawing characters with text or to display text in multiple languages and character sets. UTF-8 itself does not support switching fonts; the encoding is stateless and gives each unique character (including line-drawing characters) its own numerical encoding. It can be used to translate between these two encodings.

Examples of programs that require translation to run correctly on a UTF-8 terminal include earlier versions of emacs/MULE, [4] and programs that use ISO 2022 shift sequences in ANSI escape codes that switch to an alternate character set in order to draw line-drawing characters.

luit is invoked automatically by xterm when necessary to translate program output into UTF-8, [5] for programs running on a local computer. When connecting remotely to another computer, the user must run luit directly.

luit interprets application output according to the locale's character set with ISO 2022 shifts and ECMA-48 escape sequences. If an application is speaking a different language than the locale's character set (one that may have matched the terminal emulator's expectations in the absence of luit), luit can misinterpret the application's output and produce corrupted output to the terminal. [6]

History

luit was written in 2001 by Juliusz Chroboczek, [4] when major Linux distributions began migrating to the Unicode character set from "legacy" encodings such as ISO 8859-1. [3] It has since become a widely installed base utility, present on more than half of all Linux computer systems by some estimates. [7] [8] It is also part of IBM's AIX. [9]

Implementations

There are two versions of luit: one maintained by Thomas Dickey [5] as part of xterm, and another formerly updated by Freedesktop.org. [10] [11] Some Linux distributions ship the latter version [12] as part of their X11 utilities package. However, while migrating to GitLab, the latter fork was discontinued because it was unmaintained. [13]

See also

Related Research Articles

Character encoding Using numbers to represent text characters

Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that make up a character encoding are known as "code points" and collectively comprise a "code space", a "code page", or a "character map".

Cygwin Unix subsystem for Windows machines

Cygwin is a POSIX-compatible programming and runtime environment that runs natively on Microsoft Windows. Under Cygwin, source code designed for Unix-like operating systems may be compiled with minimal modification and executed.

Extended Binary Coded Decimal Interchange Code is an eight-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. It descended from the code used with punched cards and the corresponding six-bit binary-coded decimal code used with most of IBM's computer peripherals of the late 1950s and early 1960s. It is supported by various non-IBM platforms, such as Fujitsu-Siemens' BS2000/OSD, OS-IV, MSP, and MSP-EX, the SDS Sigma series, Unisys VS/9, Unisys MCP and ICL VME.

Plain text Term for computer data consisting only of unformatted characters of readable material

In computing, plain text is a loose term for data that represent only characters of readable material but not its graphical representation nor other objects. It may also include a limited number of "whitespace" characters that affect simple arrangement of text, such as spaces, line breaks, or tabulation characters. Plain text is different from formatted text, where style information is included; from structured text, where structural parts of the document such as paragraphs, sections, and the like are identified; and from binary files in which some portions must be interpreted as binary objects.

ANSI escape code Method used for display options on video text terminals

ANSI escape sequences are a standard for in-band signaling to control cursor location, color, font styling, and other options on video text terminals and terminal emulators. Certain sequences of bytes, most starting with an ASCII escape character and a bracket character, are embedded into text. The terminal interprets these sequences as commands, rather than text to display verbatim.

Mojibake Garbled text as a result of incorrect character encoding

Mojibake is the garbled text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, often from a different writing system.

In computing, a code page is a character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers. Typically each number represents the binary value in a single byte.

A text file is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and MS-DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. Most text files need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records.

ISO/IEC 2022Information technology—Character code structure and extension techniques, is an ISO standard specifying:

Windows-1251 is an 8-bit character encoding, designed to cover languages that use the Cyrillic script such as Russian, Ukrainian, Belarusian, Bulgarian, Serbian Cyrillic, Macedonian and other languages.

Soft hyphen Unicode character

In computing and typesetting, a soft hyphen or syllable hyphen, abbreviated SHY, is a code point reserved in some coded character sets for the purpose of breaking words across lines by inserting visible hyphens. Two alternative ways of using the soft hyphen character for this purpose have emerged, depending on whether the encoded text will be broken into lines by its recipient, or has already been preformatted by its originator.

In Unix and Unix-like operating systems, iconv is a command-line program and a standardized application programming interface (API) used to convert between different character encodings. "It can convert from any of these encodings to any other, through Unicode conversion."

Fixed (typeface)

misc-fixed is a collection of monospace bitmap fonts that is distributed with the X Window System. It is a set of independent bitmap fonts which—apart from all being sans-serif fonts—cannot be described as belonging to a single font family. The misc-fixed fonts were the first fonts available for the X Window System. Their individual origin is not attributed, but it is likely that many of them were created in the early or mid 1980s as part of MIT's Project Athena, or at its industrial partner, DEC. The misc-fixed fonts are in the public domain.

Several binary representations of 8-bit character sets for common Western European languages are compared in this article. These encodings were designed for representation of Italian, Spanish, Portuguese, French, German, Dutch, English, Danish, Swedish, Norwegian, and Icelandic, which use the Latin alphabet, a few additional letters and ones with precomposed diacritics, some punctuation, and various symbols. Although they're called "Western European" many of these languages are spoken all over the world. Also, these character sets happen to support many other languages such as Malay, Swahili, and Classical Latin.

The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, an instruction to start a new line, or a message that the text has been received.

Windows code pages are sets of characters or code pages used in Microsoft Windows from the 1980s and 1990s. Windows code pages were gradually superseded when Unicode was implemented in Windows, although they are still supported both within Windows and other platforms, and still apply when Alt code shortcuts are used.

The Universal Coded Character Set is a standard set of characters defined by the International Standard ISO/IEC 10646, Information technology — Universal Coded Character Set (UCS), which is the basis of many character encodings, improving as characters from previously unrepresented typing systems are added.

kmscon Userspace virtual console for Linux operating system

Kmscon is a virtual console that runs in userspace and intends to replace the Linux console, a terminal built into the Linux kernel. Kmscon uses the KMS driver for its output, it is multiseat-capable, and supports internationalized keyboard input and UTF-8 terminal output. The input support is implemented using X keyboard extension (XKB). Development of Kmscon stopped in March 2015. There was a successor project called systemd-consoled, but this project was also later dropped in July 2015.

There are many methods of translating text into digital data, such as Baudot code, EBCDIC, and UTF-8, and the relative usage levels of them can provide insight into their usability, and historical trends can show the progress of new methods.

References

  1. "LUIT - Change Log". 2013-02-17.
  2. 1 2 "luit manual page".
  3. 1 2 "UTF-8 and Unicode FAQ for Unix/Linux"
  4. 1 2 "luit author website"
  5. 1 2 "luit home page"
  6. "luit notes"
  7. "x11-utils Debian popularity contest results"
  8. "Ubuntu popularity contest results"
  9. AIX 7.1 manual
  10. "Xorg luit home page"
  11. Coopersmith, Alan (March 22, 2012). "Luit 1.1.1 release announcement".
  12. "Freedesktop mailing list discussion, 'luit forked?', April 2009
  13. Adam Jackson (August 7, 2018). "[PATCH app/luit] Retire this fork of luit". xorg-devel@lists.x.org (Mailing list).