Iconv

Last updated

iconv
Original author(s) Hewlett-Packard
Developer(s) Various open-source and commercial developers
Repository https://git.savannah.gnu.org/git/libiconv.git
Operating system Unix, Unix-like, Microsoft Windows, IBM i
Platform Cross-platform
Type Command
License libiconv: LGPL
iconv: GPL
win-iconv: Public domain [1]

In Unix and Unix-like operating systems, iconv (an abbreviation of internationalization conversion) [2] is a command-line program [3] and a standardized application programming interface (API) [4] used to convert between different character encodings. "It can convert from any of these encodings to any other, through Unicode conversion." [5]

Contents

History

Initially appearing on the HP-UX operating system, [6] iconv() as well as the utility was standardized within XPG4 and is part of the Single UNIX Specification (SUS).

Implementations

Most Linux distributions provide an implementation, either from the GNU Standard C Library (included since version 2.1, February 1999), or the more traditional GNU libiconv, for systems based on other Standard C Libraries.

The iconv function [7] on both is licensed as LGPL, so it is linkable with closed source applications.

Unlike the libraries, the iconv utility is licensed under GPL in both implementations. [8] The GNU libiconv implementation is portable, and can be used on various UNIX-like and non-UNIX systems. Version 0.3 dates from December 1999.

The uconv utility from International Components for Unicode provides an iconv-compatible command-line syntax for transcoding.

Most BSD systems use NetBSD's implementation, which first appeared in December 2004.

Support

Currently, over a hundred different character encodings are supported in the GNU variant. [5]

Ports

Under Microsoft Windows, the iconv library and the utility is provided by GNU's libiconv found in Cygwin [9] and GnuWin32 [10] environments; there is also a "purely Win32" implementation called "win-iconv" that uses Windows' built-in routines for conversion. [11] The iconv function is also available for many programming languages.

The iconv command has also been ported to the IBM i operating system. [12]

Usage

stdin can be converted from ISO-8859-1 to current locale and output to stdout using: [13]

iconv-fiso-8859-1 

An input file infile can be converted from ISO-8859-1 to UTF-8 and output to output file outfile using:

iconv-fiso-8859-1-tutf-8<infile>-o<outfile> 

See also

Related Research Articles

<span class="mw-page-title-main">Character encoding</span> Using numbers to represent text characters

Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical values that make up a character encoding are known as code points and collectively comprise a code space, a code page, or character map.

<span class="mw-page-title-main">Cygwin</span> Unix-like environment for Windows

Cygwin is a free and open-source Unix-like environment and command-line interface (CLI) for Microsoft Windows. The project also provides a software repository containing many open-source packages. Cygwin allows source code for Unix-like operating systems to be compiled and run on Windows. Cygwin provides native integration of Windows-based applications.

In computing, tar is a computer software utility for collecting many files into one archive file, often referred to as a tarball, for distribution or backup purposes. The name is derived from "tape archive", as it was originally developed to write data to sequential I/O devices with no file system of their own, such as devices that use magnetic tape. The archive data sets created by tar contain various file system parameters, such as name, timestamps, ownership, file-access permissions, and directory organization. POSIX abandoned tar in favor of pax, yet tar sees continued widespread use.

In computing, a code page is a character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers. Typically each number represents the binary value in a single byte.

<span class="mw-page-title-main">Windows-1252</span> Windows character set for Latin alphabet

Windows-1252 or CP-1252 is a legacy single-byte character encoding that is used by default in Microsoft Windows throughout the Americas, Western Europe, Oceania, and much of Africa.

<span class="mw-page-title-main">Newline</span> Special characters in computing signifying the end of a line of text

A newline is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a sequence of characters, is used to signify the end of a line of text and the start of a new one.

ISO/IEC 8859-2:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 2: Latin alphabet No. 2, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as "Latin-2". It is generally intended for Central or "Eastern European" languages that are written in the Latin script. Note that ISO/IEC 8859-2 is very different from code page 852 which is also referred to as "Latin-2" in Czech and Slovak regions. Almost half the use of the encoding is for Polish, and it's the main legacy encoding for Polish, while virtually all use of it has been replaced by UTF-8.

patch (Unix) Unix utility to apply changes to text files

The computer tool patch is a Unix program that updates text files according to instructions contained in a separate file, called a patch file. The patch file is a text file that consists of a list of differences and is produced by running the related diff program with the original and updated file as arguments. Updating files with patch is often referred to as applying the patch or simply patching the files.

<span class="mw-page-title-main">GB 18030</span> Official Chinese character encoding

GB 18030 is a Chinese government standard, described as Information Technology — Chinese coded character set and defines the required language and character support necessary for software in China. GB18030 is the registered Internet name for the official character set of the People's Republic of China (PRC) superseding GB2312. As a Unicode Transformation Format, GB18030 supports both simplified and traditional Chinese characters. It is also compatible with legacy encodings including GB/T 2312, CP936, and GBK 1.0.

xargs is a command on Unix and most Unix-like operating systems used to build and execute commands from standard input. It converts input from standard input into arguments to a command.

dd is a command-line utility for Unix, Plan 9, Inferno, and Unix-like operating systems and beyond, the primary purpose of which is to convert and copy files. On Unix, device drivers for hardware and special device files appear in the file system just like normal files; dd can also read and/or write from/to these files, provided that function is implemented in their respective driver. As a result, dd can be used for tasks such as backing up the boot sector of a hard drive, and obtaining a fixed amount of random data. The dd program can also perform conversions on the data as it is copied, including byte order swapping and conversion to and from the ASCII and EBCDIC text encodings.

In computing, uconv is a command-line tool that is bundled with International Components for Unicode that converts text files between different character encodings. It is very similar to the iconv command that is part of the Single UNIX Specification which is usually implemented using libiconv. In fact the command line options for transcoding are the same. The command uconv can also convert to and from various Unicode normalization forms.

The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard for reducing the number of bytes needed to represent Unicode text, especially if that text uses mostly characters from one or a small number of per-language character blocks. It does so by dynamically mapping values in the range 128–255 to offsets within particular blocks of 128 characters. The initial conditions of the encoder mean that existing strings in ASCII and ISO-8859-1 that do not contain C0 control codes other than NULL TAB CR and LF can be treated as SCSU strings. Since most alphabets do reside in blocks of contiguous Unicode codepoints, texts that use small alphabets and either ASCII punctuation or punctuation that fits within the window for the main alphabet can be encoded at one byte per character, most other punctuation can be encoded at 2 bytes per symbol through non-locking shifts. SCSU can also switch to UTF-16 internally to handle non-alphabetic languages.

tr (Unix) Unix text formatting utility

tr is a command in Unix, Plan 9, Inferno, and Unix-like operating systems. It is an abbreviation of translate or transliterate, indicating its operation of replacing or removing specific characters in its input data set.

In computing, kill is a command that is used in several popular operating systems to send signals to running processes.

In computing, cmp is a command-line utility on Unix and Unix-like operating systems that compares two files of any type and writes the results to the standard output. By default, cmp is silent if the files are the same; if they differ, the byte and line number at which the first difference occurred is reported. The command is also available in the OS-9 shell.

This article provides basic comparisons for notable text editors. More feature details for text editors are available from the Category of text editor features and from the individual products' articles. This article may not be up-to-date or necessarily all-inclusive.

Windows code pages are sets of characters or code pages used in Microsoft Windows from the 1980s and 1990s. Windows code pages were gradually superseded when Unicode was implemented in Windows, although they are still supported both within Windows and other platforms, and still apply when Alt code shortcuts are used.

sum is a legacy utility available on some Unix and Unix-like operating systems. This utility outputs a 16-bit checksum of each argument file, as well as the number of blocks they take on disk. Two different checksum algorithms are in use. POSIX abandoned sum in favor of cksum.

luit

luit is a utility program used to translate the character set of a computer program so that its output can be displayed correctly on a terminal emulator that uses a different character set. Whereas iconv converts the character set of strings or text files at rest, luit converts the input and output of programs running interactively.

References

  1. "win-iconv/readme.txt at master · win-iconv/win-iconv · GitHub". GitHub .
  2. "R: Convert Character Vector between Encodings". astrostatistics.psu.edu. Retrieved 21 April 2018.
  3. "iconv". pubs.opengroup.org. Retrieved 21 April 2018.
  4. "iconv". www.opengroup.org. Retrieved 21 April 2018.
  5. 1 2 "libiconv - GNU Project - Free Software Foundation (FSF)". www.gnu.org. Retrieved 21 April 2018.
  6. "iconv(3C)". docstore.mik.ua. Retrieved 21 April 2018.
  7. "glibc: iconv/iconv.c" . Retrieved 30 November 2016.[ permanent dead link ]
  8. "glibc: iconv/iconv_prog.c" . Retrieved 30 November 2016.[ permanent dead link ]
  9. "Cygwin Package Search: libiconv". Archived from the original on 30 November 2016. Retrieved 30 November 2016.
  10. "LibIconv for Windows". gnuwin32.sourceforge.net. Retrieved 21 April 2018.
  11. "win32-iconv". GitHub. Retrieved 30 November 2016.
  12. IBM. "IBM System i Version 7.2 Programming Qshell" (PDF). IBM . Retrieved 5 September 2020.
  13. "IBM Knowledge Center". www-01.ibm.com. Retrieved 21 April 2018.