Iconv

Last updated

iconv
Original author(s) Hewlett-Packard
Developer(s) Various open-source and commercial developers
Repository https://git.savannah.gnu.org/git/libiconv.git
Operating system Unix, Unix-like, Microsoft Windows, IBM i
Platform Cross-platform
Type Command
License libiconv: LGPL
iconv: GPL
win-iconv: Public domain [1]

In Unix and Unix-like operating systems, iconv (an abbreviation of internationalization conversion) [2] is a command-line program [3] and a standardized application programming interface (API) [4] used to convert between different character encodings. "It can convert from any of these encodings to any other, through Unicode conversion." [5]

Contents

History

Initially appearing on the HP-UX operating system, [6] iconv() as well as the utility was standardized within XPG4 and is part of the Single UNIX Specification (SUS).

Implementations

Most Linux distributions provide an implementation, either from the GNU Standard C Library (included since version 2.1, February 1999), or the more traditional GNU libiconv, for systems based on other Standard C Libraries.

The iconv function [7] on both is licensed as LGPL, so it is linkable with closed source applications.

Unlike the libraries, the iconv utility is licensed under GPL in both implementations. [8] The GNU libiconv implementation is portable, and can be used on various UNIX-like and non-UNIX systems. Version 0.3 dates from December 1999.

The uconv utility from International Components for Unicode provides an iconv-compatible command-line syntax for transcoding.

Most BSD systems use NetBSD's implementation, which first appeared in December 2004.

Support

Currently, over a hundred different character encodings are supported. [5]

Ports

Under Microsoft Windows, the iconv library and the utility is provided by GNU's libiconv found in Cygwin [9] and GnuWin32 [10] environments; there is also a "purely Win32" implementation called "win-iconv" that uses Windows' built-in routines for conversion. [11] The iconv function is also available for many programming languages.

The iconv command has also been ported to the IBM i operating system. [12]

Usage

stdin can be converted from ISO-8859-1 to current locale and output to stdout using: [13]

iconv-fiso-8859-1 

An input file infile can be converted from ISO-8859-1 to UTF-8 and output to output file outfile using:

iconv-fiso-8859-1-tutf-8<infile>-o<outfile> 

See also

Related Research Articles

<span class="mw-page-title-main">Character encoding</span> Using numbers to represent text characters

Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that make up a character encoding are known as "code points" and collectively comprise a "code space", a "code page", or a "character map".

<span class="mw-page-title-main">Cygwin</span> Unix-like environment for Windows

Cygwin is a Unix-like environment and command-line interface for Microsoft Windows.

The Portable Operating System Interface is a family of standards specified by the IEEE Computer Society for maintaining compatibility between operating systems. POSIX defines both the system and user-level application programming interfaces (APIs), along with command line shells and utility interfaces, for software compatibility (portability) with variants of Unix and other operating systems. POSIX is also a trademark of the IEEE. POSIX is intended to be used by both application and system developers.

In computing, a code page is a character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers. Typically each number represents the binary value in a single byte.

<span class="mw-page-title-main">Newline</span> Special characters in computing signifying the end of a line of text

A newline is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a sequence of characters, is used to signify the end of a line of text and the start of a new one.

patch (Unix) Unix utility to apply changes to text files

The computer tool patch is a Unix program that updates text files according to instructions contained in a separate file, called a patch file. The patch file is a text file that consists of a list of differences and is produced by running the related diff program with the original and updated file as arguments. Updating files with patch is often referred to as applying the patch or simply patching the files.

<span class="mw-page-title-main">GB 18030</span> Official Chinese character encoding

GB 18030 is a Chinese government standard, described as Information Technology — Chinese coded character set and defines the required language and character support necessary for software in China. GB18030 is the registered Internet name for the official character set of the People's Republic of China (PRC) superseding GB2312. As a Unicode Transformation Format, GB18030 supports both simplified and traditional Chinese characters. It is also compatible with legacy encodings including GB/T 2312, CP936, and GBK 1.0.

xargs is a command on Unix and most Unix-like operating systems used to build and execute commands from standard input. It converts input from standard input into arguments to a command.

dd is a command-line utility for Unix, Plan 9, Inferno, and Unix-like operating systems and beyond, the primary purpose of which is to convert and copy files. On Unix, device drivers for hardware and special device files appear in the file system just like normal files; dd can also read and/or write from/to these files, provided that function is implemented in their respective driver. As a result, dd can be used for tasks such as backing up the boot sector of a hard drive, and obtaining a fixed amount of random data. The dd program can also perform conversions on the data as it is copied, including byte order swapping and conversion to and from the ASCII and EBCDIC text encodings.

In computing, uconv is a command-line tool that is bundled with International Components for Unicode that converts text files between different character encodings. It is very similar to the iconv command that is part of the Single UNIX Specification which is usually implemented using libiconv. In fact the command line options for transcoding are the same. The command uconv can also convert to and from various Unicode normalization forms.

The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard for reducing the number of bytes needed to represent Unicode text, especially if that text uses mostly characters from one or a small number of per-language character blocks. It does so by dynamically mapping values in the range 128–255 to offsets within particular blocks of 128 characters. The initial conditions of the encoder mean that existing strings in ASCII and ISO-8859-1 that do not contain C0 control codes other than NULL TAB CR and LF can be treated as SCSU strings. Since most alphabets do reside in blocks of contiguous Unicode codepoints, texts that use small alphabets and either ASCII punctuation or punctuation that fits within the window for the main alphabet can be encoded at one byte per character, most other punctuation can be encoded at 2 bytes per symbol through non-locking shifts. SCSU can also switch to UTF-16 internally to handle non-alphabetic languages.

tr (Unix) Unix text formatting utility

tr is a command in Unix, Plan 9, Inferno, and Unix-like operating systems. It is an abbreviation of translate or transliterate, indicating its operation of replacing or removing specific characters in its input data set.

wc (Unix) Unix command utility

wc is a command in Unix, Plan 9, Inferno, and Unix-like operating systems. The program reads either standard input or a list of computer files and generates one or more of the following statistics: newline count, word count, and byte count. If a list of files is provided, both individual file and total statistics follow.

cmp (Unix) Computer file comparison utility

In computing, cmp is a command-line utility on Unix and Unix-like operating systems that compares two files of any type and writes the results to the standard output. By default, cmp is silent if the files are the same; if they differ, the byte and line number at which the first difference occurred is reported. The command is also available in the OS-9 shell.

A wide character is a computer character datatype that generally has a size greater than the traditional 8-bit character. The increased datatype size allows for the use of larger coded character sets.

yes (Unix) Unix command

yes is a command on Unix and Unix-like operating systems, which outputs an affirmative response, or a user-defined string of text, continuously until killed.

Windows code pages are sets of characters or code pages used in Microsoft Windows from the 1980s and 1990s. Windows code pages were gradually superseded when Unicode was implemented in Windows, although they are still supported both within Windows and other platforms, and still apply when Alt code shortcuts are used.

sum is a legacy utility available on some Unix and Unix-like operating systems. This utility outputs a 16-bit checksum of each argument file, as well as the number of blocks they take on disk. Two different checksum algorithms are in use. POSIX abandoned sum in favor of cksum.

The C programming language has a set of functions implementing operations on strings in its standard library. Various operations, such as copying, concatenation, tokenization and searching are supported. For character strings, the standard library uses the convention that strings are null-terminated: a string of n characters is represented as an array of n + 1 elements, the last of which is a "NUL character" with numeric value 0.

luit

luit is a utility program used to translate the character set of a computer program so that its output can be displayed correctly on a terminal emulator that uses a different character set. Whereas iconv converts the character set of strings or text files at rest, luit converts the input and output of programs running interactively.

References

  1. "win-iconv/readme.txt at master · win-iconv/win-iconv · GitHub".
  2. "R: Convert Character Vector between Encodings". astrostatistics.psu.edu. Retrieved 21 April 2018.
  3. "iconv". pubs.opengroup.org. Retrieved 21 April 2018.
  4. "iconv". www.opengroup.org. Retrieved 21 April 2018.
  5. 1 2 "libiconv - GNU Project - Free Software Foundation (FSF)". www.gnu.org. Retrieved 21 April 2018.
  6. "iconv(3C)". docstore.mik.ua. Retrieved 21 April 2018.
  7. "glibc: iconv/iconv.c" . Retrieved 30 November 2016.[ permanent dead link ]
  8. "glibc: iconv/iconv_prog.c" . Retrieved 30 November 2016.[ permanent dead link ]
  9. "Cygwin Package Search: libiconv". Archived from the original on 30 November 2016. Retrieved 30 November 2016.
  10. "LibIconv for Windows". gnuwin32.sourceforge.net. Retrieved 21 April 2018.
  11. "win32-iconv". GitHub. Retrieved 30 November 2016.
  12. IBM. "IBM System i Version 7.2 Programming Qshell" (PDF). Retrieved 5 September 2020.
  13. "IBM Knowledge Center". www-01.ibm.com. Retrieved 21 April 2018.