VISCII

Last updated
VISCII
MIME / IANAVISCII
Language(s) Vietnamese, English
Created byViet-Std Group
Definitions RFC   1456
Classification8-bit SBCS
Based on ASCII

VISCII is an unofficially-defined modified ASCII character encoding for using the Vietnamese language with computers. It should not be confused with the similarly-named officially registered VSCII encoding. VISCII keeps the 95 printable characters of ASCII unmodified, but it replaces 6 of the 33 control characters with printable characters. It adds 128 precomposed characters. Unicode and the Windows-1258 code page are now used for virtually all Vietnamese computer data,[ citation needed ] but legacy VSCII and VISCII files may need conversion.

Contents

History and naming

VISCII was designed by the Vietnamese Standardization Working Group (Viet-Std Group) [1] led by Christopher Cuong T. Nguyen, Cuong M. Bui, and Hoc D. Ngo based in Silicon Valley, California in 1992 while they were working with the Unicode consortium to include pre-composed Vietnamese characters in the Unicode standard. VISCII, along with VIQR, was first published in a bilingual report in September 1992, in which it was dubbed the "Vietnamese Standard Code for Information Interchange". [2] The report noted a proliferation in computer usage in Vietnam and the increasing volume of computer-based communications among Vietnamese abroad, that existing applications used vendor-specific encodings which were unable to interoperate with one another, and that standardisation between vendors was therefore necessary. The successful inclusion of composed and precomposed Vietnamese in Unicode 1.0 was the result of the lessons learned from the development of 8-bit VISCII and 7-bit VIQR. [2]

The next year, in 1993, Vietnam adopted TCVN 5712, its first national standard in the information technology domain. [3] This defined a character encoding named VSCII, which had been developed by the TCVN Technical Committee on Information Technology (TCVN/TC1), and with its name standing for "Vietnamese Standard Code for Information Interchange". [3] VSCII is incompatible with, and otherwise unrelated to, the earlier-published VISCII. [4] Unlike VISCII, VSCII is a "Vietnamese Standard" in the sense of a national standard.

VISCII and VIQR were approved as the informational-status RFC   1456, attributed to the Viet-Std group and dated May 1993. As is the case with IETF RFCs, RFC 1456 notes them to be "conventions" used by overseas Vietnamese speakers on Usenet, and that it "specifies no level of standard". In spite of this, it continues to call VISCII the "VIetnamese Standard Code for Information Interchange" (the same name taken by VSCII). [5] The labels VISCII and csVISCII are registered with the IANA for VISCII, with reference to RFC 1456. [6] (There is, on the other hand, no official IANA label for TCVN 5712 / VSCII, although x-viet-tcvn5712 was previously supported by Mozilla Firefox. [7] )

Design

A traditional extended ASCII character set consists of the ASCII set plus up to 128 characters. Vietnamese requires 134 additional letter-diacritic combinations, which is six too many. There are (short of dropping tone mark support for capital letters, as in VSCII-3) essentially four different ways to handle this problem:

  1. Use variable-width encoding (as does UTF-8)
  2. Include combining diacritical marks for tone marks (as do VSCII-2 and Windows-1258) or for diacritics in general (as do ANSEL and VNI)
  3. Replace some ASCII punctuation, preferably punctuation which is not invariant in ISO 646 (as does VNI for DOS)
  4. Replace at least six of the basic ASCII control characters (as do VPS and VSCII-1)

VISCII went for the last option, replacing six of the least problematic (e.g., least likely to be recognised by an application and acted on specially) C0 control codes (STX, ENQ, ACK, DC4, EM, and RS) with six of the least-used uppercase letter-diacritic combinations. [2] While this option may cause programs that use those control codes to malfunction when handling VISCII text, it creates fewer complications than the other two options (the designers note that non-8-bit clean transmission had been found to pose more difficulty in practice than the control character re-use). [2] Nonetheless, locations of both C0 or C1 control characters and the codes used for the non-breaking space in ISO-8859-1, Mac OS Roman and OEM-US were deliberately assigned to uppercase letters, with the intention of making use of lowercase codepoints with an all-capital font a serviceable workaround if graphical characters could not be displayed for those codes. [2]

However, using up all the extended code points for accented letters left no room to add useful symbols, superscripted numbers, curved quotes, proper dashes, etc., like most other extended ASCII character sets.

Location of characters deliberately mostly follows ISO-8859-1 where there are characters in common between the two code pages (the uppercase Õ being noted as an exception), motivated by user friendliness concerns. [2]

Support

VISCII is partially supported by the TriChlor Software Group in California, which has released various VISCII-compliant software packages, libraries, and fonts for MS-DOS and Windows, Unix, and Macintosh. VISCII-compliant software is available at many FTP sites.

VISCII was historically offered as an encoding for outgoing email by Mozilla Thunderbird. [8] It was also supported by the Windows Vietnamese keyboard software, WinVNKey, created by Christopher Cuong T. Nguyen and later upgraded through various Windows versions by Hoc D. Ngo and others.

VISCII was mostly used by overseas Vietnamese speakers, with VSCII (TCVN) being more popular in northern Vietnam and VNI being more popular in southern Vietnam. [9]

Character set

VISCII
0123456789ABCDEF
0x NUL SOH
1EB2
ETX EOT
1EB4

1EAA
BEL BS HT LF VT FF CR SO SI
1x DLE DC1 DC2 DC3
1EF6
NAK SYN ETB CAN
1EF8
SUB ESC FS GS
1EF4
US
2x  SP   ! " # $ % & ' ( ) * + , - . /
3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4x @ A B C D E F G H I J K L M N O
5x P Q R S T U V W X Y Z [ \ ] ^ _
6x ` a b c d e f g h i j k l m n o
7x p q r s t u v w x y z { | } ~ DEL
8x
1EA0

1EAE

1EB0

1EB6

1EA4

1EA6

1EA8

1EAC

1EBC

1EB8

1EBE

1EC0

1EC2

1EC4

1EC6

1ED0
9x
1ED2

1ED4

1ED6

1ED8

1EE2

1EDA

1EDC

1EDE

1ECA

1ECE

1ECC

1EC8

1EE6
Ũ
0168

1EE4

1EF2
Ax Õ
00D5

1EAF

1EB1

1EB7

1EA5

1EA7

1EA9

1EAD

1EBD

1EB9
ế
1EBF

1EC1

1EC3

1EC5

1EC7

1ED1
Bx
1ED3

1ED5

1ED7

1EE0
Ơ
01A0

1ED9

1EDD

1EDF

1ECB

1EF0

1EE8

1EEA

1EEC
ơ
01A1

1EDB
Ư
01AF
Cx À Á Â Ã
1EA2
Ă
0102

1EB3

1EB5
È É Ê
1EBA
Ì Í Ĩ
0128

1EF3
Dx Đ
0110

1EE9
Ò Ó Ô
1EA1

1EF7

1EEB

1EED
Ù Ú
1EF9

1EF5
Ý
1EE1
ư
01B0
Ex à á â ã
1EA3
ă
0103

1EEF

1EAB
è é ê
1EBB
ì í ĩ
0129

1EC9
Fx đ
0111

1EF1
ò ó ô õ
1ECF

1ECD

1EE5
ù ú ũ
0169

1EE7
ý
1EE3

1EEE
  Differences from ISO-8859-1

See also

Related Research Articles

While Hypertext Markup Language (HTML) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII, two goals are worth considering: the information's integrity, and universal browser display.

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from UnicodeTransformation Format – 8-bit.

Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters.

<span class="mw-page-title-main">Mojibake</span> Garbled text as a result of incorrect character encodings

Mojibake is the garbled text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, often from a different writing system.

ISO/IEC 8859-2:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 2: Latin alphabet No. 2, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as "Latin-2". It is generally intended for Central or "Eastern European" languages that are written in the Latin script. Note that ISO/IEC 8859-2 is very different from code page 852 which is also referred to as "Latin-2" in Czech and Slovak regions. Code page 912 is an extension. Almost half the use of the encoding is for Polish, and it's the main legacy encoding for Polish, while virtually all use of it has been replaced by UTF-8.

ISO/IEC 8859-5:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 5: Latin/Cyrillic alphabet, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1988. It is informally referred to as Latin/Cyrillic. It was designed to cover languages using a Cyrillic alphabet such as Bulgarian, Belarusian, Russian, Serbian and Macedonian but was never widely used. It would also have been usable for Ukrainian in the Soviet Union from 1933 to 1990, but it is missing the Ukrainian letter ge, ґ, which is required in Ukrainian orthography before and since, and during that period outside Soviet Ukraine. As a result, IBM created Code page 1124.

ISO/IEC 8859-9:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 9: Latin alphabet No. 5, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1989. It is designated ECMA-128 by Ecma International and TS 5881 as a Turkish standard. It is informally referred to as Latin-5 or Turkish. It was designed to cover the Turkish language, designed as being of more use than the ISO/IEC 8859-3 encoding. It is identical to ISO/IEC 8859-1 except for the replacement of six Icelandic characters with characters unique to the Turkish alphabet.

ISO/IEC 8859-16:2001, Information technology — 8-bit single-byte coded graphic character sets — Part 16: Latin alphabet No. 10, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 2001. The same encoding was defined as Romanian Standard SR 14111 in 1998, named the "Romanian Character Set for Information Interchange". It is informally referred to as Latin-10 or South-Eastern European. It was designed to cover Albanian, Croatian, Hungarian, Polish, Romanian, Serbian and Slovenian, but also French, German, Italian and Irish Gaelic.

Windows-1258 is a code page used in Microsoft Windows to represent Vietnamese texts. It makes use of combining diacritical marks.

Vietnamese Quoted-Readable, also known as Vietnet, is a convention for writing Vietnamese using ASCII characters encoded in only 7 bits, making possible for Vietnamese to be supported in computing and communication systems at the time. Because the Vietnamese alphabet contains a complex system of diacritical marks, VIQR requires the user to type in a base letter, followed by one or two characters that represent the diacritical marks.

Windows code page 1253, commonly known by its IANA-registered name Windows-1253 or abbreviated as cp1253, is a Microsoft Windows code page used to write modern Greek. It is not capable of supporting the older polytonic Greek.

Windows-1254 is a code page used under Microsoft Windows, to write Turkish that it was designed for. Characters with codepoints A0 through FF are compatible with ISO 8859-9, but the CR range, which is reserved for C1 control codes in ISO 8859, is instead used for additional characters. It is similar to ISO/IEC 8859-1 except for the replacement of six Icelandic characters with characters unique to the Turkish alphabet.

Several binary representations of 8-bit character sets for common Western European languages are compared in this article. These encodings were designed for representation of Italian, Spanish, Portuguese, French, German, Dutch, English, Danish, Swedish, Norwegian, and Icelandic, which use the Latin alphabet, a few additional letters and ones with precomposed diacritics, some punctuation, and various symbols. Although they're called "Western European" many of these languages are spoken all over the world. Also, these character sets happen to support many other languages such as Malay, Swahili, and Classical Latin.

Windows code pages are sets of characters or code pages used in Microsoft Windows from the 1980s and 1990s. Windows code pages were gradually superseded when Unicode was implemented in Windows, although they are still supported both within Windows and other platforms, and still apply when Alt code shortcuts are used.

Telex or TELEX, is a convention for encoding Vietnamese text in plain ASCII characters. Originally used for transmitting Vietnamese text over telex systems, it is one of the most used input method on phones and touchscreens and also computers. Vietnamese Morse code uses the TELEX system. Other systems include VNI and VIQR.

VNI Software Company is a developer of various education, entertainment, office, and utility software packages. They are known for developing an encoding and a popular input method for Vietnamese on for computers. VNI is often available on computer systems to type Vietnamese, alongside TELEX input method as well. The most common pairing is the use of VNI on keyboard and computers, whilst TELEX is more common on phones or touchscreens.

VPSKeys is a freeware input method editor developed and distributed by the Vietnamese Professionals Society (VPS). One of the first input method editors for Vietnamese, it allows users to add accent marks to Vietnamese text on computers running Microsoft Windows. The first version of VPSKeys, supporting Windows 3.1, was released in 1993. The most recent version is 4.3, released in October 2007.

<span class="mw-page-title-main">Extended ASCII</span> Nick-name for 8-bit ASCII-derived character sets

Extended ASCII is a repertoire of character encodings that include the original 96 ASCII character set, plus up to 128 additional characters. There is no formal definition of "extended ASCII", and even use of the term is sometimes criticized, because it can be mistakenly interpreted to mean that the American National Standards Institute (ANSI) had updated its ANSI X3.4-1986 standard to include more characters, or that the term identifies a single unambiguous encoding, neither of which is the case.

The Vietnamese language is written with a Latin script with diacritics which requires several accommodations when typing on phone or computers. Software-based systems are a form of writing Vietnamese on phones or computers with software that can be installed on the device or from third-party software such as UniKey. Telex is the oldest input method devised to encode the Vietnamese language with its tones. Other input methods may also include VNI and VIQR. VNI input method is not to be confused with VNI code page.

VSCII, also known as TCVN 5712, ISO-IR-180, .VN, ABC or simply the TCVN encodings, is a set of three closely related Vietnamese national standard character encodings for using the Vietnamese language with computers, developed by the TCVN Technical Committee on Information Technology (TCVN/TC1) and first adopted in 1993.

References

  1. Phung, Quang; Ngo, Hoc D.; Bui, Cuong. "Vietnamese-Standard Working Group Home Page". Viet-Std Group. Retrieved 2019-08-23.
  2. 1 2 3 4 5 6 Vietnamese Character Encoding Standardization Report - VISCII And VIQR 1.1 Character Encoding Specifications (Technical report). Viet-Std Group. 1992.
  3. 1 2 "[news] TCVN 5712:1993 (VSCII) -- Vietnamese national standard". 1993-06-02. Archived from the original on 2017-01-11.
  4. Lunde, Ken (13 January 2009). "Chapter 1: CJKV Information Processing Overview (§ Are VISCII and VSCII identical? What about TCVN?)". CJKV Information Processing (2nd ed.). p. 17. ISBN   978-0-596-51447-1.
  5. Vietnamese Standardization Working Group (May 1993). Conventions for Encoding the Vietnamese Language. IETF. doi: 10.17487/RFC1456 . RFC 1456.
  6. "Character Sets". IANA.
  7. Sivonen, Henri (2014-09-26). "Character encoding changes in m-c require c-c action". mozilla.dev.apps.thunderbird.
  8. Sivonen, Henri (2014-09-26). "Character encoding changes in m-c require c-c action". mozilla.dev.apps.thunderbird. VISCII and armscii-8 are special in the sense that, for long time, Thunderbird itself (misguidedly) provided these encodings in the user interface for the choice of outgoing character encoding when composing a message. Therefore, it is possible that there exists a Thunderbird-created legacy of VISCII and armscii-8 email and Usenet posts.
  9. Ngo, Hoc Dinh; Tran, TuBinh. "5. Why Having Vietnamese Charset (Character Set – Encoding) Conversion?". Some special functions of WinVNKey.

Further reading