Tamil All Character Encoding

Last updated

Tamil All Character Encoding (TACE16) is a scheme for encoding the Tamil script in the Private Use Area of Unicode, implementing a syllabary-based character model differing from the modified-ISCII model used by Unicode's existing Tamil implementation. [1] [2]

Contents

Keyboard drivers and fonts

The keyboard driver for this encoding scheme is available on the Tamil Virtual Academy website for free. [3] [4] It uses Tamil 99 and Tamil Typewriter keyboard layouts, which are approved by the Government of Tamil Nadu, and maps the input keystrokes to its corresponding characters of the TACE16 scheme. [2] To read files created using TACE16, the corresponding Unicode Tamil fonts are also available on the same website. [3] [4] These fonts map glyphs for characters of TACE16 format, but also for the Unicode block for both ASCII and Tamil characters, so that they can provide backward compatibility for reading existing files which are created using the Tamil Unicode block.

Character set

All the characters of this encoding scheme are located in the private use area of the Basic Multilingual Plane of Unicode's Universal Coded Character Set.

Tamil All Character Encoding (TACE16) Character Set [5]
Vowels→ AĀIĪUŪEĒAiOŌAu(Miscellaneous)
Consonants
_0_1_2_3_4_5_6_7_8_9_A_B_C_D_E_F
(Symbols)U+E10_ராஜ
(Numbers)U+E18_
(Fractions)U+E1A_𑿌𑿐𑿑𑿓𑿅𑿉𑿎𑿄𑿈𑿋𑿍𑿏𑿀𑿁𑿂𑿆
U+E1F_ி
U+E20_
KU+E21_க்காகிகீகுகூகெகேகைகொகோகௌ
NgU+E22_ங்ஙாஙிஙீஙுஙூஙெஙேஙைஙொஙோஙௌ
CU+E23_ச்சாசிசீசுசூசெசேசைசொசோசௌ
ÑU+E24_ஞ்ஞாஞிஞீஞுஞூஞெஞேஞைஞொஞோஞௌ
U+E25_ட்டாடிடீடுடூடெடேடைடொடோடௌ
U+E26_ண்ணாணிணீணுணூணெணேணைணொணோணௌ
TU+E27_த்தாதிதீதுதூதெதேதைதொதோதௌ
NU+E28_ந்நாநிநீநுநூநெநேநைநொநோநௌ
PU+E29_ப்பாபிபீபுபூபெபேபைபொபோபௌ
MU+E2A_ம்மாமிமீமுமூமெமேமைமொமோமௌ
YU+E2B_ய்யாயியீயுயூயெயேயையொயோயௌ
RU+E2C_ர்ராரிரீருரூரெரேரைரொரோரௌ
LU+E2D_ல்லாலிலீலுலூலெலேலைலொலோலௌ
VU+E2E_வ்வாவிவீவுவூவெவேவைவொவோவௌ
U+E2F_ழ்ழாழிழீழுழூழெழேழைழொழோழௌ
U+E30_ள்ளாளிளீளுளூளெளேளைளொளோளௌ
U+E31_ற்றாறிறீறுறூறெறேறைறொறோறௌ
U+E32_ன்னானினீனுனூனெனேனைனொனோனௌ
Grantha characters
JU+E33_ஜ்ஜாஜிஜீஜுஜூஜெஜேஜைஜொஜோஜௌ
ShU+E34_ஶ்ஶாஶிஶீஶுஶூஶெஶேஶைஶொஶோஶௌ
U+E35_ஷ்ஷாஷிஷீஷுஷூஷெஷேஷைஷொஷோஷௌ
SU+E36_ஸ்ஸாஸிஸீஸுஸூஸெஸேஸைஸொஸோஸௌ
HU+E37_ஹ்ஹாஹிஹீஹுஹூஹெஹேஹைஹொஹோஹௌ
KṣU+E38_க்ஷ்க்ஷக்ஷாக்ஷிக்ஷீக்ஷுக்ஷூக்ஷெக்ஷேக்ஷைக்ஷொக்ஷோக்ஷௌஶ்ரீ
Legend:
Syllabograms with irregular glyphs, which inherently need to be handled individually by a font. [lower-alpha 1]
Newly added. Not present in Unicode version 6.3.
Corresponds to a character in the Tamil Supplement block, added in Unicode version 12 (2019)
Allocated for research (NLP)

Comparison of TACE16 to present Tamil Unicode

Criticism of the standard Unicode character model for Tamil

Unicode's encoding models for Devanagari, Tamil, Kannada, Sinhala and emoji require use of the invisible zero-width joiner and zero-width non-joiner characters. Zero Width (Non)Joiner.svg
Unicode's encoding models for Devanagari, Tamil, Kannada, Sinhala and emoji require use of the invisible zero-width joiner and zero-width non-joiner characters.

The existing Unicode character model for Tamil is, like most of Indic Unicode, [lower-alpha 2] an abugida-based model derived from ISCII. It been criticized for several reasons. [1]

Unicode represents only 31 Tamil base characters as single code points, out of 247 grapheme clusters. These include stand-alone vowels, and 23 basic consonant glyphs (which, due to not bearing a virama, nonetheless denote a syllable with both a consonant and a vowel when used on their own). The others are represented as sequences of code points, requiring software support for advanced typography features (such as Apple Advanced Typography, Graphite, or OpenType advanced typography) to render correctly. This also requires the use of invisible zero-width joiner and zero-width non-joiner characters in places where the desired grapheme cluster would otherwise be ambiguous. This complexity can result in security vulnerabilities and ambiguous combinations, can require the use of an exception table to forbid invalid combinations of code points, and can necessitate the use of string normalization to compare two strings for equality.

Additionally, since syllables with both a consonant and a vowel form 64 to 70% of Tamil text, an abugida-based model which encodes the consonant and vowel parts as separate code points is inefficient, in terms of how long a string needs to be to contain a given piece of text, in comparison with a syllabary-based model.

Furthermore, ISCII is primarily an encoding of Devanagari, and the ISCII encodings of other Brahmic scripts (including Tamil) encode characters over the code points of the corresponding characters in Devanagari ISCII. Although Unicode encodes the Brahmic scripts separately from one another, the Tamil block mirrors the ISCII layout (with Devanagari-style character ordering, and reserved space in positions corresponding to Devanagari characters with no Tamil equivalent); consequently, the characters are not in the natural sequence order, and strings collated by code point (analogous to "ASCIIbetical" sorting of English text) will not produce the expected sorting order. It requires a complex collation algorithm for arranging them in the natural order.

TACE16 in comparison

The following data provides a comparison of current Unicode Tamil vs. TACE16 on e-governance and browsing: [1]

TACE16 provides performance improvements in processing time and processing space. It encompasses all of the general Tamil text; it is sequential; and it is unambiguous, with any point corresponding to only one character. [1] The TACE16 system takes fewer instruction cycles than Unicode Tamil, and also allows programming based on Tamil grammar, which needs extra framework development in Unicode Tamil.

Responses by the Unicode Consortium

The Unicode Consortium publishes a dedicated FAQ page on the Tamil script which responds to some of the criticisms. In defence of the ISCII model, the Consortium notes that expert linguists, typographers and programmers were involved in its development, but acknowledges that compromises were made due to ISCII being constrained to single-byte extended ASCII. The Consortium points out that Unicode Tamil is now implemented by all major operating systems and web browsers, and maintains that it should be used in open interchange contexts, such as online, since tools such as search engines would not necessarily be able to identify or interpret a sequence of Unicode private-use code points as Tamil text. However, the Consortium does not object to the use of Private-Use Area schemes, including TACE16, internally to particular processes for which they are useful. In particular, it highlights that both markup schemes and alternative encoding schemes may be used by researchers for specialised purposes such as natural-language processing. [6]

Unicode defines normative named-sequences for all Tamil pure consonants and syllables which are represented with sequences of more than one code point, and a dedicated table is published as part of the Unicode Standard listing all of these sequences, in their traditional order, along with their correct glyphs. The Consortium points out that it has been open to accepting proposals for characters for which no existing Unicode representation exists: for example, adding several historical fractions and other symbols as the Tamil Supplement block in version 12.0 in 2019. [6]

Regarding collation, the Consortium argues that obtaining the correct result from sorting by code point is the exception rather than the rule, highlighting that, in unmodified ASCIIbetical ordering, the uppercase Latin letter Z sorts before the lowercase letter a, and also highlighting that collation rules often differ by language (see e.g. ö). Regarding space efficiency, the Consortium argues that storage space and bandwidth taken up by text is usually far overshadowed by other accompanying media such as images and video, and that text content performs well under general-purpose compression methods such as Deflate (originally from the ZIP file format, standardized in RFC 1951 and integrated in the HTTP protocol as a generic encoding scheme). [6]

Unicode Stability Policy

When first published (version 1.0.0), Unicode made only limited stability guarantees. As such, the original Tibetan block was deleted in version 1.0.1 (and its space has since been occupied by the Myanmar block), and the original block for Korean syllables was deleted in version 2.0 (and is now occupied by CJK Unified Ideographs Extension A). Both the current Hangul Syllables block for Korean syllables, and the current Tibetan block, date back to Unicode 2.0. This was done on the assumption that little or no existing content using Unicode for those writing systems existed, [7] since it would break compatibility with all existing Unicode content in, and input methods for, those writing systems. After this so-dubbed "Korean mess", the responsible committees pledged not to make such a compatibility-breaking change ever again, [7] which now forms part of the Unicode Stability Policy. [8]

This stability policy has been upheld ever since, in spite of demands to re-encode or change the character model for both Tibetan and Korean a second time, made by China and North Korea respectively. [9] [10] [11] [12] Likewise in relation to Tamil, the Consortium emphasises the "crucial issue of maintaining the stability of the standard for existing implementations", and argues that "the resulting costs and impact of destabilizing the standard" would substantially outweigh any efficiency benefits in processing speed or storage space. [6]

There was a proposal to re-encode Tamil [13] that was rejected by Unicode, who said that the re-encoding would be damaging and that there was no convincing evidence that Unicode Tamil encoding is deficient. [14]

Alternatives

Open-Tamil

The Open-Tamil project [15] provides many of the common operations. It claims Level-1 compliance of Tamil text processing without using TACE16, but is written on top of extra programming logic which is needed for Unicode Tamil.

See also

Footnotes

  1. Highlighted syllabograms in the U and Ū columns are those where the vowel portion of the glyph matches neither the simple subjoining forms shown for those combining vowel marks in the Unicode block chart, nor the right-joining Grantha forms (as used for those combining vowel marks in isolation by, for example, Noto fonts).
  2. Except for Tibetan, which uses a different model, and for Thai and related scripts, which use a model derived from TIS-620.

Related Research Articles

<span class="mw-page-title-main">Abugida</span> Writing system

An abugida – sometimes also called alphasyllabary, neosyllabary, or pseudo-alphabet – is a segmental writing system in which consonant–vowel sequences are written as units; each unit is based on a consonant letter, and vowel notation is secondary, like a diacritical mark. This contrasts with a full alphabet, in which vowels have status equal to consonants, and with an abjad, in which vowel marking is absent, partial, or optional – in less formal contexts, all three types of script may be termed "alphabets". The terms also contrast them with a syllabary, in which a single symbol denotes the combination of one consonant and one vowel.

<span class="mw-page-title-main">Devanagari</span> Writing script for many North Indian and Nepalese languages

Devanagari is an Indic script used in the northern Indian subcontinent. Also simply called Nāgari, it is a left-to-right abugida, based on the ancient Brāhmi script. It is one of the official scripts of the Republic of India and Nepal. It was developed and in regular use by the 7th century CE and achieved its modern form by 1000 CE. The Devanāgari script, composed of 48 primary characters, including 14 vowels and 34 consonants, is the fourth most widely adopted writing system in the world, being used for over 120 languages.

<span class="mw-page-title-main">Unicode</span> Character encoding standard

Unicode, formally The Unicode Standard, is a text encoding standard maintained by the Unicode Consortium designed to support the use of text written in all of the world's major writing systems. Version 15.1 of the standard defines 149813 characters and 161 scripts used in various ordinary, literary, academic, and technical contexts.

<span class="mw-page-title-main">Brahmic scripts</span> Family of abugida writing systems

The Brahmic scripts, also known as Indic scripts, are a family of abugida writing systems. They are used throughout the Indian subcontinent, Southeast Asia and parts of East Asia. They are descended from the Brahmi script of ancient India and are used by various languages in several language families in South, East and Southeast Asia: Indo-Aryan, Dravidian, Tibeto-Burman, Mongolic, Austroasiatic, Austronesian, and Tai. They were also the source of the dictionary order (gojūon) of Japanese kana.

<span class="mw-page-title-main">Malayalam script</span> Brahmic script used commonly to write the Malayalam language

Malayalam script is a Brahmic script used commonly to write Malayalam, which is the principal language of Kerala, India, spoken by 45 million people in the world. It is a Dravidian language spoken in the Indian state of Kerala and the union territories of Lakshadweep and Puducherry by the Malayali people. It is one of the official scripts of the Indian Republic. Malayalam script is also widely used for writing Sanskrit texts in Kerala.

<span class="mw-page-title-main">Soyombo script</span> Abugida-type writing system

The Soyombo script is an abugida developed by the monk and scholar Zanabazar in 1686 to write Mongolian. It can also be used to write Tibetan and Sanskrit.

Devanagari is an Indic script used for many Indo-Aryan languages of North India and Nepal, including Hindi, Marathi and Nepali, which was the script used to write Classical Sanskrit. There are several somewhat similar methods of transliteration from Devanagari to the Roman script, including the influential and lossless IAST notation. Romanized Devanagari is also called Romanagari.

The Balinese script, natively known as Aksara Bali and Hanacaraka, is an abugida used in the island of Bali, Indonesia, commonly for writing the Austronesian Balinese language, Old Javanese, and the liturgical language Sanskrit. With some modifications, the script is also used to write the Sasak language, used in the neighboring island of Lombok. The script is a descendant of the Brahmi script, and so has many similarities with the modern scripts of South and Southeast Asia. The Balinese script, along with the Javanese script, is considered the most elaborate and ornate among Brahmic scripts of Southeast Asia.

<span class="mw-page-title-main">Tamil script</span> Brahmic script

The Tamil script is an abugida script that is used by Tamils and Tamil speakers in India, Sri Lanka, Malaysia, Singapore, Indonesia and elsewhere to write the Tamil language. It is one of the official scripts of the Indian Republic. Certain minority languages such as Saurashtra, Badaga, Irula and Paniya are also written in the Tamil script.

Indian Standard Code for Information Interchange (ISCII) is a coding scheme for representing various writing systems of India. It encodes the main Indic scripts and a Roman transliteration. The supported scripts are: Bengali–Assamese, Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and Telugu. ISCII does not encode the writing systems of India that are based on Persian, but its writing system switching codes nonetheless provide for Kashmiri, Sindhi, Urdu, Persian, Pashto and Arabic. The Persian-based writing systems were subsequently encoded in the PASCII encoding.

The International Alphabet of Sanskrit Transliteration (IAST) is a transliteration scheme that allows the lossless romanisation of Indic scripts as employed by Sanskrit and related Indic languages. It is based on a scheme that emerged during the 19th century from suggestions by Charles Trevelyan, William Jones, Monier Monier-Williams and other scholars, and formalised by the Transliteration Committee of the Geneva Oriental Congress, in September 1894. IAST makes it possible for the reader to read the Indic text unambiguously, exactly as if it were in the original Indic script. It is this faithfulness to the original scripts that accounts for its continuing popularity amongst scholars.

Uniscribe is the Microsoft Windows set of services for rendering Unicode-encoded text, supporting complex text layout. It is implemented in the dynamic link library USP10.DLL. Uniscribe was released with Windows 2000 and Internet Explorer 5.0. In addition, the Windows CE platform has supported Uniscribe since version 5.0.

Virama is a Sanskrit phonological concept to suppress the inherent vowel that otherwise occurs with every consonant letter, commonly used as a generic term for a codepoint in Unicode, representing either

  1. halanta, hasanta or explicit virāma, a diacritic in many Brahmic scripts, including the Devanagari and Bengali scripts, or
  2. saṃyuktākṣara or implicit virama, a conjunct consonant or ligature.

New Tai Lue script, also known as Xishuangbanna Dai and Simplified Tai Lue, is an abugida used to write the Tai Lü language. Developed in China in the 1950s, New Tai Lue is based on the traditional Tai Tham alphabet developed c. 1200. The government of China promoted the alphabet for use as a replacement for the older script; teaching the script was not mandatory, however, and as a result many are illiterate in New Tai Lue. In addition, communities in Burma, Laos, Thailand and Vietnam still use the Tai Tham alphabet.

Standard Sundanese script is a writing system which is used by the Sundanese people. It is built based on Old Sundanese script which was used by the ancient Sundanese from the 14th to the 18th centuries.

<span class="mw-page-title-main">Universal Character Set characters</span> Complete list of the characters available on most computers

The Unicode Consortium and the ISO/IEC JTC 1/SC 2/WG 2 jointly collaborate on the list of the characters in the Universal Coded Character Set. The Universal Coded Character Set, most commonly called the Universal Character Set, is an international standard to map characters, discrete symbols used in natural language, mathematics, music, and other domains, to unique machine-readable data values. By creating this mapping, the UCS enables computer software vendors to interoperate, and transmit—interchange—UCS-encoded text strings from one to another. Because it is a universal map, it can be used to represent multiple languages at the same time. This avoids the confusion of using multiple legacy character encodings, which can result in the same sequence of codes having multiple interpretations depending on the character encoding in use, resulting in mojibake if the wrong one is chosen.

KPS 9566 is a North Korean standard specifying a character encoding for the Chosŏn'gŭl (Hangul) writing system used for the Korean language. The edition of 1997 specified an ISO 2022-compliant 94×94 two-byte coded character set. Subsequent editions have added additional encoded characters outside of the 94×94 plane, in a manner comparable to UHC or GBK.

<span class="mw-page-title-main">Rupee sign</span> Overview of symbols used to represent currency in rupee-using countries

The rupee sign "" is a currency sign used to represent the monetary unit of account in Pakistan, Sri Lanka, Nepal, Mauritius, Seychelles, and formerly in India. It resembles, and is often written as, the Latin character sequence "Rs", of which it is an orthographic ligature.

Clip fonts or split fonts are non-Unicode fonts that assign glyphs of Brahmic scripts, such as Devanagari, at code positions intended for glyphs of the Latin script or to produce glyphs not found in Unicode by using its Private Use Area (PUA).

is a vowel symbol, or vocalic consonant, of Indic abugidas. In modern Indic scripts, Ṛ is derived from the early "Ashoka" Brahmi letter after having gone through the Gupta letter . As an Indic vowel, Ṛ comes in two normally distinct forms: 1) as an independent letter, and 2) as a vowel sign for modifying a base consonant. Bare consonants without a modifying vowel sign have the inherent "A" vowel.

References

  1. 1 2 3 4 REPORT ON THE FINAL RECOMMENDATIONS OF THE TASK FORCE ON TACE16 (PDF) (Report).
  2. 1 2 "TENDER DOCUMENT for Development of Tamil Fonts and Tamil Keyboard driver for 16-bit encodings (Unicode and TACE16)" (PDF). Tamil Virtual Academy.
  3. 1 2 "தமிழ் எழுத்துருக்கள்". தமிழ் இணையக் கல்விக்கழகம் TAMIL VIRTUAL ACADEMY.
  4. 1 2 Tamil Nadu Government's Order(G.O.), Keyboard Drivers and Fonts Archived 27 December 2023 at archive.today
  5. Tamil Virtual Academy. "Annexure 4: Typewriter Extended Keyboard Sequence for Unicode and TACE16" (PDF). Tender Document for Development of Tamil Fonts and Tamil Keyboard driver for 16-bit encodings (Unicode and TACE16). Chennai.
  6. 1 2 3 4 "FAQ - Tamil Language and Script". Unicode Consortium.
  7. 1 2 Yergeau, F. (1998). UTF-8, a transformation format of ISO 10646. IETF. doi: 10.17487/rfc2279 . RFC 2279.
  8. "Unicode Character Encoding Stability Policies". Unicode Consortium.
  9. West, Andrew (2006-09-14). "Precomposed Tibetan Part 1 : BrdaRten". BabelStone.
  10. China National Body (2003-10-20). "China's Statement of BrdaRten ad hoc". ISO/IEC JTC1/SC2/WG2 N2674.
  11. Karlsson, Kent (2000-03-02). "Comments on DPRK New Work Item proposal on Korean characters". ISO/IEC JTC1/SC2/WG2 N2167.
  12. Cho, Chun-Hui (2000-07-05). "DPRK letter on character names and ordering in 10646-1: 2000" (PDF). ISO/IEC JTC1/SC2/WG2 N2231.
  13. Anantham, A.R.Amaithi (2012-01-26). "Fresh Encoding Proposals" (PDF). Unicode.
  14. "Archive of Notices of Non-Approval". Unicode. 2012-03-05.
  15. Annamalai, M.; Arulalan, T., Open-Tamil: Tamil language text processing tools for Python v3 , retrieved 2023-12-31