Zawgyi font

Last updated
Zawgyi
Zawgyi font.svg
Foundry Arthouse (Mandalay)
Date released4 December 2007

Zawgyi font [lower-alpha 1] is a predominant typeface used for Burmese language text on websites. It supports the Burmese script using its Myanmar Unicode block following a non-compliant implementation. Prior to 2019, it was the most popular font on Burmese websites.

Contents

Unicode incompatibility

Encoding formats of k[?]eaa' in Zawgyi (top) and Unicode (bottom). In normal Unicode rendering, the codepoint sequence on the top renders as e[?]kaa' instead. Zawgyi vs Unicode krau.svg
Encoding formats of ကြော့ in Zawgyi (top) and Unicode (bottom). In normal Unicode rendering, the codepoint sequence on the top renders as ေၾကာ့ instead.

The Burmese script is a complex text layout script, whereby the positions and shapes of its graphemes vary based on context. The support for complex text rendering for personal computers did not arrive until Windows XP Service Pack 2 in 2004, and a Burmese font utilizing this technology did not exist until 2005. [1] [2] Furthermore, there were significant revisions in Unicode's implementation of Burmese script up until Unicode 5.1 in 2008. [3] Compounding the fact that Myanmar experienced sanctions from the West, this had resulted in much of the Burmese localization technology being developed locally without external cooperation. [4]

Numerous attempts at creating fonts with Burmese support were made in the 2000s, but they were developed as Unicode fonts that were only partially Unicode compliant. [2] Some of the codepoints for Burmese script were implemented as specified in Unicode, but others were not. Therefore, these fonts became incompatible with Unicode. [5] This is referred to as ad hoc font encodings by the Unicode Consortium. [6] With the advent of mobile phones, manufacturers such as Samsung and Huawei simply replaced the Unicode compliant Burmese system fonts with their Zawgyi equivalents. [1]

There are significant shortcomings in using ad hoc font encodings. As a separate encoding, the situation leads to garbled text being shown between users of Zawgyi and Unicode. [7] Because the Zawgyi font encoding was not implemented as efficiently as specified in Unicode, it had to occupy more codepoints than what is allocated for Burmese. [6] As such, Zawgyi encoding took over the Unicode block reserved for minority languages of Myanmar. [1] [2] In Zawgyi, the same word can be encoded in multiple different ways, making Zawgyi text corpus difficult to search and analyze. It is also difficult to sort Zawgyi text. [8] In addition, using Unicode would ease the implementation of natural language processing technologies. [2]

The Myanmar government designated 1 October 2019 as "U-Day" to officially switch to Unicode. [4] The full transition was expected by some to take two years. [9] [ needs update ]

Unicode uses the private-use script code Qaag to mark text written in Zawgyi. [10]

Conversion

International Components for Unicode supports conversion of Zawgyi-encoded data to conformant Unicode by means of the Zawgyi-my transliterator. [11]

Notes

  1. The first version of the typeface is known as Zawgyi-One or zawgyi1

Related Research Articles

<span class="mw-page-title-main">Unicode</span> Character encoding standard

Unicode, formally The Unicode Standard, is a text encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 of the standard defines 154998 characters and 168 scripts used in various ordinary, literary, academic, and technical contexts.

The Coptic script is the script used for writing the Coptic language, the most recent development of Egyptian. The repertoire of glyphs is based on the uncial Greek alphabet, augmented by letters borrowed from the Egyptian Demotic. It was the first alphabetic script used for the Egyptian language. There are several Coptic alphabets, as the script varies greatly among the various dialects and eras of the Coptic language.

<span class="mw-page-title-main">Mojibake</span> Garbled text as a result of incorrect character encodings

Mojibake is the garbled or gibberish text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, often from a different writing system.

Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a feature shared in common by written Chinese (hanzi), Japanese (kanji), Korean (hanja) and Vietnamese.

<span class="mw-page-title-main">Dingbat</span> Typographic symbol class

In typography, a dingbat is an ornament, specifically, a glyph used in typesetting, often employed to create box frames, or as a dinkus. Some of the dingbat symbols have been used as signature marks or used in bookbinding to order sections.

<span class="mw-page-title-main">GB 18030</span> Official Chinese character encoding

GB 18030 is a Chinese government standard, described as Information Technology — Chinese coded character set and defines the required language and character support necessary for software in China. GB18030 is the registered Internet name for the official character set of the People's Republic of China (PRC) superseding GB2312. As a Unicode Transformation Format, GB18030 supports both simplified and traditional Chinese characters. It is also compatible with legacy encodings including GB/T 2312, CP936, and GBK 1.0.

VISCII is an unofficially-defined modified ASCII character encoding for using the Vietnamese language with computers. It should not be confused with the similarly-named officially registered VSCII encoding. VISCII keeps the 95 printable characters of ASCII unmodified, but it replaces 6 of the 33 control characters with printable characters. It adds 128 precomposed characters. Unicode and the Windows-1258 code page are now used for virtually all Vietnamese computer data, but legacy VSCII and VISCII files may need conversion.

The Hong Kong Supplementary Character Set is a set of Chinese characters – 4,702 in total in the initial release—used in Cantonese, as well as when writing the names of some places in Hong Kong.

The ConScript Unicode Registry is a volunteer project to coordinate the assignment of code points in the Unicode Private Use Areas (PUA) for the encoding of artificial scripts, such as those for constructed languages. It was founded by John Cowan and was maintained by him and Michael Everson. It is not affiliated with the Unicode Consortium.

ISO 15924, Codes for the representation of names of scripts, is an international standard defining codes for writing systems or scripts. Each script is given both a four-letter code and a numeric code.

TRON Code is a multi-byte character encoding used in the TRON project. It is similar to Unicode but does not use Unicode's Han unification process: each character from each CJK character set is encoded separately, including archaic and historical equivalents of modern characters. This means that Chinese, Japanese, and Korean text can be mixed without any ambiguity as to the exact form of the characters; however, it also means that many characters with equivalent semantics will be encoded more than once, complicating some operations.

In Unicode, a Private Use Area (PUA) is a range of code points that, by definition, will not be assigned characters by the standard. Three private use areas are defined: one in the Basic Multilingual Plane, and one each in, and nearly covering, planes 15 and 16. They are intentionally left undefined so that third parties may assign their own characters without conflicting with Unicode Consortium assignments. Under the Unicode Stability Policy, the Private Use Areas will remain allocated for that purpose in all future Unicode versions.

A Unicode font is a computer font that maps glyphs to code points defined in the Unicode Standard. The vast majority of modern computer fonts use Unicode mappings, even those fonts which only include glyphs for a single writing system, or even only support the basic Latin alphabet. Fonts which support a wide range of Unicode scripts and Unicode symbols are sometimes referred to as "pan-Unicode fonts", although as the maximum number of glyphs that can be defined in a TrueType font is restricted to 65,535, it is not possible for a single font to provide individual glyphs for all defined Unicode characters. This article lists some widely used Unicode fonts that support a comparatively large number and broad range of Unicode characters.

Symbol is one of the four standard fonts available on all PostScript-based printers, starting with Apple's original LaserWriter (1985). It contains a complete unaccented Greek alphabet and a selection of commonly used mathematical symbols. Insofar as it fits into any standard classification, it is a serif font designed in the style of Times New Roman.

KPS 9566 is a North Korean standard specifying a character encoding for the Chosŏn'gŭl (Hangul) writing system used for the Korean language. The edition of 1997 specified an ISO 2022-compliant 94×94 two-byte coded character set. Subsequent editions have added additional encoded characters outside of the 94×94 plane, in a manner comparable to UHC or GBK.

<span class="mw-page-title-main">Mon–Burmese script</span> Southeast Asian writing system

The Mon–Burmese script is an abugida that derives from the Pallava Grantha script of southern India and later of Southeast Asia. It is the basis of the alphabets used for modern Burmese, Mon, Shan, Rakhine, Jingpho and Karen.

<span class="mw-page-title-main">Myanmar (Unicode block)</span> Unicode character block

Myanmar is a Unicode block containing characters for the Burmese, Mon, Shan, Palaung, and the Karen languages of Myanmar, as well as the Aiton and Phake languages of Northeast India. It is also used to write Pali and Sanskrit in Myanmar.

<span class="mw-page-title-main">Noto fonts</span> Multilingual font family from Google

Noto is a free font family comprising over 100 individual computer fonts, which are together designed to cover all the scripts encoded in the Unicode standard. As of October 2016, Noto fonts cover all 93 scripts defined in Unicode version 6.1, although fewer than 30,000 of the nearly 75,000 CJK unified ideographs in version 6.0 are covered. In total, Noto fonts cover over 77,000 characters, which is around half of the 149,186 characters defined in Unicode 15.0.

Tamil All Character Encoding (TACE16) is a scheme for encoding the Tamil script in the Private Use Area of Unicode, implementing a syllabary-based character model differing from the modified-ISCII model used by Unicode's existing Tamil implementation.

The Vietnamese language is written with a Latin script with diacritics which requires several accommodations when typing on phone or computers. Software-based systems are a form of writing Vietnamese on phones or computers with software that can be installed on the device or from third-party software such as UniKey. Telex is the oldest input method devised to encode the Vietnamese language with its tones. Other input methods may also include VNI and VIQR. VNI input method is not to be confused with VNI code page.

References

  1. 1 2 3 Hotchkiss, Griffin (23 March 2016). "Battle of the fonts". Frontier Myanmar. Retrieved 24 December 2019. With the release of Windows XP service pack 2, complex scripts were supported, which made it possible for Windows to render a Unicode-compliant Burmese font such as Myanmar1 (released in 2005). [...] Myazedi, BIT, and later Zawgyi, circumscribed the rendering problem by adding extra code points that were reserved for Myanmar's ethnic languages. Not only does the re-mapping prevent future ethnic language support, it also results in a typing system that can be confusing and inefficient, even for experienced users. [...] Huawei and Samsung, the two most popular smartphone brands in Myanmar, are motivated only by capturing the largest market share, which means they support Zawgyi out of the box.
  2. 1 2 3 4 Sin, Thant (7 September 2019). "Unified under one font system as Myanmar prepares to migrate from Zawgyi to Unicode". Rising Voices. Retrieved 24 December 2019. Standard Myanmar Unicode fonts were never mainstreamed unlike the private and partially Unicode compliant Zawgyi font. [...] Unicode will improve natural language processing
  3. Hosken, Martin (25 January 2007). "Representing Myanmar in Unicode" (PDF). Unicode Consortium. Retrieved 24 December 2019.
  4. 1 2 "Unicode in, Zawgyi out: Modernity finally catches up in Myanmar's digital world | The Japan Times". The Japan Times. 27 September 2019. Archived from the original on 22 August 2020. Retrieved 24 December 2019. Oct. 1 is "U-Day," when Myanmar officially will adopt the new system. [...] Microsoft and Apple helped other countries standardize years ago, but Western sanctions meant Myanmar lost out.
  5. "Why Unicode is Needed". Google Code: Zawgyi Project. Retrieved 31 October 2013.
  6. 1 2 "Myanmar Scripts and Languages". Frequently Asked Questions. Unicode Consortium. Retrieved 24 December 2019. "UTF-8" technically does not apply to ad hoc font encodings such as Zawgyi.
  7. LaGrow, Nick; Pruzan, Miri (26 September 2019). "Integrating autoconversion: Facebook's path from Zawgyi to Unicode - Facebook Engineering". Facebook Engineering. Facebook. Retrieved 25 December 2019. It makes communication on digital platforms difficult, as content written in Unicode appears garbled to Zawgyi users and vice versa. [...] In order to better reach their audiences, content producers in Myanmar often post in both Zawgyi and Unicode in a single post, not to mention English or other languages.
  8. Watkins, Justin (2 November 2016). "Why we should stop Zawgyi in its tracks. It harms others and ourselves. Use Unicode!" (PDF). SOAS, University of London. Retrieved 24 December 2019. (1) Use of Zawgyi encroaches on the opportunities for other languages of Myanmar to develop in electronic form – Unicode does not! (2) Zawgyi does not conform to international computing standards – Unicode does! (3) Zawgyi cannot sort correctly: useless for storing data – Unicode can be used for anything! (4) Can store the same word in several different ways: useless for searching, processing, analysing text – Unicode can be used for anything
  9. Saw Yi Nanda (21 Nov 2019). "ynmar switch to Unicode to take two years: app developer". The Myanmar Times. Retrieved 24 December 2019.
  10. Davis, Mark (2023-10-25). "Unicode Locale Data Markup Language (LDML)". unicode.org. Retrieved 11 December 2023. Qaag is a special script code for identifying the non-standard use of Myanmar characters for display with the Zawgyi font. The purpose of the code is to enable migration to standard, interoperable use of Unicode by providing an identifier for Zawgyi for tagging text, applications, input methods, font tables, transformations, and other mechanisms used for migration.
  11. "Myanmar Tools Python Documentation". Google, LLC.