Zawgyi font

Last updated
Zawgyi
Foundry Arthouse (Mandalay)
Date released4 December 2007

Zawgyi font is a predominant typeface used for Burmese language text on websites. It is also known as Zawgyi-One or zawgyi1 font although updated versions of this font were not named Zawgyi-two. Prior to 2019, it was the most popular font on Burmese websites.

Contents

It is a font with Burmese characters implemented in the Burmese block of Unicode but in a non-compliant way.

Unicode incompatibility (ad hoc font encodings)

Burmese script is a complex text layout script, whereby the positions and shapes of its graphemes vary based on context. The support for complex text rendering for personal computers did not arrive until Windows XP Service Pack 2 in 2004, and a Burmese font utilizing this technology did not exist until 2005. [1] [2] Furthermore, there were significant revisions in Unicode's implementation of Burmese script up until Unicode 5.1 in 2008. [3] Compounding the fact that Myanmar experienced sanctions from the West, [4] this had resulted in much of the Burmese localization technology being developed locally without external cooperation.

Numerous attempts at creating fonts with Burmese support were made in the 2000s, but they were developed as Unicode fonts that were only partially Unicode compliant. [2] Some of the codepoints for Burmese script were implemented as specified in Unicode, but others were not. Therefore, these fonts became incompatible with Unicode. [5] This is referred to as ad hoc font encodings by the Unicode Consortium. [6] With the advent of mobile phones, manufacturers such as Samsung and Huawei simply replaced the Unicode compliant Burmese system fonts with their Zawgyi equivalents. [1]

There are significant shortcomings in using ad hoc font encodings. As a separate encoding, the situation leads to garbled text being shown between users of Zawgyi and Unicode. [7] Because the Zawgyi font encoding was not implemented as efficiently as specified in Unicode, [6] it had to occupy more codepoints than what is allocated for Burmese. As such, Zawgyi encoding took over the Unicode block reserved for minority languages of Myanmar. [1] [2] In Zawgyi, the same word can be encoded in multiple different ways, making Zawgyi text corpus difficult to search and analyze. [8] It is also difficult to sort Zawgyi text. [8] In addition, using Unicode would ease the implementation of natural language processing technologies. [2]

The Myanmar government designated October 1, 2019 as "U-Day" to officially switch to Unicode. [4] The full transition was expected by some to take two years. [9]

Unicode uses the private-use script code Qaag to mark text written in Zawgyi. [10]

Conversion

International Components for Unicode supports conversion of Zawgyi-encoded data to conformant Unicode by means of the Zawgyi-my transliterator. [11]

See also

Related Research Articles

<span class="mw-page-title-main">Unicode</span> Character encoding standard

Unicode, formally The Unicode Standard, is a text encoding standard maintained by the Unicode Consortium designed to support the use of text written in all of the world's major writing systems. Version 15.1 of the standard defines 149813 characters and 161 scripts used in various ordinary, literary, academic, and technical contexts.

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit.

<span class="mw-page-title-main">UTF-16</span> Variable-width encoding of Unicode, using one or two 16-bit code units

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding now known as UCS-2 (for 2-byte Universal Character Set), once it became clear that more than 216 (65,536) code points were needed, including most emoji and important CJK characters such as for personal and place names.

The Coptic script is the script used for writing the Coptic language, the latest stage of Egyptian. The repertoire of glyphs is based on the uncial Greek alphabet, augmented by letters borrowed from the Egyptian Demotic. It was the first alphabetic script used for the Egyptian language. There are several Coptic alphabets, as the script varies greatly among the various dialects and eras of the Coptic language.

<span class="mw-page-title-main">Mojibake</span> Garbled text as a result of incorrect character encodings

Mojibake is the garbled or gibberish text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, often from a different writing system.

Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a feature shared in common by written Chinese (hanzi), Japanese (kanji), Korean (hanja) and Vietnamese.

In computing, Chinese character encodings can be used to represent text written in the CJK languages—Chinese, Japanese, Korean—and (rarely) obsolete Vietnamese, all of which use Chinese characters. Several general-purpose character encodings accommodate Chinese characters, and some of them were developed specifically for Chinese.

<span class="mw-page-title-main">GB 18030</span> Unicode character encoding mostly used for Simplified Chinese

GB 18030 is a Chinese government standard, described as Information Technology — Chinese coded character set and defines the required language and character support necessary for software in China. GB18030 is the registered Internet name for the official character set of the People's Republic of China (PRC) superseding GB2312. As a Unicode Transformation Format, GB18030 supports both simplified and traditional Chinese characters. It is also compatible with legacy encodings including GB2312, CP936, and GBK 1.0.

VISCII is an unofficially-defined modified ASCII character encoding for using the Vietnamese language with computers. It should not be confused with the similarly-named officially registered VSCII encoding. VISCII keeps the 95 printable characters of ASCII unmodified, but it replaces 6 of the 33 control characters with printable characters. It adds 128 precomposed characters. Unicode and the Windows-1258 code page are now used for virtually all Vietnamese computer data, but legacy VSCII and VISCII files may need conversion.

<span class="mw-page-title-main">Emoji</span> Symbols often used as emotional cues in text

An emoji is a pictogram, logogram, ideogram, or smiley embedded in text and used in electronic messages and web pages. The primary function of modern emoji is to fill in emotional cues otherwise missing from typed conversation as well as to replace words as part of a logographic system. Emoji exist in various genres, including facial expressions, expressions, activity, food and drinks, celebrations, flags, objects, symbols, places, types of weather, animals and nature.

The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard for reducing the number of bytes needed to represent Unicode text, especially if that text uses mostly characters from one or a small number of per-language character blocks. It does so by dynamically mapping values in the range 128–255 to offsets within particular blocks of 128 characters. The initial conditions of the encoder mean that existing strings in ASCII and ISO-8859-1 that do not contain C0 control codes other than NULL TAB CR and LF can be treated as SCSU strings. Since most alphabets do reside in blocks of contiguous Unicode codepoints, texts that use small alphabets and either ASCII punctuation or punctuation that fits within the window for the main alphabet can be encoded at one byte per character, most other punctuation can be encoded at 2 bytes per symbol through non-locking shifts. SCSU can also switch to UTF-16 internally to handle non-alphabetic languages.

Tamil Script Code for Information Interchange (TSCII) is a coding scheme for representing the Tamil script. The lower 128 codepoints are plain ASCII, the upper 128 codepoints are TSCII-specific. After long years of being used on the Internet by private agreement only, it was successfully registered with the IANA in 2007.

ISO 15924, Codes for the representation of names of scripts, is an international standard defining codes for writing systems or scripts. Each script is given both a four-letter code and a numeric code.

In Unicode, a Private Use Area (PUA) is a range of code points that, by definition, will not be assigned characters by the Unicode Consortium. Three private use areas are defined: one in the Basic Multilingual Plane, and one each in, and nearly covering, planes 15 and 16. The code points in these areas cannot be considered as standardized characters in Unicode itself. They are intentionally left undefined so that third parties may define their own characters without conflicting with Unicode Consortium assignments. Under the Unicode Stability Policy, the Private Use Areas will remain allocated for that purpose in all future Unicode versions.

A Unicode font is a computer font that maps glyphs to code points defined in the Unicode Standard. The vast majority of modern computer fonts use Unicode mappings, even those fonts which only include glyphs for a single writing system, or even only support the basic Latin alphabet. Fonts which support a wide range of Unicode scripts and Unicode symbols are sometimes referred to as "pan-Unicode fonts", although as the maximum number of glyphs that can be defined in a TrueType font is restricted to 65,535, it is not possible for a single font to provide individual glyphs for all defined Unicode characters. This article lists some widely used Unicode fonts that support a comparatively large number and broad range of Unicode characters.

Symbol is one of the four standard fonts available on all PostScript-based printers, starting with Apple's original LaserWriter (1985). It contains a complete unaccented Greek alphabet and a selection of commonly used mathematical symbols. Insofar as it fits into any standard classification, it is a serif font designed in the style of Times New Roman.

<span class="mw-page-title-main">Mon–Burmese script</span> Southeast Asian writing system

The Mon–Burmese script is an abugida that derives from the Pallava Grantha script of southern India and later of Southeast Asia. It is the basis of the alphabets used for modern Burmese, Mon, Shan, Rakhine, Jingpho and Karen.

Myanmar is a Unicode block containing characters for the Burmese, Mon, Shan, Palaung, and the Karen languages of Myanmar, as well as the Aiton and Phake languages of Northeast India. It is also used to write Pali and Sanskrit in Myanmar.

Tamil All Character Encoding (TACE16) is a scheme for encoding the Tamil script in the Private Use Area of Unicode, implementing a syllabary-based character model differing from the modified-ISCII model used by Unicode's existing Tamil implementation.

The Vietnamese language is written with a Latin script with diacritics which requires several accommodations when typing on phone or computers. Software-based systems are a form of writing Vietnamese on phones or computers with software that can be installed on the device or from third-party software such as UniKey. Telex is the oldest input method devised to encode the Vietnamese language with its tones. Other input methods may also include VNI and VIQR. VNI input method is not to be confused with VNI code page.

References

  1. 1 2 3 Hotchkiss, Griffin (March 23, 2016). "Battle of the fonts". Frontier Myanmar. Retrieved 24 December 2019. With the release of Windows XP service pack 2, complex scripts were supported, which made it possible for Windows to render a Unicode-compliant Burmese font such as Myanmar1 (released in 2005). ... Myazedi, BIT, and later Zawgyi, circumscribed the rendering problem by adding extra code points that were reserved for Myanmar's ethnic languages. Not only does the re-mapping prevent future ethnic language support, it also results in a typing system that can be confusing and inefficient, even for experienced users. ... Huawei and Samsung, the two most popular smartphone brands in Myanmar, are motivated only by capturing the largest market share, which means they support Zawgyi out of the box.
  2. 1 2 3 4 Sin, Thant (7 September 2019). "Unified under one font system as Myanmar prepares to migrate from Zawgyi to Unicode". Rising Voices. Retrieved 24 December 2019. Standard Myanmar Unicode fonts were never mainstreamed unlike the private and partially Unicode compliant Zawgyi font. ... Unicode will improve natural language processing
  3. Hosken, Martin (January 25, 2007). "Representing Myanmar in Unicode" (PDF). Unicode Consortium. Retrieved 24 December 2019.
  4. 1 2 "Unicode in, Zawgyi out: Modernity finally catches up in Myanmar's digital world | The Japan Times". The Japan Times. Sep 27, 2019. Archived from the original on 22 August 2020. Retrieved 24 December 2019. Oct. 1 is "U-Day," when Myanmar officially will adopt the new system. ... Microsoft and Apple helped other countries standardize years ago, but Western sanctions meant Myanmar lost out.
  5. "Why Unicode is Needed". Google Code: Zawgyi Project. Retrieved 31 October 2013.
  6. 1 2 "Myanmar Scripts and Languages". Frequently Asked Questions. Unicode Consortium. Retrieved 24 December 2019. "UTF-8" technically does not apply to ad hoc font encodings such as Zawgyi.
  7. LaGrow, Nick; Pruzan, Miri (Sep 26, 2019). "Integrating autoconversion: Facebook's path from Zawgyi to Unicode - Facebook Engineering". Facebook Engineering. Facebook. Retrieved 25 December 2019. It makes communication on digital platforms difficult, as content written in Unicode appears garbled to Zawgyi users and vice versa. ... In order to better reach their audiences, content producers in Myanmar often post in both Zawgyi and Unicode in a single post, not to mention English or other languages.
  8. 1 2 Watkins, Justin (Nov 2, 2016). "Why we should stop Zawgyi in its tracks. It harms others and ourselves. Use Unicode!" (PDF). SOAS, University of London. Retrieved 24 December 2019. (1) Use of Zawgyi encroaches on the opportunities for other languages of Myanmar to develop in electronic form - Unicode does not! (2) Zawgyi does not conform to international computing standards - Unicode does! (3) Zawgyi cannot sort correctly: useless for storing data - Unicode can be used for anything! (4) Can store the same word in several different ways: useless for searching, processing, analysing text - Unicode can be used for anything
  9. Saw Yi Nanda (21 Nov 2019). "Myanmar switch to Unicode to take two years: app developer". The Myanmar Times. Retrieved 24 December 2019.
  10. Davis, Mark (2023-10-25). "Unicode Locale Data Markup Language (LDML)". unicode.org. Retrieved 11 December 2023. Qaag is a special script code for identifying the non-standard use of Myanmar characters for display with the Zawgyi font. The purpose of the code is to enable migration to standard, interoperable use of Unicode by providing an identifier for Zawgyi for tagging text, applications, input methods, font tables, transformations, and other mechanisms used for migration.
  11. "Myanmar Tools Python Documentation". Google, LLC.