Unicode, Inc. | |
Formation | January 3, 1991 |
---|---|
Founders | |
Founded at | California, US |
Type | Non-profit consortium |
77-0269756 [1] | |
Legal status | 501(c)(3) [1] California nonprofit benefit corporation |
Purpose | "To develop, extend and promote use of various standards, data, and open source software libraries which specify the representation of text in modern software[,] ... allowing data to be shared across multiple platforms, languages and countries without corruption" [2] |
Location |
|
Coordinates | 37°24′42″N122°04′15″W / 37.411759°N 122.070958°W |
Key people |
|
Revenue (2018) | $467,576 [2] |
Expenses (2018) | $470,257 [2] |
Employees (2018) | 3 [2] |
Volunteers (2018) | 10 [2] |
Website | home |
The Unicode Consortium (legally Unicode, Inc.) is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California, U.S. [4] Its primary purpose is to maintain and publish the Unicode Standard which was developed with the intention of replacing existing character encoding schemes that are limited in size and scope, and are incompatible with multilingual environments.
The consortium describes its overall purpose as:
...enabl[ing] people around the world to use computers in any language, by providing freely-available specifications and data to form the foundation for software internationalization in all major operating systems, search engines, applications, and the World Wide Web. An essential part of this purpose is to standardize, maintain, educate and engage academic and scientific communities, and the general public about, make publicly available, promote, and disseminate to the public a standard character encoding that provides for an allocation for more than a million characters. [5]
Unicode's success at unifying character sets has led to its widespread adoption in the internationalization and localization of software. [6] The standard has been implemented in many technologies, including XML, the Java programming language, Swift, and modern operating systems. [7]
Members are usually but not limited to computer software and hardware companies with an interest in text-processing standards, [8] including Adobe, Apple, the Bangladesh Computer Council, Emojipedia, Facebook, Google, IBM, Microsoft, the Omani Ministry of Endowments and Religious Affairs, Monotype Imaging, Netflix, Salesforce, SAP SE, Tamil Virtual Academy, and the University of California, Berkeley. [9] [10] [11] Technical decisions relating to the Unicode Standard are made by the Unicode Technical Committee (UTC). [12]
The project to develop a universal character encoding scheme called Unicode was initiated in 1987 by Joe Becker, Lee Collins, and Mark Davis. [13] [14] The Unicode Consortium was incorporated in California on January 3, 1991, [15] with the stated aim to develop, extend, and promote the use of the Unicode Standard. [16] Mark Davis was the president of the Unicode Consortium from when the Consortium was incorporated in 1991 until 2023, when he changed roles to CTO. [17]
Our goal is to make sure that all of the text on computers for every language in the world is represented but we get a lot more attention for emojis than for the fact that you can type Chinese on your phone and have it work with another phone.
— Unicode Consortium co-founder and CTO, Mark Davis [18]
The Unicode Consortium cooperates with many standards development organizations, including ISO/IEC JTC 1/SC 2 and W3C. [19] While Unicode is often considered equivalent to ISO/IEC 10646, and the character sets are essentially identical, the Unicode standard imposes additional restrictions on implementations that ISO/IEC 10646 does not. [20] Apart from The Unicode Standard (TUS) and its annexes (UAX), the Unicode Consortium also maintains the CLDR, collaborated with the IETF on IDNA, [21] [22] and publishes related standards (UTS), reports (UTR), and utilities. [23] [24]
The group selects the emoji icons used by the world's smartphones, based on submissions from individuals and organizations who present their case with evidence for why each one is needed. [18]
The Unicode Technical Committee (UTC) meets quarterly to decide whether new characters will be encoded. A quorum of half of the Consortium's full members is required. [25]
As of May 2024, there are nine full members: Adobe, Airbnb, Apple, Google, Meta, Microsoft, Netflix, Salesforce and Translated. [26]
The UTC accepts documents from any organization or individual, whether they are members of the Unicode Consortium or not. [27] [28] The UTC holds its meetings behind closed doors. [29] As of July 2020, the UTC rules on both emoji and script proposals at the same meeting.
Due to the COVID-19 pandemic's effect on travel, the meetings, which used to be hosted on by various companies for free, were in 2020 held online via Zoom, [30] although the discussions remain confidential.
The UTC prefers to work by consensus, but on particularly contentious issues, votes may be necessary. [31] : §9 After it meets, the UTC releases a public statement on each proposal it considered. [25] Due to the volume of proposals, various subcommittees, such as the Script Ad Hoc Group and Emoji Subcommittee, exist to submit recommendations to the full UTC en banc . [32] [28] The UTC is under no obligation to heed these recommendations, [31] : §1.7 although in practice it usually does.
The Unicode Consortium maintains a History of Unicode Release and Publication Dates.
Publications include:
Unicode, formally The Unicode Standard, is a text encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 of the standard defines 154998 characters and 168 scripts used in various ordinary, literary, academic, and technical contexts.
UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit. Almost every web page is stored in UTF-8.
UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding now known as "UCS-2" (for 2-byte Universal Character Set), once it became clear that more than 216 (65,536) code points were needed, including most emoji and important CJK characters such as for personal and place names.
UTF-32 (32-bit Unicode Transformation Format) is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 232 Unicode code points, needing actually only 21 bits). In contrast, all other Unicode transformation formats are variable-length encodings. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value.
An internationalized domain name (IDN) is an Internet domain name that contains at least one label displayed in software applications, in whole or in part, in non-Latin script or alphabet or in the Latin alphabet-based characters with diacritics or ligatures. These writing systems are encoded by computers in multibyte Unicode. Internationalized domain names are stored in the Domain Name System (DNS) as ASCII strings using Punycode transcription.
An emoji is a pictogram, logogram, ideogram, or smiley embedded in text and used in electronic messages and web pages. The primary function of modern emoji is to fill in emotional cues otherwise missing from typed conversation as well as to replace words as part of a logographic system. Emoji exist in various genres, including facial expressions, expressions, activity, food and drinks, celebrations, flags, objects, symbols, places, types of weather, animals and nature.
Michael Everson is an American and Irish linguist, script encoder, typesetter, type designer and publisher. He runs a publishing company called Evertype, through which he has published over one hundred books since 2006.
The ConScript Unicode Registry is a volunteer project to coordinate the assignment of code points in the Unicode Private Use Areas (PUA) for the encoding of artificial scripts, such as those for constructed languages. It was founded by John Cowan and was maintained by him and Michael Everson. It is not affiliated with the Unicode Consortium.
ISO 15924, Codes for the representation of names of scripts, is an international standard defining codes for writing systems or scripts. Each script is given both a four-letter code and a numeric code.
The Ideographic Research Group (IRG), formerly called the Ideographic Rapporteur Group, is a subgroup of Working Group 2 (WG2) of ISO/IEC JTC1 Subcommittee 2 (SC2), which is the committee responsible for developing the Universal Coded Character Set. IRG is tasked with preparing and reviewing sets of CJK unified ideographs for eventual inclusion in both ISO/IEC 10646 and The Unicode Standard. The IRG is composed of representatives from national standards bodies from China, Japan, South Korea, Vietnam, and other regions that have historically used Chinese characters, as well as experts from liaison organizations such as the SAT Daizōkyō Text Database Committee (SAT), Taipei Computer Association (TCA), and the Unicode Technical Committee (UTC). The group holds two meetings every year lasting 4-5 days each, subsequently reporting its activities to its parent ISO/IEC JTC 1/SC 2 (SC2/WG2) committee.
〒 is the service mark of Japan Post and its successor, Japan Post Holdings, the postal operator in Japan. It is also used as a Japanese postal code mark since the introduction of the latter in 1968. Historically, it was used by the Ministry of Communications, which operated the postal service. The mark is a stylized katakana syllable te (テ), from the word teishin. The mark was introduced on February 8, 1887.
Ken Roger Lunde is an American specialist in information processing for East Asian languages.
Mark Edward Davis is an American specialist in the internationalization and localization of software and the co-founder and chief technical officer of the Unicode Consortium, previously serving as its president until 2022.
The Basic Latin Unicode block, sometimes informally called C0 Controls and Basic Latin, is the first block of the Unicode standard, and the only block which is encoded in one byte in UTF-8. The block contains all the letters and control codes of the ASCII encoding. It ranges from U+0000 to U+007F, contains 128 characters and includes the C0 controls, ASCII punctuation and symbols, ASCII digits, both the uppercase and lowercase of the English alphabet and a control character.
KPS 9566 is a North Korean standard specifying a character encoding for the Chosŏn'gŭl (Hangul) writing system used for the Korean language. The edition of 1997 specified an ISO 2022-compliant 94×94 two-byte coded character set. Subsequent editions have added additional encoded characters outside of the 94×94 plane, in a manner comparable to UHC or GBK.
The ISO basic Latin alphabet is an international standard for a Latin-script alphabet that consists of two sets of 26 letters, codified in various national and international standards and used widely in international communication. They are the same letters that comprise the current English alphabet. Since medieval times, they are also the same letters of the modern Latin alphabet. The order is also important for sorting words into alphabetical order.
The Universal Coded Character Set is a standard set of characters defined by the international standard ISO/IEC 10646, Information technology — Universal Coded Character Set (UCS), which is the basis of many character encodings, improving as characters from previously unrepresented typing systems are added.
The regional indicator symbols are a set of 26 alphabetic Unicode characters (A–Z) intended to be used to encode ISO 3166-1 alpha-2 two-letter country codes in a way that allows optional special treatment.
Optical Character Recognition is a Unicode block containing signal characters for OCR and MICR standards.
Hangul, Hangul Supplementary-A, and Hangul Supplementary-B were character blocks that existed in Unicode 1.0 and 1.1, and ISO/IEC 10646-1:1993. These blocks encoded precomposed modern Hangul syllables. These three Unicode 1.x blocks were deleted and superseded by the new Hangul Syllables block (U+AC00–U+D7AF) in Unicode 2.0 and ISO/IEC 10646-1:1993 Amd. 5 (1998), and are now occupied by CJK Unified Ideographs Extension A and Yijing Hexagram Symbols. Moving or removing existing characters has been prohibited by the Unicode Stability Policy for all versions following Unicode 2.0, so the Hangul Syllables block introduced in Unicode 2.0 is immutable.
Note: During the ongoing COVID-19 pandemic crisis, until further notice, all Unicode Technical Committee meetings are held via video conference. Details for joining the meeting hosted on the Unicode Zoom account are listed on the logistics page for each meeting.