Mark Davis | |
---|---|
Born | Mark Edward Davis 13 September 1952 Riverside, California, U.S. |
Alma mater | Stanford University (PhD) |
Known for | Unicode Unicode Consortium |
Scientific career | |
Fields | Internationalization and localization |
Institutions | IBM Apple Taligent Unicode Consortium |
Thesis | Formal problems for Utilitarianism (1979) |
Website | www |
Mark Edward Davis (born September 13, 1952) is an American specialist in the internationalization and localization of software and the co-founder and chief technical officer of the Unicode Consortium, previously serving as its president until 2022. [1] [2]
He is one of the key technical contributors to the Unicode specifications, being the primary author or co-author of bidirectional text algorithms (used worldwide to display Arabic language and Hebrew language text), collation (used by sorting algorithms and search algorithms), Unicode normalization, Unicode scripts, text segmentation, identifiers, regular expressions, data compression, character encoding and security. [3] [4] [5]
Davis was educated at Stanford University where he was awarded a PhD in Philosophy in 1979. [6]
Davis has specialized in Internationalization and localization of software for many years. After his PhD, he worked in Zurich, Switzerland for several years,[ quantify ] then returned to the US to join Apple, where he co-authored the Macintosh KanjiTalk and Script Manager, and authored the Macintosh Arabic and Hebrew systems. He also worked on parts of the Mac OS, including contributions to the design of TrueType. Later, he was the manager and architect for the Taligent international frameworks and was then the architect for a large part of the Java international libraries. [7] At IBM, he was the Chief Software Globalization Architect. He is the author of a number of patents, primarily in internationalization and localization. At various times he has also managed groups or departments covering text, internationalization, operating system services, porting and technical communications. [8]
Davis founded and was responsible for the overall architecture of International Components for Unicode (ICU: a major Unicode software internationalization library) and designed the core of the Java internationalization classes. He also is the vice-chair of the Unicode Common Locale Data Repository (CLDR) project, [9] and is a co-author of Best Current Practice (BCP) 47 IETF language tag Request for Comments (RFC 4646 and RFC 5646), used for identifying languages in XML and HTML documents.
Since the start of 2006, Davis has been working on software internationalization at Google, focusing on effective and secure use of Unicode (especially in the index and search pipeline), overall improvement and adoption of the software internationalization libraries (including ICU) and the introduction and maintenance of stable identifiers for languages, scripts, regions, time zones and currencies. [10]
The Unicode Standard, Version 5.0 [11]
Davis is married to Anne Gundelfinger. [12] He has two daughters from a previous marriage.
A bidirectional text contains two text directionalities, right-to-left (RTL) and left-to-right (LTR). It generally involves text containing different types of alphabets, but may also refer to boustrophedon, which is changing text direction in each row.
Unicode, formally The Unicode Standard, is a text encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 of the standard defines 154998 characters and 168 scripts used in various ordinary, literary, academic, and technical contexts.
UTF-16 (16-bit Unicode Transformation Format) is a character encoding method capable of encoding 1,112,064 code points of Unicode. The encoding is variable-length as code points are encoded with one or two 16-bitcode units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding now known as 'UCS-2' (for 2-byte Universal Character Set), once it became clear that more than 216 (65,536) code points were needed, including most emoji and important CJK characters such as for personal and place names.
Taligent Inc. was an American software company. Based on the Pink object-oriented operating system conceived by Apple in 1988, Taligent Inc. was incorporated as an Apple/IBM partnership in 1992, and was dissolved into IBM in 1998.
In computing, internationalization and localization (American) or internationalisation and localisation (British), often abbreviated i18n and l10n respectively, are means of adapting computer software to different languages, regional peculiarities and technical requirements of a target locale.
Mojibake is the garbled or gibberish text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, often from a different writing system.
The Unicode Consortium is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California, U.S. Its primary purpose is to maintain and publish the Unicode Standard which was developed with the intention of replacing existing character encoding schemes that are limited in size and scope, and are incompatible with multilingual environments.
Punycode is a representation of Unicode with the limited ASCII character subset used for Internet hostnames. Using Punycode, host names containing Unicode characters are transcoded to a subset of ASCII consisting of letters, digits, and hyphens, which is called the letter–digit–hyphen (LDH) subset. For example, München is encoded as Mnchen-3ya.
An internationalized domain name (IDN) is an Internet domain name that contains at least one label displayed in software applications, in whole or in part, in non-Latin script or alphabet or in the Latin alphabet-based characters with diacritics or ligatures. These writing systems are encoded by computers in multibyte Unicode. Internationalized domain names are stored in the Domain Name System (DNS) as ASCII strings using Punycode transcription.
An emoji is a pictogram, logogram, ideogram, or smiley embedded in text and used in electronic messages and web pages. The primary function of modern emoji is to fill in emotional cues otherwise missing from typed conversation as well as to replace words as part of a logographic system. Emoji exist in various genres, including facial expressions, expressions, activity, food and drinks, celebrations, flags, objects, symbols, places, types of weather, animals and nature.
The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from strings representing text in any writing system and language that can be represented with Unicode. These keys can then be efficiently compared byte by byte in order to collate or sort them according to the rules of the language, with options for ignoring case, accents, etc.
ISO 15924, Codes for the representation of names of scripts, is an international standard defining codes for writing systems or scripts. Each script is given both a four-letter code and a numeric code.
International Components for Unicode (ICU) is an open-source project of mature C/C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all platforms and between C, C++, and Java software. The ICU project is a technical committee of the Unicode Consortium and sponsored, supported, and used by IBM and many other companies. ICU has been included as a standard component with Microsoft Windows since Windows 10 version 1703.
WorldScript is the multilingual text rendering engine for Apple Macintosh's classic Mac OS, before Mac OS X was introduced.
The Common Locale Data Repository (CLDR) is a project of the Unicode Consortium to provide locale data in XML format for use in computer applications. CLDR contains locale-specific information that an operating system will typically provide to applications. CLDR is written in the Locale Data Markup Language (LDML).
A whitespace character is a character data element that represents white space when text is rendered for display by a computer.
Globalize is a cross-platform JavaScript library for internationalization and localization that uses the Unicode Common Locale Data Repository (CLDR).
An IETF BCP 47 language tag is a standardized code that is used to identify human languages on the Internet. The tag structure has been standardized by the Internet Engineering Task Force (IETF) in Best Current Practice (BCP) 47; the subtags are maintained by the IANA Language Subtag Registry.
The regional indicator symbols are a set of 26 alphabetic Unicode characters (A–Z) intended to be used to encode ISO 3166-1 alpha-2 two-letter country codes in a way that allows optional special treatment.
Tags is a Unicode block containing formatting tag characters. The block is designed to mirror ASCII. It was originally intended for language tags, but has now been repurposed as emoji modifiers, specifically for region flags.