Mark Davis (Unicode)

Last updated
Mark Davis
Born
Mark Edward Davis

(1952-09-13) 13 September 1952 (age 72)
Alma mater Stanford University (PhD)
Known for Unicode
Unicode Consortium
Scientific career
Fields Internationalization and localization
Institutions IBM
Apple
Google
Taligent
Unicode Consortium
Thesis Formal problems for Utilitarianism  (1979)
Website www.macchiato.com

Mark Edward Davis (born September 13, 1952) is an American specialist in the internationalization and localization of software and the co-founder and chief technical officer of the Unicode Consortium, previously serving as its president until 2022. [1] [2]

Contents

He is one of the key technical contributors to the Unicode specifications, being the primary author or co-author of bidirectional text algorithms (used worldwide to display Arabic language and Hebrew language text), collation (used by sorting algorithms and search algorithms), Unicode normalization, Unicode scripts, text segmentation, identifiers, regular expressions, data compression, character encoding and security. [3] [4] [5]

Education

Davis was educated at Stanford University where he was awarded a PhD in Philosophy in 1979. [6]

Career and research

Davis has specialized in Internationalization and localization of software for many years. After his PhD, he worked in Zurich, Switzerland for several years,[ quantify ] then returned to the US to join Apple, where he co-authored the Macintosh KanjiTalk and Script Manager, and authored the Macintosh Arabic and Hebrew systems. He also worked on parts of the Mac OS, including contributions to the design of TrueType. Later, he was the manager and architect for the Taligent international frameworks and was then the architect for a large part of the Java international libraries. [7] At IBM, he was the Chief Software Globalization Architect. He is the author of a number of patents, primarily in internationalization and localization. At various times he has also managed groups or departments covering text, internationalization, operating system services, porting and technical communications. [8]

Davis founded and was responsible for the overall architecture of International Components for Unicode (ICU: a major Unicode software internationalization library) and designed the core of the Java internationalization classes. He also is the vice-chair of the Unicode Common Locale Data Repository (CLDR) project, [9] and is a co-author of Best Current Practice (BCP) 47 IETF language tag Request for Comments (RFC 4646 and RFC 5646), used for identifying languages in XML and HTML documents.

Since the start of 2006, Davis has been working on software internationalization at Google, focusing on effective and secure use of Unicode (especially in the index and search pipeline), overall improvement and adoption of the software internationalization libraries (including ICU) and the introduction and maintenance of stable identifiers for languages, scripts, regions, time zones and currencies. [10]

Publications

The Unicode Standard, Version 5.0 [11]

Personal life

Davis is married to Anne Gundelfinger. [12] He has two daughters from a previous marriage.

Related Research Articles

A bidirectional text contains two text directionalities, right-to-left (RTL) and left-to-right (LTR). It generally involves text containing different types of alphabets, but may also refer to boustrophedon, which is changing text direction in each row.

<span class="mw-page-title-main">Unicode</span> Character encoding standard

Unicode, formally The Unicode Standard, is a text encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 of the standard defines 154998 characters and 168 scripts used in various ordinary, literary, academic, and technical contexts.

<span class="mw-page-title-main">UTF-16</span> Variable-width encoding of Unicode, using one or two 16-bit code units

UTF-16 (16-bit Unicode Transformation Format) is a character encoding method capable of encoding 1,112,064 code points of Unicode. The encoding is variable-length as code points are encoded with one or two 16-bitcode units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding now known as 'UCS-2' (for 2-byte Universal Character Set), once it became clear that more than 216 (65,536) code points were needed, including most emoji and important CJK characters such as for personal and place names.

<span class="mw-page-title-main">Taligent</span> Software company (1992–1998)

Taligent Inc. was an American software company. Based on the Pink object-oriented operating system conceived by Apple in 1988, Taligent Inc. was incorporated as an Apple/IBM partnership in 1992, and was dissolved into IBM in 1998.

<span class="mw-page-title-main">Internationalization and localization</span> Process of making software accessible to people in different areas of the world

In computing, internationalization and localization (American) or internationalisation and localisation (British), often abbreviated i18n and l10n respectively, are means of adapting computer software to different languages, regional peculiarities and technical requirements of a target locale.

<span class="mw-page-title-main">Mojibake</span> Garbled text as a result of incorrect character encodings

Mojibake is the garbled or gibberish text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, often from a different writing system.

<span class="mw-page-title-main">Unicode Consortium</span> Nonprofit organization that coordinates the development of the Unicode Standard

The Unicode Consortium is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California, U.S. Its primary purpose is to maintain and publish the Unicode Standard which was developed with the intention of replacing existing character encoding schemes that are limited in size and scope, and are incompatible with multilingual environments.

Punycode is a representation of Unicode with the limited ASCII character subset used for Internet hostnames. Using Punycode, host names containing Unicode characters are transcoded to a subset of ASCII consisting of letters, digits, and hyphens, which is called the letter–digit–hyphen (LDH) subset. For example, München is encoded as Mnchen-3ya.

<span class="mw-page-title-main">Internationalized domain name</span> Type of internet domain name

An internationalized domain name (IDN) is an Internet domain name that contains at least one label displayed in software applications, in whole or in part, in non-Latin script or alphabet or in the Latin alphabet-based characters with diacritics or ligatures. These writing systems are encoded by computers in multibyte Unicode. Internationalized domain names are stored in the Domain Name System (DNS) as ASCII strings using Punycode transcription.

An emoji is a pictogram, logogram, ideogram, or smiley embedded in text and used in electronic messages and web pages. The primary function of modern emoji is to fill in emotional cues otherwise missing from typed conversation as well as to replace words as part of a logographic system. Emoji exist in various genres, including facial expressions, expressions, activity, food and drinks, celebrations, flags, objects, symbols, places, types of weather, animals and nature.

The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from strings representing text in any writing system and language that can be represented with Unicode. These keys can then be efficiently compared byte by byte in order to collate or sort them according to the rules of the language, with options for ignoring case, accents, etc.

ISO 15924, Codes for the representation of names of scripts, is an international standard defining codes for writing systems or scripts. Each script is given both a four-letter code and a numeric code.

International Components for Unicode (ICU) is an open-source project of mature C/C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all platforms and between C, C++, and Java software. The ICU project is a technical committee of the Unicode Consortium and sponsored, supported, and used by IBM and many other companies. ICU has been included as a standard component with Microsoft Windows since Windows 10 version 1703.

WorldScript is the multilingual text rendering engine for Apple Macintosh's classic Mac OS, before Mac OS X was introduced.

The Common Locale Data Repository (CLDR) is a project of the Unicode Consortium to provide locale data in XML format for use in computer applications. CLDR contains locale-specific information that an operating system will typically provide to applications. CLDR is written in the Locale Data Markup Language (LDML).

A whitespace character is a character data element that represents white space when text is rendered for display by a computer.

Globalize is a cross-platform JavaScript library for internationalization and localization that uses the Unicode Common Locale Data Repository (CLDR).

An IETF BCP 47 language tag is a standardized code that is used to identify human languages on the Internet. The tag structure has been standardized by the Internet Engineering Task Force (IETF) in Best Current Practice (BCP) 47; the subtags are maintained by the IANA Language Subtag Registry.

The regional indicator symbols are a set of 26 alphabetic Unicode characters (A–Z) intended to be used to encode ISO 3166-1 alpha-2 two-letter country codes in a way that allows optional special treatment.

Tags is a Unicode block containing formatting tag characters. The block is designed to mirror ASCII. It was originally intended for language tags, but has now been repurposed as emoji modifiers, specifically for region flags.

References

  1. Luckerson, Victor (2016). "Meet the 63-Year-Old in Charge of Approving New Emojis". Time .
  2. "Executive Officers and Staff". www.unicode.org.
  3. "Mark Davis - President, CLDR-TC Chair, & Emoji Subcommittee Chair at Unicode Consortium". THE ORG.
  4. "Board of Directors". unicode.org.
  5. DPA, German Press Agency- (January 1, 2018). "Mark Davis: The lesser known master of emojis". Daily Sabah.
  6. Davis, Mark Edward (1979). Formal problems for Utilitarianism. stanford.edu (PhD thesis). Stanford University. OCLC   917950786. ProQuest   302982299.
  7. Davis, M. E.; Grimes, J. D.; Knoles, D. J. (1996). "Creating global software: Text handling and localization in Taligent's CommonPoint application system". IBM Systems Journal. 35 (2): 227–243. doi:10.1147/sj.352.0227. ISSN   0018-8670.
  8. Davis, Mark (2020). "Mark Davis Conference Biography". macchiato.com.
  9. "CLDR Process - CLDR - Unicode Common Locale Data Repository". cldr.unicode.org.
  10. Treanor, Sarah; Nunis, Vivienne (2021). "Face palm: When the emoji you want doesn't exist". bbc.co.uk. London: BBC News.
  11. The Unicode Consortium (November 2006), The Unicode Standard, Version 5.0, Addison-Wesley Professional, ISBN   0-321-48091-0
  12. Wong, Queenie (2016-02-12). "Q&A: Mark Davis, president of the Unicode Consortium, on the rise of emojis". mercurynews.com. The Mercury News. Retrieved 2018-04-05.