ISO basic Latin alphabet

Last updated

The ISO basic Latin alphabet is an international standard (beginning with ISO/IEC 646) for a Latin-script alphabet that consists of two sets (uppercase and lowercase) of 26 letters, codified in [1] various national and international standards and used widely in international communication. They are the same letters that comprise the current English alphabet. Since medieval times, they are also the same letters of the modern Latin alphabet. The order is also important for sorting words into alphabetical order.

Contents

The two sets contain the following 26 letters each: [1]

ISO basic Latin alphabet
Uppercase letter set A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Lowercase letter setabcdefghijklmnopqrstuvwxyz

History

By the 1960s it became apparent to the computer and telecommunications industries in the First World that a non-proprietary method of encoding characters was needed. The International Organization for Standardization (ISO) encapsulated the Latin script in their (ISO/IEC 646) 7-bit character-encoding standard. To achieve widespread acceptance, this encapsulation was based on popular usage. The standard was based on the already published American Standard Code for Information Interchange, better known as ASCII, which included in the character set the 26 × 2 letters of the English alphabet. Later standards issued by the ISO, for example ISO/IEC 8859 (8-bit character encoding) and ISO/IEC 10646 (Unicode Latin), have continued to define the 26 × 2 letters of the English alphabet as the basic Latin script with extensions to handle other letters in other languages. [1]

Terminology

The Unicode block that contains the alphabet is called "C0 Controls and Basic Latin". Two subheadings exist: [2]

There are also another two sets in the Halfwidth and Fullwidth Forms block: [3]

Timeline for encoding standards

Timeline for widely used computer codes supporting the alphabet

Representation

The uppercase letters of the ISO basic Latin alphabet on a 16-segment display (plus the Arabic numerals). Sixteen-segment display animated.gif
The uppercase letters of the ISO basic Latin alphabet on a 16-segment display (plus the Arabic numerals).

In ASCII the letters belong to the printable characters and in Unicode since version 1.0 they belong to the block "C0 Controls and Basic Latin". In both cases, as well as in ISO/IEC 646, ISO/IEC 8859 and ISO/IEC 10646 they are occupying the positions in hexadecimal notation 41 to 5A for uppercase and 61 to 7A for lowercase.

Not case sensitive, all letters have code words in the ICAO spelling alphabet and can be represented with Morse code.

Usage

All of the lowercase letters are used in the International Phonetic Alphabet (IPA). In X-SAMPA and SAMPA these letters have the same sound value as in IPA.

Alphabets containing the same set of letters

The list below only includes alphabets that lack:

Notable omissions due to these rules include Spanish, Esperanto, Filipino and German. The German alphabet is sometimes considered by tradition to contain only 26 letters (with ä, ö, ü considered variants and ß considered a ligature), but the current German orthographic rules include ä, ö, ü, ß in the alphabet placed after Z; however, this order is normally not used in collation: usually ä, ö, ü are collated as a, o, u (or sometimes as ae, oe, ue), ß as ss.

AlphabetDiacriticMultigraphs (not constituting distinct letters)Ligatures
Afrikaans alphabet á, ä, é, è, ê, ë, í, î, ï, ó, ô, ö, ú, û, ü, ý Digraphs: ⟨aa⟩, ⟨ai⟩, ⟨ch⟩, ⟨ee⟩, ⟨ei⟩, ⟨eu⟩, ⟨gh⟩, ⟨ie⟩, ⟨nj⟩, ⟨ng⟩ ⟨oe⟩, ⟨oi⟩, ⟨oo⟩, ⟨ou⟩, ⟨sj⟩, ⟨tj⟩, ⟨ts⟩, ⟨ui⟩, ⟨uu

Trigraphs: ⟨aai⟩, ⟨eeu⟩, ⟨oei⟩, ⟨ooi⟩

ʼn (Napostrophe)
Aragonese alphabet (Academia de l'Aragonés orthography) á, é, í, ó, ú, ü, lꞏl ch⟩, ⟨gu⟩, ⟨ll⟩, ⟨ny⟩, ⟨qu⟩, ⟨rr⟩, ⟨tz
Catalan alphabet à, é, è, í, ï, ó, ò, ú, ü, ç, lꞏl gu⟩, ⟨ig⟩, ⟨ix⟩, ⟨ll⟩, ⟨ny⟩, ⟨qu⟩, ⟨rr⟩, ⟨ss
Dutch alphabet ä, é, è, ë, ï, ö, ü The digraphij⟩ is sometimes considered to be a separate letter. When that is the case, it usually replaces or is intermixed with ⟨y⟩. Other digraphs: ⟨aa⟩, ⟨ae⟩, ⟨ai⟩, ⟨au⟩, ⟨ch⟩, ⟨ee⟩, ⟨ei⟩, ⟨eu⟩, ⟨ie⟩, ⟨oe⟩, ⟨oi⟩, ⟨oo⟩, ⟨ou⟩, ⟨ui⟩, ⟨uu
English alphabet only in loanwords (see below)1sh⟩, ⟨ch⟩, ⟨ea⟩, ⟨ou⟩, ⟨th⟩, ⟨ph⟩, ⟨ng æ, œ (both archaic)
French alphabet à, â, ç, é, è, ê, ë, î, ï, ô, ù, û, ü, ÿ ai⟩, ⟨au⟩, ⟨ei⟩, ⟨eu⟩, ⟨oi⟩, ⟨ou⟩, ⟨eau⟩, ⟨ch⟩, ⟨ph⟩, ⟨gn⟩, ⟨an⟩, ⟨am⟩, ⟨en⟩, ⟨em⟩, ⟨in⟩, ⟨im⟩, ⟨on⟩, ⟨om⟩, ⟨un⟩, ⟨um⟩, ⟨yn⟩, ⟨ym⟩, ⟨ain⟩, ⟨aim⟩, ⟨ein⟩, ⟨oin⟩, ⟨⟩, ⟨ æ (rare), œ  (mandatory)
Italian alphabet (extended) [lower-alpha 1] à, è, é, ì, î (formal), ò, ó, ù ch⟩, ⟨ci⟩, ⟨gh⟩, ⟨gi⟩, ⟨gl⟩, ⟨gli⟩, ⟨gn⟩, ⟨sc⟩, ⟨sci
Ido alphabet*nonequ⟩, ⟨ch⟩, ⟨sh
Indonesian alphabet only in learning materials (see below)4kh⟩, ⟨ng⟩, ⟨ny⟩, ⟨sy⟩, diphthongs: ⟨ai⟩, ⟨au⟩, ⟨ei⟩, ⟨oi⟩
Interlingua alphabet*only in unassimilated loanwords (see below)2ch⟩, ⟨ph⟩, ⟨qu⟩, ⟨rh⟩, ⟨sh
Javanese Latin alphabet é, è dh⟩, ⟨kh⟩, ⟨ng⟩, ⟨ny⟩, ⟨sy⟩, ⟨th
Latino sine flexione alphabet* only an optional accent for unusual stress (see below)3ae⟩, ⟨ch⟩, ⟨oe⟩, ⟨ph⟩, ⟨qu⟩, ⟨rh⟩, ⟨th [8]
Luxembourgish alphabet ä, é, ë aa⟩, ⟨ch⟩, ⟨ck⟩, ⟨ee⟩, ⟨ei⟩, ⟨ie⟩, ⟨ii⟩, ⟨ng⟩, ⟨oo⟩, ⟨ou⟩, ⟨qu⟩, ⟨ue⟩, ⟨uu⟩, ⟨sch
Malay alphabet only in learning materials (see below)4gh⟩, ⟨kh⟩, ⟨ng⟩, ⟨ny⟩, ⟨sy
Portuguese alphabet [lower-alpha 2] ã, õ, á, é, í, ó, ú, â, ê, ô, à, ç ch⟩, ⟨lh⟩, ⟨nh⟩, ⟨rr⟩, ⟨ss⟩, ⟨am⟩, ⟨em⟩, ⟨im⟩, ⟨om⟩, ⟨um⟩, ⟨ãe⟩, ⟨ão⟩, ⟨õe
Sundanese Latin alphabet é eu⟩, ⟨ng⟩, ⟨ny

* Constructed languages

  1. English is one of the few modern European languages requiring no diacritics for native words (although a diaeresis is used by some American publishers in words such as "coöperation"). [lower-alpha 3] [9]
  2. Interlingua, a constructed language, never uses diacritics except in unassimilated loanwords. However, they can be removed if they are not used to modify the vowel (e.g. cafe , from French : café). [10]
  3. Latino sine flexione, a.k.a. "Peano's Interlingua", allows but does not require the placement of an accent for unusual stress. (It antedates the other "Interlingua" by roughly four decades.)
  4. Malay and Indonesian (based on Malay) are the only languages outside Europe that use all the Latin alphabet and require no diacritics and ligatures. [lower-alpha 4] Many of the 700+ languages of Indonesia also use the Indonesian alphabet to write their languages, some—such as Javanese—adding diacritics é and è, and some omitting q, x, and z.

Column numbering

The Roman (Latin) alphabet is commonly used for column numbering in a table or chart. This avoids confusion with row numbers using Arabic numerals. For example, a 3-by-3 table would contain columns A, B, and C, set against rows 1, 2, and 3. If more columns are needed beyond Z (normally the final letter of the alphabet), the column immediately after Z is AA, followed by AB, and so on (see bijective base-26 system). This can be seen by scrolling far to the right in a spreadsheet program such as Microsoft Excel or LibreOffice Calc.

These are double-digit "letters" for table columns, in the same way that 10 through 99 are double-digit numbers. The Greek alphabet has a similar extended form that uses such double-digit letters if necessary, but it is used for chapters of a fraternity as opposed to columns of a table.

Such double-digit letters for bullet points are AA, BB, CC, etc., as opposed to the number-like place value system explained above for table columns.

See also

Notes

  1. The Italian alphabet is traditionally considered to have only 21 letters, excluding j, k, w, x, y. However, in practice these letters occur in a number of loanwords. J also occurs in some native Italian proper names as a variant of writing semivocalic i.
  2. Note for Portuguese: k and y (but not w) were part of the alphabet until several spelling reforms during the 20th century, the aim of which was to change the etymological Portuguese spelling into an easier phonetic spelling. These letters were replaced by other letters having the same sound: thus psychologia became psicologia, kioske became quiosque, martyr became mártir, etc. Nowadays k, w, and y are only found in foreign words and their derived terms and in scientific abbreviations (e.g. km, byronismo). These letters are considered part of the alphabet again following the 1990 Portuguese Language Orthographic Agreement, which came into effect on January 1, 2009, in Brazil. See Reforms of Portuguese orthography.
  3. As an example of an article containing a diaeresis in "coöperate", as well as accents on loan words in English, such as a cedilla in "façades" and a circumflex in the word "crêpe", see Grafton, Anthony (October 23, 2006). "Books: The Nutty Professors, The history of academic charisma". The New Yorker .
  4. However, Malay and Indonesian learning materials may use ⟨é⟩ (E with acute) to clarify the pronunciation of the letter E; in that case, ⟨e⟩ is pronounced /ə/ while ⟨é⟩ is pronounced /e/ and (è) is pronounced /ɛ/.

Related Research Articles

Extended Binary Coded Decimal Interchange Code is an eight-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. It descended from the code used with punched cards and the corresponding six-bit binary-coded decimal code used with most of IBM's computer peripherals of the late 1950s and early 1960s. It is supported by various non-IBM platforms, such as Fujitsu-Siemens' BS2000/OSD, OS-IV, MSP, and MSP-EX, the SDS Sigma series, Unisys VS/9, Unisys MCP and ICL VME.

<span class="mw-page-title-main">ISO/IEC 8859-1</span> Character encoding

ISO/IEC 8859-1:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. ISO/IEC 8859-1 encodes what it refers to as "Latin alphabet no. 1", consisting of 191 characters from the Latin script. This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa. It is the basis for some popular 8-bit character sets and the first two blocks of characters in Unicode.

ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. The ISO working group maintaining this series of standards has been disbanded.

ISO/IEC 8859-3:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 3: Latin alphabet No. 3, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1988. It is informally referred to as Latin-3 or South European. It was designed to cover Turkish, Maltese and Esperanto, though the introduction of ISO/IEC 8859-9 superseded it for Turkish. The encoding was popular for users of Esperanto, but fell out of use as application support for Unicode became more common.

ISO/IEC 646 is a set of ISO/IEC standards, described as Information technology — ISO 7-bit coded character set for information interchange and developed in cooperation with ASCII at least since 1964. Since its first edition in 1967 it has specified a 7-bit character code from which several national standards are derived.

ISO/IEC 8859-2:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 2: Latin alphabet No. 2, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as "Latin-2". It is generally intended for Central or "Eastern European" languages that are written in the Latin script. Note that ISO/IEC 8859-2 is very different from code page 852 which is also referred to as "Latin-2" in Czech and Slovak regions. Code page 912 is an extension. Almost half the use of the encoding is for Polish, and it's the main legacy encoding for Polish, while virtually all use of it has been replaced by UTF-8.

ISO/IEC 8859-8, Information technology — 8-bit single-byte coded graphic character sets — Part 8: Latin/Hebrew alphabet, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings. ISO/IEC 8859-8:1999 from 1999 represents its second and current revision, preceded by the first edition ISO/IEC 8859-8:1988 in 1988. It is informally referred to as Latin/Hebrew. ISO/IEC 8859-8 covers all the Hebrew letters, but no Hebrew vowel signs. IBM assigned code page 916 to it. This character set was also adopted by Israeli Standard SI1311:2002, with some extensions.

ISO/IEC 8859-5:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 5: Latin/Cyrillic alphabet, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1988. It is informally referred to as Latin/Cyrillic.

ISO/IEC 8859-6:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 6: Latin/Arabic alphabet, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as Latin/Arabic. It was designed to cover Arabic. Only nominal letters are encoded, no preshaped forms of the letters, so shaping processing is required for display. It does not include the extra letters needed to write most Arabic-script languages other than Arabic itself.

ISO/IEC 8859-7:2003, Information technology — 8-bit single-byte coded graphic character sets — Part 7: Latin/Greek alphabet, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as Latin/Greek. It was designed to cover the modern Greek language. The original 1987 version of the standard had the same character assignments as the Greek national standard ELOT 928, published in 1986. The table in this article shows the updated 2003 version which adds three characters. Microsoft has assigned code page 28597 a.k.a. Windows-28597 to ISO-8859-7 in Windows. IBM has assigned code page 813 to ISO 8859-7. (IBM CCSID 813 is the original encoding. CCSID 4909 adds the euro sign. CCSID 9005 further adds the drachma sign and ypogegrammeni.)

ISO/IEC 8859-9:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 9: Latin alphabet No. 5, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1989. It is designated ECMA-128 by Ecma International and TS 5881 as a Turkish standard. It is informally referred to as Latin-5 or Turkish. It was designed to cover the Turkish language, designed as being of more use than the ISO/IEC 8859-3 encoding. It is identical to ISO/IEC 8859-1 except for the replacement of six Icelandic characters with characters unique to the Turkish alphabet. And the uppercase of i is İ; the lowercase of I is ı.

ISO/IEC 8859-10:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 10: Latin alphabet No. 6, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1992. It is informally referred to as Latin-6. It was designed to cover the Nordic languages, deemed of more use for them than ISO 8859-4.

ISO/IEC 2022Information technology—Character code structure and extension techniques, is an ISO/IEC standard in the field of character encoding. It is equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202. Originating in 1971, it was most recently revised in 1994.

Thai Industrial Standard 620-2533, commonly referred to as TIS-620, is the most common character set and character encoding for the Thai language. The standard is published by the Thai Industrial Standards Institute (TISI), an organ of the Ministry of Industry under the Royal Thai Government, and is the sole official standard for encoding Thai in Thailand.

The currency sign¤ is a character used to denote an unspecified currency. It can be described as a circle the size of a lowercase character with four short radiating arms at 45° (NE), 135° (SE), 225° (SW) and 315° (NW). It is raised slightly above the baseline. The character is sometimes called scarab.

Several 8-bit character sets (encodings) were designed for binary representation of common Western European languages, which use the Latin alphabet, a few additional letters and ones with precomposed diacritics, some punctuation, and various symbols. These character sets also happen to support many other languages such as Malay, Swahili, and Classical Latin.

The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, an instruction to start a new line, or a message that the text has been received.

T.51 / ISO/IEC 6937:2001, Information technology — Coded graphic character set for text communication — Latin alphabet, is a multibyte extension of ASCII, or more precisely ISO/IEC 646-IRV. It was developed in common with ITU-T for telematic services under the name of T.51, and first became an ISO standard in 1983. Certain byte codes are used as lead bytes for letters with diacritics (accents). The value of the lead byte often indicates which diacritic that the letter has, and the follow byte then has the ASCII-value for the letter that the diacritic is on.

Many Unicode characters are used to control the interpretation or display of text, but these characters themselves have no visual or spatial representation. For example, the null character is used in C-programming application environments to indicate the end of a string of characters. In this way, these programs only require a single starting memory address for a string, since the string ends once the program reads the null character.

The Basic Latin Unicode block, sometimes informally called C0 Controls and Basic Latin, is the first block of the Unicode standard, and the only block which is encoded in one byte in UTF-8. The block contains all the letters and control codes of the ASCII encoding. It ranges from U+0000 to U+007F, contains 128 characters and includes the C0 controls, ASCII punctuation and symbols, ASCII digits, both the uppercase and lowercase of the English alphabet and a control character.

References

  1. 1 2 3 "Internationalisation standardization of 7-bit codes, ISO 646". Trans-European Research and Education Networking Association (TERENA). Retrieved October 3, 2010.
  2. "C0 Controls and Basic Latin" (PDF). Unicode.org. Retrieved August 8, 2016.
  3. "Halfwidth and Fullwidth Forms" (PDF). Unicode.org. Retrieved August 8, 2016.
  4. "The Postal History of ICAO". www.icao.int. Archived from the original on February 12, 2019. Retrieved February 17, 2019.
  5. 1 2 Standard ECMA-6: 7-Bit Coded Character Set (PDF) (5th ed.). Geneva, Switzerland: European Computer Manufacturers Association (Ecma). March 1985. Archived from the original (PDF) on May 29, 2016. Retrieved May 29, 2016. The Technical Committee TC1 of ECMA met for the first time in December 1960 to prepare standard codes for Input/Output purposes. On April 30, 1965, Standard ECMA-6 was adopted by the General Assembly of ECMA.
  6. "Unicode character database". The Unicode Standard. Retrieved March 22, 2013.
  7. The Unicode Standard Version 1.0, Volume 1. Addison-Wesley Publishing Company, Inc. 1990. ISBN   0-201-56788-1.
  8. Not "letters", per: Ager, Simon. "Latino sine Flexione". Omniglot . Latino sine Flexione alphabet. Retrieved April 14, 2023.
  9. "The New Yorker's odd mark — the diaeresis". December 16, 2010. Archived from the original on December 16, 2010.
  10. "Introduction al IED (in anglese)". www.interlingua.com. Retrieved September 21, 2020.