European ordering rules

Last updated

The European ordering rules (EOR / EN 13710) define an ordering for strings written in languages that are written with the Latin, Greek and Cyrillic alphabets. The standard covers languages used by the European Union, the European Free Trade Association, and parts of the former Soviet Union. It is a tailoring of the Common Tailorable Template of ISO/IEC 14651. [1] EOR can in turn be tailored for different (European) languages. But in inter-European contexts, EOR can be used without further tailoring.

Contents

Method

Just as for ISO/IEC 14651, upon which EOR is based, EOR has 4 levels of weights.

Level 1

The first level sorts the letters. The following Latin letters are concerned by this level, in order:

a b c d ð e f g h i j k l m n o p q r s t u v w x y z þ

The Greek alphabet has the following order:

α β γ δ ε ϝ ϛ ζ η θ ι κ λ μ ν ξ ο π ϟ ρ σ τ υ φ χ ψ ω ϡ

Cyrillic script has the following order:

а б в г ґ д ђ ѓ е ё є ж з з́ ѕ и і ї й ј к л љ м н њ о п р с с́ т ћ ќ у ў ф х ц ч џ ш щ ъ ы ь ѣ э ю я

The order for the three alphabets is:

  1. Latin alphabet
  2. Greek alphabet
  3. Cyrillic alphabet

The Georgian and Armenian alphabets had not been included in ENV 13710:2000. However, they were covered in CR 14400:2001 "European ordering rules – Ordering for Latin, Greek, Cyrillic, Georgian and Armenian scripts". They have both been incorporated in and replaced by EN 13710:2011. [2]

All scripts encoded in ISO/IEC 10646 (Unicode) are covered by ISO/IEC 14651 (and its datafile CTT) as well as Unicode collation algorithm (UCA and the associated DUCET), both of which are available at no charge.

Level 2

The second level is where different additions, such as diacritics and variations, to the letters are ordered. Letters with diacritical marks (like à, î, õ, and ü) are ordered as variants of the base letter. æ, œ, ij and ŋ are ordered as modifications of ae, oe, ij and n respectively, similarly for similar cases.

Level 2 defines the following order of diacritics and other modifications:

  1. Acute accent (á)
  2. Grave accent (à)
  3. Breve (ă)
  4. Circumflex (â)
  5. Caron (š)
  6. Ring (å)
  7. Diaeresis (ä)
  8. Double acute accent (ő)
  9. Tilde (ã)
  10. Dot (ż)
  11. Cedilla (ç)
  12. Ogonek (ą)
  13. Macron (ā)
  14. With stroke through (ø)
  15. Modified letter(s) (æ)

Level 3

The third level makes the distinction between Capital and small letters, as in "Polish" and "polish".

Level 4

The fourth level concerns punctuation and whitespace characters. This level makes the distinction between "MacDonald" and "Mac Donald", "its" and "it's".

Level 5

An optional, and usually omitted, fifth level can distinguish typographical differences, including whether the text is italic, normal or bold.

See also

Related Research Articles

<span class="mw-page-title-main">Alphabet</span> Set of letters used to write a given language

An alphabet is a standard set of letters written to represent particular sounds in a spoken language. Specifically, letters correspond to phonemes, the categories of sounds that can distinguish one word from another in a given language. Not all writing systems represent language in this way: a syllabary assigns symbols to spoken syllables, while logographies assign symbols to words, morphemes, or other semantic units.

<span class="mw-page-title-main">Character encoding</span> Using numbers to represent text characters

Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that make up a character encoding are known as "code points" and collectively comprise a "code space", a "code page", or a "character map".

<span class="mw-page-title-main">Cyrillic script</span> Writing system used for various Eurasian languages

The Cyrillic script, Slavonic script or simply Slavic script is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic, Turkic, Mongolic, Uralic, Caucasian and Iranic-speaking countries in Southeastern Europe, Eastern Europe, the Caucasus, Central Asia, North Asia, and East Asia, and used by many other minority languages.

ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. The ISO working group maintaining this series of standards has been disbanded.

The ogonek is a diacritic hook placed under the lower right corner of a vowel in the Latin alphabet used in several European languages, and directly under a vowel in several Native American languages. It is also placed on the lower right corner of consonants in some Latin transcriptions of various indigenous languages of the Caucasus mountains.

The double acute accent is a diacritic mark of the Latin and Cyrillic scripts. It is used primarily in Hungarian or Chuvash, and consequently it is sometimes referred to by typographers as hungarumlaut. The signs formed with a regular umlaut are letters in their own right in the Hungarian alphabet—for instance, they are separate letters for the purpose of collation. Letters with the double acute, however, are considered variants of their equivalents with the umlaut, being thought of as having both an umlaut and an acute accent.

<span class="mw-page-title-main">Mojibake</span> Garbled text as a result of incorrect character encodings

Mojibake is the garbled or gibberish text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, often from a different writing system.

A caron is a diacritic mark commonly placed over certain letters in the orthography of some languages to indicate a change of the related letter's pronunciation.

<span class="mw-page-title-main">Slovene alphabet</span>

The Slovene alphabet is an extension of the Latin script used to write Slovene. The standard language uses a Latin alphabet which is a slight modification of the Croatian Gaj's Latin alphabet, consisting of 25 lower- and upper-case letters:

Alphabetical order is a system whereby character strings are placed in order based on the position of the characters in the conventional ordering of an alphabet. It is one of the methods of collation. In mathematics, a lexicographical order is the generalization of the alphabetical order to other data types, such as sequences of numbers or other ordered mathematical objects.

The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from strings representing text in any writing system and language that can be represented with Unicode. These keys can then be efficiently compared byte by byte in order to collate or sort them according to the rules of the language, with options for ignoring case, accents, etc.

<span class="mw-page-title-main">Gaj's Latin alphabet</span> Form of Latin script used to write Serbo-Croatian

Gaj's Latin alphabet, also known as abeceda or gajica, is the form of the Latin script used for writing Serbo-Croatian and all of its standard varieties: Bosnian, Croatian, Montenegrin, and Serbian.

<span class="mw-page-title-main">Kazakh alphabets</span> Alphabets used to write the Kazakh language

Three alphabets are used to write Kazakh: the Cyrillic, Latin and Arabic scripts. The Cyrillic script is used in Kazakhstan and Mongolia. An October 2017 Presidential Decree in Kazakhstan ordered that the transition from Cyrillic to a Latin script be completed by 2031. The Arabic script is used in Saudi Arabia, Iran, Afghanistan, and parts of China.

<span class="mw-page-title-main">Latin script</span> Writing system based on the alphabet used by the Romans

The Latin script, also known as the Roman script, and technically Latin writing system is an alphabetic writing system based on the letters of the classical Latin alphabet, derived from a form of the Greek alphabet which was in use in the ancient Greek city of Cumae, in southern Italy. The Greek alphabet was altered by the Etruscans, and subsequently their alphabet was altered by the Romans. Several Latin-script alphabets exist, which differ in graphemes, collation and phonetic values from the classical Latin alphabet.

Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting standard character sets, which often included similar or identical characters.

ISO/IEC 14651:2016, Information technology -- International string ordering and comparison -- Method for comparing character strings and description of the common template tailorable ordering, is an International Organization for Standardization (ISO)/International Electrotechnical Commission (IEC) standard specifying an algorithm that can be used when comparing two strings. This comparison can be used when collating a set of strings. The standard also specifies a datafile specifying the comparison order, the Common Tailorable Template, CTT. The comparison order is supposed to be tailored for different languages, since different languages have incompatible ordering requirements. One such tailoring is European ordering rules (EOR), which in turn is supposed to be tailored for different European languages.

The ISO basic Latin alphabet is an international standard for a Latin-script alphabet that consists of two sets of 26 letters, codified in various national and international standards and used widely in international communication. They are the same letters that comprise the current English alphabet. Since medieval times, they are also the same letters of the modern Latin alphabet. The order is also important for sorting words into alphabetical order.

The Universal Coded Character Set is a standard set of characters defined by the international standard ISO/IEC 10646, Information technology — Universal Coded Character Set (UCS), which is the basis of many character encodings, improving as characters from previously unrepresented typing systems are added.

A Latin-script alphabet is an alphabet that uses letters of the Latin script. The 21-letter archaic Latin alphabet and the 23-letter classical Latin alphabet belong to the oldest of this group. The 26-letter modern Latin alphabet is the newest of this group.

References

  1. "ENV 13710 – a "European Pre-Standard": European ordering rules" . Retrieved 2020-11-25.
  2. CEN/CENELEC: EN 13710:2011-09 European Ordering Rules - Ordering of characters from Latin, Greek, Cyrillic, Georgian and Armenian scripts
Notes