Alphabetical order

Last updated

Alphabetical order is a system whereby character strings are placed in order based on the position of the characters in the conventional ordering of an alphabet. It is one of the methods of collation. In mathematics, a lexicographical order is the generalization of the alphabetical order to other data types, such as sequences of numbers or other ordered mathematical objects.

Contents

When applied to strings or sequences that may contain digits, numbers or more elaborate types of elements, in addition to alphabetical characters, the alphabetical order is generally called a lexicographical order.

To determine which of two strings of characters comes first when arranging in alphabetical order, their first letters are compared. If they differ, then the string whose first letter comes earlier in the alphabet comes before the other string. If the first letters are the same, then the second letters are compared, and so on. If a position is reached where one string has no more letters to compare while the other does, then the first (shorter) string is deemed to come first in alphabetical order.

Capital or upper case letters are generally considered to be identical to their corresponding lower case letters for the purposes of alphabetical ordering, although conventions may be adopted to handle situations where two strings differ only in capitalization. Various conventions also exist for the handling of strings containing spaces, modified letters, such as those with diacritics, and non-letter characters such as marks of punctuation.

The result of placing a set of words or strings in alphabetical order is that all of the strings beginning with the same letter are grouped together; within that grouping all words beginning with the same two-letter sequence are grouped together; and so on. The system thus tends to maximize the number of common initial letters between adjacent words.

History

Alphabetical order was first used in the 1st millennium BCE by Northwest Semitic scribes using the abjad system. [1] However, a range of other methods of classifying and ordering material, including geographical, chronological, hierarchical and by category, were preferred over alphabetical order for centuries. [2]

Parts of the Bible are dated to the 7th–6th centuries BCE. In the Book of Jeremiah, the prophet utilizes the Atbash substitution cipher, based on alphabetical order. Similarly, biblical authors used acrostics based on the (ordered) Hebrew alphabet. [3]

The first effective use of alphabetical order as a cataloging device among scholars may have been in ancient Alexandria, [4] in the Great Library of Alexandria, which was founded around 300 BCE. The poet and scholar Callimachus, who worked there, is thought to have created the world's first library catalog, known as the Pinakes, with scrolls shelved in alphabetical order of the first letter of authors' names. [2]

In the 1st century BC, Roman writer Varro compiled alphabetic lists of authors and titles. [5] In the 2nd century CE, Sextus Pompeius Festus wrote an encyclopedic epitome of the works of Verrius Flaccus, De verborum significatu , with entries in alphabetic order. [6] In the 3rd century CE, Harpocration wrote a Homeric lexicon alphabetized by all letters. [7] In the 10th century, the author of the Suda used alphabetic order with phonetic variations.

Alphabetical order as an aid to consultation started to enter the mainstream of Western European intellectual life in the second half of the 12th century, when alphabetical tools were developed to help preachers analyse biblical vocabulary. This led to the compilation of alphabetical concordances of the Bible by the Dominican friars in Paris in the 13th century, under Hugh of Saint Cher. Older reference works such as St. Jerome's Interpretations of Hebrew Names were alphabetized for ease of consultation. The use of alphabetical order was initially resisted by scholars, who expected their students to master their area of study according to its own rational structures; its success was driven by such tools as Robert Kilwardby's index to the works of St. Augustine, which helped readers access the full original text instead of depending on the compilations of excerpts which had become prominent in 12th century scholasticism. The adoption of alphabetical order was part of the transition from the primacy of memory to that of written works. [8] The idea of ordering information by the order of the alphabet also met resistance from the compilers of encyclopaedias in the 12th and 13th centuries, who were all devout churchmen. They preferred to organise their material theologically – in the order of God's creation, starting with Deus (meaning God). [2]

In 1604 Robert Cawdrey had to explain in Table Alphabeticall , the first monolingual English dictionary, "Nowe if the word, which thou art desirous to finde, begin with (a) then looke in the beginning of this Table, but if with (v) looke towards the end". [9] Although as late as 1803 Samuel Taylor Coleridge condemned encyclopedias with "an arrangement determined by the accident of initial letters", [10] many lists are today based on this principle.

Arrangement in alphabetical order can be seen as a force for democratising access to information, as it does not require extensive prior knowledge to find what was needed. [2]

Ordering in the Latin script

Basic order and examples

The standard order of the modern ISO basic Latin alphabet is:

A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z

An example of straightforward alphabetical ordering follows:

Another example:

The above words are ordered alphabetically. As comes before Aster because they begin with the same two letters and As has no more letters after that whereas Aster does. The next three words come after Aster because their fourth letter (the first one that differs) is r, which comes after e (the fourth letter of Aster) in the alphabet. Those words themselves are ordered based on their sixth letters (l, n and p respectively). Then comes At, which differs from the preceding words in the second letter (t comes after s). Ataman comes after At for the same reason that Aster came after As. Attack follows Ataman based on comparison of their third letters, and Baa comes after all of the others because it has a different first letter.

Treatment of multiword strings

When some of the strings being ordered consist of more than one word, i.e., they contain spaces or other separators such as hyphens, then two basic approaches may be taken. In the first approach, all strings are ordered initially according to their first word, as in the sequence:

In the second approach, strings are alphabetized as if they had no spaces, giving the sequence:

The second approach is the one usually taken in dictionaries[ citation needed ], and it is thus often called dictionary order by publishers. The first approach has often been used in book indexes, although each publisher traditionally set its own standards for which approach to use therein; there was no ISO standard for book indexes (ISO 999) before 1975.

Special cases

Modified letters

In French, modified letters (such as those with diacritics) are treated the same as the base letter for alphabetical ordering purposes. For example, rôle comes between rock and rose, as if it were written role. However, languages that use such letters systematically generally have their own ordering rules. See § Language-specific conventions below.

Ordering by surname

In most cultures where family names are written after given names, it is still desired to sort lists of names (as in telephone directories) by family name first. In this case, names need to be reordered to be sorted correctly. For example, Juan Hernandes and Brian O'Leary should be sorted as "Hernandes, Juan" and "O'Leary, Brian" even if they are not written this way. Capturing this rule in a computer collation algorithm is complex, and simple attempts will fail. For example, unless the algorithm has at its disposal an extensive list of family names, there is no way to decide if "Gillian Lucille van der Waal" is "van der Waal, Gillian Lucille", "Waal, Gillian Lucille van der", or even "Lucille van der Waal, Gillian".

Ordering by surname is frequently encountered in academic contexts. Within a single multi-author paper, ordering the authors alphabetically by surname, rather than by other methods such as reverse seniority or subjective degree of contribution to the paper, is seen as a way of "acknowledg[ing] similar contributions" or "avoid[ing] disharmony in collaborating groups". [11] The practice in certain fields of ordering citations in bibliographies by the surnames of their authors has been found to create bias in favour of authors with surnames which appear earlier in the alphabet, while this effect does not appear in fields in which bibliographies are ordered chronologically. [12]

The and other common words

If a phrase begins with a very common word (such as "the", "a" or "an", called articles in grammar), that word is sometimes ignored or moved to the end of the phrase, but this is not always the case. For example, the book "The Shining" might be treated as "Shining", or "Shining, The" and therefore before the book title "Summer of Sam". However, it may also be treated as simply "The Shining" and after "Summer of Sam". Similarly, "A Wrinkle in Time" might be treated as "Wrinkle in Time", "Wrinkle in Time, A", or "A Wrinkle in Time". All three alphabetization methods are fairly easy to create by algorithm, but many programs rely on simple lexicographic ordering instead.

Mac prefixes

The prefixes M and Mc in Irish and Scottish surnames are abbreviations for Mac and are sometimes alphabetized as if the spelling is Mac in full. Thus McKinley might be listed before Mackintosh (as it would be if it had been spelled out as "MacKinley"). Since the advent of computer-sorted lists, this type of alphabetization is less frequently encountered, though it is still used in British telephone directories.

St prefix

The prefix St or St. is an abbreviation of "Saint", and is traditionally alphabetized as if the spelling is Saint in full. Thus in a gazetteer St John's might be listed before Salem (as if it would be if it had been spelled out as "Saint John's"). Since the advent of computer-sorted lists, this type of alphabetization is less frequently encountered, though it is still sometimes used.

Ligatures

Ligatures (two or more letters merged into one symbol) which are not considered distinct letters, such as Æ and Œ in English, are typically collated as if the letters were separate—"æther" and "aether" would be ordered the same relative to all other words. This is true even when the ligature is not purely stylistic, such as in loanwords and brand names.

Special rules may need to be adopted to sort strings which vary only by whether two letters are joined by a ligature.

Treatment of numerals

When some of the strings contain numerals (or other non-letter characters), various approaches are possible. Sometimes such characters are treated as if they came before or after all the letters of the alphabet. Another method is for numbers to be sorted alphabetically as they would be spelled: for example 1776 would be sorted as if spelled out "seventeen seventy-six", and 24 heures du Mans as if spelled "vingt-quatre..." (French for "twenty-four"). When numerals or other symbols are used as special graphical forms of letters, as 1337 for leet or the movie Seven (which was stylised as Se7en), they may be sorted as if they were those letters. Natural sort order orders strings alphabetically, except that multi-digit numbers are treated as a single character and ordered by the value of the number encoded by the digits.

In the case of monarchs and popes, although their numbers are in Roman numerals and resemble letters, they are normally arranged in numerical order: so, for example, even though V comes after I, the Danish king Christian IX comes after his predecessor Christian VIII.

Language-specific conventions

Languages which use an extended Latin alphabet generally have their own conventions for treatment of the extra letters. Also in some languages certain digraphs are treated as single letters for collation purposes. For example, the Spanish alphabet treats ñ as a basic letter following n, and formerly treated the digraphs ch and ll as basic letters following c and l, respectively. Now сh and ll are alphabetized as two-letter combinations. The new alphabetization rule was issued by the Royal Spanish Academy in 1994. These digraphs were still formally designated as letters but they are no longer so since 2010. On the other hand, the digraph rr follows rqu as expected (and did so even before the 1994 alphabetization rule), while vowels with acute accents (á, é, í, ó, ú) have always been ordered in parallel with their base letters, as has the letter ü.

In a few cases, such as Arabic and Kiowa, the alphabet has been completely reordered.

Alphabetization rules applied in various languages are listed below.

A, AU, E, I, O, U, B, F, P, V, D, J, T, TH, G, C, K, Q, CH, X, S, Z, L, Y, W, H, M, N

Automation

Collation algorithms (in combination with sorting algorithms) are used in computer programming to place strings in alphabetical order. A standard example is the Unicode Collation Algorithm, which can be used to put strings containing any Unicode symbols into (an extension of) alphabetical order. [14] It can be made to conform to most of the language-specific conventions described above by tailoring its default collation table. Several such tailorings are collected in Common Locale Data Repository.

Similar orderings

The principle behind alphabetical ordering can still be applied in languages that do not strictly speaking use an alphabet – for example, they may be written using a syllabary or abugida – provided the symbols used have an established ordering.

For logographic writing systems, such as Chinese hanzi or Japanese kanji, the method of radical-and-stroke sorting is frequently used as a way of defining an ordering on the symbols. Japanese sometimes uses pronunciation order, most commonly with the Gojūon order but sometimes with the older Iroha ordering.

In mathematics, lexicographical order is a means of ordering sequences in a manner analogous to that used to produce alphabetical order. [16]

Some computer applications use a version of alphabetical order that can be achieved using a very simple algorithm, based purely on the ASCII or Unicode codes for characters. This may have non-standard effects such as placing all capital letters before lower-case ones. See ASCIIbetical order.

A rhyming dictionary is based on sorting words in alphabetical order starting from the last to the first letter of the word.

See also

Notes

  1. There is an exception: In ABC Chinese–English Dictionary the tone order is "zero tone (neutral tone), first tone (flat tone), second tone (rising tone), third tone (falling-rising tone) and fourth tone (falling tone)".

Related Research Articles

<span class="mw-page-title-main">Alphabet</span> Set of letters used to write a given language

An alphabet is a standardized set of written letters that represent particular spoken sounds in a language. Specifically, letters correspond to phonemes, the categories of sounds that can distinguish one word from another in a given language. Not all writing systems represent language in this way: a syllabary assigns symbols to spoken syllables, while logographic systems assign symbols to spoken words, morphemes, or other semantic units.

Collation is the assembly of written information into a standard order. Many systems of collation are based on numerical order or alphabetical order, or extensions and combinations thereof. Collation is a fundamental element of most office filing systems, library catalogs, and reference books.

<span class="mw-page-title-main">Diacritic</span> Modifier mark added to a letter

A diacritic is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek διακριτικός, from διακρίνω. The word diacritic is a noun, though it is sometimes used in an attributive sense, whereas diacritical is only an adjective. Some diacritics, such as the acute ⟨á⟩, grave ⟨à⟩, and circumflex ⟨â⟩, are often called accents. Diacritics may appear above or below a letter or in some other position such as within the letter or between two letters.

A macron is a diacritical mark: it is a straight bar ¯ placed above a letter, usually a vowel. Its name derives from Ancient Greek μακρόν (makrón) 'long' because it was originally used to mark long or heavy syllables in Greco-Roman metrics. It now more often marks a long vowel. In the International Phonetic Alphabet, the macron is used to indicate a mid-tone; the sign for a long vowel is instead a modified triangular colon ː.

The Danish and Norwegian alphabets, together called the Dano-Norwegian alphabet, is the set of symbols, forming a variant of the Latin alphabet, used for writing the Danish and Norwegian languages. It has consisted of the following 29 letters since 1917 (Norwegian) and 1948 (Danish):

Finnish orthography is based on the Latin script, and uses an alphabet derived from the Swedish alphabet, officially comprising twenty-nine letters but also including two additional letters found in some loanwords. The Finnish orthography strives to represent all morphemes phonologically and, roughly speaking, the sound value of each letter tends to correspond with its value in the International Phonetic Alphabet (IPA) – although some discrepancies do exist.

Welsh orthography uses 29 letters of the Latin script to write native Welsh words as well as established loanwords.

The Hungarian alphabet is an extension of the Latin alphabet used for writing the Hungarian language.

Filipinoorthography specifies the correct use of the writing system of the Filipino language, the national and co-official language of the Philippines.

<span class="mw-page-title-main">Polish alphabet</span> Script of the Polish language

The Polish alphabet is the script of the Polish language, the basis for the Polish system of orthography. It is based on the Latin alphabet but includes certain letters with diacritics: the acute accent ; the overdot ; the tail or ogonek ; and the stroke. ⟨q⟩, ⟨v⟩, and ⟨x⟩, which are used only in foreign words, are usually absent from the Polish alphabet. However, prior to the standardization of Polish spelling, ⟨x⟩ was sometimes used in place of ⟨ks⟩.

<span class="mw-page-title-main">Digraph (orthography)</span> Pair of characters used to write one phoneme

A digraph or digram is a pair of characters used in the orthography of a language to write either a single phoneme, or a sequence of phonemes that does not correspond to the normal values of the two characters combined.

<span class="mw-page-title-main">Yañalif</span> 1920s–30s Soviet Latin alphabet for Turkic languages

Jaꞑalif, Yangalif or Yañalif is the first Latin alphabet used during the latinisation in the Soviet Union in the 1930s for the Turkic languages. It replaced the Yaña imlâ Arabic script-based alphabet in 1928, and was replaced by the Cyrillic alphabet in 1938–1940. After their respective independence in 1991, several former Soviet states in Central Asia switched back to Latin script, with slight modifications to the original Jaꞑalif.

Polish orthography is the system of writing the Polish language. The language is written using the Polish alphabet, which derives from the Latin alphabet, but includes some additional letters with diacritics. The orthography is mostly phonetic, or rather phonemic—the written letters correspond in a consistent manner to the sounds, or rather the phonemes, of spoken Polish. For detailed information about the system of phonemes, see Polish phonology.

<span class="mw-page-title-main">Latin script</span> Writing system based on the alphabet used by the Romans

The Latin script, also known as the Roman script, is an alphabetic writing system based on the letters of the classical Latin alphabet, derived from a form of the Greek alphabet which was in use in the ancient Greek city of Cumae, in southern Italy. The Greek alphabet was altered by the Etruscans, and subsequently their alphabet was altered by the Romans. Several Latin-script alphabets exist, which differ in graphemes, collation and phonetic values from the classical Latin alphabet.

The modern Corsican alphabet uses twenty-two basic letters taken from the Latin alphabet with some changes, plus some multigraphs. The pronunciations of the English, French, Italian or Latin forms of these letters are not a guide to their pronunciation in Corsican, which has its own pronunciation, often the same, but frequently not. As can be seen from the table below, two of the phonemic letters are represented as trigraphs, plus some other digraphs. Nearly all the letters are allophonic; that is, a phoneme of the language might have more than one pronunciation and be represented by more than one letter. The exact pronunciation depends mainly on word order and usage and is governed by a complex set of rules, variable to some degree by dialect. These have to be learned by the speaker of the language.

<span class="mw-page-title-main">Norwegian orthography</span> Norwegian language writing conventions

Norwegian orthography is the method of writing the Norwegian language, of which there are two written standards: Bokmål and Nynorsk. While Bokmål has for the most part derived its forms from the written Danish language and Danish-Norwegian speech, Nynorsk gets its word forms from Aasen's reconstructed "base dialect", which is intended to represent the distinctive dialectal forms. Both standards use a 29-letter variant of the Latin alphabet and the same orthographic principles.

<span class="mw-page-title-main">Umlaut (diacritic)</span> Diacritic mark to indicate sound shift

The umlaut is the diacritical mark used to indicate in writing the result of the historical sound shift due to which former back vowels are now pronounced as front vowels.

A Latin-script alphabet is an alphabet that uses letters of the Latin script. The 21-letter archaic Latin alphabet and the 23-letter classical Latin alphabet belong to the oldest of this group. The 26-letter modern Latin alphabet is the newest of this group.

References

  1. Reinhard G. Lehmann: "27-30-22-26. How Many Letters Needs an Alphabet? The Case of Semitic", in: The idea of writing: Writing across borders, edited by Alex de Voogt and Joachim Friedrich Quack, Leiden: Brill 2012, pp. 11–52.
  2. 1 2 3 4 Street, Julie (10 June 2020). "From A to Z - the surprising history of alphabetical order" (text and audio). ABC News (ABC Radio National). Australian Broadcasting Corporation. Archived from the original on 2 July 2020. Retrieved 6 July 2020.
  3. e.g. Psalms 25, 34, 37, 111, 112, 119 and 145 of the Hebrew Bible
  4. Daly, Lloyd. Contributions to the History of Alphabetization in Antiquity and the Middle Ages. Brussels, 1967. p. 25.
  5. O'Hara, James (1989). "Messapus, Cycnus, and the Alphabetical Order of Vergil's Catalogue of Italian Heroes". Phoenix. 43 (1): 35–38. doi:10.2307/1088539. JSTOR   1088539.
  6. LIVRE XI – texte latin – traduction + commentaires. Archived from the original on 9 June 2012. Retrieved 8 May 2012.
  7. Gibson, Craig (2002). Interpreting a classic: Demosthenes and his ancient commentators.
  8. Rouse, Mary A.; Rouse, Richard M. (1991), "Statim invenire: Schools, Preachers and New Attitudes to the Page", Authentic Witnesses: Approaches to Medieval Texts and Manuscripts, University of Notre Dame Press, pp. 201–219, ISBN   0-268-00622-9
  9. Cawdrey, Robert (1604). A Table Alphabeticall. London. p. [A4]v.
  10. Coleridge's Letters, No.507.
  11. Tscharntke, Teja; Hochberg, Michael E; Rand, Tatyana A; Resh, Vincent H; Krauss, Jochen (January 2007). "Author Sequence and Credit for Contributions in Multiauthored Publications". PLOS Biol. 5 (1): e18. doi: 10.1371/journal.pbio.0050018 . PMC   1769438 . PMID   17227141.
  12. Stevens, Jeffrey R.; Duque, Juan F. (2018). "Order Matters: Alphabetizing In-Text Citations Biases Citation Rates" (PDF). Psychonomic Bulletin & Review. 26 (3): 1020–1026. doi: 10.3758/s13423-018-1532-8 . PMID   30288671. S2CID   52922399. Archived (PDF) from the original on 10 November 2018. Retrieved 10 November 2018.
  13. "Arabic Mathematical Alphabetic Symbols" (PDF). THE Unicode Standard. Archived (PDF) from the original on 30 October 2022. Retrieved 26 November 2022.
  14. 1 2 "Unicode Technical Standard #10: Unicode collation algorithm". Unicode, Inc. (unicode.org). 20 March 2008. Archived from the original on 27 August 2008. Retrieved 27 August 2008.
  15. Midgley, Ralph. "Volapük to English dictionary" (PDF). Archived from the original (PDF) on 1 September 2012. Retrieved 24 September 2019.
  16. Franz Baader; Tobias Nipkow (1999). Term Rewriting and All That. Cambridge University Press. pp. 18–19. ISBN   978-0-521-77920-3.

Further reading