Collation

Last updated

Collation is the assembly of written information into a standard order. Many systems of collation are based on numerical order or alphabetical order, or extensions and combinations thereof. Collation is a fundamental element of most office filing systems, library catalogs, and reference books.

Contents

Collation differs from classification in that the classes themselves are not necessarily ordered. However, even if the order of the classes is irrelevant, the identifiers of the classes may be members of an ordered set, allowing a sorting algorithm to arrange the items by class.

Formally speaking, a collation method typically defines a total order on a set of possible identifiers, called sort keys, which consequently produces a total preorder on the set of items of information (items with the same identifier are not placed in any defined order).

A collation algorithm such as the Unicode collation algorithm defines an order through the process of comparing two given character strings and deciding which should come before the other. When an order has been defined in this way, a sorting algorithm can be used to put a list of any number of items into that order.

The main advantage of collation is that it makes it fast and easy for a user to find an element in the list, or to confirm that it is absent from the list. In automatic systems this can be done using a binary search algorithm or interpolation search; manual searching may be performed using a roughly similar procedure, though this will often be done unconsciously. Other advantages are that one can easily find the first or last elements on the list (most likely to be useful in the case of numerically sorted data), or elements in a given range (useful again in the case of numerical data, and also with alphabetically ordered data when one may be sure of only the first few letters of the sought item or items).

Ordering

Numerical and chronological

Strings representing numbers may be sorted based on the values of the numbers that they represent. For example, "−4", "2.5", "10", "89", "30,000". Pure application of this method may provide only a partial ordering on the strings, since different strings can represent the same number (as with "2" and "2.0" or, when scientific notation is used, "2e3" and "2000").

A similar approach may be taken with strings representing dates or other items that can be ordered chronologically or in some other natural fashion.

Alphabetical

Alphabetical order is the basis for many systems of collation where items of information are identified by strings consisting principally of letters from an alphabet. The ordering of the strings relies on the existence of a standard ordering for the letters of the alphabet in question. (The system is not limited to alphabets in the strict technical sense; languages that use a syllabary or abugida, for example Cherokee, can use the same ordering principle provided there is a set ordering for the symbols used.)

To decide which of two strings comes first in alphabetical order, initially their first letters are compared. The string whose first letter appears earlier in the alphabet comes first in alphabetical order. If the first letters are the same, then the second letters are compared, and so on, until the order is decided. (If one string runs out of letters to compare, then it is deemed to come first; for example, "cart" comes before "carthorse".) The result of arranging a set of strings in alphabetical order is that words with the same first letter are grouped together, and within such a group words with the same first two letters are grouped together, and so on.

Capital letters are typically treated as equivalent to their corresponding lowercase letters. (For alternative treatments in computerized systems, see Automated collation, below.)

Certain limitations, complications, and special conventions may apply when alphabetical order is used:

In several languages the rules have changed over time, and so older dictionaries may use a different order than modern ones. Furthermore, collation may depend on use. For example, German dictionaries and telephone directories use different approaches.

Root sorting

Some Arabic dictionaries, such as Hans Wehr's bilingual A Dictionary of Modern Written Arabic , group and sort Arabic words by semitic root. [1] For example, the words kitāba (كتابة 'writing'), kitāb (كتاب 'book'), kātib (كاتب 'writer'), maktaba (مكتبة 'library'), maktab (مكتب 'office'), maktūb (مكتوب 'fate,' or 'written'), are agglomerated under the triliteral root k-t-b (ك ت ب), which denotes 'writing'. [2]

Radical-and-stroke sorting

See also Chinese characters and Chinese character orders

Another form of collation is radical-and-stroke sorting, used for non-alphabetic writing systems such as the hanzi of Chinese and the kanji of Japanese, whose thousands of symbols defy ordering by convention. In this system, common components of characters are identified; these are called radicals in Chinese and logographic systems derived from Chinese. Characters are then grouped by their primary radical, then ordered by number of pen strokes within radicals. When there is no obvious radical or more than one radical, convention governs which is used for collation. For example, the Chinese character 妈 (meaning "mother") is sorted as a six-stroke character under the three-stroke primary radical 女.

The radical-and-stroke system is cumbersome compared to an alphabetical system in which there are a few characters, all unambiguous. The choice of which components of a logograph comprise separate radicals and which radical is primary is not clear-cut. As a result, logographic languages often supplement radical-and-stroke ordering with alphabetic sorting of a phonetic conversion of the logographs. For example, the kanji word Tōkyō (東京) can be sorted as if it were spelled out in the Japanese characters of the hiragana syllabary as "to-u-ki-yo-u" (とうきょう), using the conventional sorting order for these characters.[ citation needed ]

In addition, Chinese characters can also be sorted by stroke-based sorting. In Greater China, surname stroke ordering is a convention in some official documents where people's names are listed without hierarchy.

Automation

When information is stored in digital systems, collation may become an automated process. It is then necessary to implement an appropriate collation algorithm that allows the information to be sorted in a satisfactory manner for the application in question. Often the aim will be to achieve an alphabetical or numerical ordering that follows the standard criteria as described in the preceding sections. However, not all of these criteria are easy to automate. [3]

The simplest kind of automated collation is based on the numerical codes of the symbols in a character set, such as ASCII coding (or any of its supersets such as Unicode), with the symbols being ordered in increasing numerical order of their codes, and this ordering being extended to strings in accordance with the basic principles of alphabetical ordering (mathematically speaking, lexicographical ordering). So a computer program might treat the characters a, b, C, d, and $ as being ordered $, C, a, b, d (the corresponding ASCII codes are $ = 36, a = 97, b = 98, C = 67, and d = 100). Therefore, strings beginning with C, M, or Z would be sorted before strings with lower-case a, b, etc. This is sometimes called ASCIIbetical order . This deviates from the standard alphabetical order, particularly due to the ordering of capital letters before all lower-case ones (and possibly the treatment of spaces and other non-letter characters). It is therefore often applied with certain alterations, the most obvious being case conversion (often to uppercase, for historical reasons [note 1] ) before comparison of ASCII values.

In many collation algorithms, the comparison is based not on the numerical codes of the characters, but with reference to the collating sequence – a sequence in which the characters are assumed to come for the purpose of collation – as well as other ordering rules appropriate to the given application. This can serve to apply the correct conventions used for alphabetical ordering in the language in question, dealing properly with differently cased letters, modified letters, digraphs, particular abbreviations, and so on, as mentioned above under Alphabetical order, and in detail in the Alphabetical order article. Such algorithms are potentially quite complex, possibly requiring several passes through the text. [3]

Problems are nonetheless still common when the algorithm has to encompass more than one language. For example, in German dictionaries the word ökonomisch comes between offenbar and olfaktorisch, while Turkish dictionaries treat o and ö as different letters, placing oyun before öbür.

A standard algorithm for collating any collection of strings composed of any standard Unicode symbols is the Unicode Collation Algorithm. This can be adapted to use the appropriate collation sequence for a given language by tailoring its default collation table. Several such tailorings are collected in Common Locale Data Repository.

Sort keys

In some applications, the strings by which items are collated may differ from the identifiers that are displayed. For example, The Shining might be sorted as Shining, The (see Alphabetical order above), but it may still be desired to display it as The Shining. In this case two sets of strings can be stored, one for display purposes, and another for collation purposes. Strings used for collation in this way are called sort keys.

Issues with numbers

Sometimes, it is desired to order text with embedded numbers using proper numerical order. For example, "Figure 7b" goes before "Figure 11a", even though '7' comes after '1' in Unicode. This can be extended to Roman numerals. This behavior is not particularly difficult to produce as long as only integers are to be sorted, although it can slow down sorting significantly. For example, Microsoft Windows does this when sorting file names.

Sorting decimals properly is a bit more difficult, because different locales use different symbols for a decimal point, and sometimes the same character used as a decimal point is also used as a separator, for example "Section 3.2.5". There is no universal answer for how to sort such strings; any rules are application dependent.

Labeling of ordered items

In some contexts, numbers and letters are used not so much as a basis for establishing an ordering, but as a means of labeling items that are already ordered. For example, pages, sections, chapters, and the like, as well as the items of lists, are frequently "numbered" in this way. Labeling series that may be used include ordinary Arabic numerals (1, 2, 3, ...), Roman numerals (I, II, III, ... or i, ii, iii, ...), or letters (A, B, C, ... or a, b, c, ...). (An alternative method for indicating list items, without numbering them, is to use a bulleted list.)

When letters of an alphabet are used for this purpose of enumeration, there are certain language-specific conventions as to which letters are used. For example, the Russian letters Ъ and Ь (which in writing are only used for modifying the preceding consonant), and usually also Ы, Й, and Ё, are omitted. Also in many languages that use extended Latin script, the modified letters are often not used in enumeration.

See also

Notes

  1. Historically, computers only handled text in uppercase (this dates back to telegraph conventions).

Related Research Articles

A bidirectional text contains two text directionalities, right-to-left (RTL) and left-to-right (LTR). It generally involves text containing different types of alphabets, but may also refer to boustrophedon, which is changing text direction in each row.

<span class="mw-page-title-main">Diacritic</span> Modifier mark added to a letter

A diacritic is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek διακριτικός, from διακρίνω. The word diacritic is a noun, though it is sometimes used in an attributive sense, whereas diacritical is only an adjective. Some diacritics, such as the acute ⟨á⟩, grave ⟨à⟩, and circumflex ⟨â⟩, are often called accents. Diacritics may appear above or below a letter or in some other position such as within the letter or between two letters.

<span class="mw-page-title-main">Grapheme</span> Smallest functional written unit

In linguistics, a grapheme is the smallest functional unit of a writing system. The word grapheme is derived from Ancient Greek γράφω (gráphō) 'write' and the suffix -eme by analogy with phoneme and other names of emic units. The study of graphemes is called graphemics. The concept of graphemes is abstract and similar to the notion in computing of a character. By comparison, a specific shape that represents any particular grapheme in a given typeface is called a glyph.

<span class="mw-page-title-main">String (computer science)</span> Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

<span class="mw-page-title-main">Sorting</span> Action of arranging objects into order

Sorting refers to ordering data in an increasing or decreasing manner according to some linear relationship among the data items.

  1. ordering: arranging items in a sequence ordered by some criterion;
  2. categorizing: grouping items with similar properties.
<span class="mw-page-title-main">Polish alphabet</span> Script of the Polish language

The Polish alphabet is the script of the Polish language, the basis for the Polish system of orthography. It is based on the Latin alphabet but includes certain letters with diacritics: the acute accent ; the overdot ; the tail or ogonek ; and the stroke. ⟨q⟩, ⟨v⟩, and ⟨x⟩, which are used only in foreign words, are usually absent from the Polish alphabet. However, prior to the standardization of Polish spelling, ⟨x⟩ was sometimes used in place of ⟨ks⟩.

Alphabetical order is a system whereby character strings are placed in order based on the position of the characters in the conventional ordering of an alphabet. It is one of the methods of collation. In mathematics, a lexicographical order is the generalization of the alphabetical order to other data types, such as sequences of numbers or other ordered mathematical objects.

<span class="mw-page-title-main">Ligature (writing)</span> Glyph combining two or more letterforms

In writing and typography, a ligature occurs where two or more graphemes or letters are joined to form a single glyph. Examples are the characters ⟨æ⟩ and ⟨œ⟩ used in English and French, in which the letters ⟨a⟩ and ⟨e⟩ are joined for the first ligature and the letters ⟨o⟩ and ⟨e⟩ are joined for the second ligature. For stylistic and legibility reasons, ⟨f⟩ and ⟨i⟩ are often merged to create ⟨fi⟩ ; the same is true of ⟨s⟩ and ⟨t⟩ to create ⟨st⟩. The common ampersand, ⟨&⟩, developed from a ligature in which the handwritten Latin letters ⟨e⟩ and ⟨t⟩ were combined.

<span class="mw-page-title-main">Digraph (orthography)</span> Pair of characters used to write one phoneme

A digraph or digram is a pair of characters used in the orthography of a language to write either a single phoneme, or a sequence of phonemes that does not correspond to the normal values of the two characters combined.

<span class="mw-page-title-main">European ordering rules</span>

The European ordering rules, define an ordering for strings written in languages that are written with the Latin, Greek and Cyrillic alphabets. The standard covers languages used by the European Union, the European Free Trade Association, and parts of the former Soviet Union. It is a tailoring of the Common Tailorable Template of ISO/IEC 14651. EOR can in turn be tailored for different (European) languages. But in inter-European contexts, EOR can be used without further tailoring.

<span class="mw-page-title-main">Latin script</span> Writing system based on the alphabet used by the Romans

The Latin script, also known as the Roman script, and technically Latin writing system is an alphabetic writing system based on the letters of the classical Latin alphabet, derived from a form of the Greek alphabet which was in use in the ancient Greek city of Cumae, in southern Italy. The Greek alphabet was altered by the Etruscans, and subsequently their alphabet was altered by the Romans. Several Latin-script alphabets exist, which differ in graphemes, collation and phonetic values from the classical Latin alphabet.

Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting standard character sets, which often included similar or identical characters.

In Unicode, a script is a collection of letters and other written signs used to represent textual information in one or more writing systems. Some scripts support one and only one writing system and language, for example, Armenian. Other scripts support many different writing systems; for example, the Latin script supports English, French, German, Italian, Vietnamese, Latin itself, and several other languages. Some languages make use of multiple alternate writing systems and thus also use several scripts; for example, in Turkish, the Arabic script was used before the 20th century but transitioned to Latin in the early part of the 20th century. More or less complementary to scripts are symbols and Unicode control characters.

<span class="mw-page-title-main">Umlaut (diacritic)</span> Diacritic mark to indicate sound shift

The umlaut is the diacritical mark used to indicate in writing the result of the historical sound shift due to which former back vowels are now pronounced as front vowels.

A Latin-script alphabet is an alphabet that uses letters of the Latin script. The 21-letter archaic Latin alphabet and the 23-letter classical Latin alphabet belong to the oldest of this group. The 26-letter modern Latin alphabet is the newest of this group.

Kangxi Radicals is a Unicode block. In version 3.0 (1999), this separate Kangxi Radicals block was introduced which encodes the 214 radicals in sequence, at U+2F00–2FD5. These are specific code points intended to represent the radical qua radical, as opposed to the character consisting of the unaugmented radical; thus, U+2F00 represents radical 1 while U+4E00 represents the character meaning "one". In addition, the CJK Radicals Supplement block (2E80–2EFF) was introduced, encoding alternative forms taken by Kangxi radicals as they appear within specific characters. For example, ⺁ "CJK RADICAL CLIFF" (U+2E81) is a variant of ⼚ radical 27 (U+2F1A), itself identical in shape to the character consisting of unaugmented radical 27, 厂 "cliff" (U+5382).

Tamil All Character Encoding (TACE16) is a scheme for encoding the Tamil script in the Private Use Area of Unicode, implementing a syllabary-based character model differing from the modified-ISCII model used by Unicode's existing Tamil implementation.

Pinyin alphabetical order, or Pinyin order in short, is a sound-based Chinese character sorting method which has been used for arrangement of entries in Xinhua Dictionary, Xiandai Hanyu Cidian, Oxford Chinese Dictionary and many other modern dictionaries. In this method, Chinese characters are arranged according to the order of the Latin alphabet adopted in "Chinese Pinyin Scheme".

Chinese character order, or Chinese character indexing, Chinese character collation and Chinese character sorting, is the way in which a Chinese character set is sorted into a sequence for the convenience of information retrieval. It may also refer to the sequence so produced. English dictionaries and indexes are normally arranged in alphabetical order for quick lookup, but Chinese is written in tens of thousands of different characters, not just dozens of letters in an alphabet, and that makes the sorting job much more challenging.

<span class="mw-page-title-main">Chinese character strokes</span> Smallest writing units of Chinese characters

Strokes are the smallest structural units making up written Chinese characters. In the act of writing, a stroke is defined as a movement of a writing instrument on a writing material surface, or the trace left on the surface from a discrete application of the writing implement. The modern sense of discretized strokes first came into being with the clerical script during the Han dynasty. In the regular script that emerged during the Tang dynasty—the most recent major style, highly studied for its aesthetics in East Asian calligraphy—individual strokes are discrete and highly regularized. By contrast, the ancient seal script has line terminals within characters that are often unclear, making them nontrivial to count.

References

  1. Abu-Haidar, J. A. (1983). "Review of A Dictionary of Modern Written Arabic (Arabic-English)". Bulletin of the School of Oriental and African Studies, University of London. 46 (2): 351–353. ISSN   0041-977X. JSTOR   615409.
  2. "Hans Wehr Arabic-English Dictionary". ejtaal.net. Retrieved 2023-06-04.
  3. 1 2 M Programming: A Comprehensive Guide, Richard F. Walters, Digital Press, 1997