Arabic letter frequency

Last updated


The frequency of letters in text has often been studied for use in cryptanalysis, and frequency analysis in particular.

Contents

No language has an exact letter frequency distribution, as all writers write slightly differently. As a rule texts in different languages using the Arabic script (e.g. Arabic, Ottoman Turkish, Persian and Urdu) will have different letter frequencies, most obviously in the case of letters which are only used in some languages (e.g. the Persian letters پ, چ, گ, which are not used to write in Arabic).

Methods encoding the most frequent letters with the shortest symbols were pioneered by telegraph codes, and are used in modern data-compression techniques such as Huffman coding.

Arabic letters

The Arabic alphabet consists of 28 primary letters, these are letters 1 to 28 in Table 1. The eight modified letters listed in positions 29 to 36 in the same table are used just the same[ clarification needed ]. If these 8 modified forms are folded into the primary list based on shape or phonetic similarity, the outcome then is as shown in Table 2. For accurate frequency analysis, each of the 36 letters of Table 1 gets its frequency counted independently.

The ordering of the alphabet shown in the tables is more logical[ citation needed ] than is used by the Unicode standard.

Figure 1: Arabic characters that can be produced using the Arabic Letter Keyboard Intellark. IntellarkChars.png
Figure 1: Arabic characters that can be produced using the Arabic Letter Keyboard Intellark.
Table 1: The Arabic alphabet. Letters 1 to 28 are the primary letters. Letters 29 to 36 are the modified letters. ArabicLetters-36.png
Table 1: The Arabic alphabet. Letters 1 to 28 are the primary letters. Letters 29 to 36 are the modified letters.
Table 2: The Arabic alphabet, with modified letters lumped onto their primary forms. ArabicLetters-28.png
Table 2: The Arabic alphabet, with modified letters lumped onto their primary forms.
Letter frequency distribution for the counted letters: Histogram data sorted on frequency. HurufFRQsort.png
Letter frequency distribution for the counted letters: Histogram data sorted on frequency.

Although the full set of Arabic characters includes about ten diacritics as shown in the Figure 1, frequency analysis of Arabic characters is only concerned with computing the frequency of alphabet letters shown in Table 2.

Arabic letter frequency using general sources

The following Arabic sources are used to generate an acceptable amount of data on which frequency statistics are conducted.

Collectively, these sources add up to 3,378 pages, with 1,297,259 words, and 5,122,132 letters.

The following graph shows the letter frequency distribution for the counted letters.

LetterRelative frequency in the Arabic language
ء0.31%0.31
 
ؤ0.09%0.09
 
ئ0.28%0.28
 
ا12.50%12.5
 
آ0.15%0.15
 
أ2.89%2.89
 
إ1.00%1
 
ب4.67%4.67
 
ة1.42%1.42
 
ت2.61%2.61
 
ث0.87%0.87
 
ج1.23%1.23
 
ح1.86%1.86
 
خ0.79%0.79
 
د2.67%2.67
 
ذ0.96%0.96
 
ر4.20%4.2
 
ز0.52%0.52
 
س2.47%2.47
 
ش0.73%0.73
 
ص1.04%1.04
 
ض0.44%0.44
 
ط0.50%0.5
 
ظ0.18%0.18
 
ع4.01%4.01
 
غ0.33%0.33
 
ف2.84%2.84
 
ق2.69%2.69
 
ك2.04%2.04
 
ل12.07%12.07
 
م6.52%6.52
 
ن6.61%6.61
 
ه5.08%5.08
 
و5.80%5.8
 
ى1.29%1.29
 
ي6.36%6.36
 

Related Research Articles

<span class="mw-page-title-main">Arabic alphabet</span> Alphabets for Arabic and other languages

The Arabic alphabet, or Arabic abjad, is the Arabic script as it is codified for writing Arabic. It is written from right to left in a cursive style and includes 28 letters. Most letters have contextual letterforms.

Collation is the assembly of written information into a standard order. Many systems of collation are based on numerical order or alphabetical order, or extensions and combinations thereof. Collation is a fundamental element of most office filing systems, library catalogs, and reference books.

The Hebrew alphabet, known variously by scholars as the Ktav Ashuri, Jewish script, square script and block script, is traditionally an abjad script used in the writing of the Hebrew language and other Jewish languages, most notably Yiddish, Ladino, Judeo-Arabic, and Judeo-Persian. In modern Hebrew, vowels are increasingly introduced. It is also used informally in Israel to write Levantine Arabic, especially among Druze. It is an offshoot of the Imperial Aramaic alphabet, which flourished during the Achaemenid Empire and which itself derives from the Phoenician alphabet.

In cryptography, a substitution cipher is a method of encrypting in which units of plaintext are replaced with the ciphertext, in a defined manner, with the help of a key; the "units" may be single letters, pairs of letters, triplets of letters, mixtures of the above, and so forth. The receiver deciphers the text by performing the inverse substitution process to extract the original message.

<span class="mw-page-title-main">Frequency analysis</span> Study of the frequency of letters or groups of letters in a ciphertext

In cryptanalysis, frequency analysis is the study of the frequency of letters or groups of letters in a ciphertext. The method is used as an aid to breaking classical ciphers.

<span class="mw-page-title-main">Russian alphabet</span> Alphabet that uses letters from the Cyrillic script

The Russian alphabet is the script used to write the Russian language. It comes from the Cyrillic script, which was devised in the 9th century for the first Slavic literary language, Old Slavonic. Initially an old variant of the Bulgarian alphabet, it became used in the Kievan Rusʹ since the 10th century to write what would become the modern Russian language.

Alphabetical order is a system whereby character strings are placed in order based on the position of the characters in the conventional ordering of an alphabet. It is one of the methods of collation. In mathematics, a lexicographical order is the generalization of the alphabetical order to other data types, such as sequences of numbers or other ordered mathematical objects.

Uyghur is a Turkic language with a long literary tradition spoken in Xinjiang, China by the Uyghurs. Today, the Uyghur Arabic alphabet is the official writing system used for Uyghur in Xinjiang, whereas other alphabets like the Uyghur Latin and Uyghur Cyrillic alphabets are still in use outside China, especially in Central Asia.

In cryptography, the ADFGVX cipher was a manually applied field cipher used by the Imperial German Army during World War I. It was used to transmit messages secretly using wireless telegraphy. ADFGVX was in fact an extension of an earlier cipher called ADFGX which was first used on 1 March 1918 on the German Western Front. ADFGVX was applied from 1 June 1918 on both the Western Front and Eastern Front.

The Azerbaijani alphabet has three versions which includes the Perso-Arabic, Latin, and Cyrillic alphabets.

<span class="mw-page-title-main">Kurdish alphabets</span> Multiple alphabets of Kurdish language

The Kurdish languages are written in either of two alphabets: a Latin alphabet introduced by Celadet Alî Bedirxan in 1932 called the Bedirxan alphabet or Hawar alphabet and a Arabic script called the Sorani alphabet or Central Kurdish alphabet. The Kurdistan Region has agreed upon a standard for Central Kurdish, implemented in Unicode for computation purposes.

It is thought that the Arabic alphabet is a derivative of the Nabataean variation of the Aramaic alphabet, which descended from the Phoenician alphabet, which among others also gave rise to the Hebrew alphabet and the Greek alphabet, the latter one being in turn the base for the Latin and Cyrillic alphabets.

<span class="mw-page-title-main">Sorabe alphabet</span> Historical Arabic-based script for Malagasy

Sorabe or Sora-be is an alphabet based on Arabic, formerly used to transcribe the Malagasy language and the Antemoro Malagasy dialect, dating from the 15th century.

<span class="mw-page-title-main">Ottoman Turkish alphabet</span> Arabic-based script for Ottoman Turkish

The Ottoman Turkish alphabet is a version of the Perso-Arabic script used to write Ottoman Turkish until 1928, when it was replaced by the Latin-based modern Turkish alphabet.

<span class="mw-page-title-main">Arabic script</span> Writing system for Arabic and several other languages

The Arabic script is the writing system used for Arabic and several other languages of Asia and Africa. It is the second-most widely used alphabetic writing system in the world, the second-most widely used writing system in the world by number of countries using it or a script directly derived from it, and the third-most by number of users.

<span class="mw-page-title-main">Urdu alphabet</span> Perso-Arabic-based alphabet used for Urdu

The Urdu alphabet, is the right-to-left alphabet used for Urdu. It is a modification of the Persian script, which is itself a derivative of the Arabic script. It is one of the official scripts of the Indian Republic. The Urdu alphabet has up to 39 or 40 distinct letters with no distinct letter cases and is typically written in the calligraphic Nastaʿlīq script, whereas Arabic is more commonly written in the Naskh style.

Letter frequency is the number of times letters of the alphabet appear on average in written language. Letter frequency analysis dates back to the Arab mathematician Al-Kindi, who formally developed the method to break ciphers. Letter frequency analysis gained importance in Europe with the development of movable type in 1450 AD, where one must estimate the amount of type required for each letterform. Linguists use letter frequency analysis as a rudimentary technique for language identification, where it is particularly effective as an indication of whether an unknown writing system is alphabetic, syllabic, or ideographic.

The modern Malay or Indonesian alphabet, consists of the 26 letters of the ISO basic Latin alphabet. It is the more common of the two alphabets used today to write the Malay language, the other being Jawi. The Latin Malay alphabet is the official Malay script in Indonesia, Malaysia and Singapore, while it is co-official with Jawi in Brunei.

ʻAfīf al-Dīn ʻAlī ibn ʻAdlān al-Mawsilī, born in Mosul, was an Arab cryptologist, linguist and poet who is known for his early contributions to cryptanalysis, to which he dedicated at least two books. He was also involved in literature and poetry, and taught on the Arabic language at the Al-Salihiyya Mosque of Cairo.

The goal of braille uniformity is to unify the braille alphabets of the world as much as possible, so that literacy in one braille alphabet readily transfers to another. Unification was first achieved by a convention of the International Congress on Work for the Blind in 1878, where it was decided to replace the mutually incompatible national conventions of the time with the French values of the basic Latin alphabet, both for languages that use Latin-based alphabets and, through their Latin equivalents, for languages that use other scripts. However, the unification did not address letters beyond these 26, leaving French and German Braille partially incompatible and as braille spread to new languages with new needs, national conventions again became disparate. A second round of unification was undertaken under the auspices of UNESCO in 1951, setting the foundation for international braille usage today.

References

  1. Ibn Kathir, Ismail (c. 1300). The beginning and the End (in Arabic). Retrieved 23 January 2011.
  2. Almubarakfuri, Safiyyurrahman (2002). The Sealed Nectar (in Arabic). Darussalam Publications. ISBN   978-1591440710 . Retrieved 24 January 2011.
  3. Ash-shuri, Majdi (c. 1900). Masterpiece of the Bride (in Arabic). Retrieved 24 January 2011.