Arabic letter frequency

Last updated

The frequency of letters in text has often been studied for use in cryptanalysis, and frequency analysis in particular.

Contents

No language has an exact letter frequency distribution, as all writers write slightly differently. As a rule texts in different languages using the Arabic script (e.g. Arabic, Kurdish, Malay, Persian and Urdu) will have different letter frequencies, most obviously in the case of letters which are only used in some languages (e.g. the letters ڤ, پ, چ, گ, or ڨ which are not part of Standard Arabic).

Methods encoding the most frequent letters with the shortest symbols were pioneered by telegraph codes, and are used in modern data-compression techniques such as Huffman coding.

Arabic letters

The Arabic alphabet consists of 28 primary letters, these are letters 1 to 28 in Table 1. The eight modified letters listed in positions 29 to 36 in the same table are used just the same[ clarification needed ]. If these 8 modified forms are folded into the primary list based on shape or phonetic similarity, the outcome then is as shown in Table 2. For accurate frequency analysis, each of the 36 letters of Table 1 gets its frequency counted independently.

The ordering of the alphabet shown in the tables is more logical[ citation needed ] than is used by the Unicode standard.

Figure 1: Arabic characters that can be produced using the Arabic Letter Keyboard Intellark. IntellarkChars.png
Figure 1: Arabic characters that can be produced using the Arabic Letter Keyboard Intellark.
Table 1: The Arabic alphabet. Letters 1 to 28 are the primary letters. Letters 29 to 36 are the modified letters. ArabicLetters-36.png
Table 1: The Arabic alphabet. Letters 1 to 28 are the primary letters. Letters 29 to 36 are the modified letters.
Table 2: The Arabic alphabet, with modified letters lumped onto their primary forms. ArabicLetters-28.png
Table 2: The Arabic alphabet, with modified letters lumped onto their primary forms.
Letter frequency distribution for the counted letters: Histogram data sorted on frequency. HurufFRQsort.png
Letter frequency distribution for the counted letters: Histogram data sorted on frequency.

Although the full set of Arabic characters includes about ten diacritics as shown in the Figure 1, frequency analysis of Arabic characters is only concerned with computing the frequency of alphabet letters shown in Table 2.

Arabic letter frequency using general sources

The following Arabic sources are used to generate an acceptable amount of data on which frequency statistics are conducted.

Collectively, these sources add up to 3,378 pages, with 1,297,259 words, and 5,122,132 letters.

The following graph shows the letter frequency distribution for the counted letters.

LetterRelative frequency in the Arabic language
ء0.31%0.31
 
ؤ0.09%0.09
 
ئ0.28%0.28
 
ا12.50%12.5
 
آ0.15%0.15
 
أ2.89%2.89
 
إ1.00%1
 
ب4.67%4.67
 
ة1.42%1.42
 
ت2.61%2.61
 
ث0.87%0.87
 
ج1.23%1.23
 
ح1.86%1.86
 
خ0.79%0.79
 
د2.67%2.67
 
ذ0.96%0.96
 
ر4.20%4.2
 
ز0.52%0.52
 
س2.47%2.47
 
ش0.73%0.73
 
ص1.04%1.04
 
ض0.44%0.44
 
ط0.50%0.5
 
ظ0.18%0.18
 
ع4.01%4.01
 
غ0.33%0.33
 
ف2.84%2.84
 
ق2.69%2.69
 
ك2.04%2.04
 
ل12.07%12.07
 
م6.52%6.52
 
ن6.61%6.61
 
ه5.08%5.08
 
و5.80%5.8
 
ى1.29%1.29
 
ي6.36%6.36
 

References

  1. Ibn Kathir, Ismail (c. 1300). The beginning and the End (in Arabic). Retrieved 23 January 2011.
  2. Almubarakfuri, Safiyyurrahman (2002). The Sealed Nectar (in Arabic). Darussalam Publications. ISBN   978-1591440710 . Retrieved 24 January 2011.
  3. Ash-shuri, Majdi (c. 1900). Masterpiece of the Bride (in Arabic). Retrieved 24 January 2011.