Buckwalter transliteration

Last updated

The Buckwalter Arabic transliteration was developed as part of the ALPNET Arabic Project being run by Ken Beesley in 1988.

Contents

Start

The first Arabic language analyst for the project was a BYU undergraduate student named Derek Foxley, hired as part-time. Foxley was in 4th year Arabic courses at the time at BYU. [1] Tim Buckwalter was employed several months later as a full-time employee of ALPNET. Buckwalter was also a PhD candidate in Arabic at the time. One of his tasks on the project was to collaborate with and assign Arabic language tasks to the part-time employee, Foxley.

Beesley mentored Buckwalter and Foxley in some of the finer details of linguistics, and one day at the whiteboard Beesley prodded Foxley and Buckwalter to come up with a transliteration schema at that moment. Foxley had been entering most of the data at that point in the project so was ready to address this. Nevertheless, in close collaboration with Buckwalter he came up with nearly all the characters used for the transliteration table. Buckwalter oversaw Foxley's Arabic tasks and made the final adjustments and refinements the transliteration table. It had no name at the time, however, Buckwalter in a few years following the project had entered thousands of textual items using the transliteration schema and presented it, and championed it, many times as well. It was, therefore, named after him.

At the time, no such one-for-one letter transliteration was in use, or at least none that the team was aware of.

Beesley later moved to Xerox, who bought the rights to the ALPNET data in the 1990s. This is documented in several other articles that Beesley has presented over the years.

Commentary on the system

The Buckwalter Transliteration is an ASCII-only transliteration scheme, representing Arabic orthography strictly one-to-one, unlike the more common romanization schemes that add morphological information not expressed in Arabic script. Thus, for example, a wāw will be transliterated as w regardless of whether it is realized as a vowel /uː/ or a consonant /w/. Only when the wāw is modified by a hamza (ؤ) does the transliteration change to &. This allows the user to type or convert text exactly as it is seen.

However, there has been some critique of the transliteration schema. Some users state that the unmodified letters are straightforward to read (except for *=dhaal and E=ayin, v=thaa), but the transliterations of letters with diacritics and the harakat take some time to get used to, for example the nunated -un, -an, -in appear as N, F, K, and the sukūn ("no vowel") as o. Taʾ marbūṭahة is p. The difficulty probably has happened because usually the Buckwalter transliteration is used and/or presented without the rationale behind the letters. Though those particular letters seem to be random they are actually mnemonically linked to the original letter.

Furthermore, since the original Buckwalter scheme was developed, several other variants have emerged, although they are not all standardized. Buckwalter transliteration is not compatible with XML, so "XML safe" versions often modify the following characters: < > & (أ إ and ؤ respectively; Buckwalter suggests transliterating them as I O W, respectively). Completely "safe" transliteration schemes replace all non-alphanumeric characters (such as $';*) with alphanumeric characters. [2]

When transliterating Arabic text, several other issues may arise. First, some Arabic characters are not specified in the transliteration table, including non-alphabetic characters such as ۞ and ۝, punctuation such as ؛ ؟, and Eastern Arabic numerals. Similarly, sometimes Arabic sentences will borrow non-Arabic letters from Persian, some of which are defined in the full Buckwalter table. [3] Symbols that are not defined in the transliteration table may be deleted, kept as non-Latin symbols embedded in transliterated text, or transliterated into different (non-conflicting) Latin symbols. (For instance, it is straightforward to convert from Hindi numerals to Arabic numerals.) Another issue that arises is how to handle transliterating Arabic text with embedded ASCII text; for instance, an Arabic sentence that refers to "IBM" or an Arabic sentence that includes a quote in English. If the Latin text is not explicitly marked, it is a challenge to distinguish transliterated Arabic from Latin. If transliterated text with embedded Latin is later transliterated back to Arabic, the Latin text will be transliterated into garbage Arabic. Finally, another important decision to make is how much normalization of the Arabic text should be done during transliteration. This may include removing kashida, removing short vowels and/or other diacritics, and/or normalizing spelling. [2]

On the other hand, all typical markings one would expect to use when writing - !@#%?.,;:()[]+= were not used because they are also used in Arabic text. Thus, if the English IBM did appear in English, in the Arabic text it was in the original concept supposed be marked by putting double quotes around it: ""IBM"". This mechanism allows for automatic language processing to take place leaving non-Arabic text as is, unprocessed when it sees the double quotes. Originally, even < > & were not used either especially < > which are French borrowed quote marks because they are occasionally used in Arabic text. These were added later as a necessity. Their XML safe versions keep with the mnemonic device devised (and discussed below) in that I O W correspond (if imprecisely) to each of the sounds made.

Key concepts in development of the table

There were three key concepts used the transliteration schema:

The first was that each Arabic letter (sound) can only correspond to one English language character. Some Arabic letters produce a sound that corresponds to 2 English letters when written. Therefore, a single letter or common symbol would have to be used for them.

The second concept was to use the familiar if possible. If an Arabic letter had always been associated with the letter “s” in English, for example, then it would be easier to remember if it could be kept that way.

The third key concept was that the table had to be fully, easily mnemonic. Therefore, every single item correlates in the following order of preference a) to the sound of the Arabic letter, or b) to a physical aspect of the original Arabic letter or, c) to the name it is called.

Mechanics

Lower case letters were used in preference. However, when there are multiple Arabic letters that have a similar sounds then for more open sounds the lower case letter was used and for more close/restricted sounds an upper case letter was used. For example, in Arabic there are 2 letters similar to the [d] in English sound. The plain sound was given a small “d” and the emphatic sound [dˤ] was assigned an upper case “D”.

In other words, an upper case letter indicates that the letter is similar to a lower case letter – but has a qualitative difference in some way.

Buckwalter transliteration table

Arabic letters ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي ی [4]
DIN 31635 ʾ / ā bt ǧ d rzs š ʿ ġ fqklmnhw / ū y ī
BuckwalterAvjHx*$SDTZEgwyY
Qalam ' / aathkhdhsh`ghy
BATR A / aacKz'xEgw / uuyii
IPA (MSA) ʔ , b t θ
ɡ
ʒ
ħ x d ð r z s ʃ ðˤ
ʕ ɣ f q k l m n h w , j ,
hamza
  • lone hamza: '
  • hamza on alif: >
  • hamza below alif: <
  • hamza on wa: &
  • hamza on ya: }
alif
  • madda on alif: |
  • alif al-wasla: {
  • dagger alif: `
  • alif maqsura: Y
harakat
  • fatha: a
  • damma: u
  • kasra: i
  • fathatayn: F
  • dammatayn: N
  • kasratayn K
  • shadda: ~
  • sukun: o
ta marbouta: p
tatwil: _

Mnemonics

ا AThis letter is usually pronounced [aː]. It is not a lower case “a” because that would conflict with the fatha diacritical mark which is pronounced shorter, [a]
ب bPronounced [b].
ة pThis is the tah marbutah and a “p” looks very similar to the way it is written when connected to a preceding letter.
ت tPronounced [t].
ث vPronounced [θ]. There are 3 dots above it that when written look like an upside down “v” – therefore, a “v” was used.
ج jPronounced [dʒ]
ح HThis letter is pronounced [ħ] and it conflicts with the [h] sound of another letter, so an upper case “H” is used.
خ xPronounced [x].
د dPronounced [d].
ذ *Pronounced [ð]. It has a dot above it so the single asterisk that looks similar to a dot above the line was used.
ر rPronounced [r].
ز zPronounced [z].
س sPronounced [s].
ش $Pronounced [ʃ]. The dollar sign was used because it looks like “s” but also has an extra property, a line through it. Upper case “S” could not be used because it is used for another letter.
ص SPronounced [sˤ].
ض DPronounced [dˤ].
ط TPronounced [tˤ]
ظ ZPronounced [ðˤ~zˤ]
ع EPronounced [ʕ], a sound not found in English, so a purely visual mnemonic was used: this letter and the letter E look similar.
غ gPronounced [ɣ~ʁ], sounds not found in English. It has often been written as “gh”, so the "g" was kept and used a visual mnemonic was used as well. It has a similar appearance to the lower case letter “g”.
ف fPronounced [f].
ق qPronounced [q].
ك kPronounced [k].
ل lPronounced [l].
م mPronounced [m].
ن nPronounced [n].
ه hPronounced [h].
و wUsually pronounced [w].
ی YPronounced [aː]. A visual mnemonic was used as it looks like the next letter, but has no dots underneath.
ي yPronounced [j].
ًFPronounced [an]. In Arabic this is called the fathatan, the dual fatha. Upper case “F” because the lower case is already used
ٌNPronounced [un]. Lower case “n” is already used, and for consistency with “F” for nunated [an], upper case “N” is used.
ٍKPronounced [in]. This is the kasratan, the nunated kasra. Lower case “k” is already used, and for consistency with “F” for nunated [an], upper case “K” is used.
َaPronounced [a].
ُuPronounced [u].
ِiPronounced [i].
ّ~This is the shadda, which is a gemination of the consonant that is above.  The tilde is also a marking that sits above a letter and is found on most English keyboards. It is a "physical mnemonic".
ْoThis is the “sukun” and represents that there is no vowel sound on that letter. A close visual mnemonic with lower case “o” was used.

The original ALPNET team quickly adopted this schema. Even though Beesley had no background in Arabic he was quickly able to understand and use it. The strength of the Buckwalter transliteration is that every single Arabic letter is represented distinctly. Yet, its reliance on traditional transliterations or mnemonic devices for anything non-traditional makes it very easy to learn.

Sample

The first article of The Universal Declaration of Human Rights:

Arabic text

يُولَدُ جَمِيعُ ٱلنَّاسِ أَحْرَارًا مُتَسَاوِينَ فِي ٱلْكَرَامَةِ وَٱلْحُقُوقِ. وَقَدْ وُهِبُوا عَقْلًا وَضَمِيرًا وَعَلَيْهِمْ أَنْ يُعَامِلَ بَعْضُهُمْ بَعْضًا بِرُوحِ ٱلْإِخَاءِ. [5]

Buckwalter transliteration

yuwladu jamiyEu {ln~aAsi >aHoraArFA mutasaAwiyna fiy {lokaraAmapi wa{loHuquwqi. waqado wuhibuwA EaqolFA waDamiyrFA waEalayohimo >ano yuEaAmila baEoDuhumo baEoDFA biruwHi {lo<ixaA'i.

DIN 31635

Yūladu ǧamīʿu n-nāsi ʾaḥrāran mutasāwīna fī l-karāmati wa-l-ḥuqūq. Wa-qad wuhibū ʿaqlan wa-ḍamīran wa-ʿalayhim ʾan yuʿāmila baʿḍuhum baʿḍan bi-rūḥi l-ʾiḫāʾi.

English text

All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. [6]

Notes

  1. (see the first page of one of the first presentations given by Dr. Beesley, in 1989, at University of Utah, foot notes list the contributors in order up to that point) .
  2. 1 2 For a complete description of different Buckwalter schemes as well as a more detailed discussion of the trade-offs between different schemes, see Habash, Nizar. Introduction to Arabic Natural Language Processing. Morgan & Claypool, 2010.
  3. Buckwalter, Tim. Buckwalter Arabic Transliteration Table.
  4. In Egypt, Sudan and sometimes other regions, the final form is sometimes ی (without dots).
  5. "Universal Declaration of Human Rights - Arabic (Alarabia)". ohchr.org. OHCHR. 2016. Retrieved October 22, 2016.
  6. "Universal Declaration of Human Rights - English". ohchr.org. OHCHR. 2016. Retrieved October 22, 2016.

Related Research Articles

<span class="mw-page-title-main">Arabic alphabet</span> Alphabets for Arabic and other languages

The Arabic alphabet, or Arabic abjad, is the Arabic script as specifically codified for writing the Arabic language. It is written from right-to-left in a cursive style, and includes 28 letters, of which most have contextual letterforms. The Arabic alphabet is considered an abjad, with only consonants required to be written; due to its optional use of diacritics to notate vowels, it is considered an impure abjad.

<span class="mw-page-title-main">Diacritic</span> Modifier mark added to a letter

A diacritic is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek διακριτικός, from διακρίνω. The word diacritic is a noun, though it is sometimes used in an attributive sense, whereas diacritical is only an adjective. Some diacritics, such as the acute ⟨á⟩, grave ⟨à⟩, and circumflex ⟨â⟩, are often called accents. Diacritics may appear above or below a letter or in some other position such as within the letter or between two letters.

The Standard Arabic Technical Transliteration System, commonly referred to by its acronym SATTS, is a system for writing and transmitting Arabic language text using the one-for-one substitution of ASCII-range characters for the letters of the Arabic alphabet. Unlike more common systems for transliterating Arabic, SATTS does not provide the reader with any more phonetic information than standard Arabic orthography does; that is, it provides the bare Arabic alphabetic spelling with no notation of short vowels, doubled consonants, etc. In other words, it is intended as a transliteration tool for Arabic linguists, and is of limited use to those who do not know Arabic.

Transliteration is a type of conversion of a text from one script to another that involves swapping letters in predictable ways, such as Greek ⟨α⟩⟨a⟩, Cyrillic ⟨д⟩⟨d⟩, Greek ⟨χ⟩ → the digraph ⟨ch⟩, Armenian ⟨ն⟩⟨n⟩ or Latin ⟨æ⟩⟨ae⟩.

Thaana, Tãnaa, Taana or Tāna is the present writing system of the Maldivian language spoken in the Maldives. Thaana has characteristics of both an abugida and a true alphabet, with consonants derived from indigenous and Arabic numerals, and vowels derived from the vowel diacritics of the Arabic abjad. Maldivian orthography in Thaana is largely phonemic.

<span class="mw-page-title-main">Romanization</span> Transliteration or transcription to Latin characters

In linguistics, Romanization or romanisation is the conversion of text from a different writing system to the Roman (Latin) script, or a system for doing so. Methods of romanization include transliteration, for representing written text, and transcription, for representing the spoken word, and combinations of both. Transcription methods can be subdivided into phonemic transcription, which records the phonemes or units of semantic meaning in speech, and more strict phonetic transcription, which records speech sounds with precision.

<span class="mw-page-title-main">Arabic diacritics</span> Diacritics used in the Arabic script

Arabic script has numerous diacritics, which include consonant pointing known as iʻjām (إِعْجَام), and supplementary diacritics known as tashkīl (تَشْكِيل). The latter include the vowel marks termed ḥarakāt.

Aleph is the first letter of the Semitic abjads, including Phoenician ʾālep 𐤀, Hebrew ʾālef א, Aramaic ʾālap 𐡀, Syriac ʾālap̄ ܐ, Arabic ʾalif ا, and North Arabian 𐪑. It also appears as South Arabian 𐩱 and Ge'ez ʾälef አ.

It is thought that the Arabic alphabet is a derivative of the Nabataean variation of the Aramaic alphabet, which descended from the Phoenician alphabet, which among others also gave rise to the Hebrew alphabet and the Greek alphabet, the latter one being in turn the base for the Latin and Cyrillic alphabets.

<span class="mw-page-title-main">Romanization of Arabic</span> Representation of Arabic in Latin script

The romanization of Arabic is the systematic rendering of written and spoken Arabic in the Latin script. Romanized Arabic is used for various purposes, among them transcription of names and titles, cataloging Arabic language works, language education when used instead of or alongside the Arabic script, and representation of the language in scientific publications by linguists. These formal systems, which often make use of diacritics and non-standard Latin characters and are used in academic settings or for the benefit of non-speakers, contrast with informal means of written communication used by speakers such as the Latin-based Arabic chat alphabet.

The Arabic chat alphabet, Arabizi, Arabeezi, Arabish or Franco-Arabic refer to the romanized alphabets for informal Arabic dialects in which Arabic script is transcribed or encoded into a combination of Latin script and Arabic numerals. These informal chat alphabets were originally used primarily by youth in the Arab world in very informal settings—especially for communicating over the Internet or for sending messages via cellular phones—though use is not necessarily restricted by age anymore and these chat alphabets have been used in other media such as advertising.

<span class="mw-page-title-main">Tajwid</span> Rules governing pronunciation during recitation of the Quran

In the context of the recitation of the Quran, tajwīd is a set of rules for the correct pronunciation of the letters with all their qualities and applying the various traditional methods of recitation (Qira'at). In Arabic, the term tajwīd is derived from the verb جود, meaning enhancement or to make something excellent. Technically, it means giving every letter its right in reciting the Qur'an.

<span class="mw-page-title-main">Urdu alphabet</span> Writing system used for Urdu

The Urdu alphabet is the right-to-left alphabet used for writing Urdu. It is a modification of the Persian alphabet, which itself is derived from the Arabic script. It has official status in the republics of Pakistan, India and South Africa. The Urdu alphabet has up to 39 or 40 distinct letters with no distinct letter cases and is typically written in the calligraphic Nastaʿlīq script, whereas Arabic is more commonly written in the Naskh style.

<span class="mw-page-title-main">Romanization of Persian</span> Representation of the Persian language with the Latin script

Romanization or Latinization of Persian is the representation of the Persian language with the Latin script. Several different romanization schemes exist, each with its own set of rules driven by its own set of ideological goals.

<span class="mw-page-title-main">Hamza</span> Mark used in Arabic-based orthographies

Hamza is a letter in the Arabic alphabet, representing the glottal stop. Hamza is not one of the 28 "full" letters and owes its existence to historical reform of standard writing system. It is derived from the Arabic letter ʿAyn. In the Phoenician, Hebrew and Aramaic alphabets, from which the Arabic alphabet is descended, the glottal stop was expressed by alif (𐤀), continued by Alif (ا) in the Arabic alphabet. However, Alif was used to express both a glottal stop and a long vowel. In order to indicate that a glottal stop is used, and not a mere vowel, it was added to Alif diacritically. In modern orthography, hamza may also appear on the line, under certain circumstances as though it were a full letter, independent of an Alif.

Bikdash Arabic Transliteration Rules are a set of rules for the romanization of Arabic that is highly phonetic, almost one-to-one, and uses only two special characters, namely the hyphen and the apostrophe as modifiers. This standard also includes rules for diacritization, including tanwiin.

<span class="mw-page-title-main">Maldivian language</span> Indo-Aryan national language of Maldives

Maldivian, also known by its endonym Dhivehi or Divehi, is an Indo-Aryan language spoken in the South Asian island country of Maldives and on Minicoy Island, Lakshadweep, a union territory of India.

<span class="mw-page-title-main">Pegon script</span> Javanese-Arabic script

Pegon is a modified Arabic script used to write the Javanese, Sundanese, and Madurese languages, as an alternative to the Latin script or the Javanese script and the Old Sundanese script. It was used in a variety of applications, from religion, to diplomacy, to poetry. But today particularly, it is used for religious (Islamic) writing and poetry, particularly in writing commentaries of the Qur'an. Pegon includes letters that are not present in Modern Standard Arabic. Pegon has been studied far less than its Jawi counterpart which is used for Malay, Acehnese and Minangkabau.

There are three writing systems for Saraiki:

Cyrillization of Arabic is the conversion of text written in Arabic script into Cyrillic script. Because the Arabic script is an abjad, an accurate transliteration into Cyrillic, an alphabet, would still require prior knowledge of the subject language to read. Instead, systems of transcription have normally been used.