Moby Project

Last updated

The Moby Project is a collection of public-domain lexical resources created by Grady Ward. The resources were dedicated to the public domain, and are now mirrored at Project Gutenberg. As of 2007, it contains the largest free phonetic database, with 177,267 words and corresponding pronunciations. [1]

Contents

Hyphenator

The Moby Hyphenator II contains hyphenations of 187,175 words and phrases (including 9,752 entries where no hyphenations are given, such as through and avoir). The character encoding appears to be MacRoman, and hyphenation is indicated by a bullet (, character value 165 decimal, or A5 hexadecimal). Some entries, however, have a combination of actual hyphens and character 165, such as "bar•ber-sur•geon".

There is little to no documentation of the hyphenation choices made; the following examples might give some flavour of the style of hyphenation used: at•mos•phere; at•tend•ant; ca•pac•i•ty; un•col•or•a•ble.

Languages

Moby Language II contains wordlists of five languages: French, German, Italian, Japanese, and Spanish. Their statistics are:

LanguageWordsSize (in bytes)
French138,2571,524,757
German159,8092,055,986
Italian60,453561,981
Japanese115,523934,783
Spanish86,059850,523
Total560,1015,928,030

However, some of the lists are contaminated: for example, the Japanese list contains English words such as abnormal and non-words such as abcdefgh and m,./. There are also unusual peculiarities in the sorting of these lists, as the French list contains a straight alphabetical listing, while the German list contains the alphabetical listing of traditionally capitalized words and then the alphabetical listing of traditionally lower-cased words. The list of Italian words, however, contains no capitalized words whatsoever.

The lists do not use accented characters, so "e^tre" is how a user would look up the French word être ("to be").

Part-of-Speech

Moby Part-of-Speech contains 233,356 words fully described by part(s) of speech, listed in priority order. The format of the file is word\parts-of-speech, with the following parts of speech being identified:

Part-of-speechCode
Noun N
Plural p
Noun phrase h
Verb (usually participle)V
Transitive verb t
Intransitive verb i
Adjective A
Adverb v
Conjunction C
Preposition P
Interjection  !
Pronoun r
Definite article D
Indefinite article I
Nominative o

Pronunciator

The Moby Pronunciator II contains 177,267 entries with corresponding pronunciations. Most of the entries describe a single word, but approximately 79,000 [2] contain hyphenated or multiple word phrases, names, or lexemes. The Project Gutenberg distribution also contains a copy of the cmudict v0.3. The file contains lines of the format word[/part-of-speech] pronunciation. Each line is ended with the ASCII carriage return character (CR, '\r', 0x0D, 13 in decimal).

The word field can include apostrophes (e.g. isn't), hyphens (e.g. able-bodied), and multiple words separated by underscores (e.g. monkey_wrench). Non-English words are generally rendered, as stated in the documentation, without accents or other diacritical marks. However, in 36 entries (e.g. São_Miguel), some non-ASCII accented characters remain, represented using Mac OS Roman encoding.

The part-of-speech field is used to disambiguate 770 of the words which have differing pronunciations depending on their part-of-speech. For example, for the words spelled close, the verb has the pronunciation /ˈklz/ , whereas the adjective is /ˈkls/ . The parts-of-speech have been assigned the following codes:

Part-of-speechCode
Noun n
Verb v
Adjective aj
Adverb av
Interjection interj

Following this is the pronunciation. Several special symbols are present:

SymbolMeaning
_Used to separate words
' Primary stress on the following syllable
, Secondary stress on the following syllable

The rest of the symbols are used to represent IPA characters. The pronunciations are generally consistent with a General American dialect of English, that exhibits father-bother merger, hurry-furry merger and lot-cloth split, but does not exhibit cot-caught merger or wine-whine merger. Each phoneme is represented by a sequence of one or more characters. Some of the sequences are delimited with a slash character "/", as shown in the following table, but note that the sequence for /ɔɪ/ is delimited by two slash characters at either end:

Symbol IPA
/&/æ
/-/ə
/@/ʌ, ə
/[@]/rɜr, ər
/A/ɑ, ɑː
/aI/
/AU/
bb
dd
/D/ð
/dZ/
/E/ɛ
/eI/
ff
gɡ
hh
hwhw
/i/
/I/ɪ
/j/j
/ju/juː
kk
ll
mm
nn
/N/ŋ
/O/ɔ, ɔː
//Oi//ɔɪ
/oU/
pp
rr
ss
/S/ʃ
tt
/T/θ
/tS/
/u/
/U/ʊ
vv
ww
zz
/Z/ʒ

To this collection are added a number of extra sequences representing phonemes found in several other languages. These are used to encode the non-English words, phrases and names that are included in the database. The following table contains these extra phonemes, but note that the extent to which some of these may exist due to encoding errors is not clear.

Symbol IPA
Aa
ee, ɛ
ii, ɪ
N Nasalisation of preceding vowel
oo
O[intent not clear]
Rʁ
Ss
uu
Vv, β, ʋ
Ww
/x/x
/y/ø
Yy
/z/ts
Zz

Shakespeare

Moby Shakespeare contains the complete unabridged works of Shakespeare. This specific resource is not available from Project Gutenberg, but it is available in a 1993 version on the web. [3]

Thesaurus

The Moby Thesaurus II contains 30,260 root words, with 2,520,264 synonyms and related terms – an average of 83.3 per root word. Each line consists of a list of comma-separated values, with the first term being the root word, and all following words being related terms.

Grady Ward placed this thesaurus in the public domain in 1996. It is also available as a Debian package although the package has been discontinued starting with Bullseye. [4]

Words

Moby Words II is the largest wordlist in the world. [1] [ additional citation(s) needed ] The distribution consists of the following 16 files:

FilenameWordsDescription
ACRONYMS.TXT6,213Common acronyms and abbreviations
COMMON.TXT74,550Common words present in two or more published dictionaries
COMPOUND.TXT256,772Phrases, proper nouns, and acronyms not included in the common words file
CROSSWD.TXT113,809Words included in the first edition of the Official Scrabble Players Dictionary
CRSWD-D.TXT4,160Additions to the Official Scrabble Players Dictionary in the second edition
FICTION.TXT467A list of the most commonly occurring substrings in the book The Joy Luck Club
FREQ.TXT1,000Most frequently occurring words in the English language, listed in descending order
FREQ-INT.TXT1,000Most frequently occurring words on Usenet in 1992, listed with corresponding percentage in decreasing order
KJVFREQ.TXT1,185Most frequently occurring substrings in the King James Version of the Bible, listed in descending order
NAMES.TXT21,986Most common names used in the United States and Great Britain
NAMES-F.TXT4,946Common English female names
NAMES-M.TXT3,897Common English male names
OFTENMIS.TXT366Most common misspelled English words
PLACES.TXT10,196Place names in the United States
SINGLE.TXT354,984Single words excluding proper nouns, acronyms, compound words and phrases, but including archaic words and significant variant spellings
USACONST.TXT7,618 United States Constitution including all amendments current to 1993
Total863,149Not the total of unique words.
Total Uniq639,995Total of single, proper nouns, acronyms, and compound words and phrases (all of the files that contain unique words).

Related Research Articles

<span class="mw-page-title-main">Arabic alphabet</span>

The Arabic alphabet, or Arabic abjad, is the Arabic script as specifically codified for writing the Arabic language. It is written from right-to-left in a cursive style, and includes 28 letters, of which most have contextual letterforms. The Arabic alphabet is considered an abjad, with only consonants required to be written; due to its optional use of diacritics to notate vowels, it is considered an impure abjad.

An orthography is a set of conventions for writing a language, including norms of spelling, hyphenation, capitalization, word boundaries, emphasis, and punctuation.

A thesaurus, sometimes called a synonym dictionary or dictionary of synonyms, is a reference work which arranges words by their meanings, sometimes as a hierarchy of broader and narrower terms, sometimes simply as lists of synonyms and antonyms. They are often used by writers to help find the best word to express an idea:

...to find the word, or words, by which [an] idea may be most fitly and aptly expressed

The hyphen is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation. Son-in-law is an example of a hyphenated word.

Phonetic transcription is the visual representation of speech sounds by means of symbols. The most common type of phonetic transcription uses a phonetic alphabet, such as the International Phonetic Alphabet.

A phonemic orthography is an orthography in which the graphemes correspond consistently to the language's phonemes. Natural languages rarely have perfectly phonemic orthographies; a high degree of grapheme–phoneme correspondence can be expected in orthographies based on alphabetic writing systems, but they differ in how complete this correspondence is. English orthography, for example, is alphabetic but highly nonphonemic; it was once mostly phonemic during the Middle English stage, when the modern spellings originated, but spoken English changed rapidly while the orthography was much more stable, resulting in the modern nonphonemic situation. On the contrary the Albanian, Serbian/Croatian/Bosnian/Montenegrin, Romanian, Italian, Turkish, Spanish, Finnish, Czech, Latvian, Lithuanian, Esperanto, Korean, Swahili and Georgian orthographic systems come much closer to being consistent phonemic representations.

<span class="mw-page-title-main">Polish alphabet</span> Script of the Polish language

The Polish alphabet is the script of the Polish language, the basis for the Polish system of orthography. It is based on the Latin alphabet but includes certain letters with diacritics: the acute accent ; the overdot ; the tail or ogonek ; and the stroke. ⟨q⟩, ⟨v⟩, and ⟨x⟩, which are used only in foreign words, are usually absent from the Polish alphabet. However, prior to the standardization of Polish spelling, ⟨x⟩ was sometimes used in place of ⟨ks⟩.

English phonology is the system of speech sounds used in spoken English. Like many other languages, English has wide variation in pronunciation, both historically and from dialect to dialect. In general, however, the regional dialects of English share a largely similar phonological system. Among other things, most dialects have vowel reduction in unstressed syllables and a complex set of phonological features that distinguish fortis and lenis consonants.

In English, many vowel shifts affect only vowels followed by in rhotic dialects, or vowels that were historically followed by that has been elided in non-rhotic dialects. Most of them involve the merging of vowel distinctions and so fewer vowel phonemes occur before than in other positions of a word.

The phonology of Japanese features a phonemic inventory including five vowels and 12 or more consonants. The phonotactics are relatively simple, allowing for few consonant clusters. Japanese phonology has been affected by the presence of several layers of vocabulary in the language: in addition to native Japanese vocabulary, Japanese has a large amount of Chinese-based vocabulary and loanwords from other languages.

French orthography encompasses the spelling and punctuation of the French language. It is based on a combination of phonemic and historical principles. The spelling of words is largely based on the pronunciation of Old French c. 1100–1200 AD, and has stayed more or less the same since then, despite enormous changes to the pronunciation of the language in the intervening years. Even in the late 17th century, with the publication of the first French dictionary by the Académie française, there were attempts to reform French orthography.

Australian English (AuE) is a non-rhotic variety of English spoken by most native-born Australians. Phonologically, it is one of the most regionally homogeneous language varieties in the world. Australian English is notable for vowel length contrasts which are absent from most English dialects.

<span class="mw-page-title-main">Spanish orthography</span> System for writing in Spanish

Spanish orthography is the orthography used in the Spanish language. The alphabet uses the Latin script. The spelling is fairly phonemic, especially in comparison to more opaque orthographies like English, having a relatively consistent mapping of graphemes to phonemes; in other words, the pronunciation of a given Spanish-language word can largely be predicted from its spelling and to a slightly lesser extent vice versa. Spanish punctuation uniquely includes the use of inverted question and exclamation marks: ⟨¿⟩⟨¡⟩.

Polish orthography is the system of writing the Polish language. The language is written using the Polish alphabet, which derives from the Latin alphabet, but includes some additional letters with diacritics. The orthography is mostly phonetic, or rather phonemic—the written letters correspond in a consistent manner to the sounds, or rather the phonemes, of spoken Polish. For detailed information about the system of phonemes, see Polish phonology.

A pronunciation respelling for English is a notation used to convey the pronunciation of words in the English language, which do not have a phonemic orthography.

The orthography of the Greek language ultimately has its roots in the adoption of the Greek alphabet in the 9th century BC. Some time prior to that, one early form of Greek, Mycenaean, was written in Linear B, although there was a lapse of several centuries between the time Mycenaean stopped being written and the time when the Greek alphabet came into use.

The pronunciation of the digraph ⟨wh⟩ in English has changed over time, and still varies today between different regions and accents. It is now most commonly pronounced, the same as a plain initial ⟨w⟩, although some dialects, particularly those of Scotland, Ireland, and the Southern United States, retain the traditional pronunciation, generally realized as, a voiceless "w" sound. The process by which the historical has become in most modern varieties of English is called the wine–whine merger. It is also referred to as glide cluster reduction.

The Pronunciation Lexicon Specification (PLS) is a W3C Recommendation, which is designed to enable interoperable specification of pronunciation information for both speech recognition and speech synthesis engines within voice browsing applications. The language is intended to be easy to use by developers while supporting the accurate specification of pronunciation information for international use.

The CMU Pronouncing Dictionary is an open-source pronouncing dictionary originally created by the Speech Group at Carnegie Mellon University (CMU) for use in speech recognition research.

<i>ABC Chinese–English Dictionary</i>

The ABC Chinese–English Dictionary or ABC Dictionary (1996), compiled under the chief editorship of John DeFrancis, is the first Chinese dictionary to collate entries in single-sort alphabetical order of pinyin romanization, and a landmark in the history of Chinese lexicography. It was also the first publication in the University of Hawaiʻi Press's "ABC" series of Chinese dictionaries. They republished the ABC Chinese–English Dictionary in a pocket edition (1999) and desktop reference edition (2000), as well as the expanded ABC Chinese–English Comprehensive Dictionary (2003), and dual ABC English–Chinese, Chinese–English Dictionary (2010). Furthermore, the ABC Dictionary databases have been developed into computer applications such as Wenlin Software for learning Chinese (1997).

References

  1. 1 2 "ACL SIGLEX Resource Links". Special Interest Group on the Lexicon of the Association for Computational Linguistics. August 13, 2004. Archived from the original on December 15, 2018. Retrieved May 9, 2022. Moby Words: 610,000+ words and phrases. The largest word list in the world
  2. Obtained by running the UNIX command grep '.*[-_].* .*' mobypron.unc | wc -l after converting the line endings and correcting some encoding errors.
  3. mobyshak.txt 1993 version
  4. Tosi, Sandro (July 13, 2020). "RM: dict-moby-thesaurus -- RoQA; dead upstream (10+ years); python2-only; no extrenal [sic] deps; extremely low popcon". Debian Bug report logs. Retrieved May 10, 2022.