IndoWordNet

Last updated

IndoWordNet [1] is a linked lexical knowledge base of wordnets of 18 scheduled languages of India, viz., Assamese, Bangla, Bodo, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Meitei (Manipuri), Marathi, Nepali, Odia, Punjabi, Sanskrit, Tamil, Telugu and Urdu.

Contents

Dravidian WordNet is a WordNet for Dravidian Languages. [2]

Background

In early 90s, the wordnet for English- called Princeton WordNet- was created in Princeton University by George Miller and Christiane Fellbaum who went on to get the prestigious Zampoli Prize in 2006. [3] Then followed the EuroWordNet- the conglomeration of European Language wordnets- which got created in 1998. [4] Wordnets are now essential resources for Natural Language Processing, Information Extraction, Word Sense Disambiguation and such other computations involving text.

Importance of Indian languages

Indian languages form a very significant component of the languages landscape of the world. There are 4 streams of language typology operative in the Indian subcontinent- Indo European, Dravidian, Tibeto Burman and Austro Asiatic. [5] Many languages rank within top 10 in the world in terms of the population speaking them, e.g., Hindi-Urdu 5th, Bangla 7th, Marathi 12th and so on as per the List of languages by number of native speakers. Creating wordnets of Indian languages is therefore a highly important techno-scientific and linguistic project.

Genesis of Indian language wordnets

Such project indeed took off in 2000 with Hindi WordNet being created by the Natural Language Processing group at the Center for Indian Language Technology (CFILT) in the Computer Science and Engineering Department at IIT Bombay. [6] It was made publicly available in 2006 under the GNU license. The Hindi WordNet was created with support from the TDIL project of Ministry of Communication and Information Technology, India and also partially from Ministry of Human Resources Development, India.

Wordnets of other languages of India then followed suit. The large nationwide project of building Indian language wordnets was called the IndoWordNet project. IndoWordNet [1] is a linked lexical knowledge base of wordnets of 18 scheduled languages of India, viz., Assamese, Bangla, Bodo, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Meitei, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Tamil, Telugu and Urdu. The wordnets are getting created by using expansion approach from the Hindi WordNet. The Hindi WordNet was created from first principles (mentioned below) and was the first wordnet for an Indian language. The method adopted was the same as the Princeton WordNet for English.

Polish WordNet is being mapped to Princeton WordNet based on the strategy followed by IndoWordNet. [7]

Principles of wordnet construction

The wordnets follow the principles of minimality, coverage and replaceability for the synsets. That means, there should be at least a 'core' set of lexemes in the synset that uniquely give the concept represented by the synset (minimality), e.g., {house, family} standing for the concept of 'family' ("she is from a noble house"). Then the synset should cover ALL the words representing the concept in the language (coverage), e.g., the word 'ménage' will have to appear in the 'family' synset, albeit, towards the end of the synset, since its usage is rare. Finally, the words towards the beginning of the synset should be able to replace one another in reasonable amount of corpora (replaceability), e.g., 'house' and 'family' can replace each other in the sentence "she is from a noble house".

Statistics of Indian language wordnets

The number of synsets (As of August 2014) in the languages and the institutes creating the language WordNets are as below:

LanguageSynsetsInstitute
Assamese 14958 Guwahati University, Guwahati, Assam
Bengali 36346 Indian Statistical Institute, Kolkata, West Bengal
Bodo 15785 Guwahati University, Guwahati, Assam
Gujarati 35599 Dharamsinh Desai University, Nadiad, Gujarat
Hindi 38607 IIT Bombay, Mumbai, Maharashtra
Kannada 20033 Mysore University, Mysore, Karnataka
Kashmiri 29469 Kashmir University, Srinagar, Jammu and Kashmir
Konkani 32370 Goa University, Taleigao, Goa
Malayalam 30060 Amrita University, Coimbatore, Tamil Nadu
Marathi 29674 IIT Bombay, Mumbai, Maharashtra
Meitei 16351 Manipur University, Imphal, Manipur
Nepali 11713 Assam University, Silchar, Assam
Oriya 35284 Hyderabad Central University, Hyderabad, Andhra Pradesh
Punjabi 32364 Thapar University and Punjabi University, Patiala, Punjab
Sanskrit 23140 IIT Bombay, Mumbai, Maharashtra
Tamil 25431 Tamil University, Thanjavur, Tamil Nadu
Telugu 21925 Dravidian University, Kuppam, Andhra Pradesh
Urdu 34280 Jawaharlal Nehru University, New Delhi

Summary

IndoWordNet is highly similar to EuroWordNet. However, the pivot language is Hindi which, of course, is linked to the English WordNet. Also typical Indian language phenomena like complex predicates and causative verbs are captured in IndoWordNet.

IndoWordNet is publicly browsable. The Indian language wordnet building efforts forming the subcomponents of IndoWordNet project are: North East WordNet project, Dravidian WordNet Project and Indradhanush project all of which are funded by the TDIL project.

Related Research Articles

<span class="mw-page-title-main">Marathi language</span> Indo-Aryan language

Marathi is an Indo-Aryan language predominantly spoken by Marathi people in the Indian state of Maharashtra. It is the official language of Maharashtra, and an additional official language in the state of Goa used to reply provided the request is received in Marathi. It is one of the 22 scheduled languages of India, with 83 million speakers as of 2011. Marathi ranks 13th in the list of languages with most native speakers in the world. Marathi has the third largest number of native speakers in India, after Hindi and Bengali. The language has some of the oldest literature of all modern Indian languages. The major dialects of Marathi are Standard Marathi and the Varhadi dialect.

<span class="mw-page-title-main">WordNet</span> Computational lexicon of English

WordNet is a lexical database of semantic relations between words that links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into synsets with short definitions and usage examples. It can thus be seen as a combination and extension of a dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. It was first created in the English language and the English WordNet database and software tools have been released under a BSD style license and are freely available for download from that WordNet website. There are now WordNets in more than 200 languages.

<span class="mw-page-title-main">Indo-Aryan languages</span> Branch of the Indo-Iranian languages

The Indo-Aryan languages are a branch of the Indo-Iranian languages in the Indo-European language family. As of the early 21st century, they have more than 800 million speakers, primarily concentrated in Bangladesh, India, Pakistan, Sri Lanka, Maldives and Nepal. Moreover, apart from the Indian subcontinent, large immigrant and expatriate Indo-Aryan–speaking communities live in Northwestern Europe, Western Asia, North America, the Caribbean, Southeast Africa, Polynesia and Australia, along with several million speakers of Romani languages primarily concentrated in Southeastern Europe. There are over 200 known Indo-Aryan languages.

<span class="mw-page-title-main">Languages of India</span>

Languages spoken in the Republic of India belong to several language families, the major ones being the Indo-Aryan languages spoken by 78.05% of Indians and the Dravidian languages spoken by 19.64% of Indians; both families together are sometimes known as Indic languages. Languages spoken by the remaining 2.31% of the population belong to the Austroasiatic, Sino–Tibetan, Tai–Kadai, and a few other minor language families and isolates. According to the People's Linguistic Survey of India, India has the second highest number of languages (780), after Papua New Guinea (840). Ethnologue lists a lower number of 456.

There are currently 900 permitted private satellite television channels in India as of February 2021. Numerous regional channels are available throughout India, often distributed according to languages.

<span class="mw-page-title-main">Konkani language</span> Indo-Aryan language spoken in India

Konkani is an Indo-Aryan language spoken by the Konkani people, primarily in the Konkan region, along the western coast of India. It is one of the 22 scheduled languages mentioned in the Indian Constitution, and the official language of the Indian state of Goa. It is also spoken in Karnataka, Maharashtra, Kerala, Gujarat as well as Damaon, Diu & Silvassa.

<span class="mw-page-title-main">Voiced retroflex lateral flap</span> Consonantal sound represented by ⟨𝼈⟩ in IPA

The voiced retroflex lateral flap is a type of consonantal sound, used in some spoken languages. The 'implicit' symbol in the International Phonetic Alphabet is 𝼈 . The sound may also be transcribed as a short ⟨ɭ̆ ⟩, or with the retired IPA dot diacritic, ⟨ɺ̣⟩.

<span class="mw-page-title-main">Semantic lexicon</span>

A semantic lexicon is a digital dictionary of words labeled with semantic classes so associations can be drawn between words that have not previously been encountered. Semantic lexicons are built upon semantic networks, which represent the semantic relations between words. The difference between a semantic lexicon and a semantic network is that a semantic lexicon has definitions for each word, or a "gloss".

<span class="mw-page-title-main">Linguistic history of India</span>

Since the Iron Age in India, the native languages of the Indian subcontinent are divided into various language families, of which the Indo-Aryan and the Dravidian are the most widely spoken. There are also many languages belonging to unrelated language families such as Munda and Tibeto-Burman, spoken by smaller groups.

<span class="mw-page-title-main">Linguistic Survey of India</span>

The Linguistic Survey of India (LSI) is a comprehensive survey of the languages of British India, describing 364 languages and dialects. The Survey was first proposed by George Abraham Grierson, a member of the Indian Civil Service and a linguist who attended the Seventh International Oriental Congress held at Vienna in September 1886. He made a proposal of the linguistic survey and it was initially turned down by the Government of India. After persisting and demonstrating that it could be done using the existing network of government officials at a reasonable cost, it was approved in 1891. It was however formally begun only in 1894 and the survey continued for thirty years with the last of the results being published in 1928.

<span class="mw-page-title-main">Languages of South Asia</span>

South Asia is home to several hundred languages, spanning the countries of Afghanistan, Bangladesh, Bhutan, India, Maldives, Nepal, Pakistan, and Sri Lanka. It is home to the third most spoken language in the world, Hindi–Urdu; and the sixth most spoken language, Bengali. The languages in the region mostly comprise Indo-Iranic and Dravidian languages, and further members of other language families like Austroasiatic, and Tibeto-Burman languages.

<span class="mw-page-title-main">Vikaspedia</span> Indian Government Knowledge Portal

Vikaspedia is an online information guide launched by the Government of India. The website was implemented by C-DAC Hyderabad and is run by the Department of Electronics and Information Technology, Ministry of Communications and Information Technology. It is built as a portal for the social sectors, and offers information in 23 languages: English, Assamese, Telugu, Hindi, Bengali, Gujarati, Kannada, Malayalam, Tamil, Bodo, Dogri, Sanskrit, Kashmiri, Konkani, Nepali, Odia, Urdu, Maithili, Meitei, Santali, Sindhi, Punjabi, and Marathi.

plWordNet is a lexico-semantic database of the Polish language. It includes sets of synonymous lexical units (synsets) followed by short definitions. plWordNet serves as a thesaurus-dictionary where concepts (synsets) and individual word meanings are defined by their location in the network of mutual relations, reflecting the lexico-semantic system of the Polish language. plWordNet is also used as one of the basic resources for the construction of natural language processing tools for Polish.

The Bulgarian WordNet (BulNet) is an electronic multilingual dictionary of synonym sets along with their explanatory definitions and sets of semantic relations with other words in the language.

<span class="mw-page-title-main">Malayalam WordNet</span>

Malayalam WordNet (പദശൃംഖല) is an online WordNet created for Malayalam Language. Malayalam WordNet has been developed by the Department of Computer Science, Cochin University Of Science And Technology.

Indic OCR refers to the process of converting text images written in Indic scripts into e-text using Optical character recognition (OCR) techniques. Broadly, it can also refer to the OCR systems of Brahmic scripts for languages of South Asia and Southeast Asia, not just the scripts of the Indian subcontinent, which are all written in an abugida-based writing system.

<span class="mw-page-title-main">Indian 1-rupee note</span> Smallest value Indian banknote

The Indian 1-rupee note (₹1) is made up of hundred 100 paise as ₹1 = 100 paise. Currently, it is the smallest Indian banknote in circulation and the only one being issued by the Government of India, as all other banknotes in circulation are issued by the Reserve Bank of India. As a result, the one rupee note is the only note bearing the signature of the Finance Secretary and not the Governor of the RBI. Predominantly pinkish green paper is used during printing.

The Inner–Outer hypothesis of the subclassification of the Indo-Aryan language family argues for a division of the family into two groups, an Inner core and an Outer periphery, evidenced by shared traits of the languages falling into one of the two groups. Proponents of the theory generally believe the distinction to be the result of gradual migrations of Indo-Aryan speakers into the Indian subcontinent, with the inner languages representing a second wave of migration speaking a different dialect of Old Indo-Aryan, overtaking the first-wave speakers in the center and relegating them to the outer region.

<span class="mw-page-title-main">Meitei input methods</span>

Meitei input methods are the methods that allow users of computers to input texts in the Meitei script, systematically for Meitei language.

References

  1. 1 2 Pushpak Bhattacharyya, IndoWordNet, Lexical Resources Engineering Conference 2010 (LREC 2010), Malta, May, 2010.
  2. https://www.amrita.edu/publication/building-wordnet-dravidian-languages [ dead link ]
  3. Christiane Fellbaum (ed.), WordNet: An Electronic Lexical Database, MIT Press, 1998.
  4. P. Vossen (ed.), EuroWordNet: A Multilingual Database with Lexical Semantic Networks, Kluwer Pub., 1998.
  5. Joseph E. Schwartzberg,Encyclopædia Britannica, IndiaLinguistic Composition, 2007.
  6. Dipak Narayan, Debasri Chakrabarty, Prabhakar Pande and P. Bhattacharyya An Experience in Building the Indo WordNet- a WordNet for Hindi, International Conference on Global WordNet (GWC 02), Mysore, India, January, 2002.
  7. Rudnicka, E., Maziarz, M., Piasecki, M., & Szpakowicz, S. (2012). Mapping plWordNet onto Princeton WordNet, 24th International Conference on Computational Linguistics (COLING), India, December 2012