IndoWordNet

Last updated

IndoWordNet [1] is a linked lexical knowledge base of wordnets of 18 scheduled languages of India, viz., Assamese, Bangla, Bodo, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Meitei (Manipuri), Marathi, Nepali, Odia, Punjabi, Sanskrit, Tamil, Telugu and Urdu. Wordnets are computational databases of words used for Natural Language Processing, Information Extraction, Word Sense Disambiguation and other forms of computational processing involving text.

Contents

Dravidian WordNet is a WordNet for Dravidian Languages. [2]

Background

In early 1990s, the original wordnet for English—called Princeton WordNet—was created at Princeton University in a project led by George Miller and Christiane Fellbaum, who later received the prestigious Zampoli Prize in 2006. [3] Then followed the EuroWordNet—the conglomeration of European language wordnets—which was established in 1998. [4]

Importance of Indian languages

Indian languages form a significant component of the linguistic landscape of the world. There are 4 language families in the Indian subcontinent—Indo European, Dravidian, Tibeto Burman and Austro Asiatic. [5] Several Indian languages are among the top-ranked languages in the world in terms of the population speaking them, most notably Hindi-Urdu (5th), Bangla (7th), Marathi (12th), and so on, as per the List of languages by number of native speakers. Creating wordnets for Indian languages is therefore a worthwhile techno-scientific and linguistic endeavour.

Genesis of Indian language wordnets

Initial work started in 2000 with the Hindi WordNet being created by the Natural Language Processing group at the Center for Indian Language Technology (CFILT) in the Computer Science and Engineering Department at IIT Bombay. [6] It was made publicly available in 2006 under the GNU license. The Hindi WordNet was created with support from the TDIL project of the Ministry of Communication and Information Technology, India and also partial support from the Ministry of Human Resources Development, India.

In subsequent years, wordnets for other languages of India followed. The large nationwide project of building Indian language wordnets was called the IndoWordNet project. IndoWordNet [1] is a linked lexical knowledge base of wordnets of 18 scheduled languages of India, viz., Assamese, Bangla, Bodo, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Meitei, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Tamil, Telugu and Urdu. The wordnets are getting created by using expansion approach from the Hindi WordNet. The Hindi WordNet was created from first principles (mentioned below) and was the first wordnet for an Indian language. The method adopted was the same as with the Princeton WordNet for English.

IndoWordNet is thus highly similar to EuroWordNet. [7] However, the pivot language is Hindi, which, of course, is linked to the English WordNet.

The Indian language wordnet building efforts forming the subcomponents of IndoWordNet project are: North East WordNet project, Dravidian WordNet Project and Indradhanush project all of which are funded by the TDIL project.

Principles of wordnet construction

The wordnets follow the principles of minimality, coverage and replaceability for the synsets. This means that there should be at least a 'core' set of lexemes in the synset that uniquely give the concept represented by the synset (minimality), e.g., {house, family} standing for the concept of 'family' ("she is from a noble house"). Then the synset should cover all the words representing the concept in the language (coverage), e.g., the word 'ménage' will have to appear in the 'family' synset, albeit, towards the end of the synset, since its usage is rare. Finally, the words towards the beginning of the synset should be able to replace one another in reasonable amount of corpus contexts (replaceability), e.g., 'house' and 'family' can replace each other in the sentence "she is from a noble house".

The IndoWordNet project differs from some other related projects in that typical Indian language phenomena such as complex predicates and causative verbs are captured.

Data

IndoWordNet is publicly browsable. The number of synsets (as of August 2014) in the languages and the institutes creating the language WordNets are as follows:

LanguageSynsetsInstitute
Assamese 14958 Guwahati University, Guwahati, Assam
Bengali 36346 Indian Statistical Institute, Kolkata, West Bengal
Bodo 15785 Guwahati University, Guwahati, Assam
Gujarati 35599Dharamsinh Desai University, Nadiad, Gujarat
Hindi 38607 IIT Bombay, Mumbai, Maharashtra
Kannada 20033 Mysore University, Mysore, Karnataka
Kashmiri 29469 Kashmir University, Srinagar, Jammu and Kashmir
Konkani 32370 Goa University, Taleigao, Goa
Malayalam 30060 Amrita University, Coimbatore, Tamil Nadu
Marathi 29674 IIT Bombay, Mumbai, Maharashtra
Meitei 16351 Manipur University, Imphal, Manipur
Nepali 11713 Assam University, Silchar, Assam
Oriya 35284 Hyderabad Central University, Hyderabad, Andhra Pradesh
Punjabi 32364 Thapar University and Punjabi University, Patiala, Punjab
Sanskrit 23140 IIT Bombay, Mumbai, Maharashtra
Tamil 25431 Tamil University, Thanjavur, Tamil Nadu
Telugu 21925 Dravidian University, Kuppam, Andhra Pradesh
Urdu 34280 Jawaharlal Nehru University, New Delhi


References

  1. 1 2 Pushpak Bhattacharyya, IndoWordNet, Lexical Resources Engineering Conference 2010 (LREC 2010), Malta, May, 2010.
  2. https://www.amrita.edu/publication/building-wordnet-dravidian-languages [ dead link ]
  3. Christiane Fellbaum (ed.), WordNet: An Electronic Lexical Database, MIT Press, 1998.
  4. P. Vossen (ed.), EuroWordNet: A Multilingual Database with Lexical Semantic Networks, Kluwer Pub., 1998.
  5. Joseph E. Schwartzberg,Encyclopædia Britannica, IndiaLinguistic Composition, 2007.
  6. Dipak Narayan, Debasri Chakrabarty, Prabhakar Pande and P. Bhattacharyya An Experience in Building the Indo WordNet- a WordNet for Hindi, International Conference on Global WordNet (GWC 02), Mysore, India, January, 2002.
  7. Dash, Niladri Sekhar (2017). The WordNet in Indian Languages. Singapore: Springer Nature. pp. 275+. ISBN   978-981-10-1907-4.