IndoWordNet [1] is a linked lexical knowledge base of wordnets of 18 scheduled languages of India, viz., Assamese, Bangla, Bodo, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Meitei (Manipuri), Marathi, Nepali, Odia, Punjabi, Sanskrit, Tamil, Telugu and Urdu.
Dravidian WordNet is a WordNet for Dravidian Languages. [2]
In early 90s, the wordnet for English- called Princeton WordNet- was created in Princeton University by George Miller and Christiane Fellbaum who went on to get the prestigious Zampoli Prize in 2006. [3] Then followed the EuroWordNet- the conglomeration of European Language wordnets- which got created in 1998. [4] Wordnets are now essential resources for Natural Language Processing, Information Extraction, Word Sense Disambiguation and such other computations involving text.
Indian languages form a very significant component of the languages landscape of the world. There are 4 streams of language typology operative in the Indian subcontinent- Indo European, Dravidian, Tibeto Burman and Austro Asiatic. [5] Many languages rank within top 10 in the world in terms of the population speaking them, e.g., Hindi-Urdu 5th, Bangla 7th, Marathi 12th and so on as per the List of languages by number of native speakers. Creating wordnets of Indian languages is therefore a highly important techno-scientific and linguistic project.
Such project indeed took off in 2000 with Hindi WordNet being created by the Natural Language Processing group at the Center for Indian Language Technology (CFILT) in the Computer Science and Engineering Department at IIT Bombay. [6] It was made publicly available in 2006 under the GNU license. The Hindi WordNet was created with support from the TDIL project of Ministry of Communication and Information Technology, India and also partially from Ministry of Human Resources Development, India.
Wordnets of other languages of India then followed suit. The large nationwide project of building Indian language wordnets was called the IndoWordNet project. IndoWordNet [1] is a linked lexical knowledge base of wordnets of 18 scheduled languages of India, viz., Assamese, Bangla, Bodo, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Meitei, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Tamil, Telugu and Urdu. The wordnets are getting created by using expansion approach from the Hindi WordNet. The Hindi WordNet was created from first principles (mentioned below) and was the first wordnet for an Indian language. The method adopted was the same as the Princeton WordNet for English.
Polish WordNet is being mapped to Princeton WordNet based on the strategy followed by IndoWordNet. [7]
The wordnets follow the principles of minimality, coverage and replaceability for the synsets. That means, there should be at least a 'core' set of lexemes in the synset that uniquely give the concept represented by the synset (minimality), e.g., {house, family} standing for the concept of 'family' ("she is from a noble house"). Then the synset should cover ALL the words representing the concept in the language (coverage), e.g., the word 'ménage' will have to appear in the 'family' synset, albeit, towards the end of the synset, since its usage is rare. Finally, the words towards the beginning of the synset should be able to replace one another in reasonable amount of corpora (replaceability), e.g., 'house' and 'family' can replace each other in the sentence "she is from a noble house".
The number of synsets (As of August 2014) in the languages and the institutes creating the language WordNets are as below:
Language | Synsets | Institute |
---|---|---|
Assamese | 14958 | Guwahati University, Guwahati, Assam |
Bengali | 36346 | Indian Statistical Institute, Kolkata, West Bengal |
Bodo | 15785 | Guwahati University, Guwahati, Assam |
Gujarati | 35599 | Dharamsinh Desai University, Nadiad, Gujarat |
Hindi | 38607 | IIT Bombay, Mumbai, Maharashtra |
Kannada | 20033 | Mysore University, Mysore, Karnataka |
Kashmiri | 29469 | Kashmir University, Srinagar, Jammu and Kashmir |
Konkani | 32370 | Goa University, Taleigao, Goa |
Malayalam | 30060 | Amrita University, Coimbatore, Tamil Nadu |
Marathi | 29674 | IIT Bombay, Mumbai, Maharashtra |
Meitei | 16351 | Manipur University, Imphal, Manipur |
Nepali | 11713 | Assam University, Silchar, Assam |
Oriya | 35284 | Hyderabad Central University, Hyderabad, Andhra Pradesh |
Punjabi | 32364 | Thapar University and Punjabi University, Patiala, Punjab |
Sanskrit | 23140 | IIT Bombay, Mumbai, Maharashtra |
Tamil | 25431 | Tamil University, Thanjavur, Tamil Nadu |
Telugu | 21925 | Dravidian University, Kuppam, Andhra Pradesh |
Urdu | 34280 | Jawaharlal Nehru University, New Delhi |
IndoWordNet is highly similar to EuroWordNet. However, the pivot language is Hindi which, of course, is linked to the English WordNet. Also typical Indian language phenomena like complex predicates and causative verbs are captured in IndoWordNet.
IndoWordNet is publicly browsable. The Indian language wordnet building efforts forming the subcomponents of IndoWordNet project are: North East WordNet project, Dravidian WordNet Project and Indradhanush project all of which are funded by the TDIL project.
Marathi is an Indo-Aryan language predominantly spoken by Marathi people in the Indian state of Maharashtra. It is the official language of Maharashtra, and an additional official language in the state of Goa used to reply provided the request is received in Marathi. It is one of the 22 scheduled languages of India, with 83 million speakers as of 2011. Marathi ranks 13th in the list of languages with most native speakers in the world. Marathi has the third largest number of native speakers in India, after Hindi and Bengali. The language has some of the oldest literature of all modern Indian languages. The major dialects of Marathi are Standard Marathi and the Varhadi dialect.
WordNet is a lexical database of semantic relations between words that links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into synsets with short definitions and usage examples. It can thus be seen as a combination and extension of a dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. It was first created in the English language and the English WordNet database and software tools have been released under a BSD style license and are freely available for download from that WordNet website. There are now WordNets in more than 200 languages.
The Indo-Aryan languages are a branch of the Indo-Iranian languages in the Indo-European language family. As of the early 21st century, they have more than 800 million speakers, primarily concentrated in Bangladesh, India, Pakistan, Sri Lanka, Maldives and Nepal. Moreover, apart from the Indian subcontinent, large immigrant and expatriate Indo-Aryan–speaking communities live in Northwestern Europe, Western Asia, North America, the Caribbean, Southeast Africa, Polynesia and Australia, along with several million speakers of Romani languages primarily concentrated in Southeastern Europe. There are over 200 known Indo-Aryan languages.
Languages spoken in the Republic of India belong to several language families, the major ones being the Indo-Aryan languages spoken by 78.05% of Indians and the Dravidian languages spoken by 19.64% of Indians; both families together are sometimes known as Indic languages. Languages spoken by the remaining 2.31% of the population belong to the Austroasiatic, Sino–Tibetan, Tai–Kadai, and a few other minor language families and isolates. According to the People's Linguistic Survey of India, India has the second highest number of languages (780), after Papua New Guinea (840). Ethnologue lists a lower number of 456.
There are currently 900 permitted private satellite television channels in India as of February 2021. Numerous regional channels are available throughout India, often distributed according to languages.
Konkani is an Indo-Aryan language spoken by the Konkani people, primarily in the Konkan region, along the western coast of India. It is one of the 22 scheduled languages mentioned in the Indian Constitution, and the official language of the Indian state of Goa. It is also spoken in Karnataka, Maharashtra, Kerala, Gujarat as well as Damaon, Diu & Silvassa.
The voiced retroflex lateral flap is a type of consonantal sound, used in some spoken languages. The 'implicit' symbol in the International Phonetic Alphabet is ⟨𝼈 ⟩. The sound may also be transcribed as a short ⟨ɭ̆ ⟩, or with the retired IPA dot diacritic, ⟨ɺ̣⟩.
A semantic lexicon is a digital dictionary of words labeled with semantic classes so associations can be drawn between words that have not previously been encountered. Semantic lexicons are built upon semantic networks, which represent the semantic relations between words. The difference between a semantic lexicon and a semantic network is that a semantic lexicon has definitions for each word, or a "gloss".
Since the Iron Age in India, the native languages of the Indian subcontinent are divided into various language families, of which the Indo-Aryan and the Dravidian are the most widely spoken. There are also many languages belonging to unrelated language families such as Munda and Tibeto-Burman, spoken by smaller groups.
The Linguistic Survey of India (LSI) is a comprehensive survey of the languages of British India, describing 364 languages and dialects. The Survey was first proposed by George Abraham Grierson, a member of the Indian Civil Service and a linguist who attended the Seventh International Oriental Congress held at Vienna in September 1886. He made a proposal of the linguistic survey and it was initially turned down by the Government of India. After persisting and demonstrating that it could be done using the existing network of government officials at a reasonable cost, it was approved in 1891. It was however formally begun only in 1894 and the survey continued for thirty years with the last of the results being published in 1928.
South Asia is home to several hundred languages, spanning the countries of Afghanistan, Bangladesh, Bhutan, India, Maldives, Nepal, Pakistan, and Sri Lanka. It is home to the third most spoken language in the world, Hindi–Urdu; and the sixth most spoken language, Bengali. The languages in the region mostly comprise Indo-Iranic and Dravidian languages, and further members of other language families like Austroasiatic, and Tibeto-Burman languages.
Vikaspedia is an online information guide launched by the Government of India. The website was implemented by C-DAC Hyderabad and is run by the Department of Electronics and Information Technology, Ministry of Communications and Information Technology. It is built as a portal for the social sectors, and offers information in 23 languages: English, Assamese, Telugu, Hindi, Bengali, Gujarati, Kannada, Malayalam, Tamil, Bodo, Dogri, Sanskrit, Kashmiri, Konkani, Nepali, Odia, Urdu, Maithili, Meitei, Santali, Sindhi, Punjabi, and Marathi.
plWordNet is a lexico-semantic database of the Polish language. It includes sets of synonymous lexical units (synsets) followed by short definitions. plWordNet serves as a thesaurus-dictionary where concepts (synsets) and individual word meanings are defined by their location in the network of mutual relations, reflecting the lexico-semantic system of the Polish language. plWordNet is also used as one of the basic resources for the construction of natural language processing tools for Polish.
The Bulgarian WordNet (BulNet) is an electronic multilingual dictionary of synonym sets along with their explanatory definitions and sets of semantic relations with other words in the language.
Malayalam WordNet (പദശൃംഖല) is an online WordNet created for Malayalam Language. Malayalam WordNet has been developed by the Department of Computer Science, Cochin University Of Science And Technology.
Indic OCR refers to the process of converting text images written in Indic scripts into e-text using Optical character recognition (OCR) techniques. Broadly, it can also refer to the OCR systems of Brahmic scripts for languages of South Asia and Southeast Asia, not just the scripts of the Indian subcontinent, which are all written in an abugida-based writing system.
The Indian 1-rupee note (₹1) is made up of hundred 100 paise as ₹1 = 100 paise. Currently, it is the smallest Indian banknote in circulation and the only one being issued by the Government of India, as all other banknotes in circulation are issued by the Reserve Bank of India. As a result, the one rupee note is the only note bearing the signature of the Finance Secretary and not the Governor of the RBI. Predominantly pinkish green paper is used during printing.
The Inner–Outer hypothesis of the subclassification of the Indo-Aryan language family argues for a division of the family into two groups, an Inner core and an Outer periphery, evidenced by shared traits of the languages falling into one of the two groups. Proponents of the theory generally believe the distinction to be the result of gradual migrations of Indo-Aryan speakers into the Indian subcontinent, with the inner languages representing a second wave of migration speaking a different dialect of Old Indo-Aryan, overtaking the first-wave speakers in the center and relegating them to the outer region.
Meitei input methods are the methods that allow users of computers to input texts in the Meitei script, systematically for Meitei language.