LANGUAGE IN INDIA

Strength for Today and Bright Hope for Tomorrow

Volume 6 : 10 October 2006
ISSN 1930-2940

Managing Editor: M. S. Thirumalai, Ph.D.
Editors: B. Mallikarjun, Ph.D.
         Sam Mohanlal, Ph.D.
         B. A. Sharada, Ph.D.
         A. R. Fatihi, Ph.D.
         Lakhan Gusain, Ph.D.
         K. Karunakaran, Ph.D.
         Jennifer Marie Bayer, Ph.D.

HOME PAGE


AN APPEAL FOR SUPPORT

PAYPAL

  • We seek your support to meet expenses relating to some new and essential software, formatting of articles and books, maintaining and running the journal through hosting, correrspondences, etc. You can use the PAYPAL link given above. Please click on the PAYPAL logo, and it will take you to the PAYPAL website. Please use the e-mail address thirumalai@mn.rr.com to make your contributions using PAYPAL.
    Also please use the AMAZON link to buy your books. Even the smallest contribution will go a long way in supporting this journal. Thank you. Thirumalai, Editor.

In Association with Amazon.com



BOOKS FOR YOU TO READ AND DOWNLOAD FREE!


REFERENCE MATERIAL

BACK ISSUES


  • E-mail your articles and book-length reports (preferably in Microsoft Word) to thirumalai@mn.rr.com.
  • Contributors from South Asia may send their articles to
    B. Mallikarjun,
    Central Institute of Indian Languages,
    Manasagangotri,
    Mysore 570006, India
    or e-mail to mallikarjun@ciil.stpmy.soft.net
  • Your articles and booklength reports should be written following the MLA, LSA, or IJDL Stylesheet.
  • The Editorial Board has the right to accept, reject, or suggest modifications to the articles submitted for publication, and to make suitable stylistic adjustments. High quality, academic integrity, ethics and morals are expected from the authors and discussants.

Copyright © 2004
M. S. Thirumalai


 
Web www.languageinindia.com

A SURVEY OF THE STATE OF THE ART IN
TAMIL LANGUAGE TECHNOLOGY
S. Rajendran, Ph.D.


A PRELUDE

The use of computer for language analysis leads to the technological development of languages in general, and Tamil, in particular. The world scenario has its impact on Tamil language too. Both the government and private organizations have initiated programs for the technological development of Tamil language.

The Department of Electronics had conducted training Courses on Natural Language Processing through selected institutions throughout India and paved way to technological development of Tamil. It funded Machine Translation programs among Indian languages and between English and Indian languages. It also funded for the development of corpus for Indian languages. It had identified certain centres for the Technological Development of Indian languages and funded them to initiate projects, which aims to achieve their goal.

Anna University at Chennai had been identified for the technological development of Tamil language and provided with a fund of a few crores of rupees to fulfill this mission. Under this scheme a Resource Centre for Indian Language Technology Solutions-Tamil has been established at Anna University. A team of researchers employed under the scheme has prepared a number of Language Technology Products. This has lead to the technological development of Tamil in many areas. Many other organizations, both government and private, followed this.

Tamil University at Thanjavur, Tamil Virtual University, AUKBC Research Centre at Chennai, Central Institute of Indian Languages at Mysore and International Forum for Information Technology in Tamil (INFITT), which conducts international conference of Tamil internet every year, put their efforts for the technological development of Tamil. Apart from the above institutions IIT, Chennai, IISC, Bangalore, and Micro Software, Bangalore also have contributed for the technological development of Tamil.

In this paper the technological development of Tamil has been classified under certain heads and the research works under taken and successfully completed as well as the products made are discussed in details.

CORPUS AND CORPUS MANAGEMENT TOOLS

Corpus linguistics seeks to further our understanding of language through the analysis of large quantities of naturally occurring data. There is a long tradition of corpus linguistic studies in Europe. The need for corpus for a language is multifarious. Starting from the preparation of a dictionary or lexicon to machine translation, corpus has become an inevitable resource for technological development of languages. Corpus means a body of huge text incorporating various types of textual materials, including newspaper, weeklies, fictions, scientific writings, literary writings, and so on. Corpus represents all the styles of a language. Corpus must be very huge in size as it is going to be used for many language applications such as preparation of lexicons of different sizes, purposes and types, machine translation programs and so on.

Tagged corpus, Parallel Corpus, and Aligned Corpus

Corpuses can be distinguished as tagged corpus, parallel corpus and aligned corpus. The tagged corpus is that which is tagged for part-of-speech. A parallel corpus contains texts and translations in each of the languages involved in it. It allows wider scopes for double-checking of the translation equivalents. Aligned corpus is a kind of bilingual corpus where text samples of one language and their translations into other language are aligned, sentence by sentence, phrase by phrase, word by word, or even character by character.

CIIL Corpus for Tamil

As for as building corpus for the Indian languages is concerned it was Central Institute of Indian languages (CIIL) which took initiative and started preparing corpus for some of the Indian languages (Tamil, Telugu, Kannada, and Malayalam). Department of electronics (DOE) financed the corpus-building project. The target was to prepare corpus with ten million words for each language. But due to financial crunch and time restriction it ends up with three million words for each language. Tamil corpus with three million words is built by CIIL in this way. It is a partially tagged corpus. This corpus is available in CD and one can get a free copy from CIIL for research purpose. At present CIIL is planning to build corpus with 10 million words for Indian languages.

AUKBCRC’s Improved Tagged Corpus for Tamil

AUKBC Research Centre which has taken up NLP oriented works for Tamil, has improved upon the CIIL Tamil Corpus and tagged it for their MT programs. It also developed parallel corpora for English-Tamil to promote its goal of preparingan MT tool for English-Tamil translation. Parallel corpus is very useful for training the corpus and for building example based machine translation. Parallel corpus is a useful tool for MT programs.

Corpus Indexing Tools (Concordance, KWIC index, etc.)

Many such tools have been made for Tamil. A few important ones are listed below in the article.

This Article

In addition, this long and detailed research article makes a detailed presentation on the state of the art in the field of Tamil language technology and reviews the strengths and weaknesses of current research and suggests directions for future work.

PLEASE CLICK HERE TO READ THE ENTIRE ARTICLE IN A PRINTER-FRIENDLY VERSION.


A Study of the Relationship Between Critical Reading and Empirical Inquiry in Undergraduate Classrooms in Pakistan | In Making Manipuri Dictionary - The Semantic Problems | A Survey of the State of the Art in Tamil Language Technology | Does Cognitive Style Contribute to Systematic Variance in Communicative Language Tests? | Ramayana & Thirukkural on Mobile Phones! Great Books from All South Asian Languages!! | Practicing Literary Translation, A Symposium by Mail - ROUND 11 |E-mailing in Indian Contexts - Brief Guidelines for Inclusion in Our Curriculum | Creative Literature of Overseas Tamil -- A Review of Pon. Sundararaju's Short Stories | HOME PAGE OF OCTOBER 2006 ISSUE | HOME PAGE | CONTACT EDITOR


S. Rajendran, Ph.D.
Department of Linguistics
Tamil University
Thanjavur 613 005
Tamilnadu, India
raj_ushush@yahoo.com
 
Web www.languageinindia.com
  • Send your articles
    as an attachment
    to your e-mail to
    thirumalai@mn.rr.com.
  • Please ensure that your name, academic degrees, institutional affiliation and institutional address, and your e-mail address are all given in the first page of your article. Also include a declaration that your article or work submitted for publication in LANGUAGE IN INDIA is an original work by you and that you have duly acknolwedged the work or works of others you either cited or used in writing your articles, etc. Remember that by maintaining academic integrity we not only do the right thing but also help the growth, development and recognition of Indian scholarship.