Language in India

LANGUAGE IN INDIA
http://www.languageinindia.com
Volume 6 : 3 March 2006

Strength for Today and Bright Hope for Tomorrow

Editor: M. S. Thirumalai, Ph.D.
Associate Editors: B. Mallikarjun, Ph.D.
         Sam Mohanlal, Ph.D.
         B. A. Sharada, Ph.D.
         A. R. Fatihi, Ph.D.
         Lakhan Gusain, Ph.D.

SOME LIMITATIONS OF CORPUS-BASED LANGUAGE STUDY
Niladri Sekhar Dash, Ph.D.

Web www.languageinindia.com

1. Introduction

The present world of modern linguistic research and application has started paying due attention to corpus of natural language databases. However, this was not the situation even a few decades ago. Corpus linguistics, from the very first date of its inception, has been the target of stringent criticism from various angles by the scholars of various fields. Particularly, the supporters of generative linguistics could not tolerate the growth and upsurge of corpus linguistics, and argued about the uselessness of linguistic investigations carried out with the support of empirical language databases. They are always ready to criticize corpus linguistics in a slightest pretext to prove that corpus-based language research is not at all a scientific way of language study.

The generative linguists are not alone to diminish the value of corpus. There are people from other domains of linguistics who also join with generative linguists to nullify academic as well as practical importance of corpus in linguistics research, investigation and application. Also, corpus itself has some limitations, which cannot be ignored at the present moment. We shall try to discuss these limitations in brief to know how these limitations create hurdles of several kind, and how people are trying to adopt various measures to overcome these limitations.

Within a broader domain of linguistics, corpus linguistics is a branch, which aims at studying a language in its most versatile environment. Language, as a tool of human thinking and communication, is always lively and dynamic with wide scopes for continuous modification, change, and diversion. Any science that sets goal to study a dynamic human aspect like language of a living society is bound to be fretted with limitations and failures, since we do not have a system by which we can properly capture and reflect on that particular dynamic aspect. Corpus linguistics is not an exception here.

It is also bound to fail at certain junctions, since it dares to reflect on the language that often evades any kind of scheme of generalization and/or particularization. Therefore, when we deal with corpus linguistics we must keep in mind that like any other social science it has some unavoidable limitations, which are the part of its own entity. Taking these limitations into account we try to evaluate its relevance within the wider spectrum of language science. In the sections below we shall try to highlight these limitations as well as find out measures that can make corpus linguistics more useful, pragmatic, practical, and trustworthy.

2. Limitation in Generative Quality

Corpus has been the target of criticism by generative linguists because it tries to manifest the real-life use of a language. The criticism was first started by Chomsky - the father of generative linguistics. In 1957, in the review on Skinner's book entitled Verbal Behavior (1957), Chomsky strongly criticised the arguments of Skinner and showed the limitations and worthlessness of the theory. In this review Chomsky also raised a few questions about the relevance of language databases in the form of corpus in linguistics. This criticism not only restrained scholars from exploring the content of Skinner's book, but also refrained them from realising the functional viability of language corpus in linguistic researches. In his argument Chomsky clearly mentioned that (Chomsky 1957: 54):

It is evident that more is involved in sentence structure than insertion of lexical items in grammatical frames; no approach to language that fails to take these deeper processes into account can possibly achieve much success in accounting for actual linguistic behavior.

Thus, he systematically nullified the reasons and arguments of the behaviourists to show that (Chomsky 1957: 56):

It appears that we recognize a new item as a sentence not because it matches some familiar item in any simple way, but because it is generated by the grammar that each individual has somehow and in some for internalised.

In the very next year, in 1958, in a memorable lecture delivered at the University of Texas, Chomsky completely defied the importance of corpus in linguistic research and investigation with following words (Chomsky 1968: 159):

Any natural corpus will be skewed. Some sentences won't occur because they are obvious; others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description [based upon it] would be no more than a mere list.

Recently, in a face-to-face interview with Andor (2004) Chomsky again has raised a strong attack against the corpus linguistics. The aggression is far more sharp, more pointed, and more oriented. In the present context of language research, Andor asked the following question to Chomsky to know about his opinion regarding the position of corpus in linguistics:

Let me inquire about your current view about corpus-based linguistic description and theorizing, an amazingly developing field, which, as many would say, has grown from childhood to adulthood, or, at least, to adolescence. This field, they say, can no longer be considered as only a methodological approach to linguistic analysis, but has to be accepted as an outstanding research field of empirical importance, which is extensively utilized and relied on, for instance, in current research in lexical semantics and construction grammar.

To reply to this question Chomsky clearly stated that corpus has no value to him. In his own words:

Corpus linguistics does not mean anything. It's like saying suppose a physicist decides, suppose physics and chemistry decide that instead of relying on experiments, what they're going to do is to take videotapes of things happening in the world and they'll collect huge videotapes of everything that's happening and from that maybe they'll come up with some generalizations or insights. Well, you know, sciences don't do this.

Thus, for a long time spanning across several decades, Chomsky and his supporters have been trying hard to annihilate the practical utility and relevance of corpus from the spectrum of linguistics. Moreover, they have been trying to divert the entire direction of linguistic research towards the rationalism and intuitive investigation from the path of empiricism and usage-based language study. The outcome of this effort has definitely become detrimental for linguistics at large, since some of his supporters have started pouncing upon the corpus linguists in each and every opportunity available to them.

On the other hand, scholars who were not ready to enter into the labyrinth of generative linguistics were not spared. Within last five decades we did observe that the supporters of generative school went a few steps ahead of Chomsky to ignore the importance of statistics even in the fields of applied linguistics. The argument these scholars put forward is the following: use of corpus in linguistic research is a foolish act of scholarship. Language corpus, due to its form, content, and composition is able only to highlight the marginal samples of a language. It has no potential beyond this. Therefore, any kind of analysis, however intensive or extensive it may be, is never capable to reflect on the linguistic competence or linguistic generativity of the users. Since the goal of linguists is to investigate the generative power of the users, corpus has no functional or theoretical value.

The argument is no doubt true. However, this is nothing more than mere repetition of the arguments made by Chomsky decades ago. Starting with the publication of the Syntactic Structures (Chomsky 1957) in most of the writings of Chomsky we come across the argument that presumably claims: use of a natural language is infinite. Therefore, nothing will be gained from the analysis of finite sets of data even if this database is a very large one. To get clues about the internal structure of a language as well as to get reliable ideas about the generative quality of the language users we need to depend on the linguistic competence of the users, not on their performances. That means, the native speaker's ability to generate infinite varieties of construction may be explained if only the underlying grammatical rules of generations are properly understood and explained.

The above argument implies that understanding the internal linguistic ability of a human is understanding his language. Since corpus is a crippled manifestation of the internal linguistic ability of a human being, it cannot be a reliable resource for linguists. In essence, a corpus can at best assemble a small subset of language data as a replica of linguistic performance of a human. Therefore, it cannot be a hunting ground of linguists, since their primary goal is not to provide a detailed account about when, where, and how a person has used language for his several purposes but 'to understand the tacit, internalised knowledge of language' of the person.

Within last few years, however, the aggression on corpus linguistics by Chomsky and his supporters is notably reduced. Even Chomsky himself has acknowledged the relevance of corpus in linguistics, particularly in some areas of applied linguistics such as phonetics and child language acquisition. Moreover, in the recent interview also he has indirectly appreciated the value of corpus in linguistic studies in the following way:

If you want to use hints from data that you acquire by looking at large corpuses, fine. That's useful information for you, fine. � You are observing the tides. And from that general observation about the tides you see regularities and so on and that leads you to construct experimental frameworks including highly abstract situations. � You may be motivated by phenomena that you've observed in the world, but as soon as you get beyond the most superficial stage, you guide inquiry by partial understanding and experiments in which you construct situations in which you hope to get answers to particular questions that are arising from a theoretical framework. And that's done whether you're studying speech acts or human interaction or discourse or any other topic. There's no other rational way to proceed.

Modern corpus linguistics does not hesitate to proceed in this path. In fact it intends to traverse this path with full enthusiasm. Actually there is no difference between Chomsky and the supporters of corpus linguists. We, as supporters of corpus linguistics, collect and analyse the real life samples of language to obtain necessary information and examples to examine and substantiate propositions about language, which may be related to applied linguistics, generative linguistics, or even to language technology.

The path is almost same for both of us - the difference lies only in the way of journey. While generative linguists want to explore the internal structure of language, corpus linguists want to explore the external structure of language to reach to the same destination. To reach to their final destination, corpus linguists pay attention to each and every external aspect of a language, investigate every unique property of a language, and observe every meander, twist, and turn reflected in the actual use of language. Their way of journey should not be stamped faulty if it does not match with the way of journey followed by the generative linguists.

Some scholars of the generative school are far more antagonistic and hostile than others (see, Stubbs 1993: 3-6). They argue that linguistics is one of the most important branches of cognitive psychology, where there is no scope for empirical evidence for generalisation, verification, or authentication. Research and investigation of language should be based only on evidences acquired from intuitive inferences, experiments should be detached from real-life situations, and analysis should be free from usage-based findings. Such investigation never requires data collected in corpus, neither does it need the proofs of actual language use to substantiate its observations (for details see, Stubbs 1993: 3-6).

Let us inform the people who still live in the ivory tower of such a make-belief world of linguistic research and investigation that modern linguists does not dare to surge ahead with a few theoretical speculations and intuitive observations about the properties of a language. Modern linguistics is now taking direct help from computer technology to bring new dimension in language study with the help of real-life language databases compiled in corpus. It is also trying to examine if the arguments and observations furnished by generative linguistics are actually attested in the evidence of real-life language databases, and if so, how far they are true.

The sum of the above discussion is that although the criticisms made by the generative school hindered the initial progress of corpus linguistics, it did not succeed to stop entirely the practice of using corpus in linguistic research and application. In the very early stage of its inception, corpus linguistics did suffer a set back, but scholars across the world, either openly or in secrecy, continued with the task of corpus generation and analysis for their linguistic works of various kinds. From the beginning of the decade of seventy, corpus linguistics gradually regained it lost strength in a slow pace. By the mid of 1980s, it was able to establish itself as one of the most promising fields of linguistic research and investigation in the world (Andor 2004)[1].

The discussion presented above clearly shows that corpus linguistics had to fight a long battle for its survival and growth. The strongest challenge came from Chomsky and his supporters. We know that within a very short period of time Chomsky was able to turn the entire direction of linguistic study from empiricism to rationalism. We should know the reasons that operated behind the diversion of the course of language study from one direction to the other. However, to know the reasons of this change we should first evaluate the relevance of the two theories (i.e., empiricism and rationalism) in linguistics. We should also evaluate the pertinence of the two theories in the area of applied linguistics and language technology - two important domains of linguistic research in modern era. Probably, a short discussion on the goals and perspectives of any social science will provide us necessary clues to understand the present problem.

The first thing that we need to understand is whether we should at all rely on the facts and events occurring around us for understanding a particular social or natural phenomenon that draws our attention. Should we rely or ignore the evidences that have direct impact on a social event of our life and living? Should we reply on our intuition only to define and analyse a social event? Shall our intuition lead us to analyse a social event correctly, if we do not pay any attention to the factual side of the event?

For the purpose of substantiating our argument, let us, for the time being, consider biology, a discipline of natural science, as an example. What is the normal course of activities the scientists of this discipline adopt in their investigation and analysis? Do they rely on their intuition only? Do they take support from the external world? Do their results base on their intuition or on their empirical study? Are the results of their study based on their intuitive logic or on real-life evidences they have examined and experimented?

Probably we need not reply to all these questions categorically, since these are self-declarative. What we understand from this argument is that a science, either a natural or a social one, never dares to design principles and theories just on the basis of some fanciful intuitions and utopian assumptions. It has to follow faithfully all the facts and evidences actually occurring in real life and nature. In all branches of science, the relevance as well as importance of real-life evidences and proofs is surely much higher than the hypotheses and assumptions and the investigators.

This argument is has no reason to be false in case of linguistics also, since this particular branch of human knowledge deals with one of the most living and dynamic aspects of human life - the language. Undoubtedly rationalism has created a lasting impact on this field, which however does not imply that empiricism has no relevance here. The basic structure of rationalism in linguistics stands on a few strong pillars of rationalistic assumptions, profound linguistic knowledge of the experts, and the peerless linguistic genius of a few native linguists. Because of these factors, some visionary linguists may easily reach to the level of generalisation with minimum support from the fields of particularisation. In fact, based on such schemes of generalisation a rationalist can generalise about various aspects of a natural language from the analysis of a few particular premises.

Proofs of this method are available in numerous theories and hypotheses, which are made by genius linguists about a language on the basis of their knowledge and exposure to the language. In most cases, however, such hypotheses although proved to be correct for the language or language groups they investigate, have been proved fallacious or insufficient to other language or group of languages. In most cases, the arguments made by the experts are based on the degree of their exposure to the language or the depth of their knowledge about the language.

In such a situation there are at least three slippery stretches:

There are high possibilities of making mistakes or erroneous generalisation due to insufficiency of data and examples[2];

Observation as well as analysis are bound to vary from person to person, and

More insightful persons will derive better inferences from the examples and premises to be more authentic and acceptable.

However, all these hypotheses and inferences will not be considered faithful, since these are not attested with real-life examples examined and certified with authentic language databases.

On the other hand, corpus linguistics is entirely based on empirical evidences. And the sources of these evidences are language databases acquired from real-life language use. Therefore, by way of close reference to language corpora we are able to know how a language is being used at various facets of life, where does the change in language take place, how does language change, what are the factors working behind these changes, how do these changes affect the general frame of a language, etc.

Also, analysing the evidence stored in corpora we are able to know whether a sentences is rightly constructed, what is the structure of the sentence, in which domains such a construction is normally used, which types of text carry maximum number of such constructions, if a sentence carries any special feature that does not match with the normal sentence constructions of a language, is there any special type of sentence in specific field, how these sentences are special from the common sentences, whether such speciality is a new phenomenon of language, whether these sentences can be considered valid in language in spite of their specialty, etc. In fact, without taking into other linguistics aspects of a language into consideration, even at the sentence level, many such questions can be properly addressed and investigated if we have a large corpus composed with various types sentence obtained from various domains of language use.

The central argument is that we can faithfully depend on a corpus for presenting authentic analysis and estimation of a language or variety. Thus, we can show that empirical and rationalistic ways to language study differ not only in approach but also in method of investigation, techniques of language analysis, as well as in system of information presentation. However, these differences do not mean that these two approaches to language study never meet together. The truth is - both the approaches are incomplete without the assistance of the other. The study of a living natural language always intermixes the systems and techniques of both the approaches to be maximally authentic and reliable.

We have already informed that both Chomsky and his supporters have considerably changed the goal and approach of language study within a short span of time. To these scholars, the most important thing of a natural language study is the internal linguistic ability of a language user, and not the usage varieties observed in language. To establish this generative approach of language investigation they tried to nullify the relevance of corpus that tries to explore the empirical linguistic evidence of various types. The baseline of their argument is that if we try to gauge the linguistic potential of a normal human being by way of referring to his stock of various linguistic usages, we will definitely make mistakes. There are many examples of erroneous linguistic usages in that stock the analysis of which may lead us to wrong conclusions and fallacious observations.

Use of language can be erroneous due to several reasons - emotion, excitement, insanity, intoxication, mental disability, etc. to mention a few. Now, if we try to analyse a general corpus that also contains data produced by these factors, we will obviously land up with wrong results and faulty observations. If we do not ignore the evidences produced by such factors then our observations will be skewed, analyses will be bias, and conclusions will be dubious, and as a result of this the process of generalisation about the linguistic capability of a human being will never be beyond question and doubt.

We must admit that the evidences of this type can never be used as proofs of linguistic capability of a normal human being. However, from these evidences we may find some new clues that might give us important insights into the nature of use of a language. For instance, let us assume that we want to investigate the phenomena observed in the language when a person is emotionally charged, exited, insane, mentally retarded or intoxicated. In such a study the language of these special situations are our best and most authentic resources. Scientific analysis of these language samples will faithfully show how language changes in a special situation, what are the deficiencies normally observed in such data, what kind of linguistic expressions normally occur in these languages, how does the language of special situations deviate from the standard norm, etc.

Finally, the question is: shall we rely on corpus? Shall we ignore it? The doubt is cleared only when we clearly understand that a corpus is nothing more than a simple collection of large sets of externalised evidences, which are compiled from a language used at certain time, place, and context. Being a set of performance data a corpus in reality is a competent guide to modelling linguistic competence either of an individual or a language community.

There is another theoretical question: Is language data really a poor reflection of the linguistic competence of an individual or a speech community? Perhaps not. Scholars (Labov 1969) have revealed that the great majority of normal utterances in all contexts of language use are grammatical in form albeit some variations. This signifies that it is not right to claim that all the sentences that appear in a corpus are not grammatically right and acceptable. This also does not support the argument of Chomsky's (1968: 88) that claims that performance data is 'degenerate'. According to some scholars (Ingram 1989: 223) the claim made by Chomsky is nothing more than an exaggeration to defy the referential value of empirical language databases.

To sum up, while generative linguistics visualizes a wide network of linguistic elements through some sets of generative rules underlying within a language, corpus linguistics explores to what extent these rules are manifested in the evidence of real empirical databases. It also aims to explore if the evidences could be accounted for systematically to identify the rudiments of linguistic rules we use in all possible linguistic interactions. This difference in approach, I suppose, leads both the fields in two opposite directions of language research.

Generative linguistics is interested to find out the underlying 'finite' rules employed for generating innumerable linguistic constructions, which will exhibit 'competence' of various linguistic 'performances'. On the other hand, corpus linguistics is highly reluctant to admit the presence of linguistic generativity unless it is supported with proper empirical verification. It agrees to admit anything about language after it is verified with proofs obtained from the examples of actual use. It prefers to traverse the paths of pure empiricism rarely addressed in generative linguistics.

3. Limitation in Balanced Text Representation

Corpus linguistics has also been the target of stringent criticism for another pertinent reason. People who are working in various domains of mainstream linguistics have complained about the lack of width and span of text representation within corpus. According to scholars, the improper and skewed representation of the target language fails to serve properly or fails to meet the general requirements of the language investigators (Landau 2001: 321).

A living language is the most powerful tool of a speech community for establishing interpersonal communication. People use it language various reasons, in various ways, in various modes, and in various settings. We use language in need and without need, to prescribe and describe, to express and suppress, to reveal and conceal, to convey and hide, to convince and deceive, to encourage and discourage, to capitulate and manipulate, to infuse and diffuse, to reflect and project. Thus, each of us uses language in various needs, contexts, situations, and manners for days after days, months after months, years after years, and generation after generations. No language corpus, however big and widely representative it may be, can ever dream of reflecting on each and every side of language use in diverse ways.

For this, however, we should not blame corpus linguistics only. There is no method available still today, which we can use to know how the people use language continuously in so many situations, contexts, varieties, and manners. Even the internal grammar of the generative school will fail here, since it is not able to show if a language will be used tomorrow in the same manners and situations as it is used today. It also cannot affirm whether the linguistic principles and rules, which are considered valid and useful today, will be equally valid and effective for tomorrow. Therefore, no theory or ism is in a position today to predict whether any linguistic rule will survive to describe and define the language that will be used by the people of tomorrow. In this crucial situation, the role of language corpus is very significant. Although it fails to project into tomorrow's language, it has potential to direct us towards the course, which the language is supposed to take up in future years[3].

Corpus has the potential to reflect on the past to show how language was used by the people in the past. Besides, comparative evaluation of language with respect to its use in the past as well as its use in the present has been possible due to the presence of diachronic or historical corpora. In fact, scientific analysis of diachronic corpora has yielded some important findings about the use of a language, which was not available to us even from the generative linguistics. We argue that until and unless we have a better way of representing a natural language in a far more democratic and balanced manner, we have no other alternative but to depend on language corpora.

It is perhaps not pertinent to ignore the importance of corpus entirely rather than criticising the methods and statistical frames applied for corpus generation. A particular deficiency in the part of a subject cannot probably be made a scapegoat to nullify the total relevance of the subject. The best remedy to overcome the deficiency, however, is to redesign the whole model of corpus representation so that the newly adopted system can help us to generate a better representative corpus, which is enriched with all statistical qualities to be far more representative of the target language or variety.

Even then, we must admit it clearly that a language corpus, however representative, large and statistically complete, can never account for the infinite potential varieties of use of a language. Similarly, it can never dare to represent properly all the aspects of a living language that throbs with the life and living of a speech community. Moreover, the apparent non-notable changes of various linguistic properties of a living language are hardly reflected and captured within a corpus even if we make it properly diachronic and universal. Moreover, in spite of its wide synchronic structure, it fails to represent all possible linguistic varieties exercised at all levels of linguistic interaction observed both in speech and writing of a speech community.

In essence, a language corpus is nothing more than a tiny replica of the huge galaxy of language use. Therefore, in principle, it can never be a good representative of a language. We should better compare a corpus with an aquarium, which meekly tries to represent the vast ocean rich with innumerable treasures. In our sense, a corpus is not more than a bucketful of water collected from a river. The analysis of water will supply valuable information and evidence about the condition of the water that goes on flowing through the middle of a community and is randomly used by the living generations[4].

Also, utilisation of quantitative or statistical methods in corpus study as well as in other fields of language study has been the target of severe criticism for some years. As a result of this corpus-based study of vocabulary of a language has "declined in influence from the 1950s until its revival ... [in] the 1980s" (Kennedy 1992: 339). The strings of discontinuity in the study of linguistics -

can be located fairly precisely in the later 1950s. Chomsky had, effectively, put to flight the corpus linguistics of an earlier generation. His view on the inadequacy of corpora, and the adequacy of intuition, became the orthodoxy of a succeeding generation of theoretical linguistics (Leech 1991: 8).

In general, most of the works of the early corpus linguistics were considered fallacious due to two flawed assumptions (McEnery and Wilson 1996: 13):

Number of sentence in a natural language is finite, and

Sentences of a language can be collected and enumerated.

These fallacious assumptions were given undue importance in early works of corpus linguistics, since corpus was mistaken as the only source of evidence in formation of linguistic theories. "This was when the linguists ... regarded the corpus as the sole explicandum of linguistics" (Leech 1991: 14). Also, people at time made some weaker claims for corpus suggesting that the purpose of the linguists working in the tradition of structuralism "is not simply to account for utterances which comprise his corpus" but rather to "account for utterances which are not in his corpus at a given time" (Hockett 1948: 269).

Probably we cannot deny the truth that quantitative information has relevance in language research and application, since we all know that many successful studies on speech research drew heavily from information obtained through statistical analyses of corpora of spoken text. Since information about the frequency of use of various linguistic properties in speech is not available via intuition or introspection, real-life speech corpora are essential for acquiring quantitative information about the various aspects of speech.

In fact, recently conducted quantitative analyses on speech corpora show that our assumptions about the patterns of use of various language properties in normal human speech events often go wrong (Svartvik 1986). We, therefore, argue that there are some obvious benefits of language corpora if these are used systematically and scientifically in language analysis and interpretation. Statistical analysis of various language properties is one of the most powerful methodologies in empirical linguistics because it is scientific, systematic, and open to any kind of objective verification of results (Leech 1992).

4. Limitation in Technical Efficiency

We must hereby acknowledge the fact that without the active involvement of advanced and sophisticated computer technology success in the generation of language corpora in electronic form is virtually zero. It is almost impossible for an individual to generate a large, multidimensional, well-representative, and balanced corpus of multibillion words by way of simple collection of language data from multiple domains of real-life language use. It that case, it may consume the entire life of a data collector[5]. Even in case of a group of scholars collection of a large and multidimensional language database is not an easy task. Our experience shows that such a work will consume both time and money, since it requires full-fledged involvement of several scholars for a long time (Dash 2005, Chapter 3). In that case, one must agree to provide enough amount of time and money for the sake of corpus generation.

Even then, the enterprise will not be entirely free from unintentional errors. Since individual liking and disliking play vital decisive roles in a collective work, one has to take necessary precaution in every step so that the common goal of the work is not blurred. Therefore, to overcome the differences that may arise in course of data collection and compilation, it is always sensible to work in tandem in a pre-planned manner with close collaboration of the participants under the invisible guidance of collective wisdom. This will not only help the concerned workers to overcome individual mistakes and but also help to strengthen the joint enterprise.

The advent of computer technology has, however, saved us from such troubles high magnitude. There are, however, some limitations in the use of computer technology available to us in generation of language corpora in electronic form. In fact, with the present available technology these limitations are hardly surmountable to address our needs. This leads some people to argue that use of computer in corpus generation is not at all useful and trustworthy, because it fails address all types of needs a corpus linguist usually faces in his work. In defence of the relevance of computer in electronic corpus generation let us argue in the following manners.

The question of corpus generation is not the only issue that is directly related with computer technology. The questions of corpus processing and use are also involved with it. Works like extraction of characters, words, phrases, idiomatic expressions, compound words, sentences and other linguistic properties from corpora also need the help of advanced computer tools and techniques. These works can only be done quite successfully if the corpus builders have good computers built up with necessary software at their disposal. In the earlier days, collection of linguistic information and data from hand-made corpora was most often done manually. Today these works are done most automatically either by the linguists themselves or by some supporting hands, which are well-versed in computer handing and data processing.

It is always difficult to extract necessary linguistic evidence, examples and information from handmade corpora to be used either to refute an earlier observation or to furnish a new one about a language. The best way to overcome this problem is to convert a handmade corpus into an electronic one so that it becomes more easily accessible to us. In that case we have be more computer-savvy so that we are able to handle and direct a computer as out work demands. On the other hand, we can hire a computer programmer who will be asked to execute various linguistic tasks and experiments on corpora according to the suggestions of the language investigators.

In the age of pre-electronic corpus, such facilities were not available to us. Therefore, people used nothing more than their eyes to search characters, words, terms, phrases, idioms, sentences, and other linguistic elements through the language databases. They required huge amount of energy, strength, and perseverance to identify each and every feature systematically either to draw conclusion or to challenge previous observations. Such large-scale enterprising works, by virtue of their complexities, were mostly time consuming, error-prone and expensive. Although the enterprises required good and efficient data processing systems, these were not available at that time. Probably, because of these difficulties people made severe criticisms and eventually become hostile against corpus linguistics.

However, the situation has changed remarkably by the decade of the 1970s with the arrival of digital computers that are vast in storage capacity, fast in processing ability, clinical in data analysis, and accurate in inference drawing. This has injected a new lease of life to the field of corpus linguistics which has now found a long-waited congenial climate for its revival and growth. Now, people have hardly any problem to deal with large corpora from various ways and means to address their requirements.

By the dawn of a new millennium, corpus linguistics has started to expand its domain in a high speed. This has forced critics to take their steps backward. In fact, in every domain of linguistics, both corpus and computer have become two indispensable components. These resources are being used to evaluate and verify the reliability and authenticity of the principles and theories proposed by earlier scholars including those from generative linguistics. Let us hope that systematic utilisation of computer technology in generation and processing of electronic corpora will open up new avenues for bringing in lasting changes in the life and living of the common people across the speech communities.

5. Supremacy of Written Texts Over Spoken Texts

Another important criticism against the corpus linguistics is that the present scenario of corpus generation and processing is mostly titled towards the corpus of written texts. As a result, corpora of spoken texts are not properly developed and utilised although spoken texts holds automatic priority over written texts in language research and analysis.

This is, in fact, a genuine criticism if will look into the core of the present scenario of corpus generation. In a simple study, we have observed that the number of corpus made from written texts far exceeds the number of corpus made from spoken texts. Obviously, there are some unavoidable factors that titled the balance. These factors are summarised below:

Design and develop a corpus of written text is much easier that that of a speech corpus. The electronic evolution in the publication industry, the rapid growth of writing with the help of computer, and easy availability written texts from various electronic resources like web-page, homepage, internets, and the use of OCR (Optical Character Recognition) system for quick conversion of written and printed texts into electronic form, etc. have made it possible for us to develop a written corpus quite easily. Any person who has access to all these resources can develop a corpus of written text samples without much trouble for it.

Also tools and systems needed for processing written corpus are easily available for the users. In most cases these tools and systems are either freely downloadable from internet or can be bought at a very marginal cost. Moreover, these techniques and systems are so user-friendly that even a novice can use these for his/her linguistic works. A linguist who is interested to process a text corpus under his disposal, can use these tools without having proper training.

In case of corpus of spoken texts, the present situation is not so encouraging. Designing and developing a speech corpus is a highly complicated task that needs careful implementation of several techniques and systems at various stages of its generation. Moreover, it requires advanced and technologically sophisticated devices, which are not within the scope of most of the language investigators.

At the initial stage, after the plan of work is ready, we need to record spoken interactions from various domains in electronic device of different kinds. Next, we need necessary tools to covert these texts into written form by way of applying techniques of text conversion and transcription. Since the work is quite complicated, it requires a team of expert linguists who have mastery over phonetics and filed linguistics (Samarain 1966).

After the conversion of the spoken texts into written forms, text samples need to be annotated with various types of tag-set for proper understanding, processing and retrieval of information from the texts (Garside 1995). Moreover, the tools and facilities available for written text processing and encoding are not easily available for spoken texts processing. Because of these constraints the growth of speech corpus is not at par with that of the written corpora. We apprehend that this trend will continue for some more time until and unless these hurdles are removed.

This however, does not imply that the importance of written corpus is much increased over spoken texts. The truth is - the value and importance of speech corpus still remain intact as it was before. We also believe that the spoken form is the most reliable and authentic proof of a language. Due to this fact we usually pay more attention to spoken form than that of a written form (Eeg-Olofsson 1991). We cannot concentrate more on spoken texts because there is a lack of faithful availability of such texts before us. We have at present no other alternative but to reply on the corpus of written texts.

However, for the last few years we observe a drastic change in the attitude of the corpus linguists. People have started designing systems and tools for quick and easy collection of corpora of spoken texts. Also, they have started designing various tools and systems for processing spoken texts (Esling and Gaylord 1993, Edwards and Lampert 1993). Moreover, people have realised the relevance of spoken corpora in the area of speech technology, particularly in the works of developing tools and systems for text-to-speech conversion, speech-to-text conversion, speech recognition, identification, etc.

Within last two decades a few large speech corpora are developed (Dash 2005, Chapter 2), while few others are in the way of completion[6]. In case of Indian languages, some attempts are initiated by the Ministry of Communication and Information Technology, Govt. of India for the generation of speech corpora in all major Indian languages. Let us hope that within next few years we will have a few speech corpora in Indian languages, the processing and analysis of which will help us to devise sophisticated unique tools and systems of speech technology for Indian languages. Also, empirical investigation and analysis of those spoken corpora will help us to bring new insights into the language and life of the Indian people.

6. Absence of Texts From Dialogic Interaction

Recently some scholars have raised voice against corpus linguistics with the argument that present day corpora most often fail to represent the impromptu and unprepared dialogic interactions, which usually take place spontaneously in regular linguistic activities of people (Selting and Couper-Kuhlen 2000). Those who want to advocate language study through evidences of language used in dialogic interactions argue that the absence of texts from any kind of dialogic interaction makes a corpus not only skewed but also crippled lacking in the aspect of spontaneity, which is one of the most valuable properties of a natural language (Weigand and Dascal 2001). Due to lack of this particular property a corpus fails to represent the real picture of language found in normal life notwithstanding the fact that natural, spontaneous, and impromptu samples of dialogic interaction can only faithfully represent the basic texture of a natural language[7].

There are definitely some grains of truth in this criticism. It is true that a corpus, either in spoken or in written form, is actually a database far removed from its actual context of occurrence. In fact, detachment form the contexts makes corpus a lifeless language database, which is devoid of many properties of a living dialogic interaction as well as of information related to discourse and pragmatics. As a result of this, a corpus often fails to ventilate into the real purpose carefully concealed within a complicated linguistic action called 'negotiation'. Moreover, it fails to identify the situations of 'language-in-use' as well as fails to determine the interactive action games involved within dialogic interactions and describe properly the "cognitive and perceptual background from which the interlocutors derive their cognitive and perceptual means of communication" (Weigand 2004).

In essence, a speech corpus detached from its actual context of occurrence loses much of its pragmatic and discoursal information the analysis of which may provide valuable clues and insights to understand spoken texts in better way. The analysis of speech corpora available so far cannot provide us clues to know how the motives of the interactants are actually hidden in their verbal deliberations, how speakers gauge the mental condition and intention of the listeners they are addressing, and how language is used as a tool to continue or terminate an ongoing spoken interaction.

The simple way to overcome these difficulties is to accumulate in a speech corpus, as much text samples as possible, from various dialogic interactions as well as from different spoken negotiations. Modern corpus linguists have now turned their attention to this direction and are trying to compensate the loss suffered for years. However, we cannot ignore the truth that the actual act of generation of a corpus with dialogic interactions of various types is far more complicated than the generation of a general speech corpus. The present trend of generating multi-modal corpora (discussed in chapter 3, section ????), however, can probably help us to make this dream a reality in near future.

7. Absence of Pictorial Elements

In general, a language corpus does not contain tables, diagrams, sketches, figures, images, formulae, and other visual elements. However, these elements are often present in written and printed texts and documents. Particularly, texts belonging to school and college curriculum, children literature, science books including physics, chemistry, biology, medicine, engineering, computer and others, advertisements, etc. contain various types of visual element for proper understanding of the content. Similarly, the value of pictorial elements is fathomless in texts related to advertisement, since in most cases, underlying message of an advertisement is heavily dependent on these visual elements. That means without proper reference to visual elements it is hardly possible to extract the central message of the text.

On the contrary, texts related to literary prose (e.g. fictions, short stories, travelogues, etc.) and social science (e.g. political science, history, education, philosophy, religion, etc) carry less amounts of visual element, which although help us in understanding the topic or idea presented in these works, are not directly integrated with visual elements as found in case of advertisements. In some places, however, some sketches and illustrations are attached with these texts to draw attention of the target readers. This signifies that the relevance of pictorial elements in the texts of literature and social science is not of primary importance. That means, the central idea of these texts can be understood even if the readers are not provided with illustrations and pictorial elements.

But in case of texts related to children literature, this assumption stated above is not true. We have noted that most of the texts of children literature, either informative or imaginative, carry visual elements - the lack of which definitely diminishes the amount of pleasure and information the children are supposed to extract from these texts. For instance, let us consider the children literature composed by Sukumar Roy[8].

If we remove the sketches and the pictorial illustrations from the verses compiled in Abol-Tabol, then we will probably destroy the entire world of fantasy, joy and enthralment of children. Similarly, if we remove those mesmerising pictures and sketches of Hanglathoriam, Gomrathoriam, Becarathoriam, Cillanosoras, Langrathoriam and others from his Heshoram Hunshiyarer Dayeri, then obviously the joy of hunting in the world of fantasy will lose much of its charm and beauty. This implies that in case of generation of a text corpus of children literature the removal of visual illustrations and pictorial elements from the texts is actually a destruction of a major share of the world of fanciful imagination of the children, which may eventually tell upon the over-all growth and nourishment of their minds[9].

In straight terms, however, visual elements found in a written or printed text are not included in an electronic corpus. But the presence of these elements in a printed text helps the author either to elaborate his idea in a more clear and lucid terms or to convey his argument or theory with more clarity to the target readers. For instance, when a writer uses diagrams and tables in his writing it is implied that these elements are considered indispensable the lack of which will make his text impenetrable and clumsy. That means these visual elements carry an extra load of information, which the text itself usually fails to carry to the target readers. The underlying and undeniable truth is that all types of visual element carry certain amount of information, which is not possible to extract from written texts only.

If we agree to this argument, then we must admit that a corpus should carry these visual elements. The lack of these properties definitely forces a corpus to lose much of its information, which could play vital roles to determine the actual nature of texts. Particularly, in the context of stylistic analysis of texts these visual elements could provide necessary information to understand the stylistic patterns of particular authors. If an author, for instance, uses large number of tables and figures in his writing, then these will supply necessary clues to understand his style of writing. If we remove these elements from his text then our interpretation about his style of writing will have chances to be mistaken and falsified.

In essence, a corpus devoid of such visual elements is bound to lose much of its information. However, in the discussion about the limitation of corpus, we must understand clearly that due to unavoidable technical constraints, it is not yet possible to incorporate pictorial elements of a printed text within a corpus of its electronic version. If, in future, any technique or system is developed to overcome this limitation, then definitely a corpus will be more representative of a language as well as a true replica of the texts where from it is generated.

8. Lack of Samples From Poetic Texts

It should be informed that a corpus of written text usually contains the samples of prose text. Rarely it contains the samples from poems, verses, nursery rhymes, songs, ballads, rhymes, and other poetic texts. However, it contains sporadically one or two extracts of poems used in a prose composition. Why the text samples from poetic creations are not usually included in a corpus of prose text is a long standing question. The reasons are many, most of which are related with style, content and goal of restive text, which are summarised in the following ways:

The expectation of readers to a poetic creation is different than that of a prose creation. That means, what we expect to extract from a poetic composition usually differs from that of a prose composition. The difference in expectation from these two types of text is exquisitely reflected in the writings of Rabindranath Tagore[10] and Budhadeb Basu[11].

The language normally used in poetry, songs and rhymes is not similar to the language used in texts used in literature, essays, science, technology, commerce and newspapers, etc. Use of words, multiword units, sentence structures, idiomatic expressions, etc is different in poetry than that of a prose text. Similarly, at certain times, the sentence final verb in poetry is placed at the initial or middle of a sentence to form a different sentence structure, and the change of the shape of a line-ending word to form a matching couplet, etc. are the common practices in poetry writing. Such type of use of words, phrases, and sentences are hardly found in a prose text. In fact the uniqueness of this kind makes the language of poetry greatly different from that of a prose. Therefore, within the realm of informative language of prose the impressionistic language of poetry has no change for entrance and coexistence. This rule is followed most of the text corpus across the world.

Although we use language of both prose and poetry in various context of our life either to convey information or to express our emotion and feeling, the language of prose has a direct and practical role to exhibit the intricate picture of life, living, society and time, which cannot be performed by the language of poetry. Prose can show the reality much better way than the poetry. On the other hand, the language of poetry can reflect on the mind and heart of the writer in a more profound way, which cannot be done by the language of prose.

At the time of writing a prose, the writer is more compact, systematic, and methodical. He tries to arrange his arguments in such a way that there is no loopholes and laxity in his statement. The primary aim of the writer is to acquire compactness in his presentation. Therefore, the writer has no scope for hyperboles or exaggerations. On the contrary, this argument does not stand for poetry, since the very nature of poetry is elusiveness and mysticism. In fact, nobody will raise any complain about exaggeration in poetry, if it succeeds to trigger a vision of another world or feat of experience in the mind of the readers. In essence, the language of poetry is a manifestation of the 'moment made eternal' whereas the language of prose is the manifestation of the 'reality pricked with crisis and chaos'.

From pure linguistic point of view a poetry contains large number of function words like pronouns, indeclinable, prepositions, postpositions, etc. which are often elusive to clarify the meaning of text. Also, poetry contains large number of words, which are mostly emotional and felling-carrier by which the writer crates a world of imagination which is different form the world of reality. The readers need a tool of vision to break through the cloak of mysticism to reach to the world of truth. A prose text, on the other hand, carries large number of content words like verbs, nouns, and adjectives, which have comparatively fixed meaning to carry specific knowledge and information.

Although there is no apparent difference in quality, there is difference in quantity of expression in the language of the two types. If we consider language as a straight line placed horizontally, the language of poetry will occupy one end while the language of prose will occupy the other end. The language at the end where prose resides is almost concrete, realistic, and pragmatic, while the language where poetry resides is mostly abstract, imaginative, and surrealistic. In the middle of the line lies the whole world of other texts where features of both prose and poetry are intermixed knowingly or unknowingly by the writers.

The language of poetry often tries to differ from that of prose due to various linguistic reasons related to phonology, grammar, semantics, and stylistics. Use of various processes of phonology, neologism, archaism, provincialism, poeticism, syntactic deconstruction, etc is quite recurrent in poetry but rarely observed in a prose text.

Thus, from various angles and perspectives, it is possible to show how the language of prose differ from that of a poetry. And, due to these immitigable differences it has been always considered sensible to keep language of poetry apart from the language of prose in corpus.

However, if we find that the lack of proper representation of texts from poetry makes a corpus a skewed in representation of a language then we can think of generating a corpus of text samples of poetry by way of collecting large representative samples from songs, poems, verses, rhymes, folksongs, ballads, and elegies, etc.

In fact, the generation of a corpus of poetic texts will give us two important opportunities to deal with the language of a speech community. First, we will be able to analyse the language of poetry separately to observe its form and features. Second, we will be in a position to make several comparative studies between the texts of prose and the texts of poetry to trace finer aspects of their similarity and difference.

9. Other Limitations

Besides the major limitations discussed above there are some other limitations of a corpus, either written or spoken. Some of these limitations are hinted by Winograd (1983), Kennedy (1998) and others. The most relevant among these are as follows:

A corpus often fails to highlight the social, evocative, and historical aspects related with a language,

Form a corpus it is not easy to define why a particular type of language is used as a standard one while others are used as regional variants, and

Analysis of corpus often fails to show how the linguistic differences play decisive roles to establish and maintain the group identity of speakers; how idiolect determines one's power, position, and economic status in society; and how language differs depending on the domains of usage.

Analysis of corpus also fails to ventilate how a narration of a story, novel or an essay disturbs some of the readers with the evocation of emotion while other readers remain undisturbed; how the knowledge of the world and context play pivotal roles in determining the actual meaning of an utterance; how a living language is forced to evolve with the change of time and society; how a language is divided into many types due to various non-linguistic factors; and how two different languages combine together to give birth to a new language in course of time.

10. Conclusion

There are some obvious benefits of corpus-based language research and application. Both from theoretical and application point of view, it is a powerful method, which is scientific, empirical, realistic, and open to any kind of objective verification (Leech 1992). There is no denial of the fact that quantitative data is necessary not only in the works of language technology but also in other applied fields of linguistics (e.g. lexicography, language teaching, speech analysis, translation, etc.) as well as within the general domains of mainstream linguistics. History has enough evidence to show that many successful approaches to speech analysis have depended on quantitative data obtained from speech databases made in the form of corpora.

In the field of language teaching, definite quantitative information about the occurrence of phonemes, morphemes, words, and sentences obtained from corpora often leads both teachers and students to deal with the language more scientifically and fruitfully. Such information about statistical frequency of use of various language properties is not available via introspection. Recent quantitative studies taken from the Bengali text corpus (Dash 2004) proves beyond doubt that intuition about the use of various properties in the language is misleading.

Because of these advantages, criticisms against corpus linguistics, although achieved partial success in the initial stage, failed to stop the growth of corpus generation and application. Both in phonetics and speech analysis, naturally occurring data has remained as an essential source of evidence where neither introspection nor intuition has any role in linguistic inquiry. In the area of language acquisition also observations on naturally occurring evidences have remained authentic for the validation and verification, since no introspective judgements are allowed to justify the phenomena observed in the process of language acquisition by the infants.

In general, a raw corpus is an aid to prepare and revise text of various types. An annotated corpus, on the other hand, is more suitable for various works of language technology to design systems capable to correct spelling errors, search lexical items, lemmatise words, parse sentences, disambiguate sense variation of words (Winograd 1983: 26). Statistical results obtained from corpora are utilised to prepare materials for language teaching, build optical character recognition systems, develop spelling checkers, etc. (Ljung 1997). Both annotated and unannotated corpora are used for machine translation, electronic dictionary generation, lexicographic works, and language teaching (Wichmann et al. 1997). To sum up, Svartvik (1986) provides a large list where information and examples obtained from corpora are used in various fields of linguistics[12]. In a broad sense, there are several types of corpus use:

Corpus is used as a large diluted source of language data as a yard-stick for linguistic and non-linguistic verification and validation.

It is used as a useful resource in general language study, description, and teaching.

It is used as reliable linguistic treasure-house to build lexical databases, dictionaries, thesauruses, reference books, and course-books.

It is used as test-bed for training and testing devices and tools developed in the field of language technology.

It is used as ready-made resource for multipurpose non-linguistic works for necessary reference.

It is used as customisable database to study particular areas of interest related to life, language, and society.

The multipurpose use of language corpora in Indian context is far below than their use in English and other languages in advanced countries (Dash 2003). There are various reasons behind this. Initially, the most difficult hurdle for us was the lack adequate knowledge about the method of corpus generation, since it was a new thing in India. The actual task of corpus generation was possible to start, at a much later years, after due consideration of the methods adopted for other languages in the world. At the present stage, the number of corpus in Indian languages is very few. Moreover, these are also beyond the reach of the majority of people due to some unavoidable technical and legal constraints. Ignorance about the presence of corpora as well as rare availability of these databases is also responsible for blocking the path between the corpora and the users. Besides, there is a dearth of information about the actual value of corpus in linguistics and language technology among Indian scholars (Dash 2003). Therefore, it is really difficult to convince the hardcore traditional linguists about how corpus is competent for making valuable contributions in research and application of the Indian languages for the benefit of the entire nation.

NOTES

[1] From a general estimation (Johansson 1991: 312) it is counted that by the year of 1965 the number of electronic corpus was around 10, and by the year of 1990, the number of corpus has reached to 320. But in a recent study we have counted that the number of corpora presently available in electronic form across the world is around five thousand. The number will definitely increase if the unknown corpora are taken into consideration.

[2] For instance, the number of observations, analyses and arguments so far furnished on the problems related to syntax are shamelessly skewed and titled towards the examples taken from English and its allied languages. Syntactic problems of other languages, particularly that of the languages of less advanced communities are rarely addressed and highlighted. Almost all the practitioners of generative linguistics as well as descriptive linguistics have built theories, principles and propositions with recurrent reference to English. Chomsky also is not an exception. Due to this fact, most often linguists of other languages have either tried to fit their languages within the frame of English or, at certain other situations, have tried to trace out some superficial similarities and dissimilarities noted at the surface level of constructions. Such a tradition of mimicry or senseless imitation is neither useful for English nor beneficial to the non-privileged languages. There is, in near future, no chance for the change in attitude of the scholars until and unless corpus linguistics establishes itself as the most powerful domain of language research and application with close reference to each individual language.

[3] In this case we can easily refer to the work of Stenstr�m, Andersen, and Hasund (2002). These scholars critically analysed the speech corpus of the present-day London teenagers to trace if there is any clue about the direction in which the English will take a future turn. The study is highly useful in this context because it has been able to show how English is going to shape up in future by the use of the new generation.

[4] Scholars (Leech 1992, Stubbs 1996: 231, Stubbs 2000:17) have used an analogy in stressing the importance of the advent of corpora to linguists and the creation of telescope to astronauts. Also, there is Halliday's (1991) famous analogy between the climate (= the long-term, fairly stable, slowly evolving language 'system') and the weather (which can include all sorts of local quirks). When a corpus is small in size and data for any given linguistic feature is sparse, 'weather' effects lead to bad conclusion: a single instance of a linguistic feature is considered as an aberration. But as corpus becomes larger, it is easier to tell aberrations from regularities. Of course, this may become a less useful analogy now, since the climate itself is changing and becoming more like the weather.

[5] Since a normal human requires super human ability to compile a corpus of multimillion words manually, we are struck with awe and disbelief to know how the German scholar K�ding (1897), more than hundred years ago, single-handedly developed and designed a corpus of billion words manually without the help of modern computers.

[6] Most of these corpora are stored, transipted, and processed at the Linguistic Data Consortium, USA; the European Language Resource Association, Paris, the Oxford Text Archive, Oxford; and the International Computer Archive of Modern English, Bergen.

[7] These features, are therefore, considered to be the most salient aspects of a language, particularly of a spoken language.

[8] Sukumar Roy is considered as one of the best (probably the greatest) writers of children literature in Bengali. Most of his writings belong to the genre of 'absurd text' where fantasy is boundless and imagination has the wings to soar high above the world of mundane reality. For nearly a century his writings have been the most entertaining source of enjoyment for both the young and the old generations in Bengal.

[9] It is not yet clear to us why the texts composed for children carry larger amount of pictorial elements than the texts composed for the adult members of a speech community. The argument presented here is nothing more than the simplification of an intriguing question of human psychology and social science. This argument, therefore, is open for any kind of reassessment and refutation by social scientists. Since this is not the place to discuss this issue in more details, we leave it here for the experts.

[10] Rabindranath Tagore in his book entitled Banglabhasa Paricay has argued that a human being not only uses language to inform others about various information, but also uses language to inform about his joy and woe, love and liking, etc. Man builds to address his need but creates to find pleasure. Therefore, language has two important functions: one is related to his need and urgency while the other is related to his pleasure, to his whimsicality. Both human knowledge and thought have the best realisation in the language of science and philosophy but human emotion and feelings have the best reflection in the language of poetry (translated by the present author).

[11] Almost in a similar tone Buddhadev Basu has ventilated the underlying differences between prose and poetry in his book entitled Kalidaser Meghdut. He argues that people of the present age admit that language works in two different ways. In one way it informs, in another way it awakens. In the world of information and knowledge we need clear and cohesive language full of clarity and transparency so that knowledge and information fit into the texture of the language without any hitch. But in the language of poetry we look for impression, which will surpass all the barriers of meanings marked by grammar. The language of poetry should have the quality to expand far and wide to awaken our memories, dreams, thought, and associations dormant within our mind. The sound in poetry is meant to generate recurrent echoes in the mind of the readers (translated by the present author).

[12] According to Svartvik (1986: 8-9): corpus is used in "Lexicography, lexicology, syntax, semantics, word-formation, parsing, question-answer synthesis, software development, spelling checkers, speech synthesis and recognition, text-to-speech conversion, pragmatics, text linguistics, language teaching and learning, stylistics, machine translation, child language, psycholinguistics, sociolinguistics, theoretical linguistics, corpus clones in other languages such as Arabic and Spanish - well, even language and sex".

REFERENCES

Andor, Josef (2004) �The Master and his performance: An interview with Noam Chomsky�. Journal of Intercultural Pragmatics. 1(1): 93-111.

Chomsky, A. Noam (1957) Review of Verbal Behavior by B.F. Skinner. Language. 35(1): 26-58.

Chomsky, A. Noam (1957) Syntactic Structures. New York and Glasgow: Harper Collins Publishers.

Chomsky, A. Noam (1968) Language and Mind. New York: Harcourt Brace.

Dash, Niladri Sekhar (2003) �Corpus linguistics in India: present scenario and future direction�. Indian Linguistics. 64(1-2): 85-113.

Dash, Niladri Sekhar (2004) �Frequency and function of characters used in Bangla text corpus�. Literary and Linguistic Computing. 19(2): 145-159.

Dash, Niladri Sekhar (2004) �Issues involved in the development of a corpus-based machine translation system�. International Journal of Translation. 16(2): 57-79.

Dash, Niladri Sekhar (2005) Corpus Linguistics and Language Technology. With Reference to Indian Languages. New Delhi: Mittal Publications.

Edwards, J.A. and M.D. Lampert (Eds.) (1993) Talking Data: Transcription and Coding in Discourse Research. Hillsdale, N.J.: Lawrence Erlbaum Associates.

Eeg-Olofsson, M. (1991) �Probabilistic word-class tagging of a corpus of spoken English�. In, Eeg-Olofsson, M. (Ed.) Pp. 1-99.

Esling, J.H. and H. Gaylord (1993) �Computer codes for phonetic symbols�. Journal of the International Phonetic Association. 23(2): 77-82.

Garside, Roger (1995) �Grammatical tagging of the spoken part of the British National Corpus: a progress report�. In: Leech, G. et al. (Eds.) Pp. 161-167.

Halliday, M.A.K. (1991) �Corpus studies and probabilistic grammar�. In: Aijmer, K. and B. Altenberg (Eds.) English Corpus Linguistics: Studies in Honour of Jan Svartvik. London: Longman. Pp. 30-43.

Hockett, Chrales F. (1948) �A note on structure�. International Journal of American Linguistics 14: 269-271.

Ingram, D. (1989) First Language Acquisition. Cambridge: Cambridge University Press.

Johansson, Stig (1991) �Times change and so do corpora�. In: Aijmer, K. and B. Altenburg (Eds.) English Corpus linguistics: Studies in Honour of Jan Svartvik. London: Longman. Pp. 305-314.

Kennedy, Grame (1992) �Preferred ways of putting things with implications for language teaching�. In: Svartvik, J. (Ed.) Directions in Corpus Linguistics: Proceedings of Nobel Symposium 82. Berlin: Mouton de Gruyter. Pp. 335-373.

Kennedy, Grame (1998) An Introduction to Corpus Linguistics. London: Addison Wesley Longman Inc.

Labov, V. (1969) �The logic of non-standard English�. Georgetown Monographs on Language and Linguistics. No. 22. Georgetown University Press.

Landau, Sidney I. (2001) Dictionaries: The Art and Craft of Lexicography. 2nd Edition. Cambridge: Cambridge University Press.

Leech, Geoffrey (1991) �The state of the art in corpus linguistics�. In: Aijmer, K. and B. Altenberg (Eds.) English Corpus Linguistics. Studies in Honour of Jan Svartvik. London: Longman. Pp. 8-29.

Leech, Geoffrey (1993) �Corpus annotation schemes�. Literary and Linguistic Computing. 8(4): 275-281.

Ljung, M. (Ed.) (1997) Corpus-Based Studies in English: Papers from the 17^th International Conference on English-Language Research Based on Computerised Corpora. Amsterdam-Atlanta, GA.: Rodopi.

McEnery, Tony and Andrew Wilson (1996) Corpus Linguistics. Edinburgh: Edinburgh University Press.

Samarin, W.J. (1966) Field Linguistics. New York: Holt, Rinehart and Winston.

Selting, Margaret and E. Couper-Kuhlen (Eds.) (2001) Studies in Interactional Linguistics. Amsterdam/Philadelphia: John Benjamins.

Stenstr�m, Anna-Brita, G. Andersen, and I.K. Hasund (2002) Trends in Teenage Talk: Corpus Compilation, Analysis and Findings. Amsterdam: John Benjamins Publishing Company.

Stubbs, Michael (1993) �British tradition in text analysis: from Firth to Sinclair�. In: Baker, M., G. Francis, and E. Tognini-Bonelli (Eds.) Text and Technology: In Honour of John Sinclair. Philadelphia: John Benjamins. Pp. 1-35.

Stubbs, Michael (1996) Text and Corpus Analysis. Oxford: Blackwell.

Stubbs, Michael (2000) �Society, education and language: the last 2,000 (and the next 20?) years�. In: Trappes-Lomax, H. (Ed.) Change and Continuity in Applied Linguistics. Clevedon: BAAL and Multilingual Matters. Pp. 15-34.

Svartvik, Jan (1986) �For Nelson Francis�. International Computer Archive of Modern English News. No. 10: 8-9.

Weigand, Edda (2004) �Possibilities and limitations of corpus linguistics�. In: Aijmer, K. and J. Allwood (Eds.) Dialogic Analysis VIII. New Trends in Dialogue Analysis. Tubingen: Niemeyer. Pp 18-35.

Weigand, Edda and M. Dascal (Eds.) (2001) Negotiation and Power in Dialogic Interaction. Amsterdam/Philadelphia: John Benjamins.

Wichmann, A., S. Fligelstone, T. Mcenery, and G. Knowles (Eds.) (1997) Teaching and Language Corpora. London: Longman.

Winograd, Terry (1983) Language as a Cognitive Process. Vol. I. Mass.: Addison-Wesley.

Attitudes Toward Hindi | A Survey of Language Preferences in Education in India | News Translation and the Concept of Equivalence - A Discourse Analysis Perspective | Who Is the Indigenous Sri Lankan? | An Overview of Orwell's Animal Farm | Speaking Versus Communicating in Business English | Linguistic Manipulation in Political Advertising | Some Limitations of Corpus-based Language Study | Hegemony, C-Semiologically | The Evolution of Language Policy in the Constituent Assembly of India | HOME PAGE | CONTACT EDITOR

Niladri Sekhar Dash, Ph.D.
Linguistic Research Unit
Indian Statistical Institute
Kolkata
West Bengal, India
niladri@isical.ac.in

Web www.languageinindia.com

LANGUAGE IN INDIA http://www.languageinindia.com Volume 6 : 3 March 2006

Strength for Today and Bright Hope for Tomorrow

Editor: M. S. Thirumalai, Ph.D. Associate Editors: B. Mallikarjun, Ph.D. Sam Mohanlal, Ph.D. B. A. Sharada, Ph.D. A. R. Fatihi, Ph.D. Lakhan Gusain, Ph.D.

SOME LIMITATIONS OF CORPUS-BASED LANGUAGE STUDYNiladri Sekhar Dash, Ph.D.

LANGUAGE IN INDIA
http://www.languageinindia.com
Volume 6 : 3 March 2006

Editor: M. S. Thirumalai, Ph.D.
Associate Editors: B. Mallikarjun, Ph.D.
Sam Mohanlal, Ph.D.
B. A. Sharada, Ph.D.
A. R. Fatihi, Ph.D.
Lakhan Gusain, Ph.D.

SOME LIMITATIONS OF CORPUS-BASED LANGUAGE STUDY
Niladri Sekhar Dash, Ph.D.