LANGUAGE IN INDIA

Strength for Today and Bright Hope for Tomorrow

2 : 1 March 2002

Editor: M. S. Thirumalai, Ph.D.
Associate Editor: B. Mallikarjun, Ph.D.


Ph.D. Dissertation

Transformation of Natural Language into Indexing Language: Kannada - A Case Study

B. A. Sharada, Ph.D.


© 2002. by B. A. Sharada, E-mail: drsharada@sancharnet.in, or sharada@ciil.stpmy.soft.net . Ph.D. in Library and Information Science, Awarded by the University of Mysore, 1999. Guide: Dr. M. A. Gopinath, E-mail: gopinath@nccu.edu, Visiting Professor, School of Library and Information Sciences, North Carolina Central University, Fayetville Road, Durham NC 27707, USA, and formerly Professor and Head, Documentation Research and Training Centre, Bangalore-560 059, India. (Kindly note that the appendices chapter is not included in this presentation for technical reasons. Our scanner is not sensitive enough to make clear copies of the Kannada typewritten materials and black and white diagrams as images. For copies of the appendices, please e-mail Dr. Sharada. Editor, Language in India.)

CONTENTS

Introduction
Chapter One Index and Indexing Language
Chapter Two Theories of Linguistics
Chapter Three Compatibility - Linguistics and Indexing Language
Chapter Four Properties of Kannada
Chapter Five Technical Literature and Glossary in Kannada
Chapter Six Rules for Generating Subject Headings- Pre - coordinate Indexing
Chapter Seven Transformational Grammar and
Analysis of Document Titles in Kannada
Chapter Eight Illustrative Examples in Demonstrating Rules
Chapter Nine CONCLUSION
References

*** *** ***


INTRODUCTION

0. Introduction

There is a dramatic increase in the quantum of knowledge and information resulting in increase in the production of books and other multimedia communication materials including Compact Discs - Read Only Memory (CD-ROM). These repositories of knowledge are the bridges between information generators and the information users . The success of such a repository is completely dependent upon how tactfully the recorded knowledge is well organized and retrieved.

Classification and indexing is an efficient method of organizing materials subject wise. Such an arrangement is most useful for effective retrieval of the kind of information required by the patrons and the information scientists serving them. As an aid to this work there are so many systematic indexing languages like Dewey Decimal Classification, Universal Decimal Classification etc. The significant contribution from India to this field is Colon Classification developed by Dr.S.R.Ranganathan (SRR).

0.1. Need and Importance of the Study

An Indexing Language (IL) is a technical language based on the structure and functioning of a Natural Language (NL). Development of an IL in a NL is part of the development of a NL. Most of the existing and available ILs are rendered or based upon English. Many ILs are also available in some other languages like French, German, Chinese, Italian etc.

Though India is rich with 1652 mother tongues, out of which 18 are Scheduled Languages included in the Constitution of India, there is a paucity in development of ILs in Indian languages. It is ideal that every language has its own IL and at least a family of languages have an IL.

Karnataka, one of the States of the Union of India was formed on linguistic basis on Novembwr 1, 1956. The Karnataka Official Language Act 1963 recognized Kannada as its Official language. This gave a fillip for the extensive use of it in administration, education and mass communication. The Government, voluntary organizations, institutions and universities are making all round efforts to develop it as an effective medium of communication for all the purposes. However, for want of adequate and appropriate research in Indian languages in the area of IL, libraries and information centers are adopting English coinage as they are without any alternatives or modified formulation to meet the linguistic and cultural needs.

The structure of Indian languages in general is different from that of English. Hence, they need an IL , each derived on the basis of their structure. Since India is a multilingual country and is considered as a linguistic area , the comprehensive rules derived in developing an IL in Kannada can be applied to other Dravidian languages and also to all other Indian languages. This study of preparation of a module has utilitarian value to prepare the pre-coordinate IL in Kannada in particular and other Indian languages in general.

The glossary had to be prepared , since there is no authority or subject heading list in Kannada like Library of Congress Subject Headings (LCSH) and the Sears List of Subject Headings in English.

0.2. Definition of the Concepts

The following are the operational definitions of some of the important technical terms used in the study.

Natural Language: The NL is the primary medium for human communication . Function of a NL is to communicate semantic content of its expression directly.

Indexing Language: The IL is an artificial language made up of expressions connecting several kernel terms. The function of an IL is to take whatever a NL does and in addition organize the semantic content through a different expression providing a point of access to the seekers of information. An IL is a system for naming subjects and has controlled vocabulary. The vocabulary of an IL may be verbal or coded. A classification scheme uses coded vocabulary in the form of notation and authority lists uses verbal vocabulary.

Kannada: Kannada is one of the 1652 mother tongues spoken in India. Forty three million people use it as their mother tongue. It is also one of the 18 Scheduled Languages included in the VIII Schedule of the Constitution of India. It belongs to the Dravidian family of languages. Within Dravidian, it belongs to the South Dravidian group. It is recognized as the Official Language of Karnataka.

Interdisciplinary Subject: A subject that emerges as a result of interaction between two known, well demarcated disciplines.

Infolinguistics: An interdisciplinary subject that has emerged out of the interaction between the two subjects - information science and linguistics.

Linguistics: Linguistics is considered as scientific study of language.

Linguistic Area: A geographical region determined by shared linguistic characteristics.

0.3. Objectives of the Study

The objectives of this study are as follows:

  1. Exploring the possibility of interdisciplinary perspective between linguistics and information science since linguistics is used as a representation mechanism for the information content of the document.
  2. Study of different linguistic theories and their relevance and application to indexing language.
  3. Study of properties of Kannada relevant to indexing language.
  4. Survey of technical literature in Kannada, its use for the preparation of a model glossary on education using a bibliometric law.
  5. Study of different steps in coining the subject headings and problems involved in deriving the descriptors in Kannada.
  6. Study of feasibility of application of computers for developing IL.
  7. Application of TG to the NL approach of IL and developing parsers.
  8. Preparation of a sample PCIL module in Kannada.

0.4. Hypothesis and Methodology

The major hypothesis on which the research is conceived are as follows:

  1. The need for pre-coordinate indexing language is much felt in Indian languages.
  2. The concepts of IL can be analyzed in a proper perspective with the knowledge of linguistics.
  3. Any language, natural or artificial has its structure and vocabulary.
  4. The pre-coordinate indexing language model derived for Kannada is applicable to all the Indian languages in general and in particular to Dravidian languages.
  5. The word order of Dravidian languages tallies with the facet structure of IL proposed by SRR in his Colon Classification.
  6. The use of computer in developing IL,reduces,minimizes the size and quantum of terminology besides simplifying the procedure of indexing,analyzing and problem solving.
  7. Depending upon the need and the purpose, the parsers have to be developed in the natural language processing environment. The definition Paser may also change depending upon the pupose.
  8. Generally the IL is free from verbs and and it needs parsers to identify the Noun Phrase(NP) instead of both Nps and Verb Phrase(VP). The following are the methodologies adopted in the present study.Historical metho of IL; survey metho that involves the sociolinguistic study of Kannada background; logical method that involves comparative approach to Kannada and English; statistical method to compile glossary; questionnaire method for eliciting document titles; application of linguistic theories and the use of computerstodevelop parsersin the NLP environment. The freely faceted or analytico synthetic classification system,namely the Colon Classification ,the brain child of S R Ranganathan having the prevalent research on general theory of classification and the techniques from transformational grammar expounded by Noam Chomsky are used as the basis in designing the IL model in Kannada.

0.5. Scope and Limitations

The dimension of IL is so vast that it monitors the whole of universe of subjects. The present study to prepare an IL model in Kannada is limited to a sample in the discipline 'Education', which concentrates on 'Special Isolate' part.of Colon Classification Some of the rules are retained depending upon their suitability to Kannada language. Similarly , for analyzing document titles in Kannada and to develop parsers in NLP environment, Chomskian school of thought is adopted . As for as computer application is concerned, out of the softwares available for processing Kannada, 'Bhasha' and 'Kavitha' software are used for word processing and indexing respectively. Since the present study deals with 'words', the bibliometric model adopted here is the 'Zipf's Law' and the CDS\ISIS package for creating inverted file.

0.6. Chapterization

The chapterization is done in such a way , that it first gives an introduction on IL in general followed by theories of linguistics and finally the way in which the linguistic theory could be practically applied to IL . Chapter one provides the introduction. Chapter two and three provide the methodology . The methodology : adopted from linguistics is transformational grammar, discussed in Chapter two and from information science, Colon Classification discussed in Chapter three. The basic objective of the present study is to prepare an IL module in Kannada..It has to be derived on the basis of structure and properties of Kannada including the technical terminology and rules for generating subject headings. They are discussed in Chapters four five and six. Analysis and interpretation of the data is presented in Chapters seven and eight. The last Chapter presents the inference and findings.

0.6.1. Chapter One: Index and Indexing Language

Chapter one is an introductory chapter to index, indexing language, its role in information retrieval systems and variety of indexing languages.Linguistics is used as representation mechanism in Information Science. By applying theories from linguistics to information science, a new inter disciplinary theme integrating information science and linguistics, 'Infolinguistics' is generated.

0.6.2. Chapter Two: Theories of Linguistics

In linguistics, syntax is discussed in different schools of thought. Since Chomskian school of thought has been adopted for the present study, importance is given here to 'Transformational Grammar'(TG) , its place in linguistics , history and development. Since 'Case Grammar' is most touched topic by information scientists, that is also discussed. Important grammatical categories are introduced here.

0.6.3. Chapter Three: Compatibility of NL and IL

The third chapter looks into the compatibility of NL and IL. Here the structure of IL and Indian languages are compared. If parts of speech such as Noun Phrase, Adjective, etc., are used to analyze NL , fundamental categories mentioned in the 'Colon Classification' such as Personality, Matter, Energy, Space and Time are used to analyze IL.

In the comparative study of NL and IL syntactic structure, it was found that, IL structure was same for each subject in each language where as the structure among the NL was different. Because IL is in the conceptual order and independent of linguistic syntax . Similarity was found among Indian languages taken in the sample and tallied with that of IL. The main reason is that, most of the Indian languages have word order of the type 'Subject Object Verb'(SOV) and English has SVO word order which does not tally with the conceptualized structure of IL.

The Chomskian TG theories are applied to IL in general from first generation 'Standard Model' up to the latest 'Government and Binding' theories that consist of many sub theories. Out of them, it is illustrated with examples that 'Case Theory', 'Theta Theory' , and 'X - Bar' convention are suitable to IL.

0.6.4. Chapter Four: Properties of Kannada

This chapter identifies the properties of Kannada language and literature and they are discussed in detail. This study helps in analyzing the Kannada titles and tagging them with grammatical categories. The properties discussed here are limited to IL analysis.

0.6.5. Chapter Five: Technical Literature in Kannada

The development of technical literature in Kannada in almost all spheres of life stress the need of an IL based on its structure. The fifth chapter discusses technical literature in Kannada, its history, objective, reason, principles used in glossary preparation in Kannada. An experiment is undertaken to prepare a glossary in Kannada (sample) based on bibliometric laws and with the application of grammatical aspects.

0.6.6. Chapter Six: Subject Headings - Pre-coordinate Indexing

The functions involved in generating subject headings are explained taking few existing pre coordinate IL as examples to prepare the Kannada module. The ISO standard is discussed and for the language standardization 'Kannadashaili kaipidi' is taken as the basis . List of Main subjects is rendered in Kannada.Cognitive modules are also discussed and an attempt is made to develop a knowledge representation module based Kannada expert system. It is argued that the purpose and objective of the study should be taken into consideration instead of ritually following the NLP models.

0.6.7. Chapter Seven : Application of TG

If the Chapter two discusses theories of TG,the seventh chapter elucidates the practical aspects of application of TG wherein the following points are discussed:

  1. The difference between a complete sentence and a document title according to TG.
  2. The syntactic components involved in a title and their origin from a phrase structure.
  3. Application of deep structure and the process involved in arriving to surface structure.
  4. Integration of TG from linguistics,and conceptualization from information science, in order to obtain the structure of IL from document titles in Kannada. To derive rules in (a) the Natural Language Processing (NLP) environment in Kannada and (b) the classificatory structure, an experiment is done by administering the keywords in Kannada among ten experts in a particular field.

0.6.8. Chapter Eight : Illustrating with Examples

Lastly, based on the properties and theories of NL and IL discussed in the previous chapters from one to six, a package is prepared by developing an IL in Kannada. Following are the modules of the package:

  1. Schedule in Kannada for the discipline 'Education' with the list of subject headings with notation.
  2. KWIC and KWOC index for titles in Kannada.

*** *** ***

CONTENTS PAGE


CHAPTER ONE

INDEX AND INDEXING LANGUAGE

1.0 Introduction
1.1 Infolinguistics
1.2 Classification
1.3 Indexing and Information Retrieval
1.3.1 Indexing Systems
1.3.2 Varieties of Indexing Systems
1.3.2.1 Derived or Natural Language Indexes
1.3.2.2 Mechanized Information System
1.3.2.2.1 Title Based Indexing
1.3.2.2.2 Catch Word - Title Indexing
1.3.2.2.3 Keyword in Context Indexing
1.3.2.2.4 Keyword out of Context
1.3.2.3 Citation Index
1.3.2.4 Automatic Indexing
1.3.2.5 Permuted Index or Coordinative Systems
1.3.2.5.1 Pre - coordinate Indexing
1.3.2.5.1.1 Pre - coordinate Indexing Languages
1.3.2.5.2 Post - coordinate Indexing
1.3.2.5.2.1 Computer Based Post - coordinate Systems
1.3.2.5.2.2 Post - coordinate Indexing Language
1.4 Conclusion

1.0. Introduction

Information science is an intra and trans - disciplinary science serving all other sciences with its theory and practice aimed at preparing and providing 'information data' and useful information where ever necessary for the proposed goal, eventually benefiting mankind and its future (Curras, 1992).The present era has been called 'the age of information'. Language is not a barrier to the growth of knowledge.The information flood is extensive and complex but at the same time the human memory has not grown in size. The main focus of information science is to closely match the two states of the mind namely,

  1. Formal or information generation.
  2. Informal or information seeking and information utilization.

The 'Text' will be formal comprising of information conveyed by a language in the form of - words→ phrases→ sentences→ paragraphs→ chapters→ and entire text. The volumes of the text will be the unity of the ideas comprising of formal grammar, semantics and other linguistic units. This will be the structure of knowledge.

The user's need in terms of search expression will be informal. Information seeking is its main function. The main constituents are: the thought formulation for a search, and the role of language. This comprises of starting → browsing → connecting→ focusing → and expressing. In this the hierarchy of thought is created.

The following schema presents the two states of mind - Formal and Informal:

Figure 1: Sharada's Thesis Chapter 1

The main focus of information science is to closely match these two states of mind i.e.,formal and informal or information generation and information seeking and utilization. Therefore it is necessary to organize information in various levels of technological developments. To cope up with this, information processing system such as search language , reduces information into a set of parameters and projects the contextual relevance.

1.1. Infolinguistics

Theoretical studies of search language require a theoretical framework and a new field of knowledge created through interdisciplinary approach arriving out of 'Information Science' and 'Linguistics', to generate a new field of study called 'Infolinguistics' (Sharada 1995 a , b). Here Linguistics is used as a representation mechanism for the information content of a text of a document. In other words it surrogates information and this forms the main function of Infolinguistics. The representatinal properties of language are syntax and semantics. Syntax deals with the anaylsis of the structure of a sentence and semantics studies the meaning. Keeping this in view Infolinguistics can be defined as syntactic representation and semantic interpretation of natural language for indexing purposes.

1.2. Classification

The new role of search language or classification in information science is to act as filter for information flood. To put it in the words of Ranganathan, SR (1944:

Classification is a lingua franca for knowledge processing and use. A lingua franca with fixed etymology and semantics and a syntax capable of marshaling and presenting it all in most helpful filiatory order is indispensable.

The arrangement of documents is wholly dependent on the indexing scheme that is adopted by the system.

1.3. Indexing and Information Retrieval

"Index is that which serves to direct to a particular point or conclusions"(Clark 1933). In the context of information retrieval systems, index is a mechanism or tool to indicate the searcher, the potentially relevant information to a query. In the library, shelf arrangement and card catalogue are considered as forms of index since they serve to indicate classes of documents.

The first function of an index is to act as a link between a source of information and its user. When size of the collection is quite large, an index is an essential tool for retrieval. A good index minimizes the search effort and ensures optimum results. Index performs a wide and important role in information retrieval system. The indexer is serving as an intermediary between authors and users with the help of Indexing Language(IL) . An IL is a system for naming subjects. It is an artificial language adopted to the requirements of indexing. Like any language, IL also consists of two basic elements:

  1. Vocabulary - a list of terms used in the system.
  2. Syntax - the recognized pattern of relationship between the terms used in the system.

If the terms that appear in the documents are used without required modifications,it is a natural language (NL). Since the usage of a NL leads to many problems, such as those arising from the use of different words by different authors to denote the same idea, an alternate to NL is, to use artificial language adopted to the specific needs. Such a language operates with a controlled vocabulary. An IL having controlled vocabulary attempting to indicate the relationship between terms in the index vocabulary is systematically structured.

The artificial language uses concept indexing rather than term indexing. The terms are representatives of a NL used by authors. The concepts imbibe standard description established in the IL. The NL is flexible and advantageous to authors to use different terms to denote same concept. The indexer who is more concerned with the ideas conveyed rather than the language niceties, depends upon artificial language. All the structured IL are based upon careful subject analysis. The vocabulary of an IL is verbal or coded. A classification scheme employs coded vocabulary in the form of its notation.Thus, for example in Colon Classification (CC) Schedule 'Indian History' is rendered as V.44.In Sear's List of Subject Headings which employs verbal vocabulary it is rendered as : India - History. In any case, selection of terms to be used in each discipline is primary and coding is done at a later stage.

1.3.1. Indexing systems

An indexing system is a systemic organization of documents for retrieval . In an information retrieval system (IRS), index will guide or project itself as a guide to the concept in a collection of documents. It informs the existence of documents containing document surrogates, such as author, title, imprint, callnumber etc. An index is a systematic guide to concepts derived from a collection of documents represented by entries arranged in a known and searchable alphabetical, numerical or classified order . In library terminology ,an index is an indicator of content and location or descriptor and locator. In an IRS an index performs two simultaneous functions:

  1. Retrieving information on documents that are required, and
  2. (b) holding back information on documents that are not required.

In the context of an IRS, the term index is primarily used as a system capable of retrieving information about required documents based on a particular subject. The principle index is the subject index.

Subject indexing as a process involves four major operations such as:

  1. Analyzing,
  2. Arranging,
  3. Assigning notations, and
  4. Maintenance of a search file.

The first step is conceptual analysis, deciding what the document is about .The second step is translating the conceptual analysis into index terms, which acts as a label for the subject matter and sequencing them in a meaningful syntactic order called citation order. Third step is assigning notational symbols, which help to retrieve. The fourth step is arranging the entries in a searchable order or maintain a search file.

Linguistically, the text in a document is made up of terms. Request for the document is also made up of terms. Such request is conceptually analyzed and described by means of controlled vocabulary. The request is matched against the search file or index and information about the document is retrieved. The two characteristics of indexing exhaustivity and specificity affect two important measures of an IRS namely recall and precision ,which operate the search stage or output stage of the system (Brown, 1982). The rules of all indexing systems are so designed to increase recall and efficiency and to certain extent, precision also.

Recall: The IRS must be able to retrieve information to the reader's request which vary from a single specific document to a set of articles on a particular subject. The document that is useful to the user's information need, that prompted his/her request may be termed as a 'relevant document'. The ability of the IRS to point at all the relevant documents is known as the 'recall power' of the system which implies quantity. Hence the recall performance of an IRS can be expressed quantitatively by means of a ratio called recall ratio as mentioned below:

R
Recall ratio = -----X 100
C

Where R is the number of relevant documents retrieved against a search and C is the total number of relevant documents to that particular request in the collection.

Precision: In an IRS, index acts as a filter. If Recall is the measure of system's ability to let through wanted items, precision is the measure of the system's ability to hold back unwanted items. The formula for Precision is:

R
Precision = ------X 100
L

Where R is the total number of relevant documents retrieved in that search and L is the total number of documents retrieved in that search. Precision ratio is qualitative one. Usually for a common frame of reference the following terms are used.

  1. Hit = Every relevant document retrieved. It adds to precision.
  2. Misses = Every relevant document not retrieved. It adds to the noise.
  3. Noise = All irrelevant documents retrieved against a search.
  4. Dodged = Not relevant documents not retrieved.

Information retrieval is the provision of enough (quantity) and relevant (precision) responses to the requests for information. Indexing the concepts based on one of the indexing systems used as a tool, makes information retrieval possible. The IL consists basically an index vocabulary together with means of showing semantic relations to help recall and syntactic device to help precision ).

1.3.2. Varieties of Indexing Systems

Subject indexing systems are the tools with which subject indexes are prepared. It is the index of concepts found in a collection of documents. The following schema presents different kinds of indexing system:

Indexing System

Figure 2: Sharada's Thesis Chapter 1

Figure 2

Since the target NL is Kannada for the present study , the examples of document titles are selected from Kannada.

1.3.2.1. Derived or Natural Language Indexes

Indexes for a book can be of three kinds:

  1. Author index,
  2. Title index, and
  3. Subject index.

Conrad Gesture's Bibliotheca Universalize listed the documents under the alphabetical order of the author's fore-name in 1545. Later in 1548, listed the same documents in a subject classification order with an alphabetic subject index to classification codes. This can be considered as the genesis of all the present indexing systems and techniques. In 1856 Andrea Crestadoro, made an attempt to show the importance of titles of documents in cataloging work. Later in 1959 H.P.Luhn of IBM ,utilizing the power of computers developed a new indexing technique called Key Word Index in Context (KWIC). From the 1970s with the rise of Selective Dissemination of Information (SDI) services, titles of scientific documents began to play a significant role in science communication. The title based indexes depend upon manipulation of all the key words in the title to give multiple entries,one entry for each significant word. Attempt is not made to use our own knowledge of the subject or other guides but only the information manifest in the document to derive indexes is used. Indexing thus derived directly from document is derived indexing.

1.3.2.2. Mechanized Information System

A great deal of research is conducted in the application of computers to the intellectual aspects of information retrieval in: (a) creation of index term profiles for documents, (b) creation of abstracts, and (c) automatic derivation of classificatory structures that display relation between document classes, etc. Computers help to process large quantity of data at very high speed. Derived indexing involves minimum intellectual effort and is therefore well suited to computer processing which can give a variety of products from the same input. There are several methods to produce title based indexes.

1.3.2.2.1. Title Based Indexing

The title of a document is ambiguous because the author tries to codify the topic or theme of his work in it. In some books a very clear indication of what the book is about will be given in the title. For example, pashu sangoopane mattu kooli saakane.

At the same time some titles will not be of any help to understand the content of the book, because it has been chosen to attract readers attention rather than to state subject coverage. For example, sari hejje. This book deals with error analysis in language teaching.

In some cases, authors choose different words to name their books on the same subject. For example,

hariharadeeva
harihara kaviya eradu ragalegalu
hariharana puraatana ragalegalu
hariharana nuutana ragalegalu

If the significant word in each title is same, such word can be used as a basis for the retrieval system.

1.3.2.2.2. Catch Word - Title Indexing

Catch word indexing is very simple.and suitable whenever large quantity of titles are to be processed. 'British Books in Print' has adopted this method.

1.3.2.2.3. Key Word in Context Indexing (KWIC)

The KWIC is another development of catch word title indexing. The simplest form of machine generated index is KWIC index. The computer ignores all syntactical words such as articles, prepositions etc., and selects remaining words in the title as indexing words, if the system is provided with a stop word list. The result of the machine manipulation is an index of key terms printed in alphabetical order, together with the text immediately surrounding each term or each significant word as entry point appears in a designated middle position while the rest of the title printed on either side. The alphabetical filing is done on the basis of the key word printed in bold letters in the middle.The only disadvantage with KWIC is, it is entirely dependent upon titles of descriptive quality by authors. This is successful in Kannada and is demonstrated in Chapter Eight.

bhaaratada samskrutiya adhyayana
praachiina bhaaratada itihaasa mattu samskruti
pravaasi kanda bhaarata

1.3.2.2.4. Key Word Out of Context (KWOC)

In KWOC every index word is extracted from its context and printed separately in the left hand margin with the immodified title in its normal order printed to the right.

bhaarata -- bhaaratada samskrutiya adhyayana
bhaarata -- prachiina bhaaratada itihaasa mattu samskruti
bhaarata -- pravaasi kanda bhaarata

In this system titles are liable to give rise to a number of entries depending upon the significant terms. Therefore they are normally used as indexes, i.e., guides leading to entries in a separate list, rather than as methods of arrangement of items. This has also been achieved in Kannada and demonstrated in Chapter Eight.

Further enriched KWIC or KWOC gives index entries wherein additional terms are inserted into the title or added at the end. This involves intellectual effort in the selection of additional terms. In recent years there has been considerable pressure on authors to give their papers meaningful titles which can be used in computer generated indexes.

The KWWC - is based on similar principles, except the 'key word with center'. The KEYTALPHA is just modified form with key terms arranged alphabetically . The WADEX is the words and author index. Along with the key words, author will also be indexed.

1.3.2.3 Citation Index

Eugune Garfield was the first to realize the presence of 'a cognitive and moral connection' between sources and their references. He showed the possibility of constructing an index on the basis of a structured list of all references in a given collection of articles, where each cited reference is followed by all the citing documents.

All the documents are likely to contain a list of references or bibliographic citations. This is the way in which author shows the foundation on which the document is prepared. Hence there is a link between the document and items cited in its list of references. This can be inverted and say that there is a link between the original item and the documents citing it or under one cited document, all the citing documents that have cited it are listed. For example: if three papers A,B and C have cited X, then the citation index will list all the citing documents A,B and C under the cited document 'X'. By scanning very large number of documents by means of computer, the citation index can establish a much large number of such links between scientific articles and their citation.

Science citation Index 1961 -
Social Science citation Index 1966 -
Arts and Humanities citation Index 1977 -

These indexes cover over 5000 periodicals. These are scanned and all the bibliographic links found and fed into a computer to generate citation index, corporate index and source index. The citation indexes are yet to be prepared / generated in Indian languages including Kannada.

1.3.2.4. Automatic Indexing

In the present state of art by using computers, there are many ways to derive suitable indexing terms and produce a conventional type of index found at the end of books. Some softwares are designed specifically for the computerized management of structured database. For example: Micro CDS/ISIS devised by the UNESCO Library, archives and documentation services, UNESCO.It is a generalized information storage and retrieval system.This enables setting up of fast access files to facilitate quick search and retrieval of records from a database. One of the files is the field select table (FST) for specifying indexing parameters for the database. The CDS/ISIS provides for the use of five different indexing techniques as mentioned below together with several facilities for formulating search expressions, the interfaces in PASCAL language for strong search in a given field and for thesaurus construction, maintenance and use the system for which it provides a powerful search facility.

The IT Codes are as follows:

O Builds an element from each line extracted by the Format and useful for indexing while lining.
1 Builds an element from each sub field or line extracted by the format.
2 Builds an element from the string of characters enclosed in angular brackets(< >).
3 Same as indexing technique 2 except instead of angular brackets use slashes (/../).
4 Builds an element from each word, prefixed and suffixed with a space.

To prevent non-significant words getting indexed, a stop word file needs to be prepared for the database.The readers even without knowing full title of the document can get the inputs retrieved with a help of one or two relevant keywords. There are instances where the computer based system contains whole text of documents. In such cases one can retrieve part or all of the text in response to a query. The development in computer technology has made the introduction of such services technically feasible, and are now becoming economically feasible also. This automatic indexing is possible in Kannada using transliteration of the titles into Roman script or with the help of GIST script processor. With the help of GIST the data can be entered in Kannada script in the CDS/ISIS and the terms will be indexed in Kannada alphabetical order.

1.3.2.5. Permuted Index or Coordinative Systems

The Index language helps to index both single concepts and compound subjects made up of number of concepts. As shown in figure 2, coordinative systems can be divided into two namely - Pre and post coordinate indexes. In the pre - coordinate indexing, the subjects including compound subjects are analyzed into its constituent concepts and the concepts are cited in a prescribed sequence to constitute the scheme of classification or subject heading etc. Since all the terms are predetermined in advance in the schedules or schemes of subject headings, the class relationships are expressed once and for all. The indexer or classifier coordinates the appropriate terms at the time of indexing a document. Here, a string made up of terms to denote the concepts found in the document are joined together to represent a document. Since the concepts and their relations are predetermined, the pre - coordinate system is completely dependent upon the concept relations implicit in the assigned index terms to describe the individual document. The classification schemes like Colon Classification, Dewey Decimal Classification, UDC, Alphabetical Subject Catalog, etc., are the examples for pre - coordinate indexing systems. They do the function of arranging documents on shelf, and help in the retrieval of the same from a collection. Since the concept coordination takes place at the input stage (while indexing), this principle is called pre - coordinate indexing.

The ILs like CC based upon the principles of analysis and synthesis are called 'Analytico - synthetic' or faceted classification .In order to classify a compound subject in CC , the indexer must first analyze the subject into its elementary constituents and then locate these elements in the CC Schedule and recombine or synthesize them to form the compound subject expressed in notational terms. The CC does not enumerate compound subjects. Many schemes list or enumerate compound subjects. They attempt to provide ready made notations for compound subjects as expressed in documents. Such schemes are commonly called Enumerative classifications. Example: Dewey Decimal Classification (Brown 1982).

1.3.2.5.1. Pre - coordinate Indexing

The three major areas to be considered for indexing are: (a) Shelf arrangement of books (b) Library catalogues and bibliographies and (c) Book indexes.

  1. Shelf classification: In present day open access libraries the books are to be arranged in a helpful way to the readers. The most beneficial arrangement is one in which all the related subjects are brought together in a systematic or classified order. Most of the indexing languages like DDC, CC etc., have been devised with this objective.
  2. Library catalogues and bibliographies : A library catalogue will record the stock of that library. Where as bibliography is not limited to the stock of the library, but has limitations such as national, international, language, subject etc. At the subject level both are alike.The arrangement of catalogues could be:
    1. Alphabetical subject catalogue : Subject entries and cross references are arranged alphabetically in one sequence.
    2. Classified catalogue : Related subjects are brought together by using notation as its code vocabulary.
    3. Feature headings : Feature headings are guide cards, each bearing relevant class number and NL term.
    4. Alphabetico - classed catalogue : Combination of alphabetical approach with helpful groupings of the systematic approaches, where in the headings are indirect. For example: Aluminium will be entered under metals-non-ferrous - aluminium,not under Aluminium itself. With the result all entries on metals will be grouped together under metals.
    5. Multiple entry system: This system involves multiple entries.
    6. Unit entry forms : Card catalog usually of the standard size 12.5 X 7.5 cm arranged in the libraries according to the indexing system headings. New cards are added where ever they are needed.
    7. Book forms : At one time this was popular in public libraries with closed access, where the catalogs were printed in book form.
    8. COM : Computer Output in the form of Micrographics.
    9. MARC : Machine Readable Cataloging began in 1966 as a cooperative venture involving 16 libraries other than Library of Congress.
    10. On Line Catalogs : The catalogs are held by computers with access through on - line terminals.
    11. Bibliographies : These are normally printed and intended for vide distribution. It may be current or retrospective.

On the whole these pre - coordinate systems are basically one-place systems following the citation or significance order. At the search stage pre-coordinate systems present certain advantages. Number of searches can be conducted simultaneously. Pre - coordinate systems, which have been severely criticized in recent years by advocates of post-coordinate methods, are yet to be restored to their previous importance by the computer revolution.

1.3.2.5.1.1. Pre - Coordinate Indexing Languages

The key part of a classification scheme is the Schedule - the index vocabulary. The following indexing languages are widely used:

  1. The Decimal Classification of Melvil Dewey This is considered as the first ILin library classification. This is used mainly in the public libraries.
  2. The Universal Decimal Classification (UDC), originally based on the Fifth edition of the DDC is the Second major scheme. Normally, widely used in special libraries.
  3. The Bibliographic Classification of H.E.Bliss (BC)
  4. The Colon Classification of S.R.Ranganathan (CC)
  5. The Library of Congress (LC)
  6. Subject headings used in the dictionary catalogues of the Library of Congress (LCSH).Basically LC is intended for shelf arrangement and is complemented by an alphabetical subject catalogue arranged according to LCSH.
  7. Sear's List of subject headings

The above mentioned systems are available only in English and some other foreign languages but not in any of the Indian languages. There are some more schemes like the subject classification of J.D.Brown (SC) etc. They are not in vogue in many libraries. The classification schemes mentioned above relied on main classes or the traditional disciplines. But in the present information era research in all disciplines have given rise to interdisciplinary topics. To take into account these new topics, research is conducted in the field of IL.For example : Classification Research group (CRG), Broad system of Ordering of UNISIST (BSO), PRECIS, POPSI etc.

PRECIS: The PRECIS is abbreviation of Preserved Context Indexing System. This was designed to generate subject heading with the help of the computer. This is one of the best currently available system based on more than 20 years of experience in the detailed index of books for BNB, and also theoretical work carried out by CRG. This is an alphabetical subject building system based on the semantic and syntactic characteristics of the language. The syntactic relationship are shown by a set of role operators. In the NL, the passive voice form is preferred over the active voice (Austin 1984).

POPSI: The Postulate-based Permuted Subject Indexing (POPSI) was developed through logical interpretation of the deep structure of subject indexing language (SIL). The POPSI draws attention to the helpfulness of adopting a suitable device for ensuring an optimally effective organizing classification through the alphabetization of verbal subject - propositions. The POPSI prescribes the use of apparatus words - such as prepositions, conjunctions, participles etc., as and when necessary to communicate the exact meaning of subject - propositions. These words are put in parenthesis and they are ignored in alphabetization. Since the POPSI - Index are all verbal entries,filing them in one alphabetical sequence in a unipartite index is made easy. The POPSI procedure involves: (a) Analysis (b) Formalization (c) Standardization (d) Modulation (e) Organizing classification entry (f) Terms of approach (g) Associative classification entries and (h) Alphabetization. One of the POPSI's special features is its technique of generating and organizing classification by juxtaposition of subject propositions in the verbal plane (Bhattachrya 1990).

1.3.2.5.2 Post Co-ordinate Indexing

The Systems allowing class relations to be exploited by manipulation of classes at the time of searching are contrived as post coordinate system. In this, the documents are indexed by terms denoting individual concepts.The headings are single concepts,each containing the code or accession number of the document. This allows free manipulation of terms at the time of search to retrieve information of documents with any logical combinations. This co-ordinates single concepts to build up composite subject at the output stage instead of at the input stage. The use of post coordinate system implies the use of some new kind of physical medium rather than the conventional card catalog. Few of the manual post co-ordinate indexes are : (a) Unit term (b) Optical co-incidence card and (c) Peek - a - boo.

In the words of Collison, Robert (1959),

One of the most exciting experiments in indexing in this generation is the process invented by Mortimer Taube and his associates in documentation. It is based on the unit term system of coordinate indexing. The theory is that each title,each article etc.,can be reduced for indexing purposes to a number of basic ideas capable of being represented mostly by single terms.

Early proponents of post - coordinate indexing claimed that, to select the correct key words it was sufficient to read through the document to be indexed and underline the significant words (Fosket 1981). This process will not take into account the synonyms and cannot demonstrate any kind of relationship. To achieve good results under normal conditions ,it is better to use control vocabulary with post - coordinate indexing ,as done in pre - coordinate indexing. While selecting the terms, preferred term has to be selected and refer to it from synonyms, distinguish homographs and be aware of semantic relations. The need to refer from the subject file to accession file is a disadvantage of post - coordinate systems. This makes searching more tedious then card catalogue. To overcome this, two methods have been suggested. A Master Matrix with a micro - image of an abstract of each document at the appropriate position on which peek - a - boo cards are super imposed and those images where the presence of holes in all the cards permit it are projected one at a time on to a screen. The second method is a development of dual dictionary, using a computer. It is simple to print out the contents of post-coordinate index in the form of series of headings under which document numbers are listed. The contents of a set of unit term cards are transferred to a printed sheet. If two such printouts are made and bound up side by side, comparing the entries are made easy under two headings. Still easier would be, if brief details of each document are printed out in one of the list by the side of each accession number. It is helpful in locating relevant document (Fosket, 1981). None of these systems are tried out in Indian languages.

1.3.2.5.2.1 Computer Based Post - Coordinate Systems

Majority of computer based systems are indexed by Post - coordinate methods or use text searching except few pre-coordinate systems like PRECIS, BTI etc. Few examples of computer based systems are: MEDLARS, ERIC, CAS, and ISI.

MEDLARS: The Medical Literature Analysis and Retrieval System is typical of a very large number of data bases linked to the production of a printed index. This is one of the first model of computer - based services depending upon intellectual indexing. The Demand searches, SDI Service, on-line access system etc.,unique features of MEDLARS. Other data bases have benefited from this pioneering work.

ERIC: The Educational Resources Information Center serves as a clearing - house for Educational Information. This is established keeping in view the publication of increasing number of reports with out adequate bibliographic control. The Journals, Resources in Education and Current Index to Journals in Education cover report literature from 1966 and 1969 respectively. The reports are given ERIC document number. The ERIC Thesaurus is also available in the machine readable form to perform the searches.The full database is available through various utilities, like DIALOG, AUSINET etc.

CAS: The Chemical Abstracts Service is a very important abstracting services in the field of Science and Technology. The whole operation is computerized. Once the abstracts have been produced and key words allocated, DIALOG has a file CA search.

ISI: The Institute for Scientific Information ISI uses only manifest information like authors, titles, citations and bibliographical references.Since 1964, Science Citation Index is produced. In 1973, Social Science Citation Index was set up to cover the areas of Social sciences. The Arts and Humanities Citation Index is also produced by ISI to cover the humanities disciplines. The Citation indexes are computer based. They lend themselves to variety of users. A substantial part of the database is available through DIALOG.The MEDLARS and ERIC use controlled descriptor vocabulary for indexing, while CA uses keywords and titles. In all, text searching techniques may be used to search the NL sections of each entry.

1.3.2.5.2.2 Post - coordinate Indexing Language

A post-coordinate indexing language consists of a set of terms selected for use as indexing terms or subject descriptors. Usually the terms are arranged alphabetically. Though these indexing terms are very similar to the lists of subject headings used in pre - coordinate indexing, post - coordinate indexing language employs only a limited degree of pre - coordination of terms. The indexing terms are not in the form of compound subject headings but are indexed according to their individual constituent concepts. The post - coordinate indexing language is also referred to as THESAURUS. Some thesaurus are alphabetical listings and some incorporate classified arrangement of concepts.The function of a thesaurus is to control the use of synonyms and word forms. Under each of its preferred indexing terms a thesaurus links related terms representing concepts related in a genus/species relationship indicated by:

BT : Broader Term - more general
NT : Narrower Term - more specific
RT : Related Term - is a non genus/species relationship but relationship between a thing and an action performed on that thing. Science and Technology were first to prepare the IL for post coordinate indexing. The most widely used post-coordinate scheme is EJC thesaurus used by limited number of libraries. Most libraries using post-coordinate indexing method tend to generate their own lists using one of the major lists/thesaurus as a model. Two such examples are 'EJC Thesaurus' and 'Thesaurofacet': a thesaurus and faceted classification for Engineering and related subjects. Since these two are complementary than parallel, in the later, both classification and thesaurus have to be used together for best results.

A Few more post-coordinate indexing languages are:

MeSH: Medical Subject Headings - a thesaurus.
BSI Root thesaurus: It is based on original principles of Roget's thesaurus.
Roget's thesaurus: It is a systematic list accompanied by an alphabetical display.

Some of the thesauri in the Social Sciences are:

ERIC: Information retrieval thesaurus of Education terms
Semantic code dictionary of Education
London Education classification
EUDISED multilingual thesaurus

The research is in progress to develop post-coordinate indexing languages in Indian languages.

1.4 Conclusion

Since enumerative schemes do not have a clear facet structure in which the most important focus cannot be identified, Ranganathan, S R's Analytico synthetic or free faceted structure is adopted for the present study. His postulates and principles for concept categorization and knowledge organization give rise to a subject structure and organization of subject in a sequence that is acceptable by specialists in different subject areas (Neelameghan 1992). His theory of classification divided the task of classification into three planes of work.

  1. Idea plane which deals with classification of ideas into a hierarchical order.
  2. Verbal plane deals with standardization of terminology, and
  3. Notational plane deals with assigning a class number to the idea.

Hence, his theory of classification forms an excellent basis for indexing irrespective of any NL. The index language though an artificial language , is dependent on the NL expression. In order to understand and analyze NL expression in a given context, it is expected to have knowledge of Linguistics in particular, syntax, semantics, lexicography etc., so that concepts can be analyzed in a proper perspective. Linguistics is used as a representation mechanism for the information content of a document . This is the main reason for introducing infolinguistics (Figure 1) in between dual states of mind. A trial is made to get the solution from NL analysis by applying transformational grammar to IL in general and Kannada in particular. The next chapter discusses various aspects of transformational generative grammar and semantics.

*** *** ***

CONTENTS PAGE


CHAPTER TWO

THEORIES OF LINGUISTICS

2.0 Introduction
2.1 Historical Development of American Linguistics
2.1.1 Post-Bloomfieldian Theories
2.2 Syntax
2.2.1 Transformation
2.2.2 First Generation Syntactic Structure
2.2.3 Aspects Model - Standard Theory
2.2.4 Extended Standard Theory (EST)
2.2.5 Revised Extended Standard Theory (REST)
2.2.6 Government and Binding
2.3 Case Grammar
2.3.1 Definition of Case Categories
2.4 Semantics
2.4.1 Semantic Relation
2.5 Conclusion

2.0. Introduction

In the previous chapter it was stated that linguistics is used as a representation mechanism for the information content of the text of a document . The representational properties of an NL are syntax and semantics The present chapter deals with syntax and semantics. In linguistics, syntax has been discussed in different schools of thought. Since , for the present study Chomskian school of thought is adopted, prominence is given to that and explained in detail.

A Natural language (NL) is the primary medium for human communication. The term language refers to the totality of utterances that can be made in a speech community. The scientific study of language is linguistics. Hocket (1942) explicitly defined the nature of linguistics to be a classificatory science, with a linguist's task of classifying data.

2.1. Historical Development of American Linguistics

Linguistics has built up a tremendous body of new knowledge concerning the nature and functioning of human language since the last quarter of the nineteenth century. The period from 1875 to 1925 saw an increasing variety of language and dialect surveys with constant improvements in the techniques of making the surveys and interpreting the data (Whitney 1975). In 1926, Leonard Bloomfield published his work 'Postulates for the Study of Language'. The most important publication concerning the scientific study of language was his work 'Language' (1933). According to him the central concept in linguistic analysis is structure. It is the ordered or patterned set of oppositions which are presumed to be discoverable in a language (Floyd 1961). Linguistics in the 1950s was dominated by the 'American Structuralism' or 'Descriptive Linguistics'. As Palmer states,

For many years from 1930 until the late 1950s, the most influential school of linguistics was one which is usually described as 'Structural' and associated chiefly with the name of the American linguist Leonard Bloomfield (Palmer 1971).

Bloomfield worked out his philosophy of grammar within the behaviorist boundaries. The research was restricted to observable. The most observable feature of language systems is the sound system or phonology. The Morpheme is the minimum meaningful unit of expression.

The post-Bloomfieldian linguists envisaged language in a very precise and limited way and postulated that it has not only a phonemic-morphemic structure but also the structure can be discovered by a set of procedures. This postulates that - phonemes should be found first and then the morphemes. This meant that phonemes had to be found without reference to the morphemes and both had to be found without reference to meaning (Semantics). Though theoretically it was possible, no linguist tried to do this in actual practice because it was practically impossible. Bloomfield stated that morphemes consisted of phonemes. The morpheme '- ing' for instance consists of the phonemes /i/ and /n/. He further stated that morphemes belong to various 'Form Classes'. Combination of such classes with different constructions and meanings are possible. Before stringing of morphemes together, the classes have to be identified first and statements about which classes may combine with which one will be made next. Here classes means 'a set of phonological segments that have more features in common'.The 'Discovery Procedure'(DP) was the result of linguistic research carried on by Bloomfield and his followers. It is a mechanical device that accepts as input a set of data and yields as output a grammar. For example: If enough data from some language is given to the computer with a program, it will construct a fully explicit and accurate grammar for that language. One of the first problems encountered was that of classification of the material being dealt with. This was approached by means of an attempt to formalize the traditional notions of 'Parts of speech'. The division of words and phrases into Noun, Noun Phrase, Verb, Verb phrase, Adjective, Adverb, Clitic, Particle etc., was called Immediate Constituent Analysis (ICA)(Grinder & Elgin 1973).

Sentences are not merely strings of words in an acceptable order and `making sense'; they are structures of successive components, consisting of groups of words and single words. These single and groups of words are called constituents. The ICA is basic to syntax. The ways in which the longer sentences are built up and analyzed into short basic sentence patterns are Expansions (Robins 1971). One of the best method to display I C Analysis is to use the principle of the Family Tree.

Example: An old man with a stick followed the woman.

Figure 3

The expansions in this sense, is not literally expansion. But it is a technical term for the substitution of one sequence of morphemes for another. If we consider the above example, 'The old man with a stick' can be replaced by the name of the person who is having the stick and in the similar way the name/relationship of the woman may be replaced in the second half of the sentence.

Rajan followed his wife Or Rajan followed Sita.

The principle of expansion is derivative from the principal of substitution. By using this procedure,the linguists were able to arrive at an abstract structural formula that represented relationships present in the sequence under consideration. This operation of substituting one sequence of morphemes for another one to arrive at a conception of expansion was first derived by Zelig Harris and further developed by Rulon Wells who suggested the class abbreviation to traditional terms such as N(oun), V(erb), A(djective), T(article), the analysis of sequences of the above example resulted in structural formula such as:

An old man followed the women with green sari

Andoldmanfollowedthewomanwith greensari
TANVTNAN

The major conceptual break through seems to be the proximate cause of the development of transformational grammar by Harris. He first determined the classes on the basis of their co-occurrences of patterns of distribution and finally presented the notion Transformation itself. This was revised and refined by his student and collaborator Noam Chomsky. Since 1957 extensive developments have taken place in the theory and finality is yet to be reached.

2.1.1. Post-Bloomfieldian Theories

One of the most prominent post-Bloomfieldian theories is the Transformational Generative Linguistics (T G Grammar in short).The TG incorporates a full theory of language description, which takes series of rules. These rules based on the theory underlying them are said to generate the grammatical sentences of a language. The term 'generation' does not mean the literal production of the sentences, but the prediction of the forms that sentences when produced will take in the language. The study of the principles and processes by which sentences are constructed in a particular language is called Syntax.

2.2. Syntax

The 'Syntactic Structures' by Noam Chomsky (1957) introduced to the world the most influential of all modern linguistic theories 'Transformational Generative Grammar'. According to him Language comprises a number of components. The syntax of a language contains a phrase structure component and a transformational component. In phrase structure the assumed largest unit of grammar, the sentence [ S ] is progressively expanded by the application of rules into 'strings' of smaller units because in TG sentence is the basic unit of the syntactic system.. Instead of beginning with actual sentences, directions for generating structural descriptions of sentences are set forth in PS rules. Each rule provides a symbol representing a constituent of a sentence to the left of an arrow and a symbol or series of symbols to the right. The following are the symbols used in P S rules:

S Sentence
NP Noun phrase
VP Verb phrase
N Noun
V Verb
T,art or D Determiner
Pron Pronoun
Aux Auxiliary
M Model Auxiliary
Be The verb Be
Pred Predicate(noun,adjective,adverb)
Vt Transitive Verb
Vi Intransitive verb
Vl Linking Verb
Comp Complement(noun or adjective)
Adj Adjective
Adv Adverb
PP Prepositional phrase

Unlike the tree explained in IC analysis,these diagrams are called labeled trees,because each successive representation of S consists of structural elements with a grammatical designation(NP etc.,) called nodes. The tree diagrams are also called 'Phrase Markers' which show the hierarchical structure of the sentence.

Figure 4

2.2.1. Transformation

The term transformation means 'to convert'. In the context of grammar it is to convert a sentence with a given constituent structure. For example, while converting an active sentence into a passive sentence, the position of nouns or noun phrases have to be changed inserting 'by' before the second NP in the passive and at the same time changing the verb from active to passive form. This is a best example for transformation. In 'Syntactic Structures' Chomsky handles the active passive relationship by saying that

if S1 is a grammatical sentence of the form
NP1 → Aux - V - NP2, Then the corresponding string of the form
NP2 → Aux+be+en - V - by+NP1 is also a grammatical sentence.

Here Aux refers to tense and all auxiliary verbs ,while be+en (en stands for the past participle) provides the passive element. The dashes and plus signs can be ignored. Upon the output of the PS rules, Transformation(T) rules are applied to give the final output of the syntactic component of the description. The T rules involve not the division of the sentence into smaller parts, but the alteration or rearrangement of a structure in various ways.

The stages of development of TG are as follows:

  1. The first generation TG - Syntactic Structure
  2. Aspects - Standard theory
  3. Extended Standard Theory
  4. Revised Extended Standard Theory
  5. Government and Binding.

2.2.2. First Generation Syntactic Structure

The original form in syntactic structure is called the Classical theory by Chomsky. Fundamental to TG is the notion of rule: TG is rule based grammar. The rules are part of the device for generating the sentences of a language. They are instructions for generating all possible sentences in a language. The rules of TG are rewrite rules. Chomsky explained the term syntax as the study of the principles and process by which sentences are constructed in a particular language. He considered phonemics, morphology and phrase structure as linguistic levels which are a set of descriptive devices that are made available for the construction of grammars. He viewed grammar as an instrument that mirrors the behavior of the speaker, who on the basis of a finite and accidental experience with language can produce or understand an indefinite number of sentences and considered language as a complex system. The meaningful sequence of words produced is a sentence. A language produced by a machine was called 'Finite State Language' and the machine itself was called 'Finite State Grammar'. It was graphically represented in the form of a State Diagram.

The grammar can be extended by adding closed loops. Infinite number of sentences can be produced in this way.

The state diagrams are usually represented by arrows tracing a path. The machines that produce language in this manner are known mathematically as 'Finite State Markov Process',and speaker as being a machine. Many languages are not a finite state languages. For example English. Hence the Markov Process cannot be accepted. So, Chomsky thought of a grammar which is more powerful. New form of grammar associated with constituent analysis had rules. The first PS rule breaks up the sentences into its principle constituents.

Example: The students read the book

  1. S → NP+VP
  2. NP → T + N
  3. VP → Verb+NP
  4. T → The
  5. N → Students,book
  6. V → Read

The derivation can be represented in an obvious way by means of the following tree structure:

Sentence

Figure 7

PS rule tree structure - Figure-7

The+students+read+the+book is a terminal string. A set of strings is called terminal language if it is the set of terminal strings for some grammars[ Σ, F ] where Σ the set of initial strings and F set of rules or instruction formulas. Σ can be extended to include declarative,interrogative sentences as additional symbols. Thus,given a terminal language and its grammar one can reconstruct the PS of each sentence of the language as described in the above diagram.

Among the above discussed two models i.e.,Markov Process and Phrase Structure model,the first one was based on a conception of language and the latter was based on Immediate Constituent Analysis. For the purpose of grammar the first one is inadequate and the second one is more powerful than the first. Considerable improvements over grammars of the form [ΣF] gave rise to the process of conjunction which is considered to be the most productive process.

For example, If we have two sentences,
S1 (a) The scene- of the movie - was in India
S2 (b) The scene- of the play - was in India
S3 - The scene of the movie and of the play was in India.

In grammars of the [ΣF] type there is no way to incorporate two sentences. It provides the best criteria for determining how to set up constituents. The next improvement was the study of 'auxiliary verbs'. Even with the verbal root fixed there are many other forms that this element can assume. Example : has+taken, will+take, has+been+taken, is+being+taken etc., the form 'would have been taking' is past tense, perfect(marked by 'have' and the past participle 'been') and progressive (marked by the acorns of 'be' in 'been' and the '-ing' from taking). This is called (be + en) element in the rule which is enumerated as:

Verb → Aux + V
V → hit,take,Walk,etc
Aux → ( (M) ( have + en )(be+ing) (be+en)
M → will,can,may etc
{ S in the context NP singular
C → { 0 in the context NP plural
{ Past

'Be' is the root verb for many verbs like be,an,is,was,are,were,being,been etc. En denotes passive verb(past participle). To transform to passive 'Be+En' formula has to be used.

Example: I saw him

He was seen by me.(where 'was' is the Be verb and "seen" is the "en" form of see). Auxiliary verb is a helping verb in grammatical conjugation.

Example: I am going (am is aux verb).

There are certain restrictions in the usage of this 'be+en'. This can be selected only if the following V is transitive,(Example: 'was' + 'eaten' is permitted but not 'was' + occurred) and it cannot be selected if the V is followed by a NP. It should occur before V+by+NP (where V is transitive). It inverts the order of the surrounding NP.

S1 = NP1 -Aux -V-NP2 Then the corresponding string of the form
NP2-Aux+be+en-V-by+NP1 is also a grammatical sentence.
S1 Raja -S-eats-ice cream.
=Ice cream -S+be+en-eaten-by+Raja. ice cream has been eaten by Raja.

Chomsky, refers to the above said rules as 'grammatical transformation'' or T. T operates on a given string with a given constituent structure and converts it into a new string with a new derived constituent structure. Certain transformations are obligatory where as others are only optional. Passive transformation for example is optional. The rule

 

S
C O
Past is obligatory

 

The distinction between these two transformations lead to set up a fundamental distinction among the sentences of a language. When only obligatory transformation is applied in the generation of a sentence, a kernel sentence is formed. Active sentences were thus kernel sentences and passives were 'transforms' of them, such sentences are 'derived' sentences. Chomsky stated that transformation is a rule which transforms underlying structures into derived structures or transforms (Chomsky 1956).Since the deep structure was supposed to represent the meaning of the sentence, abstract markers were placed in the later models of the grammar to give positive, negative and interrogative sentences.

 

(emphatic)
S (imperative)(negative) NP+VP
(question)

 

Question and Negative markers serve as triggers for transformations.

Kernel sentence Raja will pass the test.
Question Transformation Will Raja pass the test?
Negative Transformation Raja will not pass the test.
Emphatic Transformation Raja did pass the test.
Imperative Transformation Pass the test!
Negative Emphatic Transformation Raja did not pass the test.
Emphatic imperative Do pass the test!
Negative imperative Don't pass the test!
Emphatic interrogative Did Raja pass?
Negative Didn't Raja pass?

A universal feature of all languages is their infinite productivity.Even with an unchanging vocabulary the number of grammatical sentences that can be produced has no limit. Though this characteristics of language was noticed by W Von Humboldt over a century ago, it has been particularly emphasized by TG linguists, under the title of the recursiveness or recursion, which means that certain grammatical constructions can be extended indefinitely by repeated applications of the same rule. Thus noun phrases may be coordinated without a limit. Also there is the possibility of repeatedly embedding (subordinating) one sentence structure within the structure of another.

For example, the well known single sentence rhyme 'The house that jack built' exemplifies an extreme application and reapplication of this sort of embedding. The fully worked out tree for this would extend over several pages; with the embedded Ss like S1,S2 etc., and each S should be expanded as NP and a VP.

2.2.3 Aspects Model Standard Theory

It was in the Aspects of the Theory of Syntax nouns are chosen on the basis of context free rules ; verbs are then chosen on the basis of context sensitive rules, which are the terms to express the lexical features. Since nouns are the first words to be chosen,they are identified by lexical features only. Verbs and adjectives require additional features to indicate the environments in which they can appear. Aspects of grammar was organized into three major components:

The syntax, the phonology and the semantics.
The syntactic component had two sub components:

  1. Base
    1. PSG Rule
    2. Lexicon (with rules of lexical insertion)
  2. Transformational

Syntactic component enumerates the set of tree representations (Deep Structure) that serve as input to other two components. The later two components are called 'Interpretive'. The base specifies fully developed tree structure. The terminal nodes are the set of words and abstract markers that semantic component can interpret the meaning of the tree. These fully specified trees are 'Deep Structures'. The derived tree as a result of the application of T-rules is 'surface structure'.The base contained the lexicon as well as two general types of rules: (a) The Phrase structure grammar rules (PSG Rules) and (b)Lexical Insertion rules. The PSG rules are of two types :(a)Context Free (CF) and (b) Context sensitive ( CS ). The object that resulted from the application of all these rules is a 'Complex Symbol'.This is one of the addition to transformational theory made by the 'Aspects model'.

Example of a tree with complex symbols:

Figure 8

Figure 8

The complex symbol specified what kind of noun could occur under the node of any given tree. In the above example --the N- 'sincerity' is [-Count] [ + Common ] [+Abstract]; May is auxiliary.

The verb 'frightens' is analyzed by rules under the complex symbol 'Q'.

Transformation will preserve the meaning. Deep Structure contains full information to specify the meaning of the tree structure which will be mapped into surface structure by transformation. 'Aspects model' made transformation self evident. (Chomsky 1965). The separation of levels of analysis insisted upon by the structuralist school was respected in the Aspect model ,since the semantic and syntactic components were independent,articulating only at the point of deep structure (Grinder and Elgin 1973). The PSG rules and T-rules handled distinct sets of objects that resulted in formal objects. The surface structure is usually reserved for the result of phonological interpretation of the final derived phrase marker is illustrated below:

Figure 9

Figure 9

Subsequent research on the role of surface structure in determining the meaning of a sentence has led to the Extended Standard Theory , since some aspects of semantic representation were questioned from the beginning.

2.2.4 Extended Standard Theory (EST)

Ray Jackendoff offered a substantial criticism to the Standard Theory and showed that surface structure played a much more important role in semantic interpretation than the Deep structure. For example , by studying the interaction of negation and quantification within a sentence, Jackendoff showed that their relative position in the surface structure of the sentence was crucial for interpretation (Jackendoff 1965). To incorporate the role of surface structure in determining semantic representation without abandoning the identification of deep structure and semantic representation, generative semantics introduced the notion of 'Global Rules'. These rules relate surface structure to the semantic representation, postulated by generative semantics. It was also proposed that global rules may appear quite generally in the grammar,phonology as well as syntax and semantics. The EST assumes that the rewriting rules of the base, generate deep structure in which lexical items are inserted. Thematic relations between the verb and NPs which are grammatically related are defined at this level. Other semantic properties are determined by rules applying to surface structure. Chomsky introduced the term 'Trace Theory'. Trace in his point of view is that which can be considered as indicating the position of a variable bound by a kind of quantifier which is introduced into the logical form of rules applying to the surface structure. The theory has the following form: The deep structures are generated by the base components with their specific properties. Transformations from surface structures are enriched by traces. These surface structures are associated by further rules for phonetic representation and logical form(meaning),which may be explained as in the following schema:

Figure 10

Figure 10

Here the partial representation of meaning is determined by grammatical structure.The derivation of logical form proceeds step by step which is determined by a derivational process analogous to those of syntax and phonology.

The EST maintains that it is not the deep structure that undergoes semantic interpretation, but it is the surface structure that is associated directly with semantic representation. The deep structures do not vary from one language to another. All languages have the same deep structure. Certain properties of underlying deep structure are captured in the enriched sense of surface structure by means of trace theory. Surface structure determines semantic representation. Chomsky further states that surface structure is something quite abstract, involving properties that do not appear in the physical form. It is by virtue of such properties that language is worth studying (Chomsky,1971).

2.2.5 Revised Extended Standard Theory (REST)

There are two principal innovations in the REST:

  • Introduction of the trace theory of movement rules into Chomsky's Syntactic theory and
  • Semantic skepticism achieves official status, which specifically excluded meaning from the grammatical structure of sentences.
(A) B T SR1
Sentence Grammar ------------> IPM----------> S--------->
  LF    

SR-2    
Other systems :LF---------> "Meaning"

Chomsky explains that the rules of the base (B) including the rules of the categorical component and the lexicon, form Initial Phrase Markers (IPM). The rules of the transformational component (T) convert these to surface structure (SS),which are converted to logical form (LF) by certain rules of semantic interpretation (SR-1,the rules involving scope,thematic relations etc.,). The LF so generated is subject to further interpretation by other semantic rules (SR-2) interacting with other cognitive structures giving fuller representation of meaning.

The formula A takes into account grammatical properties and relations (like coreference and thematic) to be goals of sentence grammar. Katz (1980) has argued that Chomsky's theory requires sentence grammar to account for the properties and relations and precludes it from doing so,because the boundary imposed in figure A on sentence grammar excludes meaning . He further stated that with the development of the EST and REST , Chomsky returned to his Syntactic Structure with one modification that certain aspects of quantificational structure enter sentence grammar by virtue of new linguistic level called 'Logical Form'. Chomsky suggested that all semantic information is determined by suitably enriched notion of surface structure. In this theory,the syntactic and semantic properties of the former deep structure are dissociated. To avoid confusion resulting from the term deep structure , the same was replaced by Initial Phrase Markers(IPM). The IPMs generated by the base have significant and revealing properties. They enter into SS, determining the structures that undergo semantic interpretation.

2.2.6 Government and Binding

Further addition to TG is the Government and binding theory by Chomsky (1981). It is more explicit and explanatory than the earlier theories. According to this GB theory, the structure of universal grammar (UG) consists of interacting subsystems of grammatical rules and principles.

The sub component of the rule system are as follows(Chomsky 1981):

  1. Lexicon
  2. Syntax
    1. Categorical component
    2. Transformational component
  3. PF-component
  4. LF-component

The syntactic categorical component (2a) involves PS rules that generally follow X - Bar theory in one or another of its variants. The X-Bar theory is the base rules, where lexical entries can be limited to a minimal form with indication of not more than inherent and select ional features and PS rules can be dispensed (Chomsky 1986). The 1 and 2 (a) sub components together constitute the base. Base rules generate deep structure (D-structure). The D-structures are mapped to surface structure (S-structure) by the rule Move-Alpha a which is called the theory of movement. Movement is never determined by specific rule but rather results from the interaction (Chomsky 1986). Move- a constitutes 2(b) generating the S-structure assigned by components 3 and 4.

The subsystems of the principles include the following sub theories or theoretical modules (Chomsky 1985).

  1. Bounding theory
  2. Government theory
  3. -q theory
  4. Binding theory
  5. Case theory
  6. Control theory

Bounding theory possesses locality conditions on certain processes and related items. Government theory is concerned to be relation between the head of a construction and categories dependent on it. The q theory is concerned with the assignment of thematic rules such as agent-of-action, patient-of-action,etc. Binding theory refers to the relations of anaphors, pronouns, names and variables to possible antecedents. Case theory is concerned with assignment of abstract case and its morphological realization. The Control theory determines the potential for reference of the abstract pro nominal element PRO. These modules are interconnected. The third and fifth theories are closely related. The fourth and fifth are developed within the second. Interaction exists between the subsystem rules (A) and principles (B). Bounding theory is connected with the rule Move - a The q theory interacts with both D-structure and LF. The notions such as constituent command (C-command) are found to be central to many of these theories. Through interaction of these subsystems it is possible to account for many properties of particular languages.

The 'Classical' GB model is as follows:

    Logical form
D. Structure-------> S. Structure ----------->  
    Phonology

Classical GB model
Figure-11

It is also called 'T' model of Chomsky. In the recent past Chomsky is of the opinion that for a substantial core of NLPS rules are completely dispensable, and T-rules also can be eliminated in favor of the general principle Move-Alpha (Chomsky,1991).Within a span of more than four decades the generative syntax has arrived at a conception of Universal Grammar (UG) as virtually a rule free system. In their over view of GB Van Reimsdijk and Williams(1986) state that "From today's perspective most research carried out before the late 1960s appears data-bound, construction-bound and lacking in appreciation for the existence of highly general principles of linguistic organization".

2.3 Case Grammar

The study on TG will be incomplete without a mention of Fillmore's conception 'Case Grammar'. Fillmore is of the impression that grammatical features found in one language show up in some form or other in other languages (Fillmore 1968). The grammatical notion 'case' deserves a place in the Base component of the grammar of every language. The case is one of the underlying syntactic - semantic relationships in a language which make up a universal set of innate concepts that explain judgments about notions such as `who did what to whom' (Palmatier 1972).Case grammar is the modification of the theory of TG. This reintroduces the conceptual framework of core relationships from traditional grammar, but maintains a distinction between deep and surface structure from generative grammar, with the word deep signifying 'semantic deep'.

Sentence → Modality + Proposition
[ S → M + P ]
Modality → Negation, Tense, Mood and Aspect.
Proposition → Tenseless set of relationships involving verbs and noun separated from modality.

Definition of case categories:

Agentive[A]--The case of the typically animate perceived instigator of the action identified by the verb.

Experiencer[E]--The case of the animate being affected by state or action.

Instrumental[I]--the case of the inanimate object controlled by the agent and causally involved in the action or state.

Causative[C]-- The case of the inanimate force causally involved in the action or state.

Objective[O]-- Semantically most neutral case anything representable by the noun whose role in the action or state is identified bysemantic interpretation of the verb itself.

Source[Sr]--The case which reports the location of an object moving away from the locus.

Locative[L]--The case which identifies spatial,temporal or institutional orientation of the state or action identified by the verb.

Factitive[F]-- The case of the object or being resulting from the action or state identified by the verb or understood as a part of the meaning of the verb.

Benefactive[B]-- Is the case of the animate being which is benefited by the result of the action denoted by the verb.

The system of deep case has become one of the modules of generative Government Binding theory under Theta theory (q theory) or the theory of thematic roles (Chomsky 1981). A thematic role may correlate in surface structure with various phenomena like syntactic position, ad position, inflectional suffixes etc (Kiefer,Ference 1992).

2.4 Semantics

One of the three major components considered in the 'Aspects of the Theory of Syntax' in the first complete model by Noam Chomsky was 'Semantics'. Semantics is the study and representation of the meaning of language expressions, and the relationships of meaning among them (Allan, 1992). General notion of semantics is that it studies the meaning that can be expressed. The keynote of a modern linguistic approach to semantics is that "meaning can be best studied as a linguistic phenomenon with 'knowledge of language' and the 'knowledge of real world' "(Leech 1975). A semantic theory is a general theory of language meaning, and should account for the correlation between the sense of language expression and its denotation.Denotation is the relation between language expression and what they denote in words. A semantic theory of a NL is part of a linguistic description of that language (Katz & Fodor 1963). They further state that:

Linguistic description minus (-) Grammar = semantics.
LD-G=S

That is, if the property belonging to grammar is subtracted from the problems in the description of a language, problems that belong to semantics can be determined. Grammar assigns structural description. To determine the domain of a semantic theory the formula LD-G=S may be applied. The speaker's ability to interpret sentences provides empirical data for the construction of a semantic theory. Semantic theory describes and explains the interpretation ability of speakers by accounting their performance in determining the number and content of the readings of a sentence, by detecting semantic anomalies by deciding on paraphrase relations between sentences and by marking every semantic relation. A semantic theory interprets the syntactic structure revealed by the grammatical description of a language.

One important component of a semantic theory of a NL is a Dictionary. From the view point of semantic theory dictionary entries consists of Grammatical and semantic section, catering for syntactical and semantic relationships respectively.

2.4.1 Semantic Relation

From the IL point of view the following three semantic relations are worth discussing. They are:

Equivalence
Hierarchical
Affinitive

Equivalence relationship implies that there will be more than one term denoting the same concepts like:

Synonyms and antonyms
Quasi-synonyms
Same continuum
Overlapping
Preferred spelling
Acronyms, abbreviation
Current and established term
Translations

Hierarchical relationship is that of genus to species and whole to part.

Affinitive/Associative includes:

Coordination
Genetic
Concurrent (two activities taking place at the same time in Association. Example: Education-Teaching)
Caused and effect (Example: Teaching-learning)
Instruments (Example: Teaching-Overhead projectors)
Materials (Example: Plastic films)

Semantic relations discussed here are based on Fosket (1982).There is a lively and productive debate in progress concerning exactly how the semantics relates to syntactic rules. It is argued by Di Sciullo and Williams that words are syntactic atoms, determined by principles that are dissociated from syntactic rules. Mark-Baker is of the opinion that the structure of complex predicates. For example: Kill, Murder, Assassinate, Massacre etc., are causative forms based on intransitive-Die and are explicable in terms of the principles that govern syntactic concern (Jones & Kay 1973).

Among the two schools of semantic thought -The Interpretative and Generative semantics, Chomsky and Katz have favored Interpretative semantics which assigned meanings to the output of syntactic rules, which was further developed into X-Bar theory. Generative semantics was a programmatic theory of syntax, using purported meaning components etc. It failed because syntactic phrase markers used do not properly reflect semantic structure.

2.5 Conclusion

We must know how far Transformational Linguistics approach can provide a methodology. For that, the theories discussed here are applied to IL environment in the next Chapter. Also, in the forthcoming chapters, TG is applied to document titles in Kannada and rules are formulated.

*** *** ***

CONTENTS PAGE


CHAPTER THREE

COMPATIBILITY BETWEEN LINGUISTICS AND INDEXING LANGUAGE

3.0 Introduction
3.1 Basic Components of IL
3.2 Fundamental Categories
3.2.1 Personality
3.2.2 Matter
3.2.3 Energy
3.2.4 Space
3.2.5 Time
3.3 Facet Structure
3.4 Facet Syntax and Linguistic Syntax
3.5 Sample Infolinguistic Studies
3.6 Application of TG to IL
3.6.1 Computer Application of TG
3.6.2 Manual Application of TG
3.6.2.1 Application of X-Bar to Document Titles
3.6.2.2 Application of q Theory to Document Titles
3.6.2.3 Application of Case Theory to Document Titles
3.7 Conclusion
3.7.1 Advantages
3.7.2 Disadvantages

3.0 Introduction

Function of a NL is to communicate semantic content of its expression in a simple, direct manner to the receiver. Where as, the function of an IL is to take whatever NL does in addition to the organization of semantic content through a different expression . In this process the expression in an IL becomes different from that of a NL expression. In short, semantic approach needs compatibility between a NL and an IL expression. One more important function of an IL expression is to provide a point of access to the seekers of information. This has to be achieved with minimum distortions.

An IL is made up of expressions connecting several kernel terms. These kernel terms have indicated roles in an index expression in the form of pre-coordinate subject headings at the input stage or post-coordinate search statements at the output or retrieval stage. Therefore, an index expression can be taken as equivalent to a sentence in a NL discourse. An index expression consists of kernel terms in their prescribed sequence of the roles according to indexing principles. It has connectives and conjunctives to make index expression complete.In the last four decades, the development of grammar of IL has a close parallel in the studies of theory of syntax and generative grammar for NL. In the Standard Transformational Grammar (TG) the deep structure of a sentence determines the semantic content while its surface structure determines its phonetic interpretation. In IL the model of deep structure underlying a surface linear ordering is subscribed. In Linguistic notation, as discussed in the Chapter Two, a sentence is formed by Noun Phrase and Verb Phrase. Between NP and VP a relation of predication may be defined. The deep structure of every language is built up on this relation apparently without exception ( McNeill, D 1969 ).

The mapping between the deep structure and its surface structure is the transformation. "Real progress in linguistics consists in the discovery that certain features of given languages can be reduced to Universal properties of language,and explained in terms of these deeper aspects of linguistic form" (Chomsky 1969). It can be inferred that any language whether natural or artificial, will have syntax . The postulates and principles of syntax may change from language to language.

3.1 Basic Components of IL

In the IL, the letter 'S' of NL is substituted by 'Title', 'T'. The person whose versatile and unique contribution is still recognized and adopted at the international level in the field of IL is S R Ranganathan. His notable contribution is in the area of syntactic analysis, structuring and representation of subjects. His General Theory of Classification is based on postulates and the study of the attributes of the Universe of subjects (US) in particular its structure and development. A study of the ideas forming components of the large variety of subjects in the US indicates that they can be categorized into three types:

  1. Basic Subject Idea (BSI)
  2. Isolate Idea (II)
  3. Speciator Idea (SI)

If BSI is a subject without any components, II is a component of a subject but not a subject by itself and SI is a modifier, this when combined with a BSI or II produces a change in their respective connotations. With the combination of these three ideas ,Simple subject (BSI), Compound Basic subject (BSI + SI), Compound Isolate (II + SI), Compound subject (BSI + SI) and complex subject (combination of all) can be formulated. The large variety of isolate ideas occurring in diverse subjects are categorized into seven types by SRR. They are:

________________________________________________________
Number Isolate idea Manifestation of
the fundamental category
Indicator digit
1 Time TIME [T] . (dot)
2 Space SPACE[S]
3 Action ENERGY[E] : (colon)
4 Method
5 Property MATTER[M] ; (semi colon)
6 Material
7 Totality of all
attributes of an
entity taken together
PERSONALITY[P] , (comma)

By deeming each of them as a manifestation of one and only one of the five Fundamental Categories ( FC ),the seven variety of II is reduced to Five FC - [P],[M],[E],[S] & [T]. Each facet was given a separate indicator digit. There is similarity between SRR's five FC and Whorf's hypothesis on language, which states that "every language contains terms that have come to attain cosmic scope of an unformulated Philosophy...such as our words like 'reality' 'substance' 'matter' and 'space', 'time'past present and future" (Neelameghan 1972). The Structuring of subjects by SRR is based upon the above said five fundamental categories that center around the concepts of Basic Subject (BS).

3.2 Fundamental Categories

3.2.1 Personality

Personality is the core component which is the manifestation of FC Personality [P]. Taking into consideration the definition of subject as a "system -an assymetric,noncommunicative, centralised system"(Neelameghan,1972).The FC Personality is in conformity with the concept of 'Leading part' in "Centralized system"(Seetharama 1972). For recognition of Personality, SRR suggested the method of 'Residue'. In this method, a kernel idea is correlated with each of the four FC - Time, Space, Energy and Matter in succession and if the kernel idea cannot be deemed to be a manifestation of any one of these four FC ,it was deemed to be a manifestation of the FC Personality. However this was not found to be adequate. Gopinath(1980) has analyzed the problem in identification of FC in interdisciplinary subjects and has framed criteria and methods for the same. He states that "the problem in the recognition of the FC Personality is not definitional,but contextual. The semantic and syntactic aspects in the formation of the compound subjects and the generalization of these structures to a model base ... that is a Basic subject...sets the difficulties in the recognition of Personality"

3.2.2 Matter

As per the above Table 1, the manifestation of Matter is of three varieties namely 'Matter - Material', 'Matter - Property' and 'Matter - Method'. Matter represents a property or materializes of the focal idea of the subject statement. After 1964, the qualifier concept was recognized and lead to the recognition of the material constituent and such qualifiers are known as Speciators.

3.2.3 Energy

Energy connotes some kind of action in relation to the focal idea. Ranganathan(1957) stated "Energy manifests itself either as motion,interaction or mutual action of some kind or as one of the isolates postulated to be Energy, such as those denoted by the term- Physiology,Morphology,Ecology,Disease etc." Any action is termed as 'Energy' facet.

3.2.4 Space

The concept of the FC Space is in accordance with what is commonly understood by that term. The surface of the earth, the space inside and outside it are manifestations of the FC space. The geographical area and physiographic features are manifestation of FC Space.

3.2.5 Time

The Time isolate ideas such as millennium, century, decade, year etc.,are the manifestation of the FC Time. The time isolate of another kind - such as day and night, seasons such as summer and winter, meteorological quality like, wet, dry, stormy etc., are also taken as manifestation of the FC Time.

Keeping in view the explanation of each FC, it is seen that these FCs are identifiable without much difficulty. Postulates and principles provide a kind of typology of generic relations resulting in a Facet Structure which can be used for generating an organized set of subject propositions. The five FCs are interrelated and keeping this in his view, SRR sequenced them as PMEST in order of decreasing concreteness of categories. With the aid of the postulates of FC, rounds, levels, basic facet, canons and principle of helpful sequence of compound subjects going with one and the same basic subject, and in the overall sequence of subjects going with different basic subject has been achieved. Work in relation to the analysis of subjects in terms of categories has been attempted by different scholars .For example: Dobrowolski, Cordonnier and Eric de Grollier, Farradane, Foskett, Vickery, Mills, Kyle, Cerenin, Vleduts, Stockolova, Perry, Kent, Shera and Egan etc. who have used different terminologies which can be grouped or reduced to five FC - PMEST (Seetharama 1972). Among the earlier specialists in constructing IL ,Classification Research Group (CRG) of Britain established in 1948 is worth mentioning. Influence of SRR's idea is discernible in the faceted schemes produced by CRG. Farrandane from CRG doubted and abandoned the idea of Universe of subjects being divided into Basic subjects, Main subjects, Compound subjects etc., and maintained that it was from the universe of concepts that all compound subjects must be ultimately constructed (Palmer & Austin 1971).

Another systematic attempt to design IL for Social Sciences is by Barbara Kyle (1958). She identified only two categories namely, Personality and Activities. Like Farradane she also abandoned the traditional disciplines and arranged all the concepts irrespective of their origin under the two FC, sequence being Activities precedes Personality. The Space and Time are also taken into account.

Linguistically, the subject structure can be designated either by one term or by a more complicated linguistic expression. Usually concepts can be taken up as implicit of a subject. Human minds are able to form concepts which are of an abstract nature (Johansen 1990). SRR (1967) stated that, "the sequence in which the component ideas of compound subjects going with a Basic Subject, usually arrange themselves in the minds of the majority of normal intellectuals." He called this as Absolute Syntax. This postulate helps in deriving principles for sequence of component ideas in a subject.

3.3 Facet Structure

Structure is the way in which the components of an entity are put together. Any thing that has structure has parts, properties or aspects which are related to each other in some manner. Generalized facet structure for subjects are represented by the following schema (Neelameghan 1979).

Figure 12

Figure 12

Subject structuring obtained using the generalized facet structure is found to give a co-extensive representation of subjects and arrangement of subjects helpful to a majority of users (Neelameghan 1979). The sequence of facets in compound subjects is called the Facet Syntax(FS). A number of principles have been formulated in FS - such as: (a) Commodity - Raw material, (b) Act and- Action - Actor - Tool , (c) Cow Calf (d) Whole Organ and (e) Wall - Picture principle.One of the principles for helpful sequence is the Wall - Picture principle, because the others are derivable from or are corollaries to it. The other principles for helpful sequence are derivable by the application of the wall -picture principle . This wall-picture principle states that, if two facets A & B of a subject are such that the concept behind B will not be operative unless the concept behind A is conceded, even as a mural picture is not possible unless the wall exists to draw upon , then the facet A should precede facet B (Neelameghan 1971).

3.4 Facet Syntax and Linguistic Syntax

Table - 2 gives the example of difference between Facet syntax and Linguistic Syntax. The facet syntax is based on the wall-picture principle.

Table 2

Language Subjects in NL Facet Syntax
English Antibiotic treatment of bacterial disease Child Medicine, Lung, Bacterial
Treatment, Antibiotic
Kannada makkalalli eekaanujiivi
swaasakoosa roogada
jiivirodaka cikitse
makkala aarogya,swaasakoosa
eekaanujiivi ,rooga,cikitse
jiivirodhaka.
Tamil kulandekalin nuraiiral
kiriminoykkana
antibiotic cikiccai.
kulandekalin aarokyam,nurai
iral,kiriminoykkana,cikitsai, antibiotic.
Telugu pillala uupiri tittilaku
cendina krimimuulaka
vyadula kriminasaka cikitsa
pillala aarogyam,pirititti,
krimimuulaka, roogamu,
cikitsa,kriminasaka.
English The sociology of alchoholism
among middle - class
people in developing
countries 1950-70.
Sociology,Middle-class,
Alchoholism,Developing
Countries,1950-70.
Kannada abhivruddhisiila raastragalalli
madyamavargadavara
meele madyapaanada prabhava
1950-70.
samajasastra, madyama
varga, madyapaana,
abhivriddhisiila, raastra
1950-70
Tamil munnerum naatkalil
naduttara makkalidaye
kutippalakkam parriya
samuuka vijnanam.1950-70
samuuka vijnanam,naduttara
makkalidaye,kudippalakkam
munnerum, naatkal, 1950-70.
Telugu abhivriddi chendutunna
desalalo madyataragati,
prajalapai saaraa prabhavampai
sangika pariseelana.1950-70.
sangika sastram, madyataragathi,
prajalu,saaraa,prabhavam
abhivriddi cendutunna desam
1950-70

The facet syntax derived on the basis of the postulates and principles particularly the wall - picture principle of the General theory of Library Classification is same for each subject in each language, which is in the conceptual order and independent of linguistic syntax, although the linguistic syntax differs from language to language. This is because, the word order is different in each language. For example, word order of English is in the order Subject Verb Object (S V O) . Most of the Indian languages have S O V word order. Taking the above Table-2 into consideration, at the outset , if we take the three concepts and tally with word order, S will be Child , O will be bacterial diseases of lungs, V will be Treatment. Hence the facet syntax will tally with the S O V word order of Indian languages.

Another faceted scheme much influenced by SRR's ideas is the Broad System of Ordering. The basic facet pattern embodied in particular subject field is as follows:

  1. Tools or equipment for carrying out operation.
  2. Operations (Purposive activities by people).
  3. Process,interaction.
  4. Parts,subsystems,objects of action or study.
  5. Objects of action or study,products or total system. Example : 'Child welfare in disaster relief.' 575,32,0,73,50

In the above BSO Code number, the first element in combination order, namely the concept Child belongs to facet 5, the second element, the process which requires a welfare operation to be undertaken, namely the concept Disaster belongs to facet 3. Facet 4 is not applicable to this. Though facet 2 is applicable, it has no role in this combination. Welfare defines the whole combination area. Facet 1 would be applicable if a particular Welfare Agency was to be specified. The citation order within the subject field is regularly the reverse of the scheduled sequence of the element concerned, which is quite similar to the PMEST order of SRR which is in the order of decreasing concretness of categories. Neelameghan (1971) suggested a model of deep structure underlying a surface linear ordering using the wall - picture principle. Harris and others (1979) agreed with this model but instead of wall - picture principle they followed 'General to Particular' and 'Abstract to Concrete' principle. For example: The whole sequence begins with the very broad category that constitutes the basic subject and its entire literature - and ends in the 'External Dimension' with the physical particulars of the document The 'Internal Dimension' leads to particular linguistic acts, errors and objects.

In Faceted Information Retrieval for Linguistics (FIRL), Harris (1979) considers among the five FC, the core component Personality facet represented at one level by the sub - disciplines and theoretical schools of linguistics and at another level by characterization of the language speaker. Energy is clearly the speaker's performance. Space and Time turn up in that order in dialect and historical period.Hemalatha Iyer(1990), while discussing the transformational rules to NL representation, states that the facet structure of a subject proposition can be correlated to similar structure in linguistics. She finds a parallel in the inter - constituent structure of a formal language in Halliday's (1976) System and Structure and makes a comparison between linguistic structure and facet structure and formulates rules for transformation from facet structure to NL representation. She infers that pre - coordinate index string would facilitate collocation and browsing while the NL representation would help the user to interpret the subject of the document accurately.

The terms in IL should be grouped in a location in an exhaustive manner so that searcher can get the information in a short time. Since IL suffers to certain extent in syntax and semantics in extending semantics for the searcher, the question is, 'Is there any way to help the users without changing the meaning?'. Though the grammar like PMEST gives an efficient typology to indexing purposes, it does not work in favour of NL. This has support with Iyer's statement that "Facet structure representation is not as effective as NL in communicating the subject of the document to the user" (Iyer 1990). We have to test whether theories from modern linguistics like Transformational Grammar are able to give much better compatibility to IL, in particular, Indian languages.

3.5 Sample Infolinguistic Studies

Information scientists have worked on problems like - Linguistic research in classification and information processing in the following areas (Neelameghan 1982):

  1. Linguistic problems in natural language interactive inquiry systems.
  2. Multi - lingual thesauri.
  3. Input output problems in multi - lingual information networks.
  4. Mechanical linguistic aids in thesauri development.
  5. Languages for control and access as related to both data entry and inquiry.
  6. Semantic and conceptual foundations of classification.

3.6 Application of TG to IL

3.6.1 Computer Application of TG

Based on Chomskian phrase structure grammars, parsers have been developed which represent a sentence in a tree structure. As programming language, Definite Clause Grammars (DCG) is the basis. PROLOG (Programming in Logic) is one of the most popular in Artificial Intelligence programming. Finite State Transition Network (FSTN),Recursive Transition Network (RTN),Augmented Transition Network (ATN), etc., are some of the computational models. FSTN parsers are useful in dealing with very limited subset of a natural language with limited vocabulary. Finite State Grammars are not recursive. Hence, RTNs were developed which has subnetworks and build large networks in a modular way. Any RTN which allows additional tests and store information on the labels are called ATN. It can store information in registers and provides registers for each subjects like Noun phrase, verb phrase, etc. At the end of parsing, the contents of registers are grouped to form a valid sentence structure. Until then, ATN keeps on trying alternative sentence structures (Prasad 1992). In the present context, in addition to the syntactic models, semantic models are also being developed.The input sentences are transformed through the use of domain dependent semantic rewrite rules which create the target knowledge structure. Contextual Dependency Grammar, Modular Logic Grammar are few examples for this. Salton (1984) hopes that, new developments may render the linguistic techniques more attractive in future.If a sentence like the one given below is fed to the computer:

'Students read lessons'. This sentence is analyzed as:
[S[np, [n, students] ], [vp, [tv, read] [np, [n, lessons]]]]

3.6.2 Manual Application of TG

To exploit internal similarities of the major categories, Chomsky devised X - Bar convention, to show the occurrence restrictions holding within sentences. He has shown how the internal structure of the derived nominals reflect the sentence.Word categories like Noun, Verb, Auxiliary etc., are lexical categories. Whereas NP, VP, Adj ph, Pre ph, Adv ph and S as the non - final; nodes/ phrase markers. There are intermediaries which are neither lexical nor phrase markers. For these type of representation X - Bar convention is used.

XP = Phrasal category, X = Intermediary, X = lexical. However, now, linguists mix the bar convention and the phrasal category convention. The central idea in the X - bar theory is that the PS - rules determining th