Introduction This paper includes information about corpus linguistics, its connection with lexicology and translation. The latter is the most important one and I am keen on finding and introducing something which is mainly connected with my future profession. Frankly speaking that was not an easy journey but I am hopeful it is destined to be successful. A corpus is an electronically stored collection of samples of naturally occurring language. Most modern corpora are at least 1 million words in size and consist either of complete texts or of large extracts from long texts.
Usually the texts are selected to represent a type of communication or a variety of language; for example, a corpus may be compiled to represent the English used in history textbooks, or Canadian French, or Internet discussions of genetic modification. Corpora are investigated through the use of dedicated software. Corpus linguistics can be regarded as a sophisticated method of finding answers to the kinds of questions linguists have always asked. A large corpus can be a test bed for hypotheses and can be used to add a quantitative dimension to many linguistic studies.
It is also true, however, that corpus software presents the researcher with language in a form that is not normally encountered and that this can highlight patterning that often goes unnoticed. Corpus linguistics has also, therefore, led to a reassessment of what language is like. During this journey we will try to find out; What is Corpus Linguistics Corpus Linguistics Terms and Their Meanings History of Corpus Linguistics Resources and Methodologies for Corpus Linguistics, Corpora Translation Corpus Linguistics and Linguistic Theory, Corpus-Based Descriptions So fasten the seat belts we are flying!
What is Corpus Linguistics? Corpus linguistics is a study of language and a method of linguistic analysis which uses a collection of natural or “real word” texts known as corpus. Corpus linguistics is used to analyse and research a number of linguistic questions and offers a unique insight into the dynamic of language which has made it one of the most widely used linguistic methodologies. Since corpus linguistics involves the use of large corpora that consist of millions or sometimes even billion words, it relies heavily on the use of computers to determine what rules govern the language and what patters (grammatical or lexical for instance) occur.
Thus it is not surprising that corpus linguistics emerged in its modern form only after the computer revolution in the 1980s. The Brown Corpus, the first modern and electronically readable corpus, however, was created by Henry Kucera and W. Nelson Francis as early as the 1960s. Corpus Linguistics Terms and Their Meanings Corpus (plural corpora). It refers to a collection of systematically or randomly collected texts of natural language which is electronically stored and processed. Corpus can consist of texts in a single or multiple languages.
It contains a large number of texts which allow the researchers to 1 / 6 analyse linguistic rules but the corpus does not represent the entire language, no matter how large it is. Multilingual corpus. Like its name suggests, multilingual corpus consists of texts in multiple languages. Parsed corpus (treebank). It is a collection of texts in naturally occurring language in which each sentence is parsed – syntactically analysed and annotated. Syntactic analysis is typically given in a tree-like structure which is why parsed corpus is also known as treebank. Parallel corpora.
The term refers to a collection of texts which are translations of each other. Annotation. It refers to an extension of the text by addition of various linguistic information. Examples include parsing, tagging, etc. Annotation is often used in reference to corpora as opposed to annotated corpora which consist of plain text in the raw state. Collocation. It refers to a sequence or pattern in which the words appear together or co-occur. Concordance. The term encompasses a word or phrase and its immediate context.
In corpus linguistics, concordance is used to analyse different use of a single word, word frequency and phrases or idioms. Orthography. It is a standardised writing system of a particular language and includes various grammatical rules such as spelling, capitalisation and punctuation marks. Orthography can pose a problem in analysis of writing systems which use accents because the native speakers of these languages sometimes use alternative characters to the accented letters or omit them completely.
Token. It is an occurrence of an individual word which is plays an important role in the so-called tokenisation that involves division of the text or collection of words into token. This method is often used in the study of languages which do not delimit words with space. Lemmasation. The term derives from the word lemma which refers to a set of different forms of a single word such as laugh and laughed for example. Lemmasation is the process of grouping of the words that have the same meaning. Wildcard.
It refers to special characters such as question mark (? ) or asterisk (*) which can represent a character or word. 3A perspective. It is a research method that is used in corpus linguistics which was introduced by S. Wallis and G. Nelson. 3A stands for annotation, abstraction and analysis. History of Corpus Linguistics History of corpus linguistics is typically divided into two periods: – early corpus linguistics, also known as pre-Chomsky corpus linguistics and – modern corpus linguistics The early examples of corpus linguistics date to the late 19th century Germany.
In 1897, German linguist J. Kading used a large corpus consisting of about 11 million words to analyse distribution of the letters and their sequences in German language. The impressively sized corpus that corresponds with the size of a modern corpus was revolutionary at the time.
Other early linguists to use corpus to study language include Franz Boas (Handbook of Native American Indian Languages, 1911), Zellig Harris (Methods in Structural Linguistics, 1951), Charles C. Fries (The structure of English, 1952), Leonard Bloomfield (Language, 1933), Archibald A. Hill and others, mostly American structural and field linguists. Some of them such as Fries and A. Aileen Traver also started to use corpus in pedagogical study of foreign language.
In 1961, Henry Kucera and W. Nelson Francis from the Brown University started to work on the Brown University Standard Corpus of Present-Day American English, commonly known simply as the Brown Corpus which is the first modern, electronically readable corpus.
It consists of 1 million word American English texts that are organised into 15 categories. For the modern standards of corpus linguistics, the Brown Corpus is kind of small, however, it is widely considered one of the most important works in history of corpus linguistics. But this was also the time of Chomsky’s criticism of corpus linguistics which would result in a period of decline. Chomsky rejected the use of corpus as a tool for linguistic studies, arguing that linguist must model language on competence instead of performance. And according to Chomsky, corpus does allow 2 / 6 language modelling on competence.
Corpus linguistics was not abandoned completely, however, it was not until the 1980s when linguists began to show an increased interest in the use of corpus for research. The revival of corpus linguistics and its emergence in the modern form was greatly influenced by the advent of computers and network technology in the 1980s which allowed the linguists to use electronic language samples as well as electronic tools.
The use of computers, however, dates back to the early 1970s when the Montreal French Project developed the first computerised form of spoken language, while Jan Svartvik began to work on the London-Lund corpus with the aid of the Brown Corpus and the Survey of English Usage (SEU) at University College London.
All mentioned works before the 1980s as well as the early examples of corpus linguistics paved the way to modern study of language on the basis of corpora as we know it today. The term corpus linguistics has been finally adopted after J. Aarts and W. Meijs published Corpus linguistics: Recent developments in the use of computer corpora in English language research in 1984. Resources and Methodologies for Corpus Linguistics, Corpora The basic resource for corpus linguistics is a collection of texts, called a corpus.
Corpora can be of varying sizes, are compiled for different purposes, and are composed of texts of different types. All corpora are homogeneous to a certain extent; they are composed of texts from one language or one variety of a language or one register, etc. They also are all heterogeneous to a certain extent, in that at the very least they are composed of a number of different texts. Most corpora contain information in addition to the texts that make them up, such as information about the texts themselves, part-of- speech tags for each word, and parsing information. ?
What Corpus Linguistics Does Gives an access to naturalistic linguistic information. As mentioned before, corpora consist of “real word” texts which are mostly a product of real life situations. This makes corpora a valuable research source for dialectology, sociolinguistics and stylistics. Facilitates linguistic research. Electronically readable corpora have dramatically reduced the time needed to find particular words or phrases. A research that would take days or even years to complete manually can be done in a matter of seconds with the highest degree of accuracy. Enables the study of wider patterns and collocation of words.
Before the advent of computers, corpus linguistics was studying only single words and their frequency. Modern technology allowed the study of wider patters and collocation of words. Allows analysis of multiple parameters at the same time. Various corpus linguistics software programmes, online marketing and analytical tools allow the researchers to analyse a larger number of parameters simultaneously. In addition, many corpora are enriched with various linguistic information such as annotation.
Facilitates the study of the second language. Study of the second language with the use of natural language allows the students to get a better “feeling” for the language and learn the language like it is used in real rather than “invented” situations. What Corpus Linguistics Does Not Does not explain why. The study of corpora tells us what and how happened but it does not tell us why the frequency of a particular word has increased over time for instance. Does not represent the entire language.
Corpus linguistics studies the language by using randomly or systematically selected corpora. They typically consist of a large number of naturally occurring texts, however, they do not represent the entire language.
Linguistic analyses that use the methods and tools of corpus linguistics thus do not represent the entire language. Searches, Software, and Methodologies Corpora are interrogated through the use of dedicated software, the nature of which inevitably reflects assumptions about methodology in corpus investigation. At the most basic level, corpus software: . searches the corpus for a given target item, 3 / 6 . counts the number of instances of the target item in the corpus and calculates relative frequencies, . displays instances of the target item so that the corpus user can carry out further investigation.
It is apparent that corpus methodologies are essentially quantitative. Indeed, corpus linguistics has been criticized for allowing only the observation of relative quantity and for failing to expand the explanatory power of linguistic theory (for discussion, see Meyer, 2002: 2–5). It is shown in this article that corpus linguistics can indeed enrich language theory, though only if preconceptions about what that theory consists of are allowed to change. Here, however, we leave that argument aside as we review corpus investigation software in more detail. Corpus Linguistics and Linguistic Theory, Corpus-Based Descriptions.
As has been noted, corpus linguistics is essentially a methodology or set of methodologies, rather than a theory of language description. Essentially, corpus linguistics means this: . looking at naturally occurring language; . looking at relatively large amounts of such language; . observing relative frequencies, either in raw form or mediated through statistical operations; . observing patterns of association, either between a feature and a text type or between groups of words.
Reduced to its essence in this way, corpus linguistics appears to be ‘theory neutral,’ although the practice of doing corpus linguistics is never neutral, as each practitioner defines what is meant by a ‘feature’ and what frequencies should be observed, in line with a theoretical approach to what matters in language. Approaches to the use of a corpus that essentially rely on the existence of categories derived from noncorpus investigations of language are sometimes referred to as ‘corpus based’ (Tognini-Bonelli, 2001).
Studies of this kind can test hypotheses arising from grammatical descriptions based on intuition or on limited data. Experiments have been designed specifically to do this (Nelson et al., 2002: 257–283).
For example, Meyer (2002: 7–8) describes work on ellipsis from a typological and psycholinguistic point of view that predicts that of the three possible clause locations of ellipsis in American spoken English, one will be much more frequent than the others. A corpus study reveals this to be an accurate prediction. On the other hand, the study of pseudo-titles mentioned in the section ‘Languages and Varieties’ shows how assumptions about language – in this instance about the influence of one variety of English on another –can be shown to be false. Biber et al.
(1999: 7) comment that ‘‘corpus-based analysis of grammatical structure can uncover characteristics that were previously unsuspected. ’’ They mention as examples of this the surprisingly high frequency of complex relative clause constructions in conversation, and the frequency of simplified grammatical constructions in academic prose. A clearer integration between linguistic theory and corpus linguistics is demonstrated by Matthiessen’s work on probability (see the section ‘Probability’).
This work takes its categories from an existing description of English (Halliday’s (1985) systemic functional grammar), but the corpus study was more integral to the theory, as it was the only way that statements about probability of occurrence of each item in the system could be made with accuracy. Corpus-Driven Descriptions However, more radical challenges to language description can be found. Sinclair (1991, 2004) argues that the kind of patterning observable in a corpus (and nowhere else) necessitate descriptions of a markedly different kind from those commonly available.
Both the descriptions and the theories that they in turn inspire are, in Tognini-Bonelli’s (2001) terms, ‘‘corpus driven. ’’ Some of the challenges to tradition that corpus-driven theories involve are these: . Lexis and grammar are not distinct, and grammar is not an abstract system underlying language . Choice of any kind is heavily restricted by choice of lexis . Meaning is not atomistic, residing in words, but prosodic, belonging to variable units of meaning and always located in texts.
4 / 6 Evidence for these claims is presented in the section ‘Observing patterned behavior’ above. The notion of pattern grammar focuses on the way that different lexical items behave differently in terms of how they are complemented.
Grammatical generalizations about complementation cannot be made without describing that individual lexical behavior. Similarly, choice between features such as ‘positive’ and ‘negative’ depends to some extent on lexical item, as some verbs (such as afford) occur in the negative much more frequently than most. In other words, the probability of any grammatical category’s occurring is strongly affected not only by the register but also by the lexis used. Finally, the evidence of phraseology is that it makes more sense to see meaning as belonging to phrases than to individual words.
Findings such as these have led many writers to see a need for descriptions of language that are radically different from those currently available. Sinclair (1991, 2004) proposes, for example, that meaning be seen as belonging to ‘units of meaning,’ each unit being describable in the way set out in He criticized conventional grammar for distinguishing between structures (a series of ‘slots’) and lexis (the ‘fillers’), such that it appears that any slot can be filled by any filler: there are no restrictions other than what the speaker wishes to say.
This is clearly sometimes the case, and when it is, Sinclair Translation Corpora can be used to train translators, used as a resource for practicing translators, and used as a means of studying the process of translation and the kinds of choices that translators make. Parallel corpora are often used in these applications, and software exists that will ‘align’ two corpora such that the translation of each sentence in the original text is immediately identifiable. This allows one to observe how a given word has been translated in different contexts.
One interesting finding is that apparently equivalent words – such as English go and Swedish ga° , or English with and German mit (Viberg, 1996; Schmied and Fink, 2000) – occur as translations of each other in only a minority of instances. This suggests differences in the ways those languages use the items concerned. More generally, examination of parallel corpora emphasizes that what translators translate is not the word but a larger unit (Teubert andC ? erma? kova? , 2004).
Although a single word may have many equivalents when translated, a word in context may well have only one such equivalent. For example, although travail as an individual word is sometimes translated as work and sometimes as labor, the phrase travaux pre? paratoires is translated only as preparatory work. Thus, Teubert and C ? erma? kova? argue, travaux pre? paratoires and preparatory work may be considered to be equivalent translation units, whereas no such claim can be made for travaux and work. As well as giving information about languages, corpus studies have also indicated that translated language is not the same as nontranslated language.
Studies of corpora of translated texts have shown that they tend to have higher incidences of very frequent words and that they tend to be more explicit in terms of grammar (Baker, 1993). They may also be influenced by the structure of the source language, as was indicated in the discussion of wh- clefts in English and Swedish in the section ‘Languages and Varieties. ’
In communities where people read a large number of translated texts, the foreign language, via its translations, may even influence the home language. Gellerstam (1996) notes that some words in Swedish have taken on the meanings of English that look similar and argues that this is because translators tend to translate the English word with the similar looking Swedish word, thereby using the Swedish word with a new meaning, which then enters the language.
One example is the Swedish word dramatisk, which used to indicate something relating to drama but which now, like the English word dramatic, also means ‘substantial and surprising. ’ Conclusion So every journey has its end. Ours isn’t an exception. It was a long journey but it was worth it. Corpus linguistics is a relatively new discipline, and a fast-changing one. As computer resources, particularly web-based ones, develop, sophisticated corpus investigations come within the reach of 5 / 6 the ordinary translator, language learner, or linguist.
Our understanding of the ways that types of language might vary from one another, and our appreciation of the ways that words pattern in language, have been immeasurably improved by corpus studies. Even more significant, perhaps, is the development of new theories of language that take corpus research as their starting point. The list of used literature 1. M. A. K. Halliday – Lexicology and Corpus Linguistics 2. Teubert and C ? erma? kova? 2004 3. Wallis, S. and Nelson G. ‘Knowledge discovery in grammatically analysed corpora’. Data Mining and Knowledge Discovery, 5: 307–340. 2001 POWERED BY TCPDF (WWW. TCPDF. ORG)