a collection of naturally occurring samples of language which have been collected and collated for easy access by researchers and materials developers who want to know how words and other linguistic items are actually used. A corpus may vary from a few sentences to a set of written texts or recordings. In language analysis corpuses usually consist of a relatively large, planned collection of texts or parts of texts, stored and accessed by computer. A corpus is designed to represent different types of language use, e.g. casual conversation, business letters, ESP texts. A number of different types of corpuses may be distinguished, for example:
1 specialized corpus: a corpus of texts of a particular type, such as aca- demic articles, student writing, etc.
2 general corpus or reference corpus: a large collection of many different types of texts, often used to produce reference materials for language learning (e.g. dictionaries) or used as a base-line for comparison with specialized corpora
3 comparable corpora: two or more corpora in different languages or language varieties containing the same kinds and amounts of texts, to enable differences or equivalences to be compared
4 learner corpus: a collection of texts or language samples produced by language learners
What is a Corpus?
The word "corpus", derived from the Latin word meaning "body", may be used to refer to any text in written or spoken form. However, in modern Linguistics this term is used to refer to large collections of texts which represent a sample of a particular variety or use of language(s) that are presented in machine readable form. Other definitions, broader or stricter, exist. See, for example, the definition in the book "Corpus Linguistics" by Tony McEnery and Andrew Wilson or read more about different kinds of corpora in the Systematic Dictionary of Corpus Linguistics.
Computer-readable corpora can consist of raw text only, i.e. plain text with no additional information. Many...