Skip to main content
Glossary Term

Text corpus

Overview of Text Corpora - A corpus may contain texts in a single language, known as a monolingual corpus. - A corpus may also contain text data in multiple languages, known as a multilingual corpus. - Corpora are often subjected to annotation, such as part-of-speech tagging (POS-tagging) and indicating the lemma form of each word. - Interlinear glossing is used to make the annotation bilingual when the language of the corpus is not a working language of the researchers. - Some corpora have further structured levels of analysis applied, including morphology, semantics, and pragmatics. Applications of Text Corpora - Corpora are the main knowledge base in corpus linguistics. - They are used in language technology, natural language processing, and computational linguistics. - Corpora and frequency lists derived from them are useful for language teaching. - They are used in machine translation, particularly with aligned parallel corpora. - Text corpora are also used in philologies and the study of historical documents. Notable Text Corpora - There is a comprehensive list of text corpora available. - Notable text corpora include the British National Corpus (BNC) and the Corpus of Contemporary American English (COCA). - Other notable corpora include parallel corpora used in machine translation research. - The Amarna letters texts and the Kültepe Texts are examples of short-duration corpora. - Some corpora are determined by their find site dates in the study of ancient cities. Related Concepts and Terms - Concordance, corpus linguistics, distributional-relational database, linguistic data consortium, natural language processing, and natural language toolkit are related terms and concepts. - Parallel text alignment is crucial for the analysis of parallel corpora. - Search engines access the web corpus. - Speech corpus and translation memory are also related to text corpora. - Treebank and Zipf's Law are additional concepts related to text corpora. References and Additional Resources - Yoon and Hirvela (2004) discuss ESL student attitudes toward corpus use in L2 writing. - Wołk and Marasek (2014) and Wolk and Marasek (2015) present research on real-time statistical speech translation and parallel data mining from comparable corpora. - Various external links provide additional resources on text corpora, including ACL SIGLEX Resource Links and Developing Linguistic Corpora: a Guide to Good Practice. - Free samples of web-based corpora are available for American, British, Spanish, and Portuguese languages. - Intercorp and Sketch Engine offer open corpora with free access, while TS Corpus provides a Turkish corpus for academic research.