Glossary Term
Text corpus
Overview of Text Corpora
- A corpus may contain texts in a single language, known as a monolingual corpus.
- A corpus may also contain text data in multiple languages, known as a multilingual corpus.
- Corpora are often subjected to annotation, such as part-of-speech tagging (POS-tagging) and indicating the lemma form of each word.
- Interlinear glossing is used to make the annotation bilingual when the language of the corpus is not a working language of the researchers.
- Some corpora have further structured levels of analysis applied, including morphology, semantics, and pragmatics.
Applications of Text Corpora
- Corpora are the main knowledge base in corpus linguistics.
- They are used in language technology, natural language processing, and computational linguistics.
- Corpora and frequency lists derived from them are useful for language teaching.
- They are used in machine translation, particularly with aligned parallel corpora.
- Text corpora are also used in philologies and the study of historical documents.
Notable Text Corpora
- There is a comprehensive list of text corpora available.
- Notable text corpora include the British National Corpus (BNC) and the Corpus of Contemporary American English (COCA).
- Other notable corpora include parallel corpora used in machine translation research.
- The Amarna letters texts and the Kültepe Texts are examples of short-duration corpora.
- Some corpora are determined by their find site dates in the study of ancient cities.
Related Concepts and Terms
- Concordance, corpus linguistics, distributional-relational database, linguistic data consortium, natural language processing, and natural language toolkit are related terms and concepts.
- Parallel text alignment is crucial for the analysis of parallel corpora.
- Search engines access the web corpus.
- Speech corpus and translation memory are also related to text corpora.
- Treebank and Zipf's Law are additional concepts related to text corpora.
References and Additional Resources
- Yoon and Hirvela (2004) discuss ESL student attitudes toward corpus use in L2 writing.
- Wołk and Marasek (2014) and Wolk and Marasek (2015) present research on real-time statistical speech translation and parallel data mining from comparable corpora.
- Various external links provide additional resources on text corpora, including ACL SIGLEX Resource Links and Developing Linguistic Corpora: a Guide to Good Practice.
- Free samples of web-based corpora are available for American, British, Spanish, and Portuguese languages.
- Intercorp and Sketch Engine offer open corpora with free access, while TS Corpus provides a Turkish corpus for academic research.