Overview of Text Corpora
– A corpus may contain texts in a single language, known as a monolingual corpus.
– A corpus may also contain text data in multiple languages, known as a multilingual corpus.
– Corpora are often subjected to annotation, such as part-of-speech tagging (POS-tagging) and indicating the lemma form of each word.
– Interlinear glossing is used to make the annotation bilingual when the language of the corpus is not a working language of the researchers.
– Some corpora have further structured levels of analysis applied, including morphology, semantics, and pragmatics.
Applications of Text Corpora
– Corpora are the main knowledge base in corpus linguistics.
– They are used in language technology, natural language processing, and computational linguistics.
– Corpora and frequency lists derived from them are useful for language teaching.
– They are used in machine translation, particularly with aligned parallel corpora.
– Text corpora are also used in philologies and the study of historical documents.
Notable Text Corpora
– There is a comprehensive list of text corpora available.
– Notable text corpora include the British National Corpus (BNC) and the Corpus of Contemporary American English (COCA).
– Other notable corpora include parallel corpora used in machine translation research.
– The Amarna letters texts and the Kültepe Texts are examples of short-duration corpora.
– Some corpora are determined by their find site dates in the study of ancient cities.
Related Concepts and Terms
– Concordance, corpus linguistics, distributional-relational database, linguistic data consortium, natural language processing, and natural language toolkit are related terms and concepts.
– Parallel text alignment is crucial for the analysis of parallel corpora.
– Search engines access the web corpus.
– Speech corpus and translation memory are also related to text corpora.
– Treebank and Zipf’s Law are additional concepts related to text corpora.
References and Additional Resources
– Yoon and Hirvela (2004) discuss ESL student attitudes toward corpus use in L2 writing.
– Wołk and Marasek (2014) and Wolk and Marasek (2015) present research on real-time statistical speech translation and parallel data mining from comparable corpora.
– Various external links provide additional resources on text corpora, including ACL SIGLEX Resource Links and Developing Linguistic Corpora: a Guide to Good Practice.
– Free samples of web-based corpora are available for American, British, Spanish, and Portuguese languages.
– Intercorp and Sketch Engine offer open corpora with free access, while TS Corpus provides a Turkish corpus for academic research.
In linguistics and natural language processing, a corpus (pl.: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.
Annotated, they have been used in corpus linguistics for statistical hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.
In search technology, a corpus is the collection of documents which is being searched.