Skip to main content
Glossary Term

Part-of-speech tagging

Part-of-speech tagging basics and techniques - Part-of-speech tagging is the process of marking up a word in a text as corresponding to a particular part of speech. - It is based on both the definition and context of the word. - POS tagging is commonly taught to school-age children to identify words as nouns, verbs, adjectives, adverbs, etc. - POS tagging is now done using algorithms in computational linguistics. - There are two groups of POS-tagging algorithms: rule-based and stochastic. - Adjective and number percentages can help determine the part of speech. - More advanced HMMs can learn probabilities of larger sequences. - Enumerating every combination and assigning relative probabilities can improve accuracy. - CLAWS achieved 93-95% accuracy in part-of-speech tagging. - Charniak's research showed that assigning the most common tag to known words and 'proper noun' to unknowns can achieve 90% accuracy. - DeRose and Church developed dynamic programming algorithms for part-of-speech tagging. - DeRose used a table of pairs, while Church used a table of triples. - Both methods achieved over 95% accuracy. - DeRose's work was replicated for Greek and proved effective. - Unsupervised tagging techniques use untagged corpora to derive part-of-speech categories. - Iterative processes reveal patterns in word use and similarity classes. - Rule-based, stochastic, and neural approaches are used in unsupervised tagging. - Unsupervised tagging can provide valuable new insights. - Induction-based methods can achieve accuracy above 95%. - Major algorithms for part-of-speech tagging include the Viterbi algorithm, Brill tagger, Constraint Grammar, and Baum-Welch algorithm. - Hidden Markov model and visible Markov model taggers use the Viterbi algorithm. - The rule-based Brill tagger applies learned rule patterns. - Machine learning methods like SVM, maximum entropy classifier, perceptron, and nearest-neighbor have been applied to part-of-speech tagging. - A direct comparison of methods reported 97.36% accuracy using the structure regularization method. Tag Sets and Variations - English has 9 commonly taught parts of speech, but there are many more categories and sub-categories. - Nouns can have plural, possessive, and singular forms, while verbs can be marked for tense and aspect. - Different inflections of the same root word can have different parts of speech. - Tag sets for POS tagging in English can range from 50 to 150 separate parts of speech. - Different languages have different tag sets, with heavily inflected languages having larger tag sets. History and Development - Research on part-of-speech tagging has been closely tied to corpus linguistics. - The Brown Corpus, developed in the mid-1960s, was the first major corpus of English for computer analysis. - The Brown Corpus was painstakingly tagged with part-of-speech markers over many years. - The corpus has been used for numerous studies and inspired the development of similar tagged corpora in other languages. - Part-of-speech tagging was considered an inseparable part of natural language processing for a long time. - The Brown Corpus consists of about 1,000,000 words of running English prose text. - It was tagged with part-of-speech markers using a program and later reviewed and corrected by hand. - The tagging of the Brown Corpus formed the basis for many later part-of-speech tagging systems. - Larger corpora, such as the 100 million word British National Corpus, have since superseded the Brown Corpus. - Part-of-speech tagging was considered essential in natural language processing due to the ambiguity of certain words. - In the mid-1980s, researchers began using hidden Markov models (HMMs) to disambiguate parts of speech. - HMMs involve counting cases and creating a table of probabilities for certain word sequences. - HMMs were used to tag the Lancaster-Oslo-Bergen Corpus of British English. - The use of HMMs improved part-of-speech tagging accuracy. - HMMs reduced the need for analyzing higher levels of language understanding for each word. Unsupervised Tagging - Unsupervised tagging techniques use untagged corpora to derive part-of-speech categories. - Iterative processes reveal patterns in word use and similarity classes. - Rule-based, stochastic, and neural approaches are used in unsupervised tagging. - Unsupervised tagging can provide valuable new insights. - Induction-based methods can achieve accuracy above 95%. Related Topics and References - See also: Semantic net, sliding window based part-of-speech tagging, trigram tagger, and word sense disambiguation. - References include POS tags in Sketch Engine and A Universal Part-of-Speech Tagset. - Works cited: Charniak's 'Statistical Techniques for Natural Language Parsing' and DeRose's 'Stochastic Methods for Resolution of Grammatical Category Ambiguity in Inflected and Uninflected Languages.'