Glossary Term
Part-of-speech tagging
Part-of-speech tagging basics and techniques
- Part-of-speech tagging is the process of marking up a word in a text as corresponding to a particular part of speech.
- It is based on both the definition and context of the word.
- POS tagging is commonly taught to school-age children to identify words as nouns, verbs, adjectives, adverbs, etc.
- POS tagging is now done using algorithms in computational linguistics.
- There are two groups of POS-tagging algorithms: rule-based and stochastic.
- Adjective and number percentages can help determine the part of speech.
- More advanced HMMs can learn probabilities of larger sequences.
- Enumerating every combination and assigning relative probabilities can improve accuracy.
- CLAWS achieved 93-95% accuracy in part-of-speech tagging.
- Charniak's research showed that assigning the most common tag to known words and 'proper noun' to unknowns can achieve 90% accuracy.
- DeRose and Church developed dynamic programming algorithms for part-of-speech tagging.
- DeRose used a table of pairs, while Church used a table of triples.
- Both methods achieved over 95% accuracy.
- DeRose's work was replicated for Greek and proved effective.
- Unsupervised tagging techniques use untagged corpora to derive part-of-speech categories.
- Iterative processes reveal patterns in word use and similarity classes.
- Rule-based, stochastic, and neural approaches are used in unsupervised tagging.
- Unsupervised tagging can provide valuable new insights.
- Induction-based methods can achieve accuracy above 95%.
- Major algorithms for part-of-speech tagging include the Viterbi algorithm, Brill tagger, Constraint Grammar, and Baum-Welch algorithm.
- Hidden Markov model and visible Markov model taggers use the Viterbi algorithm.
- The rule-based Brill tagger applies learned rule patterns.
- Machine learning methods like SVM, maximum entropy classifier, perceptron, and nearest-neighbor have been applied to part-of-speech tagging.
- A direct comparison of methods reported 97.36% accuracy using the structure regularization method.
Tag Sets and Variations
- English has 9 commonly taught parts of speech, but there are many more categories and sub-categories.
- Nouns can have plural, possessive, and singular forms, while verbs can be marked for tense and aspect.
- Different inflections of the same root word can have different parts of speech.
- Tag sets for POS tagging in English can range from 50 to 150 separate parts of speech.
- Different languages have different tag sets, with heavily inflected languages having larger tag sets.
History and Development
- Research on part-of-speech tagging has been closely tied to corpus linguistics.
- The Brown Corpus, developed in the mid-1960s, was the first major corpus of English for computer analysis.
- The Brown Corpus was painstakingly tagged with part-of-speech markers over many years.
- The corpus has been used for numerous studies and inspired the development of similar tagged corpora in other languages.
- Part-of-speech tagging was considered an inseparable part of natural language processing for a long time.
- The Brown Corpus consists of about 1,000,000 words of running English prose text.
- It was tagged with part-of-speech markers using a program and later reviewed and corrected by hand.
- The tagging of the Brown Corpus formed the basis for many later part-of-speech tagging systems.
- Larger corpora, such as the 100 million word British National Corpus, have since superseded the Brown Corpus.
- Part-of-speech tagging was considered essential in natural language processing due to the ambiguity of certain words.
- In the mid-1980s, researchers began using hidden Markov models (HMMs) to disambiguate parts of speech.
- HMMs involve counting cases and creating a table of probabilities for certain word sequences.
- HMMs were used to tag the Lancaster-Oslo-Bergen Corpus of British English.
- The use of HMMs improved part-of-speech tagging accuracy.
- HMMs reduced the need for analyzing higher levels of language understanding for each word.
Unsupervised Tagging
- Unsupervised tagging techniques use untagged corpora to derive part-of-speech categories.
- Iterative processes reveal patterns in word use and similarity classes.
- Rule-based, stochastic, and neural approaches are used in unsupervised tagging.
- Unsupervised tagging can provide valuable new insights.
- Induction-based methods can achieve accuracy above 95%.
Related Topics and References
- See also: Semantic net, sliding window based part-of-speech tagging, trigram tagger, and word sense disambiguation.
- References include POS tags in Sketch Engine and A Universal Part-of-Speech Tagset.
- Works cited: Charniak's 'Statistical Techniques for Natural Language Parsing' and DeRose's 'Stochastic Methods for Resolution of Grammatical Category Ambiguity in Inflected and Uninflected Languages.'