Skip to main content
Glossary Term

Stemming

Introduction to Stemming - Stemming is the process of reducing inflected words to their word stem. - Stemming is used in linguistic morphology and information retrieval. - Stemming algorithms have been studied since the 1960s. - Stemming helps search engines treat words with the same stem as synonyms. - The first published stemmer was written by Julie Beth Lovins in 1968. Types of Stemming Algorithms - Simple stemmers use a lookup table to map inflected forms to their stems. - Lookup approach may use part-of-speech tagging to avoid overstemming. - Suffix-stripping algorithms find root forms using a set of rules. - Prefix stripping can also be implemented in some languages. - Suffix stripping algorithms may differ in results and performance. Production Technique of Stemming Algorithms - The lookup table used by a stemmer is produced semi-automatically. - Inverted algorithms generate inflected forms from a given root form. - The generation of unlikely forms can be avoided in the production technique. - The Paice-Husk Stemmer features an externally stored set of stemming rules. - Chris D Paice developed a direct measurement for comparing stemmers. Lemmatisation Algorithms - Lemmatisation involves determining the part of speech of a word. - Different normalization rules are applied based on the part of speech. - Correct identification of the lexical category is crucial for accurate lemmatisation. - Lemmatisation provides more accurate normalization than suffix stripping. - Lemmatisation algorithms can modify the stem based on additional information. Stochastic algorithms and other techniques - Stochastic algorithms use probability to identify the root form of a word. - Gram analysis uses the n-gram context of a word to determine the correct stem. - Hybrid approaches combine two or more stemming techniques. - Affix stemmers deal with both prefixes and suffixes. - Matching algorithms use a stem database to identify stems. Note: The references provided in the content are not included in the groups as they are not directly related to the concepts being organized.