Glossary Term
Stemming
Introduction to Stemming
- Stemming is the process of reducing inflected words to their word stem.
- Stemming is used in linguistic morphology and information retrieval.
- Stemming algorithms have been studied since the 1960s.
- Stemming helps search engines treat words with the same stem as synonyms.
- The first published stemmer was written by Julie Beth Lovins in 1968.
Types of Stemming Algorithms
- Simple stemmers use a lookup table to map inflected forms to their stems.
- Lookup approach may use part-of-speech tagging to avoid overstemming.
- Suffix-stripping algorithms find root forms using a set of rules.
- Prefix stripping can also be implemented in some languages.
- Suffix stripping algorithms may differ in results and performance.
Production Technique of Stemming Algorithms
- The lookup table used by a stemmer is produced semi-automatically.
- Inverted algorithms generate inflected forms from a given root form.
- The generation of unlikely forms can be avoided in the production technique.
- The Paice-Husk Stemmer features an externally stored set of stemming rules.
- Chris D Paice developed a direct measurement for comparing stemmers.
Lemmatisation Algorithms
- Lemmatisation involves determining the part of speech of a word.
- Different normalization rules are applied based on the part of speech.
- Correct identification of the lexical category is crucial for accurate lemmatisation.
- Lemmatisation provides more accurate normalization than suffix stripping.
- Lemmatisation algorithms can modify the stem based on additional information.
Stochastic algorithms and other techniques
- Stochastic algorithms use probability to identify the root form of a word.
- Gram analysis uses the n-gram context of a word to determine the correct stem.
- Hybrid approaches combine two or more stemming techniques.
- Affix stemmers deal with both prefixes and suffixes.
- Matching algorithms use a stem database to identify stems.
Note: The references provided in the content are not included in the groups as they are not directly related to the concepts being organized.