Skip to main content
Glossary Term

n-gram

Definition and Types of N-grams - An n-gram is a sequence of adjacent symbols in a particular order. - The symbols can be adjacent letters, syllables, or whole words in a language dataset. - N-grams can also be adjacent phonemes extracted from a speech-recording dataset or adjacent base pairs from a genome. - Latin numerical prefixes are used to name n-grams of different sizes, such as unigram, bigram, trigram, etc. - Greek numerical prefixes or English cardinal numbers are used in computational biology for polymers or oligomers of a known size, called -mers. N-grams in Language Models - N-grams allow bag-of-words models to capture information such as word order. - They are used in NLP to capture contextual information. - N-grams help in capturing the sequence of words and their relationships. - Traditional bag-of-words models do not consider word order, but n-grams can overcome this limitation. - N-grams are essential for building language models that generate coherent and contextually relevant text. Examples of N-grams - Example sequences and their corresponding 1-gram, 2-gram, and 3-gram sequences are shown in Figure 1. - Word-level 3-grams and 4-grams from the Google N-gram corpus include ceramics collectables collectibles (55) and serve as the independent (794). - N-grams provide insights into word combinations and their frequencies. - They can be used to analyze patterns and trends in large text corpora. - N-grams help in understanding the co-occurrence of words and their semantic relationships. References - Broder, Andrei Z., et al. 'Syntactic clustering of the web.' Computer Networks and ISDN Systems 29.8 (1997): 1157–1166. - Alex Franz and Thorsten Brants. 'All Our -gram are Belong to You.' Google Research Blog. Archived from the original on 17 October 2006. Retrieved 16 December 2011. - Christopher D. Manning and Hinrich Schütze. 'Foundations of Statistical Natural Language Processing.' MIT Press, 1999. - Owen White, et al. 'A quality control algorithm for DNA sequencing projects.' Nucleic Acids Research 21.16 (1993): 3829–3838. - Frederick J. Damerau. 'Markov Models and Linguistic Theory.' Mouton, The Hague, 1971. Further Reading and External Links - 'Foundations of Statistical Natural Language Processing' by Christopher D. Manning and Hinrich Schütze. - 'A quality control algorithm for DNA sequencing projects' by Owen White, et al. - 'Markov Models and Linguistic Theory' by Frederick J. Damerau. - 'Contextual Language Models For Ranking Answers To Natural Language Definition Questions' by Alejandro Figueroa and John Atkinson. - 'Authorship Verification for Short Messages Using Stylometry' by Marcelo Luiz Brocardo, et al. - Google Ngram Viewer: A tool to explore n-gram frequencies in Google Books. - Ngram Extractor: A tool that provides the weight of n-grams based on their frequency. - Googles Google Books -gram viewer and Web -grams database (September 2006): A database to explore n-gram frequencies. - STATOPERATOR N-grams Project Weighted: A project that assigns weights to n-grams based on their frequency.