Definition and Types of N-grams
– An n-gram is a sequence of adjacent symbols in a particular order.
– The symbols can be adjacent letters, syllables, or whole words in a language dataset.
– N-grams can also be adjacent phonemes extracted from a speech-recording dataset or adjacent base pairs from a genome.
– Latin numerical prefixes are used to name n-grams of different sizes, such as unigram, bigram, trigram, etc.
– Greek numerical prefixes or English cardinal numbers are used in computational biology for polymers or oligomers of a known size, called -mers.
N-grams in Language Models
– N-grams allow bag-of-words models to capture information such as word order.
– They are used in NLP to capture contextual information.
– N-grams help in capturing the sequence of words and their relationships.
– Traditional bag-of-words models do not consider word order, but n-grams can overcome this limitation.
– N-grams are essential for building language models that generate coherent and contextually relevant text.
Examples of N-grams
– Example sequences and their corresponding 1-gram, 2-gram, and 3-gram sequences are shown in Figure 1.
– Word-level 3-grams and 4-grams from the Google N-gram corpus include ceramics collectables collectibles (55) and serve as the independent (794).
– N-grams provide insights into word combinations and their frequencies.
– They can be used to analyze patterns and trends in large text corpora.
– N-grams help in understanding the co-occurrence of words and their semantic relationships.
References
– Broder, Andrei Z., et al. ‘Syntactic clustering of the web.’ Computer Networks and ISDN Systems 29.8 (1997): 1157–1166.
– Alex Franz and Thorsten Brants. ‘All Our -gram are Belong to You.’ Google Research Blog. Archived from the original on 17 October 2006. Retrieved 16 December 2011.
– Christopher D. Manning and Hinrich Schütze. ‘Foundations of Statistical Natural Language Processing.’ MIT Press, 1999.
– Owen White, et al. ‘A quality control algorithm for DNA sequencing projects.’ Nucleic Acids Research 21.16 (1993): 3829–3838.
– Frederick J. Damerau. ‘Markov Models and Linguistic Theory.’ Mouton, The Hague, 1971.
Further Reading and External Links
– ‘Foundations of Statistical Natural Language Processing’ by Christopher D. Manning and Hinrich Schütze.
– ‘A quality control algorithm for DNA sequencing projects’ by Owen White, et al.
– ‘Markov Models and Linguistic Theory’ by Frederick J. Damerau.
– ‘Contextual Language Models For Ranking Answers To Natural Language Definition Questions’ by Alejandro Figueroa and John Atkinson.
– ‘Authorship Verification for Short Messages Using Stylometry’ by Marcelo Luiz Brocardo, et al.
– Google Ngram Viewer: A tool to explore n-gram frequencies in Google Books.
– Ngram Extractor: A tool that provides the weight of n-grams based on their frequency.
– Googles Google Books -gram viewer and Web -grams database (September 2006): A database to explore n-gram frequencies.
– STATOPERATOR N-grams Project Weighted: A project that assigns weights to n-grams based on their frequency.
An n-gram is a sequence of n adjacent symbols in particular order. The symbols may be n adjacent letters (including punctuation marks and blanks), syllables, or rarely whole words found in a language dataset; or adjacent phonemes extracted from a speech-recording dataset, or adjacent base pairs extracted from a genome. They are collected from a text corpus or speech corpus. If Latin numerical prefixes are used, then n-gram of size 1 is called a "unigram", size 2 a "bigram" (or, less commonly, a "digram") etc. If, instead of the Latin ones, the English cardinal numbers are furtherly used, then they are called "four-gram", "five-gram", etc. Similarly, using Greek numerical prefixes such as "monomer", "dimer", "trimer", "tetramer", "pentamer", etc., or English cardinal numbers, "one-mer", "two-mer", "three-mer", etc. are used in computational biology, for polymers or oligomers of a known size, called k-mers. When the items are words, n-grams may also be called shingles.

In the context of NLP, the use of n-grams allows bag-of-words models to capture information such as word order, which would not be possible in the traditional bag of words setting.