Glossary Term
n-gram
Definition and Types of N-grams
- An n-gram is a sequence of adjacent symbols in a particular order.
- The symbols can be adjacent letters, syllables, or whole words in a language dataset.
- N-grams can also be adjacent phonemes extracted from a speech-recording dataset or adjacent base pairs from a genome.
- Latin numerical prefixes are used to name n-grams of different sizes, such as unigram, bigram, trigram, etc.
- Greek numerical prefixes or English cardinal numbers are used in computational biology for polymers or oligomers of a known size, called -mers.
N-grams in Language Models
- N-grams allow bag-of-words models to capture information such as word order.
- They are used in NLP to capture contextual information.
- N-grams help in capturing the sequence of words and their relationships.
- Traditional bag-of-words models do not consider word order, but n-grams can overcome this limitation.
- N-grams are essential for building language models that generate coherent and contextually relevant text.
Examples of N-grams
- Example sequences and their corresponding 1-gram, 2-gram, and 3-gram sequences are shown in Figure 1.
- Word-level 3-grams and 4-grams from the Google N-gram corpus include ceramics collectables collectibles (55) and serve as the independent (794).
- N-grams provide insights into word combinations and their frequencies.
- They can be used to analyze patterns and trends in large text corpora.
- N-grams help in understanding the co-occurrence of words and their semantic relationships.
References
- Broder, Andrei Z., et al. 'Syntactic clustering of the web.' Computer Networks and ISDN Systems 29.8 (1997): 1157–1166.
- Alex Franz and Thorsten Brants. 'All Our -gram are Belong to You.' Google Research Blog. Archived from the original on 17 October 2006. Retrieved 16 December 2011.
- Christopher D. Manning and Hinrich Schütze. 'Foundations of Statistical Natural Language Processing.' MIT Press, 1999.
- Owen White, et al. 'A quality control algorithm for DNA sequencing projects.' Nucleic Acids Research 21.16 (1993): 3829–3838.
- Frederick J. Damerau. 'Markov Models and Linguistic Theory.' Mouton, The Hague, 1971.
Further Reading and External Links
- 'Foundations of Statistical Natural Language Processing' by Christopher D. Manning and Hinrich Schütze.
- 'A quality control algorithm for DNA sequencing projects' by Owen White, et al.
- 'Markov Models and Linguistic Theory' by Frederick J. Damerau.
- 'Contextual Language Models For Ranking Answers To Natural Language Definition Questions' by Alejandro Figueroa and John Atkinson.
- 'Authorship Verification for Short Messages Using Stylometry' by Marcelo Luiz Brocardo, et al.
- Google Ngram Viewer: A tool to explore n-gram frequencies in Google Books.
- Ngram Extractor: A tool that provides the weight of n-grams based on their frequency.
- Googles Google Books -gram viewer and Web -grams database (September 2006): A database to explore n-gram frequencies.
- STATOPERATOR N-grams Project Weighted: A project that assigns weights to n-grams based on their frequency.