n-gram

Definition and Types of N-grams
– An n-gram is a sequence of adjacent symbols in a particular order.
– The symbols can be adjacent letters, syllables, or whole words in a language dataset.
– N-grams can also be adjacent phonemes extracted from a speech-recording dataset or adjacent base pairs from a genome.
– Latin numerical prefixes are used to name n-grams of different sizes, such as unigram, bigram, trigram, etc.
– Greek numerical prefixes or English cardinal numbers are used in computational biology for polymers or oligomers of a known size, called -mers.

N-grams in Language Models
– N-grams allow bag-of-words models to capture information such as word order.
– They are used in NLP to capture contextual information.
– N-grams help in capturing the sequence of words and their relationships.
– Traditional bag-of-words models do not consider word order, but n-grams can overcome this limitation.
– N-grams are essential for building language models that generate coherent and contextually relevant text.

Examples of N-grams
– Example sequences and their corresponding 1-gram, 2-gram, and 3-gram sequences are shown in Figure 1.
– Word-level 3-grams and 4-grams from the Google N-gram corpus include ceramics collectables collectibles (55) and serve as the independent (794).
– N-grams provide insights into word combinations and their frequencies.
– They can be used to analyze patterns and trends in large text corpora.
– N-grams help in understanding the co-occurrence of words and their semantic relationships.

