Skip to main content
Glossary Term

BERT (language model)

Design and Pretraining - BERT is an encoder-only transformer architecture. - BERT consists of three modules: embedding, a stack of encoders, and un-embedding. - The un-embedding module is necessary for pretraining but often unnecessary for downstream tasks. - BERT uses WordPiece to convert English words into integer codes. - BERT's vocabulary size is 30,000. - BERT was pre-trained on language modeling and next sentence prediction tasks. - Language modeling involved predicting selected tokens given their context. - Tokens were replaced with tokens or random word tokens during prediction. - Next sentence prediction involved determining if two spans appeared sequentially in the training corpus. - BERT learns latent representations of words and sentences in context during pre-training. Architecture details - BERT has two versions: BASE and LARGE. - The lowest layer of BERT is the embedding layer, which contains word_embeddings, position_embeddings, and token_type_embeddings. - word_embeddings converts input tokens into vectors. - position_embeddings performs absolute position embedding. - token_type_embeddings distinguishes tokens before and after . Performance - BERT achieved state-of-the-art performance on the GLUE task set, SQuAD, and SWAG. - GLUE is a general language understanding evaluation task set consisting of 9 tasks. - SQuAD is the Stanford Question Answering Dataset. - SWAG refers to Situations With Adversarial Generations. Analysis - The reasons for BERT's state-of-the-art performance are not well understood. - Current research focuses on analyzing BERT's output, internal vector representations, and attention weights. - BERT's bidirectional training contributes to its high performance. - BERT gains a deep understanding of context by considering words from both left and right sides. - BERT's encoder-only architecture limits its ability to generate text. Recognition - The research paper describing BERT won the Best Long Paper Award at the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).