Design and Pretraining
– BERT is an encoder-only transformer architecture.
– BERT consists of three modules: embedding, a stack of encoders, and un-embedding.
– The un-embedding module is necessary for pretraining but often unnecessary for downstream tasks.
– BERT uses WordPiece to convert English words into integer codes.
– BERT’s vocabulary size is 30,000.
– BERT was pre-trained on language modeling and next sentence prediction tasks.
– Language modeling involved predicting selected tokens given their context.
– Tokens were replaced with [MASK] tokens or random word tokens during prediction.
– Next sentence prediction involved determining if two spans appeared sequentially in the training corpus.
– BERT learns latent representations of words and sentences in context during pre-training.
Architecture details
– BERT has two versions: BASE and LARGE.
– The lowest layer of BERT is the embedding layer, which contains word_embeddings, position_embeddings, and token_type_embeddings.
– word_embeddings converts input tokens into vectors.
– position_embeddings performs absolute position embedding.
– token_type_embeddings distinguishes tokens before and after [SEP].
Performance
– BERT achieved state-of-the-art performance on the GLUE task set, SQuAD, and SWAG.
– GLUE is a general language understanding evaluation task set consisting of 9 tasks.
– SQuAD is the Stanford Question Answering Dataset.
– SWAG refers to Situations With Adversarial Generations.
Analysis
– The reasons for BERT’s state-of-the-art performance are not well understood.
– Current research focuses on analyzing BERT’s output, internal vector representations, and attention weights.
– BERT’s bidirectional training contributes to its high performance.
– BERT gains a deep understanding of context by considering words from both left and right sides.
– BERT’s encoder-only architecture limits its ability to generate text.
Recognition
– The research paper describing BERT won the Best Long Paper Award at the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
Bidirectional Encoder Representations from Transformers (BERT) is a language model based on the transformer architecture, notable for its dramatic improvement over previous state of the art models. It was introduced in October 2018 by researchers at Google. A 2020 literature survey concluded that "in a little over a year, BERT has become a ubiquitous baseline in Natural Language Processing (NLP) experiments counting over 150 research publications analyzing and improving the model."
BERT was originally implemented in the English language at two model sizes: (1) BERTBASE: 12 encoders with 12 bidirectional self-attention heads totaling 110 million parameters, and (2) BERTLARGE: 24 encoders with 16 bidirectional self-attention heads totaling 340 million parameters. Both models were pre-trained on the Toronto BookCorpus (800M words) and English Wikipedia (2,500M words).