Glossary Term
BERT (language model)
Design and Pretraining
- BERT is an encoder-only transformer architecture.
- BERT consists of three modules: embedding, a stack of encoders, and un-embedding.
- The un-embedding module is necessary for pretraining but often unnecessary for downstream tasks.
- BERT uses WordPiece to convert English words into integer codes.
- BERT's vocabulary size is 30,000.
- BERT was pre-trained on language modeling and next sentence prediction tasks.
- Language modeling involved predicting selected tokens given their context.
- Tokens were replaced with tokens or random word tokens during prediction.
- Next sentence prediction involved determining if two spans appeared sequentially in the training corpus.
- BERT learns latent representations of words and sentences in context during pre-training.
Architecture details
- BERT has two versions: BASE and LARGE.
- The lowest layer of BERT is the embedding layer, which contains word_embeddings, position_embeddings, and token_type_embeddings.
- word_embeddings converts input tokens into vectors.
- position_embeddings performs absolute position embedding.
- token_type_embeddings distinguishes tokens before and after .
Performance
- BERT achieved state-of-the-art performance on the GLUE task set, SQuAD, and SWAG.
- GLUE is a general language understanding evaluation task set consisting of 9 tasks.
- SQuAD is the Stanford Question Answering Dataset.
- SWAG refers to Situations With Adversarial Generations.
Analysis
- The reasons for BERT's state-of-the-art performance are not well understood.
- Current research focuses on analyzing BERT's output, internal vector representations, and attention weights.
- BERT's bidirectional training contributes to its high performance.
- BERT gains a deep understanding of context by considering words from both left and right sides.
- BERT's encoder-only architecture limits its ability to generate text.
Recognition
- The research paper describing BERT won the Best Long Paper Award at the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).