Access the NEW Basecamp Support Portal

BERT (language model)

« Back to Glossary Index

Design and Pretraining
– BERT is an encoder-only transformer architecture.
– BERT consists of three modules: embedding, a stack of encoders, and un-embedding.
– The un-embedding module is necessary for pretraining but often unnecessary for downstream tasks.
– BERT uses WordPiece to convert English words into integer codes.
– BERT’s vocabulary size is 30,000.
– BERT was pre-trained on language modeling and next sentence prediction tasks.
Language modeling involved predicting selected tokens given their context.
– Tokens were replaced with [MASK] tokens or random word tokens during prediction.
– Next sentence prediction involved determining if two spans appeared sequentially in the training corpus.
– BERT learns latent representations of words and sentences in context during pre-training.

Architecture details
– BERT has two versions: BASE and LARGE.
– The lowest layer of BERT is the embedding layer, which contains word_embeddings, position_embeddings, and token_type_embeddings.
– word_embeddings converts input tokens into vectors.
– position_embeddings performs absolute position embedding.
– token_type_embeddings distinguishes tokens before and after [SEP].

– BERT achieved state-of-the-art performance on the GLUE task set, SQuAD, and SWAG.
– GLUE is a general language understanding evaluation task set consisting of 9 tasks.
– SQuAD is the Stanford Question Answering Dataset.
– SWAG refers to Situations With Adversarial Generations.

– The reasons for BERT’s state-of-the-art performance are not well understood.
– Current research focuses on analyzing BERT’s output, internal vector representations, and attention weights.
– BERT’s bidirectional training contributes to its high performance.
– BERT gains a deep understanding of context by considering words from both left and right sides.
– BERT’s encoder-only architecture limits its ability to generate text.

– The research paper describing BERT won the Best Long Paper Award at the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).

Bidirectional Encoder Representations from Transformers (BERT) is a language model based on the transformer architecture, notable for its dramatic improvement over previous state of the art models. It was introduced in October 2018 by researchers at Google. A 2020 literature survey concluded that "in a little over a year, BERT has become a ubiquitous baseline in Natural Language Processing (NLP) experiments counting over 150 research publications analyzing and improving the model."

BERT was originally implemented in the English language at two model sizes: (1) BERTBASE: 12 encoders with 12 bidirectional self-attention heads totaling 110 million parameters, and (2) BERTLARGE: 24 encoders with 16 bidirectional self-attention heads totaling 340 million parameters. Both models were pre-trained on the Toronto BookCorpus (800M words) and English Wikipedia (2,500M words).

« Back to Glossary Index

Request an article

Please let us know what you were looking for and our team will not only create the article but we'll also email you to let you know as soon as it's been published.
Most articles take 1-2 business days to research, write, and publish.
Content/Article Request Form

Submit your RFP

We can't wait to read about your project. Use the form below to submit your RFP!
Request for Proposal

Contact and Business Information

Provide details about how we can contact you and your business.

Quote Request Details

Provide some information about why you'd like a quote.