information from a composite understanding of the
surrounding words. The MLM objective is
mathematically represented as:
𝐿
(
𝜃
)
= −
∑
log 𝑝
(
𝑥
|
𝑥
)
(1)
where 𝑥
is the input with some tokens
masked, 𝑥
is the original token, and 𝜃 is the model
parameters.
2.2.2 NSP
BERT incorporates the NSP task to learn
relationships between sentences. In this binary
classification task, BERT is designed to take in
sentence pairs and is trained to predict if the second
sentence is the logical and chronological next
sentence in the document. This task enhances BERT's
ability to capture the relationships between sentences,
which is crucial for downstream tasks that involve
understanding the structure of documents, such as
question answering and natural language inference.
The NSP task can be formalized as:
𝐿
(
𝜃
)
= −
∑
𝑦
𝑙𝑜𝑔
𝑦
+(1−𝑦
)log (1 −
𝑦
) (2)
where 𝑁 is the number of sentence pairs, 𝑦
indicates whether sentences are consecutive, and 𝑦
is
the model predicted probability.
2.3 Decoder-Only Language Models
Decoder-only LMs, like OpenAI’s GPT, generally
consist of multiple layers of modified Transformer
decoder blocks stacked on top of each other. Each
block comprises components of Masked Muti-head
Self Attention, Position-wised Feed-Forward
Networks, and Layer Normalization (Radford et al.
2021).
In contrast to encoder models, decoder-only
models typically pertained and generate text in a
unidirectional or auto-regressive manner. This means
each token is generated based on the previously
generated tokens without seeing future tokens in the
sequence.
The autoregressive language modeling task can be
mathematically represented as follows:
Given a sequence of tokens 𝑥
,𝑥
,…,𝑥
, the
model predicts the next token 𝑥
based on all
previous tokens. The probability of the sequence can
be factorized as:
𝑃
(
𝑥
,𝑥
,…,𝑥
)
=
∏
𝑃
(
𝑥
| 𝑥
,𝑥
,…,𝑥
)
(3)
For each step 𝑖, the model outputs a probability
distribution over the vocabulary for the next token 𝑥
,
based on the previous tokens.
2.4 Mainstream LLMs of Different
Architecture
While OpenAI’s ChatGPT is undoubtedly the
most popular and well-known language model in
recent years, there are large numbers of decoder-only
LMs, such as the latest Google AI model Gemini,
Llama and Llama 2 developed by Meta, Google’s
Bard, LaMDA, PaLM, etc. For encoder-only LMs,
there are also many other LMs besides BERT like
Microsoft’s DeBERTa, Google’s ALBERT, and
Meta’s RoBERTa. Apart from encoder-only and
decoder-only, seq2seq (encoder-decoder) models like
Meta’s BART and Google’s T5 are also widely
applied.
3 TRAINING EFFICIENCY
Since few studies focus specifically on training
efficiency, and the actual training efficiency depends
on various factors, this part will compare decoder
LMs' and encoder LMs’ training efficiency mainly
from a theoretical perspective.
3.1 Pretraining Tasks Complexity and
Time
For encoder-only LMs, tasks like MLM are
inherently parallelizable since each masked token's
prediction is relatively independent of others. This
parallelism can lead to efficient use of computational
resources, potentially reducing pretraining time. Due
to their ability to process input tokens in parallel,
encoder-only models can efficiently handle large
batches of data, which can shorten the overall time
required for pretraining.
For decoder-only LMs, the autoregressive
pretraining task, where each token prediction depends
on the previously predicted tokens, can limit parallel
processing, potentially making pretraining more
time-consuming than encoder-only models. While the
sequential learning process is thorough, it might
require more time to achieve similar levels of
understanding and generation capabilities due to its
inherent sequential processing constraints.
Encoder-decoder models are often pre-trained on
a variety of complex tasks that require both
understanding and generating text. While this makes
them highly versatile, it also means that their
pretraining can be the most time-consuming due to
the complexity of the tasks and the need to learn both
encoding and decoding capabilities. The use of