where ℎ
is the hidden state at time step 𝑡, 𝑥
is the
input, 𝑦
is the output. The activation function is
denoted by 𝑓, the bias term by 𝑏, and the weight
matrix by 𝑊. 𝑓 is usually a nonlinear function such
as tanh or ReLU.
In text classification tasks, RNNs, especially
SRNs, are used to process and understand text
sequences. These networks are able to learn the
temporal relationships between words, which is
essential for understanding sentence structure and
semantics. Text classification using RNN models is
commonplace in applications including spam
detection, subject categorization, and sentiment
analysis. For example, in topic classification tasks,
RNNs can help to recognize text topics. The
processing procedure for these tasks usually have
following steps: First the text sequence is encoded
(i.e., text vectorization). Then the encoded text
sequence is fed to the model. Ultimately, the decision
on classification is reached via one or more
completely linked layers.
Although SRNs are theoretically capable of
handling long-distance temporal dependencies, in
practice they often experience gradient vanishing or
gradient explosion during training. This makes it
difficult for the network to learn long-term
dependencies. Also, RNN is a biased model. This
means the later words have a greater impact on the
results than the earlier words. As a result, it may be
less effective when used to capture the semantics of
long texts, such as entire documents. This is because
the key constituents may appear anywhere in the
document, not simply at the end.
2.2 LSTM
Among the several RNNs variations, it is Long Short-
Term Memory (LSTM) that is the most popular
architecture being used widely. LSTM can solve the
problem of gradient vanishing or gradient explosion
suffered by vallina RNNs when dealing with long-
term dependencies by introducing three gates (input
gate, output gate, forget gate) and cell state. The cell
state is the key to realizing long-term memory in
LSTM. Only a small amount of linear manipulation is
done during the transmission of cell states, allowing
information to be easily passed on without change.
This enables long term memory retention. The
addition of information to the cell state is controlled
by the input gate. It controls the inflow of information
through a sigmoid activation function. The forget gate
determines which portion of the data in the cell state
should be forgotten. The forget gate regulates the
information outflow using a sigmoid function, same
as the input gate. Using two activation functions, the
output gate determines which portion of the cell state
should be transferred to the following time step. The
sigmoid function controls the passing of information,
and the tanh function scales the information. With
these three gate units, the cell state is updated at each
time step and passed between time steps
(Staudemever & Morris 2019).
In practical applications, the length of text data to
be learned is usually not too short. Therefore,
achieving long-term memory of textual information
is particularly important for classification tasks.
LSTM performs well in text classification tasks
because it has a good understanding of the long-term
contextual information in the document. To be
specific, LSTM can learn the deeper meaning of text,
such as grammar and syntax. It can also learn the
complex relationships between words and how their
order in a sentence affects the semantics. As one of
the variants of Simple RNNs, LSTMs are also capable
of handling variable-length input sequences.
As the most common and fundamental models
using in text classification tasks, LSTM has many
variations to boost its ability to learn long distance
dependencies. Sentence-State LSTM improves its
parallel processing ability by updating all the states of
words in each loop instead of processing one by one.
Also, the model introduces a global sentence-level
state to represent the sentences for exchanging
information between words. These makes S-LSTM
can learn the utter sentence features faster and
maintaining high efficiency in processing long
sentences at the same time (Zhang et al. 2018). Tree-
LSTM is a variant that extends the LSTM network to
tree topologies. This architecture can process the
information providing by the child node on each node
at the same time. Tree-LSTM is particularly good at
dealing with the data with hierarchical structure (for
example syntactic trees for sentences) because it can
the semantic and syntactic of sentences better than
vallina LSTM. Instead of unidirectional LSTM, there
are Bidirectional LSTM. The Bidirectional LSTM
can benefit more from the contextual information. So
it gets higher marks than unidirectional LSTM on the
classification accuracy. In addition, LSTMs can
further enhance their performance through Attention
Mechanism.
2.3 GRU
Gated Recurrent Unit (GRU) is a very effective
variation of LSTM network. Comparing to the
LSTM, it only has two gate units: update gate and
reset gate. The update gate determines how the input