NLP 101 : From RNN to Attention Mechanism

Bikash Bhoi
6 min readDec 3, 2020

by Bikash and Abhijit

Deep Learning technologies have gained popularity over the last few years, with significant impacts being seen in real-world applications like image/speech recognition, Natural Language Processing (NLP), classification, extraction, and prediction. These are being made possible by artificial neural networks.

Among the advanced artificial networks, Recurrent Neural Networks (RNNs) and Long Short Term Memory (LSTMs) offer a tremendous amount of versatility as users can operate over sequences. RNNs and LSTMs remember the inputs and the context as they have internal memory, enabling users to have more flexibility in the types of data that models or networks can process.

Recurrent Neural Network (RNN )

A General Feedforward Neural network works as a function from an input to an output. They don’t work well on Sequential data and fail to store the context if the input is sequential like speech or text. Recurrent Neural Network (RNN) is a general Feedforward neural network with state/internal memory that acts a feedback loop for the processing of the next input.

The same RNN cell for every input(thus justifying the recurrent nature) while the output of cell depends on the current input and output of last RNN cell.
RNN uses a Fully connected layer followed by a tanh activation function.

Even though RNN is effective on modelling short sequence of data, It suffers from vanishing and exploding gradient problems and fails in long input sequences.

Long Short Term Memory(LSTM)

Like RNNs, Long Short Term memory is a class of Deep Learning model which is able to model temporal sequences, and does so by using the prior context of the data to inform it’s current and future prediction.

As discussed in the previous section, a RNN model accounts for prior information in the data by creating a feedback mechanism that feeds into itself a representation of previous inputs. The hidden state matrix contains information about the inputs to the RNN cell at a previous time step, and the model maintains that information over time. However, since the updates to the hidden state are multiplicative (updates over time multiply), it becomes increasingly difficult for the RNN model to maintain any long term dependencies. Also, updates to the hidden state vector in an RNN are not conducive for keeping track of multiple contexts, and addition of a new piece of information changes the hidden state vector entirely, without any consideration of retaining information already contained that is not related to the new context.

LSTMs are an upgrade over RNNs in that they were designed to upgrade the model’s memory additively, and selectively using various gates. The LSTM architecture improves upon the RNN by :-

  • Maintaining a Cell State :- The LSTMs maintain long term memory of the inputs in the cell state. Information from the previous and current timestep combined to selectively modify the cell state to maintain the most relevant context about the input data. The magnitude of the cell state matrix is not normalized, thus allowing it to represent the strength of a particular feature based on the past data. This helps the LSTM retain information about long term dependencies.
  • Adding a FORGET gate :- The forget gate acts upon the current input x_{t}and the previous hidden state h_{t-1} to selectively forget information from the previous hidden state that might not be relevant to the current time step.
Source : Researchgate
  • Adding a INPUT gate :- The Input gate acts on the current input x_{t} and the previous hidden state to select the information that has to be added to the cell state.
  • Adding a Update gate :- The Update gate acts on the input of the previous two gates to selectively filter and update the cell state C. This ensures that only the relevant information is updated in the Cell state, and the model is able to handle multiple contexts.

Gated Recurrent Unit(GRU)

GRUs are a variation of the LSTM model which simplifies its architecture by unifying the input and forget gates. The GRUs also replace the cell state with the hidden state, and make multiple edits to the hidden state of the previous timestep to obtain the current hidden state.

In a GRU architecture and the definitions are defined as : -

source :- Understanding LSTM Networks — colah’s blog

Unlike the LSTMs, which update the cell state by a fuzzy combination of the previous cell state with the previous hidden state and the current input ( by maintaining separate input and forget gates), the GRU updates the previous hidden state by updating the selected information from the current temporary state, and carries forward everything else from the previous state. The GRUs thus have fewer parameters as compared to the LSTM, and are faster to train with comparable performance.

Attention Mechanism : Building Blocks

The attention mechanism was first explored in the context of SENets in the context of Image Processing. Vasvani et al. used demonstrated the use of the same mechanism in the context of encoder — decoder models which showed superlative results across a bevy of NLP benchmarks.

The attention mechanism is a modification of the regular Sequence to Sequence (seq2seq) architecture used in many NLP tasks like language translation, text summarization etc. The seq2seq architecture has two components, an encoder , which encodes the input to a context vector, which is passed on to the decoder, which uses the context vector from the encoder to make predictions.

A drawback of the traditional seq2seq architecture is that the decoder uses the output of the final layer of the encoder to make its predictions. Thus, information from the earlier layer of the encoder is lost. The encoder compresses the information from all the previous layers and timestep into a single vector, which results in significant information loss.

The attention mechanism modifies the traditional seq2seq architecture by introducing an attention layer which takes in the hidden vector of all the layers in the encoder (represented by h), as well as the hidden vector from the decoder (represented by s), concatenates them into a single vector, and passes it through a Feed Forward Network to obtain the ratio in which the outputs from the encoder (h) should be combined to form the input to the decoder.

To sum it up, RNNs are good for sequential but suffer in long sequence input. LSTMs and GRUs used gates to overcome the long-short term memory issue and still used as building blocks in State of the Art deep learning applications like speech recognition, natural language understanding, etc.

--

--