Written by ajayjangid

Interests: Machine Learning(ML), Deep Learning (DL, Artificial Intelligence(AI), Neural Networks, Data Science, Natural Language Processing (NLP), Information Retrieval (IR), Keras, PyTorch, XGBoost, Gensim, Python (numpy, pandas, scikit-learn, nltk)

April 13, 2020

Intuitive Understanding of Seq2seq model & Attention Mechanism in Deep Learning

In this article, I will give you an explanation of sequence to sequence model which has shown great demand recently in application like Machine translation, Image Captioning, Video Captioning, Speech recognition, etc.

Seq2seq model:

Sequence to sequence was first introduced by Google in 2014. So let’s go through our question what is seq2 seq model? Sequence to sequence model tries to map input text with fixed length to output text fixed-length where the length of input and output to the model may differ. As we know variants of Recurrent neural networks like Long short term memory or Gated Recurrent Neural Network (GRU) are the method we mostly used since they overcome the problem of vanishing gradient.

Let’s have an example to get a clear picture of what is it:

Here we can observe that one language is translated into another language. There are many examples as follow: 

  1. Speech Recognition 
  2. Machine Language Translation
  3. Name entity/Subject extraction
  4. Relation Classification 
  5. Path Query Answering 
  6. Speech Generation
  7. Chatbot 
  8. Text Summarization 
  9. Product Sales Forecasting

We have divided our seq2seq model or encoder-decoder model in two-phase:

  1. Training phase i.e process under encoder and decoder.
  2. Inference phase i.e process during test time.

So let’s dive into stage 1 – Training phase

Training Phase:

So in the training phase we setup our encoder and decoder model. After setting up we trained our model and make a prediction at every timestamp by reading input word by word or char by char.

Before going through training and testing the data need to be cleaned and we need to add tokens to specify the start and end of sentences. So that model will understand when to end. Let’s look into encoder architecture!


Let’s understand through the diagram so that we get a clear view of the flow of the encoder part.

So from fig of the encoder as LSTM network, we can get the intuition that at each time stamp word is read or processed and it captures the contextual information at every timestamp from the input sequences passed to the encoder model. In fig, we have passed an example like “How are you <end>” which is processed word by word. One important note in encoder we return final hidden state(h3) and cell state (c3) as an initializer state for decoder model. 


As decoder is also an LSTM network that reads the entire target sequence or sentence word-by-word and predicts the same sequence offset by one timestep. The decoder model is trained to predict the next word in the sequence given the previous word. Let’s go through the fig for a better understanding of decoder model flow:

Source of the image – link

<Go> and <end> are the special tokens that are added to the target sequence before feeding it into the decoder. The target sequence is unknown while decoding the test sequence. So, we start predicting the target sequence by passing the first word into the decoder which would be always the <Go> token or you can specify any token you want. And the <end> token signals specify the end of the sentence to model.

Inference Phase:

After training our encoder-decoder or Seq2seq model, the model is tested on new unseen input sequences for which the target sequence is unknown. So in order get a prediction for a given sentence we need to set up the inference architecture to decode a test sequence:

So how does it work than?

Here are the steps to process for decoding the test sequence:

  1. First, encode the entire input sequence and initialize the decoder with the internal states of the encoder.
  2. Pass <Go> token or token which you had specified as an input to the decoder.
  3. Run the decoder for one timestep with the internal states
  4. The output will be the probability for the next word. The word with the maximum probability will be selected
  5. Pass the maximum probability word as an input to the decoder in the next timestep and update the internal states with the current time step
  6. Repeat steps 3 – 5 until we generate <end> token or hit the maximum length of the target sequence.

Let’s go through a fig where we have given input and shown how the inference process is done and how encoding-decoding of a sentence is done:

So why does the seq2seq model fails than?

So we understood the architecture and flow of the encoder-decoder model since it very useful but it got some limitation. As we saw encoder takes input and converts it into a fixed-size vector and then the decoder makes a prediction and gives output sequence. It works fine for short sequence but it fails when we have a long sequence bcoz it becomes difficult for the encoder to memorize the entire sequence into a fixed-sized vector and to compress all the contextual information from sequence. As we made an observation that as the sequence size increases model performance starts getting degrading.

So how to overcome this problem? 


So to overcome this problem we introduce the concept of attention mechanism. So in this, we give importance to specific parts of the sequence instead of the entire sequence make a prediction on that word. Basically, in the attention, we don’t throw away the intermediate from the encoder state but we utilize this to generate context vector from all states so that the decoder gives output result.

Let’s go through an example to understand how attention mechanism works:

  • Source sequence: “Which sport do you like the most?
  • Target sequence: “I love cricket”

So the idea of attention was to utilize all the contextual information from the input sequence so that we can decode our target sequence. Let’s look at the example we mention. The first word ‘I’ in the target sequence is connected to the fourth word ‘you’ in the source sequence, right? Similarly, the second-word ‘love’ in the target sequence is associated with the fifth word ‘like’ in the source sequence. As we can observe that we are paying more attention to context information like sports, like, you in the source sequence.

So, instead of going through the entire sequence it pays attention to specific words from the sequence and gives out the result based on that.

Source: link

How does it work:

We understood how theoretically attention works now its time to get technically understand since we will stick to our example “Which sport do you like the most?”.

Stepwise step:

  1. Computing score of each encoder state:

So we will train our feed-forward network (encoder-decoder) using all encoder states as we recall we are initializing decoder initial state with encoder final state. So by training our network we will generate a high score for the states for which attention is used and we ignored the score with low value.

Let s1, s2, s3, s4,s5,s6, and s7 be the scores generated for the states h1, h2, h3, h4, h5, h6, and h7 correspondingly. Since we assumed that we need to pay more attention to the states h2,h4, and h5 and ignore h1, h3, h6, and h7 in order to predict “I”, we expect the above neural to generate scores such that s2, s4, and s5 are high while s1, s3, s6, and s7 are relatively low.

  1. Computing attention weights:

After generating scores, we apply softmax on these scores to obtain weights w1,w2,w3,w4,w5,w6, and w7. Since we know how softmax works it will give probability with sum up of all the weights value in the range of 0-1. 

For example:

w1=0.23, w2=0.11,w3=0.13..which all weights sum to 1.

  1. Computing context vector:

After computing attention weights now we will calculate context vector which will be used by the decoder to predict next word in sequence.

ContextVector = e1 * h1 + e2 * h2 + e3 * h3 + e4 * h4 + e5 * h5 +                                       e6*h6 + e7*h7           

Clearly, if the values of e2 and e4 are high and those of e1, e3,e5,e6, and e7 are low then the context vector will contain more information from the states h2 and h4 and relatively less information from the states h1, h3,h6, and h7.

  1. Adding context vector with the output of the previous timestamp:

In this simply add our context vector with <Go> token since for first time stamp we don’t have previous time stamp output.

After this decoder generates output for word in sequence and similarly we will get prediction every word in sequence. Once the decoder outputs <end> we stop generating a word.


  1. Unlike in the seq2seq model, we used a fixed-sized vector for all decoder time stamp but in case of attention mechanism, we generate context vector at every timestamp.
  2. Due to the advantage of attention mechanism, the performance of the model improved and we observe better results.


There are 2 different ways of doing attention mechanisms we will not go into depth. So these two class depends on how context vector information you need to compress from the input sequence:

  1. Global Attention
  2. Local Attention

Let’s dive into these.

Global Attention:

Here, the attention is placed on all the source positions. In other words, all the hidden states of the encoder are considered for deriving the attended context vector.

Local Attention:

Here, the attention is placed on only a few source positions. Only a few hidden states of the encoder are considered for deriving the attended context vector.

Reading Data

Pre-processing data:

Building vocabulary:

Splitting data into train and test:

Training data in batches:

Defining Encoder model:

Training our model:

Saving our model:

Inference model:

Decoding sequence:

Testing on train data:

Testing on test data:

Setting the inference model:



  1. We understood the concept behind the seq2seq model, its working and limitation.
  2. We also understood how we can use attention mechanism to solve the problem of seq2seq model. So by using attention mechanism, the model was capable of finding the mapping between the input sequence and output sequence.
  3. The performance of the model can be increased by increasing the training dataset and using Bi-directional LSTM for better context vector.
  4. Use the beam search strategy for decoding the test sequence instead of using the greedy approach (argmax)
  5. Evaluate the performance of your model based on the BLEU score
  6. Only one drawback of attention is that it’s time-consuming. To overcome this problem Google introduced “Transformer Model” that we will see in future coming blogs.


  1. https://www.appliedaicourse.com/
  2. https://www.coursera.org/lecture/nlp-sequence-models/attention-model-lSwVa (Andrew Ng’s Explanation on Attention)
  3. https://arxiv.org/abs/1409.3215
  4. https://keras.io/examples/lstm_seq2seq/
  5. https://arxiv.org/abs/1406.1078

You May Also Like…


Submit a Comment

Your email address will not be published. Required fields are marked *