In this article, I will give you an explanation of sequence to sequence model which has shown great demand recently in application like Machine translation, Image Captioning, Video Captioning, Speech recognition, etc.
Seq2seq model:
Sequence to sequence was first introduced by Google in 2014. So let’s go through our question what is seq2 seq model? Sequence to sequence model tries to map input text with fixed length to output text fixed-length where the length of input and output to the model may differ. As we know variants of Recurrent neural networks like Long short term memory or Gated Recurrent Neural Network (GRU) are the method we mostly used since they overcome the problem of vanishing gradient.
Let’s have an example to get a clear picture of what is it:
Here we can observe that one language is translated into another language. There are many examples as follow:
- Speech Recognition
- Machine Language Translation
- Name entity/Subject extraction
- Relation Classification
- Path Query Answering
- Speech Generation
- Chatbot
- Text Summarization
- Product Sales Forecasting
We have divided our seq2seq model or encoder-decoder model in two-phase:
- Training phase i.e process under encoder and decoder.
- Inference phase i.e process during test time.
So let’s dive into stage 1 – Training phase
Training Phase:
So in the training phase we setup our encoder and decoder model. After setting up we trained our model and make a prediction at every timestamp by reading input word by word or char by char.
Before going through training and testing the data need to be cleaned and we need to add tokens to specify the start and end of sentences. So that model will understand when to end. Let’s look into encoder architecture!
Encoder:
Let’s understand through the diagram so that we get a clear view of the flow of the encoder part.
So from fig of the encoder as LSTM network, we can get the intuition that at each time stamp word is read or processed and it captures the contextual information at every timestamp from the input sequences passed to the encoder model. In fig, we have passed an example like “How are you <end>” which is processed word by word. One important note in encoder we return final hidden state(h3) and cell state (c3) as an initializer state for decoder model.
Decoder:
As decoder is also an LSTM network that reads the entire target sequence or sentence word-by-word and predicts the same sequence offset by one timestep. The decoder model is trained to predict the next word in the sequence given the previous word. Let’s go through the fig for a better understanding of decoder model flow:
Source of the image – link
<Go> and <end> are the special tokens that are added to the target sequence before feeding it into the decoder. The target sequence is unknown while decoding the test sequence. So, we start predicting the target sequence by passing the first word into the decoder which would be always the <Go> token or you can specify any token you want. And the <end> token signals specify the end of the sentence to model.
Inference Phase:
After training our encoder-decoder or Seq2seq model, the model is tested on new unseen input sequences for which the target sequence is unknown. So in order get a prediction for a given sentence we need to set up the inference architecture to decode a test sequence:
So how does it work than?
Here are the steps to process for decoding the test sequence:
- First, encode the entire input sequence and initialize the decoder with the internal states of the encoder.
- Pass <Go> token or token which you had specified as an input to the decoder.
- Run the decoder for one timestep with the internal states
- The output will be the probability for the next word. The word with the maximum probability will be selected
- Pass the maximum probability word as an input to the decoder in the next timestep and update the internal states with the current time step
- Repeat steps 3 – 5 until we generate <end> token or hit the maximum length of the target sequence.
Let’s go through a fig where we have given input and shown how the inference process is done and how encoding-decoding of a sentence is done:
So why does the seq2seq model fails than?
So we understood the architecture and flow of the encoder-decoder model since it very useful but it got some limitation. As we saw encoder takes input and converts it into a fixed-size vector and then the decoder makes a prediction and gives output sequence. It works fine for short sequence but it fails when we have a long sequence bcoz it becomes difficult for the encoder to memorize the entire sequence into a fixed-sized vector and to compress all the contextual information from sequence. As we made an observation that as the sequence size increases model performance starts getting degrading.
So how to overcome this problem?
Attention:
So to overcome this problem we introduce the concept of attention mechanism. So in this, we give importance to specific parts of the sequence instead of the entire sequence make a prediction on that word. Basically, in the attention, we don’t throw away the intermediate from the encoder state but we utilize this to generate context vector from all states so that the decoder gives output result.
Let’s go through an example to understand how attention mechanism works:
- Source sequence: “Which sport do you like the most?
- Target sequence: “I love cricket”
So the idea of attention was to utilize all the contextual information from the input sequence so that we can decode our target sequence. Let’s look at the example we mention. The first word ‘I’ in the target sequence is connected to the fourth word ‘you’ in the source sequence, right? Similarly, the second-word ‘love’ in the target sequence is associated with the fifth word ‘like’ in the source sequence. As we can observe that we are paying more attention to context information like sports, like, you in the source sequence.
So, instead of going through the entire sequence it pays attention to specific words from the sequence and gives out the result based on that.
Source: link
How does it work:
We understood how theoretically attention works now its time to get technically understand since we will stick to our example “Which sport do you like the most?”.
Stepwise step:
- Computing score of each encoder state:
So we will train our feed-forward network (encoder-decoder) using all encoder states as we recall we are initializing decoder initial state with encoder final state. So by training our network we will generate a high score for the states for which attention is used and we ignored the score with low value.
Let s1, s2, s3, s4,s5,s6, and s7 be the scores generated for the states h1, h2, h3, h4, h5, h6, and h7 correspondingly. Since we assumed that we need to pay more attention to the states h2,h4, and h5 and ignore h1, h3, h6, and h7 in order to predict “I”, we expect the above neural to generate scores such that s2, s4, and s5 are high while s1, s3, s6, and s7 are relatively low.
- Computing attention weights:
After generating scores, we apply softmax on these scores to obtain weights w1,w2,w3,w4,w5,w6, and w7. Since we know how softmax works it will give probability with sum up of all the weights value in the range of 0-1.
For example:
w1=0.23, w2=0.11,w3=0.13..which all weights sum to 1.
- Computing context vector:
After computing attention weights now we will calculate context vector which will be used by the decoder to predict next word in sequence.
ContextVector = e1 * h1 + e2 * h2 + e3 * h3 + e4 * h4 + e5 * h5 + e6*h6 + e7*h7
Clearly, if the values of e2 and e4 are high and those of e1, e3,e5,e6, and e7 are low then the context vector will contain more information from the states h2 and h4 and relatively less information from the states h1, h3,h6, and h7.
- Adding context vector with the output of the previous timestamp:
In this simply add our context vector with <Go> token since for first time stamp we don’t have previous time stamp output.
After this decoder generates output for word in sequence and similarly we will get prediction every word in sequence. Once the decoder outputs <end> we stop generating a word.
Note:
- Unlike in the seq2seq model, we used a fixed-sized vector for all decoder time stamp but in case of attention mechanism, we generate context vector at every timestamp.
- Due to the advantage of attention mechanism, the performance of the model improved and we observe better results.
There are 2 different ways of doing attention mechanisms we will not go into depth. So these two class depends on how context vector information you need to compress from the input sequence:
- Global Attention
- Local Attention
Let’s dive into these.
Global Attention:
Here, the attention is placed on all the source positions. In other words, all the hidden states of the encoder are considered for deriving the attended context vector.
Local Attention:
Here, the attention is placed on only a few source positions. Only a few hidden states of the encoder are considered for deriving the attended context vector.
1 2 3 4 5 6 7 8 9 10 |
import necessary package: #Packages import pandas as pd import re import string from string import digits import numpy as np from sklearn.utils import shuffle from keras.layers import Input, LSTM, Embedding, Dense,Dropout,TimeDistributed from keras.models import Model |
Reading Data
1 2 3 4 5 6 7 8 9 |
data = pd.read_csv('./mar.txt',sep='\t',names=['eng','mar']) data.head() #Output eng mar 0 Go. जा. 1 Run! पळ! 2 Run! धाव! 3 Run! पळा! 4 Run! धावा! |
Pre-processing data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
# lowercase all the character source['eng']=data.eng.apply(lambda x: x.lower()) target['mar']=data.mar.apply(lambda x: x.lower()) # Remove quotes data.eng=data.eng.apply(lambda x: re.sub("'","",x)) data.mar=data.mar.apply(lambda x: re.sub("'","",x)) #specifying to remove punctuation exclude = set(string.punctuation) # Remove all special character data.eng=data.eng.apply(lambda x: "".join(ch for ch in x if x not in exclude)) data.mar=data.mar.apply(lambda x:"".join(ch for ch in x if x not in exclude)) #Remove all numbers from text remove_digits = str.maketrans('','',digits) data.eng=data.eng.apply(lambda x: x.translate(remove_digits)) data.mar = data.mar.apply(lambda x: re.sub("[२३०८१५७९४६]", "", x)) # Remove extra spaces data.eng=data.eng.apply(lambda x: x.strip()) data.mar=data.mar.apply(lambda x: x.strip()) data.eng=data.eng.apply(lambda x: re.sub(" +", " ", x)) data.mar=data.mar.apply(lambda x: re.sub(" +", " ", x)) # Add start and end tokens to target sequences data.mar = data.mar.apply(lambda x : 'START_ '+ x + ' _END') |
Building vocabulary:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
# Vocabulary of English storing all the words in a set and same for marathi vocab all_eng_words=set() for eng in data.eng: for word in eng.split(): if word not in all_eng_words: all_eng_words.add(word) # Vocabulary of marathi all_marathi_words=set() for mar in data.mar: for word in mar.split(): if word not in all_marathi_words: all_marathi_words.add(word) # Max Length of source sequence to specify wat size will be there for input lenght_list=[] for l in data.eng: lenght_list.append(len(l.split(' '))) max_source_length = np.max(lenght_list) max_source_length #35 # Max Length of target sequence to specifying wat size will be for target output lenght_list=[] for l in data.mar: lenght_list.append(len(l.split(' '))) max_target_size = np.max(lenght_list) max_target_size #37 input_words = sorted(list(all_eng_words)) target_words = sorted(list(all_marathi_words)) #storing the vocab size for encoder and decoder num_of_encoder_tokens = len(all_eng_words) num_of_decoder_tokens = len(all_marathi_words) print("Encoder token size is {} and decoder token size is {}".format(num_of_encoder_tokens,num_of_decoder_tokens)) #output Encoder token size is 8882 and decoder token size is 14689 #for zero padding num_of_decoder_tokens += 1 num_of_decoder_tokens #output 14690 # dictionary to index each english character - key is index and value is english character eng_index_to_char_dict = {} # dictionary to get english character given its index - key is english character and value is index eng_char_to_index_dict = {} for key, value in enumerate(input_words): eng_index_to_char_dict[key] = value eng_char_to_index_dict[value] = key #similary for target i.e marathi words mar_index_to_char_dict = {} mar_char_to_index_dict = {} for key,value in enumerate(target_words): mar_index_to_char_dict[key] = value mar_char_to_index_dict[value] = key |
Splitting data into train and test:
1 2 3 4 5 6 7 |
# Splitting our data into train and tes part from sklearn.model_selection import train_test_split X, y = data.eng, data.mar X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3) print("Training data size is {} and testing data size is {}".format(X_train.shape, X_test.shape)) #output Training data size is (23607,) and testing data size is (10118,) |
Training data in batches:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
#we will not train on whole data at time instead we will train on batch so as to reduce computation and increase learning and performace of model def generate_batch(X = X_train, y = y_train, batch_size = 128): while True: for j in range(0, len(X), batch_size): #encoder input encoder_input_data = np.zeros((batch_size, max_source_length),dtype='float32') #decoder input decoder_input_data = np.zeros((batch_size, max_target_size),dtype='float32') #target decoder_target_data = np.zeros((batch_size, max_target_size, num_of_decoder_tokens),dtype='float32') for i, (input_text, target_text) in enumerate(zip(X[j:j+batch_size], y[j:j+batch_size])): for t, word in enumerate(input_text.split()): encoder_input_data[i, t] = eng_char_to_index_dict[word] # encoder input seq for t, word in enumerate(target_text.split()): if t<len(target_text.split())-1: decoder_input_data[i, t] = mar_char_to_index_dict[word] # decoder input seq if t>0: # decoder target sequence (one hot encoded) # does not include the START_ token # Offset by one timestep since it is one time stamp ahead decoder_target_data[i, t - 1, mar_char_to_index_dict[word]] = 1 yield([encoder_input_data, decoder_input_data], decoder_target_data) |
Defining Encoder model:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
latent_dim = 256 # Encoder encoder_inputs = Input(shape=(None,)) enc_emb = Embedding(num_of_encoder_tokens, latent_dim, mask_zero = True)(encoder_inputs) encoder_lstm = LSTM(latent_dim, return_state=True) encoder_outputs, state_h, state_c = encoder_lstm(enc_emb) # We discard `encoder_outputs` and only keep the states. encoder_states = [state_h, state_c] Defining decoder model: # Set up the decoder, using `encoder_states` as initial state. decoder_inputs = Input(shape=(None,)) dec_emb_layer = Embedding(num_of_decoder_tokens, latent_dim, mask_zero = True) dec_emb = dec_emb_layer(decoder_inputs) decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True) hidden_with_time_axis = tf.expand_dims(latent_dim, 1) score = Dense(tanh(Dense(enc_output) + Dense(hidden_with_time_axis))) attention_weights = softmax(score, axis=1) # context_vector = attention_weights * enc_output context_vector = tf.reduce_sum(context_vector, axis=1) x = Concatenate([tf.expand_dims(context_vector, 1),dec_emb], axis=-1) decoder_outputs, _, _ = decoder_lstm(x, initial_state=encoder_states) decoder_dense = TimeDistributed(Dense(num_of_decoder_tokens, activation='softmax')) decoder_outputs = decoder_dense(decoder_outputs) decoder_outputs = Reshape(decoder_outputs, (-1, decoder_outputs.shape[2])) model = Model([encoder_inputs, decoder_inputs], decoder_outputs) Compiling model: model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) train_samples = len(X_train) val_samples = len(X_test) batch_size = 64 epochs = 50 |
Training our model:
1 2 3 4 5 |
history=model.fit_generator(generator = generate_batch(X_train, y_train, batch_size = batch_size), steps_per_epoch = train_samples//batch_size, epochs=epochs, validation_data = generate_batch(X_test, y_test, batch_size = batch_size), validation_steps = val_samples//batch_size) |
Saving our model:
1 |
model.save_weights('./weights.h5') |
Inference model:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# Inference model #storing encoder input and internal states so as to give to decoder part encoder_model = Model(encoder_inputs, encoder_states) #specifying hidden and cell state for decoder part as vector process it will get output predicted and again we add to decoder states decoder_state_input_h = Input(shape=(latent_dim,)) decoder_state_input_c = Input(shape=(latent_dim,)) decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c] dec_emb2= dec_emb_layer(decoder_inputs) # Get the embeddings of the decoder sequence # To predict the next word in the sequence, set the initial states to the states from the previous time step decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=decoder_states_inputs) decoder_states2 = [state_h2, state_c2] decoder_outputs2 = decoder_dense(decoder_outputs2) # A dense softmax layer to generate prob dist. over the target vocabulary # Final decoder model decoder_model = Model( [decoder_inputs] + decoder_states_inputs, [decoder_outputs2] + decoder_states2) |
Decoding sequence:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
def decode_sequence(input_seq): # Encode the input as state vectors. states_value = encoder_model.predict(input_seq) # Generate empty target sequence of length 1. target_seq = np.zeros((1,1)) # Populate the first character of target sequence with the start character. target_seq[0, 0] = mar_char_to_index_dict['START_'] # Sampling loop for a batch of sequences # (to simplify, here we assume a batch of size 1). stop_condition = False decoded_sentence = '' while not stop_condition: output_tokens, h, c = decoder_model.predict([target_seq] + states_value) # Sample a token sampled_token_index = np.argmax(output_tokens[0, -1, :]) sampled_char = mar_index_to_char_dict[sampled_token_index] decoded_sentence += ' '+sampled_char # Exit condition: either hit max length # or find stop character. if (sampled_char == '_END' or len(decoded_sentence) > 50): stop_condition = True # Update the target sequence (of length 1). target_seq = np.zeros((1,1)) target_seq[0, 0] = sampled_token_index # Update states states_value = [h, c] return decoded_sentence |
Testing on train data:
1 2 3 4 5 6 7 8 9 10 |
train_gen = generate_batch(X_train, y_train, batch_size = 1) k=-1 k+=1 (input_seq, actual_output), _ = next(train_gen) decoded_sentence = decode_sequence(input_seq) print('Input English sentence as per data:', X_train[k:k+1].values[0]) print('Actual Marathi Translation as per data:', y_train[k:k+1].values[0][6:-4]) print('Predicted Marathi Translation predicted by model:', decoded_sentence[:-4]) #output Input English sentence as per data: i want something to drink. Actual Marathi Translation as per data: मला काहीतरी प्यायला हवंय. Predicted Marathi Translation predicted by model: मला काहीतरी प्यायला हवंय. |
Testing on test data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
val_gen = generate_batch(X_test, y_test, batch_size = 1) k=-1 #Adam k+=1 (input_seq, actual_output), _ = next(val_gen) decoded_sentence = decode_sequence(input_seq) print('Input English sentence as per data:', X_test[k:k+1].values[0]) print('Actual Marathi Translation as per data:', y_test[k:k+1].values[0][6:-4]) print('Predicted Marathi Translation predicted by model:', decoded_sentence[:-4]) #output Input English sentence as per data: i dont speak ukrainian. Actual Marathi Translation as per data: मी युक्रेनियन बोलत नाही. Predicted Marathi Translation predicted by model: मी युक्रेनियन भाषा बोलत नाही. For unseen query: loading our model and model weights , compiling it model.load_model('./model.h5 model.load_weights('./weights.h5') model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) #pre-processing step from string import digits def pre_processing(sentence): sentence = sentence.lower() sentance = re.sub("'","",sentence).strip() sentence = re.sub(" +", " ", sentence) remove_digits = str.maketrans('','',digits) sentence=sentence.translate(remove_digits) sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in exclude) return encoder_input_data |
Setting the inference model:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
# Inference model #storing encoder input and internal states so as to give to decoder part encoder_model = Model(encoder_inputs, encoder_states) #specifying hidden and cell state for decoder part as vector process it will get output predicted and again we add to decoder states decoder_state_input_h = Input(shape=(latent_dim,)) decoder_state_input_c = Input(shape=(latent_dim,)) decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c] dec_emb2= dec_emb_layer(decoder_inputs) # Get the embeddings of the decoder sequence # To predict the next word in the sequence, set the initial states to the states from the previous time step decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=decoder_states_inputs) decoder_states2 = [state_h2, state_c2] decoder_outputs2 = decoder_dense(decoder_outputs2) # A dense softmax layer to generate prob dist. over the target vocabulary # Final decoder model decoder_model = Model( [decoder_inputs] + decoder_states_inputs, [decoder_outputs2] + decoder_states2) #decoding unseen query: def decode_sequence(input_seq): # Encode the input as state vectors. states_value = encoder_model.predict(input_seq) # Generate empty target sequence of length 1. target_seq = np.zeros((1,1)) target_seq[0, 0] = mar_char_to_index_dict['START_'] stop_condition = False decoded_sentence = '' while not stop_condition: output_tokens, h, c = decoder_model.predict([target_seq] + states_value) # Sample a token sampled_token_index = np.argmax(output_tokens[0, -1, :]) sampled_char = mar_index_to_char_dict[sampled_token_index] if (sampled_char == '_END'): break; decoded_sentence += ' '+sampled_char target_seq = np.zeros((1,1)) target_seq[0, 0] = sampled_token_index # Update states states_value = [h, c] return decoded_sentence |
Conclusion:
- We understood the concept behind the seq2seq model, its working and limitation.
- We also understood how we can use attention mechanism to solve the problem of seq2seq model. So by using attention mechanism, the model was capable of finding the mapping between the input sequence and output sequence.
- The performance of the model can be increased by increasing the training dataset and using Bi-directional LSTM for better context vector.
- Use the beam search strategy for decoding the test sequence instead of using the greedy approach (argmax)
- Evaluate the performance of your model based on the BLEU score
- Only one drawback of attention is that it’s time-consuming. To overcome this problem Google introduced “Transformer Model” that we will see in future coming blogs.
0 Comments