How to implement seq2seq with Keras

6 minute read

keras

Why do you need to read this?

The preprocessing of Seq2Seq takes time but it can be almost “templete” as well except Reshaping part! So Here I will explain complete data preparation guide of seq2seq with Keras. Let’s get started!

Contents

  1. Define Seq2Seq Architecture
  2. Text Cleaning
  3. [Put tag and tag for decoder input](#FUNCTIONS)
  4. Make Vocabulary (VOCAB_SIZE)
  5. Tokenize Bag of words to Bag of IDs
  6. Padding (MAX_LEN)
  7. Word Embedding (EMBEDDING_DIM)
  8. Reshape the Data depends on neural network shapes
  9. Split Data for training and validation, testing
  10. Conclusion

1. Define Seq2Seq Architecture

What is Seq2Seq Text Generation Model? Seq2Seq is a type of Encoder-Decoder model using RNN. It can be used as a model for machine interaction and machine translation.

  • Seq2Seq by using LSTM:

      # Seq2Seq by using LSTM
      def seq2seq_model_builder(HIDDEN_DIM=300):
    
      encoder_inputs = Input(shape=(MAX_LEN, ), dtype='int32',)
      encoder_embedding = embed_layer(encoder_inputs)
      encoder_LSTM = LSTM(HIDDEN_DIM, return_state=True)
      encoder_outputs, state_h, state_c = encoder_LSTM(encoder_embedding)
    
      decoder_inputs = Input(shape=(MAX_LEN, ), dtype='int32',)
      decoder_embedding = embed_layer(decoder_inputs)
      decoder_LSTM = LSTM(HIDDEN_DIM, return_state=True, return_sequences=True)
      decoder_outputs, _, _ = decoder_LSTM(decoder_embedding, initial_state=[state_h, state_c])
    
      # dense_layer = Dense(VOCAB_SIZE, activation='softmax')
      outputs = TimeDistributed(Dense(VOCAB_SIZE, activation='softmax'))(decoder_outputs)
      model = Model([encoder_inputs, decoder_inputs], outputs)
    
          return model
    
      model = seq2seq_model_builder(HIDDEN_DIM=300)
      model.summary()
    

For training our seq2seq model, we will use Cornell Movie — Dialogs Corpus Dataset which contains over 220,579 conversational exchanges between 10,292 pairs of movie characters. And it involves 9,035 characters from 617 movies. Here one of the conversations from the data set:

```md
Mike:
"Drink up, Charley. We're ahead of you."

Charley:
"I'm not thirsty."
```

2. Text Cleaning

I always use this my own function to clean text for Seq2Seq:

  • Text Cleaning Function:

      def clean_text(text):
      '''Clean text by removing unnecessary characters and altering the format of words.'''
    
      text = text.lower()
    
      text = re.sub(r"i'm", "i am", text)
      text = re.sub(r"he's", "he is", text)
      text = re.sub(r"she's", "she is", text)
      text = re.sub(r"it's", "it is", text)
      text = re.sub(r"that's", "that is", text)
      text = re.sub(r"what's", "that is", text)
      text = re.sub(r"where's", "where is", text)
      text = re.sub(r"how's", "how is", text)
      text = re.sub(r"\'ll", " will", text)
      text = re.sub(r"\'ve", " have", text)
      text = re.sub(r"\'re", " are", text)
      text = re.sub(r"\'d", " would", text)
      text = re.sub(r"\'re", " are", text)
      text = re.sub(r"won't", "will not", text)
      text = re.sub(r"can't", "cannot", text)
      text = re.sub(r"n't", " not", text)
      text = re.sub(r"n'", "ng", text)
      text = re.sub(r"'bout", "about", text)
      text = re.sub(r"'til", "until", text)
      text = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", text)
    
      return text
    
      cleaned_text = []
      for text in texts:
          cleaned_text.append(clean_text(text))
    
      cleaned_text
    
  • Input:

      # encoder input text data
    ["Drink up, Charley. We're ahead of you.",
     'Did you change your hair?',
     'I believe I have found a faster way.']
    
  • Output:

      # encoder input text data
    ['drink up charley we are ahead of you',
     'did you change your hair',
     'i believe i have found a faster way'] module
    

3. Put tag and tag for decoder input

  • Tag Creater Function: ```python def tagger(decoder_input_sentence): bos = “ " eos = " " final_target = [bos + text + eos for text in decoder_input_sentence] return final_target

decoder_inputs = tagger(decoder_input_text)


- Input:    
	```md
	# decoder input text data
  [['with the teeth of your zipper',
    'so they tell me',
    'so  which dakota you from'],,,,]
	```

- Output:
	```md
	# decoder input text data
  [['<BOS> with the teeth of your zipper <EOS>',
    '<BOS> so they tell me <EOS>',
    '<BOS> so  which dakota you from <EOS>'],,,,]
	```

## <a name="IO" ></a>4. Make Vocabulary (VOCAB_SIZE)

- TMake Vocabulary Function:
```python
from keras.preprocessing.text import Tokenizer

def vocab_creater(text_lists, VOCAB_SIZE):

  tokenizer = Tokenizer(num_words=VOCAB_SIZE)
  tokenizer.fit_on_texts(text_lists)
  dictionary = tokenizer.word_index

  word2idx = {}
  idx2word = {}
  for k, v in dictionary.items():
      if v < VOCAB_SIZE:
          word2idx[k] = v
          index2word[v] = k
      if v >= VOCAB_SIZE-1:
          continue

  return word2idx, idx2word

word2idx, idx2word = vocab_creater(text_lists=encoder_input_text+decoder_input_text, VOCAB_SIZE=14999)
  • Input:
      # encoder input text data
    ["Drink up, Charley. We're ahead of you.",
     'Did you change your hair?',
     'I believe I have found a faster way.']
    
  • Output:
      # encoder input text data
    ['drink up charley we are ahead of you',
     'did you change your hair',
     'i believe i have found a faster way'] module
    

5. Tokenize Bag of words to Bag of IDs

  • Tokenize Function: ```python from keras.preprocessing.text import Tokenizer VOCAB_SIZE = 14999

def text2seq(encoder_text, decoder_text, VOCAB_SIZE):

tokenizer = Tokenizer(num_words=VOCAB_SIZE) encoder_sequences = tokenizer.texts_to_sequences(encoder_text) decoder_sequences = tokenizer.texts_to_sequences(decoder_text)

return encoder_sequences, decoder_sequences

encoder_sequences, decoder_sequences = text2seq(encoder_text, decoder_text, VOCAB_SIZE)


- Input:    
	```md
	# Cleaned texts
	[['with the teeth of your zipper',
	  'so they tell me',
	  'so  which dakota you from'],,,,]
	```

- Output:
	```md
	# decoder input text data
	[[10, 27, 8, 4, 27, 1107, 802],
	 [3, 5, 186, 168],
	 [662, 4, 22, 346, 6, 130, 3, 5, 2407],,,,,]
	```


## <a name="DS"></a>6. Padding (MAX_LEN)

- Padding Function:
```python
from keras.preprocessing.sequence import pad_sequences

def padding(encoder_sequences, decoder_sequences, MAX_LEN):

  encoder_input_data = pad_sequences(encoder_sequences, maxlen=MAX_LEN, dtype='int32', padding='post', truncating='post')
  decoder_input_data = pad_sequences(decoder_sequences, maxlen=MAX_LEN, dtype='int32', padding='post', truncating='post')

  return encoder_input_data, decoder_input_data

encoder_input_data, decoder_input_data = padding(encoder_sequences, decoder_sequences, MAX_LEN)
  • Input:
      # decoder input text data
      [[10, 27, 8, 4, 27, 1107, 802],
       [3, 5, 186, 168],
       [662, 4, 22, 346, 6, 130, 3, 5, 2407],,,,,]
    
  • Output:
      # MAX_LEN = 10
    # decoder input text data
      array([[10, 27, 8, 4, 27, 1107, 802, 0, 0, 0],
             [3, 5, 186, 168, 0, 0, 0, 0, 0, 0],
             [662, 4, 22, 346, 6, 130, 3, 5, 2407, 0],,,,,]
    

7. Word Embedding (EMBEDDING_DIM)

We use Pretraind Word2Vec Model from Glove. We can create embedding layer with Glove with 3 steps:

  1. Call Glove file from XX
  2. Create Embedding Matrix from our Vocabulary
  3. Create Embedding Layer

Let’s take a look!

  • Word Embedding Function: ```python

GLOVE_DIR = path for glove.6B.100d.txt

def glove_100d_dictionary(GLOVE_DIR): embeddings_index = {} f = open(os.path.join(GLOVE_DIR, ‘glove.6B.100d.txt’)) for line in f: values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype=’float32’) embeddings_index[word] = coefs f.close() return embeddings_index

this time: embedding_dimention = 100d

def embedding_matrix_creater(embedding_dimention): embedding_matrix = np.zeros((len(word_index) + 1, embedding_dimention)) for word, i in word_index.items(): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: # words not found in embedding index will be all-zeros. embedding_matrix[i] = embedding_vector return embedding_matrix

def embedding_layer_creater(VOCAB_SIZE, EMBEDDING_DIM, MAX_LEN, embedding_matrix):

embedding_layer = Embedding(input_dim = VOCAB_SIZE, output_dim = EMBEDDING_DIM, input_length = MAX_LEN, weights = [embedding_matrix], trainable = False) return embedding_layer

embedding_layer = embedding_layer_creater(VOCAB_SIZE, EMBEDDING_DIM, MAX_LEN, embedding_matrix)


## <a name="data"></a>8. Reshape the Data depends on neural network shapes

- Reshape Decoder Output Data Function:
```python
import numpy as np
# MAX_LEN = 20
# num_samples = len(encoder_sequences)
# VOCAB_SIZE = 15000

def decoder_output_creater(decoder_input_data, num_samples, MAX_LEN, VOCAB_SIZE):

  decoder_output_data = np.zeros((num_samples, MAX_LEN, VOCAB_SIZE), dtype="float32")

  for i, seqs in enumerate(decoder_input_data):
      for j, seq in enumerate(seqs):
          if j > 0:
              decoder_output_data[i][j][seq] = 1.
  print(decoder_output_data.shape)

  return decoder_output_data

decoder_output_data = decoder_output_creater(decoder_input_data, num_samples, MAX_LEN, VOCAB_SIZE)
  • Input:
      # MAX_LEN = 10
      # decoder input text data
      array([[10, 27, 8, 4, 27, 1107, 802, 0, 0, 0],
                   [3, 5, 186, 168, 0, 0, 0, 0, 0, 0],
                   [662, 4, 22, 346, 6, 130, 3, 5, 2407, 0],,,,,]
    
  • Output:
      # output.shape (num_samples, MAX_LEN, VOCAB_SIZE)
      # decoder_output_data.shape (15000, 10, 15000)
      array([[[0., 0., 0., ..., 0., 0., 0.],
              [0., 0., 0., ..., 0., 0., 0.],
              [0., 0., 1., ..., 0., 0., 0.],
              ...,
              [1., 0., 0., ..., 0., 0., 0.],
              [1., 0., 0., ..., 0., 0., 0.],
              [1., 0., 0., ..., 0., 0., 0.]],
              ..., ], , dtype=float32)
    

9. Split Data for training and validation, testing

  • Split Data Function: ```python from sklearn.model_selection import train_test_split

def data_spliter(encoder_input_data, decoder_input_data, test_size1=0.2, test_size2=0.3):

en_train, en_test, de_train, de_test = train_test_split(encoder_input_data, decoder_input_data, test_size=test_size1) en_train, en_val, de_train, de_val = train_test_split(en_train, de_train, test_size=test_size2)

return en_train, en_val, en_test, de_train, de_val, de_test

en_train, en_val, en_test, de_train, de_val, de_test = data_spliter(encoder_input_data, decoder_input_data) ```

10. Conclusion

We just quickly overviewed text preprocessing for seq2seq. I hope it will be useful for you as well.

References

Updated:

Leave a Comment