How to implement seq2seq with Keras

6 minute read

keras

Why do you need to read this?

The preprocessing of Seq2Seq takes time but it can be almost “templete” as well except Reshaping part! So Here I will explain complete data preparation guide of seq2seq with Keras. Let’s get started!

Contents

Define Seq2Seq Architecture
Text Cleaning
[Put tag and tag for decoder input](#FUNCTIONS)
Make Vocabulary (VOCAB_SIZE)
Tokenize Bag of words to Bag of IDs
Padding (MAX_LEN)
Word Embedding (EMBEDDING_DIM)
Reshape the Data depends on neural network shapes
Split Data for training and validation, testing
Conclusion

1. Define Seq2Seq Architecture

What is Seq2Seq Text Generation Model? Seq2Seq is a type of Encoder-Decoder model using RNN. It can be used as a model for machine interaction and machine translation.

Seq2Seq by using LSTM:

  # Seq2Seq by using LSTM
  def seq2seq_model_builder(HIDDEN_DIM=300):

  encoder_inputs = Input(shape=(MAX_LEN, ), dtype='int32',)
  encoder_embedding = embed_layer(encoder_inputs)
  encoder_LSTM = LSTM(HIDDEN_DIM, return_state=True)
  encoder_outputs, state_h, state_c = encoder_LSTM(encoder_embedding)

  decoder_inputs = Input(shape=(MAX_LEN, ), dtype='int32',)
  decoder_embedding = embed_layer(decoder_inputs)
  decoder_LSTM = LSTM(HIDDEN_DIM, return_state=True, return_sequences=True)
  decoder_outputs, _, _ = decoder_LSTM(decoder_embedding, initial_state=[state_h, state_c])

  # dense_layer = Dense(VOCAB_SIZE, activation='softmax')
  outputs = TimeDistributed(Dense(VOCAB_SIZE, activation='softmax'))(decoder_outputs)
  model = Model([encoder_inputs, decoder_inputs], outputs)

      return model

  model = seq2seq_model_builder(HIDDEN_DIM=300)
  model.summary()

For training our seq2seq model, we will use Cornell Movie — Dialogs Corpus Dataset which contains over 220,579 conversational exchanges between 10,292 pairs of movie characters. And it involves 9,035 characters from 617 movies. Here one of the conversations from the data set:

```md
Mike:
"Drink up, Charley. We're ahead of you."

Charley:
"I'm not thirsty."
```

2. Text Cleaning

I always use this my own function to clean text for Seq2Seq:

Text Cleaning Function:

  def clean_text(text):
  '''Clean text by removing unnecessary characters and altering the format of words.'''

  text = text.lower()

  text = re.sub(r"i'm", "i am", text)
  text = re.sub(r"he's", "he is", text)
  text = re.sub(r"she's", "she is", text)
  text = re.sub(r"it's", "it is", text)
  text = re.sub(r"that's", "that is", text)
  text = re.sub(r"what's", "that is", text)
  text = re.sub(r"where's", "where is", text)
  text = re.sub(r"how's", "how is", text)
  text = re.sub(r"\'ll", " will", text)
  text = re.sub(r"\'ve", " have", text)
  text = re.sub(r"\'re", " are", text)
  text = re.sub(r"\'d", " would", text)
  text = re.sub(r"\'re", " are", text)
  text = re.sub(r"won't", "will not", text)
  text = re.sub(r"can't", "cannot", text)
  text = re.sub(r"n't", " not", text)
  text = re.sub(r"n'", "ng", text)
  text = re.sub(r"'bout", "about", text)
  text = re.sub(r"'til", "until", text)
  text = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", text)

  return text

  cleaned_text = []
  for text in texts:
      cleaned_text.append(clean_text(text))

  cleaned_text

Input:

  # encoder input text data
["Drink up, Charley. We're ahead of you.",
 'Did you change your hair?',
 'I believe I have found a faster way.']

Output:

  # encoder input text data
['drink up charley we are ahead of you',
 'did you change your hair',
 'i believe i have found a faster way'] module

3. Put tag and tag for decoder input

Tag Creater Function: ```python def tagger(decoder_input_sentence): bos = “ " eos = " " final_target = [bos + text + eos for text in decoder_input_sentence] return final_target

decoder_inputs = tagger(decoder_input_text)

- Input:    
	```md
	# decoder input text data
  [['with the teeth of your zipper',
    'so they tell me',
    'so  which dakota you from'],,,,]
	```

- Output:
	```md
	# decoder input text data
  [['<BOS> with the teeth of your zipper <EOS>',
    '<BOS> so they tell me <EOS>',
    '<BOS> so  which dakota you from <EOS>'],,,,]
	```

## <a name="IO" ></a>4. Make Vocabulary (VOCAB_SIZE)

- TMake Vocabulary Function:
```python
from keras.preprocessing.text import Tokenizer

def vocab_creater(text_lists, VOCAB_SIZE):

  tokenizer = Tokenizer(num_words=VOCAB_SIZE)
  tokenizer.fit_on_texts(text_lists)
  dictionary = tokenizer.word_index

  word2idx = {}
  idx2word = {}
  for k, v in dictionary.items():
      if v < VOCAB_SIZE:
          word2idx[k] = v
          index2word[v] = k
      if v >= VOCAB_SIZE-1:
          continue

  return word2idx, idx2word

word2idx, idx2word = vocab_creater(text_lists=encoder_input_text+decoder_input_text, VOCAB_SIZE=14999)

Input:

  # encoder input text data
["Drink up, Charley. We're ahead of you.",
 'Did you change your hair?',
 'I believe I have found a faster way.']

Output:

  # encoder input text data
['drink up charley we are ahead of you',
 'did you change your hair',
 'i believe i have found a faster way'] module

5. Tokenize Bag of words to Bag of IDs

Tokenize Function: ```python from keras.preprocessing.text import Tokenizer VOCAB_SIZE = 14999

def text2seq(encoder_text, decoder_text, VOCAB_SIZE):

tokenizer = Tokenizer(num_words=VOCAB_SIZE) encoder_sequences = tokenizer.texts_to_sequences(encoder_text) decoder_sequences = tokenizer.texts_to_sequences(decoder_text)

return encoder_sequences, decoder_sequences

encoder_sequences, decoder_sequences = text2seq(encoder_text, decoder_text, VOCAB_SIZE)

- Input:    
	```md
	# Cleaned texts
	[['with the teeth of your zipper',
	  'so they tell me',
	  'so  which dakota you from'],,,,]
	```

- Output:
	```md
	# decoder input text data
	[[10, 27, 8, 4, 27, 1107, 802],
	 [3, 5, 186, 168],
	 [662, 4, 22, 346, 6, 130, 3, 5, 2407],,,,,]
	```


## <a name="DS"></a>6. Padding (MAX_LEN)

- Padding Function:
```python
from keras.preprocessing.sequence import pad_sequences

def padding(encoder_sequences, decoder_sequences, MAX_LEN):

  encoder_input_data = pad_sequences(encoder_sequences, maxlen=MAX_LEN, dtype='int32', padding='post', truncating='post')
  decoder_input_data = pad_sequences(decoder_sequences, maxlen=MAX_LEN, dtype='int32', padding='post', truncating='post')

  return encoder_input_data, decoder_input_data

encoder_input_data, decoder_input_data = padding(encoder_sequences, decoder_sequences, MAX_LEN)

Input:

  # decoder input text data
  [[10, 27, 8, 4, 27, 1107, 802],
   [3, 5, 186, 168],
   [662, 4, 22, 346, 6, 130, 3, 5, 2407],,,,,]

Output:

  # MAX_LEN = 10
# decoder input text data
  array([[10, 27, 8, 4, 27, 1107, 802, 0, 0, 0],
         [3, 5, 186, 168, 0, 0, 0, 0, 0, 0],
         [662, 4, 22, 346, 6, 130, 3, 5, 2407, 0],,,,,]

7. Word Embedding (EMBEDDING_DIM)

We use Pretraind Word2Vec Model from Glove. We can create embedding layer with Glove with 3 steps:

Call Glove file from XX
Create Embedding Matrix from our Vocabulary
Create Embedding Layer

Let’s take a look!

Word Embedding Function: ```python

GLOVE_DIR = path for glove.6B.100d.txt

def glove_100d_dictionary(GLOVE_DIR): embeddings_index = {} f = open(os.path.join(GLOVE_DIR, ‘glove.6B.100d.txt’)) for line in f: values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype=’float32’) embeddings_index[word] = coefs f.close() return embeddings_index

this time: embedding_dimention = 100d

def embedding_matrix_creater(embedding_dimention): embedding_matrix = np.zeros((len(word_index) + 1, embedding_dimention)) for word, i in word_index.items(): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: # words not found in embedding index will be all-zeros. embedding_matrix[i] = embedding_vector return embedding_matrix

def embedding_layer_creater(VOCAB_SIZE, EMBEDDING_DIM, MAX_LEN, embedding_matrix):

embedding_layer = Embedding(input_dim = VOCAB_SIZE, output_dim = EMBEDDING_DIM, input_length = MAX_LEN, weights = [embedding_matrix], trainable = False) return embedding_layer

embedding_layer = embedding_layer_creater(VOCAB_SIZE, EMBEDDING_DIM, MAX_LEN, embedding_matrix)

## <a name="data"></a>8. Reshape the Data depends on neural network shapes

- Reshape Decoder Output Data Function:
```python
import numpy as np
# MAX_LEN = 20
# num_samples = len(encoder_sequences)
# VOCAB_SIZE = 15000

def decoder_output_creater(decoder_input_data, num_samples, MAX_LEN, VOCAB_SIZE):

  decoder_output_data = np.zeros((num_samples, MAX_LEN, VOCAB_SIZE), dtype="float32")

  for i, seqs in enumerate(decoder_input_data):
      for j, seq in enumerate(seqs):
          if j > 0:
              decoder_output_data[i][j][seq] = 1.
  print(decoder_output_data.shape)

  return decoder_output_data

decoder_output_data = decoder_output_creater(decoder_input_data, num_samples, MAX_LEN, VOCAB_SIZE)

Input:

  # MAX_LEN = 10
  # decoder input text data
  array([[10, 27, 8, 4, 27, 1107, 802, 0, 0, 0],
               [3, 5, 186, 168, 0, 0, 0, 0, 0, 0],
               [662, 4, 22, 346, 6, 130, 3, 5, 2407, 0],,,,,]

Output:

  # output.shape (num_samples, MAX_LEN, VOCAB_SIZE)
  # decoder_output_data.shape (15000, 10, 15000)
  array([[[0., 0., 0., ..., 0., 0., 0.],
          [0., 0., 0., ..., 0., 0., 0.],
          [0., 0., 1., ..., 0., 0., 0.],
          ...,
          [1., 0., 0., ..., 0., 0., 0.],
          [1., 0., 0., ..., 0., 0., 0.],
          [1., 0., 0., ..., 0., 0., 0.]],
          ..., ], , dtype=float32)

9. Split Data for training and validation, testing

Split Data Function: ```python from sklearn.model_selection import train_test_split

def data_spliter(encoder_input_data, decoder_input_data, test_size1=0.2, test_size2=0.3):

en_train, en_test, de_train, de_test = train_test_split(encoder_input_data, decoder_input_data, test_size=test_size1) en_train, en_val, de_train, de_val = train_test_split(en_train, de_train, test_size=test_size2)

return en_train, en_val, en_test, de_train, de_val, de_test

en_train, en_val, en_test, de_train, de_val, de_test = data_spliter(encoder_input_data, decoder_input_data) ```

10. Conclusion

We just quickly overviewed text preprocessing for seq2seq. I hope it will be useful for you as well.

References

Share on

Twitter Facebook Google+ LinkedIn

Akira Takezawa

How to implement seq2seq with Keras

1. Define Seq2Seq Architecture

2. Text Cleaning

3. Put tag and tag for decoder input

5. Tokenize Bag of words to Bag of IDs

7. Word Embedding (EMBEDDING_DIM)

GLOVE_DIR = path for glove.6B.100d.txt

this time: embedding_dimention = 100d

9. Split Data for training and validation, testing

10. Conclusion

References

Share on

Leave a Comment

You May Also Enjoy

Goal for 2022

Three Pillars in my life

You became better person, better life

13 LIFE SHIFT