How to implement seq2seq with Keras
Why do you need to read this?
The preprocessing of Seq2Seq takes time but it can be almost “templete” as well except Reshaping part! So Here I will explain complete data preparation guide of seq2seq with Keras. Let’s get started!
Contents
- Define Seq2Seq Architecture
- Text Cleaning
- [Put
tag and tag for decoder input](#FUNCTIONS) - Make Vocabulary (VOCAB_SIZE)
- Tokenize Bag of words to Bag of IDs
- Padding (MAX_LEN)
- Word Embedding (EMBEDDING_DIM)
- Reshape the Data depends on neural network shapes
- Split Data for training and validation, testing
- Conclusion
1. Define Seq2Seq Architecture
What is Seq2Seq Text Generation Model? Seq2Seq is a type of Encoder-Decoder model using RNN. It can be used as a model for machine interaction and machine translation.
-
Seq2Seq by using LSTM:
# Seq2Seq by using LSTM def seq2seq_model_builder(HIDDEN_DIM=300): encoder_inputs = Input(shape=(MAX_LEN, ), dtype='int32',) encoder_embedding = embed_layer(encoder_inputs) encoder_LSTM = LSTM(HIDDEN_DIM, return_state=True) encoder_outputs, state_h, state_c = encoder_LSTM(encoder_embedding) decoder_inputs = Input(shape=(MAX_LEN, ), dtype='int32',) decoder_embedding = embed_layer(decoder_inputs) decoder_LSTM = LSTM(HIDDEN_DIM, return_state=True, return_sequences=True) decoder_outputs, _, _ = decoder_LSTM(decoder_embedding, initial_state=[state_h, state_c]) # dense_layer = Dense(VOCAB_SIZE, activation='softmax') outputs = TimeDistributed(Dense(VOCAB_SIZE, activation='softmax'))(decoder_outputs) model = Model([encoder_inputs, decoder_inputs], outputs) return model model = seq2seq_model_builder(HIDDEN_DIM=300) model.summary()
For training our seq2seq model, we will use Cornell Movie — Dialogs Corpus Dataset which contains over 220,579 conversational exchanges between 10,292 pairs of movie characters. And it involves 9,035 characters from 617 movies. Here one of the conversations from the data set:
```md
Mike:
"Drink up, Charley. We're ahead of you."
Charley:
"I'm not thirsty."
```
2. Text Cleaning
I always use this my own function to clean text for Seq2Seq:
-
Text Cleaning Function:
def clean_text(text): '''Clean text by removing unnecessary characters and altering the format of words.''' text = text.lower() text = re.sub(r"i'm", "i am", text) text = re.sub(r"he's", "he is", text) text = re.sub(r"she's", "she is", text) text = re.sub(r"it's", "it is", text) text = re.sub(r"that's", "that is", text) text = re.sub(r"what's", "that is", text) text = re.sub(r"where's", "where is", text) text = re.sub(r"how's", "how is", text) text = re.sub(r"\'ll", " will", text) text = re.sub(r"\'ve", " have", text) text = re.sub(r"\'re", " are", text) text = re.sub(r"\'d", " would", text) text = re.sub(r"\'re", " are", text) text = re.sub(r"won't", "will not", text) text = re.sub(r"can't", "cannot", text) text = re.sub(r"n't", " not", text) text = re.sub(r"n'", "ng", text) text = re.sub(r"'bout", "about", text) text = re.sub(r"'til", "until", text) text = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", text) return text cleaned_text = [] for text in texts: cleaned_text.append(clean_text(text)) cleaned_text
-
Input:
# encoder input text data ["Drink up, Charley. We're ahead of you.", 'Did you change your hair?', 'I believe I have found a faster way.']
-
Output:
# encoder input text data ['drink up charley we are ahead of you', 'did you change your hair', 'i believe i have found a faster way'] module
3. Put tag and tag for decoder input
- Tag Creater Function:
```python
def tagger(decoder_input_sentence):
bos = “
" eos = " " final_target = [bos + text + eos for text in decoder_input_sentence] return final_target
decoder_inputs = tagger(decoder_input_text)
- Input:
```md
# decoder input text data
[['with the teeth of your zipper',
'so they tell me',
'so which dakota you from'],,,,]
```
- Output:
```md
# decoder input text data
[['<BOS> with the teeth of your zipper <EOS>',
'<BOS> so they tell me <EOS>',
'<BOS> so which dakota you from <EOS>'],,,,]
```
## <a name="IO" ></a>4. Make Vocabulary (VOCAB_SIZE)
- TMake Vocabulary Function:
```python
from keras.preprocessing.text import Tokenizer
def vocab_creater(text_lists, VOCAB_SIZE):
tokenizer = Tokenizer(num_words=VOCAB_SIZE)
tokenizer.fit_on_texts(text_lists)
dictionary = tokenizer.word_index
word2idx = {}
idx2word = {}
for k, v in dictionary.items():
if v < VOCAB_SIZE:
word2idx[k] = v
index2word[v] = k
if v >= VOCAB_SIZE-1:
continue
return word2idx, idx2word
word2idx, idx2word = vocab_creater(text_lists=encoder_input_text+decoder_input_text, VOCAB_SIZE=14999)
- Input:
# encoder input text data ["Drink up, Charley. We're ahead of you.", 'Did you change your hair?', 'I believe I have found a faster way.']
- Output:
# encoder input text data ['drink up charley we are ahead of you', 'did you change your hair', 'i believe i have found a faster way'] module
5. Tokenize Bag of words to Bag of IDs
- Tokenize Function: ```python from keras.preprocessing.text import Tokenizer VOCAB_SIZE = 14999
def text2seq(encoder_text, decoder_text, VOCAB_SIZE):
tokenizer = Tokenizer(num_words=VOCAB_SIZE) encoder_sequences = tokenizer.texts_to_sequences(encoder_text) decoder_sequences = tokenizer.texts_to_sequences(decoder_text)
return encoder_sequences, decoder_sequences
encoder_sequences, decoder_sequences = text2seq(encoder_text, decoder_text, VOCAB_SIZE)
- Input:
```md
# Cleaned texts
[['with the teeth of your zipper',
'so they tell me',
'so which dakota you from'],,,,]
```
- Output:
```md
# decoder input text data
[[10, 27, 8, 4, 27, 1107, 802],
[3, 5, 186, 168],
[662, 4, 22, 346, 6, 130, 3, 5, 2407],,,,,]
```
## <a name="DS"></a>6. Padding (MAX_LEN)
- Padding Function:
```python
from keras.preprocessing.sequence import pad_sequences
def padding(encoder_sequences, decoder_sequences, MAX_LEN):
encoder_input_data = pad_sequences(encoder_sequences, maxlen=MAX_LEN, dtype='int32', padding='post', truncating='post')
decoder_input_data = pad_sequences(decoder_sequences, maxlen=MAX_LEN, dtype='int32', padding='post', truncating='post')
return encoder_input_data, decoder_input_data
encoder_input_data, decoder_input_data = padding(encoder_sequences, decoder_sequences, MAX_LEN)
- Input:
# decoder input text data [[10, 27, 8, 4, 27, 1107, 802], [3, 5, 186, 168], [662, 4, 22, 346, 6, 130, 3, 5, 2407],,,,,]
- Output:
# MAX_LEN = 10 # decoder input text data array([[10, 27, 8, 4, 27, 1107, 802, 0, 0, 0], [3, 5, 186, 168, 0, 0, 0, 0, 0, 0], [662, 4, 22, 346, 6, 130, 3, 5, 2407, 0],,,,,]
7. Word Embedding (EMBEDDING_DIM)
We use Pretraind Word2Vec Model from Glove. We can create embedding layer with Glove with 3 steps:
- Call Glove file from XX
- Create Embedding Matrix from our Vocabulary
- Create Embedding Layer
Let’s take a look!
- Word Embedding Function: ```python
GLOVE_DIR = path for glove.6B.100d.txt
def glove_100d_dictionary(GLOVE_DIR): embeddings_index = {} f = open(os.path.join(GLOVE_DIR, ‘glove.6B.100d.txt’)) for line in f: values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype=’float32’) embeddings_index[word] = coefs f.close() return embeddings_index
this time: embedding_dimention = 100d
def embedding_matrix_creater(embedding_dimention): embedding_matrix = np.zeros((len(word_index) + 1, embedding_dimention)) for word, i in word_index.items(): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: # words not found in embedding index will be all-zeros. embedding_matrix[i] = embedding_vector return embedding_matrix
def embedding_layer_creater(VOCAB_SIZE, EMBEDDING_DIM, MAX_LEN, embedding_matrix):
embedding_layer = Embedding(input_dim = VOCAB_SIZE, output_dim = EMBEDDING_DIM, input_length = MAX_LEN, weights = [embedding_matrix], trainable = False) return embedding_layer
embedding_layer = embedding_layer_creater(VOCAB_SIZE, EMBEDDING_DIM, MAX_LEN, embedding_matrix)
## <a name="data"></a>8. Reshape the Data depends on neural network shapes
- Reshape Decoder Output Data Function:
```python
import numpy as np
# MAX_LEN = 20
# num_samples = len(encoder_sequences)
# VOCAB_SIZE = 15000
def decoder_output_creater(decoder_input_data, num_samples, MAX_LEN, VOCAB_SIZE):
decoder_output_data = np.zeros((num_samples, MAX_LEN, VOCAB_SIZE), dtype="float32")
for i, seqs in enumerate(decoder_input_data):
for j, seq in enumerate(seqs):
if j > 0:
decoder_output_data[i][j][seq] = 1.
print(decoder_output_data.shape)
return decoder_output_data
decoder_output_data = decoder_output_creater(decoder_input_data, num_samples, MAX_LEN, VOCAB_SIZE)
- Input:
# MAX_LEN = 10 # decoder input text data array([[10, 27, 8, 4, 27, 1107, 802, 0, 0, 0], [3, 5, 186, 168, 0, 0, 0, 0, 0, 0], [662, 4, 22, 346, 6, 130, 3, 5, 2407, 0],,,,,]
- Output:
# output.shape (num_samples, MAX_LEN, VOCAB_SIZE) # decoder_output_data.shape (15000, 10, 15000) array([[[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 1., ..., 0., 0., 0.], ..., [1., 0., 0., ..., 0., 0., 0.], [1., 0., 0., ..., 0., 0., 0.], [1., 0., 0., ..., 0., 0., 0.]], ..., ], , dtype=float32)
9. Split Data for training and validation, testing
- Split Data Function: ```python from sklearn.model_selection import train_test_split
def data_spliter(encoder_input_data, decoder_input_data, test_size1=0.2, test_size2=0.3):
en_train, en_test, de_train, de_test = train_test_split(encoder_input_data, decoder_input_data, test_size=test_size1) en_train, en_val, de_train, de_val = train_test_split(en_train, de_train, test_size=test_size2)
return en_train, en_val, en_test, de_train, de_val, de_test
en_train, en_val, en_test, de_train, de_val, de_test = data_spliter(encoder_input_data, decoder_input_data) ```
10. Conclusion
We just quickly overviewed text preprocessing for seq2seq. I hope it will be useful for you as well.
Leave a Comment