Tensorflow/Keras word Tokenizer
Word Tokenizer Introduction:
If you want to classify the text data, The first and foremost step is to convert that raw text into the numerical vector. So that machine learning algorithms can understand the data. In this article, We are going to discuss about the Keras word Tokenizer.
We can convert the text data into the vector by writing a custom function that is going to split all our words then take the unique words and then assign them the numbers but most of the machine learning libraries are already have these kinds of tokenizers.
There are a number of ways to convert the text features into the numerical vectors like
- countvectorizer ( Bag of words – sklearn)
- tfidfvectorizer ( Term frequency and inverse document frequency – sklearn)
- hashingvectorizer ( Hashing based encoding – sklearn)
- Keras Tokenizer ( if you use Tensorflow/keras)
In this article, we are going to discuss the Tensorflow or Keras Tokenizer.
Keras Word Tokenizer:
1 2 3 4 5 6 7 8 9 |
from tensorflow.keras.preprocessing.text import Tokenizer words = [ 'Tensorflow NLP Tokenizer', 'tokenizer! is really cool' ] tokenizer = Tokenizer(num_words = 10) tokenizer.fit_on_texts(words) word_index = tokenizer.word_index print(word_index) |
1 |
{'tokenizer': 1, 'tensorflow': 2, 'nlp': 3, 'is': 4, 'really': 5, 'cool': 6} |
In the above example, Keras able to split the words and also removed the special characters, Note in the first sentence we have Tokenizer and in the sentence we have tokenizer. Keras also removed the special characters from the strings. So Tokenizer is doing some kind of preprocessing before splitting the sentences.
Once Tokenizer splits the sentences it will maintain the dictionary named word_index with all the unique words in the provided data and assign them a unique number and this number is used in the place of those words. All sentences words also called corpus (from going forward I am going to use the corpus to represent all data)
texts_to_sequences :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
from tensorflow.keras.preprocessing.text import Tokenizer words = [ 'Tensorflow NLP Tokenizer', 'tokenizer! is really cool', 'Tensorflow is awesome!', 'spliting sentences in to words using tokenizer' ] tokenizer = Tokenizer(num_words = 10) tokenizer.fit_on_texts(words) word_index = tokenizer.word_index # fit_on_sequences takes direct sentences, And it will work with the previous fitted data. sequences = tokenizer.texts_to_sequences(words) print(word_index) print(sequences) |
1 2 3 |
{'tokenizer': 1, 'tensorflow': 2, 'is': 3, 'nlp': 4, 'really': 5, 'cool': 6, 'awesome': 7, 'spliting': 8, 'sentences': 9, 'in': 10, 'to': 11, 'words': 12, 'using': 13} [[2, 4, 1], [1, 3, 5, 6], [2, 3, 7], [8, 9, 1]] |
Now we can see that texts_to_sequences directly converted sentences into the Vectors.
- The Good thing about tokenizer is, It can fit on one data and still, we can perform on other data.
- What I mean is we can fit the tokenizer on the training data and then only use the transform on the test data.
- We must need to do the fit and transform on the training data and only do the transform on the testing data because we are building a model using the word_indexes of training data and if we use the fit on the test data word_indexes will change and effectiveness of model may decrease.
Using word tokenizer on Test data :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
from tensorflow.keras.preprocessing.text import Tokenizer words = [ 'Tensorflow NLP Tokenizer', 'tokenizer! is really cool', 'Tensorflow is awesome!', 'spliting sentences in to words using tokenizer' ] tokenizer = Tokenizer(num_words = 10) tokenizer.fit_on_texts(words) word_index = tokenizer.word_index # fit_on_sequences takes direct sentences, And it will work with the previous fitted data. sequences = tokenizer.texts_to_sequences(words) print(word_index) print(sequences) test_data = [ 'NLP Tokenizer became really powerful', 'Tensorflow tokenizer at work' ] # Now i am not going to use fit on the test data, But only using the transform seq = tokenizer.texts_to_sequences(test_data) print("Test Seq: ", seq) |
1 2 3 4 |
{'tokenizer': 1, 'tensorflow': 2, 'is': 3, 'nlp': 4, 'really': 5, 'cool': 6, 'awesome': 7, 'spliting': 8, 'sentences': 9, 'in': 10, 'to': 11, 'words': 12, 'using': 13} [[2, 4, 1], [1, 3, 5, 6], [2, 3, 7], [8, 9, 1]] Test Seq: [[4, 1, 5], [2, 1]] |
- We can see we only got the word index for the words which are in the training vocabulary
- For example: ‘NLP Tokenizer became really powerful’ is encoded as [4, 1,5 ] where [4-NLP, 1-tokenizer, 5-really] and remaining words which are not present in the training data is ignored.
- So tokenizer will only transform the words which are present in the training data, So it is always a good idea to have a large corpus of training data so that we can have all words in the vocabulary.
Solving Out of vocabulary words problem using oov_token:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
from tensorflow.keras.preprocessing.text import Tokenizer words = [ 'Tensorflow NLP Tokenizer', 'tokenizer! is really cool', 'Tensorflow is awesome!', 'spliting sentences in to words using tokenizer' ] # for the Tokenizer constructor we can pass the out of vocabulary token (oov_token), If tokenizer gets any oov word it will replace that word with the specified token tokenizer = Tokenizer(num_words = 10, oov_token='OOV') tokenizer.fit_on_texts(words) word_index = tokenizer.word_index # fit_on_sequences takes direct sentences, And it will work with the previous fitted data. sequences = tokenizer.texts_to_sequences(words) print(word_index) print(sequences) test_data = [ 'NLP Tokenizer became really powerful', 'Tensorflow tokenizer at work' ] # Now i am not going to use fit on the test data, But only using the transform seq = tokenizer.texts_to_sequences(test_data) print("Test Seq: ", seq) |
1 2 3 4 |
{'OOV': 1, 'tokenizer': 2, 'tensorflow': 3, 'is': 4, 'nlp': 5, 'really': 6, 'cool': 7, 'awesome': 8, 'spliting': 9, 'sentences': 10, 'in': 11, 'to': 12, 'words': 13, 'using': 14} [[3, 5, 2], [2, 4, 6, 7], [3, 4, 8], [9, 1, 1, 1, 1, 1, 2]] Test Seq: [[5, 2, 1, 6, 1], [3, 2, 1, 1]] |
For the Tokenizer constructor, we can pass the out of vocabulary token (oov_token), Here we took ‘OOV’ If tokenizer gets any out of vocabulary (oov) word it will replace that word with the specified token
Note: Make sure to use Unique word for the oov_token, It may conflict with any other word.
Text padding :
- It is a good idea to have the same length for all our sentences so that the neural network can perform the calculations efficiently
- But our sentences may have different lengths, So we can use the text padding technique to bring all of them into the same length.
- We can use the Keras pad_sequences to pad the sentences. please see the following code snippet
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences words = [ 'Tensorflow NLP Tokenizer', 'tokenizer! is really cool', 'Tensorflow is awesome!', 'spliting sentences in to words using tokenizer' ] # for the Tokenizer constructor we can pass the out of vocabulary token (oov_token), If tokenizer gets any oov word it will replace that word with the specified token tokenizer = Tokenizer(num_words = 20, oov_token='OOV') tokenizer.fit_on_texts(words) word_index = tokenizer.word_index # fit_on_sequences takes direct sentences, And it will work with the previous fitted data. sequences = tokenizer.texts_to_sequences(words) print(word_index) print(sequences) padded_seq = pad_sequences(sequences) print("Padded Seq : \n", padded_seq) |
1 2 3 4 5 6 7 8 9 |
{'OOV': 1, 'tokenizer': 2, 'tensorflow': 3, 'is': 4, 'nlp': 5, 'really': 6, 'cool': 7, 'awesome': 8, 'spliting': 9, 'sentences': 10, 'in': 11, 'to': 12, 'words': 13, 'using': 14} [[3, 5, 2], [2, 4, 6, 7], [3, 4, 8], [9, 10, 11, 12, 13, 14, 2]] Padded Seq : [[ 0 0 0 0 3 5 2] [ 0 0 0 2 4 6 7] [ 0 0 0 0 3 4 8] [ 9 10 11 12 13 14 2]] |
- Now our sentences are transformed in the matrix, Each row represents one sentence and has the same length
- padded sentence length is equal to the maximum length sentence in the training data i.e 4th sentence in the training data have the highest length with 7 words, So all other sentences are padded with zeros to make them 7 words.
- So basically pad_sequences takes the highest length and converts all other sentences to the
- Also, note zeros are added at the beginning of the sentences. This is also called pre-padding.
We can also pad at the end using the padding = ‘post’ for the pad_sequences function like below.
Word Tokenizer : Post Padding, Maxlen and truncating :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences words = [ 'Tensorflow NLP Tokenizer', 'tokenizer! is really cool', 'Tensorflow is awesome!', 'spliting sentences in to words using tokenizer' ] # for the Tokenizer constructor we can pass the out of vocabulary token (oov_token), If tokenizer gets any oov word it will replace that word with the specified token tokenizer = Tokenizer(num_words = 20, oov_token='OOV') tokenizer.fit_on_texts(words) word_index = tokenizer.word_index # fit_on_sequences takes direct sentences, And it will work with the previous fitted data. sequences = tokenizer.texts_to_sequences(words) print(word_index) print(sequences) padded_seq = pad_sequences(sequences, padding= 'post', maxlen=6) print("Padded Seq : \n", padded_seq) |
1 2 3 4 5 6 7 8 |
{'OOV': 1, 'tokenizer': 2, 'tensorflow': 3, 'is': 4, 'nlp': 5, 'really': 6, 'cool': 7, 'awesome': 8, 'spliting': 9, 'sentences': 10, 'in': 11, 'to': 12, 'words': 13, 'using': 14} [[3, 5, 2], [2, 4, 6, 7], [3, 4, 8], [9, 10, 11, 12, 13, 14, 2]] Padded Seq : [[ 3 5 2 0 0 0] [ 2 4 6 7 0 0] [ 3 4 8 0 0 0] [10 11 12 13 14 2]] |
- Now we can see sentences are padded at the end.
- We can also control the length of the words, Observe the 4th sentence only have 6 words and word at the beginning(same like padding) has been dropped. So we are losing the data by using the maxlen. So choose the maxlen parameter as per your sentences
We can also choose which way we can loose the data in case if we use a small maxlen parameter. To choose use truncating parameter like below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences words = [ 'Tensorflow NLP Tokenizer', 'tokenizer! is really cool', 'Tensorflow is awesome!', 'spliting sentences in to words using tokenizer' ] # for the Tokenizer constructor we can pass the out of vocabulary token (oov_token), If tokenizer gets any oov word it will replace that word with the specified token tokenizer = Tokenizer(num_words = 20, oov_token='OOV') tokenizer.fit_on_texts(words) word_index = tokenizer.word_index # fit_on_sequences takes direct sentences, And it will work with the previous fitted data. sequences = tokenizer.texts_to_sequences(words) print(word_index) print(sequences) padded_seq = pad_sequences(sequences, padding= 'post', truncating='post', maxlen=6) print("Padded Seq : \n", padded_seq) |
1 2 3 4 5 6 7 8 9 |
{'OOV': 1, 'tokenizer': 2, 'tensorflow': 3, 'is': 4, 'nlp': 5, 'really': 6, 'cool': 7, 'awesome': 8, 'spliting': 9, 'sentences': 10, 'in': 11, 'to': 12, 'words': 13, 'using': 14} [[3, 5, 2], [2, 4, 6, 7], [3, 4, 8], [9, 10, 11, 12, 13, 14, 2]] Padded Seq : [[ 3 5 2 0 0 0] [ 2 4 6 7 0 0] [ 3 4 8 0 0 0] [ 9 10 11 12 13 14]] |
- Now we can see last word in the 4th sentence is dropped. i.e tokenizer
Let’s apply pad_sequences on Test data also :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences words = [ 'Tensorflow NLP Tokenizer', 'tokenizer! is really cool', 'Tensorflow is awesome!', 'spliting sentences in to words using tokenizer' ] # for the Tokenizer constructor we can pass the out of vocabulary token (oov_token), If tokenizer gets any oov word it will replace that word with the specified token tokenizer = Tokenizer(num_words = 20, oov_token='OOV') tokenizer.fit_on_texts(words) word_index = tokenizer.word_index # fit_on_sequences takes direct sentences, And it will work with the previous fitted data. sequences = tokenizer.texts_to_sequences(words) print(word_index) print(sequences) padded_seq = pad_sequences(sequences, padding= 'post', truncating='post', maxlen=6) print("<strong>\n</strong>Padded Seq : \n", padded_seq) test_data = [ 'NLP Tokenizer became really powerful', 'Tensorflow tokenizer at work' ] # Now i am not going to use fit on the test data, But only using the transform seq = tokenizer.texts_to_sequences(test_data) padded_test_seq = pad_sequences(seq, padding= 'post', truncating='post', maxlen=6) print("\nTest Seq: ", seq) print("\nPadded test seq : \n", padded_test_seq) |
1 2 3 4 5 6 7 8 9 10 11 12 |
{'OOV': 1, 'tokenizer': 2, 'tensorflow': 3, 'is': 4, 'nlp': 5, 'really': 6, 'cool': 7, 'awesome': 8, 'spliting': 9, 'sentences': 10, 'in': 11, 'to': 12, 'words': 13, 'using': 14} [[3, 5, 2], [2, 4, 6, 7], [3, 4, 8], [9, 10, 11, 12, 13, 14, 2]] Padded Seq : [[ 3 5 2 0 0 0] [ 2 4 6 7 0 0] [ 3 4 8 0 0 0] [ 9 10 11 12 13 14]] Test Seq: [[5, 2, 1, 6, 1], [3, 2, 1, 1]] Padded test seq : [[5 2 1 6 1 0] [3 2 1 1 0 0]] |
I have tried to cover the basic operations on Keras tokenizer. Explore more at
- https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer
- https://keras.io/preprocessing/text/