This is the 23rd article in my series of articles on Python for NLP. In the previous article of this series, I explained how to perform neural machine translation using seq2seq architecture with Python's Keras library for deep learning.
In this article we will study BERT, which stands for Bidirectional Encoder Representations from Transformers and its application to text classification. BERT is a text representation technique like Word Embeddings. If you have no idea of how word embeddings work, take a look at my article on word embeddings.
Like word embeddings, BERT is also a text representation technique which is a fusion of variety of state-of-the-art deep learning algorithms, such as bidirectional encoder LSTM and Transformers. BERT was developed by researchers at Google in 2018 and has been proven to be state-of-the-art for a variety of natural language processing tasks such text classification, text summarization, text generation, etc. Just recently, Google announced that BERT is being used as a core part of their search algorithm to better understand queries.
In this article we will not go into the mathematical details of how BERT is implemented, as there are plenty of resources already available online. Rather we will see how to perform text classification using the BERT Tokenizer. In this article you will see how the BERT Tokenizer can be used to create text classification model. In the next article I will explain how the BERT Tokenizer, along with BERT embedding layer, can be used to create even more efficient NLP models.
Note: All the scripts in this article have been tested using Google Colab environment, with Python runtime set to GPU.
The dataset used in this article can be downloaded from this Kaggle link.
If you download the dataset and extract the compressed file, you will see a CSV file. The file contains 50,000 records and two columns: review and sentiment. The review column contains text for the review and the sentiment column contains sentiment for the review. The sentiment column can have two values i.e. "positive" and "negative" which makes our problem a binary classification problem.
We have previously performed sentimental analysis of this dataset in a previous article where we achieved maximum accuracy of 92% on the training set via word a embedding technique and convolutional neural network. On the test set the maximum accuracy achieved was 85.40% using the word embedding and single LSTM with 128 nodes. Let's see if we can get better accuracy using BERT representation.
Installing and Importing Required Libraries
Before you can go and use the BERT text representation, you need to install BERT for TensorFlow 2.0. Execute the following pip commands on your terminal to install BERT for TensorFlow 2.0.
!pip install bert-for-tf2 !pip install sentencepiece
Next, you need to make sure that you are running TensorFlow 2.0. Google Colab, by default, doesn't run your script on TensorFlow 2.0. Therefore, to make sure that you are running your script via TensorFlow 2.0, execute the following script:
try: %tensorflow_version 2.x except Exception: pass import tensorflow as tf import tensorflow_hub as hub from tensorflow.keras import layers import bert
In the above script, in addition to TensorFlow 2.0, we also import tensorflow_hub, which basically is a place where you can find all the prebuilt and pretrained models developed in TensorFlow. We will be importing and using a built-in BERT model from TF hub. Finally, if in the output you see the following output, you are good to go:
TensorFlow 2.x selected.
Importing and Preprocessing the Dataset
The following script imports the dataset using the
read_csv() method of the Pandas dataframe. The script also prints the shape of the dataset.
movie_reviews = pd.read_csv("/content/drive/My Drive/Colab Datasets/IMDB Dataset.csv") movie_reviews.isnull().values.any() movie_reviews.shape
The output shows that our dataset has 50,000 rows and 2 columns.
Next, we will preprocess our data to remove any punctuations and special characters. To do so, we will define a function that takes as input a raw text review and returns the corresponding cleaned text review.
def preprocess_text(sen): # Removing html tags sentence = remove_tags(sen) # Remove punctuations and numbers sentence = re.sub('[^a-zA-Z]', ' ', sentence) # Single character removal sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence) # Removing multiple spaces sentence = re.sub(r'\s+', ' ', sentence) return sentence
TAG_RE = re.compile(r'<[^>]+>') def remove_tags(text): return TAG_RE.sub('', text)
The following script cleans all the text reviews:
reviews =  sentences = list(movie_reviews['review']) for sen in sentences: reviews.append(preprocess_text(sen))
Our dataset contains two columns, as can be verified from the following script:
review column contains text while the
sentiment column contains sentiments. The sentiments column contains values in the form of text. The following script displays unique values in the
array(['positive', 'negative'], dtype=object)
You can see that the sentiment column contains two unique values i.e.
negative. Deep learning algorithms work with numbers. Since we have only two unique values in the output, we can convert them into 1 and 0. The following script replaces
positive sentiment by
1 and the negative sentiment by
y = movie_reviews['sentiment'] y = np.array(list(map(lambda x: 1 if x=="positive" else 0, y)))
reviews variable contain text reviews while the
y variable contains the corresponding labels. Let's randomly print a review.
Phil the Alien is one of those quirky films where the humour is based around the oddness of everything rather than actual punchlines At first it was very odd and pretty funny but as the movie progressed didn find the jokes or oddness funny anymore Its low budget film thats never problem in itself there were some pretty interesting characters but eventually just lost interest imagine this film would appeal to stoner who is currently partaking For something similar but better try Brother from another planet
It clearly looks like a negative review. Let's just confirm it by printing the corresponding label value:
The output 0 confirms that it is a negative review. We have now preprocessed our data and we are now ready to create BERT representations from our text data.
Creating a BERT Tokenizer
In order to use BERT text embeddings as input to train text classification model, we need to tokenize our text reviews. Tokenization refers to dividing a sentence into individual words. To tokenize our text, we will be using the BERT tokenizer. Look at the following script:
BertTokenizer = bert.bert_tokenization.FullTokenizer bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1", trainable=False) vocabulary_file = bert_layer.resolved_object.vocab_file.asset_path.numpy() to_lower_case = bert_layer.resolved_object.do_lower_case.numpy() tokenizer = BertTokenizer(vocabulary_file, to_lower_case)
In the script above we first create an object of the
FullTokenizer class from the
bert.bert_tokenization module. Next, we create a BERT embedding layer by importing the BERT model from
trainable parameter is set to
False, which means that we will not be training the BERT embedding. In the next line, we create a BERT vocabulary file in the form a numpy array. We then set the text to lowercase and finally we pass our
to_lower_case variables to the
It is pertinent to mention that in this article, we will only be using BERT Tokenizer. In the next article we will use BERT Embeddings along with tokenizer.
Let's now see if our BERT tokenizer is actually working. To do so, we will tokenize a random sentence, as shown below:
tokenizer.tokenize("don't be so judgmental")
['don', "'", 't', 'be', 'so', 'judgment', '##al']
You can see that the text has been successfully tokenized. You can also get the ids of the tokens using the
convert_tokens_to_ids() of the tokenizer object. Look at the following script:
tokenizer.convert_tokens_to_ids(tokenizer.tokenize("dont be so judgmental"))
Free eBook: Git Essentials
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
[2123, 2102, 2022, 2061, 8689, 2389]
Now will define a function that accepts a single text review and returns the ids of the tokenized words in the review. Execute the following script:
def tokenize_reviews(text_reviews): return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text_reviews))
And execute the following script to actually tokenize all the reviews in the input dataset:
tokenized_reviews = [tokenize_reviews(review) for review in reviews]
Prerparing Data For Training
The reviews in our dataset have varying lengths. Some reviews are very small while others are very long. To train the model, the input sentences should be of equal length. To create sentences of equal length, one way is to pad the shorter sentences by 0s. However, this can result in a sparse matrix contain large number of 0s. The other way is to pad sentences within each batch. Since we will be training the model in batches, we can pad the sentences within the training batch locally depending upon the length of the longest sentence. To do so, we first need to find the length of each sentence.
The following script creates a list of lists where each sublist contains tokenized review, the label of the review and the length of the review:
reviews_with_len = [[review, y[i], len(review)] for i, review in enumerate(tokenized_reviews)]
In our dataset, the first half of the reviews are positive while the last half contains negative reviews. Therefore, in order to have both positive and negative reviews in the training batches we need to shuffle the reviews. The following script shuffles the data randomly:
Once the data is shuffled, we will sort the data by the length of the reviews. To do so, we will use the
sort() function of the list and will tell it that we want to sort the list with respect to the third item in the sublist i.e. the length of the review.
reviews_with_len.sort(key=lambda x: x)
Once the reviews are sorted by length, we can remove the length attribute from all the reviews. Execute the following script to do so:
sorted_reviews_labels = [(review_lab, review_lab) for review_lab in reviews_with_len]
Once the reviews are sorted we will convert thed dataset so that it can be used to train TensorFlow 2.0 models. Run the following code to convert the sorted dataset into a TensorFlow 2.0-compliant input dataset shape.
processed_dataset = tf.data.Dataset.from_generator(lambda: sorted_reviews_labels, output_types=(tf.int32, tf.int32))
Finally, we can now pad our dataset for each batch. The batch size we are going to use is 32 which means that after processing 32 reviews, the weights of the neural network will be updated. To pad the reviews locally with respect to batches, execute the following:
BATCH_SIZE = 32 batched_dataset = processed_dataset.padded_batch(BATCH_SIZE, padded_shapes=((None, ), ()))
Let's print the first batch and see how padding has been applied to it:
(<tf.Tensor: shape=(32, 21), dtype=int32, numpy= array([[ 2054, 5896, 2054, 2466, 2054, 6752, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 3078, 5436, 3078, 3257, 3532, 7613, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 3191, 1996, 2338, 5293, 1996, 3185, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 2062, 23873, 3993, 2062, 11259, 2172, 2172, 2062, 14888, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [ 1045, 2876, 9278, 2023, 2028, 2130, 2006, 7922, 12635, 2305, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ...... [ 7244, 2092, 2856, 10828, 1997, 10904, 2402, 2472, 3135, 2293, 2466, 2007, 10958, 8428, 10102, 1999, 1996, 4281, 4276, 3773, 0], [ 2005, 5760, 7788, 4393, 8808, 2498, 2064, 12826, 2000, 1996, 11056, 3152, 3811, 16755, 2169, 1998, 2296, 2028, 1997, 2068, 0], [ 2307, 3185, 2926, 1996, 2189, 3802, 2696, 2508, 2012, 2197, 2023, 8847, 6702, 2043, 2017, 2031, 2633, 2179, 2008, 2569, 2619], [ 2028, 1997, 1996, 4569, 15580, 2102, 5691, 2081, 1999, 3522, 2086, 2204, 23191, 5436, 1998, 11813, 6370, 2191, 2023, 2028, 4438], [ 2023, 3185, 2097, 2467, 2022, 5934, 1998, 3185, 4438, 2004, 2146, 2004, 2045, 2024, 2145, 2111, 2040, 6170, 3153, 1998, 2552]], dtype=int32)>, <tf.Tensor: shape=(32,), dtype=int32, numpy= array([0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1], dtype=int32)>)
The above output shows the first five and last five padded reviews. From the last five reviews, you can see that the total number of words in the largest sentence were 21. Therefore, in the first five reviews the 0s are added at the end of the sentences so that their total length is also 21. The padding for the next batch will be different depending upon the size of the largest sentence in the batch.
Once we have applied padding to our dataset, the next step is to divide the dataset into test and training sets. We can do that with the help of following code:
TOTAL_BATCHES = math.ceil(len(sorted_reviews_labels) / BATCH_SIZE) TEST_BATCHES = TOTAL_BATCHES // 10 batched_dataset.shuffle(TOTAL_BATCHES) test_data = batched_dataset.take(TEST_BATCHES) train_data = batched_dataset.skip(TEST_BATCHES)
In the code above we first find the total number of batches by dividing the total records by 32. Next, 10% of the data is left aside for testing. To do so, we use the
take() method of
batched_dataset() object to store 10% of the data in the
test_data variable. The remaining data is stored in the
train_data object for training using the
The dataset has been prepared and now we are ready to create our text classification model.
Creating the Model
Now we are all set to create our model. To do so, we will create a class named
TEXT_MODEL that inherits from the
tf.keras.Model class. Inside the class we will define our model layers. Our model will consist of three convolutional neural network layers. You can use LSTM layers instead and can also increase or decrease the number of layers. I have copied the number and types of layers from SuperDataScience's Google colab notebook and this architecture seems to work quite well for the IMDB Movie reviews dataset as well.
Let's now create out model class:
class TEXT_MODEL(tf.keras.Model): def __init__(self, vocabulary_size, embedding_dimensions=128, cnn_filters=50, dnn_units=512, model_output_classes=2, dropout_rate=0.1, training=False, name="text_model"): super(TEXT_MODEL, self).__init__(name=name) self.embedding = layers.Embedding(vocabulary_size, embedding_dimensions) self.cnn_layer1 = layers.Conv1D(filters=cnn_filters, kernel_size=2, padding="valid", activation="relu") self.cnn_layer2 = layers.Conv1D(filters=cnn_filters, kernel_size=3, padding="valid", activation="relu") self.cnn_layer3 = layers.Conv1D(filters=cnn_filters, kernel_size=4, padding="valid", activation="relu") self.pool = layers.GlobalMaxPool1D() self.dense_1 = layers.Dense(units=dnn_units, activation="relu") self.dropout = layers.Dropout(rate=dropout_rate) if model_output_classes == 2: self.last_dense = layers.Dense(units=1, activation="sigmoid") else: self.last_dense = layers.Dense(units=model_output_classes, activation="softmax") def call(self, inputs, training): l = self.embedding(inputs) l_1 = self.cnn_layer1(l) l_1 = self.pool(l_1) l_2 = self.cnn_layer2(l) l_2 = self.pool(l_2) l_3 = self.cnn_layer3(l) l_3 = self.pool(l_3) concatenated = tf.concat([l_1, l_2, l_3], axis=-1) # (batch_size, 3 * cnn_filters) concatenated = self.dense_1(concatenated) concatenated = self.dropout(concatenated, training) model_output = self.last_dense(concatenated) return model_output
The above script is pretty straightforward. In the constructor of the class, we initialze some attributes with default values. These values will be replaced later on by the values passed when the object of the
TEXT_MODEL class is created.
Next, three convolutional neural network layers have been initialized with the kernel or filter values of 2, 3, and 4, respectively. Again, you can change the filter sizes if you want.
Next, inside the
call() function, global max pooling is applied to the output of each of the convolutional neural network layer. Finally, the three convolutional neural network layers are concatenated together and their output is fed to the first densely connected neural network. The second densely connected neural network is used to predict the output sentiment since it only contains 2 classes. In case you have more classes in the output, you can updated the
output_classes variable accordingly.
Let's now define the values for the hyper parameters of our model.
VOCAB_LENGTH = len(tokenizer.vocab) EMB_DIM = 200 CNN_FILTERS = 100 DNN_UNITS = 256 OUTPUT_CLASSES = 2 DROPOUT_RATE = 0.2 NB_EPOCHS = 5
Next, we need to create an object of the
TEXT_MODEL class and pass the hyper paramters values that we defined in the last step to the constructor of the
text_model = TEXT_MODEL(vocabulary_size=VOCAB_LENGTH, embedding_dimensions=EMB_DIM, cnn_filters=CNN_FILTERS, dnn_units=DNN_UNITS, model_output_classes=OUTPUT_CLASSES, dropout_rate=DROPOUT_RATE)
Before we can actually train the model we need to compile it. The following script compiles the model:
if OUTPUT_CLASSES == 2: text_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]) else: text_model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["sparse_categorical_accuracy"])
Finally to train our model, we can use the
fit method of the model class.
Here is the result after 5 epochs:
Epoch 1/5 1407/1407 [==============================] - 381s 271ms/step - loss: 0.3037 - accuracy: 0.8661 Epoch 2/5 1407/1407 [==============================] - 381s 271ms/step - loss: 0.1341 - accuracy: 0.9521 Epoch 3/5 1407/1407 [==============================] - 383s 272ms/step - loss: 0.0732 - accuracy: 0.9742 Epoch 4/5 1407/1407 [==============================] - 381s 271ms/step - loss: 0.0376 - accuracy: 0.9865 Epoch 5/5 1407/1407 [==============================] - 383s 272ms/step - loss: 0.0193 - accuracy: 0.9931 <tensorflow.python.keras.callbacks.History at 0x7f5f65690048>
You can see that we got an accuracy of 99.31% on the training set.
Let's now evaluate our model's performance on the test set:
results = text_model.evaluate(test_dataset) print(results)
156/Unknown - 4s 28ms/step - loss: 0.4428 - accuracy: 0.8926[0.442786190037926, 0.8926282]
From the output, we can see that we got an accuracy of 89.26% on the test set.
Going Further - Hand-Held End-to-End Project
Your inquisitive nature makes you want to go further? We recommend checking out our Guided Project: "Image Captioning with CNNs and Transformers with Keras".
In this guided project - you'll learn how to build an image captioning model, which accepts an image as input and produces a textual caption as the output.
You'll learn how to:
- Preprocess text
- Vectorize text input easily
- Work with the
tf.dataAPI and build performant Datasets
- Build Transformers from scratch with TensorFlow/Keras and KerasNLP - the official horizontal addition to Keras for building state-of-the-art NLP models
- Build hybrid architectures where the output of one network is encoded for another
How do we frame image captioning? Most consider it an example of generative deep learning, because we're teaching a network to generate descriptions. However, I like to look at it as an instance of neural machine translation - we're translating the visual features of an image into words. Through translation, we're generating a new representation of that image, rather than just generating new meaning. Viewing it as translation, and only by extension generation, scopes the task in a different light, and makes it a bit more intuitive.
Framing the problem as one of translation makes it easier to figure out which architecture we'll want to use. Encoder-only Transformers are great at understanding text (sentiment analysis, classification, etc.) because Encoders encode meaningful representations. Decoder-only models are great for generation (such as GPT-3), since decoders are able to infer meaningful representations into another sequence with the same meaning. Translation is typically done by an encoder-decoder architecture, where encoders encode a meaningful representation of a sentence (or image, in our case) and decoders learn to turn this sequence into another meaningful representation that's more interpretable for us (such as a sentence).
In this article you saw how we can use BERT Tokenizer to create word embeddings that can be used to perform text classification. We performed sentimental analysis of IMDB movie reviews and achieved an accuracy of 89.26% on the test set. In this article we did not use BERT embeddings, we only used BERT Tokenizer to tokenize the words. In the next article, you will see how BERT Tokenizer along with BERT Embeddings can be used to perform text classification.