Text Classification with BERT Tokenizer and TF 2.0 in Python

This is the 23rd article in my series of articles on Python for NLP. In the previous article of this series, I explained how to perform neural machine translation using seq2seq architecture with Python's Keras library for deep learning.

In this article we will study BERT, which stands for Bidirectional Encoder Representations from Transformers and its application to text classification. BERT is a text representation technique like Word Embeddings. If you have no idea of how word embeddings work, take a look at my article on word embeddings.

Like word embeddings, BERT is also a text representation technique which is a fusion of variety of state-of-the-art deep learning algorithms, such as bidirectional encoder LSTM and Transformers. BERT was developed by researchers at Google in 2018 and has been proven to be state-of-the-art for a variety of natural language processing tasks such text classification, text summarization, text generation, etc. Just recently, Google announced that BERT is being used as a core part of their search algorithm to better understand queries.

In this article we will not go into the mathematical details of how BERT is implemented, as there are plenty of resources already available online. Rather we will see how to perform text classification using the BERT Tokenizer. In this article you will see how the BERT Tokenizer can be used to create text classification model. In the next article I will explain how the BERT Tokenizer, along with BERT embedding layer, can be used to create even more efficient NLP models.

Note: All the scripts in this article have been tested using Google Colab environment, with Python runtime set to GPU.

The Dataset

The dataset used in this article can be downloaded from this Kaggle link.

If you download the dataset and extract the compressed file, you will see a CSV file. The file contains 50,000 records and two columns: review and sentiment. The review column contains text for the review and the sentiment column contains sentiment for the review. The sentiment column can have two values i.e. "positive" and "negative" which makes our problem a binary classification problem.

We have previously performed sentimental analysis of this dataset in a previous article where we achieved maximum accuracy of 92% on the training set via word a embedding technique and convolutional neural network. On the test set the maximum accuracy achieved was 85.40% using the word embedding and single LSTM with 128 nodes. Let's see if we can get better accuracy using BERT representation.

Installing and Importing Required Libraries

Before you can go and use the BERT text representation, you need to install BERT for TensorFlow 2.0. Execute the following pip commands on your terminal to install BERT for TensorFlow 2.0.

!pip install bert-for-tf2
!pip install sentencepiece

Next, you need to make sure that you are running TensorFlow 2.0. Google Colab, by default, doesn't run your script on TensorFlow 2.0. Therefore, to make sure that you are running your script via TensorFlow 2.0, execute the following script:

try:
    %tensorflow_version 2.x
except Exception:
    pass
import tensorflow as tf

import tensorflow_hub as hub

from tensorflow.keras import layers
import bert

In the above script, in addition to TensorFlow 2.0, we also import tensorflow_hub, which basically is a place where you can find all the prebuilt and pretrained models developed in TensorFlow. We will be importing and using a built-in BERT model from TF hub. Finally, if in the output you see the following output, you are good to go:

TensorFlow 2.x selected.

Importing and Preprocessing the Dataset

The following script imports the dataset using the read_csv() method of the Pandas dataframe. The script also prints the shape of the dataset.

movie_reviews = pd.read_csv("/content/drive/My Drive/Colab Datasets/IMDB Dataset.csv")

movie_reviews.isnull().values.any()

movie_reviews.shape

Output

(50000, 2)

The output shows that our dataset has 50,000 rows and 2 columns.

Next, we will preprocess our data to remove any punctuations and special characters. To do so, we will define a function that takes as input a raw text review and returns the corresponding cleaned text review.

def preprocess_text(sen):
    # Removing html tags
    sentence = remove_tags(sen)

    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence
TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', text)

The following script cleans all the text reviews:

reviews = []
sentences = list(movie_reviews['review'])
for sen in sentences:
    reviews.append(preprocess_text(sen))

Our dataset contains two columns, as can be verified from the following script:

print(movie_reviews.columns.values)

Output:

['review' 'sentiment']

The review column contains text while the sentiment column contains sentiments. The sentiments column contains values in the form of text. The following script displays unique values in the sentiment column:

movie_reviews.sentiment.unique()

Output:

array(['positive', 'negative'], dtype=object)

You can see that the sentiment column contains two unique values i.e. positive and negative. Deep learning algorithms work with numbers. Since we have only two unique values in the output, we can convert them into 1 and 0. The following script replaces positive sentiment by 1 and the negative sentiment by 0.

y = movie_reviews['sentiment']

y = np.array(list(map(lambda x: 1 if x=="positive" else 0, y)))

Now the reviews variable contain text reviews while the y variable contains the corresponding labels. Let's randomly print a review.

print(reviews[10])

Output:

Phil the Alien is one of those quirky films where the humour is based around the oddness of everything rather than actual punchlines At first it was very odd and pretty funny but as the movie progressed didn find the jokes or oddness funny anymore Its low budget film thats never problem in itself there were some pretty interesting characters but eventually just lost interest imagine this film would appeal to stoner who is currently partaking For something similar but better try Brother from another planet 

It clearly looks like a negative review. Let's just confirm it by printing the corresponding label value:

print(y[10])

Output:

0

The output 0 confirms that it is a negative review. We have now preprocessed our data and we are now ready to create BERT representations from our text data.

Creating a BERT Tokenizer

In order to use BERT text embeddings as input to train text classification model, we need to tokenize our text reviews. Tokenization refers to dividing a sentence into individual words. To tokenize our text, we will be using the BERT tokenizer. Look at the following script:

BertTokenizer = bert.bert_tokenization.FullTokenizer
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                            trainable=False)
vocabulary_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
to_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = BertTokenizer(vocabulary_file, to_lower_case)

In the script above we first create an object of the FullTokenizer class from the bert.bert_tokenization module. Next, we create a BERT embedding layer by importing the BERT model from hub.KerasLayer. The trainable parameter is set to False, which means that we will not be training the BERT embedding. In the next line, we create a BERT vocabulary file in the form a numpy array. We then set the text to lowercase and finally we pass our vocabulary_file and to_lower_case variables to the BertTokenizer object.

It is pertinent to mention that in this article, we will only be using BERT Tokenizer. In the next article we will use BERT Embeddings along with tokenizer.

Let's now see if our BERT tokenizer is actually working. To do so, we will tokenize a random sentence, as shown below:

tokenizer.tokenize("don't be so judgmental")

Output:

['don', "'", 't', 'be', 'so', 'judgment', '##al']

You can see that the text has been successfully tokenized. You can also get the ids of the tokens using the convert_tokens_to_ids() of the tokenizer object. Look at the following script:

tokenizer.convert_tokens_to_ids(tokenizer.tokenize("dont be so judgmental"))

Output:

[2123, 2102, 2022, 2061, 8689, 2389]

Now will define a function that accepts a single text review and returns the ids of the tokenized words in the review. Execute the following script:

def tokenize_reviews(text_reviews):
    return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text_reviews))

And execute the following script to actually tokenize all the reviews in the input dataset:

tokenized_reviews = [tokenize_reviews(review) for review in reviews]

Prerparing Data For Training

The reviews in our dataset have varying lengths. Some reviews are very small while others are very long. To train the model, the input sentences should be of equal length. To create sentences of equal length, one way is to pad the shorter sentences by 0s. However, this can result in a sparse matrix contain large number of 0s. The other way is to pad sentences within each batch. Since we will be training the model in batches, we can pad the sentences within the training batch locally depending upon the length of the longest sentence. To do so, we first need to find the length of each sentence.

The following script creates a list of lists where each sublist contains tokenized review, the label of the review and the length of the review:

reviews_with_len = [[review, y[i], len(review)]
                 for i, review in enumerate(tokenized_reviews)]

In our dataset, the first half of the reviews are positive while the last half contains negative reviews. Therefore, in order to have both positive and negative reviews in the training batches we need to shuffle the reviews. The following script shuffles the data randomly:

random.shuffle(reviews_with_len)

Once the data is shuffled, we will sort the data by the length of the reviews. To do so, we will use the sort() function of the list and will tell it that we want to sort the list with respect to the third item in the sublist i.e. the length of the review.

reviews_with_len.sort(key=lambda x: x[2])

Once the reviews are sorted by length, we can remove the length attribute from all the reviews. Execute the following script to do so:

sorted_reviews_labels = [(review_lab[0], review_lab[1]) for review_lab in reviews_with_len]

Once the reviews are sorted we will convert thed dataset so that it can be used to train TensorFlow 2.0 models. Run the following code to convert the sorted dataset into a TensorFlow 2.0-compliant input dataset shape.

processed_dataset = tf.data.Dataset.from_generator(lambda: sorted_reviews_labels, output_types=(tf.int32, tf.int32))

Finally, we can now pad our dataset for each batch. The batch size we are going to use is 32 which means that after processing 32 reviews, the weights of the neural network will be updated. To pad the reviews locally with respect to batches, execute the following:

BATCH_SIZE = 32
batched_dataset = processed_dataset.padded_batch(BATCH_SIZE, padded_shapes=((None, ), ()))

Let's print the first batch and see how padding has been applied to it:

next(iter(batched_dataset))

Output:

(<tf.Tensor: shape=(32, 21), dtype=int32, numpy=
 array([[ 2054,  5896,  2054,  2466,  2054,  6752,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 3078,  5436,  3078,  3257,  3532,  7613,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 3191,  1996,  2338,  5293,  1996,  3185,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 2062, 23873,  3993,  2062, 11259,  2172,  2172,  2062, 14888,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 1045,  2876,  9278,  2023,  2028,  2130,  2006,  7922, 12635,
          2305,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
      ......
      
        [ 7244,  2092,  2856, 10828,  1997, 10904,  2402,  2472,  3135,
          2293,  2466,  2007, 10958,  8428, 10102,  1999,  1996,  4281,
          4276,  3773,     0],
        [ 2005,  5760,  7788,  4393,  8808,  2498,  2064, 12826,  2000,
          1996, 11056,  3152,  3811, 16755,  2169,  1998,  2296,  2028,
          1997,  2068,     0],
        [ 2307,  3185,  2926,  1996,  2189,  3802,  2696,  2508,  2012,
          2197,  2023,  8847,  6702,  2043,  2017,  2031,  2633,  2179,
          2008,  2569,  2619],
        [ 2028,  1997,  1996,  4569, 15580,  2102,  5691,  2081,  1999,
          3522,  2086,  2204, 23191,  5436,  1998, 11813,  6370,  2191,
          2023,  2028,  4438],
        [ 2023,  3185,  2097,  2467,  2022,  5934,  1998,  3185,  4438,
          2004,  2146,  2004,  2045,  2024,  2145,  2111,  2040,  6170,
          3153,  1998,  2552]], dtype=int32)>,
 <tf.Tensor: shape=(32,), dtype=int32, numpy=
 array([0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1,
        1, 1, 0, 1, 0, 1, 1, 0, 1, 1], dtype=int32)>)

The above output shows the first five and last five padded reviews. From the last five reviews, you can see that the total number of words in the largest sentence were 21. Therefore, in the first five reviews the 0s are added at the end of the sentences so that their total length is also 21. The padding for the next batch will be different depending upon the size of the largest sentence in the batch.

Once we have applied padding to our dataset, the next step is to divide the dataset into test and training sets. We can do that with the help of following code:

TOTAL_BATCHES = math.ceil(len(sorted_reviews_labels) / BATCH_SIZE)
TEST_BATCHES = TOTAL_BATCHES // 10
batched_dataset.shuffle(TOTAL_BATCHES)
test_data = batched_dataset.take(TEST_BATCHES)
train_data = batched_dataset.skip(TEST_BATCHES)

In the code above we first find the total number of batches by dividing the total records by 32. Next, 10% of the data is left aside for testing. To do so, we use the take() method of batched_dataset() object to store 10% of the data in the test_data variable. The remaining data is stored in the train_data object for training using the skip() method.

The dataset has been prepared and now we are ready to create our text classification model.

Creating the Model

Now we are all set to create our model. To do so, we will create a class named TEXT_MODEL that inherits from the tf.keras.Model class. Inside the class we will define our model layers. Our model will consist of three convolutional neural network layers. You can use LSTM layers instead and can also increase or decrease the number of layers. I have copied the number and types of layers from SuperDataScience's Google colab notebook and this architecture seems to work quite well for the IMDB Movie reviews dataset as well.

Let's now create out model class:

class TEXT_MODEL(tf.keras.Model):
    
    def __init__(self,
                 vocabulary_size,
                 embedding_dimensions=128,
                 cnn_filters=50,
                 dnn_units=512,
                 model_output_classes=2,
                 dropout_rate=0.1,
                 training=False,
                 name="text_model"):
        super(TEXT_MODEL, self).__init__(name=name)
        
        self.embedding = layers.Embedding(vocabulary_size,
                                          embedding_dimensions)
        self.cnn_layer1 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=2,
                                        padding="valid",
                                        activation="relu")
        self.cnn_layer2 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=3,
                                        padding="valid",
                                        activation="relu")
        self.cnn_layer3 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=4,
                                        padding="valid",
                                        activation="relu")
        self.pool = layers.GlobalMaxPool1D()
        
        self.dense_1 = layers.Dense(units=dnn_units, activation="relu")
        self.dropout = layers.Dropout(rate=dropout_rate)
        if model_output_classes == 2:
            self.last_dense = layers.Dense(units=1,
                                           activation="sigmoid")
        else:
            self.last_dense = layers.Dense(units=model_output_classes,
                                           activation="softmax")
    
    def call(self, inputs, training):
        l = self.embedding(inputs)
        l_1 = self.cnn_layer1(l) 
        l_1 = self.pool(l_1) 
        l_2 = self.cnn_layer2(l) 
        l_2 = self.pool(l_2)
        l_3 = self.cnn_layer3(l)
        l_3 = self.pool(l_3) 
        
        concatenated = tf.concat([l_1, l_2, l_3], axis=-1) # (batch_size, 3 * cnn_filters)
        concatenated = self.dense_1(concatenated)
        concatenated = self.dropout(concatenated, training)
        model_output = self.last_dense(concatenated)
        
        return model_output

The above script is pretty straightforward. In the constructor of the class, we initialze some attributes with default values. These values will be replaced later on by the values passed when the object of the TEXT_MODEL class is created.

Next, three convolutional neural network layers have been initialized with the kernel or filter values of 2, 3, and 4, respectively. Again, you can change the filter sizes if you want.

Next, inside the call() function, global max pooling is applied to the output of each of the convolutional neural network layer. Finally, the three convolutional neural network layers are concatenated together and their output is fed to the first densely connected neural network. The second densely connected neural network is used to predict the output sentiment since it only contains 2 classes. In case you have more classes in the output, you can updated the output_classes variable accordingly.

Let's now define the values for the hyper parameters of our model.

VOCAB_LENGTH = len(tokenizer.vocab)
EMB_DIM = 200
CNN_FILTERS = 100
DNN_UNITS = 256
OUTPUT_CLASSES = 2

DROPOUT_RATE = 0.2

NB_EPOCHS = 5

Next, we need to create an object of the TEXT_MODEL class and pass the hyper paramters values that we defined in the last step to the constructor of the TEXT_MODEL class.

text_model = TEXT_MODEL(vocabulary_size=VOCAB_LENGTH,
                        embedding_dimensions=EMB_DIM,
                        cnn_filters=CNN_FILTERS,
                        dnn_units=DNN_UNITS,
                        model_output_classes=OUTPUT_CLASSES,
                        dropout_rate=DROPOUT_RATE)

Before we can actually train the model we need to compile it. The following script compiles the model:

if OUTPUT_CLASSES == 2:
    text_model.compile(loss="binary_crossentropy",
                       optimizer="adam",
                       metrics=["accuracy"])
else:
    text_model.compile(loss="sparse_categorical_crossentropy",
                       optimizer="adam",
                       metrics=["sparse_categorical_accuracy"])

Finally to train our model, we can use the fit method of the model class.

text_model.fit(train_data, epochs=NB_EPOCHS)

Here is the result after 5 epochs:

Epoch 1/5
1407/1407 [==============================] - 381s 271ms/step - loss: 0.3037 - accuracy: 0.8661
Epoch 2/5
1407/1407 [==============================] - 381s 271ms/step - loss: 0.1341 - accuracy: 0.9521
Epoch 3/5
1407/1407 [==============================] - 383s 272ms/step - loss: 0.0732 - accuracy: 0.9742
Epoch 4/5
1407/1407 [==============================] - 381s 271ms/step - loss: 0.0376 - accuracy: 0.9865
Epoch 5/5
1407/1407 [==============================] - 383s 272ms/step - loss: 0.0193 - accuracy: 0.9931
<tensorflow.python.keras.callbacks.History at 0x7f5f65690048>

You can see that we got an accuracy of 99.31% on the training set.

Let's now evaluate our model's performance on the test set:

results = text_model.evaluate(test_dataset)
print(results)

Output:

156/Unknown - 4s 28ms/step - loss: 0.4428 - accuracy: 0.8926[0.442786190037926, 0.8926282]

From the output, we can see that we got an accuracy of 89.26% on the test set.

Conclusion

In this article you saw how we can use BERT Tokenizer to create word embeddings that can be used to perform text classification. We performed sentimental analysis of IMDB movie reviews and achieved an accuracy of 89.26% on the test set. In this article we did not use BERT embeddings, we only used BERT Tokenizer to tokenize the words. In the next article, you will see how BERT Tokenizer along with BERT Embeddings can be used to perform text classification.

Author image
About Usman Malik
Paris (France) Twitter
Programmer | Blogger | Data Science Enthusiast | PhD To Be | Arsenal FC for Life