This is the 20th article in my series of articles on Python for NLP. In the last few articles, we have been exploring deep learning techniques to perform a variety of machine learning tasks, and you should also be familiar with the concept of word embeddings. Word embeddings is a way to convert textual information into numeric form, which in turn can be used as input to statistical algorithms. In my article on word embeddings, I explained how we can create our own word embeddings and how we can use built-in word embeddings such as GloVe.
In this article, we are going to study FastText which is another extremely useful module for word embedding and text classification. FastText has been developed by Facebook and has shown excellent results on many NLP problems, such as semantic similarity detection and text classification.
In this article, we will briefly explore the FastText library. This article is divided into two sections. In the first section, we will see how FastText library creates vector representations that can be used to find semantic similarities between the words. In the second section, we will see the application of FastText library for text classification.
FastText for Semantic Similarity
FastText supports both Continuous Bag of Words and Skip-Gram models. In this article, we will implement the skip-gram model to learn vector representation of words from the Wikipedia articles on artificial intelligence, machine learning, deep learning, and neural networks. Since these topics are quite similar, we chose these topics to have a substantial amount of data to create a corpus. You can add more topics of the similar nature if you want.
As a first step, we need to import the required libraries. We will make use of the Wikipedia library for Python, which can be downloaded via the following command:
$ pip install wikipedia
Importing Libraries
The following script imports the required libraries into our application:
from keras.preprocessing.text import Tokenizer
from gensim.models.fasttext import FastText
import numpy as np
import matplotlib.pyplot as plt
import nltk
from string import punctuation
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from nltk import WordPunctTokenizer
import wikipedia
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))
%matplotlib inline
You can see that we are using the FastText
module from the gensim.models.fasttext
library. For the word representation and semantic similarity, we can use the Gensim model for FastText. This model can run on Windows, however, for text classification, we will have to use Linux platform. We will see that in the next section.
Scraping Wikipedia Articles
In this step, we will scrape the required Wikipedia articles. Look at the script below:
artificial_intelligence = wikipedia.page("Artificial Intelligence").content
machine_learning = wikipedia.page("Machine Learning").content
deep_learning = wikipedia.page("Deep Learning").content
neural_network = wikipedia.page("Neural Network").content
artificial_intelligence = sent_tokenize(artificial_intelligence)
machine_learning = sent_tokenize(machine_learning)
deep_learning = sent_tokenize(deep_learning)
neural_network = sent_tokenize(neural_network)
artificial_intelligence.extend(machine_learning)
artificial_intelligence.extend(deep_learning)
artificial_intelligence.extend(neural_network)
To scrape a Wikipedia page, we can use the page
method from the wikipedia
module. The name of the page that you want to scrap is passed as a parameter to the page
method. The method returns WikipediaPage
object, which you can then use to retrieve the page contents via the content
attribute, as shown in the above script.
The scraped content from the four Wikipedia pages are then tokenized into sentences using the sent_tokenize
method. The sent_tokenize
method returns list of sentences. The sentences for the four pages are tokenized separately. Finally, sentences from the four articles are joined together via the extend
method.
Data Preprocessing
The next step is to clean our text data by removing punctuations and numbers. We will also convert the data into the lower case. The words in our data will be lemmatized to their root form. Furthermore, the stop words and the words with the length less than 4 will be removed from the corpus.
The preprocess_text
function, as defined below performs the preprocessing tasks.
import re
from nltk.stem import WordNetLemmatizer
stemmer = WordNetLemmatizer()
def preprocess_text(document):
# Remove all the special characters
document = re.sub(r'\W', ' ', str(document))
# remove all single characters
document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)
# Remove single characters from the start
document = re.sub(r'\^[a-zA-Z]\s+', ' ', document)
# Substituting multiple spaces with single space
document = re.sub(r'\s+', ' ', document, flags=re.I)
# Removing prefixed 'b'
document = re.sub(r'^b\s+', '', document)
# Converting to Lowercase
document = document.lower()
# Lemmatization
tokens = document.split()
tokens = [stemmer.lemmatize(word) for word in tokens]
tokens = [word for word in tokens if word not in en_stop]
tokens = [word for word in tokens if len(word) > 3]
preprocessed_text = ' '.join(tokens)
return preprocessed_text
Let's see if our function performs the desired task by preprocessing a dummy sentence:
sent = preprocess_text("Artificial intelligence, is the most advanced technology of the present era")
print(sent)
final_corpus = [preprocess_text(sentence) for sentence in artificial_intelligence if sentence.strip() !='']
word_punctuation_tokenizer = nltk.WordPunctTokenizer()
word_tokenized_corpus = [word_punctuation_tokenizer.tokenize(sent) for sent in final_corpus]
The preprocessed sentence looks like this:
artificial intelligence advanced technology present
You can see the punctuations and stop words have been removed, and the sentences have been lemmatized. Furthermore, words with length less than 4, such as "era", have also been removed. These choices were chosen randomly for this test, so you may allow the words with smaller or greater lengths in the corpus.
Creating Words Representation
We have preprocessed our corpus. Now is the time to create word representations using FastText. Let's first define the hyper-parameters for our FastText model:
embedding_size = 60
window_size = 40
min_word = 5
down_sampling = 1e-2
Here embedding_size
is the size of the embedding vector. In other words, each word in our corpus will be represented as a 60-dimensional vector. The window_size
is the size of the number of words occurring before and after the word based on which the word representations will be learned for the word. This might sound tricky, however in the skip-gram model we input a word to the algorithm and the output is the context words. If the window size is 40, for each input there will be 80 outputs: 40 words that occur before the input word and 40 words that occur after the input word. The word embeddings for the input word are learned using these 80 output words.
The next hyper-parameter is the min_word
, which specifies the minimum frequency of a word in the corpus for which the word representations will be generated. Finally, the most frequently occurring word will be down-sampled by a number specified by the down_sampling
attribute.
Let's now create our FastText
model for word representations.
%%time
ft_model = FastText(word_tokenized_corpus,
size=embedding_size,
window=window_size,
min_count=min_word,
sample=down_sampling,
sg=1,
iter=100)
All the parameters in the above script are self-explanatory, except sg
. The sg
parameter defines the type of model that we want to create. A value of 1 specifies that we want to create skip-gram model. Whereas zero specifies the bag of words model, which is the default value as well.
Execute the above script. It may take some time to run. On my machine the time statistics for the above code to run are as follows:
CPU times: user 1min 45s, sys: 434 ms, total: 1min 45s
Wall time: 57.2 s
Let's now see the word representation for the word "artificial". To do so, you can use the wv
method of the FastText
object and pass it the name of the word inside a list.
print(ft_model.wv['artificial'])
Here is the output:
[-3.7653010e-02 -4.5558015e-01 3.2035065e-01 -1.5289043e-01
4.0645871e-02 -1.8946664e-01 7.0426887e-01 2.8806925e-01
-1.8166199e-01 1.7566417e-01 1.1522485e-01 -3.6525184e-01
-6.4378887e-01 -1.6650060e-01 7.4625671e-01 -4.8166099e-01
2.0884991e-01 1.8067230e-01 -6.2647951e-01 2.7614883e-01
-3.6478557e-02 1.4782918e-02 -3.3124462e-01 1.9372456e-01
4.3028224e-02 -8.2326338e-02 1.0356739e-01 4.0792203e-01
-2.0596240e-02 -3.5974573e-02 9.9928051e-02 1.7191900e-01
-2.1196717e-01 6.4424530e-02 -4.4705093e-02 9.7391091e-02
-2.8846195e-01 8.8607501e-03 1.6520244e-01 -3.6626378e-01
-6.2017748e-04 -1.5083785e-01 -1.7499258e-01 7.1994811e-02
-1.9868813e-01 -3.1733567e-01 1.9832127e-01 1.2799081e-01
-7.6522082e-01 5.2335665e-02 -4.5766738e-01 -2.7947658e-01
3.7890410e-03 -3.8761377e-01 -9.3001537e-02 -1.7128626e-01
-1.2923178e-01 3.9627206e-01 -3.6673656e-01 2.2755004e-01]
In the output above, you can see a 60-dimensional vector for the word "artificial"
Let's now find top 5 most similar words for the words 'artificial', 'intelligence', 'machine', 'network', 'recurrent', 'deep'. You can chose any number of words. The following script prints the specified words along with the 5 most similar words.
semantically_similar_words = {words: [item[0] for item in ft_model.wv.most_similar([words], topn=5)]
for words in ['artificial', 'intelligence', 'machine', 'network', 'recurrent', 'deep']}
for k,v in semantically_similar_words.items():
print(k+":"+str(v))
The output is as follows:
artificial:['intelligence', 'inspired', 'book', 'academic', 'biological']
intelligence:['artificial', 'human', 'people', 'intelligent', 'general']
machine:['ethic', 'learning', 'concerned', 'argument', 'intelligence']
network:['neural', 'forward', 'deep', 'backpropagation', 'hidden']
recurrent:['rnns', 'short', 'schmidhuber', 'shown', 'feedforward']
deep:['convolutional', 'speech', 'network', 'generative', 'neural']
We can also find the cosine similarity between the vectors for any two words, as shown below:
print(ft_model.wv.similarity(w1='artificial', w2='intelligence'))
The output shows a value of "0.7481". The value can be anywhere between 0 and 1. A higher value means higher similarity.
Visualizing Word Similarities
Though each word in our model is represented as 60-dimensional vector, we can use principal component analysis technique to find two principal components. The two principal components can then be used to plot the words in a two dimensional space. However, first we need to create a list of all the words in the semantically_similar_words
dictionary. The following script does that:
from sklearn.decomposition import PCA
all_similar_words = sum([[k] + v for k, v in semantically_similar_words.items()], [])
print(all_similar_words)
print(type(all_similar_words))
print(len(all_similar_words))
In the script above, we iterate through all the key-value pairs in the semantically_similar_words
dictionary. Each key in the dictionary is a word. The corresponding value is a list of all semantically similar words. Since we found the top 5 most similar words for a list of 6 words i.e. 'artificial', 'intelligence', 'machine', 'network', 'recurrent', 'deep', you will see that there will be 30 items in the all_similar_words
list.
Next, we have to find the word vectors for all these 30 words, and then use PCA to reduce the dimensions of the word vectors from 60 to 2. We can then use the plt
method, which is an alias of the matplotlib.pyplot
method to plot the words on a two-dimensional vector space.
Execute the following script to visualize the words:
word_vectors = ft_model.wv[all_similar_words]
pca = PCA(n_components=2)
p_comps = pca.fit_transform(word_vectors)
word_names = all_similar_words
plt.figure(figsize=(18, 10))
plt.scatter(p_comps[:, 0], p_comps[:, 1], c='red')
for word_names, x, y in zip(word_names, p_comps[:, 0], p_comps[:, 1]):
plt.annotate(word_names, xy=(x+0.06, y+0.03), xytext=(0, 0), textcoords='offset points')
The output of the above script looks like this:
You can see the words that frequently occur together in the text are close to each other in the two dimensional plane as well. For instance, the words "deep" and "network" are almost overlapping. Similarly, the words "feedforward" and "backpropagation" are also very close.
Now we know how to create word embeddings using FastText. In the next section, we will see how FastText can be used for text classification tasks.
FastText for Text Classification
Text classification refers to classifying textual data into predefined categories based on the contents of the text. Sentiment analysis, spam detection, and tag detection are some of the most common examples of use-cases for text classification.
FastText text classification module can only be run via Linux or OSX. If you are a Windows user, you can use Google Colaboratory to run FastText text classification module. All the scripts in this section have been run using Google Colaboratory.
The Dataset
The dataset for this article can be downloaded from this Kaggle link. The dataset contains multiple files, but we are only interested in the yelp_review.csv
file. The file contains more than 5.2 million reviews about different businesses including restaurants, bars, dentists, doctors, beauty salons, etc. However, we will only be using the first 50,000 records to train our model due to memory constraints. You can try with more records if you want.
Let's import the required libraries and load the dataset:
import pandas as pd
import numpy as np
yelp_reviews = pd.read_csv("/content/drive/My Drive/Colab Datasets/yelp_review_short.csv")
bins = [0,2,5]
review_names = ['negative', 'positive']
yelp_reviews['reviews_score'] = pd.cut(yelp_reviews['stars'], bins, labels=review_names)
yelp_reviews.head()
In the script above we load the yelp_review_short.csv
file that contains 50,000 reviews with the pd.read_csv
function.
We will simplify our problem by converting the numerical values for the reviews into categorical ones. This will be done by adding a new column ,reviews_score
, to our dataset. If the user review has a value between 1-2 in the Stars
column (which rates the business on a 1-5 scale), the reviews_score
column will have a string value negative
. If the rating is between 3-5 in the Stars
column, the reviews_score
column will contain a value positive
. This makes our problem, a binary classification problem.
Finally the header of the dataframe is printed as shown below:
Installing FastText
The next step is to import FastText models, which can be imported using the wget
command from the GitHub repository, as shown in the following script:
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
!wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
Note: If you are executing the above command from a Linux terminal, you don't have to prefix !
before the above command. In Google Colaboratory notebook, any command after the !
is executed as a shell command and not within the Python interpreter. Hence all non-Python commands here are prefixed by !
.
If you run the above script and see the following results, that means FastText has been successfully downloaded:
--2019-08-16 15:05:05-- https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/facebookresearch/fastText/zip/v0.1.0 [following]
--2019-08-16 15:05:05-- https://codeload.github.com/facebookresearch/fastText/zip/v0.1.0
Resolving codeload.github.com (codeload.github.com)... 192.30.255.121
Connecting to codeload.github.com (codeload.github.com)|192.30.255.121|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘v0.1.0.zip’
v0.1.0.zip [ <=> ] 92.06K --.-KB/s in 0.03s
2019-08-16 15:05:05 (3.26 MB/s) - ‘v0.1.0.zip’ saved [94267]
The next step is to unzip FastText modules. Simply type the following command:
!unzip v0.1.0.zip
Next, you have to navigate to the directory where you downloaded FastText and then execute the !make
command to run C++ binaries. Execute the following steps:
cd fastText-0.1.0
!make
If you see the following output, that means FastText is successfully installed on your machine.
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/args.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/dictionary.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/productquantizer.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/matrix.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/qmatrix.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/vector.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/model.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/utils.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/fasttext.cc
c++ -pthread -std=c++0x -O3 -funroll-loops args.o dictionary.o productquantizer.o matrix.o qmatrix.o vector.o model.o utils.o fasttext.o src/main.cc -o fasttext
To verify the installation, execute the following command:
!./fasttext
You should see that these commands are supported by FastText:
usage: fasttext <command> <args>
The commands supported by FastText are:
supervised train a supervised classifier
quantize quantize a model to reduce the memory usage
test evaluate a supervised classifier
predict predict most likely labels
predict-prob predict most likely labels with probabilities
skipgram train a skipgram model
cbow train a cbow model
print-word-vectors print word vectors given a trained model
print-sentence-vectors print sentence vectors given a trained model
nn query for nearest neighbors
analogies query for analogies
Text Classification
Before we train FastText models to perform text classification, it is pertinent to mention that FastText accepts data in a special format, which is as follows:
_label_tag This is sentence 1
_label_tag2 This is sentence 2.
If we look at our dataset, it is not in the desired format. The text with positive sentiment should look like this:
__label__positive burgers are very big portions here.
Similarly, negative reviews should look like this:
__label__negative They do not use organic ingredients, but I thi...
The following script filters the reviews_score
and text
columns from the dataset and then prefixes __label__
before all the values in the reviews_score
column. Similarly, the \n
and \t
are replaced by a space in the text
column. Finally, the updated dataframe is written to the disk in the form of yelp_reviews_updated.txt
.
import pandas as pd
from io import StringIO
import csv
col = ['reviews_score', 'text']
yelp_reviews = yelp_reviews[col]
yelp_reviews['reviews_score']=['__label__'+ s for s in yelp_reviews['reviews_score']]
yelp_reviews['text']= yelp_reviews['text'].replace('\n',' ', regex=True).replace('\t',' ', regex=True)
yelp_reviews.to_csv(r'/content/drive/My Drive/Colab Datasets/yelp_reviews_updated.txt', index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar="", escapechar=" ")
Let's now print the head of the updated yelp_reviews
dataframe.
yelp_reviews.head()
You should see the following results:
reviews_score text
0 __label__positive Super simple place but amazing nonetheless. It...
1 __label__positive Small unassuming place that changes their menu...
2 __label__positive Lester's is located in a beautiful neighborhoo...
3 __label__positive Love coming here. Yes the place always needs t...
4 __label__positive Had their chocolate almond croissant and it wa...
Similarly, the tail of the dataframe looks like this:
reviews_score text
49995 __label__positive This is an awesome consignment store! They hav...
49996 __label__positive Awesome laid back atmosphere with made-to-orde...
49997 __label__positive Today was my first appointment and I can hones...
49998 __label__positive I love this chic salon. They use the best prod...
49999 __label__positive This place is delicious. All their meats and s...
We have converted our dataset into the required shape. The next step is to divide our data into train and test sets. The 80% data i.e. the first 40,000 records out of 50,000 records will be used to train the data, while 20% data (the last 10,000 records) will be used to evaluate the performance of the algorithm.
The following script divides the data into training and test sets:
!head -n 40000 "/content/drive/My Drive/Colab Datasets/yelp_reviews_updated.txt" > "/content/drive/My Drive/Colab Datasets/yelp_reviews_train.txt"
!tail -n 10000 "/content/drive/My Drive/Colab Datasets/yelp_reviews_updated.txt" > "/content/drive/My Drive/Colab Datasets/yelp_reviews_test.txt"
Once the above script is executed, the yelp_reviews_train.txt
file will be generated, which contains the training data. Similarly, the newly generated yelp_reviews_test.txt
file will contain test data.
Now is the time to train our FastText text classification algorithm.
%%time
!./fasttext supervised -input "/content/drive/My Drive/Colab Datasets/yelp_reviews_train.txt" -output model_yelp_reviews
To train the algorithm we have to use supervised
command and pass it the input file. The model name is specified after the -output
keyword. The above script will result in a trained text classification model called model_yelp_reviews.bin
. Here is the output for the script above:
Read 4M words
Number of words: 177864
Number of labels: 2
Progress: 100.0% words/sec/thread: 2548017 lr: 0.000000 loss: 0.246120 eta: 0h0m
CPU times: user 212 ms, sys: 48.6 ms, total: 261 ms
Wall time: 15.6 s
You can take a look at the model via !ls
command as shown below:
!ls
Here is the output:
args.o Makefile quantization-results.sh
classification-example.sh matrix.o README.md
classification-results.sh model.o src
CONTRIBUTING.md model_yelp_reviews.bin tutorials
dictionary.o model_yelp_reviews.vec utils.o
eval.py PATENTS vector.o
fasttext pretrained-vectors.md wikifil.pl
fasttext.o productquantizer.o word-vector-example.sh
get-wikimedia.sh qmatrix.o yelp_reviews_train.txt
LICENSE quantization-example.sh
You can see the model_yelp_reviews.bin
in the above list of documents.
Finally, to test the model you can use the test
command. You have to specify the model name and the test file after the test
command, as shown below:
!./fasttext test model_yelp_reviews.bin "/content/drive/My Drive/Colab Datasets/yelp_reviews_test.txt"
The output of the above script looks like this:
N 10000
P@1 0.909
R@1 0.909
Number of examples: 10000
Here P@1
refers to precision and R@1
refers to recall. You can see our model achieves precision and recall of 0.909 which is pretty good.
Let's now try to clean our text of punctuations, special characters, and convert it into the lower case to improve the uniformity of text. The following script cleans the train set:
!cat "/content/drive/My Drive/Colab Datasets/yelp_reviews_train.txt" | sed -e "s/\([.\!?,’/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > "/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt"
And the following script cleans the test set:
"/content/drive/My Drive/Colab Datasets/yelp_reviews_test.txt" | sed -e "s/\([.\!?,’/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > "/content/drive/My Drive/Colab Datasets/yelp_reviews_test_clean.txt"
Now, we will train the model on the cleaned training set:
%%time
!./fasttext supervised -input "/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt" -output model_yelp_reviews
And finally, we will use the model trained on cleaned training set to make predictions on the cleaned test set:
!./fasttext test model_yelp_reviews.bin "/content/drive/My Drive/Colab Datasets/yelp_reviews_test_clean.txt"
The output of the above script is as follows:
N 10000
P@1 0.915
R@1 0.915
Number of examples: 10000
You can see a slight increase in both precision and recall. To further improve the model, you can increase the epochs and learning rate of the model. The following script sets the number of epochs to 30 and learning rate to 0.5.
%%time
!./fasttext supervised -input "/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt" -output model_yelp_reviews -epoch 30 -lr 0.5
You can try different numbers and see if you can get better results. Don't forget to share your results in the comments!
Conclusion
FastText model has recently been proved state of the art for word embeddings and text classification tasks on many datasets. It is very easy to use and lightning fast as compared to other word embedding models.
In this article, we briefly explored how to find semantic similarities between different words by creating word embeddings using FastText. The second part of the article explains how to perform text classification via FastText library.