Introduction
As I write this article, 1,907,223,370 websites are active on the internet and 2,722,460 emails are being sent per second. This is an unbelievably huge amount of data. It is impossible for a user to get insights from such huge volumes of data. Furthermore, a large portion of this data is either redundant or doesn't contain much useful information. The most efficient way to get access to the most important parts of the data, without having to sift through redundant and insignificant data, is to summarize the data in a way that it contains non-redundant and useful information only. The data can be in any form such as audio, video, images, and text. In this article, we will see how we can use automatic text summarization techniques to summarize text data.
Text summarization is a subdomain of Natural Language Processing (NLP) that deals with extracting summaries from huge chunks of texts. There are two main types of techniques used for text summarization: NLP-based techniques and deep learning-based techniques. In this article, we will see a simple NLP-based technique for text summarization. We will not use any machine learning library in this article. Rather we will simply use Python's NLTK library for summarizing Wikipedia articles.
Text Summarization Steps
I will explain the steps involved in text summarization using NLP techniques with the help of an example.
The following is a paragraph from one of the famous speeches by Denzel Washington at the 48th NAACP Image Awards:
So, keep working. Keep striving. Never give up. Fall down seven times, get up eight. Ease is a greater threat to progress than hardship. Ease is a greater threat to progress than hardship. So, keep moving, keep growing, keep learning. See you at work.
We can see from the paragraph above that he is basically motivating others to work hard and never give up. To summarize the above paragraph using NLP-based techniques we need to follow a set of steps, which will be described in the following sections.
Convert Paragraphs to Sentences
We first need to convert the whole paragraph into sentences. The most common way of converting paragraphs to sentences is to split the paragraph whenever a period is encountered. So if we split the paragraph under discussion into sentences, we get the following sentences:
- So, keep working
- Keep striving
- Never give up
- Fall down seven times, get up eight
- Ease is a greater threat to progress than hardship
- Ease is a greater threat to progress than hardship
- So, keep moving, keep growing, keep learning
- See you at work
Text Preprocessing
After converting paragraph to sentences, we need to remove all the special characters, stop words and numbers from all the sentences. After preprocessing, we get the following sentences:
- keep working
- keep striving
- never give
- fall seven time get eight
- ease greater threat progress hardship
- ease greater threat progress hardship
- keep moving keep growing keep learning
- see work
Tokenizing the Sentences
We need to tokenize all the sentences to get all the words that exist in the sentences. After tokenizing the sentences, we get list of following words:
['keep',
'working',
'keep',
'striving',
'never',
'give',
'fall',
'seven',
'time',
'get',
'eight',
'ease',
'greater',
'threat',
'progress',
'hardship',
'ease',
'greater',
'threat',
'progress',
'hardship',
'keep',
'moving',
'keep',
'growing',
'keep',
'learning',
'see',
'work']
Find Weighted Frequency of Occurrence
Next we need to find the weighted frequency of occurrences of all the words. We can find the weighted frequency of each word by dividing its frequency by the frequency of the most occurring word. The following table contains the weighted frequencies for each word:
Word | Frequency | Weighted Frequency |
---|---|---|
ease | 2 | 0.40 |
eight | 1 | 0.20 |
fall | 1 | 0.20 |
get | 1 | 0.20 |
give | 1 | 0.20 |
greater | 2 | 0.40 |
growing | 1 | 0.20 |
hardship | 2 | 0.40 |
keep | 5 | 1.00 |
learning | 1 | 0.20 |
moving | 1 | 0.20 |
never | 1 | 0.20 |
progress | 2 | 0.40 |
see | 1 | 0.20 |
seven | 1 | 0.20 |
striving | 1 | 0.20 |
threat | 2 | 0.40 |
time | 1 | 0.20 |
work | 1 | 0.20 |
working | 1 | 0.20 |
Since the word "keep" has the highest frequency of 5, therefore the weighted frequency of all the words have been calculated by dividing their number of occurances by 5.
Replace Words by Weighted Frequency in Original Sentences
The final step is to plug the weighted frequency in place of the corresponding words in original sentences and finding their sum. It is important to mention that weighted frequency for the words removed during preprocessing (stop words, punctuation, digits etc.) will be zero and therefore is not required to be added, as mentioned below:
Sentence | Sum of Weighted Frequencies |
---|---|
So, keep working | 1 + 0.20 = 1.20 |
Keep striving | 1 + 0.20 = 1.20 |
Never give up | 0.20 + 0.20 = 0.40 |
Fall down seven times, get up eight | 0.20 + 0.20 + 0.20 + 0.20 + 0.20 = 1.0 |
Ease is a greater threat to progress than hardship | 0.40 + 0.40 + 0.40 + 0.40 + 0.40 = 2.0 |
Ease is a greater threat to progress than hardship | 0.40 + 0.40 + 0.40 + 0.40 + 0.40 = 2.0 |
So, keep moving, keep growing, keep learning | 1 + 0.20 + 1 + 0.20 + 1 + 0.20 = 3.60 |
See you at work | 0.20 + 0.20 = 0.40 |
Sort Sentences in Descending Order of Sum
The final step is to sort the sentences in inverse order of their sum. The sentences with highest frequencies summarize the text. For instance, look at the sentence with the highest sum of weighted frequencies:
So, keep moving, keep growing, keep learning
You can easily judge that what the paragraph is all about. Similarly, you can add the sentence with the second highest sum of weighted frequencies to have a more informative summary. Take a look at the following sentences:
So, keep moving, keep growing, keep learning. Ease is a greater threat to progress than hardship.
These two sentences give a pretty good summarization of what was said in the paragraph.
Summarizing Wikipedia Articles
Now we know how the process of text summarization works using a very simple NLP technique. In this section, we will use Python's NLTK library to summarize a Wikipedia article.
Fetching Articles from Wikipedia
Before we could summarize Wikipedia articles, we need to fetch them from the web. To do so we will use a couple of libraries. The first library that we need to download is the beautiful soup which is very useful Python utility for web scraping. Execute the following command at the command prompt to download the Beautiful Soup utility.
$ pip install beautifulsoup4
Another important library that we need to parse XML and HTML is the lxml library. Execute the following command at command prompt to download lxml
:
$ pip install lxml
Now lets some Python code to scrape data from the web. The article we are going to scrape is the Wikipedia article on Artificial Intelligence. Execute the following script:
import bs4 as bs
import urllib.request
import re
scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Artificial_intelligence')
article = scraped_data.read()
parsed_article = bs.BeautifulSoup(article,'lxml')
paragraphs = parsed_article.find_all('p')
article_text = ""
for p in paragraphs:
article_text += p.text
In the script above we first import the important libraries required for scraping the data from the web. We then use the urlopen
function from the urllib.request
utility to scrape the data. Next, we need to call read
function on the object returned by urlopen
function in order to read the data. To parse the data, we use BeautifulSoup
object and pass it the scraped data object i.e. article
and the lxml
parser.
In Wikipedia articles, all the text for the article is enclosed inside the <p>
tags. To retrieve the text we need to call find_all
function on the object returned by the BeautifulSoup
. The tag name is passed as a parameter to the function. The find_all
function returns all the paragraphs in the article in the form of a list. All the paragraphs have been combined to recreate the article.
Once the article is scraped, we need to to do some preprocessing.
Preprocessing
The first preprocessing step is to remove references from the article. Wikipedia, references are enclosed in square brackets. The following script removes the square brackets and replaces the resulting multiple spaces by a single space. Take a look at the script below:
# Removing Square Brackets and Extra Spaces
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)
The article_text
object contains text without brackets. However, we do not want to remove anything else from the article since this is the original article. We will not remove other numbers, punctuation marks and special characters from this text since we will use this text to create summaries and weighted word frequencies will be replaced in this article.
To clean the text and calculate weighted frequences, we will create another object. Take a look at the following script:
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
# Removing special characters and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)
Now we have two objects article_text
, which contains the original article and formatted_article_text
which contains the formatted article. We will use formatted_article_text
to create weighted frequency histograms for the words and will replace these weighted frequencies with the words in the article_text
object.
Converting Text To Sentences
At this point we have preprocessed the data. Next, we need to tokenize the article into sentences. We will use thearticle_text
object for tokenizing the article to sentence since it contains full stops. The formatted_article_text
does not contain any punctuation and therefore cannot be converted into sentences using the full stop as a parameter.
The following script performs sentence tokenization:
sentence_list = nltk.sent_tokenize(article_text)
Find Weighted Frequency of Occurrence
To find the frequency of occurrence of each word, we use the formatted_article_text
variable. We used this variable to find the frequency of occurrence since it doesn't contain punctuation, digits, or other special characters. Take a look at the following script:
stopwords = nltk.corpus.stopwords.words('english')
word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
if word not in stopwords:
if word not in word_frequencies.keys():
word_frequencies[word] = 1
else:
word_frequencies[word] += 1
In the script above, we first store all the English stop words from the nltk
library into a stopwords
variable. Next, we loop through all the sentences and then corresponding words to first check if they are stop words. If not, we proceed to check whether the words exist in word_frequency
dictionary i.e. word_frequencies
, or not. If the word is encountered for the first time, it is added to the dictionary as a key and its value is set to 1. Otherwise, if the word previously exists in the dictionary, its value is simply updated by 1.
Finally, to find the weighted frequency, we can simply divide the number of occurances of all the words by the frequency of the most occurring word, as shown below:
maximum_frequncy = max(word_frequencies.values())
for word in word_frequencies.keys():
word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)
Calculating Sentence Scores
We have now calculated the weighted frequencies for all the words. Now is the time to calculate the scores for each sentence by adding weighted frequencies of the words that occur in that particular sentence. The following script calculates sentence scores:
sentence_scores = {}
for sent in sentence_list:
for word in nltk.word_tokenize(sent.lower()):
if word in word_frequencies.keys():
if len(sent.split(' ')) < 30:
if sent not in sentence_scores.keys():
sentence_scores[sent] = word_frequencies[word]
else:
sentence_scores[sent] += word_frequencies[word]
In the script above, we first create an empty sentence_scores
dictionary. The keys of this dictionary will be the sentences themselves and the values will be the corresponding scores of the sentences. Next, we loop through each sentence in the sentence_list
and tokenize the sentence into words.
We then check if the word exists in the word_frequencies
dictionary. This check is performed since we created the sentence_list
list from the article_text
object; on the other hand, the word frequencies were calculated using the formatted_article_text
object, which doesn't contain any stop words, numbers, etc.
We do not want very long sentences in the summary, therefore, we calculate the score for only sentences with less than 30 words (although you can tweak this parameter for your own use-case). Next, we check whether the sentence exists in the sentence_scores
dictionary or not. If the sentence doesn't exist, we add it to the sentence_scores
dictionary as a key and assign it the weighted frequency of the first word in the sentence, as its value. On the contrary, if the sentence exists in the dictionary, we simply add the weighted frequency of the word to the existing value.
Getting the Summary
Now we have the sentence_scores
dictionary that contains sentences with their corresponding score. To summarize the article, we can take top N sentences with the highest scores. The following script retrieves top 7 sentences and prints them on the screen.
import heapq
summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)
summary = ' '.join(summary_sentences)
print(summary)
In the script above, we use the heapq
library and call its nlargest
function to retrieve the top 7 sentences with the highest scores.
The output summary looks like this:
Artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals. Many tools are used in AI, including versions of search and mathematical optimization, artificial neural networks, and methods based on statistics, probability and economics. The traditional problems (or goals) of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception and the ability to move and manipulate objects. When access to digital computers became possible in the middle 1950s, AI research began to explore the possibility that human intelligence could be reduced to symbol manipulation. One proposal to deal with this is to ensure that the first generally intelligent AI is 'Friendly AI', and will then be able to control subsequently developed AIs. Nowadays, the vast majority of current AI researchers work instead on tractable "narrow AI" applications (such as medical diagnosis or automobile navigation). Machine learning, a fundamental concept of AI research since the field's inception, is the study of computer algorithms that improve automatically through experience.
Remember, since Wikipedia articles are updated frequently, you might get different results depending upon the time of execution of the script.
Conclusion
This article explains the process of text summarization with the help of the Python NLTK library. The process of scraping articles using the BeautifulSoap
library has also been briefly covered in the article. I will recommend you to scrape any other article from Wikipedia and see whether you can get a good summary of the article or not.