Python for NLP: Introduction to the Pattern Library

This is the eighth article in my series of articles on Python for NLP. In my previous article, I explained how Python's TextBlob library can be used to perform a variety of NLP tasks ranging from tokenization to POS tagging, and text classification to sentiment analysis. In this article, we will explore Python's Pattern library, which is another extremely useful Natural Language Processing library.

The Pattern library is a multipurpose library capable of handling the following tasks:

  • Natural Language Processing: Performing tasks such as tokenization, stemming, POS tagging, sentiment analysis, etc.
  • Data Mining: It contains APIs to mine data from sites like Twitter, Facebook, Wikipedia, etc.
  • Machine Learning: Contains machine learning models such as SVM, KNN, and perceptron, which can be used for classification, regression, and clustering tasks.

In this article, we will see the first two applications of the Pattern library from the above list. We will explore the use of the Pattern Library for NLP by performing tasks such as tokenization, stemming and sentiment analysis. We will also see how the Pattern library can be used for web mining.

Installing the Library

To install the library, you can use the following pip command:

$ pip install pattern

Otherwise if you are using the Anaconda distribution of Python, you can use the following Anaconda command to download the library:

$ conda install -c asmeurer pattern

Pattern Library Functions for NLP

In this section, we will see some of the NLP applications of the Pattern Library.

Tokenizing, POS Tagging, and Chunking

In the NLTK and spaCy libraries, we have a separate function for tokenizing, POS tagging, and finding noun phrases in text documents. On the other hand, in the Pattern library there is the all-in-one parse method that takes a text string as an input parameter and returns corresponding tokens in the string, along with the POS tag.

The parse method also tells us if a token is a noun phrase or verb phrase, or subject or object. You can also retrieve lemmatized tokens by setting lemmata parameter to True. The syntax of the parse method along with the default values for different parameters is as follows:

parse(string,
    tokenize=True,      # Split punctuation marks from words?
    tags=True,          # Parse part-of-speech tags? (NN, JJ, ...)
    chunks=True,        # Parse chunks? (NP, VP, PNP, ...)
    relations=False,    # Parse chunk relations? (-SBJ, -OBJ, ...)
    lemmata=False,      # Parse lemmata? (ate => eat)
    encoding='utf-8',   # Input string encoding.
    tagset=None         # Penn Treebank II (default) or UNIVERSAL.
)

Let's see the parse method in action:

from pattern.en import parse
from pattern.en import pprint

pprint(parse('I drove my car to the hospital yesterday', relations=True, lemmata=True))

To use the parse method, you have to import the en module from the pattern library. The en module contains English language NLP functions. If you use the pprint method to print the output of the parse method on the console, you should see the following output:

         WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

             I   PRP    NP      SBJ    1      -      i
         drove   VBD    VP      -      1      -      drive
            my   PRP$   NP      OBJ    1      -      my
           car   NN     NP ^    OBJ    1      -      car
            to   TO     -       -      -      -      to
           the   DT     NP      -      -      -      the
      hospital   NN     NP ^    -      -      -      hospital
     yesterday   NN     NP ^    -      -      -      yesterday

In the output, you can see the tokenized words along with their POS tag, the chunk that the tokens belong to, and the role. You can also see the lemmatized form of the tokens.

If you call the split method on the object returned by the parse method, the output will be a list of sentences, where each sentence is a list of tokens and each token is a list of words, along with the tags associated with the words.

For instance look at the following script:

from pattern.en import parse
from pattern.en import pprint

print(parse('I drove my car to the hospital yesterday', relations=True, lemmata=True).split())

The output of the script above looks like this:

[[['I', 'PRP', 'B-NP', 'O', 'NP-SBJ-1', 'i'], ['drove', 'VBD', 'B-VP', 'O', 'VP-1', 'drive'], ['my', 'PRP$', 'B-NP', 'O', 'NP-OBJ-1', 'my'], ['car', 'NN', 'I-NP', 'O', 'NP-OBJ-1', 'car'], ['to', 'TO', 'O', 'O', 'O', 'to'], ['the', 'DT', 'B-NP', 'O', 'O', 'the'], ['hospital', 'NN', 'I-NP', 'O', 'O', 'hospital'], ['yesterday', 'NN', 'I-NP', 'O', 'O', 'yesterday']]]

Pluralizing and Singularizing the Tokens

The pluralize and singularize methods are used to convert singular words to plurals and vice versa, respectively.

from pattern.en import pluralize, singularize

print(pluralize('leaf'))
print(singularize('theives'))

The output looks like this:

leaves
theif

Converting Adjective to Comparative and Superlative Degrees

You can retrieve comparative and superlative degrees of an adjective using comparative and superlative functions. For instance, the comparative degree of good is better and the superlative degree of good is best. Let's see this in action:

from pattern.en import comparative, superlative

print(comparative('good'))
print(superlative('good'))

Output:

better
best

Finding N-Grams

N-Grams refer to "n" combination of words in a sentence. For instance, for the sentence "He goes to hospital", 2-grams would be (He goes), (goes to) and (to hospital). N-Grams can play a crucial role in text classification and language modeling.

In the Pattern library, the ngram method is used to find the all the n-grams in a text string. The first parameter to the ngram method is the text string. The number of n-grams is passed to the n parameter of the method. Look at the following example:

from pattern.en import ngrams

print(ngrams("He goes to hospital", n=2))

Output:

[('He', 'goes'), ('goes', 'to'), ('to', 'hospital')]

Finding Sentiments

Sentiment refers to an opinion or feeling towards a certain thing. The Pattern library offers functionality to find sentiment from a text string.

In Pattern, the sentiment object is used to find the polarity (positivity or negativity) of a text along with its subjectivity.

Depending upon the most commonly occurring positive (good, best, excellent, etc.) and negative (bad, awful, pathetic, etc.) adjectives, a sentiment score between 1 and -1 is assigned to the text. This sentiment score is also called the polarity.

In addition to the sentiment score, subjectivity is also returned. The subjectivity value can be between 0 and 1. Subjectivity quantifies the amount of personal opinion and factual information contained in the text. The higher subjectivity means that the text contains personal opinion rather than factual information.

from pattern.en import sentiment

print(sentiment("This is an excellent movie to watch. I really love it"))

When you run the above script, you should see the following output:

(0.75, 0.8)

The sentence "This is an excellent movie to watch. I really love it" has a sentiment of 0.75, which shows that it is highly positive. Similarly, the subjectivity of 0.8 refers to the fact that the sentence is a personal opinion of the user.

Checking if a Statement is a Fact

The modality function from the Pattern library can be used to find the degree of certainty in the text string. The modality function returns a value between -1 to 1. For facts, the modality function returns a value greater than 0.5.

Here is an example of it in action:

from pattern.en import parse, Sentence
from pattern.en import modality

text = "Paris is the capital of France"
sent = parse(text, lemmata=True)
sent = Sentence(sent)

print(modality(sent))
1.0

In the script above we first import the parse method along with the Sentence class. On the second line, we import the modality function. The parse method takes text as input and returns a tokenized form of the text, which is then passed to the Sentence class constructor. The modality method takes the Sentence class object and returns the modality of the sentence.

Since the text string "Paris is the capital of France" is a fact, in the output, you will see a value of 1.

Similarly, for a sentence which is not certain, the value returned by the modality method is around 0.0. Look at the following script:

text = "I think we can complete this task"
sent = parse(text, lemmata=True)
sent = Sentence(sent)

print(modality(sent))
0.25

Since the string in the above example is not very certain, the modality of the above string will be 0.25.

Spelling Corrections

The suggest method can be used to find if a word is spelled correctly or not. The suggest method returns 1 if a word is 100% correctly spelled. Otherwise the suggest method returns the possible corrections for the word along with their probability of correctness.

Look at the following example:

from pattern.en import suggest

print(suggest("Whitle"))

In the script above we have a word Whitle which is incorrectly spelled. In the output, you will see possible suggestions for this word.

[('While', 0.6459209419680404), ('White', 0.2968881412952061), ('Title', 0.03280067283431455), ('Whistle', 0.023549201009251473), ('Chile', 0.0008410428931875525)]

According to the suggest method, there is a 0.64 probability that the word is "While", similarly there is a probability of 0.29 that the word is "White", and so on.

Now let's spell a word correctly:

from pattern.en import suggest
print(suggest("Fracture"))

Output:

[('Fracture', 1.0)]

From the output, you can see that there is a 100% chance that the word is spelled correctly.

Working with Numbers

The Pattern library contains functions that can be used to convert numbers in the form of text strings into their numeric counterparts and vice versa. To convert from text to numeric representation the number function is used. Similarly to convert back from numbers to their corresponding text representation the numerals function is used. Look at the following script:

from pattern.en import number, numerals

print(number("one hundred and twenty two"))
print(numerals(256.390, round=2))

Output:

122
two hundred and fifty-six point thirty-nine

In the output, you will see 122 which is the numeric representation of text "one hundred and twenty-two". Similarly, you should see "two hundred and fifty-six point thirty-nine" which is text representation of the number 256.390.

Remember, for numerals function we have to provide the integer value that we want our number to be rounded-off to.

The quantify function is used to get a word count estimation of the items in the list, which provides a phrase for referring to the group. If a list has 3-8 similar items, the quantify function will quantify it to "several". Two items are quantified to a "couple".

from pattern.en import quantify

print(quantify(['apple', 'apple', 'apple', 'banana', 'banana', 'banana', 'mango', 'mango']))

In the list, we have three apples, three bananas, and two mangoes. The output of the quantify function for this list looks like this:

several bananas, several apples and a pair of mangoes

Similarly, the following example demonstrates the other word count estimations.

from pattern.en import quantify

print(quantify({'strawberry': 200, 'peach': 15}))
print(quantify('orange', amount=1200))

Output:

hundreds of strawberries and a number of peaches
thousands of oranges

Pattern Library Functions for Data Mining

In the previous section, we saw some of the most commonly used functions of the Pattern library for NLP. In this section, we will see how the Pattern library can be used to perform a variety of data mining tasks.

The web module of the Pattern library is used for web mining tasks.

Accessing Web Pages

The URL object is used to retrieve contents from the webpages. It has several methods that can be used to open a webpage, download the contents from a webpage and read a webpage.

You can directly use the download method to download the HTML contents of any webpage. The following script downloads the HTML source code for the Wikipedia article on artificial intelligence.

from pattern.web import download

page_html = download('https://en.wikipedia.org/wiki/Artificial_intelligence', unicode=True)

You can also download files from webpages, for example, images using the URL method:

from pattern.web import URL, extension

page_url = URL('https://upload.wikimedia.org/wikipedia/commons/f/f1/RougeOr_football.jpg')
file = open('football' + extension(page_url.page), 'wb')
file.write(page_url.download())
file.close()

In the script above we first make a connection with the webpage using the URL method. Next, we call the extension method on the opened page, which returns the file extension. The file extension is appended at the end of the string "football". The open method is called to read this path and finally, the download() method downloads the image and writes it to the default execution path.

Finding URLs within Text

You can use the findurl method to extract URLs from text strings. Here is an example:

from pattern.web import find_urls

print(find_urls('To search anything, go to www.google.com', unique=True))

In the output, you will see the URL for the Google website as shown below:

['www.google.com']

Making Asynchronous Requests for Webpages

Webpages can be very large and it can take quite a bit of time download the complete contents of the webpage, which can block a user from performing any other task on the application until the complete webpage is downloaded. However, the web module of the Pattern library contains a function asynchronous, which downloads contents of a webpage in a parallel manner. The asynchronous method runs in the background so that the user can interact with the application while the webpage is being downloaded.

Let's take a very simple example of the asynchronous method:

from pattern.web import asynchronous, time, Google

asyn_req = asynchronous(Google().search, 'artificial intelligence', timeout=4)
while not asyn_req.done:
    time.sleep(0.1)
    print('searching...')

print(asyn_req.value)

print(find_urls(asyn_req.value, unique=True))

In the above script, we retrieve the Google search result of page 1 for the search query "artificial intelligence", you can see that while the page downloads we execute a while loop in parallel. Finally, the results retrieved by the query are printed using the value attribute of the object returned by the asynchronous module. Next, we extract the URLs from the search, which are then printed on the screen.

Getting Search Engine Results with APIs

The pattern library contains SearchEngine class which is derived by the classes that can be used to connect to call API's of different search engines and websites such as Google, Bing, Facebook, Wikipedia, Twitter, etc. The SearchEngine object construct accepts three parameters:

  • license: The developer license key for the corresponding search engine or website
  • throttle: Corresponds to the time difference between successive request to the server
  • langauge: Specifies the language for the results

The search method of the SearchEngine class is used to make a request to search engine for certain search query. The search method can take the following parameters:

  • query: The search string
  • type: The type of data you want to search, it can take three values: SEARCH, NEWS and IMAGE.
  • start: The page from which you want to start the search
  • count: The number of results per page.

The search engine classes that inherit the SearchEngine class along with its search method are: Google, Bing, Twitter, Facebook, Wikipedia, and Flickr.

The search query returns objects for each item. The result object can then be used to retrieve the information about the searched result. The attributes of the result object are url, title, text, language, author, date.

Now let's see a very simple example of how we can search something on Google via pattern library. Remember, to make this example work, you will have to use your developer license key for the Google API.

from pattern.web import Google

google = Google(license=None)
for search_result in google.search('artificial intelligence'):
    print(search_result.url)
    print(search_result.text)

In the script above, we create an object of Google class. In the constructor of Google, pass your own license key to the license parameter. Next, we pass the string artificial intelligence to the search method. By default, the first 10 results from the first page will be returned which are then iterated, and the url and text of each result is displayed on the screen.

The process is similar for Bing search engine, you only have to replace the Bing class with Google in the script above.

Let's now search Twitter for the three latest tweets that contain the text "artificial intelligence". Execute the following script:

from pattern.web import Twitter

twitter = Twitter()
index = None
for j in range(3):
    for tweet in twitter.search('artificial intelligence', start=index, count=3):
        print(tweet.text)
        index = tweet.id

In the script above we first import the Twitter class from the pattern.web module. Next, We iterate over the tweets returned by the Twitter class and display the text of the tweet on the console. You do not need any license key to run the above script.

Converting HTML Data to Plain Text

The download method of the URL class returns data in the form of HTML. However, if you want to do a semantic analysis of the text, for instance, sentiment classification, you need data cleaned data without HTML tags. You can clean the data with the plaintext method. The method takes as a parameter, the HTML content returned by the download method, and returns cleaned text.

Look at the following script:

from pattern.web import URL, plaintext

html_content = URL('https://stackabuse.com/python-for-nlp-introduction-to-the-textblob-library/').download()
cleaned_page = plaintext(html_content.decode('utf-8'))
print(cleaned_page)

In the output, you should see the cleaned text from the webpage:

https://stackabuse.com/python-for-nlp-introduction-to-the-textblob-library/.

It is important to remember that if you are using Python 3, you will need to call decode('utf-8') method to convert the data from byte to string format.

Parsing PDF Documments

The Pattern library contains PDF object that can be used to parse a PDF document. PDF (Portable Document Format) is a cross platform file which contains images, texts, and fonts in a stand-alone document.

Let's see how a PDF document can be parsed with the PDF object:

from pattern.web import URL, PDF

pdf_doc = URL('http://demo.clab.cs.cmu.edu/NLP/syllabus_f18.pdf').download()
print(PDF(pdf_doc.decode('utf-8')))

In the script we download a document using the download function. Next, the downloaded HTML document is passed to the PDF class which finally prints it on the console.

Clearing the Cache

The results returned by the methods such as SearchEngine.search() and URL.download() are, by default, stored in the local cache. To clear the cache after downloading an HTML document, we can use clear method of the cache class, as shown below:

from pattern.web import cache

cache.clear()

Conclusion

The Pattern library is one of the most useful natural language processing libraries in Python. Although it is not as well-known as spaCy or NLTK, it contains functionalities such as finding superlatives and comparatives, and fact and opinion detection which distinguishes it from the other NLP libraries.

In this article, we studied the application of the Pattern library for natural language processing, and data mining and web scraping. We saw how to perform basic NLP tasks such as tokenization, lemmatization and sentiment analysis with the Pattern library. Finally, we also saw how to use Pattern for making search engine queries, mining online tweets and cleaning HTML documents.

Author image
About Usman Malik
Paris (France) Twitter
Programmer | Blogger | Data Science Enthusiast | PhD To Be | Arsenal FC for Life