Introduction
Natural language refers to the language used by humans to communicate with each other. This communication can be verbal or textual. For instance, face-to-face conversations, tweets, blogs, emails, websites, SMS messages, all contain natural language.
Natural language is an incredibly important thing for computers to understand for a few reasons (among others):
- It can be viewed as a source of huge amount of data, and if processed intelligently can yield useful information
- It can be used to allow computers to better communicate with humans
However, unlike humans, computers cannot easily comprehend natural language. Sophisticated techniques and methods are required to translate natural language into a format understandable by computers. Natural language processing (NLP) is the application area that helps us achieve this objective. NLP refers to techniques and methods involved in automatic manipulation of natural language. Natural language processing is widely being used for machine learning, information summarization, human computer interaction, and much more.
This article contains a brief overview of NLP application areas, important NLP tasks and concepts, and some very handy NLP tools. The rest of the article is organized as follows:
NLP Application Areas
- Machine Learning
- Human Computer Interaction
- Information Extraction
- Summarization
NLP Processes and Concepts
- Lemmatization
- Stemming
- POS Tagging
- Named Entity Recognition
- Bag of Words Approach
- TF-IDF
- N Grams
NLP Tools
- Python
NLTK
- Python
Scikit-Learn
- TextBlob
- spaCy
- Stanford NLP
NLP Application Areas
NLP is currently being used in a variety of areas to solve difficult problems.
Machine Learning
NLP is used in conjunction with machine learning techniques to perform tasks such as emotion detection, sentiment analysis, dialogue act recognition, spam email classification etc. Machine learning techniques require data to train algorithms.
Natural language in the form of tweets, blogs, websites, chats etc. is a huge source of data. NLP plays a very important role in data collection by converting natural language into a format that can be used by machine learning techniques to train algorithms.
Spam classification is a classic example of the use of NLP in machine learning. Spam classification refers to the process of classifying emails as "spam or ham" based on the contents of the email. For instance an email "Hello, congratulations you have won $100000000" can be classified as "spam" and another email "As per our discussion, please find attached minutes of the meeting" can be classified as "ham".
NLP tools and techniques help convert the text of these emails into feature vectors that can be used by machine learning applications to train the algorithm and then predict a new email instance as spam or ham.
Human Computer Interaction
Human computer interaction has evolved from simple mouse-and-keyboard desktop-based interaction to more natural interaction involving speech and gestures. For example, Amazon's Alexa and Apple's Siri are two of the prime examples of such interaction where humans use speech to interact with the system and perform different tasks. Another example of natural interaction is Google's homepage where you can perform search operations via speech. Natural language processing lays at the foundation of such interaction.
Information Extraction
Another important NLP task is to extract useful information from documents that can be used for different purposes. For instance, publicly traded companies are required to publish financial information and make it available to their shareholders. NLP can be used to extract financial information from these kinds of documents to automatically gather information on how a company (or industry) is doing.
If mined carefully, the information extracted from these sources can help companies make appropriate decisions. For instance, if a company didn't perform very well last quarter then you could use NLP to determine this and then automatically sell your shares of their stock.
Summarization
Not every piece of information in a text document is useful, so we may want to cut out any "fluff" and read only what is important. Summarization refers to condensing a document in a way that it contains only key pieces of information, leaving behind all the waste.
For example, instead of having to read a multi-page news story you could use summarization to extract only the important information about whatever news event the article was about. Or a summary could be made as a preview of the article to help you decide if you want to read the full text or not.
Important NLP Concepts
In this section we will study some of the most common NLP concepts.
Tokenization
Tokenization refers to dividing a sentence into chunks of words. Usually tokenization is the first task performed in the natural language processing pipeline. Tokenization can be performed at two levels: word-level and sentence-level.
At word-level, tokenization returns a set of words in a sentence.
For instance tokenizing a sentence "I am feeling hungry" returns following set of words:
Words = ['I', 'am', 'feeling', 'hungry']
At sentence-level, tokenization returns a chunk of sentences in a document. For instance, consider a document with the following text.
"Paris is the capital of France. It is located in northern France. It is a beautiful city"
Sentence level tokenization of the above document will return set of following sentences:
- S1 = "Paris is the capital of France"
- S2 = "It is located in the northern France"
- S3 = "It is a beautiful city"
Stop Word Removal
Stop words in natural language are words that do not provide any useful information in a given context. For instance if you are developing an emotion detection engine, words such as "is", "am", and "the" do not convey any information related to emotions.
For instance, in the sentence "I am feeling happy today", the first two words "I" and "am" can be removed since they do not provide any emotion-related information. However, the word "I" may be important for other reasons, like identifying who is feeling happy. There is no universal list of stop words to remove as some can actually provide value - it just depends on your application.
In NLP, each and every word requires processing. Therefore, it is convenient that we only have those words in our text that are important in a given context. This saves processing time and results in a more robust NLP engine.
Stemming and Lemmatization
Stemming refers to the process of stripping suffixes from words in an attempt to normalize them and reduce them to their non-changing portion. For example, stemming the words "computational", "computed", "computing" would all result in "comput" since this is the non-changing part of the word. Stemming operates on single words and does not take context of the word into account. However, "comput" has no semantic information.
Lemmatization on the other hand performs a similar task but it takes context into account while stemming the words. Lemmatization is more complex as it performs dictionary look-ups to fetch the exact word containing semantic information.
The word "talking" will be stripped to "talk" by both stemming and lemmatization. On the other hand, for the word "worse", lemmatization will return "bad" as the lemmatizer takes the context of the word into account. Here the lemmatization will know that "worse" is an adjective and is the second form of the word "bad", therefore it will return the latter. On the other hand stemming will return the word "worse" as it is.
Stemming and lemmatization are very useful for finding the semantic similarity between different pieces of texts.
Parts of Speech (POS)
Another important NLP task is to assign parts of speech tags to the words. To construct a meaningful and grammatically correct sentence, parts of speech play an important role. The arrangement and co-occurrence of different parts of speech in a word make it grammatically and semantically understandable. Furthermore, parts of speech are also integral to context identification. POS tagging helps achieve these tasks.
A POS tagger labels words with their corresponding parts of speech. For instance "laptop, mouse, keyboard" are tagged as nouns. Similarly "eating, playing" are verbs while "good" and "bad" are tagged as adjectives.
While this may sound like a simple task, it is not. For many words you can just use a dictionary lookup to see what POS a word is, but many words have various meanings and could therefore be a different POS. For example, if you encountered the word "wind", would it be a noun or a verb? That really depends on the context, so your POS tagger would need to understand context, to a degree.
Different sets of POS tags are used by different NLP libraries. The Stanford POS tag list is one such list.
Named Entity Recognition
Named entity recognition refers to the process of classifying entities into predefined categories such as person, location, organization, vehicle etc.
For instance in the sentence "Sundar Pichai, CEO of Google is having a talk in Washington". A typical named entity recognizer will return following information about this sentence:
- Sundar Pichai -> Person
- CEO -> Position
- Google -> Organization
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
An important application of named entity recognition is that of topic modeling where using the information about the entities in the text, the topic of the document can automatically be detected.
Bag-of-Words Approach
Bag-of-words refers to a methodology used for extracting features from text documents. These features can then be used for various tasks, such as training machine learning algorithms.
The bag-of-words approach is straight forward. It starts by building a vocabulary of all the unique words occurring in all the documents in the training set. This vocabulary serves as a feature vector for the training set. For instance, consider the following four documents.
- D1 = "I am happy for your success"
- D2 = "I am sorry for your loss"
- D3 = "He is sorry for not coming"
The vocabulary or the feature vector build using above documents will look like this:
I | am | happy | for | your | success | sorry | loss | He | cannot | come |
In the training set, one row for each document is inserted. For each attribute (word) in a row the frequency of the word in the corresponding document is inserted. If a word doesn't exist in the document, 0 is added for that. The training data for the documents above looks like this:
Document | I | am | happy | for | your | success | sorry | loss | He | cannot | come |
---|---|---|---|---|---|---|---|---|---|---|---|
D1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
D2 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | ||
D3 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 2 | 1 | 1 |
Table 1: Training Features containing term frequencies of each word in the document
This is called bag-of-words approach since the sequence of words in a document isn't taken into account. For instance, the training example constructed using the sentence "Sorry I your am for loss" will be exactly similar to "I am sorry for your loss". The occurrence of the word is all that matters in this approach.
TF-IDF
TF-IDF stands for "term frequency multiplied by document frequency". The intuition behind calculating TF-IDF values is that those words that occur more frequently in one document and are overall less frequent in all the documents should be given more weightage since they are more crucial for classification. To understand TF-IDF, let's consider the same example that we studied in the last section. Suppose we have these three documents D1, D2 and D3:
- D1 = "I am happy for your success"
- D2 = "I am sorry for your loss"
- D3 = "He is sorry, he cannot come"
TF-IDF is a combination of two values: TF (Term Frequency) and IDF (Inverse Document Frequency).
Term frequency refers to the number of times a word occurs within a document. In the document D1, the term "happy" occurs one time. Similarly, the term "success" also occurs one time. In D2, "he" occurs twice, so the term frequency for "he" is 2 for D2. It is important to mention here that the term frequency of a word is calculated per document.
In some scenarios where documents lengths vary, term frequency for a particular word is calculated as:
Term frequence = (Number of Occurrences of a word)/(Total words in the document)
However for the sake of simplicity, we will only use the number of occurrences of a word in the document. Table 1 contains term frequency for all the words in D1, D2 and D3 documents.
Inverse Document Frequency for a particular word refers to the total number of documents in a dataset divided by the number of documents in which the word exists. In the documents D1, D2 and D3 the word "your" occurs in D1 and D2. So IDF for "your" will be 3/2.
To reduce the impact of uniqueness, it is common practice to take log of the IDF value. The final formula for IDF of a particular word looks like this:
IDF(word) = Log((Total number of documents)/(Number of documents containing the word))
Let's try to find the IDF value for the word "happy". We have three documents i.e. D1, D2 and D3 and the word "happy" occurs only in one document. So the IDF value for "happy" will be:
IDF(happy) = log(3/1) = log(3) = 0.477
Finally the term TF-IDF is a product of TF and IDF values for a particular term in the document.
For "happy", the TF-IDF value will be 1 x 0.477 = 0.477
By the same process, the IDF value for the word "your" will be 0.17. Notice that "your" occurs in two documents, therefore it is less unique, hence the lower IDF value as compared to "happy" which occurs in only one document.
This kind of analysis can be helpful for things like search or document categorization (think automated tagging).
N-Grams as Features
N-grams refers to a set of co-occurring words. The intuition behind the N-gram approach is that words occurring together provide more information rather than those occurring individually. Consider for example the following sentence:
S1 = "Manchester united is the most successful English football club"
Here if we create a feature set for this sentence with individual words, it will look like this:
Features = {Manchester, United, is, the, most, successful, English, Football, Club}
But if we look at this sentence we can see that "Manchester United" together provides more information about what is being said in the sentence rather than if you inspect the words "Manchester" and "United" separately.
N-grams allow us to take the occurrence of the words into account while processing the content of the document.
In N-grams, the N refers to the number of co-occurring words. For instance, let's reconsider the sentence "Manchester United is the most successful English football club". If we try to construct 2-grams from this sentence, they would look like this:
2-Grams(S1) = ('Manchester United', 'United is', 'is the', 'the most', 'most successful', 'successful English', 'English Football', 'Football club')
Now, if you look at these N-grams you can see that at least three N-grams convey significant bits of information about the sentence e.g. "Manchester United", "English Football", "Football Club". From these N-grams we can understand that the sentence is about Manchester United, which is a football club in English football.
You can have any number of N-grams. The number N-grams for a sentence S
having X
number of words is:
N-gram(S) = X - (N-1)
A set of N-grams can be helpful for things like autocomplete/autocorrect and language models. Creating an N-gram from a huge corpus of text provides lots of information about which words typically occur together, and therefore allows you to predict what word will come next in a sentence.
In this section, we covered some of the basic natural language processing concepts and processes. Implementing these processes manually is cumbersome and time consuming. Thankfully, a lot of software libraries are available that automate these processes. A brief overview of some of these libraries has been provided in the next section.
Some useful NLP Tools
Following are some of the most commonly used NLP tools. All of these tools provide most of the basic NLP functionalities; however they differ in their implementation and licensing.
Python NLTK
Python Natural Language Toolkit (NLTK) is by far the most popular and complete natural language processing tool. Implemented in Python, NLTK has all the basic natural language processing capabilities such as stemming, lemmatization, named entity recognition, POS tagging, etc. If Python is your language of choice, look no further than Python NLTK.
Scikit-Learn
Scikit-learn is another extremely useful Python library for natural language processing. Although Scikit-Learn
is primarily focused towards machine learning tasks, it contains most of the basic natural language processing capabilities as well and should not be overlooked.
Textblob
Though you can get almost every NLP task done with Python NLTK, getting used to complex syntax and functionalities of NLTK can be time consuming. Textblob is a simple-to-use NLP library built on top of NLTK and pattern with a less steep learning curve. Textblob is highly recommended for absolute beginners to natural language processing in Python, or for someone who cares more about getting to the end result rather than how the internals work.
spaCy
Textblob and NLTK are extremely good for educational purposes and exploring how NLP works. They contain lots of options for performing one task. For instance, if you are trying to train a POS tagger on your own data, libraries like NLTK and Textblob should be your choice. On the other hand if you are looking to build something from the best possible combination of functionalities, spaCy is a better option. spaCy is faster and more accurate than both NLTK and Textblob. One downside to spaCy is that currently it supports only the English language, while NLTK has support for multiple languages.
Stanford NLP
If you are into Java-based natural language processing tools, Stanford NLP should be your first choice. Stanford NLP is a GPL-Licensed NPL library capable of performing all the fundamental NLP tasks e.g. tokenization, co-referencing, stemming etc. It is also of such high quality that it is used in many research papers, so you'll likely hear of it quite often in the academic world. The good thing about Stanford NLP is that it supports multiple languages such as Chinese, English, Spanish, and French etc.
Learn More
This article provides a brief overview of many of the basic NLP tasks and concepts. While we covered quite a bit of the main areas of NLP, there is still a ton to learn.
In the upcoming articles, we will dig deeper into these concepts and will see how they can actually be used and implemented in Python.