This is the ninth article in my series of articles on Python for NLP. In the previous article, we saw how Python's Pattern library can be used to perform a variety of NLP tasks ranging from tokenization to POS tagging, and text classification to sentiment analysis. Before that we explored the TextBlob library for performing similar natural language processing tasks.
In this article, we will explore StanfordCoreNLP library which is another extremely handy library for natural language processing. We will see different features of StanfordCoreNLP with the help of examples. So before wasting any further time, let's get started.
Setting up the Environment
The installation process for StanfordCoreNLP is not as straightforward as the other Python libraries. As a matter of fact, StanfordCoreNLP is a library that's actually written in Java. Therefore make sure you have Java installed on your system. You can download the latest version of Java freely.
Once you have Java installed, you need to download the JAR files for the StanfordCoreNLP libraries. The JAR file contains models that are used to perform different NLP tasks. To download the JAR files for the English models, download and unzip the folder located at the official StanfordCoreNLP website.
Next thing you have to do is run the server that will serve the requests sent by the Python wrapper to the StanfordCoreNLP library. Navigate to the path where you unzipped the JAR files folder. Navigate inside the folder and execute the following command on the command prompt:
$ java -mx6g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -timeout 10000
The above command initiates the StanfordCoreNLP server. The parameter -mx6g
specifies that the memory used by the server should not exceed 6 gigabytes. It is important to mention that you should be running a 64-bit system in order to have a heap as big as 6GB. If you are running a 32-bit system, you might have to reduce the memory size dedicated to the server.
Once you run the above command, you should see the following output:
[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
[main] INFO CoreNLP - setting default constituency parser
[main] INFO CoreNLP - warning: cannot find edu/stanford/nlp/models/srparser/englishSR.ser.gz
[main] INFO CoreNLP - using: edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz instead
[main] INFO CoreNLP - to use shift reduce parser download English models jar from:
[main] INFO CoreNLP - http://stanfordnlp.github.io/CoreNLP/download.html
[main] INFO CoreNLP - Threads: 8
[main] INFO CoreNLP - Starting server...
[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000
The server is running at port 9000.
Now the final step is to install the Python wrapper for the StanfordCoreNLP library. The wrapper we will be using is pycorenlp
. The following script downloads the wrapper library:
$ pip install pycorenlp
Now we are all set to connect to the StanfordCoreNLP server and perform the desired NLP tasks.
To connect to the server, we have to pass the address of the StanfordCoreNLP server that we initialized earlier to the StanfordCoreNLP
class of the pycorenlp
module. The object returned can then be used to perform NLP tasks. Look at the following script:
from pycorenlp import StanfordCoreNLP
nlp_wrapper = StanfordCoreNLP('http://localhost:9000')
Performing NLP Tasks
In this section, we will briefly explore the use of StanfordCoreNLP library for performing common NLP tasks.
Lemmatization, POS Tagging and Named Entity Recognition
Lemmatization, parts of speech tagging, and named entity recognition are the most basic NLP tasks. The StanfordCoreNLP library supports pipeline functionality that can be used to perform these tasks in a structured way.
In the following script, we will create an annotator which first splits a document into sentences and then further splits the sentences into words or tokens. The words are then annotated with the POS and named entity recognition tags.
doc = "Ronaldo has moved from Real Madrid to Juventus. While messi still plays for Barcelona"
annot_doc = nlp_wrapper.annotate(doc,
properties={
'annotators': 'ner, pos',
'outputFormat': 'json',
'timeout': 1000,
})
In the script above we have a document with two sentences. We use the annotate
method of the StanfordCoreNLP wrapper object that we initialized earlier. The method takes three parameters. The annotator
parameter takes the type of annotation we want to perform on the text. We pass 'ner, pos'
as the value for the annotator
parameter which specifies that we want to annotate our document for POS tags and named entities.
The outputFormat
variable defines the format in which you want the annotated text. The possible values are json
for JSON objects, xml
for XML format, text
for plain text, and serialize
for serialized data.
The final parameter is the timeout in milliseconds which defines the time that the wrapper should wait for the response from the server before timing out.
In the output, you should see a JSON object as follows:
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
{'sentences': [{'index': 0, 'entitymentions': [{'docTokenBegin': 0, 'docTokenEnd': 1, 'tokenBegin': 0, 'tokenEnd': 1, 'text': 'Ronaldo', 'characterOffsetBegin': 0, 'characterOffsetEnd': 7, 'ner': 'PERSON'}, {'docTokenBegin': 4, 'docTokenEnd': 6, 'tokenBegin': 4, 'tokenEnd': 6, 'text': 'Real Madrid', 'characterOffsetBegin': 23, 'characterOffsetEnd': 34, 'ner': 'ORGANIZATION'}, {'docTokenBegin': 7, 'docTokenEnd': 8, 'tokenBegin': 7, 'tokenEnd': 8, 'text': 'Juventus', 'characterOffsetBegin': 38, 'characterOffsetEnd': 46, 'ner': 'ORGANIZATION'}], 'tokens': [{'index': 1, 'word': 'Ronaldo', 'originalText': 'Ronaldo', 'lemma': 'Ronaldo', 'characterOffsetBegin': 0, 'characterOffsetEnd': 7, 'pos': 'NNP', 'ner': 'PERSON', 'before': '', 'after': ' '}, {'index': 2, 'word': 'has', 'originalText': 'has', 'lemma': 'have', 'characterOffsetBegin': 8, 'characterOffsetEnd': 11, 'pos': 'VBZ', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 3, 'word': 'moved', 'originalText': 'moved', 'lemma': 'move', 'characterOffsetBegin': 12, 'characterOffsetEnd': 17, 'pos': 'VBN', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 4, 'word': 'from', 'originalText': 'from', 'lemma': 'from', 'characterOffsetBegin': 18, 'characterOffsetEnd': 22, 'pos': 'IN', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 5, 'word': 'Real', 'originalText': 'Real', 'lemma': 'real', 'characterOffsetBegin': 23, 'characterOffsetEnd': 27, 'pos': 'JJ', 'ner': 'ORGANIZATION', 'before': ' ', 'after': ' '}, {'index': 6, 'word': 'Madrid', 'originalText': 'Madrid', 'lemma': 'Madrid', 'characterOffsetBegin': 28, 'characterOffsetEnd': 34, 'pos': 'NNP', 'ner': 'ORGANIZATION', 'before': ' ', 'after': ' '}, {'index': 7, 'word': 'to', 'originalText': 'to', 'lemma': 'to', 'characterOffsetBegin': 35, 'characterOffsetEnd': 37, 'pos': 'TO', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 8, 'word': 'Juventus', 'originalText': 'Juventus', 'lemma': 'Juventus', 'characterOffsetBegin': 38, 'characterOffsetEnd': 46, 'pos': 'NNP', 'ner': 'ORGANIZATION', 'before': ' ', 'after': ''}, {'index': 9, 'word': '.', 'originalText': '.', 'lemma': '.', 'characterOffsetBegin': 46, 'characterOffsetEnd': 47, 'pos': '.', 'ner': 'O', 'before': '', 'after': ' '}]}, {'index': 1, 'entitymentions': [{'docTokenBegin': 14, 'docTokenEnd': 15, 'tokenBegin': 5, 'tokenEnd': 6, 'text': 'Barcelona', 'characterOffsetBegin': 76, 'characterOffsetEnd': 85, 'ner': 'ORGANIZATION'}], 'tokens': [{'index': 1, 'word': 'While', 'originalText': 'While', 'lemma': 'while', 'characterOffsetBegin': 48, 'characterOffsetEnd': 53, 'pos': 'IN', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 2, 'word': 'messi', 'originalText': 'messi', 'lemma': 'messus', 'characterOffsetBegin': 54, 'characterOffsetEnd': 59, 'pos': 'NNS', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 3, 'word': 'still', 'originalText': 'still', 'lemma': 'still', 'characterOffsetBegin': 60, 'characterOffsetEnd': 65, 'pos': 'RB', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 4, 'word': 'plays', 'originalText': 'plays', 'lemma': 'play', 'characterOffsetBegin': 66, 'characterOffsetEnd': 71, 'pos': 'VBZ', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 5, 'word': 'for', 'originalText': 'for', 'lemma': 'for', 'characterOffsetBegin': 72, 'characterOffsetEnd': 75, 'pos': 'IN', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 6, 'word': 'Barcelona', 'originalText': 'Barcelona', 'lemma': 'Barcelona', 'characterOffsetBegin': 76, 'characterOffsetEnd': 85, 'pos': 'NNP', 'ner': 'ORGANIZATION', 'before': ' ', 'after': ''}]}]}
If you look at the above script carefully, you can find the POS tags, named entities and the lemmatized version of each word.
Lemmatization
Let's now explore the annotated results. We'll first print the lemmatization for the words in the two sentences in our dataset:
for sentence in annot_doc["sentences"]:
for word in sentence["tokens"]:
print(word["word"] + " => " + word["lemma"])
In the script above, the outer loop iterates through each sentence in the document and the inner loop iterates through each word in the sentence. Inside the inner loop, the word and its corresponding lemmatized form are printed on the console. The output looks like this:
Ronaldo=>Ronaldo
has=>have
moved=>move
from=>from
Real=>real
Madrid=>Madrid
to=>to
Juventus=>Juventus
.=>.
While=>while
messi=>messus
still=>still
plays=>play
for=>for
Barcelona=>Barcelona
For example, you can see the word moved
has been lemmatized to move
, similarly the word plays
has been lemmatized to play
.
POS Tagging
In the same way, we can find the POS tags for each word. Look at the following script:
for sentence in annot_doc["sentences"]:
for word in sentence["tokens"]:
print (word["word"] + "=>" + word["pos"])
In the output, you should see the following results:
Ronaldo=>NNP
has=>VBZ
moved=>VBN
from=>IN
Real=>JJ
Madrid=>NNP
to=>TO
Juventus=>NNP
.=>.
While=>IN
messi=>NNS
still=>RB
plays=>VBZ
for=>IN
Barcelona=>NNP
The tag set used for POS tags is the Penn Treebank tag set and can be found here.
Named Entity Recognition
To find named entities in our document, we can use the following script:
for sentence in annot_doc["sentences"]:
for word in sentence["tokens"]:
print (word["word"] + "=>" + word["ner"])
The output looks like this:
Ronaldo=>PERSON
has=>O
moved=>O
from=>O
Real=>ORGANIZATION
Madrid=>ORGANIZATION
to=>O
Juventus=>ORGANIZATION
.=>O
While=>O
messi=>O
still=>O
plays=>O
for=>O
Barcelona=>ORGANIZATION
We can see that Ronaldo
has been identified as a PERSON
while Barcelona
has been identified as Organization
, which in this case is correct.
Sentiment Analysis
To find the sentiment of a sentence, all you have to do is pass sentiment
as the value for the annotators
property. Look at the following script:
doc = "I like this chocolate. This chocolate is not good. The chocolate is delicious. It's a very tasty chocolate. This is so bad"
annot_doc = nlp_wrapper.annotate(doc,
properties={
'annotators': 'sentiment',
'outputFormat': 'json',
'timeout': 1000,
})
To find the sentiment, we can iterate over each sentence and then use the sentimentValue
property to find the sentiment. The sentimentValue
returns a value between 1 and 4 where 1 corresponds to highly negative sentiment while 4 corresponds to highly positive sentiment. The sentiment
property can be used to get sentiment in verbal form i.e positive
, negative
or neutral
.
The following script finds the sentiment for each sentence in the document we defined above.
for sentence in annot_doc["sentences"]:
print ( " ".join([word["word"] for word in sentence["tokens"]]) + " => " \
+ str(sentence["sentimentValue"]) + " = "+ sentence["sentiment"])
Output:
I like this chocolate . => 2 = Neutral
This chocolate is not good . => 1 = Negative
The chocolate is delicious . => 3 = Positive
It's a very tasty chocolate . => 3 = Positive
This is so bad => 1 = Negative
Conclusion
StanfordCoreNLP
is another extremely handy library for natural language processing. In this article, we studied how to set up the environment to run StanfordCoreNLP. We then explored the use of StanfordCoreNLP library for common NLP tasks such as lemmatization, POS tagging and named entity recognition and finally, we rounded off the article with sentimental analysis using StanfordCoreNLP.