Python for NLP: Working with Text and PDF Files

This is the first article in my series of articles on Python for Natural Language Processing (NLP). In this article, we will start with the basics of Python for NLP. We will see how we can work with simple text files and PDF files using Python.

Working with Text Files

Text files are probably the most basic types of files that you are going to encounter in your NLP endeavors. In this section, we will see how to read from a text file in Python, create a text file, and write data to the text file.

Reading a Text File

Create a text file with the following text and save it in your local directory with a ".txt" extension.

Welcome to Natural Language Processing  
It is one of the most exciting research areas as of today  
We will see how Python can be used to work with text files.  

In my case, I stored the file named "myfile.txt" in my root "D:" directory.

Reading All File Contents

Now let's see how we can read the whole contents of the file. The first step is to specify the path of the file, as shown below:

myfile = open("D:\myfile.txt")  

To open the file, you can use Python's built-in open function. If you execute the above piece of code and do not see an error, that means your file was successfully opened. Make sure to change the file path to the location in which you saved your text file.

Let's now see what is stored in the myfile variable:

print(myfile)  

The output looks like this:

<_io.TextIOWrapper name='D:\\myfile.txt' mode='r' encoding='cp1252'>  

The output reads that myfile variable is a wrapper to the myfile.txt file and opens the file in read-only mode.

If you specify the wrong file path, you are likely to get the following error:

myfile222 = open("D:\myfile222.txt")  
print(myfile222)  
FileNotFoundError: [Errno 2] No such file or directory: 'D:\\myfile222.txt'  

Whenever you get Errno 2, there can be two reasons. Either your file doesn't exist or you provided the wrong file path to the open function.

Now, let's read the contents of the file. To do so, you need to call the read() function on the myfile variable, as shown below:

myfile = open("D:\myfile.txt")  
print(myfile.read())  

In the output, you should see the text of the file, as shown below:

Welcome to Natural Language Processing  
It is one of the most exciting research areas as of today  
We will see how Python can be used to work with text files.  

Now if you try to call the read method again, you will see that nothing will be printed on the console:

print(myfile.read())  

This is because once you call the read method, the cursor is moved to the end of the text. Therefore, when you call read again, nothing is displayed since there is no more text to print.

A solution to this problem is that after calling the read() method, call the seek() method and pass 0 as the argument. This will move the cursor back to the start of the text file. Look at the following script to see how this works:

myfile = open("D:\myfile.txt")  
print(myfile.read())  
myfile.seek(0)  
print(myfile.read())  

In the output, you will see the contents of the text file printed twice.

Once you are done working with a file, it is important to close the file so that other applications can access the file. To do so, you need to call the close() method.

myfile.close()  
Reading a File Line by Line

Instead of reading all the contents of the file at once, we can also read the file contents line by line. To do so, we need to execute the readlines() method, which returns each line in the text file as list item.

myfile = open("D:\myfile.txt")  
print(myfile.readlines())  

In the output, you will see each line in the text file as a list item:

['Welcome to Natural Language Processing\n', 'It is one of the most exciting research areas as of today\n', 'We will see how Python can be used to work with text files.']

In many cases this makes the text easier to work with. For example, we can now easily iterate through each line and print the first word in the line.

myfile = open("D:\myfile.txt")  
for lines in myfile:  
    print(lines.split()[0])

The output looks like this:

Welcome  
It  
We  

Writing to a Text File

To write to a text file, you simply have to open a file with mode set to w or w+. The former opens a file in the write mode, while the latter opens the file in both read and write mode. If the file doesn't exist, it will be created. It is important to mention that if you open a file that already contains some text with w or w+ mode, all the existing file contents will be removed, as shown below:

myfile = open("D:\myfile.txt", 'w+')  
print(myfile.read())  

In the output, you will see nothing printed on the screen since the file is opened using the w+ mode, all the contents of the file have been removed. If you want to avoid this then you'll want to append text instead, which I cover below as well.

Now, let's write some content in the file using the write() method.

myfile = open("D:\myfile.txt", 'w+')  
print(myfile.read())  
myfile.write("The file has been rewritten")  
myfile.seek(0)  
print(myfile.read())  

In the script above, we write text to the file and then call the seek() method to shift the cursor back to the start and then call the read method to read the contents of the file. In the output, you will see the newly added content as shown below:

The file has been rewritten  

Often times, you dont simply need to wipe out the existing contents of the file. Rather, you may need to add the contents at the end of the file.

To do so, you need to open the file with a+ mode which refers to append plus read.

Again create a file with the following contents and save it as "myfile.txt" in the "D" directory:

Welcome to Natural Language Processing  
It is one of the most exciting research areas as of today  
We will see how Python can be used to work with text files.  

Execute the following script to open the file with the append mode:

myfile = open("D:\myfile.txt", 'a+')  
myfile.seek(0)  
print(myfile.read())  

In the output, you will see the contents of the file.

Next, let's append some text to the file.

myfile.write("\nThis is a new line")  

Let's now again read the file contents:

myfile.seek(0)  
print(myfile.read())  

In the output, you will see the newly appended line at the end of the text as shown below:

Welcome to Natural Language Processing  
It is one of the most exciting research areas as of today  
We will see how Python can be used to work with text files.  
This is a new line  

Finally, before moving on to the next section, let's see how context manager can be used to automatically close the file after performing the desired operations.

with open("D:\myfile.txt") as myfile:  
    print(myfile.read())

Using the with keyword, as shown above, you don't need to explicitly close the file. Rather, the above script opens the file, reads its contents, and then closes it automatically.

Working with PDF Files

In addition to text files, we often need to work with PDF files to perform different natural language processing tasks. By default, Python doesn't come with any built-in library that can be used to read or write PDF files. Rather, we can use the PyPDF2 library.

Before we can use the PyPDF2 library, we need to install it. If you are using pip installer, you can use the following command to install PyPDF2 library:

$ pip install PyPDF2

Alternatively, if you are using Python from Anaconda environment, you can execute the following command at the conda command prompt:

$ conda install -c conda-forge pypdf2

Note: It is important to mention here that a PDF document can be created from different sources like word processing documents, images, etc. In this article, we will only be dealing with the PDF documents created using word processors. For the PDF documents created using images, there are other specialized libraries that I will explain in a later article. For now, we will only work with the PDF documents generated using word processors.

As a dummy document to play around with, you can download the PDF from this link:

http://www.bavtailor.com/wp-content/uploads/2018/10/Lorem-Ipsum.pdf

Download the document locally at the root of the "D" drive.

Reading a PDF Document

To read a PDF document, we first have to open it like any ordinary file. Look at the following script:

import PyPDF2  
mypdf = open('D:\Lorem-Ipsum.pdf', mode='rb')  

It is important to mention that while opening a PDF file, the mode must be set to rb, which stands for "read binary" since most of the PDF files are in binary format.

Once the file is opened, we will need to call the PdfFileReader() function of the PyPDF2 library, as shown below.

pdf_document = PyPDF2.PdfFileReader(mypdf)  

Now using the pdf_document variable, we can perform a variety of read functions. For instance, to get the total number of pages in the PDF document, we can use the numPages attribute:

pdf_document.numPages  

Since we only have one 1 page, in our PDF document, you will see 1 in the output.

Finally, to extract the text from the PDF document, you first need to get the page of the PDF document using the getPage() function.

Next, you can call the extractText() function to extract the text from that particular page.

The following script extracts the text from the first page of the PDF and then prints it on the console.

first_page = pdf_document.getPage(0)

print(first_page.extractText())  

In the output, you should see the text from the first page of the PDF.

Writing to a PDF Document

It is not possible to directly write Python strings to PDF document using the PyPDF2 library due to fonts and other constraints. However, for the sake of demonstration, we will read contents from our PDF document and then will write that content to another PDF file that we will create.

Let's first read the contents of the first page of our PDF document.

import PyPDF2

mypdf = open('D:\Lorem-Ipsum.pdf', mode='rb')  
pdf_document = PyPDF2.PdfFileReader(mypdf)  
pdf_document.numPages

page_one = pdf_document.getPage(0)  

The above script reads the first page of our PDF document. Now we can write the contents from the first page to a new PDF document using the following script:

pdf_document_writer = PyPDF2.PdfFileWriter()  

The script above creates an object that can be used to write content to a PDF file. First, we will add a page to this object and pass it the page that we retrieved from the other PDF.

pdf_document_writer.addPage(page_one)  

Next, we need to open a new file with wb (write binary) permissions. Opening a file with such permissions creates a new file if one doesn't exist.

pdf_output_file = open('new_pdf_file.pdf', 'wb')  

Finally, we need to call the write() method on the PDF writer object and pass it the newly created file.

pdf_document_writer.write(pdf_output_file )  

Close both the mypdf and pdf_output_file files and go to the program's working directory. You should see a new file new_pdf_file.pdf in your editor. Open the file and you should see that it contains the contents from the first page from our original PDF.

Let's try to read the contents of our newly created PDF document:

import PyPDF2

mypdf = open(r'C:\Users\Admin\new_pdf_file.pdf', mode='rb')

pdf_document = PyPDF2.PdfFileReader(mypdf)  
pdf_document.numPages  
page_one = pdf_document.getPage(0)

print(page_one.extractText())  

Let's now work with a bigger PDF file. Download the PDF file from this link:

http://ctan.math.utah.edu/ctan/tex-archive/macros/latex/contrib/lipsum/lipsum.pdf

Save it in your local directory. The name of the downloaded file will be "lipsum.pdf".

Execute the following script to see the number of pages in the file:

import PyPDF2

mypdf = open(r'D:\lipsum.pdf', mode='rb')  
pdf_document = PyPDF2.PdfFileReader(mypdf)  
pdf_document.numPages  

In the output, you will see 87 printed out since there are 87 pages in the PDF. Let's print all the pages in the document on the console:

import PyPDF2

mypdf = open(r'D:\lipsum.pdf', mode='rb')  
pdf_document = PyPDF2.PdfFileReader(mypdf)

for i in range(pdf_document.numPages):  
    page_to_print = pdf_document.getPage(i)
    print(page_to_print.extractText())

In the output, you will see all the pages of the PDF document, printed on the screen.

Conclusion

Reading and writing text documents is a fundamental step for developing natural language processing applications. In this article, we explained how we can work with the text and PDF files using Python. We saw how to read and write text and PDF files.

In the next article, we will start our discussion about few other NLP tasks such as stemming, lemmatization, tokenization with the spaCy library.

Author image
About Usman Malik
Paris (France) Twitter
Programmer | Blogger | Data Science Enthusiast | PhD To Be | Arsenal FC for Life