Perform OCR on a Scanned PDF in Python Using borb

Perform OCR on a Scanned PDF in Python Using borb

The Portable Document Format (PDF) is not a WYSIWYG (What You See is What You Get) format. It was developed to be platform-agnostic, independent of the underlying operating system and rendering engines.

To achieve this, PDF was constructed to be interacted with via something more like a programming language, and relies on a series of instructions and operations to achieve a result. In fact, PDF is based on a scripting language - PostScript, which was the first device-independent Page Description Language.

In this guide, we'll be using borb - a Python library dedicated to reading, manipulating and generating PDF documents. It offers both a low-level model (allowing you access to the exact coordinates and layout if you choose to use those) and a high-level model (where you can delegate the precise calculations of margins, positions, etc to a layout manager).

In this guide, we'll take a look at how to apply Optical Character Recognition (OCR) on a scanned PDF document.

Installing borb

borb can be downloaded from source on GitHub, or installed via pip:

$ pip install borb

“My PDF Document Has No Text!”

This is by far one of the most classic questions on any programming-forum, or helpdesk:

"My document does not seem to have text in it. Help?"

Or:

"Your text-extraction code sample does not work for my document. How come?"

The answer is often as straightforward as "your scanner hates you".

Most of the documents for which this doesn't work are PDF documents that are essentially glorified images. They contain all the meta-data needed to constitute a PDF, but their pages are just large (often low-quality) images, created by scanning physical papers.

As a consequence, there are no text-rendering instructions in these documents. And most PDF libraries will not be able to handle them. borb, however, loves to help and can be applied in these cases, with built-in support for OCR.

In this section we'll be using a special EventListener implementation called OCRAsOptionalContentGroup. This class uses tesseract (or rather pytesseract) to perform OCR (optical character recognition) on the Document.

If you'd like to read more about OCR in Python, read our Guide to Simple Optical Character Recognition with PyTesseract!

Once finished, the recognized text is re-inserted in each Page as a special "layer" (in PDF this is called an "optional content group").

With the content now restored, the usual tricks (SimpleTextExtraction) yield the expected results.

You'll start by creating a method that builds a PIL Image with some text in it. This Image will then be inserted in a PDF.

Creating an Image

import typing
from pathlib import Path

from PIL import Image as PILImage  # Type: ignore [import]
from PIL import ImageDraw, ImageFont

def create_image() -> PILImage:
    # Create new Image
    img = PILImage.new("RGB", (256, 256), color=(255, 255, 255))

    # Create ImageFont
    # CAUTION: you may need to adjust the path to your particular font directory
    font = ImageFont.truetype("/usr/share/fonts/truetype/ubuntu/UbuntuMono-B.ttf", 24)

    # Draw text
    draw = ImageDraw.Draw(img)
    draw.text((10, 10),
              "Hello World!",
              fill=(0, 0, 0),
              font=font)

    # Return
    return img

Now let's build a PDF with this image, to represent our scanned document, that isn't parsable, as it doesn't contain metadata:

import typing
# New imports
from borb.pdf.canvas.layout.image.image import Image
from borb.pdf.canvas.layout.page_layout.multi_column_layout import SingleColumnLayout
from borb.pdf.canvas.layout.page_layout.page_layout import PageLayout
from borb.pdf.canvas.layout.text.paragraph import Paragraph
from borb.pdf.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF

# Main method to create the document
def create_document():

    # Create Document
    d: Document = Document()

    # Create/add Page
    p: Page = Page()
    d.append_page(p)

    # Set PageLayout
    l: PageLayout = SingleColumnLayout(p)

    # Add Paragraph
    l.add(Paragraph("Lorem Ipsum"))

    # Add Image
    l.add(Image(create_image()))

    # Write
    with open("output_001.pdf", "wb") as pdf_file_handle:
        PDF.dumps(pdf_file_handle, d)

The resulting document should look like this:

When you select the text in this document, you'll see immediately that only the top line is actually text. The rest is an Image with text (the Image you created):

Now, let's apply OCR to this document, and overlay actual text so that it becomes parsable:

# New imports
from pathlib import Path
from borb.toolkit.ocr.ocr_as_optional_content_group import OCRAsOptionalContentGroup
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction

def apply_ocr_to_document():

    # Set up everything for OCR
    tesseract_data_dir: Path = Path("/home/joris/Downloads/tessdata-master/")
    assert tesseract_data_dir.exists()
    l: OCRAsOptionalContentGroup = OCRAsOptionalContentGroup(tesseract_data_dir)

    # Read Document
    doc: typing.Optional[Document] = None
    with open("output_001.pdf", "rb") as pdf_file_handle:
        doc = PDF.loads(pdf_file_handle, [l])

    assert doc is not None

    # Store Document
    with open("output_002.pdf", "wb") as pdf_file_handle:
        PDF.dumps(pdf_file_handle, doc)

You can see this created an extra layer in the PDF. This layer is named "OCR by borb", and contains the rendering instructions borb re-inserted in the Document.

You can toggle the visibility of this layer (this can be handy when debugging):

You can see that borb re-inserted the postscript rendering command to ensure "Hello World!" is in the `Document. Let's hide this layer again.

Keep in mind OCR is a heuristic. The location and matched text may not always be 100% correct. That's just the way it goes. Typically, you'll keep the layer hidden (but selectable) so the original image is in place, and you can select/copy an approximation of it.

Now (even with the layer hidden), you can select the text:

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

And if you apply SimpleTextExtraction now, you should be able to retrieve all the text in the Document.

# New imports
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction

def read_modified_document():

    doc: typing.Optional[Document] = None
    l: SimpleTextExtraction = SimpleTextExtraction()
    with open("output_002.pdf", "rb") as pdf_file_handle:
        doc = PDF.loads(pdf_file_handle, [l])

    print(l.get_text_for_page(0))


def main():
    create_document()
    apply_ocr_to_document()
    read_modified_document()

    
if __name__ == "__main__":
    main()

This prints:

Lorem Ipsum
Hello World!

Awesome!

Conclusion

In this guide you've learned how to apply OCR to PDF documents, ensuring your scanned documents are searchable and ready for future processing.

Last Updated: October 5th, 2021
Was this article helpful?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

Joris SchellekensAuthor

I'm a software architect from Belgium, with a passion for machine learning, knowledge-based systems and graph algorithms. I'm also the author of borb, the pure python PDF library.

Want a remote job?

    Prepping for an interview?

    • Improve your skills by solving one coding problem every day
    • Get the solutions the next morning via email
    • Practice on actual problems asked by top companies, like:
     
     
     

    Make Clarity from Data - Quickly Learn Data Visualization with Python

    Learn the landscape of Data Visualization tools in Python - work with Seaborn, Plotly, and Bokeh, and excel in Matplotlib!

    From simple plot types to ridge plots, surface plots and spectrograms - understand your data and learn to draw conclusions from it.

    © 2013-2021 Stack Abuse. All rights reserved.