The Portable Document Format (PDF) is not a WYSIWYG (What You See is What You Get) format. It was developed to be platform-agnostic, independent of the underlying operating system and rendering engines.
To achieve this, PDF was constructed to be interacted with via something more like a programming language, and relies on a series of instructions and operations to achieve a result. In fact, PDF is based on a scripting language - PostScript, which was the first device-independent Page Description Language.
In this guide, we'll be using borb - a Python library dedicated to reading, manipulating and generating PDF documents. It offers both a low-level model (allowing you access to the exact coordinates and layout if you choose to use those) and a high-level model (where you can delegate the precise calculations of margins, positions, etc to a layout manager).
In this guide, we'll take a look at how to apply Optical Character Recognition (OCR) on a scanned PDF document.
Installing borb
borb can be downloaded from source on GitHub, or installed via pip
:
$ pip install borb
“My PDF Document Has No Text!”
This is by far one of the most classic questions on any programming-forum, or help desk:
"My document does not seem to have text in it. Help?"
Or:
"Your text-extraction code sample does not work for my document. How come?"
The answer is often as straightforward as "your scanner hates you".
Most of the documents for which this doesn't work are PDF documents that are essentially glorified images. They contain all the meta-data needed to constitute a PDF, but their pages are just large (often low-quality) images, created by scanning physical papers.
As a consequence, there are no text-rendering instructions in these documents. And most PDF libraries will not be able to handle them. borb
, however, loves to help and can be applied in these cases, with built-in support for OCR.
In this section we'll be using a special EventListener
implementation called OCRAsOptionalContentGroup
. This class uses tesseract
(or rather pytesseract
) to perform OCR (optical character recognition) on the Document
.
If you'd like to read more about OCR in Python, read our Guide to Simple Optical Character Recognition with PyTesseract!
Once finished, the recognized text is re-inserted in each Page as a special "layer" (in PDF this is called an "optional content group").
With the content now restored, the usual tricks (SimpleTextExtraction
) yield the expected results.
You'll start by creating a method that builds a PIL Image with some text in it. This Image will then be inserted in a PDF.
Creating an Image
import typing
from pathlib import Path
from PIL import Image as PILImage # Type: ignore [import]
from PIL import ImageDraw, ImageFont
def create_image() -> PILImage:
# Create new Image
img = PILImage.new("RGB", (256, 256), color=(255, 255, 255))
# Create ImageFont
# CAUTION: you may need to adjust the path to your particular font directory
font = ImageFont.truetype("/usr/share/fonts/truetype/ubuntu/UbuntuMono-B.ttf", 24)
# Draw text
draw = ImageDraw.Draw(img)
draw.text((10, 10),
"Hello World!",
fill=(0, 0, 0),
font=font)
# Return
return img
Now let's build a PDF with this image, to represent our scanned document, that isn't parsable, as it doesn't contain metadata:
import typing
# New imports
from borb.pdf.canvas.layout.image.image import Image
from borb.pdf.canvas.layout.page_layout.multi_column_layout import SingleColumnLayout
from borb.pdf.canvas.layout.page_layout.page_layout import PageLayout
from borb.pdf.canvas.layout.text.paragraph import Paragraph
from borb.pdf.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF
# Main method to create the document
def create_document():
# Create Document
d: Document = Document()
# Create/add Page
p: Page = Page()
d.append_page(p)
# Set PageLayout
l: PageLayout = SingleColumnLayout(p)
# Add Paragraph
l.add(Paragraph("Lorem Ipsum"))
# Add Image
l.add(Image(create_image()))
# Write
with open("output_001.pdf", "wb") as pdf_file_handle:
PDF.dumps(pdf_file_handle, d)
The resulting document should look like this:
When you select the text in this document, you'll see immediately that only the top line is actually text. The rest is an Image with text (the Image you created):
Now, let's apply OCR to this document, and overlay actual text so that it becomes parsable:
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
# New imports
from pathlib import Path
from borb.toolkit.ocr.ocr_as_optional_content_group import OCRAsOptionalContentGroup
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction
def apply_ocr_to_document():
# Set up everything for OCR
tesseract_data_dir: Path = Path("/home/joris/Downloads/tessdata-master/")
assert tesseract_data_dir.exists()
l: OCRAsOptionalContentGroup = OCRAsOptionalContentGroup(tesseract_data_dir)
# Read Document
doc: typing.Optional[Document] = None
with open("output_001.pdf", "rb") as pdf_file_handle:
doc = PDF.loads(pdf_file_handle, [l])
assert doc is not None
# Store Document
with open("output_002.pdf", "wb") as pdf_file_handle:
PDF.dumps(pdf_file_handle, doc)
You can see this created an extra layer in the PDF. This layer is named "OCR by borb", and contains the rendering instructions borb
re-inserted in the Document
.
You can toggle the visibility of this layer (this can be handy when debugging):
You can see that borb re-inserted the postscript rendering command to ensure "Hello World!" is in the `Document. Let's hide this layer again.
Keep in mind OCR is a heuristic. The location and matched text may not always be 100% correct. That's just the way it goes. Typically, you'll keep the layer hidden (but selectable) so the original image is in place, and you can select/copy an approximation of it.
Now (even with the layer hidden), you can select the text:
And if you apply SimpleTextExtraction
now, you should be able to retrieve all the text in the Document
.
# New imports
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction
def read_modified_document():
doc: typing.Optional[Document] = None
l: SimpleTextExtraction = SimpleTextExtraction()
with open("output_002.pdf", "rb") as pdf_file_handle:
doc = PDF.loads(pdf_file_handle, [l])
print(l.get_text_for_page(0))
def main():
create_document()
apply_ocr_to_document()
read_modified_document()
if __name__ == "__main__":
main()
This prints:
Lorem Ipsum
Hello World!
Awesome!
Conclusion
In this guide you've learned how to apply OCR to PDF documents, ensuring your scanned documents are searchable and ready for future processing.