Extract and Process PDF Invoices in Python with borb

Introduction

The Portable Document Format (PDF) is not a WYSIWYG (What You See is What You Get) format. It was developed to be platform-agnostic, independent of the underlying operating system and rendering engines.

To achieve this, PDF was constructed to be interacted with via something more like a programming language, and relies on a series of instructions and operations to achieve a result. In fact, PDF is based on a scripting language - PostScript, which was the first device-independent Page Description Language.

In this guide, we'll be using borb - a Python library dedicated to reading, manipulating and generating PDF documents. It offers both a low-level model (allowing you access to the exact coordinates and layout if you choose to use those) and a high-level model (where you can delegate the precise calculations of margins, positions, etc to a layout manager).

In this guide, we'll take a look at how to process a PDF invoice in Python using borb, by extracting text, since PDF is an extractable format - which makes it prone to automated processing.

Automating processing is one of the fundamental goals of machines, and if someone doesn't supply a parsable document, such as json alongside a human-oriented invoice - you'll have to parse the PDF contents yourself.

Installing borb

borb can be downloaded from source on GitHub, or installed via pip:

$ pip install borb

Creating a PDF Invoice in Python with borb

In the previous guide, we've generated a PDF invoice, using borb, which we'll now be processing.

If you'd like to read more about How to Create Invoices in Python with borb, we've got you covered!

The generated PDF document specifically looks like this:

Processing a PDF Invoice with borb

Let's start by opening the PDF file and loading it into a Document - the object-representation of the file:

import typing
from borb.pdf.document import Document
from borb.pdf.pdf import PDF

def main():
    d: typing.Optional[Document] = None
    with open("output.pdf", "rb") as pdf_in_handle:
        d = PDF.loads(pdf_in_handle)

    assert d is not None


if __name__ == "__main__":
    main()

The code follows the same pattern you might see in the json library; a static method, loads(), which accepts a file-handle, and outputs a data structure.

Next, we'd like to be able to extract all the text contents of the file. borb enables this by allowing you to register EventListener classes to the parsing of the Document.

For instance, whenever borb encounters some kind of text-rendering instruction it will notify all registered EventListener objects, which can then process the emitted Event.

borb comes with quite a few implementations of EventListener:

  • SimpleTextExtraction : Extracts text from a PDF
  • SimpleImageExtraction: Extracts all images from a PDF
  • RegularExpressionTextExtraction: Matches a regular expression, and returns the matches per page
  • etc.

We'll start by extracting all the text:

import typing
from borb.pdf.document import Document
from borb.pdf.pdf import PDF

# New import
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction

def main():

    d: typing.Optional[Document] = None
    l: SimpleTextExtraction = SimpleTextExtraction()
    with open("output.pdf", "rb") as pdf_in_handle:
        d = PDF.loads(pdf_in_handle, [l])

    assert d is not None
    print(l.get_text_for_page(0))


if __name__ == "__main__":
    main()

This code-snippet should print all the text in the invoice, in reading order (top to bottom, left to right):

[Street Address] Date 6/5/2021
[City, State, ZIP Code] Invoice # 1741
[Phone] Due Date 6/5/2021
[Email Address]
[Company Website]
BILL TO SHIP TO
[Recipient Name] [Recipient Name]
[Company Name] [Company Name]
[Street Address] [Street Address]
[City, State, ZIP Code] [City, State, ZIP Code]
[Phone] [Phone]
DESCRIPTION QTY UNIT PRICE AMOUNT
Product 1 2 $ 50 $ 100
Product 2 4 $ 60 $ 240
Labor 14 $ 60 $ 840
Subtotal $ 1,180.00
Discounts $ 177.00
Taxes $ 100.30
Total $ 1163.30

This is of course not very useful to us as this would require more processing before we can do much with it, though this is a great start, especially compared to OCR-scanned PDF documents!

Let's refine this code and tell borb which Rectangle we are interested in.

For instance, let's extract the shipping information (but you can modify the code to retrieve any area of interest).

In order to allow borb to filter out a Rectangle we'll be using the LocationFilter class. This class implements EventListener. It gets notified of all Events when rendering the Page and passes those (to its children) that occur inside predefined bounds:

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

import typing
from decimal import Decimal

from borb.pdf.document import Document
from borb.pdf.pdf import PDF
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction

# New import
from borb.toolkit.location.location_filter import LocationFilter
from borb.pdf.canvas.geometry.rectangle import Rectangle

def main():

    d: typing.Optional[Document] = None

    # Define rectangle of interest
    # x, y, width, height
    r: Rectangle = Rectangle(Decimal(280),
                             Decimal(510),
                             Decimal(200),
                             Decimal(130))

    # Set up EventListener(s)
    l0: LocationFilter = LocationFilter(r)
    l1: SimpleTextExtraction = SimpleTextExtraction()
    l0.add_listener(l1)

    with open("output.pdf", "rb") as pdf_in_handle:
        d = PDF.loads(pdf_in_handle, [l0])

    assert d is not None
    print(l1.get_text_for_page(0))


if __name__ == "__main__":
    main()

Running this code, assuming that the right rectangle is chosen, prints:

SHIP TO
[Recipient Name]
[Company Name]
[Street Address]
[City, State, ZIP Code]
[Phone]

This code is not exactly the most flexible or future-proof. It takes some fiddling to find the right Rectangle, and there is no guarantee it will work if the layout of the invoice changes even slightly.

We're going to have to build something more robust, to have actual practical application.

We can start by removing the hard-coded Rectangle. RegularExpressionTextExtraction can match a regular expression and return (among other things) its coordinates on the Page! Using pattern-matching, we can search for elements in a document automatically and retrieve them, instead of guessing where to draw a rectangle.

Let's use this class to find the words "SHIP TO", and build a Rectangle based on those coordinates:

import typing
from borb.pdf.document import Document
from borb.pdf.pdf import PDF
from borb.pdf.canvas.geometry.rectangle import Rectangle

# New imports
from borb.toolkit.text.regular_expression_text_extraction import RegularExpressionTextExtraction, PDFMatch

def main():

    d: typing.Optional[Document] = None
        
    # Set up EventListener
    l: RegularExpressionTextExtraction = RegularExpressionTextExtraction("SHIP TO")
    with open("output.pdf", "rb") as pdf_in_handle:
        d = PDF.loads(pdf_in_handle, [l])

    assert d is not None

    matches: typing.List[PDFMatch] = l.get_matches_for_page(0)
    assert len(matches) == 1

    r: Rectangle = matches[0].get_bounding_boxes()[0]
    print("%f %f %f %f" % (r.get_x(), r.get_y(), r.get_width(), r.get_height()))

if __name__ == "__main__":
    main()

Here, we've built a Rectangle around the section and printed its coordinates:

299.500000 621.000000 48.012000 8.616000

You will have noticed that get_bounding_boxes() returns typing.List[Rectangle]. This is the case when a regular expression is matched across multiple lines of text in the PDF.

Also, keep in mind the origin of a PDF (the [0, 0] point) is located in the bottom left corner. So the top of the Page has the highest Y-coordinate.

Now that we know where to find "SHIP TO", we can update our earlier code to place the Rectangle of interest just underneath those words:

import typing
from decimal import Decimal

from borb.pdf.document import Document
from borb.pdf.pdf import PDF
from borb.pdf.canvas.geometry.rectangle import Rectangle
from borb.toolkit.location.location_filter import LocationFilter
from borb.toolkit.text.regular_expression_text_extraction import RegularExpressionTextExtraction, PDFMatch
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction

def find_ship_to() -> Rectangle:

    d: typing.Optional[Document] = None

    # Set up EventListener
    l: RegularExpressionTextExtraction = RegularExpressionTextExtraction("SHIP TO")
    with open("output.pdf", "rb") as pdf_in_handle:
        d = PDF.loads(pdf_in_handle, [l])

    assert d is not None

    matches: typing.List[PDFMatch] = l.get_matches_for_page(0)
    assert len(matches) == 1

    return matches[0].get_bounding_boxes()[0]
def main():

    d: typing.Optional[Document] = None

    # Define rectangle of interest
    ship_to_rectangle: Rectangle = find_ship_to()
    r: Rectangle = Rectangle(ship_to_rectangle.get_x() - Decimal(50),
                             ship_to_rectangle.get_y() - Decimal(100),
                             Decimal(200),
                             Decimal(130))

    # Set up EventListener(s)
    l0: LocationFilter = LocationFilter(r)
    l1: SimpleTextExtraction = SimpleTextExtraction()
    l0.add_listener(l1)

    with open("output.pdf", "rb") as pdf_in_handle:
        d = PDF.loads(pdf_in_handle, [l0])

    assert d is not None
    print(l1.get_text_for_page(0))

if __name__ == "__main__":
    main()

And this code prints:

SHIP TO
[Recipient Name]
[Company Name]
[Street Address]
[City, State, ZIP Code]
[Phone]

This still requires some knowledge of the document, but isn't nearly as rigid as the previous approach - and as long as you know which text you'd like to extract - you can get coordinates and snatch the contents within a rectangle on the page.

Conclusion

In this guide we've taken a look at how to process an invoice in Python using borb. We've started by extracting all the text, and refined our process to extract only a region of interest. Finally, we matched a regular expression against a PDF to make the process even more robust and future-proof.

Last Updated: March 23rd, 2023
Was this article helpful?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

Joris SchellekensAuthor

I'm a software architect from Belgium, with a passion for machine learning, knowledge-based systems and graph algorithms. I'm also the author of borb, the pure python PDF library.

© 2013-2024 Stack Abuse. All rights reserved.

AboutDisclosurePrivacyTerms