Introduction
The Portable Document Format (PDF) is not a WYSIWYG (What You See is What You Get) format. It was developed to be platform-agnostic, independent of the underlying operating system and rendering engines.
To achieve this, PDF was constructed to be interacted with via something more like a programming language, and relies on a series of instructions and operations to achieve a result. In fact, PDF is based on a scripting language - PostScript, which was the first device-independent Page Description Language.
In this guide, we'll be using borb - a Python library dedicated to reading, manipulating and generating PDF documents. It offers both a low-level model (allowing you access to the exact coordinates and layout if you choose to use those) and a high-level model (where you can delegate the precise calculations of margins, positions, etc to a layout manager).
In this guide, we'll take a look at how to process a PDF invoice in Python using borb, by extracting text, since PDF is an extractable format - which makes it prone to automated processing.
Automating processing is one of the fundamental goals of machines, and if someone doesn't supply a parsable document, such as json
alongside a human-oriented invoice - you'll have to parse the PDF contents yourself.
Installing borb
borb can be downloaded from source on GitHub, or installed via pip
:
$ pip install borb
Creating a PDF Invoice in Python with borb
In the previous guide, we've generated a PDF invoice, using borb, which we'll now be processing.
If you'd like to read more about How to Create Invoices in Python with borb, we've got you covered!
The generated PDF document specifically looks like this:
Processing a PDF Invoice with borb
Let's start by opening the PDF file and loading it into a Document
- the object-representation of the file:
import typing
from borb.pdf.document import Document
from borb.pdf.pdf import PDF
def main():
d: typing.Optional[Document] = None
with open("output.pdf", "rb") as pdf_in_handle:
d = PDF.loads(pdf_in_handle)
assert d is not None
if __name__ == "__main__":
main()
The code follows the same pattern you might see in the json
library; a static method, loads()
, which accepts a file-handle, and outputs a data structure.
Next, we'd like to be able to extract all the text contents of the file. borb
enables this by allowing you to register EventListener
classes to the parsing of the Document
.
For instance, whenever borb
encounters some kind of text-rendering instruction it will notify all registered EventListener
objects, which can then process the emitted Event
.
borb
comes with quite a few implementations of EventListener
:
SimpleTextExtraction
: Extracts text from a PDFSimpleImageExtraction
: Extracts all images from a PDFRegularExpressionTextExtraction
: Matches a regular expression, and returns the matches per page- etc.
We'll start by extracting all the text:
import typing
from borb.pdf.document import Document
from borb.pdf.pdf import PDF
# New import
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction
def main():
d: typing.Optional[Document] = None
l: SimpleTextExtraction = SimpleTextExtraction()
with open("output.pdf", "rb") as pdf_in_handle:
d = PDF.loads(pdf_in_handle, [l])
assert d is not None
print(l.get_text_for_page(0))
if __name__ == "__main__":
main()
This code-snippet should print all the text in the invoice, in reading order (top to bottom, left to right):
[Street Address] Date 6/5/2021
[City, State, ZIP Code] Invoice # 1741
[Phone] Due Date 6/5/2021
[Email Address]
[Company Website]
BILL TO SHIP TO
[Recipient Name] [Recipient Name]
[Company Name] [Company Name]
[Street Address] [Street Address]
[City, State, ZIP Code] [City, State, ZIP Code]
[Phone] [Phone]
DESCRIPTION QTY UNIT PRICE AMOUNT
Product 1 2 $ 50 $ 100
Product 2 4 $ 60 $ 240
Labor 14 $ 60 $ 840
Subtotal $ 1,180.00
Discounts $ 177.00
Taxes $ 100.30
Total $ 1163.30
This is of course not very useful to us as this would require more processing before we can do much with it, though this is a great start, especially compared to OCR-scanned PDF documents!
Let's refine this code and tell
borb
whichRectangle
we are interested in.
For instance, let's extract the shipping information (but you can modify the code to retrieve any area of interest).
In order to allow borb
to filter out a Rectangle
we'll be using the LocationFilter
class. This class implements EventListener
. It gets notified of all Events
when rendering the Page
and passes those (to its children) that occur inside predefined bounds:
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
import typing
from decimal import Decimal
from borb.pdf.document import Document
from borb.pdf.pdf import PDF
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction
# New import
from borb.toolkit.location.location_filter import LocationFilter
from borb.pdf.canvas.geometry.rectangle import Rectangle
def main():
d: typing.Optional[Document] = None
# Define rectangle of interest
# x, y, width, height
r: Rectangle = Rectangle(Decimal(280),
Decimal(510),
Decimal(200),
Decimal(130))
# Set up EventListener(s)
l0: LocationFilter = LocationFilter(r)
l1: SimpleTextExtraction = SimpleTextExtraction()
l0.add_listener(l1)
with open("output.pdf", "rb") as pdf_in_handle:
d = PDF.loads(pdf_in_handle, [l0])
assert d is not None
print(l1.get_text_for_page(0))
if __name__ == "__main__":
main()
Running this code, assuming that the right rectangle is chosen, prints:
SHIP TO
[Recipient Name]
[Company Name]
[Street Address]
[City, State, ZIP Code]
[Phone]
This code is not exactly the most flexible or future-proof. It takes some fiddling to find the right Rectangle
, and there is no guarantee it will work if the layout of the invoice changes even slightly.
We're going to have to build something more robust, to have actual practical application.
We can start by removing the hard-coded Rectangle
. RegularExpressionTextExtraction
can match a regular expression and return (among other things) its coordinates on the Page
! Using pattern-matching, we can search for elements in a document automatically and retrieve them, instead of guessing where to draw a rectangle.
Let's use this class to find the words "SHIP TO", and build a Rectangle
based on those coordinates:
import typing
from borb.pdf.document import Document
from borb.pdf.pdf import PDF
from borb.pdf.canvas.geometry.rectangle import Rectangle
# New imports
from borb.toolkit.text.regular_expression_text_extraction import RegularExpressionTextExtraction, PDFMatch
def main():
d: typing.Optional[Document] = None
# Set up EventListener
l: RegularExpressionTextExtraction = RegularExpressionTextExtraction("SHIP TO")
with open("output.pdf", "rb") as pdf_in_handle:
d = PDF.loads(pdf_in_handle, [l])
assert d is not None
matches: typing.List[PDFMatch] = l.get_matches_for_page(0)
assert len(matches) == 1
r: Rectangle = matches[0].get_bounding_boxes()[0]
print("%f %f %f %f" % (r.get_x(), r.get_y(), r.get_width(), r.get_height()))
if __name__ == "__main__":
main()
Here, we've built a Rectangle
around the section and printed its coordinates:
299.500000 621.000000 48.012000 8.616000
You will have noticed that get_bounding_boxes()
returns typing.List[Rectangle]
. This is the case when a regular expression is matched across multiple lines of text in the PDF.
Also, keep in mind the origin of a PDF (the
[0, 0]
point) is located in the bottom left corner. So the top of thePage
has the highest Y-coordinate.
Now that we know where to find "SHIP TO", we can update our earlier code to place the Rectangle
of interest just underneath those words:
import typing
from decimal import Decimal
from borb.pdf.document import Document
from borb.pdf.pdf import PDF
from borb.pdf.canvas.geometry.rectangle import Rectangle
from borb.toolkit.location.location_filter import LocationFilter
from borb.toolkit.text.regular_expression_text_extraction import RegularExpressionTextExtraction, PDFMatch
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction
def find_ship_to() -> Rectangle:
d: typing.Optional[Document] = None
# Set up EventListener
l: RegularExpressionTextExtraction = RegularExpressionTextExtraction("SHIP TO")
with open("output.pdf", "rb") as pdf_in_handle:
d = PDF.loads(pdf_in_handle, [l])
assert d is not None
matches: typing.List[PDFMatch] = l.get_matches_for_page(0)
assert len(matches) == 1
return matches[0].get_bounding_boxes()[0]
def main():
d: typing.Optional[Document] = None
# Define rectangle of interest
ship_to_rectangle: Rectangle = find_ship_to()
r: Rectangle = Rectangle(ship_to_rectangle.get_x() - Decimal(50),
ship_to_rectangle.get_y() - Decimal(100),
Decimal(200),
Decimal(130))
# Set up EventListener(s)
l0: LocationFilter = LocationFilter(r)
l1: SimpleTextExtraction = SimpleTextExtraction()
l0.add_listener(l1)
with open("output.pdf", "rb") as pdf_in_handle:
d = PDF.loads(pdf_in_handle, [l0])
assert d is not None
print(l1.get_text_for_page(0))
if __name__ == "__main__":
main()
And this code prints:
SHIP TO
[Recipient Name]
[Company Name]
[Street Address]
[City, State, ZIP Code]
[Phone]
This still requires some knowledge of the document, but isn't nearly as rigid as the previous approach - and as long as you know which text you'd like to extract - you can get coordinates and snatch the contents within a rectangle on the page.
Conclusion
In this guide we've taken a look at how to process an invoice in Python using borb. We've started by extracting all the text, and refined our process to extract only a region of interest. Finally, we matched a regular expression against a PDF to make the process even more robust and future-proof.