The Portable Document Format (PDF) is not a WYSIWYG (What You See is What You Get) format. It was developed to be platform-agnostic, independent of the underlying operating system and rendering engines.
To achieve this, PDF was constructed to be interacted with via something more like a programming language, and relies on a series of instructions and operations to achieve a result. In fact, PDF is based on a scripting language - PostScript, which was the first device-independent Page Description Language.
It has operators that modify graphics states, which, from a high-level look something like:
- Set the font to "Helvetica"
- Set the stroke color to black
- Go to (60,700)
- Draw the glyph "H"
This explains a few things:
- Why it's so hard to extract text from a PDF in an unambiguous way
- Why it's difficult to edit a PDF document
- Why most PDF libraries enforce a very low-level approach to content creation (you, the programmer has to specify the coordinates at which to render text, the margins, etc)
In this guide, we'll be using borb - a Python library dedicated to reading, manipulating and generating PDF documents, to create a PDF document. It offers both a low-level model (allowing you access to the exact coordinates and layout if you choose to use those) and a high-level model (where you can delegate the precise calculations of margins, positions, etc to a layout manager).
We'll take a look at how to create and inspect a PDF document in Python, using borb, as well as how to use some of the LayoutElements
to add barcodes and tables.
Installing borb
borb can be downloaded from source on GitHub, or installed via pip
:
$ pip install borb
Creating a PDF Document in Python with borb
borb has two intuitive key classes - Document
and Page
, which represent a document and the pages within it. These are the main framework for creating PDF documents.
Additionally, the PDF
class represents an API for loading and saving the Document
s we create.
With that in mind, let's create an empty PDF file:
from borb.pdf.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF
# Create an empty Document
document = Document()
# Create an empty page
page = Page()
# Add the Page to the Document
document.append_page(page)
# Write the Document to a file
with open("output.pdf", "wb") as pdf_file_handle:
PDF.dumps(pdf_file_handle, document)
Most of the code speaks for itself here. We start by creating an empty Document
, then add an empty Page
to the Document
with the append()
function, and finally store the file through PDF.dumps()
.
It's worth noting that we used the "wb"
flag to write in binary mode, since we don't want Python to encode this text.
This results in an empty PDF file, named output.pdf
on your local file system:
Creating a "Hello World" Document with borb
Of course, empty PDF documents don't really convey a lot of information. Let's add some content to the Page
, before we add it to the Document
instance.
In a similar vein to the two integral classes from before, to add content to the Page
, we'll add a PageLayout
which specifies the type of layout we'd like to see, and add one or more Paragraph
s to that layout.
To this end, the Document
is the lowest-level instance in the hierarchy of objects, while the Paragraph
is the highest-level instance, stacked on top of the PageLayout
and consequently, the Page
.
Let's add a Paragraph
to our Page
:
from borb.pdf.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF
from borb.pdf.canvas.layout.paragraph import Paragraph
from borb.pdf.canvas.layout.page_layout.multi_column_layout import SingleColumnLayout
from borb.io.read.types import Decimal
document = Document()
page = Page()
# Setting a layout manager on the Page
layout = SingleColumnLayout(page)
# Adding a Paragraph to the Page
layout.add(Paragraph("Hello World", font_size=Decimal(20), font="Helvetica"))
document.append_page(page)
with open("output.pdf", "wb") as pdf_file_handle:
PDF.dumps(pdf_file_handle, document)
You'll notice we added 2 extra objects:
- An instance of
PageLayout
, made more concrete through its subclassSingleColumnLayout
: this class keeps track of where content is being added to aPage
, which area(s) are available for future content, what thePage
margins are, and what the leading (the space betweenParagraph
objects) is supposed to be.
Since we're only working with one column here, we're using a SingleColumnLayout
. Alternatively, we can use the MultiColumnLayout
.
- A
Paragraph
instance: this class represents a block of text. You can set properties such as the font, font_size, font_color, and many others. For more examples, you should check out the documentation.
This generates an output.pdf
file that contains our Paragraph
:
Inspecting the Generated PDF with borb
Note: This section is completely optional if you are not interested in the inner workings of a PDF document.
But it can be very useful to know a bit about the format (such as when you're debugging the classic "why does my content now show up on this page" issue).
Typically, a PDF reader will read the document starting at the last bytes:
xref
0 11
0000000000 00000 f
0000000015 00000 n
0000002169 00000 n
0000000048 00000 n
0000000105 00000 n
0000000258 00000 n
0000000413 00000 n
0000000445 00000 n
0000000475 00000 n
0000000653 00000 n
0000001938 00000 n
trailer
<</Root 1 0 R /Info 2 0 R /Size 11 /ID [<61e6d144af4b84e0e0aa52deab87cfe9><61e6d144af4b84e0e0aa52deab87cfe9>]>>
startxref
2274
%%EOF
Here we see the end-of-file marker (%%EOF
) and the cross-reference-table (typically abbreviated to xref
).
The
xref
is delimited by the tokens"startxref"
and"xref"
.
An xref
(a document can have multiple) acts as a lookup table for the PDF reader.
It contains the byte offset (starting at the top of the file) of each object in a PDF. The first line of the xref
(0 11
) says there are 11 objects in this xref
, and that the first object starts at number 0.
Each subsequent line consists of the byte offset, followed by the so called generation number and the letter f
or n
:
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
- Objects marked with
f
are free objects, they are not expected to be rendered. - Objects marked with
n
are "in use".
At the bottom of the xref
, we find the trailer dictionary. Dictionaries, in PDF syntax, are delimited by <<
and >>
.
This dictionary has the following pairs:
/Root 1 0 R
/Info 2 0 R
/Size 11
/ID [<61e6d144af4b84e0e0aa52deab87cfe9> <61e6d144af4b84e0e0aa52deab87cfe9>]
The trailer dictionary is the starting point for the PDF reader and contains references to all other data.
In this case:
/Root
: this is another dictionary that links to the actual content of the document./Info
: this is a dictionary containing meta-information of the document (author, title, etc).
Strings like 1 0 R
are called "references" in PDF syntax. And this is where the xref
table comes in handy.
To find the object associated with 1 0 R
we look at object 1 (generation number 0).
The xref
lookup table tells us we can expect to find this object at byte 15 of the document.
If we check that out, we'll find:
1 0 obj
<</Pages 3 0 R>>
endobj
Notice how this object starts with 1 0 obj
and ends with endobj
. This is another confirmation that we are in fact dealing with object 1.
This dictionary tells us we can find the pages of the document in object 3:
3 0 obj
<</Count 1 /Kids [4 0 R]
/Type /Pages>>
endobj
This is the /Pages
dictionary, and it tells us there is 1 page in this document (the /Count
entry). The entry for /Kids
is typically an array, with one object-reference per page.
We can expect to find the first page in object 4:
4 0 obj
<</Type /Page /MediaBox [0 0 595 842]
/Contents 5 0 R /Resources 6 0 R /Parent 3 0 R>>
endobj
This dictionary contains several interesting entries:
/MediaBox
: physical dimensions of the page (in this case an A4 sized page)./Contents
: reference to a (typically compressed) stream of PDF content operators./Resources
: reference to a dictionary containing all the resources (fonts, images, etc) used for rendering this page.
Let's check out object 5 to find what is actually being rendered on this page:
5 0 obj
<</Filter /FlateDecode /Length 85>>
stream
xÚãR@
\È<§®`a¥£šÔw3T0É
€!K¡š3Benl7'§9©99ù
åùE9)
!Y(®!8õÂyšT*î
endstream
endobj
As mentioned earlier, this (content) stream is compressed. You can tell which compression method was used by the /Filter
entry. If we apply decompression (unzip
) to object 5, we should get the actual content operators:
5 0 obj
<</Filter /FlateDecode /Length 85>>
stream
q
BT
0.000000 0.000000 0.000000 rg
/F1 1.000000 Tf
20.000000 0 0 20.000000 60.000000 738.000000 Tm
(Hello world) Tj
ET
Q
endstream
endobj
Finally, we are at the level where we can decode the content. Each line consists of arguments followed by their operator. Let's quickly go over the operators:
q
: preserves the current graphic state (pushing it to a stack).BT
: begin text.0 0 0 rg
: set the current stroke color to (0,0,0
) rgb. This is black./F1 1 Tf
: set the current font to/F1
(this is an entry in the resources dictionary mentioned earlier) and the font size to1
.20.000000 0 0 20.000000 60.000000 738.000000 Tm
: set the text-matrix. Text matrices warrant a guide of their own. Suffice to say that this matrix regulates font size, and text position. Here we are scaling the font tofont-size 20
, and setting the text-drawing cursor to60,738
. The PDF coordinate system starts at the bottom left of a page. So60,738
is somewhere near the left top of the page (considering the page was842
units tall).(Hello world) Tj
: strings in PDF syntax are delimited by(
and)
. This command tells the PDF reader to render the string "Hello world" at the position we indicated earlier with the text-matrix, in the font, size and color we specified in the commands before that.ET
: end text.Q
: pop the graphics state from the stack (thus restoring the graphics state).
Adding Other borb LayoutElements to Pages
borb
comes with a wide variety of LayoutElement
objects. In the previous example we briefly explored Paragraph
. But there's also other elements such as UnorderedList
, OrderedList
, Image
, Shape
, Barcode
and Table
.
Let's create a slightly more challenging example, with a Table
and Barcode
. Tables
consist of TableCell
s, which we add to the Table
instance.
A Barcode
can be one of many BarcodeType
s - we'll be using a QR
code:
from borb.pdf.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF
from borb.pdf.canvas.layout.paragraph import Paragraph
from borb.pdf.canvas.layout.page_layout import SingleColumnLayout
from borb.io.read.types import Decimal
from borb.pdf.canvas.layout.table import Table, TableCell
from borb.pdf.canvas.layout.barcode import Barcode, BarcodeType
from borb.pdf.canvas.color.color import X11Color
document = Document()
page = Page()
# Layout
layout = SingleColumnLayout(page)
# Create and add heading
layout.add(Paragraph("DefaultCorp Invoice", font="Helvetica", font_size=Decimal(20)))
# Create and add barcode
layout.add(Barcode(data="0123456789", type=BarcodeType.QR, width=Decimal(64), height=Decimal(64)))
# Create and add table
table = Table(number_of_rows=5, number_of_columns=4)
# Header row
table.add(TableCell(Paragraph("Item", font_color=X11Color("White")), background_color=X11Color("SlateGray")))
table.add(TableCell(Paragraph("Unit Price", font_color=X11Color("White")), background_color=X11Color("SlateGray")))
table.add(TableCell(Paragraph("Amount", font_color=X11Color("White")), background_color=X11Color("SlateGray")))
table.add(TableCell(Paragraph("Price", font_color=X11Color("White")), background_color=X11Color("SlateGray")))
# Data rows
for n in [("Lorem", 4.99, 1), ("Ipsum", 9.99, 2), ("Dolor", 1.99, 3), ("Sit", 1.99, 1)]:
table.add(Paragraph(n[0]))
table.add(Paragraph(str(n[1])))
table.add(Paragraph(str(n[2])))
table.add(Paragraph(str(n[1] * n[2])))
# Set padding
table.set_padding_on_all_cells(Decimal(5), Decimal(5), Decimal(5), Decimal(5))
layout.add(table)
# Append page
document.append_page(page)
# Persist PDF to file
with open("output4.pdf", "wb") as pdf_file_handle:
PDF.dumps(pdf_file_handle, document)
Some implementation details:
borb
supports various color models, including:RGBColor
,HexColor
,X11Color
andHSVColor
.- You can add
LayoutElement
objects directly to aTable
object, but you can also wrap them with aTableCell
object, this gives you some extra options, such as settingcol_span
androw_span
or in this case,background_color
. - If no
font
,font_size
orfont_color
are specified,Paragraph
will assume a default ofHelvetica
,size 12
,black
.
This results in:
Conclusion
In this guide, we've taken a look at borb, a library for reading, writing and manipulating PDF files.
We've taken a look at the key classes such as Document
and Page
, as well as some of the elements such as Paragraph
, Barcode
and PageLayout
. Finally, we've created a couple of PDF files with varying contents, as well as inspected how PDFs store data under the hood.