The Portable Document Format (PDF) is not a WYSIWYG (What You See is What You Get) format. It was developed to be platform-agnostic, independent of the underlying operating system and rendering engines.
To achieve this, PDF was constructed to be interacted with via something more like a programming language, and relies on a series of instructions and operations to achieve a result. In fact, PDF is based on a scripting language - PostScript, which was the first device-independent Page Description Language.
It has operators that modify graphics states, which, from a high-level look something like:
- Set the font to "Helvetica"
- Set the stroke color to black
- Go to (60,700)
- Draw the glyph "H"
This explains a few things:
- Why it's so hard to extract text from a PDF in an unambiguous way
- Why it's difficult to edit a PDF document
- Why most PDF libraries enforce a very low-level approach to content creation (you, the programmer has to specify the coordinates at which to render text, the margins, etc)
In this guide, we'll be using pText - a Python library dedicated to reading, manipulating and generating PDF documents, to create a PDF document. It offers both a low-level model (allowing you access to the exact coordinates and layout if you choose to use those) and a high-level model (where you can delegate the precise calculations of margins, positions, etc to a layout manager).
We'll take a look at how to create and inspect a PDF document in Python, using pText, as well as how to use some of the
LayoutElements to add barcodes and tables.
pText can be downloaded from source on GitHub, or installed via
$ pip install ptext-joris-schellekens
Note: As of writing, the version 1.8.6 doesn't install the external requirements by default, such as the
qrcode libraries. If prompted with an error, please install them manually:
$ pip install qrcode python-barcode requests
Creating a PDF Document in Python with pText
pText has two intuitive key classes -
Page, which represent a document and the pages within it. These are the main framework for creating PDF documents.
Documents we create.
With that in mind, let's create an empty PDF file:
from ptext.pdf.document import Document from ptext.pdf.page.page import Page from ptext.pdf.pdf import PDF # Create an empty Document document = Document() # Create an empty page page = Page() # Add the Page to the Document document.append_page(page) # Write the Document to a file with open("output.pdf", "wb") as pdf_file_handle: PDF.dumps(pdf_file_handle, document)
Most of the code speaks for itself here. We start by creating an empty
Document, then add an empty
Page to the
Document with the
append() function, and finally store the file through
It's worth noting that we used the
"wb" flag to write in binary mode, since we don't want Python to encode this text.
This results in an empty PDF file, named
output.pdf on your local file system:
Creating a "Hello World" Document with pText
Of course, empty PDF documents don't really convey a lot of information. Let's add some content to the
Page , before we add it to the
In a similar vein to the two integral classes from before, to add content to the
Page, we'll add a
PageLayout which specifies the type of layout we'd like to see, and add one or more
Paragraphs to that layout.
To this end, the
Document is the lowest-level instance in the hierarchy of objects, while the
Paragraph is the highest-level instance, stacked on top of the
PageLayout and consequently, the
Let's add a
Paragraph to our
from ptext.pdf.document import Document from ptext.pdf.page.page import Page from ptext.pdf.pdf import PDF from ptext.pdf.canvas.layout.paragraph import Paragraph from ptext.pdf.canvas.layout.page_layout import SingleColumnLayout from ptext.io.read.types import Decimal document = Document() page = Page() # Setting a layout manager on the Page layout = SingleColumnLayout(page) # Adding a Paragraph to the Page layout.add(Paragraph("Hello World", font_size=Decimal(20), font="Helvetica")) document.append_page(page) with open("output.pdf", "wb") as pdf_file_handle: PDF.dumps(pdf_file_handle, document)
You'll notice we added 2 extra objects:
- An instance of
PageLayout, made more concrete through its subclass
SingleColumnLayout: this class keeps track of where content is being added to a
Page, which area(s) are available for future content, what the
Pagemargins are, and what the leading (the space between
Paragraphobjects) is supposed to be.
Since we're only working with one column here, we're using a
SingleColumnLayout. Alternatively, we can use the
Paragraphinstance: this class represents a block of text. You can set properties such as the font, font_size, font_color, and many others. For more examples, you should check out the documentation.
This generates an
output.pdf file that contains our
Inspecting the Generated PDF with pText
Note: This section is completely optional if you are not interested in the inner workings of a PDF document.
But it can be very useful to know a bit about the format (such as when you're debugging the classic "why does my content now show up on this page" issue).
Typically, a PDF reader will read the document starting at the last bytes:
xref 0 11 0000000000 00000 f 0000000015 00000 n 0000002169 00000 n 0000000048 00000 n 0000000105 00000 n 0000000258 00000 n 0000000413 00000 n 0000000445 00000 n 0000000475 00000 n 0000000653 00000 n 0000001938 00000 n trailer <</Root 1 0 R /Info 2 0 R /Size 11 /ID [<61e6d144af4b84e0e0aa52deab87cfe9><61e6d144af4b84e0e0aa52deab87cfe9>]>> startxref 2274 %%EOF
Here we see the end-of-file marker (
%%EOF) and the cross-reference-table (typically abbreviated to
xrefis delimited by the tokens
xref (a document can have multiple) acts as a lookup table for the PDF reader.
It contains the byte offset (starting at the top of the file) of each object in a PDF. The first line of the
0 11) says there are 11 objects in this
xref, and that the first object starts at number 0.
Each subsequent line consists of the byte offset, followed by the so called generation number and the letter
- Objects marked with
fare free objects, they are not expected to be rendered.
- Objects marked with
nare "in use".
At the bottom of the
xref, we find the trailer dictionary. Dictionaries, in PDF syntax, are delimited by
This dictionary has the following pairs:
/Root 1 0 R
/Info 2 0 R
/ID [<61e6d144af4b84e0e0aa52deab87cfe9> <61e6d144af4b84e0e0aa52deab87cfe9>]
The trailer dictionary is the starting point for the PDF reader and contains references to all other data.
In this case:
/Root: this is another dictionary that links to the actual content of the document.
/Info: this is a dictionary containing meta-information of the document (author, title, etc).
1 0 R are called "references" in PDF syntax. And this is where the
xref table comes in handy.
To find the object associated with
1 0 R we look at object 1 (generation number 0).
xref lookup table tells us we can expect to find this object at byte 15 of the document.
If we check that out, we'll find:
1 0 obj <</Pages 3 0 R>> endobj
Notice how this object starts with
1 0 obj and ends with
endobj. This is another confirmation that we are in fact dealing with object 1.
This dictionary tells us we can find the pages of the document in object 3:
3 0 obj <</Count 1 /Kids [4 0 R] /Type /Pages>> endobj
This is the
/Pages dictionary, and it tells us there is 1 page in this document (the
/Count entry). The entry for
/Kids is typically an array, with one object-reference per page.
We can expect to find the first page in object 4:
4 0 obj <</Type /Page /MediaBox [0 0 595 842] /Contents 5 0 R /Resources 6 0 R /Parent 3 0 R>> endobj
This dictionary contains several interesting entries:
/MediaBox: physical dimensions of the page (in this case an A4 sized page).
/Contents: reference to a (typically compressed) stream of PDF content operators.
/Resources: reference to a dictionary containing all the resources (fonts, images, etc) used for rendering this page.
Let's check out object 5 to find what is actually being rendered on this page:
5 0 obj <</Filter /FlateDecode /Length 85>> stream xÚã[email protected] \È<§®`a¥£šÔw3T0É €!K¡š3Benl7'§9©99ù åùE9) !Y(®!8õÂyšT*î endstream endobj
As mentioned earlier, this (content) stream is compressed. You can tell which compression method was used by the
/Filter entry. If we apply decompression (
unzip) to object 5, we should get the actual content operators:
5 0 obj <</Filter /FlateDecode /Length 85>> stream q BT 0.000000 0.000000 0.000000 rg /F1 1.000000 Tf 20.000000 0 0 20.000000 60.000000 738.000000 Tm (Hello world) Tj ET Q endstream endobj
Finally, we are at the level where we can decode the content. Each line consists of arguments followed by their operator. Let's quickly go over the operators:
q: preserves the current graphic state (pushing it to a stack).
BT: begin text.
0 0 0 rg: set the current stroke color to (
0,0,0) rgb. This is black.
/F1 1 Tf: set the current font to
/F1(this is an entry in the resources dictionary mentioned earlier) and the font size to
20.000000 0 0 20.000000 60.000000 738.000000 Tm: set the text-matrix. Text matrices warrant a guide of their own. Suffice to say that this matrix regulates font size, and text position. Here we are scaling the font to
font-size 20, and setting the text-drawing cursor to
60,738. The PDF coordinate system starts at the bottom left of a page. So
60,738is somewhere near the left top of the page (considering the page was
(Hello world) Tj: strings in PDF syntax are delimited by
). This command tells the PDF reader to render the string "Hello world" at the position we indicated earlier with the text-matrix, in the font, size and color we specified in the commands before that.
ET: end text.
Q: pop the graphics state from the stack (thus restoring the graphics state).
Adding Other pText LayoutElements to Pages
pText comes with a wide variety of
LayoutElement objects. In the previous example we briefly explored
Paragraph. But there's also other elements such as
Let's create a slightly more challenging example, with a
Tables consist of
TableCells, which we add to the
Barcode can be one of many
BarcodeTypes - we'll be using a
from ptext.pdf.document import Document from ptext.pdf.page.page import Page from ptext.pdf.pdf import PDF from ptext.pdf.canvas.layout.paragraph import Paragraph from ptext.pdf.canvas.layout.page_layout import SingleColumnLayout from ptext.io.read.types import Decimal from ptext.pdf.canvas.layout.table import Table, TableCell from ptext.pdf.canvas.layout.barcode import Barcode, BarcodeType from ptext.pdf.canvas.color.color import X11Color document = Document() page = Page() # Layout layout = SingleColumnLayout(page) # Create and add heading layout.add(Paragraph("DefaultCorp Invoice", font="Helvetica", font_size=Decimal(20))) # Create and add barcode layout.add(Barcode(data="0123456789", type=BarcodeType.QR, width=Decimal(64), height=Decimal(64))) # Create and add table table = Table(number_of_rows=5, number_of_columns=4) # Header row table.add(TableCell(Paragraph("Item", font_color=X11Color("White")), background_color=X11Color("SlateGray"))) table.add(TableCell(Paragraph("Unit Price", font_color=X11Color("White")), background_color=X11Color("SlateGray"))) table.add(TableCell(Paragraph("Amount", font_color=X11Color("White")), background_color=X11Color("SlateGray"))) table.add(TableCell(Paragraph("Price", font_color=X11Color("White")), background_color=X11Color("SlateGray"))) # Data rows for n in [("Lorem", 4.99, 1), ("Ipsum", 9.99, 2), ("Dolor", 1.99, 3), ("Sit", 1.99, 1)]: table.add(Paragraph(n)) table.add(Paragraph(str(n))) table.add(Paragraph(str(n))) table.add(Paragraph(str(n * n))) # Set padding table.set_padding_on_all_cells(Decimal(5), Decimal(5), Decimal(5), Decimal(5)) layout.add(table) # Append page document.append_page(page) # Persist PDF to file with open("output4.pdf", "wb") as pdf_file_handle: PDF.dumps(pdf_file_handle, document)
Some implementation details:
pTextsupports various color models, including:
- You can add
LayoutElementobjects directly to a
Tableobject, but you can also wrap them with a
TableCellobject, this gives you some extra options, such as setting
row_spanor in this case,
- If no
Paragraphwill assume a default of
This results in:
In this guide, we've taken a look at pText, a library for reading, writing and manipulating PDF files.
We've taken a look at the key classes such as
Page, as well as some of the elements such as
PageLayout. Finally, we've created a couple of PDF files with varying contents, as well as inspected how PDFs store data under the hood.