Creating a PDF Document in Python with pText

The Portable Document Format (PDF) is not a WYSIWYG (What You See is What You Get) format. It was developed to be platform-agnostic, independent of the underlying operating system and rendering engines.

To achieve this, PDF was constructed to be interacted with via something more like a programming language, and relies on a series of instructions and operations to achieve a result. In fact, PDF is based on a scripting language - PostScript, which was the first device-independent Page Description Language.

It has operators that modify graphics states, which, from a high-level look something like:

  • Set the font to "Helvetica"
  • Set the stroke color to black
  • Go to (60,700)
  • Draw the glyph "H"

This explains a few things:

  • Why it's so hard to extract text from a PDF in an unambiguous way
  • Why it's difficult to edit a PDF document
  • Why most PDF libraries enforce a very low-level approach to content creation (you, the programmer has to specify the coordinates at which to render text, the margins, etc)

In this guide, we'll be using pText - a Python library dedicated to reading, manipulating and generating PDF documents, to create a PDF document. It offers both a low-level model (allowing you access to the exact coordinates and layout if you choose to use those) and a high-level model (where you can delegate the precise calculations of margins, positions, etc to a layout manager).

We'll take a look at how to create and inspect a PDF document in Python, using pText, as well as how to use some of the LayoutElements to add barcodes and tables.

Installing pText

pText can be downloaded from source on GitHub, or installed via pip:

$ pip install ptext-joris-schellekens

Note: As of writing, the version 1.8.6 doesn't install the external requirements by default, such as the python-barcode and qrcode libraries. If prompted with an error, please install them manually:

$ pip install qrcode python-barcode requests

Creating a PDF Document in Python with pText

pText has two intuitive key classes - Document and Page, which represent a document and the pages within it. These are the main framework for creating PDF documents.

Additionally, the PDF class represents an API for loading and saving the Documents we create.

With that in mind, let's create an empty PDF file:

from ptext.pdf.document import Document
from import Page
from ptext.pdf.pdf import PDF

# Create an empty Document
document = Document()

# Create an empty page
page = Page()

# Add the Page to the Document

# Write the Document to a file
with open("output.pdf", "wb") as pdf_file_handle:
    PDF.dumps(pdf_file_handle, document)

Most of the code speaks for itself here. We start by creating an empty Document, then add an empty Page to the Document with the append() function, and finally store the file through PDF.dumps().

It's worth noting that we used the "wb" flag to write in binary mode, since we don't want Python to encode this text.

This results in an empty PDF file, named output.pdf on your local file system:

blank pdf document in python

Creating a "Hello World" Document with pText

Of course, empty PDF documents don't really convey a lot of information. Let's add some content to the Page , before we add it to the Document instance.

In a similar vein to the two integral classes from before, to add content to the Page, we'll add a PageLayout which specifies the type of layout we'd like to see, and add one or more Paragraphs to that layout.

To this end, the Document is the lowest-level instance in the hierarchy of objects, while the Paragraph is the highest-level instance, stacked on top of the PageLayout and consequently, the Page.

Let's add a Paragraph to our Page:

from ptext.pdf.document import Document
from import Page
from ptext.pdf.pdf import PDF
from ptext.pdf.canvas.layout.paragraph import Paragraph
from ptext.pdf.canvas.layout.page_layout import SingleColumnLayout
from import Decimal

document = Document()
page = Page()

# Setting a layout manager on the Page
layout = SingleColumnLayout(page)

# Adding a Paragraph to the Page
layout.add(Paragraph("Hello World", font_size=Decimal(20), font="Helvetica"))


with open("output.pdf", "wb") as pdf_file_handle:
    PDF.dumps(pdf_file_handle, document)

You'll notice we added 2 extra objects:

  • An instance of PageLayout, made more concrete through its subclass SingleColumnLayout: this class keeps track of where content is being added to a Page, which area(s) are available for future content, what the Page margins are, and what the leading (the space between Paragraph objects) is supposed to be.

Since we're only working with one column here, we're using a SingleColumnLayout. Alternatively, we can use the MultiColumnLayout.

  • A Paragraph instance: this class represents a block of text. You can set properties such as the font, font_size, font_color, and many others. For more examples, you should check out the documentation.

This generates an output.pdf file that contains our Paragraph:

create a hello world pdf with python

Inspecting the Generated PDF with pText

Note: This section is completely optional if you are not interested in the inner workings of a PDF document.

But it can be very useful to know a bit about the format (such as when you're debugging the classic "why does my content now show up on this page" issue).

Typically, a PDF reader will read the document starting at the last bytes:

0 11
0000000000 00000 f
0000000015 00000 n
0000002169 00000 n
0000000048 00000 n
0000000105 00000 n
0000000258 00000 n
0000000413 00000 n
0000000445 00000 n
0000000475 00000 n
0000000653 00000 n
0000001938 00000 n
<</Root 1 0 R /Info 2 0 R /Size 11 /ID [<61e6d144af4b84e0e0aa52deab87cfe9><61e6d144af4b84e0e0aa52deab87cfe9>]>>

Here we see the end-of-file marker (%%EOF) and the cross-reference-table (typically abbreviated to xref).

The xref is delimited by the tokens "startxref" and "xref".

An xref (a document can have multiple) acts as a lookup table for the PDF reader.

It contains the byte offset (starting at the top of the file) of each object in a PDF. The first line of the xref (0 11) says there are 11 objects in this xref, and that the first object starts at number 0.

Each subsequent line consists of the byte offset, followed by the so called generation number and the letter f or n:

  • Objects marked with f are free objects, they are not expected to be rendered.
  • Objects marked with n are "in use".

At the bottom of the xref, we find the trailer dictionary. Dictionaries, in PDF syntax, are delimited by << and >>.

This dictionary has the following pairs:

  • /Root 1 0 R
  • /Info 2 0 R
  • /Size 11
  • /ID [<61e6d144af4b84e0e0aa52deab87cfe9> <61e6d144af4b84e0e0aa52deab87cfe9>]

The trailer dictionary is the starting point for the PDF reader and contains references to all other data.

In this case:

  • /Root : this is another dictionary that links to the actual content of the document.
  • /Info : this is a dictionary containing meta-information of the document (author, title, etc).

Strings like 1 0 R are called "references" in PDF syntax. And this is where the xref table comes in handy.

To find the object associated with 1 0 R we look at object 1 (generation number 0).

The xref lookup table tells us we can expect to find this object at byte 15 of the document.

If we check that out, we'll find:

1 0 obj
<</Pages 3 0 R>>

Notice how this object starts with 1 0 obj and ends with endobj. This is another confirmation that we are in fact dealing with object 1.

This dictionary tells us we can find the pages of the document in object 3:

3 0 obj
<</Count 1 /Kids [4 0 R]
 /Type /Pages>>

This is the /Pages dictionary, and it tells us there is 1 page in this document (the /Count entry). The entry for /Kids is typically an array, with one object-reference per page.

We can expect to find the first page in object 4:

4 0 obj
<</Type /Page /MediaBox [0 0 595 842]
 /Contents 5 0 R /Resources 6 0 R /Parent 3 0 R>>

This dictionary contains several interesting entries:

  • /MediaBox: physical dimensions of the page (in this case an A4 sized page).
  • /Contents: reference to a (typically compressed) stream of PDF content operators.
  • /Resources: reference to a dictionary containing all the resources (fonts, images, etc) used for rendering this page.

Let's check out object 5 to find what is actually being rendered on this page:

5 0 obj
<</Filter /FlateDecode /Length 85>>
xÚã[email protected]

As mentioned earlier, this (content) stream is compressed. You can tell which compression method was used by the /Filter entry. If we apply decompression (unzip) to object 5, we should get the actual content operators:

5 0 obj
<</Filter /FlateDecode /Length 85>>
            0.000000 0.000000 0.000000 rg
            /F1 1.000000 Tf            
            20.000000 0 0 20.000000 60.000000 738.000000 Tm            
            (Hello world) Tj

Finally, we are at the level where we can decode the content. Each line consists of arguments followed by their operator. Let's quickly go over the operators:

  • q: preserves the current graphic state (pushing it to a stack).
  • BT: begin text.
  • 0 0 0 rg: set the current stroke color to (0,0,0) rgb. This is black.
  • /F1 1 Tf: set the current font to /F1 (this is an entry in the resources dictionary mentioned earlier) and the font size to 1.
  • 20.000000 0 0 20.000000 60.000000 738.000000 Tm : set the text-matrix. Text matrices warrant a guide of their own. Suffice to say that this matrix regulates font size, and text position. Here we are scaling the font to font-size 20, and setting the text-drawing cursor to 60,738. The PDF coordinate system starts at the bottom left of a page. So 60,738 is somewhere near the left top of the page (considering the page was 842 units tall).
  • (Hello world) Tj : strings in PDF syntax are delimited by ( and ). This command tells the PDF reader to render the string "Hello world" at the position we indicated earlier with the text-matrix, in the font, size and color we specified in the commands before that.
  • ET: end text.
  • Q: pop the graphics state from the stack (thus restoring the graphics state).

Adding Other pText LayoutElements to Pages

pText comes with a wide variety of LayoutElement objects. In the previous example we briefly explored Paragraph. But there's also other elements such as UnorderedList, OrderedList, Image, Shape, Barcode and Table.

Let's create a slightly more challenging example, with a Table and Barcode. Tables consist of TableCells, which we add to the Table instance.

A Barcode can be one of many BarcodeTypes - we'll be using a QR code:

from ptext.pdf.document import Document
from import Page
from ptext.pdf.pdf import PDF
from ptext.pdf.canvas.layout.paragraph import Paragraph
from ptext.pdf.canvas.layout.page_layout import SingleColumnLayout
from import Decimal
from ptext.pdf.canvas.layout.table import Table, TableCell
from ptext.pdf.canvas.layout.barcode import Barcode, BarcodeType
from ptext.pdf.canvas.color.color import X11Color

document = Document()
page = Page()

# Layout
layout = SingleColumnLayout(page)

# Create and add heading
layout.add(Paragraph("DefaultCorp Invoice", font="Helvetica", font_size=Decimal(20)))

# Create and add barcode
layout.add(Barcode(data="0123456789", type=BarcodeType.QR, width=Decimal(64), height=Decimal(64)))

# Create and add table
table = Table(number_of_rows=5, number_of_columns=4)

# Header row
table.add(TableCell(Paragraph("Item", font_color=X11Color("White")), background_color=X11Color("SlateGray")))
table.add(TableCell(Paragraph("Unit Price", font_color=X11Color("White")), background_color=X11Color("SlateGray")))
table.add(TableCell(Paragraph("Amount", font_color=X11Color("White")), background_color=X11Color("SlateGray")))
table.add(TableCell(Paragraph("Price", font_color=X11Color("White")), background_color=X11Color("SlateGray")))

	# Data rows
for n in [("Lorem", 4.99, 1), ("Ipsum", 9.99, 2), ("Dolor", 1.99, 3), ("Sit", 1.99, 1)]:
    table.add(Paragraph(str(n[1] * n[2])))

# Set padding
table.set_padding_on_all_cells(Decimal(5), Decimal(5), Decimal(5), Decimal(5))

# Append page

# Persist PDF to file
with open("output4.pdf", "wb") as pdf_file_handle:
    PDF.dumps(pdf_file_handle, document)

Some implementation details:

  • pText supports various color models, including: RGBColor, HexColor, X11Color and HSVColor.
  • You can add LayoutElement objects directly to a Table object, but you can also wrap them with a TableCell object, this gives you some extra options, such as setting col_span and row_span or in this case, background_color.
  • If no font, font_size or font_color are specified, Paragraph will assume a default of Helvetica, size 12, black.

This results in:

creating a corporate invoice pdf in python


In this guide, we've taken a look at pText, a library for reading, writing and manipulating PDF files.

We've taken a look at the key classes such as Document and Page, as well as some of the elements such as Paragraph, Barcode and PageLayout. Finally, we've created a couple of PDF files with varying contents, as well as inspected how PDFs store data under the hood.

Author image
Ghent, Belgium
I'm a software architect from Belgium, with a passion for machine learning, knowledge-based systems and graph algorithms. I'm also the author of pText, the pure python PDF library.