Today, a world without the Portable Document Format (PDF) seems to be unthinkable. It has become one of the most commonly used data formats ever. Up to PDF version 1.4, displaying a PDF document in an according PDF viewer works fine. Unfortunately, the features from the newer PDF revisions, such as forms, are tricky to implement, and still require further work to be fully functional in the tools. Using various Python libraries you can create your own application in an comparable easy way.
This article is part two of a little series on PDFs with Python. In part one we already gave you an introduction into reading PDF documents using Python, and started with a summary of the various Python libraries. An introduction followed that showed how to manipulate existing PDFs, and how to read and extract the content - both the text and images. Furthermore, we showed you how to split documents into its single pages.
In this article you will learn how add images to your PDF in the form of watermarks, stamps, and barcodes. For example this is quite helpful in order to stamp or mark documents that are intended to be read by a specific audience, only, or have a draft quality, or to simply add a barcode for identification purposes.
Adding a Watermark via Command Line with pdftk
In order to add a watermark to an existing PDF on a Unix/Linux command-line we can use pdftk. The name abbreviates "PDF Toolkit", and describes itself as "a simple tool for doing everyday things with PDF documents".
pdftk has been ported to Java, and made available as an according package for Debian GNU/Linux.
In order for this to work you need to have a background image available that comes with the word "DRAFT" on a transparent layer, which you can apply to an existing single-page PDF as follows:
$ pdftk input.pdf background background.pdf output output.pdf
pdftk tool takes in the PDF file
input.pdf, merges it with
background.pdf, and outputs the result to the file
output.pdf. Figure 1 shows the output of this action.
For more complex actions, like stamping a document with different stamps per page, have a look at the description at the PDF Labs project page. We also show the stamping use-case in this article below, although our example uses the library
pdfrw instead of
Adding a Watermark with PyPDF2
The PyPDF library provides a method called
mergepage() that accepts another PDF to be used as a watermark or stamp.
In the example below we start with reading the first page of the original PDF document and the watermark. To read the file we use the
PdfFileReader() class. As a second step we merge the two pages by using the
mergepage() method. Finally, we will write the output to the output file. This is done in three steps - creating an object based on the
PdfFileWriter() class, adding the merged page to this object using the
addPage() method, and writing the new content to the output page using the
# !/usr/bin/python # Adding a watermark to a single-page PDF import PyPDF2 input_file = "example.pdf" output_file = "example-drafted.pdf" watermark_file = "draft.pdf" with open(input_file, "rb") as filehandle_input: # read content of the original file pdf = PyPDF2.PdfFileReader(filehandle_input) with open(watermark_file, "rb") as filehandle_watermark: # read content of the watermark watermark = PyPDF2.PdfFileReader(filehandle_watermark) # get first page of the original PDF first_page = pdf.getPage(0) # get first page of the watermark PDF first_page_watermark = watermark.getPage(0) # merge the two pages first_page.mergePage(first_page_watermark) # create a pdf writer object for the output file pdf_writer = PyPDF2.PdfFileWriter() # add page pdf_writer.addPage(first_page) with open(output_file, "wb") as filehandle_output: # write the watermarked file to the new file pdf_writer.write(filehandle_output)
Adding an Image with PyMuPDF
PyMuPDF are the Python bindings for MuPDF, which is a lightweight PDF and XPS viewer. In your Python script the module that needs to be imported is named
fitz, and this name goes back to the previous name of PyMuPDF.
For this section we are going to show how to add an image by using a barcode as an example since this is a pretty common task. Although the same steps can be applied to adding any kind of image to a PDF.
In order to decorate a PDF document with a barcode we simply add an image as another PDF layer at the desired position. As for image formats, PyMuPDF accepts PNG or JPEG, but not SVG.
The position of the image is defined as a rectangle using the method
fitz.Rect() that requires two pairs of coordinates - (x1,y1) and (x2,y2). PyMuPDF interprets the upper-left corner of the page as (0,0).
Having opened the input file and extracted the first page from it, the image containing the barcode is added using the method
insertImage(). This method requires two parameters - the position delivered via
imageRectangle, and the name of the image file to be inserted. Using the
save() method the modified PDF is stored to disk. Figure 2 shows the barcode after it was added to the example PDF.
# !/usr/bin/python import fitz input_file = "example.pdf" output_file = "example-with-barcode.pdf" barcode_file = "barcode.png" # define the position (upper-right corner) image_rectangle = fitz.Rect(450,20,550,120) # retrieve the first page of the PDF file_handle = fitz.open(input_file) first_page = file_handle # add the image first_page.insertImage(image_rectangle, fileName=barcode_file) file_handle.save(output_file)
Adding Stamps with pdfrw
pdfrw is a pure Python-based PDF parser to read and write PDF documents. It faithfully reproduces vector formats without rasterization. For Debian GNU/Linux, the package repository contains releases for both Python 2 and 3.
The following example will demonstrate how to add a barcode or watermark to an existing PDF that contains multiple pages. From the
pdfrw package it is sufficient that you import the three classes
PageMerge. Next, you establish the according reader and writer objects to access the contents of both the PDF, and the watermark. For each page in the original document you continue creating a
PageMerge object to which you add the watermark, and which is rendered using the
render() method. Finally, you write the modified pages to the output file. Figure 3 shows the modified document next to the code that made the addition possible.
# !/usr/bin/python # Adding a watermark to a multi-page PDF from pdfrw import PdfReader, PdfWriter, PageMerge input_file = "example.pdf" output_file = "example-drafted.pdf" watermark_file = "barcode.pdf" # define the reader and writer objects reader_input = PdfReader(input_file) writer_output = PdfWriter() watermark_input = PdfReader(watermark_file) watermark = watermark_input.pages # go through the pages one after the next for current_page in range(len(reader_input.pages)): merger = PageMerge(reader_input.pages[current_page]) merger.add(watermark).render() # write the modified content to disk writer_output.write(output_file, reader_input)
Adding images, watermarks, or stamps to a PDF file is quite simple. With a few lines of code this complex-sounding task is solved within minutes. No matter which of the given libraries you choose it works very well.
Part three of this series will exclusively focus on writing/creating PDFs, and will also include both deleting and re-combining single pages into a new document.