This article is the second in a series on working with PDFs in Python:
- Reading and Splitting Pages
- Adding Images and Watermarks (you are here)
- Inserting, Deleting, and Reordering Pages
Introduction
Today, a world without the Portable Document Format (PDF) seems to be unthinkable. It has become one of the most commonly used data formats ever. Up to PDF version 1.4, displaying a PDF document in an according PDF viewer works fine. Unfortunately, the features from the newer PDF revisions, such as forms, are tricky to implement, and still require further work to be fully functional in the tools. Using various Python libraries you can create your own application in a comparable easy way.
This article is part two of a little series on PDFs with Python. In part one we already gave you an introduction into reading PDF documents using Python, and started with a summary of the various Python libraries. An introduction followed that showed how to manipulate existing PDFs, and how to read and extract the content - both the text and images. Furthermore, we showed you how to split documents into its single pages.
In this article you will learn how to add images to your PDF in the form of watermarks, stamps, and barcodes. For example this is quite helpful in order to stamp or mark documents that are intended to be read by a specific audience, only, or have a draft quality, or to simply add a barcode for identification purposes.
Adding a Watermark via Command Line with pdftk
In order to add a watermark to an existing PDF on a Unix/Linux command-line we can use pdftk. The name abbreviates "PDF Toolkit", and describes itself as "a simple tool for doing everyday things with PDF documents". pdftk
has been ported to Java, and made available as an according package for Debian GNU/Linux.
In order for this to work you need to have a background image available that comes with the word "DRAFT" on a transparent layer, which you can apply to an existing single-page PDF as follows:
$ pdftk input.pdf background background.pdf output output.pdf
The pdftk
tool takes in the PDF file input.pdf
, merges it with background.pdf
, and outputs the result to the file output.pdf
. Figure 1 shows the output of this action.
For more complex actions, like stamping a document with different stamps per page, have a look at the description at the PDF Labs project page. We also show the stamping use-case in this article below, although our example uses the library pdfrw
instead of pdftk
.
Adding a Watermark with PyPDF2
The PyPDF library provides a method called mergepage()
that accepts another PDF to be used as a watermark or stamp.
In the example below we start with reading the first page of the original PDF document and the watermark. To read the file we use the PdfFileReader()
class. As a second step we merge the two pages by using the mergepage()
method. Finally, we will write the output to the output file. This is done in three steps - creating an object based on the PdfFileWriter()
class, adding the merged page to this object using the addPage()
method, and writing the new content to the output page using the write()
method.
# !/usr/bin/python
# Adding a watermark to a single-page PDF
import PyPDF2
input_file = "example.pdf"
output_file = "example-drafted.pdf"
watermark_file = "draft.pdf"
with open(input_file, "rb") as filehandle_input:
# read content of the original file
pdf = PyPDF2.PdfFileReader(filehandle_input)
with open(watermark_file, "rb") as filehandle_watermark:
# read content of the watermark
watermark = PyPDF2.PdfFileReader(filehandle_watermark)
# get first page of the original PDF
first_page = pdf.getPage(0)
# get first page of the watermark PDF
first_page_watermark = watermark.getPage(0)
# merge the two pages
first_page.mergePage(first_page_watermark)
# create a pdf writer object for the output file
pdf_writer = PyPDF2.PdfFileWriter()
# add page
pdf_writer.addPage(first_page)
with open(output_file, "wb") as filehandle_output:
# write the watermarked file to the new file
pdf_writer.write(filehandle_output)
Adding an Image with PyMuPDF
PyMuPDF are the Python bindings for MuPDF, which is a lightweight PDF and XPS viewer. In your Python script the module that needs to be imported is named fitz
, and this name goes back to the previous name of PyMuPDF.
For this section we are going to show how to add an image by using a barcode as an example since this is a pretty common task. Although the same steps can be applied to adding any kind of image to a PDF.
In order to decorate a PDF document with a barcode we simply add an image as another PDF layer at the desired position. As for image formats, PyMuPDF accepts PNG or JPEG, but not SVG.
The position of the image is defined as a rectangle using the method fitz.Rect()
that requires two pairs of coordinates - (x1,y1) and (x2,y2). PyMuPDF interprets the upper-left corner of the page as (0,0).
Having opened the input file and extracted the first page from it, the image containing the barcode is added using the method insertImage()
. This method requires two parameters - the position delivered via imageRectangle
, and the name of the image file to be inserted. Using the save()
method the modified PDF is stored to disk. Figure 2 shows the barcode after it was added to the example PDF.
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
# !/usr/bin/python
import fitz
input_file = "example.pdf"
output_file = "example-with-barcode.pdf"
barcode_file = "barcode.png"
# define the position (upper-right corner)
image_rectangle = fitz.Rect(450,20,550,120)
# retrieve the first page of the PDF
file_handle = fitz.open(input_file)
first_page = file_handle[0]
# add the image
first_page.insertImage(image_rectangle, fileName=barcode_file)
file_handle.save(output_file)
Adding Stamps with pdfrw
pdfrw is a pure Python-based PDF parser to read and write PDF documents. It faithfully reproduces vector formats without rasterization. For Debian GNU/Linux, the package repository contains releases for both Python 2 and 3.
The following example will demonstrate how to add a barcode or watermark to an existing PDF that contains multiple pages. From the pdfrw
package it is sufficient that you import the three classes PdfReader
, PdfWriter
, and PageMerge
. Next, you establish the according reader and writer objects to access the contents of both the PDF, and the watermark. For each page in the original document you continue creating a PageMerge
object to which you add the watermark, and which is rendered using the render()
method. Finally, you write the modified pages to the output file. Figure 3 shows the modified document next to the code that made the addition possible.
# !/usr/bin/python
# Adding a watermark to a multi-page PDF
from pdfrw import PdfReader, PdfWriter, PageMerge
input_file = "example.pdf"
output_file = "example-drafted.pdf"
watermark_file = "barcode.pdf"
# define the reader and writer objects
reader_input = PdfReader(input_file)
writer_output = PdfWriter()
watermark_input = PdfReader(watermark_file)
watermark = watermark_input.pages[0]
# go through the pages one after the next
for current_page in range(len(reader_input.pages)):
merger = PageMerge(reader_input.pages[current_page])
merger.add(watermark).render()
# write the modified content to disk
writer_output.write(output_file, reader_input)
Conclusion
Adding images, watermarks, or stamps to a PDF file is quite simple. With a few lines of code this complex-sounding task is solved within minutes. No matter which of the given libraries you choose it works very well.
Part three of this series will exclusively focus on writing/creating PDFs, and will also include both deleting and re-combining single pages into a new document.