Split, Merge and Rotate PDF Documents in Python with borb

Split, Merge and Rotate PDF Documents in Python with borb

Introduction

The Portable Document Format (PDF) is not a WYSIWYG (What You See is What You Get) format. It was developed to be platform-agnostic, independent of the underlying operating system and rendering engines.

To achieve this, PDF was constructed to be interacted with via something more like a programming language, and relies on a series of instructions and operations to achieve a result. In fact, PDF is based on a scripting language - PostScript, which was the first device-independent Page Description Language.

In this guide, we'll be using borb - a Python library dedicated to reading, manipulating and generating PDF documents. It offers both a low-level model (allowing you access to the exact coordinates and layout if you choose to use those) and a high-level model (where you can delegate the precise calculations of margins, positions, etc to a layout manager).

In this guide, we'll take a look at how to split and merge PDF documents in Python using borb, we'll also look at how to rotate pages in a PDF document.

Splitting and merging PDF documents are the basis for many use-cases:

  • Processing an invoice (you don't need the terms and conditions so you can remove those pages)
  • Adding a cover letter to documents (a test report, an invoice, promotional material)
  • Aggregating test-results from heterogeneous sources
  • Etc.

Installing borb

borb can be downloaded from source on GitHub, or installed via pip:

$ pip install borb

Splitting a PDF using borb

To demonstrate this, you'll need a PDF with a few pages.
We'll start by creating such a PDF using borb. This step is optional, you can of course simply use a PDF you have laying around instead:

from borb.pdf.canvas.color.color import HexColor
from borb.pdf.canvas.layout.page_layout.multi_column_layout import SingleColumnLayout
from borb.pdf.canvas.layout.page_layout.page_layout import PageLayout
from borb.pdf.canvas.layout.text.paragraph import Paragraph
from borb.pdf.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF
from decimal import Decimal

def create_document(heading_color: HexColor = HexColor("0b3954"), 
                    text_color: HexColor = HexColor("de6449"),
                    file_name: str = "output.pdf"):

    d: Document = Document()

    N: int = 10
    for i in range(0, N):
    
        # Create a new Page, and append it to the Document
        p: Page = Page()
        d.append_page(p)
        
        # Set the PageLayout of the new Page
        l: PageLayout = SingleColumnLayout(p)
        
        # Add the paragraph to identify the Page
        l.add(Paragraph("Page %d of %d" % (i+1, N),
                        font_color=heading_color,
                        font_size=Decimal(24)))
                        
        # Add a Paragraph of dummy text                        
        l.add(Paragraph("""
                        Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
                        Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
                        when an unknown printer took a galley of type and scrambled it to make a type specimen book. 
                        It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. 
                        It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, 
                        and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
                        """,
                        font_color=text_color))
    
    # Persist the Document to disk
    with open(file_name, "wb") as pdf_out_handle:
        PDF.dumps(pdf_out_handle, d)

This example code generates a PDF document consisting of 10 pages:

  • Each page starts with "Page x of 10". This will make it easier to identify the pages later on.
  • Each page contains 1 paragraph of text.

Splitting PDF Documents in Python

Now let's split this PDF. We'll start by splitting it in two, the first half containing the first 5 pages, and the second half containing the remaining pages:

def split_half_half():

  # Read PDF
  with open("output.pdf", "rb") as pdf_file_handle:
    input_pdf = PDF.loads(pdf_file_handle)

  # Create two empty PDFs to hold each half of the split
  output_pdf_001 = Document()
  output_pdf_002 = Document()

  # Split
  for i in range(0, 10):
    if i < 5:
      output_pdf_001.append_page(input_pdf.get_page(i))
    else:
      output_pdf_002.append_page(input_pdf.get_page(i))

  # Write PDF
  with open("output_001.pdf", "wb") as pdf_out_handle:
    PDF.dumps(pdf_out_handle, output_pdf_001)

  # Write PDF
  with open("output_002.pdf", "wb") as pdf_out_handle:
    PDF.dumps(pdf_out_handle, output_pdf_002)

We've extracted the first 5 pages into a new Document, and the following 5 pages into a second new Document, effectively splitting the original one into two smaller entities.

This is made easy via the get_page() method, as its return type can be directly used to with append_page().

You can check the resulting PDFs to verify the code works as intended:

The first document ends at page 5, and the second one starts at page 6.

We can also split it based on other criteria as well! In the next example we'll split the PDF by putting all odd pages in one PDF, and the even pages in another one:

def split_even_odd():

  # Read PDF
  with open("output.pdf", "rb") as pdf_file_handle:
    input_pdf = PDF.loads(pdf_file_handle)
  
  # Rreate two empty PDFs to hold each half of the split
  output_pdf_001 = Document()
  output_pdf_002 = Document()

  # Split
  for i in range(0, 10):
    if i % 2 == 0:
      output_pdf_001.append_page(input_pdf.get_page(i))
    else:
      output_pdf_002.append_page(input_pdf.get_page(i))

  # Write PDF
  with open("output_001.pdf", "wb") as pdf_out_handle:
    PDF.dumps(pdf_out_handle, output_pdf_001)

  # Write PDF
  with open("output_002.pdf", "wb") as pdf_out_handle:
    PDF.dumps(pdf_out_handle, output_pdf_002)

You can verify the resulting PDF documents represent the aforementioned split:

Merging PDF Documents in Python

In order to work the next examples we'll need two PDF's. Let's use the earlier code to generate them if you don't already have some:

create_document(HexColor("247B7B"), HexColor("78CDD7"), "output_001.pdf")
create_document(file_name="output_002.pdf")

The intuition used for splitting is pretty similar to merging - though, we can add entire documents to other documents, not just pages. Though, sometimes you might want to split a document (cut off the last page) before merging it with another one.

We can merge them entirely (concatenating both PDFs), but we can also just add some pages of the first PDF to the second if we'd prefer it that way - using the append_page() function like before.

Let's start by concatenating them entirely:

def concatenate_two_documents():

  # Read first PDF
  with open("output_001.pdf", "rb") as pdf_file_handle:
    input_pdf_001 = PDF.loads(pdf_file_handle)
  
  # Read second PDF
  with open("output_002.pdf", "rb") as pdf_file_handle:
    input_pdf_002 = PDF.loads(pdf_file_handle)
  
  # Build new PDF by concatenating two inputs
  output_document = Document()
  output_document.append_document(input_pdf_001)
  output_document.append_document(input_pdf_002)
  
  # Write PDF
  with open("output.pdf", "wb") as pdf_out_handle:
    PDF.dumps(pdf_out_handle, output_document)

This code should result in:

Rotating Pages in PDF Documents in Python

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

A page in a PDF document can be rotated by any multiple of 90 degrees. This kind of operation allows you to easily switch between landscape and portrait mode.

In the next example you'll rotate a page from one of the input PDFs we created earlier:

def rotate_first_page():
  # Read PDF
  with open("output_001.pdf", "rb") as pdf_file_handle:
    input_pdf_001 = PDF.loads(pdf_file_handle)

  # Rotate page
  input_pdf_001.get_page(0).rotate_left()  
  
  # Write PDF to disk
  with open("output.pdf", "wb") as pdf_out_handle:
    PDF.dumps(pdf_out_handle, input_pdf_001)

The resulting PDF looks like this:

Conclusion

In this guide, we've taken a look at how to merge and split PDF documents. We've also modified an existing PDF by rotating some of its pages.

Last Updated: September 23rd, 2021
Was this article helpful?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

Joris SchellekensAuthor

I'm a software architect from Belgium, with a passion for machine learning, knowledge-based systems and graph algorithms. I'm also the author of borb, the pure python PDF library.

Want a remote job?

    Prepping for an interview?

    • Improve your skills by solving one coding problem every day
    • Get the solutions the next morning via email
    • Practice on actual problems asked by top companies, like:
     
     
     

    Make Clarity from Data - Quickly Learn Data Visualization with Python

    Learn the landscape of Data Visualization tools in Python - work with Seaborn, Plotly, and Bokeh, and excel in Matplotlib!

    From simple plot types to ridge plots, surface plots and spectrograms - understand your data and learn to draw conclusions from it.

    © 2013-2021 Stack Abuse. All rights reserved.