Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
November 18, 2022 10:21 pm GMT

Working with PDFs in Python

This article was originally written by Giridhar Talla on the Honeybadger Developer Blog.

Working with files in any programming language is a fascinating experience, and Python gives us the ability to work with any file. This article explains how to work with PDF files in Python. Python 3 has a plethora of libraries that can assist you in reading and creating PDF files. This post provides a quick overview of some of the packages you'll need to work with PDF files.

Table of Contents

  • What is a PDF?
  • Setup
  • Working with PDF files
    • Creating a PDF
    • Extracting text from a PDF
    • Converting .txt files to a PDF
    • Concatenating and merging PDF files
    • Encrypting and decrypting PDFs

What Is a PDF?

Working with PDF files is not the same as working with other file formats. A Portable Document Format (PDF) is a binary file format that one can read using a computer. It was initially created by Adobe and is now an open standard managed by the International Organization for Standardization (ISO). A PDF file is more than just a collection of text; it is also a collection of data in binary format. The data can be of any format, including text, images, tables, and rich media, such as audio and video. However, it cannot be modified. It is a popular format for storing documents since it is easy to share or print. Refer to the Wikipedia article on the PDF format for more information.

Setup

I assume Python is already installed on your machine. If not, go to the official website and download it. Youll need two libraries to work with PDF files. The first is PyPDF2, a Python library for reading and modifying PDF files. The second is FPDF for creating PDF files. PyPDF2 is an excellent package for working with existing PDF files, but you can't create new PDF files with it. You'll use FPDF to create new PDF files.

Note: If you're using Python 2, you can use PyPDF (the old version of PyPDF2) instead. I'll use PyPDF2 with Python 3 in this article, although you can use either PyPDF2 or PyPDF4. Both do the same thing and are compatible with Python 3. Simply swap out the import statements.

Let's get started on this installation. Install PyPDF2 and FPDF using pip or conda (if you're using Anaconda).

pip install pypdf2 fpdf2

You may use the following command to check the installation.

$ pip show pypdf2 fpdfName: PyPDF2Version: 1.26.0Summary: PDF toolkitHome-page: http://mstamy2.github.com/PyPDF2Author: Mathieu FenniakAuthor-email: [email protected]: UNKNOWNLocation: c:\\users\\giri\\python3.9\\lib\\site-packagesRequires:Required-by:---Name: fpdfVersion: 1.7.2Summary: Simple PDF generation for PythonHome-page: http://code.google.com/p/pyfpdfAuthor: Olivier PLATHEY ported by MaxAuthor-email: [email protected]: LGPLv3+Location: c:\\users\\giri\\python3.9\\lib\\site-packagesRequires:Required-by:

Quick note: You can find the entire directory of the code and working examples here.

Working with PDF Files

Now that you have PyPDF2 and FPDF installed, let's get started. First, let's look at extracting information about a PDF file. You can use the PdfFileReader class of PyPDF2. It allows you to read the content of the PDF file. The getDocumentInfo method of PdfFileReader returns the metadata of the PDF file in the form of a dictionary. Also, the getNumPages function returns the total number of pages in the PDF file. You can use this information to perform various automated tasks (such as sorting according to the number of pages or author and so on) on your existing PDF files.

pdf_info.py

## Importfrom PyPDF2 import PdfFileReader## Setuppdf = PdfFileReader(open('pdf_path', "rb"))info = pdf.getDocumentInfo()number_of_pages = pdf.getNumPages()## Extracting informationpdf_info = f"""    Information about {info.title}:    Author: {info.author}    Creator: {info.creator}    Producer: {info.producer}    Subject: {info.subject}    Title: {info.title}    Number of pages: {number_of_pages}  """print(pdf_info)

You can see the output as shown below:

Information about Test PDF:Author: GiridharCreator: HoneybadgerProducer: PyFPSF 1.7.2 http://pyfpdf.googlecode.com/Subject: Test PDF created using PyPDF2Title: Test PDFNumber of pages: 1

You can refer to the documentation for all the different methods and parameters of PdfFileReader.

Creating a PDF

Now, let's create a new PDF file. To create a brand-new PDF file, you can use PdfFileWriter from PyPdf2. However, it does not have any methods to add text and create PDF content flexibly. Instead, you can use FPDF library. Import the package and create a new PDF file object using FPDF() by defining the orientation, size, and format. You can add a new blank page using the method add_page.

create_pdf.py

## Importfrom fpdf import FPDF## Create a new PDF file## Orientation: P = Portrait, L = Landscape## Unit = mm, cm, in## Format = 'A3', 'A4' (default), 'A5', 'Letter', 'Legal', custom size with (width, height)pdf = FPDF(orientation="P", unit="mm", format="A4")## Add a pagepdf.add_page()

Note: You can also include some meta-information if you want. The FPDF class provides the required methods.

You can also specify the font, font size, and style using the set_font method and color of the text using the set_text_color method.

You can add text to the PDF file using the cell(w, h, txt) method. You can specify whether to move the cursor to the following line using ln and text-alignment using align.

create_pdf.py

...## Specify Font## Font Family: Arial, Courier, Helvetica, Times, Symbol## Font Style: B = Bold, I = Italic, U = Underline, combinations (i.e., BI, BU, etc.)pdf.set_font("Arial", size=18)pdf.set_text_color(0, 0, 255)## Add text## Cell(w, h, txt, border, ln, align)## w = width, h = height## txt = your text## ln = (0 or False; 1 or True - move cursor to next line)## border = (0 or False; 1 or True - draw border around the cell)pdf.cell(200, 10, txt="Hello World!", ln=1, align="C")pdf.set_font("Arial", size=12)pdf.set_text_color(0, 0, 0)pdf.cell(200, 10, txt="This pdf is created using FPDF in Python.", ln=3, align="C")

You can also add image to your PDF file using the image method. And finally, to output the PDF file, use the output method. It saves the new PDF file in the home directory.

create_pdf.py

...## Add image## name = Path or URL of the image## x = x-coordinate, y = y-coordinate (default = None)## w = width, h = height (If not specified or equal to zero, they are automatically calculated.)## type = Image format. JPG, JPEG, PNG and GIF (If not specified, the type is inferred from the file extension.).pdf.image(name="boy_night.jpg", h=107, type="JPG")## Output the PDFpdf.output("test_pdf.pdf")print("pdf has been created successfully....")

Run the above program, and if you see the success message, your PDF is created. Check out the whole program for creating a PDF file below.

create_pdf.py

## Importfrom fpdf import FPDFpdf = FPDF(orientation="P", unit="mm", format="A4")## Adding meta data to the PDF filepdf.set_title("Test PDF")pdf.set_author("Giridhar")pdf.set_creator("Honeybadger")pdf.set_subject("Test PDF created using PypDF2")pdf.set_keywords("PDF, Python, Tutorial")pdf.add_page()## Add textpdf.set_font("Arial", size=18)pdf.set_text_color(0, 0, 255)pdf.cell(200, 10, txt="Hello World!", ln=1, align="C")pdf.set_font("Arial", size=12)pdf.set_text_color(0, 0, 0)pdf.cell(200, 10, txt="This pdf is created using FPDF in Python.", ln=3, align="C")## Add imagepdf.image(name="boy_night.jpg", h=107, type="JPG")## Save the PDF filepdf.output("test_pdf.pdf")print("pdf has been created successfully....")

Extracting Text from a PDF

Now that you have created a PDF file, let's look at extracting the text using Python. PyPDF2 reads a page in a PDF as an object called PageObject. You can use several methods of the PageOject class to interact with the pages in a PDF file. The getPage(pageNumber) method of the PdfFileReader class returns a PageOject instance of that page. To extract the text from that specific page, you can use the extractText() method of the PageObject class. You are free to do anything you want with the text.

extract_single_page.py

## Importfrom PyPDF2 import PdfFileReader## Create the PdfFileReader instancepdf = PdfFileReader(open("<path_to_pdf>", "rb"))## Get the page object and extract the textpage_object = pdf.getPage(0) # page number starts from 0 (0-index)text = page_object.extractText()## Print the textprint(text)

Again, the getPage method returns a single page. The PdfFileReader class has a .pages attribute that returns the list of all the pages in a PDF file as PageObjects. You can loop through the pages and extract the text on each page.

extract_text.py

## Importfrom PyPDF2 import PdfFileReader## Create the PdfFileReader instancepdf = PdfFileReader(open("<path_to_pdf>", "rb"))## Looping through the page objects arrayfor page in pdf.pages:    text = page.extractText()    print(text)

Now, you can create a .txt file from the contents of the PDF. Follow the comments in the code snippet if you get lost.

extract_text.py

## Importfrom PyPDF2 import PdfFileReader## Declare the PdfFileReader instancepdf = PdfFileReader(open("<path_to_pdf>", "rb"))## Create a new text file and open it in write modewith open("<path_to_text_file>", "w") as f:  ## Loop through the PDF pages    for page in pdf.pages:        text = page.extractText()      ## Write to the text file        f.write(text)

You can also create a new PDF file by extracting a specific page or a range of pages from a PDF. Using the PdfFileWriter class in PyPDF2 allows you to create a new PDF file and add these pages.

The PdfFileWriter class creates a new PDF file, and you can add a page to the new PDF file using the addPage() method. It requires an existing pageObject as an input to add to the new PDF file.

extract_text_to_pdf.py

## Importfrom PyPDF2 import PdfFileReader, PdfFileWriter## Declare the PdfFileReader instance and Create a new PDF file using PdfFileWriterold_pdf = PdfFileReader(open("<path_to_pdf>", "rb"))new_pdf = PdfFileWriter()## Loop through the pages and add them to the new PDF filefor page in old_pdf.pages[1:4]: # [1:4] means from page 1 to page 3    new_pdf.addPage(page)## Save the new PDF filewith open("<path_to_new_pdf>", "wb") as f:    new_pdf.write(f)

The above code generates a new PDF file containing the previous PDF's pages from page 1 to 3.

Converting .Txt Files to a PDF

You are already aware that you CANNOT change the contents of a PDF file. Instead, you may convert it to a .txt or other type of file, modify the contents, and then convert it back to a new PDF file. Let's look at converting a .txt file into a PDF file.

To generate a PDF file from text, you should use the FPDF library. You must loop over the lines in the text file and add each line to a blank PDF, just as you created a .txt file from a PDF.

convert_txt_to_pdf.py

## Importfrom fpdf import FPDF## Create a new PDFpdf = FPDF(orientation="P", unit="mm", format="A4")pdf.add_page()pdf.set_font("Arial", size=12)## Open the .txt file in read modetext = open("<path-to-text-file>", "r")## Loop through the lines in the text file and add them to the PDFfor line in text:    pdf.cell(0, 5, txt=line, ln=1)## Save the pdf filepdf.output("<new-path-to-pdf>")print("PDF created!")

The code snippet generates a new PDF file from an existing text (.txt) file.

Next well see how to work with the existing PDF files. PyPDF2 can combine, encrypt, and decrypt PDF files.

Concatenating and Merging PDF files

In this section, you'll learn how to merge or concatenate PDF files. You can use the PdfFileMerger class to combine the PDF files. It enables us to combine PDF files in two different ways. The first way is to use the append method. It concatenates (adds) a new PDF to the end of the previous one. The second way is to use the merge method, which allows you to define the page range to merge.

To combine PDF files, you have to create a new PDF merger object and then add the two PDF files using the append method. Finally, use the write() function to create a new PDF file. This method saves the new PDF file in the computer's memory.

append_pdf.py

## Importfrom PyPDF2 import PdfFileMerger## Create a PDF merger objectpdf_merger = PdfFileMerger()## Append the PDFs to the mergerpdf_merger.append("pdf_1.pdf")pdf_merger.append("pdf_2.pdf")## Write to filewith open("append_pdf.pdf", "wb") as f:    pdf_merger.write(f)

You can also append two or more PDF files to the same PDF merger object (append_multiple_pdf.py). It creates a new PDF file with all of the PDF files' pages stacked on top of one another.

The merge method is the same as the append method, except instead of appending the second PDF, you should use the merge method. The merge method takes two arguments; the first one is the page index position, and the second one is the path to the second file. The page index position is the page number of the first PDF file where you want to insert the new one.

merge_pdf.py

## Importfrom PyPDF2 import PdfFileMerger## Create a PDF merger objectpdf_merger = PdfFileMerger()## Append the PDFs to the mergerpdf_merger.append("pdf_1.pdf")## Merge the second PDF file using index position and pathpdf_merger.merge(1, "pdf_2.pdf")## Write to filewith open("merged_pdf.pdf", "wb") as f:    pdf_merger.write(f)

This generates a new PDF file with the pages from the first PDF and the pages from the second (i.e., index position = 1). You can also choose specific pages from the second PDF file to merge. After the path, specify the index range of pages to combine.

merge_pdf_range.py

## Importfrom PyPDF2 import PdfFileMerger## Create a PDF merger objectpdf_merger = PdfFileMerger()## Append the PDFs to the mergerpdf_merger.append("pdf_1.pdf")## Merge the second PDF file using index position and pathpdf_merger.merge(1, "pdf_2.pdf", (1,3))  # pages = (start, stop)## Write to filewith open("append_1_to_3_pdf.pdf", "wb") as f:    pdf_merger.write(f)

This creates a new PDF file with the pages of the first PDF file merged with the pages of the second PDF file from the second page (i.e., index position = 1) with only pages 2 and 3 remaining (index positions 1 and 3). This method assists you in merging only the pages you choose.

Encrypting and Decrypting PDFs

Everything revolves around safety. Encryption is the process of protecting data via mathematical algorithms and a password (similar to a 'key') to decode the original data. You can read more about encryption in this article.

Encrypting PDF files might help you feel more secure in terms of security. Only you or your client can open the PDF using the password provided. It allows you to limit access to your PDF file. You can easily encrypt a PDF file using the encrypt(user_password, owner_password) method of the PdfFileWriter class. You can decrypt the PDF file using the decrypt(user_password) method to access it.

The encrypt method accepts the following arguments:

  • The user_pwd = user password is used to open the PDF file.
  • The owner_pwd = owner password is used to restrict the PDF file's edit and view access (admin privileges). By default, the owner password is the same as the user password.
  • The use_128bit = True is used to specify whether to use 128-bit encryption. By default, it employs 40-bit encryption.

Note: At this stage, PyPdf2 allows you to encrypt a PDF file but does not allow you to specify any permissions on the document. You can accomplish this with another library, such as pdfrw.

The code sample below demonstrates how to encrypt a PDF file with PyPdf2.

encrypt_pdf.py

## Importfrom PyPDF2 import PdfFileWriter, PdfFileReaderpdf_reader = PdfFileReader("<path_to_pdf_file>")pdf_writer = PdfFileWriter()for page in pdf_reader.pages:    pdf_writer.addPage(page)# Encrypt the PDF filepdf_writer.encrypt(user_pwd = "<user-password>", owner_pwd = "<owner-password>", use_128bit = True)with open("encrypted_pdf.pdf", "wb") as f:    pdf_writer.write(f)

The code produces a new PDF file, encrypting it with the password. Whenever you try to open the PDF, you must enter the user password to view the contents.

If you try to access the PDF using PyPDF2, it displays the following error:

Traceback (most recent call last):  File "read_pdf.py", line 6, in <module>    pdf_reader.getPage(0)raise utils.PdfReadError("file has not been decrypted")PyPDF2.utils.PdfReadError: file has not been decrypted

To open the PDF, you must enter the owner's password. You can use the decrypt(owner_pwd) method to decrypt the PDF file.

The decrypt method returns an integer representing the success of the decryption:

  • 0 denotes that the password is incorrect.
  • 1 indicates that the user password is a match.
  • 2 indicates that the owner's password was matched.

decrypt_pdf.py

## Importfrom PyPDF2 import PdfFileReader## Get the encrypted filepdf_reader = PdfFileReader("encrypted_pdf.pdf")## Decrypt the file using passwordpdf_reader.decrypt("SuperSecret")print(pdf_reader.getPage(0).extractText())

Now, you can work with the PDF file as you did along this article.

Conclusion

As previously stated, you may use any form of PDF toolkit to work with PDF files in Python, as I did in this post using PyPdf2 and FPdf. You can generate, read, edit, combine, encrypt, and decrypt PDF files. You may also convert a PDF file to another format and vice versa. For your next projects, you could create an online PDF file converter or create an application to create PDF files online. You could also make an application to automate the process of creating invoices. You are not, however, limited to the libraries mentioned in this article. Django and Flask both have their own packages for working with PDF files. I hope that this post provides you with a foundation for working with PDF files in Python.


Original Link: https://dev.to/honeybadger/working-with-pdfs-in-python-61d

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To