PyPDF2 is a popular Python library for working with PDF files. It allows you to extract text, metadata, and images from PDF files or manipulate and combine them to create new PDFs. In this article, we’ll explore the PyPDF2 library, its features, and demonstrate its usage through practical examples.
Before we get started, you need to have Python installed on your system. You can install Python from the official Python website. Once Python is installed, you can install PyPDF2 using pip
:
pip install pypdf2
PyPDF2 provides a comprehensive set of tools for working with PDF files, including the ability to:
Let’s explore some of these features using practical examples.
To read a PDF file, we first need to import the PyPDF2 library and open the file using the PdfFileReader
object:
import PyPDF2
with open('example.pdf', 'rb') as file:
pdf_reader = PyPDF2.PdfFileReader(file)
print(f"Number of pages: {pdf_reader.numPages}")
Keep in mind that we need to open the file in binary mode ('rb'
) since we’re working with a binary file format.
To extract text from a PDF file, we loop through each page and call the extractText()
method on the page object:
import PyPDF2
with open('example.pdf', 'rb') as file:
pdf_reader = PyPDF2.PdfFileReader(file)
text = ""
for i in range(pdf_reader.numPages):
page = pdf_reader.getPage(i)
text += page.extractText()
print(text)
Merging PDF files is simple, as we only need to create a PdfFileMerger
object, append the pages from each PDF file, and then write the result to a new file:
import PyPDF2
pdf_merger = PyPDF2.PdfFileMerger()
pdf_files = ['example1.pdf', 'example2.pdf']
for pdf_file in pdf_files:
with open(pdf_file, 'rb') as file:
pdf_merger.append(file)
with open('merged.pdf', 'wb') as output_file:
pdf_merger.write(output_file)
To add a watermark to a PDF file, we’ll first create a PdfFileWriter
object, loop through the pages of the input PDF, and then merge each page with the watermark:
import PyPDF2
with open('example.pdf', 'rb') as file, open('watermark.pdf', 'rb') as watermark_file:
pdf_reader = PyPDF2.PdfFileReader(file)
watermark_page = PyPDF2.PdfFileReader(watermark_file).getPage(0)
pdf_writer = PyPDF2.PdfFileWriter()
for i in range(pdf_reader.numPages):
page = pdf_reader.getPage(i)
page.merge_page(watermark_page)
pdf_writer.addPage(page)
with open('watermarked.pdf', 'wb') as output_file:
pdf_writer.write(output_file)
In this article, we have explored some of the core features of the PyPDF2 library for working with PDF files in Python. With PyPDF2, you can easily read, manipulate, and create PDF files in your Python projects.
Keep in mind that PyPDF2 has some limitations, such as not being able to extract text from scanned or image-based PDFs. In such cases, you may want to explore other libraries like PDFMiner or OCR tools for text extraction.