To extract text from a PDF file, you can use the PyPDF2
library. If you don’t have it installed, you can install it using pip
:
pip install PyPDF2
Here’s a Python script to accomplish the task:
import PyPDF2
import sys
def extract_pdf_text(pdf_path, output_path):
try:
with open(pdf_path, 'rb') as pdf_file:
print(f"Reading PDF file: {pdf_path}")
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
if pdf_reader.isEncrypted:
print("The PDF file is encrypted. Unable to extract text.")
return False
total_pages = pdf_reader.numPages
print(f"Total pages: {total_pages}")
with open(output_path, 'w', encoding='utf-8') as output_file:
print(f"Extracting text to: {output_path}")
for page in range(total_pages):
text = pdf_reader.getPage(page).extractText()
output_file.write(text)
print("Text extraction completed.")
return True
except FileNotFoundError:
print(f"Error: The file {pdf_path} was not found.")
return False
except PyPDF2.utils.PdfReadError:
print(f"Error: Unable to read the PDF file {pdf_path}")
return False
except Exception as e:
print(f"Error: An unexpected error occurred: {str(e)}")
return False
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: python script_name.py input_pdf_path output_txt_path")
else:
input_pdf_path = sys.argv[1]
output_txt_path = sys.argv[2]
extract_pdf_text(input_pdf_path, output_txt_path)
To use the script, save it as pdf_to_text.py
and then run it from the command line with the input and output file paths as command-line arguments:
python pdf_to_text.py input.pdf output.txt
The script reads the PDF file, checks if it’s encrypted, and then extracts the text from each page to a text file. It also includes error handling and progress messages.