deCODE - Our Newsletter for Jan 2022 is available for Download. 🗞   🥳
  Signup/Sign In
Written By:
iamabhishek
6 minute read
PythonHowToPDF to Text

Extract Text from PDF in Python - PyPDF2 Module

Posted in Programming   LAST UPDATED: NOVEMBER 30, 2021

In this simple tutorial, we will learn how we can extract text from a given PDF in Python. The PDF can be a multipage PDF too, we will extract the text for all the pages of PDF.

We will be using the PyPDF2 module for extracting text from PDF files.

Extract Text from PDF in Python using pypdf2 module

To install the PyPDF2 module, you can use pip command. Run the below pip command to download the PyPDF2 module:

pip install PyPDF2

Once we have downloaded the PyPDF2 module, we can write the code for opening the PDF file, then reading its text and printing it on the console or writing the text in a separate text file.

Using the PyPDF2 module

For extracting text from a PDF file we will be using the PdfFileReader class which is used to initialize PdfFileReader object, taking a stream parameter, in which we will provide the file stream for the PDF file.

Now let's see how we can use PyPDF2 module to read PDF files:

from PyPDF2 import PdfFileReader

# open the PDF file
pdfFile = open('mypdf.pdf', 'rb')

# create PDFFileReader object to read the file
pdfReader = PdfFileReader(pdfFile)

print("Printing the document info: " + str(pdfReader.getDocumentInfo()))
print("- - - - - - - - - - - - - - - - - - - -")
print("Number of Pages: " + str(pdfReader.getNumPages()))

# close the PDF file object
pdfFile.close()

In the code above, we have first used the open() method used to open a file in Python for reading, then we will use this file object to initialize the PdfFileReader object.

One we have the PdfFileReader object ready, we can use its methods like getDocumentInfo() to get the file information, or getNumPages() to get the total number of pages in the PDF file.

Then we have the getPage() method to get the page from the PDF file using the page index which starts from 0, and finally the extractText() method which is used to extract the text from the PDF file page.

from PyPDF2 import PdfFileReader

# open the PDF file
pdfFile = open('mypdf.pdf', 'rb')

# create PDFFileReader object to read the file
pdfReader = PdfFileReader(pdfFile)

print("PDF File name: " + str(pdfReader.getDocumentInfo().title))
print("PDF File created by: " + str(pdfReader.getDocumentInfo().creator))
print("- - - - - - - - - - - - - - - - - - - -")

numOfPages = pdfReader.getNumPages()

for i in range(0, numOfPages):
	print("Page Number: " + str(i))
	print("- - - - - - - - - - - - - - - - - - - -")
	pageObj = pdfReader.getPage(i)
	print(pageObj.extractText())
	print("- - - - - - - - - - - - - - - - - - - -")
# close the PDF file object
pdfFile.close()

In the code above, we are printing the title and the name of the creator for the PDF file mypdf.pdf(change it as per your PDF file name and provide the full path for the file) which are attributes of the getDocumentInfo() method.

Then we have used Python for loop, to print the text of all the pages of the PDF. Once we are done, we can call the close() method on the file object to close the file resource.

Other Applications of PyPDF2 Module

The PyPDF2 module can be used to perform many opertations on PDF files, such as:

  1. Reading the text of the PDF file, which we just did above

  2. Rotating a PDF file page by any defined angle

  3. Merging two or more PDF files at a defined page number.

  4. Appending two or more PDF files, one after another.

  5. Find all the meta information for any PDF file to get information like creator, author, date of creation, etc.

  6. We can even create a new PDF file using the text coming from some text file.

Conclusion:

In this tutorial, we covered how we can extract text from a PDF file. This is a great use case if you are working on a project where you want to convert scanned files in PDF format to text which can be stored in a database for data collection.

Similarly, there can be many different use cases, like scanning physical documents like candidate resumes, and then reading text from it for analysis, or maybe reading text from invoices, etc.

If you have a special use case, do share it with us in the comment section below. Also, if you face any issue while running the python script, do share the error with us by posting in the comments and we will definitely help you.

You may also like:


IF YOU LIKE IT, THEN SHARE IT

RELATED POSTS