In this simple tutorial, we will learn how we can extract text from a given PDF in Python. The PDF can be a multipage PDF too, we will extract the text for all the pages of PDF.
We will be using the PyPDF2 module for extracting text from PDF files.
To install the PyPDF2 module, you can use
pip command. Run the below
pip command to download the PyPDF2 module:
pip install PyPDF2
Once we have downloaded the PyPDF2 module, we can write the code for opening the PDF file, then reading its text and printing it on the console or writing the text in a separate text file.
For extracting text from a PDF file we will be using the
PdfFileReader class which is used to initialize
PdfFileReader object, taking a stream parameter, in which we will provide the file stream for the PDF file.
Now let's see how we can use PyPDF2 module to read PDF files:
from PyPDF2 import PdfFileReader # open the PDF file pdfFile = open('mypdf.pdf', 'rb') # create PDFFileReader object to read the file pdfReader = PdfFileReader(pdfFile) print("Printing the document info: " + str(pdfReader.getDocumentInfo())) print("- - - - - - - - - - - - - - - - - - - -") print("Number of Pages: " + str(pdfReader.getNumPages())) # close the PDF file object pdfFile.close()
In the code above, we have first used the open() method used to open a file in Python for reading, then we will use this file object to initialize the
One we have the
PdfFileReader object ready, we can use its methods like
getDocumentInfo() to get the file information, or
getNumPages() to get the total number of pages in the PDF file.
Then we have the
getPage() method to get the page from the PDF file using the page index which starts from 0, and finally the
extractText() method which is used to extract the text from the PDF file page.
from PyPDF2 import PdfFileReader # open the PDF file pdfFile = open('mypdf.pdf', 'rb') # create PDFFileReader object to read the file pdfReader = PdfFileReader(pdfFile) print("PDF File name: " + str(pdfReader.getDocumentInfo().title)) print("PDF File created by: " + str(pdfReader.getDocumentInfo().creator)) print("- - - - - - - - - - - - - - - - - - - -") numOfPages = pdfReader.getNumPages() for i in range(0, numOfPages): print("Page Number: " + str(i)) print("- - - - - - - - - - - - - - - - - - - -") pageObj = pdfReader.getPage(i) print(pageObj.extractText()) print("- - - - - - - - - - - - - - - - - - - -") # close the PDF file object pdfFile.close()
In the code above, we are printing the title and the name of the creator for the PDF file mypdf.pdf(change it as per your PDF file name and provide the full path for the file) which are attributes of the getDocumentInfo() method.
Then we have used Python for loop, to print the text of all the pages of the PDF. Once we are done, we can call the
close() method on the file object to close the file resource.
The PyPDF2 module can be used to perform many opertations on PDF files, such as:
Reading the text of the PDF file, which we just did above
Rotating a PDF file page by any defined angle
Merging two or more PDF files at a defined page number.
Appending two or more PDF files, one after another.
Find all the meta information for any PDF file to get information like creator, author, date of creation, etc.
We can even create a new PDF file using the text coming from some text file.
In this tutorial, we covered how we can extract text from a PDF file. This is a great use case if you are working on a project where you want to convert scanned files in PDF format to text which can be stored in a database for data collection.
Similarly, there can be many different use cases, like scanning physical documents like candidate resumes, and then reading text from it for analysis, or maybe reading text from invoices, etc.
If you have a special use case, do share it with us in the comment section below. Also, if you face any issue while running the python script, do share the error with us by posting in the comments and we will definitely help you.