Dark Mode On/Off

Interactive Learning

C Language course

GO Lang course

Learn JavaScript

Learn HTML

Learn CSS

C Language

C Tutorial

C Programs (100+)

C Compiler

Execute C programs online.

C++ Language

C++ Tutorial

Standard Template Library

C++ Programs (100+)

C++ Compiler

Execute C++ programs online.

Python

Python Tutorial

Python Projects

Python Programs

Python How Tos

Numpy Module

Matplotlib Module

Tkinter Module

Network Programming with Python

Learn Web Scraping

Extract Text from PDF in Python - PyPDF2 Module

Technology #pdf#python

In this simple tutorial, we will learn how we can extract text from a given PDF in Python. The PDF can be a multipage PDF too, we will extract the text for all the pages of PDF.

We will be using the PyPDF2 module for extracting text from PDF files.

Extract Text from PDF in Python

Extract Text from PDF in Python using pypdf2 module

To install the PyPDF2 module, you can use pip command. Run the below pip command to download the PyPDF2 module:

pip install PyPDF2

Once we have downloaded the PyPDF2 module, we can write the code for opening the PDF file, then read its text and printing it on the console or write the text in a separate text file.

Using the PyPDF2 module

For extracting text from a PDF file we will be using the PdfFileReader class which is used to initialize PdfFileReader object, taking a stream parameter, in which we will provide the file stream for the PDF file.

Now let's see how we can use the PyPDF2 module to read PDF files:

from PyPDF2 import PdfFileReader

# open the PDF file
pdfFile = open('mypdf.pdf', 'rb')

# create PDFFileReader object to read the file
pdfReader = PdfFileReader(pdfFile)

print("Printing the document info: " + str(pdfReader.getDocumentInfo()))
print("- - - - - - - - - - - - - - - - - - - -")
print("Number of Pages: " + str(pdfReader.getNumPages()))

# close the PDF file object
pdfFile.close()

In the code above, we have first used the open() method used to open a file in Python for reading, then we will use this file object to initialize the PdfFileReader object.

Once we have the PdfFileReader object ready, we can use its methods like getDocumentInfo() to get the file information, or getNumPages() to get the total number of pages in the PDF file.

Then we have the getPage() method to get the page from the PDF file using the page index which starts from 0, and finally the extractText() method which is used to extract the text from the PDF file page.

from PyPDF2 import PdfFileReader

# open the PDF file
pdfFile = open('mypdf.pdf', 'rb')

# create PDFFileReader object to read the file
pdfReader = PdfFileReader(pdfFile)

print("PDF File name: " + str(pdfReader.getDocumentInfo().title))
print("PDF File created by: " + str(pdfReader.getDocumentInfo().creator))
print("- - - - - - - - - - - - - - - - - - - -")

numOfPages = pdfReader.getNumPages()

for i in range(0, numOfPages):
	print("Page Number: " + str(i))
	print("- - - - - - - - - - - - - - - - - - - -")
	pageObj = pdfReader.getPage(i)
	print(pageObj.extractText())
	print("- - - - - - - - - - - - - - - - - - - -")
# close the PDF file object
pdfFile.close()

In the code above, we are printing the title and the name of the creator for the PDF file mypdf.pdf(change it as per your PDF file name and provide the full path for the file) which are attributes of the getDocumentInfo() method.

Then we used Python for loop, to print the text of all the pages of the PDF. Once we are done, we can call the close() method on the file object to close the file resource.

Other Applications of PyPDF2 Module

The PyPDF2 module can be used to perform many operations on PDF files, such as:

Reading the text of the PDF file, which we just did above
Rotating a PDF file page by any defined angle
Merging two or more PDF files at a defined page number.
Appending two or more PDF files, one after another.
Find all the meta information for any PDF file to get information like creator, author, date of creation, etc.
We can even create a new PDF file using the text coming from some text file.

Conclusion

In this tutorial, we covered how we can extract text from a PDF file. This is a great use case if you are working on a project where you want to convert scanned files in PDF format to text which can be stored in a database for data collection.

Similarly, there can be many different use cases, like scanning physical documents like candidate resumes, and then reading text from it for analysis, or maybe reading text from invoices, etc.

If you have a special use case, do share it with us in the comment section below. Also, if you face any issues while running the Python script, do share the error with us by posting in the comments and we will definitely help you.

Frequently Asked Questions(FAQs)

1. How can I extract text from a PDF file using PyPDF2 in Python?

PyPDF2 provides a simple and intuitive API to extract text from PDF files. You can open a PDF, iterate over its pages, and use the extract_text() method to retrieve the text content.

2. Does PyPDF2 handle scanned or image-based PDFs?

No, PyPDF2 is primarily designed for extracting text from text-based PDFs. It may not work well with scanned or image-based PDFs that lack textual content.

3. Can PyPDF2 preserve the original formatting and layout of the extracted text?

PyPDF2 focuses on extracting the textual content from PDF files rather than preserving the original formatting or layout. The extracted text is returned as a plain string.

4. Are there any limitations or considerations when using PyPDF2 for text extraction?

PyPDF2 relies on the structure and encoding of PDF files. If a PDF file has complex formatting, unusual encoding, or encrypted content, PyPDF2's text extraction may encounter limitations or difficulties.

5. Are there alternative libraries for extracting text from PDFs in Python?

Yes, there are alternative libraries like PDFMiner, PyMuPDF, and pdftotext that can be used for text extraction from PDFs in Python. These libraries offer different features and capabilities, so it's worth exploring them to find the best fit for your specific requirements.

C TUTORIAL

C PROGRAMS

INTERVIEW TESTS

EXECUTE CODE

C++ TUTORIAL

C++ PROGRAMS

INTERVIEW TESTS

EXECUTE CODE

PYTHON TUTORIAL

PYTHON HOW TOS

INTERVIEW TESTS

EXECUTE CODE

JAVA TUTORIAL

JAVA CODE EXAMPLES

SPRING TUTORIAL

MORE IN JAVA

COMPUTER ARCHITECTURE

COMPUTER NETWORK

OPERATING SYSTEM

DBMS & SQL

PL/SQL

MongoDB

EXECUTE SQL

ANDROID DEVELOPMENT

GO LANGUAGE

LINUX

DOCKER

HTML TAGS (A to Z)

CSS REFERENCES

SASS/SCSS

KOTLIN

GAME DEVELOPMENT

PHP

GIT GUIDE

JAVASCRIPT

ADVANCED DSA

Extract Text from PDF in Python - PyPDF2 Module

Table of Contents

Extract Text from PDF in Python

Using the PyPDF2 module

Other Applications of PyPDF2 Module

Conclusion

Frequently Asked Questions(FAQs)

1. How can I extract text from a PDF file using PyPDF2 in Python?

2. Does PyPDF2 handle scanned or image-based PDFs?

3. Can PyPDF2 preserve the original formatting and layout of the extracted text?

4. Are there any limitations or considerations when using PyPDF2 for text extraction?

5. Are there alternative libraries for extracting text from PDFs in Python?

You may also like:

IF YOU LIKE IT, THEN SHARE IT

RELATED POSTS