Read PDF in Python

Read PDF in Python

Samyak Jain Oct-12, 2021 Jun-19, 2021 Python Python PDF
  1. Use the PyPDF2 Module to Read a PDF in Python
  2. Use the PDFplumber Module to Read a PDF in Python
  3. Use the textract Module to Read a PDF in Python
  4. Use the PDFminer.six Module to Read a PDF in Python

A PDF document cannot be modified but can be shared easily and reliably. There can be different elements in a PDF document like text, links, images, tables, forms, and more.

In this tutorial, we will read a PDF file in Python.

Use the PyPDF2 Module to Read a PDF in Python

PyPDF2 is a Python module that we can use to extract a PDF document’s information, merge documents, split a document, crop pages, encrypt or decrypt a PDF file, and more.

We open the PDF document in read binary mode using open('document_path.PDF', 'rb'). PDFFileReader() is used to create a PDF reader object to read the document. We can extract text from the pages of the PDF document using getPage() and extractText() methods. To get the number of pages in the given PDF document, we use .numPages.

For example,

from PyPDF2 import PDFFileReader
temp = open('document_path.PDF', 'rb')
PDF_read = PDFFileReader(temp)
first_page = PDF_read.getPage(0)
print(first_page.extractText())

The above code will print the text on the first page of the provided PDF document.

Use the PDFplumber Module to Read a PDF in Python

PDFplumber is a Python module that we can use to read and extract text from a PDF document and other things. PDFplumber module is more potent as compared to the PyPDF2 module. Here we also use the open() function to read a PDF file.

For example,

import PDFplumber
with PDFplumber.open("document_path.PDF") as temp:
  first_page = temp.pages[0]
  print(first_page.extract_text())

The above code will print the text from the first page of the provided PDF document.

Use the textract Module to Read a PDF in Python

We can use the function textract.process() from the textract module to read a PDF document.

For example,

import textract
PDF_read = textract.process('document_path.PDF', method='PDFminer')

Use the PDFminer.six Module to Read a PDF in Python

PDFminer.six is a Python module that we can use to read and extract text from a PDF document. We will use the extract_text() function from this module to read the text from a PDF.

For example,

from PDFminer.high_level import extract_text
PDF_read = extract_text('document_path.PDF')

Related Article - Python PDF

  • Extract Images From PDF Files Using Python
  • Pdfminer Package in Python