Pdfminer Package in Python

Pdfminer Package in Python

Manav Narula Apr-14, 2022 Jan-02, 2021 Python Python PDF

A PDF file is a standard portable document and is one of the most used document formats.

We can work and read different types of files in Python. There are several packages available to work with PDF files.

The pdfminer is one such package. It has different functionalities to work with PDF files and read text data from such files.

We will discuss some basics of this package below.

Installing the pdfminer Package in Python

The pdfminer package does not support Python 3 from recent versions. We can use the fork of this package called pdfminer.six for Python 3.

We can install this using the following pip command from the command prompt.

pip install pdfminer.six

Using the pdfminer Package in Python

We can use the extract_text() function to extract text from a PDF saved on the device, we can use the extract_text() function. We can specify the path of the file within the function.

See the following example.

from pdfminer.high_level import extract_text
s = extract_text('sample.pdf')
print(s)

Output:

Sample PDF from device

We can use the same function in different ways.

We can open a PDF file using the open() function, create a file object, and use this file object to read the data. For this, we need to open the file in the rb mode.

For example,

from pdfminer.high_level import extract_text
with open('sample.pdf', 'rb') as f:
    s = extract_text(f) 
print(s)

Output:

Sample PDF from device

We can read a file from the web and extract its content using this function.

First, we will read the file using the given URL in the requests.get() function. Its contents can be retrieved using the content() function.

We will then load this file into the memory using the io.BytesIO() function, and extract its text using the extract_pdf() function.

Check the syntax below.

import io
import requests
r = requests.get(url)
s = extract_text(io.BytesIO(response.content))
print(s)

The pdfminer package was widely used till Python 2.7 but then lost popularity due to compatibility issues with Python 3.

However, new packages have emerged that provide a faster way to work with PDF files in Python. The pyPDF2 is one such alternative available.

Author: Manav Narula
Manav Narula avatar Manav Narula avatar

Manav is a IT Professional who has a lot of experience as a core developer in many live projects. He is an avid learner who enjoys learning new things and sharing his findings whenever possible.

LinkedIn

Related Article - Python PDF

  • Read PDF in Python
  • Extract Images From PDF Files Using Python