How to Extract Images From PDF Files Using Python

Lakshay Kapoor Feb 02, 2024
  1. Install the PyMuPDF Library in Python
  2. Extract Images From a PDF File in Python
How to Extract Images From PDF Files Using Python

You can perform many operations with external files and sources using Python. One of the operations is extracting images from PDF files in Python, which is very useful whenever the PDF is too long and cannot be managed manually.

This guide shows you how to extract images from PDF files in Python.

Install the PyMuPDF Library in Python

To perform this operation, one must install the PyMuPDF library in Python. This library helps the user deal with the files in PDF,XPS, FB2, OpenXPS, and EPUB formats. It is a very versatile library known for its high performance and rendering quality. However, it doesn’t come pre-installed in Python. To install this library, run the following command.

pip install PyMuPDF Pillow

Extract Images From a PDF File in Python

Now, to extract images from a PDF file, there is a stepwise procedure:

  • First, all the necessary libraries are imported.
import fitz
import io
from PIL import Image
  • Then, the path to the file from which the images have to be extracted is defined. The file is opened using the open() function from the fitz module.
file_path = "randomfile.pdf"
open_file = fitz.open(file_path)
  • After that, every page of the PDF file is iterated and checked if there are images available on each page.
for page_number in range(len(open_file)):
    page = pdf_file[page_number]
    list_image = page.getImageList()

    if list_image:
        print(f"{len(list_image)} images found on page {page_number}")
    else:
        print("No images found on page", page_number)

In this step, the getImageList() function is used to extract all the images in the form of image objects, as a list of tuples.

  • Then, all the extra information about the image, like the image size and the image extension, are returned by using the extractImage() function. This step is carried out as an iteration inside the first iteration itself.
for image_number, img in enumerate(page.getImageList(), start=1):
    xref = img[0]

    image_base = pdf_file.extractImage(xref)
    bytes_image = image_base["image"]

    ext_image = base_image["ext"]

After combining all these steps into one single program, you can easily extract all the images from a PDF file.

Now, suppose there are 5 pages in the randomfile.pdf file. In those 5 pages, there is only 1 image in the last, for example, the 5th page. So, the output will look like this.

0 images found on page 0
0 images found on page 1
0 images found on page 2
0 images found on page 3
0 images found on page 4
1 images found on page 5
Lakshay Kapoor avatar Lakshay Kapoor avatar

Lakshay Kapoor is a final year B.Tech Computer Science student at Amity University Noida. He is familiar with programming languages and their real-world applications (Python/R/C++). Deeply interested in the area of Data Sciences and Machine Learning.

LinkedIn

Related Article - Python PDF