Java OCR

Sheeraz Gul Oct 12, 2023
  1. Java OCR
  2. How to Use OCR in Java
Java OCR

This tutorial demonstrates the implementation of OCR in Java.

Java OCR

The OCR or Tesseract OCR is an optical character reading engine developed in 1985 by HP laboratories, and since 2006 it has been developed by Google. The tesseract OCR runs on Unicode UTF-8 support and can detect more than 100 languages, which is why it is used to create language scanning software.

The newest Tesseract OCR is tesseract version 4, which includes the OCR-based neural net system LSTM, used for line recognition. The tesseract OCR provides functionalities to perform image processing with AI and machine learning in Java.

OCR Applications

The applications of OCR can be defined in the following points:

  1. Preprocess the image’s data, for example, filtering, de-skew, and converting to greyscale.
  2. It can also detect words, lines, and characters.
  3. Recognize characters in the post process, choosing the best characters based on their confidence from the language data, and this language data also includes grammar, dictionary, etc.
  4. It can also generate a ranking list based on training data set.

How to Use OCR in Java

Follow the steps below to use OCR in Java:

  1. First, download the Tess4j API.
  2. Extract the downloaded file.
  3. For OCR, we have to create a new project in our IDE.
  4. Once the new project is created, import the jar file in your build path.
  5. You can find the jar file in the .\Tess4J-3.4.8-src\Tess4J\dist path.

If the above method doesn’t work, extract the downloaded file somewhere, go to Eclipse, select the Open Project from File System option, and select the path of the extracted folder.

This will open the tess4j OCR project directly in Eclipse, and then we will add classes to create our applications.

Let’s try a simple example of an image-to-text conversion using the tesseract OCR in Java.

package tess4jtest;

import java.io.File;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
public class test {
  public static void main(String[] args) {
    // Instance of tesseract
    Tesseract DemoTesseract = new Tesseract();
    try {
      // path to the tessdata in the extracted folder
      DemoTesseract.setDatapath(
          "C:\\Users\\Sheeraz\\OneDrive\\Desktop\\New folder (3)\\Tess4J\\tessdata");
      // InputPath
      String OutputText = DemoTesseract.doOCR(
          new File("C:\\Users\\Sheeraz\\OneDrive\\Desktop\\New folder (3)\\Tess4J\\sample.png"));

      System.out.print(OutputText);
    } catch (TesseractException ex) {
      ex.printStackTrace();
    }
  }
}

The code above will read the text from the following image:

sample image from the code

Let’s see the output of the above code.

Demmck is a resource for everyone interested in programming. embedded software, and electronics. It
covers the programming languages like Python‘ CIC++‘ ct and so on in mi: website’sfirstdevelopment
smge. Openrsource hardware alsofalls in the website's scope, like Arduino‘ Raspberry PL and BeagleBone.
Demmckaims to provide tutorials, howito's, and cheat sheets to different levels of developers and
hobbyists

As we can see, some output change from the image because the OCR tried to read the image. Now let’s try the same example on the following handwritten example:

handwritten sample from the code

Change this line of code:

String OutputText = DemoTesseract.doOCR(
    new File("C:\\Users\\Sheeraz\\OneDrive\\Desktop\\New folder (3)\\Tess4J\\sample.png"));

to

String OutputText = DemoTesseract.doOCR(
    new File("C:\\Users\\Sheeraz\\OneDrive\\Desktop\\New folder (3)\\Tess4J\\sample1.png"));

Now the output for the above code for the handwritten image is below.

This Is a handwr.' ++en
Exampte
NrHC as good as X"  "Cam.

The above images are considered clear what if there is an uncleared image? To perform OCR on an uncleared image, we need to process the image first; let’s try an example.

package tess4jtest;

import java.awt.Graphics2D;
import java.awt.Image;
import java.awt.image.*;
import java.io.*;
import javax.imageio.ImageIO;
import net.sourceforge.tess4j.*;

public class test {
  public static void ImageProcess(BufferedImage Input_Image, float scaleFactor, float offset)
      throws IOException, TesseractException {
    // First of all, create an empty image buffer
    BufferedImage Output_Image = new BufferedImage(1050, 1024, Input_Image.getType());

    // Now create a 2D platform
    Graphics2D DemoGraphic = Output_Image.createGraphics();

    // draw a new image
    DemoGraphic.drawImage(Input_Image, 0, 0, 1050, 1024, null);
    DemoGraphic.dispose();

    // rescale the OP object for the grey images
    RescaleOp RescaleImage = new RescaleOp(scaleFactor, offset, null);

    // perform scaling
    BufferedImage Buffered_FOP_Image = RescaleImage.filter(Output_Image, null);
    ImageIO.write(Buffered_FOP_Image, "jpg",
        new File(
            "C:\\Users\\Sheeraz\\OneDrive\\Desktop\\New folder (3)\\Tess4J\\output\\output.png"));

    Tesseract DemoTesseract = new Tesseract();

    DemoTesseract.setDatapath(
        "C:\\Users\\Sheeraz\\OneDrive\\Desktop\\New folder (3)\\Tess4J\\tessdata");

    String OutputStr = DemoTesseract.doOCR(Buffered_FOP_Image);
    System.out.println(OutputStr);
  }

  public static void main(String args[]) throws Exception {
    File InputFile =
        new File("C:\\Users\\Sheeraz\\OneDrive\\Desktop\\New folder (3)\\Tess4J\\sample2.png");

    BufferedImage Input_Image = ImageIO.read(InputFile);

    double Image_Double =
        Input_Image.getRGB(Input_Image.getTileWidth() / 2, Input_Image.getTileHeight() / 2);

    // compare the values
    if (Image_Double >= -1.4211511E7 && Image_Double < -7254228) {
      ImageProcess(Input_Image, 3f, -10f);
    } else if (Image_Double >= -7254228 && Image_Double < -2171170) {
      ImageProcess(Input_Image, 1.455f, -47f);
    } else if (Image_Double >= -2171170 && Image_Double < -1907998) {
      ImageProcess(Input_Image, 1.35f, -10f);
    } else if (Image_Double >= -1907998 && Image_Double < -257) {
      ImageProcess(Input_Image, 1.19f, 0.5f);
    } else if (Image_Double >= -257 && Image_Double < -1) {
      ImageProcess(Input_Image, 1f, 0.5f);
    } else if (Image_Double >= -1 && Image_Double < 2) {
      ImageProcess(Input_Image, 1f, 0.35f);
    }
  }
}

The code above will first process a noisy image and then try to read the image text. The image used for this code is below.

Sample of an unclear image

See the output:

I liVEI'» 1W,
Every :10 I 0
£0 Workyéy git;
A/Jo T would
M: £0 visié MW

Advantages and Disadvantages of OCR

Here are the advantages of using OCR in Java:

  1. It helps to increase work efficiency in offices and other places.
  2. OCR ensures the content is intact, saving time.
  3. The tesseract OCR can instantly search through the given content, which is immensely useful.
  4. It saves the manual labor of workers.

The disadvantages of OCR are:

  1. The OCR is only limited to language recognition.
  2. This OCR doesn’t provide 100% accuracy of the content.
  3. This OCR requires a lot of effort to create trainer data.
  4. The performance of OCR is based on the image so we may need to extra process the image for better results.
Author: Sheeraz Gul
Sheeraz Gul avatar Sheeraz Gul avatar

Sheeraz is a Doctorate fellow in Computer Science at Northwestern Polytechnical University, Xian, China. He has 7 years of Software Development experience in AI, Web, Database, and Desktop technologies. He writes tutorials in Java, PHP, Python, GoLang, R, etc., to help beginners learn the field of Computer Science.

LinkedIn Facebook