How to Parse HTML in Java

MD Aminul Islam Feb 02, 2024
  1. Working of Jsoup in Java
  2. Use Jsoup to Parse HTML in Java
How to Parse HTML in Java

If you are working on a program that works with HTML files, you may need to find a way to parse HTML files efficiently. You can quickly parse HTML files through the Java programming language using the most used web scraping tool, Jsoup.

This article discusses how to parse an HTML file. Also, we will discuss the topic by providing necessary examples and explanations to make the topic easier.

Working of Jsoup in Java

The Jsoup works by parsing the HTML file of the web page and then converting it into a Document object. You can say this as a programmatic representation of the DOM.

A method named parse in Jsoup creates the Document. Below discussed some of the functionality of Jsoup:

  1. parse(File MyFile, @Nullable String charsetName) - It is used to parse an HTML file.
  2. parse(InputStream in, @Nullable String CharsetName, String BaseUri) - reads the InputStream and parse it.
  3. parse(String html) - It is used to parse an HTML string.

Use Jsoup to Parse HTML in Java

Our example below will parse a website using the Jsoup. The Java code for our example will be as follows:

// importing necessary packages
package javaparsehtml;

import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JavaParseHtml {
  public static void main(String[] args) {
    URL MyUrl;
    try {
      // Providing the URL of the website
      MyUrl = new URL("https://www.example.com");
      HttpURLConnection MyConnection;
      try {
        // Create an Http connection
        MyConnection = (HttpURLConnection) MyUrl.openConnection();
        // Defining the request format
        MyConnection.setRequestProperty("accept", "application/json");

        try {
          // Create a response stream
          InputStream ResponseStream = MyConnection.getInputStream();

          // Parsing the website
          Document MyDoc = Jsoup.parse(ResponseStream, "UTF-8", "https://www.example.com");
          // Showing the output as HTML
          System.out.println(MyDoc.html());
        } catch (IOException e) {
          e.printStackTrace();
        }
      } catch (IOException e) {
        e.printStackTrace();
      }

    } catch (MalformedURLException e) {
      e.printStackTrace();
    }
  }
}

In our example above, we will illustrate how we can parse an HTML file, and we have already commanded the purpose of each line.

In the example, we created an HTTP connection based on the provided URL and then defined the requested property. After that, we created an InputStream and parsed the website.

Lastly, we print the website as an output. After executing the above Java program, you will get an output like the below:

<!doctype html>
<html>
 <head>
  <title>Example Domain</title>
  <meta charset="utf-8">
  <meta http-equiv="Content-type" content="text/html; charset=utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
 </head>
 <body>
  <div>
   <h1>Example Domain</h1>
   <p>This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.</p>
   <p><a href="https://www.iana.org/domains/example">More information...</a></p>
  </div>
 </body>
</html>

An important note here, if you don’t have installed or included the jar file of the Jsoup, you first need to include the jar file in your project directory or install the package. Otherwise, you may get errors.

MD Aminul Islam avatar MD Aminul Islam avatar

Aminul Is an Expert Technical Writer and Full-Stack Developer. He has hands-on working experience on numerous Developer Platforms and SAAS startups. He is highly skilled in numerous Programming languages and Frameworks. He can write professional technical articles like Reviews, Programming, Documentation, SOP, User manual, Whitepaper, etc.

LinkedIn