How to Parse HTML in PHP

Olorunfemi Akinlua Feb 02, 2024
  1. Use DomDocument() to Parse HTML in PHP
  2. Use simplehtmldom to Parse HTML in PHP
  3. Use DiDOM to Parse HTML in PHP
How to Parse HTML in PHP

Parsing HTML allows us to convert its content or markup to string, making it easier to analyze or create a dynamic HTML file. In more detail, it takes the raw HTML code, reads it, generates a DOM tree object structure from the paragraphs to the headings, and allows us to extract the important or needed information.

We parse HTML files using built-in libraries and sometimes third-party libraries for web scraping or content analysis in PHP. Depending on the method, the goal is to convert the HTML document body into a string to extract each HTML tag.

This article will discuss the built-in class, DomDocument(), and two third-party libraries, simplehtmldom and DiDOM.

Use DomDocument() to Parse HTML in PHP

Whether a local HTML file or an online webpage, the DOMDocument() and DOMXpath() classes help with parsing an HTML file and storing its element as strings or, in the case of our example, an array.

Let’s parse this HTML file using the functions and return the headings, sub-headings, and paragraphs.

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="UTF-8" />
        <meta http-equiv="X-UA-Compatible" content="IE=edge" />
        <meta name="viewport" content="width=device-width, initial-scale=1.0" />
        <title>Document</title>
    </head>
    <body>
        <h2 class="main">Welcome to the Abode of PHP</h2>
        <p class="special">
            PHP has been the saving grace of the internet from its inception, it
            runs over 70% of website on the internet
        </p>
        <h3>Understanding PHP</h3>
        <p>
            Lorem ipsum dolor, sit amet consectetur adipisicing elit. Eum minus
            eos cupiditate earum et optio culpa, eligendi facilis laborum
            dolore.
        </p>
        <h3>Using PHP</h3>
        <p>
            Lorem ipsum dolor, sit amet consectetur adipisicing elit. Eum minus
            eos cupiditate earum et optio culpa, eligendi facilis laborum
            dolore.
        </p>
        <h3>Install PHP</h3>
        <p>
            Lorem ipsum dolor, sit amet consectetur adipisicing elit. Eum minus
            eos cupiditate earum et optio culpa, eligendi facilis laborum
            dolore.
        </p>
        <h3>Configure PHP</h3>
        <p>
            Lorem ipsum dolor, sit amet consectetur adipisicing elit. Eum minus
            eos cupiditate earum et optio culpa, eligendi facilis laborum
            dolore.
        </p>

        <h2 class="main">Welcome to the Abode of JS</h2>
        <p class="special">
            PHP has been the saving grace of the internet from its inception, it
            runs over 70% of website on the internet
        </p>
        <h3>Understanding JS</h3>
        <p>
            Lorem ipsum dolor, sit amet consectetur adipisicing elit. Eum minus
            eos cupiditate earum et optio culpa, eligendi facilis laborum
            dolore.
        </p>
    </body>
</html>

PHP code:

<?php

$html = 'index.html';

function getRootElement($element, $html)
{
    $dom = new DomDocument();

    $html = file_get_contents($html);

    $dom->loadHTML($html);

    $dom->preserveWhiteSpace = false;

    $content = $dom->getElementsByTagName($element);

    foreach ($content as $each) {
        echo $each->nodeValue;
        echo "\n";
    }
}

echo "The H2 contents are:\n";
getRootElement("h2", $html);
echo "\n";

echo "The H3 contents are:\n";
getRootElement("h3", $html);
echo "\n";

echo "The Paragraph contents include\n";
getRootElement("p", $html);
echo "\n";

The output of the code snippet is:

The H2 contents are:
Welcome to the Abode of PHP
Welcome to the Abode of JS

The H3 contents are:
Understanding PHP
Using PHP
Install PHP
Configure PHP
Understanding JS

The Paragraph contents include

PHP has been the saving grace of the internet from its inception, it
runs over 70% of the website on the internet

...

Use simplehtmldom to Parse HTML in PHP

For additional functionalities such as CSS style selectors, you can use a third-party library called Simple HTML DOM Parser, which is a simple and fast PHP parser. You can download it and include or require the single PHP file.

With this process, you can easily parse through all the elements you want. Using the same code snippet as in the previous section, we will parse the HTML using a function called str_get_html(), which processes the HTML and uses the find() method to look for a specific HTML element or tag.

To find an element with a special class, we need the class selector to apply to each find element. Also, to find the actual text, we need to use the innertext selector on the element, which we then store in the array.

Using the same HTML file as the last section, let’s parse through it using the simplehtmldom.

<?php

require_once('simple_html_dom.php');

function getByClass($element, $class)
{
    $content= [];

    $html = 'index.html';

    $html_string = file_get_contents($html);

    $html = str_get_html($html_string);

    foreach ($html->find($element) as $element) {
        if ($element->class === $class) {
            array_push($heading, $element->innertext);
        }
    }

    print_r($content);
}

getByClass("h2", "main");
getByClass("p", "special");

The output of the code snippet is:

Array
(
    [0] => Welcome to the Abode of PHP
    [1] => Welcome to the Abode of JS
)
Array
(
    [0] =>               PHP has been the saving grace of the internet from its inception, it              runs over 70% of the website on the internet
    [1] =>               PHP has been the saving grace of the internet from its inception, it              runs over 70% of the website on the internet
)

Use DiDOM to Parse HTML in PHP

For this third-party PHP library, we have to use a PHP dependency manager called Composer, which allows us to manage all our PHP libraries and dependencies. The DiDOM library is available via GitHub and provides more speed and memory management than other libraries.

If you don’t have it, you can install Composer here. However, the following command adds the DiDOM library to your project if you have it.

composer require imangazaliev/didom

After that, you can use the code below, which has a similar structure to simplehtmldom with the find() method. There is a text(), which converts the HTML element contexts to strings we can use in our code.

The has() function allows you to check if you have an element or a class within your HTML string and returns a Boolean value.

<?php

use DiDom\Document;

require_once('vendor/autoload.php');

$html = 'index.html';

$document = new Document('index.html', true);

echo "H3 Element\n";

if ($document->has('h3')) {
    $elements = $document->find('h3');
    foreach ($elements as $element) {
        echo $element->text();
        echo "\n";
    }
}

echo "\nElement with the Class 'main'\n";

if ($document->has('.main')) {
    $elements = $document->find('.main');
    foreach ($elements as $element) {
        echo $element->text();
        echo "\n";
    }
}

The output of the code snippet is:

H3 Element
Understanding PHP
Using PHP
Install PHP
Configure PHP
Understanding JS

Element with the Class 'main'
Welcome to the Abode of PHP
Welcome to the Abode of JS
Olorunfemi Akinlua avatar Olorunfemi Akinlua avatar

Olorunfemi is a lover of technology and computers. In addition, I write technology and coding content for developers and hobbyists. When not working, I learn to design, among other things.

LinkedIn

Related Article - PHP HTML