Scraping, Parsing, and API Use IV

Using the simple_html_dom class to extract data from a string of raw HTML code.


In this series of articles, we're looking at a variety of techniques that involve scraping, parsing, and using an API to get information. The use case is a script to assemble a documentation page for the MODX documentation site by gathering information from several different sources and saving the finished HTML code for the page to a file.

In the previous article, we saw the curlGetData() function. In this one, we'll look at the simple_html_dom class we'll use to parse the raw HTML data from each MODX repository extras page.

MODX logo

A Class for Parsing HTML

In the previous article, we saw how to get the raw HTML code of a web page using cURL. Now, we need a way to extract data from it. You've probably read that using regular expressions to pull data from raw HTML is difficult and error-prone. It can work for very simple cases, but when things get more complicated, it tends to blow up in your face. Luckily, there's an excellent alternative — the Simple HTML Dom class (which we'll refer to by the actual class name: simple_html_dom).

A full description of the simple_html_dom class is beyond the scope of this article, but we'll look at some basic techniques for using it.


Getting Started

You can actually call the functions of the class as functions, as long as they're included above the calls, but it's safer to call them as class methods. For example, one of the functions we'll use is called find(). Since there could be other functions with that name in memory, we'll used an object-oriented approach and instantiate the class. Here are some code selections from our script that use the class:

/* Load the class file */
include 'C:\xampp\htdocs\addons\assets\mycomponents\_extras-rtfm\simple_html_dom.php';

/* Instantiate simple_html_dom class object */
$html = new simple_html_dom();

 /* Get data from MODX extras repository */
    getDataFromMODX($phs, $html);

The $html variable now holds an instance of the simple_html_dom class that we can pass to the getDataFromMODX function. Note that the class doesn't hold any data yet. To give it some, we call our getDataFromMODX() function, passing our $html variable to the function as a second argument. Remember that the $html variable holds an instance of the simple_html_dom class, not HTML code. The getDataFromMODX() function uses the curlGetData() function we saw in the last article to pull a page of HTML code from the MODX extras repository. We have to remember to "load" the data into the class in the function itself before it can be parsed.

Here's a very abbreviated version of the code from the getDataFromMODX() function. We'll see the full function in the next article.

function getDataFromMODX(&$phs, $html) {
    /** @var $html simple_html_dom */
    /* Get the package url and package name from the placeholder array and
       use them to create the URL for the extras page at the repository */

    $phs['package_url'] = 'https://modx.com/extras/package/' . strtolower($phs['package_name']);

    /* Get the raw page content with our curlGetData function */
    $pageText = curlGetData($phs['package_url']);

    $html->load($pageText);

The code above gets the raw HTML from the MODX repository page for an extra and "loads" it into our $html object, which is an instance of the simple_html_dom class. The load() method of the class not only loads the raw HTML into the class, it also parses it so that we can easily extract the parts of it we want.


Pulling the Data in a Tag

Once the parsed code is present in the class object, you can use various methods extract specific parts of it. Many of the methods take a JQuery-like selector as an argument and return an object (*not* a string of HTML). For example, on the MODX repo page, the name of the package is in the title tag like this:

<title>NewsPublisher 3.0.4-pl</title>

Because we don't want the version number, we have to use a little regex action to pull out just the title:

function getTitle($html, $tagName) {
    /** @var $html simple_html_dom */

    /* Get the title object */
    $tag = $html->getElementByTagName($tagName);

    /* Extract the text between the opening
       and closing title tags */
    $title = $tag->innertext;

    /* Pull out just the extra name */
    preg_match('/^(.*)\s\d/', $title, $matches);
    return isset($matches[1]) ? $matches[1] : '';
    }

The $tagName, 'title' is sent in the call to this function. We use the simple_html_dom method getElementByTagName() method to get the title, but remember that this is a PHP object, so we need to extract the title itself by grabbing the innertext class variable. The regular expression in the next line says we want any string of text followed by a space followed by a decimal. The parentheses around the first part say that we want to "capture" just the part before the space. That will be in $matches[1] if the match is successful. If it is, we return just the package name. If not, we return an empty string.

(We use innertext here because we want the title, but not the enclosing tags. If we wanted them, we'd use outertext.)


Pulling the Data by ID

Not all the data we want is so conveniently held in a single tag. For example, we also want the description of the extra. That's in a block of text that looks like this:

<div id="tab-description">
    <p>Description of extra</p>
    <p>Description continued</p>
    <p>Etc.</p>
</div>

In this case, we want to pull out all the paragraph sections in the div and concatenate them into a single variable, with some exceptions. We do that like this:

function getDescription($html, $divSelector) {
    /** @var $html simple_html_dom */

    $descDiv = $html->find($divSelector, 0);
    $children = $descDiv->children;

    $output = '';

    /* Concatenate just p tags with some exceptions */
    foreach ($children as $child) {
        if ($child->tag == 'p') {
            $inner = $child->innertext;
            if (!empty($inner)) {
                if ((stripos($inner, 'Install with') === 0) || (stripos($inner, 'See the') === 0) || (stripos($inner, 'Install in') === 0)) {
                    continue;
                }
                $output .= "\n\n" . $child->outertext;
            }
        }
    }

    return $output;
}

First, we get the description div with the find() method, using the selector we passed in the $divSelector argument. Passing 0 as the second argument tells the find() function that we want the first member of the resulting array (which in this case has only one member). Next, we get all it's children with $children = $descDiv->children. Then, we walk through the children and if the child tag is p, we add the child's innertext to the $output variable adding two newlines as the prefix for each paragraph, because this is going in a file that will be pasted into the MODX documentation page for the extra, which uses Markup rather than HTML.

We exclude any paragraphs that start with "Install with," "See the, " and "Install in" because those are not really part of the description.

Finally, we return the concatenated string containing the full description.


Getting Elements by CSS Class Name

Two of the sections we want to use don't have an ID, so we need to use their class names. Luckily, the class names are only used once on each page so we don't have to search through them. We want the download statistics, the requirements, and the section that reports the databases supported. That HTML code looks like this:

<div class="stats">
    Downloads: 9,764<br>
    License: GPLv2<br>
</div>

<div class="supports"> Requires Revolution 2.2.x or greater
      <br />
      Supports mysql,sqlsrv </div>

To get those sections, and pull out the data we want, we use this code:

$statsDiv = $html->find('div.stats', 0);
$requiresDiv = $html->find('div.supports', 0);

$phs['info'] = toList($statsDiv->innertext . $requiresDiv->innertext);
$pattern = '/Downloads\D*([\d\,]+)/';
preg_match($pattern, $phs['info'], $matches);
$phs['downloads'] = number_format((int) str_replace(',', '', $matches[1]));

First, we get the two divs using the find() method, passing it the selector in each case. We need to parse and format the data from the two divs, so we pass the concatenated innertext of both to our toList() function (see below) and put the result in the info member of the $phs (placeholders) array. Then, we pull out just the number of downloads from the result and put it in the downloads member of the $phs array.

The regular expression pattern searches for "Downloads", followed by anything that isn't a number followed by any string of numbers with or without a comma. We capture just the number in the parentheses.


The toList() function

This function parses the data in our two divs and reformats it into an unordered list.

function toList($text) {
    $text = str_replace('<br />', '<br>', $text);
    $textArray = explode('<br>', $text);

    foreach ($textArray as $key => $line) {
        $textArray[$key] = "\n    <li>" . trim($line) . '</li>';
    }
    $result = "\n<ul>" . implode('', $textArray) . "\n</ul>";
    $result = str_replace('Requires ', 'Requires: ', $result);
    $result = str_replace('Supports ', 'Supports: ', $result);
    return $result;
}

First, we "normalize" the br tags so they are all the same. Then we use them as a delimiter to split the text into a PHP array with each element representing a line. Next, we add li tags to each line and trim any extra white space. Then we "implode" the array into a single string surrounded by ul tags. The two str_replace() lines simply add a colon after "Requires" and "Supports".

The result looks like this:

<ul>
    <li>Downloads: 9,437</li>
    <li>License: GPLv2</li>
    <li>Requires: Revolution 2.2.x or greater</li>
    <li>Supports: mysql,sqlsrv</li>
</ul>

Wrapping Up

The simple_html_dom class can do a lot more things, including changing the content of the HTML, but in this article we've seen some of the basics you can use to extract various kinds of data from an HTML page.


Coming Up

Now that we've seen some of the helper functions used to grab and massage the HTML data, in the following article we'll look at the full code of the getDataFromMODX() function.


Looking for high-quality, MODX-friendly hosting? As of May 2016, Bob's Guides is hosted at A2 hosting. (More information in the box below.)



Comments (0)


Please login to comment.

  (Login)