Bob's Guides | Scraping, Parsing, and API Use V

In this series of articles, we're looking at a variety of techniques that involve scraping, parsing, and using an API to get information. The use case is a script to assemble a documentation page for the MODX documentation site by gathering information from several different sources and saving the finished HTML code for the page to a file.

In the previous article, we learned more about the simple_html_dom class used to parse raw HTML data, and we saw some helper functions used to deal with the results. In this one, we'll look at the getDataFromMODX() function that uses the code from the previous article. You can look back at the previous article to see the code of the helper functions and how to use the simple_html_dom class.

Getting The MODX Repository Page

I'm always impressed by the foresight of the MODX core development team. The repository code makes contributors fill out a form when submitting extras. The repository page for each extra is then created from the form data. That means that every repo extra page is in the same form with the same headings. The uniformity of the pages makes it much easier to extract data from the repo about any extra.

To review, we instantiated the simple_html_dom class object in our skeleton code back in article II of this series. Then, we passed that class object to our getDataFromMODX() function as the second argument. As a reminder, here's that code:

/* Instantiate simple_html_dom class object */
$html = new simple_html_dom();

/* ... */

/* Get data from MODX extras repository */
    getDataFromMODX($phs, $html);

Here's the code of the getDataFromMODX() function:

/**
 * Get data from MODX Extras repository
 *
 * @param $phs array - placeholder array (as reference)
 * @param $html simple_html_dom - simple_html_dom object
 */

function getDataFromMODX(&$phs, $html) {
    global $totalDownloads;
    /** @var $html simple_html_dom */

    /* Get the package url and package name from the placeholder array and
       use them to create the URL for the extras page at the repository */

    $phs['package_url'] = 'https://modx.com/extras/package/' . strtolower($phs['package_name']);

    /* Get the raw page content with our <code>curlGetData</code> function */
    $pageText = curlGetData($phs['package_url']);

    if (empty($pageText)) {

        /* If it's Subscribe, fix the URL */
        if (stripos($phs['package_url'], 'subscribe') !== false) {
            $phs['package_url'] = 'https://modx.com/extras/package/usersubscriptionsignupsystem';
            $pageText = curlGetData($phs['package_url']);
        }
        $phs['downloads'] = 0;
    }

    if (!empty($pageText)) {
        /* Load raw HTML into our simple_html_dom object */
        $html->load($pageText);

        $phs['description'] = getDescription($html, 'div[id=tab-description]');

        /* Update package_name if we can get it from modx.com/extras */
        /* See previous article for the getTitle() function */
        $title = getTitle($html, 'title');
        if (!empty($title)) {
            $phs['package_name'] = $title;
        }

        $statsDiv = $html->find('div.stats', 0);
        $requiresDiv = $html->find('div.supports', 0);

        /* See previous article for the toList() function */
        $phs['info'] = toList($statsDiv->innertext . $requiresDiv->innertext);

        /* Pull out the number of downloads */
        preg_match('/Downloads\D*([\d\,]+)/', $phs['info'], $matches);

        /* Remove any commas and reformat the number */
        $phs['downloads'] = number_format((int) str_replace(',', '', $matches[1]));
    }

    /* Increment the total downloads */
    $totalDownloads += intval($phs['downloads']);
}

The code above gets the raw HTML from the repository page for the extra, parses it, with help from the support functions and a little magic from the simple_html_dom class and places it in the $phs array. Note that the $phs array is passed by reference to the function above (&$phs) so that any changes we make to it will persist outside this function.

The comments in the code above explain what it's doing.

The if statement near the top of the function was necessary to deal with a mistake at the repository for my Subscribe extra, which somehow got the url: https://modx.com/extras/package/usersubscriptionsignupsystem, instead of https://modx.com/extras/package/subscribe. Without that mistake, the entire if statement could be removed.

See the previous article for the code of the helper functions called here and how to use the simple_html_dom class to extract data from raw HTML.

Coming Up

In the following article we'll look at the implementation of another function from our skeleton code: getDataFromGitHub().

Looking for high-quality, MODX-friendly hosting? As of May 2016, Bob's Guides is hosted at Hosting.com (formerly A2 Hosting). (More information in the box below.)

Previous Article << —— >> Next Article

SUBSCRIBE to receive notifications of new blog posts.

For information on how to use MODX to create a web site (and other topics), see my main web site, Bob's Guides, or better yet, buy my book: MODX: The Official Guide.

Looking for high-quality, MODX-friendly hosting? Since May 2016, Bob's Guides has been hosted at Hosting.com (formerly A2 hosting). MODX will work fine at most hosting services, but having a MODX-friendly host can prevent a lot of frustration. Better yet, the Hosting.com Solid-State-Drive servers are configured to handle the many Ajax and database calls made by MODX — especially the MODX Manager. My Manager runs about four times as fast as it did on my previous host.

Comments (0)

Please login to comment.

Bob's Blog

Why Subscribe?

Privacy Policy

Scraping, Parsing, and API Use V

Getting The MODX Repository Page

Coming Up

Comments (0)

Tags

Archives

Latest Posts

About Me

Follow

Share

Copyright © 2011-2026
Bob Ray

Bob's Blog

Why Subscribe?

Privacy Policy

Scraping, Parsing, and API Use V

Getting The MODX Repository Page

Coming Up

Comments (0)

Tags

Archives

Latest Posts

About Me

Follow

Share

Copyright © 2011-2026 Bob Ray

Copyright © 2011-2026
Bob Ray