Scraping, Parsing, and API Use V

Extracting data from MODX repository pages


In this series of articles, we're looking at a variety of techniques that involve scraping, parsing, and using an API to get information. The use case is a script to assemble a documentation page for the MODX documentation site by gathering information from several different sources and saving the finished HTML code for the page to a file.

In the previous article, we learned more about the simple_html_dom class used to parse raw HTML data, and we saw some helper functions used to deal with the results. In this one, we'll look at the getDataFromMODX() function that uses the code from the previous article. You can look back at the previous article to see the code of the helper functions and how to use the simple_html_dom class.

MODX logo

Getting The MODX Repository Page

I'm always impressed by the foresight of the MODX core development team. The repository code makes contributors fill out a form when submitting extras. The repository page for each extra is then created from the form data. That means that every repo extra page is in the same form with the same headings. The uniformity of the pages makes it much easier to extract data from the repo about any extra.

To review, we instantiated the simple_html_dom class object in our skeleton code back in article II of this series. Then, we passed that class object to our getDataFromMODX() function as the second argument. As a reminder, here's that code:

/* Instantiate simple_html_dom class object */
$html = new simple_html_dom();

/* ... */

/* Get data from MODX extras repository */
    getDataFromMODX($phs, $html);


Here's the code of the getDataFromMODX() function:

/**
 * Get data from MODX Extras repository
 *
 * @param $phs array - placeholder array (as reference)
 * @param $html simple_html_dom - simple_html_dom object
 */

function getDataFromMODX(&$phs, $html) {
    global $totalDownloads;
    /** @var $html simple_html_dom */

    /* Get the package url and package name from the placeholder array and
       use them to create the URL for the extras page at the repository */

    $phs['package_url'] = 'https://modx.com/extras/package/' . strtolower($phs['package_name']);

    /* Get the raw page content with our <code>curlGetData</code> function */
    $pageText = curlGetData($phs['package_url']);

    if (empty($pageText)) {

        /* If it's Subscribe, fix the URL */
        if (stripos($phs['package_url'], 'subscribe') !== false) {
            $phs['package_url'] = 'https://modx.com/extras/package/usersubscriptionsignupsystem';
            $pageText = curlGetData($phs['package_url']);
        }
        $phs['downloads'] = 0;
    }

    if (!empty($pageText)) {
        /* Load raw HTML into our simple_html_dom object */
        $html->load($pageText);

        $phs['description'] = getDescription($html, 'div[id=tab-description]');

        /* Update package_name if we can get it from modx.com/extras */
        /* See previous article for the getTitle() function */
        $title = getTitle($html, 'title');
        if (!empty($title)) {
            $phs['package_name'] = $title;
        }

        $statsDiv = $html->find('div.stats', 0);
        $requiresDiv = $html->find('div.supports', 0);

        /* See previous article for the toList() function */
        $phs['info'] = toList($statsDiv->innertext . $requiresDiv->innertext);

        /* Pull out the number of downloads */
        preg_match('/Downloads\D*([\d\,]+)/', $phs['info'], $matches);

        /* Remove any commas and reformat the number */
        $phs['downloads'] = number_format((int) str_replace(',', '', $matches[1]));
    }

    /* Increment the total downloads */
    $totalDownloads += intval($phs['downloads']);
}

The code above gets the raw HTML from the repository page for the extra, parses it, with help from the support functions and a little magic from the simple_html_dom class and places it in the $phs array. Note that the  $phs array is passed by reference to the function above (&$phs) so that any changes we make to it will persist outside this function.

The comments in the code above explain what it's doing.

The if statement near the top of the function was necessary to deal with a mistake at the repository for my Subscribe extra, which somehow got the url: https://modx.com/extras/package/usersubscriptionsignupsystem, instead of https://modx.com/extras/package/subscribe. Without that mistake, the entire if statement could be removed.

See the previous article for the code of the helper functions called here and how to use the simple_html_dom class to extract data from raw HTML.


Coming Up

In the following article we'll look at the implementation of another function from our skeleton code: getDataFromGitHub().


Looking for high-quality, MODX-friendly hosting? As of May 2016, Bob's Guides is hosted at A2 hosting. (More information in the box below.)



Comments (0)


Please login to comment.

  (Login)