Scraping, Parsing, and API Use XI

This final article in the series contains the full code, all in one place


In this series of articles, we've looked at a variety of techniques that involve scraping, parsing, and using an API to get information. The use case is a script to assemble a documentation page for the MODX documentation site by gathering information from several different sources and saving the finished HTML code for the page to a file.

In the previous article, we saw the writeFile() function, which writes the final HTML file for each extra. In this final article, we'll look at the full script used to do the job — all in one place.

MODX logo

Wrapping Things Up

Our script loops through the list of extras and fills our placeholders array ($phs). It's shown below in its final form. It's not only nice to see it all in one place, it also can serve as a reference since it has code examples of all the various techniques used, including using the GitHub API, using cURL to get the content of a web page, using the Git class to issue Git commands and get their results in PHP, using the DirWalker class to traverse directories and their files, using the simple_html_dom class to parse and extract information from HTML text, formatting timestamps and human-readable date/time strings, using regular expression (regex) searches to extract data from text, and replacing placeholders in a Tpl chunk.

As I said a while back in this series, there are some bits of code here that I'm not particularly proud of. If this script were going to be released as an extra, I would have used a class, more sanity checking, more intelligent dependency injection instead of some hard-coded values, some clearer variable names, and better separation of the model and view. This code was only going to be used once, though, so I chose not to spend the extra time making it "correct" and bulletproof. I think it still provides some good examples of how to do various kinds of parsing and information retrieval.



The Code

Here, finally, is the code of the full script used to produce the HTML files for all my extras (creatively named script.php). For my 42 extras, it takes about seven minutes to run (another good reason for not making it a MODX snippet, which would have timed out after just a few extras). There's no time limit for running it in a code editor or from the command line.

The final stats, in case you're curious:

  • Total Downloads (free extras only): 166,249
  • Total Files: 35,717
  • Total Lines: 756,267
  • Total Commits: 4,153
<?php
/**
 * Created by PhpStorm.
 * User: BobRay
 * Date: 6/9/2017
 * Time: 11:54 PM
 */

include 'C:\xampp\htdocs\addons\assets\mycomponents\_extras-rtfm\simple_html_dom.php';
include 'C:\xampp\htdocs\addons\assets\mycomponents\dirwalker\core\components\dirwalker\model\dirwalker\dirwalker.class.php';
include 'C:\xampp\htdocs\addons\assets\mycomponents\_extras-rtfm\Git.php';


/**
 * Function to get the content of a web page or a return from
 * the GitHub API with cURL
 *
 * @param $url
 * @param bool $returnData - Sets whether to return the content or just true/false
 * @param int $timeout - cURL timeout value
 * @param int $tries - number of attempts to make
 *
 * @return bool|mixed - content if $returnData is true; else true/false (success/failure)
 */
function curlGetData($url, $returnData = true, $timeout = 6, $tries = 6) {
    $username = 'BobRay';
    $token = '4d20f30853205c456110433d8522b39e18c5b036';
    $retVal = false;
    $errorMsg = '(' . $url . ' - curl) ' . 'failed';
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_TIMEOUT, $timeout);
    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0)");
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, $returnData);
    curl_setopt($ch, CURLOPT_HEADER, false);
    curl_setopt($ch, CURLOPT_NOBODY, !$returnData);
    if (strpos($url, 'github') !== false) {

        if (!empty($username) && !empty($token)) {
            curl_setopt($ch, CURLOPT_USERPWD, $username . ':' . $token);
        }
    }

    $i = $tries;

    while ($i--) {
        $retVal = @curl_exec($ch);
        if (!empty($retVal)) {
            break;
        }
    }

    if (empty($retVal) || ($retVal === false)) {
        $e = curl_error($ch);
        if (!empty($e)) {
            $errorMsg = $e;
        }
        echo "\n" . $errorMsg;
    } elseif (!$returnData) { /* Just checking for existence */
        $statusCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
        $retVal = $statusCode == 200 || $statusCode == 301 || $statusCode == 302;
    }
    curl_close($ch);
    return $retVal;
}

/**
 * Format both timestamps and human-readable date/time strings
 *
 * @param $t int | string - timestamp or date/time string
 * @param string $format - format template for strftime()
 *
 * @return string - formatted time
 */

function formatTime($t, $format = "%b %d, %Y") {
    $t = is_numeric($t) ? $t : strtotime($t);
    return strftime($format, $t);
}


/**
  * Concatenate just p tags and their content in a div, with some exceptions
 * @param $html simple_html_dom  - simple_html_dom object
 * @param $divSelector string - selector for Div. E.g., 'div[id=tab-description]'
 *
 * @return string - single string containing concatenated content of selected p tags
 */

function getDescription($html, $divSelector) {
    /** @var $html simple_html_dom */

    $descDiv = $html->find($divSelector, 0);
    $children = $descDiv->children;

    $output = '';

    /* Concatenate just p tags with some exceptions */
    foreach ($children as $child) {
        if ($child->tag == 'p') {
            $inner = $child->innertext;
            if (!empty($inner)) {
                if ((stripos($inner, 'Install with') === 0)
                    || (stripos($inner, 'See the') === 0)
                    || (stripos($inner, 'Install in') === 0)) {
                    continue;
                }
                $output .= "\n\n" . $child->outertext;
            }
        }
    }

    return $output;

}

/**
 * Get Title of extra from page content
 *
 * @param $html simple_html_dom - simple_html_dom object to analyze
 * @param $tagName - name of HTML tag to look for
 *
 * @return string - title of extra
 */
function getTitle($html, $tagName) {
    /** @var $html simple_html_dom */
    $title = $html->getElementByTagName($tagName)->innertext;
    preg_match('/^(.*)\s\d/', $title, $matches);
    return isset($matches[1]) ? $matches[1] : '';
}

/**
 * Convert lines separated with br tags into a ul list
 *
 * @param $text string - text to convert
 *
 * @return mixed|string
 */
function toList($text) {
    $text = str_replace('<br />', '<br>', $text);
    $textArray = explode('<br>', $text);

    foreach ($textArray as $key => $line) {
        $textArray[$key] = "\n    <li>" . trim($line) . '</li>';
    }
    $result = "\n<ul>" . implode('', $textArray) . "\n</ul>";
    $result = str_replace('Requires ', 'Requires: ', $result);
    $result = str_replace('Supports ', 'Supports: ', $result);
    return $result;
}

/**
 * Get the Authors of an extra from the readme.md file
 *
 * @param $readme_md string - content of readme.md
 *
 * @return string - formatted Authors section
 */
function getAuthors($readme_md) {
    $outerPattern = '/\*\*([^\]]*Author|Authors|Collaborators|Collaborator|Contributors|Contributor):\*\*([^\n]+)/';
    $matches = array();

    preg_match_all($outerPattern, $readme_md, $matches);
    // echo "\n MATCHES: " . print_r($matches, true) . "\n";
    $titles = array();
    $count = count($matches[1]);

    for ($j = 0; $j < $count; $j++) {
        $line = $matches[1][$j] . ': ' . $matches[2][$j];
        // $line = str_replace('<br>', '', $line);
        if ((strpos($line, ']') !== false) && strpos($line, ')') !== false) {
            $innerPattern = '/([^\[]+)\[([^\]]+)\]\(([^\)]+)\)/';
            preg_match($innerPattern, $line, $hits);
            // echo "\n Hits: " . print_r($hits, true) . "\n";
            $titles[] = "\n    <li>" . $hits[1] . '<a href="' . $hits[3] . '">' . $hits[2] . '</a>';
        } else {
            $titles[] = "\n    <li>" . $line . '</li>';
        }
        // echo $newLine . "\n";
        // $titles[] = $newLine;

    }
    // echo $result;
    return "\n<ul>" . implode('', $titles) . "\n</ul>";
}

/**
 * Get data from GitHubAPI and put it in the placeholders array
 *
 * @param $gitHubUrl string  - URL of API for the author
 * @param $file string - name of GitHub repository for this extra
 * @param $phs array - placeholder array (as reference)
 */
function getDataFromGitHub($gitHubUrl, $file, &$phs) {
    global $totalCommits;
    $url = $gitHubUrl . $file;


    $gitHubData = json_decode(curlGetData($url), true);
    if (isset($gitHubData['message']) && ($gitHubData['message'] === 'Not Found')) {
        echo " -- No GitHub Data ";
    } else {
        // echo "\n    " . $gitHubUrl;
        $phs['package_name'] = $gitHubData['name'];
        $phs['github_url'] = $gitHubData['html_url'];
        $phs['name'] = $gitHubData['name'];
        $phs['github_date'] = formatTime($gitHubData['created_at']);
        $phs['github_updated_on'] = formatTime($gitHubData['updated_at']);
        $phs['documentation_url'] = $gitHubData['homepage'];
        $phs['github_issues_url'] = $gitHubData['html_url'] . '/' . 'issues';
        $phs['watchers'] = $gitHubData['watchers_count'];
        $phs['stars'] = $gitHubData['stargazers_count'];

        /* Get commits */
        $commits = $gitHubData = json_decode(curlGetData($url . '/commits'), true);
        $phs['commits'] = count($commits);
        $totalCommits += str_replace(',', '', $phs['commits']);

    }
}

/**
 * Get data from MODX Extras repository
 *
 * @param $phs array - placeholder array (as reference)
 * @param $html simple_html_dom - simple_html_dom object to analyze
 */

function getDataFromMODX(&$phs, $html) {
    global $totalDownloads;
    /** @var $html simple_html_dom */

    $phs['package_url'] = 'https://modx.com/extras/package/' .
        strtolower($phs['package_name']);

    $pageText = curlGetData($phs['package_url']);
    if (empty($pageText)) {

        echo " -- No data from MODX\n";
        if (stripos($phs['package_url'], 'subscribe') !== false) {
            $phs['package_url'] = 'https://modx.com/extras/package/usersubscriptionsignupsystem';
            $pageText = curlGetData($phs['package_url']);
            if (!empty($pageText)) {
                echo " -- second attempt succeeded";
            }
        }
        $phs['downloads'] = 0;
    }

    if (!empty($pageText)) {
        $html->load($pageText);
        // $html = str_get_html($text);
        $phs['description'] = getDescription($html, 'div[id=tab-description]');

        /* Update package_name if we can get it from modx.com/extras */
        $title = getTitle($html, 'title');
        if (!empty($title)) {
            $phs['package_name'] = $title;
        }

        $statsDiv = $html->find('div.stats', 0);
        $requiresDiv = $html->find('div.supports', 0);

        $phs['info'] = toList($statsDiv->innertext . $requiresDiv->innertext);
        preg_match('/Downloads\D*([\d\,]+)/', $phs['info'], $matches);
        $phs['downloads'] = number_format((int) str_replace(',', '', $matches[1]));
        // echo $phs['info'];
        // echo print_r($phs, true);

    }

    $totalDownloads += intval($phs['downloads']);
}

/**
 * Get count of files and lines of code for an extra from
 * local directory using DirWalker
 *
 * @param $dw DirWalker - DirWalker class object
 * @param $localPath string - path to directory containing extras
 * @param $file string - name of extra directory
 * @param $phs array - placeholder array (as reference)
 * @return int
 */
function getFileCount($dw, $localPath, $file, &$phs) {
    global $totalLines;
    /** @var $dw DirWalker */
    $path = $localPath . $file;
    $dw->resetFiles();
    $dw->dirWalk($path, true);
    $files = $dw->getFiles();
    $fileCount = count($files);
    $phs['file_count'] = number_format($fileCount);
    $lineCount = 0;
    foreach ($files as $path => $fileName) {
        $fileArray = file($path);
        $lineCount += count($fileArray);
    }
    $phs['lines'] = number_format($lineCount);
    $totalLines += $lineCount;
    return $fileCount;


}

/**
 * Get number of commits for project from local GitHub repo
 * using Git.php
 *
 * @param $localPath string - path to local extras directory
 * @param $file string - name of specific extra directory
 * @param $phs array - placeholder array (as reference)
 *
 * @return int - total number of commits for project
 */
function getCommitCount($localPath, $file, &$phs) {

        $path = $localPath . "\\" . $file;
        chdir($path);

        $repo = Git::open($path);  // -or- Git::create('/path/to/repo')
        Git::windows_mode();
        $result = $repo->run('log --pretty=short --oneline');
        // echo "\n\n" . $result;
        $result = count(explode("\n", $result));

        $phs['commits'] = $result;

        return $result;
}

/**
 * Write the final file as extraName.html
 * using template file and placeholder array
 *
 * @param $phs array - placeholder array
 */
function writeFile ($phs) {
    $file = dirname(__FILE__)  . '/files/' . $phs['package_name'] . '.html';

    $tpl = file_get_contents(dirname(__FILE__) . '/template.html');
    // $chunk = $modx->newObject('modChunk');
    // $chunk->setContent($tpl);
    // $content = $chunk->process($phs);
    foreach ($phs as $key => $value) {
        if (!empty($value)) {
            $tpl = str_replace('[[+' . $key . ']]', $value, $tpl);
        }
    }
    $tpl = str_replace('https//bobsguides', 'https://bobsguides', $tpl);
    $tpl = str_replace('MODx', 'MODX', $tpl);
    $fp = fopen($file, 'w');
    fwrite($fp, $tpl);
    fclose($fp);
}


$localPath = 'C:/xampp/htdocs/addons/assets/mycomponents/';

$files = array (
    'newspublisher',
    'activationemail',
    'subscribe',
    'GoRevo',
    'SiteCheck',
    'botblockx',
    'cacheclear',
    'cachemaster',
    'canonical',
    'captcha',
    'caseinsensitiveurls',
    'classextender',
    'convertdatabasecharset',
    'constantcontact',
    'defaultresourcegroup',
    'defaultusergroup',
    'dirwalker',
    'emailresource',
    'ezfaq',
    'fileupload',
    'fixedpre',
    'getDynaDescription',
    'lexiconhelper',
    'loglogins',
    'logpagenotfound',
    'mandrillx',
    'messagemanager',
    'notify',
    'objectexplorer',
    'orphans',
    'personalize',
    'quickemail',
    'reflectblock',
    'refreshcache',
    'season',
    'siteatoz',
    'sizematters',
    'spform',
    'stagecoach',
    'syntaxhighlighter',
    'thermx',
    'upgrademodx',
);
/* Set max number of extras to process */
$i = 1;
$max = 50;

/* Create 'files' directory if necessary */
if (! file_exists('files')) {
    mkdir('files');
}

/* Instantiate simple_html_dom class object */
$html = new simple_html_dom();

/* Initialize starting total values */
$totalDownloads = 0;
$totalCommits = 0;
$totalFiles = 0;
$totalLines = 0;


foreach($files as $file) {
    /* Set some default placeholder values */
    $phs = array(
        'today' => formatTime(time()),
        'package_name' => $file,
        'author' => 'Bob Ray',
        'author_web_site_name' => "Bob's Guides",
    );

    /* Bail if we've hit the max value */
    if ($i++ > $max) {
        break;
    }
    echo "\n" . $file . "\n";

    /* Get readme.md content from local file */
    $readme_md = file_get_contents($localPath .
        $file . '/readme.md');

    /* Set authors placeholder */
    $phs['authors'] = getAuthors($readme_md);

    /* get data from GitHub */
    $gitHubUrl = 'https://api.github.com/repos/BobRay/';
    getDataFromGitHub($gitHubUrl, $file, $phs);

    /* Get date from MODX extras repository */
    getDataFromMODX($phs, $html);

    // echo "\nDownloads: " . $totalDownloads;

    /* Get commit and line counts from local filed using DirWalker */
    $dw = new DirWalker();
    /** @var $dw DirWalker */
    $totalFiles += getFileCount($dw, $localPath, $file, $phs);
    $totalCommits += (int) getCommitCount($localPath, $file, $phs);

    /* Write the final file */
    writeFile($phs);
}

/* Show total statistics */

echo "\nTotal Downloads: " . number_format($totalDownloads);
echo "\nTotal Commits: " . number_format($totalCommits);
echo "\nTotal Files: " . number_format($totalFiles);
echo "\nTotal Lines: " . number_format($totalLines);




Coming Up

In the next article, we'll look at how to control the order that TVs appear on the Template tab of the Create/Edit Resource panel.


Looking for high-quality, MODX-friendly hosting? As of May 2016, Bob's Guides is hosted at A2 hosting. (More information in the box below.)



Comments (0)


Please login to comment.

  (Login)