Bob's Guides | Scraping, Parsing, and API Use II

In this series of articles, we'll look at a variety of techniques that involve scraping, parsing, and using an API to get information. The use case is a script to assemble a documentation page for the MODX documentation site by gathering information from several different sources and saving the finished HTML code for the page to a file.

In the previous article, we saw an overview of the project. In this one, we'll look at a skeleton of the script used to create the final files.

Skeletons

People design and develop projects in various ways. Some time ago, when Object-Oriented programming was touted as the only intelligent way to program, I used index cards to design each project. Each card was devoted to a single class. The card listed the class name, parent class (if any), class variables, class methods, and other classes the class communicated with. It's not a bad method, but I found it unsatisfying. Today, the same process would involve an IDE that would create a visual image of the project and manage its development for you. If you were part of a team working on the project, a method like this would be very helpful. As a single programmer, though, it's not the way I usually work. I like coding a lot better than I like planning and I usually can't resist starting to write code when I should be planning.

Designing a whole project before you start any coding has many advantages. Various logic problems and alternate solutions will present themselves and you can solve them before you write a bunch of code that won't actually work. I recommend this method, even though I don't always use it. I will do a full design if the project is large and complex (e.g., MyComponent, SiteCheck, or GoRevo), but for simpler projects, I do a rough design in my head and dive in. I don't recommend this technique unless the project is very simple, and even then, lots of "simple" projects turn out not to be so simple once you get going on them. No matter how well you've planned your project, there's a good chance that you'll think of a new feature somewhere along the line that will require the project to be re-designed.

For this project, as soon as I learned that GitHub had an API that let you gather information about a GitHub repository, I couldn't resist diving in and writing the code to access the API. Similarly, when I realized that the best place to get some of the information I needed was the MODX Extras repository, I couldn't resist pulling information from it. And when I learned about a PHP class to analyze the DOM structure of a web page, I immediately started trying to apply that to the pages I pulled from the MODX Extras site.

All that said, in this series of articles, I'm going to pretend that I did the project the right way and designed it all ahead of time. In that vein, here is the skeleton design of the program:


/**
 * Created by PhpStorm.
 * User: BobRay
 * Date: 6/9/2017
 * Time: 11:54 PM
 */

/* Load class files */
include 'C:\xampp\htdocs\addons\assets\mycomponents\_extras-rtfm\simple_html_dom.php';
include 'C:\xampp\htdocs\addons\assets\mycomponents\dirwalker\core\components\dirwalker\model\dirwalker\dirwalker.class.php';
include 'C:\xampp\htdocs\addons\assets\mycomponents\_extras-rtfm\Git.php';


/**
 * Function to get the content of a web page with cURL
 *
 * @param $url
 * @param bool $returnData - Sets whether to return the content or just true/false
 * @param int $timeout - cURL timeout value
 * @param int $tries - number of attempts to make
 *
 * @return bool|mixed - content if $returnData is true; else true/false (success/failure)
 */
function curlGetData($url, $returnData = false, $timeout = 6, $tries = 6) {}

/**
 * Format both timestamps and human-readable date/time strings
 *
 * @param $t int | string - timestamp or date/time string
 * @param string $format - format template for strftime()
 *
 * @return string - formatted time
 */
function formatTime($t, $format = "%b %d, %Y") {}


/**
 * Get description of project from MODX extras page by
 * concatenating just p tags and their content in a div, with some exceptions

 * @param $html simple_html_dom  - simple_html_dom object to analyze
 * @param $divSelector string - selector for Div. E.g., 'div[id=tab-description]'
 *
 * @return string - single string containing concatenated content of selected p tags
 */
function getDescription($html, $divSelector) {}

/**
 * Get Title of extra from page content
 *
 * @param $html simple_html_dom - simple_html_dom object to analyze
 * @param $tagName - name of HTML tag to look for
 *
 * @return string - title of extra
 */
function getTitle($html, $tagName) {}

/**
 * Convert lines separated with br tags into a ul list
 *
 * @param $text string - text to convert
 *
 * @return mixed|string
 */
function toList($text) {}

/**
 * Get the Authors of an extra from the readme.md file
 *
 * @param $readme_md string - content of readme.md file
 *
 * @return string - formatted Authors section
 */
function getAuthors($readme_md) {}

/**
 * Get data from GitHubAPI and put it in the placeholders array
 *
 * @param $gitHubUrl string  - URL of API for the author
 * @param $file string - name of GitHub repository for this extra
 * @param $phs array - placeholder array (as reference)
 */
function getDataFromGitHub($gitHubUrl, $file, &$phs) {}

/**
 * Get data from MODX Extras repository
 *
 * @param $phs array - placeholder array (as reference)
 * @param $html simple_html_dom - simple_html_dom object to analyze
 */
function getDataFromMODX(&$phs, $html) {}

/**
 * Get count of files and lines of code for an extra from
 * local directory using DirWalker
 *
 * @param $dw DirWalker - DirWalker class object
 * @param $localPath string - path to directory containing extras
 * @param $file string - name of extra directory
 * @param $phs array - placeholder array (as reference)
 * @return int
 */
function getFileCount($dw, $localPath, $file, &$phs) {}

/**
 * Get number of commits for project from local GitHub repo
 * using Git.php
 *
 * @param $localPath string - path to local extras directory
 * @param $file string - name of specific extra directory
 * @param $phs array - placeholder array (as reference)
 *
 * @return int - total number of commits for project
 */
function getCommitCount($localPath, $file, &$phs) {}

/**
 * Write the final file as extraName.html
 * using template file and placeholder array
 *
 * @param $phs array - placeholder array
 */
function writeFile ($phs) {}

/* *************************************** */
/* Script code starts here */

$localPath = 'C:/xampp/htdocs/addons/assets/mycomponents/';

$files = array (
    'newspublisher',
    'activationemail',
    'subscribe',
    'GoRevo',
    'SiteCheck',
    'botblockx',
    'cacheclear',
    'cachemaster',
    'canonical',
    'captcha',
    'caseinsensitiveurls',
    'classextender',
    'convertdatabasecharset',
    'constantcontact',
    'defaultresourcegroup',
    'defaultusergroup',
    'dirwalker',
    'emailresource',
    'ezfaq',
    'fileupload',
    'fixedpre',
    'getDynaDescription',
    'lexiconhelper',
    'loglogins',
    'logpagenotfound',
    'mandrillx',
    'messagemanager',
    'notify',
    'objectexplorer',
    'orphans',
    'personalize',
    'quickemail',
    'reflectblock',
    'refreshcache',
    'season',
    'siteatoz',
    'sizematters',
    'spform',
    'stagecoach',
    'syntaxhighlighter',
    'thermx',
    'upgrademodx',
);
/* Set max number of extras to process */
$i = 1;
$max = 50;

/* Create 'files' directory if necessary */
if (! file_exists('files')) {
    mkdir('files');
}

/* Instantiate simple_html_dom class object */
$html = new simple_html_dom();

/* Initialize starting total values */
$totalDownloads = 0;
$totalCommits = 0;
$totalFiles = 0;
$totalLines = 0;


foreach($files as $file) {
    /* Set some default placeholder values */
    $phs = array(
        'today' => formatTime(time()),
        'package_name' => $file,
        'author' => 'Bob Ray',
        'author_web_site_name' => "Bob's Guides",
    );

    /* Bail if we've hit the max value */
    if ($i++ > $max) {
        break;
    }
    echo "\n" . $file . "\n";

    /* Get readme.md content from local file */
    $readme_md = file_get_contents($localPath . $file . '/readme.md');

    /* Set authors placeholder */
    $phs['authors'] = getAuthors($readme_md);

    /* get data from GitHub */
    $gitHubUrl = 'https://api.github.com/repos/BobRay/';
    getDataFromGitHub($gitHubUrl, $file, $phs);

    /* Get data from MODX extras repository */
    getDataFromMODX($phs, $html);

    // echo "\nDownloads: " . $totalDownloads;

    /* Get commit and line counts from local filed using DirWalker */
    $dw = new DirWalker();
    /** @var $dw DirWalker */
    $totalFiles += getFileCount($dw, $localPath, $file, $phs);
    $totalCommits += (int) getCommitCount($localPath, $file, $phs);

    /* Write the final file */
    writeFile($phs);
}

/* Show total statistics */

echo "\nTotal Downloads: " . number_format($totalDownloads);
echo "\nTotal Commits: " . number_format($totalCommits);
echo "\nTotal Files: " . number_format($totalFiles);
echo "\nTotal Lines: " . number_format($totalLines);

Coming Up

In the following articles we'll look at the implementation of the skeleton functions above, starting with the curlGetData() function.

Looking for high-quality, MODX-friendly hosting? As of May 2016, Bob's Guides is hosted at Hosting.com (formerly A2 Hosting). (More information in the box below.)

Previous Article << —— >> Next Article

SUBSCRIBE to receive notifications of new blog posts.

For information on how to use MODX to create a web site (and other topics), see my main web site, Bob's Guides, or better yet, buy my book: MODX: The Official Guide.

Looking for high-quality, MODX-friendly hosting? Since May 2016, Bob's Guides has been hosted at Hosting.com (formerly A2 hosting). MODX will work fine at most hosting services, but having a MODX-friendly host can prevent a lot of frustration. Better yet, the Hosting.com Solid-State-Drive servers are configured to handle the many Ajax and database calls made by MODX — especially the MODX Manager. My Manager runs about four times as fast as it did on my previous host.

Comments (0)

Please login to comment.

Bob's Blog

Why Subscribe?

Privacy Policy

Scraping, Parsing, and API Use II

Skeletons

Coming Up

Comments (0)

Tags

Archives

Latest Posts

About Me

Follow

Share

Copyright © 2011-2026
Bob Ray

Bob's Blog

Why Subscribe?

Privacy Policy

Scraping, Parsing, and API Use II

Skeletons

Coming Up

Comments (0)

Tags

Archives

Latest Posts

About Me

Follow

Share

Copyright © 2011-2026 Bob Ray

Copyright © 2011-2026
Bob Ray