Scraping, Parsing, and API Use X

Writing the finished documentation files

In this series of articles, we're looking at a variety of techniques that involve scraping, parsing, and using an API to get information. The use case is a script to assemble a documentation page for the MODX documentation site by gathering information from several different sources and saving the finished HTML code for the page to a file.

In the previous article, we saw how to create a formatted date/time string from either a timestamp or a human-readable string with formatTime() and how to use regular expression searches to pull information out of a file formatted in Markup with getAuthors(). In this one, we'll look at the final piece of code that puts all the pieces together. It replaces the placeholders in our Tpl chunk and writes an HTML file for each extra.

MODX logo

Wrapping Things Up

Our script loops through the list of extras and fills our placeholders array ($phs) with information. In the previous articles we saw how to extract information from GitHub, the MODX Extras repository, local Markup files, and the local Git repository. Various functions did the work for us using cURL, the simple_html_dom class, DirWalker, the Git class, and regular expression (regex) searches, among other things.

At the point in the loop where all placeholders for a single extra have been filled, we call the writeFile() function to create the final file that's pasted into the MODX documentation site. We'll look at the code of that function in a bit, but first we'll review the Tpl chunk containing the placeholder tags.

A Little Review

As I mentioned way back in Article I, the Tpl chunk is really a file on disk, though it's used exactly as true MODX Tpl chunks are used. A few of the final HTML files needed to be edited before posting, and it was easier to do that in my code editor (PhpStorm), which validates the HTML and has great features for editing it that you can't get in the MODX Manager. I could have created a static chunk, but using files meant that I never had to instantiate MODX. That made the script faster. It also used a lot less memory. Better yet, it allowed me to run and debug the script in my editor.

I posted the Tpl chunk back in article I of this series, but I thought I'd put a copy here so you don't have to jump back there. Here's the code:

        <a href="#[[+package_name]]-Whatis[[+package_name]]">What is [[+package_name]]?</a>

        <a href="#[[+package_name]]-Information">Package Information</a>

        <a href="#[[+package_name]]-History">History</a>

        <a href="#[[+package_name]]-Download">Download</a>

        <a href="#[[+package_name]]-DevelopmentandBugReporting">Development and Bug Reporting</a>

            <a href="#[[+package_name]]-Documentation">Documentation</a>

        <a href="#[[+package_name]]-SeeAlso">See Also</a>
    </li> -->

<h2 id="[[+package_name]]-Whatis[[+package_name]]">What is [[+package_name]]?</h2>

<h2 id="[[+package_name]]-Information">Package Information</h2>


<h2 id="[[+package_name]]-History">History</h2>


    This version of the [[+package_name]] extra was developed by [[+author]]. It was first posted to GitHub on [[+github_date]]. As of [[+today]] it had been last updated on [[+github_updated_on]], had [[+commits]] commits, and had been downloaded [[+downloads]] times. The [[+package_name]] package consists of [[+file_count]] separate files, containing [[+lines]] lines of code. At that time, the [[+package_name]] repository had [[+watchers]] watchers and [[+stars]] stars.</p>

<p>It is currently maintained by [[+author]].</p>

<h2 id="[[+package_name]]-Download">Download</h2>

[[+package_name]] can be downloaded and installed from within the MODX Revolution Manager via
    <a href="/revolution/2.x/developing-in-modx/advanced-development/package-management" title="Package Manager"
       target="_blank">Package Manager</a> (Extras -&gt; Installer), or from the <a href="[[+package_url]]"
       target="_blank">MODX Extras Repository</a>.
<h2 id="[[+package_name]]-DevelopmentandBugReporting">Development and Bug Reporting
[[+package_name]] is stored and developed using GitHub, and can be found here:
    <a href="[[+github_url]]" target="_blank">[[+package_name]] GitHub main page</a>.
Bugs and feature requests can be filed here:
    <a href="[[+github_issues_url]]" target="_blank">[[+package_name]] issues page.</a>.

<p>Questions about how to use [[+package_name]] should be posted on the <a href=""
                                                                           target="_blank">MODX Forums</a>.</p>
<h2 id="[[+package_name]]-Documentation">Documentation</h2>

The full documentation for [[+package_name]] can be found at the author's web site ([[+author_web_site_name]]): <a
        href="[[+documentation_url]]" target="_blank">[[+package_name]] Documentation</a>.

    <h2 id="[[+package_name]]-SeeAlso">
    See Also

    <ol class="ug-toc see-also">
        <a href="/extras/[[+url]]">[[+link_text]]</a>

    </ol> -->

The writeFile() Function

This function is fairly straightforward. At the end of the loop when the placeholders array ($phs) is fully populated for the extra being processed, writeFile() is called like this:

/* Write the final file */

Here's the code of the writeFile() function:

 * Write the final file as extraName.html
 * using template file and placeholder array
 * @param $phs array - placeholder array
function writeFile ($phs) {
    $file = dirname(__FILE__)  . '/files/' . $phs['package_name'] . '.html';

    $tpl = file_get_contents(dirname(__FILE__) . '/template.html');
    // $chunk = $modx->newObject('modChunk');
    // $chunk->setContent($tpl);
    // $content = $chunk->process($phs);
    foreach ($phs as $key => $value) {
        if (!empty($value)) {
            $tpl = str_replace('[[+' . $key . ']]', $value, $tpl);
    $tpl = str_replace('https//bobsguides', 'https://bobsguides', $tpl);
    $tpl = str_replace('MODx', 'MODX', $tpl);
    $fp = fopen($file, 'w');
    fwrite($fp, $tpl);

First, notice that unlike most of the other functions, we don't pass the $phs array by reference here because we're not modifying it. We're just using it to plug the values into the Tpl chunk.

At the top of the loop that iterates through the extras, we set $phs['package_name'] to the name of the package. The writeFile() code starts by using that placeholder to set the name of the file to be written (e.g., currentDirectory/files/NewsPublisher.html. As you can see from the Tpl chunk code at the top of this article, that placeholder is also used in a bunch of places in the Tpl chunk itself.

The commented-out lines show how to use an "ad hoc" chunk to replace the placeholders in the Tpl chunk. Using that code would have meant instantiating MODX. It would also have made the script much slower since MODX would have had to fully parse the chunk to identify the placeholder tags. I left the commented section in so you could see how it's done. It's a convenient method for replacing placeholders in a MODX snippet, where the $modx variable already exists.

Next, we loop through the placeholders and use str_replace() to replace each placeholder with its value in the $phs array. There is a way to use str_replace() here without a loop. That would look like this (replacing the foreach() loop:

$tpl = str_replace(array_keys($phs), array_values($phs), $tpl);

The problem with that code is that it would not have removed the square brackets enclosing the placeholder tags and the placeholder token (). We could have included the brackets and the tooken in all the placeholder keys, but it would have been a pain to type them all and some would probably would have been forgotten or mis-typed. The foreach() loop code I used is easier to understand and probably about as fast.

The two extra str_replace() calls fix incorrect URLs for Bob's Guides and the old orthography for MODX (MODx) that were used in some of my older extras.

Finally, we open the file, write the content of the Tpl chunk (with all the placeholders replaced) to it, and close it. In a very few cases, some placeholder values didn't get set in the $phs array. In those cases, the placeholder tags remained in the HTML file so they were easy to spot and fix manually.

Coming Up

In the this series of articles, we've seen the skeleton of the script file used to create the HTML files for all extras, and the functions used to fill it out. You might be curious, though, to see the full script in one place. We'll see that in the next (and final) article in this series. The full script is also handy because it contains code examples of all the techniques we've seen in the series.

Looking for high-quality, MODX-friendly hosting? As of May 2016, Bob's Guides is hosted at A2 hosting. (More information in the box below.)

Comments (0)

Please login to comment.