Scraping, Parsing, and API Use I

A tour de force of techniques for pulling information from the web and from various kinds of files


In this series of articles, we'll look at a variety of techniques that involve scraping, parsing, and using an API to get information.


MODX logo

A Use Case

I finally got around to creating pages for my extras at the MODX Documentation site. Since I have over 40 extras, and I loath repetitive tasks, I decided to create a script that assembled the pages for me and saved them to separate files so I could just paste the content into the MODX docs pages. The process took quite a while, so it might have been faster to just create the pages, but it was a lot more fun to automate it and I learned a bunch of stuff that you might find useful.

Ultimately, the process involved the following steps:

  • Scraping the MODX Extras page for each add-on at modx.com/extras with cURL
  • Using a DOM parser to to extract information from the extras page
  • Using the GitHub API and JSON to get information about the extra's repository
  • Using a regular expressions to pull information from the readme.md file in (Markup format) for each extra
  • Using James Brumond's Git class to query a local Git repo
  • Using the DirWalker class to count the files in a given repo
  • Plugging all the resulting information into a Tpl chunk.
  • Saving the finished product to an individual file

There is also some interesting code in the project to manage and format the resulting information before plugging it into the Tpl chunk.

To design the project, I had to start near the end of the process by creating the Tpl chunk and a series of dummy functions to gather the information.

In this article, we'll look at the Tpl chunk and its placeholders. In the next one, we'll see the outline of the code and functions involved. In following articles, we'll see the finished functions themselves.


The Tpl chunk

I call it a Tpl chunk because it has the form of one and is used in the same way, but it's actually a file called template.html with MODX placeholders for specific bits of information. I gave it a .html suffix so my code editor (PhpStorm) would make sure it was valid HTML. This thing went through a lot of iterations before reaching its final form, but I won't bore you with those steps. It's loosely based on the code for other extras at the MODX docs site. Here is the final code of the Tpl chunk:

<ul>
    <li>
        <a href="#[[+package_name]]-Whatis[[+package_name]]">What is [[+package_name]]?</a>
    </li>

    <li>
        <a href="#[[+package_name]]-Information">Package Information</a>
    </li>

    <li>
        <a href="#[[+package_name]]-History">History</a>
    </li>

    <li>
        <a href="#[[+package_name]]-Download">Download</a>
    </li>

    <li>
        <a href="#[[+package_name]]-DevelopmentandBugReporting">Development and Bug Reporting</a>
    </li>

    <li>
            <a href="#[[+package_name]]-Documentation">Documentation</a>
    </li>

    <!--
    <li>
        <a href="#[[+package_name]]-SeeAlso">See Also</a>
    </li> -->
</ul>

<h2 id="[[+package_name]]-Whatis[[+package_name]]">What is [[+package_name]]?</h2>
[[+description]]

<h2 id="[[+package_name]]-Information">Package Information</h2>

[[+info]]


<h2 id="[[+package_name]]-History">History</h2>

[[+authors]]


<p>
    This version of the [[+package_name]] extra was developed by [[+author]]. It was first posted to GitHub on [[+github_date]]. As of [[+today]] it had been last updated on [[+github_updated_on]], had [[+commits]] commits, and had been downloaded [[+downloads]] times. The [[+package_name]] package consists of [[+file_count]] separate files, containing [[+lines]] lines of code. At that time, the [[+package_name]] repository had [[+watchers]] watchers and [[+stars]] stars.</p>

<p>It is currently maintained by [[+author]].</p>

<h2 id="[[+package_name]]-Download">Download</h2>

<p>
[[+package_name]] can be downloaded and installed from within the MODX Revolution Manager via
    <a href="/revolution/2.x/developing-in-modx/advanced-development/package-management" title="Package Manager"
       target="_blank">Package Manager</a> (Extras -&gt; Installer), or from the <a href="[[+package_url]]"
       target="_blank">MODX Extras Repository</a>.
</p>
<h2 id="[[+package_name]]-DevelopmentandBugReporting">Development and Bug Reporting
</h2>
<p>
[[+package_name]] is stored and developed using GitHub, and can be found here:
    <a href="[[+github_url]]" target="_blank">[[+package_name]] GitHub main page</a>.
</p>
<p>
Bugs and feature requests can be filed here:
    <a href="[[+github_issues_url]]" target="_blank">[[+package_name]] issues page.</a>.
</p>

<p>Questions about how to use [[+package_name]] should be posted on the <a href="https://forums.modx.com"
                                                                           target="_blank">MODX Forums</a>.</p>
<h2 id="[[+package_name]]-Documentation">Documentation</h2>

<p>
The full documentation for [[+package_name]] can be found at the author's web site ([[+author_web_site_name]]): <a
        href="[[+documentation_url]]" target="_blank">[[+package_name]] Documentation</a>.
</p>

    <!--
    <h2 id="[[+package_name]]-SeeAlso">
    See Also
    </h2>

    <ol class="ug-toc see-also">
        <li>
        <a href="/extras/[[+url]]">[[+link_text]]</a>
        </li>

    </ol> -->


A few of the final files for this project had to be edited before posting because some of the information was not available for them, but most of them were used as is. If you would like to preview the results for a typical extra, you can take a look at the NewsPublisher page.

The commented-out sections at the end of the UL section above and at the bottom are there in case it was appropriate to add a "See Also" section. It would have to be created by hand because there's no way to automate it.


A Word About Coding Standards and Security

If this code were going into an extra, the functions would be part of a class file, the code specific to this project would be pulled out of the functions and handled with dependency injection, and some security issues would be dealt with more carefully. Some values that really should be passed as arguments are hard-coded into the functions. This code, however, was only going to run once when I finally got it working, so I didn't spend a lot of time making it "correct." If I end up using parts of this for other tasks, I may be sorry that I didn't do it properly, but as is, it does a nice job of illustrating the concepts I'm discussing.


Coming Up

In the next article we'll look at the skeleton of the script used to create the files.


Looking for high-quality, MODX-friendly hosting? As of May 2016, Bob's Guides is hosted at A2 hosting. (More information in the box below.)



Comments (0)


Please login to comment.

  (Login)