Scraping, Parsing, and API Use VI

Using the GitHub API to extract information

In this series of articles, we're looking at a variety of techniques that involve scraping, parsing, and using an API to get information. The use case is a script to assemble a documentation page for the MODX documentation site by gathering information from several different sources and saving the finished HTML code for the page to a file.

In the previous article, we looked at the getDataFromMODX() function that uses the helper functions from the article before that. In this one, we'll look at the the implementation of another function from our skeleton code: getDataFromGitHub().

MODX logo

Getting Data from GitHub

GitHub has been kind enough to create an API that allows us to get statistics on each repository. The endpoint(URL) to get data on a particular extra repository takes this form:

For example, the URL for my StageCoach repository is:

If you take the URL above and paste it into the address bar of a browser, you may see the JSON string that GitHub returns. I say "may" because GitHub imposes strict rate limits, so you can only do this successfully a few times per day. When you hit the limit, you'll see an unhelpful "Not Found" message. If someone else from your ISP has been querying the GitHub API, you may see that on your first try. That's why our curlGetData() function supplies credentials for a user account at GitHub (article III of this series), which vastly expands the limit.

The process of getting the information is pretty simple, we use curlGetData() with the URL of the particular repository we want to know about, use json_decode() to convert the returned JSON string into a PHP associative array, and extract the data to our placeholder array ($phs).

Because we're looping through all my repos, this process is repeated many times.

Memory Refreshment

Our function to query the GitHub API calls the curlGetData() function from Article III of this series. I won't reproduce the whole function, but here are the parts that relate to GitHub:

function curlGetData($url, $returnData = true $timeout = 6, $tries = 6) {
    $username = 'BobRay'; /* Replace with your GitHub username */
    $token = '***********************'; /* Replace with your GitHub Token */

    /* ... */

    if (strpos($url, 'github') !== false) {
        if (!empty($username) && !empty($token)) {
            curl_setopt($ch, CURLOPT_USERPWD, $username . ':' . $token);

    /* ... */

The function is called inside a loop like this (where $file is the name of the specific repo):

/* get data from GitHub */
    $gitHubUrl = '';
    getDataFromGitHub($gitHubUrl, $file, $phs);

The URL sent to curlGetData() is the full URL to the specific repo, e.g.,: If the URL contains the string github we tell cURL to supply the username and token. To get a personal API token for GitHub, see This page.

Getting the Data

Here is our function to get the data from GitHub for a particular repo:

 * Get data from GitHubAPI and put it in the placeholders array
 * @param $gitHubUrl string  - URL of API for the author
 * @param $file string - name of GitHub repository for this extra
 * @param $phs array - placeholder array (as reference)
function getDataFromGitHub($gitHubUrl, $file, &$phs) {
    global $totalCommits;
    $url = $gitHubUrl . $file;

    $gitHubData = json_decode(curlGetData($url), true);
    if (isset($gitHubData['message']) && ($gitHubData['message'] === 'Not Found')) {
        echo " -- No GitHub Data ";
    } else {
        // echo "\n    " . $gitHubUrl;
        $phs['package_name'] = $gitHubData['name'];
        $phs['github_url'] = $gitHubData['html_url'];
        $phs['name'] = $gitHubData['name'];
        $phs['github_date'] = formatTime($gitHubData['created_at']);
        $phs['github_updated_on'] = formatTime($gitHubData['updated_at']);
        $phs['documentation_url'] = $gitHubData['homepage'];
        $phs['github_issues_url'] = $gitHubData['html_url'] . '/' . 'issues';
        $phs['watchers'] = $gitHubData['watchers_count'];
        $phs['stars'] = $gitHubData['stargazers_count'];

        /* Get commits */
        $commits = $gitHubData = json_decode(curlGetData($url . '/commits'), true);
        $phs['commits'] = count($commits);
        $totalCommits += str_replace(',', '', $phs['commits']);


The code above is fairly straightforward. We get the JSON string from GitHub with curlGetData() and convert it to a PHP associative array with json_decode(). Then, we set the placeholders we want to use (with a little occasional formatting). Remember that the $phs array is passed by reference to the function above (&$phs) so that the changes we make here will persist outside the function.

At present, the commits count is unreliable. It shows only commits from the current year. As a result, I had to create a function to get the true commit count from my local Git repo for each extra (we'll look at that in the next article). There is a way to aggregate the commit counts from various years, but it's fairly complex. It was easier, and much faster, just to get them locally. I left the code to get the commit count here for you to see, but the value is overwritten by my own function. The comma is removed for commit counts over 1,000 when computing $totalCommits because it confuses PHP when it tries to add them.

At this writing, the watchers_count and the stargazers_count are the same, though they are intended to be different down the road. The first will be the number of people watching the repo, the second will be the number of people who have awarded a star rating to the repo. At least that's what the GitHub docs say now.

Coming Up

In the following article we'll look at the code used to get the total number of commits for each repo.

Looking for high-quality, MODX-friendly hosting? As of May 2016, Bob's Guides is hosted at A2 hosting. (More information in the box below.)

Comments (0)

Please login to comment.