Scraping, Parsing, and API Use III

A generic cURL function for getting content or raw data from remote sites


In this series of articles, we're looking at a variety of techniques that involve scraping, parsing, and using an API to get information. The use case is a script to assemble a documentation page for the MODX documentation site by gathering information from several different sources and saving the finished HTML code for the page to a file.

In the previous article, we saw the "skeleton" of the project with empty functions to do the heavy lifting. In this one, we'll look at the curlGetData() function.

MODX logo

What is cURL?

The cURL library (libCurl) is installed by default in almost all installations of PHP these days. The name, cURL, stands for "client URL". It's a way of making HTTP requests to a web page and getting the returned headers and content of the page using PHP code. You might think of it as a simple browser that runs from a PHP script.

When you use a regular browser to visit a web page, the browser makes an HTTP request to the web page, using code much like the cURL code, and then displays the page in the browser, with modifications to the raw HTML code made (optionally) by any CSS and JavaScript referenced on the page. There's a lot going on under the hood, but we don't need to bother with that here. When you use cURL to make a similar request, you get back only the raw HTML code of the page (including any modifications made by JavaScript during the page load).

If the URL you're making the request to is an API, you'll typically get back a JSON (JavaScript Object Notation) string containing data rather than HTML code. JSON is just a convenient way of encoding information that gets passed around in HTTP requests. The information is converted into a single string before being transmitted, usually with PHP's json_encode() function. On the receiving end, the JSON string can be converted back into a PHP object or array using PHP's json_decode(). There are also two MODX convenience methods that do essentially the same thing: $modx->toJSON() and $modx->fromJSON().

Let's look at an example. There's a fake API at typicode.com. In fact, if you click on this link, you can see the JSON code for a list of 10 users just as it would be retrieved by a cURL request.

Imagine that we got that JSON data string with cURL and placed it in a variable: $s. Here's that same string of information after running it through json_decode($s, true) (the second argument to json_decode() tells it that we want a PHP associative array) — I'm showing just the first three users to save space:

$users = array(
    0 => array(
            'id' => 1,
            'name' => 'Leanne Graham',
            'username' => 'Bret',
            'email' => 'Sincere@april.biz',
            'address' => array(
                'street' => 'Kulas Light',
                'suite' => 'Apt. 556',
                'city' => 'Gwenborough',
                'zipcode' => '92998-3874',
                'geo' => array(
                    'lat' => '-37.3159',
                    'lng' => '81.1496',
                ),
            ),
            'phone' => '1-770-736-8031 x56442',
            'website' => 'hildegard.org',
            'company' => array(
                'name' => 'Romaguera-Crona',
                'catchPhrase' => 'Multi-layered client-server neural-net',
                'bs' => 'harness real-time e-markets',
            ),
        ),
    1 => array(
            'id' => 2,
            'name' => 'Ervin Howell',
            'username' => 'Antonette',
            'email' => 'Shanna@melissa.tv',
            'address' => array(
                'street' => 'Victor Plains',
                'suite' => 'Suite 879',
                'city' => 'Wisokyburgh',
                'zipcode' => '90566-7771',
                'geo' => array(
                    'lat' => '-43.9509',
                    'lng' => '-34.4618',
                ),
             ),
            'phone' => '010-692-6593 x09125',
            'website' => 'anastasia.net',
            'company' =>
                array(
                    'name' => 'Deckow-Crist',
                    'catchPhrase' => 'Proactive didactic contingency',
                    'bs' => 'synergize scalable supply-chains',
                ),
        ),
    2 => array(
            'id' => 3,
            'name' => 'Clementine Bauch',
            'username' => 'Samantha',
            'email' => 'Nathan@yesenia.net',
            'address' => array(
                'street' => 'Douglas Extension',
                'suite' => 'Suite 847',
                'city' => 'McKenziehaven',
                'zipcode' => '59590-4157',
                'geo' => array(
                    'lat' => '-68.6102',
                    'lng' => '-47.0653',
                ),
            ),
            'phone' => '1-463-123-4447',
            'website' => 'ramiro.info',
            'company' => array(
                'name' => 'Romaguera-Jacobson',
                'catchPhrase' => 'Face to face bifurcated interface',
                'bs' => 'e-enable strategic applications',
            ),
        ),
);

You might be wondering how I got that nice PHP array out of the variable. Neither var_dump() nor print_r() will do it. Instead, I used this code to produce the array:

echo var_export($s, true);

The var_export() function produces code that can actually be used by PHP rather than the purely informative displays created by var_dump() and print_r(). This is really handy if you want to update PHP code in a file using PHP code to do the job.


cURL can also be used to make Post requests, for example, to simulate submitting a form, or to create a bot to register fake users on a forum (of course you'd never do that, right?), but that's beyond the scope of this article. We only need to use cURL to grab data.


Our curlGetData() function

For our purposes, the curlGetData() function is pretty vanilla, with one exception to use GitHub authentication. The function is called a number of times in our project so it's quite generic it gets the data and returns it with a little added error handling. Sometimes it will return HTML, sometimes it will return JSON. It depends on what the page returns. Here's the code:

/**
 * Function to get the content of a web page or a return from the GitHub API with cURL
 *
 * @param $url
 * @param bool $returnData - Sets whether to return the content or just true/false
 * @param int $timeout - cURL timeout value
 * @param int $tries - number of attempts to make
 *
 * @return bool|mixed - content if $returnData is true; else true/false (success/failure)
 */
function curlGetData($url, $returnData = true $timeout = 6, $tries = 6) {
    $username = 'BobRay'; /* Replace with your GitHub username */
    $token = '***********************'; /* Replace with your GitHub Token */
    $retVal = false;
    $errorMsg = '(' . $url . ' - curl) ' . 'failed';
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_TIMEOUT, $timeout);
    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0)");
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, $returnData);
    curl_setopt($ch, CURLOPT_HEADER, false);
    curl_setopt($ch, CURLOPT_NOBODY, !$returnData);
    if (strpos($url, 'github') !== false) {

        if (!empty($username) && !empty($token)) {
            curl_setopt($ch, CURLOPT_USERPWD, $username . ':' . $token);
        }
    }

    $i = $tries;

    while ($i--) {
        $retVal = @curl_exec($ch);
        if (!empty($retVal)) {
            break;
        }
    }

    if (empty($retVal) || ($retVal === false)) {
        $e = curl_error($ch);
        if (!empty($e)) {
            $errorMsg = $e;
        }
        echo "\n" . $errorMsg;
    } elseif (!$returnData) { /* Just checking for existence */
        $statusCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
        $retVal = $statusCode == 200 || $statusCode == 301 || $statusCode == 302;
    }
    curl_close($ch);
    return $retVal;
}


This is a fairly standard cURL function that, with a little modification, can be used anywhere you need to get data from a remote site with cURL. The CURLOPT_SSL_VERIFYPEER setting is a little insecure but setting it to 1 will sometimes make the code crash. Leaving the setting off would create a security issue only if modx.com or github.com were hijacked *and* the downloaded data were going to be written to a DB, or displayed in a spot where it might execute. Since it's just going to be written to a file, this is not a serious problem for our project.

The CURLOPT_URL value is a string containing the full URL of the site we want to contact.

For some sites CURLOPT_FOLLOWLOCATION would have to be set to true, since there might be a redirect involved, but that's not the case with our URLs. We're not using the headers for anything, so CURLOPT_HEADER is set to false.

The second argument ($returnData) defaults to true because we always want returned data for our project. The CURLOPT_NOBODY setting is the opposite of the $returnData value. These are options because there are cases where you don't want the data back, you just want to see if the URL exists (say for link-checking code). In that case, you can get the headers and set $returnData to false to see if the cURL call succeeded. The CURLOPT_USERAGENT setting is arbitrary, but some sites are cranky about returning any data if it's not set.

The CURLOPT_TIMEOUT setting is in seconds and defaults to 6 because both MODX and GitHub can time out if they're busy and you use a short timeout value.

The $tries setting determines how many times the request will be attempted. It won't be necessary for all sites, but it can't hurt.

The $username and $token settings are necessary because without them, GitHub will only allow a few requests from your ISP per day. If you have a shared ISP, other users there might also be making GitHub API calls and you may not get any response at all. Notice that they're not used in our code unless they are sent as arguments to the function. Often, after you've passed the rate limit, GitHub will respond with an unhelpful "Not Found" message. If you intend to actually use this code with GitHub, you'll definitely want to have a GitHub account (it's free), and get an API token (also free). To get a personal API token for GitHub, see This page.


Coming Up

In the following article we'll look at the simple_html_dom class we'll be using to parse raw HTML from the MODX repository page for each extra.


Looking for high-quality, MODX-friendly hosting? As of May 2016, Bob's Guides is hosted at A2 hosting. (More information in the box below.)



Comments (0)


Please login to comment.

  (Login)