Scraping, Parsing, and API Use IX

A general-purpose time formatting function, and parsing a Markup readme.md file to get the authors of an extra


In this series of articles, we're looking at a variety of techniques that involve scraping, parsing, and using an API to get information. The use case is a script to assemble a documentation page for the MODX documentation site by gathering information from several different sources and saving the finished HTML code for the page to a file.

In the previous article, we looked at the getFileCount() function which used the DirWalker class to count the number of files and the number of the lines of code for each extra. In this one, we'll look at the the implementation of two utility functions that we skipped in previous articles: formatTime() and getAuthors()

MODX logo

The formatTime() Function

I separated out this function because it's handy to use in lots of different situations. You often have a preferred way of displaying the date and/or time something happened. PHP has a number of functions to handle this problem. The difficulty is that the input you have can change across situations. Sometimes, you have a Unix timestamp, which is just a big number representing the number of seconds since 00:00:00 (12:00 am) UTC on January 1, 1970. You almost never want to display this number. Instead, you want a human-readable date/time string like this one: 7 pm January 5, 2017.

At other times, you already have a human-readable date/time string, but almost never in the format you want to display. If you use PHP to get a date/time from MODX, you'll almost always get a human-readable string, even though what's stored in the MODX database is a Unix timestamp unless you query the DB directly for the raw value, or in the case of TemplateVariables, you use $tv->getValue().

Whenever you have a human-readable date in the wrong format, it's almost always easier to have PHP convert it to a timestamp and then format the timestamp.

This very compact two-line function will detect whether the input is a timestamp or a human-readable date, and act accordingly, returning a string containing the date/time in your preferred format. Here is the function:

/**
 * Format both timestamps and human-readable date/time strings
 *
 * @param $t int | string - timestamp or date/time string
 * @param string $format - format template for strftime()
 *
 * @return string - formatted time
 */

function formatTime($t, $format = "%b %d, %Y") {
    $t = is_numeric($t) ? $t : strtotime($t);
    return strftime($format, $t);
}

In the code above $t is the incoming time/date, which can be either a Unix timestamp or a human-readable date. The second argument defaults to "%b %d, %Y", which presents the date/time in the form Feb 01, 2011, but you can send any valid strftime() format string. For example, the string "%Y will return just the four-digit year. The strfTime() function will respect the locale setting (set with setlocale()), so it will automatically display month and day names in the appropriate language.

See the table at the top of this page for a reference to all the possible formats. Note that some format-string elements are not supported on all platforms.

The first line of the code above uses is_numeric to see if the value of $t is an integer (Unix timestamp). If so it does nothing with it. If it's not numeric, it must be a string, so we convert it to a timestamp with strtotime(). The strtoTime() function will convert almost any possible human-readable string to a timestamp, though it wouldn't be a bad idea to test $t after that first line with if($t) === false) and handle the error if it fails.

The first line above uses PHP's ternary operator for a compact and efficient "if/then/else" test. It is the equivalent to the code below:

if (is_numeric($t)) {
   $t = $t; // no change
} else {
   $t = strtotime($t); // convert to timestamp
}

Note that an actual timestamp can be passed to this function as either an integer or a string. The is_numeric() function will return true for either 1504843291 or '1504843291'. The human-readable string '5/22/2017', on the other hand, is not numeric.

After the first line has executed, $t definitely contains a Unix timestamp, so in the last line, we simply convert it to a human-readable string with strftime(), using our format string as the first argument, and return it.

If you want to play with formatting timestamps in human-readable form, you can paste the function above into a snippet and add this code below it:

return formatTime(time(), 'some format string');

The snippet should display the human-readable form of the current date/time returned by time().


The getAuthors() Function

This function took quite a while to develop. I saved it for this article because it uses a regular expression to pull and format the author(s) of the extra and it will take some effort to explain. It gets the author(s) from the readme.md file for each extra. That file is used by GitHub for the display you see on the main page of a repository. The readme.md file uses the Markup language to format the output. You may have read that it's not a good idea to use regular expressions to parse HTML files, and it's true unless your use case is very simple. Markup, on the other hand, is a more reliable target for regular expression use.

The reason it was so difficult to develop this function is that the author(s) listed for my extras can take a number of forms. Here are some examples:

ClassExtender

**Author:** Bob Ray [Bob's Guides](https://bobsguides.com)

Intended output:

<ul>
    <li>Author: Bob Ray <a href="https://bobsguides.com">Bob's Guides</a>
</ul>

FileUpload

**Author:** Michel van de Wetering [Michel van de Wetering GitHub](https://github.com/mvdwetering)
<br>
**Author:** Bob Ray [Bob's Guides](https://bobsguides.com)

Intended output:

<ul>
    <li>Author: Michel van de Wetering <a href="https://github.com/mvdwetering">Michel van de Wetering GitHub</a>
    <li>Author: Bob Ray <a href="https://bobsguides.com">Bob's Guides</a>
</ul>

NewsPublisher

**Evolution Author:** Raymond Irving [SlideShare](https://www.slideshare.net/xwisdom)
<br>
**Revolution Author:** Bob Ray [Bob's Guides](https://bobsguides.com)

**Contributors:** Invaluable fixes, improvements, and feature additions were created and tested by Markus Schlegel, donshakespeare, Bruno17, Gregor Šekoranja, Alberto Ramacciotti, and others too numerous to mention.

Intended output:

<ul>
    <li>Evolution Author: Raymond Irving <a href="https://www.slideshare.net/xwisdom">SlideShare</a>
    <li>Revolution Author: Bob Ray <a href="https://bobsguides.com">Bob's Guides</a>
    <li>Contributors: Invaluable fixes, improvements, and feature additions were created and tested by Markus Schlegel, donshakespeare, Bruno17, Gregor Šekoranja, Alberto Ramacciotti, and others too numerous to mention.
    </li>
</ul>

I confess that I had to do a little editing of the readme.md files for my extras to make sure the Author(s) section was standardized.


The Code

Here is the code of the getAuthors() function:

/**
 * Get the Authors of an extra from the readme.md file
 *
 * @param $readme_md string - content of readme.md file
 *
 * @return string - formatted Authors section
 */
function getAuthors($readme_md) {
    $outerPattern = '/\*\*([^\]]*Author|Authors|Collaborators|Collaborator|Contributors|Contributor):\*\*([^\n]+)/';
    $matches = array();

    preg_match_all($outerPattern, $readme_md, $matches);
    // echo "\n MATCHES: " . print_r($matches, true) . "\n";
    $titles = array();
    $count = count($matches[1]);

    for ($j = 0; $j < $count; $j++) {
        $line = $matches[1][$j] . ': ' . $matches[2][$j];

        if ((strpos($line, ']') !== false) && strpos($line, ')') !== false) {
            $innerPattern = '/([^\[]+)\[([^\]]+)\]\(([^\)]+)\)/';
            preg_match($innerPattern, $line, $hits);
            // echo "\n Hits: " . print_r($hits, true) . "\n";
            $titles[] = "\n    <li>" . $hits[1] . '<a href="' . $hits[3] . '">' . $hits[2] . '</a>';
        } else {
            $titles[] = "\n    <li>" . $line . '</li>';
        }
        // echo $newLine . "\n";
    }
    // echo $result;
    return "\n<ul>" . implode('', $titles) . "\n</ul>";
}

The only argument for the function is $readme_md, which is the content of the readme.md file for one extra. It's called like this when looping through the extras:

/* Get readme.md content from local file */
    $readme_md = file_get_contents($localPath . $file . '/readme.md');

    /* Set authors placeholder */
    $phs['authors'] = getAuthors($readme_md);

The outer regular expression search uses this pattern:

$outerPattern = '/\*\*([^\]]*Author|Authors|Collaborators|Collaborator|Contributors|Contributor):\*\*([^\n]+)/';

The slashes at each end are simply delimiters to tell preg_match_all() where the patten begins and ennds. The pattern itself looks for any line that starts with two * characters (\*\*). These asterisks have to be escaped with backslashes because the asterisk has a special meaning in regular expression (regex) searches and we want to get only literal asterisks. The pattern continues by looking for any sequence of characters that is not a literal closing square bracket ([^\]]*, followed by one of theses strings: Author|Authors|Collaborators|Collaborator|Contributors|Contributor, followed by two more asterisks, followed by any sequence of characters that is not a newline ([^\n]+). The two sets of parentheses in the pattern indicate "capture groups" that we want preg_match_all to extract from the content. What they capture is shown in the array below.

When we call preg_match_all() with that pattern on the content of the NewsPublisher readme.md file, we get back this PHP array:

 Array (
    [0] => Array
        (
            [0] => **Evolution Author:** Raymond Irving [SlideShare](https://www.slideshare.net/xwisdom)
            [1] => **Revolution Author:** Bob Ray [Bob's Guides](https://bobsguides.com)
            [2] => **Contributors:** Invaluable fixes, improvements, and feature additions were created and tested by Markus Schlegel, donshakespeare, Bruno17, Gregor Šekoranja, Alberto Ramacciotti, and others too numerous to mention.
        )

    [1] => Array
        (
            [0] => Evolution Author
            [1] => Revolution Author
            [2] => Contributors
        )

    [2] => Array
        (
            [0] =>  Raymond Irving [SlideShare](https://www.slideshare.net/xwisdom)
            [1] =>  Bob Ray [Bob's Guides](https://bobsguides.com)
            [2] =>  Invaluable fixes, improvements, and feature additions were created and tested by Markus Schlegel, donshakespeare, Bruno17, Gregor Šekoranja, Alberto Ramacciotti, and others too numerous to mention.
        )

)
 

The whole pattern captures the three strings in the Markup code that contain any of the search terms (Author, Collaborator, Contributor, etc.). The three full strings are in the first member of the array ([0].

The next member of the array ([1]) contains the titles from each of the three lines. The final member of the array ([2] contains the author name, link text, and link URL from each of the three lines.

Once we have this array (contained in the $matches variable, we need to parse out the title, people, link text, and URL from the array. We do that by walking through the array and using another regex search, but first we concatenate the appropriate member from the the second (1) and third (2) members of the outer array. The variable $j is the line we want from the inner array being processed. So on the first pass, $j is zero, and this line:

$line = $matches[1][$j] . ': ' . $matches[2][$j];
gives us Evolution Author and adds a colon and space followed by Raymond Irving [SlideShare](https://www.slideshare.net/xwisdom) to the end of it, or:
Evolution Author: Raymond Irving [SlideShare](https://www.slideshare.net/xwisdom)

In order to pull out the components of this line before we build our final HTML, we use preg_match() with this pattern:

$innerPattern = '/([^\[]+)\[([^\]]+)\]\(([^\)]+)\)/';

The pattern looks first for any sequence of characters that is not an opening square bracket (([^\[]+)) and captures it. This captures both the title (Evolution Author:) and the author's name. Then the pattern looks for a literal opening bracket (/[) followed by any sequence that doesn't contain a closing square bracket (\[([^\]]+)\]), followed by a literal closing square bracket. Notice that we don't capture the opening and closing brackets because they are outside the parentheses. That captures the link text (SlideShare). Finally, we do a similar search for a literal opening parenthesis, followed by any sequence of characters that is not a closing parenthesis, followed by a literal closing parenthesis (\(([^\)]+)\)). Again, we don't capture the parentheses, just what's inside them. This is the URL.

The resulting $hits array looks like this:

Array (
    [0] => Evolution Author: Raymond Irving [SlideShare](https://www.slideshare.net/xwisdom)
    [1] => Evolution Author: Raymond Irving
    [2] => SlideShare
    [3] => https://www.slideshare.net/xwisdom
)

Array member 0 of the $hits array is the whole string. Member 1 is the title and person, Member 2 is the link text. Member 3 is the URL.

Now that we have the components, it's relatively simple to put them together to create the HTML we want for our final output for one author or contibutor:

if ((strpos($line, ']') !== false) && strpos($line, ')') !== false) {
    $innerPattern = '/([^\[]+)\[([^\]]+)\]\(([^\)]+)\)/';
    preg_match($innerPattern, $line, $hits);
    // echo "\n Hits: " . print_r($hits, true) . "\n";
    $titles[] = "\n    <li>" . $hits[1] . '<a href="' . $hits[3] . '">' . $hits[2] . '</a>';
} else {
    $titles[] = "\n    <li>" . $line . '</li>';
}

After looping through the three entries (the two authors and the contributors) and placing each in our (poorly named) $titles array, we use implode() to compress them into a single block of HTML code and return it:

return "\n<ul>" . implode('', $titles) . "\n</ul>";

This result matches the "Intended Output" section shown for NewsPublisher earlier in this article. Notice that we test the result with this line, which checks for the presence of the link text and URL, because they may not exist for some extras:

if ((strpos($line, ']') !== false) && strpos($line, ')') !== false)

If the link text and URL are missing, we present the line we got from the first regex search with no modifications.



Wrapping Up

In the sections above we saw how to create a formatted date/time string from either a timestamp or a human-readable string with formatTime(). We also saw how to use regular expression searches to pull information out of a readme.md file formatted in Markup with getAuthors(). If you followed that second section closely, you get a brief introduction to doing a regular expressions (regex) search. A technique that's extremely valuable for any PHP programmer.



Coming Up

In the next article, we'll look the code that writes the final HTML files I pasted into the MODX documentation pages using our placeholders array ($phs) and our Tpl chunk.


Looking for high-quality, MODX-friendly hosting? As of May 2016, Bob's Guides is hosted at A2 hosting. (More information in the box below.)



Comments (0)


Please login to comment.

  (Login)