Scraping, Parsing, and API Use VIII

Counting the number of files in a directory and its subdirectories with DirWalker

In this series of articles, we're looking at a variety of techniques that involve scraping, parsing, and using an API to get information. The use case is a script to assemble a documentation page for the MODX documentation site by gathering information from several different sources and saving the finished HTML code for the page to a file.

In the previous article, we looked at the getCommitCount() function used to count the number of commits in a local Git repository. In this one, we'll look at the the implementation of another function from our skeleton code: getFileCount(), which uses the DirWalker class to count the number of files in a directory and all its subdirectories.

MODX logo


I wrote DirWalker a while back as part of the MyComponent extra. I needed to collect all files of a certain type in an extra's directories for various reasons. Sometimes, I wanted to make sure all properties used in the PHP or HTML files were represented in a snippet's Properties array. At other times, I was looking for System Settings, or checking to make sure all lexicon tags had an entry in the lexicon files.

DirWalker "walks" through a directory and all its subdirectories creating a PHP associative array where each key is the full path to a file (including the filename) and the value is just the name of the file. Optionally, DirWalker accepts filters that include or exclude certain file extensions or directories. The search can be recursive, or limited to a single directory.

For this project, I simply wanted a gross count of all files (but not directories), so no filters were used. DirWalker was called this way (after including the class file at the top of the script):

$dw = new DirWalker();
/** @var $dw DirWalker */
$totalFiles += getFileCount($dw, $localPath, $file, $phs);

The code above is called inside the loop for all extras. We pass the DirWalker class in the $dw variable. $localPath is the path to the directory for all the extras and $file is the name (and directory name) of a specific extra. The $phs variable holds our placeholders array, and as usual, it is passed by reference.

The getFileCount() function

Here is the code for the getFileCount() function:

 * Get count of files and lines of code for an extra from
 * local directory using DirWalker
 * @param $dw DirWalker - DirWalker class object
 * @param $localPath string - path to directory containing extras
 * @param $file string - name of extra directory
 * @param $phs array - placeholder array (as reference)
 * @return int
function getFileCount($dw, $localPath, $file, &$phs) {
    global $totalLines;
    /** @var $dw DirWalker */
    $path = $localPath . $file;
    $dw->dirWalk($path, true);
    $files = $dw->getFiles();
    $fileCount = count($files);
    $phs['file_count'] = number_format($fileCount);
    $lineCount = 0;
    foreach ($files as $path => $fileName) {
        $fileArray = file($path);
        $lineCount += count($fileArray);
    $phs['lines'] = number_format($lineCount);
    $totalLines += $lineCount;
    return $fileCount;

First, we call $dw->resetFiles() because DirWalker is called many times and we only want the files for the current extra. The resetFiles() method simply clears the array of files by setting it to an empty array.

Next, we call DirWalker's dirWalk() method with the path to the root directory of the extra. The second argument tells DirWalker that we want a recursive search so all subdirectories will be included.

During the "walk" DirWalker adds all the files to the big array, which is returned by $dw->getFiles(). Then we use count() to get the number of elements (files) in the array and put that in our placeholder array as $phs['file_count'].

Since I was getting all the files for the extra, I was also curious about the total number of lines of code used. The foreach() statement loops through all the files and uses PHP's file() function to load each file into a PHP array, with each line as an array element. Using count() on that array gives the number of lines in the file, which is added to the $lineCount variable and goes in the $phs['lines'] placeholder. That count is also added to the global variable, $totalLines so I could see the total number of lines in all extras combined.

Coming Up

In the next article, we'll take a quick look at a couple of the utility functions in the project that we haven't covered yet: formatTime(), and getAuthors().

Looking for high-quality, MODX-friendly hosting? As of May 2016, Bob's Guides is hosted at A2 hosting. (More information in the box below.)

Comments (0)

Please login to comment.