BotBlockX Plugin Tutorial

I could tell you how many hours it takes to develop a MODX extra Transport Package complete with a build script, properties, multiple MODX elements, internationalized strings, error checks, and then fully test it, but you wouldn't believe me. If you use this extra and like it, please consider donating. The suggested donation for this extra is $5.00, but any amount you want to give is fine (really). For a one-time donation of $50.00 you can use all of my non-commercial extras with a clear conscience.


PayPal

BotBlockX is based on Alex Kemp's classic bot-blocking routine. I have rewritten it for MODX Revolution and added a few tweaks, but the heart of it is Alex's original code.

The default install of this plugin will block badly behaved bots (both slow and fast), while leaving good bots like Google alone. If you want to allow slow scrapers to grab your content (some regular users download whole sites to use for offline reference), you can disable the slow scraper check by changing the &total_visits property to 'none'.

BotBlockX creates a lot of files in the block/ directory, but they are all zero-length files so they shouldn't count as resources on your site. This is possible because of Alex's ingenious use of the files' modification time and access time to store information about the visitor's behavior. Both the block/ directory and the blocklogs/ directory are placed directly under the MODX core directory to speed up access time.

How it Works

When a user visits the site, the plugin (connected to both OnPageNotFound and OnHandleRequest), creates an "IP file," the name of which is a substring of the MD5 hash of their IP (the length of which is determined by the &ip_length property). The duration of their visit, and the number of hits during the specified interval is stored in the file's access time and modification time.

Note that almost all "good" bots are whitelisted by IP range and can do whatever they want. You can disable this (and significantly speed up the plugin) by changing the &use_whitelist property of the plugin, but you may end up blocking some legitimate, but badly behaved, search spiders.

To control good bots, place these lines in your robots.txt file in the MODX root directory. Leave a space above and below the lines:

User-agent: *
Crawl-delay: 90
Crawl-Delay: 90

If the lines above are in your robots.txt file, no well-behaved bot should be caught by the fast scraper code.

Slow Scrapers — If a user makes more than &total_visits requests in the seconds defined by &start_over_secs, they've violated the slow scraper rules and will receive a message telling them that they'll have to wait to see another page. With the default settings, they can make 1500 requests in 12 hours and violators will have to wait about 12 hours before getting access to any page at the site.

Fast Scrapers — If a user makes more than &max_visits requests in the seconds defined by &interval, they've violated the fast scraper rules and will receive a message telling them that they'll have to wait to see another page. With the default settings, they can make 14 requests every 7 seconds. The wait penalty varies with the frequency of their requests. It's hard to test, but I think it's usually around 60-120 seconds.

Testing

Note that it's almost impossible to trigger BotBlockX with a standard browser and the default settings. If you set &total_visits to 5 and &secs_to_start_over to 20, you can trigger the slow scraper warning, but you'll be blocking regular visitors to your site (don't forget to set them back).

Note that once you've blocked yourself, you'll have to either wait 12 hours to see the front end again, or delete the IP file in the core/block/ directory. Since the filenames are hashed, you'll probably have to delete all of them, so try this test before there are a lot of files there.

To trigger the fast scraper block, you'll need a bot program that can make more than one request per second.

If you just want to see if the program is working at all, go to the files tab and expand the "block/" directory at the root of the site. You should see a growing number of zero-length IP files there. If any bots have been blocked, you'll see entries in the log file in the "log/" directory, also in the MODx root directory.

Installing BotBlockX

Go to System | Package Management and click on the "Download Extras" button. Put "BotBlockX" in the search box and press "Enter". Click on the "Download" button next to BotBlockX on the right side of the grid. Wait until the "Download" button changes to "Downloaded" and click on the "Finish" button.

The BotBlockX package should appear in the Package Manager grid. Clock on the "Install" button and respond to the forms presented. That's it. Once the install finishes, bot-blocking will begin immediately. The LogPageNotFound plugin and the ReflectBlock plugin will be installed, but these are disabled by default. You'll also get the three snippet/resource pairs that will let you view the logs created by the three plugins.

Some servers do not have both the fileatime() and filemtime() functions enabled. The BotBlockX package performs a test for this during the install and aborts the install if both functions are not working correctly. If that happens, you'll see a message telling you about it and will have to talk things over with your host.

Another possible issue is that some servers reset the access time of all files when doing their daily backups (usually around 2:30 a.m.) If this is the case, some visitors may be blocked for a short period after the backup. I think that most servers don't do this, but if yours does, there is a fix for it in the comments of the code at Alex's site.

BotBlockXLog

When a bot is blocked, an entry is written to the ipblock.log file (inside the blocklogs/ directory below the MODX core directory). The log entry will the IP, host name, date/time, useragent, and the the type of scraper (slow or fast). The fields are separated by back-ticks.

Remember that there will be an entry for every request made by a blocked bot, so there will be a lot of duplicates, especially with stupid fast scrapers. The log is limited to 300 entries, but you can change that with the &log_max_lines property. New entries will be added at the top and old ones will scroll off the bottom.

Properties

The default property settings have evolved over a number of years and I don't recommend changing most of them except for testing. You can change the &ip_length property if your site is very popular or unpopular, but the default &ip_length of 3 is a fairly good compromise for most sites. And you can change the &total_visits limit for slow scrapers if you site is larger or smaller than usual, but remember that a visit to a single page can generate more than one request. You can set &total_visits to 'none' to disable the slow scraper block.

To change the other properties, and the content of the messages sent to violators, you need to edit the code of the plugin. I originally had a number of Tpl chunks and many properties to control the plugin, but the speed penalty was too severe.

Note that the &max_visits property value *must* be greater than the setting for the &interval property.

Note that if you have an Adsense site, every request will be followed by one or more visits from the GoogleBot looking for the same URL. You'll also see the GoogleBot failing to find the unpublished report page when you access them.

You can prevent some of that with these lines in your robots.txt file. Adjust the file suffix as necessary.

User-agent: Mediapartners-Google*
Disallow: bot-block-report.html

Reports

The report snippet and resource to view it are included in the package. The resource is unpublished and hidden from menus by default, but you can still view the log by previewing the resource from the Manager when you are logged in as a Super User.

The log snippet has two properties: &table_width (to set the width of the table) and &cell_width (to set the width of each cell). Feel free to adjust them to meet your needs.

If you see serious repeat offenders in the log, you can block them by IP with code like this in your .htaccess file (using their actual IPs):

order allow,deny
deny from 127.0.0.1
deny from 127.0.0.2
deny from 127.0.0.3
allow from all

Blocking users in the .htaccess file is extremely fast and it stops the users dead before they even reach the site. It's not all that practical, however, since people can spoof IPs, there are zillions of evildoers out there, and the request can be from an IP that might later be assigned to a legitimate user. Be sure to note the host before adding an IP block. You don't want to block the GoogleBot or yourself. The User Agent can be helpful here, but many evildoers will fake the User Agent and pretend to be the GoogleBot.

Ultimately, it's better to block miscreants by what they are looking for, using something like this:

RewriteCond %{REQUEST_URI} forgot_password [NC,OR]
RewriteCond %{REQUEST_URI} phpinfo [NC,OR]
RewriteCond %{REQUEST_URI} password_forgotten [NC,OR]
RewriteCond %{REQUEST_URI} mysql [NC,OR]
RewriteCond %{REQUEST_URI} sqlpatch [NC,OR]
RewriteCond %{REQUEST_URI} humans [NC,OR]
RewriteCond %{REQUEST_URI} checkout [NC,OR]
RewriteCond %{REQUEST_URI} customer [NC,OR]
RewriteCond %{REQUEST_URI} admin [NC,OR]
RewriteCond %{REQUEST_URI} expand [NC,OR]
RewriteCond %{REQUEST_URI} contract [NC,OR]
RewriteCond %{REQUEST_URI} alert [NC,OR]
RewriteCond %{REQUEST_URI} client-info\.php [NC,OR]
RewriteCond %{QUERY_STRING} (.*)(http|https|ftp):\/\/(.*) [NC,OR]
RewriteCond %{QUERY_STRING} forgot_password [NC,OR]
RewriteCond %{HTTP_USER_AGENT} libwww-perl.* [NC]
RewriteRule .* - [F,L]

Include the "OR" directive on every condition but the last. Before adding the lines to the .htaccess file, make sure that none of the strings you're blocking are used in aliases at your site.

You should *always* back up your .htaccess file before making any changes! If there are errors in the .htaccess file, all visitors to your site may get a server 500 error.

 

My book, MODX: The Official Guide - Digital Edition is now available here. The paper version of the book is available from Amazon.

If you have the book and would like to download the code, you can find it here.

If you have the book and would like to see the updates and corrections page, you can find it here.

MODX: The Official Guide is 772 pages long and goes far beyond this web site in explaining beginning and advanced MODX techniques. It includes detailed information on:

  • Installing MODX
  • How MODX Works
  • Working with MODX resources and Elements
  • Using Git with MODX
  • Using common MODX add-on components like SPForm, Login, getResources, and FormIt
  • MODX security Permissions
  • Customizing the MODX Manager
  • Using Form Customization
  • Creating Transport Packages
  • MODX and xPDO object methods
  • MODX System Events
  • Using PHP with MODX

Go here for more information about the book.

Thank you for visiting BobsGuides.com

  —  Bob Ray