Understanding .htaccess II

Understanding regular expressions used in htaccess files


In this article we'll take a look at how regular expressions work. We'll also discuss a great online tool that not only lets you test your regular expressions, but gives real-time feedback on what is being matched and what any capture groups are capturing.


MODX logo

Introduction

Regular expressions (often shortened to "regex") often look like gibberish to people who aren't used to them. Even for experienced regular expression users, something like this example from MyComponent's Lexicon Helper code is difficult to parse (it should be a single line):

'#function getLanguageTopics\(\)\s*\{\s*return\s
*array\([\'\"]([^\"\']+)[\"\']\)#';

There are much more complex regular expressions. For example, here's the RFC 5322 Official Standard for validating an email address:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^
_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b
\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:
[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?
[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9]
[0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f
\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

(If you need to paste the code above, remove the line breaks to put it all on one line.)

Luckily, the regular expressions used in .htaccess rewrite conditions and rules tend to be fairly simple. In fact, so are the ones people actually use to validate email addresses. Here's a fairly simple regex that will allow most valid emails and catch most typos in an email address:

^[A-Za-z0-9._+\-\']+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}$

We'll revisit this one later in the article.

If you're curious about the history of regular expressions, check out this section of the Wikipedia Regular Expression page.

This article will definitely not give an exhaustive explanation of regular expressions. My goal is to provide enough information to understand and create rewrite conditions and rules for use in .htaccess files.


First, the Tool

PhpStorm actually has a built-in regular expression editor, but not everyone uses PhpStorm, and I actually prefer the Rubular Regular Expression Editor, (which I'll refer to as "the Rubular Editor" or just "Rubular" for short). It's also a system for testing regular expressions.

The Rubular editor gives you progressive highlighting of the match being generated. It shows the full match in a window, and any capture groups in another window. It also tries its best to tell you what's wrong when you've entered an invalid regular expression. As a bonus, is has a regular expression cheat sheet at the bottom of the screen.

I encourage you to use the Rubular Editor to play with the code in this series of articles.


Some Basics

A regular expression is really a pattern (and is often called that in both code and comments). It's job is to match some string of text. That long email validation pattern I showed you above, for example, is intended to match any possible valid email address. Some simple regular expressions contain nothing but text. In others, symbols are used to match various things.

When regular expressions are used in PHP code, they're required to have a delimiter at each end of the pattern. The most common delimiter by far is the slash: /some_pattern/. You can use any character as the delimiter, as long as it appears at both ends of the pattern, but it's important to make sure that the pattern does not contain the delimiter.

Although the delimiters are required in code, they're *not used* in the regular expressions in an .htaccess file. A delimiter is provided automatically by mod_rewrite behind the scenes when it processes each regex. The delimiters are also unnecessary in the Rubular Editor, though you can see them on the screen beyond each end of the pattern.

The vertical bar (|), is used for a logical OR. So if you wanted to match the strings "tick" or "tack,", the regex pattern tick|tack would do the job.

Parentheses have multiple uses in regular expressions. If there were other characters to be matched before or after that part of the pattern above, you'd want to use parentheses, like this (tick|tack). For example, this pattern would match "ticking" or "tacking": (tick|tack)ing. Try this in the Rubular editor. Put ticking tacking in the "Your Test String" box and (tick|tack)ing in the "Your Regular Expression" box.

Notice that something new has appeared in the "Match Groups" box. It's our old friends "tick" and "tack." That's because parentheses also serve to "capture" the matching text inside them. The regex engine saves them as match 1 and match 2. They can be used later in a rewrite rule as $1 and $2 (more on this later).

Character classes can be also be used to match multiple strings. A character class looks like this: [somecharacters]. There are almost always multiple ways to specify a match in a regular expressions. For example, our original match of "tick" and "tack" could be rewritten as t(i|a)ck. If you try that at Rubular, you'll see that the two captured match groups are now i and a.

A character class matches any of the letters between the brackets, so we could also rewrite our tick|tack expression as t[ia]ck. If you try that at Rubular, you'll see that the Matching Groups have disappeared, because there are no parentheses to capture anything.

Literal characters can be matched by "escaping" them with a backslash (\). Imagine that we wanted to capture "(tick)" and "(tack)" (surrounded by actual parentheses). Because parentheses have a special meaning in a regex, we'd need to escape them like this: \(t[ia]ck\). We've preceded each of the parentheses with a backslash to let the regex engine know that we want to match literal parentheses. In that case, nothing will be captured.

There are different flavors of regular expressions, but generally, these are the characters that need escaping if you want to match them literally and they are outside of a character class: .^$*+?()[{\|. Inside a character class (inside square brackets), you generally only need to escape these: ^-]\ though in some flavors of regex, it doesn't hurt to escape other characters. In others, it's an error, so it's best to only escape the ones we've listed.

The ^ character is kind of a special case. It's used in two ways. Outside of a character class, it refers to the beginning of a line, so ^(tick|tack) will match either word, but only if it appears at the beginning of the line. It would match the "tick" in "ticking", but not the "tick" in "backtick."

Inside a character class (the square brackets), if the ^ character appears at the beginning of the character list, it "negates" the character class. So [^ia] will match any character *except* i and a. If the ^ character occurs anywhere else in a character class, it's taken as a literal ^.

The $ character matches the end of a line. So ^the whole line$ will match "the whole line," but only if there is nothing else on the line.

You will often see the ^ and $ characters in .htaccess rewrite conditions and rules, but they are not required. If you want to match "http" only when it appears at the beginning of a URL, you need the pattern ^http. If, on the other hand, you want to match a directory name which might, or might not occur at the beginning of a URL, you don't want the ^ character.

For some reason, I used to have trouble remembering which of the two characters ^ and $ referred to the beginning and end of a line. This went away when I came up with the phrase, "the hat goes on the head" to remind me that the ^ character referred to the beginning (head) of the line.

Those two characters (^ and $) are technically not part of the pattern, so they are never captured when referring to the beginning and end of the string to be searched. If your pattern is ^the whole line$, the newline character at the end of the line won't be part of the full match.

Ranges — Character classes can also have ranges, separated by a dash character (-). So the character class [a-z] will match any lowercase letter. The character class [a-zA-Z] will match any letter. And the character class [a-zA-Z0-9] will match any letter or number.

Multiple Matches are very common in regular expressions. Suppose you're parsing MODX chunk tags like . The part between the brackets and after the $ could be anything. We have to escape the opening brackets to prevent them from being interpreted as character classes (but not the closing brackets). We also have to escape the $ symbol. We only want to capture the name of the chunk so our regex pattern could look like this:

\[\[\$(.+)]]

The brackets and the dollar sign are just literal characters (right brackets ] don't need to be escaped). The part we're capturing (inside the parentheses) is the name of the chunk. The dot character (.) stands for any character at all. The + character stands for one or more of the expression it follows, so (.+) will capture anything inside the tag as long as it's not empty (the + requires at least one character for a match).

What if a character might or might not appear? The * character stands for zero or more of the expression it follows. The regex above won't quite do for a chunk tag, because it might have a ! character to make it uncached and the tag might also have properties, like this:

[[!$chunkName?&property=`value`]]

To capture just the chunk name, we'd need something like this:

\[\[!*\$([a-zA-Z\-_]+)

This will look for two opening brackets (\[\[) followed by zero or more exclamation points (!*) followed by a dollar sign (\$) followed by a capture group containing one or more of a character class containing letters, a hyphen, or an underscore. In this case, instead of !*, we'd probably use !? because we know there will only be one exclamation point and the question mark character matches zero or one of the element it follows.

Of course this pattern will not capture any properties in the snippet tag. Since it's being processed in PHP code, it would probably be handled in two steps by tacking \??(.*)]] on the end of our pattern to capture the rest of the tag except for the closing brackets (in other words, a question mark, followed by any series of characters followed by two closing brackets), then processing the second capture group separately to get the properties and their values. The full regex would look like this:

\[\[!*\$([a-zA-Z\-_]+)\??(.*)]]

The \?? looks odd. It captures 0 or 1 questions marks. This is necessary because the chunk might not have properties at all. The first question mark is escaped, so it means a literal question mark. The second one is not escaped, so it means 0 or 1 of the character ahead of it (the literal question mark). Notice that we did not capture the literal question mark at the beginning of the properties because it's not inside the parentheses. We don't need it to parse the properties themselves.

The actual code used to parse a chunk tag in MODX is much more complicated than our example. Because MODX tags can be nested, the MODX parser needs to handle a chunk tag like this, where the name of the tag is the pagetitle of the current document and the value of the property comes from a TV:

[[!$[[*pagetitle]]?&property=`[[*tv_name]]`]]

The parser would first parse the inner tags, replacing them with their values. Then, our example code would be able to parse the resulting tag.

Sometimes you need to match Literal dots. Because the dot character has a special meaning, you have to escape it when you want to match a literal dot, so in .htaccess files, you'll often see something like this: yoursite\.com to match your domain name.

The dot character (unescaped) matches anything, but sometimes you want a more specific match. Here are some other options:

  • \d — any digit
  • \D — any non-digit
  • \s — any whitespace character
  • \S — any non-whitespace character
  • \w — Any word character (letter, number, or underscore)
  • \W — Any non-word character
  • \b — Any word boundary

The pattern \s+ is handy when you're matching a string entered by a user and the user might use more than one space to separate parts of the entry.

Here are the characters used for extending what's matched and their meanings:

  • ? — zero or one
  • * — zero or more
  • + — one or more
  • {n} — exactly n (e.g., {2} means exactly 2)
  • {n,} — n or more (e.g., {2,} means 2 or more)
  • {n1,n2} — n1 to n2 (e.g., {2,5} means 2 to 5)

Email Revisited

Now that we've looked at some basic regular expression principles, let's revisit that email validator expression be saw earlier in this article:

^[A-Za-z0-9._+\-\']+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}$

To make it easier to see the elements of this regex, lets split it up a little (note that the regex below would not work, because the spaces aren't there in an email address).

^  [A-Za-z0-9._+\-\']+  @  [A-Za-z0-9.\-]+  \.  [A-Za-z]{2,}  $
^  [A-Za-z0-9._+\-\']+  @  [A-Za-z0-9.\-]+  \.  [A-Za-z]{2,}  $

To interpret this one, first look at the character classes surrounded by square brackets. Then find the @ and the literal dot (\.) we expect to see in an email address.

Consider the example, JoeBlow@somedomain.com, for the explanation below.

You could read this regex as matching the beginning of a line, followed by one or more of the characters in the first set of square brackets, followed by @, followed by one or more of the characters in the second set of square brackets, followed by a literal dot, followed by two or more of the characters in the third set of square brackets, followed by the end of the line.

Let's look at that section by section:

The first section, ^[A-Za-z0-9._+\-\']+@, matches the beginning of a line, followed by one or more (the + sign after the closing bracket) of the characters in the brackets, followed by a @ sign. So one or more alphanumeric character, dot, underscore, plus sign, literal dash, or literal single quote. In our example, this part matches JoeBlow@.

The next section ([A-Za-z0-9.\-]+\.), is slightly more restrictive. The character set is almost the same as the first, but leaves out the underscore, plus sign, and single quote. It ends with a literal dot, so it matches somedomain. in our example.

The final section ([A-Za-z]{2,}$)matches only letters because that's all that's allowed in the part of a domain name after the dot. The {2,} part requires that part to contain at 2 or more letters. (Before there were domains like .museum, it would have been {2,3} to only allow two or three letters.) The $ at the end matches the end of the address. This section matches the com in our example. The literal dot in .com was matched in the previous part.


Wrapping Up

In the sections above, we've covered most of the commonly used elements in a regular expression. If you have trouble understanding the regex examples in the following articles, you can refer back to this one or look at any of the numerous Regular Expression Cheat Sheets on the web.

Here's one that's a printable .pdf document: PDF Regular Expression Cheat Sheet.

This one has good examples and links to it's own Regular Expression editor: Regular Expression Cheat Sheet.


Coming Up

In the next article, we'll look at rewrite conditions and how regular expressions are used in them to limit when our rewrite rules are applied.



Looking for high-quality, MODX-friendly hosting? As of May 2016, Bob's Guides is hosted at A2 hosting. (More information in the box below.)



Comments (0)


Please login to comment.

  (Login)