RegularExpressions

Summary: Some basic info about PHP regular expressions.

Categories: Markup, MarkupWriting, PmWiki Developer

How to begin

At least skim the sections on regular expressions and regular expression modifiers, so that you know what's there.

If you do anything even mildly complicated, test your regular expressions.

Check that the expression says what you wanted it to say.
Devise strings that should almost match and check that they indeed don't match.
Devise strings that just barely match and check that they indeed do match.

It is strongly recommended that you test the regular expression outside of PmWiki. Concentrate on getting the markup to do what you want it to do first, and fix up problems from interaction with other PmWiki markup later. This will give you a better intuition about what problems originated within the regular expression, and what problems originated elsewhere.

Tools for testing regular expressions (and links to other resources on regular expressions in general) can be found at the bottom of the Wikipedia article on Regular Expressions.

See https://regex101.com/ provides an interactive debugger

Check that the expression doesn't interfere with the other PmWiki markup. Install the MarkupRulesetDebugging recipe and see what other markup exists - it's all documented, but it's easy to overlook a page of the docs, and ?action=ruletable?columns=pat,rep is guaranteed to show them all. Also, it might be a good idea to install other markup recipes and check for conflicts with these.

Looking at the other markups helps with designing a new markup, but you also have to check - it's all too easy to misread a regular expression.

Take all your test strings, put them on a wiki page, and see what markup grabs them - if one of the just barely matching strings is grabbed by anything else, you have a conflict.
If you find a conflict, either rework your markup and change your regular expression accordingly, or correct the mistake in the regular expression, or (as a last resort) define which of the two markups should take precedence and adjust the $when parameter of your Markup() call accordingly.

Useful examples

Note: The examples in this section are strongly biased towards HTML issues. PmWiki typically doesn't analyse HTML, so somebody should add regular expressions as used for PmWiki markup to correct the bias. (Please don't try email addresses unless you know what a nested comment in an email address is.)

Each regular expression is shown by building it from its constituents. Those parts that are copied down from a previous regular expression are shown like this, those that are newly introduced are shown like this.

Some regular expressions have been broken into several lines to better fit the typical width of a browser window. These line breaks must be eliminated before the regular expressions can be used in PHP code!

There's another pitfall: Many of the regular expressions listed here contain single quotes ' or double quotes ". In a PHP string, the quotes must be escaped (prefixed with a backslash \); depending on what quote character you use, you'll need to escape single or double quotes. Escaping is not necessary if the regular expressions are read from a text file or from user input.

Strings

Warning: The regular expressions in this section are untested.

The simplest version of a "-delimited string is

 "[^"]*"

that is: a double quote, a sequence of zero or more arbitrary characters except the quote itself, and another double quote. (Often, there are alternatives with a single quote, so I'll talk about "quote characters" from here on.)

Most languages with string values allow inclusion of the quote character by prefixing it with an escape character, typically a backslash; this also requires that the backslash itself is escaped:

 "([^"\]|\\|\")*"

The above regular expression is wrong: the backslash is used as an escape character within regular expressions themselves. We have to escape it, like this:

 "([^"\\]|\\\\|\\")"

PHP (like most languages that borrow from C) has additional escape sequences, such as \n (for the newline character) or \x5a (for the character with the hexadecimal code 5a, which happens to be the letter Z). The regular expression for additionally recognising \n and \xnn would be:

 "([^"\\]|\\\\|\\"|\\n|\\x[a-fA-F0-9][a-fA-F0-9])"

(Note that PHP has a lot of additional stuff in strings; this last regular expression is just an example how one would construct a full parser, not a useful end product.)

Attribute=value pairs in an HTML tag

Warning: The regular expressions in this section are untested.

An attribute is a sequence of ASCII letters:

 [a-zA-Z]+

A value may be a quoted string (that is, a quote, anything but that quote, then the same quote again; all this for two kinds of quotes, namely ' and "):

 "[^"]*"|'[^']'

or (if we're using relaxed rules) something without a space:

\S+

giving us:

 "[^"]*"|'[^']'|\S+

The equals sign may be surrounded by spaces (the standard admits at most a single space, but most browsers are friendlier and so are we):

 \s*=\s*

A single attribute-value pair hence looks like this:

 [a-zA-Z]+\s*=\s*("[^"]*"|'[^']'|\S+)

Multiple attributes, each prefixed by one or more spaces:

 (\s+[a-zA-Z]+\s*=\s*("[^"]*"|'[^']'|\S+))*

Note that regular expressions like this one are fairly typical for parsing parameter lists.

With relaxed rules, we can have valueless attributes:

 (\s+[a-zA-Z]+)*

but they must come at the end of the attribute list:

 (\s+[a-zA-Z]+\s*=\s*("[^"]*"|'[^']'|\S+))*(\s+[a-zA-Z]+)*

Finding a specific attribute in an HTML tag

Warning: The regular expressions in this section are untested.

This deals only with attributes that have a value. (Those without an attribute value are assumed to have a value that's the same as the attribute name, i.e. attr=attr. To check for that case, you'd have to check for both cases; this is left as an exercise to the reader.) (I.e. I'm too lazy to write that up right now.)

Assume we want to find the href attribute in an img tag. Since we're not interested in the other attribute, we need a non-capturing version of the above attribute matcher:

 (?\s+[a-zA-Z]+\s*=\s*(?"[^"]*"|'[^']'|\S+))*

and it must match only as much as minimally required (else it would eat up the href attribute as well):

 (?\s+[a-zA-Z]+\s*=\s*(?"[^"]*"|'[^']'|\S+))*?

Now to extract the first href (additional hrefs are ignored by browsers anyway), we need the <img tag, a sequence of non-href attributes, href itself, the equals sign, the attribute value (in parentheses to capture it!), the remaining attributes, optional whitespace, and a closing >:

 <img(?\s+[a-zA-Z]+\s*=\s*(?"[^"]*"|'[^']'|\S+))*?
 \s+href\s*=\s*("[^"]*"|'[^']'|\S+)
 (?\s+[a-zA-Z]+\s*=\s*(?"[^"]*"|'[^']'|\S+))*?\s*>

The $replace parameter of the Match() function can now pick up the image URL as $1.

There are some small improvements to be made:

If the tag is from an HTML document, letter case doesn't matter in tags, so we should not nail down the tag name to lower-case letters.
If the tag is from an XHTML document, the closing delimiter will be />, not >. In most cases, we don't know whether the tag is in a HTML or XHTML document, so we simply allow both variants.
If we want to re-emit the HTML tag, with all attributes except the href passed through unchanged, we need to capture the attribute sequences before and after the href.
We want to allow valueless attributes at the end. (As said above, we don't care about an href attribute without a value. For href, being valueless wouldn't make sense - it would be equivalent to href=href, which isn't very likely to be useful.)

These changes give:

  <[iI][mM][gG]((?\s+[a-zA-Z]+\s*=\s*(?"[^"]*"|'[^']'|\S+))*?)
  \s+href\s*=\s*("[^"]*"|'[^']'|\S+)
  ((?\s+[a-zA-Z]+\s*=\s*(?"[^"]*"|'[^']'|\S+))*?(?\s+[a-zA-Z]+)*\s*/?)>

Now the $replace expression has three values to play with:

$1 - the attributes that came before href.
$2 - the value of the href attribute.
$3 - the attributes that came after href, up to and including the closing / if it was present.

Note that we can do away with that [iI][mM][gG] and [a-zA-Z] stuff if we can add a /i modifier to the end of the regular expression in PHP. However, if we wish to do some letter case dependent matching within the tags, this isn't possible, so the above example makes all uppercase/lowercase distinction explicit.

HTML comments

Warning: The regular expressions in this section are untested.

This one is simple: a <!--, anything that's not an end-of-comment delimiter, a --, optionally spaces, and a >, giving us:

 <!--(.*?)--\s>