|
Cookbook /
RegularExpressionsHow to beginAt least skim the sections on regular expressions and regular expression modifiers, so that you know what's there. If you do anything even mildly complicated, test your regular expressions. Check that the expression says what you wanted it to say. Check that the expression doesn't interfere with the other PmWiki markup. Install the MarkupRulesetDebugging recipe and see what other markup exists - it's all documented, but it's easy to overlook a page of the docs, and Just looking at the other markups helps with designing a new markup, but you also have to check - it's all too easy to misread a regular expression. Useful examplesNote: The examples in this section are strongly biased towards HTML issues. PmWiki typically doesn't analyse HTML, so somebody should add regular expressions as used for PmWiki markup to correct the bias. (Please don't try email addresses unless you know what a nested comment in an email address is.) Each regular expression is shown by building it from its consitutents. Those parts that are copied down from a previous regular expression are shown Some regular expressions have been broken into several lines to better fit the typical width of a browser window. These line breaks must be eliminated before the regular expressions can be used in PHP code! There's another pitfall: Many of the regular expressions listed here contain single quotes ' or double quotes ". In a PHP string, the quotes must be escaped (prefixed with a backslash \); depending on what quote character you use, you'll need to escape single or double quotes. Escaping is not necessary if the regular expressions are read from a text file or from user input. StringsWarning: The regular expressions in this section are untested. The simplest version of a "-delimited string is "[^"]*"
that is: a double quote, a sequence of zero or more arbitrary characters except the quote itself, and another double quote. (Often, there are alternatives with a single quote, so I'll talk about "quote characters" from here on.) Most languages with string values allow inclusion of the quote character by prefixing it with an escape character, typically a backslash; this also requires that the backslash itself is escaped: "([^"\]|\\|\")*" The above regular expression is wrong: the backslash is used as an escape character within regular expressions themselves. We have to escape it, like this: "([^"\\]|\\\\|\\")" PHP (like most languages that borrow from C) has additional escape sequences, such as \n (for the newline character) or \x5a (for the character with the hexadecimal code "([^"\\]|\\\\|\\"|\\n|\\x[a-fA-F0-9][a-fA-F0-9])" (Note that PHP has a lot of additional stuff in strings; this last regular expression is just an example how one would construct a full parser, not a useful end product.) Attribute=value pairs in an HTML tagWarning: The regular expressions in this section are untested. An attribute is a sequence of ASCII letters: [a-zA-Z]+
A value may be a quoted string (that is, a quote, anything but that quote, then the same quote again; all this for two kinds of quotes, namely ' and "): "[^"]*"|'[^']'
or (if we're using relaxed rules) something without a space: \S+
giving us: "[^"]*"|'[^']'|\S+ The equals sign may be surrounded by spaces (the standard admits at most a single space, but most browsers are friendlier and so are we): \s*=\s*
A single attribute-value pair hence looks like this: [a-zA-Z]+\s*=\s*("[^"]*"|'[^']'|\S+) Multiple attributes, each prefixed by one or more spaces: (\s+[a-zA-Z]+\s*=\s*("[^"]*"|'[^']'|\S+))* Note that regular expressions like this one are fairly typical for parsing parameter lists. With relaxed rules, we can have valueless attributes: (\s+[a-zA-Z]+)* but they must come at the end of the attribute list: (\s+[a-zA-Z]+\s*=\s*("[^"]*"|'[^']'|\S+))*(\s+[a-zA-Z]+)* Finding a specific attribute in an HTML tagWarning: The regular expressions in this section are untested. This deals only with attributes that have a value. (Those without an attribute value are assumed to have a value that's the same as the attribute name, i.e. Assume we want to find the href attribute in an img tag. Since we're not interested in the other attribute, we need a non-capturing version of the above attribute matcher: (?\s+[a-zA-Z]+\s*=\s*(?"[^"]*"|'[^']'|\S+))* and it must match only as much as minimally required (else it would eat up the (?\s+[a-zA-Z]+\s*=\s*(?"[^"]*"|'[^']'|\S+))*? Now to extract the first <img(?\s+[a-zA-Z]+\s*=\s*(?"[^"]*"|'[^']'|\S+))*? \s+href\s*=\s*("[^"]*"|'[^']'|\S+) (?\s+[a-zA-Z]+\s*=\s*(?"[^"]*"|'[^']'|\S+))*?\s*> The There are some small improvements to be made:
These changes give: <[iI][mM][gG]((?\s+[a-zA-Z]+\s*=\s*(?"[^"]*"|'[^']'|\S+))*?) \s+href\s*=\s*("[^"]*"|'[^']'|\S+) ((?\s+[a-zA-Z]+\s*=\s*(?"[^"]*"|'[^']'|\S+))*?(?\s+[a-zA-Z]+)*\s*/?)> Now the
Note that we can do away with that [iI][mM][gG] and [a-zA-Z] stuff if we can add a HTML commentsWarning: The regular expressions in this section are untested. This one is simple: a <!--(.*?)--\s> Category:
|