$CustomSyntax

Summary: Design notes of PmSyntax and custom markup rules

Version: pmwiki-2.5.1

Prerequisites: PmWiki 2.3.0

Status: Beta

Maintainer: Petko

License: GPL

Categories: Markup, MarkupWriting, PmWikiDeveloper, PmWikiInternals

Users: +1 (view / edit)

Discussion: CustomSyntax-Talk

Design notes of PmSyntax and documentation of the format for custom markup rules.

Description

PmSyntax has default settings that recognize most core markup rules, and those of many recipes, if they use patterns similar to core directives and markup expressions, see Test.PmSyntax.

This is an explanation about how the function works, how you can add custom markup that is not among the initial ruleset, and some suggestions about the styling.

You may want to do this in order to provide syntax highlighting with your recipe's documentation, and with the edit form (if the wiki has enabled it).

Recipes that are usable with PmSyntax can be added to Category.PmSyntax.

$CustomSyntax array

PmSyntax has a renderer function written in JavaScript somewhat similar to the core PmWiki Markup functions. The $CustomSyntax definitions are somewhat similar to PmWiki's Custom markup declarations with Markup() which your recipe probably uses.

Your recipe can add entries to the $CustomSyntax array, in the following format:

SDVA($CustomSyntax, array(
  'MyRecipe'   => 'When  Semantic-type  /regexp/g',
  'MyRecipe,2' => 'When  Semantic-types  /regexp1/g  /regexp2/g  ...',
  'MyRecipe,3' => 'When  
    Semantic-types 
    /regexp3/g  
    /regexp4/g
      ...',
));

Please use the SDVA() function call to allow wiki administrators to disable or override the rules.

The keys of the array should be unique to your recipe. If you have several rules, it may be best to have them all with the same prefix.

The values of the array are space-or-newline-separated lists of keywords and regular expressions:

The separator needs to be at least 2 white-space characters (2+ spaces, tabs, newlines, or mix-and-match).
The when anchor inserts your custom syntax rules before ("<anchor") or after (">anchor") existing core rules.
The Semantic-types string is an indication about whether your rule is a directive, or inline punctuation, or something else. This translates to CSS class names for the output.
The regular expressions define the patterns for your custom markup.

Basic example

Say we want to highlight some custom inline punctuation, for example '*bold*' and '~italic~' which are not defined in the PmWiki core or in PmSyntax.

We could add something like this to our recipe, or to config.php:

SDVA($CustomSyntax, array(
  'BoldItalic' => "<punct   punct   /'[*~]|[*~]'/g",
));

To elaborate:

BoldItalic is the unique name of our rule.
<punct is the "when" anchor, here "<" means it will be processed "before" the core rule named "punct".
The second punct keyword is the semantic type, in this case "inline punctuation". Text that matches our rule will be output with the CSS classname "pmpunct" for easy styling; easily recognizable compared to other markup.

Note that these are JavaScript regular expressions written as PHP strings. There are a few differences from PHP regular expressions which are outlined below.

"When" keywords

The PmSyntax core rules are processed one after another in a predefined order. When you define entries in $CustomSyntax, the new rules are inserted before or after the core rules.

As each rule is processed, any text matching the rule's regular expressions, is wrapped in the rule's semantic CSS classes, removed from the text and replaced with a token. At the end of the processing, all tokens are replaced with their corresponding texts.

Important core anchors are the following:

_begin - before any other rule (anchor, does nothing)
preserve - escaped [=text=] or [@text@] not processed by PmWiki
joinline - the "single-backslash-before-newline" markup that glues the line with the next one
pagevar - page variables like {*$:Variable}
mx - markup expressions like {(ftime)}
ptv0, ptv1 - page text variables like (:name:value:)
Most custom rules should be positioned before or after one of these indented anchors:
- skin, meta0 ... meta3, tmpl, rdir - various core meta directives, templates, redirect
- _url - URLs and InterMaps
  - And, InterMap is a special "when" anchor, see next section.
- ws0 ... ws3 - WikiStyles
- form, dir0, dir1 - forms, core directives and generic directives like (:something name=value:)
- bullet - list items (ordered, bulleted, definition), start-of-line spacing/indents, \\ linebreaks
- punct - inline punctuation (bold, superscript...)
- tablecapt, tablerow, tableattr - simple tables
- heading - !, !!, !!!, ... headings (may contain links and punctuation)
cleanup - this converts any unconsumed special characters <, >, & into their HTML entities
restore - this replaces all tokens in the text with the highlighted <span> tags.
_end - end of processing, the text is already converted to styled HTML (anchor, does nothing)

These rules are processed in an order similar to the core markup rules. Your own recipe inserts custom markup somewhere in that order (before or after existing rules), so it should insert the custom syntax rules in a similar manner.

In general, you should insert your rule before an existing rule that may mistakenly consume your own markup. On the other hand, if your markup can contain other markups, for example a footnote can contain links and inline punctuation, you can position your rule after the other markups have been processed.

So, to insert your rule before, say, wikistyles, you can use the "when" keyword <ws0 (before "ws0"); or >ws3 (after "ws3") to insert it after wikistyles.

Unlike PmWiki's Markup() function, you can only define your rules relative to the existing core rules, not to other custom rules. Your rules will be inserted in the order you define them in $CustomSyntax one after another, so you control this. If your "when" keyword is not one of the core anchors, your rule will be dropped.

If you have any questions or difficulties, please let us know.

Custom InterMap rules

Locally defined InterMap prefixes are automatically recognized, so in most cases you don't need to do anything.

In some cases there may be special characters inside the InterMap that are not usually part of a URL, so PmSyntax may not recognize it as a URL. In another case, there may be a pattern not defined as an InterMap but behaving in a similar way so it may be simple to add it.

In such a case you can define a regular expression of the prefix with all characters or patterns.

This is done by having the "When" keyword set to "InterMap", followed by 2 space characters, then a partial regular expression of the InterMap prefix pattern.

For example, the ChessMarkup recipe uses the prefix "Chessboard:" but it may contain the markup "{FEN}" and the "{" character is not allowed in URLs unencoded. So the recipe defines such a $CustomSyntax entry:

SDVA($CustomSyntax, array(
  'ChessMarkup' => 'InterMap  Chessboard:\\{FEN\\}'
));

The first word (the "when" anchor) is "InterMap", and the rest is a regular expression (without the wrapping "/" characters).

Semantic types

A semantic type is what the markup rule represents -- is it a directive, a link, a punctuation, or something else. It translates to CSS classes. Some "when" anchors reuse the names of their semantic rules.

Here is the logic behind the types; you may choose to follow this logic or to adopt a different one.

comment - a commented-out text like (:comment text:)
escaped - text that is ignored or not processed by PmWiki like [=escaped=], also used for hidden text in links or tooltip titles like [[(Main.)Home page "Welcome"]]; it shows with slightly subdued color and grayish background
punct - inline punctuation like [-small-] and [[link]], shown in red and bold
heading - heading, Q:/A:, horizontal rule like !! Heading . Note that a heading can contain anchors, links and other inline markup; the background is under the whole line
meta - markup that controls the processing, and doesn't necessarily produce HTML output, can be a directive like (:noleft:) or (:template ...:), a conditional like (:elseif3 ...:), or a wikistyle like %list ...%
directive - most other directives like (:cellnr:) or (:pagelist ...:).
mx - markup expressions like {(ftime)}
bullet - list items (bulleted, ordered, definition), start- and end-of-line markups (indents, linebreaks) usually in bold and in green color
table - simple tables like ||!head||cell||
string - i18n strings like $[Edit] or entities like  
var - page (text) variables, template variables, like {*$:Summary}
keyword - keywords used in forms, conditionals and templates (generally the first word after the directive name)
url - link URLs, InterMap links

Special:

tag - usually the start and end parts of a directive, for example in (:abc some text content:), the parts marked as "tag" would be the opening "(:abc" and the closing ":)". These are in bold, and the interior is normally not in bold.
attr and value - attributes and values inside directives, as in (:directive attr=value:)
nobg - an element with a transparent background rather than the default one (for example a core meta directive in blue color, without the light-blue background)

Above are the colors in the default installation; you should expect for wiki administrators or skin maintainers to override these colors, see PmSyntax#colors.

Combining semantic types:

*type (asterisk and type, e.g. "*meta") - combine "tag" (bold) with the other type
type1_type2 (e.g. "meta_nobg") - combine several types/classnames; the example "meta_nobg" would output <span class="pmmeta pmnobg">

Advanced usage, requires numbered captured matches, see section further down:

=type (equals sign and type) - directive containing mostly plain text as in (:title text:)
!type (exclamation mark and type) - directive may contain attributes=values, variables and/or strings
=type1>*type2>type3 - you earned a black belt for nested custom syntax rules
external - a special mode, documented below, which calls Highlight.js to do the highlighting.

Other simple example: Our skin has tabs, and a (:notabs:) directive that is not in the core. PmSyntax recognizes (:notabs:) as a "generic" directive, but we want to style it like other "meta" directives, e.g. (:noleft:). We could add the following in $CustomSyntax:

SDVA($CustomSyntax, array(
  'MySkin' => ">skin   *meta   /\\(:notabs:\\)/gi",
));

To elaborate:

MySkin is the unique name of our rule.
>skin is the "when" anchor, here it means it will be processed after the core rule named "skin" (which does noleft, nofooter, etc.).
*meta is the semantic type, in this case "meta tag". Text that matches our rule will be output with the CSS classname "pmtag pmmeta" for easy styling; easily recognizable compared to other markup.

When you define your rule in $CustomSyntax, the "Semantic-type" entry is actually one of the above types. When some text matches the regular expression of your rule, it is wrapped in a <span> element with the classname of your semantic type. The classnames are prefixed with "pm", like "pmescaped" or "pmpunct", to prevent interference from various frameworks that may use the same classnames.

It is possible to define custom types, but then you need to also provide the CSS for these. If you must add CSS, please limit yourself to adding colors and backgrounds (possibly via CSS variables to allow adaptations). Any custom CSS must NOT change the font family, font size, or any other metrics of fonts, characters, lines and spacing, otherwise it will cause misalignment of the edit form and make it unusable.

Regular expressions

The regular expressions added to the $CustomSyntax array are in fact PHP strings representing JavaScript regular expressions.

You already know and use regular expressions in PHP. Those in JavaScript have the following particularities:

Explicit Global: An expression like /regexp/ only finds the first match in the text; you need to use /regexp/g with the "g" (global) pattern modifier if you need to find all matches (you most often do).
Recent DotAll: The "/s" pattern modifier (the "dot" meta character may include newlines) was only recently added to all major browsers (mid-2021), it may be better to use a character class [\s\S] in place of your dot.
No lookbehind: The positive and negative lookbehind assertions like (?<=text) and (?<!text) are unsupported on one major browser platform, Safari desktop and mobile. (Lookahead assertions work.)

In the PHP string, you generally need to escape a backslash with another backslash.

You shouldn't have 2 or more consecutive unescaped white-space characters like [\n\t ] because PmSyntax will mistake them for separators. You can escape the characters with backslashes like [\\t\\n ] - this set has only a single space.

Advanced (more complex) cases

Workaround for lack of "lookbehind" assertions

We can solve the lack of "lookbehind" assertions with 2 regular expressions, the first one matching a container, and the second one matching against the container text.

This is how we would match a hypothetical markup of a single tilde "~" at the end of a line:

  '<punct   bullet   /[^~]~$/mg   /~/'

To elaborate:

<punct is the "when" anchor
bullet is the semantic type "start- and end-of line markup, bullets, indents"
the first regular expression finds the containers: portions that have one character not a tilde, followed by one tilde at the end of a line
the second regular expression finds only the tilde within the containers, marks them as the 'bullet' type, and consumes (removes) them from the container, leaving a token, and returns the container in the text.

Markup with "start", "middle", and "end"

You can easily mark some text with "starting tag" (bold), "middle content" (normal), and "ending tag" (bold). Your regular expression needs to capture 3 sub-patterns with parentheses for the "start, middle and end" parts of your markup.

For example, if we want to define a Footnote markup like [^footnote text^], we can add the following entry to $CustomSyntax:

SDVA($CustomSyntax, array(
  'Footnote' => ">heading   =directive   /(\\[\\^)([\\S\\s]*?)(\\^\\])/g",
));

To elaborate:

Footnote is our unique identifier for the rule
>heading is the "when" anchor, here we position our rule after headings, near the end of the processing (we want any links and other punctuation to have already been processed)
=directive is our semantic type, "directive", and the "=" signifies that the regular expression has captured matches, $1 for the start, $2 for the middle and $3 for the end of the directive; $1 and $3 will be styled as "tag" (bold), not $2.
The regular expression indeed wraps in parentheses the starting and ending [^ and ^] parts, and the middle part between them. If you have other sub-patterns, you have to make them non-capturing with "(?:...)".

In the above regular expression we use [\S\s] (any character including newlines), so we could have line breaks in the footnote. If we wanted to only allow single-line footnotes, we would use the dot meta-character in place: /(\\[\\^)(.*?)(\\^\\])/g

Markup with "start", "middle" with standard attributes, and "end"

The "=type" semantic definition (e.g. "=directive" or "=meta") makes it easy to define start, middle, and end for a markup with nice styles. The "middle" part is not processed further.

A slightly more satisfying result can be achieved just as easily with "!type" (e.g. "!directive" or "!meta"). In such a case, the "middle" part is analyzed for strings, attributes, values, and special characters, like a generic directive:

(:mydirective "Some text" attr=value fmt=#template source={$Name}:)

Only the middle part ($2, second captured subpattern) is re-processed.

Markup with "start", custom "middle" patterns, and "end"

You can have a markup rule with multiple semantic types, with the same number of regular expressions.

If the "Semantic type" contains the character ">", then it used as separator to split the types into pieces. Every piece has a corresponding regular expression for that semantic type.

So, the first markup expression finds the container. If it has captured matches, it will process the "middle" part with all other semantic types, one after another and finally apply the first semantic type.

Here is a shortened example of the $CustomSyntax definition of the Formula recipe:

SDVA($CustomSyntax, array(
  'Formula' => '<ws0  
    =directive>attr>keyword
    /(\{\$)([\s\S]*?)(\$\})/g 
    /#[#a-f\d]+/i
    /\\\\[a-z]+/gi',
));

To elaborate:

Formula is the unique identifier of the rule (recipe name)
<ws0 is the "when" anchor, before "ws0" (empty wikistyles)
=directive>attr>keyword are the semantic types, where:
- =directive is the container, with numbered captured matches for start, middle and end
- attr and keyword are the semantic types found in the "middle" part (both are directly within the container, "keyword" is not nested inside "attr").
The regular expressions are as follows:
- the first one is the container, matches multiline text between "{$" and "$}"
- the second one matches an attribute (the colors of the formula) in the "middle" part; it is not global, only the first match in the "middle" part will be considered an attribute
- the third regular expression matches any keywords like \sum or \infinity in the "middle" part.

After the container {$...$} is found, the middle part is re-processed searching for attributes and keywords, then the container is itself processed.

5-part markup with "directive with attributes", "middle", and "directiveend"

A case similar to this one:

(:XYZ attr=value:) middle 123 (:XYZend:)

This may already be highlighted reasonably well by the core, as in the start and end directives (:XYZ attr=value:) and (:XYZend:) are already highlighted.

But if you want more control, notably wrapping the middle content, or excluding it from the other markup processing, or even custom nested highlighting within it, here is how to do it.

Here your regular expression needs to have 5 captured matches:

directive open, as in (:XYZ
directive attributes, as in attr=value
directive close, as in :)
middle content, as in middle 123
directive end, as in (:XYZend:)

Then, the second match will be parsed for "attributes". Depending on your semantic type prefix ("!" or "="), the 4th match will be either again processed for attributes or left as is.

If you have nested types and regular expressions (see previous section), they will be applied to the 4th match only. Such a complex example can be (carefully) reviewed at Test.PmSyntax-MermaidJS.

Mode "external" calls an external highlighter

You need to embed code from another programming language into PmWiki, and want to highlight it as well.

It currently uses Highlight.js if it is installed and enabled (see WikiStyles#highlight), and it already highlights code in preformatted blocks preceded with the %hlt LANGUAGE% WikiStyle:

!! Some PHP code
%hlt php% [@
  # only geeks allowed
  if(!isset($EnableGeek)) exit;
@]

!! Some CSS
%hlt css% [@
/* make sure nobody can read my page 
   except ants walking on the screen */
body, body * {
  font-size: 1px !important;
}@]

It is in fact quite easy to include the "external" processing in a directive from a recipe, but it has to do only the highlighting, not the directive itself. The directive needs to be processed later, or if it is standard, can be left for PmSyntax to process it.

There are 2 cases:

We know the other programming language
- We define the language name as a nested type as in external>html.
- Our regular expression needs to match the full container, in which is captured 1 group, the code.
The other programming language is in the directive.
- Our "external" regular expression needs to match the full container, in which are captured 2 groups, $1 the language name, and $2 the code.

Within the container, only the "code" part is highlighted with the "language name".

This mode will highlight the code and consume it (remove it from the container, leaving a token), and will return the container with the token into the full text. Other rules may process the directives of the container - core rules, or your own.

A: We know the other programming language

Say we have a custom directive that looks like this:

(:richtext:)some HTML code(:richtextend:)

We can add such a $CustomSyntax configuration:

SDVA($CustomSyntax, array(
  'RichText' => '>preserve
   external>html
   /\(:richtext.*?:\)([\s\S]+?)\(:richtextend:\)/gi',
));

To elaborate:

RichText is the unique identifier for the rule
>preserve is the "when" anchor; use ">preserve" if your code is not wrapped in the escape sequences [@...@] or [=...=]; use "<preserve" if it is.
external>html means mode "external", language "html" (could be any programming language supported by Highlight.js)
the regular expression matches the full container (:richtext:)..(:richtextend:), and in parentheses is captured the code ($1) to be externally highlighted.

B: The other programming language is in the directive

Say we have a custom directive that looks like this:

(:code lang=LANGUAGE numberlines=1:)the actual code(:codeend:)

The wiki has the Highlight.js library enabled, and it can do the language "LANGUAGE".

We can add such a $CustomSyntax configuration:

SDVA($CustomSyntax, array(
  'CodeCodeEnd' => '>preserve
   external
   /\(:code .*?lang=([-\w+]+).*?:\)([\s\S]+?)\(:codeend:\)/gi',
));

To elaborate:

CodeCodeEnd is the unique identifier for the rule
>preserve is the "when" anchor; use ">preserve" if your code is not wrapped in the escape sequences [@...@] or [=...=]; use "<preserve" if it is.
external is the mode (type)
the regular expression matches the full container (:code:)..(:codeend:), and in parentheses are captured the language name ($1) and the code ($2) to be externally highlighted.

To do / some day / maybe

Change log / Release notes

2022-01-11 : added section about "external>html"
2022-01-07 : added section Mode "external" calls an external highlighter
2022-01-06 : added section 5 parts markup with "directive with attributes", "middle", and "directiveend", Test.PmSyntax-MermaidJS.
2022-01-04 : the documentation is mostly ready

Contributors

Written and maintained by Petko

Comments

See discussion at CustomSyntax-Talk

User notes +1: If you use, used or reviewed this recipe, you can add your name. These statistics appear in the Cookbook listings and will help newcomers browsing through the wiki.

$CustomSyntax

Description

$CustomSyntax array

Basic example

"When" keywords

Custom InterMap rules

Semantic types

Regular expressions

Advanced (more complex) cases

Workaround for lack of "lookbehind" assertions

Markup with "start", "middle", and "end"

Markup with "start", "middle" with standard attributes, and "end"

Markup with "start", custom "middle" patterns, and "end"

5-part markup with "directive with attributes", "middle", and "directiveend"

Mode "external" calls an external highlighter

A: We know the other programming language

B: The other programming language is in the directive

To do / some day / maybe

Change log / Release notes

See also

Contributors

Comments