Content

Summary: API to create external pages dependent upon text in a wiki page.

Version:1.8 2008-11-26

Prerequisites:

Status: Bug fixes, referencing content from other pages

Maintainer:Martin Fick

Discussion: Content-Talk

Categories: MarkupWriting [

Questions answered by this recipe

Is there an easy way to embed output from markups that generate images into wiki pages?
- Is there an easy way to manage the temporary files needed for these images?
- Is there an easy way to restrict access to these embedded images?
- Is there a clean logical URL scheme (no ugly hashes) I can use to reference these images from elsewhere?
Is there an easy safe way to call external programs/filters to process markup?
Is there an easy way to create links to output generated by external programs/filters?
Is there an easy way to pipeline output from one external program/filter to another?
Is there a way for developers to leverage external programs/filters already interfaced with pmwiki when interfacing new programs/filters with pmwiki?

Description

API to create external pages dependent upon text in a wiki page.

This recipe is a building block meant primarily to be used by other recipes, recipes which need to create derived content related to data/markup in a wiki page. This related data will either take the form of a separate file that can be accessed through a link in the current page, an image embedded in the current wiki page or inline output in the current wiki page. The Content recipe provides a mechanism for recipe builders to define content types along with conversion mechanisms between these content types (filters/converters). This recipe also provides wiki authors with the markup to provide the source data (text) to these content types along with the markup to reference this content using a simple abstract syntax known as ContentPaths.

Lastly, this recipe provides many underlying technical mechanisms to help recipe builders with common tasks needed to manage dynamic content freeing them from much of this burden allowing them to focus on the specific issues of their types and converters. Mechanisms provided by this recipe include: temporary file management for converters, cache management, temporary content preview generation, content page splitting (allowing a large file to be split into multiple files/links) and content protection (using pmwiki's built in authorization scheme for pages).

Notes

To install this recipe:

Put content.php Δ and markupsamepass.php Δ in your cookbook directory.
Add the following line to local/config.php

@include_once("$FarmD/cookbook/content.php");

Optional interesting configuration:

Configure you cache directory with $ContentCfgCacheDir
Configure your preview signing key with $ContentCfgPreviewKey

Since this recipe is mostly an API, this page is a little longer than a typical recipe. There is no use installing this recipe unless you need it for another recipe or want to develop to the API.

This recipe was written as a building block for the Music recipe. The music recipe is therefor an excellent example of what can easily Δ be done with this recipe and how to make efficient use of the Content API.

API

Defining and using a Base Type

One of the simplest uses of the content recipe would be the creation of a (plain) text type. This type would allow a wiki author to enter plain text, such as a poem, into a wiki page and to make this plain text poem available as a link from the wiki page. A recipe author can easily define such a simple type with the ContentRegisterType() function like this:

  ContentRegisterType('text', 'text/plain', 'txt', null, true);

This will not only create the text type, it will define its mime type and define a pair of directives to enclose plain text in: (:text:) ... (:textend:). A wiki author can then use these directive to define text which will be considered as plain/text and to create a link labeled text that will point to a file of type text/plain (with a .txt extension for easy opening).

Defining and using a Derived Type

Alone, this singular type may not seem very useful, but once a base type has been created, other derived types may be created along with filters which convert from the base type to the derived types. Imagine a text2ps filter which converts plain text to postscript so that poems can be nicely printed. A recipe builder can define such a filter along with the ps type using the following functions like this:

  ContentRegisterType('ps', 'application/postscript', '');
  ContentRegConverter('text', 'ps', 'php_text2ps');

Having defined a new type, ps, and a php function, php_text2ps(), which can convert from text to ps, ps will now be considered a derived type of text and whenever a wiki author uses the previously defined (:text:) ... (:textend:) directives to define some text, they will no longer just see a link to a plain text version of this text content, but now there will also be a link to the postscript version labeled ps!

Defining an Embeddable Type

To further extend our example, suppose that a recipe author has a filter which can convert postscript to a gif. By defining the gif type and a ps2gif() converter, the output of this converter, a gif image, will automatically (because it is defined as a pmwiki image type, see pmwiki's $ImgExtPattern) be embedded in the wiki page where the original (:text:) ... (:textend:) directives where entered above the previously mentioned text and ps links.

Built-In Types

There are three built in types to the Content recipe which are treated differently than other types. They are: the inline, safe, and raw types. These types are special and meant to be used as destination (output) types for converters. Whenever a converter converts to one of these types, the output from the converter will be embedded as either wiki text or html above the other type links. This allows the content recipe to be used as a simple mechanism to create directives which call external programs/scripts which then create html or more wiki text. For example, the SDML recipe uses the inline type to embed ascii sequence diagrams.

The first two types: inline and safe will be interpreted as wiki text and therefore may contain wiki markup which will then be further processed by pmwiki. Both of these types are safe from html injection. The inline type will be embedded with no special processing (except for the html injection protection), while the safe type will incur an immediate call to the pmwiki Markup2HTML() function, and therefore will terminate any unterminated tags up until that point (ifs, tables...). Lastly, the raw type will be embedded as raw html and will not undergo any further markup processing, use this type with care! Due to the inherent insecurity of allowing users to inject html directly into the output of a wiki page, the raw type may not be defined as a starting type. In other words the (:content:) directive will not accept data of type raw. If you choose to use this as an output type with one of your converters you better know what you are doing!

Taking Advantage of Other Filters and Other Types

Once we have several types and conversions mechanisms defined between several types, new types can automatically take advantage of currently existing types and converters. Imagine now a new type which defines a music notation along with a converter which can convert this music type to postscript. This new type will now automatically get embedded in our wiki page as a gif since the content system already knows how to convert postscript to gifs!

A Content Path: Referencing Content

Wiki authors may want to reference content defined in other sections of a page, this can be done using an enhancement to pagevariables called Content Paths. A Content Path can be defined in varying degrees of accuracy. If we wanted to embed another gif of our original text named: poem, into our page somewhere apart form its text definition, this could be done like this:

  {$/text/ps/gif~poem}

This page variable will evaluate to a URL to a gif which will end up being embedded in the wiki page at the current location. A more verbose way to define this Content Path would be:

  {$/text/text2ps:ps/ps2gif:gif~poem}

Where the first example defined a type path from the base type to the final type, this new more accurate example defines not only the types which must passed through from the base type to the final type, but it also defines the specific converters to get there. The reason we can omit these converters is because if they are not defined, they are assumed to be <prevtype>2<nexttype>. On the other hand, a more lazy approach than the first one would be to write the Content Path like this:

  {$/../gif}

This Content Path presupposes many things and could lead to different results depending on where it is in a page! First since there is no base type and no name specified (the name would normally be specified after a tilde: '~'), they are assumed to be whatever the most recent content type and name that we have defined content for (so this must be used near the content definition). The next thing that this leaves up to the content system to fill in is the actual type path from the base type to the final type, this is indicated by the '..' (dotdot). Since the content system knows the base type (text) and the final type (gif), it is smart enough to fill in the blanks (or the dotdot) with ps!

Arguments may be added to content paths with a dot after the converter or the output type which takes them for short. i.e. {$/abcpp/abcpp2abcm.ARG/ps/ppm/gif} or the shorter version {$/abcpp/abcm.ARG/ps/ppm/gif}

Caching

The content system is setup to automatically cache the output of each stage of content processing. This means that dynamic content will only be created once. Successive accesses will be retrieved from the cache. This can substantially improve the performance of many pages, even on their first pass! The caching system is smart enough to clear out a page's cache each time the page is saved, ensuring that old inaccurate or unused data does not persist on disk. When feasible, the caching mechanism is also used to speed up preview generation.

Setting The Temporary and Cache Directories

The temporary and cache directories should share a common filesystem (so that files can be hard linked/renamed between them), and should probably also not be web accessible. The default location for these directories is under /var/tmp/content but this can be changed by setting the $ContentCfgCacheDir variable.

URL Previews

Previews are handled differently by the content system than the normal content creation/retrieval mechanisms. Since each browser fetch is handled by a separate process, previews cannot access the content of the current unsaved preview wiki page when serving up previews of the individual content type links/images. However, accurate previews are still possible, the preview page will create preview links for each content type. These preview links have embedded source data in the URLs enabling them to create the individual content type pages from just the URL data!

Previews may use temporary resources on disk (including a temporary cache), but these resources are currently immediately cleared after content generation/serving and do not interfere with the saved contents of a page (what other readers will see).

Content Protection

The Content system creates content from the source data embedded in wiki pages. As such, the output of that content is assumed to be as sensitive as the original source data in the wiki page from which the source data comes. To achieve this, the content system checks that read authorization is granted to the current wiki page anytime that source content is taken from that page. Naturally, write protection is ensured by the fact that one needs write authorization in order to insert source data into a wiki page in the first place.

Signed Previews

For the paranoid, previews can be signed by simply setting the $ContentCfgPreviewKey to a value. If this key is set, the content source data and path will be signed with this key by means of a simple md5 checksum. When previews are fetched, this md5 checksum is verified before submitting the source data in the URLs to any content converters. Signing ensures that only authorized wiki authors can submit data to the various content conversion filters, even in preview mode.

Converters

Converters are defined by recipe builders using the function:

  ContentRegConverter($intype, $outtype, $fnc, $cnv=null)

The $fnc is a php function that will be called when the conversion from $intype to $outtype is needed. The $cnv parameter represents the converter name. This name may be used to reference this converter in content paths. If the converter name is omitted, it is assumed to be <intype>2<outtype>.

The converter function itself must implement the following function signature:

 <functionname>($cp, $cnv, $intype, $outtype, $args, $data)

The $cp is a content path structure as returned by ContentParsePath(). The $cnv is the name of the converter which along with the types is specified so that you may overload converters. The $args are the arguments to the converter to be interpreted however the recipe builder wants. Lastly, the data source of type $intype will be contained in the $data parameter, and naturally the converter output should be of type $outtype and returned by the converter.

FS Converters

In order to simplify the development of converters that are external programs there is a builtin filesystem converter. This converter will supply the input to your filter as a file and will take the output form another file. The FSConverter will take care of placing these files in a separate directory for the converter, and will clean them up for you when the conversion is done (and even submit the output file directly to the cache system for efficiency's sake!)

This converter can be used by calling its registering function:

  ContentRegFSConverter($intype, $outtype, $cmdfmt, $cnv=null, $argfnc=null)

As can be noted, this is somewhat similar to registering your own converter except that you do not need to register the php function name. Instead you must provide a $cmdfmt string which will contain the format of the system command to execute. This format string can contain any valid shell commands and can contain three special tokens to represent the input/output files and user provided arguments: ${i}, ${o}, and additionally the ${a} token if $argfnc is provided. The input and output files will be generated by the converter and will end with the input and output type's extension respectively. The usual precautions should be noted if you are going to be supplying user provided data in the $cmdfmt. Hint: don't, use the safer $argfnc method discussed below.

If both the $cmdfmt and the $argfnc are provided, the FSConverter will call the $argfnc function with an array of arguments as $argv, and in turn expects an array of arguments in return. The array of input arguments is created by parsing each dot '.' separated argument to the current converter into its own array element for easier handling by the argument processing function. The argument processing function must implement the following function signature:

  <functionname>($cp, $cnv, $intype, $outtype, $argv)

The first 4 arguments, $cp, $cnv, $intype, and $outtype are defined the same way they are for the Converter function signature defined previously. The array of arguments returned by this function will each be safely escaped for shell use (to prevent malicious shell character injections) and then space separated and substituted into $cmdfmt in place of the token {$a} if the token is present. Here is an example converter registration along with an argument processing functions for use with the music recipe:

  ContentRegFSConverter("abcpp", "abctab", 'abcpp ${a} < ${i} > ${o}',null,'music_args_abcpp');

  function music_args_abcpp($cp, $cnv, $intype, $outtype, $args) {
    foreach($args as $arg) $out[]= "-$arg";
    return $out;
  }

Note: be sure to read the security section below to find out why you should not actually use this example!

A Note about FSConverters and Security

When deciding to interface with an external program using the FSConverter it is important to be aware of the capabilities of the program you are embedding and how they may pose a threat to your system. Many programs have potentially malicious operations which may be invoked by the data given to them. Fairly obvious things to look out for range from: 1) the ability to execute random system commands (worst case), to 2) saving data to files (potentially destructive and probably escalatable to worse), to 3) simply reading other files (information disclosure).

For example, the gnuplot program allows all three of these feature to be invoke directly from its "language": 1) one may escape to the shell with the !, i.e. !rm -rf /. 2) one may save data to files, i.e. save "/etc/hosts" or 3) view other files on the system, i.e. load "/etc/passwd". While these examples are not likely to actually succeed on most systems, do take this potential seriously!

A more subtle example is the abcpp program mentioned as an example in the previous section. The abcpp program works similar to the C preprocessor, but on abc music files instead of .c files. This program has an #include "file" feature which may potentially lead to information leakage from your system. This is however a great candidate for the Cookbook.SchrootConverter extension! The SchrootConverter extension can make many programs safer for use even with the potential for malicious input data!

Splitting The Output Pages

Some converters may want to provide several pages for output. From the user perspective this is handled by creating enumerated links for them to click on. From the recipe builders point of view each output page must be able to be referenced by a separate content path, likely by varying the arguments to the particular converter which will do the splitting. This is done by returning a php array from your converter instead of returning the output data. The content system will interpret an array as an indication that splitting is needed, and it will create multiple links each resembling the current content path but with enumerated arguments matching the keys of the array.

Enumerated Links

In order to easily support output page splitting, the content system defines a link markup type for enumerated links. Enumerated links are links in a series that should be associated with the same text except that they should be differentiated by a number. The syntax for enumerated links is similar to normal links except that one should simply space separate the urls like this:

   [[ URL1 URL2 URL3... | Enumerated Link Text ]]

This will create the equivalent of:

  Enumerated Link Text([[URL1|1]] [[URL2|2]] [[URL3|3]])

Extra spaces between the URLs will cause the numbering to jump, you may use this intentionally.

The `(:content:) ... (:contentend:)` Directives

Wiki authors may use these directives to define the source text for a specific type. These directives will create a series of links to all the types to which this type can be translated. If any of these new types are embeddable such as images, they will be embedded at the current location in the page.

The source text is defined as the text between these two directives. The (:content:) directive must take at least one argument, the type, it may optionally take a second argument, the name. If the name is not defined one will be given to it automatically. On top of these 2 arguments, there are several options which can be specified with the first directive, they are:

  types=[+|-]type,...
  list=default|all|none
  embed=true|false
  args=converter|type.args,...
  defargs=converter|type.args,...

The types= and list= options allow you to customize which of the available type-links to create links to or to embed. For each input type there is a default list (set by the specific type's recipe builder) which can be altered with the list= option. This list can then be added to or deleted from with the types= option. If a type in the types= option does not contain a '+' or a '-', the types list will be assumed to be an absolute list and the list= option will be ignored entirely.

The embed= option can be used to disable image embedding in favor of a link.

You may use the args= and defargs= options to indicate arguments which need to be passed to converters. The args= option is a comma separated list of converters/types followed by a dot '.' and then the arguments to the converter. i.e. args=abcpp.arg1.arg2:abcm,abctab.arg3 To explicitly set an argument list to null, use a dot but do not put anything after it. i.e. args=abcpp.

The defargs= option understands the same syntax as the args= option, but it makes the arguments the default for those converters anytime the source data is referenced. The default arguments option provides a mechanism where arguments become 'bound' to the source. This makes it easy for other authors to reference content generated by a specific source without them needing to know/understand which arguments converters should get to make the source render/convert properly to an output format they wish to use. This is handy when referencing content via content path variables or with the (:contentlist:) directive.

The `(:contentlist:)` Directive

The (:contentlist:) directive accepts all the same arguments as the (:content:) directive (except for the defargs='' option), but it may not be used to define source content. Instead this directive is used to output various types from a previously defined source. Like content paths variables, this is a handy way to reference content defined elsewhere. But unlike content path variables this directive allows you to define a whole set of output types and their converter arguments for a source easily. i.e. (:contentlist types=midi,mp3,vorbis args=abcm.args:)

More on Registering Types

The registering function for types is defined as:

  ContentRegisterType($type, $mime=null, $extension=null, $defautlist=null, $directive=false)

If you want the extension to be the same as the type you many define it as a blank string '', leaving it null will omit it entirely. The $defaultlist is a string of the same format as the types= options to the (:content:) directive and will be the default list of types output whenever this type is used as a source. The $directive argument can either be the name of a directive you wish to use to identify content for the type being registered, or if set to true it will make a directive with the same name as the type. The default for $directive is false which (as of version 1.6) means that this particular type may never be input directly as source, it can only be used as an output or transitional type. This feature makes it possible to define types for which you would not trust users to create source for but that you may need to convert to. You may create types that do not have a directive defined but are still allowed as source types by setting $directive to null.

Debugging

To help debug content generation there is a log file for each page. This log file is stored in the cache directory for the page (which should not have web access) and may be configured to be accessible from the web to make debugging easier. To enable log file viewing with action=contentlog, set the variable $ContentLogViewer to the authorization level to which you would like to make it available. 'write' is probably a good default so that authors can view it. If log file viewing is enabled for at least 'write' authorization level, the log file will also be appended to the output of previews for quick easy viewing.

Custom converters may write to this logfile by use of the ContentLog($cp, $text) function. Simply supply the $cp structure for the current conversion along with the $text to log.

Release Notes

1.8 - 2008-11-26

Fixed some log authorization problems (bugs).
Fixed the missing name in source content types returned by the (:content:) and (:contentlist:) directives (bug).
Fixed a warning for when a type has no converters associated with it. (bug)
Made embed=false work (bugfix)
Added a nocheck option to the (:content:) and (:contentlist:) directives which creates links to types whether there is content or not (similar to the + at the end of cpaths)
Enabled Pagenames in cpaths. cpaths can now reference content from other pages similar to the way that page and pagetext variables do. i.e. {Group.Name$/cpath~name}. Useful in pagelists like this: {{=$FullName}$/cpath~name}=]
Added a page= option to the (:contentlist:) directive to enable the directive to reference content from another page. Very useful in pagelist, perhaps with the embed=false option.

1.7 - 2008-03-28

IMPORTANT: Cache filenaming has changed in this release. It is therefore suggested that you clear your content cache (rm -rf $ContentCfgCacheDir/cache) to remove old cache files now "incorrectly" named.

Refactored many small pieces of code, polished many aspects of the codebase:

Audited all the content access paths and tightened access control policies
Previews no longer accept source for types which do not accept source from directives
Previews of generated content now require 'edit' privileges
Audited error paths to improve error handling
Fixed many converter argument handling corner cases
Caching semantics and logging have been improved
Made friendly cache file names instead of md5 ones (see cache note above!)

1.6 - 2008-03-20

Fixed preview bug where saved data was displayed instead of the preview, introduced in 1.5
Enhanced the security model to disallow types which do not register a directive to be source types unless specifically designated as allowable.

1.5 - 2008-03-19

Fixed another caching bug where ContentCachePutFile() could mistakenly put files in the main cache during a preview.
Simplified caching a bit more, improves caching performance also.

1.4 - 2008-03-18

Fixed caching bug. Cached contents was being deleted on editing instead of saving (noticed by Patrick Ogay)
Enable log file appending to previews (suggested by Patrick Ogay)
Major fixes to caching, caching is used more aggressively now
Some caching is used for previews now!

1.3 - 2008-03-16

All changes to this version should be backwards compatible with the previous release except for any custom uses of the previously undefined argument processing.

Added the args= and defargs= options to the (:content:) directive
Arguments are now annotated with dot '.' so that the comma can be used to separate values in directive option lists
Added support for custom argument processing via an external function to the FSConverter, upgraded the registering function signature for this converter
Added the (:contentlist:) directive
Added a mechanism to log converter output for a page and the action=contentlog to view it
Added the ContentLog($cp, $text) function

1.2 - 2008-02-15

Added support for 3 inline types
Fixed the [@ContentFSCon

User notes ? : If you use, used or reviewed this recipe, you can add your name. These statistics appear in the Cookbook listings and will help newcomers browsing through the wiki.