ConvertHTML

Summary: Convert an HTML page to PmWiki markup
Version: 20210207
Prerequisites: PmWiki 2.2.58 or later
Status: beta
Maintainer: Petko (original author: Eemeli Aro)
Discussion: ConvertHTML-Talk
License: GPLv2

Question answered by this recipe

  • How do I convert an HTML page to PmWiki markup?
  • How can I migrate a site to PmWiki or import HTML pages?

Description

PmWiki markup does not support all of the HTML markup so a 100% conversion is not possible. However, PmWiki can make replacements to the text as it is being edited or saved. ConvertHTML implements a relatively comprehensive set of rules for converting HTML tags to wiki markup.

To install this recipe:

  • download convert-html.phpΔ to your cookbook directory
  • add the following line to your configuration file:
    if ($action=='edit') include_once("$FarmD/cookbook/convert-html.php");

What it does

ConvertHTML uses the $ROEPatterns patterns array to translate most HTML tags, leaving the rest intact. All replacements are case-insensitive and attributes may be surrounded by single or double quotes, or in some cases left unquoted. The XHTML / at the end of a lone tag is always optional.

Any HTML inside [=...=] or [@...@] tags will be left untouched.

The following tags will be parsed only if they contain no attributes: B, BIG, BLOCKQUOTE, BODY, CODE, DD, DEL, EM, HEAD, HR, HTML, I, INS, PRE, SMALL, STRONG, SUB, SUP, TITLE, TT.

The following tags will be parsed even if they contain attributes: A, BR, DIV, DL, DT, FORM, H1..6, IMG, INPUT, LI, OL, OPTION, P, SELECT, SPAN, TABLE, TEXTAREA, TD, UL. These attributes will be assigned within an applicable (:...:) or %...% ... %% statement. The validity or effectiveness of these attributes as PmWiki markup isn't verified, for the most part.

Some additional notes:

  • <meta name="description|keywords" content="..." /> is also recognised, as are HTML comments <!-- ... -->.
  • Link and image targets that start with a . or a / are prepended with Path:, those that contain neither / or : but do contain a . are prepended with Attach:.
  • As PmWiki doesn't support spaces within named anchors ([[#...]]), these spaces are replaced with the _ character.
  • IMG tags with alt or title attributes are correctly handled, and align=left|right on an image results in the markup %lfloat% or %rfloat% at the beginning of the line.
  • Ordered and unordered lists are supported to an arbitrary depth.
  • Attributes defined for a TR are only applied to the first TD of the TR.
  • Only the clear attribute is supported for BR; having it set to all, left or right results in [[<<]] instead of \\ markup
  • The generated markup for form elements may differ from the usual PmWiki markup conventions, which make use of positional arguments instead of named arguments. The markup should still be valid, however.
  • TEXTAREA is only supported for single-line default values, as PmWiki markup doesn't support it for multiple lines

Usage

  1. Install the recipe
  2. Paste HTML into a PmWiki edit box
  3. Press "Preview" or "Save and edit"
  4. Verify the resulting markup

Notes

The $ROEPatterns array is available in the PmWiki core starting from pmwiki-2.2.0-beta45. For earlier versions, you'll need to implement Cookbook.ROEPatterns or replace the reference in the cookbook to use $ROSPatterns.

Suggestions, fixes and improvements to the regular expressions involved are quite positively encouraged.

I am aware that <p>...</p> tags end up having two empty rows between blocks, but this shouldn't affect the page's rendering and I'm not quite sure how to fix this in a robust manner.

If you use SourceBlock, you may need to add the following to your config file just before including convert-html.php:

$ROEPatterns['#\(:(code|source)(?:\s+.*?)?:\).*?\(:\1e?nd:\)#sei'] = 'Keep(stripslashes("$0"), "H")';

I haven't actually tested the html2wiki program mentioned on the talk page, but as far as I can tell from its source files this recipe handles all of the markup also handled by html2wiki.

Release Notes

  • 20210207 : update for PHP 7.3-8.0.
  • 20150827 : fix incomplete/missing definitions, reported by Oliver Betz.
  • 20150816 : update for PHP 5.5, requires PmWiki 2.2.58 or later.
  • 2011-02-16Δ
    • bugfixed quotes in [=...=] and [@...@] exclusion (reported by Maxim)
  • 2010-12-23Δ
    • added [=...=] and [@...@] exclusion
  • 2010-04-20Δ
  • 2009-08-25Δ
    • A accesskey, rel, and target attributes are handled, with <a ... target="_blank"> becoming %newwin%[[...]] (suggested by overtones99)
    • bugfix: using stripslashes instead of stripmagic
  • 2009-04-20Δ
    • support for form elements (suggested by simon)
    • bugfix: links now need to contain a '.' to become Attach: links (reported by simon)
  • 2008-10-07Δ
    • better documentation
    • bugfixes: white space in output, DL lists
    • IMG alt/title and align attributes
    • better A names and targets
  • 2008-10-05Δ — first public release

See Also

Contributors

Comments

See discussion at ConvertHTML-Talk

User notes +4: If you use, used or reviewed this recipe, you can add your name. These statistics appear in the Cookbook listings and will help newcomers browsing through the wiki.