Summary: Convert an HTML page to PmWiki markup
Prerequisites: PmWiki 2.2.58 or later
(original author: Eemeli Aro
Question answered by this recipe
- How do I convert an HTML page to PmWiki markup?
- How can I migrate a site to PmWiki or import HTML pages?
PmWiki markup does not support all of the HTML markup so a 100% conversion is not possible. However, PmWiki can make replacements to the text as it is being edited or saved. ConvertHTML implements a relatively comprehensive set of rules for converting HTML tags to wiki markup.
To install this recipe:
- download convert-html.phpΔ to your cookbook directory
- add the following line to your configuration file:
if ($action=='edit') include_once("$FarmD/cookbook/convert-html.php");
What it does
ConvertHTML uses the
$ROEPatterns patterns array to translate most HTML tags, leaving the rest intact. All replacements are case-insensitive and attributes may be surrounded by single or double quotes, or in some cases left unquoted. The XHTML / at the end of a lone tag is always optional.
Any HTML inside
] tags will be left untouched.
The following tags will be parsed only if they contain no attributes: B, BIG, BLOCKQUOTE, BODY, CODE, DD, DEL, EM, HEAD, HR, HTML, I, INS, PRE, SMALL, STRONG, SUB, SUP, TITLE, TT.
The following tags will be parsed even if they contain attributes: A, BR, DIV, DL, DT, FORM, H1..6, IMG, INPUT, LI, OL, OPTION, P, SELECT, SPAN, TABLE, TEXTAREA, TD, UL. These attributes will be assigned within an applicable
%...% ... %% statement. The validity or effectiveness of these attributes as PmWiki markup isn't verified, for the most part.
Some additional notes:
<meta name="description|keywords" content="..." /> is also recognised, as are HTML comments
<!-- ... -->.
- Link and image targets that start with a . or a / are prepended with
Path:, those that contain neither / or : but do contain a . are prepended with
- As PmWiki doesn't support spaces within named anchors (
[[#...]]), these spaces are replaced with the _ character.
- IMG tags with
title attributes are correctly handled, and
align=left|right on an image results in the markup
%rfloat% at the beginning of the line.
- Ordered and unordered lists are supported to an arbitrary depth.
- Attributes defined for a TR are only applied to the first TD of the TR.
- Only the
clear attribute is supported for BR; having it set to
right results in
[[<<]] instead of
- The generated markup for form elements may differ from the usual PmWiki markup conventions, which make use of positional arguments instead of named arguments. The markup should still be valid, however.
- TEXTAREA is only supported for single-line default values, as PmWiki markup doesn't support it for multiple lines
- Install the recipe
- Paste HTML into a PmWiki edit box
- Press "Preview" or "Save and edit"
- Verify the resulting markup
$ROEPatterns array is available in the PmWiki core starting from pmwiki-2.2.0-beta45. For earlier versions, you'll need to implement Cookbook.ROEPatterns or replace the reference in the cookbook to use
Suggestions, fixes and improvements to the regular expressions involved are quite positively encouraged.
I am aware that
<p>...</p> tags end up having two empty rows between blocks, but this shouldn't affect the page's rendering and I'm not quite sure how to fix this in a robust manner.
If you use SourceBlock, you may need to add the following to your config file just before including
$ROEPatterns['#\(:(code|source)(?:\s+.*?)?:\).*?\(:\1e?nd:\)#sei'] = 'Keep(stripslashes("$0"), "H")';
I haven't actually tested the html2wiki program mentioned on the talk page, but as far as I can tell from its source files this recipe handles all of the markup also handled by html2wiki.
- 20150827 : fix incomplete/missing definitions, reported by Oliver Betz.
- 20150816 : update for PHP 5.5, requires PmWiki 2.2.58 or later.
- bugfixed quotes in
] exclusion (reported by Maxim)
- A accesskey, rel, and target attributes are handled, with
<a ... target="_blank"> becoming
%newwin%[[...]] (suggested by overtones99)
- bugfix: using stripslashes instead of stripmagic
- support for form elements (suggested by simon)
- bugfix: links now need to contain a '.' to become Attach: links (reported by simon)
- better documentation
- bugfixes: white space in output, DL lists
- better A names and targets
- 2008-10-05Δ — first public release
See discussion at ConvertHTML-Talk
User notes +4: If you use, used or reviewed this recipe, you can add your name. These statistics appear in the Cookbook listings and will help newcomers browsing through the wiki.