ConvertHTML-Talk

Summary: Talk Page for ConvertHTML recipe
Maintainer: Petko (original author: Eemeli Aro)

Comments

Alternative: html2wiki

There is a perl program html2wiki which does a good job. You can use the converter on the web page, or install the program.

It can be installed from CPAN in the usual perl way, or some Linux distributions may have it as a separate package, such as libhtml-wikiconverter-perl.

One needs to install both the HTML::WikiConverter module and the HTML::WikiConverter::PmWiki (which is the PmWiki "dialect" module).

The html2wiki script is a standalone program which takes a HTML input file and creates Wikified output. You can then cut-and-paste the output into the wiki (or use your favourite editor, see EmacsPmWikiMode and Pywe).

For example:

   html2wiki --dialect=PmWiki input.html >output.wiki

20 Sept, 2022

html2wiki worked for me using:

sudo apt-get update

sudo apt-get -y install libhtml-wikiconverter-dokuwiki-perl

downloading and following the build and install instructions.

https://github.com/gitpan/HTML-WikiConverter-PmWiki

Then for example, as below:

   html2wiki --dialect=PmWiki input.html >output.wiki

Throws deprecated error with PHP 7.3

running PmWiki 2.2.134 with PHP 7.3, ConvertHTML creates a "Deprecated: Function create_function() is deprecated in .../pmwiki.php on line 501" error.

No problem for me but maybe worth mentioning. OliverBetz February 07, 2021, at 04:38 PM

Updated for PHP 7.3-8.0 today. --Petko February 07, 2021, at 05:04 PM

Awesome, thanks!OliverBetz February 07, 2021, at 07:42 PM

Less conversions in version 20150816

is it by intention that the 20150816 version doesn't convert <p>, tables and much more present in the 2011-02-16 version?

No, it is an omission. Thanks for noticing -- should be fixed now. --Petko August 27, 2015, at 02:52 PM

Using ConvertHTML in another recipe --tamouse June 24, 2012, at 12:16 PM

I am looking at creating a recipe that will do the conversion of HTML outside the edit cycle. Would be possible to use this recipe in that way?

Errors?

Version 2011-02-16 converts code included in [=...=] or [@...@]. Example:

$LinkPageSelfFmt = "<span class='selflink'>\$LinkText</span>";

becomes

$LinkPageSelfFmt = "<span class='selflink'>$LinkText</span>";

OliverBetz 2011-05-14


The latest version is giving syntax errors for me when editing certain pages (that contain no HTML):

Parse error: syntax error, unexpected ':', expecting T_VARIABLE or '$' in /home/smspower/public_html/pmwiki.php(1691) : regexp code on line 18
Fatal error: preg_replace() [<a href='function.preg-replace'>function.preg-replace</a>]: Failed evaluating code: Keep(stripslashes(&quot;[@
...
in /home/smspower/public_html/pmwiki.php on line 1691

The markup snippet is part-way through my Site.LocalTemplates. I'm not sure what the problem is, it was fine with 2009-08-25.

That would be a bug in how I used 'quotes "inside" quotes' on a preg_replace call with the PREG_REPLACE_EVAL modifier. Fixed now with version 2011-02-16. —Eemeli Aro February 16, 2011, at 05:49 AM

font face not converted

I've just tried to convert some text that came from google showing me a word document. It took care of most issues, but I had to clean up a few hundred "<font face="Arial" size="5">DITA </font><font face="Arial" size="6">1</font>" type of things. Any possibility these could be included in the ROS patterns? Also & nbsp ; (ampersand-nbsp-semicolon) is left untranslated. --Peter Bowers May 11, 2010, at 08:19 AM

FONT tags I've left untouched for now. Yes, they're annoying, but they may also be necessary for the page layout. I you'd like to remove them on your own site, try adding the following to your config file:
$ROEPatterns['#</?font([^>]*)>#i'] = '';
&nbsp; is left as it is since it's valid PmWiki markup as well. To replace them with normal spaces, you could try adding the following to your config. —Eemeli Aro May 11, 2010, at 09:18 AM
$ROEPatterns['#&nbsp;#'] = ' ';

redundant links

and one more - i was running into the issue that a link like <a href="http://blah.com">http://blah.com</a> ... is getting turned into [[http://blah.com|http://blah.com]], which is slightly redundant. i successfully added this line to the bottom of my ROEPatterns to reduce it even futher:

	# convert [[http://blam.com|http://blam.com]] to http://blam.com
	,'#\[\[(http[^\|]+)\s*\|\s*\1\]\]#i' => '$1'

thanks again! overtones99 August 26, 2009, at 01:28 AM


converting annoying tabs...

sorry - one more - it may just be a result of my own crappy first timer html coding efforts from several years ago, but i'm getting TONS of tabs everywhere in my output. i've found that adding the following very simple line is indispensible in my scenario:

	'#\t#i' => "",    # get rid of weird tabbing

overtones99 August 25, 2009, at 03:31 PM


Archived comments

"title" error, additional tags

In the title pattern seems to be an error: "\*s" should be "\s*".

<HTML></HTML>, <HEAD></HEAD> and <BODY></BODY> should be removed.

What about <FONT> tags? IMO annoying, shouldn't they be removed?

What about converting character entities (e.g. "Umlauts") to searchable characters? Strings containing "&uuml;" etc are not searchable! Not an easy task because it might have unwanted side effects and it should respect the used character set. Maybe it should done only when found inside <HTML></HTML>.

OliverBetz 2010-01-24

I've updated the recipe to fix the title pattern error (thanks!) and to add the HTML/HEAD/BODY removal, provided that they don't have any parameters. FONT tags I've left untouched for now. Yes, they're annoying, but they may also be necessary for the page layout. I you'd like to remove them on your own site, try adding the following to your config file:
$ROEPatterns['#</?font([^>]*)>#i'] = '';
Converting character entities to characters or vice versa isn't a bad idea, but it's a different thing from what this recipe does: PmWiki will happily handle entities and characters, unlike HTML. —Eemeli Aro April 20, 2010, at 07:50 AM

converting underlines

hi eemeli. i just noticed that underlines <u> aren't getting converted. i added the following line to my $ROEPatterns:

'#<u>(.*?)</u>#i' => "{+$1+}",

thanks. overtones99 October 02, 2009, at 09:18 PM

Added in version 2010-04-20. —Eemeli Aro April 20, 2010, at 07:50 AM

current content of convert-html file

The title field in the current convert-html file is

 Ubuntu Edgy on the Apple Macbook

and the file contents have nothing to do with html2pmwiki markup conversion

Jean-Pierre Chrétien 2010-01-08

Thanks for letting us know, now fixed (copied from convert-html-2009-08-25.php). --Petko February 09, 2010, at 07:38 AM

Update of today?

Hi Eemeli, the convert-html.php script was today uploaded again with no author specified and without further information. Spammed or correct version? -- SchreyP January 19, 2010, at 05:05 PM

convert-html.php (upload date 2010-01-19) and convert-html-2009-08-25.php (old) are identical, so don't worry OliverBetz 2010-01-24.

converting links with '_blank' to %newwin%[[url|text]]

hi. this works great. however, i've found that adding the following line to the top of my ROEPatterns is a must-have for my setup - maybe it is for others too?

    # add %newwin% before links with _blank
    '#<a\s[^>]*\bhref=([\'"])([^\'"]*?)\1[^>]*_blank[^>]*>(.*?)</a>#is' => "%newwin%[[$2|$3]]",

overtones99 August 25, 2009, at 04:58 AM

Thank you for the idea; I ended up reworking the A attribute handling so the recipe now understand rel and accesskey attributes as well as target. —Eemeli Aro August 25, 2009, at 08:39 AM

Thanks! the functions works great! in fact, it also solves another problem i was having, where links without "" (ie. <a href=http://blah.com>, as opposed to <a href="http://blah.com">) weren't getting converted - but now they are! thanks! overtones99 August 25, 2009, at 03:31 PM


I followed the steps mentioned herein. The cookbook doesn't seem to work. I still see plain HTML code as the output.

Are you using at least version 2.2.0-beta45 of PmWiki? If not, you'll need to follow the additional instructions in the "Notes" section. --Eemeli Aro January 16, 2009, at 01:26 PM

I found this recipe really useful, and can fully recommend it. It saved me a lot of time.

But there are a couple of minor things to note:

<a href="Two#two">second</a>
<a href="#three">third</a>

incorrectly gives

[[Attach:Two#two|second]]
[[Attach:#three|third]]

should give

[[Two#two|second]]
[[#three|third]]

-- simon

I'm not convinved this isn't partly a feature. For <a href="#three">third</a>, yes, the result is wrong, for <a href="Two#two">second</a> I'm not so sure. Also note that also <a href="Two">second</a> currently gives [[Attach:Two|second]]. For a quick fix, change the last parenthesized part of the regular expression on line 58 of convert-html.phpΔ to ([^/:\'"\#]+?), ie. add '\#' to the character class. --Eemeli Aro February 19, 2009, at 05:15 PM
thanks, I still believe they should not be generating the Attach:, both links are clearly source anchors linking to a destination anchor within a page. -- simon
PS Retested the exact example from above and got the following output in my installation
[[Attach:Two#two|second]]
[[Attach:#three|third]]
This is now fixed, generating Attach: links now requires the link href to have a dot in it. —Eemeli Aro April 20, 2009, at 08:57 AM

Note

  • does not convert <form to (:input form ...
  • does not convert <input to (:input ...
-- simon
Does now. —Eemeli Aro April 20, 2009, at 08:57 AM
Brilliant, thanks very much simon June 18, 2009, at 05:29 PM

Talk page for the ConvertHTML recipe (users).