ISO8859MakePageNamePatterns
Questions answered by this recipe
How can I strip accents from characters for easier readable page names?
How can I convert existing page names to names without accents etc?
How to convert ISO 8859 character input to unaccented equivalents.
Usage
Add the following to config.php
for automatic creation of page names which have accents stripped from their characters. This adds a conversion mapping array to PmWiki's $MakePageNamePatterns
.
Links like [[Español]]
, [[Français]]
, [[Überänderung]]
will point to pages Espanol
, Francais
, Ueberaenderung
instead of the valid url-encoded page names Espa%f1ol
, Fran%e7ais
, %dcber%e4nderung
(using the ISO 8859-1 character set).
To convert existing pagenames you can use the script isorename.phpΔ. Read below!
For ISO 8859-1 (Latin-1 Western European)
NOTE: This cookbook is dependent on the encoding type of config.php
. You need to make sure that config.php
is saved with ISO8859-1 encoding. By default PmWiki ships with the encoding type set to Latin-1. For more information refer to page encoding.
# standard patterns from pmwiki.php SDV($PageNameChars, '-[:alnum:]'); SDV($MakePageNamePatterns, array( "/'/" => '', # strip single-quotes "/[^$PageNameChars]+/" => ' ', # convert everything else to space '/((^|[^-\\w])\\w)/' => 'cb_toupper', # capitalize first letter of word using core function '/ /' => '')); # additonal character conversion patterns for ISO 8859-1 character set SDV($ISO88591MakePageNamePatterns, array( '/Á/' => 'A', '/Â/' => 'A', '/Ã/' => 'A', '/Ä/' => 'Ae', '/Å/' => 'Ao', '/Æ/' => 'Ae', '/Ç/' => 'C', '/Œ/' => 'Oe', '/È/' => 'E', '/É/' => 'E', '/Ê/' => 'E', '/Ë/' => 'E', '/Ì/' => 'I', '/Í/' => 'I', '/Î/' => 'I', '/Ï/' => 'I', '/Ð/' => 'D', '/Ñ/' => 'N', '/Ú/' => 'U', '/Ó/' => 'O', '/Ô/' => 'O', '/Õ/' => 'O', '/Ö/' => 'Oe', '/Ø/' => 'Oe', '/Ù/' => 'U', '/Ú/' => 'U', '/Û/' => 'U', '/Ü/' => 'Ue', '/Ý/' => 'Y', '/Þ/' => 'Th', '/ß/' => 'ss', '/œ/' => 'oe', '/à/' => 'a', '/á/' => 'a', '/â/' => 'a', '/ã/' => 'a', '/ä/' => 'ae', '/å/' => 'ao', '/æ/' => 'ae', '/ç/' => 'c', '/è/' => 'e', '/é/' => 'e', '/ê/' => 'e', '/ë/' => 'e', '/ì/' => 'i', '/í/' => 'i', '/î/' => 'i', '/ï/' => 'i', '/ð/' => 'd', '/ñ/' => 'n', '/ò/' => 'o', '/ó/' => 'o', '/ô/' => 'o', '/õ/' => 'o', '/ö/' => 'oe', '/ø/' => 'oe', '/ù/' => 'u', '/ú/' => 'u', '/û/' => 'u', '/ü/' => 'ue', '/ý/' => 'y', '/þ/' => 'th', '/ÿ/' => 'y' )); # join to standard patterns $MakePageNamePatterns = array_merge($ISO88591MakePageNamePatterns, $MakePageNamePatterns);
For other ISO 8859 standards
Please add a suitable character conversion array
Alternative conversion pipeline
An alternative approach to the problem could be to rely on the html-entities definitions to handle the conversion process, taking account for example that the character "à"
is described as "à"
, which could be striped from "&_grave;"
to be rendred as "a"
.
function unaccent_entity($s) { if(strpos($s = htmlentities($s, ENT_QUOTES, 'UTF-8'), '&') !== false) $s = html_entity_decode(preg_replace('/&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|tilde|uml);/i', '$1', $s), ENT_QUOTES, 'UTF-8'); return $s; } $MakePageNamePatterns = array_merge(['/^(.*)$/'=>'unaccent_entity'], $MakePageNamePatterns);
Note that this alternative approach of stripping does not replace 'ä' and 'å' with 'ae' and 'ao', but both with 'a', so does not result in an accepted alternative for characters from ISO8859-1 like Å Ä Ö Ü Ø Æ Œ Ð Þ å ä ö ü ø ð ß Þ. - HansB
Converting existing pagenames to unaccented equivalents
You can use the script isorename.phpΔ. Install it as normally, than run it, with admin permission, after you installed the character conversion patterns above, with the action:
?action=isorename
added to a page url.
This will look through all the files in all groups and rename automatically any page names which have accented etc characters, i.e. the new MakePageName patterns will be applied.
- Do a test run without renaming anything with parameter test=1 (
?action=isorename&test=1
. - Make a backup copy of original files with parameter
backup=1
. - Use pagename wildcard patterns with parameter
pattern=...
, for instance to rename files in group Main:action=isorename&pattern=Main.*
.
Preserving Original Characters in the Title
To preserve the original accented page name as a page title you may want to add it to the page with the (:title :)
markup. This could be automated somewhat for new page creation by setting up a template page with an empty (:title :)
markup included, and setting the variable $EditTemplatesFmt
in config.php
to point to this template page. (see EditTemplates for details and also other possibilies to provide a template for the edit).
Or use a NewPage Box or NewPage Box Plus with a template page.
Or use the Fox form processing script with a new page form which fills the (:title :)
markup with the original accented page name entered automatically:
(:if auth edit:) (:fox newpageform template={$FullName}#newpage:) (:input text newedit size=20:)(:input submit post "Create Page":) (:foxend newpageform:) [[#newpage]] (:title {$$newedit}:) [[#newpageend]] (:ifend:)
Add such form to your SideBar to create new pages from anywhere with ease.
Hyphenated Dash-Page-Names, Avoiding CamelCase
If you want to avoid CamelCase and convert spaces to hyphens (which is more SEO friendly), you can try this:
$group = PageVar($pagename,'$Group'); # callback helper function cb_strtoupper($m) { return strtoupper($m[1]); } $PageNameChars ='-[:alnum:]'; if ($group=='PmWiki' || $group=='Site' || $group=='SiteAdmin') { $MakePageNamePatterns = array( "/'/" => '', //strip single-quotes "/[^$PageNameChars]+/" => ' ', // convert everything else to space '/((^|[^-\\w])\\w)/' => "cb_strtoupper", //make first letters upper case '/ /' => '', //remove any other spaces ); } else { $MakePageNamePatterns = array( "/'/" => '', "/[^$PageNameChars]+/" => '-', '/((^|[^-\\w])\\w)/' => "cb_strtoupper", ); $AsSpacedFunction = 'HyphenToSpace'; function HyphenToSpace($x) { return ucfirst(str_replace('-',' ',$x)); } }
This excludes groups PmWiki
, Site
and SiteAdmin
from hyphenated names. It uses custom callback helper function "cb_strtoupper
" instead. It also converts hyphens in link text to space, so your hyphenated page names will be displayed with spaces instead. But care needs to be taken to write correct links from some page to pages in group PmWiki
etc, you should use the CamelCase words for those in your simple links. -- HansB June 20, 2017, at 06:07 AM
Dash-Pagenames addresses this issue and provides functional cross-linking between groups using/needing different $MakePageNamePatterns
. With an option set $ForcePageNamesToASCII = 1;
, Dash-Pagenames provides hyphenated/dashed page names with any European accented characters replaced by ASCII substitutes. And you can have page names in lower case with an option $ForcePageNamesToLowerCase = 1;
. - HansB 2023-02-19
Dash-Pagenames - URLs and page names with dashes for word spacing, UTF-8 friendly
See also Router - Router allows a website's url structure to be different from PmWiki's group/page structure..
Page Encoding
In order catch and convert characters to another encoding type, config.php
must be saved using that encoding type, or PmWiki will be unable to find the characters to convert. See Character encoding of config.php.
See Also
- Cookbook /
- AlternateNamingScheme Use other naming schemes for PmWiki pages
- Dash-Pagenames URLs and page names with dashes for word spacing, UTF-8 friendly (new)
- Router Router allows a website's url structure to be different from PmWiki's group/page structure. (beta)
- UnaccentUTF8 Diacritics-insensitive page index and searches (Beta)
Contributors
Comments
See discussion at ISO8859MakePageNamePatterns-Talk
User notes : If you use, used or reviewed this recipe, you can add your name. These statistics appear in the Cookbook listings and will help newcomers browsing through the wiki.