UnaccentUTF8

Summary: Diacritics-insensitive page index and searches
Version: 20230203
Prerequisites: Enabled PmWiki.UTF-8; PHP 5.4 or more recent with the Intl extension enabled
Status: Beta
Maintainer: Petko
License: Public domain
Users: +1 (view / edit)
Discussion: UnaccentUTF8-Talk?

Questions answered by this recipe

How to enable diacritics-insensitive search and pagelists?

Description

Diacritics-insensitive page index and searches.

When someone searches your wiki, the results should include pages that match both the accented and plain character variants.

The function removes / strips various accents from letters. For example, searching for either "Māori" or "Maori" should find pages containing either variant (also case insensitive).

This is for the text content and search terms only, it doesn't restrict or modify page names.

This currently works for Latin (Roman), Cyrillic, Greek, Arabic and Hebrew characters with diacritics.

Installation

Note, PmWiki.UTF-8 needs to be enabled, and your config.php file needs to be saved in the UTF-8 encoding.

  1. Delete wiki.d/.pageindex.
  2. Add to config.php:
    $StrFoldFunction = $PageIndexFoldFunction = 'UnaccentUTF8'; # See Cookbook:UnaccentUTF8
    $PmTransliterator = Transliterator::createFromRules(
      ':: Latin-ASCII ; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;',
      Transliterator::FORWARD);
    function UnaccentUTF8($str) {
      global $PmTransliterator;
      # for German language umlauts ü->ue, uncomment next 2 lines
      # $str = preg_replace("/ä|ö|ü|Ä|Ö|Ü/", '$0e', $str);
      # $str = str_replace("\xcc\x88", 'e', $str);
      return $PmTransliterator->transliterate($str);
    }

This needs to be added before scripts/pagelist.php is loaded. Some recipes may load it (SearchCloud), they need to be included after this function is defined in order to use the new folding rules.

Configuration, Internationalization

N/A

Usage

Just search as usual.

Notes

  • This requires the PHP extension Intl to be enabled on the server.

Change log / Release notes

  • 20230203 First public release after 2 months of use on 2 high-volume websites.

See also

Cookbook /
ISO8859MakePageNamePatterns  How to convert ISO 8859 character input for page names to unaccented ASCII equivalents
PmWiki /
UTF-8  Enabling UTF-8 Unicode language encoding in your wiki.

Contributors

Written and maintained by Petko.

Comments

See discussion at UnaccentUTF8-Talk?

User notes +1: If you use, used or reviewed this recipe, you can add your name. These statistics appear in the Cookbook listings and will help newcomers browsing through the wiki.