00682: Search results when using UTF-8 and no-Latin characters
Description: When using UTF-8, searching for a non-Latin string only returns case sensitive results. It is a known PHP limitation, but a utf8tolower conversion using xlpage-utf-8.php would be a solution.
Just for completeness, here are my comments from the mailing list...
Unfortunately, at the moment there's not really a good way for us to do this -- it's a limitation of PHP.
The basic functions available in PHP to perform case-insensitive searches in substrings aren't really aware of uppercase and lowercase distinctions for utf-8 encoded strings.
One approach would be to convert all terms to lowercase when doing the string search, but even here PHP's support is limited. To convert utf-8 to lowercase we'd have to use something like PHP's mb_strtolower function, but a lot of PHP installations don't have the mb_* available by default. Also, we have to be careful that we don't perform utf-8 lowercase conversions on sites that are using iso-8859-1 or other character encodings.
On the other hand, the xlpage-utf-8.php script is already defining a table of case conversions, so maybe I can get the search script to use that.
I've put this on my ToDo list, so maybe I can come up with a fix reasonably soon.
Here is little patch for utf-8 case insensitive search:
It successfully works at www.pmwiki.ru. Tested on pmwiki-2.1.5.
satrap June 07, 2006, at 05:17 PM
The patch works but current $CaseConversions table is l=>u (lower to upper) thus some uppercase characters map to wrong lowercase. A complete u=>l table can be added easily though.
I'm sure Patric will implement a complete utf8 solution in pmwiki core soon. Even better, pmwiki should turn to full utf-8 by default.
New version of the patch: pmwiki-utf8-search.zip. Tested on PmWiki 2.2.0-beta16. By the way, when mbstring extension installed and modern version of PHP is used, there is no need of $CaseConversions array at all.
It works, thanks satrap. However patching every new pmwiki build isn't the best practice imho.
I wonder why Patrick doesn't include such a fix into pmwiki's core. UTF-8 is the preferable encoding today, even for English-only websites. Also mbstring extension is included in most, if not all, PHP hosts.