00682: Search results when using UTF-8 and no-Latin characters

Summary: Search results when using UTF-8 and no-Latin characters
Created: 2006-03-03 01:29
Status: Closed: added in PmWiki 2.2.x beta versions.
Category: Bug
From: Athan
Assigned:
Priority: 54
Version: 2.1 b33
OS: Any

Description: When using UTF-8, searching for a non-Latin string only returns case sensitive results. It is a known PHP limitation, but a utf8tolower conversion using xlpage-utf-8.php would be a solution.


Just for completeness, here are my comments from the mailing list...

Unfortunately, at the moment there's not really a good way for us to do this -- it's a limitation of PHP.

The basic functions available in PHP to perform case-insensitive searches in substrings aren't really aware of uppercase and lowercase distinctions for utf-8 encoded strings.

One approach would be to convert all terms to lowercase when doing the string search, but even here PHP's support is limited. To convert utf-8 to lowercase we'd have to use something like PHP's mb_strtolower function, but a lot of PHP installations don't have the mb_* available by default. Also, we have to be careful that we don't perform utf-8 lowercase conversions on sites that are using iso-8859-1 or other character encodings.

On the other hand, the xlpage-utf-8.php script is already defining a table of case conversions, so maybe I can get the search script to use that.


I've put this on my ToDo list, so maybe I can come up with a fix reasonably soon.


Here is little patch for utf-8 case insensitive search:

utf-8-search.zip

It successfully works at www.pmwiki.ru. Tested on pmwiki-2.1.5.
satrap June 07, 2006, at 05:17 PM


The patch works but current $CaseConversions table is l=>u (lower to upper) thus some uppercase characters map to wrong lowercase. A complete u=>l table can be added easily though. I'm sure Patric will implement a complete utf8 solution in pmwiki core soon. Even better, pmwiki should turn to full utf-8 by default.
Athan


New version of the patch: pmwiki-utf8-search.zip. Tested on PmWiki 2.2.0-beta16. By the way, when mbstring extension installed and modern version of PHP is used, there is no need of $CaseConversions array at all.

satrap December 21, 2006, at 04:49 PM

It works, thanks satrap. However patching every new pmwiki build isn't the best practice imho.
I wonder why Patrick doesn't include such a fix into pmwiki's core. UTF-8 is the preferable encoding today, even for English-only websites. Also mbstring extension is included in most, if not all, PHP hosts.
Athan