00086: Neutralise accented letters

Summary: Neutralise accented letters
Created: 2004-10-10 05:25
Status: Closed - added in 2.2.0-beta43
Category: Feature
From: PRZ
Assigned:
Priority: 55554 3
Version: All
OS:

Description: For search tool in international pages, it might be very useful to neutralise accented letters, say to search for words with accented letters transformed to their neutral equivalent.

This is important because :

  • Page are frequently written with improper orthography (without accents), due to fairly hostile behaviour of computers or writers laziness.
  • Typos are very fréquent

Say, when for example you Search the French word 'élégance', the search word shall be transformed in 'elegance', and in the pages, you can find the words written 'élegance' or 'elègance', which are typos, or simply 'elegance', which is not correct in French.

The same behaviour may also be desirable for WikiWords and Links.

See http://fr2.php.net/strtr

From the above address, that function in PHP :

 function removeaccents($string) {
return strtr($string, 'ŠŽšžŸÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïñòóôõöøùúûüýÿ', 'SZszYAAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy'); }

Performance loss may not be a problem for WikiWords and Links, but that might be a problem when searching in a whole text ?
A solution might be to create a special 'searchpage', when saving an edit. The volume shall not be that large, because you will not carry all the page history in the searchpage. That may also improve general search performance. A start of indexation ?

If validated for `Wikiwords and Links, that may also help to find a solution to PITS.00053, by storing pages under a name without accents, and setting the Title by default as accented.

List of characters in Windows latin police

  • Ã,Á,Â,À,Ä,Å->A
  • Ê,Ë,È,É => E
  • Í,Î,Ï,Ì => I
  • Ñ => N
  • Õ,Ó,Ö,Ô,Ò => O
  • Ú,Û,Ù,Ü => U
  • Ý => Y
  • á,ã,â,ä,à,å => a
  • ç => c
  • ê,ë,è,é => e
  • í,ï,î,ì,ï => i
  • ñ => n
  • ð,õ,ó,ô,ö,ò => o
  • ú,û,ù,ü => u
  • ý,ÿ => y

Bonjour!

Good idea for the search feature, though for German it's not as important as for French.

For PITS.00053, there is a possible problem if two deliberately different page names (for example Wählen and Wahlen) result in the same "neutralised" name. I think this would confuse authors.

(This would also give irrelevant hits in the search feature, but that might be tolerable.)

--Henning October 11, 2004, at 03:33 AM


One afterthought: In German, it would be helpful to have the following conversions for search purposes (search is case insensitive, so I only list lower case here):

  • ä => ae
  • ö => oe
  • ü => ue
  • ß => ss

--Henning October 11, 2004, at 08:34 AM


The search page could have a checkbox option for "neutralize accented letters" which would be off by default. Then if you really must do that kind of search, you would check it. This would allow for a naive (simpler) implementation that takes longer to execute, but that would not be active by default. Just a tought. --Fabio


This is a very useful feature, but I'll need to think about it in a larger context for a bit before it can be implemented. The issue is that not everyone is using a roman (Latin-1) charset, so it'd be nice if the above could be implemented in a way that extends to other charsets as well. It may be that the search code will have an option to allow custom translations before doing text search, or even custom comparisons functions.

To move this feature off of suspended status, it'd help to see more priority votes above--until then it's likely to fall under PmWikiPhilosophy #3 (avoid gratuitous features).

--Pm


I just learned about the Levensthein distance, which quantifies the differences between two strings. Maybe this PITS could be implemented in a simple manner by using the Levenshtein algorithm for search, accepting results with a distance greater than zero, but below a certain threshold as hits?

It could be user-invoked by a special searchfield notation, such as using ~apple to search for anything resembling "apple", such as "äpple", "ápple", "aple" etc.

(To be rally tricky, the threshold Levenshtein distance could be determined by the number of tilde signs used, such as ~~~apple finding even "Äpfél". Of course, this would also yield false hits.)

Just an idea ...

--Henning August 03, 2007, at 09:26 AM


Closed -- added in 2.2.0-beta43.

Pm November 14, 2007, at 09:53 AM