00425: Words like DVDs and CDs are mistaken for WikiWord links

Summary: Words like DVDs and CDs are mistaken for WikiWord links
Created: 2005-04-28 11:44
Status: CoreCandidate, awaiting feedback / votes
Category: Bug
Assigned:
Priority: 321
Version: 2.0b35
OS: Unknown

Description: I think PmWiki's WikiWord implementation is a bit overzealous. It correctly knows to ignore all-capital abbreviations/acronyms like DVD, CD and TV, but if you make these plural like DVDs, CDs and TVs, then it thinks they ought to be linked.


It would be nice to see a good implentation of a "Glossary cookbook" for those problems, so users of PmWiki (not the admin) will not have every single word that looks like a wiki word, that has a small definition and are not worth a wiki page, to become one.

That would be nice because this is important to keep wiki administration and technical documentation with a wiki, sane.

I saw every other implementation like: Markup for plurals, acronyms, glossary, abbr ...

I think that a glossary like system would be nice to solve:

  • Problems with plurals
  • Problems with Technical acronyms
  • Problems with software/hardware names that have capital and non capital words in it.
  • Glossary for words that don't need a wiki page for it and are out of the scope of the site using pmwiki.
  • Ease of use and administration of a wiki for documentation on something.

CarlosAB


We might just want to redefine $WikiWordPattern so that at least two lowercase letters are required to build a wikiword. This has been discussed somewhat on the pmwiki-users mailing list.

Comments, votes?

--Pm


PM, you have my vote -IanMacGregor


I've recently changed my running $WikiWordPattern setup to allow specifying a minimum word length.

local/config.php

$WikiWordMinLength = 5;

pmwiki.php:

SDV($WikiWordMinLength,2);
$WikiWordPattern = "(?=.\x7B$WikiWordMinLength,\x7D)[[:upper:]][[:alnum:]]*(?:[[:upper:]][[:lower:]0-9]|[[:lower:]0-9][[:upper:]])[[:alnum:]]*";

I'd prolly suggest smarter escaping of the { }'s around the length variable. I'm not a php guy personally, so hacked it together heh. Anyway, the SDV(,2) makes the default to how things work already. Adding something for a max length prolly wouldn't hurt as well for public sites?

... holy! this was created back in 2005! lol, oops. Sorry. Maybe my change will be of benefit to someone anyway heh ...

-unfy


I'd like to see a rule where all leading or training capitals are treated as part of the first or last word respectively. E.g. FFred is not changed to F Fred.

simon July 23, 2015, at 04:16 AM

The AsSpaced function in pmwiki.php does this. You can use $AsSpacedFunction to set your own in config.php.

So for example, I did this at the end of my config.php (a copy / paste of the AsSpaced with an edit to not do the space you're talking about) and it works:

function UnfyAsSpaced($text) {
  $text = preg_replace("/([[:lower:]\\d])([[:upper:]])/", '$1 $2', $text);
  $text = preg_replace('/([^-\\d])(\\d[-\\d]*( |$))/','$1 $2',$text);
  return $text;
  //return preg_replace("/([[:upper:]])([[:upper:]][[:lower:]\\d])/",
  //  '$1 $2', $text);
}

$AsSpacedFunction = 'UnfyAsSpaced';
-unfy July 23rd 2015

The $WikiWordPattern I used above doesn't properly handle the second part of the regex (ie: XXX2 still gets wikified). It's been changed to:

$WikiWordPattern = "(?=.{" . $WikiWordMinLength . ",})[[:upper:]][[:alnum:]]*((?=.{" . $WikiWordMinLength . "})?:[[:upper:]][[:lower:]0-9]|[[:lower:]0-9][[:upper:]])[[:alnum:]]*";

If this is inappropriate here, just lemme know (and possibly point me to where it might belong heh).

-unfy July 27th 2015