01413: Custom PageStore receives requests for file system paths

Summary: Custom PageStore receives requests for file system paths
Created: 2017-06-22 15:52
Status: Closed, works good enough for now.
Category: Bug
From: Sven
Assigned:
Priority: 3
Version: 2.2.98
OS: Ubuntu 14.04.5 LTS trusty | Apache/2.4.10 | PHP 5.5.9-1ubuntu4.21

Description: I just implemented my new PageStore and found there seem to be some requests that probably should have been addressed to file_exists() instead of PageExists():

W: slash in pagename: /var/www/path/to/FarmD/local/farmmap.txt (exists)
    @ self->exists                   | FarmD:pmwiki.php:1160
    @ PageExists                     | FarmD:pmwiki.php:354

W: slash in pagename: local/localmap.txt (exists)
    @ self->exists                   | FarmD:pmwiki.php:1160
    @ PageExists                     | FarmD:pmwiki.php:354

I was going to add a skip check that ignores all requests whose $pagename contains a slash, but there's also this one:

W: slash in pagename: Extemporate/GroupAttributes (read)
    @ self->read                     | FarmD:pmwiki.php:1134
    @ ReadPage                       | FarmD:pmwiki.php:2023
    @ PmWikiAuth                     | FarmD:pmwiki.php:1176
    @ RetrieveAuthPage               | FarmD:pmwiki.php:1697
    @ HandleBrowse                   | FarmD:pmwiki.php:394
    @ HandleDispatch                 | FarmD:pmwiki.php:381

So is my PageStore expected to understand Group/Page in addition to Group.Page?

Sven June 22, 2017, at 03:56 PM

This notice has been hidden in 2.2.98. It is related to the PmWiki.InterMap feature. For the core it is simpler to check through all InterMap locations -- at any rate, if the string is not a pagename, the "exists()" function should simply return false. It is safer to assume that the pagename may come with either separator (dot or slash) -- feel free to str_replace the one into the other if you like (and it may be safer to replace everything to dots). --Petko June 22, 2017, at 04:15 PM

Thanks for reminding me to upgrade! I suggest to let the (:pitsform:)'s preview issue a big reminder box "Your version differs from the latest release. Have you tried upgrading?" in that case.
Now I'm at 2.2.98 and the warnings are still there, as expected, because they are generated by my PageStore to help me understand how PmWiki interacts with it.
I've added blacklist rules for "$FarmD/local/farmmap.txt" and "local/localmap.txt" now. Are those two guaranteed to be the only not-a-pagename requests, or do I need a more sophisticated mechanism to determine what kinds of strings are pagenames? I can't really test existence because the whole purpose of that PageStore is to make up pages on the spot if the other PageStores couldn't find them.
If I need a more sophisticated mechanism, what are the best good and still cheap to test criteria to support most of future receipes that might extend the possibilities for page names? Sven June 22, 2017, at 07:57 PM

No, these are not guaranteed to be the only "not-a-page" requests. The array $InterMapFiles (pmwiki.php:56, pmwiki.php:357) contains both filenames and pagenames, and can be rewritten or modified in a local customization (like almost everything in PmWiki).

If current and future recipes extend or restrict sensibly the possibilities of page names, try this:

  if (preg_match("!^{$GroupPattern}[./]$NamePattern$!", $pagename)) { 
     # probably an actual pagename, case sensitively
  }

Define $GroupPattern and $NamePattern as global. Note that if someone checks for lowercase "group.name" it may (UTF-8) or may not (ISO) match the patterns. I wouldn't know how likely is this to happen and what would be the consequences, but an actual pagename/filename created and updated by PmWiki "should" always match that pattern, yet a case-insensitive filesystem may return different-cased variants as existing. --Petko June 23, 2017, at 01:10 AM

Thanks! RegExp matching was one of the things I feared could be costly if my PageStore had to do it for each request. I'll update my blacklist approach to use the $InterMapFiles as a first measure for old versions. Meanwhile I had the idea to reduce the amount of string comparison in future versions by guaranteeing that those file path requests all start with either / or ./, but if users are meant to configure their own InterMap files, that won't work. My blacklist approach however cannot easily deal with new versions of PmWiki adding other file path checks for new features. How about a global variable instead, in which core states the intent of its PageExists() calls? E.g. $CorePageExistsFeature = 'init' very early when core doesn't actually intend any checks, just announces it will clarify them. Then $CorePageExistsFeature = 'InterMap' for that, and later on $CorePageExistsFeature = 'wiki' for regular wikipages. Then I can easily branch, and fast:

if ($CorePageExistsFeature) {
  if ($CorePageExistsFeature !== 'WikiPage') { return false; }
} else {
  // costly blacklist approach with in_array
}

I think you might severely underestimate the speed of PHP 5 and 7, I normally wouldn't worry. The global variable $LinkPattern does not exist when InterMap files/pages are processed, and is created just afterwards, before processing markup [[links]] from pages. Maybe you could check for it. --Petko June 24, 2017, at 07:00 AM

Then there's still the worry of each PageStore having to do its own str_replace('/', '.', …) or vice-versa, because most actual data sources will expect one always-same separator. What do you think of making future core's PageExists() call the PageStore's exists method with a second argument, and array that holds each level of hierarchy? That way PageStores can even prepare for the great day when someone invents a good idea for more than two levels. Sven June 23, 2017, at 04:04 AM

PS: Just thought that the level names array would need to be on the other calls as well, that makes the API less pretty when page data is in between. How about PageStores expose a public variable to opt-in for a level names array instead of $pagename? Sven June 23, 2017, at 04:42 AM

By "level of hierarchy" and "level names array" do you mean array('Group', 'Name') ? --Petko

Yes, array('Group', 'Name') would be the current hierarchy's levels (don't know a better name). — Sven June 24, 2017, at 07:29 PM

I found $NamePattern and $GroupPattern. Their explanations should point to an explanation of which characters are available for exotic custom group/page namespaces, and which characters need to be reserved for PmWiki or any recipe that needs a safe, easy, fast-to-detect delimiter for group/page names. I suggest we reserve "for markup and delimiter use" all Unicode below and including U+009F application program command, except for numbers, letters, U+002C comma, U+002D hyphen-minus, U+0040 commercial at and U+005F low line. Would be nice if that page establishes another rule: That group and page names cannot be empty, and both their first and last character (coincides if name length is 1) must be neither Unicode whitespace nor a Unicode control character. I don't expect PmWiki to enforce this, it's just that I'd rather have a common minimal standard than each PageStore recipe making up their own rules for what they won't support. — Sven June 25, 2017, at 11:01 AM

For 10+ years the available characters were different from what you suggest: see the definitions at pmwiki.php:47, xlpage-utf-8.php:47, see also $PageNamePatterns, $MakePageNamePatterns, and the function MakePageName. I don't want to change them and document something that is not necessarily intentional. If your own recipe requires something precise and different (more or less restricted), feel free to enforce it by setting these variables in your recipe, and document it in the cookbook. See also the function ResolvePageName() and $EnableFixedUrlRedirect for incomplete page names. See also Cookbook:Cluster, Cookbook:SubgroupMarkup and Category.CustomPageStore.

Another note: in the past I was an "optimization freak", I felt pain when I knew something was less than perfect/fastest/best, and I feel yours, but discussing with Pm and the community, working with the core and 60+ modules I now find "pragmatism" to be of more value, and "good enough" to be good enough. Not only that, I strongly believe that less is more and especially for software, worse is better. I worry about speed or scalability when it becomes a problem. Just me, you of course can have other priorities. --Petko June 25, 2017, at 12:15 PM

Thanks for the hint to xlpage-utf-8.php:47. I had thought my suggestion was just codification of status quo, as I'd never have expected a range of control characters to become valid for group and page names just by enabling the default UTF-8 support. (Update: Verified. I can't decide whether I like ? or ? better.) I'll restrict that on my farm and accept it on others then.
The warning above the CustomPageStore category sounds like it's generally accepted for PageStores to have lots of compatibility quirks, so I'll just code what works for me and not care about exotic edge cases. :-)) In that spirit I can actually stop caring about the original issue as well. — Sven June 25, 2017, at 04:13 PM