00358: Track pages of interest

Summary: Track pages of interest
Created: 2005-03-05 13:35
Status: Suspended - awaiting discussion
Category: Cookbook
Assigned:
Priority: 44433
Version: 2
OS: N/A

See also: PITS.00291

Description

Fast access to recent changes *of interest*

Problem

Wiki visitors often want to be notified of changes in specific pages, or page groups.

Existing solutions

The PmWiki/MailPosts system is an approximation to such a notification system. It is too limited for our purposes because users cannot express interest in changes in just some specific pages.
However, since everybody is getting the same list of changed pages, it's enough to set up a mailing list so that everybody who wants to get notified of changes can simply subscribe to the list.

Design choice points

  1. (User Interface) The visitor could either have a watchlist page created in the wiki for him, and would have to periodically review it to see what pages have changed. Or he could get a notification email whenever a page on his watchlist changes.
  2. (Implementation) The software must be able to find all watchers to a page, and all pages watched by a given visitor. There's a multitude of caching and indexing schemes conceivable, the challenge is to select the scheme that's best for PmWiki.
  3. (Generality) Account-based authentication would simplify many parts of the subscription system, but it would also restrict the range of wikis where this extension is applicable.

Caveats

  • (Importance: Low) A visitor shouldn't be notified of pages that he has no read access to. Since read privileges can vary over time, actual selection of pages in the list shown to the visitor should be postponed to the last moment possible.
  • (Importance: High) Watchlist pages would need a way to reset them.

Considerations for proposals

Mail address verification

The big issue here is how to verify that it's indeed the owner of a mail address who requested a subscription or unsubscription.

The simplest mechanism seems to be this:

1. Pmwiki generates an "authentication ID" (actually a random number from a large domain). It remembers that ID together with the (un)subscription request information.

2. PmWiki sends a mail to the given mail address, reading roughly as follows:

 Dear user,
 somebody (probably you) requested that this email address
 be (not) notified whenever page SomePage is changed. To
 prevent malicious changes, we need you to confirm this request
 by clicking on the following link:
 http://wiki.tld/pmwiki.php?action=confirm&code=01946502399643654
 If you didn't ask for this change in subscription, please ignore
 this messages - obviously somebody mistyped the email address or
 tried to play a dumb joke on you.
 If you're being sent such unwanted mail more often than you'd like,
 please contact abuse@wiki.tld, and we'll stop our software from
 ever bothering again - and please accept our apologies.
 Yours sincerely,
 wiki.tld

3. User clicks on the link.

4. Pmwiki gets the ?action=confirm&code=01946502399643654 information back, looks up what it should be doing if that confirmation code comes in, and (un)subscribes the mail address.

Cutting down on mail address verification

In its simplest form, whenever a visitor requests a change in mail validation, he'll get a confirmation email. This is tedious on the users.

For a wiki with password-based authentication, there's no way around it. (Even people who know an authentication password may be mischievous.)

For a wiki with account-based authentication, it's possible to store their mail address with the account data, and not request mail verification when a user subscribes to a page.
Mail addresses must still be verified, but now this needs to be done just once.


More on subscription handling

Since the script should be scalable we could implement a script that pmwiki would run once a day on the first 'read' call of the day (if access to the site crontab is not available). That script would compare yesterday's list of all pages in the site and their modification timestamps with the current timestamps, while caching current timestamps, and use a list of

group.page1:user1:email1,user2:email2,user3:email3

or some such (details of format TBD), to determine who gets an email and/or which watchlists need to change. Then it would write the new list of pages and modification timestamps. To simplify data entry, the email could be stored in a cookie at the client side. The maintenance of the page:email list would be done with a click-to-toggle-registration mechanism as described below under proposals. That file would be invisible to browsers, since access to the wiki.d directory is denied by .htaccess

New pages would be reported in the email only of the day they are created, and they would be added to all watchlists.


Proposals

Radu's version (modified 2005March18)

  • I should learn to avoid fixation :) Of course indexing and dynamic rendering would be better. I'll get to do a version of that a bit later. Sorry for using this page as a thinking pad *blush*

On the page template, add a (:track:) directive that checks

  • If the Author string is empty or 'Guest', it doesn't render anything.
  • If a track4 attribute doesn't exist on the current page, it adds it as ',' (a comma, to ease checking for the name in the list while avoiding substrings), then
  • if the track4 attribute does not contain ',{$Author},':
    • on the current page, render text with a link:
      • that says "Track",
      • with title "Click to track changes on this page on Track4{$Author}" and
      • onClick ({$FullName}?action=track) pmwiki:
        • adds '{$Author},' to the track4 attribute of $FullName,
        • if Profiles.Track4{$Author} doesn't exist, creates it (maybe as edit-locked with the admin pass)
        • does track() - see below.
  • If the track4 attribute of the current page exists and contains ',{$Author},'
    • find the bullet that starts with [[{$FullName}]] on page Profiles.Track4{$Author}
    • move it immediately below a line that contains some Fmt like "Pages viewed since their last edit"
    • on the current page, render text with a link:
      • that says "tracking...",
      • with title "Click to stop tracking this page", and
      • onClick ({$FullName}?action=notrack) pmwiki removes:
        • '{$Author},' from the track4 attribute of $FullName and
        • the bullet starting with [[$FullName]] from Profiles.Track4{$Author}.

When the page is saved from an edit (this is the ugly part: each save from edit could potentially process many pages - gotta think more on that),

  • if it contains a track4 attribute, it checks all Track4{$Author} pages:
    • if they contain {$FullName}, that line is removed
    • then pmwiki does track()

track():

  • adds a bullet '* [[$FullName]] $DateStamp (:if author {$Author}:)(<remove>)(:if:)' at the top of Profiles.Track4{$Author}
  • the link that says "remove":
    • has alt "Click to stop tracking this page" and
    • onClick (Track4<Author>?action=trackoff&page={$FullName}) pmwiki removes:
      • 'substr($PageName,7),' from the track4 attribute of the source page and
      • the bullet starting with '[['.'$_GET['page'].']]' from {$FullName}.

Dealing with the edit problem:

  • since the track page will be locked anyway, in order to avoid parsing the current text, we may add the two sets of links (modified and seen) as attributes and render them only at read time.
  • or simply use Pm's version and add some indexing... gotta think on that too :)

Pm's proposal

Well, I still think that having the "save page" action update authors' watchlists is the hard way to go about it. Here's my algorithm:

  • On the page template, if the current Author is not defined, don't add anything.
  • If the current Author is defined, read in the Author's watchlist page (using ReadTrail() from trails.php) and see if the current page is on it.
    • If not, add a link saying "Add this page to your watchlist"
    • If it is, add a link saying "Remove this page from your watchlist"
  • When an Author clicks on the add-to-watchlist link, simply add the current page as a bullet item at the end of the author's watchlist page. (If the watchlist page is empty, also add
  (:pagelist trail={$FullName} order=-date:)

to the top.)

  • When an Author clicks on the remove-from-watchlist link, simply remove any bullet items containing the current page from the author's watchlist page.

That's it. Much simpler, because it doesn't involve adding a page attribute, or updating lots of watchlists whenever a page is modified-- one only modifies watchlists in response to add-page-to-watchlist or remove-page-from-watchlist actions.

Radu: I'd still like to see the pages either demoted or completely off the watchlist if the guy has seen the tracked page since its last modification time. According to your version, users will see the same list no matter how many times they've seen each page. It sort of takes away from the idea of ToCheck list. Maybe add a button that temporarily hides a bullet until the next modif of that page? Or adds a checked icon in front of the bullet... Aw. Any of this would have the same effect as my version, since on each update all tracked pages would need updating (but eliminates the delete problem) donno... what do other people think?

My point is that external memory helpers (like the reference list we're designing here) are supposed to actually help remembering (in this case "remember what still needs my attention now" rather than "what am I generally interested in") *shrug*

Pm's analysis

The mechanism I propose is *definitely* faster and involves less overall work and stress on the server, even in the scenario you propose. The trick is to realize that the watchlists are only "generated" when the watchlist page is viewed. Let's consider your scenario with 20 authors watching 200 pages each.

When someone edits a watched page, how many pages are written or rewritten?

RL: Up to 21 -- the edited page itself, plus each of the watchlists.

Pm: Just one -- the edited page itself. The watchlist pages don't need updating because they hold a simple list of references to the watched pages. The "sorting" of this list takes place when the watchlist page is viewed.

Result: Updating/editing is definitely less expensive in Pm's approach.

What's the cost of an update?

RL: Very high. In order to maintain page consistency, PmWiki uses exclusion locks to prevent any other process from accessing the database while files are being updated. Thus the entire wiki is blocked for the duration of up to twenty page updates. Things become much worse if we attempt to maintain any sort of page history on the watchlist pages.

Pm: Since an update only updates one page, it's the same cost as the existing environment.

How expensive is it to view a watchlist?

RL: Very inexpensive -- simply display the watchlist page as a normal wiki page.

Pm: Somewhat more expensive, in that the system has to scan the contents of the watchlist in order to sort the pagelist into the correct order. However,

1. This is known to be relatively inexpensive -- the (:pagelist:) and (:searchresults:) algorithms are known to work extremely well when dealing of lists of 200 pages. Furthermore, it's not like doing a search, where we have to find the pages meeting a given criteria -- the list of pages is already known (on the watchlist).

2. This expense is an "on-demand" expense -- it's only incurred when someone actually views a watchlist. If nobody views a watchlist, the expense is never incurred. Consider what happens if ten pages on the site are updated but nobody checks their watchlist. Under the other approach, we will have updated as many as 10x20 == 200 pages even though none of those updated pages were actually viewed.

Result: Overall efficiency depends on the viewing pattern of the watchlists -- however, both approaches are known to have adequate performance.

What happens if someone deletes a page and then adds new contents in the same location?

RL: The watchlist subscriptions are lost, since they're stored as attributes of the watched page.

Pm: Since the watchlists are held in separate watchlist pages, deleting a watched page and then adding a new one doesn't affect the subscriptions.

Result: The Pm algorithm is more robust in light of page deletions.

What happens as the users/pages ratio gets larger?

RL: The cost of updates increases substantially, as each edit requires updating a larger number of pages. If a page is watched by 1000 users, then an edit to that page will require updating 1000 pages.

Pm: It's not a substantial additional cost. The number of watchlists increases, which may mean more on-demand sorting of pagelists, but this is not onerous. If it becomes expensive, it's relatively easy to build in optimizations to the watchlist algorithms such that the scanning/sorting is only performed the first time a page is viewed, and the results of the scan/sort stored in the page until another update occurs.

Overall I'm fairly certain the approach I describe will be much more efficient and effective, and will scale better and overall more flexible than the one you've described. But I'm open to hearing about any holes in my analysis.


Comments

A watchlist feature would be a great addition for using PmWiki as ISO-9000-compliant document management system. 00386 (Upload Versioning) would be a prerequisite for that, though. --Henning March 16, 2005, at 03:56 AM

Please do not replace the current MailPosts functionality -- as an admin I use it for monitoring all the wikisites (now complimented by getting posts when someone is blocked as well), my users do not get mailposts (especially as they would have to bother me to do so, because I would have to add them to the config files). I never considered the MailPosts feature to be a user-notification system. A system where users can subscribe to pages is great, though. -- Crisses

Hm, I didn't notice the MailPosts limitations because my administrator didn't get the Apache to send any mails at all :-( I'm definitely looking for a user-specific notification system that can be managed by authors and by readers. In other words, an author should be able to add users to the notification list, and the readers should be able to add themselves to notification lists, too. (Now that I think about it, the notification system should extend on uploads as well as on wiki pages.) Well, these are my personal requirements - maybe I'm lucky and someone else needs something similar :-) --Henning May 12, 2005, at 12:27 PM

I am in the process of evaluating PMWiki for use as an internal private wiki. A watchlist feature is critical for us ... users need to be able to specify their pages of interest. PM's design above is not only more performant but also more elegant and refined, in line with many PMWiki features; it gets my vote. This issue has been floating around for some time. Is there an expectation when this feature will be implemented? -- SteveF May 22, 2006

Also, on a related note, in general I think it is important to remember that not all PMWiki installations are large or public. I would prefer that PMWiki features address security issues optionally. The idea of making someone respond to a verification email every time they add themselves to a page watchlist would make a highly collaborative system very difficult to use. I would like to see verification be an optional setting by the admin. -- SteveF May 22, 2006

Finally, a nice feature beyond the basic watchlist would be the ability for a user to automate the watchlist process; for example, if the user had a way to say "always notify me for changes to any page I have created". -- SteveF May 22, 2006


Quite by accident I found Cookbook:WatchLists which seems to fit this feature request. --Henning July 03, 2006, at 08:50 AM